Neural Networks for Speech Recognition

Table of Contents

Abstract

In this paper we study the method for Neural Networks for Speech Recognition. Artificial Neural Networks are relatively crude electronic models based on the neural structure of the brain. These Artificial Neural Networks (ANN) are one of the best ways to recognize anything whether it is a pattern, character, gesture or speech reorganization. We design a recognition system which is capable of recognizing spoken language. We propose a combination of extracting the characteristics of the audio signal using linear predictive coding and a computational approach of using artificial neural networks in identifying the correct sample. The analog audio is converted into digital signals. As speech recognition involves the ability to match a voice pattern against a provided or acquired vocabulary, a neural net is constructed to achieve maximum accuracy.

Introduction

Speech Recognition is the field of computer science that deals with designing computer systems that can recognize spoken words. They generally require an extended training session during which the computer system becomes accustomed to a particular voice and accent. Such systems are said to be speaker dependent. For use with computers, analog audio must be converted into digital signals. This requires analog-to-digital conversion. For a computer to decipher the signal, it must have a digital database, or vocabulary, of words or syllables, and a speedy means of comparing this data with signals. The speech patterns are stored on the hard drive and loaded into memory when the program is run. A comparator checks these stored patterns against the output of the A/D converter. Speech recognition is composed of two parts:

Feature Extraction
Pattern Classification

It is also possible to analyze the speech recognition search result by looking at the confidence in the top-scoring word. The word can be rejected as out-of-vocabulary if the confidence falls below a pre-determined threshold. We currently do not use these values during general-purpose recognition. A neural network is composed of a number of interconnected units (artificial neurons).Each unit has input/output (I/O) characteristics and implements a local computation or function. The output of any unit is determined by the I/O characteristics, its interconnection to other units and (possibly) the external inputs. A single neuron by itself is not a very useful pattern recognition tool. The real power of neural networks comes when we combine neurons into the multilayer structures, called neural networks.

The Neuron has: Set of nodes that connect it to inputs, output, or other neurons, also called synapses. A Linear Combiner, which is a function that takes all inputs and produces a single value. A simple way of doing it is by adding together the Input multiplied by the Synaptic Weight.

Speech Recognition

Since frequency is one of the important pieces of information necessary to accurately recognize sound, it is necessary to have a transformation that allows one to break a signal into its frequency components. The Fourier transform of a signal is the representation of the frequency and amplitude of that signal. Since the differential of a wave signal is not continuous, we get phantom frequencies. Common, everyday signals, such as the signals from speech, are rarely stationary. They will almost always have frequency components that exist for only a short period of time. Therefore, the Fourier transform is rendered an invalid when faced with the task of speech recognition. Fourier Transform is symmetric, so the first half of the data is really all that is interesting. Short time Fourier transforms was used. A band pass filter is used to remove unwanted frequencies.

Stationary phonemes

Stationary phonemes (like ‘a’, ‘e’) don’t change in time
Therefore, just a single frame of signal is sufficient to classify them
A single non-linear element can do the job.

Non-linear Element for stationary phonemes classification

There are as many inputs to the unit as there are frequency bands.
Non-linearity changes output towards 0 or 1.
Phonemes like ‘k’ consist of silence, explosion and fade out parts
Single frame of signal is insufficient to recognize them

In the above figure the waveform and its spectrogram are shown for the pronunciation of word “ONE”.

Spectrogram provides a good visual representation of speech but still varies significantly between samples. A Cepstrum analysis is a popular method for feature extraction in speech recognition applications, and can be accomplished using Mel Frequency Cepstrum Coefficient analysis (MFCC).

Neural Networks for Speech Recognition

Most applications require neural networks that contain at least the three normal types of layers – input, hidden, and output. The layer of input neurons receives the data either from input files or directly from electronic sensors in real-time applications. The output layer sends information directly to the outside world, to a secondary computer process, or to other devices such as a mechanical control system. Between these two layers can be many hidden layers. These internal layers contain many of the neurons in various interconnected structures. The inputs and outputs of each of these hidden neurons simply go to other neurons. In most networks each neuron in a hidden layer receives the signals from all of the neurons in a layer above it, typically an input layer. After a neuron performs its function, it passes its output to all of the neurons in the layer below it, providing a feedforward path to the output.

These lines of communication from one neuron to another are important aspects of neural networks. They are the glue to the system. They are the connections which provide a variable strength to an input. There are two types of these connections. One causes the summing mechanism of the next neuron to add while the other causes it to subtract. In more human terms one excites while the other inhibits.

Major Components of an Artificial Neuron

Weighting Factors
Summation Function
Transfer Function
Scaling and Limiting
Output Function
Error Function and Back-Propagated Value
Learning Function

The figure of the neural network is shown below:

Network Architecture

Input layer

♦ 26 Cepstral Coefficients Hidden Layer

♦ 100 fully-connected hidden-layer units

♦ Weight range between (-1 to +1)

Initially random remain constant Output

♦ 1 output unit for each target

♦ Limited to values between 0 and +1

The values shown in the hidden layers and the output layers are the weights which are set by the repetitive training of the neural network. There are various laws on the basis of which the learning of the neural networks is done. They are listed below

Hebb’s Rule
Hopfield Law
The Delta Rule
The Gradient Descent Rule
Kohonen’s Learning Law

Training A Network

Once a network has been structured for a particular application, that network is ready to be trained. To start this process the initial weights are chosen randomly. Then, the training, or learning, begins. There are two approaches to training – supervised and unsupervised. Supervised training involves a mechanism of providing the network with the desired output either by manually “grading” the network’s performance or by providing the desired outputs with the inputs. Unsupervised training is where the network has to make sense of the inputs without outside help. In supervised training, both the inputs and the outputs are provided. The network then processes the inputs and compares its resulting outputs against the desired outputs. Errors are then propagated back through the system, causing the system to adjust the weights which control the network. This process occurs over and over as the weights are continually tweaked. The set of data which enables the training is called the “training set.” During the training of a network the same set of data is processed many times as the connection weights are ever refined. Here we shall be using the supervised training format and hence we get the input.

Spoken digits were recorded

♦ Seven samples of each digit

♦ “One” through “eight” recorded

♦ Total of 56 different recordings with varying lengths and environmental conditions

IMPORTANT Background noise was removed from each sample.

Supervised learning

♦ Choose intended target and create a target vector

♦ 56-dimensional target vector

If training the network to recognize spoken “one”, target has a value of +1 for each of theknown “one” stimuli and 0 for everything else.

Train a multilayer Perceptron with feature vectors (simplified)

♦ Select stimuli at random

♦ Calculate response to stimuli

♦ Calculate error

♦ Update weights

♦ Repeat

In a finite amount of time, the Perceptron will successfully learn to distinguish between stimuli of an intended target and not.

Calculate error

♦ For a given stimuli, error is the difference between target and response

♦ t-o

♦ t will be either 0 or 1

♦ o will be between 0 and +1

Update weights

Results

Response to unseen stimuli

♦ Stimuli produced by same voice used to train network with noise removed

♦ Network was tested against eight unseen stimuli corresponding to eight spoken digits

♦ Returned 1 (full activation) for “one” and zero for all other stimuli.

♦ Results were consistent across targets i.e. when trained to recognize “two”, “three”, etc…

Response to noisy sample

♦ Network returned a low, but response > 0 to a sample without noise removed

Response to foreign speaker

♦ Network responded with mixed results when presented samples from speakers different from training stimuli.

In all cases, error rate decreased and accuracy improved with more learning iterations

Some Related Information

Hebb’s Rule: The first, and undoubtedly the best known, learning rule was introduced by Donald Hebb. The description appeared in his book The Organization of Behavior in 1949. His basic rule is: If a neuron receives an input from another neuron, and if both are highly active (mathematically have the same sign), the weight between the neurons should be strengthened.

Hopfield Law: It is similar to Hebb’s rule with the exception that it specifies the magnitude of the strengthening or weakening. It states, “if the desired output and the input are both active or both inactive, increment the connection weight by the learning rate, otherwise decrement the weight by the learning rate.”

The Delta Rule: This rule is a further variation of Hebb’s Rule. It is one of the most commonly used. This rule is based on the simple idea of continuously modifying the strengths of the input connections to reduce the difference (the delta) between the desired output value and the actual output of a processing element. This rule changes the synaptic weights in the way that minimizes the mean squared error of the network. This rule is also referred to as the Widrow-Hoff Learning Rule and the Least Mean Square (LMS) Learning Rule. The way that the Delta Rule works is that the delta error in the output layer is transformed by the derivative of the transfer function and is then used in the previous neural layer to adjust input connection weights. In other words, this error is back-propagated into previous layers one layer at a time. The process of back-propagating the network errors continue until the first layer is reached. The network type called Feed forward, Back-propagation derives its name from this method of computing the error term. When using the delta rule, it is important to ensure that the input data set is well randomized. Well ordered or structured presentation of the training set can lead to a network which can not converge to the desired accuracy. If that happens, then the network is incapable of learning the problem.

The Gradient Descent Rule: This rule is similar to the Delta Rule in that the derivative of the transfer function is still used to modify the delta error before it is applied to the connection weights. Here, however, an additional proportional constant tied to the learning rate is appended to the final modifying factor acting upon the weight. This rule is commonly used, even though it converges to a point of stability very slowly. It has been shown that different learning rates for different layers of a network help the learning process converge faster. In these tests, the learning rates for those layers close to the output were set lower than those layers near the input. This is especially important for applications where the input data is not derived from a strong underlying model.

Kohonen’s Learning Law: This procedure, developed by Teuvo Kohonen, was inspired by learning in biological systems. In this procedure, the processing elements compete for the opportunity to learn, or update their weights. The processing element with the largest output is declared the winner and has the capability of inhibiting its competitors as well as exciting its neighbors. Only the winner is permitted an output, and only the winner plus its neighbors are allowed to adjust their connection weights.

Further, the size of the neighborhood can vary during the training period. The usual paradigm is to start with a larger definition of the neighborhood, and narrow in as the training process proceeds. Because the winning element is defined as the one that has the closest match to the input pattern, Kohonen networks model the distribution of the inputs. This is good for statistical or topological modeling of the data and is sometimes referred to as self-organizing maps or self-organizing topologies.

Supervised Training

In supervised training, both the inputs and the outputs are provided. The network then processes the inputs and compares its resulting outputs against the desired outputs. Errors are then propagated back through the system, causing the system to adjust the weights which control the network. This process occurs over and over as the weights are continually tweaked. The set of data which enables the training is called the “training set.” During the training of a network the same set of data is processed many times as the connection weights are ever refined.

The current commercial network development packages provide tools to monitor how well an artificial neural network is converging on the ability to predict the right answer. These tools allow the training process to go on for days, stopping only when the system reaches some statistically desired point, or accuracy. However, some networks never learn. This could be because the input data does not contain the specific information from which the desired output is derived. Networks also don’t converge if there is not enough data to enable complete learning. Ideally, there should be enough data so that part of the data can be held back as a test. Many layered networks with multiple nodes are capable of memorizing data. To monitor the network to determine if the system is simply memorizing its data in some nonsignificant way, supervised training needs to hold back a set of data to be used to test the system after it has undergone its training. (Note: memorization is avoided by not having too many processing elements.)

If a network simply can’t solve the problem, the designer then has to review the input and outputs, the number of layers, the number of elements per layer, the connections between the layers, the summation, transfer, and training functions, and even the initial weights themselves. Those changes required to create a successful network constitute a process wherein the “art” of neural networking occurs.

Another part of the designer’s creativity governs the rules of training. There are many laws (algorithms) used to implement the adaptive feedback required to adjust the weights during training. The most common technique is backward-error propagation, more commonly known as back-propagation. These various learning techniques are explored in greater depth later in this report.

Yet, training is not just a technique. It involves a “feel,” and conscious analysis, to ensure that the network is not overtrained. Initially, an artificial neural network configures itself with the general statistical trends of the data. Later, it continues to “learn” about other aspects of the data which may be spurious from a general viewpoint.

When finally, the system has been correctly trained, and no further learning is needed, the weights can, if desired, be “frozen”. In some systems this finalized network is then turned into hardware so that it can be fast. Other systems don’t lock themselves in but continue to learn while in production use.

Neural Networks for Speech Recognition

Abstract

Introduction