Simple Speech Recognition using Neural Networks

Ever since my first home computer, I have been interested in speech recognition. Imagine how fast you could whip out a term paper or quarterly report if you just dictated it to your PC. This could mean an end to "all nighters" and wrist cramps. My first successful system was a non-robust, speaker dependent system. In other words, the system would only recognize a few words spoken by the same person, or only a few people.

There are several products on the market today that provide some form of speech recognition. The quality of these products varies along with their price. I will attempt to demonstrate some of the building blocks that can be used in a simple speech recognition system or any other pattern recognition system. As an example, I will go through the steps to create a system that will recognize single word or short phrase commands spoken by an individual. The system discussed here is a speaker dependent, limited vocabulary speech recognition system. The system has four sections. Data acquisition, preprocessing, pattern identification and postprocessing. Most pattern identification systems are composed of these four components.


The first section, data acquisition, uses an ordinary PC sound card as an A/D (analog to digital) converter. The system samples 4096 points over approximately .5 seconds. The second block, preprocessing, performs a frequency analysis of the time samples from the first block. The spectrum is then compressed into 40 points. The Third block, pattern identification, uses a trained artificial neural network (ANN) to determine how closely the spoken word matches members of the ANN vocabulary. Finally, the last section contains some post processing to interpret and act on the output of the neural network.

The goal of the data acquisition block is to collect 4096 data samples over about .5 seconds. 4096 is the magic number of samples required by the preprocessing block to perform the spectral analysis. One half a second is about the length of most common words such as UP, DOWN, LEFT and RIGHT. The bandwidth of human speech is usually between 50 the 4000hz. According the Nyqust, you must sample at 8000hz, twice the frequency of interest, to get a digital representation of the data. The criteria is met by sampling at 8000hz for 1/2 second giving 4000 samples. To collect the speech samples, hardware must be considered. This function is performed by an A/D converter. Fortunately, there is a very common and economical A/D converter available for the PC, the sound card. The sound card uses an A/D converter to collect input from the microphone jack.

Accessing this function of the sound card is achieved using a sound card driver library. Otherwise, without the driver library, the programming of the sound card at the A/D level could be tricky for someone inexperienced at this type of programming and not much fun for everyone else. Another important hardware component of the data acquisition sub-system is the microphone. Using a cheap cassette recorder microphone will work, but not was well as a slightly more expensive model. The final aspect of the data acquisition block to be considered is the trigger. Many commercial speech recognition system use a keyboard combination to get the "attention" of the speech recognition system. This key combination tells the system to "listen up" for a command. A threshold is used to trigger the system to sample for the next .5 seconds. The threshold should be high enough to avoid triggering by background noise.

The second block of the speech recognition system is preprocessing. This section of the system must condition the data for the pattern recognition block. There are two steps in the preprocessing block. The first is to get the frequency spectrum of the samples provided by the data acquisition block. A FFT (Fast Fourier Transform) is used to provide a frequency spectrum of the speech sample. The FFT will yield 4096 frequency data points. Most of the delay before a decision is made by the speech recognition system is here in the preprocessing block. The FFT algorithm is very time consuming. I am not sure why they call it a Fast Fourier Transform. The second section of the preprocessing block will take 4096 frequency points and reduce them to 40 points of normalized frequency data.

The next block in the system, pattern recognition, consists of an artificial neural network. Neural Lab was used to create the neural network block that will recognize 5 spoken commands. More commands can be added, but his will increase the amount of training time and increase the number of examples needed. The training examples consist of sampled words passed through the preprocessor and an output pattern identifying the spoken word.


{40 frequency points} {5 vocabulary points}

{ spectrum of word 1} {1 0 0 0 0}

Good results can be obtained with as little as three training examples of each command. Backpropagation with momentum and unipolar sigmoid activation are used to train the neural network. The neural network was configured with 40 input nodes (one for each output of the preprocessing section), 20 hidden nodes, and 5 output nodes (one for each word in the vocabulary). Changing the number of hidden nodes may improve or degrade performance. Training on a 386 should take no more than 30 minutes to get encouraging results. If the neural network does not appear to converge, more training samples are needed. A bad training pattern can slow down or stop the neural network from training properly.

The final block in the speech recognition system is post processing. Here the output of the neural network is interpreted. The neural network will produce an output corresponding to each word in the vocabulary. The larger the value is, the more likely it is that the word was spoken. A simple sort will find the neural network output with the largest value. This output is the word that was spoken. This simple system was designed as a neural network demonstration. Multiple users can be sampled and used in the command training set. This often requires more training time, and could require more hidden nodes in the neural network. This system was successfully used as a voice command system for a robot. The commands LEFT, RIGHT, FORWARD, BACK, and STOP were spoken to control the robots movements. A long parallel port cable was connected between a 486 and the robot to relay the actions to the robot. Commands had to be quickly repeated occasionally when the robot "misunderstood" the command. This system could also be ported to a micro controller and provide a robot with voice control.

There are many other techniques and tricks that can be used for speech recognition. This is just a simple example of how a neural network can be used to solve this very interesting problem. This also demonstrates how many practical neural based systems use preprocessing and postprocessing to make the most of the data available.

Speech - information on simple speech recognition using neural networks

Copyright 2022 - Zagros Robotics, All Rights Reserved - Please send webpage comments or corrections to webmaster@zagrosrobotics.com - Zagros Robotics,PO Box 460342, St. Louis, MO 63146, info@zagrosrobotics.com for answers to any questions.

Zagros Robotics, Inc