Speech              

Speech Recognition -

The goal is to understand speech utterances. Many applications. This covers the application of many techniques including HMMs (hidden markov models), neural-nets, statistical pattern recognition.

Speaker Verification: is confirming the identity of an individual from his speech.

Speech Synthesis -

The generation of artifical speech utterances. Typically used in automated response systems. Also used in text to speech applications.

Speech Coding and Compression -

The representation and reduction of storage needed to save speech samples.

Phonetics/Linguistics -

Area of study of speech in terms of sounds (phonetics) and other language issues like syntax, etc. (inguistics) all of which are used in the applications above. Could cover parsing, natural language processing, phonology and prosodic work.

Steps in Speech Recognition

Sampling

This is where you sample, create a discrete number of sample points for an analog speech signal being recorded from say a phone or microphone or similar device.

minimum sampling (poor speech): For recorded speech to be understood by humans you need an 8kHz sampling rate or more and at least 8 bit sampling. Improvements can be achieved by increasing the number of bits in sampling to 12bits or 16bits, or by using a non-linear encoding technique such as mu-law or A-law. This improves the "signal-to-noise" ratio.

typicall sampling (adequate for most speech recognition) A 16kHz sampling rate is a reasonable target for high quality speech recording and playback.

 

Endpoint Detection and Blocking/Windowing

Endpoint detection is the detection of the start and stop of speech utterances in a sound sample

Blocking/Windowing is the breaking up of speech utterances into smaller windows of time (i.e. 10-30ms).

Accurate end-pointing is a non-trivial task, however, reasonable behaviour can be obtained for inputs which contain only speech surrounded by silence (no other noises).

Typical algorithms look at the energy or amplitude of the incoming signal and at the rate of "zero-crossings". A zero-crossing is where the audio signal changes from positive to negative or visa versa. When the energy and zero-crossings are at certain levels, it is reasonable to guess that there is speech. More detailed descriptions are provided in the papers

Signal Processing: Preprocessing, Feature Extraction and Postprocessing

Preprocessing

so far we have issolated a potential speech utterance but, this does not mean the time sample we have is devoid of noise. Plus there are other issues like speaker rate variation as well as pitch changes between speakers that can influence the performance of a recognition system. Many times speech signals are processed not only in the temporal domain but, also in the frequency (spectral) domain.

Feature Exatraction

Rather than processing the raw data samples, sometimes speech systems try to extract higher-order information. This serves 2 purposes:

  1. reduction in the ammount of information to be processed
  2. hopefully more meaningful information is input into the recognition system.

This stage can be optional. Feature extraction can take place in either the spatial or spectral/frequence domain.

Temporal Feature Extraction

Extracting feqtures of speech from each frame in the time domain has the advantage of simplicity, quick calculation, and easy physical interpretation.

These temporal feature include:

  • short-time avaerage energy and amplitude,
  • short-time zero-crossing rate,
  • short-time autocorrelation,
  • pitch periods,
  • root mean square (rms),
  • maximum of amplitude, vocing quality,
  • different between maximum and minmum values in the positive and negative halves of the signal,
  • autocorrelation peaks.

Spectral/Frequence Features

The features in the frequency domain may include:

  • the difference between peak and valley,
  • the energy in a particular frequency region,
  • the spectral gradient
  • spectral variation contour.

 

 

 

Training & Recognition

Some systems take advantage of having a training session with a user that can tune the recognition algorithms used for better performance. This is called Speaker-dependent Recognition (works somewere at 98%-? accuracy)

Recognition without training is a more difficult problem and is called Speaker-independent Recognition (works somewhere around 95-?% accuracy).

There are many algorithms that have been developed for Recognition. Most use statistical pattern recognition approaches, other may also incorporate Natural Language Processing. Important issues include speed and accurancy. In some sense, search and optimization are part of the recognition process ("find the utterance that matches with the minimum error").

 

Example: Simple Speech Verification Algorithm

 

© Lynne Grewe