An important number of errors made by a speech recognizer is related to speech detection. Even weak background sound before and after (or during) the speech to be recognized, lead to insertion errors, in other words the speech recognizer tries to explain the background noise as speech. The reason is that speech recognizers are based on accurate statistical models for speech, while "background sound" is such a wide concept that is cannot be modeled accurately. Therefore these simple models will be considered improbable by the decoder, and models for speech may get a higher probability, leading to the insertion error as mentioned before.
To solve this problem, a speech detector is built into a speech recognizer, and only the speech is sent to the decoder. This speech detector may be based on features that cannot be incorporated easily in a decoder, for instance harmonicity (voiced speech), duration, pitch contour...
Next to avoiding recognition errors, a speech detector also has other aims. When only background sounds are present, the decoder cannot use the acoustic model or the grammatical model to restrict its search (in other words, there are many possible words at the start of a sentence!). Therefore a lot of computation power is used when ... nothing ... is said! This effect is even worse when loud background sounds are present that may mask soft speech. So for energy-low (portable) applications, a speech detector is a good investment. An other reason to use a good speech detector, is the adaptation of the model for the background noises as to compensate for them during speech.
So the aim of the thesis is to built a reliable speech detector that exploits the above signal properties (e.g. harmonicity). The system will be evaluated by checking how well audio with different types of background noise can be classified as speech or no speech.