|SPCH03: Speech synthesis based on speech parameters|
In this project we want to reverse the signal processing done in current speech recognizers, i.e. we want to re-synthesize speech from the parameters used internally by the speech recognizer.
Our interest in this operation is threefold. First, there is the pure scientific question of determining whether or not all necessary information to understand the speech is still retained in these parameters. Any information that is lost in this very first step of the speech recognition process will inherently cripple the final recognition accuracy.
The second purpose of this research is more practical. When adding a speech interface to mobile information appliances (e.g. taxi and truck drivers requesting a route description for a certain address), a distributed approach is typically used. In such a distributed approach, the mobile appliance only converts the speech into a compact (low bandwidth) set of parameters, while a dedicated (non mobile) server does the actual recognition of the spoken queries and then sends back the requested information. Current mobile appliance code the speech according to the GSM-standard. However, the GSM-coding is not well suited for automatic recognition systems. When switching to the parameters used by automatic recognition systems, it should still be possible to re-synthesize the speech just in case a human operator is needed to complete the query.
The third purpose is to do speech modifications. This is for example needed to post-synchronize the voices of actors recorded in the studio with the low-quality recording done on location. Also creating new voices for TTS (text-to-speech) systems starting from an existing voice is a big market. The signal processing done in speech recognizers more or less extracts the following three basic components of a speech sound: the exitation signal (voiced or unvoiced, and pith), the energy and the vocal trackt. Given this decomposition and some re-synthesis algorithm, it is easy to reconstruct the speech signal with one or more properties changed.
The basic operations involved in converting the speech signal into parameters suitable for a recognizer are depicted in the figure below. When reversing the process, the first difficulty will be the estimation of good phase information based on the power-spectrum only. A second difficulty is the handling of pitch and voicing information.
Applicants for this project should have basic knowledge of signal processing algorithms (Fast Fourier transform, ...).