Speech group year report 1999-2000





Centre for Processing of Speech & Images

The Centre for Processing of Speech and Images (PSI), formerly known as MI2, bundles the research on the acquisition, processing and generation of audio-visual signals and on their application to real world problems. Because of the difference in applications, the medical applications and the industrial ones are separated into two subgroups (MIR and VISICS) and the speech/audio processing forms a third subgroup (SPEECH). For more information about PSI, please visit http://www.esat.kuleuven.be/psi/.

The PSI group is headed by Paul Suetens. Research is supervised by the staff members Paul Suetens, Luc Van Gool, Patrick Wambacq, Dirk Vandermeulen, Luc Van Eycken, Werner Verhelst, and Dirk Van Compernolle. A lot of specialized equipment for I/O and processing of images, video, 3D and speech is available. The more than 50 researchers, about 10 of which are post-docs, are supported by an administrative and university-level technical staff of more than 10 people. The staff members are also actively involved in courses on image and speech processing and their telecommunication aspects, at the university level as well as at the post-graduate level. Their contribution to industrial courses is solicited on a regular basis.

PSI is involved in several national and international projects, including fundamental as well as applied research. The national collaborations are either sponsored directly by companies (Agfa, Siemens, Lernout & Hauspie, & ...), or co-financed by a governmental organization (IWT, FWO, & ...). The international projects are mainly European ones (ACTS, Esprit, Brite, & ...). Within and outside these projects, the group has collaborations and exchanges with many of the major players in computer vision and speech recognition, including non-European research institutes. Furthermore, PSI participates in standardization bodies: BIN (Belgian standardization organization), MPEG (ISO), and IUE/TargetJr (computer vision software standardization).

The industrial image processing group (VISICS), headed by L. Van Gool, conducts research in the field of non-medical image processing. Current research topics are 3D scene reconstruction, shape and object description and recognition, visual inspection, compression, remote sensing, and image database retrieval.

The laboratory for medical image computing (MIC), headed by P. Suetens, is a joint initiative of the Division Radiology (Faculty of Medicine) and PSI (Faculty of Applied Sciences). The group conducts research in the field of medical imaging in its broad sense. An emphasis is put on demonstrators in a clinical setting, which allow evaluating the actual clinical benefits of the recently developed methods. The research unit conducts fundamental and applied research that yields the necessary knowledge to acquire, manipulate, analyse, display, transmit and archive 2-D and 3-D medical images in a useful and efficient way. Important research activities include image acquisition (physical aspects and reconstruction), image processing (enhancement, quantification, registration, ...) for diagnosis, therapy and surgery planning, image guided surgery, and multimedia archiving and communication systems.

The research activities of the speech processing group, coordinated by P. Wambacq, are situated in the domain of speech recognition, enhancement, compression, modification and neural networks. The acquired know-how and expertise in prototype development is used to turn theoretical algorithms into applied procedures.





9.1 Acoustic modelling for large vocabulary speech recognition

P. Wambacq, J. Duchateau

The ongoing research in the field of acoustic modelling for large vocabulary speech recognition mainly consisted of the development of acoustic models for different languages and recognition tasks.

In the previous years, the research focused on strategies for the development of both accurate and efficient acoustic models. This resulted in modelling with state-of-the-art recognition performance and a real time evaluation.

More recently, acoustic models for Dutch were developed in the same way as for English. Special attention was paid to the use of several acoustic databases, with different channel and noise conditions. More specifically, a noise-masking algorithm was developed, and the necessity of database specific decision trees and feature transforms was checked. Furthermore, the influence of the class definition on selection and decorrelation of features, and on initialisation of gaussians was investigated.

Based on the acquired knowledge, acoustic models are currently being developed for different languages and tasks in the framework of several speech group projects: Dutch models for dictation, spontaneous speech and telephone speech, American English models for spontaneous speech, dictation models for British English and Greek.





9.2 Automatic transcription and normalization of speech (ATraNoS)

P. Wambacq, J. Duchateau, K. Demuynck, T. Laureys, D.H. Van Uytsel

The ATraNoS project aims at contributing to the development of better products for the automatic verbatim transcription of speech, and for the conversion of these transcriptions to a form that is better adapted to the needs of the end-user. One application, which will be studied as a case study, is the generation of subtitles for the benefit of hearing-impaired people.

The research at ESAT focuses on two topics: (1) the detection of out-of-vocabulary words based on confidence measures for the recognition results; (2) the improvement of the robustness of the recogniser against the presence of disfluencies (hesitations, repetitions, ...) in spontaneous speech utterances using statistical language models.

More information can be found on the web site http://atranos.esat.kuleuven.be/.





9.3 Optimal linear feature transformation

P. Wambacq, K. Demuynck

One of the first steps in speech recognition is the extraction of a set of relevant parameters from the speech signal. For optimal performance, this set of parameters should be compact, provide a good distinction between the different sounds in a language and be consistent with the parametric models used by the speech recogniser.

In speech recognition, a two-stage approach is used to achieve these goals. The first stage is based on physical considerations and transforms the speech signal to a set of log-energy values in different frequency-bands. The second stage consists of a linear transformation, which condenses these log-energy values and their time-derivatives into a compact set of features.

Two new data-driven algorithms were developed to optimise this linear transformation. The first algorithm calculates the linear transformation that maps a large feature set onto a smaller one with a minimal loss of information. The second algorithm makes the features more suitable for modelling with mixtures of diagonal covariance gaussians as used in our speech recogniser. It therefore calculates the transformation matrix, which minimizes the local correlations between the features at the gaussian level. Both algorithms showed a 5 to 10% relative improvement over existing algorithms such as linear discriminant analysis and principal component analysis, while adding less than 0.5% parameters to the acoustic modelling.





9.4 Hybrid MLP/HMM speech recognition

P. Wambacq, K. Demuynck

One of the pillars on which the success of the current generation of speech recognisers is based, is the use of stochastic models to model the inherent variability of speech. The stochastic properties of the elementary sounds in a language are usually modelled with mixtures of diagonal covariance gaussians.

In this project, an alternative modelling strategy with multi layer perceptrons (MLPs) as probability estimators was investigated. Advantages of the MLPs are the invariance of the structure with respect to linear transformations, the ability to cope with non-gaussian distributed data and the discriminative training.

The first research topic was the objective function used to train the neural network. By replacing the standard least-squares criterion with a cross-entropy objective function, a 30% relative improvement in recognition accuracy was achieved.

The second topic was the training algorithm. A comparison between the scaled conjugated gradient algorithm and a simple (incremental) gradient descent algorithm showed that, despite of the outstanding performance of the conjugated gradient algorithm on small artificial problems, the best algorithm for speech recognition is still the gradient descent scheme which converges up to a 100 times faster.

The third and last topic was the structure of the MLPs. By using a hierarchical structure, both the number of parameters in the MLPs and the evaluation time could be limited. This new structure however, also resulted in a substantial drop in the performance of the network, which limits the usability of this technique.

The final comparison with mixtures of gaussians showed that MLPs could achieve the same performance with one third of the parameters. However, MLPs also showed some drawbacks: their evaluation is slow, the networks become too large when using context-dependent acoustic models, and their structure is not flexible, which makes techniques such as noise or speaker adaptation difficult if not impossible.





9.5 Language modelling: less data, more structure

P. Wambacq, D. Van Compernolle, D.H. Van Uytsel, P. Vanroose

A language model estimates the a priori probability of a word sequence. The performance of a large vocabulary recognition system relies heavily on its language model. The main workhorse of state-of-the-art language models is the trigram, which predicts the occurrence of a word given the two preceding ones. The probability estimates are extracted from very large text corpora in a process called training.

Our efforts to improve language models focus on two goals: reducing training data require-ments (enhancing the accuracy of the parameter estimation given a training corpus) and introducing more structure (finding a closer match between model structure and reality).

Information-theoretic methods

One approach to relax training data requirements is to use syllables instead of words as text atoms. Text can easily be split into syllables, and there are far less syllables than words.

Also, unseen words are most likely built from seen syllables. A second approach to combat data sparseness is borrowed from universal source coding (data compression), where the Context-Tree Weighting (CTW) algorithm has proven successful. Using a modified version of CTW for trigram language modelling reduces the perplexity by about 5%.

Modelling of spontaneous speech

The available amount of training material for spontaneous speech is typically much smaller than for dictated speech, due to the high costs of verbatim hand transcription of spontaneous speech. We have devised a corpus adaptation method, called weighted counting, that attempts to match a written text corpus to a small spontaneous speech transcription corpus by weighting the observations according to a style similarity measure. Evaluated on a meeting transcription task, this method produced a modest but significant improvement in recognition accuracy.

We are currently seeking to improve spontaneous speech language models by explicit model-ling of the disfluencies that typically occur in spontaneous speech, since they are likely to conflict with the n-gram model.

Structured language models for dictation

In a different line of research we have investigated techniques to integrate linguistic knowledge in the form of syntactic structure into language models for dictation. This has first led to a language model based on a probabilistic context-free grammar. Since the context-free assumption turned out to be too restrictive, a model based on a dynamic-programming Earley parser with context-dependent rules was developed. Further improvements, in execution speed, perplexity as well as recognition accuracy, were finally obtained with a model based on a context-dependent dynamic-programming version of a probabilistic left-corner parser.

The three above-mentioned models are initially trained on a corpus of parsed text (treebank), but can be re-trained on unparsed material with an expectation-maximization algorithm. At this moment, evaluation results on the Wall Street Journal recognition task were positive. We expect even more improvements with the recognition of Dutch, since it tends to contain more long-span syntactic constraints than English.





9.6 The influence of dialects on speech recognition systems

D. Van Compernolle, P.-J. Ghesquiere

Current speech recognition systems can transcribe fairly well dictated speech of a known speaker. However, these systems often fail when the speaker has an explicit accent or a speaking style, which differs from the 'standard' speaker. We want to investigate the influence of situation-specific acoustical and phonological models, which are obtained by transformations of general speaker independent models.

Currently, we are investigating which features are important for the development of these transformations. Articulatory features as open-closed, sonorant-obstruent, etc. can be used to indicate differences in phoneme models. Towards this, a formant tracker has been developed.





9.7 CGN (het Corpus Gesproken Nederlands - the Spoken Dutch Corpus)

P. Wambacq, K. Demuynck, J. Duchateau, T. Laureys

The project CGN aims to compile a ten-million-word corpus that will constitute a plausible sample of contemporary standard Dutch as spoken in Flanders (1/3 of data) and the Netherlands (2/3 of data). The resulting corpus, planned to be complete in June 2003, will be an indispensable source for research in linguistics as well as language and speech technology. ESAT/PSI is an active member in the working groups for corpus design and signal analysis, selects appropriate speech fragments, digitises them, delivers orthographic transcriptions and works on automatic speech segmentation for the project.

Being a member of the working groups for corpus design and signal analysis we have left a definitive stamp on the theoretical framework for the corpus, mainly anticipating on speech research. As such, we have written several working documents, which were discussed in the specialist working groups.

For the first two of seven planned releases, more than 1.5 million spoken words were orthographically transcribed in Flanders and the Netherlands. We selected, digitised and transcribed a total of 190k words (31% of the currently available Flemish data) for those releases.

In addition, we developed an HMM-based tool for automatically segmenting speech on a word basis. A first pilot experiment, in which such automatically generated segmentations were com-pared to manual segmentations, is showing promising results. At the moment confidence measures are being incorporated in order to be able to use the results in a data fusion system.

More information can be found on the official web site: http://lands.let.kun.nl/cgn/.





9.8 Computer aided language learning (MYTHE)

P. Wambacq, P. Vanroose

The MYTHE project (Multimedia young children thesaurus for educational purposes) deals with the design and development of a multilingual interactive computer-based language-learning environment for children around the transition-to-literacy age, i.e., 6 to 8 years old. This software tool can be used for both mother tongue and foreign language learning and will initially support three languages: English, Greek and Dutch.

MYTHE is built around a modern fable, especially written for this purpose, which will be presented in an interactive 3D environment with animated characters, which will present the language exercises in a game-like fashion. Advanced text parsing and correction tools will be included, as well as the possibility for the child to interact with the system by speaking to it. For that purpose, the speech recogniser developed at the ESAT speech group will be plugged into the system, and extended to support Greek and British English.

This will also make it possible to assess the pronunciation of the pupil, especially the correct stress position in a word. We are currently working on this topic.

More information can be found on http://mythe.ilsp.gr/.





9.9 Optimised subspace weighting for robust speech recognition in additive noise environments

P. Wambacq, W. Verhelst, K. Hermus

With the increase in computing power and the advances in speech processing, current Automatic Speech Recognisers (ASR) have attained a high level of accuracy. However, as soon as the input speech signal is corrupted by disturbing sources, severe performance degradation can occur, due to the mismatch between training (laboratory) and operating (test) conditions. We concentrated on one of several possible approaches, namely speech enhancement (noise removal before the feature extraction step).

Signal Subspace (SS) based speech enhancement techniques obtain significant additive-noise reduction by altering the singular value spectrum of the speech observation matrix. High energetic singular components contain almost only (pure) signal, whereas the low energetic ones are noise-related and are (completely) suppressed.

These SS approaches were developed as pure speech enhancement techniques that pursue optimal perceptual speech quality. In speech recognition applications we focus on optimal recognition accuracy. In this respect, we presented and investigated the idea of `optimal SS weighting' for speech recognition systems.

We compared the SS estimation techniques with our optimal weighting scheme that was trained with known matched pairs of clean and noisy singular spectra. We found that, for ro-bust speech recognition, the Minimum Variance (MV) based weighting can be considered as near optimal in the class of SS weighting methods.

Recognition experiments on a LV-CSR (Large Vocabulary Continuous Speech Recognition) task under stationary noise conditions revealed that relative reductions in error rate of 60% could be achieved with the SS methods.

When combined with an accurate speech-noise classifier and a prewhitening step, the basic SS based algorithms are able to remove non-stationary coloured additive noise.

This topic is illustrated by the following files: st0058_10db_w.wav and st0128_10db_w.wav (both speech with white noise, 10dB SNR), and st0058_10db_w_enh.wav and st0128_10db_w_enh.wav (enhanced with our method).





9.10 Speech and audio representations with damped sinusoids based on Total Least Squares (TLS) algorithms

P. Wambacq, W. Verhelst, K. Hermus, Ph. Lemmerling

The representation of speech and audio signals with a weighted sum of constant-amplitude, constant-frequency sinusoids has gained a lot of attention in speech analysis/synthesis, speech coding and speech modification. When it comes to capturing the transitional parts of these sig-nals, damped sinusoids have lots of advantages (e.g. more compact representation). Total Least Squares algorithms are suitable for the difficult task of automatically extracting the parameters of these damped sinusoids.

Total Least Squares algorithms form a natural extension of the basic LS algorithm, deriving the parameters of an Auto-Regressive (AR) model that exactly matches a (slightly) perturbated version of the input signal. Among the different TLS solutions that are available, we opted for the HTLS (Hankel TLS) algorithm since this method combines a reasonable computational complexity with a good modelling accuracy.

Our scheme decomposes a speech/audio frame into a predetermined number of damped sinusoids, each with its proper amplitude, damping factor, frequency and phase.





Figure 40: Relative computation time for TLS as a function of the number of subbands in the filter bank.





Figure 41: Behaviour of TLS in the frequency domain for a voiced speech segment (Fullband approach, blue: original spectrum, red: TLS spectrum).





Figure 42: Behaviour of TLS in the frequency domain for a voiced speech segment. The subband approach has a much better spectral coverage than its fullband counter-part. (blue: original spectrum, red: TLS spectrum)

By applying the HTLS algorithm to the fully decimated outputs of a filter-bank we manage to drastically reduce the total computational complexity. Figure 40 plots the relative computation time as a function of the number of subbands. At the same time the modelling precision can be adjusted per subband, which leads to a much more equilibrated distribution of the components over the total spectral range. Figure 41 and Figure 42 clearly illustrate the difference between the fullband and the subband TLS modelling. The subband scheme also allows us to incorporate perceptual criteria into the scheme.

Current research focuses on the challenging task of quantising the parameters (amplitude, damping factor, frequency and phase) in our signal representation. With increased modelling precision or 'over'-modelling, a number of components can become extra-sensitive to quantisation errors. A straightforward quantisation of all the damped sinusoids is then impossible (or would lead to unacceptable bit rates). A perceptual model could be used to extract the perceptually most important components (these are least sensitive to quantisation errors). The remaining components, that have a small contribution in the signal representation, can be quantised with traditional techniques. The whole operation can (and should best) be done in a subband scheme.

Some results are available in the files f116.wav (original), f50.wav (encoded-decoded, fullband with 50 sinusoids), s4comp32.wav and s16comp32.wav (subband approach, 32 components in 4, respectively 16 subbands).





9.11 Speech modification

W. Verhelst, T. Ceyssens

The ultimate goal of our speech modification research project is to build an automatic voice imitation system. Such a system is able to convert an utterance of a (source) speaker into a transformed version that sounds as if that same utterance was produced by another (target) speaker. Among many others, one possible application of such a system would be the conversion of a dubbed movie speech track. In a dubbing process, the original speech track is re-uttered by another speaker (with another voice), in another language. The system would allow converting the dubbed speech track, obtaining the voice characteristics of the original speech track again, but keeping the desired language.

To obtain a complete voice conversion, one has to convert both the voice characteristics (i.e. how a voice sounds) and the speaking characteristics (i.e. way of speaking). Voice characteristics are determined mostly by the location of the formants of every sound a speaker may produce, and by the average pitch. These are segmental aspects, i.e. from a time domain point of view they are located within one pitch period (= main period of the quasi-periodic speech signal). Speaking characteristics on the other hand, mostly deal with suprasegmental aspects (these extend over several pitch periods). A few important examples are: speaking rate, location and strength of stress, and way of co-articulation. Stress itself is created by specific time variations in pitch, loudness and speaking rate along the stressed syllable.

Currently we are investigating the important aspect of pitch conversion. The conversion itself has already been accomplished. Now we are focusing on how to apply this conversion algorithm in a systematic way to convert pitch between different speakers, as the relations between pitch variations of different speakers are not straightforward.





9.12 Low budget information retrieval

P. Wambacq, M. Osian, D. Martens

To enable a virtual reality speech recognition agent to respond on topic to a spoken user query, three different information retrieval methods are explored, each with distinct features. With the final goal of augmenting an existing virtual reality environment with scene related textual information in mind, three document mapping techniques and document similarity measures are compared. The focus for comparison of the three methods is the obtained robustness (and hence, usability) with respect to computational complexity. On account of the excessive computational cost, two neural network based techniques, a classic Kohonen self-organizing network and a neural gas, are abandoned in favour of an ad hoc raw statistical technique, dubbed relative similarity measurement. It is demonstrated that both robustness and computational cost are directly related to the level of connectivity in the mapping topology used to determine the similarity. The usability of the system is also investigated.





9.13 CLIF research community

P. Wambacq

CLIF (Computational Linguistics in Flanders) is a research community funded by FWO Vlaanderen. Its goal is to gather all Flemish expertise in the field of language and speech processing. Joined forces are essential if Dutch is to remain an equally valued language among the larger ones in Europe. The participating Flemish research groups are CNTS (University of Antwerp), CCL (KULeuven), ELIS (University of Ghent), ESAT/PSI-Speech (KULeuven), ARTI (University of Brussels) and Taal en Computer (KVH Antwerp). The cooperation between the partners focuses on

  • joined multidisciplinary efforts in research on language and speech technology in Flanders, both fundamental and application oriented;
  • facilitating research activities by an enhanced reusability of data for written and spoken language;
  • very specifically, the integration of fundamental research on language and speech processing in the Flemish context;
  • providing services by giving advice and organizing specific educational activities.