Automatic continuous speech recognition with rapid speaker adaptation for human/machine interaction

Nikko Ström  

Department of Speech, Music and Hearing, KTH, Sweden.  

Academic dissertation for the degree of Teknologie Doctor (Ph.D.), that will be publicly defended December 5, 1997, 14.00 in kollegiesalen, Vallhallavägen 7, KTH, Stockholm. The opponent is Dr. James Glass, MIT Lab. for Computer Science, Spoken Language Systems Group. 
cover picture


This thesis presents work in three main directions of the automatic speech recognition field. The work within two of these - dynamic decoding and hybrid HMM/ANN speech recognition - has resulted in a real-time speech recognition system, currently in use in the human/machine dialogue demonstration system WAXHOLM, developed at the department. The third direction is fast unsupervised speaker adaptation, where "fast" refers to adaptation with a small amount of adaptation speech.
The work in dynamic decoding has involved the development of a continuous speech decoding engine based on the A* search paradigm. An efficient implementation of the algorithms has made real-time continuous speech recognition possible in the WAXHOLM dialogue system with a lexicon of about 1000 words. Features of the search algorithms that are important for the real-time performance are proposed. These include efficient use of beam-pruning, and graph reduction methods that greatly reduce the effective search space.
The hybrid HMM/ANN recognition is an area of work in its own right, but is also important in the speaker adaptation experiments. A very flexible ANN architecture has been developed and refined during the course of the thesis work. The architecture is a generalization of the TDNN and the RNN architecture, and allows both delayed and look-ahead connections. In the latest experiments, sparsely connected networks were investigated. Sparsely connected networks were shown to perform significantly better than their fully connected counterparts with an equal number of connections. In an experiment with phoneme recognition of the TIMIT database, the recognition rate of the hybrid HMM/ANN system is in the range of the highest reported, and only outperformed by another hybrid system.
The fast speaker adaptation work is based on the notion that an explicit a priori model of the speaker variability helps to rapidly adapt to a new speaker. In the experiments, a parametric speaker characterization is introduced in the ANN by adding special-purpose speaker-space input units whose activity values are determined by the speaker adaptation. Experiments have been made both with the American English TIMIT database and the Swedish WAXHOLM database, and a positive adaptation effect is detected after only a few syllables.


Included papers 
1. Introduction 
2. Hybrid HMM/ANN speech recognition 
2.1 Introduction 
2.2 The standard CDHMM 
2.3 Problems with the standard model 
2.4 The artificial neural network 
2.5 Sparse connectivity and pruning in the ANN
3. Dynamic decoding 
3.1 Introduction 
3.2 Viterbi decoding 
3.3 The N-best paradigm and A* search 
3.4 Word lattice representation of multiple hypotheses
4. Word graph minimization 
4.1 Introduction 
4.2 Problem formulation 
4.3 Approximative methods 
4.4 The minimal deterministic graph 
4.5 Determinization 
4.6 Minimization 
4.7 Computational considerations 
4.8 Discussion
5. Speaker Adaptation 
5.1 Introduction 
5.2 Speaker modeling 
5.3 Speaker-sensitive phonetic evaluation 
5.4 Speaker adaptation in the speaker modeling framework
6. Applications 
6.1 The WAXHOLM dialogue system 
6.2 An instructional system for teaching spoken dialogue systems technology
7. Summaries and comments on individual papers 
7.1 Paper 1 
7.2 Paper 2 
7.3 Paper 3 
7.4 Paper 4 
7.5 Paper 5 
7.6 Paper 6
8. Acknowledgments 
9. References 

Keywords: Automatic speech recognition (ASR), hybrid HMM/ANN, lexical search, speaker adaptation, speaker characterization, human/machine dialogue system.