1. Introduction

Continuous speech recognition (CSR), not long ago imaginable only in science-fiction stories, is a reality today. The first commercial, large vocabulary, continuous speech dictation system for use on standard PCs is already on the market (Naturally Speaking^TM by Dragon Systems), and others will follow. IBM has announced a similar product to be released at the end of 1997. There is however a long way to go from contemporary state-of-the-art recognition systems to a system that is comparable with human speech perception. The research on the SWITCHBOARD corpus of spontaneous speech over the telephone is an illustration of this discrepancy (Cohen, 1996). For this task, today's technology is clearly not sufficient. Today's best large vocabulary systems are used for one speaker only, because it is desirable to fine-tune to the speaker's voice, and the systems are still vulnerable to noisy conditions. Also, humans have an unlimited lexicon as new words can always be formed. This is true in particular for many Germanic languages where there is virtually no limit on the compounding of words. CSR systems recognize at best a few ten thousand words. The recent commercial break-through is due to a gradual improvement in small steps of the prevalent HMM method. The improvements have walked hand-in-hand with the development of the computer technology that has allowed increasingly computation demanding models and the storage of larger probabilistic grammars. Although this development has been very fruitful, concerns have been raised about the long term development of the field. In order to bridge the remaining gap between human perception and automatic speech recognition, new innovative solutions may be required. It has been argued that the presence of one dominant technique, that has been tuned for many years, suppresses work along such innovative lines (Bourlard, 1995). CSR is a complex process that naturally breaks down in different sub-tasks (see Figure 1)). Because of the years of tuning of the standard system, almost any significantly new approach in one of the modules of the system will most probably lead to an increase in error-rate when first investigated. This thesis reports in part on work that can be characterized as tuning the standard model, but the focus is on work on alternative methods. In the first two blocks of Figure 1, signal processing and feature extraction, the speech signal is transformed to a time-series of feature vectors that is a suitable representation for the subsequent statistical phonetic evaluation in the next module. The sample rate of the feature vectors is usually about 100 Hz. These first two blocks are often called the front-end of the ASR system. Apart from the implementation of a few of the most popular features, e.g., mel-frequency cepstrum coefficients, the work reported on in this thesis is not much concerned with the front-end. The standard method for implementing the phonetic classification module is to evaluate the feature vectors phonetically using multivariate Gaussian probability density functions. The density functions are conditioned by the phoneme in context, i.e., they estimate the likelihood of the feature vector given the hypothesized phoneme and its neighboring phonemes. However, several alternative methods have been proposed. For example, Digalakis, Ostendorf and Rohlicek (1992) use a stochastic model of phone-length segments instead of evaluation of the feature vectors independently, and Glass, Chang, and McCandless (1996) use a method based on phone-length segment feature-vectors and discrete landmarks in the speech signal. Another alternative approach is the hybrid HMM/ANN method (Bourlard and Wellekens, 1988), where an ANN is used for the phonetic evaluation of the feature vectors. The hybrid HMM/ANN method is used in Paper 3, Paper 5 and Paper 6 of this thesis and is discussed in more detail in the next section of this summary. The dynamic decoding module is where the recognition system searches for sequences of words that match phoneme sequences with high likelihood in the phonetic evaluation. This is also the module where the probabilistic grammar is included. Thus, word sequences are evaluated on the merits of their phonetic match and their grammatical likelihood. The output from the module is the most likely sequence of words, or a set of likely word sequence hypotheses that are then further processed by other modules of the system. The dynamic decoding search of a large vocabulary CSR system can be computationally very costly. Therefore, current research and development in this field are dominated by computational issues. Popular themes for reducing computation are pruning and fast search. In pruning methods, partial hypotheses that are relatively unlikely are not pursued in the continued search. A particularly efficient pruning is proposed in Paper 3. In fast search methods, an initial fast, but less accurate search is used to guide a subsequent, more accurate search, in order to pursue the most promising hypotheses. Multi-pass methods are generalizations of fast search. In a multi pass method, the output of the first search pass is a set of the most likely hypotheses given the knowledge sources available in this first pass. Subsequent passes re-score the set of hypotheses based on additional knowledge sources. The additional knowledge sources require more computation, so it is beneficial to apply them only on the selected set of hypotheses instead of the whole search space. Examples of knowledge sources that can be added in later passes are of the type that span word boundaries, e.g., higher order N-grams in the probabilistic grammar, and tri-phones that condition on phones across word boundaries. A concept that can be applied to all the modules of Figure 1 is speaker adaptation. Although the role of speaker adaptation in the human perception process is not completely understood, there is convincing evidence that such a process takes place. In particular, the presence of a rapid adaptation was proposed by Ladefoged and Broadbent (1957). This process operates on a time scale of a few syllables. Speaker adaptation methods have been successfully applied also for ASR. The success should be no surprise, because of the discrepancy in the performance between speaker-dependent (SD) and speaker-independent (SI) systems (e.g., Huang and Lee, 1991). Different adaptation methods can be classified by the amount of supervision required from the user and on what time-scale they operate. At one end of the spectrum are methods that require the user to read a sample text that is then used to re-train the system. This method can asymptotically reach the performance of a corresponding SD system if the size of the sample text is increased (e.g., Brown, Lee, and Spohrer, 1983; Gauvain and Lee, 1994). More advanced methods collect the speaker-dependent sample as the system is running, and do re-training based on the data collected, without the need of a special training session by the user. This latter scheme is called unsupervised adaptation. However, both supervised and unsupervised re-training methods are typically relatively slow - the adaptation effect is significant only after a minute or more. At the other end of the spectrum of speaker adaptation, are methods that have access to a model of speaker variability. This additional knowledge lets the system concentrate on adapting to voices that are likely to exist in the population of speakers, instead of blindly trying to adapt based on the (initially) very small sample from the new speaker. These methods can reach a positive adaptation effect after only a few syllables of speech. The speaker modeling approach to speaker adaptation is developed and investigated in the context of hybrid HMM/ANN recognition in Paper 2, Paper 4 and Paper 5. However, the approach does not require an ANN model - a related adaptation scheme in the HMM domain, is found in a paper by Leggeter and Woodland (1994), and Hazen and Glass (1997) use a related method in a segment-based system. The remainder of this summary is organized in sections corresponding to the three main directions of work: hybrid ANN/HMM recognition, dynamic decoding, and fast speaker adaptation. The exception is section 4, that contains previously unpublished material on the representation of large sets of alternative hypotheses - the output of the dynamic decoding search. Section 6 contains a brief presentation of the applications of the developed CSR system. Finally, section 7 contains brief summaries and comments on the included papers. In particular, the original contribution and innovative elements of each paper are discussed in this section.
			Figure 1. Overview of the ASR process. Most contemporary speech recognition systems conform to this chain of processing. In the signal processing module the raw speech waveform is transformed to the frequency domain. In the feature extraction module a data compressing transform is applied and the resulting output stream is a time-series of feature vectors. The phonetic classification module evaluates the feature vectors phonetically and in the dynamic decoding module the lexical and grammatical constraints are enforced.