5. Speaker Adaptation

5.1 Introduction

Speaker adaptation is one of the main areas of work of this thesis. In particular, rapid adaptation to a new voice have been investigated. Although there is evidence for such an adaptation process in speech perception, the role of speaker adaptation in the speech perception process is still not completely understood. In a famous experiment with synthetic speech stimuli, Ladefoged and Broadbent (1957) found that altering the formant frequencies in a precursor utterance resulted in differently perceived identity of a target word. To account for this effect they postulated a psychological adaptation process and concluded that: 
    "... unknown vowels are identified in terms of the way in which their acoustic structure fits into the pattern of sounds that the listener has been able to observe."
This view has been very influential, but it has been challenged by others arguing that the effect is of minor importance in the perception of natural speech. In particular, the dynamic patterns of the vowels in their consonantal context have been ascribed more importance (Verbrugge & Strange 1976, Strange 1989). Strange (1989) wrote:
    "If, as perceptual results suggest, there is sufficient information within single syllables to allow the listener to identify intended vowels, even when those vowels are coarticulated by different speakers in different consonantal contexts, then the need to postulate psychological processes by which the perceiver enriches otherwise ambiguous sensory input is eliminated.
In ASR, the performance gap between speaker-independent (SI) and speaker dependent (SD) systems (e.g., Huang & Lee 1991) indicates that there is a lot to be gained from adapting SI-models to the speaker. Certainly, the variation due to consonantal context is recognized as an important property of the speech-signal also for ASR. This is usually modeled by context dependent models, e.g., triphones. Nevertheless, there is general agreement that speaker adaptation schemes can further improve recognition performance significantly. An overview of existing methods for speaker adaptation in ASR is given in Paper 4. 
Paper 2, Paper 4 and Paper 5 represent a line of work that aims at performing rapid speaker adaptation by accessing knowledge from an explicit model of the speaker variability.  The idea is that a priori knowledge of the speaker variability will reduce the amount of adaptation data necessary from the new speaker by constraining the parameter space. For example, the variability due to varying vocal tract length affects the formant patterns of all voiced phones in a coherent and systematic fashion. Thus, if the rules governing this variability are known, it is not necessary to collect adaptation data for all phonemes. 
Key concepts of the framework of Paper 2, Paper 4 and Paper 5 are the speaker model that models the speaker variability, and the speaker space that is the domain of a set of parameters describing different voice characteristics.  A speaker model is basically a probability distribution function over a speaker space. 

 5.2 Speaker modeling

A speaker model is a statistical model of the speaker variability. It includes a parametric description of the variability and an a priori probability distribution for the different characteristics. For example, vocal-tract length is a characteristic that can be of importance in a speaker model, independent of the particular ASR method used. 
Vocal-tract length is an example a of speaker parameter - a parameter that describes the speaker. The speaker-space is the domain of the speaker parameters. In this framework, individual speakers have positions in the speaker-space and adapting to a speaker involves estimating this position. Note however that speaker parameters are not necessarily physiologically related as in the previous example. In Paper 2, I experiment with a data driven method to extract a speaker space that does not explicitly correspond to any knowledge-based parameters. 
An explicit model of speaker variation may offer other advantages than increased speech recognition performance. The speaker-model can provide an interface for coupling with other modules of the human-machine interface. For example, consider the two possible speaker parameters dialect and age, both known to affect the acoustic realization of the speech. But speakers of differing dialects also have different lexical preferences (Shiel 1993), and speakers of different age are interested in different subjects. 
Figure 14 
Figure 14. Structure of a speaker-sensitive ANN for phoneme probability estimation. Speaker parameters are introduced to the ANN by means of the special-purpose speaker-space units. The speaker space units are connected to other units like other input units, but they get their activation values from the speaker adaptation. 

5.3 Speaker-sensitive phonetic evaluation

In the speaker modeling approach to speaker adaptation, discussed in the previous section, the phonetic evaluation is conditioned by the speaker parameters, i.e., an utterance is recognized differently depending of on the hypothesized voice characteristics of the speaker. A speaker-sensitive phonetic classifier is a phonetic classifier that is dependent of the hypothesized speaker parameters. In the case of an ANN classifier, the speaker parameters can be introduced in the network as extra input units (see Figure 14). 
The speaker parameters can be thought of as the "source" of the variation. Thus, changing a speaker parameter related to the vocal tract length, should alter the ANN's probabilities for all voiced phonemes. The ease of modeling such complex systematic variation shared by different phonemes, is an important strength of explicitly modeling the speaker variability. 
An example of a speaker-sensitive ANN can be found in (Carlson and Glass, 1992a,b), where the acoustic observations as well as the speaker parameters are input to the network. This study inspired the work reported in Paper 2, Paper 4 and Paper 5, where the same basic ANN structure is used. In this continued work, the domain have been extended from only the vowels to the whole phoneme inventory, and from classifying segments to the complete continuous speech ASR problem. 

5.4 Speaker adaptation in the speaker modeling framework

In the previous sections it was discussed how to model the speaker variability, but the particular adaptation procedure is not specified. This is not a coincident - in fact, it is one of the advantages of the speaker modeling framework that the adaptation procedure is not determined by the model of the variability. Different adaptation schemes may be chosen for different tasks. For example the amount of adaptation speech available, text dependent/independent adaptation, real-time requirements etc. may be of relevance for the particular adaptation scheme chosen. 
Clearly, speaker adaptation in the speaker modeling framework includes, in one form or the other,  estimation of the current speaker's position in the speaker-space. If the speaker's position is known, this information can be used to condition the phonetic evaluation, and give a more accurate recognition. If the exact position in the speaker-space is unknown, it must be estimated from the knowledge sources at hand. This includes speech recorded previously from the speaker (possibly only the one utterance to recognize), but it could also include other types of information that is available to the system. For the analysis of the recorded speech, features that are often discarded in ASR can potentially be of use for the speaker characteristics estimation, e.g., fundamental frequency that is strongly correlated with gender. 
The estimation of speaker-space position need not be explicit. A concept strongly related to speaker adaptation, is the so called speaker consistency principle. This principle is a formulation of the observation that an utterance is spoken by one and the same speaker, from the beginning to the end. This constrains the observation space and can therefore be used to reduce the variation in the ASR model. In the speaker modeling framework, the speaker consistency  principle can be introduced by enforcing constant speaker parameters throughout the utterance. This can be implemented by adding a new dimension to the search space of the dynamic decoding. The original two dimensions: time and HMM state, are then complemented with the third dimension of the speaker parameters. This is the method used in Paper 5. The extended search space is illustrated in Figure 15
The dynamic decoding search in the extended search space of Figure 15 is pruned with beam-pruning just like in the case of the standard search in two dimensions. The effect is that partial hypotheses with low probability will not be further investigated in the search, leaving more computational resources for the most promising hypotheses. In the extended search space, a part of an hypothesis is the speaker characteristics, so the effect of beam pruning is that hypotheses with unlikely speaker parameters will be pruned. In effect this is speaker adaptation - as the Viterbi search progresses, unlikely speaker characteristics are successively pruned, and  the speaker's position in the speaker space will gradually be more specified. Consequently, more resources can be allocated for the other dimension. Thus, in this framework, adaptation in the sense that something in the system is progressively changed, is the changed balance of attention in the search from speaker characteristics to HMM states. 
As a final note, we point out that the speaker modeling approach and speaker consistency modeling does not require an ANN model - a related adaptation scheme in the HMM domain is found in a paper by Leggeter and Woodland (1994), and a slightly different approach to implement the speaker consistency principle is taken by Hazen and Glass (1997). In their consistency model, the key concept is long range correlation between speech sounds. No explicit speaker model is used, but the method is successfully combined with speaker clustering and a technique called "reference speaker weighting". Both of these implicitly define speaker models by the space spanned by the clusters and reference speakers respectively.
Figure 15 
Figure 15. Search-space of the dynamic decoding. Top: the standard search space with the two dimensions time and HMM states. The objective of the dynamic decoding is to find the most likely path through the search-space. Because of beam-pruning, many paths in the search space are never investigated. This is indicated by the shadowed "beam"  in the figure. Bottom: the search space with an additional speaker characteristics dimension. At the beginning of the search, all different speaker characteristics are inside the beam, but as the search progresses, unlikely speaker characteristics are successively pruned, and  the speaker's position in the speaker space is gradually more specified.