7. Summaries and comments on individual papers

7.1 Paper 1

This paper was written after the first implementation of the ASR module of the WAXHOLM demonstration system. Since then, the system has continuously been improved, and has been one of the main tools for the experiments of this thesis.
The A* algorithm is defined in a graph formalism, where both the input and output of the algorithm, and the lexicon and grammar model are represented by directed graphs. The acoustic input observation of an utterance, and the word sequence hypotheses that are the output, are represented by graphs called observation graphs. The nodes of observation graphs have a time-tag, and arcs between the nodes represent some aspect of the observation between the two times. The graph representing the lexicon and grammar model is a probabilistic finite state automaton, e.g., an HMM. The graph formulation of the A* algorithm is also used in Paper 3 and has been a useful conceptual tool in the development of the search.
A large part of the paper is concerned with computational aspects of the search. CPU and memory requirements are discussed for the implementation of the first pass of the A* algorithm, but it is the graph reduction methods in the second, stack-decoding, pass of the implantation that is the most important contribution of this paper. The idea is to utilize the regularities of the words in the lexicon to reduce the effective search space. For example, many words begin with the same phoneme, and it is not necessary to evaluate it more than once at a particular time. This notion is generalized with the help of quotient graphs. The quotient graph is typically a significantly smaller graph than the original model graph, but all nodes of the original graph have a corresponding node in the quotient graph. The search can be made in the quotient graph, and when a search result is needed for a node in the original graph, there is always a reference to a quotient graph node where the result is available.

"Optimising the Lexical Representation to Speed Up A* Lexical Search," STL QPSR 2-3/1994, pp. 113-124.

 

 7.2 Paper 2

In this report the speaker sensitive ANN, introduced in Ström (1994b), is further investigated. The general framework, an ANN for phoneme evaluation with a set of extra input units that characterizes the speaker is inspired by the work of Carlson and Glass (1992a,b), and is also used in Paper 5 and discussed in Paper 4.
In this report, the speaker parameters supplied by the extra input units are automatically extracted, i.e., they have no explicit relation to any knowledge-based parameters. However, using a novel analysis-by-synthesis method, the influence of the automatically extracted parameters was visualized in the formant space. An ensemble of synthetic vowels where generated by a formant synthesizer driven by an LF voice source (Fant, Liljencrants and Lin, 1985). The individual vowels of the ensemble were varied only in F1 and F2, (the first and second formant frequencies). The synthetic vowels were then fed to the ANN and the vowel classification was recorded. This makes it possible to draw a map with the phoneme boundaries in the F1/F2 space for the ANN. By repeating the procedure with different speaker parameters, the effect of the speaker parameters on the phoneme boundaries can be studied. It was found that, in agreement with theory, the phoneme boundaries were lower in frequency for speaker parameters corresponding to male voices than female. In another analysis, a correlation between the knowledge-based parameter fundamental frequency, and one of the two automatically generated parameters was also found. 
This report was written during my stay as a guest researcher at ATR, Kyoto, Japan. As a curiosity I can mention that, as part of my attempt to learn the Japanese language, it was written on a Macintosh with only Japanese labels on all buttons and menu items. However, I never quite succeeded in learning the kanji characters, so sometimes some pretty spectacular things happened on the screen. This was how I learned to recognize the "undo" character.

"A Speaker Sensitive Artificial Neural Network Architecture for Speaker Adaptation," ATR Technical Report, TR-IT-0116, 1995, ATR, Japan. 

 
 

7.3 Paper 3

This paper presents the status of the continuous speech recognition engine of the WAXHOLM project in the end of 1996. At this point the demonstration system was mature enough to be displayed and tested outside the laboratory by completely novice users. A successful such attempt was made at "Tekniska Mässan" (the technology fair) in Älvsjö in October '96 where visitors with no prior experience with the system were invited to try the demonstrator in a rather noisy environment.
All parts of the ASR module are described in some detail in the paper, including the different modes of operation: standard CDHMM, hybrid HMM/ANN, and a general phone-graph input mode. However, the focus is on the aspects of the system that are original in some sense, and the ANN part of the system is covered more thoroughly in Paper 6.
Most of the report is devoted to the dynamic decoding block of the system. The graph representation of the search space of the dynamic decoding is described in detail, and algorithms for graph reduction, Viterbi search, and A* stack decoding are given.
The optimization of the lexical graph is perhaps the most original aspect of the decoding block of the system. This is a continuation of the work reported in Paper 1. The resulting graph has only one word-end and one word-start node for each word class. This results in a very small number of word connecting arcs. Without the graph reduction, the algorithm would spend a large part of its CPU time processing the word connecting arcs. The key concept that was utilized to achieve this high degree of reduction is word pivot arcs. This is an innovation that was not used in Paper 1. 
For the estimation of bigram grammar probabilities, a novel estimate based of Zipf's law for word frequency is proposed. However, the WAXHOLM corpus is not large enough for a conclusive judgment of the method.

"Continuous Speech Recognition in the WAXHOLM Dialogue System," STL QPSR 4/1996, pp. 67-96.

 

7.4 Paper 4

This book illuminates the subject of talker variability from the different viewpoints of research in auditory word recognition, speech perception, voice perception and ASR. A common theme of all chapters is to approach speaker variability as a source of information rather than unwanted noise. In particular in the ASR field, the latter have often been the case in the past. This approach is very clear in my chapter that describes the ideas behind the experiments with fast speaker adaptation, reported on in Paper 2 and Paper 5. The concepts of a speaker space, and an explicit model for the speaker variability are laid out. 
The chapter also contains a survey of existing speaker adaptation methods, and comparison with adaptation effects in the human speech perception process is made. 

"Speaker Modeling for Speaker Adaptation in Automatic Speech Recognition," in: Talker Variability in Speech Processing, Chapter 9, pp. 167-190, Eds: Keith Johnson & John Mullennix, 1997, Academic Press.

 

7.5 Paper 5

This is the latest included paper on fast speaker adaptation. In this paper, the speaker modeling approach introduced in (Ström, 1994b), and developed further in Paper 2, is used in a full continuous speech recognition experiment. 
The adaptation performance was evaluated on the WAXHOLM database. Overall, the adaptation effect was not very high, but the database contains a large potion of very short utterances (only a few words), and this was not enough for the adaptation to have positive effect. On average the adaptation had positive effect on the word-error rate for utterances longer than three words, but increased the word-error rate for shorter utterances. The utterance level result was, in contrast, slightly improved for all utterance lengths.
Although it was shown that the proposed adaptation method can improve recognition both at the word and utterance level, the merit of this paper is not the results achieved. The ANN used in the study is rather small and not the highest performing available, the speaker space used in the experiments is simplistic (only two speaker parameters), and the sampling of the speaker space is coarse. The main contribution of this paper is that it  showed that the method is feasible in a full ASR system. 

"Speaker Adaptation by Modeling the Speaker Variation in a Continuous Speech Recognition System," Proc. ICSLP '96, Philadelphia, pp. 989-992.

 
 
 

7.6 Paper 6

This paper focuses on the latest development, but provides detailed information on a complete ANN toolkit, developed during the course of my thesis studies. The department of Speech, Music, and Hearing at KTH has a history of research in the area (Elenius and Takacs, 1990), and the first step of the development of the toolkit is reported on in Ström, 1992. Formulae are given for back-propagation training of the generalized dynamic ANN architecture used. The dynamic networks require that units' activities are computed in a particular, time-asynchronous order, and an algorithm for computing this order is described in detail. All aspects of network training and evaluation are discussed; network topology, weight initialization, input feature representation, the "softmax" output activation function, and the theory that establishes the link between multi-layer perceptrons and a posteriori phoneme probabilities. 
This paper also announces the toolkit as a publicly available software resource, available for free by the research community. In an appendix, detailed instructions are given for training and evaluating ANNs for phoneme probability estimation using the toolkit. The source code and documentation are available on the Internet. To this date, the toolkit has been downloaded from more than 75 different sites worldwide.
The innovative part of the paper is the sparse connection and connection pruning of the networks, that to my knowledge have not been used for phone probability estimation ANNs before. The sparse connectivity of the networks allows much larger hidden layers without increasing the computational demands, and this is shown to improve recognition accuracy significantly. The phoneme recognition results for the TIMIT database are in the range of the lowest error-rate reported, and outperforms all HMM based systems for the core test set of the database. 
Since this paper was published, the development of the toolkit, and a new ANN architecture have resulted in further improvement of the phoneme recognition. Currently our lowest error-rate for the core test set of the TIMIT database is 26.7% (Ström, 1997). The development of other systems have also resulted in reduced error-rates, Chang and Glass (1997) report an error-rate of 26.6% using their segment based approach. 

"Phoneme Probability Estimation with Dynamic Sparsely Connected Artificial Neural Networks,"The Free Speech Journal, Vol 1(5), 1997.