Speech Recognition Simulation and its Application for Wizard of Oz Experiments

This contribution focuses on the simulation of speech recognition systems in the framework of WoZ experiments performed to design and evaluate dialogue-based vocal systems. The simulation of such systems can be useful in cases where different dialogue management strategies are to be evaluated in terms of user satisfaction and where no ASR engine or training data (necessary to build the acoustic, phonetic and/or language models of an existing ASR engine) are available. The aim of the described work is to build a methodology and a tool that allow to simulate recognition errors in a controlled way. More precisely, we describe an approach that produces, for any set of word sequences, simulated recognition outputs (i.e. noised versions of the input sequences) in such a way that the average WER (Word Error Rate) and WA (Word Accuracy) scores computed for the simulated recognition outputs correspond as accurately as possible to pre-defined values representative of a targeted true speech recognition engine. Under this assumption, the recognised outputs can then be used as inputs for a dialogue manager during a WoZ experiment in order to test, for different issues related with dialogue management strategy (such as the definition of the level of mixed initiative, the management of repetitions or conflicting data, etc), their adequacy and robustness to speech recognition quality level. Two different approaches are proposed and evaluated: 1. the first one aims at simulating WA and WER levels only. In this case, no additional information but the input data is needed, and the simulation is implemented in the form of a stochastic process whose parameters (probabilities for substitution, insertion and deletion actions) can be estimated on a corpus produced by the targeted speech recognition system. The main advantage of this approach is its simplicity. However, it also remains limited as it does not fully simulate any real speech recognition engine, in the sense that it doesn't produce the same sentences as would the real speech recogniser do; 2. to overcome the above mentioned drawback, the second approach integrates the Viterbi decoding algorithm used in many HMM-based speech recognition systems; in other words, the goal is now not only to try to guarantee global characteristics such as average WER and WA, but also to integrate in the simulated recognition process more of the information that is actually used by the real speech recogniser. In particular, the used phonetic lexicon and language model are taken into account in order to produce sentences closer to the ones that might have been produced by the real speech recogniser. More precisely, for this second approach, the input data consists of the input sequences along with the available phonetic and language models. The input word sequences are first transformed into phoneme sequences using the available phonetic lexicon. Then, each of the phonemes is transformed into a fixed number of feature vectors, each of these feature vectors being associated with a time frame and corresponding to a probability distribution over the phoneme set. The used probability distributions are derived from a predefined phoneme confusion matrix. Meta-parameters, such as relative confidence weights for language model and acoustic model or pruning factors, are trained in the usual way. The evaluation of the first approach was done using results produced by the Loquendo speech recognition system. Different noising techniques representing different noise conditions were applied on a set of 1'370 sentences recorded during the Inspire project. The obtained data was recognised by the Loquendo speech recogniser and the obtained WA and WER scores were used as parameters for the simulator. The original sentences were then used as input word sequences for the simulator. The evaluation of the obtained simulated recognition outputs, known as MAPSSWE Test, showed WA and WER scores very similar to those originally obtained with Loquendo. The average relative difference computed on all the test sentences was of 1.54%
Published in 2004