About UsApplicationsCoursesResearchConsultingBusinessTikiWikiContact us

 

Extreme Speech Recognition System

Distinctive Features of the Extreme System

  • Emphasis on greater use of knowledge from many sources

    • Acoustic modeling with more knowledge of articulation

    • Acoustic modeling with more knowledge of perception

    • Speaker independent modeling as population of distinct individual voices

    • Acoustic modeling millions of hours of data

    • Acoustically-sensitive language modeling

    • Injecting Lexical Syntactic and Semantic Preferences in Parser-based Models
    • Knowledge acquisition from millions of users

    • Training language models on trillions of words of text

  • Modular architecture with many layers and parallel modules

    • Multiple diverse and redundant modules

      • Allows knowledge-based modules to be combined with statistical modules

      • Improves performance even of purely statistical systems

      • Novel methods to use multiple modules computationally efficiently

    • More levels of analysis also permit more kinds of knowledge

  • Novel training methods

    • Training on untranscribed/unannotated data

    • Training distinct levels with distinct objective functions

    • Joint optimization of multiple parallel modules

    • Unique optimization for critical pair-wise discrimination task

    • Optimum training formulated as constrained optimization problem

    • Distributed computing to train on massive distributed data sources

Emphasis on Greater Use of Knowledge

The Extreme System builds on the work of the past thirty years.  Its highly modular framework can easily incorporate any of the existing techniques.  However, the Extreme System is ambitious not only in its size and scope, but in the number of innovations that are being made in going from one generation of speech recognition to the next.  For more than twenty years, most of the improvement in speech recognition has come from gradual incremental improvements.

However, it is the belief of the ICISLT program that now is the time to make a substantial number of interdependent, mutually beneficial improvements.  Although the steady accumulation of gradual incremental improvements has led to a large overall improvement in performance, there have also been constant frustration of research experiments that don't work in this paradigm.  In particular, even where we know the current models represent speech incorrectly system that attempt to inject more knowledge into the process often fail to perform as well as the current known-to-be-incorrect models.

One reason for this result is that it is very difficult to get a knowledge-based system to have as complete a representation of the facts as the automatically trained systems.  Even though they have less human supplied knowledge and models that are structurally incorrect, it is easy to train either stochastic modeling systems or neural networks on a large quantity of data.  Even though the models may be structurally incorrect, the training process adjusts the model parameters to get the best possible fit to this large quantity of training data.  This fitting to a large quantity of training data lessen the effect of incorrect structural assumptions because the trained model parameters are adjusted to compensate.  In addition, by being trained on a very large quantity of data, these automatically trained system learn fit their models to a large number of facts, even though they do it blindly lacking underlying knowledge.

It still seems, however, that at least further incremental improvement could be made by incrementally adding more knowledge.  Unfortunately, knowledge-based modules do not work well within the current architectures of automatically trained systems.  When automatically trained modules are replaced by an initial versions of knowledge-based modules the performance suffers because the knowledge based systems lack all the detailed facts that the automatic systems have learned brute force by training on a massive amount of data.  Under such an architecture, the knowledge-based system can show an improvement only once its knowledge is very complete and very detailed, which would require many years of development before being able to show any improvement.

Multiple Module Architecture

The extreme system changes this architectural limitation.  Each task is done jointly by many diverse and redundant modules.  These modules can include both knowledge-based modules and automatically trained modules.  Of course, it is not simply a matter of putting many modules together and, say, letting them vote to get the final answer.  Indirectly, that technique has already been tried but it doesn't change the basic phenomenon.  The multiple module architecture must be accompanied by new training techniques that do not train the modules separately but rather train them jointly to cooperate with each other to produce a jointly optimum result.

Another reason that research in knowledge-based approaches has difficulty catching up with automatically trained models is that the performance of the automatically trained models is a moving target.  The systems keep getting better and especially learn more and more detailed facts as they are trained on larger and larger data corpora.  It has been suggested that the potential for knowledge-based approaches could be shown by handicapping the automatically trained systems by allowing them to train on only a small fraction of the data that is available.  However, for any real deployment the best system trained on all available data would be used.  Pragmatic considerations indicate that the results of systems deliberately operating under a handicap would not be interesting to decision makers.

Massive Data Collections and Novel Training Techniques

The ICISLT program endorses the opposite point of view.  It is economically feasible to collect and process orders of magnitude more data than are presently being used.  Once this very large amount of data has been collected, even the automatically trained system will have difficulty digesting it all.  Knowledge-based system should have an advantage of being able to better generalize from a small amount of training data.  However, properly designed they also have the potential to better make sense of what would otherwise be an overwhelming amount of data.  Of course, acquiring this knowledge from these massive data sets will require new training techniques that are novel in ways that are different from the novel training methods the Extreme Speech System already plans to develop to jointly train multiple modules.

One class of novel training technique will actually help the automatically trained modules as well as the knowledge-based modules.  We refer to these techniques as "Training on Untranscribed Data."  Literally, "transcription" refers to the process of having a human write down the sequence of words that occur in a corpus of speech utterances.  However, the training technique referred to applies to any situation in which conventional training might require human supplied annotations to be added to the training corpus.  Examples would include human annotations of text for training language models that use higher level knowledge then merely word sequences, in particular parses for syntactic knowledge and semantic tags or categories for semantic knowledge.  Examples would also include human annotation of acoustic phonetic phenomena beyond those that can be derived simply from knowing the word sequence.

Other novel training techniques introduce algorithms supporting distributed computing to allow training on millions of hour of speech and to get the maximum benefit from on-going training of systems widely deployed in the field.

   

 

 

Here is a web version of a powerpoint presentation of the Extreme Speech architecture

 
 

Copyright © 2005 James K. Baker