About UsApplicationsCoursesResearchConsultingBusinessTikiWikiContact us

 

Million Hour Corpus Project

The goal of this project is to collect and process at least one million hours of recorded speech in English and another million hours of speech in non-English languages.  The goal for non-English languages may be satisfied by collecting at least 10,000 hours of speech in each of at least 100 languages.

Although the cost of storing and processing a million hours of speech is quite moderate, speech recognition and synthesis research have been limited to corpora of less than 10,000 hours in English and less than that in other languages.  It has always been true that the availability of substantially more data along with the time and effort to do the research necessary to properly utilize it invariably leads to improved performance.  Given these facts, why has so little speech data been collected?

There are three main problems, all of which are addressed by this or other ICISLT projects.

1) The first problem is the cost of transcription.  It is orders of magnitude less expensive to simply collect speech than it is to transcribe it.  The present research corpora consist of material that has been expensively transcribed or for which transcriptions or near transcriptions (such as close captioning on television broadcasts) are available from other sources.  This project will only undertake the recording and collection of the million hour corpora.  Other ICISLT research projects will develop algorithms for training on speech for which transcripts are not available.

2) The second problem is that continually collecting more data makes it difficult to do before and after comparisons of performance.  That is, if more data has been collected and utilized, there will always be a performance improvement and it will be difficult to determine how much of that improvement is due to the greater quantity of data and how much is due to other causes.  Because research and algorithm changes are necessary to properly utilize a significantly increased quantity of data, you can't simply run the old algorithms on the new data and say that all of the rest of the improvement is due to other causes.

On the other hand, if you are trying to get the best possible performance then it would be silly to use any less than all the data that is available.  Certainly no commercial company would deliberately put out an inferior product just to make research comparisons easier.

With the million hour corpus, this problem solves itself.  At least for English, once the million hour corpus has been collected, there will be no great need to continually collect more data.  Performance will depend less on the need to collect yet more data than on learning to better utilize the million hours that will already be available.  Once the data is available, all further performance improvements will be due to algorithm and modeling changes.

3) A third problem is the fear that developing techniques based on a large quantity of data will make it difficult and time consuming to develop systems for new languages.  Especially for security needs, a new language may become of interest without notice.  The ICISLT 100 languages project will solve this problem by working on at least 100 languages at once.  Not only will most languages already be covered, but with this amount of experience doing a new language will be routine.

     
 

Copyright © 2005 James K. Baker