|
Million Hour Corpus Project
The goal of this project is to collect and
process at least one million hours of recorded speech in English
and another million hours of speech in non-English languages.
The goal for non-English languages may be satisfied by
collecting at least 10,000 hours of speech in each of at least
100 languages.
Although the cost of storing and processing a
million hours of speech is quite moderate, speech recognition
and synthesis research have been limited to corpora of less than
10,000 hours in English and less than that in other languages.
It has always been true that the availability of substantially
more data along with the time and effort to do the research
necessary to properly utilize it invariably leads to improved
performance. Given these facts, why has so little speech
data been collected?
There are three main problems, all of which are
addressed by this or other ICISLT projects.
1) The first problem is the cost of
transcription. It is orders of magnitude less expensive to
simply collect speech than it is to transcribe it. The
present research corpora consist of material that has been
expensively transcribed or for which transcriptions or near
transcriptions (such as close captioning on television
broadcasts) are available from other sources. This project
will only undertake the recording and collection of the million
hour corpora. Other ICISLT research projects will develop
algorithms for training on speech for which transcripts are not
available.
2) The second problem is that continually
collecting more data makes it difficult to do before and after
comparisons of performance. That is, if more data has been
collected and utilized, there will always be a performance
improvement and it will be difficult to determine how much of
that improvement is due to the greater quantity of data and how
much is due to other causes. Because research and
algorithm changes are necessary to properly utilize a
significantly increased quantity of data, you can't simply run
the old algorithms on the new data and say that all of the rest
of the improvement is due to other causes.
On the other hand, if you are trying to get the
best possible performance then it would be silly to use any less
than all the data that is available. Certainly no
commercial company would deliberately put out an inferior
product just to make research comparisons easier.
With the million hour corpus, this problem
solves itself. At least for English, once the million hour
corpus has been collected, there will be no great need to
continually collect more data. Performance will depend
less on the need to collect yet more data than on learning to
better utilize the million hours that will already be available.
Once the data is available, all further performance improvements
will be due to algorithm and modeling changes.
3) A third problem is the fear that developing
techniques based on a large quantity of data will make it
difficult and time consuming to develop systems for new
languages. Especially for security needs, a new language
may become of interest without notice. The ICISLT 100
languages project will solve this problem by working on at least
100 languages at once. Not only will most languages
already be covered, but with this amount of experience doing a
new language will be routine. |