Massively Multilingual Automatic Speech Recognition

Post-Doc | Project | JSALT | Wiki


Probabilistic Transcription: Post-Doctoral Position Open in Singapore

Speech input permits people to find data (maps, search, contacts) by talking to their cell phones. Of the 6700 languages spoken in the world, speech input is available in 40. Why so few? The problem is data. Before it can be used, speech input software must learn a language by studying hundreds of hours of transcribed audio. In most languages, finding somebody who can transcribe hundreds of hours of audio (somebody who is computer literate, yet has time available to perform this task) is nearly impossible. Faced with this problem, we proposed a radical solution: solicit transcription from people who don't speak the language. Non-native listeners make many mistakes. By building a probabilistic model of their mistakes, we are able to infer correct transcriptions, and thus to train speech technology in any language. We are seeking a post-doctoral researcher who can scale these algorithms to commercial relevance. Necessary qualifications include a Ph.D. in speech technology, natural language processing, computational linguistics and phonetics, information theory or machine learning. Objectives of the research include the derivation, implementation, testing, and publication of new algorithms that train state of the art speech input technologies from probabilistic transcription in the under-resourced languages of southeast Asia.

This is a 20-month post-doctoral research position at the Advanced Digital Sciences Center (ADSC) in Singapore. The post-doc will work most closely with Dr. Nancy Chen, Institute for Infocomm Research, Singapore, and with Dr. Preethi Jyothi and Prof. Mark Hasegawa-Johnson, University of Illinois. For inquiries contact probtranspostdoc@gmail.com.



Project Description

Widely accepted estimates list 6700 languages spoken in the world today (www.ethnologue.com). Of these, commercial automatic speech recognition (ASR) is available in approximately 40 (e.g., Baidu, Google). Almost all academic publications describing ASR in a language outside the "top 10" are focused on the same core research problem: the lack of transcribed speech training data. In (Jyothi & Hasegawa-Johnson, AAAI 2015 and Interspeech 2015) we proposed a method called mismatched crowdsourcing that acquires transcriptions without native-language transcribers. Instead of recruiting transcribers who speak the language, we recruit transcribers who don't speak it, and we ask them to transcribe as if listening to nonsense speech. Mistakes caused by non-native speech perception are encoded in a noisy-channel model: a finite state transducer (FST) with learned or estimated error probabilities. We demonstrated that it's possible to acquire transcribed speech data this way; by the end of August 2015, a JSALT team led by PI Hasegawa-Johnson will use mismatched crowdsourcing to train ASR.

We propose to continue this research in Singapore for six reasons: (1) under-resourced language data is more plentiful in Singapore than Urbana, (2) leveraging Singapore's unique global position, I2R is now the world leader in speech technology for under-resourced languages. (3) Dr. Hasegawa-Johnson and I2R's Dr. Haizhou Li are both officers of the International Speech Communication Association; this proposal stems from their discussion at the 2014 officers' meeting in Singapore, (4) as a result of that meeting, Dr. Chen and Dr. Hasegawa-Johnson successfully proposed an I2R project, 3/2015-3/2016, on the subject of mismatched crowdsourcing. (5) Commercialization of this research is more likely to be profitable in Singapore than anywhere else in the world; Singapore is uniquely positioned to sell spoken language user interfaces in Singapore, Vietnam, Malaysia, the Philippines, Thailand, Hong Kong, and the many other countries and cities of southeast Asia that are currently unserved by speech technology because their languages have no ASR.

The Co-PIs, Dr. Hasegawa-Johnson at UIUC and Dr. Chen at I2R, had the same doctoral adviser at MIT until his retirement in 2007. All other proposed I2R and ADSC personnel are UIUC students or alumni. Staff supported by the I2R grant include Dr. Chen and Research Scientist Dr. Boon Pang Lim; Dr. Lim was an AUIP student hooded by Dr. Hasegawa-Johnson in 2011. Dr. Hasegawa-Johnson's second AUIP student, Wenda Chen, is currently working with Dr. Chen, and will arrive in Urbana for the first time in August 2015.

Introduction

In most of the world's 6700 languages, creating ASR using current methods is impossible at any price: in most of the languages of the world, it is not possible to acquire transcribed speech data. Pavlick et al. (Proc. LREC 2013) hired transcribers on Amazon Mechanical Turk, and asked them for a list of the languages they know how to speak; they list 40 languages represented by at least 20 workers each. The other 6660 languages of the world are inaccessible to speech technology, in the sense that current methods will never develop ASR in those languages. In (Jyothi & Hasegawa-Johnson, AAAI 2015 and Interspeech 2015) we proposed a methodology called mismatched crowdsourcing that bypasses the need for native language transcription. In our methodology, speech is transcribed by workers who don't know the language. Workers treat the speech as nonsense, and transcribe it as nonsense syllables. Since the transcribers do not speak the language they are transcribing, they will necessarily make mistakes: some phonemes in the target language will be indistinguishable in the language of the transcriber, and will be mapped to the same phonemes in the language of the transcriber. If we assume that the phoneme sets of both spoken language and transcriber language are known in advance (www.phoible.org), then it is possible to specify explicit mathematical models of second language phonetic perception, and to use these models to recover an equivalent transcription in the language of the speaker.

Preliminary experiments in mismatched crowdsourcing were carried out in our AAAI paper using Hindi speech excerpts extracted from SBS radio podcasts (http://www.sbs.com.au/podcasts/yourlanguage/hindi/, Australian Special Broadcasting Service). Approximately one hour of speech was extracted from the podcasts (about 10000 word tokens in total) and phonetically transcribed by a Hindi speaker. The data were then segmented into very short speech clips (1 to 2 seconds long). The crowd workers were asked to listen to these short clips and provide English text, in the form of nonsense syllables that most closely matched what they heard. The English text was aligned with the Hindi phone transcripts using a finite state transducer (FST) whose input alphabet was the set of Hindi phonemes, and whose output alphabet was the set of English orthographic symbols. The FST substitution costs, deletion costs and insertion costs were learned using expectation maximization.

Having trained the mismatch FST, we now have a complete and invertible model of the process by which Hindi words are transcribed into English orthography. By composing the mismatch FST with a Hindi dictionary, and searching for the best-matching word, it is possible to recover the best Hindi transcription matching any given English-language nonsense transcription. We find that the correct Hindi word string is usually not the one with maximum posterior probability, but is almost always (96%) within the 8-best list. By combining these methods with existing semi-supervised training methods, we expect it will be possible to train ASR in any language, even with no native-language transcribers.

State of the Art

Speech-to-text transcription is performed by finding the word sequence, W, that maximizes p(W,Q,A), where A is the recorded acoustic input, and Q is a sequence of phonetic segments. The joint probability p(W,Q,A) is computed in real time, on the fly, by composing three models: a language model p(W), a pronunciation model p(Q|W), and an acoustic model p(A|Q). In order to achieve high-accuracy ASR, these three component statistical models must first be trained using large databases of representative speech and text. The speech technology development cycle is the process of preparing corpora, training the models, and applying the models in a speech-to-text software engine.

The three models are each expensive to create. The language model, p(W), is trained using large databases of text data, e.g., language models in English and Mandarin are typically trained using one billion words, and are typically stored by interpolating between neural network and finite state transducer models. The pronunciation model, p(Q|W), is a database of word pronunciations manually constructed by linguists at moderately high cost, e.g., the GlobalPhone project provides pronunciation models in 22 languages for 650 Euro per language. The acoustic model, p(A|Q), is trained using large databases of transcribed speech. The training process for well-resourced languages usually begins with 1000 hours of speech audio recordings, chopped into waveform files of about thirty seconds, each of which has an associated text file specifying its transcription using the standard orthographic conventions of the specified language. Orthographic transcriptions are converted to phonetic segments using the pronunciation model, then aligned to the speech audio using a preliminary ASR. The aligned phonetic transcriptions are then used to train deep neural networks (DNN), which are then linked to the preliminary ASR in order to create a commercially viable speech recognizer. All of these methods are useless for under-resourced languages, because of the high data cost.

The state of the art in speech technology for under-resourced languages is exemplified by Dr. Chen's work at I2R. Dr. Chen leads the I2R team competing in the IARPA-funded OpenKWS competition (open keyword search). OpenKWS competitors develop technology that locates spoken keywords in a large untranscribed speech corpus. IARPA funds OpenKWS because they believe that full speech-to-text in an under-resourced language is currently impossible; OpenKWS seeks to develop relevant foundational technologies. On April 28, 2015, at 2pm EDT, the US National Institute of Standards (NIST) provided competitors with the identity of this year's surprise language (Swahili), and with passwords necessary to unlock 3 hours of transcribed speech and 80 hours untranscribed. On May 12, 2015, competitors were given evaluation data and a list of keywords to find; on May 19, competitors uploaded their detected "hits" (proposed locations of each keyword in the evaluation dataset) to NIST. Further resources were released May 20, and sites uploaded revised results on June 10.

Research Approach and Methodology

Proposed research is organized around five tasks. Task 1: Matched and mismatched transcriptions of six languages. Task 2: DNN-HMM ASR in five languages (train and test). Task 3: Distinctive-feature mismatch transducer (derive, train and test). Task 4: MCE-regularized DNN transfer learning (train and test). Task 5: Bayesian RBM-based transfer learning (derive, train and test).

Task 1: Matched and mismatched transcriptions of six languages. Untranscribed speech audio in six under-resourced languages will be transcribed. The ADSC post-doc will collaborate with Dr. Preethi Jyothi (currently a Beckman Fellow at UIUC, through May 2016) in order to acquire mismatched-crowdsourcing transcriptions. Matched transcriptions will be sought by posting a request, on Mechanical Turk or on other sites, for crowd workers who speak the desired language. In order to even seek matched transcriptions, we will need a task description written in the target language, and a set of previously transcribed speech samples that we can intersperse with the unknown speech in order to test the reliability of each crowd worker. These task descriptions and quality-control questions will be designed by a trusted native language informant, recruited either locally in Singapore or on-line, and paid at a higher hourly rate (for far fewer hours of work) than crowd workers. Languages to be transcribed will be chosen by Dr. Chen and Dr. Hasegawa-Johnson, in consultation with other I2R staff.

Task 2: Baseline ASR in five languages (train and test). As data become available in each language, the ADSC post-doc and the I2R research staff member will work jointly with interns to train and test baseline ASR systems. We have good reason to believe that this will succeed: I2R Research Scientist Boon Pang Lim is already collaborating with Dr. Hasegawa-Johnson's students in the training and testing of ASR systems in English (exchanging a few e-mails per week on the progress of experiments at each end), and has stated his interest in collaborating on other languages. Baseline ASR will be trained and tested using the same open source toolkit that is used by the OpenKWS team at I2R (at the time of this writing, the kaldi toolkit), and configuration files will be designed to ensure compatibility.

Task 3: Distinctive-feature mismatch transducer (derive, train and test). The mismatch transducer is a context-dependent probabilistic mapping between spoken phonemes (in the language of the speaker) and orthographic symbols (in the language of the transcriber). Our published experiments to date find optimum performance when the mismatch transducer is trained from data, meaning that it is necessary to acquire both matched and mismatched transcriptions for about thirty minutes of speech, and to learn, from those thirty minutes, the probability of every possible phoneme substitution error. Unfortunately, there may be some languages in which matched transcription is truly impossible to acquire. In preparation for that eventuality, we propose to train a language-independent mismatch transducer. Phonemes are not language-independent, but every phoneme in every language can be described by a vector of binary phonological distinctive features (www.phoible.org). Both Dr. Chen and Dr. Hasegawa-Johnson have used distinctive features in the design of both automatic speech recognizers (ASR) and computer-assisted language learning (CALL), e.g., Dr. Chen is mentoring a PhD student at GaTech who is training distinctive-feature based attribute detectors for his CALL system, and Dr. Hasegawa-Johnson led multi-university teams on this subject in 2004 (Hasegawa-Johnson et al., ICASSP 2005), 2006 (Livescu et al., ICASSP 2007) and 2009 (Yoon et al., Interspeech 2010). Dr. Hasegawa-Johnson, Dr. Chen, Dr. Lim and the ADSC post-doc will jointly derive algorithms that learn the mixture weights in a distinctive-feature-based model of language-independent phoneme substitution probabilities and to run validation experiments.

Task 4: Regularized transfer learning (train and test). Other investigators have demonstrated that ASR error rate in an under-resourced language can be reduced if one starts with a DNN (deep neural net) trained on a related language, then adapts the DNN using available transcribed data in the target language. Dr. Chen's results in the NIST OpenKWS15 competition show that we can improve Swahili ASR and KWS from multilingual acoustic features (Bottleneck features) in both exemplar and ASR systems. Further improvements are possible (Das and Hasegawa-Johnson, Interspeech 2015) using semi-supervised minimum-conditional-entropy (MCE) training, i.e., by explicitly minimizing the expected error rate not only of transcribed but also of untranscribed speech in the target language. Dr. Chen, Dr. Hasegawa-Johnson, Dr. Lim and the ADSC post-doc will derive the form of the MCE training criterion that is most applicable when an error-free reference transcription is replaced by the probabilistic transcription (distribution over possible transcriptions) characteristic of mismatched crowdsourcing, and will perform validation experiments.

Task 5: Bayesian RBM-based transfer learning (derive, train and test). DNNs became popular in 2010 when Abdel-rahman Mohamed and Geoffrey Hinton (ICASSP, 2010) demonstrated that a Bayesian interpretation of the network weights called the "Restricted Boltzmann Machine" (RBM) can be used to rapidly initialize the neural net, using data without any labels (unsupervised pre-training). They argued that supervised training (using error back-propagation) is more effective if preceded by unsupervised pre-training. It has since been demonstrated (Hinton et al., IEEE Signal Processing Magazine 2012) that a large training corpus renders unsupervised pre-training unnecessary, but since proposed research will use very small training corpora (only six hours of speech), we expect to routinely benefit from the use of unsupervised pre-training. Task 5 of proposed research will apply the RBM in a somewhat different way. Maximum A Posteriori (MAP) adaptation (Gauvain & Lee, IEEE Trans. SAP, 1994) uses a large mismatched training corpus (e.g., data in the wrong language) combined with a small matched training set (e.g., data in the right language) to jointly optimize ASR. MAP adaptation has never been used for DNN-based ASR, because the Bayesian interpretation of a DNN is not obvious. Dr. Hasegawa-Johnson, Dr. Chen, Dr. Lim and the ADSC post-doc will collaborate to derive MAP adaptation for DNNs, based on Mohamed & Hinton's Bayesian interpretation of the DNN. The I2R staff member and the ADSC post-doc will perform validation experiments by implementing the derived algorithms as part of ASR training. To our knowledge this idea has never previously been articulated.

Connections to Other Activities in Singapore

The straits of Malacca are the best place in the world to study massively multilingual speech recognition, and I2R has already published some of the world's best results. The world's only commercially viable Vietnamese-language ASR was created at I2R, as was one of the world's best Mandarin ASRs. I2R is one of the competitive sites regularly participating in the NIST OpenKWS (open keyword search) competition, which competition was designed by IARPA program director Mary Harper to build the infrastructure necessary to address the problem of massively multilingual ASR. The country of Singapore sits at the only geographical juncture connecting the East Asian, South Asian, Southeast Asian, and European language communities, and has chosen national languages from the Indo-European, Dravidian, Sino-Tibetan and Austronesian language families. Of the world's 6700 living languages, almost half are spoken within 5000 kilometers of Singapore, e.g., 700 languages are spoken in Indonesia, 850 in Papua New Guinea, 1600 in India, and hundreds more in southeast China.

Untranscribed speech data in 70 languages are available as podcasts from http://www.sbs.com.au/podcasts/yourlanguage; Dr. Hasegawa-Johnson has been downloading and archiving SBS podcasts since July 2014. Dr. Hasegawa-Johnson is head of the Probabilistic Transcription research workshop team at WS15 (the Jelinek Memorial Workshop). In preparation for that workshop, Beckman Post-Doctoral Fellow Preethi Jyothi has acquired one hour of mismatched crowdsourced transcriptions in each of 23 languages. During the workshop, we will seek to train and test preliminary baseline ASR using these data.

One hour of speech per language is enough to train a proof-of-concept ASR, but not much more (e.g., it's not enough to train triphone models). If this ADSC proposal is funded, it will include additional crowdsourcing of six hours of speech in each of six languages, thus scaling the databases up to a somewhat more useful size (e.g., we'll be able to train triphone models). For each language, crowdsourcing tasks will be posted in the target language (with questions in the target language) and in other languages (mismatched crowdsourcing). Both matched and mismatched crowdsourcing will be posted simultaneously, permitting us to measure the degree to which mismatched crowdsourcing is actually useful in each selected language. Progress of both sets of tasks will be monitored daily, so that, for example, if more native-language transcribers than expected are available in one of these languages, it will be possible to move some of the mismatched crowdsourcing tasks into the matched pool (and conversely if no native-language transcribers are found).

UIUC Beckman Post-Doctoral Fellow Preethi Jyothi is inventor of the mismatched crowdsourcing technique, and will continue to be involved in this research. UIUC graduate student Amit Das has written about cross-language transfer learning, and will continue to be involved in this research. Current I2R intern Wenda Chen will begin graduate study at UIUC in August, 2015 as an A*STAR graduate fellow, and will continue to be involved in this research.