Project Details
Description
Recent advances in speech technology have resulted in wide use of Spoken Dialogue Systems (SDS) such as Siri (iPhone) and Google Assistant (Android). These systems support major improvements in information access by voice for High Resource Languages (HRLs) such as English, French, Mandarin, Japanese, and Spanish. For these languages, researchers have built dictionaries, parsers, part-of-speech taggers, language models, search engines, and machine translation engines to support speech technologies. However, there are ~6500 world languages, including Tagalog, Tamil, Swahili, Vietnamese and Pashto, many of which are spoken by millions of people, but which do not enjoy the computational resources necessary to build SDS. These are termed Low Resource Languages (LRLs). Speakers of LRLs do not benefit from the same communication and search capabilities speakers of HRLs do. In particular, there is little research and few resources supporting the development of Text-to-Speech Synthesis (TTS) systems to produce Siri-like speech for SDS in these languages. Furthermore, both commercial and research TTS systems also require large amounts of carefully recorded, single-speaker speech data, creating another major (and expensive) barrier to TTS development for LRLs. This work will create TTS systems in LRLs and, in the process, create and make available tools for others to create their own systems using 'found' data - data recorded for other purposes or available on the web.
New paradigms for TTS synthesis (parametric synthesis and the use of Deep Neural Nets) are now being developed which make it theoretically possible to build systems quickly and cheaply without recording large, special-purpose speech corpora, instead using data recorded for other purposes such as training speech recognizers. This work will investigate the use these techniques to produce TTS systems for LRL. Two major problems will be explored: 1) What are the best techniques to filter found data (removing data that is too loud, too noisy or disfluent, for example) to obtain intelligible and natural-sounding results? 2) Can basic prosodic features of LRLs such as phrasing and emphasis be identified, using crowdsourcing and tools developed for HRLs? Pilot studies on English have revealed that more natural and intelligible voices can be created by using subsets of the data selected on features such as pitch variation and level of articulation. These methods will be tested on LRLs such as Turkish, Amharic, and Telugu. Evaluations will be made in terms of intelligibility and naturalness both automatically and using crowdsourcing techniques with native speakers of each language. The ultimate goal of this exploratory work will be to test these techniques on a broad variety of LRLs which have been collected for purposes of developing speech recognizers.
Status | Finished |
---|---|
Effective start/end date | 9/1/17 → 8/31/21 |
Funding
- National Science Foundation: US$500,000.00
ASJC Scopus Subject Areas
- Information Systems
- Computer Science(all)