StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis

Yinghao Aaron Li; Cong Han; Nima Mesgarani

doi:10.1109/JSTSP.2025.3530171

StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis

Yinghao Aaron Li, Cong Han, Nima Mesgarani

Zuckerman Institute

Résultat de recherche › examen par les pairs

Résumé

Text-to-Speech (TTS) has recently seen great progress in synthesizing high-quality speech owing to the rapid development of parallel TTS systems. Yet producing speech with naturalistic prosodic variations, speaking styles, and emotional tones remains challenging. In addition, many existing parallel TTS models often struggle with identifying optimal monotonic alignments since speech and duration generation typically occur independently. Here, we propose StyleTTS, a style-based generative model for parallel TTS that can synthesize diverse speech with natural prosody from a reference speech utterance. Using our novel Transferable Monotonic Aligner (TMA) and duration-invariant data augmentation, StyleTTS significantly outperforms other baseline models on both single and multi-speaker datasets in subjective tests of speech naturalness and synthesized speaker similarity. It also demonstrates higher robustness and emotional similarity to the reference speech as indicated by word error rate (WER) and acoustic feature correlations. Through self-supervised learning, StyleTTS can generate speech with the same emotional and prosodic tone as the reference speech without needing explicit labels for these categories. In addition, when trained with a large number of speakers, our model can perform zero-shot speaker adaption. The source code and audio samples can be found on our demo page https://styletts.github.io/.

Langue d'origine	English
Journal	IEEE Journal on Selected Topics in Signal Processing
DOI	https://doi.org/10.1109/JSTSP.2025.3530171
Statut de publication	Accepted/In press - 2025

ASJC Scopus Subject Areas

Signal Processing
Electrical and Electronic Engineering

Accès au document

10.1109/JSTSP.2025.3530171

Autres fichiers et liens

Citer

Li, Y. A., Han, C., & Mesgarani, N. (Accepter/ En production). StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis. IEEE Journal on Selected Topics in Signal Processing. https://doi.org/10.1109/JSTSP.2025.3530171

@article{a614639c7cb54b28a78ae8a61745213b,

title = "StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis",

abstract = "Text-to-Speech (TTS) has recently seen great progress in synthesizing high-quality speech owing to the rapid development of parallel TTS systems. Yet producing speech with naturalistic prosodic variations, speaking styles, and emotional tones remains challenging. In addition, many existing parallel TTS models often struggle with identifying optimal monotonic alignments since speech and duration generation typically occur independently. Here, we propose StyleTTS, a style-based generative model for parallel TTS that can synthesize diverse speech with natural prosody from a reference speech utterance. Using our novel Transferable Monotonic Aligner (TMA) and duration-invariant data augmentation, StyleTTS significantly outperforms other baseline models on both single and multi-speaker datasets in subjective tests of speech naturalness and synthesized speaker similarity. It also demonstrates higher robustness and emotional similarity to the reference speech as indicated by word error rate (WER) and acoustic feature correlations. Through self-supervised learning, StyleTTS can generate speech with the same emotional and prosodic tone as the reference speech without needing explicit labels for these categories. In addition, when trained with a large number of speakers, our model can perform zero-shot speaker adaption. The source code and audio samples can be found on our demo page https://styletts.github.io/.",

author = "Li, {Yinghao Aaron} and Cong Han and Nima Mesgarani",

note = "Publisher Copyright: {\textcopyright} 2007-2012 IEEE.",

year = "2025",

doi = "10.1109/JSTSP.2025.3530171",

language = "English",

journal = "IEEE Journal on Selected Topics in Signal Processing",

issn = "1932-4553",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - StyleTTS

T2 - A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis

AU - Li, Yinghao Aaron

AU - Han, Cong

AU - Mesgarani, Nima

PY - 2025

Y1 - 2025

N2 - Text-to-Speech (TTS) has recently seen great progress in synthesizing high-quality speech owing to the rapid development of parallel TTS systems. Yet producing speech with naturalistic prosodic variations, speaking styles, and emotional tones remains challenging. In addition, many existing parallel TTS models often struggle with identifying optimal monotonic alignments since speech and duration generation typically occur independently. Here, we propose StyleTTS, a style-based generative model for parallel TTS that can synthesize diverse speech with natural prosody from a reference speech utterance. Using our novel Transferable Monotonic Aligner (TMA) and duration-invariant data augmentation, StyleTTS significantly outperforms other baseline models on both single and multi-speaker datasets in subjective tests of speech naturalness and synthesized speaker similarity. It also demonstrates higher robustness and emotional similarity to the reference speech as indicated by word error rate (WER) and acoustic feature correlations. Through self-supervised learning, StyleTTS can generate speech with the same emotional and prosodic tone as the reference speech without needing explicit labels for these categories. In addition, when trained with a large number of speakers, our model can perform zero-shot speaker adaption. The source code and audio samples can be found on our demo page https://styletts.github.io/.

AB - Text-to-Speech (TTS) has recently seen great progress in synthesizing high-quality speech owing to the rapid development of parallel TTS systems. Yet producing speech with naturalistic prosodic variations, speaking styles, and emotional tones remains challenging. In addition, many existing parallel TTS models often struggle with identifying optimal monotonic alignments since speech and duration generation typically occur independently. Here, we propose StyleTTS, a style-based generative model for parallel TTS that can synthesize diverse speech with natural prosody from a reference speech utterance. Using our novel Transferable Monotonic Aligner (TMA) and duration-invariant data augmentation, StyleTTS significantly outperforms other baseline models on both single and multi-speaker datasets in subjective tests of speech naturalness and synthesized speaker similarity. It also demonstrates higher robustness and emotional similarity to the reference speech as indicated by word error rate (WER) and acoustic feature correlations. Through self-supervised learning, StyleTTS can generate speech with the same emotional and prosodic tone as the reference speech without needing explicit labels for these categories. In addition, when trained with a large number of speakers, our model can perform zero-shot speaker adaption. The source code and audio samples can be found on our demo page https://styletts.github.io/.

UR - http://www.scopus.com/inward/record.url?scp=85216084122&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85216084122&partnerID=8YFLogxK

U2 - 10.1109/JSTSP.2025.3530171

DO - 10.1109/JSTSP.2025.3530171

M3 - Article

AN - SCOPUS:85216084122

SN - 1932-4553

JO - IEEE Journal on Selected Topics in Signal Processing

JF - IEEE Journal on Selected Topics in Signal Processing

ER -

StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis

Résumé

ASJC Scopus Subject Areas

Accès au document

Autres fichiers et liens

Empreinte numérique

Citer