This is the accompanying website for the following paper:
@inproceedings{BauerZMD25_emphasiscontrol_SSW,
author = {Judith Bauer and Frank Zalkow and Meinard M{\"{u}}ller and Christian Dittmar},
title = {Explicit Emphasis Control in Text-to-Speech Synthesis},
booktitle = {Proceedings of the 13th {ISCA} Speech Synthesis Workshop}
address = {Leeuwarden, Netherlands},
year = {2025},
pages = {},
}
Recent text-to-speech (TTS) systems are able to generate synthetic speech with high naturalness. However, the synthesized speech usually lacks variation in emphasis. Since it is well-known that emphasizing different words can alter a sentence's meaning, it is desirable to extend TTS models to include the ability for emphasis control, i.e., the option to indicate during synthesis which words should carry special emphasis. In this work, we realize such functionality by automatically annotating TTS training datasets with emphasis scores and modifying the TTS model to use these scores during training. In particular, we propose a new architecture for emphasis detection and compare its suitability for TTS with existing emphasis detectors. We introduce an extension for the ForwardTacotron TTS model and train multiple versions of the model with scores from the different emphasis detectors. Finally, we compare the naturalness and the perceived emphasis of speech synthesized by the models.
The goal of emphasis detection is to use features, such as, e.g., audio features or text information, to estimate one emphasis score for each word of an utterance. In this work, we use emphasis detectors to generate emphasis scores for TTS training datasets. With the predicted emphasis scores, we can train TTS models with explicit emphasis control. We experiment with different approaches for emphasis detection:
We use a TTS system consisting of an acoustic model for predicting mel spectrograms and a vocoder model for synthesizing speech.
The acoustic model is based on ForwardTacotron [3] with additional prosody predictor modules for pitch, energy, voicing confidence, and phoneme duration prediction [4]. We propose to augment the input for the prosody predictors with the emphasis information. During training, we use the predictions from the emphasis detectors as emphasis information. During inference, we use values based on statistics derived from the training dataset. We train multiple variants of the acoustic model using the different emphasis detectors:
Additionally, we use the following models:
For synthesizing speech from the mel spectrograms, we use a pretrained StyleMelGAN [5] vocoder model.
For the listening tests in this paper, we randomly selected 12 sentences from the Expresso dataset [7]. A few audio samples from the listening tests are shown on this website.
With our models, gradual emphasis control is possible. For the listening tests, we defined four levels of increasing emphasis "level 0" (lowest emphasis) to "level 3" (highest emphasis) as described in the paper.
The following audio samples show how emphasis is realized by different models and with different voices.
"I can definitely help *you* with that." (speaker: "female1", ephasis level 2)
"It's mostly clear and minus *one* degrees." (speaker: "female2", ephasis level 2)
"He was forced *to* quit." (speaker: "male1", ephasis level 2)
"I'm *trying* to help her!" (speaker: "male2", ephasis level 2)
The following audio samples show how emphasis changes depending on the level ("level 0" to "level 3" represent increasing emphasis levels).
"We should see *some* rainfall later today." (speaker: "female1")
"We should see *some* rainfall later today." (speaker: "female1")
"We should see *some* rainfall later today." (speaker: "female1")
"We should see *some* rainfall later today." (speaker: "female1")
This research was partially supported by the Free State of Bavaria in the DSAI project and by the Fraunhofer-Zukunftsstiftung. The International Audio Laboratories Erlangen are a joint institution of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) and Fraunhofer Institute for Integrated Circuits IIS.
The authors gratefully acknowledge the scientific support and HPC resources provided by the Erlangen National High Performance Computing Center (NHR@FAU). The hardware is funded by the German Research Foundation (DFG).
We thank all participants of our listening tests.
@article{SeysselAWD23_Emphassess_ArXiv,
title={{EmphAssess}: {A} Prosodic Benchmark on Assessing Emphasis Transfer in Speech-to-Speech Models},
author={Maureen de Seyssel and Antony D'Avirro and Adina Williams and Emmanuel Dupoux},
year={2023},
journal = {arXiv preprint arXiv:2312.14069}
}
@inproceedings{MorrisonPPCP24_AutomaticProminence_ICASSP,
address={Seoul, Republic of Korea},
title={Crowdsourced and Automatic Speech Prominence Estimation},
author={Max Morrison and Pranav Pawar and Nathan Pruyne and Jennifer Cole and Bryan Pardo},
booktitle={Proceedings of the {IEEE} International Conference on Acoustics, Speech, and Signal Processing ({ICASSP})},
year={2024}
pages = {12281--12285}
}
@misc{Schaefer20_ForwardTacotron_Github,
author = {Christian Schäfer and Ollie McCarthy and contributors},
howpublished = {\url{https://github.com/as-ideas/ForwardTacotron}},
journal = {GitHub repository},
publisher = {GitHub},
title = {{ForwardTacotron}},
year = {2020}
}
@inproceedings{ZalkowEtAl23_AudioLabs_Blizzard,
address = {Grenoble, France},
author = {Frank Zalkow and Paolo Sani and Michael Fast and Judith Bauer and Mohammad Joshaghani and Kishor Kayyar and Emanu{\"e}l A. P. Habets and Christian Dittmar},
booktitle = {Proceedings of the Blizzard Challenge Workshop},
pages = {63--68},
title = {The {AudioLabs} System for the {B}lizzard {C}hallenge 2023},
year = {2023}
}
@inproceedings{MustafaPF21_StyleMelGAN_ICASSP,
address = {Toronto, Canada},
author = {Ahmed Mustafa and Nicola Pia and Guillaume Fuchs},
booktitle = {Proceedings of the {IEEE} International Conference on Acoustics, Speech, and Signal Processing ({ICASSP})},
pages = {6034--6038},
title = {{StyleMelGAN}: {A}n Efficient High-Fidelity Adversarial Vocoder with Temporal Adaptive Normalization},
year = {2021}
}
@inproceedings{JolyEtAl23_ZeroDataEmph_SSW,
address = {Grenoble, France},
author={Arnaud Joly and Marco Nicolis and Ekaterina Peterova and Alessandro Lombardi and Ammar Abbas and Arent van Korlaar and Aman Hussain and Parul Sharma and Alexis Moinet and Mateusz Lajszczak and Penny Karanasou and Antonio Bonafonte and Thomas Drugman and Elena Sokolova},
title={Controllable Emphasis with zero data for text-to-speech},
year=2023,
booktitle={Proceedings of the {ISCA} Workshop on Speech Synthesis ({SSW})},
pages={113--119},
doi={10.21437/SSW.2023-18}
}
@inproceedings{NguyenEtAl23_Expresso_Interspeech,
address = {Dublin, Ireland},
author={Tu Anh Nguyen and Wei-Ning Hsu and Antony D'Avirro and Bowen Shi and Itai Gat and Maryam Fazel-Zarani and Tal Remez and Jade Copet and Gabriel Synnaeve and Michael Hassid and Felix Kreuk and Yossi Adi and Emmanuel Dupoux},
title={{EXPRESSO}: {A} Benchmark and Analysis of Discrete Expressive Speech Resynthesis},
year=2023,
booktitle={Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech)},
pages={4823--4827},
doi={10.21437/Interspeech.2023-1905},
issn={2958-1796}
}