Explicit Emphasis Control in Text-to-Speech Synthesis

This is the accompanying website for the following paper:

  1. Judith Bauer, Frank Zalkow, Meinard Müller, and Christian Dittmar
    Explicit Emphasis Control in Text-to-Speech Synthesis
    In Proceedings of the 13th ISCA Speech Synthesis Workshop, 2025.
    @inproceedings{BauerZMD25_emphasiscontrol_SSW,
    author = {Judith Bauer and Frank Zalkow and Meinard M{\"{u}}ller and Christian Dittmar},
    title = {Explicit Emphasis Control in Text-to-Speech Synthesis},
    booktitle = {Proceedings of the 13th {ISCA} Speech Synthesis Workshop}
    address = {Leeuwarden, Netherlands},
    year = {2025},
    pages = {},
    }

Abstract

Recent text-to-speech (TTS) systems are able to generate synthetic speech with high naturalness. However, the synthesized speech usually lacks variation in emphasis. Since it is well-known that emphasizing different words can alter a sentence's meaning, it is desirable to extend TTS models to include the ability for emphasis control, i.e., the option to indicate during synthesis which words should carry special emphasis. In this work, we realize such functionality by automatically annotating TTS training datasets with emphasis scores and modifying the TTS model to use these scores during training. In particular, we propose a new architecture for emphasis detection and compare its suitability for TTS with existing emphasis detectors. We introduce an extension for the ForwardTacotron TTS model and train multiple versions of the model with scores from the different emphasis detectors. Finally, we compare the naturalness and the perceived emphasis of speech synthesized by the models.

Emphasis Detectors

The goal of emphasis detection is to use features, such as, e.g., audio features or text information, to estimate one emphasis score for each word of an utterance. In this work, we use emphasis detectors to generate emphasis scores for TTS training datasets. With the predicted emphasis scores, we can train TTS models with explicit emphasis control. We experiment with different approaches for emphasis detection:

  • RNN-based: We propose an emphasis detection model based on a recurrent neural network (RNN). The architecture consists of linear layers, bidirectional LSTM layers, and a final sigmoid activation function.
  • XLS-R-based: De Seyssel et al. [1] proposed a model for emphasis detection from speech waveforms. The model is based on a cross-lingual speech representation (XLS-R) model. We modified the model output to obtain real-valued emphasis scores.
  • Combination of RNN-based and XLS-R-based: We combine the outputs of the RNN-based and XLS-R-based emphasis detectors.
  • CNN-based: Morrison et al. [2] published a convolutional neural network (CNN) for predicting emphasis scores.

TTS System with Emphasis Information

We use a TTS system consisting of an acoustic model for predicting mel spectrograms and a vocoder model for synthesizing speech.

The acoustic model is based on ForwardTacotron [3] with additional prosody predictor modules for pitch, energy, voicing confidence, and phoneme duration prediction [4]. We propose to augment the input for the prosody predictors with the emphasis information. During training, we use the predictions from the emphasis detectors as emphasis information. During inference, we use values based on statistics derived from the training dataset. We train multiple variants of the acoustic model using the different emphasis detectors:

  • RNN: acoustic model trained with emphasis scores from RNN-based emphasis detector
  • XLSR: acoustic model trained with emphasis scores from XLSR-based emphasis detector
  • RNN+XLSR: acoustic model trained with emphasis scores from combination of RNN-based and XLSR-based emphasis detectors
  • CNN: acoustic model trained with emphasis scores from CNN-based emphasis detector

Additionally, we use the following models:

  • NoEmph: acoustic model trained without emphasis scores, this model is not able to synthesize emphasis
  • DD: same acoustic model as ℳNoEmph, emphasis is modelled by scaling predicted phoneme durations with a constant factor (Duration Dilatation) as suggested by Joly et al. [6]

For synthesizing speech from the mel spectrograms, we use a pretrained StyleMelGAN [5] vocoder model.

Audio Samples: Selection and Emphasis Levels

For the listening tests in this paper, we randomly selected 12 sentences from the Expresso dataset [7]. A few audio samples from the listening tests are shown on this website.

With our models, gradual emphasis control is possible. For the listening tests, we defined four levels of increasing emphasis "level 0" (lowest emphasis) to "level 3" (highest emphasis) as described in the paper.

Audio Samples: Emphasis generated with different models

The following audio samples show how emphasis is realized by different models and with different voices.

"I can definitely help *you* with that." (speaker: "female1", ephasis level 2)

"It's mostly clear and minus *one* degrees." (speaker: "female2", ephasis level 2)

"He was forced *to* quit." (speaker: "male1", ephasis level 2)

"I'm *trying* to help her!" (speaker: "male2", ephasis level 2)


Audio Samples: Gradual emphasis control

The following audio samples show how emphasis changes depending on the level ("level 0" to "level 3" represent increasing emphasis levels).

"We should see *some* rainfall later today." (speaker: "female1")

"We should see *some* rainfall later today." (speaker: "female1")

"We should see *some* rainfall later today." (speaker: "female1")

"We should see *some* rainfall later today." (speaker: "female1")

Acknowledgements

This research was partially supported by the Free State of Bavaria in the DSAI project and by the Fraunhofer-Zukunftsstiftung. The International Audio Laboratories Erlangen are a joint institution of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) and Fraunhofer Institute for Integrated Circuits IIS.

The authors gratefully acknowledge the scientific support and HPC resources provided by the Erlangen National High Performance Computing Center (NHR@FAU). The hardware is funded by the German Research Foundation (DFG).

We thank all participants of our listening tests.

References

  1. Maureen de Seyssel, Antony D'Avirro, Adina Williams, and Emmanuel Dupoux
    EmphAssess: A Prosodic Benchmark on Assessing Emphasis Transfer in Speech-to-Speech Models
    arXiv preprint arXiv:2312.14069, 2023.
    @article{SeysselAWD23_Emphassess_ArXiv,
    title={{EmphAssess}: {A} Prosodic Benchmark on Assessing Emphasis Transfer in Speech-to-Speech Models},
    author={Maureen de Seyssel and Antony D'Avirro and Adina Williams and Emmanuel Dupoux},
    year={2023},
    journal = {arXiv preprint arXiv:2312.14069}
    }
  2. Max Morrison, Pranav Pawar, Nathan Pruyne, Jennifer Cole, and Bryan Pardo
    Crowdsourced and Automatic Speech Prominence Estimation
    In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP): 12281–12285, 2024.
    @inproceedings{MorrisonPPCP24_AutomaticProminence_ICASSP,
    address={Seoul, Republic of Korea},
    title={Crowdsourced and Automatic Speech Prominence Estimation},
    author={Max Morrison and Pranav Pawar and Nathan Pruyne and Jennifer Cole and Bryan Pardo},
    booktitle={Proceedings of the {IEEE} International Conference on Acoustics, Speech, and Signal Processing ({ICASSP})},
    year={2024}
    pages = {12281--12285}
    }
  3. Christian Schäfer, Ollie McCarthy, and contributors
    ForwardTacotron
    https://github.com/as-ideas/ForwardTacotron, 2020.
    @misc{Schaefer20_ForwardTacotron_Github,
    author = {Christian Schäfer and Ollie McCarthy and contributors},
    howpublished = {\url{https://github.com/as-ideas/ForwardTacotron}},
    journal = {GitHub repository},
    publisher = {GitHub},
    title = {{ForwardTacotron}},
    year = {2020}
    }
  4. Frank Zalkow, Paolo Sani, Michael Fast, Judith Bauer, Mohammad Joshaghani, Kishor Kayyar, Emanuël A. P. Habets, and Christian Dittmar
    The AudioLabs System for the Blizzard Challenge 2023
    In Proceedings of the Blizzard Challenge Workshop: 63–68, 2023.
    @inproceedings{ZalkowEtAl23_AudioLabs_Blizzard,
    address = {Grenoble, France},
    author = {Frank Zalkow and Paolo Sani and Michael Fast and Judith Bauer and Mohammad Joshaghani and Kishor Kayyar and Emanu{\"e}l A. P. Habets and Christian Dittmar},
    booktitle = {Proceedings of the Blizzard Challenge Workshop},
    pages = {63--68},
    title = {The {AudioLabs} System for the {B}lizzard {C}hallenge 2023},
    year = {2023}
    }
  5. Ahmed Mustafa, Nicola Pia, and Guillaume Fuchs
    StyleMelGAN: An Efficient High-Fidelity Adversarial Vocoder with Temporal Adaptive Normalization
    In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP): 6034–6038, 2021.
    @inproceedings{MustafaPF21_StyleMelGAN_ICASSP,
    address = {Toronto, Canada},
    author = {Ahmed Mustafa and Nicola Pia and Guillaume Fuchs},
    booktitle = {Proceedings of the {IEEE} International Conference on Acoustics, Speech, and Signal Processing ({ICASSP})},
    pages = {6034--6038},
    title = {{StyleMelGAN}: {A}n Efficient High-Fidelity Adversarial Vocoder with Temporal Adaptive Normalization},
    year = {2021}
    }
  6. Arnaud Joly, Marco Nicolis, Ekaterina Peterova, Alessandro Lombardi, Ammar Abbas, Arent van Korlaar, Aman Hussain, Parul Sharma, Alexis Moinet, Mateusz Lajszczak, Penny Karanasou, Antonio Bonafonte, Thomas Drugman, and Elena Sokolova
    Controllable Emphasis with zero data for text-to-speech
    In Proceedings of the ISCA Workshop on Speech Synthesis (SSW): 113–119, 2023. DOI
    @inproceedings{JolyEtAl23_ZeroDataEmph_SSW,
    address = {Grenoble, France},
    author={Arnaud Joly and Marco Nicolis and Ekaterina Peterova and Alessandro Lombardi and Ammar Abbas and Arent van Korlaar and Aman Hussain and Parul Sharma and Alexis Moinet and Mateusz Lajszczak and Penny Karanasou and Antonio Bonafonte and Thomas Drugman and Elena Sokolova},
    title={Controllable Emphasis with zero data for text-to-speech},
    year=2023,
    booktitle={Proceedings of the {ISCA} Workshop on Speech Synthesis ({SSW})},
    pages={113--119},
    doi={10.21437/SSW.2023-18}
    }
  7. Tu Anh Nguyen, Wei-Ning Hsu, Antony D'Avirro, Bowen Shi, Itai Gat, Maryam Fazel-Zarani, Tal Remez, Jade Copet, Gabriel Synnaeve, Michael Hassid, Felix Kreuk, Yossi Adi, and Emmanuel Dupoux
    EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis
    In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech): 4823–4827, 2023. DOI
    @inproceedings{NguyenEtAl23_Expresso_Interspeech,
    address = {Dublin, Ireland},
    author={Tu Anh Nguyen and Wei-Ning Hsu and Antony D'Avirro and Bowen Shi and Itai Gat and Maryam Fazel-Zarani and Tal Remez and Jade Copet and Gabriel Synnaeve and Michael Hassid and Felix Kreuk and Yossi Adi and Emmanuel Dupoux},
    title={{EXPRESSO}: {A} Benchmark and Analysis of Discrete Expressive Speech Resynthesis},
    year=2023,
    booktitle={Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech)},
    pages={4823--4827},
    doi={10.21437/Interspeech.2023-1905},
    issn={2958-1796}
    }