AudioLabs - Evaluating Speech–Phoneme Alignment and Its Impact on Neural Text-To-Speech Synthesis

Evaluating Speech–Phoneme Alignment and Its Impact on Neural Text-To-Speech Synthesis

This is the accompanying website for the following paper:

Frank Zalkow, Prachi Govalkar, Meinard Müller, Emanuël A. P. Habets, and Christian Dittmar
Evaluating Speech—Phoneme Alignment and Its Impact on Neural Text-To-Speech Synthesis
In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023. PDF Details DOI

@inproceedings{ZalkowGMHD23_EvalAlignmentTTS_ICASSP,
author      = {Frank Zalkow and Prachi Govalkar and Meinard M{\"u}ller and Emanu{\"e}l A.\ P.\ Habets and Christian Dittmar},
title       = {Evaluating Speech--Phoneme Alignment and Its Impact on Neural Text-To-Speech Synthesis},
booktitle   = {Proceedings of the {IEEE} International Conference on Acoustics, Speech, and Signal Processing ({ICASSP})},
address     = {Rhodes Island, Greece},
year        = {2023},
pages       = {},
doi         = {10.1109/ICASSP49357.2023.10097248},
url-pdf     = {https://ieeexplore.ieee.org/document/10097248},
url-details = {https://www.audiolabs-erlangen.de/resources/NLUI/2023-ICASSP-eval-alignment-tts},
}

Abstract

In recent years, the quality of text-to-speech (TTS) synthesis vastly improved due to deep-learning techniques, with parallel architectures, in particular, providing excellent synthesis quality at fast inference. Training these models usually requires speech recordings, corresponding phoneme-level transcripts, and the temporal alignment of each phoneme to the utterances. Since manually creating such fine-grained alignments requires expert knowledge and is time-consuming, it is common practice to estimate them using automatic speech–phoneme alignment methods. In the literature, either the estimation methods' accuracy or their impact on the TTS system's synthesis quality is evaluated. In this study, we perform experiments with five state-of-the-art speech–phoneme aligners and evaluate their output with objective and subjective measures. As our main result, we show that small alignment errors (below 75 ms error) do not decrease the synthesis quality, which implies that the alignment error may not be the crucial factor when choosing an aligner for TTS training.

Details for `RAD`

Our implementation for the RAD system was inspired by the parallel model described by Badlani et al. [1], which extends and generalizes RAD-TTS [2]. We use a few 1-D convolutional blocks as the phonetic encoder (processing embeddings of phonemes) and a spectral encoder (processing log-scaled mel-spectral frames), respectively. An alignment matrix is then computed using the negative Euclidean distance between each pair of the phonetic and spectral encodings, respectively. We can then apply the forward sum loss to this matrix as described by Shih et al. [2] As proposed by the original authors, we also use an additional loss term to increase the values around a prior along the main diagonal. Furthermore, we also employ a binarization loss term, which is added after convergence, where a second training phase begins. The table below summarises the architecture of the model.

Architecture for `RAD` approach. lReLU refers to the leaky ReLU activation (with a negative slope of 0.01). N refers to the variable number of time steps, M to the fixed number of mel bands, and V to the fixed phoneme alphabet size.
	Layer	Output	Activation	Parameters
Phonetic Encoder	Input	(N, V)
	Embedding (512)	(N, 512)		512 · V
	1D-Conv (3)	(N, 512)	lReLU	786432
	1D-BatchNorm	(N, 512)		1024
	1D-Conv (1)	(N, 512)	lReLU	262144
	1D-BatchNorm	(N, 512)		1024
Spectral Encoder	Input	(N, M)
	1D-Conv (3)	(N, 512)	lReLU	1536 · M
	1D-BatchNorm	(N, 512)		1024
	1D-Conv (3)	(N, 512)	lReLU	786432
	1D-BatchNorm	(N, 512)		1024
	1D-Conv (1)	(N, 512)	lReLU	262144
	1D-BatchNorm	(N, 512)		1024

Details for `CTC*`

We have been inspired by the idea from Teytaut and Roebel [3] to stabilize the CTC-based posteriogram temporally by using a spectral decoder. We simplified all other aspects of their model, e.g., by removing the phonetic attention mechanism. In this section, we describe the simplified model, highlighting the changes compared to the original model [3].

The input to the model is only a mel spectrogram (without applying a scaling and without any side information or delta features as in [3]). A 2-layer bidirectional LSTM with 128 units (instead of 512 units as in [3]) directly processes the input (without prior convolutional blocks as in [3]). Similar to [3], a linear layer reduces the feature dimension to the size of the CTC posteriogram (size of the phoneme alphabet and the additional blank symbol). Unlike [3], we don't use a recurrent network for the spectral decoder but three fully connected (or dense) layers. We think a recurrent network might compensate for temporal inaccuracies in the posteriogram, which invalidates the decoder's aim to stabilize the posteriogram temporally. Thus, limiting the receptive field of the decoder should be beneficial for that aim. The table below summarises the architecture of the model. For training, we use a reconstruction loss of 1.0 instead of 0.1.

To retrieve the final alignment from the CTC posteriogram, we employ a different procedure compared to [3]. First, we remove the blank symbol probability from the posteriogram and ℓ¹-normalize the rows, similar to [4]. Then, we apply a dynamic programming procedure (similar to the Viterbi algorithm) to find a probability-maximizing alignment.

Architecture for `CTC*` approach. lReLU refers to the leaky ReLU activation (with a negative slope of 0.01). N refers to the variable number of time steps, M to the fixed number of mel bands, and V to the fixed phoneme alphabet size.
	Layer	Output	Activation	Parameters
Encoder	Input	(N, M)
	Bi-LSTM	(N, 256)	lReLU	1024 · (M + 130)
	Bi-LSTM	(N, 256)	lReLU	395264
	Dense	(N, V + 1)	softmax	257 · (V + 1)
Decoder	Input	(N, V + 1)
	Remove blank	(N, V)
	Dense	(N, 256)	lReLU	(V + 1) · 256
	Dense	(N, 256)	lReLU	65792
	Dense	(N, M)	linear	257 · M

Audio Samples

T6B06201847: "You have neither of you any doubt as to your son's guilt?"

T6B06202613: "Very good. Now, Mister Wilson?"

T6B06202664: "Good God! What a week she must have spent!"

T6B06202849: "Did Lady Brackenstall say that screw was used?"

T6B06202850: "Are any of your people tinsmiths?"

T6B06313689: A very warm welcome to you and your family.

T6B06314479: six eight seven six seven three

T6B06324701: Mangel-wurzels are grown chiefly as cattle feed.

T6B06325330: The chewing-gum tasted spearminty.

T6B06335464: Do you not think it strange that these judgments are made?

Acknowledgements

We thank all participants of our listening test. Furthermore, we thank Alexander Adami for fruitful discussions on the listening test design and its evaluation. Parts of this work have been supported by the SPEAKER project (FKZ 01MK20011A), funded by the German Federal Ministry for Economic Affairs and Climate Action. In addition, this work was supported by the Free State of Bavaria in the DSAI project. The International Audio Laboratories Erlangen are a joint institution of the Friedrich–Alexander-Universität Erlangen-Nürnberg (FAU) and Fraunhofer Institute for Integrated Circuits IIS. The authors gratefully acknowledge the technical support and HPC resources provided by the Erlangen National High Performance Computing Center (NHR@FAU) of the FAU.

References

Rohan Badlani, Adrian Lancucki, Kevin J. Shih, Rafael Valle, Wei Ping, and Bryan Catanzaro
One TTS Alignment To Rule Them All
CoRR, abs/2108.10447, 2021.

@article{BadlaniEtAl21_TTSAlignment_arXiv,
title         = {One {TTS} Alignment To Rule Them All},
author        = {Rohan Badlani and Adrian {\L{}}ancucki and Kevin J. Shih and Rafael Valle and Wei Ping and Bryan Catanzaro},
journal       = {CoRR},
year          = {2021},
volume        = {abs/2108.10447},
eprinttype    = {arXiv},
eprint        = {2108.10447},
}

Kevin Shih, Rafael Valle, Rohan Badlani, Adrian Lancucki, Wei, Ping, and Bryan Catanzaro
RAD-TTS: Parallel Flow-Based TTS with Robust Alignment Learning and Diverse Synthesis
In Proceedings of the International Conference on Machine Learning (ICML) Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models, 2021. DOI

@inproceedings{ShihEtAl21_RADTTS_ICML,
author    = {Kevin Shih and Rafael Valle and Rohan Badlani and Adrian {\L{}}ancucki and Wei and Ping and Bryan Catanzaro},
title     = {{RAD-TTS}: {P}arallel Flow-Based {TTS} with Robust Alignment Learning and Diverse Synthesis},
booktitle = {Proceedings of the International Conference on Machine Learning ({ICML}) Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models},
pages     = {},
address   = {},
year      = {2021},
doi       = {}
}

Yann Teytaut and Axel Roebel
Phoneme-to-Audio Alignment with Recurrent Neural Networks for Speaking and Singing Voice
In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech): 61–65, 2021.

@inproceedings{TeytautRoebel21_PhonemeAudioAlignment_Interspeech,
author    = {Yann Teytaut and Axel Roebel},
title     = {Phoneme-to-Audio Alignment with Recurrent Neural Networks for Speaking and Singing Voice},
booktitle = {Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech)},
pages     = {61--65},
address   = {Brno, Czech Republic},
year      = {2021},
}

Frank Zalkow and Meinard Müller
CTC-Based Learning of Chroma Features for Score-Audio Music Retrieval
IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29: 2957–2971, 2021. DOI

@article{ZalkowMueller_CTC_TASLP,
author        = {Frank Zalkow and Meinard M{\"{u}}ller},
title         = {{CTC}-Based Learning of Chroma Features for Score-Audio Music Retrieval},
journal       = {{IEEE}/{ACM} Transactions on Audio, Speech, and Language Processing},
year          = {2021},
volume        = {29},
pages         = {2957--2971},
doi           = {10.1109/TASLP.2021.3110137},
}

International Audio Laboratories Erlangen