--------------------------------> Abstract <---------------------------------


Text-to-speech aims to synthesize human-like speech from text input. Recent language model based TTS frameworks demonstrate scalability and in-context learning capabilities. However, they often suffer from robustness issues, such as word mispronounication, deletion and repetition, caused by autoregressive language modeling. In this paper, we proposes a phonetic enhanced language modeling method to improve the performance of TTS models. We leverage self-supervised representations that are phonetically rich as the training target for the autoregressive language model. Subsequently, a non-autoregressive model is employed to predict discrete acoustic codecs that contain fine-grained acoustic details. The TTS model focuses solely on linguistic modeling during autoregressive training, thereby reducing the error propagation that occurs in non-autoregressive training. Both objective and subjective evaluations validate the effectiveness of our proposed method.


We summarize our experiment setups as below:
- VALL-E [1] (Baseline): VALL-E utilizes a similar decoder-only language model and predicts the first layer of acoustic codecs autoregressively, followed by predicting the other layers non-autoregressively.
- Proposed w/ Hubert (K = 500): Our proposed phonetic enhanced language model based TTS, where Hubert feauters with a cluster size of 500 is used as the SSL tokens.
- Proposed w/ Hubert (K = 1024): Our proposed phonetic enhanced language model based TTS, where Hubert feauters with a cluster size of 10240 is used as the SSL tokens.
- Proposed w/ WavLM (K = 500):Our proposed phonetic enhanced language model based TTS, where WavLM feauters with a cluster size of 500 is used as the SSL tokens.
- Proposed w/ WavLM (K = 1024): Our proposed phonetic enhanced language model based TTS, where WavLM feauters with a cluster size of 1024 is used as the SSL tokens.

----------------------------> System Overview <----------------------------



The overall diagram of proposed phonetic enhanced language model (LM) based text-to-speech framework. Given input text and an acoustic prompt, the autoregressive decoder predicts self-supervised learning (SSL) tokens that contain phonetic information, and the non-autoregressive decoder further predicts 8 layers of acoustic codecs that represent fine-grained acoustic details.

----------------------------> Speech Samples <-----------------------------


Test set: Libritts (test-clean)

Acoustic Prompt VALL-E (Baseline) Proposed w/ Hubert (K = 500) Proposed w/ Hubert (K = 1024) Proposed w/ WavLM (K = 500) Proposed w/ WavLM (K = 1024) Ground-truth
Synthesize Text: So that if you get once the right clue to any group of them, it will grasp the simplest, yet reach to the highest truths.
Synthesize Text: The count advanced a step towards his friend, and pressed him warmly in his arms.
Synthesize Text: Faithful to their legislative declaration they knew but one issue, slavery.
Synthesize Text: Let your actors have tea by all means, but see that it is a properly histrionic tea.
Synthesize Text: He wore a flannel cap and duck trousers, and the sleeves of his white flannel shirt were rolled back to the elbow.
Synthesize Text: He stooped to the evil of hypocrisy with others, sceptical of their innocence which he could cajole so easily.
Synthesize Text: Every plant in the grass is set formally, grows perfectly, and may be realized completely.
Synthesize Text: Edison held that the electricity sold must be measured just like gas or water, and he proceeded to develop a meter.
Synthesize Text: Such consumption as falls to the women is merely incidental to their work; it is a means to their continued labour, and not a consumption directed to their own comfort and fulness of life.
Synthesize Text: The canon of reputability is at hand and seizes upon such innovations as are, according to its standard, fit to survive.

Test set: Libritts (test-other)

Acoustic Prompt VALL-E (Baseline) Proposed w/ Hubert (K = 500) Proposed w/ Hubert (K = 1024) Proposed w/ WavLM (K = 500) Proposed w/ WavLM (K = 1024) Ground-truth
Synthesize Text: The imagination is cultivated. A man puts himself in the place of another.
Synthesize Text: You must go on trying to improve your mind, said the pawnbroker fussily.
Synthesize Text: Why, only last month a brute of a dog bit me in the leg, at a back door Sutton way.
Synthesize Text: From some mysterious source mr Beale had obtained an old double perambulator, which must have been made, Dickie thought, for very fat twins, it was so broad and roomy.
Synthesize Text: I could do so much for all at home how I should enjoy that! And Polly let her thoughts revel in the luxurious future her fancy painted.
Synthesize Text: This had an especial charm to Polly, for she soon found that this side of his character was not shown to every one.
Synthesize Text: If you had a little more faith, and if you could have been in her cell, she would have cured your leg merely by touching it. She smiled.
Synthesize Text: General observations on preserves, confectionary, ices, and dessert dishes.
Synthesize Text: I did but laugh to think the sword of Ethelried had been so quickly found, responded the Jester, and he pointed to the scissors hanging from the Tailor's girdle.
[1]C. Wang, S. Chen, Y. Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y. Liu, H. Wang, J. Li et al., “Neural codec language mod- els are zero-shot text to speech synthesizers,”