--------------------------------> Abstract <---------------------------------
Text-to-speech aims to synthesize human-like speech from text input. Recent language model based TTS frameworks demonstrate scalability and in-context learning capabilities. However, they often suffer from robustness issues, such as word mispronounication, deletion and repetition, caused by autoregressive language modeling. In this paper, we proposes a phonetic enhanced language modeling method to improve the performance of TTS models. We leverage self-supervised representations that are phonetically rich as the training target for the autoregressive language model. Subsequently, a non-autoregressive model is employed to predict discrete acoustic codecs that contain fine-grained acoustic details. The TTS model focuses solely on linguistic modeling during autoregressive training, thereby reducing the error propagation that occurs in non-autoregressive training. Both objective and subjective evaluations validate the effectiveness of our proposed method.
We summarize our experiment setups as below:
- VALL-E [1] (Baseline): VALL-E utilizes a similar decoder-only language model and predicts the first layer of acoustic codecs autoregressively, followed by predicting the other layers non-autoregressively.
- Proposed w/ Hubert (K = 500): Our proposed phonetic enhanced language model based TTS, where Hubert feauters with a cluster size of 500 is used as the SSL tokens.
- Proposed w/ Hubert (K = 1024): Our proposed phonetic enhanced language model based TTS, where Hubert feauters with a cluster size of 10240 is used as the SSL tokens.
- Proposed w/ WavLM (K = 500):Our proposed phonetic enhanced language model based TTS, where WavLM feauters with a cluster size of 500 is used as the SSL tokens.
- Proposed w/ WavLM (K = 1024): Our proposed phonetic enhanced language model based TTS, where WavLM feauters with a cluster size of 1024 is used as the SSL tokens.
----------------------------> System Overview <----------------------------
----------------------------> Speech Samples <-----------------------------
Test set: Libritts (test-clean) |
||||||
---|---|---|---|---|---|---|
Acoustic Prompt | VALL-E (Baseline) | Proposed w/ Hubert (K = 500) | Proposed w/ Hubert (K = 1024) | Proposed w/ WavLM (K = 500) | Proposed w/ WavLM (K = 1024) | Ground-truth |
Synthesize Text: So that if you get once the right clue to any group of them, it will grasp the simplest, yet reach to the highest truths. |
Synthesize Text: The count advanced a step towards his friend, and pressed him warmly in his arms. |
||||||
Synthesize Text: Faithful to their legislative declaration they knew but one issue, slavery. |
||||||
Synthesize Text: Let your actors have tea by all means, but see that it is a properly histrionic tea. |
||||||
Synthesize Text: He wore a flannel cap and duck trousers, and the sleeves of his white flannel shirt were rolled back to the elbow. |
||||||
Synthesize Text: He stooped to the evil of hypocrisy with others, sceptical of their innocence which he could cajole so easily. |
||||||
Synthesize Text: Every plant in the grass is set formally, grows perfectly, and may be realized completely. |
||||||
Synthesize Text: Edison held that the electricity sold must be measured just like gas or water, and he proceeded to develop a meter. |
||||||
Synthesize Text: Such consumption as falls to the women is merely incidental to their work; it is a means to their continued labour, and not a consumption directed to their own comfort and fulness of life. |
||||||
Synthesize Text: The canon of reputability is at hand and seizes upon such innovations as are, according to its standard, fit to survive. |
||||||
Test set: Libritts (test-other) |
||||||
---|---|---|---|---|---|---|
Acoustic Prompt | VALL-E (Baseline) | Proposed w/ Hubert (K = 500) | Proposed w/ Hubert (K = 1024) | Proposed w/ WavLM (K = 500) | Proposed w/ WavLM (K = 1024) | Ground-truth |
Synthesize Text: The imagination is cultivated. A man puts himself in the place of another. |
Synthesize Text: You must go on trying to improve your mind, said the pawnbroker fussily. | ||||||
Synthesize Text: Why, only last month a brute of a dog bit me in the leg, at a back door Sutton way. | ||||||
Synthesize Text: From some mysterious source mr Beale had obtained an old double perambulator, which must have been made, Dickie thought, for very fat twins, it was so broad and roomy. | ||||||
Synthesize Text: I could do so much for all at home how I should enjoy that! And Polly let her thoughts revel in the luxurious future her fancy painted. | ||||||
Synthesize Text: This had an especial charm to Polly, for she soon found that this side of his character was not shown to every one. | ||||||
Synthesize Text: If you had a little more faith, and if you could have been in her cell, she would have cured your leg merely by touching it. She smiled. | ||||||
Synthesize Text: General observations on preserves, confectionary, ices, and dessert dishes. | ||||||
Synthesize Text: I did but laugh to think the sword of Ethelried had been so quickly found, responded the Jester, and he pointed to the scissors hanging from the Tailor's girdle. | ||||||