--------------------------------> Abstract <---------------------------------
Current emotional text-to-speech (TTS) systems face challenges in conveying the full spectrum of human emotions, largely due to the inherent complexity of emotions and the limited range of emotional labels in existing speech datasets. To address these limitations, this paper introduces a TTS framework that provides flexible user control over three emotional dimensions—pleasure, arousal, and dominance—enabling the synthesis of a diverse array of emotional styles. The framework leverages an emotional attribute predictor, trained solely on categorical labels from speech data and grounded in earlier psychological research, which is seamlessly integrated into a language model-based TTS system. Experimental results demonstrate that the proposed framework effectively learns emotional styles from expressive speech, eliminating the need for explicit emotion labels during TTS training, while enhancing the naturalness and diversity of synthesized emotional speech.
----------------------------> System Overview <----------------------------

----------------------------> Speech Samples <-----------------------------
() 8 Emotions Evaluation |
|||
---|---|---|---|
(1) Anger (Pleasure: -0.51, Arousal: 0.59, Dominance: 0.25) vs. Anxiety (Pleasure: 0.01, Arousal: 0.59, Dominance: -0.15) |
|||
Anger | Anxiety | Input Text: "Gave him a little brandy and left him collapsed in a chair, while I made a most careful examination of the room." |
|
Input Text: "I have only been able to find a few which I seem to have jotted down almost unconsciously." |
|||
Input Text: "My scholar has been left very poor, but he is hard-working and industrious." |
|||
Input Text: "It must be as wide as the Mediterranean or the Atlantic-and why not?" |
|||
Input Text: "'Well, perhaps not,' said Alice in a soothing tone: 'don't be angry about it.'" |
|||
Input Text: "I must suppress what I feel, or you will think me foolishly enthusiastic." |
|||
(2) Surprise (Pleasure: 0.40, Arousal: 0.67, Dominance: -0.13) vs. Alert (Pleasure: 0.49, Arousal: 0.57, Dominance: 0.45) |
|||
Surprise | Alert | Input Text: "Won't you tell, Douglas? somebody else inquired." |
|
Input Text: "For sheer terror? I remember asking." |
|||
Input Text: "'Well, perhaps not,' said Alice in a soothing tone: 'don't be angry about it." |
|||
Input Text: "It has cost me twice sixty dollars in annoyance." |
|||
Input Text: "There was no such thing as a hawk in sight." |
|||
Input Text: "This I read with great attention, while they sat silent." |
|||
(3) Pleasure (Pleasure: 1.00, Arousal: 0.00, Dominance: 0.00) vs. Excited (Pleasure: 0.62, Arousal: 0.75, Dominance: 0.38) |
|||
Pleasure | Excited | Input Text: "His heart trembled in an ecstasy of fear and his soul was in flight." |
|
Input Text: "He soon stopped again, and waited for the whole party to come up." |
|||
Input Text: "The darkness deepens; scarcely can I jot down a few hurried notes." |
|||
Input Text: "I must suppress what I feel, or you will think me foolishly enthusiastic." |
|||
Input Text: "We suffer stifling pains." |
|||
Input Text: "After his visit I told Esprit to take me to the Palais Royal, and I left him at the gates." |
|||
(4) Relaxed (Pleasure: 0.68, Arousal: -0.46, Dominance: 0.20) vs. Protected (Pleasure: 0.60, Arousal: -0.22, Dominance: -0.40) |
|||
Relaxed | Protected | Input Text: "Free thinkers, replied the young woman laconically." |
|
Input Text: "There was no such thing as a hawk in sight." |
|||
Input Text: "These observations partly apply here." |
|||
Input Text: "But it were a pretty close call an' I hope it won't happen again." |
|||
Input Text: "After that there was no attempt at speaking." |
|||
Input Text: "What do you mean? inquired Louis." |
|||
(B) Emotion Cloning Evaluation |
|||
---|---|---|---|
(B-1) Angry |
|||
MixedEmotion [1] | CosyVoice [2] | Proposed | Ground-truth | Input Text: "Clear than clear water!" |
Input Text: "Andy what's the gyre and to gimble." |
|||
Input Text: "That was his chief thought." |
|||
Input Text: "At the end of four." |
|||
Input Text: "At the roots-of a bush of a grass." |
|||
(B-2) Happy |
|||
MixedEmotion [1] | CosyVoice [2] | Proposed | Ground-truth | Input Text: "Clear than clear water!" |
Input Text: "Andy what's the gyre and to gimble." |
|||
Input Text: "That was his chief thought." |
|||
Input Text: "At the end of four." |
|||
Input Text: "At the roots-of a bush of a grass." |
|||
(B-3) Surprise |
|||
MixedEmotion [1] | CosyVoice [2] | Proposed | Ground-truth | Input Text: "Clear than clear water!" |
Input Text: "Andy what's the gyre and to gimble." |
|||
Input Text: "That was his chief thought." |
|||
Input Text: "At the end of four." |
|||
Input Text: "At the roots-of a bush of a grass." |
|||
(B-4) Sad |
|||
MixedEmotion [1] | CosyVoice [2] | Proposed | Ground-truth | Input Text: "Clear than clear water!" |
Input Text: "Andy what's the gyre and to gimble." |
|||
Input Text: "That was his chief thought." |
|||
Input Text: "At the end of four." |
|||
Input Text: "At the roots-of a bush of a grass." |
|||
(C) Zero-Shot TTS Evaluation |
||
---|---|---|
CosyVoice [2] | Proposed | Ground-truth | Input Text: "We waited in fact till two nights later; but that same evening, before we scattered, he brought out what was in his mind." |
Input Text: "Run back, Uncas, and bring me the size of the singer's foot." |
||
Input Text: "So choose for yourself-to make a rush or tarry here." |
||
[2] Z, Du, et al, "Cosyvoice: A scaleable multi-lingual zero-shot text-to-speech synthesizer based on supervised semantic tokens"