--------------------------------> Abstract <---------------------------------


Current emotional text-to-speech (TTS) systems face challenges in conveying the full spectrum of human emotions, largely due to the inherent complexity of emotions and the limited range of emotional labels in existing speech datasets. To address these limitations, this paper introduces a TTS framework that provides flexible user control over three emotional dimensions—pleasure, arousal, and dominance—enabling the synthesis of a diverse array of emotional styles. The framework leverages an emotional attribute predictor, trained solely on categorical labels from speech data and grounded in earlier psychological research, which is seamlessly integrated into a language model-based TTS system. Experimental results demonstrate that the proposed framework effectively learns emotional styles from expressive speech, eliminating the need for explicit emotion labels during TTS training, while enhancing the naturalness and diversity of synthesized emotional speech.


----------------------------> System Overview <----------------------------


Fig. 1 An overview of the proposed text-to-speech (TTS) framework with emotional dimension control, consisting of: (a) Emotional Dimension (ED) Predictor Training, and (b) Text-to-Speech Flow. The ED predictor is pre-trained on an emotional speech dataset to map emotional features to dimension representations via anchored dimensionality reduction. It then guides the non-autoregressive language model (LM) to predict acoustic details. `P', `A', and `D' denote `Pleasure', `Arousal', and `Dominance'.


----------------------------> Speech Samples <-----------------------------


() 8 Emotions Evaluation

(1) Anger (Pleasure: -0.51, Arousal: 0.59, Dominance: 0.25) vs. Anxiety (Pleasure: 0.01, Arousal: 0.59, Dominance: -0.15)
Anger Anxiety
Input Text: "Gave him a little brandy and left him collapsed in a chair, while I made a most careful examination of the room."
Input Text: "I have only been able to find a few which I seem to have jotted down almost unconsciously."
Input Text: "My scholar has been left very poor, but he is hard-working and industrious."
Input Text: "It must be as wide as the Mediterranean or the Atlantic-and why not?"
Input Text: "'Well, perhaps not,' said Alice in a soothing tone: 'don't be angry about it.'"
Input Text: "I must suppress what I feel, or you will think me foolishly enthusiastic."
(2) Surprise (Pleasure: 0.40, Arousal: 0.67, Dominance: -0.13) vs. Alert (Pleasure: 0.49, Arousal: 0.57, Dominance: 0.45)
Surprise Alert
Input Text: "Won't you tell, Douglas? somebody else inquired."
Input Text: "For sheer terror? I remember asking."
Input Text: "'Well, perhaps not,' said Alice in a soothing tone: 'don't be angry about it."
Input Text: "It has cost me twice sixty dollars in annoyance."
Input Text: "There was no such thing as a hawk in sight."
Input Text: "This I read with great attention, while they sat silent."
(3) Pleasure (Pleasure: 1.00, Arousal: 0.00, Dominance: 0.00) vs. Excited (Pleasure: 0.62, Arousal: 0.75, Dominance: 0.38)
Pleasure Excited
Input Text: "His heart trembled in an ecstasy of fear and his soul was in flight."
Input Text: "He soon stopped again, and waited for the whole party to come up."
Input Text: "The darkness deepens; scarcely can I jot down a few hurried notes."
Input Text: "I must suppress what I feel, or you will think me foolishly enthusiastic."
Input Text: "We suffer stifling pains."
Input Text: "After his visit I told Esprit to take me to the Palais Royal, and I left him at the gates."
(4) Relaxed (Pleasure: 0.68, Arousal: -0.46, Dominance: 0.20) vs. Protected (Pleasure: 0.60, Arousal: -0.22, Dominance: -0.40)
Relaxed Protected
Input Text: "Free thinkers, replied the young woman laconically."
Input Text: "There was no such thing as a hawk in sight."
Input Text: "These observations partly apply here."
Input Text: "But it were a pretty close call an' I hope it won't happen again."
Input Text: "After that there was no attempt at speaking."
Input Text: "What do you mean? inquired Louis."

(B) Emotion Cloning Evaluation

(B-1) Angry
MixedEmotion [1] CosyVoice [2] Proposed Ground-truth
Input Text: "Clear than clear water!"
Input Text: "Andy what's the gyre and to gimble."
Input Text: "That was his chief thought."
Input Text: "At the end of four."
Input Text: "At the roots-of a bush of a grass."
(B-2) Happy
MixedEmotion [1] CosyVoice [2] Proposed Ground-truth
Input Text: "Clear than clear water!"
Input Text: "Andy what's the gyre and to gimble."
Input Text: "That was his chief thought."
Input Text: "At the end of four."
Input Text: "At the roots-of a bush of a grass."
(B-3) Surprise
MixedEmotion [1] CosyVoice [2] Proposed Ground-truth
Input Text: "Clear than clear water!"
Input Text: "Andy what's the gyre and to gimble."
Input Text: "That was his chief thought."
Input Text: "At the end of four."
Input Text: "At the roots-of a bush of a grass."
(B-4) Sad
MixedEmotion [1] CosyVoice [2] Proposed Ground-truth
Input Text: "Clear than clear water!"
Input Text: "Andy what's the gyre and to gimble."
Input Text: "That was his chief thought."
Input Text: "At the end of four."
Input Text: "At the roots-of a bush of a grass."

(C) Zero-Shot TTS Evaluation

CosyVoice [2] Proposed Ground-truth
Input Text: "We waited in fact till two nights later; but that same evening, before we scattered, he brought out what was in his mind."
Input Text: "Run back, Uncas, and bring me the size of the singer's foot."
Input Text: "So choose for yourself-to make a rush or tarry here."

[1] K, Zhou, et al, "Speech Synthesis with Mixed Emotions", IEEE Transactions on Affective Computing, 2023.

[2] Z, Du, et al, "Cosyvoice: A scaleable multi-lingual zero-shot text-to-speech synthesizer based on supervised semantic tokens"