Abstract


[Code]     [Arxiv]

Achieving high-fidelity audio compression while preserving perceptual quality across diverse audio types remains a significant challenge in Neural Audio Coding (NAC). This paper introduces MUFFIN, a fully convolutional NAC framework that leverages psychoacoustically guided multi-band frequency reconstruction. Central to MUFFIN is the Multi-Band Spectral Residual Vector Quantization (MBS-RVQ) mechanism, which quantizes latent speech across different frequency bands. This approach optimizes bitrate allocation and enhances fidelity based on psychoacoustic studies, achieving efficient compression with unique perceptual features that separate content from speaker attributes through distinct codebooks. MUFFIN integrates a transformer-inspired convolutional architecture with proposed modified snake activation functions to capture fine frequency details with greater precision. Extensive evaluations on diverse datasets (LibriTTS, IEMOCAP, GTZAN, BBC) demonstrate MUFFIN’s ability to consistently surpass existing performance in audio reconstruction across various domains. Notably, a high-compression variant achieves an impressive SOTA 12.5 kHz rate while preserving reconstruction quality. Furthermore, MUFFIN excels in downstream generative tasks, demonstrating its potential as a robust token representation for integration with large language models. These results establish MUFFIN as a groundbreaking advancement in NAC and as the first neural psychoacoustic coding system.


System Overview



Fig. 1 Architecture of MUFFIN incorporating a fully convolutional structure. The autoencoder blocks implement transformer-like operations through a (1) multi-receptive field communication layer for spatial dependency modeling, and (2) an inverted bottleneck layer for increased neural complexity.


Speech Samples



(A) Audio Reconstruction

DAC Encodec Hificodec Muffin (75 Hz) Muffin (25 Hz) Muffin (12.5 Hz) Target

(B) Music Reconstruction

DAC Encodec Hificodec Muffin (75 Hz) Muffin (25 Hz) Muffin (12.5 Hz) Target

(C) Emotional Speech Reconstruction

DAC Encodec Hificodec Muffin (75 Hz) Muffin (25 Hz) Muffin (12.5 Hz) Mimi Target

(D) LibriTTS Test-Clean Reconstruction

DAC Encodec Hificodec Muffin (75 Hz) Muffin (25 Hz) Muffin (12.5 Hz) Mimi Target

(E) LibriTTS Test-Other Reconstruction

DAC Encodec Hificodec Muffin (75 Hz) Muffin (25 Hz) Muffin (12.5 Hz) Mimi Target

(F) Psychoacoustic Codebook Auditory Analysis

Remark:
We discover that through MBS-RVQ, MUFFIN learns to quantize latent speech features, where segment speech attributes across distinct frequency bands achieving weak disentanglement without labels (no supervision required). Specifically, we identify the following roles of the learned codebooks:
  • Codebook 1: Encodes features critical for high-level speech intelligibility, focusing on content-related attributes;
  • Codebook 2: Captures articulation-related properties, including formats, which are essential for the clarity and precision of consonant sounds;
  • Codebook 3: Represents speaker-specific attributes, such as timbre, contributing to the identification and characterization of individual speakers;
Target Codebook 1 Codebook 2 Codebook 2 + Codebook 1 Codebook 3 Codebook 3 + Codebook 1
Speech (MUFFIN with MBS-RVQ)
Non-stationary or transient sound: Music (MUFFIN with MBS-RVQ)
Speech (MUFFIN with plain RVQ-Baseline)
Non-stationary or transient sound: Music (MUFFIN with plain RVQ-Baseline)

(G) Zero-Shot Text-to-Speech Synthesis

Prompt VALL-E w/ Encodec VALL-E w/ Hifi-codec VALL-E w/ Muffin (75Hz)
Input Text: Then, turning to Jane, she asked, in a somewhat altered tone, "Has she been a good girl, Jane?"
Input Text: After early nightfall the yellow lamps would light up, here and there, the squalid quarter of the brothels.
Input Text: So choose for yourself to make a rush or tarry here.
Input Text: This I took for a sign that he had himself something to produce and that we should only have to wait .
Input Text: But it's not the first occurrence of its charming kind that I know to have involved a child.
Note: Our MUFFIN neural psychoacoustic coding introduces a novel method that naturally disentangles speech attributes, driven by label-free psychoacoustic studies. This positions our approach as an innovative and simpler alternative to FACodec, achieving similar factorization goals with a more straightforward optimization process. This could significantly advance the study of low-resource factorized speech representation learning.