Demo Page of "MUFFIN: Multi-band Frequency Reconstruction for Neural Psychoacoustic Coding"

Abstract

Achieving high-fidelity audio compression while preserving perceptual quality across diverse audio types remains a significant challenge in Neural Audio Coding (NAC). This paper introduces MUFFIN, a fully convolutional NAC framework that leverages psychoacoustically guided multi-band frequency reconstruction. Central to MUFFIN is the Multi-Band Spectral Residual Vector Quantization (MBS-RVQ) mechanism, which quantizes latent speech across different frequency bands. This approach optimizes bitrate allocation and enhances fidelity based on psychoacoustic studies, achieving efficient compression with unique perceptual features that separate content from speaker attributes through distinct codebooks. MUFFIN integrates a transformer-inspired convolutional architecture with proposed modified snake activation functions to capture fine frequency details with greater precision. Extensive evaluations on diverse datasets (LibriTTS, IEMOCAP, GTZAN, BBC) demonstrate MUFFIN’s ability to consistently surpass existing performance in audio reconstruction across various domains. Notably, a high-compression variant achieves an impressive SOTA 12.5 kHz rate while preserving reconstruction quality. Furthermore, MUFFIN excels in downstream generative tasks, demonstrating its potential as a robust token representation for integration with large language models. These results establish MUFFIN as a groundbreaking advancement in NAC and as the first neural psychoacoustic coding system.

System Overview

Fig. 1 Architecture of MUFFIN incorporating a fully convolutional structure. The autoencoder blocks implement transformer-like operations through a (1) multi-receptive field communication layer for spatial dependency modeling, and (2) an inverted bottleneck layer for increased neural complexity.

Speech Samples

(A) Audio Reconstruction
DAC	Encodec	Hificodec	Muffin (75 Hz)	Muffin (25 Hz)	Muffin (12.5 Hz)	Target

(B) Music Reconstruction
DAC	Encodec	Hificodec	Muffin (75 Hz)	Muffin (25 Hz)	Muffin (12.5 Hz)	Target

(C) Emotional Speech Reconstruction
DAC	Encodec	Hificodec	Muffin (75 Hz)	Muffin (25 Hz)	Muffin (12.5 Hz)	Mimi	Target

(D) LibriTTS Test-Clean Reconstruction
DAC	Encodec	Hificodec	Muffin (75 Hz)	Muffin (25 Hz)	Muffin (12.5 Hz)	Mimi	Target

(E) LibriTTS Test-Other Reconstruction
DAC	Encodec	Hificodec	Muffin (75 Hz)	Muffin (25 Hz)	Muffin (12.5 Hz)	Mimi	Target

(F) Psychoacoustic Codebook Auditory Analysis

Remark:
We discover that through MBS-RVQ, MUFFIN learns to quantize latent speech features, where segment speech attributes across distinct frequency bands *achieving weak disentanglement* without labels (no supervision required). Specifically, we identify the following roles of the learned codebooks: Codebook 1: Encodes features critical for high-level speech intelligibility, focusing on content-related attributes; Codebook 2: Captures articulation-related properties, including formats, which are essential for the clarity and precision of consonant sounds; Codebook 3: Represents speaker-specific attributes, such as timbre, contributing to the identification and characterization of individual speakers;
Target	Codebook 1	Codebook 2	Codebook 2 + Codebook 1	Codebook 3	Codebook 3 + Codebook 1
Speech (MUFFIN with MBS-RVQ)





Non-stationary or transient sound: Music (MUFFIN with MBS-RVQ)





Speech (MUFFIN with plain RVQ-Baseline)





Non-stationary or transient sound: Music (MUFFIN with plain RVQ-Baseline)

(G) Zero-Shot Text-to-Speech Synthesis
Prompt	VALL-E w/ Encodec	VALL-E w/ Hifi-codec	VALL-E w/ Muffin (75Hz)
Input Text: Then, turning to Jane, she asked, in a somewhat altered tone, "Has she been a good girl, Jane?"

Input Text: After early nightfall the yellow lamps would light up, here and there, the squalid quarter of the brothels.

Input Text: So choose for yourself to make a rush or tarry here.

Input Text: This I took for a sign that he had himself something to produce and that we should only have to wait .

Input Text: But it's not the first occurrence of its charming kind that I know to have involved a child.

Note: Our MUFFIN neural psychoacoustic coding introduces a novel method that naturally disentangles speech attributes, driven by label-free psychoacoustic studies. This positions our approach as an innovative and simpler alternative to FACodec, achieving similar factorization goals with a more straightforward optimization process. This could significantly advance the study of low-resource factorized speech representation learning.

Abstract

System Overview

Speech Samples

(A) Audio Reconstruction

(B) Music Reconstruction

(C) Emotional Speech Reconstruction

(D) LibriTTS Test-Clean Reconstruction

(E) LibriTTS Test-Other Reconstruction

(F) Psychoacoustic Codebook Auditory Analysis

(G) Zero-Shot Text-to-Speech Synthesis