Beyond Spectrograms: Rethinking Audio Classification from EnCodec's Latent Space

Jorge-Perianez-Pascual, Juan D. Gutiérrez, Laura Escobar-Encinas, Álvaro Rubio-Largo, Roberto Rodriguez-Echeverria

February 2025 Quercus, I3lab

Abstract

This article presents an innovative approach for audio classification, leveraging the latent representation generated by Meta’s EnCodec neural audio codec. Our hypothesis is that the compressed latent space captures essential audio features, offering a representation more suitable for classification tasks than traditional spectrogram-based approaches. To validate this hypothesis, we train a standard convolutional neural network to classify music genres, distinguish between speech and music, and recognize environmental sounds, using EnCodec’s encoder output as input. We then compare its performance with that of the same network when using a spectrogram-based representation. Our experiments demonstrate that this method achieves accuracy comparable to state-of-the-art techniques, but with significantly faster convergence and lower computational load during training. These results highlight the potential of EnCodec’s latent representation for more efficient, faster and lower-cost audio classification applications. Additionally, we analyze the characteristics of EnCodec’s output and compare its performance with traditional spectrogram-based methods, allowing us to better understand the advantages of this new approach.

Type

Journal article

Publication

Algorithms

Artificial Intelligence Audio Classification Deep Learning Foundation Models