Beyond Spectrograms: Rethinking Audio Classification from EnCodec's Latent Space

Abstract

This article presents an innovative approach for audio classification, leveraging the latent representation generated by Meta’s EnCodec neural audio codec. Our hypothesis is that the compressed latent space captures essential audio features, offering a representation more suitable for classification tasks than traditional spectrogram-based approaches. To validate this hypothesis, we train a standard convolutional neural network to classify music genres, distinguish between speech and music, and recognize environmental sounds, using EnCodec’s encoder output as input. We then compare its performance with that of the same network when using a spectrogram-based representation. Our experiments demonstrate that this method achieves accuracy comparable to state-of-the-art techniques, but with significantly faster convergence and lower computational load during training. These results highlight the potential of EnCodec’s latent representation for more efficient, faster and lower-cost audio classification applications. Additionally, we analyze the characteristics of EnCodec’s output and compare its performance with traditional spectrogram-based methods, allowing us to better understand the advantages of this new approach.

Publication
Algorithms
Álvaro Rubio-Largo
Álvaro Rubio-Largo
INTIA Secretary and Associate Professor

Academic Secretary of INTIA and Associate Professor at the University of Extremadura.

Roberto Rodriguez-Echeverria
Roberto Rodriguez-Echeverria
INTIA Director and Associate Professor

Associate Professor at the University of Extremadura. Software passionate, Deep learner, MTB rider and father of 2.