Skip to main navigation Skip to search Skip to main content

Spectrogram Features for Audio and Speech Analysis

  • Ian McLoughlin
  • , Lam Pham
  • , Yan Song
  • , Xiao Xiao Miao
  • , Huy Phan
  • , Pengfei Cai
  • , Qing Gu
  • , Jiang Nan
  • , Haoyu Song
  • , Donny Soh
  • Singapore Institute of Technology
  • The University of Science and Technology of China
  • Duke Kunshan University (DKU)
  • Meta Inc.

Research output: Contribution to journalArticlepeer-review

Abstract

Spectrogram-based representations have grown to dominate the feature space for deep learning audio analysis systems, and are often adopted for speech analysis also. Initially, the primary motivation behind spectrogram-based representations was their ability to present sound as a two-dimensional signal in the time–frequency plane, which not only provides an interpretable physical basis for analysing sound, but also unlocks the use of a range of machine learning techniques such as convolutional neural networks, which had been developed for image processing. A spectrogram is a matrix characterised by the resolution and span of its dimensions, as well as by the representation and scaling of each element. Many possibilities for these three characteristics have been explored by researchers across numerous application areas, with different settings showing affinity for various tasks. This paper reviews the use of spectrogram-based representations and surveys the state-of-the-art to question how front-end feature representation choice allies with back-end classifier architecture for different tasks.
Original languageEnglish
Article number572
Number of pages29
JournalApplied Sciences (Switzerland)
Volume16
Issue number2
DOIs
Publication statusPublished - 6 Jan 2026

Research Field

  • Multimodal Analytics

Keywords

  • spectrogram
  • spectrogram image feature
  • mel-frequency spectrogram
  • mel frequency cepstral coefficient (MFCC)
  • constant-Q transform
  • audio analysis
  • speech classification

Fingerprint

Dive into the research topics of 'Spectrogram Features for Audio and Speech Analysis'. Together they form a unique fingerprint.

Cite this