Spatio-temporal attention pooling for audio scene classification

School Computing, The University of Kent, Oliver Y. Chen, Lam Pham, Philipp Koch, Maarten De Vos, School Computing, The University of Kent, Alfred Mertins

Research output: Chapter in Book or Conference ProceedingsConference Proceedings with Oral Presentationpeer-review

Abstract

Acoustic scenes are rich and redundant in their content. In this work, we present a spatio-temporal attention pooling layer coupled with a convolutional recurrent neural network to learn from patterns that are discriminative while suppressing those that are irrelevant for acoustic scene classification. The convolutional layers in this network learn invariant features from time-frequency input. The bidirectional recurrent layers are then able to encode the temporal dynamics of the resulting convolutional features. Afterwards, a two-dimensional attention mask is formed via the outer product of the spatial and temporal attention vectors learned from two designated attention layers to weigh and pool the recurrent output into a final feature vector for classification. The network is trained with between-class examples generated from between-class data augmentation. Experiments demonstrate that the proposed method not
only outperforms a strong convolutional neural network baseline but also sets new state-of-the-art performance on the LITIS Rouen dataset.
Original languageEnglish
Title of host publicationINTERSPEECH, 2019
Pages3845-3849
Publication statusPublished - Sept 2019
EventInterspeech 2019 - Graz, Austria
Duration: 15 Sept 201919 Sept 2019

Conference

ConferenceInterspeech 2019
Country/TerritoryAustria
CityGraz
Period15/09/1919/09/19

Research Field

  • Former Research Field - Data Science

Fingerprint

Dive into the research topics of 'Spatio-temporal attention pooling for audio scene classification'. Together they form a unique fingerprint.

Cite this