Spatio-temporal attention pooling for audio scene classification

School Computing, The University of Kent, Oliver Y. Chen, Lam Pham, Philipp Koch, Maarten De Vos, School Computing, The University of Kent, Alfred Mertins

Publikation: Beitrag in Buch oder TagungsbandVortrag mit Beitrag in TagungsbandBegutachtung

Abstract

Acoustic scenes are rich and redundant in their content. In this work, we present a spatio-temporal attention pooling layer coupled with a convolutional recurrent neural network to learn from patterns that are discriminative while suppressing those that are irrelevant for acoustic scene classification. The convolutional layers in this network learn invariant features from time-frequency input. The bidirectional recurrent layers are then able to encode the temporal dynamics of the resulting convolutional features. Afterwards, a two-dimensional attention mask is formed via the outer product of the spatial and temporal attention vectors learned from two designated attention layers to weigh and pool the recurrent output into a final feature vector for classification. The network is trained with between-class examples generated from between-class data augmentation. Experiments demonstrate that the proposed method not
only outperforms a strong convolutional neural network baseline but also sets new state-of-the-art performance on the LITIS Rouen dataset.
OriginalspracheEnglisch
TitelINTERSPEECH, 2019
Seiten3845-3849
PublikationsstatusVeröffentlicht - Sept. 2019
VeranstaltungInterspeech 2019 - Graz, Österreich
Dauer: 15 Sept. 201919 Sept. 2019

Konferenz

KonferenzInterspeech 2019
Land/GebietÖsterreich
StadtGraz
Zeitraum15/09/1919/09/19

Research Field

  • Data Science

Fingerprint

Untersuchen Sie die Forschungsthemen von „Spatio-temporal attention pooling for audio scene classification“. Zusammen bilden sie einen einzigartigen Fingerprint.

Diese Publikation zitieren