Deepfake Audio Detection Using Spectrogram-based Feature and Ensemble of Deep Learning Models

Publikation: Beitrag in Buch oder TagungsbandBeitrag in Tagungsband mit PosterpräsentationBegutachtung

Abstract

In this paper, we propose a deep-learning-based system for the task of deepfake audio detection. This work is a part of the proposed toolchain for speech analysis in EUCINF (EUropean Cyber and INFormation) project, which is an European project with multiple partners in Europe. In particular, the raw input audio is first transformed into various spectrograms using three transformation methods of Short-time Fourier Transform (STFT), Constant-Q Transform (CQT), Wavelet Transform (WT) combined with different auditory- based filters of Mel, Gammatone, linear filters (LF), and discrete cosine transform (DCT). Given the spectrograms, we evaluate a wide range of classification models based on three deep learning approaches. The first approach is to train the spectrograms using our proposed baseline models of CNN-based model (CNN- baseline), RNN-based model (RNN-baseline), C-RNN model (C-RNN baseline). Meanwhile, the second approach is to apply the transfer learning from computer vision models such as ResNet- 18, MobileNet-V3, EfficientNet-BO, DenseNet-121, SuffleNet- V2, Swint, Convnext- Tiny, GoogLeNet, MNASsnet, and Reg- Net. In the third approach, we leverage the state-of-the-art audio pre-trained models of Whisper, Seamless, Speechbrain, and Pyannote to extract audio embed dings from the input spectrograms. Then, the audio embed dings are explored by a Multilayer perceptron (MLP) model to detect fake or real audio samples. Finally, high-performance deep learning models from these approaches are fused to achieve the best performance. We evaluated our proposed models on ASVspoof 2019 benchmark dataset. Our best ensemble model achieved an Equal Error Rate (EER) of 0.03, which is highly competitive to top-performing systems in the ASVspoofing 2019 challenge. Experimental results also highlight the potential of selective spectrograms and deep learning approaches to enhance model performance on the task of audio deepfake detection.
OriginalspracheEnglisch
Titel5th IEEE International Symposium on the Internet of Sounds 2024
Seitenumfang5
ISBN (elektronisch)979-8-3503-6652-5
DOIs
PublikationsstatusVeröffentlicht - 2024
VeranstaltungIEEE 5th International Symposium on the Internet of Sounds (IS2 2024). - Erlangen, Erlangen, Deutschland
Dauer: 30 Sept. 20242 Okt. 2024

Konferenz

KonferenzIEEE 5th International Symposium on the Internet of Sounds (IS2 2024).
Land/GebietDeutschland
StadtErlangen
Zeitraum30/09/242/10/24

Research Field

  • Multimodal Analytics

Fingerprint

Untersuchen Sie die Forschungsthemen von „Deepfake Audio Detection Using Spectrogram-based Feature and Ensemble of Deep Learning Models“. Zusammen bilden sie einen einzigartigen Fingerprint.

Diese Publikation zitieren