GerDISDETECT: A German Multilabel Dataset for Disinformation Detection

Mina Schütz (Autor:in und Vortragende:r), Daniela Pisoiu, Daria Liakhovets, Alexander Schindler, Melanie Siegel

Publikation: Beitrag in Buch oder TagungsbandVortrag mit Beitrag in TagungsbandBegutachtung

Abstract

Disinformation has become increasingly relevant in recent years both as a political issue and as object of research. Datasets for training machine learning models, especially for other languages than English, are sparse and the
creation costly. Annotated datasets often have only binary or multiclass labels, which provide little information about the grounds and system of such classifications. We propose a novel textual dataset GerDISDETECT for
German disinformation. To provide comprehensive analytical insights, a fine-grained taxonomy guided annotation scheme is required. The goal of this dataset, instead of providing a direct assessment regarding true or false, is to
provide wide-ranging semantic descriptors that allow for complex interpretation as well as inferred decision-making regarding information and trustworthiness of potentially critical articles. This allows this dataset to be also used
for other tasks. The dataset was collected in the first three months of 2022 and contains 39 multilabel classes with 5 top-level categories for a total of 1,890 articles: General View (3 labels), Offensive Language (11 labels), Reporting Style (15 labels), Writing Style (6 labels), and Extremism (4 labels). As a baseline, we further pre-trained a multilingual XLM-R model on around 200,000 unlabeled news articles and fine-tuned it for each category.
OriginalspracheEnglisch
TitelProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Redakteure/-innenNicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Seiten7683–7695
Seitenumfang13
PublikationsstatusVeröffentlicht - 2024
VeranstaltungThe 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) - Torino, Torino, Italien
Dauer: 20 Mai 202425 Mai 2024

Konferenz

KonferenzThe 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Land/GebietItalien
StadtTorino
Zeitraum20/05/2425/05/24

Research Field

  • Ehemaliges Research Field - Data Science

Fingerprint

Untersuchen Sie die Forschungsthemen von „GerDISDETECT: A German Multilabel Dataset for Disinformation Detection“. Zusammen bilden sie einen einzigartigen Fingerprint.

Diese Publikation zitieren