Abstract
This master thesis aims to explore the degree of similarity between 22 European languages by employing the Needleman-Wunsch algorithm for language comparison. The comparison is done on written text, which is extracted from the website of the European Union using a custom-built webscraper. Besides the language comparison, another part of this thesis is to perform an analysis on how different text sizes and applied preprocessing steps impact the comparison. Therefore texts in three different sizes are extracted for each language. Each text size is subjected to different combinations of four preprocessing techniques, resulting in a comprehensive analysis of the effect of these factors on language similarity measurements. The preprocessing steps include lemmatization, the removal of diacritics, the removal of whitespaces and the removal of punctuation. The results of the language comparison are represented visually as dendrograms and heatmaps, where the relationships and similarities between European languages are shown. These visualizations provide valuable insights into the linguistic connections and similarities among the languages under investigation.
Originalsprache | Englisch |
---|---|
Qualifikation | Master of Science |
Gradverleihende Hochschule |
|
Betreuer/-in / Berater/-in |
|
Datum der Bewilligung | 9 Okt. 2023 |
Publikationsstatus | Veröffentlicht - Okt. 2023 |
Research Field
- Ehemaliges Research Field - Data Science