Abstract
The rapid advancement of Large Language Models (LLMs) has transformed natural language processing, yet comprehensive evaluation methods are necessary to ensure their reliability, particularly in Retrieval-Augmented Generation (RAG) tasks. This study aims to evaluate and compare the performance of open-source LLMs by introducing a rigorous evaluation framework. We benchmark 20 LLMs using a combination of established metrics such as BLEU, ROUGE, BERTScore, along with and a novel metric, RAGAS. The models were tested across two distinct datasets to assess their text generation quality. Our findings reveal that models like nous-hermes-2-solar-10.7b and mistral-7b-instruct-v0.1 consistently excel in tasks requiring strict instruction adherence and effective use of large contexts, while other models show areas for improvement. This research contributes to the field by offering a comprehensive evaluation framework that aids in selecting the most suitable LLMs for complex RAG applications, with implications for future developments in natural language processing and big data analysis.
Originalsprache | Englisch |
---|---|
Titel | Proceedings - 2024 IEEE International Conference on Big Data, BigData 2024 |
Seiten | 5342-5351 |
ISBN (elektronisch) | 979-8-3503-6248-0 |
DOIs | |
Publikationsstatus | Veröffentlicht - 16 Jan. 2025 |
Veranstaltung | 2024 IEEE International Conference on Big Data (BigData) - Washington, Washington, USA/Vereinigte Staaten Dauer: 15 Dez. 2024 → 18 Dez. 2024 |
Konferenz
Konferenz | 2024 IEEE International Conference on Big Data (BigData) |
---|---|
Land/Gebiet | USA/Vereinigte Staaten |
Stadt | Washington |
Zeitraum | 15/12/24 → 18/12/24 |
Research Field
- Cyber Security