Identifying Open-Source Threat Detection Resources on GitHub: A Scalable Machine Learning Approach

Publikation: Beitrag in FachzeitschriftArtikelBegutachtung

Abstract

Many businesses rely on open-source software modules to build their technology stacks. However, those who lack domain expertise may struggle to find the right software due to unfamiliar terminology and specific names. As a consequence, search engines and other platforms often cannot be utilized effectively to discover appropriate solutions. There is thus a need for a more applicable approach to assist non-domain experts in navigating the vastness of available repositories, enabling them to efficiently discover and select the right solution for their business needs. To overcome these gaps, we introduce an approach that supports finding unpopular yet important open-source software repositories on GitHub using advanced machine learning techniques. For this purpose, we propose novel strategies for information gathering and data pre-processing that resolve scalability issues of existing solutions and enable clustering of repositories even when topics, descriptions, or repository names are unclear or absent. For our evaluation, we gathered a dataset of 221,971 repositories using GitHub search and keywords related to incident detection. We show that our approach is able to separate threat detection repositories from others with an F1-score of 0.93.
OriginalspracheEnglisch
Aufsatznummer158
Seitenumfang14
FachzeitschriftInternational Journal of Information Security
Volume24
Issue4
DOIs
PublikationsstatusVeröffentlicht - 17 Juni 2025

Research Field

  • Cyber Security

Fingerprint

Untersuchen Sie die Forschungsthemen von „Identifying Open-Source Threat Detection Resources on GitHub: A Scalable Machine Learning Approach“. Zusammen bilden sie einen einzigartigen Fingerprint.

Diese Publikation zitieren