Identifying Open-Source Threat Detection Resources on GitHub: A Scalable Machine Learning Approach

    Research output: Contribution to journalArticlepeer-review

    Abstract

    Many businesses rely on open-source software modules to build their technology stacks. However, those who lack domain expertise may struggle to find the right software due to unfamiliar terminology and specific names. As a consequence, search engines and other platforms often cannot be utilized effectively to discover appropriate solutions. There is thus a need for a more applicable approach to assist non-domain experts in navigating the vastness of available repositories, enabling them to efficiently discover and select the right solution for their business needs. To overcome these gaps, we introduce an approach that supports finding unpopular yet important open-source software repositories on GitHub using advanced machine learning techniques. For this purpose, we propose novel strategies for information gathering and data pre-processing that resolve scalability issues of existing solutions and enable clustering of repositories even when topics, descriptions, or repository names are unclear or absent. For our evaluation, we gathered a dataset of 221,971 repositories using GitHub search and keywords related to incident detection. We show that our approach is able to separate threat detection repositories from others with an F1-score of 0.93.
    Original languageEnglish
    Article number158
    Number of pages14
    JournalInternational Journal of Information Security
    Volume24
    Issue number4
    DOIs
    Publication statusPublished - 17 Jun 2025

    Research Field

    • Cyber Security

    Keywords

    • ids
    • security
    • cluster

    Fingerprint

    Dive into the research topics of 'Identifying Open-Source Threat Detection Resources on GitHub: A Scalable Machine Learning Approach'. Together they form a unique fingerprint.

    Cite this