Abstract
Cyber threats are continuously evolving, with new attack techniques developing rapidly. Anomaly detection (AD) in system log data is thereby an increasingly important task, as it is able to detect attacks of previously known but also unknown kind. The configuration of AD algorithms heavily depends on the data and includes complex feature selection and the definition of parameters such as thresholds or window sizes. This process is consequently not straightforward and often necessitates manual intervention by domain experts which restricts accessibility and effectiveness of AD algorithms. This work therefore introduces the Configuration-Engine (CE), a semi-supervised approach to automate the configuration process of AD algorithms. The CE applies a data science approach to identify properties of parts of log lines. Thereby, it uses a parser to recognize meaningful static and variable tokens in the log lines that AD detectors can analyze. The CE categorizes variables based on their characteristics and behavior over time. Based on the requirements of the AD detectors at hand, the CE specifies which log parts a detector should observe and determines the appropriate configuration parameters. This thesis considers a set of 6 different detectors of the AMiner, an advanced AD pipeline encompassing a wide range of AD algorithms. Additionally, the CE contains an optimization approach for further refinement of configurations.The performance was evaluated considering point and collective anomalies occurring in a set of Apache Access and audit datasets. For collective anomalies the CE provided configurations that reached an average precision of over 0.95 for Apache and over 0.9 for audit datasets for 5 out of the 6 detectors, while maintaining a recall of 1.0 during detection. It thereby competes with the performance of handcrafted configurations by 3 different experts that formed the baseline for the evaluation. Additionally, the optimization improved the precision of both CE and expert configurations in 29 out of 32 cases for Apache data and in 6 out of 20 cases for audit. Moreover, the configurations can be represented as dictionaries and thus be compared for similarity using the Jaccard index. The experts’ configurations are thereby significantly dissimilar to the ones of the CE. Meanwhile, the CE's configurations exhibit remarkable similarity to each other across various datasets, suggesting effective portability of CE configurations across different datasets of the same type. The CE represents a significant advancement in AD, reducing the need for domain expertise and manual configuration, making AD more accessible and efficient across different datasets and detection techniques.
Translated title of the contribution | Ein halbüberwachter Ansatz zur Konfiguration und Optimierung von Machine-Learning basierten Anomalieerkennungs-Algorithmen |
---|---|
Original language | English |
Qualification | Master of Science |
Awarding Institution |
|
Supervisors/Advisors |
|
Thesis sponsors | |
Award date | 10 Oct 2024 |
Publication status | Published - 18 Oct 2024 |
Research Field
- Cyber Security
Web of Science subject categories (JCR Impact Factors)
- Computer Science, Artificial Intelligence
- Computer Science, Information Systems