Week #3 #
Implemented MVP features #
Log Upload Functionality.
Description: Users can upload .log files (only for HDFS) containing raw system logs.
Purpose: This is the entry point of the user journey.
Implementation: Frontend + backend endpoint for file upload.
PR: Add log upload endpoint
Issue: Users need to upload logs for analysis
Preprocessing Pipeline.
Description: Uploaded logs are automatically parsed into a structured csv format.
Limitation: Works with HDFS (Hadoop Distributed File System) log files.
Aggregate or convert logs into a model-friendly format.
PR: Implement log preprocessing pipeline
Issue: Parse and normalize log files
Model Inference
Description: Use a trained Decision Tree to predict whether the block is normal or anomalous.
Limitation: Works with HDFS (Hadoop Distributed File System) log files.
Output: Probability prediction ∈ [0, 1] (the rate of anomalous blocks in log file)
PR: Model inference pipeline
Issue: Integrate ML model for prediction
Functional User Journey
User uploads a log file.
System parses logs, encodes features, and performs inference.
Anomaly detection model analyzes data.
User receives report with predictions..
Demonstration of the working MVP #
ML #
Link to the training code: link
The goal of the model is to detect anomalies in system log data (for HDFS) using classification techniques. An anomaly may indicate a system error, potential cyberattack, or unexpected behavior.
Model Selection
Model: DecisionTreeClassifier from scikit-learn
Reason for choice: This model was chosed due to interpretability, fast to train, handles sparse categorical input (like event type counts). Also this model showed slightly better results than Logistic Regression, Rando Forest, AdaBoost
Training dataset
For training dataset preparation we firstly used structured dataset Event_occurence_matrix.csv from loghub repo
Original logs parsed and converted into a structured format:
- BlockId or log ID as the unit of grouping
- EventId (e.g., E1–E29) as features
- Label (0: Normal, 1: Anomaly)
This dataset was converted to more diverse dataset to solve the problem of the imbalance in the distribution of features between learning and inference. Each row represents either:
- A full block (multiple events grouped by BlockId)
- Or a medium chunk of logs (for medium-data simulation)
- Or a small “chunk” of logs (for low-data simulation)
Example of feature row, where numbers are counts (or presence/absence) of each EventId (E1–E29) for each block: E1=0, E2=0, E3=2, E4=0, E5=1, …, E29=0, Label=1
Results was saved to hybrid_training_dataset.csv. The model was trained on this data.
Internal demo #
Ml model:
- The model tends to over-predict anomalies on short or “thin” log blocks
- There were many false positives outputs that why recall and f1 were not high
- Necessary to add support for different log types UI :
- We need to choose more readable colors.
Weekly commitments #
Individual contribution of each participant #
Bulat:
- Data preparation and Feature extrraction
- Raw input log Parser
- Model training
- Model testing and inference
Paramon:
Plan for Next Week #
- Improve performance of current model (especially recall / F1)
- Add support for other log types (e.g., BGP, Linux system logs, macOS)
- Add parsers for other log types
- Make the user interface more intuitive, responsive, and informative — especially for uploading logs, viewing results, and understanding model predictions.
Confirmation of the code’s operability #
We confirm that the code in the main branch:
- In working condition.
- Run via docker-compose (or another alternative described in the
README.md
).