Module 1 - Baseline
TF-IDF + Logistic Regression
Build a fast word-frequency baseline for lightweight rough filtering on CPU.
Pipeline
- Lowercase, leet speak normalize, obfuscation decode, repeated char collapse
- Word n-grams (1,3) + lemmatization
- Character n-grams (3,5)
- Feature stacking into around 100,000 dimensions
- One-vs-Rest Logistic Regression with class_weight='balanced'
- Sigmoid threshold 0.5 for 6 output labels
Results
- Very fast CPU training
- AUC around 0.95+
Limitations
- Limited by bag-of-words representation
- Dependent on data quality and dictionaries
- Cannot capture deep semantic meaning