Author: Andrei Cristian
Version: 0.1
SSOSM (Simple Spam Or Sanitized Mail) is a Python-based spam filtering tool designed to classify text or file content using a Naive Bayes classifier trained on manually labeled data. It can detect spam from raw text files and HTML emails using a lightweight and customizable dataset.
This project is ideal for security research, email filtering prototypes, or educational purposes on NLP and machine learning fundamentals.
.
βββ main.py # Main script containing training and scanning logic
βββ requirements.txt # Dependencies
βββ .python-version # Python version (3.7.12)
- β Detects spam content using Multinomial Naive Bayes
- β Supports plain text and HTML content (auto-strips HTML)
- β
Trains on a CSV dataset (
dataset.csv) generated from labeled folders - β Scans directories for potential spam files
- β Outputs results to a log file
- β
Saves and reuses trained model and vectorizer with
pickle
Python 3.7.12
Install dependencies with:
pip install -r requirements.txtpython main.py -info info.txtWrites project metadata to the file info.txt.
python main.py -scan <directory_path> <output_file>Scans all files in the specified directory, classifies them, and writes results to the output file. Each line contains:
<filename>|cln # Clean
<filename>|inf # Infected (Spam)
Note: Training logic is currently commented out in main.py. You can enable it manually for re-training.
Directory structure for training:
<root>
βββ Lot1/
β βββ Clean/
β βββ Spam/
βββ Lot2/
βββ Clean/
βββ Spam/
Each subfolder should contain text files. After uncommenting the train() function and adjusting the logic:
# Enable and modify in main.py
# python main.py -train <path_to_data_root>It generates a new dataset.csv, trains the classifier, and saves:
naive_bayes_clf.pklβ the trained modelnaive_bayes_cv.pklβ the fittedCountVectorizer
- Only plain text and HTML are processed (malicious scripts in HTML are stripped).
- Binary files are ignored.
- This tool is not a full antivirus scanner β it is intended for text-based spam detection.
offer.txt|inf
newsletter.html|cln
free-gift.msg|inf
- β Enable training via CLI argument
- π Add confusion matrix and classification reports
- π€ Extend to detect phishing keywords
- π Add support for
.emlemail formats - π Web dashboard (Flask/FastAPI)
This project is currently not licensed. Contact the author for reuse or contributions.