Skip to content

Latest commit

Β 

History

History
134 lines (90 loc) Β· 2.99 KB

File metadata and controls

134 lines (90 loc) Β· 2.99 KB

🧼 SSOSM: Spam Filter

Author: Andrei Cristian
Version: 0.1


πŸ“Œ Overview

SSOSM (Simple Spam Or Sanitized Mail) is a Python-based spam filtering tool designed to classify text or file content using a Naive Bayes classifier trained on manually labeled data. It can detect spam from raw text files and HTML emails using a lightweight and customizable dataset.

This project is ideal for security research, email filtering prototypes, or educational purposes on NLP and machine learning fundamentals.


πŸ“ Project Structure

.
β”œβ”€β”€ main.py              # Main script containing training and scanning logic
β”œβ”€β”€ requirements.txt     # Dependencies
β”œβ”€β”€ .python-version      # Python version (3.7.12)

πŸ›  Features

  • βœ… Detects spam content using Multinomial Naive Bayes
  • βœ… Supports plain text and HTML content (auto-strips HTML)
  • βœ… Trains on a CSV dataset (dataset.csv) generated from labeled folders
  • βœ… Scans directories for potential spam files
  • βœ… Outputs results to a log file
  • βœ… Saves and reuses trained model and vectorizer with pickle

πŸ§ͺ Requirements

Python 3.7.12
Install dependencies with:

pip install -r requirements.txt

βš™οΈ Usage

1. Print Project Info

python main.py -info info.txt

Writes project metadata to the file info.txt.


2. Scan a Folder for Spam

python main.py -scan <directory_path> <output_file>

Scans all files in the specified directory, classifies them, and writes results to the output file. Each line contains:

<filename>|cln  # Clean
<filename>|inf  # Infected (Spam)

3. (Optional) Train Model from Labeled Dataset

Note: Training logic is currently commented out in main.py. You can enable it manually for re-training.

Directory structure for training:

<root>
β”œβ”€β”€ Lot1/
β”‚   β”œβ”€β”€ Clean/
β”‚   └── Spam/
└── Lot2/
    β”œβ”€β”€ Clean/
    └── Spam/

Each subfolder should contain text files. After uncommenting the train() function and adjusting the logic:

# Enable and modify in main.py
# python main.py -train <path_to_data_root>

It generates a new dataset.csv, trains the classifier, and saves:

  • naive_bayes_clf.pkl – the trained model
  • naive_bayes_cv.pkl – the fitted CountVectorizer

πŸ”’ Security Notes

  • Only plain text and HTML are processed (malicious scripts in HTML are stripped).
  • Binary files are ignored.
  • This tool is not a full antivirus scanner β€” it is intended for text-based spam detection.

πŸ“„ Output Example

offer.txt|inf
newsletter.html|cln
free-gift.msg|inf

🧹 Future Improvements

  • βœ… Enable training via CLI argument
  • πŸ“Š Add confusion matrix and classification reports
  • πŸ€– Extend to detect phishing keywords
  • πŸ—‚ Add support for .eml email formats
  • 🌐 Web dashboard (Flask/FastAPI)

πŸ“œ License

This project is currently not licensed. Contact the author for reuse or contributions.