Skip to content

Latest commit

 

History

History
68 lines (43 loc) · 1.78 KB

File metadata and controls

68 lines (43 loc) · 1.78 KB

Spam Classification of Emails

Models Used

  • Logistic Regression
  • Decision Trees
  • Naive Bayes

Requirements

  • Python 3.6.10
  • Numpy 1.18.4
  • Scikit-learn 0.23.1
  • Scipy 1.4.1

Steps to run:

  1. Run the preprocessing file to extract the data from email, get the clean text after removing stopwords, punctuations and upload to ES after splitting into training and testing.

    python preprocessing.py
    --dirpath
    --labels
    --index
    --seed 4

  2. Get the unigrams from the Elastic Search.

    python getUnigrams.py
    --index

  3. Build matrix and run the models.

    python spamClassifier.py
    --index
    --labels
    --features
    --cutoff
    --result
    --model <model types are : reg, logit(default), tree, nb(naive bayes)>
    --sparse (use only when u want to create sparse matrix)

Results

Top 10 words after running Logistic Regression on unigrams sparse matrix:

('freebsd', 1.1172201336086656).
('click', 1.1067116575308757).
('antivir', 1.079915881679438).
('penis', 1.0705047080248018).
('opt', 1.0648006849173453).
('girl', 1.0265354745168107).
('adf', 1.0224262305744003).
('website', 0.9591053071640367).
('products', 0.8749953047612927).
('remove', 0.8748610351597089).