NLP-Tutorial-Collection/10_smsClassification.py at main · ImdataScientistSachin/NLP-Tutorial-Collection

314 lines (264 loc) · 15.7 KB
#!/usr/bin/env python
# coding: utf-8
# ════════════════════════════════════════════════════════════════════════════════
# PROJECT: SMS SPAM CLASSIFICATION WITH CLASS IMBALANCE HANDLING
# ════════════════════════════════════════════════════════════════════════════════
# 1. EXECUTIVE SUMMARY
# ────────────────────────────────────────────────────────────────────────────────
# OBJECTIVE:
#   To build a robust SMS spam classifier that handles class imbalance using 
#   advanced resampling techniques (ADASYN and SMOTE). The project demonstrates 
#   end-to-end NLP pipeline construction using scikit-learn's Pipeline API, 
#   combining text preprocessing, feature extraction, oversampling, and 
#   classification.
# PROBLEM STATEMENT & CONTEXT:
#   SMS spam detection is a critical security task. However, spam datasets are 
#   typically imbalanced (far more "ham" than "spam"). Standard classifiers 
#   trained on imbalanced data tend to be biased toward the majority class, 
#   leading to poor recall on spam messages. This project addresses that 
#   challenge using synthetic oversampling.
# ALGORITHM OVERVIEW:
#   - Preprocessing: Custom function to remove punctuation and stopwords.
#   - Vectorization: Bag-of-Words (CountVectorizer) + TF-IDF.
#   - Resampling: ADASYN (Adaptive Synthetic Sampling) and SMOTE (Synthetic 
#     Minority Over-sampling Technique).
#   - Classification: Random Forest Classifier.
#   - Pipeline: Integrated workflow for reproducibility and deployment.
#   - Source: SMSSpamCollection (UCI Machine Learning Repository).
#   - Structure: Tab-separated file with 'label' (ham/spam) and 'message'.
#   - Size: ~5,574 messages (87% ham, 13% spam - imbalanced).
# EXPECTED OUTCOMES:
#   - High precision and recall on spam detection.
#   - Comparison of ADASYN vs. SMOTE performance.
#   - Production-ready pipeline for real-time inference.
# PREREQUISITES:
#   - Python 3.x
#   - Libraries: nltk, pandas, numpy, sklearn, imbalanced-learn
#   - Data: 'SMSSpamCollection' file in the working directory.
# ────────────────────────────────────────────────────────────────────────────────
# TABLE OF CONTENTS
# ────────────────────────────────────────────────────────────────────────────────
# 1. Setup & Library Imports
# 2. Data Loading & Exploration
# 3. Text Preprocessing (Custom Function)
# 4. Feature Extraction (BOW & TF-IDF)
# 5. Understanding Class Imbalance
# 6. Pipeline 1: ADASYN + Random Forest
# 7. Pipeline 2: SMOTE + Random Forest
# 8. Model Evaluation & Comparison
# 9. Inference on New Data
# ════════════════════════════════════════════════════════════════════════════════
# ════════════════════════════════════════════════════════════════════════════════
# STEP 1: SETUP & LIBRARY IMPORTS
# ════════════════════════════════════════════════════════════════════════════════
# WHAT: Import all necessary libraries.
#   - 'nltk': For tokenization and stopwords.
#   - 'pandas': For data manipulation.
#   - 'sklearn': For ML models and pipelines.
#   - 'imbalanced-learn': For ADASYN and SMOTE.
import nltk
import pandas as pd
import numpy as np
import string
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from imblearn.over_sampling import ADASYN, SMOTE
from imblearn.pipeline import Pipeline
from collections import Counter
from scipy.sparse import csr_matrix
import heapq
# ════════════════════════════════════════════════════════════════════════════════
# STEP 2: DATA LOADING & EXPLORATION
# ════════════════════════════════════════════════════════════════════════════════
# WHAT: Load the SMS dataset.
# WHY: To understand the data structure and distribution.
# Reading the dataset from a tab-separated file named 'SMSSpamCollection'
# The dataset contains two columns: 'label' for spam/ham classification and 'message' for the SMS text
# Using 'latin1' encoding to properly read special characters in the dataset
dataset = pd.read_csv('SMSSpamCollection', sep='\t', names=['label', 'message'], encoding="latin1")
# Display basic statistics
# dataset.describe(include='all')
# ════════════════════════════════════════════════════════════════════════════════
# STEP 3: TEXT PREPROCESSING (CUSTOM FUNCTION)
# ════════════════════════════════════════════════════════════════════════════════
# WHAT: Define a reusable preprocessing function.
# WHY: To clean text by removing punctuation and stopwords.
# DESIGN PATTERN: This function will be passed to CountVectorizer's 'analyzer' 
# parameter, making the pipeline modular and testable.
# Load stopwords
stopwrd = stopwords.words('english')
# FUNCTION: text_process
# INPUT: A raw string (e.g., "Sample #message ! Notice: it has @ punctuations.")
# OUTPUT: A list of cleaned words (e.g., ['Sample', 'message', 'Notice', 'punctuations'])
#   1. Remove punctuation.
#   2. Split into words.
#   3. Filter out stopwords.
def text_process(sample_msg):
    # Remove all punctuation characters
    noPunctuation = [c for c in sample_msg if c not in string.punctuation]
    # Join back into a string
    noPunctuation = ''.join(noPunctuation)
    # Split, lowercase, and remove stopwords
    return [word for word in noPunctuation.split() if word.lower() not in stopwrd]
# DEMONSTRATION: Test the function
# sample_msg = 'Sample #message ! Notice: it has @ punctuations.'
# text_process(sample_msg)
# ════════════════════════════════════════════════════════════════════════════════
# STEP 4: FEATURE EXTRACTION (BOW & TF-IDF)
# ════════════════════════════════════════════════════════════════════════════════
# WHAT: Convert text to numerical features.
# APPROACH:
#   1. Bag-of-Words (CountVectorizer): Counts word occurrences.
#   2. TF-IDF (TfidfTransformer): Re-weights counts by document frequency.
# Initialize CountVectorizer with custom text processing
# The 'analyzer' parameter allows us to inject our custom preprocessing function
bow_transformer = CountVectorizer(
    analyzer=text_process  # Use our custom text processor
).fit(dataset['message'])  # Fit on the entire message column
# Transform a sample message
# msg = dataset['message'][2]
# sam_bowTrns = bow_transformer.transform([msg])
# sam_bowTrns.toarray()
# Transform the entire dataset
bow_msg_trns = bow_transformer.transform(dataset['message'])
# Apply TF-IDF transformation
# Initialize and fit the TF-IDF transformer on Bag-of-Words data
tfidf_transformer = TfidfTransformer().fit(bow_msg_trns)
# Transform BOW to TF-IDF
msg_tfidf = tfidf_transformer.transform(bow_msg_trns)
# msg_tfidf.toarray()
# ════════════════════════════════════════════════════════════════════════════════
# STEP 5: UNDERSTANDING CLASS IMBALANCE
# ════════════════════════════════════════════════════════════════════════════════
# WHAT: Analyze the distribution of 'ham' vs. 'spam'.
# WHY: To justify the use of resampling techniques.
# Map labels to numeric values
labels = dataset['label'].map({'ham': 0, 'spam': 1}).values
# Display class distribution
print('Original dataset shape:', Counter(labels))
# Expected output: Counter({0: ~4800, 1: ~700}) - Highly imbalanced!
# ════════════════════════════════════════════════════════════════════════════════
# STEP 6: PIPELINE 1 - ADASYN + RANDOM FOREST
# ════════════════════════════════════════════════════════════════════════════════
# WHAT: Build an end-to-end pipeline with ADASYN.
# WHY: ADASYN focuses on generating synthetic samples near the decision boundary, 
#      which helps the model learn difficult cases.
# ADASYN THEORY:
#   - Adaptive Synthetic Sampling (ADASYN) is an advanced oversampling technique.
#   - Unlike SMOTE (which creates samples uniformly), ADASYN generates more 
#     synthetic samples for minority instances that are harder to learn.
#   - This shifts the decision boundary to better classify difficult cases.
# Split data
X_train, X_test, y_train, y_test = train_test_split(
    dataset['message'], dataset['label'], test_size=0.3
# Initialize ADASYN
ADA = ADASYN(random_state=42)
# Build Pipeline
# PIPELINE STAGES:
#   1. 'bow': Convert text to word counts.
#   2. 'tfidf': Convert counts to TF-IDF scores.
#   3. 'adasyn': Generate synthetic minority samples.
#   4. 'classifier': Train Random Forest.
pipeline = Pipeline([
    ('bow', CountVectorizer(analyzer=text_process)),
    ('tfidf', TfidfTransformer()),
    ('adasyn', ADA),  # Apply ADASYN
    ('classifier', RandomForestClassifier(random_state=42))
# Fit the pipeline on the original training data
pipeline.fit(X_train, y_train)
# Make predictions
predictions = pipeline.predict(X_test)
print("\n" + "="*80)
print("PIPELINE 1: ADASYN + RANDOM FOREST")
print("="*80)
print("\nCLASSIFICATION REPORT:")
print(classification_report(y_test, predictions))
print("\nCONFUSION MATRIX:")
print(confusion_matrix(y_test, predictions))
# Test on a custom spam message
print("\nSAMPLE PREDICTION:")
print("Input: 'Free entry in 2 a wkly comp to win FA Cup fina'")
print("Predicted:", pipeline.predict(['Free entry in 2 a wkly comp to win FA Cup fina']))
# ════════════════════════════════════════════════════════════════════════════════
# STEP 7: PIPELINE 2 - SMOTE + RANDOM FOREST
# ════════════════════════════════════════════════════════════════════════════════
# WHAT: Build a second pipeline with SMOTE for comparison.
# WHY: SMOTE is a simpler, more widely-used technique.
# SMOTE THEORY:
#   - Synthetic Minority Over-sampling Technique (SMOTE) creates synthetic samples 
#     by interpolating between existing minority class instances.
#   - For each minority sample, SMOTE finds its k-nearest neighbors and creates 
#     new samples along the line segments connecting them.
#   - Simpler than ADASYN but doesn't adapt to difficulty.
# Build Pipeline with SMOTE
pipeline2 = Pipeline([
    ('bow', CountVectorizer(analyzer=text_process)),
    ('tfidf', TfidfTransformer()),
    ('smote', SMOTE(random_state=40)),  # Use SMOTE instead of ADASYN
    ('classifier', RandomForestClassifier(random_state=42))
# Fit the pipeline
pipeline2.fit(X_train, y_train)
# Make predictions
predictions2 = pipeline2.predict(X_test)
print("\n" + "="*80)
print("PIPELINE 2: SMOTE + RANDOM FOREST")
print("="*80)
print("\nCLASSIFICATION REPORT:")
print(classification_report(y_test, predictions2))
print("\nCONFUSION MATRIX:")
print(confusion_matrix(y_test, predictions2))
# Test on a custom spam message
print("\nSAMPLE PREDICTION:")
print("Input: 'Free entry in 2 a wkly comp to win FA Cup fina'")
print("Predicted:", pipeline2.predict(['Free entry in 2 a wkly comp to win FA Cup fina']))
# ════════════════════════════════════════════════════════════════════════════════
# FINAL SUMMARY & CONCLUSION
# ════════════════════════════════════════════════════════════════════════════════
#   We built two complete SMS spam classifiers:
#   1. ADASYN-based pipeline (adaptive oversampling).
#   2. SMOTE-based pipeline (uniform oversampling).
# KEY TAKEAWAYS:
#   - Class imbalance is a critical issue in spam detection. Without resampling, 
#     the model would predict "ham" for almost everything.
#   - ADASYN typically performs better on datasets with complex decision boundaries 
#     because it focuses on hard-to-classify samples.
#   - SMOTE is simpler and faster, making it a good baseline.
#   - Pipelines ensure reproducibility and make deployment easier (no need to 
#     manually apply each transformation step).
# PERFORMANCE COMPARISON:
#   - Compare the F1-scores for the 'spam' class in both reports.
#   - ADASYN may show slightly higher recall if the spam messages are diverse.
# PRACTICAL APPLICATION:
#   - This pipeline can be integrated into a messaging app to filter spam in real-time.
#   - The model can be saved using pickle and loaded in a production environment.
# FURTHER IMPROVEMENTS:
#   - Hyperparameter tuning (GridSearchCV on n_estimators, max_depth).
#   - Try other classifiers (SVM, Gradient Boosting).
#   - Use word embeddings (Word2Vec, GloVe) instead of TF-IDF.
# ════════════════════════════════════════════════════════════════════════════════
Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

10_smsClassification.py

Latest commit

History

10_smsClassification.py

File metadata and controls