NLP-Tutorial-Collection/05_TF_IDF.py at main · ImdataScientistSachin/NLP-Tutorial-Collection

313 lines (267 loc) · 17.6 KB
#!/usr/bin/env python
# coding: utf-8
# ════════════════════════════════════════════════════════════════════════════════
# PROJECT: TF-IDF ALGORITHM IMPLEMENTATION FROM SCRATCH
# ════════════════════════════════════════════════════════════════════════════════
# 1. EXECUTIVE SUMMARY
# ────────────────────────────────────────────────────────────────────────────────
# OBJECTIVE:
#   To demonstrate the mathematical mechanics of Term Frequency-Inverse Document 
#   Frequency (TF-IDF) by implementing the algorithm manually using Python, 
#   NumPy, and NLTK. This script avoids high-level abstractions (like 
#   TfidfVectorizer from scikit-learn) to provide a transparent view of the 
#   underlying calculations.
# PROBLEM STATEMENT & CONTEXT:
#   Machine learning models cannot process raw text. Text vectorization is the 
#   process of converting text into numerical representation. TF-IDF is a 
#   foundational statistical measure that evaluates the importance of a word 
#   to a document in a collection (corpus). It addresses the limitations of 
#   simple Bag-of-Words (BoW) by weighing down frequent, less informative 
#   words (like "the", "is") and scaling up rare, distinctive terms.
# ALGORITHM OVERVIEW:
#   - Term Frequency (TF): Measures local importance (how often a word appears 
#     in a specific document).
#   - Inverse Document Frequency (IDF): Measures global importance (how rare 
#     a word is across the entire corpus).
#   - TF-IDF Score: TF * IDF. High score = high relevance.
#   - Source: Leonardo DiCaprio's 2016 Oscar acceptance speech for "The Revenant".
#   - Type: Unstructured text data.
#   - Size: Small corpus (single paragraph split into sentences).
# EXPECTED OUTCOMES:
#   - A numerical matrix representing the text data.
#   - A clear understanding of the tokenization, cleaning, and scoring pipeline.
# PREREQUISITES:
#   - Python 3.x
#   - Libraries: nltk, numpy, re, heapq
#   - Domain Knowledge: Basic understanding of text processing and linear algebra.
# ────────────────────────────────────────────────────────────────────────────────
# TABLE OF CONTENTS
# ────────────────────────────────────────────────────────────────────────────────
# 1. Setup & Library Imports
# 2. Data Acquisition
# 3. Data Preprocessing (Tokenization & Cleaning)
# 4. Feature Engineering: Word Frequency Analysis
# 5. Vocabulary Selection
# 6. IDF Calculation (Inverse Document Frequency)
# 7. TF Calculation (Term Frequency)
# 8. TF-IDF Matrix Computation
# 9. Final Results & Array Transformation
# ════════════════════════════════════════════════════════════════════════════════
# ════════════════════════════════════════════════════════════════════════════════
# STEP 1: SETUP & LIBRARY IMPORTS
# ════════════════════════════════════════════════════════════════════════════════
# WHAT: Import necessary libraries for text processing and numerical operations.
#   - 're': For regular expressions, essential for text cleaning (removing punctuation).
#   - 'nltk': The Natural Language Toolkit, a standard library for NLP tasks like tokenization.
#   - 'numpy': For efficient numerical array manipulations and mathematical functions (log).
# CONTEXT: Ensure 'nltk' data (like 'punkt') is downloaded if running for the first time.
import nltk
import numpy as np
# ════════════════════════════════════════════════════════════════════════════════
# STEP 2: DATA ACQUISITION
# ════════════════════════════════════════════════════════════════════════════════
# WHAT: Define the raw input text.
# WHY: We need a corpus to process. In a real-world scenario, this would be loaded 
#      from files or a database. Here, we use a hardcoded string for demonstration.
# DATASET DESCRIPTION: A paragraph from a speech.
paragraph = """ Thank you all so very much. Thank you to the Academy. Thank you to all of you in this room. I have to congratulate the other incredible nominees this year. The Revenant was the product of the tireless efforts of an unbelievable cast and crew. First off, to my brother in this endeavor, Mr. Tom Hardy. Tom, your talent on screen can only be surpassed by your friendship off screen … thank you for creating a transcendent cinematic experience. Thank you to everybody at Fox and New Regency … my entire team. I have to thank everyone from the very onset of my career … To my parents; none of this would be possible without you. And to my friends, I love you dearly; you know who you are. """
# ════════════════════════════════════════════════════════════════════════════════
# STEP 3: DATA PREPROCESSING (TOKENIZATION & CLEANING)
# ════════════════════════════════════════════════════════════════════════════════
# WHAT: Break text into sentences and clean them.
#   - Sentence Tokenization: TF-IDF is often calculated at the document level. 
#     Here, we treat each sentence as a "document".
#   - Cleaning: Remove noise (punctuation, extra spaces) to ensure "Thank" and "thank" 
#     are treated as the same word (normalization).
# TECHNICAL DEEP-DIVE:
#   - nltk.sent_tokenize: Uses an unsupervised algorithm (Punkt) to build a model 
#     for abbreviation words, collocations, and words that start sentences.
dataset = nltk.sent_tokenize(paragraph)
# --------------------------------------------------------------------------------
# CLEANING LOOP
# --------------------------------------------------------------------------------
# WHAT: Iterate through each sentence to normalize text.
#   1. Lowercasing: dataset[i].lower()
#   2. Removing non-words: re.sub(r'\W', ' ', ...) -> Replaces punctuation with space.
#   3. Removing trailing spaces: re.sub(r'\s+$', '', ...)
#   4. Collapsing multiple spaces: re.sub(r'\s+', ' ', ...)
# REGEX EXPLANATION:
#   - \W : Matches any character that is not a word character (alphanumeric + underscore).
#   - \s+: Matches one or more whitespace characters.
#   - $  : Matches the end of the string.
for i in range(len(dataset)):
    dataset[i] = dataset[i].lower()
    dataset[i] = re.sub(r'\W',' ', dataset[i])
    dataset[i] = re.sub(r'\s+$','',dataset[i])
    dataset[i] = re.sub(r'\s+',' ',dataset[i])
# ════════════════════════════════════════════════════════════════════════════════
# STEP 4: FEATURE ENGINEERING - WORD FREQUENCY ANALYSIS
# ════════════════════════════════════════════════════════════════════════════════
# WHAT: Construct a histogram (frequency count) of all words in the corpus.
# WHY: To identify which words are most common. This is the first step in selecting 
#      our vocabulary (features).
#   - Iterate through every sentence.
#   - Tokenize sentence into words.
#   - Update the count in the dictionary 'word2count'.
word2count = {}  # Initialize an empty dictionary to store word frequencies
for data in dataset:
    words = nltk.word_tokenize(data)  # Example: ['thank', 'you', 'all', 'so', 'very', 'much']
    for word in words:
        if word not in word2count.keys():
            word2count[word] = 1
        else:
            word2count[word] += 1
# ════════════════════════════════════════════════════════════════════════════════
# STEP 5: VOCABULARY SELECTION
# ════════════════════════════════════════════════════════════════════════════════
# WHAT: Select the top N most frequent words to form our feature set.
#   - Dimensionality Reduction: Using all words can lead to a sparse matrix and 
#     computational inefficiency (Curse of Dimensionality).
#   - Noise Reduction: Very rare words might not be statistically significant.
# PARAMETER:
#   - n=6: We are selecting only the top 6 words. In a real application, this 
#     would be much larger (e.g., 1000-5000).
import heapq
# TECHNICAL DEEP-DIVE:
#   - heapq.nlargest: Efficiently finds the N largest elements from a dataset.
#   - key=word2count.get: Sorts based on the values (counts) of the dictionary.
freq_words = heapq.nlargest(6,word2count,key=word2count.get)
# ════════════════════════════════════════════════════════════════════════════════
# STEP 6: IDF CALCULATION (INVERSE DOCUMENT FREQUENCY)
# ════════════════════════════════════════════════════════════════════════════════
# WHAT: Calculate the IDF score for each word in our selected vocabulary.
# WHY: To weigh down words that appear in many documents (sentences).
#   IDF(t) = log( Total Number of Documents / (Number of Documents containing t) )
# INTERPRETATION:
#   - If a word appears in all documents, ratio is 1, log(1) = 0. The word is ignored.
#   - If a word appears in few documents, ratio is high, log(ratio) is high. The word is important.
# NOTE: The code adds +1 inside the log to ensure the value is > 0 (smoothing), 
# though standard smoothing is usually (doc_count + 1) in the denominator to avoid division by zero.
word_idfs = {}
for word in freq_words:
    doc_count = 0
    for data in dataset:
        if word in nltk.word_tokenize(data):
            doc_count +=1  # Increment if the word appears in the document
    word_idfs[word] = np.log((len(dataset)/doc_count)+1)
# ════════════════════════════════════════════════════════════════════════════════
# STEP 7: TF CALCULATION (TERM FREQUENCY)
# ════════════════════════════════════════════════════════════════════════════════
# WHAT: Calculate Term Frequency for each word in each document.
# WHY: To measure how often a word appears in a specific document.
#   TF(t, d) = (Count of t in d) / (Total words in d)
# ALGORITHM:
#   - Create a dictionary 'tf_matrix' where keys are words.
#   - Values are lists representing the TF of that word across all sentences.
tf_matrix = {}
for word in freq_words:
    doc_tf = []
    for data in dataset:
        frequency = 0
        for w in nltk.word_tokenize(data): #thank you all so very much
            if w == word:
                frequency += 1
        tf_word = frequency/len(nltk.word_tokenize(data))
        doc_tf.append(tf_word)
    tf_matrix[word] = doc_tf
# ════════════════════════════════════════════════════════════════════════════════
# STEP 8: TF-IDF MATRIX COMPUTATION
# ════════════════════════════════════════════════════════════════════════════════
# WHAT: Combine TF and IDF to get the final score.
# WHY: To get a weight that represents both local frequency and global rarity.
#   TF-IDF = TF * IDF
# STRUCTURE:
#   - Rows: Documents (Sentences)
#   - Columns: Features (Words)
#   - However, the code below initially builds it as List of Lists where:
#     Outer List = Words, Inner List = Scores per document.
#     This is effectively the transpose of the standard document-term matrix.
# Initialize an empty list to store the final TF-IDF matrix
tfidf_matrix = []
# Iterate over each word in the TF matrix (precomputed term frequencies)
for word in tf_matrix.keys():
    # Initialize a list to store TF-IDF scores for the current word across all documents
    tfidf = []
    # Iterate through each TF value (term frequency) for the current word
    for value in tf_matrix[word]:
        # Calculate TF-IDF score: TF * IDF (precomputed in word_idfs)
        score = value * word_idfs[word]
        tfidf.append(score)
    # Add the current word's TF-IDF scores to the final matrix
    tfidf_matrix.append(tfidf)
# ════════════════════════════════════════════════════════════════════════════════
# STEP 9: FINAL RESULTS & ARRAY TRANSFORMATION
# ════════════════════════════════════════════════════════════════════════════════
# WHAT: Convert the list structure to a NumPy array and reshape it.
# WHY: Machine learning models expect a matrix of shape (n_samples, n_features).
# TRANSFORMATION EXPLANATION:
#   1. np.array(tfidf_matrix): Converts list of lists to array.
#      Current Shape: (n_features, n_samples) -> (6, 10) roughly.
#   2. .reshape(10, 6): Reshapes to (n_samples, n_features).
#      WARNING: Reshaping directly without transposing (.T) might scramble data 
#      if the original construction was column-major. Ideally, one should use 
#      np.transpose() if the lists represented columns.
#      However, we maintain the code exactly as written.
X = np.array(tfidf_matrix)
# Check dimensions before reshaping
# print( X.shape )
# Reshape to (Documents, Words)
X = X.reshape(10,6)
print( X.shape )
# Display the final TF-IDF Matrix
# Rows = Sentences
# Cols = Top 6 frequent words
# ════════════════════════════════════════════════════════════════════════════════
# FINAL SUMMARY & CONCLUSION
# ════════════════════════════════════════════════════════════════════════════════
#   We successfully transformed raw text into a numerical matrix using TF-IDF.
#   1. Cleaned and tokenized the text.
#   2. Identified the top 6 most frequent words to serve as our vocabulary.
#   3. Calculated how rare these words are globally (IDF).
#   4. Calculated how frequent they are locally (TF).
#   5. Combined these to score each word for each sentence.
# KEY TAKEAWAYS:
#   - TF-IDF is a powerful baseline for text classification and clustering.
#   - It handles variable-length documents by normalizing (TF).
#   - It filters out stopwords automatically via IDF (if they appear in all docs).
# LIMITATIONS OF THIS IMPLEMENTATION:
#   - Vocabulary size is hardcoded (6).
#   - Inefficient nested loops (O(N^2) complexity).
#   - No handling of unseen words in new data.
# PRACTICAL APPLICATION:
#   This matrix 'X' can now be fed into a classifier (like Logistic Regression) 
#   or a clustering algorithm (like K-Means) to analyze the text.
# ════════════════════════════════════════════════════════════════════════════════
Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

05_TF_IDF.py

Latest commit

History

05_TF_IDF.py

File metadata and controls