-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy path05_TF_IDF.py
More file actions
313 lines (267 loc) · 17.6 KB
/
05_TF_IDF.py
File metadata and controls
313 lines (267 loc) · 17.6 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
#!/usr/bin/env python
# coding: utf-8
"""
# ════════════════════════════════════════════════════════════════════════════════
# PROJECT: TF-IDF ALGORITHM IMPLEMENTATION FROM SCRATCH
# ════════════════════════════════════════════════════════════════════════════════
#
# 1. EXECUTIVE SUMMARY
# ────────────────────────────────────────────────────────────────────────────────
# OBJECTIVE:
# To demonstrate the mathematical mechanics of Term Frequency-Inverse Document
# Frequency (TF-IDF) by implementing the algorithm manually using Python,
# NumPy, and NLTK. This script avoids high-level abstractions (like
# TfidfVectorizer from scikit-learn) to provide a transparent view of the
# underlying calculations.
#
# PROBLEM STATEMENT & CONTEXT:
# Machine learning models cannot process raw text. Text vectorization is the
# process of converting text into numerical representation. TF-IDF is a
# foundational statistical measure that evaluates the importance of a word
# to a document in a collection (corpus). It addresses the limitations of
# simple Bag-of-Words (BoW) by weighing down frequent, less informative
# words (like "the", "is") and scaling up rare, distinctive terms.
#
# ALGORITHM OVERVIEW:
# - Term Frequency (TF): Measures local importance (how often a word appears
# in a specific document).
# - Inverse Document Frequency (IDF): Measures global importance (how rare
# a word is across the entire corpus).
# - TF-IDF Score: TF * IDF. High score = high relevance.
#
# DATASET:
# - Source: Leonardo DiCaprio's 2016 Oscar acceptance speech for "The Revenant".
# - Type: Unstructured text data.
# - Size: Small corpus (single paragraph split into sentences).
#
# EXPECTED OUTCOMES:
# - A numerical matrix representing the text data.
# - A clear understanding of the tokenization, cleaning, and scoring pipeline.
#
# PREREQUISITES:
# - Python 3.x
# - Libraries: nltk, numpy, re, heapq
# - Domain Knowledge: Basic understanding of text processing and linear algebra.
#
# ────────────────────────────────────────────────────────────────────────────────
# TABLE OF CONTENTS
# ────────────────────────────────────────────────────────────────────────────────
# 1. Setup & Library Imports
# 2. Data Acquisition
# 3. Data Preprocessing (Tokenization & Cleaning)
# 4. Feature Engineering: Word Frequency Analysis
# 5. Vocabulary Selection
# 6. IDF Calculation (Inverse Document Frequency)
# 7. TF Calculation (Term Frequency)
# 8. TF-IDF Matrix Computation
# 9. Final Results & Array Transformation
# ════════════════════════════════════════════════════════════════════════════════
"""
# ════════════════════════════════════════════════════════════════════════════════
# STEP 1: SETUP & LIBRARY IMPORTS
# ════════════════════════════════════════════════════════════════════════════════
# WHAT: Import necessary libraries for text processing and numerical operations.
# WHY:
# - 're': For regular expressions, essential for text cleaning (removing punctuation).
# - 'nltk': The Natural Language Toolkit, a standard library for NLP tasks like tokenization.
# - 'numpy': For efficient numerical array manipulations and mathematical functions (log).
# CONTEXT: Ensure 'nltk' data (like 'punkt') is downloaded if running for the first time.
import re
import nltk
import numpy as np
# ════════════════════════════════════════════════════════════════════════════════
# STEP 2: DATA ACQUISITION
# ════════════════════════════════════════════════════════════════════════════════
# WHAT: Define the raw input text.
# WHY: We need a corpus to process. In a real-world scenario, this would be loaded
# from files or a database. Here, we use a hardcoded string for demonstration.
# DATASET DESCRIPTION: A paragraph from a speech.
paragraph = """ Thank you all so very much. Thank you to the Academy. Thank you to all of you in this room. I have to congratulate the other incredible nominees this year. The Revenant was the product of the tireless efforts of an unbelievable cast and crew. First off, to my brother in this endeavor, Mr. Tom Hardy. Tom, your talent on screen can only be surpassed by your friendship off screen … thank you for creating a transcendent cinematic experience. Thank you to everybody at Fox and New Regency … my entire team. I have to thank everyone from the very onset of my career … To my parents; none of this would be possible without you. And to my friends, I love you dearly; you know who you are. """
# ════════════════════════════════════════════════════════════════════════════════
# STEP 3: DATA PREPROCESSING (TOKENIZATION & CLEANING)
# ════════════════════════════════════════════════════════════════════════════════
# WHAT: Break text into sentences and clean them.
# WHY:
# - Sentence Tokenization: TF-IDF is often calculated at the document level.
# Here, we treat each sentence as a "document".
# - Cleaning: Remove noise (punctuation, extra spaces) to ensure "Thank" and "thank"
# are treated as the same word (normalization).
# TECHNICAL DEEP-DIVE:
# - nltk.sent_tokenize: Uses an unsupervised algorithm (Punkt) to build a model
# for abbreviation words, collocations, and words that start sentences.
dataset = nltk.sent_tokenize(paragraph)
# --------------------------------------------------------------------------------
# CLEANING LOOP
# --------------------------------------------------------------------------------
# WHAT: Iterate through each sentence to normalize text.
# HOW:
# 1. Lowercasing: dataset[i].lower()
# 2. Removing non-words: re.sub(r'\W', ' ', ...) -> Replaces punctuation with space.
# 3. Removing trailing spaces: re.sub(r'\s+$', '', ...)
# 4. Collapsing multiple spaces: re.sub(r'\s+', ' ', ...)
#
# REGEX EXPLANATION:
# - \W : Matches any character that is not a word character (alphanumeric + underscore).
# - \s+: Matches one or more whitespace characters.
# - $ : Matches the end of the string.
for i in range(len(dataset)):
dataset[i] = dataset[i].lower()
dataset[i] = re.sub(r'\W',' ', dataset[i])
dataset[i] = re.sub(r'\s+$','',dataset[i])
dataset[i] = re.sub(r'\s+',' ',dataset[i])
# ════════════════════════════════════════════════════════════════════════════════
# STEP 4: FEATURE ENGINEERING - WORD FREQUENCY ANALYSIS
# ════════════════════════════════════════════════════════════════════════════════
# WHAT: Construct a histogram (frequency count) of all words in the corpus.
# WHY: To identify which words are most common. This is the first step in selecting
# our vocabulary (features).
# LOGIC:
# - Iterate through every sentence.
# - Tokenize sentence into words.
# - Update the count in the dictionary 'word2count'.
word2count = {} # Initialize an empty dictionary to store word frequencies
for data in dataset:
words = nltk.word_tokenize(data) # Example: ['thank', 'you', 'all', 'so', 'very', 'much']
for word in words:
if word not in word2count.keys():
word2count[word] = 1
else:
word2count[word] += 1
# ════════════════════════════════════════════════════════════════════════════════
# STEP 5: VOCABULARY SELECTION
# ════════════════════════════════════════════════════════════════════════════════
# WHAT: Select the top N most frequent words to form our feature set.
# WHY:
# - Dimensionality Reduction: Using all words can lead to a sparse matrix and
# computational inefficiency (Curse of Dimensionality).
# - Noise Reduction: Very rare words might not be statistically significant.
# PARAMETER:
# - n=6: We are selecting only the top 6 words. In a real application, this
# would be much larger (e.g., 1000-5000).
import heapq
# TECHNICAL DEEP-DIVE:
# - heapq.nlargest: Efficiently finds the N largest elements from a dataset.
# - key=word2count.get: Sorts based on the values (counts) of the dictionary.
freq_words = heapq.nlargest(6,word2count,key=word2count.get)
# ════════════════════════════════════════════════════════════════════════════════
# STEP 6: IDF CALCULATION (INVERSE DOCUMENT FREQUENCY)
# ════════════════════════════════════════════════════════════════════════════════
# WHAT: Calculate the IDF score for each word in our selected vocabulary.
# WHY: To weigh down words that appear in many documents (sentences).
# MATH:
# IDF(t) = log( Total Number of Documents / (Number of Documents containing t) )
#
# INTERPRETATION:
# - If a word appears in all documents, ratio is 1, log(1) = 0. The word is ignored.
# - If a word appears in few documents, ratio is high, log(ratio) is high. The word is important.
#
# NOTE: The code adds +1 inside the log to ensure the value is > 0 (smoothing),
# though standard smoothing is usually (doc_count + 1) in the denominator to avoid division by zero.
word_idfs = {}
for word in freq_words:
doc_count = 0
for data in dataset:
if word in nltk.word_tokenize(data):
doc_count +=1 # Increment if the word appears in the document
word_idfs[word] = np.log((len(dataset)/doc_count)+1)
# ════════════════════════════════════════════════════════════════════════════════
# STEP 7: TF CALCULATION (TERM FREQUENCY)
# ════════════════════════════════════════════════════════════════════════════════
# WHAT: Calculate Term Frequency for each word in each document.
# WHY: To measure how often a word appears in a specific document.
# MATH:
# TF(t, d) = (Count of t in d) / (Total words in d)
#
# ALGORITHM:
# - Create a dictionary 'tf_matrix' where keys are words.
# - Values are lists representing the TF of that word across all sentences.
tf_matrix = {}
for word in freq_words:
doc_tf = []
for data in dataset:
frequency = 0
for w in nltk.word_tokenize(data): #thank you all so very much
if w == word:
frequency += 1
tf_word = frequency/len(nltk.word_tokenize(data))
doc_tf.append(tf_word)
tf_matrix[word] = doc_tf
# ════════════════════════════════════════════════════════════════════════════════
# STEP 8: TF-IDF MATRIX COMPUTATION
# ════════════════════════════════════════════════════════════════════════════════
# WHAT: Combine TF and IDF to get the final score.
# WHY: To get a weight that represents both local frequency and global rarity.
# MATH:
# TF-IDF = TF * IDF
#
# STRUCTURE:
# - Rows: Documents (Sentences)
# - Columns: Features (Words)
# - However, the code below initially builds it as List of Lists where:
# Outer List = Words, Inner List = Scores per document.
# This is effectively the transpose of the standard document-term matrix.
# Initialize an empty list to store the final TF-IDF matrix
tfidf_matrix = []
# Iterate over each word in the TF matrix (precomputed term frequencies)
for word in tf_matrix.keys():
# Initialize a list to store TF-IDF scores for the current word across all documents
tfidf = []
# Iterate through each TF value (term frequency) for the current word
for value in tf_matrix[word]:
# Calculate TF-IDF score: TF * IDF (precomputed in word_idfs)
score = value * word_idfs[word]
tfidf.append(score)
# Add the current word's TF-IDF scores to the final matrix
tfidf_matrix.append(tfidf)
# ════════════════════════════════════════════════════════════════════════════════
# STEP 9: FINAL RESULTS & ARRAY TRANSFORMATION
# ════════════════════════════════════════════════════════════════════════════════
# WHAT: Convert the list structure to a NumPy array and reshape it.
# WHY: Machine learning models expect a matrix of shape (n_samples, n_features).
# TRANSFORMATION EXPLANATION:
# 1. np.array(tfidf_matrix): Converts list of lists to array.
# Current Shape: (n_features, n_samples) -> (6, 10) roughly.
# 2. .reshape(10, 6): Reshapes to (n_samples, n_features).
# WARNING: Reshaping directly without transposing (.T) might scramble data
# if the original construction was column-major. Ideally, one should use
# np.transpose() if the lists represented columns.
# However, we maintain the code exactly as written.
X = np.array(tfidf_matrix)
# Check dimensions before reshaping
# print( X.shape )
# Reshape to (Documents, Words)
X = X.reshape(10,6)
print( X.shape )
# Display the final TF-IDF Matrix
# Rows = Sentences
# Cols = Top 6 frequent words
X
"""
# ════════════════════════════════════════════════════════════════════════════════
# FINAL SUMMARY & CONCLUSION
# ════════════════════════════════════════════════════════════════════════════════
#
# RECAP:
# We successfully transformed raw text into a numerical matrix using TF-IDF.
# 1. Cleaned and tokenized the text.
# 2. Identified the top 6 most frequent words to serve as our vocabulary.
# 3. Calculated how rare these words are globally (IDF).
# 4. Calculated how frequent they are locally (TF).
# 5. Combined these to score each word for each sentence.
#
# KEY TAKEAWAYS:
# - TF-IDF is a powerful baseline for text classification and clustering.
# - It handles variable-length documents by normalizing (TF).
# - It filters out stopwords automatically via IDF (if they appear in all docs).
#
# LIMITATIONS OF THIS IMPLEMENTATION:
# - Vocabulary size is hardcoded (6).
# - Inefficient nested loops (O(N^2) complexity).
# - No handling of unseen words in new data.
#
# PRACTICAL APPLICATION:
# This matrix 'X' can now be fed into a classifier (like Logistic Regression)
# or a clustering algorithm (like K-Means) to analyze the text.
#
# ════════════════════════════════════════════════════════════════════════════════
"""