-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy path06_NGram.py
More file actions
262 lines (212 loc) · 17.5 KB
/
06_NGram.py
File metadata and controls
262 lines (212 loc) · 17.5 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
#!/usr/bin/env python
# coding: utf-8
"""
# ════════════════════════════════════════════════════════════════════════════════
# PROJECT: N-GRAM TEXT GENERATION FROM SCRATCH
════════════════════════════════════════════════════════════════════════════════
# 1. EXECUTIVE SUMMARY
# ────────────────────────────────────────────────────────────────────────────────
# OBJECTIVE:
# To demonstrate the fundamental concepts of N-Gram language models by
# implementing a simple text generator from scratch using Python. This script
# builds a probabilistic model that predicts the next word in a sequence
# based on the previous N words.
# PROBLEM STATEMENT & CONTEXT:
# Language modeling is the core of many NLP tasks, including autocomplete,
# speech recognition, and machine translation. An N-Gram model assumes that
# the probability of a word depends only on the previous N-1 words (Markov
# Property). While modern LLMs (like GPT) use complex neural networks,
# N-Grams provide the foundational statistical understanding of sequence
# prediction.
# ALGORITHM OVERVIEW:
# - N-Gram: A contiguous sequence of N items from a text.
# - Training: Building a dictionary where keys are N-word sequences and
# values are lists of possible next words.
# - Generation: Starting with a seed sequence and iteratively picking the
# next word based on the learned probabilities (random sampling).
# DATASET:
# - Source: A historical paragraph about the assassination of Mahatma Gandhi.
# - Type: Unstructured text data.
# - Size: Small corpus (single paragraph).
# EXPECTED OUTCOMES:
# - A generated text sequence that mimics the style and vocabulary of the
# input text, though it may lack long-term coherence due to the simplicity
# of the model.
# PREREQUISITES:
# - Python 3.x
# - Libraries: nltk, numpy, random
# - Domain Knowledge: Basic understanding of probability and text processing.
# ────────────────────────────────────────────────────────────────────────────────
# TABLE OF CONTENTS
# ────────────────────────────────────────────────────────────────────────────────
# 1. Theoretical Background
# 2. Setup & Library Imports
# 3. Data Acquisition
# 4. Data Preprocessing (Tokenization)
# 5. Model Configuration (N-Gram Size)
# 6. Model Training (Building the N-Gram Dictionary)
# 7. Text Generation Initialization
# 8. Text Generation Loop
# 9. Final Output
# ════════════════════════════════════════════════════════════════════════════════
"""
# ════════════════════════════════════════════════════════════════════════════════
# STEP 1: THEORETICAL BACKGROUND
# ════════════════════════════════════════════════════════════════════════════════
# WHAT: Define what N-Grams are.
# WHY: To provide context for the implementation.
# DEFINITIONS:
# - Unigram (N=1): Single words (e.g., "dog"). No context.
# - Bigram (N=2): Two words (e.g., "big dog"). Uses previous 1 word for context.
# - Trigram (N=3): Three words (e.g., "the big dog"). Uses previous 2 words.
# - Higher-order: More context, but requires more data (sparsity issue).
# # What is an N-gram in NLP?
# An N-gram is a contiguous sequence of N items (such as words, characters, or phonemes) from a given sample of text or speech. In the context of natural language processing (NLP), N-grams are most commonly sequences of words used to analyze language patterns, build statistical models, and perform various text analysis tasks .
# ### Types of N-grams
#
# ##### Unigram (N=1): Single words (e.g., "dog", "runs")
# ##### Bigram (N=2): Two consecutive words (e.g., "big dog", "runs fast")
# ##### Trigram (N=3): Three consecutive words (e.g., "the big dog")
# ##### Higher-order N-grams (N>3): Longer sequences, offering more context but also increasing data sparsity and computational complexityexity
# ════════════════════════════════════════════════════════════════════════════════
# STEP 2: SETUP & LIBRARY IMPORTS
# ════════════════════════════════════════════════════════════════════════════════
# WHAT: Import necessary libraries.
# WHY:
# - 'random': To stochastically select the next word from the list of possibilities.
# - 'nltk': For robust word tokenization.
# - 'numpy': Standard numerical library (imported but not strictly used in the core logic here).
import random
import nltk
import numpy as np
# ════════════════════════════════════════════════════════════════════════════════
# STEP 3: DATA ACQUISITION
# ════════════════════════════════════════════════════════════════════════════════
# WHAT: Define the training corpus.
# WHY: The model needs text to learn which words follow which sequences.
# CONTEXT: This is a historical text about the assassination of Mahatma Gandhi.
paragraph = """The assassination of Mahatma Gandhi by Nathuram Godse on 30 January 1948 stemmed from a combination of political grievances and ideological opposition rooted in Hindu nationalist extremism. Here are the primary reasons:Godse and his associates resented Gandhi’s insistence on transferring ₹550 million to Pakistan during the Kashmir conflict. Despite India withholding the payment due to Pakistan’s aggression, Gandhi’s fast-unto-death pressured the government to release the funds, which Hindu nationalists viewed as appeasement. Godse argued this decision prioritized Pakistan’s interests over India’s security.Godse accused Gandhi of disproportionately advocating for Muslim rights at the expense of Hindus, particularly during communal violence post-Partition. He believed Gandhi’s emphasis on nonviolence (ahimsa) and interfaith harmony weakened Hindu interests, especially after Pakistan’s creation. Godse stated Gandhi’s policies would lead to further Hindu subjugation.Godse, influenced by Vinayak Savarkar’s Hindu nationalist ideology, saw Gandhi’s pluralistic vision as a threat to the establishment of a Hindu Rashtra (Hindu nation). Gandhi’s rejection of religion-based citizenship and support for equal treatment of all faiths directly conflicted with the Hindu Mahasabha’s agenda.Godse criticized Gandhi’s moral authority over Indian politics, claiming the government capitulated to his demands outside democratic processes. He believed removing Gandhi would enable India to adopt a more assertive, militarized stance against Pakistan.The mass displacement and violence during Partition exacerbated tensions. Godse held Gandhi responsible for failing to protect Hindus in Pakistan and for enabling Partition through his earlier negotiations. Hindu nationalists viewed Gandhi as a symbol of Hindu "weakness" in the face of Muslim demands."""
# ════════════════════════════════════════════════════════════════════════════════
# STEP 4: DATA PREPROCESSING (TOKENIZATION)
# ════════════════════════════════════════════════════════════════════════════════
# WHAT: Convert the raw string into a list of individual words (tokens).
# WHY: The N-Gram model operates on discrete units (words), not characters.
# TECHNICAL NOTE: nltk.word_tokenize handles punctuation better than simple .split().
words = nltk.word_tokenize(paragraph)
# Check the number of tokens
# len(words)
# ════════════════════════════════════════════════════════════════════════════════
# STEP 5: MODEL CONFIGURATION
# ════════════════════════════════════════════════════════════════════════════════
# WHAT: Set the 'N' in N-Gram.
# WHY: This determines the context window size.
# - n=3 (Trigram): The model looks at the previous 3 words to predict the next one.
# - Note: In standard terminology, a trigram model usually predicts the 3rd word
# based on the previous 2. Here, the code uses 'n' words as the key to predict
# the (n+1)th word. So effectively, this is an (n+1)-gram model logic where
# context size is n.
n = 3
# ════════════════════════════════════════════════════════════════════════════════
# STEP 6: MODEL TRAINING (BUILDING THE N-GRAM DICTIONARY)
# ════════════════════════════════════════════════════════════════════════════════
# WHAT: Build a lookup table (dictionary) representing the language model.
# WHY: To map every unique sequence of 'n' words to a list of all words that
# have followed it in the training text.
#
# ALGORITHM:
# 1. Slide a window of size 'n' across the text.
# 2. Key = The sequence of 'n' words (joined by space).
# 3. Value = The word immediately following this sequence.
# 4. Store these in a dictionary where values are lists (to handle multiple possibilities).
# Initialize empty dictionary to store n-grams and their subsequent words
ngram = {}
# Slide an n-sized window through the list of words
# Range goes up to len(words)-n to ensure we don't go out of bounds when accessing i+n
for i in range(len(words)-n):
# Create current n-gram by joining n consecutive words
gram = ' '.join(words[i:i+n]) # Example: "The climate" (if n=2)
# Initialize empty list for this n-gram if not already present
if gram not in ngram.keys():
ngram[gram] = []
# Append the next word (i+n position) to this n-gram's list
# This builds the frequency distribution implicitly (more frequent next words appear more often in the list)
ngram[gram].append(words[i+n]) # Example: "change" after "The climate"
# ════════════════════════════════════════════════════════════════════════════════
# STEP 7: TEXT GENERATION INITIALIZATION
# ════════════════════════════════════════════════════════════════════════════════
# WHAT: Select a starting point (seed) for generation.
# WHY: The model needs a context to start predicting.
# STRATEGY: We simply pick the first 'n' words of the text. In a more advanced
# version, we could pick a random valid n-gram or let the user input a seed.
# Initialize the first n-gram from the beginning of the word list
CurrentGram = ' '.join(words[0:n]) # First n words as a single string
# ════════════════════════════════════════════════════════════════════════════════
# STEP 8: TEXT GENERATION LOOP
# ════════════════════════════════════════════════════════════════════════════════
# WHAT: Iteratively generate new words.
# WHY: To construct a new sentence based on the learned patterns.
#
# LOGIC:
# 1. Look up the 'CurrentGram' in the dictionary.
# 2. If it exists, retrieve the list of possible next words.
# 3. Randomly select one word from the list (weighted random sampling because
# words appearing multiple times in the list have higher probability).
# 4. Append the new word to the result.
# 5. Update 'CurrentGram' by sliding the window: drop the first word, add the new word.
# 6. Repeat until a limit (150 words) or a dead end is reached.
# Initialize result with the starting n-gram
result = CurrentGram # "The climate" (if n=2)
# Generate up to 150 words
for i in range(150):
# Stop if current n-gram has no possible continuations (it was the end of the text)
if CurrentGram not in ngram.keys():
break
# Get all possible next words for current n-gram
possibilities = ngram[CurrentGram] # Example: ['change', 'crisis', 'shift']
# Randomly select one possibility
nextItem = possibilities[random.randrange(len(possibilities))]
# Append selected word to result
result += ' ' + nextItem # "The climate change"
# Tokenize the updated result string into words to easily slice the last n words
# OPTIMIZATION NOTE: Re-tokenizing the whole string every iteration is inefficient (O(N^2)).
# A better approach would be to maintain a list of generated words and slice that.
rwords = nltk.word_tokenize(result) # ['The', 'climate', 'change']
# Update CurrentGram to last n words to serve as context for the next iteration
CurrentGram = ' '.join(rwords[-n:]) # "climate change" (if n=2)
# ════════════════════════════════════════════════════════════════════════════════
# STEP 9: FINAL OUTPUT
# ════════════════════════════════════════════════════════════════════════════════
# WHAT: Display the generated text.
# INTERPRETATION: The output will look locally coherent (phrases make sense) but
# may lack global meaning or repeat itself, which is characteristic of simple
# N-Gram models without smoothing or backoff strategies.
print(result)
"""
# ════════════════════════════════════════════════════════════════════════════════
# FINAL SUMMARY & CONCLUSION
# ════════════════════════════════════════════════════════════════════════════════
#
# RECAP:
# We built a text generator that learns the style of a source text.
# 1. Tokenized the input paragraph.
# 2. Created a dictionary mapping every sequence of 3 words to the word that follows it.
# 3. Used this dictionary to probabilistically generate a new sequence.
#
# KEY TAKEAWAYS:
# - N-Grams capture local syntax and style effectively.
# - The "Markov Assumption" simplifies language modeling by looking only at fixed context.
# - Data Sparsity: If an n-gram sequence never appeared in training, the model fails
# (returns nothing). This is why "Smoothing" is used in production models.
#
# LIMITATIONS OF THIS IMPLEMENTATION:
# - Deterministic start (always starts with the first words of the text).
# - Inefficient string operations inside the loop.
# - No handling of unseen n-grams (zero probability problem).
#
# PRACTICAL APPLICATION:
# - Autocomplete systems (predicting the next word).
# - Spelling correction (context-aware).
# - Simple chatbots.
#
# ════════════════════════════════════════════════════════════════════════════════
"""