-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy path03_April_NLTK.py
More file actions
204 lines (140 loc) · 9.06 KB
/
03_April_NLTK.py
File metadata and controls
204 lines (140 loc) · 9.06 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
#!/usr/bin/env python
# coding: utf-8
# ================================================================
# EXECUTIVE SUMMARY : Natural Language Toolkit (NLTK)
# ================================================================
# This project demonstrates the use of **NLTK (Natural Language Toolkit)**, a leading open-source Python library for natural language processing (NLP).
# ## Natural LanLanguage Tool Kit (NLTK)
# NLTK, or the Natural Language Toolkit, is a leading open-source Python library designed for working with human language data, making it a foundational tool for natural language processing (NLP) tasks in data science, linguistics, and research .
# ### Key Features
# Comprehensive NLP Suite: NLTK provides a wide array of modules and datasets for tasks such as classification, tokenization, stemming, lemmatization, part-of-speech tagging, parsing, and semantic reasoning.
# Text Preprocessing: It offers robust tools for cleaning and preparing raw text, including tokenizers to split text into words or sentences, stopword removal, and normalization.
# Linguistic Feature Extraction: NLTK can extract parts of speech, named entities, syntactic structures, and more, enabling deeper linguistic analysis.
# Large Corpora Access: The library includes access to numerous sample corpora and lexical resources for training and testing NLP models.
# Integration with Machine Learning: NLTK works well with other Python machine learning libraries, allowing users to build and evaluate classifiers for tasks like sentiment analysis, topic modeling, and text categorization.
# Educational Resources: NLTK is accompanied by extensive documentation, graphical demonstrations, and the well-known book "Natural Language Processing with Python," making it a popular choice for teaching and self-study
# ### Common Applications
# Text Classification: Spam detection, sentiment analysis, and topic classification.
# Tokenization: Breaking down text into manageable units for analysis.
# Stemming and Lemmatization: Reducing words to their root forms for more effective analysis.
# Part-of-Speech Tagging: Identifying grammatical roles of words in context.
# Named Entity Recognition (NER): Extracting entities like people, organizations, and places from text.
# Parsing and Chunking: Analyzing grammatical structure and extracting phrases.
# Purpose:
# - To showcase text preprocessing techniques such as tokenization, stemming,
# lemmatization, stopword removal, and named entity recognition (NER).
# - To highlight how NLTK enables linguistic analysis and prepares text for
# machine learning tasks.
# Why it matters:
# - Text preprocessing is the foundation of NLP pipelines.
# - Recruiters value candidates who can demonstrate practical knowledge of
# tokenization, normalization, and linguistic feature extraction.
# Techniques highlighted:
# - Sentence and word tokenization
# - Stemming and lemmatization
# - Stopword removal
# - Part-of-speech tagging
# - Named Entity Recognition (NER)
# ================================================================
# ==== Step 1: Import Libraries ====
import nltk
import numpy as np
# Note: Run nltk.download() once to install required corpora and models.
# Example: nltk.download('punkt'), nltk.download('stopwords'), nltk.download('wordnet')
# ==== Step 2: Create a Sample Paragraph ====
paragraph = """ Thank you all so very much. Thank you to the Academy. Thank you to all of you in this room. I have to congratulate the other incredible nominees this year. The Revenant was the product of the tireless efforts of an unbelievable cast and crew. First off, to my brother in this endeavor, Mr. Tom Hardy. Tom, your talent on screen can only be surpassed by your friendship off screen … thank you for creating a transcendent cinematic experience. Thank you to everybody at Fox and New Regency … my entire team. I have to thank everyone from the very onset of my career … To my parents; none of this would be possible without you. And to my friends, I love you dearly; you know who you are."""
# ==== Step 3: Sentence Tokenization ====
# Splitting text into sentences is the first step in NLP.
# Naive approach: split by '.' (not robust).
paragraph.split('.')
# Better approach: use NLTK's sentence tokenizer.
sentence = nltk.sent_tokenize(paragraph)
print(sentence)
# ==== Step 4: Word Tokenization ====
# Naive approach: split by spaces.
paragraph.split(' ')
# Robust approach: use NLTK's word tokenizer.
words = nltk.word_tokenize(paragraph)
print('Length of words:', len(words))
print(words)
# ==== Step 5: Stemming ====
# #### Stemming is the process of reducing a word to its root or base form (use to find root of the words), usually by chopping off prefixes or suffixes. The resulting stem may not be a real word, but it serves as a common representation for related words.
# ### Example:
# "playing", "played", "plays" → "play"
# "better" → "better" (no change, as stemmers often don’t handle irregular forms)
# Popular Stemming Algorithms: Porter Stemmer , Lancaster Stemmer , Snowball Stemmer
from nltk.stem import PorterStemmer
sentences = nltk.sent_tokenize(paragraph)
stemmer = PorterStemmer()
for i in range(len(sentences)):
words = nltk.word_tokenize(sentences[i])
newWords = [stemmer.stem(word) for word in words]
sentences[i] = ' '.join(newWords)
print("Stemmed sentences:")
print(sentences)
# Interpretation: Stemming may produce non-dictionary words (e.g., "univers" for "universe"),
# but it helps normalize text for tasks like search or classification.
# ==== Step 6: Lemmatization ====
# Lemmatization reduces words to their base dictionary form (lemma).
# Lemmatization reduces words to their base or dictionary form (lemma), considering the context and part of speech. Lemmatization uses a vocabulary and morphological analysis, so the output is always a valid word.
# ### Example:
# "playing", "played", "plays" → "play"
# "better" → "good" (because "better" is the comparative form of "good")
# Popular Lemmatizer: WordNet Lemmatizer
# stemming does not preserve grammatical meaning wheresas lematization preserve
from nltk.stem import WordNetLemmatizer
lemat = WordNetLemmatizer()
sentences = nltk.sent_tokenize(paragraph)
for i in range(len(sentences)):
words = nltk.word_tokenize(sentences[i])
newWords = [lemat.lemmatize(word) for word in words]
sentences[i] = ' '.join(newWords)
print("Lemmatized sentences:")
print(sentences)
# Interpretation: Lemmatization uses vocabulary and context, producing valid words.
# Loop through each sentence in the 'sentences' list
# Tokenize the current sentence into individual words
# Remove stopwords from the list of words
# For each word, check if it is NOT in the list of English stopwords
# Join the filtered words back into a single string to reconstruct the sentence.
# ==== Step 7: Stopword Removal ====
# Stopwords are frequently used words such as "the", "is", "in", "and", "to", etc., that are considered to have little value in understanding the core content of a sentence.
# Stopwords are common words that are usually filtered out before processing text, as they often do not carry significant meaning for tasks like text classification, information retrieval, or sentiment analysis.
from nltk.corpus import stopwords
for i in range(len(sentences)):
words = nltk.word_tokenize(sentences[i])
newwords = [word for word in words if word not in stopwords.words('english')]
sentences[i] = ' '.join(newwords)
print("Sentences after stopword removal:")
print(sentences)
# ==== Step 8: Named Entity Recognition (NER) ====
# Named Entity Recognition (NER) is an NLP task that identifies and classifies named entities in text into predefined categories such as:
# Person names (e.g., "Albert Einstein") , Organizations (e.g., "Google"),Locations (e.g., "New York"), Dates, times, monetary values, percentages, etc.
# NER helps extract structured information from unstructured text, which is valuable in information retrieval, question answering, and knowledge graph construction.
paragraph = "The Taj Mahal was built by Emperor Shahjahan in 1 July 1999 at Delhi, India. Yash is living in New York and working for IBM."
# Tokenize words
words = nltk.word_tokenize(paragraph)
# Part-of-speech tagging
tagged_word = nltk.pos_tag(words)
print("POS Tagged words:")
print(tagged_word)
# Named Entity Chunking
namedEnt = nltk.ne_chunk(tagged_word)
print("Named Entities:")
print(namedEnt)
# Visualize the named entity tree (opens a GUI window)
namedEnt.draw()
# ================================================================
# FINAL SUMMARY
# ================================================================
# - We demonstrated key NLP preprocessing techniques using NLTK:
# * Sentence and word tokenization
# * Stemming and lemmatization
# * Stopword removal
# * Part-of-speech tagging
# * Named Entity Recognition (NER)
# - Tokenization breaks text into manageable units.
# - Stemming and lemmatization normalize words for analysis.
# - Stopword removal reduces noise.
# - POS tagging and NER extract linguistic and semantic features.
# ================================================================