-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy path09_summarization.py
More file actions
310 lines (252 loc) · 16.4 KB
/
09_summarization.py
File metadata and controls
310 lines (252 loc) · 16.4 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
#!/usr/bin/env python
# coding: utf-8
"""
# ════════════════════════════════════════════════════════════════════════════════
# PROJECT: AUTOMATED TEXT SUMMARIZATION WITH WEB SCRAPING
# ════════════════════════════════════════════════════════════════════════════════
#
# 1. EXECUTIVE SUMMARY
# ────────────────────────────────────────────────────────────────────────────────
# OBJECTIVE:
# To build an extractive text summarization tool that scrapes content from a
# Wikipedia page and generates a concise summary by selecting the most
# important sentences.
#
# PROBLEM STATEMENT & CONTEXT:
# With the explosion of digital content, reading entire articles to extract
# key information is time-consuming. Automated summarization helps in quickly
# understanding the gist of long documents.
#
# ALGORITHM OVERVIEW:
# - Data Acquisition: Web scraping using `urllib` and `BeautifulSoup`.
# - Preprocessing: Cleaning text, removing citations, and tokenization.
# - Feature Engineering: Calculating word frequencies (weighted by occurrence).
# - Scoring: Ranking sentences based on the sum of weighted frequencies of
# their constituent words.
# - Selection: Picking the top N sentences to form the summary.
#
# DATASET:
# - Source: Wikipedia article on "Yuvraj Singh".
# - Type: Unstructured HTML text.
#
# EXPECTED OUTCOMES:
# - A paragraph-length summary containing the 20 most significant sentences
# from the article.
#
# PREREQUISITES:
# - Python 3.x
# - Libraries: bs4 (BeautifulSoup), urllib, nltk, re, heapq
# - Internet connection for scraping.
#
# ────────────────────────────────────────────────────────────────────────────────
# TABLE OF CONTENTS
# ────────────────────────────────────────────────────────────────────────────────
# 1. Theoretical Background (Summarization Types)
# 2. Setup & Library Imports
# 3. Data Acquisition (Web Scraping)
# 4. Data Preprocessing (Cleaning)
# 5. Feature Engineering (Word Frequencies)
# 6. Sentence Scoring
# 7. Summary Generation
# 8. Final Output
# ════════════════════════════════════════════════════════════════════════════════
"""
# ════════════════════════════════════════════════════════════════════════════════
# STEP 1: THEORETICAL BACKGROUND
# ════════════════════════════════════════════════════════════════════════════════
# WHAT: Define Extractive vs. Abstractive Summarization.
# WHY: To understand the approach used in this script.
# # ### Text Summarization with web scrapping
###### Text summarization in NLP is the process of automatically condensing large texts into shorter versions while preserving the essential information and meaning. It enables faster information retrieval and decision-making across various domains such as healthcare, law, and finance
##### Types of Text Summarization
#### There are two primary approaches:
### 1. Extractive Summarization
## Selects and combines the most important sentences or phrases directly from the original text without generating new content.
## Common algorithms include frequency-based methods (e.g., TF-IDF), graph-based methods like TextRank (a variant of PageRank), and machine learning models that score sentences based on relevance.
## Extractive summarization is simpler and computationally less expensive, but the summary may lack fluency or coherence as it only rearranges existing sentences.
### 2. Abstractive Summarization
## Generates new sentences that capture the core meaning of the text, often rephrasing or compressing information.
## Requires advanced NLP techniques such as sequence-to-sequence deep learning models (LSTM, Transformer architectures like BERT, GPT, BART).
## Uses attention mechanisms to focus on relevant parts of the input text for better coherence and relevance.
## More computationally intensive but can produce more natural, human-like summaries
# ════════════════════════════════════════════════════════════════════════════════
# STEP 2: SETUP & LIBRARY IMPORTS
# ════════════════════════════════════════════════════════════════════════════════
# WHAT: Import libraries for scraping and NLP.
# WHY:
# - 'bs4': To parse HTML and extract text.
# - 'urllib': To fetch the web page.
# - 'nltk': For tokenization and stopwords.
# - 're': For regex-based cleaning.
# - 'heapq': To efficiently find the top N sentences.
import bs4 as bs
import urllib.request
import re
import nltk
import heapq
# install Libraries for use summarization
# pip install beautifulsoup4
# pip install lxml
# ════════════════════════════════════════════════════════════════════════════════
# STEP 3: DATA ACQUISITION (WEB SCRAPING)
# ════════════════════════════════════════════════════════════════════════════════
# WHAT: Fetch and parse the Wikipedia article.
# WHY: We need raw text data to summarize.
# TECHNICAL FIX: Added 'User-Agent' header to avoid HTTP 403 Forbidden errors
# from Wikipedia, which blocks automated scripts.
# Define the URL
url = 'https://en.wikipedia.org/wiki/Yuvraj_Singh'
# Create a request object with a User-Agent header
# Wikipedia blocks requests without a User-Agent to prevent bot scraping
req = urllib.request.Request(
url,
data=None,
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
)
# Open the URL and read the content
web_source = urllib.request.urlopen(req).read()
# Parse the HTML content stored in the variable 'source' using BeautifulSoup
# 'lxml' is specified as the parser, which is fast and lenient with malformed HTML
soup = bs.BeautifulSoup(web_source,'lxml')
# print(soup)
# Initialize an empty string to store the combined text
text = ""
# Loop through all <p> (paragraph) tags found in the HTML content parsed by BeautifulSoup
# Wikipedia content is mostly contained within <p> tags
for paragraph in soup.find_all('p'):
# Extract the text content from each paragraph tag and append it to the 'text' string
text += paragraph.text
# print(text)
# ════════════════════════════════════════════════════════════════════════════════
# STEP 4: DATA PREPROCESSING (CLEANING)
# ════════════════════════════════════════════════════════════════════════════════
# WHAT: Clean the text for analysis.
# WHY:
# - Remove citations like [1], [2] which are irrelevant for summary.
# - Normalize whitespace.
# - Create a separate 'clean_text' for word frequency calculation (lowercased, no punctuation).
# Remove reference numbers or citations enclosed in square brackets, e.g., [1], [23], [456]
# The pattern '\[[0-9]*\]' matches any sequence of digits inside square brackets
text = re.sub(r'\[[0-9]*\]', ' ', text)
text = re.sub(r'\s+', ' ', text)
# Convert the entire text to lowercase to ensure uniformity
clean_text = text.lower()
# Replace all non-word characters (anything other than letters, digits, and underscore) with a space
# This removes punctuation and special characters
clean_text = re.sub(r'\W', ' ', clean_text)
# Remove all digits by replacing them with a space
clean_text = re.sub(r'\d', ' ', clean_text)
# Remove any single character surrounded by spaces (isolated letters), e.g., ' a ', ' b '
# This helps remove stray single letters that are usually not meaningful
clean_text = re.sub(r'\s+[a-z]\s+', ' ', clean_text)
# Replace multiple consecutive whitespace characters with a single space
# This normalizes spacing after previous substitutions
clean_text = re.sub(r'\s+', ' ', clean_text)
# clean_text
# ════════════════════════════════════════════════════════════════════════════════
# STEP 5: FEATURE ENGINEERING (WORD FREQUENCIES)
# ════════════════════════════════════════════════════════════════════════════════
# WHAT: Calculate the importance of each word.
# LOGIC:
# - Frequency = Importance.
# - Stopwords (common words like "the", "is") are removed as they are noise.
# - Frequencies are normalized (divided by max frequency) to scale between 0 and 1.
# convert paragraph into sentences
sentences = nltk.sent_tokenize(text)
# Download the stopwords corpus if not already downloaded
nltk.download('stopwords')
# Load the list of English stopwords from the NLTK corpus
stop_words = nltk.corpus.stopwords.words('english')
# Initialize an empty dictionary to store word frequencies
word2count = {}
# Tokenize the cleaned text into individual words
for word in nltk.word_tokenize(clean_text):
# Check if the word is NOT in the list of stopwords
if word not in stop_words:
# If the word is not already in the dictionary, add it with count 1
if word not in word2count.keys():
word2count[word] = 1
else:
# If the word is already in the dictionary, increment its count by 1
word2count[word] += 1
# Normalize the word counts by dividing each count by the maximum word count
# This scales the frequencies to a range between 0 and 1
max_count = max(word2count.values())
for key in word2count.keys():
# Divide the count of the current word by the highest count in the dictionary
word2count[key] = word2count[key] / max_count
# word2count
# ════════════════════════════════════════════════════════════════════════════════
# STEP 6: SENTENCE SCORING
# ════════════════════════════════════════════════════════════════════════════════
# WHAT: Score each sentence based on the words it contains.
# ALGORITHM:
# - Iterate through every sentence.
# - For each word in the sentence, if it exists in our frequency table, add its
# normalized score to the sentence's total score.
# - Constraint: Ignore very long sentences (>100 words) as they might be
# hard to read or contain too much mixed info.
# Initialize an empty dictionary to store scores for each sentence
sent2score = {}
# Iterate over each sentence in the list 'sentences'
for sentence in sentences:
# Tokenize the sentence into words after converting to lowercase
for word in nltk.word_tokenize(sentence.lower()):
# Check if the word exists in the normalized word frequency dictionary
if word in word2count.keys():
# Consider only sentences shorter than 100 words to avoid overly long sentences
if len(sentence.split(' ')) < 100:
# If the sentence is not already in the dictionary, initialize its score with the word's frequency
if sentence not in sent2score.keys():
sent2score[sentence] = word2count[word]
else:
# If the sentence is already scored, add the current word's frequency to its total score
sent2score[sentence] += word2count[word]
# sent2score
# ════════════════════════════════════════════════════════════════════════════════
# STEP 7: SUMMARY GENERATION
# ════════════════════════════════════════════════════════════════════════════════
# WHAT: Select the top N sentences.
# WHY: To create the final summary.
# Select the top 20 sentences with the highest scores from the sent2score dictionary
# heapq.nlargest returns a list of the n largest elements based on the key function
best_sentences = heapq.nlargest(20, sent2score, key=sent2score.get)
# Display the selected top sentences
# print(best_sentences)
# Join the selected top sentences into a single string to form the summary
summary = ' '.join(best_sentences)
# ════════════════════════════════════════════════════════════════════════════════
# STEP 8: FINAL OUTPUT
# ════════════════════════════════════════════════════════════════════════════════
# WHAT: Print the result.
print("ORIGINAL LENGTH:", len(text))
print("SUMMARY LENGTH:", len(summary))
print("\nGENERATED SUMMARY:\n")
print(summary)
"""
# ════════════════════════════════════════════════════════════════════════════════
# FINAL SUMMARY & CONCLUSION
# ════════════════════════════════════════════════════════════════════════════════
#
# RECAP:
# We built an extractive summarizer that:
# 1. Scraped Wikipedia.
# 2. Calculated word importance based on frequency.
# 3. Scored sentences by summing the importance of their words.
# 4. Extracted the top 20 sentences.
#
# KEY TAKEAWAYS:
# - Frequency-based summarization is a simple yet effective baseline.
# - It doesn't require training data (unsupervised).
# - Limitation: It assumes that frequent words imply importance, which isn't
# always true (though stopword removal helps).
# - Limitation: It doesn't handle synonyms or context (e.g., "king" and "monarch"
# are treated as unrelated).
#
# PRACTICAL APPLICATION:
# - News aggregators.
# - Legal document summarization.
# - Meeting minutes generation.
#
# ════════════════════════════════════════════════════════════════════════════════
"""