NLP-Tutorial-Collection/09_summarization.py at main · ImdataScientistSachin/NLP-Tutorial-Collection

310 lines (252 loc) · 16.4 KB
#!/usr/bin/env python
# coding: utf-8
# ════════════════════════════════════════════════════════════════════════════════
# PROJECT: AUTOMATED TEXT SUMMARIZATION WITH WEB SCRAPING
# ════════════════════════════════════════════════════════════════════════════════
# 1. EXECUTIVE SUMMARY
# ────────────────────────────────────────────────────────────────────────────────
# OBJECTIVE:
#   To build an extractive text summarization tool that scrapes content from a 
#   Wikipedia page and generates a concise summary by selecting the most 
#   important sentences.
# PROBLEM STATEMENT & CONTEXT:
#   With the explosion of digital content, reading entire articles to extract 
#   key information is time-consuming. Automated summarization helps in quickly 
#   understanding the gist of long documents.
# ALGORITHM OVERVIEW:
#   - Data Acquisition: Web scraping using `urllib` and `BeautifulSoup`.
#   - Preprocessing: Cleaning text, removing citations, and tokenization.
#   - Feature Engineering: Calculating word frequencies (weighted by occurrence).
#   - Scoring: Ranking sentences based on the sum of weighted frequencies of 
#     their constituent words.
#   - Selection: Picking the top N sentences to form the summary.
#   - Source: Wikipedia article on "Yuvraj Singh".
#   - Type: Unstructured HTML text.
# EXPECTED OUTCOMES:
#   - A paragraph-length summary containing the 20 most significant sentences 
#     from the article.
# PREREQUISITES:
#   - Python 3.x
#   - Libraries: bs4 (BeautifulSoup), urllib, nltk, re, heapq
#   - Internet connection for scraping.
# ────────────────────────────────────────────────────────────────────────────────
# TABLE OF CONTENTS
# ────────────────────────────────────────────────────────────────────────────────
# 1. Theoretical Background (Summarization Types)
# 2. Setup & Library Imports
# 3. Data Acquisition (Web Scraping)
# 4. Data Preprocessing (Cleaning)
# 5. Feature Engineering (Word Frequencies)
# 6. Sentence Scoring
# 7. Summary Generation
# 8. Final Output
# ════════════════════════════════════════════════════════════════════════════════
# ════════════════════════════════════════════════════════════════════════════════
# STEP 1: THEORETICAL BACKGROUND
# ════════════════════════════════════════════════════════════════════════════════
# WHAT: Define Extractive vs. Abstractive Summarization.
# WHY: To understand the approach used in this script.
# # ### Text Summarization with web scrapping
###### Text summarization in NLP is the process of automatically condensing large texts into shorter versions while preserving the essential information and meaning. It enables faster information retrieval and decision-making across various domains such as healthcare, law, and finance
##### Types of Text Summarization
#### There are two primary approaches:
### 1. Extractive Summarization
## Selects and combines the most important sentences or phrases directly from the original text without generating new content.
## Common algorithms include frequency-based methods (e.g., TF-IDF), graph-based methods like TextRank (a variant of PageRank), and machine learning models that score sentences based on relevance.
## Extractive summarization is simpler and computationally less expensive, but the summary may lack fluency or coherence as it only rearranges existing sentences.
### 2. Abstractive Summarization
##   Generates new sentences that capture the core meaning of the text, often rephrasing or compressing information.
##   Requires advanced NLP techniques such as sequence-to-sequence deep learning models (LSTM, Transformer architectures like BERT, GPT, BART).
##  Uses attention mechanisms to focus on relevant parts of the input text for better coherence and relevance.
## More computationally intensive but can produce more natural, human-like summaries
# ════════════════════════════════════════════════════════════════════════════════
# STEP 2: SETUP & LIBRARY IMPORTS
# ════════════════════════════════════════════════════════════════════════════════
# WHAT: Import libraries for scraping and NLP.
#   - 'bs4': To parse HTML and extract text.
#   - 'urllib': To fetch the web page.
#   - 'nltk': For tokenization and stopwords.
#   - 're': For regex-based cleaning.
#   - 'heapq': To efficiently find the top N sentences.
import bs4 as bs
import urllib.request
import nltk
import heapq
# install Libraries for use summarization
# pip install beautifulsoup4
# pip install lxml
# ════════════════════════════════════════════════════════════════════════════════
# STEP 3: DATA ACQUISITION (WEB SCRAPING)
# ════════════════════════════════════════════════════════════════════════════════
# WHAT: Fetch and parse the Wikipedia article.
# WHY: We need raw text data to summarize.
# TECHNICAL FIX: Added 'User-Agent' header to avoid HTTP 403 Forbidden errors 
# from Wikipedia, which blocks automated scripts.
# Define the URL
url = 'https://en.wikipedia.org/wiki/Yuvraj_Singh'
# Create a request object with a User-Agent header
# Wikipedia blocks requests without a User-Agent to prevent bot scraping
req = urllib.request.Request(
    data=None, 
    headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
# Open the URL and read the content
web_source = urllib.request.urlopen(req).read()
# Parse the HTML content stored in the variable 'source' using BeautifulSoup
# 'lxml' is specified as the parser, which is fast and lenient with malformed HTML
soup = bs.BeautifulSoup(web_source,'lxml')
# print(soup)
# Initialize an empty string to store the combined text
# Loop through all <p> (paragraph) tags found in the HTML content parsed by BeautifulSoup
# Wikipedia content is mostly contained within <p> tags
for paragraph in soup.find_all('p'):
    # Extract the text content from each paragraph tag and append it to the 'text' string
    text += paragraph.text
# print(text)
# ════════════════════════════════════════════════════════════════════════════════
# STEP 4: DATA PREPROCESSING (CLEANING)
# ════════════════════════════════════════════════════════════════════════════════
# WHAT: Clean the text for analysis.
#   - Remove citations like [1], [2] which are irrelevant for summary.
#   - Normalize whitespace.
#   - Create a separate 'clean_text' for word frequency calculation (lowercased, no punctuation).
# Remove reference numbers or citations enclosed in square brackets, e.g., [1], [23], [456]
# The pattern '\[[0-9]*\]' matches any sequence of digits inside square brackets
text = re.sub(r'\[[0-9]*\]', ' ', text)
text = re.sub(r'\s+', ' ', text)
# Convert the entire text to lowercase to ensure uniformity
clean_text = text.lower()
# Replace all non-word characters (anything other than letters, digits, and underscore) with a space
# This removes punctuation and special characters
clean_text = re.sub(r'\W', ' ', clean_text)
# Remove all digits by replacing them with a space
clean_text = re.sub(r'\d', ' ', clean_text)
# Remove any single character surrounded by spaces (isolated letters), e.g., ' a ', ' b '
# This helps remove stray single letters that are usually not meaningful
clean_text = re.sub(r'\s+[a-z]\s+', ' ', clean_text)
# Replace multiple consecutive whitespace characters with a single space
# This normalizes spacing after previous substitutions
clean_text = re.sub(r'\s+', ' ', clean_text)
# clean_text
# ════════════════════════════════════════════════════════════════════════════════
# STEP 5: FEATURE ENGINEERING (WORD FREQUENCIES)
# ════════════════════════════════════════════════════════════════════════════════
# WHAT: Calculate the importance of each word.
#   - Frequency = Importance.
#   - Stopwords (common words like "the", "is") are removed as they are noise.
#   - Frequencies are normalized (divided by max frequency) to scale between 0 and 1.
# convert paragraph into sentences
sentences = nltk.sent_tokenize(text)
# Download the stopwords corpus if not already downloaded
nltk.download('stopwords')
# Load the list of English stopwords from the NLTK corpus
stop_words = nltk.corpus.stopwords.words('english')
# Initialize an empty dictionary to store word frequencies
word2count = {}
# Tokenize the cleaned text into individual words
for word in nltk.word_tokenize(clean_text):
    # Check if the word is NOT in the list of stopwords
    if word not in stop_words:
        # If the word is not already in the dictionary, add it with count 1
        if word not in word2count.keys():
            word2count[word] = 1
        else:
            # If the word is already in the dictionary, increment its count by 1
            word2count[word] += 1
# Normalize the word counts by dividing each count by the maximum word count
# This scales the frequencies to a range between 0 and 1
max_count = max(word2count.values())
for key in word2count.keys():
    # Divide the count of the current word by the highest count in the dictionary
    word2count[key] = word2count[key] / max_count
# word2count
# ════════════════════════════════════════════════════════════════════════════════
# STEP 6: SENTENCE SCORING
# ════════════════════════════════════════════════════════════════════════════════
# WHAT: Score each sentence based on the words it contains.
# ALGORITHM:
#   - Iterate through every sentence.
#   - For each word in the sentence, if it exists in our frequency table, add its 
#     normalized score to the sentence's total score.
#   - Constraint: Ignore very long sentences (>100 words) as they might be 
#     hard to read or contain too much mixed info.
# Initialize an empty dictionary to store scores for each sentence
sent2score = {}
# Iterate over each sentence in the list 'sentences'
for sentence in sentences:
    # Tokenize the sentence into words after converting to lowercase
    for word in nltk.word_tokenize(sentence.lower()):
        # Check if the word exists in the normalized word frequency dictionary
        if word in word2count.keys():
            # Consider only sentences shorter than 100 words to avoid overly long sentences
            if len(sentence.split(' ')) < 100:
                # If the sentence is not already in the dictionary, initialize its score with the word's frequency
                if sentence not in sent2score.keys():
                    sent2score[sentence] = word2count[word]
                else:
                    # If the sentence is already scored, add the current word's frequency to its total score
                    sent2score[sentence] += word2count[word]
# sent2score
# ════════════════════════════════════════════════════════════════════════════════
# STEP 7: SUMMARY GENERATION
# ════════════════════════════════════════════════════════════════════════════════
# WHAT: Select the top N sentences.
# WHY: To create the final summary.
# Select the top 20 sentences with the highest scores from the sent2score dictionary
# heapq.nlargest returns a list of the n largest elements based on the key function
best_sentences = heapq.nlargest(20, sent2score, key=sent2score.get)
# Display the selected top sentences
# print(best_sentences)
# Join the selected top sentences into a single string to form the summary
summary = ' '.join(best_sentences)
# ════════════════════════════════════════════════════════════════════════════════
# STEP 8: FINAL OUTPUT
# ════════════════════════════════════════════════════════════════════════════════
# WHAT: Print the result.
print("ORIGINAL LENGTH:", len(text))
print("SUMMARY LENGTH:", len(summary))
print("\nGENERATED SUMMARY:\n")
print(summary)
# ════════════════════════════════════════════════════════════════════════════════
# FINAL SUMMARY & CONCLUSION
# ════════════════════════════════════════════════════════════════════════════════
#   We built an extractive summarizer that:
#   1. Scraped Wikipedia.
#   2. Calculated word importance based on frequency.
#   3. Scored sentences by summing the importance of their words.
#   4. Extracted the top 20 sentences.
# KEY TAKEAWAYS:
#   - Frequency-based summarization is a simple yet effective baseline.
#   - It doesn't require training data (unsupervised).
#   - Limitation: It assumes that frequent words imply importance, which isn't 
#     always true (though stopword removal helps).
#   - Limitation: It doesn't handle synonyms or context (e.g., "king" and "monarch" 
#     are treated as unrelated).
# PRACTICAL APPLICATION:
#   - News aggregators.
#   - Legal document summarization.
#   - Meeting minutes generation.
# ════════════════════════════════════════════════════════════════════════════════
Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

09_summarization.py

Latest commit

History

09_summarization.py

File metadata and controls