Skip to content

Commit 7a5973c

Browse files
committed
add deleted files
1 parent 525edb9 commit 7a5973c

2 files changed

Lines changed: 227 additions & 0 deletions

File tree

TEXT_SUMMARIZER/README.md

Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,114 @@
1+
# Text Summarizer
2+
3+
A Python script that summarizes text by keeping only the most important sentences using simple word frequency analysis.
4+
5+
## Features
6+
7+
- **Frequency-based Extractive Algorithm**: Analyzes word frequencies to identify important sentences
8+
- **Stop Words Filtering**: Removes common words to focus on meaningful content
9+
- **Normalized Scoring**: Scores sentences based on normalized word frequencies
10+
- **Maintains Original Order**: Preserves the original sequence of selected sentences
11+
- **Configurable Summary Ratio**: Adjust the proportion of sentences to include in the summary
12+
13+
## Requirements
14+
15+
- Python 3.x
16+
- No external dependencies (uses only standard library)
17+
18+
## Installation
19+
20+
No installation required. Simply download the script:
21+
22+
```bash
23+
git clone https://github.com/sumanth-0/100LinesOfPythonCode.git
24+
cd 100LinesOfPythonCode/text_summarizer
25+
```
26+
27+
## Usage
28+
29+
### Interactive Mode
30+
31+
Run the script and enter text when prompted:
32+
33+
```bash
34+
python text_summarizer.py
35+
```
36+
37+
### Programmatic Usage
38+
39+
```python
40+
from text_summarizer import summarize
41+
42+
text = """
43+
Your long text goes here. The text summarizer will analyze word frequencies
44+
and select the most important sentences. It uses an extractive approach,
45+
meaning it pulls sentences directly from the original text rather than
46+
generating new ones.
47+
"""
48+
49+
# Summarize with default 30% ratio
50+
summary = summarize(text)
51+
print(summary)
52+
53+
# Summarize with custom ratio (e.g., 50% of sentences)
54+
summary = summarize(text, ratio=0.5)
55+
print(summary)
56+
```
57+
58+
## How It Works
59+
60+
1. **Text Cleaning**: Removes extra whitespace and normalizes the input text
61+
2. **Sentence Tokenization**: Splits text into individual sentences
62+
3. **Word Tokenization**: Extracts words and converts them to lowercase
63+
4. **Frequency Analysis**: Calculates word frequencies while filtering stop words
64+
5. **Sentence Scoring**: Scores each sentence based on the frequency of words it contains
65+
6. **Sentence Selection**: Selects top-scoring sentences based on the specified ratio
66+
7. **Order Preservation**: Returns selected sentences in their original order
67+
68+
## Algorithm Details
69+
70+
### Stop Words Filtering
71+
72+
The script filters out common English stop words like "the", "is", "at", "which", etc. This helps focus on content-bearing words.
73+
74+
### Scoring Formula
75+
76+
Each sentence is scored using:
77+
```
78+
sentence_score = sum(normalized_word_frequencies) / number_of_words
79+
```
80+
81+
This approach favors sentences with high-frequency important words while normalizing for sentence length.
82+
83+
## Example
84+
85+
### Input
86+
```
87+
Artificial intelligence is transforming technology. Machine learning algorithms can process vast amounts of data. Neural networks are inspired by the human brain. Deep learning has revolutionized computer vision. Natural language processing enables computers to understand text. AI applications are everywhere in modern life.
88+
```
89+
90+
### Output (30% ratio - 2 sentences)
91+
```
92+
Deep learning has revolutionized computer vision. Neural networks are inspired by the human brain.
93+
```
94+
95+
## Limitations
96+
97+
- Works best with structured, well-written text
98+
- May not capture context or semantic relationships
99+
- Limited stop word list (can be expanded)
100+
- No handling of complex linguistic structures
101+
- Best suited for informative/factual text rather than narrative content
102+
103+
## Contributing
104+
105+
Feel free to submit issues, fork the repository, and create pull requests for any improvements.
106+
107+
## License
108+
109+
This project is part of the 100LinesOfPythonCode repository. See the main repository for license information.
110+
111+
## References
112+
113+
- Issue: #682
114+
- Repository: [100LinesOfPythonCode](https://github.com/sumanth-0/100LinesOfPythonCode)

TEXT_SUMMARIZER/text_summarizer.py

Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
#!/usr/bin/env python3
2+
"""
3+
Text Summarizer using Frequency-based Extractive Algorithm
4+
Summarizes text by keeping only the most important sentences.
5+
"""
6+
7+
import re
8+
from collections import Counter
9+
from string import punctuation
10+
11+
12+
def clean_text(text):
13+
"""Remove extra whitespace and normalize text."""
14+
text = re.sub(r'\s+', ' ', text)
15+
return text.strip()
16+
17+
18+
def tokenize_sentences(text):
19+
"""Split text into sentences."""
20+
sentences = re.split(r'[.!?]+', text)
21+
return [s.strip() for s in sentences if s.strip()]
22+
23+
24+
def tokenize_words(text):
25+
"""Extract words and convert to lowercase."""
26+
words = re.findall(r'\b[a-zA-Z]+\b', text.lower())
27+
return words
28+
29+
30+
def get_word_frequencies(sentences):
31+
"""Calculate word frequency scores."""
32+
all_words = []
33+
for sentence in sentences:
34+
all_words.extend(tokenize_words(sentence))
35+
36+
# Remove common stop words
37+
stop_words = {'the', 'is', 'at', 'which', 'on', 'a', 'an', 'and', 'or',
38+
'but', 'in', 'with', 'to', 'for', 'of', 'as', 'by', 'that',
39+
'this', 'it', 'from', 'be', 'are', 'was', 'were', 'been',
40+
'have', 'has', 'had', 'do', 'does', 'did', 'will', 'would',
41+
'could', 'should', 'may', 'might', 'can'}
42+
43+
filtered_words = [w for w in all_words if w not in stop_words]
44+
45+
word_freq = Counter(filtered_words)
46+
max_freq = max(word_freq.values()) if word_freq else 1
47+
48+
# Normalize frequencies
49+
for word in word_freq:
50+
word_freq[word] = word_freq[word] / max_freq
51+
52+
return word_freq
53+
54+
55+
def score_sentences(sentences, word_freq):
56+
"""Score each sentence based on word frequencies."""
57+
sentence_scores = {}
58+
59+
for sentence in sentences:
60+
words = tokenize_words(sentence)
61+
score = sum(word_freq.get(word, 0) for word in words)
62+
63+
if len(words) > 0:
64+
sentence_scores[sentence] = score / len(words)
65+
else:
66+
sentence_scores[sentence] = 0
67+
68+
return sentence_scores
69+
70+
71+
def summarize(text, ratio=0.3):
72+
"""Summarize text by extracting top sentences.
73+
74+
Args:
75+
text: Input text to summarize
76+
ratio: Proportion of sentences to keep (0.0 to 1.0)
77+
78+
Returns:
79+
Summarized text
80+
"""
81+
text = clean_text(text)
82+
sentences = tokenize_sentences(text)
83+
84+
if len(sentences) <= 2:
85+
return text
86+
87+
word_freq = get_word_frequencies(sentences)
88+
sentence_scores = score_sentences(sentences, word_freq)
89+
90+
# Select top sentences
91+
num_sentences = max(1, int(len(sentences) * ratio))
92+
top_sentences = sorted(sentence_scores.items(), key=lambda x: x[1], reverse=True)[:num_sentences]
93+
94+
# Maintain original order
95+
summary_sentences = sorted(top_sentences, key=lambda x: sentences.index(x[0]))
96+
summary = '. '.join([s[0] for s in summary_sentences]) + '.'
97+
98+
return summary
99+
100+
101+
if __name__ == "__main__":
102+
print("Text Summarizer - Frequency-based Extractive Algorithm")
103+
print("=" * 55)
104+
text = input("\nEnter text to summarize:\n")
105+
106+
if text.strip():
107+
summary = summarize(text)
108+
print("\n" + "=" * 55)
109+
print("SUMMARY:")
110+
print("=" * 55)
111+
print(summary)
112+
else:
113+
print("No text provided!")

0 commit comments

Comments
 (0)