End-to-end natural language processing pipeline that collects public Reddit posts, clusters them into events, and generates short extractive summaries per event.
The codebase is organized as a reusable Python package under src/ and includes notebooks that mirror the pipeline stages.
- Highlights
- Quickstart (Windows / PowerShell)
- Configuration
- Usage
- Evaluation (ROUGE)
- Notebooks
- Outputs
- Project Structure
- Troubleshooting
- Future Work
- Author
Expand
- Data ingestion via the Reddit API (
praw), persisted as JSONL - Text preprocessing and TF–IDF feature extraction
- Unsupervised event discovery via agglomerative clustering (cosine distance)
- Centroid-based extractive summarization with evidence post IDs
- ROUGE evaluation against a small human-written reference set
- Clear module boundaries:
collect,preprocess,events,summarize,evaluate,pipeline
From the repo root:
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txt
python -c "import nltk; nltk.download('stopwords')"
$env:PYTHONPATH = "src"
python -m iris_reddit_events.pipelineExpand
To collect live Reddit data, create a .env file in the repo root:
REDDIT_CLIENT_ID=your_client_id_here
REDDIT_CLIENT_SECRET=your_client_secret_here
REDDIT_USER_AGENT=iris-reddit-events (by u/your_username)Credentials are loaded in src/iris_reddit_events/config.py.
Expand
All commands below assume PYTHONPATH=src.
$env:PYTHONPATH = "src"
python -m iris_reddit_events.collectThis writes a timestamped JSONL file under data/raw/.
$env:PYTHONPATH = "src"
python -m iris_reddit_events.pipelineDefaults:
- Subreddits:
news,worldnews,technology - Clustering:
distance_threshold=0.8 - Summaries:
top_k=3titles
Tune parameters in src/iris_reddit_events/pipeline.py.
Expand
- Create
data/gold_summaries.csvwith:
event_id,reference
0,"<your human-written summary for event 0>"
3,"..."- Run evaluation:
$env:PYTHONPATH = "src"
python -m iris_reddit_events.evaluate --gold data/gold_summaries.csvExpand
Notebooks mirror the pipeline stages:
notebooks/00_check_config.ipynb: confirms paths +.envloadingnotebooks/01_preprocess.ipynb: loads raw JSONL, builds TF–IDFnotebooks/02_events.ipynb: clusters posts intoevent_idnotebooks/03_summarize.ipynb: generates summaries from labeled eventsnotebooks/04_eval.ipynb: runs ROUGE against a gold CSV
data/raw/reddit_*.jsonl: raw collection resultsdata/processed/posts.parquet: cleaned post tabledata/events/events_labeled.parquet: post table +event_iddata/summaries/summaries.parquet: event-level summaries (+ evidence post IDs)
src/iris_reddit_events/
collect.py # Reddit API ingestion
preprocess.py # text cleaning + TF–IDF
events.py # clustering into events
summarize.py # centroid-based extractive summaries
evaluate.py # ROUGE evaluation helpers + CLI
pipeline.py # end-to-end runner
config.py # paths + .env loading
notebooks/ # exploratory / report notebooks
Expand
- Parquet read/write error: install an engine (recommended):
pip install pyarrow LookupError: Resource stopwords not found: runpython -c "import nltk; nltk.download('stopwords')"No raw files found: run the collector sodata/raw/reddit_*.jsonlexists- Import errors in scripts: ensure
PYTHONPATH=srcis set
Expand
- Replace TF–IDF with dense embeddings for improved clustering
- Move to sentence-level extraction to reduce redundancy
- Add time-aware segmentation (windowing / burst detection)
Symaedchit Octavius Leo
Built for CSE 482 (Big Data Analysis).
python -m iris_reddit_events.collectFetches recent posts from configured subreddits and saves to data/raw/.
Configuration (in collect.py):
- Subreddits:
news,worldnews,technology - Posts per subreddit: 300 (configurable via
limitparameter)
from iris_reddit_events import preprocess as pp
df = pp.load_raw()
X, vectorizer, df_clean = pp.vectorize(df)from iris_reddit_events import events
df_events = events.cluster_events(
X,
df_clean,
n_clusters=None,
distance_threshold=0.8
)from iris_reddit_events import summarize
df_summaries = summarize.summarize_events(df_events, top_k=3)Click to view notebook descriptions
The notebooks/ directory contains Jupyter notebooks that mirror each pipeline stage:
| Notebook | Purpose | Key Outputs |
|---|---|---|
00_check_config.ipynb |
Verify environment setup | Confirms paths and .env loading |
01_preprocess.ipynb |
Data preprocessing | TF-IDF matrix, cleaned DataFrame |
02_events.ipynb |
Event clustering | Event assignments per post |
03_summarize.ipynb |
Summary generation | Event summaries with evidence IDs |
04_eval.ipynb |
ROUGE evaluation | Precision, recall, F1 scores |
Running Notebooks:
- Launch Jupyter:
jupyter notebook - Navigate to
notebooks/directory - Open desired notebook
- Run cells sequentially
Note: Notebooks automatically configure the Python path to import the iris_reddit_events package.
Click to view complete project structure
cse482-reddit-events/
├── src/
│ └── iris_reddit_events/
│ ├── __init__.py
│ ├── collect.py # Reddit API data collection
│ ├── config.py # Configuration and paths
│ ├── preprocess.py # Text preprocessing & TF-IDF
│ ├── events.py # Event clustering logic
│ ├── summarize.py # Summary generation
│ ├── evaluate.py # ROUGE evaluation + CLI
│ └── pipeline.py # End-to-end orchestration
│
├── notebooks/
│ ├── 00_check_config.ipynb # Environment verification
│ ├── 01_preprocess.ipynb # Preprocessing exploration
│ ├── 02_events.ipynb # Clustering analysis
│ ├── 03_summarize.ipynb # Summary generation
│ └── 04_eval.ipynb # Evaluation workflow
│
├── data/
│ ├── raw/ # Raw Reddit JSONL files
│ ├── processed/ # Preprocessed Parquet files
│ ├── events/ # Event-labeled data
│ ├── summaries/ # Generated summaries
│ └── gold_summaries.csv # Human reference summaries
│
├── requirements.txt # Python dependencies
├── .env # Environment variables (gitignored)
├── .gitignore
└── README.md
config.py: Centralized configuration (paths, environment variables)pipeline.py: Main entry point for full pipeline executionevaluate.py: Can be run as CLI tool or imported as modulerequirements.txt: Pinned dependencies for reproducibility
Click to view configuration options
REDDIT_CLIENT_ID=your_client_id
REDDIT_CLIENT_SECRET=your_secret_key
REDDIT_USER_AGENT=iris-reddit-events (by u/username)Clustering (events.py):
cluster_events(
X, # TF-IDF matrix
df_posts, # Posts DataFrame
n_clusters=None, # Auto-determine via threshold
distance_threshold=0.8 # Cosine distance cutoff
)Summarization (summarize.py):
summarize_events(
df_events, # Event-labeled posts
top_k=3 # Number of representative posts
)Data Collection (collect.py):
collect_sample(
subreddits=("news", "worldnews", "technology"),
limit=300 # Posts per subreddit
)Edit src/iris_reddit_events/pipeline.py to change default parameters:
def main():
# Adjust these values
df_events = events.cluster_events(
X, df_clean,
distance_threshold=0.7 # More granular events
)
df_sum = summarize.summarize_events(
df_events,
top_k=5 # Longer summaries
)Click to view evaluation methodology
The system uses ROUGE (Recall-Oriented Understudy for Gisting Evaluation) to compare generated summaries against human references.
Metrics Computed:
- ROUGE-1: Unigram overlap
- ROUGE-2: Bigram overlap
- ROUGE-L: Longest common subsequence
-
Generate system summaries first:
python -m iris_reddit_events.pipeline -
Create
data/gold_summaries.csv:event_id,reference 0,"Human-written summary for event 0" 2,"Human-written summary for event 2" 5,"Human-written summary for event 5"
-
Run evaluation:
python -m iris_reddit_events.evaluate --gold data/gold_summaries.csv
Evaluated 3 event summaries with ROUGE.
rouge1: P=0.478 R=0.431 F1=0.447
rouge2: P=0.216 R=0.195 F1=0.202
rougeL: P=0.401 R=0.347 F1=0.367
Interpretation:
- Precision (P): How much of the generated summary is relevant
- Recall (R): How much of the reference is captured
- F1: Harmonic mean of precision and recall
Click to view common issues and solutions
Solution:
$env:PYTHONPATH = "src"Ensure set this in every new terminal session, or add to shell profile.
Solution:
python -c "import nltk; nltk.download('stopwords')"Solution:
pip install pyarrowThis should already be in requirements.txt, but reinstall if needed.
Cause: scikit-learn API change in version 1.4+
Solution: Already fixed in events.py (uses metric= instead of affinity=)
Solution: The project includes sample data. If missing create test data or run the collector:
python -m iris_reddit_events.collectCause: Invalid Reddit API credentials
Solution:
- Verify credentials in
.envfile - Ensure no extra spaces or quotes
- Regenerate credentials at https://www.reddit.com/prefs/apps
Solution: Notebooks include path setup cells. Run them first:
import sys
from pathlib import Path
sys.path.insert(0, str(Path.cwd().parent / "src"))Click to view planned improvements
- Replace TF-IDF with dense embeddings (sentence-transformers, BERT)
- Implement sentence-level extraction using TextRank
- Add temporal segmentation for time-aware event detection
- Create web dashboard for real-time monitoring
- Add support for multiple languages
- Implement abstractive summarization with fine-tuned T5/BART
- Add named entity recognition for key figure extraction
- Implement topic modeling (LDA) alongside clustering
- Create REST API for programmatic access
- Add Docker containerization
- Real-time streaming pipeline with Apache Kafka
- Multi-source integration (Twitter, news sites, forums)
- Fact-checking integration with claim verification
- Personalized event filtering based on user interests
- Mobile application for summary consumption
The repository includes sample Reddit data in data/raw/reddit_20260101_120000.jsonl for immediate testing. This contains 10 synthetic posts across three topics:
- AI technology announcements
- Climate summit news
- Local news events
To collect live Reddit data, configure API credentials and run the collector.
This project is licensed under the MIT License. See LICENSE file for details.
Symaedchit Octavius Leo
Built as part of CSE 482: Big Data Analysis
- Reddit API via praw
- scikit-learn for machine learning tools
- NLTK for natural language processing
- Google's ROUGE implementation for evaluation
