IRIS: Reddit Event Detection & Summarization

End-to-end natural language processing pipeline that collects public Reddit posts, clusters them into events, and generates short extractive summaries per event. The codebase is organized as a reusable Python package under src/ and includes notebooks that mirror the pipeline stages.

Demo

Highlights

Expand

Data ingestion via the Reddit API (praw), persisted as JSONL
Text preprocessing and TF–IDF feature extraction
Unsupervised event discovery via agglomerative clustering (cosine distance)
Centroid-based extractive summarization with evidence post IDs
ROUGE evaluation against a small human-written reference set
Clear module boundaries: collect, preprocess, events, summarize, evaluate, pipeline

Quickstart (Windows / PowerShell)

From the repo root:

python -m venv .venv
.\.venv\Scripts\Activate.ps1

pip install -r requirements.txt

python -c "import nltk; nltk.download('stopwords')"

$env:PYTHONPATH = "src"
python -m iris_reddit_events.pipeline

Configuration

Expand

Reddit credentials

To collect live Reddit data, create a .env file in the repo root:

REDDIT_CLIENT_ID=your_client_id_here
REDDIT_CLIENT_SECRET=your_client_secret_here
REDDIT_USER_AGENT=iris-reddit-events (by u/your_username)

Credentials are loaded in src/iris_reddit_events/config.py.

Usage

Expand

All commands below assume PYTHONPATH=src.

1) Collect data

$env:PYTHONPATH = "src"
python -m iris_reddit_events.collect

This writes a timestamped JSONL file under data/raw/.

2) Run preprocessing -> clustering -> summarization

$env:PYTHONPATH = "src"
python -m iris_reddit_events.pipeline

Defaults:

Subreddits: news, worldnews, technology
Clustering: distance_threshold=0.8
Summaries: top_k=3 titles

Tune parameters in src/iris_reddit_events/pipeline.py.

Evaluation (ROUGE)

Expand

Create data/gold_summaries.csv with:

event_id,reference
0,"<your human-written summary for event 0>"
3,"..."

Run evaluation:

$env:PYTHONPATH = "src"
python -m iris_reddit_events.evaluate --gold data/gold_summaries.csv

Notebooks

Expand

Notebooks mirror the pipeline stages:

notebooks/00_check_config.ipynb: confirms paths + .env loading
notebooks/01_preprocess.ipynb: loads raw JSONL, builds TF–IDF
notebooks/02_events.ipynb: clusters posts into event_id
notebooks/03_summarize.ipynb: generates summaries from labeled events
notebooks/04_eval.ipynb: runs ROUGE against a gold CSV

Outputs

data/raw/reddit_*.jsonl: raw collection results
data/processed/posts.parquet: cleaned post table
data/events/events_labeled.parquet: post table + event_id
data/summaries/summaries.parquet: event-level summaries (+ evidence post IDs)

Project Structure

src/iris_reddit_events/
    collect.py     # Reddit API ingestion
    preprocess.py  # text cleaning + TF–IDF
    events.py      # clustering into events
    summarize.py   # centroid-based extractive summaries
    evaluate.py    # ROUGE evaluation helpers + CLI
    pipeline.py    # end-to-end runner
    config.py      # paths + .env loading
notebooks/       # exploratory / report notebooks

Troubleshooting

Expand

Parquet read/write error: install an engine (recommended): pip install pyarrow
LookupError: Resource stopwords not found: run python -c "import nltk; nltk.download('stopwords')"
No raw files found: run the collector so data/raw/reddit_*.jsonl exists
Import errors in scripts: ensure PYTHONPATH=src is set

Future Work

Expand

Replace TF–IDF with dense embeddings for improved clustering
Move to sentence-level extraction to reduce redundancy
Add time-aware segmentation (windowing / burst detection)

Author

Symaedchit Octavius Leo

Built for CSE 482 (Big Data Analysis).

Data Collection

python -m iris_reddit_events.collect

Fetches recent posts from configured subreddits and saves to data/raw/.

Configuration (in collect.py):

Subreddits: news, worldnews, technology
Posts per subreddit: 300 (configurable via limit parameter)

Preprocessing Only

from iris_reddit_events import preprocess as pp

df = pp.load_raw()
X, vectorizer, df_clean = pp.vectorize(df)

Event Clustering Only

from iris_reddit_events import events

df_events = events.cluster_events(
    X, 
    df_clean, 
    n_clusters=None, 
    distance_threshold=0.8
)

Summarization Only

from iris_reddit_events import summarize

df_summaries = summarize.summarize_events(df_events, top_k=3)

Interactive Notebooks

Click to view notebook descriptions

The notebooks/ directory contains Jupyter notebooks that mirror each pipeline stage:

Notebook	Purpose	Key Outputs
`00_check_config.ipynb`	Verify environment setup	Confirms paths and `.env` loading
`01_preprocess.ipynb`	Data preprocessing	TF-IDF matrix, cleaned DataFrame
`02_events.ipynb`	Event clustering	Event assignments per post
`03_summarize.ipynb`	Summary generation	Event summaries with evidence IDs
`04_eval.ipynb`	ROUGE evaluation	Precision, recall, F1 scores

Running Notebooks:

Launch Jupyter: jupyter notebook
Navigate to notebooks/ directory
Open desired notebook
Run cells sequentially

Note: Notebooks automatically configure the Python path to import the iris_reddit_events package.

Project Structure

Click to view complete project structure

cse482-reddit-events/
├── src/
│   └── iris_reddit_events/
│       ├── __init__.py
│       ├── collect.py          # Reddit API data collection
│       ├── config.py            # Configuration and paths
│       ├── preprocess.py        # Text preprocessing & TF-IDF
│       ├── events.py            # Event clustering logic
│       ├── summarize.py         # Summary generation
│       ├── evaluate.py          # ROUGE evaluation + CLI
│       └── pipeline.py          # End-to-end orchestration
│
├── notebooks/
│   ├── 00_check_config.ipynb   # Environment verification
│   ├── 01_preprocess.ipynb     # Preprocessing exploration
│   ├── 02_events.ipynb         # Clustering analysis
│   ├── 03_summarize.ipynb      # Summary generation
│   └── 04_eval.ipynb           # Evaluation workflow
│
├── data/
│   ├── raw/                    # Raw Reddit JSONL files
│   ├── processed/              # Preprocessed Parquet files
│   ├── events/                 # Event-labeled data
│   ├── summaries/              # Generated summaries
│   └── gold_summaries.csv      # Human reference summaries
│
├── requirements.txt            # Python dependencies
├── .env                        # Environment variables (gitignored)
├── .gitignore
└── README.md

Key Files

config.py: Centralized configuration (paths, environment variables)
pipeline.py: Main entry point for full pipeline execution
evaluate.py: Can be run as CLI tool or imported as module
requirements.txt: Pinned dependencies for reproducibility

Configuration

Click to view configuration options

Environment Variables (`.env`)

REDDIT_CLIENT_ID=your_client_id
REDDIT_CLIENT_SECRET=your_secret_key
REDDIT_USER_AGENT=iris-reddit-events (by u/username)

Pipeline Parameters

Clustering (events.py):

cluster_events(
    X,                          # TF-IDF matrix
    df_posts,                   # Posts DataFrame
    n_clusters=None,            # Auto-determine via threshold
    distance_threshold=0.8      # Cosine distance cutoff
)

Summarization (summarize.py):

summarize_events(
    df_events,                  # Event-labeled posts
    top_k=3                     # Number of representative posts
)

Data Collection (collect.py):

collect_sample(
    subreddits=("news", "worldnews", "technology"),
    limit=300                   # Posts per subreddit
)

Modifying Defaults

Edit src/iris_reddit_events/pipeline.py to change default parameters:

def main():
    # Adjust these values
    df_events = events.cluster_events(
        X, df_clean, 
        distance_threshold=0.7   # More granular events
    )
    
    df_sum = summarize.summarize_events(
        df_events, 
        top_k=5                   # Longer summaries
    )

Evaluation

Click to view evaluation methodology

ROUGE Metrics

The system uses ROUGE (Recall-Oriented Understudy for Gisting Evaluation) to compare generated summaries against human references.

Metrics Computed:

ROUGE-1: Unigram overlap
ROUGE-2: Bigram overlap
ROUGE-L: Longest common subsequence

Creating Reference Summaries

Generate system summaries first:
```
python -m iris_reddit_events.pipeline
```

Create data/gold_summaries.csv:

event_id,reference
0,"Human-written summary for event 0"
2,"Human-written summary for event 2"
5,"Human-written summary for event 5"

Run evaluation:

python -m iris_reddit_events.evaluate --gold data/gold_summaries.csv

Example Output

Evaluated 3 event summaries with ROUGE.
rouge1: P=0.478  R=0.431  F1=0.447
rouge2: P=0.216  R=0.195  F1=0.202
rougeL: P=0.401  R=0.347  F1=0.367

Interpretation:

Precision (P): How much of the generated summary is relevant
Recall (R): How much of the reference is captured
F1: Harmonic mean of precision and recall

Troubleshooting

Click to view common issues and solutions

Issue: ModuleNotFoundError: No module named 'iris_reddit_events'

Solution:

$env:PYTHONPATH = "src"

Ensure set this in every new terminal session, or add to shell profile.

Issue: LookupError: Resource stopwords not found

Solution:

python -c "import nltk; nltk.download('stopwords')"

Issue: ImportError: Unable to find a usable engine for parquet

Solution:

pip install pyarrow

This should already be in requirements.txt, but reinstall if needed.

Issue: TypeError: AgglomerativeClustering() got unexpected keyword 'affinity'

Cause: scikit-learn API change in version 1.4+

Solution: Already fixed in events.py (uses metric= instead of affinity=)

Issue: No raw files found in data/raw/

Solution: The project includes sample data. If missing create test data or run the collector:

python -m iris_reddit_events.collect

Issue: praw.exceptions.ResponseException: received 401 HTTP response

Cause: Invalid Reddit API credentials

Solution:

Verify credentials in .env file
Ensure no extra spaces or quotes
Regenerate credentials at https://www.reddit.com/prefs/apps

Issue: Notebook cells fail with import errors

Solution: Notebooks include path setup cells. Run them first:

import sys
from pathlib import Path
sys.path.insert(0, str(Path.cwd().parent / "src"))

Future Enhancements

Click to view planned improvements

Short-Term Goals

Replace TF-IDF with dense embeddings (sentence-transformers, BERT)
Implement sentence-level extraction using TextRank
Add temporal segmentation for time-aware event detection
Create web dashboard for real-time monitoring
Add support for multiple languages

Medium-Term Goals

Implement abstractive summarization with fine-tuned T5/BART
Add named entity recognition for key figure extraction
Implement topic modeling (LDA) alongside clustering
Create REST API for programmatic access
Add Docker containerization

Long-Term Vision

Real-time streaming pipeline with Apache Kafka
Multi-source integration (Twitter, news sites, forums)
Fact-checking integration with claim verification
Personalized event filtering based on user interests
Mobile application for summary consumption

Sample Data

The repository includes sample Reddit data in data/raw/reddit_20260101_120000.jsonl for immediate testing. This contains 10 synthetic posts across three topics:

AI technology announcements
Climate summit news
Local news events

To collect live Reddit data, configure API credentials and run the collector.

License

This project is licensed under the MIT License. See LICENSE file for details.

Author

Symaedchit Octavius Leo

Built as part of CSE 482: Big Data Analysis

Acknowledgments

Reddit API via praw
scikit-learn for machine learning tools
NLTK for natural language processing
Google's ROUGE implementation for evaluation

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
assets		assets
data		data
notebooks		notebooks
src/iris_reddit_events		src/iris_reddit_events
.gitignore		.gitignore
README.md		README.md
report_draft.md		report_draft.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

IRIS: Reddit Event Detection & Summarization

Table of Contents

Demo

Highlights

Quickstart (Windows / PowerShell)

Configuration

Reddit credentials

Usage

1) Collect data

2) Run preprocessing -> clustering -> summarization

Evaluation (ROUGE)

Notebooks

Outputs

Project Structure

Troubleshooting

Future Work

Author

Data Collection

Preprocessing Only

Event Clustering Only

Summarization Only

Interactive Notebooks

Project Structure

Key Files

Configuration

Environment Variables (.env)

Pipeline Parameters

Modifying Defaults

Evaluation

ROUGE Metrics

Creating Reference Summaries

Example Output

Troubleshooting

Issue: ModuleNotFoundError: No module named 'iris_reddit_events'

Issue: LookupError: Resource stopwords not found

Issue: ImportError: Unable to find a usable engine for parquet

Issue: TypeError: AgglomerativeClustering() got unexpected keyword 'affinity'

Issue: No raw files found in data/raw/

Issue: praw.exceptions.ResponseException: received 401 HTTP response

Issue: Notebook cells fail with import errors

Future Enhancements

Short-Term Goals

Medium-Term Goals

Long-Term Vision

Sample Data

License

Author

Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Environment Variables (`.env`)

Packages