Reddit Data Pipeline with LLM-Powered Visualization (Batch Processing)

🎯 Project Goal

This project aims to build a complete, intelligent data pipeline that not only extracts and processes Reddit data at scale — but also transforms it into actionable insights and live AI-generated summaries, all within an interactive Grafana dashboard.

As a Data Engineer, my goal was to:

✅ Ingest real-world, unstructured data from Reddit and automate its collection
✅ Transform and model the data for analytics using AWS-native tools
✅ Visualize meaningful trends (e.g., trending topics, user engagement) through Grafana
✅ Bridge data engineering with AI by integrating a live LLM-based summarization API into the dashboard

📊 This project combines the power of batch processing with AI augmentation, creating a smart dashboard where users can not only explore the data but also ask it to explain itself.

Whether it’s for media trend analysis, opinion monitoring, or tech content discovery, this architecture can serve as a template for data pipelines enhanced with LLMs — turning raw data into both visual stories and summarized knowledge.

🎥 👉 Watch the Demo

🛠️ Tools & Technologies

Layer	Tool/Service	Purpose
Ingestion	AWS Lambda + PRAW	Scheduled batch extraction from Reddit
Storage (raw/clean)	Amazon S3	Cost-effective, scalable data lake
ETL	AWS Glue	Data cleaning and transformation
Query Engine	AWS Athena	Serverless SQL engine for querying S3
Visualization	Grafana	Dashboarding tool connected to Athena
Integration/API	Flask on EC2	JSON API for LLM-powered summarization
Language	Python	All processing and APIs
LLM	Groq LLM	Summarization of Reddit articles

🧠 Why This Architecture?

I chose this architecture to reflect a real-world production-ready batch data flow using serverless and cloud-native components:

Batch processing fits Reddit's post dynamics (daily trends) better than real-time streaming.
AWS Lambda + EventBridge offers a simple and cost-efficient ingestion trigger.
S3 + Glue + Athena is a popular trio in the data engineering world for building flexible data lakes.
Grafana provides an open-source dashboard with plugin support (Athena + JSON API).
Flask API on EC2 separates LLM logic from visualization and is easily extendable.

This architecture allows each component to scale independently and be swapped if needed.

📊 Pipeline Diagram

🔄 Processing Overview

This project uses batch processing, running on a daily schedule:

Ingestion:
AWS Lambda pulls the top 100 "hot" posts from the ArtificialInteligence subreddit using Reddit's API (via PRAW), and saves the results as JSON Lines to S3.
ETL with AWS Glue:
A Glue job (triggered by schedule) transforms the raw Reddit data into a cleaned format, adding structure for efficient querying.
Querying via AWS Athena:
Athena reads cleaned data directly from S3 and enables SQL-based analytics on the posts, such as top authors, engagement trends, or most discussed posts.
Visualization with Grafana:
Grafana queries Athena and displays the results using panels, filters, and time-based analysis. Grafana runs on EC2 and connects to AWS Athena using the default credential path.
Summarization via API:
Grafana also integrates a custom JSON API panel. When a user selects a Reddit article title, it sends a request to a Flask API hosted on EC2. This API:
- Searches Reddit by title
- Uses Groq LLM to summarize the post content
- Returns a clean summary that’s displayed live on the dashboard

📁 Repository Structure

reddit-data-pipeline/
├── README.md                     # You are here
├── docs/
│   └── architecture.png          # Architecture diagram
├── ingestion_layer/
│   ├── lambda_function.py        # Reddit to S3 ingestion script
│   ├── README.md                 # Lambda deployment and trigger setup
│   └── screenshots/              # Setup screenshots
├── api_integration/
│   ├── app.py                    # Flask API for summarization
│   ├── requirements.txt          # Python dependencies
│   └── README.md                 # API deployment on EC2 and Grafana JSON plugin
├── visualization_layer/
│   ├── README.md                 # Grafana install & Athena config steps
│   └── screenshots/              # EC2 and Grafana configuration screenshots

🚀 How to Run the Pipeline

See the detailed READMEs in each layer's folder.

Deploy the Lambda function with Reddit API keys to collect daily data → ingestion_layer/
Query with Athena, make sure tables are defined
Set up Grafana on EC2 and connect it to Athena → visualization_layer/
Deploy the Flask summarization API on the same EC2 instance → api_integration/
Use Grafana JSON API plugin to send article titles for LLM-based summaries

📬 Contact

Author: Youssef Makhlouf 📧 [[email protected]] 📍 Tunisia — SUPCOM Engineering Student 🎯 Interested in Data Engineering · Cloud · AI

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
api-integration		api-integration
docs		docs
ingestion_Layer		ingestion_Layer
visualization_layer		visualization_layer
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reddit Data Pipeline with LLM-Powered Visualization (Batch Processing)

🎯 Project Goal

As a Data Engineer, my goal was to:

🎥 👉 Watch the Demo

🛠️ Tools & Technologies

🧠 Why This Architecture?

📊 Pipeline Diagram

🔄 Processing Overview

📁 Repository Structure

🚀 How to Run the Pipeline

📬 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Reddit Data Pipeline with LLM-Powered Visualization (Batch Processing)

🎯 Project Goal

As a Data Engineer, my goal was to:

🎥 👉 Watch the Demo

🛠️ Tools & Technologies

🧠 Why This Architecture?

📊 Pipeline Diagram

🔄 Processing Overview

📁 Repository Structure

🚀 How to Run the Pipeline

📬 Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages