This project aims to build a complete, intelligent data pipeline that not only extracts and processes Reddit data at scale — but also transforms it into actionable insights and live AI-generated summaries, all within an interactive Grafana dashboard.
- ✅ Ingest real-world, unstructured data from Reddit and automate its collection
- ✅ Transform and model the data for analytics using AWS-native tools
- ✅ Visualize meaningful trends (e.g., trending topics, user engagement) through Grafana
- ✅ Bridge data engineering with AI by integrating a live LLM-based summarization API into the dashboard
📊 This project combines the power of batch processing with AI augmentation, creating a smart dashboard where users can not only explore the data but also ask it to explain itself.
Whether it’s for media trend analysis, opinion monitoring, or tech content discovery, this architecture can serve as a template for data pipelines enhanced with LLMs — turning raw data into both visual stories and summarized knowledge.
| Layer | Tool/Service | Purpose |
|---|---|---|
| Ingestion | AWS Lambda + PRAW | Scheduled batch extraction from Reddit |
| Storage (raw/clean) | Amazon S3 | Cost-effective, scalable data lake |
| ETL | AWS Glue | Data cleaning and transformation |
| Query Engine | AWS Athena | Serverless SQL engine for querying S3 |
| Visualization | Grafana | Dashboarding tool connected to Athena |
| Integration/API | Flask on EC2 | JSON API for LLM-powered summarization |
| Language | Python | All processing and APIs |
| LLM | Groq LLM | Summarization of Reddit articles |
I chose this architecture to reflect a real-world production-ready batch data flow using serverless and cloud-native components:
- Batch processing fits Reddit's post dynamics (daily trends) better than real-time streaming.
- AWS Lambda + EventBridge offers a simple and cost-efficient ingestion trigger.
- S3 + Glue + Athena is a popular trio in the data engineering world for building flexible data lakes.
- Grafana provides an open-source dashboard with plugin support (Athena + JSON API).
- Flask API on EC2 separates LLM logic from visualization and is easily extendable.
This architecture allows each component to scale independently and be swapped if needed.
This project uses batch processing, running on a daily schedule:
-
Ingestion:
AWS Lambda pulls the top 100 "hot" posts from theArtificialInteligencesubreddit using Reddit's API (via PRAW), and saves the results as JSON Lines to S3. -
ETL with AWS Glue:
A Glue job (triggered by schedule) transforms the raw Reddit data into a cleaned format, adding structure for efficient querying. -
Querying via AWS Athena:
Athena reads cleaned data directly from S3 and enables SQL-based analytics on the posts, such as top authors, engagement trends, or most discussed posts. -
Visualization with Grafana:
Grafana queries Athena and displays the results using panels, filters, and time-based analysis. Grafana runs on EC2 and connects to AWS Athena using the default credential path. -
Summarization via API:
Grafana also integrates a custom JSON API panel. When a user selects a Reddit article title, it sends a request to a Flask API hosted on EC2. This API:- Searches Reddit by title
- Uses Groq LLM to summarize the post content
- Returns a clean summary that’s displayed live on the dashboard
reddit-data-pipeline/
├── README.md # You are here
├── docs/
│ └── architecture.png # Architecture diagram
├── ingestion_layer/
│ ├── lambda_function.py # Reddit to S3 ingestion script
│ ├── README.md # Lambda deployment and trigger setup
│ └── screenshots/ # Setup screenshots
├── api_integration/
│ ├── app.py # Flask API for summarization
│ ├── requirements.txt # Python dependencies
│ └── README.md # API deployment on EC2 and Grafana JSON plugin
├── visualization_layer/
│ ├── README.md # Grafana install & Athena config steps
│ └── screenshots/ # EC2 and Grafana configuration screenshotsSee the detailed READMEs in each layer's folder.
- Deploy the Lambda function with Reddit API keys to collect daily data →
ingestion_layer/ - Query with Athena, make sure tables are defined
- Set up Grafana on EC2 and connect it to Athena →
visualization_layer/ - Deploy the Flask summarization API on the same EC2 instance →
api_integration/ - Use Grafana JSON API plugin to send article titles for LLM-based summaries
Author: Youssef Makhlouf 📧 [[email protected]] 📍 Tunisia — SUPCOM Engineering Student 🎯 Interested in Data Engineering · Cloud · AI
