Reddit Data Engineering Pipeline

A production-ready data engineering pipeline that ingests Reddit posts and comments with Python, delivers batches to AWS Kinesis Firehose, lands raw data in S3, publishes processing events with SNS, transforms with AWS Lambda, and stores curated datasets in Supabase.

End-to-End Architecture

Reddit → Python Ingestion → AWS Kinesis Firehose → AWS S3 → AWS SNS → AWS Lambda → Supabase

Reddit: data source (subreddits feed)
Python Ingestion: PRAW-based scrapers with batching and retry/backoff
Kinesis Firehose: reliable delivery stream to S3
S3: raw data lake storage (partitioned)
SNS: event notifications to trigger downstream processing
Lambda: stateless validation, transformation, enrichment
Supabase: relational storage for curated posts and comments

Repository Layout (actual)

Notebook/
README.md
makefile
pyproject.toml
poetry.lock
mypy.ini

sentiment_analysis/
  __init__.py
  logging_config.py            # project-wide logger
  exception.py                 # custom exception utilities

  scrapping/
    __init__.py
    data_scrapping.py          # Reddit ingestion (PRAW): posts + comments, batching

  AWS_processing/
    __init__.py
    kinesis_firehose/
      __init__.py
      kinesis_firehose.py      # Firehose client and PutRecordBatch wrapper
    lambda_function/
      __init__.py
      lambda_function.py       # SNS-triggered Lambda: read S3 → transform ( basic ETL )  → load

  data_ingestion/              # ingest data from the supabase for machine learning purpose
  data_transformation/         # data transformation for machine learning purpose 
  storage/
    db_schema.sql              # Supabase schema (posts, comments)
  utils/                       # (reserved for helpers)

What Each Module Does

sentiment_analysis/scrapping/data_scrapping.py
- Authenticates with Reddit using PRAW
- Pulls posts/comments by topic (best/new/trending)
- Batches and prepares payloads for delivery
sentiment_analysis/AWS_processing/kinesis_firehose/kinesis_firehose.py
- Thin client around Firehose
- Handles PutRecord/PutRecordBatch with basic retries
- Expects JSON payloads (posts/comments)
sentiment_analysis/AWS_processing/lambda_function/lambda_function.py
- SNS-triggered entrypoint
- Reads new S3 objects (from Firehose)
- Validates/cleans/transforms records
- Upserts curated rows into Supabase
sentiment_analysis/storage/db_schema.sql
- SQL schema for Supabase tables: posts and comments
- Indexes for common query patterns
sentiment_analysis/data_ingestion
- ML-focused data ingestion modules for model training workflows
- Contains utilities for data staging and preprocessing
sentiment_analysis/data_transformation
- ML-focused data transformation utilities for feature engineering
- Handles data normalization and preprocessing for machine learning models

Getting Started

Prerequisites

Python 3.10+
Poetry
AWS access (Kinesis Firehose, S3, SNS, Lambda)
Supabase project (database URL + API key)

Install

poetry install

Configure

Create an .env file (see your own secrets store) with at least:

Reddit credentials (PRAW)
AWS credentials (profile/keys or role)
Supabase URL + Key

Run Ingestion (local)

poetry run python sentiment_analysis/scrapping/data_scrapping.py

This will fetch batches from Reddit and deliver them to Kinesis Firehose, which writes to S3.

Lambda Processing (cloud)

Deploy sentiment_analysis/AWS_processing/lambda_function/lambda_function.py as a Lambda
Configure SNS to trigger the Lambda when new S3 objects land
Lambda reads the S3 object, transforms, and loads into Supabase using the schema in storage/db_schema.sql

Data Flow Details

Scraper pulls posts/comments → batches to Firehose
Firehose delivers to S3 under time-based prefixes
S3 PUT event → SNS notification
SNS triggers Lambda
Lambda reads object, validates/transforms, writes to Supabase

Development

make lint — ruff/mypy/black checks
make test — run tests (add tests under tests/)
Notebooks in Notebook/ for exploration only

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reddit Data Engineering Pipeline

End-to-End Architecture

Repository Layout (actual)

What Each Module Does

Getting Started

Prerequisites

Install

Configure

Run Ingestion (local)

Lambda Processing (cloud)

Data Flow Details

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
Notebook		Notebook
sentiment_analysis		sentiment_analysis
.gitignore		.gitignore
README.md		README.md
makefile		makefile
mypy.ini		mypy.ini
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Reddit Data Engineering Pipeline

End-to-End Architecture

Repository Layout (actual)

What Each Module Does

Getting Started

Prerequisites

Install

Configure

Run Ingestion (local)

Lambda Processing (cloud)

Data Flow Details

Development

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages