A production-ready data engineering pipeline that ingests Reddit posts and comments with Python, delivers batches to AWS Kinesis Firehose, lands raw data in S3, publishes processing events with SNS, transforms with AWS Lambda, and stores curated datasets in Supabase.
Reddit → Python Ingestion → AWS Kinesis Firehose → AWS S3 → AWS SNS → AWS Lambda → Supabase
- Reddit: data source (subreddits feed)
- Python Ingestion: PRAW-based scrapers with batching and retry/backoff
- Kinesis Firehose: reliable delivery stream to S3
- S3: raw data lake storage (partitioned)
- SNS: event notifications to trigger downstream processing
- Lambda: stateless validation, transformation, enrichment
- Supabase: relational storage for curated posts and comments
Notebook/
README.md
makefile
pyproject.toml
poetry.lock
mypy.ini
sentiment_analysis/
__init__.py
logging_config.py # project-wide logger
exception.py # custom exception utilities
scrapping/
__init__.py
data_scrapping.py # Reddit ingestion (PRAW): posts + comments, batching
AWS_processing/
__init__.py
kinesis_firehose/
__init__.py
kinesis_firehose.py # Firehose client and PutRecordBatch wrapper
lambda_function/
__init__.py
lambda_function.py # SNS-triggered Lambda: read S3 → transform ( basic ETL ) → load
data_ingestion/ # ingest data from the supabase for machine learning purpose
data_transformation/ # data transformation for machine learning purpose
storage/
db_schema.sql # Supabase schema (posts, comments)
utils/ # (reserved for helpers)
-
sentiment_analysis/scrapping/data_scrapping.py
- Authenticates with Reddit using PRAW
- Pulls posts/comments by topic (best/new/trending)
- Batches and prepares payloads for delivery
-
sentiment_analysis/AWS_processing/kinesis_firehose/kinesis_firehose.py
- Thin client around Firehose
- Handles PutRecord/PutRecordBatch with basic retries
- Expects JSON payloads (posts/comments)
-
sentiment_analysis/AWS_processing/lambda_function/lambda_function.py
- SNS-triggered entrypoint
- Reads new S3 objects (from Firehose)
- Validates/cleans/transforms records
- Upserts curated rows into Supabase
-
sentiment_analysis/storage/db_schema.sql
- SQL schema for Supabase tables: posts and comments
- Indexes for common query patterns
-
sentiment_analysis/data_ingestion
- ML-focused data ingestion modules for model training workflows
- Contains utilities for data staging and preprocessing
-
sentiment_analysis/data_transformation
- ML-focused data transformation utilities for feature engineering
- Handles data normalization and preprocessing for machine learning models
- Python 3.10+
- Poetry
- AWS access (Kinesis Firehose, S3, SNS, Lambda)
- Supabase project (database URL + API key)
poetry installCreate an .env file (see your own secrets store) with at least:
- Reddit credentials (PRAW)
- AWS credentials (profile/keys or role)
- Supabase URL + Key
poetry run python sentiment_analysis/scrapping/data_scrapping.pyThis will fetch batches from Reddit and deliver them to Kinesis Firehose, which writes to S3.
- Deploy
sentiment_analysis/AWS_processing/lambda_function/lambda_function.pyas a Lambda - Configure SNS to trigger the Lambda when new S3 objects land
- Lambda reads the S3 object, transforms, and loads into Supabase using the schema in
storage/db_schema.sql
- Scraper pulls posts/comments → batches to Firehose
- Firehose delivers to S3 under time-based prefixes
- S3 PUT event → SNS notification
- SNS triggers Lambda
- Lambda reads object, validates/transforms, writes to Supabase
- make lint — ruff/mypy/black checks
- make test — run tests (add tests under tests/)
- Notebooks in Notebook/ for exploration only
MIT