Data Platform Development Lab & Group Project > Data Engineering 2025 Program at STI (Stockholm)
An event-driven data engineering platform simulating a fleet of smart home appliances. It ingests, validates, and stores raw sensor data via Apache Kafka and PostgreSQL, implementing a strict Medallion Architecture (Bronze, Silver, Gold). The platform is designed to process streaming telemetry, catch physical machine faults, and serve analytical data for predictive maintenance dashboards.
This platform translates raw IoT telemetry into actionable operational insights through four core pillars:
-
Streaming Ingestion (Apache Kafka): From Reactive to Proactive. We ingest real-time telemetry to detect anomalies instantly, rather than waiting for customer breakdown reports.
-
Medallion Architecture: Trust in Data. By enforcing strict Pydantic quality gates across decoupled layers (Bronze, Silver, Gold), we guarantee that our BI dashboards reflect a 100% accurate, noise-free single source of truth.
-
Infrastructure as Code (Docker & CI/CD): Disaster Recovery & Portability. The entire platform is containerized and validated by GitHub Actions. If a server goes down, the environment can be fully restored in minutes.
-
Database Migrations (Alembic): Safe Evolution. We manage our PostgreSQL schemas with Alembic, enabling zero-downtime upgrades and instant rollbacks to protect historical data integrity.
- Data Generation: Python 3.12+, Faker
- Message Broker: Apache Kafka (Docker/KRaft)
- Ingestion & ETL:
confluent-kafka,psycopg(v3) - Data Validation: Pydantic
- Database & Migrations: PostgreSQL 16, Alembic
- Serving & Visualization: FastAPI, Streamlit
- DevOps: Docker Compose,
uv, GitHub Actions, Ruff, Pytest
iot_sensor_pipeline/
├── alembic/ # Database migration scripts & history
├── data/
│ ├── raw/ # Cold storage for generated JSONL source of truth
│ └── processed/ # Silver layer backups
├── docs/ # Architecture Docs (CDM, LDM, PDM, Overviews)
│ ├── diagrams/ # Visual representations of data flow
│ └── modules/ # Deep dives into Business Value & Technical decisions
├── src/
│ ├── api/ # FastAPI backend with connection pooling
│ ├── config/ # Centralized environment variable management
│ ├── consumer/ # Kafka Consumer & Bronze ingestion (Quality Gate)
│ ├── dashboard/ # Streamlit UI (Overview, Anomalies, Errors)
│ ├── gold/ # Star Schema ETL & Daily Aggregations
│ ├── producer/ # Stateful Fleet simulator & Kafka producer
│ ├── schemas/ # Global Pydantic data contracts
│ ├── silver/ # Idempotent cleaning & structural transformations
│ └── test/ # Pytest suite for API and Data Validation
├── .env.example # Template for environment variables
├── docker-compose.yml # Local infrastructure orchestration
├── Dockerfile # Unified, optimized Python image
└── pyproject.toml # Dependencies managed by uv
- Docker / Docker Desktop
uv(Fast Python package manager) installed
- git clone https://github.com/JohnnyHyytiainen/group_project_dataplatform
- cd group_project_dataplatform
- cp .env.example .env
(Ensure your .env contains the correct database credentials).
Spin up the entire Medallion Architecture (PostgreSQL, Kafka, API, Consumer, Producer, and Dashboard) with a single command:
docker compose up -d --buildNote: The Producer will automatically start simulating the 1,200 machine fleet, and the Consumer will begin ingesting into the Bronze layer.
- Streamlit BI Dashboard:
http://localhost:8501 - FastAPI Swagger UI:
http://localhost:8000/docs
To ensure your database structure is up to date with the latest code, run:
uv run alembic upgrade head(To tear down the environment and wipe the database volumes, run docker compose down -v)
-
Stateful Data Generation: Simulates continuous wear-and-tear
run_hourswith Chaos Engineering (intentional anomalies) -
Event Streaming: Decoupled architecture using Apache Kafka.
-
Quality Gate: Real-time Pydantic validation routing corrupt data to a Dead Letter Queue
faulty_events. -
ELT Storage: Preserves raw JSON payloads as
TEXTin PostgreSQL.
-
Pure Python ETL: Extracts raw Bronze data without relying on heavy frameworks like Pandas.
-
Data Cleaning: Strips whitespace, standardizes casing, and handles missing IDs using soft-filtering
is_validflags. -
Idempotent Upserts: Ensures no duplicate data via
ON CONFLICT DO NOTHING. -
Database Versioning: Schema managed securely via Alembic.
-
Dimensional Modeling: Implemented a strict Star Schema
FACT_SENSOR_READING,DIM_ENGINE,DIM_LOCATION, etc. -
Business Logic in SQL: Calculates physical machine faults Maintenance, Temperature, RPM, Vibration warnings.
-
BI Integration: Connects seamlessly to Streamlit for real-time Executive Dashboards.
- FastAPI Backend: Serves clean data with built-in DDoS protection (Pagination).
- Connection Pooling: Uses
psycopg_poolmanaged via@asynccontextmanagerto prevent database overloading. - Dynamic Filtering:
WHERE 1=1implementation for flexible query parameters.
For a deep dive into our engineering decisions, please explore the docs/ folder:
- Bronze Layer Architecture & Setup
- Database Design & Medallion Philosophy
- Data Lineage & Soft Deletes
- Database Migrations with Alembic
- CI/CD with GitHub Actions
- API Core & Connection Pooling
- Bronze Layer Flowchart overview
- Silver Layer Flowchart overview
- Conceptual, Logical, and Physical Data Models for Bronze, Silver, Gold layers