RideStream is an end-to-end Azure-based data engineering project that ingests batch ride data and real-time booking events, unifies them into a lakehouse architecture, and serves curated analytics-ready datasets for business users.
The project is inspired by real-world ride-hailing data platform patterns and demonstrates modern data engineering practices using Azure Data Factory, Azure Storage, Azure Event Hubs, Databricks, Delta Live Tables, and a medallion-style architecture.
The pipeline is designed to process two types of data:
- Batch data: Bulk ride and mapping data stored on an HTTP server and internal sources.
- Streaming data: Live booking events captured from the application through Azure Event Hubs.
Both sources are landed into Azure Data Lake Storage Gen2, then processed in Databricks using streaming and batch transformations. The data is unified into a silver-layer OBT and then modeled into a gold-layer star schema for analytics.
Ride-hailing companies generate both historical and live operational data. Business users need a reliable platform to:
- Track ride bookings in real time.
- Analyze driver, passenger, vehicle, and location performance.
- Maintain consistent historical records for changing dimensions.
- Expose clean fact and dimension tables for downstream reporting.
RideStream solves this by combining batch and streaming data into a unified analytics model.
- Azure Data Factory for orchestration and batch ingestion.
- Azure Data Lake Storage Gen2 for raw and curated data storage.
- Azure Event Hubs for capturing live booking events.
- Azure Databricks for data processing and transformation.
- Delta Live Tables for declarative pipeline management.
- Streaming tables for incremental processing.
- Delta Lake for reliable storage and ACID transactions.
- JSON-based external mappings for reference and lookup data.
Batch files and mapping data are copied from HTTP/internal sources into Azure Data Lake using Azure Data Factory.
Live booking events are ingested through Azure Event Hubs and made available for processing in Databricks.
Batch and streaming sources are combined into one unified flow using append-based processing and Databricks streaming logic.
The silver layer contains the OBT (One Big Table), which consolidates ride, booking, driver, passenger, vehicle, and location information into a clean analytical format.
The gold layer is modeled as a star schema with dimension tables and a fact table for business reporting.
The silver layer acts as the curated transformation layer. It integrates data from:
- Batch ride datasets.
- Streaming booking events.
- External JSON mappings.
This layer creates a unified OBT that simplifies downstream modeling and improves usability for analytics and reporting.
The gold layer is built as a dimensional model with the following tables:
dim_driverdim_passengerdim_vehicledim_bookingdim_locationfact_ride
This structure is optimized for business intelligence, dashboarding, and SQL-based reporting.
The project implements Slowly Changing Dimension Type 2 for the city dimension.
This allows the pipeline to preserve historical changes to city attributes over time while still supporting current-state analytics. It is especially useful when business attributes change and historical accuracy must be maintained.
- End-to-end Azure lakehouse pipeline.
- Batch and streaming data integration.
- Medallion architecture with bronze, silver, and gold layers.
- Unified OBT creation in the silver layer.
- Star schema modeling in the gold layer.
- SCD Type 2 support for historical dimension tracking.
- Declarative pipelines using Delta Live Tables.
- Scalable and production-style data engineering design.

