End-to-end data pipeline project for Uber Taxi 🚕
In this project, I designed and implemented an end-to-end data pipeline that consists of several stages:
- Extracted data from the NYC Trip Record Data website and loaded it into Google Cloud Storage for further processing.
- Transformed and modeled the data using fact and dimensional data modeling concepts using Python on Jupyter Notebook.
- Using ETL, orchestrated the data pipeline on Mage and loaded the transformed data into Google BigQuery.
- Developed a dashboard on Looker Studio to generate insights.
Being a data engineering project, the emphasis is primarily on the engineering aspect with a lesser emphasis on analytics and dashboard development.
This project uses the TLC Trip Record Data which includes fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts.
More info about the dataset can be found here:
- Website: https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
- Data Dictionary: https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf
- Raw Data: uber_data.csv
The following technologies are used to build this project:
- Language: Python, SQL
- Extraction and transformation: Jupyter Notebook, Google BigQuery
- Storage: Google Cloud Storage
- Orchestration: Mage
- Dashboard: Looker Studio
- Step 1: Cleaning and transformation: The following is the transformation code written in Python to transform the data into desired state and test locally. transformation-code.ipynb
- Step 2: Storage: Load the data into google cloud storage bucket. Methods to upload data in a bucket can be found in the below link. Uploading objects in google cloud storage
- Step 2: ETL Orchestration - Mage
- Data Loader: This code block fetches the data from the google cloud storage bucket and ready the data for the next transformation step. The output of this code block would be used as input for the next transformation block.
- Transform: This code block transforms the raw CSV data into a fact and dimension tables model along with cleansing of data such as removing duplicates. The output of this transformation block would be used as input for the next data export block.
- Export: This code block uses the output of the transform block as input and exports the data into Google BigQuery dataset for further analysis.
Here is the Looker Studio Dashboard link: Uber Dashboard

