Modern Data Engineering Project 🚗

End-to-end data pipeline project for Uber Taxi 🚕

Objective

In this project, I designed and implemented an end-to-end data pipeline that consists of several stages:

Extracted data from the NYC Trip Record Data website and loaded it into Google Cloud Storage for further processing.
Transformed and modeled the data using fact and dimensional data modeling concepts using Python on Jupyter Notebook.
Using ETL, orchestrated the data pipeline on Mage and loaded the transformed data into Google BigQuery.
Developed a dashboard on Looker Studio to generate insights.

Being a data engineering project, the emphasis is primarily on the engineering aspect with a lesser emphasis on analytics and dashboard development.

Dataset

This project uses the TLC Trip Record Data which includes fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts.

More info about the dataset can be found here:

Website: https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
Data Dictionary: https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf
Raw Data: uber_data.csv

The following technologies are used to build this project:

Language: Python, SQL
Extraction and transformation: Jupyter Notebook, Google BigQuery
Storage: Google Cloud Storage
Orchestration: Mage
Dashboard: Looker Studio

Data Pipeline Architecture

Steps Involved

Step 1: Cleaning and transformation: The following is the transformation code written in Python to transform the data into desired state and test locally. transformation-code.ipynb
Step 2: Storage: Load the data into google cloud storage bucket. Methods to upload data in a bucket can be found in the below link. Uploading objects in google cloud storage
Step 2: ETL Orchestration - Mage
- Data Loader: This code block fetches the data from the google cloud storage bucket and ready the data for the next transformation step. The output of this code block would be used as input for the next transformation block.
- Transform: This code block transforms the raw CSV data into a fact and dimension tables model along with cleansing of data such as removing duplicates. The output of this transformation block would be used as input for the next data export block.
- Export: This code block uses the output of the transform block as input and exports the data into Google BigQuery dataset for further analysis.

Data Model

Dashboard

Here is the Looker Studio Dashboard link: Uber Dashboard

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.ipynb_checkpoints		.ipynb_checkpoints
GCP		GCP
Mage		Mage
data		data
README.md		README.md
data_model_diagram.jpeg		data_model_diagram.jpeg
etl_pipeline.ipynb		etl_pipeline.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Modern Data Engineering Project 🚗

Objective

Dataset

Data Pipeline Architecture

Steps Involved

Data Model

Dashboard

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Modern Data Engineering Project 🚗

Objective

Dataset

Data Pipeline Architecture

Steps Involved

Data Model

Dashboard

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages