Skip to content

Pushpen-Joshi/project-uber-taxi

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Modern Data Engineering Project 🚗

End-to-end data pipeline project for Uber Taxi 🚕

Objective

In this project, I designed and implemented an end-to-end data pipeline that consists of several stages:

  1. Extracted data from the NYC Trip Record Data website and loaded it into Google Cloud Storage for further processing.
  2. Transformed and modeled the data using fact and dimensional data modeling concepts using Python on Jupyter Notebook.
  3. Using ETL, orchestrated the data pipeline on Mage and loaded the transformed data into Google BigQuery.
  4. Developed a dashboard on Looker Studio to generate insights.

Being a data engineering project, the emphasis is primarily on the engineering aspect with a lesser emphasis on analytics and dashboard development.

Dataset

This project uses the TLC Trip Record Data which includes fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts.

More info about the dataset can be found here:

The following technologies are used to build this project:

Data Pipeline Architecture

image

Steps Involved

  • Step 1: Cleaning and transformation: The following is the transformation code written in Python to transform the data into desired state and test locally. transformation-code.ipynb
  • Step 2: Storage: Load the data into google cloud storage bucket. Methods to upload data in a bucket can be found in the below link. Uploading objects in google cloud storage
  • Step 2: ETL Orchestration - Mage
    • Data Loader: This code block fetches the data from the google cloud storage bucket and ready the data for the next transformation step. The output of this code block would be used as input for the next transformation block.
    • Transform: This code block transforms the raw CSV data into a fact and dimension tables model along with cleansing of data such as removing duplicates. The output of this transformation block would be used as input for the next data export block.
    • Export: This code block uses the output of the transform block as input and exports the data into Google BigQuery dataset for further analysis.

Data Model

image

Dashboard

Here is the Looker Studio Dashboard link: Uber Dashboard

image image image image

About

Modern end to end data engineering project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors