GCP-Data-Engineering-Project

Overview :

This project entails building a pipeline to facilitate the extraction of data pertaining to orders and order items from a relational database management system (RDBMS). The extracted data will then undergo processing to compute daily product revenue, followed by its integration into BigQuery tables. The primary objective is to make the processed data accessible to Business Analytics (BA) developers for the creation of requisite reports and dashboards.

Requirements :

Data Extraction: Retrieve relevant data from the RDBMS, specifically focusing on orders and order items. Data Processing: Perform computation tasks to derive daily product revenue from the extracted data. Data Loading: Incorporate the pre-processed daily product revenue data into designated BigQuery tables. Accessibility: Ensure that BA developers have the necessary access and permissions to utilize the data stored in BigQuery for their reporting and dashboard development needs.

Usage:

Data Extraction: Utilize appropriate methods to query the RDBMS and extract the required data related to orders and order items.
Data Processing: Implement computational logic to process the extracted data and compute daily product revenue.
Data Loading: Establish a mechanism to load the pre-processed daily product revenue data into the designated BigQuery tables.
Report and Dashboard Development: BA developers can leverage the data stored in BigQuery to generate the necessary reports and dashboards as per the project requirements.

We possess JSON files that are uploaded into Google Cloud Storage (GCS), serving as our data lake, containing data related to orders and order items. Our objective is to transform these JSON files into Parquet format. Subsequently, leveraging these Parquet files, we aim to perform computations to derive daily product revenue. To facilitate efficient data processing, we intend to create either views or tables. Following the computation of daily product revenue, the results are preserved in GCS in Parquet format.

We plan to utilize Spark code to read the data from Parquet files and load it into BigQuery for further analysis and utilization.

Additional Considerations :

Ensure that error handling mechanisms are in place to address any issues encountered during the extraction, processing, or loading phases of the pipeline. Implement appropriate logging functionalities to facilitate troubleshooting and monitoring of the pipeline activities. We are automating the pipeline execution by scheduling it to run at regular intervals using tools like cron jobs or Apache Airflow.

Data Model :

This overview outlines the project's objectives, requirements, and usage guidelines.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
apps		apps
data		data
notebooks		notebooks
scripts		scripts
01 Getting Started with Data Engineering on GCP.md		01 Getting Started with Data Engineering on GCP.md
02 Setting up Data Lake using GCS.md		02 Setting up Data Lake using GCS.md
04 Setup Postgres Database using GCP Cloud SQL.md		04 Setup Postgres Database using GCP Cloud SQL.md
05 Data Warehouse using Google Big Query.md		05 Data Warehouse using Google Big Query.md
06 Data Processing using Google Cloud Functions.md		06 Data Processing using Google Cloud Functions.md
07 Big Data Processing using Google Dataproc.md		07 Big Data Processing using Google Dataproc.md
08 Big Data Processing using Databricks on GCP.md		08 Big Data Processing using Databricks on GCP.md
09 Integration of Dataproc and Google BigQuery.md		09 Integration of Dataproc and Google BigQuery.md
12 Data Pipeline Orchestration using Google Cloud Composer.md		12 Data Pipeline Orchestration using Google Cloud Composer.md
CURRICULUM.md		CURRICULUM.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GCP-Data-Engineering-Project

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GCP-Data-Engineering-Project

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages