Skip to content

muhammadrauhan/End-to-End-Data-Engineering-Project

Repository files navigation

Netflix Data Cleaning and Analysis Project | End to End ELT Project (SQL + Python)

An end-to-end Data Engineering project on Netflix dataset implementing ELT (Extract, Load, Transform) using PostgreSQL and Python. It focuses on data cleaning, data manipulation, transformation, and analysis to derive meaningful insights.

Introduction

This project demonstrates an End-to-End Data Engineering pipeline on the Netflix dataset using Python and PostgreSQL. It covers the complete ELT (Extract, Load, Transform) process, including data extraction, cleaning (handling foreign characters, removing duplicates, filling missing values, type conversions), and transformation (creating tables, dropping unnecessary columns). Using advanced SQL techniques such as Joins, Aggregate, Window, and Built-in Functions, the project delivers meaningful data analysis and serves as a practical reference for ETL workflows, data cleaning, and real-world database management.

Workflow / ELT Pipeline

E2E ELT Project

📊 Insights Required

The goal of this project is to analyze the Netflix dataset and extract meaningful business insights by answering the following key questions:

  • Count the number of movies and TV shows created by each director (in separate columns).
  • Identify the country with the highest number of comedy movies.
  • For each year (based on date added), find the director with the maximum number of movies released.
  • Calculate the average duration of movies in each genre.
  • List the directors who created both horror and comedy movies, along with the count of each.

Note

If anyone wants to see Netflix Data Analysis, All data analysis questions are present at the bottom of the SQL script. You can view by clicking on this Netflix Data Analysis.

🔑 Key Features / Highlights

  • End-to-End ELT Workflow: Implemented Extract, Load, and Transform pipeline using Python and PostgreSQL.
  • Database Integration: Established secure connections and executed queries directly from Python.
  • Data Cleaning: Removed duplicates, handled foreign characters, populated missing values, and standardized null entries.
  • Schema Design & Transformation: Created new tables, converted data types, and dropped unnecessary columns for a clean dataset.
  • Advanced SQL Operations: Applied DDL, DML, and DQL queries including Joins, Aggregate Functions, Window Functions, and Built-in Functions.
  • Data Analysis: Conducted meaningful insights on Netflix dataset using optimized SQL queries.
  • Practical Use Case: Demonstrates real-world data engineering and data analysis best practices with open-source tools.

🛠️ Data Preprocessing & Transformation Steps

Here I explain the data preprocessing and transformation process applied to the Netflix dataset such as schema setup, restructuring tables, column modifications, handling missing values and duplicates, supported with screencshots for better understanding.

  • Data Check

    data check

    In this above mentioned picture, We can clearly see in the title column the value was ambiguous because of the unstructured schema I got and datatype. I handled this by transforming the schema which you can see in the picture below.

  • Schema Setup

    Schema before
    Raw Schema
    schema after
    Transformed Schema
  • Checking Duplicates in Raw Table

    There were some duplicates in netflix_raw table. I checked for title column. So I solved the duplicates issue using CTE (Common Tbale Expression).

    final-result-duplicates
  • Handling Missing Values

    There were some missing values in netflix_raw table.

    • Handled Missing Values in country:

      I handled country column which I had inserted in new country column called netflix_country for further analysis.

      populate-na-countries
    • Handled Missing Values in duration column:

      I handled missing values column which is duration table, data type coversion for date_added column and drop director, listed_in, country and cast columns for final netflix table.

      final-netflix
  • Creation of New Tables

    new-tables

About

An End to End Data Engineering Project, specifically implementing ELT (Extract, Load, Transform) in which I had used Netflix dataset to clean and analyze the data using SQL and Python.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages