Netflix Data Cleaning and Analysis Project | End to End ELT Project (SQL + Python)

An end-to-end Data Engineering project on Netflix dataset implementing ELT (Extract, Load, Transform) using PostgreSQL and Python. It focuses on data cleaning, data manipulation, transformation, and analysis to derive meaningful insights.

Introduction

This project demonstrates an End-to-End Data Engineering pipeline on the Netflix dataset using Python and PostgreSQL. It covers the complete ELT (Extract, Load, Transform) process, including data extraction, cleaning (handling foreign characters, removing duplicates, filling missing values, type conversions), and transformation (creating tables, dropping unnecessary columns). Using advanced SQL techniques such as Joins, Aggregate, Window, and Built-in Functions, the project delivers meaningful data analysis and serves as a practical reference for ETL workflows, data cleaning, and real-world database management.

Workflow / ELT Pipeline

📊 Insights Required

The goal of this project is to analyze the Netflix dataset and extract meaningful business insights by answering the following key questions:

Count the number of movies and TV shows created by each director (in separate columns).
Identify the country with the highest number of comedy movies.
For each year (based on date added), find the director with the maximum number of movies released.
Calculate the average duration of movies in each genre.
List the directors who created both horror and comedy movies, along with the count of each.

Note

If anyone wants to see Netflix Data Analysis, All data analysis questions are present at the bottom of the SQL script. You can view by clicking on this Netflix Data Analysis.

🔑 Key Features / Highlights

End-to-End ELT Workflow: Implemented Extract, Load, and Transform pipeline using Python and PostgreSQL.
Database Integration: Established secure connections and executed queries directly from Python.
Data Cleaning: Removed duplicates, handled foreign characters, populated missing values, and standardized null entries.
Schema Design & Transformation: Created new tables, converted data types, and dropped unnecessary columns for a clean dataset.
Advanced SQL Operations: Applied DDL, DML, and DQL queries including Joins, Aggregate Functions, Window Functions, and Built-in Functions.
Data Analysis: Conducted meaningful insights on Netflix dataset using optimized SQL queries.
Practical Use Case: Demonstrates real-world data engineering and data analysis best practices with open-source tools.

🛠️ Data Preprocessing & Transformation Steps

Here I explain the data preprocessing and transformation process applied to the Netflix dataset such as schema setup, restructuring tables, column modifications, handling missing values and duplicates, supported with screencshots for better understanding.

Data Check

In this above mentioned picture, We can clearly see in the title column the value was ambiguous because of the unstructured schema I got and datatype. I handled this by transforming the schema which you can see in the picture below.
Schema Setup

Raw Schema

Transformed Schema
Checking Duplicates in Raw Table

There were some duplicates in netflix_raw table. I checked for title column. So I solved the duplicates issue using CTE (Common Tbale Expression).
Handling Missing Values

There were some missing values in netflix_raw table.
- Handled Missing Values in country:
  
  I handled country column which I had inserted in new country column called netflix_country for further analysis.
- Handled Missing Values in duration column:
  
  I handled missing values column which is duration table, data type coversion for date_added column and drop director, listed_in, country and cast columns for final netflix table.
Creation of New Tables

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
LICENSE		LICENSE
README.md		README.md
data_analysis_dml_ddl_dql_queries.sql		data_analysis_dml_ddl_dql_queries.sql
netflix_data_extraction.py		netflix_data_extraction.py
netflix_raw_table_queries.sql		netflix_raw_table_queries.sql
netflix_titles.csv		netflix_titles.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Netflix Data Cleaning and Analysis Project | End to End ELT Project (SQL + Python)

Introduction

Workflow / ELT Pipeline

📊 Insights Required

🔑 Key Features / Highlights

🛠️ Data Preprocessing & Transformation Steps

Data Check

Schema Setup

Checking Duplicates in Raw Table

Handling Missing Values

Handled Missing Values in `country`:

Handled Missing Values in `duration` column:

Creation of New Tables

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Netflix Data Cleaning and Analysis Project | End to End ELT Project (SQL + Python)

Introduction

Workflow / ELT Pipeline

📊 Insights Required

🔑 Key Features / Highlights

🛠️ Data Preprocessing & Transformation Steps

Data Check

Schema Setup

Checking Duplicates in Raw Table

Handling Missing Values

Handled Missing Values in country:

Handled Missing Values in duration column:

Creation of New Tables

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Handled Missing Values in `country`:

Handled Missing Values in `duration` column:

Packages