An end-to-end Data Engineering project on Netflix dataset implementing ELT (Extract, Load, Transform) using PostgreSQL and Python. It focuses on data cleaning, data manipulation, transformation, and analysis to derive meaningful insights.
This project demonstrates an End-to-End Data Engineering pipeline on the Netflix dataset using Python and PostgreSQL. It covers the complete ELT (Extract, Load, Transform) process, including data extraction, cleaning (handling foreign characters, removing duplicates, filling missing values, type conversions), and transformation (creating tables, dropping unnecessary columns). Using advanced SQL techniques such as Joins, Aggregate, Window, and Built-in Functions, the project delivers meaningful data analysis and serves as a practical reference for ETL workflows, data cleaning, and real-world database management.
The goal of this project is to analyze the Netflix dataset and extract meaningful business insights by answering the following key questions:
- Count the number of movies and TV shows created by each director (in separate columns).
- Identify the country with the highest number of comedy movies.
- For each year (based on date added), find the director with the maximum number of movies released.
- Calculate the average duration of movies in each genre.
- List the directors who created both horror and comedy movies, along with the count of each.
Note
If anyone wants to see Netflix Data Analysis, All data analysis questions are present at the bottom of the SQL script. You can view by clicking on this Netflix Data Analysis.
- End-to-End ELT Workflow: Implemented Extract, Load, and Transform pipeline using Python and PostgreSQL.
- Database Integration: Established secure connections and executed queries directly from Python.
- Data Cleaning: Removed duplicates, handled foreign characters, populated missing values, and standardized null entries.
- Schema Design & Transformation: Created new tables, converted data types, and dropped unnecessary columns for a clean dataset.
- Advanced SQL Operations: Applied DDL, DML, and DQL queries including Joins, Aggregate Functions, Window Functions, and Built-in Functions.
- Data Analysis: Conducted meaningful insights on Netflix dataset using optimized SQL queries.
- Practical Use Case: Demonstrates real-world data engineering and data analysis best practices with open-source tools.
Here I explain the data preprocessing and transformation process applied to the Netflix dataset such as schema setup, restructuring tables, column modifications, handling missing values and duplicates, supported with screencshots for better understanding.
-
In this above mentioned picture, We can clearly see in the
titlecolumn the value was ambiguous because of the unstructured schema I got and datatype. I handled this by transforming the schema which you can see in the picture below. -
There were some duplicates in
netflix_rawtable. I checked fortitlecolumn. So I solved the duplicates issue using CTE (Common Tbale Expression). -
There were some missing values in
netflix_rawtable.-
I handled
countrycolumn which I had inserted in new country column callednetflix_countryfor further analysis. -
I handled missing values column which is
durationtable, data type coversion fordate_addedcolumn and dropdirector,listed_in,countryandcastcolumns for finalnetflixtable.
-







