Project for learning data engineering as a professional
Python libraries for data engineering:
pandas(for data manipulation)SQLAlchemy(Python SQL toolkit)
SQL deep dive:
- Complex queries, window functions, joins
- Performance tuning
ETL concepts:
- Building simple pipelines
- Build ETL scripts extracting data from CSV/JSON APIs
- Transform data with pandas
- Load data into a local Postgres DB
- Learn and write complex SQL queries to prepare data sets
- Python for Data Analysis by Wes McKinney (focus on pandas)
- Mode Analytics SQL Tutorial
- Intro to ETL with Python and SQL (many tutorials on YouTube)
- Apache Spark fundamentals (PySpark preferred)
- Build batch data processing jobs
- Apache Airflow basics: DAGs, operators, scheduling
- Set up Airflow locally or in Docker
- Build a Spark job to process a medium-size public dataset (e.g., NYC Taxi Trips, Kaggle datasets)
- Build an Airflow DAG to run your Spark job on schedule and track success/failure
- Databricks free courses on Apache Spark
- Airflow official tutorial
- Hands-on projects from GitHub repos for Spark + Airflow integration
- AWS Glue (serverless ETL)
- AWS Redshift (data warehouse)
- AWS Kinesis basics or Apache Kafka (more open source)
- Build real-time data ingestion and processing pipelines
- Create an ETL job in AWS Glue that extracts from S3 and loads into Redshift
- Build a Kafka producer and consumer app in Python or Java
- Set up a simple streaming pipeline to process data in real-time (Kafka → Spark Streaming or Kinesis Data Analytics)
- AWS Glue tutorial
- Confluent Kafka tutorials
- Kafka + Spark Streaming sample projects on GitHub
- Document your projects on GitHub with READMEs and architecture diagrams
- Share progress as blog posts or short videos — great for portfolio & networking
- Join data engineering communities (LinkedIn, Reddit r/dataengineering, Slack groups)
This repository contains various data engineering projects and learning resources:
├── 3-weeks-plan/ # 3-week intensive data engineering plan
│ ├── week1-batch-etl/
│ ├── week2-streaming-airflow/
│ └── week3-cloud-etl/
├── full-phased-project/ # Comprehensive phased data engineering project
│ ├── phase1-batch-etl/
│ ├── phase2-streaming-orchestration/
│ └── phase3-cloud-pipeline/
├── basic-statistics/ # Production-ready statistics with Python+Fortran
│ ├── src/ # Python and Fortran implementations
│ ├── docs/ # Theory, theorems, and guides
│ ├── examples/ # Real-world use cases
│ ├── tests/ # Comprehensive test suite
│ ├── api/ # FastAPI service
│ └── docker/ # Production deployment
├── cobol-project/ # Production-ready COBOL project with converters
│ ├── src/ # COBOL source programs
│ ├── converters/ # Python ↔ COBOL conversion tools
│ ├── examples/ # Example programs
│ └── docs/ # Comprehensive documentation
├── fortan-ai/ # Fortran AI project
├── snowflake-databricks-mastery/ # Cloud data warehouse projects
└── README.md
A comprehensive statistics project featuring:
- Dual Implementation: Python (flexible) + Fortran (performance)
- Complete Theory: Statistical theorems with proofs
- Real Use Cases: A/B testing, quality control, market analysis
- Big Data Ready: Spark integration, distributed computing
- Production API: FastAPI service with Docker deployment
- Fully Tested: 28+ unit tests, property-based testing
A comprehensive COBOL project featuring:
- 5 Production-Ready COBOL Programs demonstrating different COBOL features
- Bidirectional Converters: Python ↔ COBOL conversion tools
- Complete Documentation: COBOL features guide and conversion guide
- Example Programs: Ready-to-use examples for learning