This is an ongoing project. Further enhancements and improvements will be made to make the data more useful and accessible for downstream stakeholders for their respective analytical and decision-making purposes.
This project is built using Apache Airflow on Amazon Web Services (AWS) for processing weather data from the OpenWeather API.
The pipeline follows a standard ETL flow:
- Extract weather data from the OpenWeather API using Airflow tasks.
- Transform the raw data into a structured, analysis-ready format.
- Load the transformed data into an Amazon S3 bucket.
- OpenWeather API — Provides real-time weather data, forecasts, and historical information.
-
Amazon S3
Used for storing intermediate and transformed weather data in CSV format for further analysis. -
Apache Airflow
Orchestrates the entire data pipeline through DAGs, operators, and schedulers.
The pipeline DAG (weather_dag) contains three main tasks:
is_weather_api_ready– UsesHttpSensorto check API availability.extract_weather_data– UsesSimpleHttpOperatorto pull data from the API.transform_load_weather_data– UsesPythonOperatorto clean and save data into S3.
You can monitor DAG runs, success/failure states, and task execution duration through the Airflow UI.
The transformed weather data is stored in CSV format inside an S3 bucket (openweatherapidata).
Each DAG run creates a new file with a timestamped filename, making it easy to track and manage historical data.
-
Extract Data
- Airflow triggers a call to the OpenWeather API to fetch weather data for a given location.
-
Transform Data
- The raw JSON response is parsed and converted into structured CSV format.
-
Load Data to S3
- The cleaned data is saved in S3 with proper naming conventions for future analysis.
- Create an AWS account.
- Set up an S3 bucket to store weather data.
- Sign up on OpenWeather and generate an API key.
- Deploy Airflow on AWS using EC2, ECS, or your local environment.
- Define the DAG and operators to orchestrate ETL tasks.
- Configure Airflow connections for:
- OpenWeather API
- Amazon S3
- Trigger the DAG manually or schedule it using Airflow’s scheduler (
@daily). - Monitor task execution in the Airflow UI.
- Verify the generated CSV files in the S3 bucket.
- Monitor AWS storage costs.
- Implement proper logging and error handling for robustness.
- Set up monitoring and alerts to catch pipeline failures early.
- Vecham Gautham
For questions or feedback, contact [email protected].


