Skip to content

laineq/DataWings

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

96 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DataWings

SFU BigData Lab1 Project

📌 Project Overview

This project analyzes how large-scale climate variability ENSO(El Niño, Neutral, La Niña) influences bird migration patterns across North America.

We built:

  • A Reusable, configuration-driven ETL pipeline using PySpark and AWS EMR
  • Machine learning prediction models for bird observations
  • A real-time data streaming system using Kafka → S3 → Athena → QuickSight
  • Interactive ML and real-time dashboards

👥 Team

  • Joohyun Park – Real-Time Pipeline, Kafka, QuickSight Dashboard
  • Jiayi Li – ML Modeling, Shiny Dashboard
  • Hongrui Qu – Historical ETL, PySpark, EMR

🔗 Dashboard Links

We developed a historical analysis and machine learning–based prediction dashboard
This dashboard visualizes long-term bird migration patterns and model-based observations predictions under different ENSO phases.
Built using R, Shiny, and ggplot2

Key Features

Exploratory Data Analysis:

  • Historical migration pattern visualization by weekly and per-species
  • Average Latitude Comparison Between El Niño and La Niña

Prediction Map:

  • Random Forest–based bird observation prediction under different phase of ENSO(El Niño, Neutral, and La Niña)
  • Interactive map-based visualization allow User select for month, ENSO phase, and species

We developed a real-time bird observation dashboard using AWS QuickSight.
The dashboard provides an interactive map that allows users to explore live bird observations across Canada and the United States.

Key Features

  • Interactive map-based exploration
  • Filter by date, country (CA / US), and species
  • View real-time observation counts and locations
  • Built on a real-time data pipeline using Kafka → S3 → Athena → QuickSight

Access Note

  • You may need to sign in with your AWS account to view the dashboard.
  • Access is restricted to invited users.

🏗 System Architecture

Historical Pipeline:
eBird & NOAA ENSO datasets → S3 → ETL (EMR/PySpark) → ML (R) → ShinyAPP

  • ETL Configuration:
    Run entire historical ETL at once with: python3 pipeline.py
    config.json stores all user settings including species, countries, year ranges, file paths, and the choice of weekly or monthly aggregation

  • Raw file requirement:
    eBird: raw txt file download from eBird and named using the pattern species_country.txt
    NOAA ENSO: csv file download from NOAA

  • Overall structure for ETL

Real-Time Pipeline:
eBird API → Kafka → S3 → Glue → Athena → QuickSight

🛠 Tech Stack

Data Engineering

  • Python, PySpark, Apache Kafka, Kafka S3 Sink Connector
  • Amazon S3, EMR, Glue, Athena

Machine Learning

  • RStudio (randomForest, dplyr, ggplot2)

Visualization

  • R Shiny
  • gganimate
  • AWS QuickSight (Real-Time Dashboard)

Database

  • SQLite (Real-Time Deduplication)
  • S3 (Raw & Clean data storage)

📂 Data Source

🗂️ Historical Data

  • eBird Observation Dataset (Cornell Lab of Ornithology)
  • NOAA Oceanic Niño Index (ONI)

Scope for historical data

  • Time Range: 1995 Jan – 2025 Sep
  • Region: Canada & United States
  • Species:
    • Swainson’s Thrush(olive-backed subspecies)
    • Golden-winged Warbler

⏱️ Real-Time Data

  • eBird “Recent Observations in a Region” API (24-hour rolling window, 30-min interval)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors