GreenPull

Inspiration

Training ML models is getting expensive in two ways: cloud bills and carbon. We kept seeing the same pattern across teams—people wanted to make training greener, but the “how” was messy: inconsistent codebases, unclear configs, and no simple way to translate “this repo” into “this many kg of CO₂” plus a concrete optimization path.

GreenPull started from a simple question:

Can we turn any ML repo into a carbon estimate + a practical optimization PR—without executing code?

What it does

GreenPull takes a GitHub repository URL and produces:

Detected training entrypoint (best-guess command + file)
Extracted training configuration (model type, epochs, batch size, hardware assumptions) via static signals + LLM analysis
Baseline emissions estimate using Green Algorithms-style computation:

P = PUE × (P_CPU + P_GPU + P_mem) E = (P × t) / 1000 CO₂ = E × CI

A preview-only optimization patch (AMP / LoRA / INT8 ideas) as a unified diff
Optimized emissions estimate + savings + real-world comparisons (car km, tree-months, etc.)

No training code is executed—everything is static analysis and estimation.

How we built it

Frontend: React + TypeScript (Vite) + Tailwind + shadcn/ui, with Recharts dashboards for energy/CO₂ breakdowns and savings.
Backend: FastAPI that exposes /api/analyze and /api/jobs/{id}.
Async workers: Redis + RQ, so repo cloning, analysis, extraction, and patch generation run reliably in the background.
Static analysis core:
- Repo cloning + file scanning
- Entrypoint detection using regex scoring + iterative LLM assistance
- Context gathering: configs, README, imports, dependency hints
- Code analysis via Python AST where possible, and structured extraction prompts when needed
Carbon estimator: Green Algorithms-inspired formulas using PUE, hardware profiles, runtime estimate, and country carbon intensity.
Patch generator: LLM-generated preview patches for AMP/LoRA with guardrails—always output as diff, never mutate the repo.

Challenges we ran into

1) GitHub integration was more painful than expected

Getting from “repo URL” to “actionable PR flow” had sharp edges: auth quirks, rate limits, varying default branches, monorepos, and repos that don’t follow conventions. Even “clone + detect entrypoint” became a reliability problem when projects had multiple training scripts, notebooks, or custom launchers.

How we overcame it: we added layered fallbacks—regex heuristics first, then deeper context gathering, then iterative LLM reasoning with confidence scoring. We also made job states explicit (queued → cloning → analyzing → extracting → estimating → patching → completed) so failures were debuggable instead of mysterious.

2) Analytics tooling and measurement consistency

We wanted the dashboard to be credible, not hand-wavy. But “carbon estimation” depends on assumptions: runtime, GPU type, PUE, utilization, memory draw, and country carbon intensity. Making those assumptions transparent while keeping the UI simple was tricky.

How we overcame it: we standardized the estimator inputs, separated “detected facts” from “assumptions,” and surfaced breakdowns (CPU/GPU/memory) so users could see what drives emissions. The dashboard became less of a vanity chart and more of a diagnostic tool.

3) Generating error-free “green code” patches

The hardest part was producing patches that looked correct across diverse code styles. AMP is not just one line—you need correct autocast scopes, scaler usage, and safe integration with existing training loops. LoRA touches model construction and optimizer setup. A patch that compiles in one repo can break another.

This was the “several hours non-stop debugging” part: we repeatedly hit edge cases—different frameworks, custom trainers, unusual module layouts, missing imports, or training loops split across files.

How we overcame it: we introduced patch templates, stricter diff formatting, AST-informed insertion points where possible, and repeated validation passes (static checks, sanity heuristics, and “does this patch match the repo’s style?” prompts). The output became consistently reviewable and far less brittle.

Accomplishments that we're proud of

A full end-to-end pipeline: URL → baseline estimate → optimization diff → optimized estimate → dashboard
No-code-execution design that still produces meaningful outputs through static analysis
A robust async system (Redis/RQ) that handles long-running jobs cleanly with observable states
Patch previews that are practical to review and merge—bringing sustainability closer to normal developer workflows

What we learned

In real repos, the main challenge isn’t the formula—it’s reliability across messy codebases.
Carbon reporting only matters if it leads to action; the killer feature is the PR-shaped output.
“AI-generated patches” need engineering guardrails: structure, templates, insertion logic, and clear diffs.
Transparency wins trust: users accept estimates when assumptions are explicit and breakdowns are shown.

What's next for GreenPull

Broader framework coverage: better support for PyTorch Lightning, HuggingFace Trainer variants, distributed training patterns, and notebook-heavy repos.
Smarter runtime estimation: learn from repo signals (dataset size, steps/epoch hints, logs in README) to tighten uncertainty.
Policy & reporting exports: sustainability reports for org compliance (internal dashboards, audit-friendly summaries).
PR automation (optional): safer GitHub PR creation with configurable checks, plus “human-in-the-loop” approval steps.
Team features: baselines over time, project portfolios, and “where do we save the most CO₂ per engineering hour?” ranking.

GreenPull’s goal is simple: make “green ML” feel like normal software practice; measure, patch, and ship.