AutoKaggle

Autonomous Kaggle competition loop inspired by Karpathy's autoresearch. Three persistent specialist agents coordinated by an orchestrator. No human involvement once started. Resilient to crashes and session restarts.

Where autoresearch runs a single-agent loop (edit -> train -> keep/revert) on one file, AutoKaggle runs a multi-agent loop (research -> plan -> review -> code -> submit -> learn) across an entire ML competition pipeline. The multi-agent design catches tunnel vision and forgotten learnings that a single agent misses after many rounds.

Architecture

+--------------------------------------------------------------+
|                       ORCHESTRATOR                            |
|                      (program.md)                             |
|                                                               |
|  Reads: state.json + results.tsv (stays context-lean)        |
|  Sends short triggers to persistent agents via SendMessage    |
|  Decides submission, logs atomically, loops forever           |
+-------+----------------------+----------------------+--------+
        |                      |                      |
  spawn v                spawn v                spawn v
+--------------+   +---------------------+   +--------------+
|  RESEARCHER  |   |       BUILDER       |   |   REVIEWER   |
|              |   |                     |   |              |
| Scrapes      |   | Plans, codes,       |   | Challenges   |
| Kaggle for   |   | handles submission  |   | the plan     |
| new findings |   | CSV                 |   | before any   |
|              |   |                     |   | code runs    |
| Returns:     |   | Returns:            |   |              |
| "DONE"       |   | "DONE" (plan)       |   | Returns:     |
|              |   | "REVISED" (plan)    |   | "APPROVED"   |
|              |   | "CV_SCORE=X" (code) |   | or           |
|              |   |                     |   | "REVISE: X"  |
+--------------+   +---------------------+   +--------------+

Flow each round:

Round 0:  Builder (EDA) -> done (no experiment, no submission)
Round 1+: Research -> Builder (plan) -> Reviewer -> [Builder (revise)] -> Builder (code+submit) -> Reviewer (verify) -> [Submit]
Round 10, 20, ...: Reviewer (retro) -> Research -> Builder (plan) -> ...

Key design principles:

Agents are persistent — spawned once per competition, kept alive via SendMessage
Each agent accumulates context naturally — no re-reading full history each round
Agents communicate via file paths and one-line returns — never file contents
Orchestrator reads only state.json + results.tsv — stays lean across many rounds
All file writes are atomic (write .tmp then mv) — safe against mid-write crashes
Resume check uses [ -s file ] — guards against empty files from partial crashes
Reviewer = strategic memory — catches tunnel vision and forgotten learnings

Quick Start

1. Set up for a new competition

Edit config.json — replace all <PLACEHOLDERS>:

{
  "competition": "playground-series-s6e4",
  "data_dir": "./competition/data",
  "task_type": "binary_classification",
  "metric": "auc_roc",
  "metric_direction": "higher",
  "target_column": "target",
  "cv_folds": 5,
  "deadline": "2026-04-30"
}

Reset state.json (or leave the template defaults).

2. Set up Kaggle authentication

The loop needs Kaggle API access for submissions, leaderboard checks, and GPU kernels.

# Option A: API token file (recommended)
# Download from https://www.kaggle.com/settings → API → Create New Token
# This saves kaggle.json to ~/.kaggle/
mkdir -p ~/.kaggle
mv ~/Downloads/kaggle.json ~/.kaggle/
chmod 600 ~/.kaggle/kaggle.json

# Option B: Environment variable (alternative)
# Add to your .env file — never hardcode tokens in scripts
export KAGGLE_API_TOKEN='{"username":"your_username","key":"your_key"}'

Verify it works:

kaggle competitions list | head -5

Set your Kaggle username in config.json under kaggle_username — this is used for kernel push commands and leaderboard tracking.

3. Download competition data

kaggle competitions download -c <COMPETITION_SLUG>
unzip <COMPETITION_SLUG>.zip -d competition/data/

4. Add knowledge files (optional but high-ROI)

Create markdown files with competition-specific notes, community insights, or research and list them in config.json under knowledge_files. The agents read these at startup.

5. Start the loop

Read autokaggle/program.md and run the loop.

To resume after a crash: same command. The orchestrator checks which phases are already complete and skips them.

To Use for a New Competition

Edit config.json only. Change:

competition — Kaggle slug
competition_dir / data_dir / existing_oof_dir — local paths
task_type — binary_classification, multiclass_classification, regression, ranking, object_detection, nlp_generation, time_series, etc.
metric — auc_roc, rmse, logloss, map@5, f1, bleu, etc.
metric_direction — higher or lower
target_column, id_column, cv_strategy, cv_folds, deadline, max_submissions_per_day
knowledge_files — list of any .md files with competition notes
best_pipeline_script — path to your current best script (optional, Builder Agent reads at startup)
scraper_path — path to a Kaggle scraper script (optional)

All agent prompts read from config.json — nothing else needs to change.

The Reviewer

The Reviewer is the most important addition over a single-agent loop. It catches:

Failure mode	What the Reviewer asks
Tunnel vision (same model 3 rounds)	"Is ensemble diverse enough?"
Forgotten learnings	"Does this contradict a prior finding?"
Narrow search space	"Are we fine-tuning an already-explored region?"
Weak ensemble	"What model family hasn't been tried yet?"
Missing bold moves	"What are top competitors doing that we haven't?"
Poor ROI	"Is this the best use of the next N hours?"

One revision cycle per round maximum.

Every 10 rounds, the Reviewer runs a full campaign retrospective — reading all results and findings to find cross-round connections (e.g., a round 2 insight that explains a round 15 failure), reassess discarded experiments, detect diminishing returns, and recommend the top 5 experiments to try next.

Crash Recovery

Every phase output is a named, atomically-written file. On restart the orchestrator checks file existence with [ -s ] and skips done phases:

R{NN}_research_brief.md  -> Research done?
R{NN}_plan.md            -> Plan written?
R{NN}_review.md          -> Review done?
R{NN}_submission.csv     -> Code + submit done?

Periodic re-spawn: orchestrator re-spawns any agent that has been running 15+ rounds to prevent context overflow.

Per-Round Files (in `rounds/`)

File	Writer	Contents
`R00_eda.md`	Builder	EDA report (Round 0 only) — referenced by all agents
`R{NN}_research_raw.md`	Research	Raw findings (large, never read into context)
`R{NN}_research_brief.md`	Research	New findings only (<500 words)
`R{NN}_plan.md`	Main	Full experiment specification (may be revised)
`R{NN}_review.md`	Reviewer	Six-check review + APPROVED/REVISE verdict
`R{NN}_experiment.py`	Main	Self-contained ML training script
`R{NN}_run.log`	Main	Training stdout/stderr (grepped, never fully read)
`R{NN}_oof.npy`	Main	OOF predictions for new model
`R{NN}_test.npy`	Main	Test predictions for new model
`R{NN}_findings.md`	Main	Post-round analysis + LB submission recommendation
`R{NN}_submission.csv`	Main	The actual Kaggle submission

State Files

File	Contents
`config.json`	Competition config — edit this for each new competition
`state.json`	Round N, best scores, agent IDs, submissions today, CV/LB history
`results.tsv`	One row per round: round, script, cv_score, lb, status, submitted, timestamp, description
`LEARNINGS.md`	Generalised rules from prior campaigns — agents read at startup
`KAGGLE_API.md`	GPU types, kernel commands, submission API, session limits
`experiments.json`	Structured experiment registry — model families, scores, correlations

Key Learnings (from prior campaigns)

See LEARNINGS.md for full details with evidence.

Ridge stacking on correlated models hurts LB. If pairwise correlation > 0.995, use simple weighted blend.
Diversity needs minimum metric threshold. Models must score within 0.003 of the best to contribute.
Research > tuning. Community notebook research gave 10x more gain than hyperparameter tuning.
Fold-1 kill gates save hours. Run fold-1 before committing to full K-fold.
Simple blends generalise better. 2-4 diverse models with rank blending beats 20-model Ridge stack on LB.
Research before building. Understanding what works for this problem class beats starting from defaults.
The Reviewer catches false negatives. "Was this truly exhausted or just tried with wrong params?"

Acknowledgements

Andrej Karpathy — autoresearch concept: autonomous improvement loops via program.md
Battle-tested across 80 rounds on Kaggle Playground Series S6E3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AutoKaggle

Architecture

Quick Start

1. Set up for a new competition

2. Set up Kaggle authentication

3. Download competition data

4. Add knowledge files (optional but high-ROI)

5. Start the loop

To Use for a New Competition

The Reviewer

Crash Recovery

Per-Round Files (in `rounds/`)

State Files

Key Learnings (from prior campaigns)

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
agents		agents
KAGGLE_API.md		KAGGLE_API.md
LEARNINGS.md		LEARNINGS.md
README.md		README.md
config.json		config.json
experiments.json		experiments.json
program.md		program.md
results.tsv		results.tsv
state.json		state.json

Folders and files

Latest commit

History

Repository files navigation

AutoKaggle

Architecture

Quick Start

1. Set up for a new competition

2. Set up Kaggle authentication

3. Download competition data

4. Add knowledge files (optional but high-ROI)

5. Start the loop

To Use for a New Competition

The Reviewer

Crash Recovery

Per-Round Files (in rounds/)

State Files

Key Learnings (from prior campaigns)

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Per-Round Files (in `rounds/`)

Packages