Autonomous Kaggle competition loop inspired by Karpathy's autoresearch. Three persistent specialist agents coordinated by an orchestrator. No human involvement once started. Resilient to crashes and session restarts.
Where autoresearch runs a single-agent loop (edit -> train -> keep/revert) on one file, AutoKaggle runs a multi-agent loop (research -> plan -> review -> code -> submit -> learn) across an entire ML competition pipeline. The multi-agent design catches tunnel vision and forgotten learnings that a single agent misses after many rounds.
+--------------------------------------------------------------+
| ORCHESTRATOR |
| (program.md) |
| |
| Reads: state.json + results.tsv (stays context-lean) |
| Sends short triggers to persistent agents via SendMessage |
| Decides submission, logs atomically, loops forever |
+-------+----------------------+----------------------+--------+
| | |
spawn v spawn v spawn v
+--------------+ +---------------------+ +--------------+
| RESEARCHER | | BUILDER | | REVIEWER |
| | | | | |
| Scrapes | | Plans, codes, | | Challenges |
| Kaggle for | | handles submission | | the plan |
| new findings | | CSV | | before any |
| | | | | code runs |
| Returns: | | Returns: | | |
| "DONE" | | "DONE" (plan) | | Returns: |
| | | "REVISED" (plan) | | "APPROVED" |
| | | "CV_SCORE=X" (code) | | or |
| | | | | "REVISE: X" |
+--------------+ +---------------------+ +--------------+
Flow each round:
Round 0: Builder (EDA) -> done (no experiment, no submission)
Round 1+: Research -> Builder (plan) -> Reviewer -> [Builder (revise)] -> Builder (code+submit) -> Reviewer (verify) -> [Submit]
Round 10, 20, ...: Reviewer (retro) -> Research -> Builder (plan) -> ...
Key design principles:
- Agents are persistent — spawned once per competition, kept alive via SendMessage
- Each agent accumulates context naturally — no re-reading full history each round
- Agents communicate via file paths and one-line returns — never file contents
- Orchestrator reads only
state.json+results.tsv— stays lean across many rounds - All file writes are atomic (write
.tmpthenmv) — safe against mid-write crashes - Resume check uses
[ -s file ]— guards against empty files from partial crashes - Reviewer = strategic memory — catches tunnel vision and forgotten learnings
Edit config.json — replace all <PLACEHOLDERS>:
{
"competition": "playground-series-s6e4",
"data_dir": "./competition/data",
"task_type": "binary_classification",
"metric": "auc_roc",
"metric_direction": "higher",
"target_column": "target",
"cv_folds": 5,
"deadline": "2026-04-30"
}Reset state.json (or leave the template defaults).
The loop needs Kaggle API access for submissions, leaderboard checks, and GPU kernels.
# Option A: API token file (recommended)
# Download from https://www.kaggle.com/settings → API → Create New Token
# This saves kaggle.json to ~/.kaggle/
mkdir -p ~/.kaggle
mv ~/Downloads/kaggle.json ~/.kaggle/
chmod 600 ~/.kaggle/kaggle.json
# Option B: Environment variable (alternative)
# Add to your .env file — never hardcode tokens in scripts
export KAGGLE_API_TOKEN='{"username":"your_username","key":"your_key"}'Verify it works:
kaggle competitions list | head -5Set your Kaggle username in config.json under kaggle_username — this is used
for kernel push commands and leaderboard tracking.
kaggle competitions download -c <COMPETITION_SLUG>
unzip <COMPETITION_SLUG>.zip -d competition/data/Create markdown files with competition-specific notes, community insights, or
research and list them in config.json under knowledge_files. The agents read
these at startup.
Read autokaggle/program.md and run the loop.
To resume after a crash: same command. The orchestrator checks which phases are already complete and skips them.
Edit config.json only. Change:
competition— Kaggle slugcompetition_dir/data_dir/existing_oof_dir— local pathstask_type—binary_classification,multiclass_classification,regression,ranking,object_detection,nlp_generation,time_series, etc.metric—auc_roc,rmse,logloss,map@5,f1,bleu, etc.metric_direction—higherorlowertarget_column,id_column,cv_strategy,cv_folds,deadline,max_submissions_per_dayknowledge_files— list of any.mdfiles with competition notesbest_pipeline_script— path to your current best script (optional, Builder Agent reads at startup)scraper_path— path to a Kaggle scraper script (optional)
All agent prompts read from config.json — nothing else needs to change.
The Reviewer is the most important addition over a single-agent loop. It catches:
| Failure mode | What the Reviewer asks |
|---|---|
| Tunnel vision (same model 3 rounds) | "Is ensemble diverse enough?" |
| Forgotten learnings | "Does this contradict a prior finding?" |
| Narrow search space | "Are we fine-tuning an already-explored region?" |
| Weak ensemble | "What model family hasn't been tried yet?" |
| Missing bold moves | "What are top competitors doing that we haven't?" |
| Poor ROI | "Is this the best use of the next N hours?" |
One revision cycle per round maximum.
Every 10 rounds, the Reviewer runs a full campaign retrospective — reading all results and findings to find cross-round connections (e.g., a round 2 insight that explains a round 15 failure), reassess discarded experiments, detect diminishing returns, and recommend the top 5 experiments to try next.
Every phase output is a named, atomically-written file.
On restart the orchestrator checks file existence with [ -s ] and skips done phases:
R{NN}_research_brief.md -> Research done?
R{NN}_plan.md -> Plan written?
R{NN}_review.md -> Review done?
R{NN}_submission.csv -> Code + submit done?
Periodic re-spawn: orchestrator re-spawns any agent that has been running 15+ rounds to prevent context overflow.
| File | Writer | Contents |
|---|---|---|
R00_eda.md |
Builder | EDA report (Round 0 only) — referenced by all agents |
R{NN}_research_raw.md |
Research | Raw findings (large, never read into context) |
R{NN}_research_brief.md |
Research | New findings only (<500 words) |
R{NN}_plan.md |
Main | Full experiment specification (may be revised) |
R{NN}_review.md |
Reviewer | Six-check review + APPROVED/REVISE verdict |
R{NN}_experiment.py |
Main | Self-contained ML training script |
R{NN}_run.log |
Main | Training stdout/stderr (grepped, never fully read) |
R{NN}_oof.npy |
Main | OOF predictions for new model |
R{NN}_test.npy |
Main | Test predictions for new model |
R{NN}_findings.md |
Main | Post-round analysis + LB submission recommendation |
R{NN}_submission.csv |
Main | The actual Kaggle submission |
| File | Contents |
|---|---|
config.json |
Competition config — edit this for each new competition |
state.json |
Round N, best scores, agent IDs, submissions today, CV/LB history |
results.tsv |
One row per round: round, script, cv_score, lb, status, submitted, timestamp, description |
LEARNINGS.md |
Generalised rules from prior campaigns — agents read at startup |
KAGGLE_API.md |
GPU types, kernel commands, submission API, session limits |
experiments.json |
Structured experiment registry — model families, scores, correlations |
See LEARNINGS.md for full details with evidence.
- Ridge stacking on correlated models hurts LB. If pairwise correlation > 0.995, use simple weighted blend.
- Diversity needs minimum metric threshold. Models must score within 0.003 of the best to contribute.
- Research > tuning. Community notebook research gave 10x more gain than hyperparameter tuning.
- Fold-1 kill gates save hours. Run fold-1 before committing to full K-fold.
- Simple blends generalise better. 2-4 diverse models with rank blending beats 20-model Ridge stack on LB.
- Research before building. Understanding what works for this problem class beats starting from defaults.
- The Reviewer catches false negatives. "Was this truly exhausted or just tried with wrong params?"
- Andrej Karpathy — autoresearch concept: autonomous improvement loops via
program.md - Battle-tested across 80 rounds on Kaggle Playground Series S6E3