Skip to content

mine-sweeper/autokaggle

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AutoKaggle

Autonomous Kaggle competition loop inspired by Karpathy's autoresearch. Three persistent specialist agents coordinated by an orchestrator. No human involvement once started. Resilient to crashes and session restarts.

Where autoresearch runs a single-agent loop (edit -> train -> keep/revert) on one file, AutoKaggle runs a multi-agent loop (research -> plan -> review -> code -> submit -> learn) across an entire ML competition pipeline. The multi-agent design catches tunnel vision and forgotten learnings that a single agent misses after many rounds.


Architecture

+--------------------------------------------------------------+
|                       ORCHESTRATOR                            |
|                      (program.md)                             |
|                                                               |
|  Reads: state.json + results.tsv (stays context-lean)        |
|  Sends short triggers to persistent agents via SendMessage    |
|  Decides submission, logs atomically, loops forever           |
+-------+----------------------+----------------------+--------+
        |                      |                      |
  spawn v                spawn v                spawn v
+--------------+   +---------------------+   +--------------+
|  RESEARCHER  |   |       BUILDER       |   |   REVIEWER   |
|              |   |                     |   |              |
| Scrapes      |   | Plans, codes,       |   | Challenges   |
| Kaggle for   |   | handles submission  |   | the plan     |
| new findings |   | CSV                 |   | before any   |
|              |   |                     |   | code runs    |
| Returns:     |   | Returns:            |   |              |
| "DONE"       |   | "DONE" (plan)       |   | Returns:     |
|              |   | "REVISED" (plan)    |   | "APPROVED"   |
|              |   | "CV_SCORE=X" (code) |   | or           |
|              |   |                     |   | "REVISE: X"  |
+--------------+   +---------------------+   +--------------+

Flow each round:

Round 0:  Builder (EDA) -> done (no experiment, no submission)
Round 1+: Research -> Builder (plan) -> Reviewer -> [Builder (revise)] -> Builder (code+submit) -> Reviewer (verify) -> [Submit]
Round 10, 20, ...: Reviewer (retro) -> Research -> Builder (plan) -> ...

Key design principles:

  • Agents are persistent — spawned once per competition, kept alive via SendMessage
  • Each agent accumulates context naturally — no re-reading full history each round
  • Agents communicate via file paths and one-line returns — never file contents
  • Orchestrator reads only state.json + results.tsv — stays lean across many rounds
  • All file writes are atomic (write .tmp then mv) — safe against mid-write crashes
  • Resume check uses [ -s file ] — guards against empty files from partial crashes
  • Reviewer = strategic memory — catches tunnel vision and forgotten learnings

Quick Start

1. Set up for a new competition

Edit config.json — replace all <PLACEHOLDERS>:

{
  "competition": "playground-series-s6e4",
  "data_dir": "./competition/data",
  "task_type": "binary_classification",
  "metric": "auc_roc",
  "metric_direction": "higher",
  "target_column": "target",
  "cv_folds": 5,
  "deadline": "2026-04-30"
}

Reset state.json (or leave the template defaults).

2. Set up Kaggle authentication

The loop needs Kaggle API access for submissions, leaderboard checks, and GPU kernels.

# Option A: API token file (recommended)
# Download from https://www.kaggle.com/settings → API → Create New Token
# This saves kaggle.json to ~/.kaggle/
mkdir -p ~/.kaggle
mv ~/Downloads/kaggle.json ~/.kaggle/
chmod 600 ~/.kaggle/kaggle.json

# Option B: Environment variable (alternative)
# Add to your .env file — never hardcode tokens in scripts
export KAGGLE_API_TOKEN='{"username":"your_username","key":"your_key"}'

Verify it works:

kaggle competitions list | head -5

Set your Kaggle username in config.json under kaggle_username — this is used for kernel push commands and leaderboard tracking.

3. Download competition data

kaggle competitions download -c <COMPETITION_SLUG>
unzip <COMPETITION_SLUG>.zip -d competition/data/

4. Add knowledge files (optional but high-ROI)

Create markdown files with competition-specific notes, community insights, or research and list them in config.json under knowledge_files. The agents read these at startup.

5. Start the loop

Read autokaggle/program.md and run the loop.

To resume after a crash: same command. The orchestrator checks which phases are already complete and skips them.


To Use for a New Competition

Edit config.json only. Change:

  • competition — Kaggle slug
  • competition_dir / data_dir / existing_oof_dir — local paths
  • task_typebinary_classification, multiclass_classification, regression, ranking, object_detection, nlp_generation, time_series, etc.
  • metricauc_roc, rmse, logloss, map@5, f1, bleu, etc.
  • metric_directionhigher or lower
  • target_column, id_column, cv_strategy, cv_folds, deadline, max_submissions_per_day
  • knowledge_files — list of any .md files with competition notes
  • best_pipeline_script — path to your current best script (optional, Builder Agent reads at startup)
  • scraper_path — path to a Kaggle scraper script (optional)

All agent prompts read from config.json — nothing else needs to change.


The Reviewer

The Reviewer is the most important addition over a single-agent loop. It catches:

Failure mode What the Reviewer asks
Tunnel vision (same model 3 rounds) "Is ensemble diverse enough?"
Forgotten learnings "Does this contradict a prior finding?"
Narrow search space "Are we fine-tuning an already-explored region?"
Weak ensemble "What model family hasn't been tried yet?"
Missing bold moves "What are top competitors doing that we haven't?"
Poor ROI "Is this the best use of the next N hours?"

One revision cycle per round maximum.

Every 10 rounds, the Reviewer runs a full campaign retrospective — reading all results and findings to find cross-round connections (e.g., a round 2 insight that explains a round 15 failure), reassess discarded experiments, detect diminishing returns, and recommend the top 5 experiments to try next.


Crash Recovery

Every phase output is a named, atomically-written file. On restart the orchestrator checks file existence with [ -s ] and skips done phases:

R{NN}_research_brief.md  -> Research done?
R{NN}_plan.md            -> Plan written?
R{NN}_review.md          -> Review done?
R{NN}_submission.csv     -> Code + submit done?

Periodic re-spawn: orchestrator re-spawns any agent that has been running 15+ rounds to prevent context overflow.


Per-Round Files (in rounds/)

File Writer Contents
R00_eda.md Builder EDA report (Round 0 only) — referenced by all agents
R{NN}_research_raw.md Research Raw findings (large, never read into context)
R{NN}_research_brief.md Research New findings only (<500 words)
R{NN}_plan.md Main Full experiment specification (may be revised)
R{NN}_review.md Reviewer Six-check review + APPROVED/REVISE verdict
R{NN}_experiment.py Main Self-contained ML training script
R{NN}_run.log Main Training stdout/stderr (grepped, never fully read)
R{NN}_oof.npy Main OOF predictions for new model
R{NN}_test.npy Main Test predictions for new model
R{NN}_findings.md Main Post-round analysis + LB submission recommendation
R{NN}_submission.csv Main The actual Kaggle submission

State Files

File Contents
config.json Competition config — edit this for each new competition
state.json Round N, best scores, agent IDs, submissions today, CV/LB history
results.tsv One row per round: round, script, cv_score, lb, status, submitted, timestamp, description
LEARNINGS.md Generalised rules from prior campaigns — agents read at startup
KAGGLE_API.md GPU types, kernel commands, submission API, session limits
experiments.json Structured experiment registry — model families, scores, correlations

Key Learnings (from prior campaigns)

See LEARNINGS.md for full details with evidence.

  1. Ridge stacking on correlated models hurts LB. If pairwise correlation > 0.995, use simple weighted blend.
  2. Diversity needs minimum metric threshold. Models must score within 0.003 of the best to contribute.
  3. Research > tuning. Community notebook research gave 10x more gain than hyperparameter tuning.
  4. Fold-1 kill gates save hours. Run fold-1 before committing to full K-fold.
  5. Simple blends generalise better. 2-4 diverse models with rank blending beats 20-model Ridge stack on LB.
  6. Research before building. Understanding what works for this problem class beats starting from defaults.
  7. The Reviewer catches false negatives. "Was this truly exhausted or just tried with wrong params?"

Acknowledgements

  • Andrej Karpathy — autoresearch concept: autonomous improvement loops via program.md
  • Battle-tested across 80 rounds on Kaggle Playground Series S6E3

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors