🏆 Final Results & Winner Announcement

@ChoiSoojin ChoiSoojin — Tue, 10 Mar 2026 12:43:15 +0000

Congratulations with ranking place everybody!
@RickySong good job, you deserved this

🏆 Final Results & Winner Announcement

@RickySong RickySong — Tue, 10 Mar 2026 00:02:19 +0000

It seems that the evaluation was conducted very thoroughly.

Thank you for hosting such an interesting competition.

🏆 Final Results & Winner Announcement

@snehananavati snehananavati — Fri, 06 Mar 2026 11:09:44 +0000

🏆 Final Results & Winner Announcement

@aicrowd_team aicrowd_team — Fri, 06 Mar 2026 11:03:46 +0000

The Orak Game Agent Challenge 2025 has come to a close. Over the course of the challenge, 497 participants across 117 teams took part, collectively producing 685 submissions and steadily improving agent performance throughout the competition.

Orak is an open benchmark designed to test agentic LLM systems in real games. Participants submitted MCP-connected agents capable of consuming textual and visual state across several environments including Super Mario, Pokémon Red, StarCraft II, and 2048.

Final Evaluation

Final standings were computed as a weighted average across four environments:

Pokémon: 0.30
StarCraft II: 0.30
Super Mario: 0.15
2048: 0.15

The final evaluation includes hidden test cases designed to test generalisation, meaning final scores are typically lower than those observed on the live leaderboard.

LLM Usage Threshold

The challenge evaluates LLM-powered agents. During the final evaluation, individual game scores were treated as non-qualifying (zeroed) if language model usage fell below a minimum threshold.

This ensures the rankings reflect meaningful LLM-driven decision making, rather than approaches where classical solvers or rule-based controllers dominate the agent’s behavior.

Integrity Review

All submissions were reviewed against the published competition rules and clarifications to ensure that results reflect generalisable agent behavior.

Disqualification decisions were based on one or more of the following categories:

Hidden-test overfitting via hardcoding
Submissions containing game-specific routes, coordinates, or scripted behaviours tied directly to the public evaluation environment.

Disallowed action interfaces
Creation of new high-level actions beyond the predefined functions provided by the environment.

Tool restriction bypass
Use of external tools or services beyond what is permitted under the competition rules.

Reproducibility or verification failure
Submissions that could not be reliably reproduced or verified using the required code, artifacts, and logs.

Evaluation Updates Applied Before Finalising Results

To ensure fairness and consistent scoring across teams, the organisers applied an update to Pokémon before confirming final results. They fixed score normalisation to a consistent 0–1 scale and improved reset handling to clear milestone counters between episodes.

All final standings reflect results after these corrections were applied.

Winners

The final standings are based on the weighted evaluation described above.

Track 1: Lightweight (SLM ≤10B parameters)

Rank	Team	2048	Mario	Pokémon	SC2	Final Weighted Score
1	a-great-toe (yucheon, hwanggeumhwan, yujin_kim, kgb)	0.860	0.186	0.095	0.000	0.185
2	artist	0.000*	0.000*	0.000	0.333	0.100
3	Actrix	0.181	0.236	0.000	0.000	0.063

Track 2: Open (No parameter limit)

Rank	Team	2048	Mario	Pokémon	SC2	Final Weighted Score
1	emaeon	0.020	0.000*	0.143	1.000	0.346
2	RickySong	0.000*	0.218	0.286	0.333	0.218
3	olawale_ibrahim	0.001	0.177	0.000	0.333	0.127

Score zeroed due to LLM usage below the required threshold in that game.

Thank you to everyone who participated and contributed submissions throughout the challenge. We appreciate the experimentation, engineering effort, and persistence that went into building agents capable of operating across diverse game environments.

We will be reaching out to the winning teams shortly regarding prize distribution and will also share follow-up insights from the challenge with the community.

Curious about the announcement of the results of the competition

@aicrowd_team aicrowd_team — Tue, 03 Mar 2026 09:09:39 +0000

Hello, the winner announcement will be made soon. The solutions are currently undergoing validation checks for cheating and malicious activity.

Curious about the announcement of the results of the competition

@lIIIlllIII lIIIlllIII — Tue, 03 Mar 2026 04:52:52 +0000

The tournament seems to have ended a long time ago and quite a long time later than the announced date.

Please let me know when I will know the result of the winner of the competition.

Submission Failing Without Message

@logan_costello logan_costello — Tue, 03 Mar 2026 02:42:56 +0000

My submission passed the verify_submission check, but when I submitted it it immediately failed and did not give a reason as to why. Additionally, after looking at old submissions, every submission before Oct 10, 2025 had a comment, but none have had one since. There also hasn’t been a successful submission since May 2025.

The challenge says it’s still open, but is it actually? Is the evaluator broken?

AIcrowd Forum - Latest posts

🏆 Final Results & Winner Announcement

🏆 Final Results & Winner Announcement

🏆 Final Results & Winner Announcement

🏆 Final Results & Winner Announcement

Final Evaluation

LLM Usage Threshold

Integrity Review

Evaluation Updates Applied Before Finalising Results

Winners

Track 1: Lightweight (SLM ≤10B parameters)

Track 2: Open (No parameter limit)

Curious about the announcement of the results of the competition

Curious about the announcement of the results of the competition

Submission Failing Without Message