AIcrowd Forum - Latest posts https://discourse.aicrowd.com Latest posts 🏆 Final Results & Winner Announcement :boom:Congratulations with ranking place everybody!:tada:
@RickySong good job, you deserved this :muscle:

]]>
https://discourse.aicrowd.com/t/final-results-winner-announcement/17849#post_4 Tue, 10 Mar 2026 12:43:15 +0000 discourse.aicrowd.com-post-34771
🏆 Final Results & Winner Announcement It seems that the evaluation was conducted very thoroughly.

Thank you for hosting such an interesting competition.

]]>
https://discourse.aicrowd.com/t/final-results-winner-announcement/17849#post_3 Tue, 10 Mar 2026 00:02:19 +0000 discourse.aicrowd.com-post-34768
🏆 Final Results & Winner Announcement https://discourse.aicrowd.com/t/final-results-winner-announcement/17849#post_2 Fri, 06 Mar 2026 11:09:44 +0000 discourse.aicrowd.com-post-34761 🏆 Final Results & Winner Announcement The Orak Game Agent Challenge 2025 has come to a close. Over the course of the challenge, 497 participants across 117 teams took part, collectively producing 685 submissions and steadily improving agent performance throughout the competition.

Orak is an open benchmark designed to test agentic LLM systems in real games. Participants submitted MCP-connected agents capable of consuming textual and visual state across several environments including Super Mario, Pokémon Red, StarCraft II, and 2048.

Final Evaluation

Final standings were computed as a weighted average across four environments:

  • PokĂ©mon: 0.30
  • StarCraft II: 0.30
  • Super Mario: 0.15
  • 2048: 0.15

The final evaluation includes hidden test cases designed to test generalisation, meaning final scores are typically lower than those observed on the live leaderboard.

LLM Usage Threshold

The challenge evaluates LLM-powered agents. During the final evaluation, individual game scores were treated as non-qualifying (zeroed) if language model usage fell below a minimum threshold.

This ensures the rankings reflect meaningful LLM-driven decision making, rather than approaches where classical solvers or rule-based controllers dominate the agent’s behavior.

Integrity Review

All submissions were reviewed against the published competition rules and clarifications to ensure that results reflect generalisable agent behavior.

Disqualification decisions were based on one or more of the following categories:

Hidden-test overfitting via hardcoding
Submissions containing game-specific routes, coordinates, or scripted behaviours tied directly to the public evaluation environment.

Disallowed action interfaces
Creation of new high-level actions beyond the predefined functions provided by the environment.

Tool restriction bypass
Use of external tools or services beyond what is permitted under the competition rules.

Reproducibility or verification failure
Submissions that could not be reliably reproduced or verified using the required code, artifacts, and logs.

Evaluation Updates Applied Before Finalising Results

To ensure fairness and consistent scoring across teams, the organisers applied an update to Pokémon before confirming final results. They fixed score normalisation to a consistent 0–1 scale and improved reset handling to clear milestone counters between episodes.

All final standings reflect results after these corrections were applied.

Winners

The final standings are based on the weighted evaluation described above.

Track 1: Lightweight (SLM ≤10B parameters)

Rank Team 2048 Mario Pokémon SC2 Final Weighted Score
:1st_place_medal: 1 a-great-toe (yucheon, hwanggeumhwan, yujin_kim, kgb) 0.860 0.186 0.095 0.000 0.185
:2nd_place_medal: 2 artist 0.000* 0.000* 0.000 0.333 0.100
:3rd_place_medal: 3 Actrix 0.181 0.236 0.000 0.000 0.063

Track 2: Open (No parameter limit)

Rank Team 2048 Mario Pokémon SC2 Final Weighted Score
:1st_place_medal: 1 emaeon 0.020 0.000* 0.143 1.000 0.346
:2nd_place_medal: 2 RickySong 0.000* 0.218 0.286 0.333 0.218
:3rd_place_medal: 3 olawale_ibrahim 0.001 0.177 0.000 0.333 0.127

Score zeroed due to LLM usage below the required threshold in that game.

Thank you to everyone who participated and contributed submissions throughout the challenge. We appreciate the experimentation, engineering effort, and persistence that went into building agents capable of operating across diverse game environments.

We will be reaching out to the winning teams shortly regarding prize distribution and will also share follow-up insights from the challenge with the community.

]]>
https://discourse.aicrowd.com/t/final-results-winner-announcement/17849#post_1 Fri, 06 Mar 2026 11:03:46 +0000 discourse.aicrowd.com-post-34760
Curious about the announcement of the results of the competition Hello, the winner announcement will be made soon. The solutions are currently undergoing validation checks for cheating and malicious activity.

]]>
https://discourse.aicrowd.com/t/curious-about-the-announcement-of-the-results-of-the-competition/17844#post_2 Tue, 03 Mar 2026 09:09:39 +0000 discourse.aicrowd.com-post-34752
Curious about the announcement of the results of the competition The tournament seems to have ended a long time ago and quite a long time later than the announced date.

Please let me know when I will know the result of the winner of the competition.

]]>
https://discourse.aicrowd.com/t/curious-about-the-announcement-of-the-results-of-the-competition/17844#post_1 Tue, 03 Mar 2026 04:52:52 +0000 discourse.aicrowd.com-post-34751
Submission Failing Without Message My submission passed the verify_submission check, but when I submitted it it immediately failed and did not give a reason as to why. Additionally, after looking at old submissions, every submission before Oct 10, 2025 had a comment, but none have had one since. There also hasn’t been a successful submission since May 2025.

The challenge says it’s still open, but is it actually? Is the evaluator broken?

]]>
https://discourse.aicrowd.com/t/submission-failing-without-message/17842#post_1 Tue, 03 Mar 2026 02:42:56 +0000 discourse.aicrowd.com-post-34748