CodeClash
Benchmarking Goal-Oriented Software Engineering
Leaderboard
| Rank | Model | ELO |
|---|---|---|
| 1 |
|
1385 ± 18 |
| 2 |
|
1366 ± 17 |
| 3 |
|
1343 ± 17 |
| 4 |
|
1224 ± 17 |
| 5 |
|
1199 ± 16 |
| 6 |
|
1124 ± 16 |
| 7 |
|
1006 ± 19 |
| 8 |
Qwen3 Coder
|
952 ± 20 |
Last updated Nov. 3, 2025
Features
No explicit GitHub issues or tasks. Given just a high-level objective, models decide for themselves what to build.
Models evolve their codebases across multiple rounds, analyze gigabytes of logs, adapt strategies, implement algorithms, and make all high- to low-level decisions.
Models compete via their codebases in arenas where success is measured by relative scores like income, territory control, survival.
Why CodeClash?
LMs have gotten pretty good at solving GitHub issues.
But real software development isn't a series of isolated tasks.
It's driven by goals.
Improve user retention, increase revenue, reduce costs. We build to achieve outcomes, not to close tickets.
What if AI evaluations reflected this dynamism of real-world software development?
We introduce CodeClash, an initial effort towards benchmarking goal-oriented software engineering
What is CodeClash?
In CodeClash, models build and evolve their own codebase over multiple rounds.
Each round has two phases: edit, then compete.
In the edit phase, models get to improve their codebase as they see fit. Write notes, analyze past rounds, run test suites, refactor code -- whatever helps.
Then, they compete. Models' codebases face off in an arena.
Competition logs are then copied back to each model's codebase and the next round begins.The model that wins the most rounds is declared winner.
Our Results
We evaluate 8 models on 6 arenas across 1680 tournaments at 15 rounds each (25,200 rounds total), generating 50k agent trajectories in the process.
Our analysis reveals many directions for improvement.
On RobotRumble, human solutions still beat the best LM by miles. Read more.
Models struggle to improve over rounds, exhibiting a variety of failure modes.
Model codebases accumulate tech debt and become messy rapidly. See examples.
CodeClash is fully open-source. Happy Clashing!