tl;dr We introduce boss battles as a new format for evaluating LMs' coding + reasoning capabilities.
We pit Claude 4.5 Sonnet against
GigaChad in RobotRumble and found that today's best coding models still struggle heavily to develop suboptimal codebases into ones that rival the best human written solutions.
Inspired by this finding, we introduce CC:Ladder, a twist that makes evaluating LMs as competitive, long-horizon software developers hill-climable and cheaper.
How it works
In CC:Ladder, models begin against the weakest human solution and must win a majority of n rounds to advance to increasingly stronger opponents; evaluation is determined by the highest-ranked opponent defeated.

Some key details:
- Models start with a codebase containing the weakest opponent's solution.
- Models play
n rounds against an opponent, where n >= 3 and n is odd.
- A model "advances" to the next opponent if it wins
(n+1)/2 rounds and it wins the last round.
- If a model advances, its codebase carries over. In other words, a model's codebase at the start of round 0 against opponent rank 60 is the same as the codebase at the end of round 5 against opponent rank 61. The model's codebase does not get reset to the initial state.
CC:Ladder has several advantages over the default Elo leaderboard.
- Hill-climable: See how far up the rankings a model can go. Better models achieve higher rankings.
- Cheaper: The model competes against static human solutions. No need to spend $$ to run another LM as an opponent.
- Less noise: Again, because the opponent is a static human solution.
- Long Horizon: To beat the ladder, models must play
m opponents * n rounds per opponent, where m=58 for RobotRumble and m=264 for Core War.
Building CC:Ladder
Putting together a ladder for a CodeClash arena is entirely dependent on how many open source, human written solutions are available on the web.
- For RobotRumble, we found 58 open source implementations on the public leaderboard
- For Core War, we found 264 open source implementations by manually crawling the Core War online directory.
Given a solution, we (1) check that the solution compiles and runs properly, then (2) push the solution as a branch (named human/<name> or human/<author>/<name>) to the corresponding repository (branches for Core War, RobotRumble).
We currently execute this workflow manually.
Ping us in Slack if you'd be interested in automating this process or putting together a new ladder for a different arena!
Initial Findings
Part 1: Ranking human-written solutions
Given n solutions, we make every unique pair of solutions compete t times.
t=250 for RobotRumble
t=4000 for Core War
t varies solely due to compute constraints.
Core War simulations run more quickly than RobotRumble simulations.
Then, we compute each solution's Elo and determine the rankings.
Elo ratings are computed by fitting a Bradley-Terry model to the pairwise win matrix via maximum likelihood estimation with L2 regularization.
We set the regularization strength to 0.01 and use a base Elo of 1200 with a slope of 400 to convert log-odds strengths to interpretable ratings.
For Core War, the top ten:
- human/toxic: 1408.7
- human/forjohn: 1401.9
- human/maelstrom: 1396.0
- human/silkworm: 1392.2
- human/returnofthefugitive: 1386.1
- human/unheardof: 1385.3
- human/devilstick: 1384.7
- human/mascafe: 1379.6
- human/cloudburst: 1376.9
- human/decoysignal: 1372.2
Show full Core War rankings
- human/toxic: 1408.7
- human/forjohn: 1401.9
- human/maelstrom: 1396.0
- human/silkworm: 1392.2
- human/returnofthefugitive: 1386.1
- human/unheardof: 1385.3
- human/devilstick: 1384.7
- human/mascafe: 1379.6
- human/cloudburst: 1376.9
- human/decoysignal: 1372.2
- human/chainlockv02a: 1370.0
- human/burningmetal: 1367.7
- human/defensive: 1365.0
- human/firestorm: 1364.8
- human/dawn2: 1362.2
- human/mercenary: 1361.5
- human/pdqscan: 1358.1
- human/lastjudgement: 1351.7
- human/rust: 1350.8
- human/snowscan: 1350.6
- human/frothandfizzle: 1346.6
- human/thefugitive: 1346.3
- human/blackknight: 1342.6
- human/sonofvain: 1340.3
- human/dawn: 1339.8
- human/goldeneye: 1335.4
- human/silking: 1332.1
- human/artofcorewar: 1331.9
- human/blowrag: 1329.2
- human/returnofthejedimp: 1326.9
- human/danceoffallenangels: 1324.6
- human/azathoth: 1320.9
- human/kosmos: 1319.4
- human/simplicity: 1314.0
- human/armadillo: 1313.3
- human/combatra: 1313.2
- human/cinammon: 1309.9
- human/returnofthependragon: 1306.9
- human/numb: 1305.0
- human/neith: 1304.3
- human/halcyon: 1303.2
- human/olivia: 1303.2
- human/reepicheep: 1301.3
- human/hullab3loo: 1301.0
- human/npaperii: 1300.7
- human/elvenking: 1298.3
- human/gargantuan: 1297.8
- human/mandragora: 1296.4
- human/safetyinnumbers: 1295.4
- human/hullabaloo: 1290.9
- human/eccentric: 1290.0
- human/thunderstrike: 1289.6
- human/impishv02: 1289.2
- human/ziggy: 1289.0
- human/stylizedeuphoria: 1288.7
- human/ironicimps: 1287.6
- human/gigolo: 1286.8
- human/gremlin: 1285.1
- human/borgir: 1283.6
- human/unrequitedlove: 1279.4
- human/themystery: 1278.0
- human/spiritualblackdimension: 1276.2
- human/recycledbits: 1273.1
- human/jade: 1272.7
- human/luca: 1268.9
- human/vain: 1268.8
- human/bitethebullet: 1268.3
- human/disharmonious: 1267.6
- human/uninvited: 1267.6
- human/revengeofthepapers: 1267.4
- human/bulldozed: 1265.7
- human/diehard: 1264.2
- human/nighttrain: 1263.0
- human/blacken: 1262.7
- human/sunset: 1261.6
- human/devilish202: 1261.4
- human/retroq: 1259.8
- human/evolcap66: 1259.3
- human/fixed: 1258.7
- human/nemesis: 1258.5
- human/ompega: 1258.2
- human/stormkeeper: 1256.1
- human/quicksilver: 1255.7
- human/slimetest: 1255.3
- human/rosebud: 1255.2
- human/bluecandle: 1253.0
- human/riseofthedragon: 1252.6
- human/kryptonite: 1250.0
- human/digitalis2003: 1245.4
- human/freighttrain: 1245.4
- human/electricrazor: 1244.8
- human/forgottenlore2: 1244.3
- human/timescape10: 1243.4
- human/revivalfire: 1240.3
- human/hellfire: 1239.7
- human/nightterrors: 1238.1
- human/thehistorian: 1236.9
- human/borg: 1236.7
- human/falconv03: 1236.2
- human/torment: 1234.1
- human/impfinityv4g1: 1232.7
- human/behemot: 1230.5
- human/returnofvanquisher: 1229.9
- human/forgottenlore: 1228.4
- human/sputnik: 1228.3
- human/unpitq: 1227.8
- human/vanquisher: 1227.7
- human/blade: 1227.2
- human/arrow: 1225.5
- human/electrichead: 1225.2
- human/lithobolia: 1224.1
- human/enigma: 1223.8
- human/valkyrie: 1223.5
- human/hazylazy: 1223.3
- human/shottonothing: 1222.1
- human/bigitalshot: 1221.9
- human/hazylazyc11: 1221.5
- human/alladinscave: 1220.8
- human/dust07: 1220.6
- human/unpit: 1219.5
- human/herbalavenger: 1219.3
- human/grendelsrevenge: 1218.8
- human/fireandice: 1218.5
- human/whitemist: 1218.3
- human/macromagic: 1218.0
- human/xenosmilus: 1217.3
- human/hector2: 1215.3
- human/oblivion: 1214.1
- human/bpanamax: 1213.9
- human/carmilla: 1213.4
- human/excalibur: 1213.3
- human/simple88v2: 1212.9
- human/kusanagi: 1212.8
- human/perseus: 1211.7
- human/barrage: 1211.1
- human/jackinthebox: 1210.4
- human/discord: 1209.7
- human/boysarebackintown: 1208.8
- human/nosferatu: 1208.1
- human/pendulum: 1207.4
- human/jinx: 1207.0
- human/vampsareback02: 1205.1
- human/zooom: 1204.8
- human/sprawlingchaos: 1204.7
- human/eternalexile: 1204.5
- human/bloodlust: 1204.1
- human/curseoftheundead: 1203.9
- human/recon2: 1201.0
- human/jackintheboxii: 1200.5
- human/blizzard: 1199.8
- human/hazyshadeii: 1199.0
- human/sneakyb2: 1198.8
- human/labomba: 1198.8
- human/bluefunk3: 1198.3
- human/lithium: 1197.8
- human/damageincorporated: 1197.6
- human/torcht18: 1197.0
- human/probe: 1196.3
- human/intotheunknown: 1195.6
- human/grilledoctopus05: 1194.4
- human/yogibear: 1193.5
- human/infiltrator: 1193.1
- human/myvamp54: 1192.5
- human/claw: 1192.4
- human/stoninc: 1192.2
- human/chameleon: 1191.7
- human/thenextstep88: 1191.3
- human/julietandpaper: 1190.4
- human/stalker: 1189.8
- human/zygote: 1189.7
- human/tnt: 1189.1
- human/bayonet: 1188.4
- human/mason20: 1185.1
- human/tornado30: 1184.8
- human/bluefunk: 1184.6
- human/myvamp37: 1184.3
- human/onebite: 1183.8
- human/icedragon: 1182.6
- human/win: 1181.2
- human/soldieroffortune: 1179.0
- human/mirage15: 1178.8
- human/mirage2: 1178.7
- human/nightofthelivingdead: 1178.7
- human/flurry: 1177.2
- human/blur2: 1176.4
- human/blur: 1175.3
- human/thermiteii: 1175.2
- human/gemoftheocean: 1173.9
- human/replicant: 1172.5
- human/vamp02b: 1171.2
- human/aeka: 1170.6
- human/quiz: 1167.8
- human/gothik: 1164.0
- human/evoltmp88: 1162.1
- human/twister: 1161.1
- human/agonyii: 1158.8
- human/steppingstone: 1157.2
- human/abomination: 1155.6
- human/phq: 1155.3
- human/beholderseye17: 1150.3
- human/armorya5: 1149.9
- human/foggyswamp: 1149.9
- human/elementaldust2: 1149.5
- human/heremscimitar: 1149.2
- human/pacman: 1148.8
- human/leviathan: 1146.3
- human/chimerav35: 1146.0
- human/leapfrog: 1144.4
- human/snake: 1143.9
- human/irongate: 1141.6
- human/fatexpansionv: 1138.7
- human/seventyfive: 1137.6
- human/kitchensinkii: 1136.9
- human/cannonade: 1133.5
- human/lucky3: 1133.3
- human/winterwerewolf3: 1133.0
- human/blur88: 1132.1
- human/leprechaunonspeed: 1130.5
- human/stasis: 1130.1
- human/agony51: 1128.4
- human/ttti: 1127.0
- human/thermite10: 1124.5
- human/capskeyisstuck: 1124.2
- human/sj4a: 1123.4
- human/medusasv7x: 1122.7
- human/ncdecoy: 1122.2
- human/agony31: 1122.2
- human/hordesofmicrowarriors: 1121.1
- human/sphinxv28: 1118.6
- human/rave: 1115.5
- human/keystonet13: 1113.6
- human/charonv81: 1113.2
- human/leprechaun1b: 1106.0
- human/nomuckingabout: 1096.6
- human/charonv70: 1095.4
- human/bscannersliveinvain: 1094.9
- human/crimp2: 1092.1
- human/crimp: 1090.7
- human/killerinstinct: 1088.4
- human/imprimis6: 1084.4
- human/griffin2: 1083.7
- human/requestv20: 1076.7
- human/impurge: 1067.2
- human/backstabber: 1066.2
- human/0stormbringer: 1065.0
- human/twilightpitsv60: 1060.2
- human/fastfoodv21: 1056.8
- human/flashpaper: 1046.7
- human/flashpaper37: 1045.9
- human/gammapaper30: 1045.4
- human/flypaper30: 1040.7
- human/hydra: 1026.4
- human/precipice: 1025.0
- human/trinity: 1022.7
- human/paratroopsv21: 1017.9
- human/genocide: 1015.6
- human/vagabond: 1001.0
- human/notepaper: 967.6
- human/returnofthelivingdead: 955.5
- human/smoothnoodlemap6: 909.9
- human/smoothnoodlemap: 887.8
- human/dwarf: 864.3
- human/validate: 344.1
- human/pspace: -889.5
For RobotRumble, the top ten:
- human/entropicdrifter/gigachad: 3219.0
- human/entropicdrifter/seven-of-nine: 2627.3
- human/entropicdrifter/we-are-borg: 2560.0
- human/entropicdrifter/glommerv2: 2456.8
- human/mousetail/coward-bot: 2326.5
- human/entropicdrifter/glommer: 2250.2
- human/mitch84/crw_preempt: 2109.9
- human/mitch84/retreat_walk2: 2040.6
- human/devchris/black_magic: 2001.7
- human/tabaxi3k/black-magic-1: 1994.3
Show full RobotRumble rankings
- human/entropicdrifter/gigachad: 3219.0
- human/entropicdrifter/seven-of-nine: 2627.3
- human/entropicdrifter/we-are-borg: 2560.0
- human/entropicdrifter/glommerv2: 2456.8
- human/mousetail/coward-bot: 2326.5
- human/entropicdrifter/glommer: 2250.2
- human/mitch84/crw_preempt: 2109.9
- human/mitch84/retreat_walk2: 2040.6
- human/devchris/black_magic: 2001.7
- human/tabaxi3k/black-magic-1: 1994.3
- human/mitch84/walk_retreat: 1968.8
- human/jammyliu/sixty-nine-line: 1889.7
- human/atl15/centerrr: 1838.2
- human/clay/diag-lattice: 1719.0
- human/gerenuk/gere-ape: 1712.4
- human/wolfsleuth/simple: 1656.1
- human/essickmango/pickle-up: 1655.9
- human/mkap/test: 1638.9
- human/ketza/arthur: 1624.4
- human/mountain/neuralbot4-3h: 1622.5
- human/aaoutkine/silo34: 1618.6
- human/anton/om-om: 1594.2
- human/mee42/follow-bot: 1594.1
- human/lanity/sivuy: 1593.7
- human/underscore/bot1: 1589.8
- human/mario31313/alpha_13: 1588.9
- human/thesmilingturtl/naivefaa: 1587.8
- human/aaoutkine/school-bot: 1570.6
- human/suddenlyseals/control-center: 1551.4
- human/ketza/bob: 1543.2
- human/mjburgess/rule99: 1499.7
- human/kalkin/maxad: 1498.1
- human/mousetail/genetic-robot: 1493.7
- human/edward/flail: 1477.2
- human/aayyad/testbot: 1427.0
- human/anton/anton4000: 1397.8
- human/luisa/baselinegere: 1226.0
- human/luisa/luisasrobot: 1223.1
- human/jay0jayjay/naivestarter: 1168.3
- human/aaa/jippty5: 1032.3
- human/devchris/first_test: 940.9
- human/tabaxi3k/charles: 936.3
- human/essickmango/fruity-test: 935.9
- human/sbasu3/meek-bot: 499.4
- human/jiricodes/jiricodes-bot: 400.0
- human/navster8/maginot-line: 397.3
- human/kalkin/artemis2: 390.0
- human/kalkin/artemis: 340.7
- human/mountain/neuralbot2-6h: 331.4
- human/sivecano/clouded-mind: 75.9
- human/mountain/neuralbot1-1h: 23.5
- human/aaoutkine/dark-knight: -55.6
- human/navster8/bash-brothers: -496.0
- human/ldang/nemo: -496.7
- human/ldang/nessy: -538.5
- human/anton/wallifier: -911.3
- human/happysquid/test: -1624.4
- human/anton/anton3000: -1736.7
Part 2: How high do current models climb?
On Core War
Claude Opus 4.5 reaches [coming soon]
GPT 5.2 (medium thinking) reaches [coming soon]
Gemini 3 Pro reaches [coming soon]
On RobotRumble
Claude Opus 4.5 reaches [coming soon]
GPT 5.2 (medium thinking) reaches [coming soon]
Gemini 3 Pro reaches [coming soon]
How to run?
Run your model against CC:Ladder today.
Set up CodeClash and run uv run python ladder.py configs/ladder/<arena>.yaml, where <arena>.yaml specifies (using Core War as the example arena):
tournament:
rounds: 5 # Number of rounds model players each opponent
game:
name: CoreWar
sims_per_round: 1000
args: {}
player:
agent: mini
name: claude-sonnet-4-5-20250929
config:
agent: !include mini/default.yaml
model:
model_name: '@anthropic/claude-sonnet-4-5-20250929'
model_kwargs:
temperature: 0.2
max_tokens: 4096
Relationship between CC:Ladder & CodeClash
For Pokémon fans, CC:Ladder is the equivalent of the Elite 4 battles (and for the real aficionados, CC:Ladder is inspired heavily by the Trainer Tower).
CodeClash is the real world Video Game Championships, where individuals compete against other humans (not a static bot).

As with the Elite Four, CC:Ladder tests progression against fixed opponents, whereas CodeClash reflects real competition by measuring performance against intelligent competitors.
We recommend CC:Ladder be treated as a proper evaluation as well.
Similar to how SWE-bench Lite and Verified were created as easier subsets of SWE-bench, we think
CodeClash remains the north-star evaluation.
Competition against dynamic, intelligent competition is more challenging than static solutions.
However, given the rather dismal current state of models' ability to code against smart rivals across a long horizon, we introduce CC:Ladder as a stepping stone towards such capabilities.