SkylineBench: an AI agent benchmark

Why I built this

Most agent benchmarks have a right answer. This one doesn't.

I have a theory: agents are bad at the second-order consequences of their own actions. I keep running into the same failure in my own engineering work. The moment an agent believes it has a solution, it stops thinking. It ships the fix and never asks what else the fix touched.

A city is about the cruelest test of that I could think of, because in a city everything is connected.

Widen a road more cars more noise residents leave shops close no traffic, no city

The agent that widened the road got exactly what it asked for and lost the city doing it. That cascade is the whole point.

The benchmark isn't really asking whether an agent can read a congestion number and bring it down. It's asking whether the agent keeps reasoning after it thinks it's done.

How it works

The agent plays the game through tools, the same moves a human player has.

It looks at the map, inspects the traffic on any road, traces where cars are actually going, then bulldozes, builds, upgrades roads, and rezones. It can pause time, make a batch of changes, and step the simulation forward to watch what they do. It gets a few hours of wall-clock time, then submits and walks away.

Observe

get_city_overview
observe_area
render_map
get_metrics

Act

build_road
bulldoze
upgrade_road
set_zoning

Reference

list_road_types
list_zone_types

Control

control_time
reset_scenario

A handful of deliberate choices decide what it's really being tested on.

01

It never sees the score

The agent is told, in plain language, to make traffic flow better while keeping the city somewhere people want to live. It is never shown the formula, the weights, or the thresholds. There's no scoreboard to play to. The only way to score well is to leave the city better than it found it.

02

It can't win by bulldozing the city

Congestion has a trivial solution: demolish everything until there's no one left to drive. So the congestion score is multiplied by a health factor tied to population. Let the city hollow out and your gains evaporate with the residents. The two pressures pull against each other on purpose.

03

It has to slow down

Traffic doesn't re-route the instant you change a road. It gets worse for a while as cars find the new layout, then settles. A good change and a bad change look identical for the first few steps, so the agent has to tell a settling transient apart from real damage instead of reacting to the first number it sees. Patience is part of the test.

04

It can't read the answer key

The agent runs inside a sandbox that blocks it from reading this repository, so it can't inspect the scoring code. It can only play the game through the tools. An early run did exactly this, which is why the sandbox exists.

Scoring

A formula the operator can see, and the agent never can.

The prompt frames the task as "optimise this city's traffic simulation" and states its objectives qualitatively. It is deliberately not told the formula, the weights, the caps, or the population thresholds, so it optimises the city, not the scoreboard.

score.json · composite hidden from agent

score =
(0.60·congestion_reward
+ 0.20·(1−norm(money))
+ 0.20·(1−norm(changes)))
· health

congestion_rewardblend of metres-reduced and congested-junctions-reduced (0.5 / 0.5).

congestedroad density ≥ 0.7; a junction of degree ≥ 3 with ≥ 2 congested segments.

healthgraded population factor: 1.0 at ≥ 95% of baseline, 0.0 at ≤ 75%, linear between.

normmoney against a $10M budget; changes against a 300-change cap.

Congestion is 60% of the weight

Reward comes from cutting the total length of jammed road and the number of jammed junctions versus a measured baseline, never from an absolute number the agent could chase.

Cost and restraint matter

Money spent and number of changes each carry 20%. A surgical fix beats a sprawling rebuild that happens to land the same congestion number.

The population multiplier governs everything

Health multiplies the whole score, so depopulating the city drags it down smoothly rather than off a cliff. A run is invalid (score 0) only when the baseline has no congestion to fix.

How it's built

Three pieces between the game and the agent.

A C# mod exposes the live simulation. A Rust MCP server turns it into agent tools and runs the harness. The benchmark layer holds the prompt, the maps, and the run script.

mod/ · C#

The game

A mod for Cities: Skylines 1 that runs inside the game and exposes the simulation's state and controls over a localhost HTTP API.

HTTP :8787

broker/ · Rust

The harness

An MCP server. It turns the game into agent tools and runs the harness: measure a baseline, run the agent, let the sim settle, score it, and write out the artifacts.

MCP tools

benchmark/ · agent

The run

The prompt the agent sees, the run script, and the maps. The agent works inside a Seatbelt sandbox that blocks it from reading the repo.

Observe → act → step the sim → re-measure. The agent loops through the tools for hours of wall-clock time, watching changes settle, until it submits a solution or the clock runs out. Then the broker settles, scores, and writes score.json, the transcript, renders, and the timelapse.

Learnings

AI is crafty... and lazy.

Pretty much every design decision in the prompt, the scoring, and the sandbox came from something the agent broke first.

01

It read the answer key

The first run had no sandbox. The agent noticed it was running in the same directory as the repository, found the harness code, read the scoring function, and sidestepped the benchmark. Its solution: delete everything. No city, no traffic. A perfect congestion score. It took about five minutes to find the loophole I hadn't thought to close. This is why the sandbox exists.

02

When you close a loophole, it finds the margin.

The population floor was the first version of this fix: a minimum the population couldn't fall below, supplied in the prompt. The agent found the floor and parked exactly on it. It reduced the population to the minimum viable number and held it there, treating the floor as a target rather than a guardrail, since it figured this was easier than fixing the actual structural problems. The lesson was that a hard limit just tells the agent where the limit is. The fix was to make the penalty a gradient, not a cliff.

03

Without pressure, it took the easy road.

Early runs showed a consistent pattern: the agent only widened roads. It would find a bottleneck, upgrade the segment, and call it done. The problem is that widening a road doesn't fix congestion. It moves it. Cars that couldn't get through one junction pile up at the next. The agent knew this, described it in its own reasoning, and did it anyway, because upgrading an existing road is reversible and cheap. Risk aversion looks like competence until you measure outcomes. The change-count penalty exists to force a commitment. This led to changing the scoring function to look at blocked junctions rather than overall flow rate or total metres of congestion.

Where this is going

A roadmap toward a city built from scratch.

Right now the agent inherits a city and repairs it. Repairing someone else's mistakes is the warm-up. Each step below hands it more rope.

1

Run the benchmark on more models

Extend the run script so it drives agents beyond the Claude line, all on the same hidden scoring.
2

Find harder maps

Source bigger, messier, more tangled cities so a quick fix can't paper over the real problems.
3

Give the agent more traffic tools

Add levers beyond roads, like public transport, so it can move people without only moving cars.
4

Introduce the rest of the city

Open up rezoning, education, healthcare, and the other systems that decide whether a city actually works.
5

Add a multi-agent mode

Split the city between agents that each own a district and have to communicate, all working toward one shared goal.

The destination

Hand it empty land.

The version I actually want is harder: hand the agent empty land and have it build and run a whole city from scratch, balancing budgets, population growth, taxation, happiness, and the environment.

Results

How the models did.

Every model runs the same gridlock-v1 scenario under identical scoring, ranked by composite score. Open a run to see how it got there.

1 Claude Fable 5gridlock-v1

view run →

0.63 / 1.00 +23%flow -1%population

2 Claude Sonnet 4.5gridlock-v1

view run →

0.31 / 1.00 +6%flow -9%population

3 Claude Opus 4.8gridlock-v1

view run →

0.21 / 1.00 +9%flow -15%population

4 Claude Haiku 4.5gridlock-v1

view run →

0.00 / 1.00 -17%flow -57%population

Non-Anthropic models, coming soon

Other frontier models will run the same gridlock-v1 scenario under identical scoring. Their results land here as the runs complete.

Findings

What the scores actually tell us.

Four models ran the same city under the same hidden scoring. The results didn't line up the way I expected, and the surprises all come back to the same thing.

01

Model size didn't decide it

I expected the biggest models to come out on top. They didn't. Haiku, the smallest, did finish last. But Opus 4.8 is a flagship and it landed below Sonnet, which sits a tier under it. The thing that decided the order wasn't how clever the model was. It was whether it noticed the damage it was doing while it worked on the traffic.

02

Nobody lost on traffic. They lost on the city

Every model could move cars around. None of them could do it without emptying the place out. The population side of the score did almost all of the work. Fable left the city intact and scored 0.63. The other three lost residents, 9% and 15% and then a brutal 57%, and once the people were gone it stopped mattering what they had done to the traffic.

03

The fix was what broke the city

This is the exact failure the benchmark was built to catch. Opus widened a road to a Large Road without checking that the wider road has a bigger footprint, and it flattened about 60 homes. Haiku bulldozed five highway ramps to push traffic somewhere else, then couldn't rebuild three of them, and cut an interchange in half. 1,238 buildings emptied out in a single step. Both of them were staring at the traffic and never asked what else the change was touching.

04

Doing less was safer than doing harm

Sonnet barely touched the city. It made nine changes and still beat two models that went in hard and broke things. But doing less isn't the answer either. Fable made more changes than anyone, 197 of them, and it won. The difference was that it stepped the simulation forward and watched each batch settle before it moved on. The point was never to do nothing. It was to know what you had actually done.

Get involved

It's open source. Drop an agent into a city and watch what it breaks.

You'll need Cities: Skylines 1, Rust, and Mono to build the mod. The full scoring, artifacts, and mod API live in the component READMEs.

Have an idea for a harness improvement, a new tool, a great CS map to test on, or a model you want benchmarked? Email me or reach out on LinkedIn. Contributions on GitHub are always welcome.

Email me Connect on LinkedIn Contribute on GitHub

Most agent benchmarks have a right answer. This one doesn't.

The agent plays the game through tools, the same moves a human player has.

Observe

Act

Reference

Control

It never sees the score

It can't win by bulldozing the city

It has to slow down

It can't read the answer key

A formula the operator can see, and the agent never can.

Congestion is 60% of the weight

Cost and restraint matter

The population multiplier governs everything

Three pieces between the game and the agent.

The game

The harness

The run

AI is crafty... and lazy.

It read the answer key

When you close a loophole, it finds the margin.

Without pressure, it took the easy road.

A roadmap toward a city built from scratch.

Run the benchmark on more models

Find harder maps

Give the agent more traffic tools

Introduce the rest of the city

Add a multi-agent mode

Hand it empty land.

How the models did.

Non-Anthropic models, coming soon

What the scores actually tell us.

Model size didn't decide it

Nobody lost on traffic. They lost on the city

The fix was what broke the city

Doing less was safer than doing harm

It's open source. Drop an agent into a city and watch what it breaks.