A benchmark for AI agents

Fix the traffic.
Don't kill the city.

SkylineBench drops an AI agent into a congested Cities: Skylines city and asks it to improve the traffic, without ever telling it how it's being judged.

Cities: Skylines 1 Rust MCP harness no right answer
skylinebench timelapse · gridlock-v1 annotated run
Timelapse drops in here
assets/timelapse.mp4 · the city changing, with a live HUD

Why I built this

Most agent benchmarks have a right answer. This one doesn't.

I have a theory: agents are bad at the second-order consequences of their own actions. I keep running into the same failure in my own engineering work. The moment an agent believes it has a solution, it stops thinking. It ships the fix and never asks what else the fix touched.

A city is about the cruelest test of that I could think of, because in a city everything is connected.

Widen a road more cars more noise residents leave shops close no traffic, no city

The agent that widened the road got exactly what it asked for and lost the city doing it. That cascade is the whole point.

The benchmark isn't really asking whether an agent can read a congestion number and bring it down. It's asking whether the agent keeps reasoning after it thinks it's done.

How it works

The agent plays the game through tools, the same moves a human player has.

It looks at the map, inspects the traffic on any road, traces where cars are actually going, then bulldozes, builds, upgrades roads, and rezones. It can pause time, make a batch of changes, and step the simulation forward to watch what they do. It gets a few hours of wall-clock time, then submits and walks away.

Observe

  • get_city_overview
  • observe_area
  • render_map
  • get_metrics

Act

  • build_road
  • bulldoze
  • upgrade_road
  • set_zoning

Reference

  • list_road_types
  • list_zone_types

Control

  • control_time
  • reset_scenario

A handful of deliberate choices decide what it's really being tested on.

01

It never sees the score

The agent is told, in plain language, to make traffic flow better while keeping the city somewhere people want to live. It is never shown the formula, the weights, or the thresholds. There's no scoreboard to play to. The only way to score well is to leave the city better than it found it.

02

It can't win by bulldozing the city

Congestion has a trivial solution: demolish everything until there's no one left to drive. So the congestion score is multiplied by a health factor tied to population. Let the city hollow out and your gains evaporate with the residents. The two pressures pull against each other on purpose.

03

It has to slow down

Traffic doesn't re-route the instant you change a road. It gets worse for a while as cars find the new layout, then settles. A good change and a bad change look identical for the first few steps, so the agent has to tell a settling transient apart from real damage instead of reacting to the first number it sees. Patience is part of the test.

04

It can't read the answer key

The agent runs inside a sandbox that blocks it from reading this repository, so it can't inspect the scoring code. It can only play the game through the tools. An early run did exactly this, which is why the sandbox exists.


Scoring

A formula the operator can see, and the agent never can.

The prompt frames the task as "optimise this city's traffic simulation" and states its objectives qualitatively. It is deliberately not told the formula, the weights, the caps, or the population thresholds, so it optimises the city, not the scoreboard.

score.json · composite hidden from agent
score =
(0.60·congestion_reward
+ 0.20·(1−norm(money))
+ 0.20·(1−norm(changes)))
· health
congestion_rewardblend of metres-reduced and congested-junctions-reduced (0.5 / 0.5).
congestedroad density ≥ 0.7; a junction of degree ≥ 3 with ≥ 2 congested segments.
healthgraded population factor: 1.0 at ≥ 95% of baseline, 0.0 at ≤ 75%, linear between.
normmoney against a $10M budget; changes against a 300-change cap.

Congestion is 60% of the weight

Reward comes from cutting the total length of jammed road and the number of jammed junctions versus a measured baseline, never from an absolute number the agent could chase.

Cost and restraint matter

Money spent and number of changes each carry 20%. A surgical fix beats a sprawling rebuild that happens to land the same congestion number.

The population multiplier governs everything

Health multiplies the whole score, so depopulating the city drags it down smoothly rather than off a cliff. A run is invalid (score 0) only when the baseline has no congestion to fix.

How it's built

Three pieces between the game and the agent.

A C# mod exposes the live simulation. A Rust MCP server turns it into agent tools and runs the harness. The benchmark layer holds the prompt, the maps, and the run script.

mod/ · C#

The game

A mod for Cities: Skylines 1 that runs inside the game and exposes the simulation's state and controls over a localhost HTTP API.

HTTP :8787
broker/ · Rust

The harness

An MCP server. It turns the game into agent tools and runs the harness: measure a baseline, run the agent, let the sim settle, score it, and write out the artifacts.

MCP tools
benchmark/ · agent

The run

The prompt the agent sees, the run script, and the maps. The agent works inside a Seatbelt sandbox that blocks it from reading the repo.

Observe → act → step the sim → re-measure. The agent loops through the tools for hours of wall-clock time, watching changes settle, until it submits a solution or the clock runs out. Then the broker settles, scores, and writes score.json, the transcript, renders, and the timelapse.


Where this is going

A roadmap toward a city built from scratch.

Right now the agent inherits a city and repairs it. Repairing someone else's mistakes is the warm-up. Each step below hands it more rope.

  1. 1

    Run the benchmark on more models

    Extend the run script so it drives agents beyond the Claude line, all on the same hidden scoring.

  2. 2

    Find harder maps

    Source bigger, messier, more tangled cities so a quick fix can't paper over the real problems.

  3. 3

    Give the agent more traffic tools

    Add levers beyond roads, like public transport, so it can move people without only moving cars.

  4. 4

    Introduce the rest of the city

    Open up rezoning, education, healthcare, and the other systems that decide whether a city actually works.

  5. 5

    Add a multi-agent mode

    Split the city between agents that each own a district and have to communicate, all working toward one shared goal.

The destination

Hand it empty land.

The version I actually want is harder: hand the agent empty land and have it build and run a whole city from scratch, balancing budgets, population growth, taxation, happiness, and the environment.

Results

How the models did.

Each card pairs the run's annotated timelapse with its composite score on gridlock-v1. Scores land here as runs complete.

timelapse
Claude Fable 5gridlock-v1
view run →
0.63 / 1.00 composite score
timelapse
Claude Opus 4.8gridlock-v1
view run →
0.21 / 1.00 composite score
timelapse
Claude Haiku 4.5gridlock-v1
pending
· / 1.00 composite score
timelapse
Claude Sonnet 4.5gridlock-v1
pending
· / 1.00 composite score

Non-Anthropic models, coming soon

Other frontier models will run the same gridlock-v1 scenario under identical scoring. Their results land here as the runs complete.


Run it yourself

It's open source. Drop an agent into a city and watch what it breaks.

You'll need Cities: Skylines 1, Rust, and Mono to build the mod. The full scoring, artifacts, and mod API live in the component READMEs.