A benchmark for AI agents

Fix the traffic.
Don't kill the city.

SkylineBench drops an AI agent into a congested Cities: Skylines city and asks it to improve the traffic, without ever telling it how it's being judged.

View on GitHub Read the thesis

Cities: Skylines 1 Rust MCP harness no right answer

skylinebench timelapse · gridlock-v1 annotated run

Timelapse drops in here

assets/timelapse.mp4 · the city changing, with a live HUD

Why I built this

Most agent benchmarks have a right answer. This one doesn't.

I have a theory: agents are bad at the second-order consequences of their own actions. I keep running into the same failure in my own engineering work. The moment an agent believes it has a solution, it stops thinking. It ships the fix and never asks what else the fix touched.

A city is about the cruelest test of that I could think of, because in a city everything is connected.

Widen a road more cars more noise residents leave shops close no traffic, no city

The agent that widened the road got exactly what it asked for and lost the city doing it. That cascade is the whole point.

The benchmark isn't really asking whether an agent can read a congestion number and bring it down. It's asking whether the agent keeps reasoning after it thinks it's done.

How it works

The agent plays the game through tools, the same moves a human player has.

It looks at the map, inspects the traffic on any road, traces where cars are actually going, then bulldozes, builds, upgrades roads, and rezones. It can pause time, make a batch of changes, and step the simulation forward to watch what they do. It gets a few hours of wall-clock time, then submits and walks away.

Observe

get_city_overview
observe_area
render_map
get_metrics

Act

build_road
bulldoze
upgrade_road
set_zoning

Reference

list_road_types
list_zone_types

Control

control_time
reset_scenario

A handful of deliberate choices decide what it's really being tested on.

It never sees the score

The agent is told, in plain language, to make traffic flow better while keeping the city somewhere people want to live. It is never shown the formula, the weights, or the thresholds. There's no scoreboard to play to. The only way to score well is to leave the city better than it found it.

It can't win by bulldozing the city

Congestion has a trivial solution: demolish everything until there's no one left to drive. So the congestion score is multiplied by a health factor tied to population. Let the city hollow out and your gains evaporate with the residents. The two pressures pull against each other on purpose.

It has to slow down

Traffic doesn't re-route the instant you change a road. It gets worse for a while as cars find the new layout, then settles. A good change and a bad change look identical for the first few steps, so the agent has to tell a settling transient apart from real damage instead of reacting to the first number it sees. Patience is part of the test.

It can't read the answer key

The agent runs inside a sandbox that blocks it from reading this repository, so it can't inspect the scoring code. It can only play the game through the tools. An early run did exactly this, which is why the sandbox exists.

Scoring

A formula the operator can see, and the agent never can.

The prompt frames the task as "optimise this city's traffic simulation" and states its objectives qualitatively. It is deliberately not told the formula, the weights, the caps, or the population thresholds, so it optimises the city, not the scoreboard.

score.json · composite hidden from agent

score =
(0.60·congestion_reward
+ 0.20·(1−norm(money))
+ 0.20·(1−norm(changes)))
· health

congestion_rewardblend of metres-reduced and congested-junctions-reduced (0.5 / 0.5).

congestedroad density ≥ 0.7; a junction of degree ≥ 3 with ≥ 2 congested segments.

healthgraded population factor: 1.0 at ≥ 95% of baseline, 0.0 at ≤ 75%, linear between.

normmoney against a $10M budget; changes against a 300-change cap.

Congestion is 60% of the weight

Reward comes from cutting the total length of jammed road and the number of jammed junctions versus a measured baseline, never from an absolute number the agent could chase.

Cost and restraint matter

Money spent and number of changes each carry 20%. A surgical fix beats a sprawling rebuild that happens to land the same congestion number.

The population multiplier governs everything

Health multiplies the whole score, so depopulating the city drags it down smoothly rather than off a cliff. A run is invalid (score 0) only when the baseline has no congestion to fix.

How it's built

Three pieces between the game and the agent.

A C# mod exposes the live simulation. A Rust MCP server turns it into agent tools and runs the harness. The benchmark layer holds the prompt, the maps, and the run script.

mod/ · C#

The game

A mod for Cities: Skylines 1 that runs inside the game and exposes the simulation's state and controls over a localhost HTTP API.

HTTP :8787

broker/ · Rust

The harness

An MCP server. It turns the game into agent tools and runs the harness: measure a baseline, run the agent, let the sim settle, score it, and write out the artifacts.

MCP tools

benchmark/ · agent

The run

The prompt the agent sees, the run script, and the maps. The agent works inside a Seatbelt sandbox that blocks it from reading the repo.

Observe → act → step the sim → re-measure. The agent loops through the tools for hours of wall-clock time, watching changes settle, until it submits a solution or the clock runs out. Then the broker settles, scores, and writes score.json, the transcript, renders, and the timelapse.

Where this is going

A roadmap toward a city built from scratch.

Right now the agent inherits a city and repairs it. Repairing someone else's mistakes is the warm-up. Each step below hands it more rope.

1

Run the benchmark on more models

Extend the run script so it drives agents beyond the Claude line, all on the same hidden scoring.
2

Find harder maps

Source bigger, messier, more tangled cities so a quick fix can't paper over the real problems.
3

Give the agent more traffic tools

Add levers beyond roads, like public transport, so it can move people without only moving cars.
4

Introduce the rest of the city

Open up rezoning, education, healthcare, and the other systems that decide whether a city actually works.
5

Add a multi-agent mode

Split the city between agents that each own a district and have to communicate, all working toward one shared goal.

The destination

Hand it empty land.

The version I actually want is harder: hand the agent empty land and have it build and run a whole city from scratch, balancing budgets, population growth, taxation, happiness, and the environment.

Results