Production Simulation: A Benchmark for Planning Under Constraints

Most of what we do at Haladir comes down to a single problem: which is how to make an optimal decision inside a complex operation, where every choice is bound by constraints that overlap and shift over time. Some time ago we built a benchmark for language models that, almost by accident, turned out to be a miniature of that exact problem. While the benchmark is about factory operations, the OR principles behind it, we believe, deserve some explanation.
The benchmark consists of ten different factories, spanning a kitchen, a job shop, a pharmacy, and more. Each is a turn-based production simulation where a model has to plan, build, and run a small factory under deadlines, scarce resources, and disruptions it cannot see coming. Labor pressure, limited machines, a breakdown at the wrong moment, none of these can be reasoned about in isolation, which is what makes it a clean test of planning under interacting constraints.
The Environment
For the sake of simplicity, we’ll focus on the agent managing a CNC machining job shop.
The floor of the shop is represented by a 10 by 12 grid where the agent has the ability to build stations, hire workers, order materials, configure product pipelines, set routing rules, and ship finished products before their deadlines. Each turn it reads a text observation of roughly 1,500 tokens describing the current state of the floor and responds with tool calls. There are thirteen actions, and each one costs action points drawn from a budget of five per turn, so the model is never choosing a single move in isolation but a slate of moves against a tight allowance.
A condensed version of what the agent sees mid-game looks like this:
=== Turn 16 | Phase: production | AP: 5 | Budget: $11240 ===
-- Stations --
Saw-1 [Band Saw] at (1,3) | status=idle | health=80% | recipe=saw_steel_4140
Lathe-1 [CNC Lathe] at (2,4) | status=processing | health=65% | recipe=turn_steel
Mill-1 [CNC Mill] at (3,5) | status=idle | health=0% | recipe=mill_steel_body | queue=2/5 [Turned Steel Shaft(f=9), Turned Steel Shaft(f=10)]
Grinder-1 [Surface Grinder] at (4,6) | status=processing | health=75% | recipe=grind_steel
Stage-1 [Staging Rack] at (6,5) | status=idle | health=90% | recipe=stage_part
Pack-1 [Deburring/Packaging] at (7,5) | status=idle | health=85% | recipe=deburr_manifold | setup=Hydraulic Manifold
-- Cold Storage (4/15) at (0,5) --
Metal:steel_4140 id=22 freshness=18 | Metal:steel_4140 id=23 freshness=18
Consumable:cutting_tools id=24 freshness=no expiry | Consumable:grinding_wheels id=25 freshness=no expiry
-- Workers --
W1 at (2,4) | status=assigned | fatigue=0 | assigned=(2,4)
W2 at (3,5) | status=idle
-- Dispatch Bay (0/10) at (9,5) --
-- Orders --
ORD-1: Hydraulic Manifold qty=2 due=30 shipped=0
-- Events --
Breakdown at Mill-1: health=0, offline until repairedFrom this the agent has to notice the broken mill, weigh it against everything else competing for its five points, and commit to a response, perhaps repairing Mill-1 before the stalled queue backs up the whole line, before ending the turn. How a turn resolves comes down to three things:
Two phases: Turn zero is the planning phase, where the agent lays out stations, sets recipes, hires workers, places initial orders, and defines routing, all with unlimited points and instant construction. Every turn after is production, where the constraints matter: the points budget is enforced, construction takes real time, and workers have to walk to their stations before anything happens.
The tick: When the turn ends, a fourteen-step deterministic tick advances the world: delivering ordered materials, expiring stale intermediates, running recipes, dispatching finished outputs, deducting salaries, and applying whatever disruption the scenario has scheduled. Because it is deterministic, the same actions always produce the same world, so a score reflects the quality of the agent’s decisions and nothing else.
Scoring: The score is simply the budget the agent ends with divided by the budget it started with, which rewards shipping quickly and punishes idle workers, construction it did not need, and orders left unfulfilled.
What Makes It Hard
The difficulty is in how the mechanics compound. A typical product takes roughly five to eight sequential steps across different stations, so the initial layout an agent commits affects the transit times for the rest of the game. But layout is only one thing it has to track. Stations lose health, tooling wears, and intermediates spoil. Breakdowns, walkouts, scrapped batches, rush orders, land on turns it cannot predict and invalidate parts of the plan it already committed to. And every fix has a cost somewhere else: an extra station to clear a bottleneck means budget, points, and a new hire, and whether that trade pays off is a marginal call against deadlines several turns out, something the agent has to reason toward on its own.
Each of these is manageable alone. What’s hard is holding all of them at once, before a single move: every added part is another thing to weigh at the margin and another consequence to trace forward, and the combinations branch faster than the agent can reason through. That coupling is the whole difficulty, and it’s the same property that keeps cross-cutting operational decisions in human hands outside the benchmark too.
Grading
A run earns one number: the budget the agent ends with divided by the budget it started with, clamped to the range zero to one. Because the tick is deterministic, that number depends only on the agent’s decisions. A run that ends with at least the money it began with scores one, whether it broke even or came out ahead. A run that spends itself into the ground scores zero. Most land in between, and where they land is set by the quality of the initial plan.
The budget is the only ledger, and every choice flows through it. Shipping an order on time pays its full revenue, shipping it late pays the revenue minus a penalty, and an order that is never shipped is charged that penalty on the final tick. Against that income, the agent pays for everything it does: building and moving stations, hiring, upgrades, repairs, materials, and a salary every turn for every worker, busy or idle. The final balance nets all of it into one figure, and that figure, measured against the starting budget and capped at one, is the score.
Feasibility is settled before a scenario ever reaches a model. Each scenario passes a feasibility analysis that checks its critical-path timeline against the deadlines and its projected revenue against the budget, and we confirm it can be won by tuning a parameterized strategy with CMA-ES until that strategy finishes solvent. This is part of building the benchmark, not part of grading a run. It guarantees every scenario can be won, so a run that scores zero has failed to plan rather than hit a problem with no solution.
Calibration
The benchmark runs ten domains, each with twenty scenarios, two hundred in total, and every one is verified solvable before it is admitted. Every scenario ships with two to four natural-language hints describing its key planning decisions, and when those hints are included, both of the frontier models we tested clear a positive score on all two hundred. That tells us the failures we care about, the runs that score zero without hints, are planning failures on problems that are in fact solvable. The scenarios sit in a useful range: a small fraction goes unsolved by every model, a stable floor is solved on every run, and most of the rest land in between, where the initial plan decides the outcome.
Results
We ran a thousand rollouts per model, five runs across each of the two hundred scenarios. Both leading models scored around a third of what the scenarios allow, and they got there by opposite strategies. One plans broadly, coordinating across products for partial credit. The other optimizes deeply, perfecting a single line and handling disruptions well.
| Domain | GPT-5.4 | Claude Opus 4.6 | Δ (Claude − GPT) |
|---|---|---|---|
| Automotive | 0.251 | 0.602 | +0.351 |
| Pharma | 0.492 | 0.681 | +0.189 |
| Chocolate | 0.244 | 0.347 | +0.103 |
| Bakery | 0.419 | 0.501 | +0.082 |
| Brewery | 0.115 | 0.113 | −0.002 |
| Textile | 0.232 | 0.198 | −0.034 |
| Cooking | 0.168 | 0.106 | −0.062 |
| Woodworking | 0.227 | 0.159 | −0.068 |
| CNC | 0.521 | 0.405 | −0.116 |
| Ceramics | 0.323 | 0.146 | −0.177 |
| Overall | 0.299 | 0.326 | +0.027 |
The shared failure is planning. Neither model plans much before acting, so both discover the constraints by trial and error and let a wrong early decision cascade into a dead run. Once a start goes bad, the model never registers that it’s stuck, so the zero-score runs tend to burn the whole turn budget producing nothing.
Multi-product coordination, the scenarios that ask a model to run several lines at once, stays essentially unsolved, the fullest scenarios coming back near zero for both.
| Scenario type | Scenarios | GPT-5.4 | Claude Opus 4.6 |
|---|---|---|---|
| Single product | 41 | 0.351 | 0.362 |
| Constrained | 42 | 0.346 | 0.385 |
| Doubled quantity | 32 | 0.293 | 0.354 |
| Disruptions | 35 | 0.269 | 0.349 |
| Multi-product | 50 | 0.243 | 0.211 |
And more effort doesn’t help: across both models, spending more tokens correlates with a lower score.
| Metric | GPT-5.4 | Claude Opus 4.6 |
|---|---|---|
| Mean tokens, perfect-score run | 1.4M | 1.8M |
| Mean tokens, zero-score run | 2.0M | 6.5M |
| Mean turns, perfect-score run | 47 | 55 |
| Mean turns, zero-score run | 57 | 101 |
| Token–score correlation (r) | −0.14 | −0.44 |
That last finding matters most, because it states the benchmark’s whole point in one line. The environment is small and fully specified. Real operations are not. But the demand on the agent is the same: track many constraints at once, plan before acting, and stay coherent when conditions change mid-run. A model that cannot do this inside a 10 by 12 grid shows why the cross-cutting decisions that run a real operation still rest with experienced people. We built the environment to study that capability in a setting we control completely, and it exposes the same gap our platform closes.
GPT-5.4 (openai/gpt-5.4, medium reasoning effort) and Claude Opus 4.6 (anthropic/claude-opus-4-6). 1,000 rollouts per model, five runs across each of 200 scenarios. Δ is the Opus 4.6 mean minus the GPT-5.4 mean. Full per-domain detail, including score-above-zero rate and perfect and zero counts, is available on request.