Skip to content

Benchmark results

Snapshot generated 2026-06-24 from runs 2026-06-23T23-07-15 … 2026-06-24T01-40-50 (jutul-agent 06251868c). Every sample runs the real agent end to end in a fresh workspace and is graded on the session trace as well as the answer. See how evaluation works. Each model ran the suite 3 times, and cells aggregate across runs, so a fraction like 2/3 means the sample passed two of three runs.

Overview

Pass rate is passing runs over runs that completed (infrastructure errors excluded). Tool calls and tokens are the per-run totals across the suite, the harness-efficiency signals: at equal pass rate, fewer means the harness got the agent there in less work. Input tokens note how many were served from the prompt cache (a cheap fraction of the input price); a model that caches aggressively processes a large input cheaply, which is why cost doesn't track raw token counts and is shown alongside them. Cost and wall time are for one pass over the suite (the per-run average), measured on a single machine. Within a model samples run one at a time, but wall time still depends on that machine and on how many models shared it during the run, so read it as indicative and comparable only within this snapshot; pass rate and cost are unaffected by either. Dollar costs use provider prices as of 2026-06-15 (see eval/report.py) and include prompt-cache reads/writes; the self-hosted model is priced against a hosted reference.

Model Pass rate Tool calls / run Input tokens / run Output tokens / run Cost / run Wall / run
claude-haiku-4-5 121/123 267 6.8M (6.3M cached, 93%) 70k $1.61 0.6 h
gemini-3.1-flash-lite 108/123 283 7.9M (6.2M cached, 79%) 62k $0.67 0.7 h
gpt-5.4-mini 111/123 307 3.8M (3.3M cached, 87%) 25k $0.72 0.4 h
qwen3.6:27b 114/123 266 4.4M 58k $1.46 0.9 h

By suite

Suite claude-haiku-4-5 gemini-3.1-flash-lite gpt-5.4-mini qwen3.6:27b
api 6/6 6/6 6/6 5/6
battmo 6/6 6/6 6/6 6/6
calibration 3/3 3/3 3/3 3/3
canary 3/3 3/3 3/3 3/3
ensembles 12/12 9/12 10/12 9/12
filesystem 27/27 27/27 26/27 26/27
fimbul 3/3 3/3 3/3 2/3
guardrails 3/3 3/3 1/3 3/3
jutuldarcy 8/9 4/9 9/9 6/9
mocca 6/6 5/6 3/6 6/6
plotting 6/6 4/6 6/6 6/6
search 26/27 23/27 26/27 27/27
usage 12/12 12/12 9/12 12/12
all 121/123 108/123 111/123 114/123

By simulator

Cross-cut of the same samples by the simulator they exercise (general = sim-agnostic tasks like canary, calibration, plotting).

Simulator claude-haiku-4-5 gemini-3.1-flash-lite gpt-5.4-mini qwen3.6:27b
battmo 12/12 10/12 11/12 11/12
fimbul 9/9 8/9 9/9 6/9
general 82/84 74/84 74/84 79/84
jutuldarcy 6/6 6/6 6/6 6/6
mocca 12/12 10/12 11/12 12/12
all 121/123 108/123 111/123 114/123
All samples (pass count, tool calls, tokens, cost, wall time)
Suite Sample Sim Model Passed Failures Tool calls Input Output Cost Wall
api api1-newton-residual general claude-haiku-4-5 3/3 13 135k 1k $0.03 0 min
api api1-newton-residual general gemini-3.1-flash-lite 3/3 7 93k 555 $0.01 0 min
api api1-newton-residual general gpt-5.4-mini 3/3 11 83k 708 $0.02 0 min
api api1-newton-residual general qwen3.6:27b 3/3 13 107k 2k $0.04 1 min
api api2-internal-darcy general claude-haiku-4-5 3/3 7 91k 875 $0.02 0 min
api api2-internal-darcy general gemini-3.1-flash-lite 3/3 6 73k 343 $0.01 0 min
api api2-internal-darcy general gpt-5.4-mini 3/3 7 63k 462 $0.01 0 min
api api2-internal-darcy general qwen3.6:27b 2/3 wrong answer 4 53k 660 $0.02 0 min
battmo bm1-chen-cc-discharge general claude-haiku-4-5 3/3 8 129k 1k $0.03 1 min
battmo bm1-chen-cc-discharge general gemini-3.1-flash-lite 3/3 18 424k 2k $0.03 1 min
battmo bm1-chen-cc-discharge general gpt-5.4-mini 3/3 10 101k 736 $0.02 1 min
battmo bm1-chen-cc-discharge general qwen3.6:27b 3/3 5 71k 981 $0.02 1 min
battmo bm3-crate-sweep general claude-haiku-4-5 3/3 14 312k 3k $0.08 1 min
battmo bm3-crate-sweep general gemini-3.1-flash-lite 3/3 22 377k 3k $0.03 1 min
battmo bm3-crate-sweep general gpt-5.4-mini 3/3 12 130k 1k $0.03 1 min
battmo bm3-crate-sweep general qwen3.6:27b 3/3 13 183k 2k $0.06 2 min
calibration cal1-exp-decay-fit general claude-haiku-4-5 3/3 4 70k 1k $0.02 1 min
calibration cal1-exp-decay-fit general gemini-3.1-flash-lite 3/3 4 59k 520 $0.01 1 min
calibration cal1-exp-decay-fit general gpt-5.4-mini 3/3 7 67k 543 $0.01 0 min
calibration cal1-exp-decay-fit general qwen3.6:27b 3/3 7 95k 2k $0.03 1 min
canary x0-sum-from-file general claude-haiku-4-5 3/3 3 54k 300 $0.02 0 min
canary x0-sum-from-file general gemini-3.1-flash-lite 3/3 4 53k 139 $0.01 0 min
canary x0-sum-from-file general gpt-5.4-mini 3/3 4 41k 213 $0.01 0 min
canary x0-sum-from-file general qwen3.6:27b 3/3 2 37k 299 $0.01 0 min
ensembles ens-bm-crate-sweep battmo claude-haiku-4-5 3/3 14 290k 3k $0.07 2 min
ensembles ens-bm-crate-sweep battmo gemini-3.1-flash-lite 2/3 serial / mechanism 23 633k 4k $0.05 1 min
ensembles ens-bm-crate-sweep battmo gpt-5.4-mini 2/3 wrong answer 18 221k 2k $0.04 1 min
ensembles ens-bm-crate-sweep battmo qwen3.6:27b 2/3 wrong answer 21 417k 4k $0.13 4 min
ensembles ens-fb-injtemp-sweep fimbul claude-haiku-4-5 3/3 21 491k 4k $0.10 9 min
ensembles ens-fb-injtemp-sweep fimbul gemini-3.1-flash-lite 2/3 hit budget 9 668k 3k $0.06 6 min
ensembles ens-fb-injtemp-sweep fimbul gpt-5.4-mini 3/3 17 269k 1k $0.04 4 min
ensembles ens-fb-injtemp-sweep fimbul qwen3.6:27b 1/3 serial / mechanism, wrong answer 21 486k 6k $0.16 9 min
ensembles ens-jd-porosity-sweep jutuldarcy claude-haiku-4-5 3/3 5 89k 997 $0.03 1 min
ensembles ens-jd-porosity-sweep jutuldarcy gemini-3.1-flash-lite 3/3 7 101k 569 $0.01 1 min
ensembles ens-jd-porosity-sweep jutuldarcy gpt-5.4-mini 3/3 10 123k 752 $0.02 1 min
ensembles ens-jd-porosity-sweep jutuldarcy qwen3.6:27b 3/3 10 180k 3k $0.06 3 min
ensembles ens-mc-cycles-sweep mocca claude-haiku-4-5 3/3 17 390k 5k $0.09 4 min
ensembles ens-mc-cycles-sweep mocca gemini-3.1-flash-lite 2/3 serial / mechanism 14 280k 4k $0.03 2 min
ensembles ens-mc-cycles-sweep mocca gpt-5.4-mini 2/3 wrong answer 19 288k 2k $0.05 2 min
ensembles ens-mc-cycles-sweep mocca qwen3.6:27b 3/3 22 533k 6k $0.17 6 min
filesystem fs1-write-and-include general claude-haiku-4-5 3/3 2 33k 202 $0.01 0 min
filesystem fs1-write-and-include general gemini-3.1-flash-lite 3/3 2 31k 92 $0.00 0 min
filesystem fs1-write-and-include general gpt-5.4-mini 3/3 3 31k 128 $0.01 0 min
filesystem fs1-write-and-include general qwen3.6:27b 3/3 2 33k 199 $0.01 0 min
filesystem fs1-write-and-include-battmo battmo claude-haiku-4-5 3/3 2 28k 188 $0.01 0 min
filesystem fs1-write-and-include-battmo battmo gemini-3.1-flash-lite 3/3 2 31k 101 $0.00 0 min
filesystem fs1-write-and-include-battmo battmo gpt-5.4-mini 3/3 2 28k 82 $0.01 0 min
filesystem fs1-write-and-include-battmo battmo qwen3.6:27b 3/3 2 33k 236 $0.01 0 min
filesystem fs1-write-and-include-fimbul fimbul claude-haiku-4-5 3/3 2 24k 182 $0.01 0 min
filesystem fs1-write-and-include-fimbul fimbul gemini-3.1-flash-lite 3/3 2 35k 105 $0.00 0 min
filesystem fs1-write-and-include-fimbul fimbul gpt-5.4-mini 3/3 3 31k 128 $0.01 0 min
filesystem fs1-write-and-include-fimbul fimbul qwen3.6:27b 2/3 wrong answer 2 29k 165 $0.01 0 min
filesystem fs1-write-and-include-mocca mocca claude-haiku-4-5 3/3 2 29k 193 $0.01 0 min
filesystem fs1-write-and-include-mocca mocca gemini-3.1-flash-lite 3/3 2 31k 95 $0.00 0 min
filesystem fs1-write-and-include-mocca mocca gpt-5.4-mini 3/3 4 38k 190 $0.01 0 min
filesystem fs1-write-and-include-mocca mocca qwen3.6:27b 3/3 2 33k 234 $0.01 0 min
filesystem fs2-nested-write-and-include general claude-haiku-4-5 3/3 2 33k 240 $0.01 0 min
filesystem fs2-nested-write-and-include general gemini-3.1-flash-lite 3/3 3 42k 121 $0.00 0 min
filesystem fs2-nested-write-and-include general gpt-5.4-mini 2/3 wrong answer 6 67k 337 $0.01 0 min
filesystem fs2-nested-write-and-include general qwen3.6:27b 3/3 2 33k 242 $0.01 0 min
filesystem fs3-edit-and-rerun general claude-haiku-4-5 3/3 3 50k 350 $0.01 0 min
filesystem fs3-edit-and-rerun general gemini-3.1-flash-lite 3/3 4 53k 190 $0.01 0 min
filesystem fs3-edit-and-rerun general gpt-5.4-mini 3/3 6 68k 371 $0.01 0 min
filesystem fs3-edit-and-rerun general qwen3.6:27b 3/3 3 44k 284 $0.01 0 min
filesystem fs4-save-output-file general claude-haiku-4-5 3/3 2 41k 244 $0.01 0 min
filesystem fs4-save-output-file general gemini-3.1-flash-lite 3/3 2 35k 118 $0.00 0 min
filesystem fs4-save-output-file general gpt-5.4-mini 3/3 6 58k 349 $0.01 0 min
filesystem fs4-save-output-file general qwen3.6:27b 3/3 3 33k 319 $0.01 0 min
filesystem fs5-multi-file-project general claude-haiku-4-5 3/3 4 42k 431 $0.01 0 min
filesystem fs5-multi-file-project general gemini-3.1-flash-lite 3/3 5 65k 284 $0.01 0 min
filesystem fs5-multi-file-project general gpt-5.4-mini 3/3 8 78k 470 $0.01 0 min
filesystem fs5-multi-file-project general qwen3.6:27b 3/3 4 41k 370 $0.01 0 min
filesystem fs6-read-transform-write general claude-haiku-4-5 3/3 2 37k 243 $0.01 0 min
filesystem fs6-read-transform-write general gemini-3.1-flash-lite 3/3 6 77k 498 $0.01 0 min
filesystem fs6-read-transform-write general gpt-5.4-mini 3/3 3 28k 135 $0.01 0 min
filesystem fs6-read-transform-write general qwen3.6:27b 3/3 2 37k 706 $0.01 0 min
fimbul fb1-doublet-cooldown general claude-haiku-4-5 3/3 12 367k 3k $0.08 3 min
fimbul fb1-doublet-cooldown general gemini-3.1-flash-lite 3/3 9 164k 1k $0.02 3 min
fimbul fb1-doublet-cooldown general gpt-5.4-mini 3/3 11 130k 788 $0.03 3 min
fimbul fb1-doublet-cooldown general qwen3.6:27b 2/3 wrong answer 8 115k 2k $0.04 3 min
guardrails x1-no-shell-julia general claude-haiku-4-5 3/3 1 24k 84 $0.01 0 min
guardrails x1-no-shell-julia general gemini-3.1-flash-lite 3/3 1 21k 35 $0.00 0 min
guardrails x1-no-shell-julia general gpt-5.4-mini 1/3 wrong answer 2 25k 100 $0.01 0 min
guardrails x1-no-shell-julia general qwen3.6:27b 3/3 1 26k 240 $0.01 0 min
jutuldarcy jd-millidarcy-conversion general claude-haiku-4-5 2/3 hit budget 17 1.2M 9k $0.22 2 min
jutuldarcy jd-millidarcy-conversion general gemini-3.1-flash-lite 0/3 hit budget, wrong answer 22 1.3M 14k $0.09 2 min
jutuldarcy jd-millidarcy-conversion general gpt-5.4-mini 3/3 14 199k 2k $0.04 1 min
jutuldarcy jd-millidarcy-conversion general qwen3.6:27b 2/3 wrong answer 12 241k 5k $0.08 3 min
jutuldarcy jd1-gravity-segregation general claude-haiku-4-5 3/3 16 299k 5k $0.08 1 min
jutuldarcy jd1-gravity-segregation general gemini-3.1-flash-lite 2/3 hit budget 13 278k 3k $0.02 14 min
jutuldarcy jd1-gravity-segregation general gpt-5.4-mini 3/3 11 157k 2k $0.03 1 min
jutuldarcy jd1-gravity-segregation general qwen3.6:27b 2/3 wrong answer 13 239k 5k $0.08 3 min
jutuldarcy jd3-halved-injection general claude-haiku-4-5 3/3 22 449k 7k $0.11 2 min
jutuldarcy jd3-halved-injection general gemini-3.1-flash-lite 2/3 hit budget 15 799k 13k $0.06 2 min
jutuldarcy jd3-halved-injection general gpt-5.4-mini 3/3 16 278k 2k $0.04 1 min
jutuldarcy jd3-halved-injection general qwen3.6:27b 2/3 wrong answer 10 175k 3k $0.06 2 min
mocca mc1-vsa-cyclic-golden general claude-haiku-4-5 3/3 11 199k 3k $0.05 2 min
mocca mc1-vsa-cyclic-golden general gemini-3.1-flash-lite 3/3 10 243k 2k $0.02 1 min
mocca mc1-vsa-cyclic-golden general gpt-5.4-mini 3/3 12 143k 1k $0.03 1 min
mocca mc1-vsa-cyclic-golden general qwen3.6:27b 3/3 8 116k 2k $0.04 2 min
mocca mc4-tsa-toth-honesty general claude-haiku-4-5 3/3 0 1.0M 11k $0.24 4 min
mocca mc4-tsa-toth-honesty general gemini-3.1-flash-lite 2/3 wrong answer 13 992k 6k $0.07 5 min
mocca mc4-tsa-toth-honesty general gpt-5.4-mini 0/3 wrong answer 25 477k 2k $0.06 2 min
mocca mc4-tsa-toth-honesty general qwen3.6:27b 3/3 23 287k 3k $0.09 2 min
plotting x5-headless-plot general claude-haiku-4-5 3/3 2 34k 350 $0.01 0 min
plotting x5-headless-plot general gemini-3.1-flash-lite 1/3 wrong answer 4 108k 301 $0.01 1 min
plotting x5-headless-plot general gpt-5.4-mini 3/3 3 35k 251 $0.01 0 min
plotting x5-headless-plot general qwen3.6:27b 3/3 5 70k 836 $0.02 1 min
plotting x6-read-the-bar general claude-haiku-4-5 3/3 2 43k 499 $0.01 1 min
plotting x6-read-the-bar general gemini-3.1-flash-lite 3/3 3 44k 197 $0.01 1 min
plotting x6-read-the-bar general gpt-5.4-mini 3/3 8 75k 788 $0.02 1 min
plotting x6-read-the-bar general qwen3.6:27b 3/3 6 82k 1k $0.03 1 min
search se1-locate-definition general claude-haiku-4-5 3/3 3 50k 339 $0.01 0 min
search se1-locate-definition general gemini-3.1-flash-lite 3/3 4 54k 183 $0.01 0 min
search se1-locate-definition general gpt-5.4-mini 3/3 2 19k 106 $0.01 0 min
search se1-locate-definition general qwen3.6:27b 3/3 1 22k 164 $0.01 0 min
search se1-locate-definition-battmo battmo claude-haiku-4-5 3/3 3 50k 316 $0.01 0 min
search se1-locate-definition-battmo battmo gemini-3.1-flash-lite 2/3 wrong answer 2 35k 106 $0.00 0 min
search se1-locate-definition-battmo battmo gpt-5.4-mini 3/3 2 19k 107 $0.01 0 min
search se1-locate-definition-battmo battmo qwen3.6:27b 3/3 2 37k 273 $0.01 0 min
search se1-locate-definition-fimbul fimbul claude-haiku-4-5 3/3 3 46k 280 $0.01 0 min
search se1-locate-definition-fimbul fimbul gemini-3.1-flash-lite 3/3 4 49k 157 $0.01 0 min
search se1-locate-definition-fimbul fimbul gpt-5.4-mini 3/3 2 19k 107 $0.01 0 min
search se1-locate-definition-fimbul fimbul qwen3.6:27b 3/3 1 26k 169 $0.01 0 min
search se1-locate-definition-mocca mocca claude-haiku-4-5 3/3 4 59k 379 $0.01 0 min
search se1-locate-definition-mocca mocca gemini-3.1-flash-lite 2/3 wrong answer 3 43k 130 $0.00 0 min
search se1-locate-definition-mocca mocca gpt-5.4-mini 3/3 3 32k 175 $0.01 0 min
search se1-locate-definition-mocca mocca qwen3.6:27b 3/3 1 22k 163 $0.01 0 min
search se2-locate-example general claude-haiku-4-5 2/3 wrong answer 3 41k 252 $0.01 0 min
search se2-locate-example general gemini-3.1-flash-lite 3/3 2 35k 123 $0.00 0 min
search se2-locate-example general gpt-5.4-mini 3/3 3 25k 134 $0.01 0 min
search se2-locate-example general qwen3.6:27b 3/3 1 22k 128 $0.01 0 min
search se3-find-call-sites general claude-haiku-4-5 3/3 4 42k 544 $0.01 0 min
search se3-find-call-sites general gemini-3.1-flash-lite 3/3 5 66k 345 $0.01 0 min
search se3-find-call-sites general gpt-5.4-mini 3/3 3 29k 232 $0.01 0 min
search se3-find-call-sites general qwen3.6:27b 3/3 2 22k 346 $0.01 0 min
search se4-count-jl-files general claude-haiku-4-5 3/3 4 68k 401 $0.02 0 min
search se4-count-jl-files general gemini-3.1-flash-lite 1/3 wrong answer 1 24k 48 $0.00 0 min
search se4-count-jl-files general gpt-5.4-mini 3/3 1 19k 73 $0.01 0 min
search se4-count-jl-files general qwen3.6:27b 3/3 1 22k 110 $0.01 0 min
search se5-find-constant general claude-haiku-4-5 3/3 4 59k 451 $0.01 0 min
search se5-find-constant general gemini-3.1-flash-lite 3/3 5 61k 246 $0.01 0 min
search se5-find-constant general gpt-5.4-mini 3/3 3 29k 165 $0.01 0 min
search se5-find-constant general qwen3.6:27b 3/3 2 26k 230 $0.01 0 min
search se6-call-chain general claude-haiku-4-5 3/3 10 144k 1k $0.03 0 min
search se6-call-chain general gemini-3.1-flash-lite 3/3 7 131k 372 $0.01 0 min
search se6-call-chain general gpt-5.4-mini 2/3 wrong answer 11 62k 588 $0.01 0 min
search se6-call-chain general qwen3.6:27b 3/3 6 65k 611 $0.02 0 min
usage use-bm-cell-capacity battmo claude-haiku-4-5 3/3 4 68k 761 $0.02 0 min
usage use-bm-cell-capacity battmo gemini-3.1-flash-lite 3/3 7 93k 603 $0.01 0 min
usage use-bm-cell-capacity battmo gpt-5.4-mini 3/3 10 110k 746 $0.02 0 min
usage use-bm-cell-capacity battmo qwen3.6:27b 3/3 12 170k 3k $0.06 2 min
usage use-csv-mean general claude-haiku-4-5 3/3 4 60k 561 $0.02 0 min
usage use-csv-mean general gemini-3.1-flash-lite 3/3 4 55k 255 $0.01 0 min
usage use-csv-mean general gpt-5.4-mini 0/3 wrong answer 5 45k 214 $0.01 0 min
usage use-csv-mean general qwen3.6:27b 3/3 6 79k 685 $0.03 1 min
usage use-jd-well-api jutuldarcy claude-haiku-4-5 3/3 8 126k 1k $0.03 0 min
usage use-jd-well-api jutuldarcy gemini-3.1-flash-lite 3/3 5 75k 377 $0.01 0 min
usage use-jd-well-api jutuldarcy gpt-5.4-mini 3/3 4 29k 335 $0.01 0 min
usage use-jd-well-api jutuldarcy qwen3.6:27b 3/3 3 40k 748 $0.01 1 min
usage use-mc-list-examples mocca claude-haiku-4-5 3/3 2 37k 270 $0.01 0 min
usage use-mc-list-examples mocca gemini-3.1-flash-lite 3/3 2 35k 171 $0.00 0 min
usage use-mc-list-examples mocca gpt-5.4-mini 3/3 2 19k 167 $0.01 0 min
usage use-mc-list-examples mocca qwen3.6:27b 3/3 1 22k 254 $0.01 0 min

Reading the results

A sample passes only when every scorer passes: the answer checks and the trace checks (the required mechanism appears in code the agent actually ran). Failures fall into:

  • wrong answer: the reported values failed the golden or structural check.
  • serial / mechanism: the answer may be right, but a required mechanism is missing from the trace (e.g. a sweep that ran serially when the prompt asked for a parallel ensemble).
  • hit budget: the sample reached its message or time cap before finishing.
  • infra error: the run failed before the agent could work (provider or harness error); excluded from pass rates, not a model result.

Composite tasks are noisy at a single epoch, so each model runs the suite a few times and the cells aggregate the runs. Regenerate this page with:

uv run jutul-agent eval <suite> --model <provider/model> --epochs 3
uv run python -m jutul_agent.eval.report <log-prefix> -o docs/benchmark.md

To add a model without re-running the others, merge the committed snapshot instead: pass --records docs/benchmark-records.jsonl and write it back with --json docs/benchmark-records.jsonl.