Testing
How the test infrastructure works: what runs where, what gates what, and the conventions that keep an agent system testable at all.
Three tiers
| Tier | Selected by | Needs | Time |
|---|---|---|---|
| Unit | uv run pytest (default) |
nothing beyond the venv | ~35 s |
| Integration | -m integration |
Julia (some tests an instantiated env) | seconds to minutes |
| Live | pytest tests/live/ |
a provider API key | one real agent turn |
The default run must work on a fresh clone with no Julia, no keys, and no
network. Anything that cannot promise that is marked integration or
lives in tests/live/.
Gating: tests skip themselves
Integration tests check their own prerequisites and skip with a reason
rather than fail: Julia on PATH, an env template with a generated
Manifest.toml, a display or Xvfb for GL tests. This is what makes one
test suite serve three environments: a laptop without Julia, a dev box
with everything, and CI lanes that instantiate exactly what their job
needs. The simulator smoke tests are the pattern: parametrized over every
registered adapter, each instance skipping unless that simulator's env is
instantiated.
The live smoke test gates on a provider key, runs one real turn (read a file, evaluate through the kernel, answer), asserts on the answer and on the trace, and retries once, a deliberate compromise with model nondeterminism. Proper repetition statistics belong to the bench, not the test suite.
Fakes, not mocks
tests/fakes.py provides hand-written fakes (a fake simulator adapter, a
fake Julia session that returns scripted results) instead of patch-based
mocks. Tests build real objects (Session, backends, tools) around a fake
edge, so they exercise real wiring and survive refactors that would break
patch paths. The eval scorer tests follow the same idea: they write a real
trace.sqlite with synthetic events and run the actual scorer against it.
Snapshots
Prompt assembly and other rendered text are snapshot-tested with syrupy.
A change to the system prompt shows up as a snapshot diff in review:
deliberate changes are accepted with --snapshot-update, accidental ones
get caught. Treat a snapshot update in a PR as a prompt change to review,
not noise.
TUI tests
Every widget that renders model-controlled text gets a Textual pilot test (drive the real app headless, press keys, assert on the screen). The failure class this guards is real: model output containing markup-like text can crash or corrupt a widget that renders it as markup. New TUI surfaces ship with a pilot test that feeds them hostile text.
Kernel tests
The Julia kernel has two layers. Unit tests drive connection.py over an
in-process socketpair, playing the Julia side byte-for-byte: framing,
routing, and failure paths with no Julia at all. Integration tests run the
real kernel: eval round-trips, fd-level output capture (C printf),
interrupt-under-load, and cancel recovery. Two of them are the protocol's
safety net and intentionally stress race conditions
(test_interrupt_during_heavy_printing_repeatedly, the bounded cancel
test). Treat a flake there as a real bug until proven otherwise.
Eval tests
tests/test_eval_bench.py runs entirely offline: scorers against
synthetic traces, RunConfig hashing stability, and an import-and-build of
every task suite so API drift in the entrypoint files fails fast. The
whole module skips when the eval extra is not installed. Live model
calls never run under pytest. They are bench runs
(evaluation).
What CI runs
ci.yml: lint, the unit suite on Linux/macOS/Windows, the kernel integration job, and the plot job (instantiates the JutulDarcy env under xvfb, renders a real GLMakie figure).simulators.yml: per-simulator env instantiate + smoke, on PRs and weekly. The weekly schedule is the upstream-breakage canary, since envs carry no version pins. Both instantiate steps use an explicitPkg.precompile()because best-effort auto-precompile exits 0 on failure (see development).
Conventions for new tests
- Unit by default. Mark
integrationthe moment a test needs Julia. - Gate on prerequisites with a skip and a reason, never a hard fail.
- Test through public seams (
JuliaSession, backends,TurnRunner), not internals, so implementation rewrites keep their tests. - When a real failure surfaces in use, pin it: a regression test if it is mechanical, a bench task if it is behavioral.