← home

Validator Bench: Leaderboard

More context about the benchmark setup can be found here, with a per-step study here and the YAML task description and corner cases here. Briefly: the model needs to implement a validator for some language (JSON, TOML, etc.) according to a specification, and is scored based on its performance on a predefined test set with hundreds of test cases.

The benchmark tests a combination of entities across several dimensions:

We fully expect models to know what YAML or C++ is. Some of the tests might also have been seen by the model in training data. Unless models cheat by hardcoding answers (not a single case like that was found by spot checks), we can still make some useful observations about model’s performance; it’s hard to make generalization claims anyway for any existing coding benchmarks like SWE-bench Verified.

Leaderboard

This is a leaderboard for 6 selected models and (TOML-1.0, YAML 1.2, HCL 2) x (with-specification, without-specification) x (C++17, zig) set of tasks. For each task the model gets multiple independent attempts, so we can understand the variance within a task.

model P(acc≥0.90) P(acc≥0.99) P(acc≥1.00)
opus-4-7-adaptive 0.850 0.717 0.528
gpt-5.5-xhigh 0.706 0.642 0.567
sonnet-4-6-enabled 0.631 0.200 0.033
glm-5p1 0.533 0.250 0.150
deepseek-v4-pro-thinking 0.363 0.071 0.037
kimi-k2.6-thinking 0.304 0.044 0.000

In the previous notes about this benchmark we used MCC; for these charts we look at mean accuracy. As we mostly focus on the higher end (90%+ of tests passed), this is fine and easier to reason about — a trivial baseline (always return true/false, or return random verdict) is not going to score over 90% on the test suites.

The way to read the data: how likely is a model to produce a classifier (validator) with accuracy >= x for a task/environment from the dataset?

cumulative accuracy

Confidence intervals here should be understood like this: if we rerun the same set of tasks on fresh attempts, in 95% of cases we’d expect the P(mean_accuracy > threshold) to be within the interval. It’s not a confidence interval for an ‘arbitrary validator in an arbitrary environment we might ask a model to build’.

Observations:

Breakdowns

We can also plot a similar chart to understand language/environment breakdown; it will be a combination of a model’s proficiency with a language and that language/stdlib’s suitability for the task.

cpp vs zig

We can clearly see that C++17 is an easier language for these models/tasks.

Here’s a breakdown by task - what language do we need to validate. YAML is a clear outlier here, while TOML and HCL are fairly attainable to ~90% accuracy.

specification type

We can see a big gap here and more target languages will definitely help: easier languages (like JSON) for smaller, weaker models and harder for stronger models.

Moving to one shot

After experimenting with this benchmark, I’m considering simplifying it further to avoid multi-turn submissions entirely. This would get rid of failures due to exceeding the context window, simplify the data studies, allow batch processing of requests so we can submit them in a batch with shared initial prompt, give better kv-cache hit rates, avoid the possibility of leaking information about tests, and further decouple the model from the coding agent — there’d be no ‘agentic loop’ at all.

Plans for the one shot:

References

Prior notes in this series:

Specifications and tooling:

Models: