← home

Validator Bench: YAML 1.2

TOML parsing seemed to be pretty easy for top-tier models, with Opus 4.6-4.7 and GPT 5.3+ hitting a perfect score. Good open-weight models were also fairly strong, reaching 0.9+ MCC, and Kimi K2.6 achieved a perfect score on one of the attempts.

Compared to TOML, getting a compliant YAML parser is considerably harder - there are many more corner cases to consider.

I used libfyaml as the reference YAML parser and the yaml-test-suite as the test suite.

Same as with TOML, I fully expect the models to be aware of the YAML format, specification, and implementations. Still, as we can see, getting a good implementation is hard.

Just as before, the model’s task is to write a validator. To better distinguish between ‘validator thinks YAML is invalid’ and ‘validator crashed’, the expectation is that the validator returns 0 and prints its verdict to stdout.

For each attempt the model can do up to 5 submissions and will get test names and outcomes on failure - so, partial information about which tests to fix, no full test leakage.

Each dot is the best MCC score out of 5 turns per attempt. A perfect validator would score 1, random - 0, a perfectly wrong validator would get -1.

YAML 1.2 validators in c++17

I ran each model 5-10 times, so a boxplot would add more noise than value, and we can just look at individual submissions.

Next steps

References