← home

Validator Bench 0.0.2

This is a small update for Validator Bench.

The task stays the same, create a validator for TOML 1.0 using c++17. The motivation for now is to get it into stable condition for this one task, and then quickly scale to many more tasks with the same basic property - providing real value score rather than binary pass/fail to gain more information per sample, thus being more efficient in estimating model ability.

One change is for the benchmark harness itself - separate runners for different model providers, as there are different settings which make litellm-based harness way overloaded and bug-prone.

Other changes involve running for newly released models.

Here’s the overall summary for the results. Note that there are multiple entries for models like Opus 4.7 - the old one from deprecated litellm-based runner and new ones from dedicated Anthropic SDK runner with correct thinking budget setting. In the next run, we’ll clean up old results. Some runs are still in progress and will be updated for completeness, yet I do not expect the results for this task to change significantly.

In the chart below, color coding distinguishes open-weight from proprietary models, and model names in bold are the ones newly added in this update.

toml-1.0-cpp

Observations:

References: