← home

Using Claude Code, May 2026

Versions used: Claude Code 2.1.123, Opus 4.7.

The workflow hasn’t changed much compared to the previous note - just a few specific mistakes I noticed.

Poor choice of command output capture

Claude Code would start a long-running command like foo_bar.sh | tail -n 20 when the important output was supposed to come at the end. This has multiple disadvantages:

Recording it via /oops mitigated the issue.

Benchmark scoring issues

For project context, see Validator Bench and Validator Bench: YAML 1.2.

Briefly: in this task Claude Code had to implement a scorer/evaluator for a submission against a set of (already existing) test cases. The submitted code is a ‘validator’ - something that checks the correctness of a given JSON, TOML, etc. We started with exit code only - non-zero implies ‘invalid’. This is incorrect when the validator crashes or times out: judging by exit code alone, a test that was supposed to be marked invalid still gets a ‘good outcome’.

When I asked to switch to printing the verdict to stdout, Claude Code decided to ignore the exit code entirely - so printing something and then crashing would be ‘acceptable’. I think this is something a good engineer should catch.

Reasoning about a wrong submission

Another interesting case was asking Claude Code to explain a submission for the project above. The submission was getting an MCC score of -1; such a score implies a submission that is wrong every time (see Phi coefficient (MCC)).

For two such submissions, Claude produced a plausible but entirely wrong explanation, focusing on the idea that ‘the labels must be swapped’ - while the actual cause was much more trivial: the solution was timing out, so the scorer was marking it incorrect every time (see the infinite-loop repro). After manual inspection and asking Claude Code to reproduce and see what was going on, it was fixed.

Some form of TDD - or at least nudging the model to reproduce - still seems important. The model + assistant doesn’t do it on its own.

References