Daily HumanEvalPlus-CC164 benchmark for Claude Code (Opus 4.7)
Loading latest results…
Tasks where Opus 4.6 and 4.7 have different pass rates over recent paired runs.
Green = always passes, red = always fails. A task that’s green for one model and red for
the other reveals a real tradeoff, not noise. Historical divergences include
HumanEval/97 (Python signed-modulo quirk) and HumanEval/141
(Unicode .isalpha() vs literal a–z range).
| Task | Opus 4.7 | Opus 4.6 | Delta |
|---|---|---|---|
| Loading… | |||
| Task | Function | Result | Base | EvalPlus | Attempts | Turns | Cost | Error |
|---|---|---|---|---|---|---|---|---|
| Loading… | ||||||||