Daily HumanEvalPlus-CC164 benchmark for Claude Code (Opus 4.8)
Loading latest results…
Tasks where Opus 4.8, 4.7, and 4.6 have different pass rates over recent paired runs.
Green = always passes, red = always fails. Spread is the gap between the best and
worst model on that task — a high spread reveals a real tradeoff, not noise. Historical
divergences include HumanEval/97 (Python signed-modulo quirk) and
HumanEval/141 (Unicode .isalpha() vs literal a–z range).
| Task | Opus 4.8 | Opus 4.7 | Opus 4.6 | Spread |
|---|---|---|---|---|
| Loading… | ||||
| Task | Function | Result | Base | EvalPlus | Attempts | Turns | Cost | Error |
|---|---|---|---|---|---|---|---|---|
| Loading… | ||||||||