Is Claude Dumb Today?

Daily HumanEvalPlus-CC164 benchmark for Claude Code (Opus 4.8)

...

Loading latest results…

Score
 
Model
 
Cost
 
Runtime
 

Score History (last 90 runs)

Where the models disagree

Tasks where Opus 4.8, 4.7, and 4.6 have different pass rates over recent paired runs. Green = always passes, red = always fails. Spread is the gap between the best and worst model on that task — a high spread reveals a real tradeoff, not noise. Historical divergences include HumanEval/97 (Python signed-modulo quirk) and HumanEval/141 (Unicode .isalpha() vs literal a–z range).

Task Opus 4.8 Opus 4.7 Opus 4.6 Spread
Loading…

Per-Task Results (latest run)

Task Function Result Base EvalPlus Attempts Turns Cost Error
Loading…