Is Claude Dumb Today?

Daily HumanEvalPlus-CC164 benchmark for Claude Code (Opus 4.7)

...

Loading latest results…

Score
 
Model
 
Cost
 
Runtime
 

Score History (last 90 runs)

Where the models disagree

Tasks where Opus 4.6 and 4.7 have different pass rates over recent paired runs. Green = always passes, red = always fails. A task that’s green for one model and red for the other reveals a real tradeoff, not noise. Historical divergences include HumanEval/97 (Python signed-modulo quirk) and HumanEval/141 (Unicode .isalpha() vs literal a–z range).

Task Opus 4.7 Opus 4.6 Delta
Loading…

Per-Task Results (latest run)

Task Function Result Base EvalPlus Attempts Turns Cost Error
Loading…