Benchmarking AI is a hard problem. And as the benchmarks become the goal (Goodhart's law), the model makers train the model to do well on them, and hence they seize to be good benchmarks.
Share this post
Unpacking Deepseek
Share this post
Benchmarking AI is a hard problem. And as the benchmarks become the goal (Goodhart's law), the model makers train the model to do well on them, and hence they seize to be good benchmarks.