News commentary

Journalists Need Their Own Benchmark Tests for AI Tools

Columbia Journalism Review · Klaudia Jaźwińska and Aisvarya Chandrasekar · last updated

A recent paper from OpenAI researchers sheds new light on why large language models (LLMs) are prone to “hallucination,” or fabricating information. According to the paper, the evaluation methods major AI companies use encourage overconfidence. Performance tests often take the form of multiple-choice questions with explicit correct answers that end up unintentionally rewarding models for guessing rather than declining to answer if they aren’t certain. By optimizing their systems to achieve a high score on these evaluations, AI companies are training their models to be good test-takers instead of actually improving their overall accuracy.

There’s a growing recognition among researchers that popular benchmark tests used to evaluate how well models perform at skills such as general reasoning, math, or coding often fail to capture the models’ real-world capabilities. Companies are under competitive pressure to demonstrate constant progress, so they optimize their models to perform well and rank high on benchmark leaderboards. A study of ChatBot Arena, a widely used benchmark platform, found that major companies like OpenAI, Meta, and Google test many variants of their models privately and release the scores of only the best-performing versions, leaving out poor results. The study authors argue that this process misrepresents the actual capabilities of the models.