Insider Brief
- Stanford University researchers have introduced a new evaluation method for large language models (LLMs) that reduces costs and enhances fairness by assigning difficulty scores to benchmark questions, according to a paper presented at the International Conference on Machine Learning.
- Funded by the MacArthur Foundation, Stanford HAI, and Google Inc., the method uses Item Response Theory, a concept from standardized testing, to select adaptive question subsets that deliver more accurate comparisons across AI models while cutting evaluation costs by up to 80%.
- Applied across 22 datasets and 172 models spanning medicine, mathematics, and law, the system improves the integrity of AI assessments by identifying and removing previously seen questions and allows for tracking model safety metrics over time, supporting more transparent and trustworthy AI development.
A new method for evaluating artificial intelligence models promises to cut costs and improve fairness, according to Stanford University researchers who developed the approach with funding from the MacArthur Foundation, Stanford HAI, and Google Inc.
Detailed in a paper presented at the International Conference on Machine Learning, the method introduces an adaptive question selection system that assesses the difficulty of benchmark questions to more accurately compare language model performance.
As AI developers release increasingly advanced language models, they often claim improved performance based on benchmark testing. These evaluations, which use large banks of test questions, typically require extensive human review, making them both time-consuming and expensive. According to Stanford researchers, the evaluation process can cost as much or more than model training itself. Moreover, practical limitations force developers to use only subsets of questions, which can skew results if easier questions are overrepresented.
Led by computer science assistant professor Sanmi Koyejo, the Stanford team developed a system that assigns difficulty scores to benchmark questions using Item Response Theory, a concept long employed in standardized testing. This enables evaluators to account for question difficulty when comparing model results, leveling the playing field between models and reducing the chance of misleading outcomes due to easier test sets.
“The key observation we make is that you must also account for how hard the questions are,” said Sanmi Koyejo, said lead researcher and assistant professor of computer science in the School of Engineering.. “Some models may do better or worse just by luck of the draw. We’re trying to anticipate that and adjust for it to make fairer comparisons.”
The researchers indicated they applied their approach across 22 datasets and 172 different language models, demonstrating its adaptability across varied domains such as medicine, mathematics, and law. The system uses AI-generated questions, calibrated to specific difficulty levels, which both lowers costs and automates the replenishment of question banks. This method also allows for the identification and removal of previously seen, or “contaminated,” questions from the datasets, improving the integrity of evaluations.
Co-author Sang Truong, a doctoral candidate at the Stanford Artificial Intelligence Lab, emphasized that this adaptive method reduces evaluation costs by up to 80% in some cases while delivering more consistent comparisons. Additionally, the system was able to detect nuanced changes in the safety metrics of versions of GPT-3.5, highlighting its capacity for tracking performance shifts over time. Safety in this context refers to the robustness of models against manipulation, exploitation, and other vulnerabilities, researchers noted.
The Stanford researchers argue that better evaluation tools will benefit both AI developers and end-users by improving diagnostics and providing more transparent assessments of AI models. By reducing costs and enhancing fairness in model evaluation, the system could help accelerate AI development while increasing trust in the technology.
In addition to Stanford, contributors to the research include collaborators from the University of California, Berkeley, and the University of Illinois Urbana-Champaign. Koyejo and co-author Bo Li are affiliated with Virtue AI, which also supported the project.
“And, for everyone else,” Koyejo said. “It will mean more rapid progress and greater trust in the quickly evolving tools of artificial intelligence.”




