Study: New Method for Evaluating AI Can Cut Costs, Improve Fairness

Insider Brief

  • Stanford University researchers have introduced a new evaluation method for large language models (LLMs) that reduces costs and enhances fairness by assigning difficulty scores to benchmark questions, according to a paper presented at the International Conference on Machine Learning.
  • Funded by the MacArthur Foundation, Stanford HAI, and Google Inc., the method uses Item Response Theory, a concept from standardized testing, to select adaptive question subsets that deliver more accurate comparisons across AI models while cutting evaluation costs by up to 80%.
  • Applied across 22 datasets and 172 models spanning medicine, mathematics, and law, the system improves the integrity of AI assessments by identifying and removing previously seen questions and allows for tracking model safety metrics over time, supporting more transparent and trustworthy AI development.

A new method for evaluating artificial intelligence models promises to cut costs and improve fairness, according to Stanford University researchers who developed the approach with funding from the MacArthur Foundation, Stanford HAI, and Google Inc.

Detailed in a paper presented at the International Conference on Machine Learning, the method introduces an adaptive question selection system that assesses the difficulty of benchmark questions to more accurately compare language model performance.

As AI developers release increasingly advanced language models, they often claim improved performance based on benchmark testing. These evaluations, which use large banks of test questions, typically require extensive human review, making them both time-consuming and expensive. According to Stanford researchers, the evaluation process can cost as much or more than model training itself. Moreover, practical limitations force developers to use only subsets of questions, which can skew results if easier questions are overrepresented.

Led by computer science assistant professor Sanmi Koyejo, the Stanford team developed a system that assigns difficulty scores to benchmark questions using Item Response Theory, a concept long employed in standardized testing. This enables evaluators to account for question difficulty when comparing model results, leveling the playing field between models and reducing the chance of misleading outcomes due to easier test sets.

“The key observation we make is that you must also account for how hard the questions are,” said Sanmi Koyejo, said lead researcher and assistant professor of computer science in the School of Engineering.. “Some models may do better or worse just by luck of the draw. We’re trying to anticipate that and adjust for it to make fairer comparisons.”

The researchers indicated they applied their approach across 22 datasets and 172 different language models, demonstrating its adaptability across varied domains such as medicine, mathematics, and law. The system uses AI-generated questions, calibrated to specific difficulty levels, which both lowers costs and automates the replenishment of question banks. This method also allows for the identification and removal of previously seen, or “contaminated,” questions from the datasets, improving the integrity of evaluations.

Co-author Sang Truong, a doctoral candidate at the Stanford Artificial Intelligence Lab, emphasized that this adaptive method reduces evaluation costs by up to 80% in some cases while delivering more consistent comparisons. Additionally, the system was able to detect nuanced changes in the safety metrics of versions of GPT-3.5, highlighting its capacity for tracking performance shifts over time. Safety in this context refers to the robustness of models against manipulation, exploitation, and other vulnerabilities, researchers noted.

The Stanford researchers argue that better evaluation tools will benefit both AI developers and end-users by improving diagnostics and providing more transparent assessments of AI models. By reducing costs and enhancing fairness in model evaluation, the system could help accelerate AI development while increasing trust in the technology.

In addition to Stanford, contributors to the research include collaborators from the University of California, Berkeley, and the University of Illinois Urbana-Champaign. Koyejo and co-author Bo Li are affiliated with Virtue AI, which also supported the project.

“And, for everyone else,” Koyejo said. “It will mean more rapid progress and greater trust in the quickly evolving tools of artificial intelligence.”

Need Deeper Intelligence on the AI Market?

AI Insider's Market Intelligence platform tracks funding rounds, competitive landscapes, and technology trends across the global AI ecosystem in real time. Get the data and insights your organization needs to make informed decisions.

Related Articles

WIRobotics Raises USD $68M in Series B Funding to Develop Humanoid Robotics Platform

Insider Brief South Korean robotics company WIRobotics has raised about $68 million in a Series B funding round as the company expands beyond wearable robotics

OpenAI Pursues Legal Action Against Apple as Codex Goes Mobile in Battle With Anthropic

OpenAI has integrated its Codex AI coding agent into the ChatGPT mobile app for iOS and Android, allowing developers to monitor live environments, review outputs,

Cisco Cuts 4,000 Jobs to Fund AI and Cybersecurity Push Despite Record Quarterly Revenue

Cisco is eliminating nearly 4,000 positions — approximately 5% of its global workforce — to restructure its cost base and redirect investment toward artificial intelligence

Stay Updated with AI Insider

Get the latest AI funding news, market intelligence, and industry insights delivered to your inbox weekly.

$ 0 M

Seed round tracked

Gitar — Code Validation

Get the Weekly Briefing

Funding analysis, market intelligence, and industry trends delivered to your inbox every week.

Need bespoke intelligence?

Our team combines real-time data with decades of sector experience to guide your decisions.

Subscribe today for the latest news about the AI landscape