Microsoft Research Examines New Way to Predict and Explain AI Model Performance

Insider Brief

  • Microsoft researchers have developed a new AI evaluation method that aims to predict and explain model performance on unfamiliar tasks.
  • ADeLe assesses 18 types of cognitive and knowledge-based abilities, creating detailed ability profiles that reveal strengths, weaknesses, and performance thresholds across large language models.
  • The method achieved 88% accuracy in predicting task success for models like GPT-4o and LLaMA-3.1-405B and could serve as a tool for evaluating AI in policy, security, and multimodal systems.

A new evaluation method developed by Microsoft researchers aims to predict and explain how artificial intelligence (AI) models will perform on unfamiliar tasks, offering a detailed look into their strengths and weaknesses and paving the way for more transparent and reliable AI deployments.

The method, introduced in a study led by Lexin Zhou and Xing Xie and supported by Microsoft’s Accelerating Foundation Models Research grant program, centers on a framework called ADeLe (annotated-demand-levels), the company announced in a blog post. Unlike traditional benchmarks that measure overall accuracy, ADeLe applies the measurement scales of 18 types of cognitive and knowledge-based abilities required by specific tasks. It then compares those demands with a model’s capabilities to forecast outcomes.

The paper titled “General Scales Unlock AI Evaluation with Explanatory and Predictive Power,” outlines the method that applies a scoring system from 0 to 5 for each of the 18 dimensions, including attention, reasoning, and domain-specific knowledge. Originally created for human evaluations, these rubrics were adapted to AI and shown to be consistent when applied by the models themselves, according to researchers. By analyzing 16,000 examples from 63 tasks across 20 AI benchmarks, the team created ability profiles for 15 large language models (LLMs).

Figure 1. Top: For each AI model, (1) run the new system on the ADeLe benchmark, and (2) extract its ability profile. Bottom: For each new task or benchmark, (A) apply 18 rubrics and (B) get demand histograms and profiles that explain what abilities the tasks require. Optionally, predict performance on the new tasks for any system based on the demand and ability profiles, or past performance data, of the systems. Credit: Microsoft

The findings, Microsoft points out, highlight three central insights. First, many existing AI benchmarks fail to test what they intend. For example, the Civil Service Examination benchmark, assumed to measure logical reasoning, also draws heavily on specialized knowledge and metacognition. The TimeQA benchmark only includes questions of moderate difficulty, limiting its usefulness in assessing true model range.

Second, the ability profiles created with ADeLe reveal meaningful patterns in model behavior. Each profile is based on the difficulty level at which a model achieves a 50% success rate for a given ability. The researchers plotted subject characteristic curves for each ability, uncovering performance thresholds and documenting how ability varies by model architecture, training, and scale. The results show that newer LLMs generally perform better, but not universally across all abilities. The reseachers report that gains are especially evident in reasoning and abstraction but taper off beyond a certain model size.

Third, ADeLe can predict how models will perform on new, unseen tasks. The prediction system, which matches task demands with model abilities, achieved around 88% accuracy on well-known models like GPT-4o and LLaMA-3.1-405B. This exceeds the performance of traditional evaluation methods and offers a way to anticipate failures before deployment.

The researchers report seeing broad applications for ADeLe. Beyond text-based LLMs, the framework could be adapted for evaluating multimodal and embodied AI systems. They also suggest that ADeLe could serve as a standardized tool for government and industry bodies developing AI policy, security protocols, or research roadmaps.

Microsoft added in the post that the study contributes to a growing body of work urging the adoption of psychometric-style evaluation for AI. It aligns with recent white papers advocating for more rigorous, scalable, and interpretable evaluation systems as general-purpose AI capabilities outpace existing benchmarks. Microsoft researchers aim to establish a collaborative community to refine and expand ADeLe as an open standard.

“As general-purpose AI advances faster than traditional evaluation methods, this work lays a timely foundation for making AI assessments more rigorous, transparent, and ready for real-world deployment,” the Microsoft team noted. “The research team is working toward building a collaborative community to strengthen and expand this emerging field.”

Need Deeper Intelligence on the AI Market?

AI Insider's Market Intelligence platform tracks funding rounds, competitive landscapes, and technology trends across the global AI ecosystem in real time. Get the data and insights your organization needs to make informed decisions.

Related Articles

Circus Acquires US-Israelian Kitchen Robotics

Insider Brief PRESS RELEASE — Circus SE (ISIN: DE000A2YN355 / XETRA: CA1), a global AI-robotics company developing autonomous sustainment systems, has completed the acquisition of

Illuminate Financial Closes $135M Early Growth Fund to Back the Next Generation of AI & Fintech for Financial Services

Insider Brief PRESS RELEASE — Illuminate Financial, the specialist venture capital firm focused on technology for financial services, announced it has raised $135 million for its

Figure AI Ramps Up Production to One Humanoid Robot Per Hour

Insider Brief Humanoid robotics company Figure AI announced it has increased production of its Figure 03 humanoid robot from one unit per day to one

Stay Updated with AI Insider

Get the latest AI funding news, market intelligence, and industry insights delivered to your inbox weekly.

$ 0 M

Seed round tracked

Gitar — Code Validation

Get the Weekly Briefing

Funding analysis, market intelligence, and industry trends delivered to your inbox every week.

Need bespoke intelligence?

Our team combines real-time data with decades of sector experience to guide your decisions.

Subscribe today for the latest news about the AI landscape