AI Flunks First Scientific Reasoning Test, Study Finds

AI, News, Research, Slider

Insider Brief

A new benchmark called Scientists’ First Exam reveals that state-of-the-art AI models perform poorly on scientific reasoning tasks, scoring far below their results on general benchmarks.
The test evaluates 830 expert-designed, bilingual visual questions across five scientific domains and three levels of cognitive difficulty: perception, understanding, and reasoning.
Results show that model size alone does not predict success; performance depends more on domain-specific training data, with structured tasks in materials science proving easier and astronomy posing the greatest challenge.

A new test for evaluating artificial intelligence in scientific domains has sent young AI systems with a grade that won’t make their parents — or, at least, their developers — very happy.

According to the study, published on the pre-print server arXiv, even the most advanced AI systems perform poorly on real-world scientific reasoning tasks. Developed by researchers at the Shanghai Artificial Intelligence Laboratory, the “Scientists’ First Exam” (SFE) reveals that multimodal large language models (MLLMs) still fall short in perception, understanding and reasoning—three core components of scientific cognition.

The benchmark, which comprises 830 bilingual visual question-answering (VQA) pairs across 66 expert-defined tasks, spans five disciplines: astronomy, chemistry, earth science, life sciences and materials science. Each task is designed to reflect real scientific challenges by integrating native data formats such as spectra, X-ray diffraction patterns and molecular diagrams, making it far more demanding than prior academic-style tests.

State-of-the-art models, including GPT-o3 and InternVL-3, achieved top scores of just 34.08% and 26.52%, respectively—far below their 80%–90% performance on general-purpose benchmarks like MMLU and ScienceQA. This performance gap highlights the difficulty of transferring AI’s existing language prowess to rigorous scientific reasoning.

Perception, Understanding and Reasoning

Unlike traditional tests that focus primarily on factual recall or general comprehension, SFE breaks down scientific cognition into three levels: signal perception (identifying features in raw data), attribute understanding (interpreting scientific meaning), and comparative reasoning (deriving conclusions from multiple data sources). Each level is probed using open-ended, multiple-choice, and exact-match formats. All questions are paired with rendered scientific images or visualizations, requiring models to integrate multimodal data — something existing models still struggle with.

Construction of the SFE benchmark involved deep collaboration with domain experts. Researchers first identified 18 core scientific directions, such as reaction prediction in chemistry or atmospheric circulation in earth science. Experts then formulated tasks grounded in real-world research needs, selected appropriate datasets (e.g., ERA5, RCSB PDB, PubChem), and created questions and answers with corresponding data visualizations. Each VQA pair underwent a two-step validation process — scientific peer review followed by format checking — ensuring both rigor and consistency across English and Chinese.

To evaluate model performance, the study employed multiple metrics. Beyond simple accuracy, it used BERTScore for text similarity and an “LLM-as-a-Judge” approach, where GPT-4o evaluated model answers for semantic correctness. For visual tasks such as object bounding or precipitation mapping, metrics like Intersection over Union (IoU) and execution success rates were used.

The results revealed that closed-weight models from commercial developers consistently outperformed open-weight models. Closed-weight models, as opposed to open-weight models, are AI systems with internal settings — the data and training methods used to build them — that are not publicly shared. GPT-o3, a proprietary OpenAI model, led the field, particularly in earth and materials sciences, where visual structure and symbolic representation played to the model’s strengths. On the other end, Gemini-2.5-Pro scored just 8% overall. Even models built on the same architecture showed significant variation, reflecting the impact of fine-tuning and post-training strategies.

Materials Science Most Tractable

Across disciplines, materials science emerged as the easiest to handle for AI systems. Its tasks — often involving structured visual inputs such as lattice diagrams or XRD patterns — aligned well with current model capabilities. In contrast, astronomy presented the greatest challenge. Models had difficulty interpreting noisy spectra or estimating physical properties from images, underscoring the difficulty of applying machine reasoning to complex observational data.

More importantly, the benchmark revealed a potential shift in model capability emphasis. Some models showed marked improvement on Level 3 tasks — such as comparative reasoning — without corresponding gains in Level 2 tasks. For instance, GPT-o3 improved Level 3 performance by 10 percentage points over its predecessor, GPT-4.1, while maintaining similar performance on Level 2. This suggests that recent models may be increasingly optimized for abstract reasoning, possibly due to changes in training methods such as chain-of-thought prompting and tool-use strategies.

Model Robustness

SFE also tested model robustness across languages. On average, models performed 1–3 percentage points worse in Chinese than in English, reflecting the challenges of cross-lingual scientific reasoning. Though the dataset was fully bilingual, linguistic nuances and model training biases likely contributed to the gap.

The researchers also investigated the impact of model scale and temperature settings. In this context, ‘temperature’ doesn’t refer to heat, but to how the system controls the randomness of the model’s responses. Larger models didn’t consistently perform better than smaller ones, suggesting that simply increasing model size isn’t enough. To handle scientific reasoning well, models need to be trained on large-scale data that is both relevant to science and well-structured.Meanwhile, moderate temperature settings (0.4–0.6) provided optimal performance, striking a balance between deterministic outputs and creative reasoning.

Beyond benchmark scores, the study poses deeper questions about AI’s role in science. While current MLLMs are capable of generating human-like text or summarizing documents, they remain fragile when applied to the layered reasoning required in scientific workflows. Many models failed to interpret multi-step visual information or draw correct inferences across modalities. Even top models focused disproportionately on the first few images when presented with longer visual sequences, highlighting limitations in processing extended visual context.

Looking ahead, the SFE benchmark offers a roadmap for model improvement. By clearly categorizing cognitive levels and scientific domains, it enables developers to pinpoint weaknesses, — be it in molecular structure interpretation or precipitation event analysis — and tailor training accordingly. The dataset’s open-source nature and dual-language support also make it a valuable resource for global AI and scientific communities.

But the study also issues a warning. As reliance on AI in scientific research grows, so does the risk of overestimating its capabilities. The allure of automation must not overshadow the need for human oversight, intuition, and creativity. Until MLLMs can robustly interpret data, draw valid conclusions, and do so across languages and modalities, their role will remain that of an assistant, not a substitute, for human researchers.

The team writes: “… while SFE has the potential to significantly advance these discoveries by offering a robust evaluation framework, it also raises concerns about increasing reliance on AI in scientific research. This might inadvertently undermine the value of human intuition and creativity.”

No word on whether these AI systems will be grounded for the rest of the semester to try to bring these grades up.

The paper on arXiv dives in deeper technically than this summary story, so reviewing the study for more exact scientific and technological detail is recommended. ArXiv is a pre-print server, meaning the work has not officially been peer-review, a key step of the scientific method.