Insider Brief
- A new study from Westlake University finds that AI systems built to do science are improving but still fall short of being independent researchers.
- While large language models can now generate hypotheses and draft papers, they struggle with executing experiments and adapting based on feedback—key traits of real scientists.
- An analysis of AI-generated research papers showed consistent flaws in methodology and reasoning, raising concerns about the risk of low-quality AI-generated science entering the research ecosystem.
Artificial intelligence systems built to conduct scientific research are showing rapid progress—but they still fall short of becoming independent agents of discovery.
That’s the conclusion of a new study in the pre-print journal arXiv led by researchers at Westlake University, which evaluates the maturity of AI-powered “scientist” systems. These systems, often powered by large language models (LLMs), can already retrieve scientific knowledge, generate hypotheses, and write research papers. In a few cases, AI-generated papers have been accepted at major machine learning workshops. But despite these early milestones, the study finds that today’s AI scientists remain deeply reliant on human oversight and lack the rigor and reliability to operate on their own.
The study lays out a framework for what it calls a “mature AI Scientist,” identifying four core capabilities: acquiring domain knowledge, generating original scientific ideas, verifying those ideas through experiments, and improving over time through feedback and adaptation. Most systems today have made progress in the first two categories. The last two — experimentation and learning — remain weak spots, and they are critical if AI is to become a real scientific actor rather than just an assistant.
Where AI Scientists Excel—and Where They Don’t
The researchers identify four core capabilities that define the developmental stages of AI Scientist systems. They also map which areas show the most progress and where major gaps remain:
- Knowledge Acquisition: AI systems are increasingly capable of retrieving, summarizing, and synthesizing scientific literature using advanced retrieval-augmented generation and multi-agent architectures. Domain-specific models like BioBERT and tools such as Semantic Scholar’s API have helped AI better understand specialized topics.
- Idea Generation: LLMs can now formulate hypotheses that are rated by experts as more novel — but not necessarily more feasible — than those proposed by humans. Some systems can generate thousands of hypotheses at once and use re-ranking tools to identify promising candidates. However, feasibility evaluations still depend largely on human judgment.
Verification and evolution remain underdeveloped, according to the study.
While several AI systems can draft experimental plans and generate code to test hypotheses, the study shows that few can carry out those plans with enough accuracy to meet scientific standards. Benchmarks that test the ability to implement or reproduce experiments from scientific papers show performance levels far below what would be required for reliable research. In one benchmark, models succeeded only 16% of the time at producing working code for a standard machine learning task. In another, they were able to reproduce published results just 26% of the time.
The final capability — evolution — is where current systems are furthest behind. Most AI scientists operate in short cycles. They don’t yet dynamically adjust their research path based on past findings, nor do they learn across experiments the way human scientists do. Some attempts are emerging, such as tree-search planning and multi-agent collaboration platforms, but these remain early-stage.
The Risks of Shallow Science
The consequences of these shortcomings are clear when researchers look at what AI scientists are actually producing. The study includes a review of 28 publicly available papers generated by five leading AI Scientist systems. Every paper had at least one major experimental flaw. Most had several. Issues ranged from missing methodology sections to weak novelty, flawed logic, and inadequate literature reviews. The highest-scoring system received just 4.63 out of 10 in aggregate quality.
The authors argue that these aren’t isolated glitches. They reflect a systemic problem: AI Scientists today can mimic the form of research, but they struggle with its substance. While they can generate plausible-sounding papers, they often lack rigorous empirical support, precise reasoning, or a clear connection to existing knowledge. In short, they may look like scientists — but they don’t think like scientists.
That raises a risk for the broader research ecosystem. As these systems become more widespread, there’s a danger that AI-generated research could flood journals or preprint servers with papers that are difficult to verify but hard to distinguish from human work. Without stronger validation tools, the line between real and artificial scholarship could blur.
What Needs to Change
To bridge the gap, the study calls for improvements on two fronts: better models and better scientific instincts. Today’s LLMs are powerful but flawed. They can hallucinate false claims, lose information when retrained, and struggle to keep up with fast-moving research fields. Updating them with the latest findings remains slow and expensive.
Equally important are advances in what the authors call “scientific reasoning.” That includes the ability to plan multi-step experiments, understand tradeoffs, adapt research direction based on feedback, and reason through ambiguous or conflicting results. These are all areas where current AI systems fall short.
The study also recommends developing standardized communication protocols between AI scientists, allowing them to interact and collaborate more effectively. A few prototype communities already exist, such as RESEARCHTOWN, where multiple AI agents simulate a research ecosystem by reading, reviewing, and building on each other’s work. But most systems today still operate in isolation.
Still a Tool, Not a Colleague
While the long-term promise of AI science is real, the study’s central message is one of caution. Even as these systems grow more capable, they remain tools — powerful tools, to be sure, but not yet peers. Their ability to mimic human researchers masks deeper limitations in judgment, creativity and experimental rigor.
The authors compare current AI scientists to early-stage graduate students: energetic, capable of summarizing literature and brainstorming ideas, but not yet ready to lead their own research agenda. As the field moves forward, success will depend less on generating flashier outputs and more on deepening the systems’ ability to reason, revise, and reflect.
The study was conducted by a team of researchers, primarily from Westlake University. These include Qiujie Xie, Yixuan Weng, Minjun Zhu, Fuchen Shen, Shulin Huang, Zhen Lin, Zilan Mao, Zijie Yang, Jian Wu, and corresponding author Yue Zhang. Xie and Zhu are also affiliated with Zhejiang University, while Jiahui Zhou is from Dalian University of Technology. Linyi Yang is affiliated with University College London.
For a deeper, more technical dive, please review the paper on arXiv. It’s important to note that arXiv is a pre-print server, which allows researchers to receive quick feedback on their work. However, it is not — nor is this article, itself — official peer-review publications. Peer-review is an important step in the scientific process to verify the work.




