Insider Brief
- Google researchers developed an AI review system that found nearly 90% of certain math and computer science errors in a benchmark of flawed papers, suggesting AI could help ease pressure on scientific peer review.
- The Paper Assistant Tool divides manuscripts into sections, assigns more computing effort to difficult areas such as proofs and experiments, and synthesizes feedback into a technical review for authors.
- Pilot programs at STOC and ICML found strong author interest and led some researchers to fix theory gaps, revise claims and run new experiments before submission.
- Image: Photo by Artturi Jalli
Google researchers say an AI system built to review scientific papers found nearly 90% of certain mathematical and computer science errors in a benchmark of flawed manuscripts, pointing to a possible new role for AI in easing the strain on peer review.
The system, called the Paper Assistant Tool, or PAT, is designed to read full scientific papers and produce detailed technical feedback before a manuscript is submitted to a conference or journal. In a study posted to arXiv, researchers from Google Research and Carnegie Mellon University said the tool can check theoretical claims, review experiments, flag possible flaws and suggest improvements.
The work comes as artificial intelligence is increasing the pace of scientific writing and discovery, while also adding pressure to a peer-review system that already depends heavily on unpaid expert labor. The researchers frame the problem as a validation bottleneck: AI can help generate more papers, proofs and experiments, but science still depends on careful checking before those claims are accepted.
The study does not suggest that AI should replace human reviewers now. Instead, the researchers present PAT as a tool that can help authors catch errors early and help reviewers spend less time on routine technical checking. They also warn that more powerful review agents could raise hard questions about accountability, bias, access and the proper line between machine assessment and human judgment.
A Tool for a Strained System
The researchers point to the rapid growth of submissions to major AI conferences as evidence that peer review is under mounting stress. According to figures cited in the study, combined submissions to ICLR, ICML and NeurIPS rose from 17,051 in 2020 to an estimated 73,883 in 2026.
The team said the trend cannot be attributed solely to AI, but they cite other work finding signs of AI-assisted writing in scientific abstracts and biomedical papers. Their broader point is that scientific production is rising faster than the community’s ability to verify it.
That problem is especially difficult in fields such as mathematics and theoretical computer science. In those areas, a serious review may require checking long proofs line by line. In machine learning, reviewers may also need to assess experiments, datasets, comparisons and statistical claims.
PAT was built to address that kind of technical burden. The system is powered by Google’s Gemini Deep Think technology and uses a multi-stage process rather than a single request to a language model.
First, the tool divides a paper into logical sections, such as theory, methods, experiments and conclusions. It can use overlapping or non-contiguous segments when needed. It then assigns more computing effort to dense or difficult sections, such as proofs or technical methods, and less effort to simpler parts of the paper.
Specialized review agents then analyze each section while still having access to the full paper. A final synthesis agent combines the results, removes duplicate comments, checks severity and uses search to reduce the chance of hallucinated claims, such as references to non-existent papers or theorems.
The researchers said this structure is meant to solve two common problems with simpler AI review methods. A single model call may not have enough effective context to deeply check a full paper. Running many independent calls may increase the chance of finding an error, but it can also produce many false alarms that a human must sort through.
Benchmark Results
To test PAT, the researchers used a subset of the SPOT benchmark, a collection of scientific manuscripts that contained verified mistakes later tied to errata or retractions. They narrowed the test to mathematics and computer science papers with equation or proof errors, producing a set of 26 papers and 29 errors.
The study compared PAT with a single zero-shot call to Gemini 3.1 Pro and with the original SPOT state-of-the-art result. The original SPOT result detected 21.1% of the errors. A single Gemini 3.1 Pro call detected 55.2%. PAT detected 89.7%.
The researchers said the result shows that modern foundation models can already serve as useful error-detection tools, even without a specialized review pipeline. But they said PAT’s stronger performance suggests that inference scaling — giving the system a structured way to spend more reasoning effort on difficult parts of a task — can substantially improve the quality of review.
The study gives an example involving a paper on dual Banach spaces. According to the researchers, the single-model baseline accepted a false mathematical claim, while PAT constructed a counterexample and identified a gap that undermined the main theorem.
The researchers said the results imply that automated checks, if used carefully, could catch many serious problems before papers enter the literature. They also suggest that such tools could one day be incorporated into preprint servers or submission systems to give authors immediate feedback.
Google also tested PAT in pilot programs tied to two major computer science conferences: STOC, a leading theoretical computer science conference, and ICML, a leading machine learning conference.
In those pilots, PAT was offered to authors before final submission deadlines. It was not used as part of the formal peer-review process. Authors received one review of their manuscript and could decide whether and how to respond.
The STOC deployment, in November 2025, focused on math-heavy theory papers. The ICML deployment, in January 2026, used a broader version of PAT that could assess experiments, missing comparisons, possible confounding factors and technical claims. Across both pilots, PAT reviewed more than 4,700 submissions.
Survey results were largely positive. In the STOC group, 97% of respondents said they would use PAT again, and 85.1% said it improved clarity or readability. In the ICML group, 92.1% said they would use it again, and 87% said it improved clarity or readability.
More than 90% of respondents in both groups said the feedback was very or mostly helpful. In the STOC group, 55.8% said the feedback was mostly or fully grounded, while 64.8% of ICML respondents said the same.
The most striking results concerned substantive changes. In the STOC group, 11.6% of respondents said PAT identified significant errors in theoretical results that took more than an hour to fix. In the ICML group, 35.4% said the tool identified substantive theory gaps. Among ICML respondents, 31% said they ran new experiments because of PAT’s review.
The researchers said those results show the tool can do more than catch typographical errors. It can also push authors to strengthen claims, revise proofs and improve experimental support.
Limits and Risks
The study also identifies several limits. Authors reported cases of date hallucinations, outdated knowledge, problems parsing PDFs and false claims that a proof or argument was wrong. The researchers said they have improved search and parsing to address the first two issues, while false critiques remain a broader challenge for large language models.
Beyond searching for mistakes, peer review is a judgment about novelty, importance, clarity and fit. PAT currently avoids subjective rankings and acceptance recommendations and focuses on objective technical checks and suggestions for improvement.
The researchers also warn that AI review tools could create new burdens. If used by reviewers, AI-generated critiques may force humans to spend time checking the machine’s claims rather than checking the paper itself. If used by authors, such tools could make weak papers look more polished, making it harder for reviewers to distinguish deep contributions from superficial improvements.
The paper proposes four roles for AI in peer review. The first is AI as a tool for authors, the role PAT played in the STOC and ICML pilots. The second is AI as a tool for reviewers, helping human reviewers understand papers and find possible flaws. The third is AI as a supporting reviewer, producing a separate technical review for humans to consider. A variation of that role would allow AI to provide ratings or recommendations. The fourth is full automation of peer review.
The researchers treat that final role as a possible future scenario, not a recommendation for immediate adoption. They note that human peer review is itself inconsistent, citing prior NeurIPS experiments showing that independent review committees can reach different decisions on the same papers. But they also warn that automated review could narrow intellectual debate, embed centralized viewpoints or be gamed by authors who learn how to satisfy review agents.
Future Directions
The study points toward a future in which AI becomes part of the scientific quality-control system, first as a private assistant for authors and then possibly as a technical aide for reviewers and editors. The researchers said the immediate value is in catching errors before submission, improving paper clarity and reducing the cognitive burden on reviewers.
Longer term, they suggest AI systems could support new publication models, including repositories where papers receive automated technical vetting before being widely circulated. Such a system could sit between ordinary preprints and traditional peer-reviewed journals, offering some measure of validation without the full delay of human review.
But the researchers argue that the scientific community must set clear policies before moving too far in that direction. Key questions include who is accountable when an AI review is wrong, how authors can challenge hallucinated critiques, how to prevent unequal access to powerful review tools and how to avoid deskilling human reviewers.
The paper was written by Rajesh Jayaram, Drew Tyler, David Woodruff, Corinna Cortes, Yossi Matias, Vahab Mirrokni and Vincent Cohen-Addad.
For a deeper, more technical dive, please review the paper on arXiv. It’s important to note that arXiv is a pre-print server, which allows researchers to receive quick feedback on their work. However, it is not — nor is this article, itself — official peer-review publications. Peer-review is an important step in the scientific process to verify results.