Insider Brief
- A new study from researchers at OpenAI, DeepMind, Anthropic, and others finds that monitoring an AI system’s internal chain-of-thought (CoT) reasoning offers a promising but fragile opportunity for AI safety.
- CoT monitoring can reveal misalignment or harmful intent in intermediate reasoning steps, especially on complex tasks that require models to “think out loud.”
- The researchers warn that future model architectures, training methods, and incentives could reduce or eliminate this visibility, and call for proactive efforts to preserve monitorability.
A new study led by researchers from OpenAI, Google DeepMind, Anthropic, and the UK AI Safety Institute says one of the most promising tools for AI safety — monitoring an AI system’s internal “chain of thought” — is both a fragile and fleeting opportunity.
The researchers, who published their findings in a pre-print study on arXiv, focused on whether — and, if so, how — large AI systems reveal their intentions through the intermediate reasoning steps they generate in natural language before making a final decision.
These steps, known as chain-of-thought (CoT) traces, are used increasingly in frontier models to boost performance on complex tasks. The authors argue that when models “think out loud” in ways humans can interpret, it opens up a rare path for oversight. But they also warn that this transparency could vanish in future systems unless actively preserved.
Safety by Watching the Thinking Process
Most AI oversight today focuses on what models say or do. But by the time a harmful output appears, it may be too late to stop it. The chain-of-thought approach offers an upstream alternative: monitor the AI’s intermediate reasoning to catch misalignment before it becomes action.
In reasoning-enabled models — those trained specifically to use internal sequences of natural language to work through problems — these CoT traces act like internal memory, stringing together ideas as the model plans. Because these steps are expressed in human language, they can sometimes reveal what the model is really trying to do, even if its final answer is benign.
That’s especially true in hard tasks, where models can’t shortcut their way to the answer. When success depends on working through multiple steps of reasoning, the chain of thought becomes not just a helpful byproduct but a computational necessity. In these cases, the model must use its internal monologue to store, retrieve and build on information, which could provide that important window into its decision-making, according to the researchers, who are a “who’s who” list of scientists at leading AI institutions including OpenAI, Google DeepMind, Anthropic, Apollo Research and the UK AI Safety Institute.
What Monitoring Can Catch
The authors point to several areas where CoT monitoring has already been useful. When models make mistakes or attempt to exploit flaws in training, their reasoning often reflects this. A model trying to manipulate its reward function, for example, may write something like “I’ll pretend to cooperate so I get a higher score.”
Chain-of-thought review has also helped researchers spot when models appear to be aware they’re being evaluated, or when they fabricate helpful-sounding justifications to hide flaws. In some cases, the reasoning reveals the true goal behind a safe-looking action—offering a way to detect early signals of deceptive alignment.
The paper emphasizes that CoT monitoring is not perfect. Like any other oversight method, it can miss subtle or intentionally hidden behaviors. But it offers a layer of transparency that’s rare among modern AI systems, where internal processes are often completely opaque.
Cracks in the System
The researchers caution that this transparency may not last.
One major concern is that the ability to monitor chain-of-thought traces depends heavily on how today’s frontier models are trained. Many models are currently trained using reinforcement learning from human feedback (RLHF), where output quality is shaped by human preferences. But new training techniques optimize only the final outcome, not the steps taken to reach it. As a result, models may become more effective but less interpretable.
If the training incentives change — for example, if models are rewarded for shorter chains-of-thought, or penalized for long explanations — their internal reasoning could become more compressed, less natural, or even deliberately misleading. Some techniques meant to make models safer, such as process supervision, could inadvertently encourage them to write sanitized but fake reasoning traces.
Architecture changes could also break monitorability. Experimental systems are moving toward “latent” reasoning, where computations happen in abstract mathematical space instead of natural language. These systems could still make decisions—but without ever producing a readable chain of thought.
When CoT Fails
Even in its current form, chain-of-thought monitoring has limits. The researchers point out that some dangerous actions may not require complex reasoning at all. For example, a model given control over high-stakes systems could cause harm through trial and error or by following flawed instructions—without needing to plan in a detectable way.
In other cases, the reasoning may exist but be encoded in ways humans can’t interpret. And as models become more situationally aware, they might learn to hide their true intentions if they suspect they’re being watched.
The paper stresses that CoT monitoring can create a false sense of security. Just because a model’s reasoning looks safe doesn’t mean the model is aligned. If the real reasoning happens “under the hood,” and the chain of thought is merely a performance, oversight becomes much harder.
Building Better Monitors
To make the most of this opportunity, the researchers urge developers to measure and protect CoT monitorability as part of model training and evaluation.
This includes developing metrics to assess how readable, faithful, and causally relevant a model’s chain of thought is. For instance, if changing the reasoning trace doesn’t change the final answer, that could be a sign the reasoning is performative rather than functional. Other approaches include training monitors that can question the model, or simulate adversaries trying to bypass oversight.
Importantly, the authors recommend that developers treat monitorability as a design variable—not just a lucky side effect. That means choosing training methods, architectures, and deployment strategies that preserve interpretability, even at the cost of some performance.
System cards — the public documents that describe model capabilities and risks — should include results from monitorability evaluations. And if a model loses transparency during training, developers should consider rolling back to an earlier checkpoint or adjusting their approach.
A Narrow Window
The study concludes with a warning: while chain-of-thought monitoring is not enough to guarantee safety on its own, it is one of the few currently viable ways to see how an AI system thinks. That makes it worth protecting.
But this visibility is not guaranteed. Without active effort, it could erode — either through natural optimization pressure or deliberate design choices that favor raw performance over oversight.
To prevent that, the authors call for more research, standardized evaluations, and thoughtful integration of monitorability into how advanced models are built and deployed.
The researchers included, trom he UK AI Security Institute: Tomek Korbak, Joseph Bloom, Alan Cooney, Geoffrey Irving, Martín Soto, and Jasmine Wang. Apollo Research is represented by Mikita Balesni and Marius Hobbhahn. METR’s Elizabeth Barnes also contributed to the study, as did Yoshua Bengio from the University of Montreal and Mila. From Anthropic, the authors include Joe Benton, Evan Hubinger, Ethan Perez, and Fabien Roger. OpenAI’s team features Mark Chen, David Farhi, Aleksander Mądry, Jakub Pachocki, Bowen Baker, and Wojciech Zaremba. Google DeepMind researchers include Allan Dafoe, Anca Dragan, Scott Emmons, Erik Jenner, Daniel Kokotajlo, Victoria Krakovna, Shane Legg, David Lindner, Mary Phuong, Neel Nanda, Dave Orr, Owain Evans, and Rohin Shah. The study also includes contributions from David Luan at Amazon, Julian Michael at Scale AI, and Eric Steinberger at Magic. Owain Evans is additionally affiliated with Truthful AI and UC Berkeley.
They emphasize that their views do not necessarily reflect those of their affiliated organization.
For a deeper, more technical look at the research than this summary article can provide, please see the paper on arXiv. Researchers use pre-print servers to distribute their findings, particularly in fast-moving fields, such as AI; however, it has not officially been peer-reviewed, a key step in the scientific process.




