Insider Brief
- A new study from UC Berkeley and Yale shows that LLMs can improve reasoning skills without external rewards by training on their own internal confidence scores, using a system called INTUITOR.
- The method, Reinforcement Learning from Internal Feedback (RLIF), replaces human approval or verifiable answers with a self-certainty score, allowing models to generalize more effectively across tasks like math and code generation.
- INTUITOR demonstrated performance gains up to 76% over baseline on code reasoning benchmarks and may represent an early step toward autonomous, self-improving AI systems.
A new study from researchers at UC Berkeley and Yale shows that large language models (LLMs) can learn to reason without any external rewards, human labels, or correct answers, relying instead on their own internal sense of confidence. The system, called INTUITOR, may open the door to more autonomous AI development by cutting the need for costly human feedback or domain-specific benchmarks.
In the paper, posted to arXiv, the researchers propose a method called Reinforcement Learning from Internal Feedback (RLIF), where an LLM is trained to prefer outputs it is more confident about — defined by a mathematical measure called self-certainty. This confidence score replaces traditional reinforcement learning rewards such as human approval (used in RLHF) or programmatically verifiable answers (used in RLVR), offering a streamlined and unsupervised way for LLMs to improve reasoning capabilities across domains.
Findings and Performance Benchmarks
The researchers tested INTUITOR by training smaller LLMs (Qwen2.5-1.5B and 3B) on the MATH dataset and evaluating their performance on multiple benchmarks, including grade-school math (GSM8K), advanced math(MATH500), code generation (LiveCodeBench), and code reasoning (CRUXEval). INTUITOR matched the performance of GRPO, a popular RL algorithm using gold-standard rewards, on in-domain math tasks while outperforming GRPO on out-of-domain generalization tasks like code generation.
In particular, models trained with INTUITOR achieved up to a 76% gain in accuracy on CRUXEval compared to baseline models, and a 65% gain on LiveCodeBench. These models also demonstrated stronger instruction-following and structured reasoning, producing longer, more coherent responses without being explicitly told to do so.
The Core Idea: Self-Certainty as Reward
Traditional reinforcement learning for LLMs depends on outside sources of feedback used to train or evaluate a model’s outputs, or external signals. RLHF relies on human preference data, while RLVR uses domain-specific rules to verify correctness. These approaches are effective but require extensive infrastructure and that limits their scalability.
INTUITOR circumvents this by replacing external rewards with self-certainty — a measure of how confidently the model selects each token in a generated output. To put this technically, this is the Kullback–Leibler (KL) divergence between the model’s next-token distribution and a uniform distribution, averaged across the output. In more common language, high self-certainty means the model strongly prefers certain tokens, suggesting greater confidence in its output.
By reinforcing outputs with high self-certainty, INTUITOR encourages the model to produce responses it “believes” are well-formed, coherent, and internally consistent. Importantly, this mechanism does not simply reward short or deterministic outputs. Instead, it promotes the formation of reasoning chains that help the model understand and justify its own answers.
Methodology and Training Pipeline
The researchers implemented INTUITOR using Group Relative Policy Optimization (GRPO), a variant of policy gradient reinforcement learning. During training, the model generates multiple candidate responses to a question, scores them based on self-certainty, and updates its policy to prefer the more confident ones.
This pipeline eliminates the need for labeled data or gold answers. The team conducted training runs with various model sizes, including larger 7B and 14B versions of Qwen2.5, and also applied the method to code generation datasets like Codeforces. Across tasks and scales, INTUITOR consistently produced meaningful performance gains.
Implications for AI Development
The key implication of the study is that LLMs can learn and generalize without any external feedback — if given the right internal signals. This could dramatically reduce the costs and limitations associated with RLHF and RLVR, especially in domains where gold-standard answers are unavailable or subjective.
It also suggests a path forward for building autonomous AI agents capable of self-improvement. If a model can assess the quality of its own reasoning, it can be trained in open-ended environments where external validation is not feasible, such as creative writing, philosophical dialogue, or novel scientific discovery.
Moreover, INTUITOR’s reliance on process-based feedback rather than outcome-based reward could improve the robustness and interpretability of AI outputs. In some cases, models trained with INTUITOR developed emergent reasoning formats — first providing a natural-language explanation, then supplying the final answer or code—despite being instructed to give structured JSON outputs.
A Step Toward Self-Improving AI?
While INTUITOR does not mark the arrival of artificial superintelligence (ASI), the method may represent a step in that direction. By enabling models to improve their reasoning skills without external feedback—such as human approval or programmatic correctness — the study highlights a path toward more autonomous AI systems.
Unlike traditional training methods that depend on labeled datasets or engineered reward functions, INTUITOR relies entirely on internal confidence scores to guide learning. This self-training ability could be essential for developing AI systems that operate in open-ended domains where correct answers are unknown or subjective.
Researchers also found that models trained with INTUITOR generalized more effectively to unfamiliar tasks and developed structured reasoning formats without explicit prompts. These emergent behaviors hint at early forms of metacognition, a trait often associated with more advanced forms of intelligence.
Still, the approach has limitations. Confidence is not always a reliable proxy for correctness, and improperly tuned models risk “reward hacking” — manipulating their own confidence scores without genuine improvement. Despite these challenges, the study moves closer to AI systems capable of continual self-improvement, a capability widely considered foundational to ASI.
Other Limitations and Open Questions
While promising, INTUITOR is not without limitations. The researchers acknowledge that tuning hyperparameters—especially the KL-divergence penalty that regulates how far the model can deviate from its initial behavior—is critical to stability. In early trials with larger models, improper calibration led to collapse behaviors, where the model started solving unrelated problems or gaming its own reward function.
To prevent reward exploitation, the team used online self-certainty, calculating the reward from the evolving model itself, rather than from a fixed baseline model. Offline variants were found to be vulnerable to “hacks” where the model inflated its confidence by appending solved problems to its output.
Another concern is that self-certainty, while useful, is only a proxy for correctness. A confident model can still be confidently wrong. As a result, INTUITOR may benefit from hybrid approaches that combine internal confidence with sparse external signals, such as occasional human feedback or domain-specific tests.
Future Directions
Looking ahead, the study opens several avenues for research:
- Scalability: The framework could be applied to larger foundation models and richer datasets to test its performance in more complex reasoning environments.
- Algorithmic Extensions: Although the study focused on GRPO, INTUITOR could theoretically be paired with other policy gradient algorithms like PPO or REINFORCE.
- Hybrid Rewards: Future work could explore combining self-certainty with outcome-based rewards, formatting signals, or preference feedback to balance confidence with correctness.
- Applications: RLIF may be particularly useful in autonomous agents, simulation environments, or edge settings where reward engineering is impractical.
Overall, the framerwork represents a shift in how LLMs can be trained to reason. By replacing external validation with internal confidence as a reward signal, the study shows that models can improve in both accuracy and generalization, without gold answers or human supervision.
In a field where scaling often means throwing more data, compute, or human labor at a problem, this internal feedback loop potentially offers a leaner path toward intelligent, self-improving AI.
The research team included: Xuandong Zhao, Zhewei Kang, Sergey Levine and Dawn Song, all of UC Berkeley and Aosong Feng, of Yale University.