Why Do Language Models Hallucinate? OpenAI Scientists Say LLMs Rewarded For Being Too Cocky

AI Research & Advances

Insider Brief

An OpenAI team of scientists report that language models hallucinate because their training and evaluation processes reward confident guesses over admitting uncertainty.
Hallucinations are predictable statistical errors that arise during pretraining and persist because benchmarks penalize responses that express doubt.
Fixing the problem requires changing mainstream evaluations to credit uncertainty, aligning incentives toward more trustworthy AI systems.

Stopping language models from hallucinating starts with figuring out why they hallucinate in the first place.

Now, a new study from researchers at OpenAI and Georgia Tech finds that language models often generate confident falsehoods because they are rewarded for guessing rather than admitting uncertainty. The paper, published in the pre-print server arXiv, argues that hallucinations — plausible but incorrect statements — are not mysterious quirks of artificial intelligence but predictable outcomes of the way these systems are trained and evaluated.

The researchers frame hallucinations as a statistical inevitability, drawing on decades of computational learning theory. During training, language models learn from massive text corpora using a method called density estimation: predicting the probability of sequences of words. Even with perfect training data, the statistical objectives used in this process guarantee that errors will appear.

Don’t get too judgmental. We’ve all been guilty of using a similar strategy, the OpenAI team suggests in a recent blog post on the research.

“Think about it like a multiple-choice test,” the team writes in the post. “If you do not know the answer but take a wild guess, you might get lucky and be right. Leaving it blank guarantees a zero. In the same way, when models are graded only on accuracy, the percentage of questions they get exactly right, they are encouraged to guess rather than say ‘I don’t know.'”

The researchers liken this to a binary classification problem, which means, if a model is asked to decide whether a given sentence is valid or erroneous, misclassifications are unavoidable. When applied to language generation, these misclassifications surface as hallucinations. For instance, when prompted with questions about obscure birthdays or dissertation titles, models produce wrong but confident answers because the correct information either appears only once in training data or not at all, according to the study.

The paper introduces the idea of a “singleton rate,” or the proportion of facts that appear only once in the training set. This measure predicts hallucination frequency: if 20% of birthdays are singletons, then models are expected to hallucinate on about 20% of such queries.

Why Training Encourages Mistakes

Errors also persist because language models are designed to be calibrated, according to the team. A well-calibrated model assigns probabilities to responses that reflect their likelihood, much like a weather forecast. But this calibration creates a trade-off: a model that always refuses to answer would avoid mistakes, but it would fail as a language generator. To balance usefulness with accuracy, models inevitably produce some errors.

Beyond statistical limits, models struggle with poor representations. For example, older systems such as trigram models, which predict each word based only on the previous two words, misfired on grammar, while modern transformer models can still falter on tasks like counting letters when text is tokenized into larger chunks. Distribution shifts—where test inputs differ from training data — and the presence of errors in the training corpus itself (“garbage in, garbage out”) further amplify hallucination risks, according to the study.

The second phase of model development, known as post-training, is meant to refine outputs and reduce hallucinations. Techniques such as reinforcement learning from human or AI feedback attempt to align models with human expectations. Yet, the study argues, post-training often makes hallucinations worse.

The reason lies in how benchmarks evaluate model performance. Most benchmarks use binary grading, in other words, scoring answers as simply right or wrong. Under this scheme, admitting uncertainty with “I don’t know” is penalized as harshly as providing no answer at all. Models, like students gaming multiple-choice exams, learn that guessing maximizes their expected score.

This creates what the researchers call an “epidemic” of overconfidence. A model that always provides an answer, even when wrong, will outperform a more cautious system under prevailing test rules, the researchers report.

The study catalogues widely used benchmarks such as MMLU-Pro, GPQA, and SWE-bench, finding that most either lack a mechanism for crediting uncertainty or explicitly penalize it. Even new hallucination-specific benchmarks have struggled to gain traction because mainstream leaderboards still rely on binary accuracy metrics.

This misalignment means that companies optimizing for leaderboard performance inadvertently reinforce hallucinatory behavior. A model trained to signal uncertainty truthfully would lose ground in rankings to one that guesses confidently. In practice, the incentive structure tilts toward bluffing.

Proposed Solutions

The team argues that hallucination cannot be solved by adding new evaluations alone. Instead, they call for modifying mainstream benchmarks to incorporate “confidence targets.” Under this system, models would be explicitly told that uncertain answers are acceptable and even preferable in some cases.

For example, a benchmark might specify: only answer if you are more than 75% confident, otherwise respond with “I don’t know.” Correct answers earn a point, wrong answers lose two points, and abstentions neither gain nor lose. This mirrors certain standardized human exams that penalize wrong guesses more heavily than skipped questions.

Such changes would encourage what the researchers call “behavioral calibration”—the habit of expressing uncertainty appropriately. Unlike probabilistic confidence scores, which can be unwieldy in natural language, behavioral calibration emphasizes practical, human-like communication of doubt.

Limits of the Framework

The study acknowledges its limitations. It focuses on plausible falsehoods rather than nonsensical strings, and it simplifies open-ended generation into binary categories of valid or erroneous. The framework does not account for subtle pragmatic factors such as hedging or asking clarifying questions.

It also cautions that retrieval-augmented models, which use search engines to ground responses, are not immune. According to the researchers, if benchmarks still penalize uncertainty, even these systems will prefer to guess when search fails. Similarly, reasoning-based models that chain through logical steps may reduce certain errors but remain vulnerable to the same incentive misalignments.

What Comes Next

Future work will likely focus on refining evaluation schemes, experimenting with explicit confidence penalties across popular benchmarks, and studying how users respond to models that admit uncertainty more often. There is also interest in developing richer forms of pragmatic competence, enabling models to hedge or ask clarifying questions instead of presenting false facts.

The study suggests that hallucinations are not evidence of a fundamental flaw in large language models but rather artifacts of the systems built around them. With recalibrated incentives, the researchers argue, AI can become more reliable partners—less like students bluffing on exams and more like cautious collaborators who know when to say, “I don’t know.”

On a broader level, the findings highlight a fundamental tension in artificial intelligence development: the push for models to appear competent under evaluation clashes with the need for trustworthy systems in real-world use. By rewarding guessing, the field has inadvertently created machines that bluff.

Correcting this requires a socio-technical shift, not just better algorithms. Benchmarks and leaderboards — the currency of AI progress — must evolve to reward honesty about uncertainty. Without such reforms, hallucinations will remain entrenched, regardless of technical improvements in architecture or training scale.

For a deeper, more technical dive, please review the paper on arXiv. It’s important to note that arXiv is a pre-print server, which allows researchers to receive quick feedback on their work. However, it is not — nor is this article, itself — official peer-review publications. Peer-review is an important step in the scientific process to verify the work.

The research team included am Tauman Kalai, Ofir Nachum and Edwin Zhang, all of Open AI and Santosh S. Vempala, from Georgia Tech.