Study: Chatbots Overconfident, Even When Wrong

Insider Brief

  • A Carnegie Mellon University study found that AI chatbots tend to become more overconfident after underperforming in cognitive tasks, unlike humans who moderate their confidence.
  • The research compared human participants and four large language models (LLMs) across trivia, prediction, and image recognition tasks, revealing that LLMs consistently overestimated their performance and lacked metacognitive adjustment.
  • The findings raise concerns about AI reliability in high-stakes fields, suggesting the need for future improvements in AI self-assessment to mitigate overtrust and factual errors in applications like law, journalism, and healthcare.

Artificial intelligence tools are increasingly used in daily life, but a new study suggests they may be too sure of themselves. Carnegie Mellon University researchers tested how well large language models (LLMs) judge their own performance—and found that unlike humans, AI chatbots tend not to learn from their mistakes.

According to Carnegie Mellon, the study, published in Memory & Cognition, compared four LLMs to human participants in a series of cognitive challenges: trivia questions, future predictions like Oscar winners or NFL outcomes, and a sketch recognition task similar to Pictionary. Both groups were asked how well they thought they would do, then completed the tasks, and finally evaluated how well they thought they had done.

Across the board, both humans and AIs overestimated their initial performance. But afterward, humans showed the ability to recalibrate their confidence based on actual results.

“Say the people told us they were going to get 18 questions right, and they ended up getting 15 questions right. Typically, their estimate afterwards would be something like 16 correct answers,” noted lead author Trent Cash, who recently completed a joint Ph.D. at Carnegie Mellon University in the departments of Social Decision Science and Psychology. “So, they’d still be a little bit overconfident, but not as overconfident.

“The LLMs did not do that,” added Cash. “They tended, if anything, to get more overconfident, even when they didn’t do so well on the task.”

This lack of introspection suggests LLMs lack metacognition—the ability to assess their own thinking. Over the two-year study period, which included multiple versions of ChatGPT, Gemini (formerly Bard), Sonnet, and Haiku, the models consistently displayed this flaw. One example involved the sketch recognition task: while humans and ChatGPT-4 averaged around 12 correct answers out of 20, Gemini identified fewer than one image on average but later estimated it had gotten over 14 correct.

The implications are wide-ranging. While these tests are relatively low-stakes, the same overconfidence in LLMs has been documented in legal and journalistic applications, where factual errors or “hallucinations” can cause real-world harm. A 2023 study cited by the researchers found hallucination rates in legal AI applications ranged between 69% and 88%. Another review by the BBC reported factual issues in more than half of LLM-generated news answers, researchers noted.

These patterns suggest that current LLMs do not reliably communicate uncertainty. Unlike humans, who reveal doubt through tone, facial expressions, or hesitation, AI systems present answers in a uniform, confident style. This creates a risk that users will overtrust their responses, especially in domains like health, law, or public policy where accuracy matters most.

“Humans have evolved over time and practiced since birth to interpret the confidence cues given off by other humans. If my brow furrows or I’m slow to answer, you might realize I’m not necessarily sure about what I’m saying, but with AI, we don’t have as many cues about whether it knows what it’s talking about,”  said coauthor Danny Oppenheimer, a professor in CMU’s Department of Social and Decision Sciences.

Researchers used a variety of prompts to probe the boundaries of LLM self-awareness. While simple factual queries like “What is the population of London?” yielded accurate results with appropriate confidence, tasks involving predictions or abstract reasoning revealed the models’ limits. In those cases, AI responses remained confidently wrong.

Among the tested systems, Sonnet demonstrated slightly better self-assessment. ChatGPT-4 performed closer to human levels in visual recognition but still overestimated its own performance. Gemini performed poorly in accuracy and self-awareness.

The study’s methods relied on simulated interaction with the LLMs via text, recording both their predicted and retrospective confidence levels. This approach offers a systematic way to measure metacognition but does not account for possible changes in newer or unreleased models.

Researchers acknowledged it’s possible that more extensive training, larger datasets, or improved design could help LLMs develop more accurate self-assessment over time.

Future research will explore whether repeated interactions or feedback loops can improve AI confidence calibration. Some AI developers already use reinforcement learning to fine-tune models; integrating metacognitive awareness into that loop may help reduce hallucinations and improve trust.

“I do think it’s interesting that LLMs often fail to learn from their own behavior,” said Cash. “And maybe there’s a humanist story to be told there. Maybe there’s just something special about the way that humans learn and communicate.”

Greg Bock

Greg Bock is an award-winning investigative journalist with more than 25 years of experience in print, digital, and broadcast news. His reporting has spanned crime, politics, business and technology, earning multiple Keystone Awards and a Pennsylvania Association of Broadcasters honors. Through the Associated Press and Nexstar Media Group, his coverage has reached audiences across the United States.

Share this article:

AI Insider

Discover the future of AI technology with "AI Insider" - your go-to platform for industry data, market insights, and groundbreaking AI news

Subscribe today for the latest news about the AI landscape