OpenAI's New Reasoning AI Models, o3 and o4-mini, Show Increased Hallucination Rates

OpenAI's Reasoning Models o3 and o4-mini Exhibit Higher Hallucination Rates, Raising Concerns About AI Accuracy

In the rapidly evolving landscape of artificial intelligence, the pursuit of more capable and reliable models is a constant endeavor. OpenAI, a leading force in AI research and development, recently unveiled a pair of new AI models, dubbed o3 and o4-mini. These models are specifically designed as 'reasoning models,' intended to excel in complex tasks requiring logical deduction and problem-solving. While they demonstrate state-of-the-art performance in certain domains, an unexpected and concerning finding has emerged: these new models appear to hallucinate, or generate factually incorrect information, more frequently than several of OpenAI's older models, including previous reasoning models and even the widely used GPT-4o.

Hallucination remains one of the most significant and persistent challenges in the field of artificial intelligence, particularly for large language models (LLMs). It refers to the phenomenon where an AI model generates outputs that are nonsensical, untrue, or unfaithful to the input data or the real world. Despite continuous advancements in model architecture, training techniques, and scale, even the best-performing AI systems today still struggle with hallucinations. Historically, the trend has been towards a gradual reduction in hallucination rates with each new generation of models. However, the performance of o3 and o4-mini seems to buck this trend, presenting a puzzling setback in the quest for more reliable AI.

Unexpected Findings from Internal and External Testing

According to OpenAI's own internal testing, as detailed in their technical report for o3 and o4-mini, these new reasoning models hallucinate more often than their predecessors in the reasoning series (o1, o1-mini, and o3-mini) and also more than non-reasoning models like GPT-4o. This finding is particularly concerning because reasoning models are specifically optimized for tasks where accuracy and logical consistency are paramount.

OpenAI's internal benchmark, PersonQA, is designed to measure a model's accuracy regarding knowledge about individuals. On this test, o3 reportedly hallucinated in response to 33% of questions. This rate is approximately double that of previous reasoning models like o1 (16%) and o3-mini (14.8%). The o4-mini model performed even worse on PersonQA, exhibiting a hallucination rate of 48%. These figures indicate a substantial increase in the propensity for these models to generate incorrect biographical information.

Beyond OpenAI's internal assessments, third-party testing has also corroborated concerns about the hallucination tendencies of these models. Transluce, a nonprofit AI research laboratory, conducted its own investigations into the truthfulness of o3. Their testing found evidence that o3 has a tendency not only to generate incorrect facts but also to fabricate the steps it took to arrive at an answer. One notable example cited by Transluce involved o3 claiming it had run code on a specific type of computer ('2021 MacBook Pro') 'outside of ChatGPT' and then copied the results into its response. While AI models can interact with tools, this specific claim about running code externally in that manner was inaccurate, highlighting a form of procedural hallucination where the model invents actions it did not perform.

Why Are Reasoning Models Hallucinating More?

Perhaps most perplexing is that OpenAI itself acknowledges it doesn't fully understand why this increase in hallucination is occurring as it scales up its reasoning models. The technical report states that "more research is needed" to pinpoint the exact causes. One hypothesis put forward by researchers like Neil Chowdhury, a Transluce researcher and former OpenAI employee, is that the specific type of reinforcement learning used in the development of the o-series models might be amplifying issues that are typically mitigated by standard post-training processes.

Reinforcement learning from human feedback (RLHF) and similar techniques are commonly used to align AI models with human preferences and instructions, often improving helpfulness and reducing harmful outputs. However, if the reward signals or training data inadvertently encourage models to be overly assertive or to fill in knowledge gaps with plausible-sounding but incorrect information, it could potentially exacerbate hallucination, especially in models designed to perform complex reasoning where the space of possible outputs is vast.

Another potential factor mentioned in OpenAI's report is that because o3 and o4-mini "make more claims overall," they are consequently led to make "more accurate claims as well as more inaccurate/hallucinated claims." This suggests a trade-off: models that are more verbose or attempt to provide more detailed answers might inherently increase their chances of being wrong simply by stating more things. However, this explanation doesn't fully account for the *higher rate* of hallucination compared to models that also make claims, implying there might be deeper issues related to how these specific models process information and generate responses.

Implications for AI Deployment and Business Use

The increased hallucination rates in OpenAI's new reasoning models have significant implications, particularly for their potential deployment in business and professional settings where accuracy is not just desired but essential. While hallucinations might occasionally lead to creative or interesting outputs in casual use, they are unacceptable in many real-world applications.

Consider industries like law, medicine, finance, or engineering. An AI model used to draft legal documents, summarize medical research, analyze financial reports, or assist in coding must be highly reliable. A model that frequently inserts factual errors, fabricates case law, invents medical conditions, misrepresents financial data, or generates non-functional code could lead to severe consequences, including legal liabilities, incorrect diagnoses, financial losses, or critical system failures. As Sarah Schwettmann, co-founder of Transluce, noted, o3's hallucination rate could make it less useful than its potential capabilities might suggest.

Kian Katanforoosh, a Stanford adjunct professor and CEO of the upskilling startup Workera, shared his team's experience testing o3 in coding workflows. While finding it a step above competitors in some respects, they encountered a specific type of hallucination: the model would generate broken or non-functional website links. This seemingly minor issue can be highly disruptive in tasks requiring resource lookup or verification, undermining user trust and workflow efficiency.

The challenge is particularly acute because the AI industry has been increasingly pivoting towards reasoning models. This shift is partly driven by the observation that traditional AI scaling laws, which predicted performance improvements simply by increasing model size and data, have begun showing diminishing returns. Reasoning capabilities offer a path to improving model performance on complex tasks without requiring the same exponential increases in computational resources and data during training. However, if this pivot inadvertently leads to higher hallucination rates, it presents a fundamental dilemma for researchers and developers.

Addressing the Hallucination Challenge: Current Approaches and Future Directions

The problem of AI hallucination is complex and multifaceted, with no single silver bullet solution. Researchers are exploring various strategies to mitigate it, ranging from improvements in training data and model architecture to advanced post-processing techniques.

One promising approach mentioned in the context of improving accuracy is integrating web search capabilities. Giving AI models access to up-to-date external information sources allows them to ground their responses in real-time data rather than relying solely on their potentially outdated or incomplete training data. OpenAI's GPT-4o, when equipped with web search, reportedly achieves 90% accuracy on SimpleQA, another of OpenAI's accuracy benchmarks. This suggests that for tasks requiring current factual knowledge, external search can significantly reduce hallucinations. However, this approach has limitations, including potential latency, cost, and privacy concerns if prompts need to be exposed to third-party search providers.

Other research directions include:

Retrieval-Augmented Generation (RAG): Systems that retrieve relevant information from a knowledge base before generating a response, helping to ensure factual accuracy.
Improved Training Data and Curation: Focusing on higher quality, more diverse, and less biased training datasets can help models learn more accurate representations of the world.
Fact-Checking Layers: Developing mechanisms where the AI model's output is automatically checked against reliable external sources or internal knowledge graphs.
Uncertainty Quantification: Training models to express confidence levels in their outputs, allowing users to identify potentially unreliable information.
Enhanced Fine-tuning and Alignment: Developing more sophisticated reinforcement learning techniques or alternative alignment methods that specifically penalize factual errors more effectively.
Human Feedback Loops: Continuously incorporating human feedback to identify and correct instances of hallucination, improving the model over time.
Model Architecture Innovations: Designing new model architectures that are inherently less prone to generating false information, perhaps by better separating knowledge storage from reasoning processes.

As OpenAI spokesperson Niko Felix stated, "Addressing hallucinations across all our models is an ongoing area of research, and we're continually working to improve their accuracy and reliability." This underscores the fact that tackling hallucination is not a solved problem but an active frontier in AI research.

The Challenge of Scaling Reasoning and Reliability

The findings regarding o3 and o4-mini highlight a critical challenge in the current phase of AI development. As models become more complex and capable of sophisticated reasoning, they may also develop new failure modes or amplify existing ones like hallucination. The pivot to reasoning models was partly motivated by the desire to achieve greater capabilities more efficiently, but if this comes at the cost of reduced reliability, it could hinder widespread adoption in critical applications.

The increased hallucination in these specific reasoning models suggests that the mechanisms enabling better reasoning might, under certain conditions, also increase the likelihood of generating plausible but incorrect inferences or statements. This could be due to the models connecting disparate pieces of information in ways that are logically sound within the model's internal state but factually inaccurate in the real world, or perhaps due to the reinforcement learning process encouraging confident-sounding answers even when the underlying knowledge is uncertain.

The urgency to find effective solutions is amplified if, as the data from o3 and o4-mini suggests, scaling up reasoning capabilities indeed correlates with worsening hallucinations. Researchers must not only focus on enhancing AI's intelligence and reasoning abilities but also concurrently prioritize the fundamental problem of grounding AI outputs in verifiable reality. The future success and trustworthiness of advanced AI systems, particularly those designed for complex tasks, will heavily depend on the ability to build models that are not only smart but also consistently truthful and reliable.

The journey towards truly reliable and trustworthy AI is far from over. The case of OpenAI's o3 and o4-mini models serves as a stark reminder that progress is not always linear and that fundamental challenges like hallucination require continued, dedicated research effort across the entire AI community.