Doctors’ AI Systems Are Hallucinating Nonexistent Medical Issues — The 1 Hidden Cost That Could Bankrupt Your Hospital

Fast Facts: When a doctor’s AI scribe invents a disease you don’t have, the cost isn’t just clinical—it’s financial. A 2024 Associated Press investigation found OpenAI’s Whisper transcription tool hallucinates in medical settings, fabricating treatments, racial commentary, and violent rhetoric. With over 30,000 clinicians already using Whisper-based tools, the liability exposure is massive. This article breaks down why these hallucinations happen, what they cost, and why procurement teams are the last line of defense.

Why Hospitals Are Paying for AI That Invents Diseases

In October 2024, the Associated Press dropped a bombshell: doctors’ AI systems are hallucinating nonexistent medical issues during patient appointments. OpenAI’s Whisper transcription tool—used by more than 30,000 clinicians across 40 health systems—was found to fabricate text that patients never spoke.

These aren’t minor typos. The invented text includes racial commentary, violent rhetoric, and entirely imagined medical treatments. One transcription invented a non-existent medication called “hyperactivated antibiotics.” Another turned a patient mentioning an umbrella into a hallucinated narrative about a terror knife and multiple killings.

“Nobody wants a misdiagnosis. There should be a higher bar.” — Alondra Nelson, former White House Office of Science and Technology Policy director

The economic logic that pushed hospitals toward AI scribes is sound on paper: reduce documentation time, cut burnout, see more patients. But when the tool invents information, the savings evaporate into liability.

Why the Error Rate Is a Financial Time Bomb

A University of Michigan researcher found hallucinations in 8 out of every 10 public meeting transcripts Whisper processed. A machine learning engineer analyzed over 100 hours of Whisper transcriptions and discovered fabrications in roughly half. A third developer created 26,000 transcripts and found hallucinations in nearly every single one.

A peer-reviewed study from Cornell, University of Washington, and others found that 1% of audio transcriptions contained entirely hallucinated phrases, and 38% of those hallucinations included explicit harms—perpetuating violence, making inaccurate associations, or implying false authority.

Now apply that to healthcare. Nabla, the company building on Whisper, estimates its tool has transcribed 7 million medical conversations. A 1% hallucination rate means potentially 70,000 patient records with fabricated clinical information. Each one is a malpractice lawsuit waiting to happen.

Why Procurement Is Ignoring OpenAI’s Own Warnings

Here’s the procurement failure: OpenAI explicitly warns against using Whisper in “high-risk domains.” Hospitals used it anyway.

“This seems solvable if the company is willing to prioritize it. It’s problematic if you put this out there and people are overconfident about what it can do and integrate it into all these other systems.” — William Saunders, former OpenAI research engineer

Nabla’s response? The company claims its proprietary model eliminates hallucinations—but it also erases the original audio recordings for “data safety reasons.” Doctors cannot verify AI-generated notes against the source. Procurement teams who approved this tool effectively bought a black box with no audit trail.

The BMJ found that one in five UK GPs are already using generative AI tools like ChatGPT in clinical practice, despite no formal training or governance frameworks. Speed of adoption is outpacing safety—and procurement is greenlighting tools without requiring independent validation.

Why the Hallucination Problem Hits Vulnerable Patients Hardest

⚠ Fiction: Sarah, a 58-year-old teacher with aphasia following a stroke, visited her neurologist. The AI scribe transcribed her silence as a detailed confession of self-harm. The note triggered a psychiatric hold—entirely based on words she never said. The original audio was deleted by the AI vendor. It took her family three weeks and $12,000 in legal fees to correct the record.

This fictional scenario is uncomfortably close to reality. The Cornell study found that Whisper’s hallucinations disproportionately affect people with aphasia—those who speak with longer pauses. Deaf and hard-of-hearing patients face similar risks, with no way to identify fabrications in their transcripts.

The vulnerability isn’t random. AI scribes fail most on the patients who are already hardest to treat—speech-impaired, elderly, non-native speakers. The financial cost of misdiagnosis for these groups is amplified by longer recovery times and higher litigation risk.

Global Implications: A Regulatory Reckoning Is Coming

Ontario’s Auditor General found that 9 out of 20 AI scribe systems evaluated during a provincial procurement process fabricated information or suggested treatment plans doctors never made—including referring patients for therapy and ordering blood tests. Despite these findings, four systems were still approved.

The global pattern is clear: governments are racing to deploy AI in healthcare without adequate testing. The European Union’s AI Act classifies medical AI as high-risk, but enforcement mechanisms are still being built. The U.S. has no federal AI medical device framework with teeth. Procurement offices are making de facto regulatory decisions with limited technical expertise.

💡 CreedTec Analyst’s Note

By Daniel Ikechukwu

Strategic Impact: The Whisper hallucination crisis exposes a fundamental flaw in healthcare AI procurement: vendors are treated as partners rather than suppliers subject to independent verification. Every AI scribe contract signed without an audit clause is a liability time bomb.

Stop / Start / Watch:

Stop procuring AI scribes that delete original audio recordings.
Start requiring independent hallucination benchmarks before purchase.
Watch the legal landscape: the first major malpractice suit naming an AI vendor as co-defendant will reshape the market.

ROI Outlook: The cost of one missed hallucination—a wrongful treatment, a psychiatric hold, a missed diagnosis—can exceed the entire annual savings from AI documentation. Procurement teams must price hallucination risk into every contract.

FAQ

How common are AI hallucinations in medical transcription?

Studies found hallucinations in 1% to 80% of transcripts, depending on audio quality and speaker characteristics. The Cornell study identified a 1% hallucination rate in clean audio, with 38% of those containing harmful content.

Which AI tools are most affected?

OpenAI’s Whisper is the primary tool under scrutiny, but Ontario’s audit found hallucinations in 9 of 20 AI scribe systems tested. The problem spans multiple vendors.

Can doctors verify AI-generated notes?

Not if the vendor deletes original audio recordings. Nabla, used by 45,000+ clinicians, erases source audio for “data safety reasons,” making verification impossible.

What’s the procurement risk of buying an AI scribe without an audit clause?

Without audit rights, healthcare organizations cannot independently verify accuracy, expose themselves to malpractice liability, and may violate emerging AI governance regulations.

Are regulators catching up?

The EU AI Act classifies medical AI as high-risk, but enforcement is nascent. Ontario’s auditor flagged inadequate testing, but governments continue to approve flawed systems.

What should procurement teams demand from AI scribe vendors?

Mandatory third-party hallucination benchmarks, retention of original audio for audit, and contractual liability for AI-generated clinical errors.

📩 Don’t Let Your Next AI Procurement Become a Liability

Join the CreedTec newsletter for weekly briefings on AI governance risks, vendor accountability, and procurement strategies that protect your organization. One missed hallucination could cost more than your entire AI budget.

[Subscribe Now]