Mayo Clinic’s secret weapon against AI hallucinations: Reverse RAG in action

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
Even as large language models (LLMs) become ever more sophisticated and capable, they continue to suffer from hallucinations: offering up inaccurate information, or, to put it more harshly, lying.
This can be particularly harmful in areas like healthcare, where wrong information can have dire results.
Mayo Clinic, one of the top-ranked hospitals in the U.S., has adopted a novel technique to address this challenge. To succeed, the medical facility must overcome the limitations of retrieval-augmented generation (RAG). That’s the process by which large language models (LLMs) pull information from specific, relevant data sources. The hospital has employed what is essentially backwards RAG, where the model extracts relevant information, then links every data point back to its original source content.
Remarkably, this has eliminated nearly all data-retrieval-based hallucinations in non-diagnostic use cases — allowing Mayo to push the model out across its clinical practice.
“With this approach of referencing source information through links, extraction of this data is no longer a problem,” Matthew Callstrom, Mayo’s medical director for strategy and chair of radiology, told VentureBeat.
Accounting for every single data point
Dealing with healthcare data is a complex challenge — and it can be a time sink. Although vast amounts of data are collected in electronic health records (EHRs), data can be extremely difficult to find and parse out.
Mayo’s first use case for AI in wrangling all this data was discharge summaries (visit wrap-ups with post-care tips), with its models using traditional RAG. As Callstrom explained, that was a natural place to start because it is simple extraction and summarization, which is what LLMs generally excel at.
“In the first phase, we’re not trying to come up with a diagnosis, where you might be asking a model, ‘What’s the next best step for this patient right now?’,” he said.
The danger of hallucinations was also not nearly as significant as it would be in doctor-assist scenarios; not to say that the data-retrieval mistakes weren’t head-scratching.
“In our first couple of iterations, we had some funny hallucinations that you clearly wouldn’t tolerate — the wrong age of the patient, for example,” said Callstrom. “So you have to build it carefully.”
While RAG has been a critical component of grounding LLMs (improving their capabilities), the technique has its limitations. Models may retrieve irrelevant, inaccurate or low-quality data; fail to determine if information is relevant to the human ask; or create outputs that don’t match requested formats (like bringing back simple text rather than a detailed table).
While there are some workarounds to these problems — like graph RAG, which sources knowledge graphs to provide context, or corrective RAG (CRAG), where an evaluation mechanism assesses the quality of retrieved documents — hallucinations haven’t gone away.
Referencing every data point
This is where the backwards RAG process comes in. Specifically, Mayo paired what’s known as the clustering using representatives (CURE) algorithm with LLMs and vector databases to double-check data retrieval.
Clustering is critical to machine learning (ML) because it organizes, classifies and groups data points based on their similarities or patterns. This essentially helps models “make sense” of data. CURE goes beyond typical clustering with a hierarchical technique, using distance measures to group data based on proximity (think: data closer to one another are more related than those further apart). The algorithm has the ability to detect “outliers,” or data points that don’t match the others.
Combining CURE with a reverse RAG approach, Mayo’s LLM split the summaries it generated into individual facts, then matched those back to source documents. A second LLM then scored how well the facts aligned with those sources, specifically if there was a causal relationship between the two.
“Any data point is referenced back to the original laboratory source data or imaging report,” said Callstrom. “The system ensures that references are real and accurately retrieved, effectively solving most retrieval-related hallucinations.”
Callstrom’s team used vector databases to first ingest patient records so that the model could quickly retrieve information. They initially used a local database for the proof of concept (POC); the production version is a generic database with logic in the CURE algorithm itself.
“Physicians are very skeptical, and they want to make sure that they’re not being fed information that isn’t trustworthy,” Callstrom explained. “So trust for us means verification of anything that might be surfaced as content.”
‘Incredible interest’ across Mayo’s practice
The CURE technique has proven useful for synthesizing new patient records too. Outside records detailing patients’ complex problems can have “reams” of data content in different formats, Callstrom explained. This needs to be reviewed and summarized so that clinicians can familiarize themselves before they see the patient for the first time.
“I always describe outside medical records as a little bit like a spreadsheet: You have no idea what’s in each cell, you have to look at each one to pull content,” he said.
But now, the LLM does the extraction, categorizes the material and creates a patient overview. Typically, that task could take 90 or so minutes out of a practitioner’s day — but AI can do it in about 10, Callstrom said.
He described “incredible interest” in expanding the capability across Mayo’s practice to help reduce administrative burden and frustration.
“Our goal is to simplify the processing of content — how can I augment the abilities and simplify the work of the physician?” he said.
Tackling more complex problems with AI
Of course, Callstrom and his team see great potential for AI in more advanced areas. For instance, they have teamed with Cerebras Systems to build a genomic model that predicts what will be the best arthritis treatment for a patient, and are also working with Microsoft on an image encoder and an imaging foundation model.
Their first imaging project with Microsoft is chest X-rays. They have so far converted 1.5 million X-rays and plan to do another 11 million in the next round. Callstrom explained that it’s not extraordinarily difficult to build an image encoder; the complexity lies in making the resultant images actually useful.
Ideally, the goals are to simplify the way Mayo physicians review chest X-rays and augment their analyses. AI might, for example, identify where they should insert an endotracheal tube or a central line to help patients breathe. “But that can be much broader,” said Callstrom. For instance, physicians can unlock other content and data, such as a simple prediction of ejection fraction — or the amount of blood pumping out of the heart — from a chest X ray.
“Now you can start to think about prediction response to therapy on a broader scale,” he said.
Mayo also sees “incredible opportunity” in genomics (the study of DNA), as well as other “omic” areas, such as proteomics (the study of proteins). AI could support gene transcription, or the process of copying a DNA sequence, to create reference points to other patients and help build a risk profile or therapy paths for complex diseases.
“So you basically are mapping patients against other patients, building each patient around a cohort,” Callstrom explained. “That’s what personalized medicine will really provide: ‘You look like these other patients, this is the way we should treat you to see expected outcomes.’ The goal is really returning humanity to healthcare as we use these tools.”
But Callstrom emphasized that everything on the diagnosis side requires a lot more work. It’s one thing to demonstrate that a foundation model for genomics works for rheumatoid arthritis; it’s another to actually validate that in a clinical environment. Researchers have to start by testing small datasets, then gradually expand test groups and compare against conventional or standard therapy.
“You don’t immediately go to, ‘Hey, let’s skip Methotrexate” [a popular rheumatoid arthritis medication], he noted.
Ultimately: “We recognize the incredible capability of these [models] to actually transform how we care for patients and diagnose in a meaningful way, to have more patient-centric or patient-specific care versus standard therapy,” said Callstrom. “The complex data that we deal with in patient care is where we’re focused.”