As research on Generative AI begins hitting the presses, we need to examine these studies critically just as we would with work from other fields and avoid contributing to the either/or of hype or fear when it comes to this technology.

Case in point: this paper titled Faithfulness Hallucination Detection in Healthcare AI has been making the rounds on LinkedIn and various media outlets, with people saying look, see, this technology is dangerous in the healthcare space and others commenting in ready agreement. The paper’s abstract said the goal was to investigate whether GPT-4o and Llama-3 AI models could provide concise and accurate summaries of lengthy electronic health records. It concluded that both models exhibit hallucinations on almost all of the example health records, and that these pose significant risks relating to misdiagnoses and inappropriate treatments.

I decided to take a closer look.

The Prompts

First, I wanted to see which prompts were used to generate the summaries of medical notes. If you’re familiar with Gen. AI, you know that the crafting of the prompt is an important part of working with these tools, especially when it involves complex tasks or documentation.

I was disappointed to find that the paper glossed over the summary prompt, offering all of three sentences on the topic:

GPT-4o and Llama-3 were prompted to summarize the medical notes with specific instructions to provide any available information regarding the variables of interest. To promote brevity, the models were additionally prompted to generate the summaries of no more than = 500 words. Explicitly the prompt provided was: Summarize the provided clinical note with at most {n} words. Ensure to capture the following essential information, when they exist: the specific cancer type, its morphology, cancer stage, progression, TNM staging, prescribed medications, diagnostic tests conducted, surgical interventions performed, and the patient’s response to treatment. {clinical note}

With such a poor, loose prompt used, I didn’t think it was surprising this study found so many hallucinations. There was no rationale given for using this particular prompt, as if it were a trivial part of the experiment. This alone is a big limitation missing in the methodology that undermines the entire study. (Contrast this approach with the one in this paper A Data-Centric Approach To Generate Faithful and High Quality Patient Summaries with Large Language Models, which includes an appendix on prompt tuning explaining the different prompts used and choice of the best one.)

There is a brief reference to another prompt (not provided) being used to try to get Llama-3 to include more details in the summary without a word limit, indicating that the first prompt was inadequate. There is mention of another prompt being used to compare the original medical notes and the AI-generated summary to detect inconsistencies:

We employ a straightforward prompt for hallucination detection: Detect all the inconsistencies between an original document and its summary. The inconsistency could be any details in the summary that could potentially have a different interpretation from the document, or vice versa.

They do acknowledge at the end of this section that better prompting may be necessary to improve LLM-based hallucination detection systems, but the final conclusion section does not mention prompting. Given the quality of the prompts used, one could actually make the case that the results showed a decent output from the AI models!

The Authors

Second, I looked into the researchers’ affiliations and where the paper was published to do an authority and accuracy check. This is where things got even more interesting. The first author’s affiliation is listed as the University of Massachusetts Amherst, but the second author’s affiliation is Mendel AI, and the other nine authors alternate from either place. A quick search on LinkedIn shows that the first author has also been a graduate student researcher at Mendel AI for the past seven months working on clinical data summaries and hallucination projects. This is an important relationship with the company that should be disclosed.

Curious about Mendel AI, I did a quick Google search that turned up a press release from Mendel AI titled Mendel and UMass Amherst Unveil Groundbreaking Research on AI-driven Hallucination Detection in Healthcare. This explains that Mendel AI is a California-based company that created the Hypercube system which uses medical knowledge bases and natural language processing to detect hallucinations. The Hypercube system was used in the study for an initial hallucination detection step. It would have been nice if this information were more forthright in the paper because, again, there is a bias here toward use of this company’s product.

Review

In terms of publication information, this paper is set to be presented at a data science conference (ACM KDD 2024) next week. It has not been published – it is available on OpenReview.net which facilitates submissions management for academia. This means it has only been reviewed by the conference committee as a conference paper, not a full peer-reviewed publication (not that this in itself would be a guarantee of anything).

In summary, this is a conference paper that covers a pilot study looking at hallucinations in summaries of health records, using loose prompts, authored by 11 people, over half being affiliated with an AI company that develops software that detects hallucinations in health records.

It sounds like this pilot study was really an opportunity to make Mendel AI’s product look good. But the affiliation with the university gives the research legitimacy, and people aren’t looking twice before sharing the paper with warnings about AI and healthcare. I would like the authors to go into more details on their methodology around the prompts used and how the conflict of interest with the AI company is being managed. For starters, the first author’s connection with Mendel AI should be disclosed on the paper and the limitations of the methodology acknowledged.

There is a risk that people lose faith in studies of Gen. AI and/or its use in healthcare settings, and other risks if research institutions aren’t being upfront with their relationship with AI companies and the nature of the research they’re conducting under a university banner. We want and need good-quality research to discuss as we engage with this emerging technology.

Categories: News