The original title of this talk was “A Turing Test for Clinical Reasoning: Large Language Models and the Future of Diagnosis.” A lot can happen in a year, so now we find ourselves “Turing” in the rearview as we careen toward an exciting and uncertain future. Adam Rodman, MD, MPH, a general internist and medical educator at Beth Israel Deaconess Medical Center and assistant professor at Harvard Medical School, both in Boston, assured the audience, “I don’t think we’re in danger of being replaced any time soon.” But by the end of the presentation, I wondered about his definition of the word “soon.”
The new models
Dr. Rodman began the session by reviewing the difference between traditional large-language models (LLMs) and newer reasoning models. He described traditional LLMs as “autocomplete on steroids,” a technology whose power is derived from its ability to predict the next most appropriate word in a string of text. Traditional LLMs, while powerful, are not able to explain the way they arrive at an answer.
In contrast, reasoning models are designed to show their chain of thought. You might think of it as the difference between an early trainee developing their plan based solely on prior experiences with similar patients (the traditional LLM) compared to a seasoned attending walking through how each piece of data influenced their thinking. These reasoning models are incredibly powerful and have the ability to solve problems they have never seen before.
The human and the machine
Next, we were taken on a tour of recent studies evaluating the performance of reasoning models. These models show incredible ability to generate differential diagnoses, with the best-performing model containing the correct diagnosis over 75% of the time, compared with a clinician score of about 30%. In a recent publication,1 GPT-4 was shown to have a higher quality display of reasoning than residents and attendings with equivalent efficiency, accuracy, quality, and identification of can’t-miss items.
And the reasoning models don’t just do diagnosis. Early signs indicate reasoning models can perform at a level that is equivalent to and possibly superior to the human clinician in taking clinical histories and recommending management.
Interestingly, while they can outperform clinicians in some measures of clinical reasoning, there is concerning data that these models lose some of their edge when partnered with a clinician. Clinicians using artificial intelligence (AI) did no better than those without AI with regard to clinical reasoning, but the models alone scored better. This may suggest that human biases in reasoning can negate the value of AI assistance.
The demo
In the next portion of the presentation, the man (Daniel Restrepo, MD, core educator faculty member and associate program director of the internal medicine residency program at Massachusetts General Hospital, and assistant professor of medicine at Harvard Medical School, both in Boston) took on the machine (a reasoning model preview version). Kudos to Dr. Restrepo for keeping pace without sacrificing quality during a fast and furious case recitation by Dr. Rodman. As Dr. Rodman dropped out aliquots of information, both doctor and computer impressed the audience with their reasoning and rapid processing speed.
And while Dr. Restrepo’s skill as a diagnostician was on full display, it was hard not to direct the majority of the awe in the room to the screen behind him where the reasoning model rapidly developed a prioritized table of diagnoses that included columns for key supporting information, refuting information, pitfalls if missed and a reasoned defense for each diagnosis’s place in the table. Below the table, management recommendations were displayed that could have saved the patient quite a lot of time and trouble if they had been followed.
The case wound through its twists and turns, with the reasoning model suggesting the final diagnosis and highlighting it as a “can’t miss” before Dr. Restrepo got there. If we’re being fair, in the avalanche of information, I don’t think the key piece of clinical data (scrotal swelling) that triggered the reasoning model was verbally communicated to Dr. Restrepo. This was so very true-to-life, the volume of information we need to process often obscures small but important details.
As we neared the end of the case and the imaging revealed a scrotal abscess with associated Fournier’s gangrene, the reasoning model impressively advised, “What to tell the team: ‘This is Fournier’s gangrene – he needs the OR right now’”. That’s the kind of situational awareness that makes you high-five a resident.
The real world
Finally, Dr. Rodman presented data from a study comparing the second-opinion power of two reasoning models and two experienced hospitalists. For “all comers” to the emergency department over two weeks, the models and physicians were asked to provide second opinions at three pre-defined touchpoints. The reasoning models were equivalent to the experienced hospitalist at times of high information density (later in the patient course) but outperformed the hospitalists when information density was lower (earlier in the patient course).
The Q and A
Drs. Rodman and Restrepo turned to the audience to ask, “What does hospitalist-computer collaboration look like in the future?” Here are some pearls from that discussion:
- Sometimes, patients using an AI tool will get it right before we do (or when we didn’t). Humility and keeping the patient at the center are key.
- Overuse of AI can lead to cognitive de-skilling. “AI can make us stupid”. Keep your mind sharp.
- Great questions to ask a reasoning model: “What could I be missing?” and “What else should I check?”
Author’s Note: I did not use AI to assist in the writing of this article. If I had, it might have been better.
Dr. Herrle is a hospitalist and the associate medical director for professional development in the division of hospital medicine at Maine Medical Center in Portland, Maine, and the associate chief medical information officer for MaineHealth.
Key Takeaways
- The technological advancement of LLMs has led to reasoning models that are able to document their thought processes and perform well with problems they have not previously encountered.
- For a variety of clinical reasoning tasks, reasoning models are showing equivalent or superior performance when compared to experienced clinicians. There are studies suggesting that the benefit of these models is lost or moderated when clinicians are in the loop.
- As clinicians, we need to drive the discussion around the responsible use of technology in patient care. If we do not participate, the decision making will happen without us.
Reference
- Cabral S, et al. Clinical reasoning of a generative artificial intelligence model compared with physicians. JAMA Intern Med. 2024;184(5):581-583. doi: 10.1001/jamainternmed.2024.0295.