Can AI answer medical questions better than your doctor?

Last year, headlines describing a study about artificial intelligence (AI) were eye-catching, to say the least:

ChatGPT Rated as Better Than Real Doctors for Empathy, Advice
The AI will see you now: ChatGPT provides higher quality answers and is more empathetic than a real doctor, study finds
Is AI Better Than A Doctor? ChatGPT Outperforms Physicians In Compassion And Quality Of Advice

At first glance, the idea that a chatbot using AI might be able to generate good answers to patient questions isn’t surprising. After all, ChatGPT boasts that it passed a final exam for a Wharton MBA, wrote a book in a few hours, and composed original music.

But showing more empathy than your doctor? Ouch. Before assigning final honors on quality and empathy to either side, let’s take a second look.

What tasks is AI taking on in health care?

Already, a rapidly growing list of medical applications of AI includes drafting doctor’s notes, suggesting diagnoses, helping to read x-rays and MRI scans, and monitoring real-time health data such as heart rate or oxygen level.

But the idea that AI-generated answers might be more empathetic than actual physicians struck me as amazing — and sad. How could even the most advanced machine outperform a physician in demonstrating this important and particularly human virtue?

Can AI deliver good answers to patient questions?

It’s an intriguing question.

Imagine you’ve called your doctor’s office with a question about one of your medications. Later in the day, a clinician on your health team calls you back to discuss it.

Now, imagine a different scenario: you ask your question by email or text, and within minutes receive an answer generated by a computer using AI. How would the medical answers in these two situations compare in terms of quality? And how might they compare in terms of empathy?

To answer these questions, researchers collected 195 questions and answers from anonymous users of an online social media site that were posed to doctors who volunteer to answer. The questions were later submitted to ChatGPT and the chatbot’s answers were collected.

A panel of three physicians or nurses then rated both sets of answers for quality and empathy. Panelists were asked “which answer was better?” on a five-point scale. The rating options for quality were: very poor, poor, acceptable, good, or very good. The rating options for empathy were: not empathetic, slightly empathetic, moderately empathetic, empathetic, and very empathetic.

What did the study find?

The results weren’t even close. For nearly 80% of answers, ChatGPT was considered better than the physicians.

Good or very good quality answers: ChatGPT received these ratings for 78% of responses, while physicians only did so on 22% of responses.
Empathetic or very empathetic answers: ChatGPT scored 45% and physicians 4.6%.

Notably, the length of the answers was much shorter for physicians (average of 52 words) than for ChatGPT (average of 211 words).

Like I said, not even close. So, were all those breathless headlines appropriate after all?

Not so fast: Important limitations of this AI research

The study wasn’t designed to answer two key questions:

Do AI responses offer accurate medical information and improve patient health while avoiding confusion or harm?
Will patients accept the idea that questions they pose to their doctor might be answered by a bot?

And it had some serious limitations:

Evaluating and comparing answers: The evaluators applied untested, subjective criteria for quality and empathy. Importantly, they did not assess actual accuracy of the answers. Nor were answers assessed for fabrication, a problem that has been noted with ChatGPT.
The difference in length of answers: More detailed answers might seem to reflect patience or concern. So, higher ratings for empathy might be related more to the number of words than true empathy.
Incomplete blinding: To minimize bias, the evaluators weren’t supposed to know whether an answer came from a physician or ChatGPT. This is a common research technique called “blinding.” But AI-generated communication does not always sound exactly like a human, and the AI answers were significantly longer. So, it’s likely that for at least some answers, the evaluators were not blinded.

The bottom line

Could physicians learn something about expressions of empathy from AI-generated answers? Possibly. Might AI work well as a collaborative tool, generating responses that a physician reviews and revises? Actually, some medical systems already use AI in this way.

But it seems premature to rely on AI answers to patient questions without solid proof of their accuracy and actual supervision by healthcare professionals. This study wasn’t designed to provide either.

And by the way, ChatGPT agrees: I asked it if it could answer medical questions better than a doctor. Its answer was no.

We’ll need more research to know when it’s time to set the AI genie free to answer patients’ questions. We may not be there yet — but we’re getting closer.

Want more information about the research? Read responses composed by doctors and a chatbot, such as answers to a concern about consequences after swallowing a toothpick.

About the Author

Robert H. Shmerling, MD, Senior Faculty Editor, Harvard Health Publishing; Editorial Advisory Board Member, Harvard Health Publishing

Dr. Robert H. Shmerling is the former clinical chief of the division of rheumatology at Beth Israel Deaconess Medical Center (BIDMC), and is a current member of the corresponding faculty in medicine at Harvard Medical School. … See Full Bio View all posts by Robert H. Shmerling, MD