
This study, published in the Annals of Internal Medicine, tested an AI system in real-world digital urgent care visits. Patients used an app to begin their appointments, and the AI got the first crack at the case. Let’s unpack what this AI did, how it was evaluated, and why those results may not be the slam dunk they seem.
What AI Did
The AI conducted “a structured dynamic interview,” asking questions regarding symptoms and medical history and assessing their electronic medical record for relevant additional information.
That information was used to generate a patient summary and “potential medication prescriptions, laboratory test orders, and referrals” to the supervising physician, who conducted a video consultation to determine the final diagnosis and treatment. In essence, AI performs triage, much like the chatbots on websites and phone trees, when we call for assistance in other settings.
How AI know to speak up
The AI system uses a "confidence-based" approach to decide when to offer advice. Suppose it's unsure whether a patient needs to go to the emergency department. In that case, it simply suggests that doctors “consider” a referral—without pushing other recommendations, allowing the doctor control in uncertain cases.
Measuring Recommendations
To evaluate the AI’s performance, researchers looked at one month of urgent care visits for common complaints. Each case was labeled as concordant if doctors fully agreed with the AI’s plan—diagnosis, prescriptions, tests, and referrals had to match exactly. A mismatch on any of those points made the case non-concordant. All treatment plans were judged on a 4-point scale - optimal (guideline-adherent), acceptable, insufficient, and potentially harmful. [1]
All non-concordant cases, plus a random sample of concordant ones, were reviewed by two experienced physicians providing the study’s outcome. When the two physicians disagreed by greater than 1 point, e.g., optimal vs. inadequate, a third physician adjudicator was asked to weigh in.
So, how did the AI stack up? Here's what researchers found when they dug into 461 virtual urgent care visits. While AI looks like a digital overachiever, let’s walk through the study first, then take a closer look at some behind-the-scenes choices in study design that raise important questions.
Findings
461 virtual visits during the study period met the inclusion criteria. Patients had a mean age of 45.3 years, were predominantly female (70.2%), and they presented with predominantly respiratory symptoms (65.3), followed by urinary (20.4), vaginal, eye, and dental. They were cared for by 18 physicians, all with a minimum of 2 years post-residency experience.
Physicians’ decisions and AI recommendations were concordant 57% of the time. Of the 199 virtual visits where AI and physicians disagreed, nearly 60% required a third adjudicator to weigh in.
- In most cases, both the AI and physicians made top-rated (optimal) decisions—this happened over half the time.
- Reviewers judged the AI and doctors as equally good in about two-thirds of all cases.
- The AI was rated better than physicians in about 1 in 5 cases and worse in about 1 in 10.
- AI recommendations were rated optimal more often than physicians (around 3 out of 4 times vs. 2 out of 3).
- Harmful decisions were rare for both—but the AI had fewer.
- AI’s performance varied by presenting symptoms, e.g., outperforming doctors in 40% of urinary cases.
- In cases where AI and doctors disagreed and at least one made a questionable call, AI was judged better (64%). Common physician errors included skipping important tests or referrals, treating without justification, or not following guidelines. Alternatively, when physicians were better, they often caught details the AI missed—like patient history changes, subtle exam findings, or avoiding unnecessary ER visits.
Let’s Talk Fine Print
Nearly 60% of the visits where AI and doctors disagreed required not two—but three physicians to resolve the case. That alone says something about how murky medical decision-making can be, even before adding AI to the mix.
“Thoughtful integration of AI into clinical practice, combining its strengths with those of physicians, could improve the quality of care.”
While a reasonable conclusion is based on the findings, all studies have caveats and limitations, and several speak to the “strength” and “integration” of the deployed AI.
Sometimes, guidelines aren’t crystal clear, patients’ symptoms don’t fit standard checklists, or there’s conflicting information about side effects and underlying conditions. Decisions can hinge on a single symptom mentioned halfway through an appointment or a past allergy that only surfaces in conversation. This level of uncertainty reminds us that any tool, no matter how advanced, must contend with the inherent human challenges of diagnosing and treating people in all their complexity.
The researchers only included cases where the AI had high confidence in its recommendations, which conveniently excluded any medical visits where the AI might have struggled or given poor advice.
Adding to this lopsided comparison, we don’t even know if doctors looked at the AI’s advice—because they had to scroll down to see it, and the study didn’t track whether they did. We do not know whether physicians saw, let alone “integrated” those recommendations into their thinking. Meanwhile, the study's funders, a company providing the AI, were involved in “case adjudication.” Further, reviewers weren’t blinded to whether a recommendation came from a human or a machine since the AI’s suggestions were neatly structured. In contrast, those of the humans were a mixture of structured and unstructured guidance. Finally, we have no outcome data, so the quality of the recommendations is based more on following the guidelines rather than personalizing the care.
The AI looked impressive—matching or beating physicians in many cases. But the game wasn't entirely fair: only confident AI calls were included, doctors may not have even seen the AI's suggestions, and the adjudicators weren’t blind to which answers came from machine vs. human. Nearly 60% of disagreements needed a third expert to break the tie, a reminder that real-life medicine remains more art than algorithm. Bottom line? AI shows promise, but this study reports on the dress rehearsal, not opening night.
[1] Deviations from guidelines were judged in a clinical context—for example, travel plans might justify an antibiotic for borderline sinusitis.