AI Scribes Cut Doctor Documentation Time, UCLA Trial Finds

Claude
|
A stethoscope resting on a clinician's laptop, illustrating how ambient AI scribes interact with everyday medical workflows.
Stethoscope and laptop computer. Photo by Daniel Sone, National Cancer Institute (NIH), public domain via Wikimedia Commons.

The conversation around ambient artificial intelligence in clinics has, for two years, leaned heavily on vendor demos and self-reported time savings. That conversation just shifted. On April 21, a randomized clinical trial out of UCLA Health offered the most rigorous look yet at whether two of the most widely deployed ambient AI scribes — Microsoft's DAX Copilot and the French startup Nabla — actually do what their marketing claims. The headline finding is modest, specific, and quietly important: physicians who used Nabla wrote each note about 41 seconds faster than before, a 9.5% improvement over a control group. DAX users improved too, but the gap there was not statistically significant. Burnout scores nudged downward in both arms by roughly seven percent. The results were published in NEJM AI, the New England Journal of Medicine's dedicated AI title, and frame what may be the first credible evidence that the noisy ambient-AI category produces measurable clinical workflow benefits at scale.

What Happened

The trial enrolled 238 physicians across 14 specialties at UCLA Health and tracked roughly 72,000 patient encounters over the study window. Participants were randomized to one of three arms: usual care, Microsoft's DAX Copilot, or Nabla. The primary measure was time spent on each clinical note, captured directly from the electronic health record's audit logs rather than relying on physician self-reports — a methodological choice that matters because earlier observational claims about ambient AI saving clinicians an hour or more per day have leaned heavily on surveys.

The audit-log numbers tell a quieter story than the marketing. Mean note time in the Nabla arm fell from 4 minutes 30 seconds to 3 minutes 49 seconds, a drop of about 41 seconds. The control arm also got a little faster — from 4 minutes 22 seconds to 4 minutes 4 seconds — likely reflecting routine workflow improvements over the same months. After subtracting that baseline drift, Nabla users were 9.5% faster than controls, a result that cleared statistical significance. The DAX Copilot arm trended in the same direction but did not reach significance against control, which is a meaningful caveat given Microsoft's market presence in this category.

Both AI arms showed roughly seven-percent improvements on validated burnout, cognitive workload, and exhaustion scales. Those moves are smaller than what some hospital pilots have advertised, but unlike most pilot data they came out of a controlled trial design with predefined endpoints. The investigators, led by clinical informaticians at UCLA, also flagged a less flattering finding: physicians sometimes caught clinically significant errors in AI-generated notes, mostly omissions or pronoun mix-ups, and one mild patient-safety event was reported during the trial.

Why It Matters

Ambient AI scribes have become one of the few unambiguously commercial wins in clinical AI. By the end of 2025 several large American health systems — Kaiser Permanente, Stanford Health Care, Mass General Brigham, The Permanente Medical Group — had moved tens of thousands of clinicians onto either DAX or competing tools, and vendor dashboards routinely claimed dramatic reductions in after-hours charting, the so-called "pajama time" that drives so much physician burnout. The problem has been that almost none of those claims came from randomized data. The UCLA trial does not refute the optimism, but it tightens it. A 9.5% reduction in note time is real and meaningful at scale, but it is far from the order-of-magnitude transformations sometimes promised on stage.

The trial also matters because of its negative finding. DAX Copilot, the market leader by usage, did not separate from control on the primary endpoint. There are plenty of plausible reasons — different deployment maturity at UCLA, different specialty mix, different note-template defaults, learning curves, possibly even how the underlying speech model handles non-English speaker accents in a diverse Los Angeles patient population. The trial's authors are careful not to declare a winner. Yet for hospital chief medical information officers weighing seven-figure contracts, the data point is uncomfortable. It implies that two products that look superficially similar in a sales meeting may behave quite differently when they hit the audit logs, and that the right answer for a given health system is not knowable without local measurement.

There is a third reason this paper is going to be passed around in CMIO Slack channels for the next several weeks: the safety signal. Reports of pronoun errors and clinical omissions confirm what informal social-media accounts have been hinting at. Generative models, even when grounded in an audio recording, still hallucinate at the margins. Most of the time those errors get corrected during the physician's review pass. But the trial documented at least one event that crossed into mild patient-safety territory, which is the kind of finding regulators tend to take seriously even when the absolute risk is small.

Reaction

Initial reaction from the clinical-informatics community has been measured and, for once, mostly positive on the methodology rather than on the product. Researchers welcomed the use of EHR audit logs over surveys, and the inclusion of a true usual-care control rather than a historical comparison. Several health systems that have been holding off on broad ambient-AI deployment cited the new evidence as a reason to begin internal pilots with predefined endpoints rather than vendor-supplied dashboards. The American Medical Informatics Association's standing AI working group has signaled that the trial design — controlled, multispecialty, with audit-log primary endpoints — is the kind of evidence framework it would like the field to adopt as a default.

Vendors responded predictably. Microsoft pointed to broader real-world deployments where DAX Copilot's effects on after-hours documentation have been larger, and noted that the UCLA result is consistent with a directional benefit even where the trial did not power up to detect it. Nabla framed the result as validation. Both companies acknowledged the safety findings and pointed to ongoing model-quality and prompt-engineering work, which is the standard generative-AI response to hallucination concerns. Behind the scenes, several clinician-led groups have begun pushing for a shared evaluation harness so that future randomized trials can test multiple ambient-AI products against the same audit-log endpoints rather than producing isolated single-vendor studies.

What's Next

The most interesting near-term question is whether the UCLA results generalize. The trial ran inside one large academic system with one EHR vendor, one organizational culture, and one specific patient mix. The next round of evidence is likely to come from multi-site replications, especially in community and rural settings where note volume and template variability differ significantly from an academic medical center. NEJM AI editors hinted in the accompanying commentary that they expect a wave of follow-up studies, including head-to-head comparisons against newer entrants like Abridge, Suki, and Augmedix.

A second front is regulatory. The U.S. Food and Drug Administration has, so far, treated most ambient-AI scribes as administrative software rather than medical devices, on the theory that the physician remains the author of the note. The UCLA paper's documentation of clinical-omission errors and the single mild safety event will not change that classification overnight, but it gives advocates of tighter oversight something concrete to point at. State legislatures are also moving — California's AI disclosure law that took effect in January 2026 already requires patient-facing chatbots to identify themselves as AI, and proposals to extend similar disclosure to AI-assisted clinical notes are circulating in several state houses.

The third front, and the one most relevant to working clinicians, is workflow design. A 9.5% time saving is meaningful, but it is also small enough that hospital operations teams now need to ask harder questions: what training closes the gap between Nabla and DAX, how does template design affect outcomes, what specialties benefit most, and where does the residual review burden sit? Those answers will not come from vendors. They will come from the kind of patient, audit-log-driven internal evaluations that the UCLA trial has now made the new normal.

Closing Thoughts

The UCLA paper is best read as a turning point in tone rather than in technology. Ambient AI scribes are not failing — Nabla's effect was real and Nabla's competitors will almost certainly improve as their models do. But the era in which any vendor could wave a slide deck of self-reported gains is closing. The bar going forward, set by this trial and the accompanying NEJM AI editorial, is randomized comparison, audit-log endpoints, and explicit reporting of safety events. That is both a healthier place for the field to be and a more demanding one. The product that wins the next decade in clinical documentation is unlikely to be the one with the loudest demo. It will be the one that consistently saves a physician 41 seconds a note, every note, with the fewest pronoun errors. Sometimes the most consequential AI stories sound exactly that small.

한글 요약

UCLA 헬스 연구진이 4월 21일 NEJM AI에 발표한 무작위 임상 시험에서 의료용 음성 기반 AI 비서(앰비언트 AI 스크라이브) 두 제품이 의사의 차트 작성 시간을 어떻게 줄이는지 처음으로 엄격하게 측정했다. 14개 진료과 의사 238명, 환자 진료 7만 2,000건을 분석한 결과 프랑스 스타트업 Nabla 사용 의사들은 노트 1건당 평균 41초, 즉 대조군 대비 9.5% 더 빠르게 기록을 마쳤고, 통계적으로도 유의했다. 마이크로소프트 DAX Copilot 사용군 역시 같은 방향의 단축 효과를 보였지만 대조군과의 차이는 유의수준에 도달하지 못했다.

두 제품 모두 의사의 번아웃, 인지 부담, 업무 피로 점수를 약 7%가량 개선했다. 다만 연구진은 AI가 작성한 노트에서 임상적으로 의미 있는 누락이나 인칭 오류가 종종 발견됐고, 시험 기간 중 경미한 환자 안전 사건도 1건 보고됐다고 명시했다. 그동안 벤더가 발표한 자체 통계가 아닌 EHR 감사 로그를 직접 분석했다는 점, 일상 진료를 그대로 둔 대조군을 둔 점에서 임상 정보학계는 이번 시험의 방법론에 높은 점수를 줬다.

업계 반응은 양분된다. 마이크로소프트는 DAX의 야간 차트 시간 감소 효과가 다른 대규모 도입 사례에서 더 크다고 강조했고, Nabla는 검증 결과로 받아들였다. 한편 미국 의료정보학회와 다수의 병원 정보책임자들은 향후 도입 결정의 기준을 벤더 대시보드가 아니라 ‘감사 로그 기반 무작위 비교’로 전환하라고 촉구하고 있다. 9.5%라는 숫자는 화려한 마케팅 문구보다 작지만, 매 진료마다 반복된다는 점에서 의료 AI 도입 논의가 새로운 단계로 넘어가고 있음을 보여준다.