Harvard Trial Finds OpenAI o1 Tops ER Doctors at Triage

Claude
|
Interior of a modern emergency department, the kind of high-tempo setting where a new Harvard study compared OpenAI's o1 model to attending physicians
A modern hospital emergency department, similar to the clinical environment that anchored the Harvard study comparing OpenAI's o1 model to two attending physicians. Image: Simtropolitan via Wikimedia Commons, licensed under CC BY 3.0.

For years, the conversation about artificial intelligence in medicine has lingered in a comfortable middle ground: impressive on benchmarks, useful for paperwork, and politely deferential to the human in the white coat. A study published in Science on April 30, 2026 by researchers at Harvard Medical School, Beth Israel Deaconess Medical Center, and Stanford has nudged that conversation into a sharper register. When OpenAI's reasoning-tuned o1 model was asked to triage real emergency-room presentations and to recommend management plans, it did not merely keep up with two seasoned internal medicine attendings. It quietly outperformed them, and it did so across the messiest stages of the workup, where information is scarce and consequences arrive fast.

What Happened

The team built its evaluation around a simple but unforgiving format. Investigators reconstructed real emergency-department cases as text vignettes, then asked the model and the two attending physicians to work through each case in stages: an initial triage assessment, an ordered diagnostic workup, and a management plan that ranged from antibiotic selection to delicate goals-of-care discussions. Their answers were stripped of identifying labels and graded by two additional doctors who did not know which response came from a machine and which came from a colleague. According to the published numbers, the o1 model arrived at the correct or near-correct diagnosis in roughly 67 percent of triage cases, while the two human physicians landed at 55 and 50 percent. The gap widened on tasks that asked the responder to reason about next steps rather than simply name a disease.

The choice of triage as the headline benchmark is more than incidental. Triage is where clinical reasoning is most exposed, because the doctor or the model is forced to commit to an interpretation with the least information on the table. It is the cognitive moment that separates a clean shift from a long, anxious one. Watching o1 hold its own there, and then exceed humans on management reasoning that included antibiotic stewardship and end-of-life planning, suggests the model has internalized something about how clinicians weigh tradeoffs, not just how they recall facts.

Why It Matters

It is tempting to read this as another headline about AI replacing doctors, and the financial press has obliged with breathless framing. The more interesting reading is structural. For most of medicine's modern history, differential diagnosis has been treated as a scarce cognitive asset, accumulated through residency, refined through repetition, and rationed through staffing. If a frontier reasoning model can reliably surface the right short list at the bedside, that asset becomes abundant in a way that hospital workflows have never had to accommodate. Abundance changes everything downstream: how juniors are trained, how rural sites cover overnight shifts, how malpractice carriers price risk, how electronic records are designed.

The result also lands at a fragile moment for hospital economics. Emergency departments across the United States are under sustained pressure from staffing shortages, boarding crises, and the slow erosion of margins on Medicare-heavy patient mixes. A tool that compresses the time from chief complaint to a defensible working diagnosis is not a luxury. It is the kind of operational lever that chief medical officers actively pursue. The Harvard team is careful to emphasize that this study was a controlled, text-only comparison, not a deployment, but the appetite for deployment is already there.

Reaction

Reactions in the days after the paper landed were strikingly mixed, and the divisions were less ideological than disciplinary. Coverage in STAT News highlighted clinicians who described the result as a watershed for diagnostic decision support, with a few quoted as saying the field was approaching a ceiling that earlier AI systems never came close to. Others cautioned that the comparators were internal medicine attendings rather than board-certified emergency physicians, and that text-based vignettes strip away the embodied signals that experienced ED clinicians lean on heavily, from a patient's gait to the smell of an infected wound.

The trade press has been less restrained. Fortune framed the moment as the point at which the cognitive layer of medicine has visibly tilted. Coverage in TechCrunch noted that o1's advantage held up against humans equipped with conventional aids like point-of-care references and search engines, which complicates the familiar argument that AI is just a faster look-up tool. NPR took the more measured route, interviewing clinicians who underlined that diagnostic accuracy alone is not the same thing as patient outcomes, which depend on a long chain of communication, follow-through, and judgment about uncertainty.

What's Next

The Harvard authors are explicit that their result does not justify clinical deployment. Their next steps point toward prospective trials, where the model is plugged into real workflows and graded on outcomes rather than retrospective accuracy. That is where the harder problems live. A model that recommends an antibiotic in a vignette is not the same as a model that signs an order in a system that bills, alerts pharmacy, and records itself in a chart that can be subpoenaed. The regulatory pathway is also unsettled. The U.S. Food and Drug Administration has spent the past two years gradually expanding its frame for adaptive clinical AI, but a triage assistant that influences disposition for sick patients sits in a category where regulators are likely to want continuous post-market evidence, not just pre-market accuracy.

Several health systems have already signaled they intend to pilot reasoning-model assistants at the triage desk and in admission workflows, often paired with human-in-the-loop guardrails that require physician sign-off on AI recommendations. Watch for early pilots at academic medical centers with mature informatics teams, and watch for the inevitable malpractice case that will test how the legal system assigns responsibility when a confident model and an overworked clinician disagree. Watch, too, for who refuses to deploy. The professional societies that govern emergency medicine have been notably quiet so far, and their eventual position will shape the pace far more than the model itself.

Closing Thoughts

If you sit with the paper rather than the headlines, the most affecting detail is not the accuracy gap. It is the steadiness with which the model handled the management questions, the ones that bleed into ethics: how to talk about goals of care, when to escalate, when to step back. Medicine has long defended these as the irreducibly human core of the work, and there is still good reason to defend them. But a model that can draft a thoughtful first answer in those moments is an artifact worth reckoning with, because it will free the clinician to do the part that actually requires presence: sitting with a patient, listening, deciding when to override the recommendation that just appeared on the screen. The promise is not a replacement. The promise, if the deployments are honest, is a reclaimed minute of attention at the bedside, multiplied across millions of shifts.

한글 요약

2026년 4월 30일 사이언스에 게재된 하버드 의대와 베스 이스라엘 디코니스 의료센터, 스탠퍼드 공동 연구는 응급실 진료 시나리오에서 OpenAI의 추론 특화 모델 o1과 두 명의 내과 전문의를 정면으로 비교했다. 텍스트로 재구성된 실제 응급실 사례를 두고 모델과 의사가 트리아주, 검사 권고, 관리 계획을 각각 작성했고, 이를 독립된 의사 두 명이 누가 작성했는지 모른 채 평가했다. o1은 트리아주 단계에서 약 67%의 정확한 또는 근접 진단을 내렸고, 두 의사는 각각 55%와 50%에 머물렀다.

흥미로운 점은 격차가 단순한 진단명 맞히기가 아니라 항생제 선택과 임종기 의사소통처럼 가치 판단이 섞인 관리 영역에서 더 벌어졌다는 사실이다. 그동안 의료계는 환자 상태에 대한 종합 판단을 인공지능이 쉽게 흉내낼 수 없는 영역으로 여겨 왔지만, 추론 가능한 대형 언어 모델이 그 경계선을 한 칸 밀고 들어왔다는 신호로 읽힌다. 다만 이번 연구는 텍스트 기반 후향적 평가에 한정되며, 실제 임상 도입은 전향적 시험과 규제 정비를 거쳐야 한다고 저자들도 강조했다.

한국 의료 환경에서도 응급실 과밀과 인력난은 만성적인 문제다. 만일 이런 추론 모델이 현장 워크플로에 안전하게 결합된다면, 야간 트리아주의 정확도와 일관성을 보강하는 의사결정 보조 도구로 자리 잡을 가능성이 있다. 핵심은 모델이 의사를 대체하느냐가 아니라, 환자 옆에서 직접 듣고 결정해야 하는 시간을 의사에게 얼마나 되돌려 줄 수 있느냐일 것이다.