Enumeration Versus Reasoning in Large Language Model AI Clinical Benchmarks
The Brodeur Science paper says AI beats doctors at clinical reasoning. The 89-vs-34% number measures something narrower.
The Brodeur et al. paper in Science last month claimed AI eclipses physician benchmarks of clinical reasoning. The most-quoted number is 89 percent versus 34 percent. What the paper actually measures is narrower. The methods version of this critique is now an eLetter on the Brodeur paper at Science. Full text below.
Three takeaways:
The rubric measures enumeration. The Grey Matters management rubric is additive across 22 items with no penalty for excess or wrong tests. An AI that lists everything earns most of the points. A focused physician answer cannot reach as many items, even when the reasoning behind it is sharper.
AI’s edge depends on the information level. In the paper’s one head-to-head experiment, AI scored 67% at ER triage versus 50–55% for two attending physicians. By admission, with the full workup, the gap is no longer statistically significant. Same patients, same model, same physicians; only the information changes.
Historical-comparator design inflates the gap. Five of the six experiments compare AI in 2024–2025 against physicians scored on different cases, by different graders, in earlier publications. The 55-percentage-point gap on Grey Matters cannot be cleanly attributed to model superiority over physicians.
eLetter
Brodeur and colleagues report that an advanced large language model has eclipsed most benchmarks of clinical reasoning (1). The paper documents real progress, and the emergency department experiment is well-engineered. The authors themselves note in their Discussion that existing benchmarks “may therefore overstate the performance of AI models when using ‘messy’ data available in more realistic clinical workflows.” Hopkins and Cornelisse note in the accompanying Perspective (2) that “passing examinations is not the same as being a doctor, and demonstrating physician-level performance on authentic clinical tasks is a fundamentally harder challenge.”
We extend these acknowledgments with three specific concerns about how the most-cited number from the study (89% versus 34% on the Grey Matters management cases) is being interpreted as evidence that AI surpasses physicians in clinical reasoning. The first concern is methodological. The second comes from the paper’s own strongest experiment. The third addresses study-design balance.
First, the scoring rubrics share a structural feature that systematically advantages verbose, comprehensive answers. The Bond Score awards full credit for inclusion of the correct diagnosis anywhere in the differential, with the prompt explicitly stating no limit on differential length (Supplementary Text 1A and 1B). On the 101-case overlap with a prior physician study, o1-preview’s top-1 accuracy was 66.3% but its top-10 accuracy was 81.2% (Table S2), a 15-point gap that reflects enumeration value rather than diagnostic precision. The Grey Matters management rubric is largely additive across line items: each item earns a small fixed number of points if included, and no item subtracts from the total. Question 1 awards 19 points across 22 line items for the workup of involuntary weight loss, with no penalty for ordering excess or wrong tests (Supplementary Text 3B). CT of abdomen and pelvis receives the same 1 point as screening for food insecurity. (Question 3, on antibiotic management, is a noted exception with explicit 0-point options for clinically wrong management, but the additive pattern dominates the other questions in the example rubric.)
Table S5 case 3-2022 illustrates the resulting scoring problem: an o1-preview test plan listing more than 30 tests across 8 categories (including bone marrow biopsy, four specialty consults, leptospira microscopic agglutination, Ehrlichia PCR, and HLA-B27 typing) for a case ultimately diagnosed as Crohn disease scored 1 (”helpful”) on the rubric, which is just one step down from the full score of 2 (“exactly right”). The excessive bone marrow biopsy and four specialty consults piled on top did not drop the score below “helpful.” The rubric distinguishes “helpful” from “exactly right” (matching the actual case plan), but provides no mechanism to penalize the inappropriate items added alongside reasonable ones. o1-preview generates substantially longer outputs than the focused physician answers used as comparators. A focused physician answer cannot reach as many rubric items as comprehensive enumeration, even when the reasoning behind it is sharper.
The pattern across the paper’s six experiments is consistent with the rubric-bias hypothesis. Where the metric measures focused clinical judgment, AI’s advantage is small or absent: cannot-miss diagnoses (Figure 3B, where o1-preview was not significantly higher than physicians, GPT-4, or residents), Landmark diagnostic reasoning (Figure 4B, where o1-preview was not significantly different from physicians with conventional resources, p=0.055, or with GPT-4 access, p=0.076; the Landmark cases are also the only experiment whose cases were never publicly released, specifically to prevent memorization), and probabilistic reasoning (Table S6, where all groups overestimate the likelihood of low-prevalence conditions). Where the metric rewards comprehensive enumeration most heavily (the Grey Matters additive checklist), AI’s advantage is largest. The cross-experiment gradient is itself evidence that the rubric type is doing more work than the abstract framing acknowledges.
Second, the paper’s own strongest experiment shows an information gradient that is more informative than the headline number. The 76-case ED study (Figure 5) tested o1 against two attending physicians on real emergency department patients at three points in their evaluation: triage (chief complaint, vitals, and nursing note only), ED evaluation (after the initial physician encounter and labs), and admission (full workup). At triage, o1 placed the correct diagnosis at the top of the differential 67.1% of the time, compared with 55.3% and 50.0% for the two physicians, a 12 to 17 percentage point gap. By admission, the rates were 81.6% (o1), 78.9% (physician 1), and 69.7% (physician 2). The difference between o1 and physician 1 at admission is no longer statistically significant.
Same patients, same model, same physicians. The only thing that changes across the three touchpoints is the amount of clinical information available. The pattern shows AI’s advantage concentrating where information is sparse and shrinking as information accumulates. This is the strongest piece of work in the paper, and it tells a different story than the headline. Even though AI shows advantage at triage, in real ED practice, no clinical decision turns on the triage differential alone; the next steps (orders, examination, response to therapy) are exactly where the information gradient erodes the AI’s edge. The within-experiment gradient is also consistent with the cross-experiment pattern noted above: rubrics that reward enumeration favor AI most when there is little focused information to constrain the differential.
Third, five of the six experiments use historical physician comparators rather than concurrent ones. The Grey Matters numbers come from a 2025 publication (3); the Landmark cases from a 2024 publication (4); the NEJM Healer comparison from a 2024 publication (5); and the probabilistic reasoning vignettes from a 2021 publication (6). The Brodeur runs spanned September 2024 to August 2025, with reasoning effort set to high and the maximum 65,536 token output budget. Historical-control comparisons in any field carry confounders that head-to-head designs eliminate: era effects, scorer drift between graders, and selection in the original cohort. Whatever the magnitude of the apparent advantage, it cannot be cleanly attributed to model superiority over physicians. The 55-percentage-point gap on Grey Matters is the most striking example.
These three observations do not invalidate the paper. They locate where its conclusions are most defensible (the ED experiment, with concurrent comparators and blinded scoring, examined across the information gradient) and where they are most strained (the historical-control comparisons that are now circulating as evidence of AI superiority over physicians). The paper itself is more cautious than the secondary coverage suggests: the limitations section acknowledges that performance gains were not robust on the cannot-miss diagnoses or the Landmark cases.
As the field considers the implications for clinical practice, magnitude claims should be matched to the strongest version of the evidence. On contemporaneous head-to-head testing in real ED data, o1 was narrowly better than two attending physicians where information was sparse and equivalent at admission when full information was available. That is meaningful and worth confirming in larger trials. It is not the gap implied by the 89%-versus-34% framing now in circulation, and it is not what the paper’s own internal evidence supports.
See the video breakdown on YouTube
References
1. Brodeur PG, Buckley TA, Kanjee Z, et al. Performance of a large language model on the reasoning tasks of a physician. Science 392, 524 (2026).
2. Hopkins AM, Cornelisse E. AI can reason like a physician—what comes next? Science 392, 466-467 (2026). doi: 10.1126/science.aeg8766
3. Goh E, Gallo RJ, Strong E, et al. GPT-4 assistance for improvement of physician performance on patient care tasks: A randomized controlled trial. Nat Med 31, 1233-1238 (2025).
4. Goh E, Gallo R, Hom J, et al. Large language model influence on diagnostic reasoning: A randomized clinical trial. JAMA Netw Open 7, e2440969 (2024).
5. Cabral S, Restrepo D, Kanjee Z, et al. Clinical reasoning of a generative artificial intelligence model compared with physicians. JAMA Intern Med 184, 581-583 (2024).
6. Morgan DJ, Pineles L, Owczarzak J, et al. Accuracy of practitioner estimates of probability of diagnosis before and after testing. JAMA Intern Med 181, 747-755 (2021).



