Benchmarking clinical reasoning and accuracy of large language models on breast oncology multiple-choice questions
Recommended Citation
Odabashian R, Basta AS, Sidgal R, Chao A, Lin T, Alhassani W, Aboujaoude MT, Bryant III L, Dyson G, Soni S, Muthu P, Haider A, Jabbour Aida H, Arjyal L, Flaherty LE, Elayoubi J. Benchmarking clinical reasoning and accuracy of large language models on breast oncology multiple-choice questions. 2025; (16_suppl):e13637.
Document Type
Conference Proceeding
Publication Date
5-28-2025
Abstract
Background: Large language models (LLMs) like GPT-4 (OpenAI) and Claude Opus (Anthropic) showed high accuracy in medical multiple-choice exams, but data on their oncology-specific clinical reasoning and performance is limited. This study evaluates their accuracy and clinical reasoning on breast oncology multiple-choice questions (MCQs) from the American Society of Clinical Oncology (ASCO) question bank. Methods: Using OpenAI and Anthropic Application Programming Interface (APIs), questions were tested without additional prompts under consistent settings (GPT-4 and Claude Opus; Temperature = 0, Tokens = Max). Each question was tested three times to assess precision. Then, Chain-of-thought (COT) prompting was applied to promote LLMs stepwise reasoning to increase their accuracy. Accuracy before and after COT prompting was compared. Incorrect responses were reviewed by board-certified medical on cologists speclized in breast cancer, who scored reasoning clarity, bias, and clinical relevance. Qualitative feedback was descriptively analyzed. Results: A total of 273 breast oncology MCQs were evaluated across the two LLMs. GPT-4 achieved an initial accuracy of 81.3% (222/273; 95% CI: 76.3%-85.5%), compared to Claude Opus, which achieved an accuracy of 79.5% (217/273; 95% CI: 74.3%-83.9%). The Chi-squared test for difference between the models before chain-of-thought (COT) prompting yielded a p-value of 0.59, indicating no statistically significant difference in accuracy between GPT-4 and Claude Opus prior to COT prompting. After COT prompting, the performance of the two models diverged significantly. GPT-4 saw a net decline in accuracy, decreasing by a net of 1 correct answer, resulting in an overall accuracy of 80.95% (221/273). In contrast, Claude Opus experienced a notable improvement, with a net of 19 additional correct answers, leading to an accuracy of 86.4% (236/273). A statistical analysis using a Chi-squared test revealed a difference in accuracy between the two models of 5.5% (p = 0.08), demonstrating that the improvement in accuracy for Claude Opus after COT prompting was borderline statistically significant compared to GPT-4. Thematic analysis of oncologists’ feedback revealed that the most common reasons for incorrect answers were reliance on outdated guidelines, misinterpretation of clinical trial data, and failure to consider multidisciplinary or patient-specific approaches in clinical decision-making. Conclusions: Although AI models can achieve high scores on multiple-choice exams, they still require human supervision. These models rely on potentially outdated training data and lack the ability to individualize patient care or apply data from clinical trials. especially in unique or unconventional/non textbook scenarios.
Issue
16_suppl
First Page
e13637
