Benchmarking clinical reasoning and accuracy of large language models on breast oncology multiple-choice questions

Document Type

Conference Proceeding

Publication Date

5-28-2025

Publication Title

J Clin Oncol

Keywords

alanine aminotransferase, accuracy, benchmarking, borderline state, breast cancer, chain of thought prompting, clinical decision making, clinical practice guideline, clinical reasoning, clinical significance, clinical trial, conference abstract, diagnosis, diagnostic test accuracy study, human, large language model, major clinical study, medical oncologist, multiple choice test, oncologist, patient care, reasoning, thematic analysis

Abstract

Background: Large language models (LLMs) like GPT-4 (OpenAI) and Claude Opus (Anthropic) showed high accuracy in medical multiple-choice exams, but data on their oncology-specific clinical reasoning and performance is limited. This study evaluates their accuracy and clinical reasoning on breast oncology multiple-choice questions (MCQs) from the American Society of Clinical Oncology (ASCO) question bank. Methods: Using OpenAI and Anthropic Application Programming Interface (APIs), questions were tested without additional prompts under consistent settings (GPT-4 and Claude Opus; Temperature = 0, Tokens = Max). Each question was tested three times to assess precision. Then, Chain-of-thought (COT) prompting was applied to promote LLMs stepwise reasoning to increase their accuracy. Accuracy before and after COT prompting was compared. Incorrect responses were reviewed by board-certified medical oncologists speclized in breast cancer, who scored reasoning clarity, bias, and clinical relevance. Qualitative feedback was descriptively analyzed. Results: A total of 273 breast oncology MCQs were evaluated across the two LLMs. GPT-4 achieved an initial accuracy of 81.3% (222/273; 95% CI: 76.3%-85.5%), compared to Claude Opus, which achieved an accuracy of 79.5% (217/273; 95% CI: 74.3%-83.9%). The Chi-squared test for difference between the models before chain-of-thought (COT) prompting yielded a p-value of 0.59, indicating no statistically significant difference in accuracy between GPT-4 and Claude Opus prior to COT prompting. After COT prompting, the performance of the two models diverged significantly. GPT-4 saw a net decline in accuracy, decreasing by a net of 1 correct answer, resulting in an overall accuracy of 80.95% (221/273). In contrast, Claude Opus experienced a notable improvement, with a net of 19 additional correct answers, leading to an accuracy of 86.4% (236/273). A statistical analysis using a Chi-squared test revealed a difference in accuracy between the two models of 5.5% (p = 0.08), demonstrating that the improvement in accuracy for Claude Opus after COT prompting was borderline statistically significant compared to GPT-4. Thematic analysis of oncologists' feedback revealed that the most common reasons for incorrect answers were reliance on outdated guidelines, misinterpretation of clinical trial data, and failure to consider multidisciplinary or patient-specific approaches in clinical decision-making. Conclusions: Although AI models can achieve high scores on multiple-choice exams, they still require human supervision. These models rely on potentially outdated training data and lack the ability to individualize patient care or apply data from clinical trials. especially in unique or unconventional/non textbook scenarios.

Volume

43

Share

COinS