Performance of GPT-4 with Vision on Text- and Image-based ACR Diagnostic Radiology In-Training Examination Questions

Document Type

Article

Publication Date

9-1-2024

Publication Title

Radiology

Abstract

Background: Recent advancements, including image processing capabilities, present new potential applications of large language models such as ChatGPT (OpenAI), a generative pretrained transformer, in radiology. However, baseline performance of ChatGPT in radiology-related tasks is understudied. Purpose To evaluate the performance of GPT-4 with vision (GPT-4V) on radiology in-training examination questions, including those with images, to gauge the model's baseline knowledge in radiology.

Materials and Methods: In this prospective study, conducted between September 2023 and March 2024, the September 2023 release of GPT-4V was assessed using 386 retired questions (189 image-based and 197 text-only questions) from the American College of Radiology Diagnostic Radiology In-Training Examinations. Nine question pairs were identified as duplicates; only the first instance of each duplicate was considered in ChatGPT's assessment. A subanalysis assessed the impact of different zero-shot prompts on performance. Statistical analysis included χ(2) tests of independence to ascertain whether the performance of GPT-4V varied between question types or subspecialty. The McNemar test was used to evaluate performance differences between the prompts, with Benjamin-Hochberg adjustment of the P values conducted to control the false discovery rate (FDR). A P value threshold of less than.05 denoted statistical significance.

Results: GPT-4V correctly answered 246 (65.3%) of the 377 unique questions, with significantly higher accuracy on text-only questions (81.5%, 159 of 195) than on image-based questions (47.8%, 87 of 182) (χ(2) test, P < .001). Subanalysis revealed differences between prompts on text-based questions, where chain-of-thought prompting outperformed long instruction by 6.1% (McNemar, P = .02; FDR = 0.063), basic prompting by 6.8% (P = .009, FDR = 0.044), and the original prompting style by 8.9% (P = .001, FDR = 0.014). No differences were observed between prompts on image-based questions with P values of .27 to >.99.

Conclusion: While GPT-4V demonstrated a level of competence in text-based questions, it showed deficits interpreting radiologic images.

Medical Subject Headings

Humans; Prospective Studies; Radiology; Educational Measurement; Clinical Competence; United States; Internship and Residency; Education, Medical, Graduate

PubMed ID

39225605

Volume

312

Issue

3

First Page

240153

Last Page

240153

Share

COinS