Battle of the Bots: Assessing the Ability of Four Large Language Models to Tackle Different Surgery Topics

Document Type

Article

Publication Date

5-26-2025

Publication Title

The American surgeon

Abstract

Objective: Our study aims to compare the performance of different large language model chatbots on surgical questions of different topics and categories.

Materials and Methods: Four different chatbots (ChatGPT 4.0, Medical Chat, Google Bard, and Copilot Ai) were used for our study. 114 multiple-choice surgical questions covering 9 different topics were entered into each chatbot, and their answers were recorded.

Results: The performance of ChatGPT was significantly better than Bard (P < 0.0001) and Medical Chat (P = 0.0013) but not significantly better than Copilot (P = 0.9663). We also found a statistically significant difference in ENT (P = 0.0199) and GI (P = 0.0124) questions between each chatbot when we assessed their performances per surgical specialty. Finally, the mean scores of Bard, Copilot, Medical Chat, and ChatGPT 4.0 on the diagnosis questions were higher than those in the management questions. The difference was only statistically significant, however, for Bard (P = 0.0281).

Conclusion: Our study offers insight into the performance of different chatbots on surgery-related questions and topics. The strengths and shortcomings of each can provide us with a better understanding of how to use Chatbots in the surgical field, including surgical education.

Medical Subject Headings

chatbots; educational tools; innovation; large language models; resident education; surgical education

PubMed ID

40420550

ePublication

ePub ahead of print

First Page

31348251346538

Last Page

31348251346538

Share

COinS