Battle of the Bots: Assessing the Ability of Four Large Language Models to Tackle Different Surgery Topics
Recommended Citation
Madi M, Araji T, Hazimeh D, and Adra SW. Battle of the Bots: Assessing the Ability of Four Large Language Models to Tackle Different Surgery Topics. Am Surg 2025.
Document Type
Article
Publication Date
5-26-2025
Publication Title
The American surgeon
Abstract
Objective: Our study aims to compare the performance of different large language model chatbots on surgical questions of different topics and categories.
Materials and Methods: Four different chatbots (ChatGPT 4.0, Medical Chat, Google Bard, and Copilot Ai) were used for our study. 114 multiple-choice surgical questions covering 9 different topics were entered into each chatbot, and their answers were recorded.
Results: The performance of ChatGPT was significantly better than Bard (P < 0.0001) and Medical Chat (P = 0.0013) but not significantly better than Copilot (P = 0.9663). We also found a statistically significant difference in ENT (P = 0.0199) and GI (P = 0.0124) questions between each chatbot when we assessed their performances per surgical specialty. Finally, the mean scores of Bard, Copilot, Medical Chat, and ChatGPT 4.0 on the diagnosis questions were higher than those in the management questions. The difference was only statistically significant, however, for Bard (P = 0.0281).
Conclusion: Our study offers insight into the performance of different chatbots on surgery-related questions and topics. The strengths and shortcomings of each can provide us with a better understanding of how to use Chatbots in the surgical field, including surgical education.
Medical Subject Headings
chatbots; educational tools; innovation; large language models; resident education; surgical education
PubMed ID
40420550
ePublication
ePub ahead of print
First Page
31348251346538
Last Page
31348251346538
