Performance of large language models in addressing patient queries on colorectal cancer screening in different languages: An international study across 28 countries
Recommended Citation
Maida M, Papaefthymiou A, Gupta S, Voiosu T, Lau LHS, Baraldo S, Pal P, Mwachiro M, Zuchelli T, Uchima H, Aguila E, Bouberra D, Degroote H, Düzenli T, Gameel A, Khurelbaatar T, Lakkasani S, Luvsandagva B, Maulahela H, Nobre R, Okubo Y, Rimondi A, Taiymi A, and Mostafa I. Performance of large language models in addressing patient queries on colorectal cancer screening in different languages: An international study across 28 countries. Dig Liver Dis 2025;58(2):250-257.
Document Type
Article
Publication Date
2-1-2026
Publication Title
Digestive and liver disease : official journal of the Italian Society of Gastroenterology and the Italian Association for the Study of the Liver
Keywords
Humans, Colorectal Neoplasms, Early Detection of Cancer, Language, Comprehension, Surveys and Questionnaires, Asia, Multilingualism, Europe, Male, Female, Mass Screening, Africa, Large Language Models
Abstract
BACKGROUND: Colorectal cancer (CRC) screening reduces incidence and mortality, yet patient adherence remains suboptimal. Large language models may improve participation by addressing patient questions in native languages, but their multilingual performance has not been systematically assessed.
METHODS: From April to June 2025, we conducted a cross-continental study involving 28 countries and 23 languages. A standardized set of 15 CRC screening-related questions was translated into each language and submitted to ChatGPT (GPT-4o). Responses were independently evaluated by 140 gastroenterologists (five per country) for accuracy, completeness, and comprehensibility on a 5-point Likert scale. Statistical analyses included t-test, Chi-square, and two-way ANOVA.
RESULTS: The study included experts and data from Europe, Asia, Africa, America, and Oceania. Mean scores (±SD) for accuracy, completeness, and comprehensibility were 4.1 ± 1.0, 4.1 ± 1.0, and 4.2 ± 0.9, respectively. Most languages achieved high ratings, with 73.9%, 86.9%, and 82.6% scoring ≥4 for accuracy, completeness, and comprehensibility. However, lower scores were observed in Chinese, Dutch, and Greek. Variability was also noted between countries sharing the same language, highlighting language- and context-dependent performance.
DISCUSSION: ChatGPT showed strong ability to answer CRC screening questions across multiple languages, supporting its promise as a multilingual patient education tool. Nonetheless, regional variability requires careful validation before clinical integration.
Medical Subject Headings
Humans; Colorectal Neoplasms; Early Detection of Cancer; Language; Comprehension; Surveys and Questionnaires; Asia; Multilingualism; Europe; Male; Female; Mass Screening; Africa; Large Language Models
PubMed ID
41436291
ePublication
ePub ahead of print
Volume
58
Issue
2
First Page
250
Last Page
257
