Adequacy of prostate cancer prevention and screening recommendations provided by an artificial intelligence-powered large language model

Recommended Citation

Chiarelli G, Stephens A, Finati M, Cirulli GO, Beatrici E, Filipas DK, Arora S, Tinsley S, Bhandari M, Carrieri G, Trinh QD, Briganti A, Montorsi F, Lughezzani G, Buffi N, Rogers C, and Abdollah F. Adequacy of prostate cancer prevention and screening recommendations provided by an artificial intelligence-powered large language model. Int Urol Nephrol 2024.

Document Type

Article

Publication Date

4-2-2024

Publication Title

International urology and nephrology

Keywords

Humans, Prostatic Neoplasms/prevention & control/diagnosis, Male, Early Detection of Cancer/methods, Artificial Intelligence, Language, Artificial intelligence, Cancer screening, Preventive health service, Prostatic neoplasm

Abstract

PURPOSE: We aimed to assess the appropriateness of ChatGPT in providing answers related to prostate cancer (PCa) screening, comparing GPT-3.5 and GPT-4.

METHODS: A committee of five reviewers designed 30 questions related to PCa screening, categorized into three difficulty levels. The questions were formulated identically for both GPTs three times, varying the prompts. Each reviewer assigned a score for accuracy, clarity, and conciseness. The readability was assessed by the Flesch Kincaid Grade (FKG) and Flesch Reading Ease (FRE). The mean scores were extracted and compared using the Wilcoxon test. We compared the readability across the three different prompts by ANOVA.

RESULTS: In GPT-3.5 the mean score (SD) for accuracy, clarity, and conciseness was 1.5 (0.59), 1.7 (0.45), 1.7 (0.49), respectively for easy questions; 1.3 (0.67), 1.6 (0.69), 1.3 (0.65) for medium; 1.3 (0.62), 1.6 (0.56), 1.4 (0.56) for hard. In GPT-4 was 2.0 (0), 2.0 (0), 2.0 (0.14), respectively for easy questions; 1.7 (0.66), 1.8 (0.61), 1.7 (0.64) for medium; 2.0 (0.24), 1.8 (0.37), 1.9 (0.27) for hard. GPT-4 performed better for all three qualities and difficulty levels than GPT-3.5. The FKG mean for GPT-3.5 and GPT-4 answers were 12.8 (1.75) and 10.8 (1.72), respectively; the FRE for GPT-3.5 and GPT-4 was 37.3 (9.65) and 47.6 (9.88), respectively. The 2nd prompt has achieved better results in terms of clarity (all p < 0.05).

CONCLUSIONS: GPT-4 displayed superior accuracy, clarity, conciseness, and readability than GPT-3.5. Though prompts influenced the quality response in both GPTs, their impact was significant only for clarity.

Medical Subject Headings

Humans; Prostatic Neoplasms/prevention & control/diagnosis; Male; Early Detection of Cancer/methods; Artificial Intelligence; Language; Artificial intelligence; Cancer screening; Preventive health service; Prostatic neoplasm

PubMed ID

38564079

ePublication

ePub ahead of print

Volume

56

Issue

8

First Page

2589

Last Page

2595

Urology Articles

Adequacy of prostate cancer prevention and screening recommendations provided by an artificial intelligence-powered large language model

Recommended Citation

Document Type

Publication Date

Publication Title

Keywords

Abstract

Medical Subject Headings

PubMed ID

ePublication

Volume

Issue

First Page

Last Page

Browse

Author Corner

Urology Articles

Adequacy of prostate cancer prevention and screening recommendations provided by an artificial intelligence-powered large language model

Authors

Recommended Citation

Document Type

Publication Date

Publication Title

Keywords

Abstract

Medical Subject Headings

PubMed ID

ePublication

Volume

Issue

First Page

Last Page

Share

Browse

Author Corner