Late Radiotherapy-related Toxicity Extraction From Clinical Notes Using Large Language Models for Definitively Treated Prostate Cancer Patients
Recommended Citation
Ghanem AI, Khanmoham-Madi R, Verdecchia K, Hall R, Elshaikh MA, Movsas B, Bagher-Ebadian H, Chetty I, Ghassemi MM, Thind K. Late Radiotherapy-related Toxicity Extraction From Clinical Notes Using Large Language Models for Definitively Treated Prostate Cancer Patients. Am J Clin Oncol 2025; 48(8):S22.
Document Type
Conference Proceeding
Publication Date
8-1-2025
Publication Title
Am J Clin Oncol
Abstract
Background: For definitively treated prostate cancer, it is very hard to keep track of late radiotherapy (RT)-related toxicities because patients usually seek management beyond radiation oncology years after the radiotherapy course. Thus, reporting these toxicities entitles a complex timeconsuming process going through a huge number of followup notes of different specialties over a long duration. Large Language Models (LLM) represent a major advancement in the field of Artificial intelligence, capable of extracting clinically relevant information from electronic medical records.1,2 Objectives: We sought to automate the extraction of late RT-related toxicity symptoms from clinical notes for prostate cancer patients using LLM, utilizing a teacherstudent architecture. Methods: For a cohort of 177 localized prostate cancer patients treated definitively with 78 to 79.2 Gy +/- androgen deprivation therapy between 2013 and 2020 we identified 1133 clinical notes beyond 6 months after RT conclusion. For validation (434 notes), radiation oncologists manually captured late RT toxicities and relevant symptoms focusing on twelve genitourinary/gastrointestinal domains: cystitis, urgency, urinary obstruction, dysuria, hematuria, nocturia, secondary malignancy (urothelial carcinoma), incontinence, stricture, proctitis, rectal bleeding and erectile dysfunction. For the LLM model optimization, 699 notes were utilized: 294 single and 375 with multiple symptoms/ note with a median of 5/note. The Mixtral-8x7B student model was utilized which initially extracts toxicity symptoms, which are then refined by the GPT-4 teacher model over 16 rounds and 5 epochs, based on the student's performance and rationale. The process involves the student ranking concepts as positive, negative, or neutral and justifying the ranking, with the teacher model evaluating and improving the prompts based on this analysis. Using the validation set as a reference, we employed accuracy to assess the student model refinement, and we also calculated precision, recall and F1 scores to evaluate the performance of the refined prompts compared with the initial forms. Results: For single-symptom notes (n= 294), overall average accuracy for toxicity symptom extraction reached 0.71 postrefinements with final precision, recall and F1 scores of 0.82, 0.71 and 0.73, respectively, as depicted in Table 1. 'Urgency,' 'Urothelial Carcinoma,' and 'Stricture' reached the perfect accuracy (score=1) postrefinements with excellent precision, followed by 'Urinary Obstruction' (score= 0.8). Initial model performed optimally with no improvements for 'Dysuria,' 'Incontinence,' and 'Hematuria.' Scores were relatively lower for multiple symptoms notes (n= 375), with best accuracy for 'Hematuria,' 0.76; 'Urothelial Carcinoma,' 0.7 and 'Dysuria,' 0.62, with best improvements noted for 'Urothelial Carcinoma,' 0.05 to 0.7 and 'Rectal Bleeding,' 0.16 to 0.57. Compared with initial performance, improvements ranged between 16% and 30% for the assessment metrics for single-symptom and multi-symptom (Table 1). The incremental improvement of each symptom, across all note types, for each epoch achieved a final accuracy of 72% to 97% (Fig. 1 left), with overall average (SD) accuracy of 84 (72 to 96) % (Fig. 1 right). Description: A clear trend of improvement is observed in nearly all symptoms as the number of epochs increases, illustrating the efficacy of our prompt refinement process. Conclusions: Using our developing in-house novel student-teacher LLM with incremental self-improvements, we were able to achieve clinically meaningful results with a robust potential to accurately automate the process of late RT-related toxicity extraction for prostate cancer patients without compromises for patient privacy. We are in the process of leveraging the validity of this model by including more patients' data with longer follow-up, and we hope to expand the scope to encompass RT-toxicity grading and management.
Volume
48
Issue
8
First Page
S22
