Efficient CTCAE Grading for Post-Radiotherapy Toxicities Using Large Language Models: A Privacy-Preserving Approach Using Instruction Fine-Tuning

Document Type

Conference Proceeding

Publication Date

9-1-2025

Publication Title

Int J Radiat Oncol Biol Phys

Abstract

Purpose/Objective(s): Accurate Common Terminology Criteria for Adverse Events (CTCAE) grading is vital for patient care and clinical decision modeling toward the goal of precision medicine. This study introduces a novel, parameter-efficient, and privacy-preserving method for automated CTCAE grading by leveraging instruction fine-tuning (IFT) of compact language models, aiming to improve grading accuracy while minimizing computational demands. Materials/Methods: We fine-tuned two language models, Llama-3.1-8B (Llama) and Qwen2.5-7B (Qwen), using explicit CTCAE grading guidelines. Low-Rank Adaptation (LoRA, rank 128, α = 32) was applied to the attention, feed-forward, and embedding layers, improving the models’ understanding of clinical terminologies and refining their focus on relevant contexts. Chain-of-thought (CoT) prompting further enhanced reasoning during grading. Our models were trained on 333 expert-labeled clinical notes from 45 prostate cancer patients treated with 78 Gy radiation (2017–2021), covering 12 toxicity symptoms: cystitis, dysuria, erectile dysfunction, hematuria, incontinence, nocturia, proctitis, rectal bleeding, stricture, urgency, urinary frequency, and urinary retention. Two expert clinicians graded notes into Grade (G) 1–3 (Cohen’s κ = 0.88; 92% agreement). A stratified five-fold cross-validation was performed with a 50-10-40 train-validation-test split—yielding approximately 166, 33, and 134 notes per fold—while preserving toxicity severity distribution. Metrics included class-specific F1, macro-averaged precision, recall, area under the receiver operating characteristic curve (AUCROC), and area under the precision-recall curve (AUCPR). Results: Both models improved post-IFT across metrics (Table 1). Llama-3.1-8B’s median F1 scores rose from 48% to 53% (Grade 1), 68% to 71% (Grade 2), and 56% to 71% (Grade 3); precision increased from 49% to 66%, recall from 43% to 72%. Qwen2.5-7B’s median F1 scores improved from 47% to 52% (Grade 1), 53% to 69% (Grade 2), and 56% to 66% (Grade 3); precision rose from 45% to 62%, recall from 42% to 67%. Conclusion: This framework, using IFT, LoRA, and CoT, improves toxicity grading accuracy and consistency. It offers a privacy-preserving, scalable solution for better clinical decisions and patient care in radiation oncology.

Volume

123

Issue

1S

First Page

e745

Last Page

e746

Share

COinS