Using a Large Language Model (LLM) for Automated Extraction of Discrete Elements from Clinical Notes for Creation of Cancer Databases

Document Type

Conference Proceeding

Publication Date

10-1-2024

Publication Title

Int J Radiat Oncol Biol Phys

Abstract

Purpose/Objective(s): A common barrier in developing and maintaining institutional cancer databases with large cohorts is having the resources to review electronic medical records (EMR) quickly and accurately to capture important variables. With the advancements of artificial intelligence (AI) in healthcare, access to local and powerful large language models (LLM) is increasing and can allow researchers to create custom data queries to LLMs. We sought to automate the extraction of critical health information for cancer patients by searching for keywords in a patient’s EMR using an LLM. Materials/Methods: We employed the LLM Mixtral 8x7B by Mistral AI to automate the extraction of key words and phrases from multidisciplinary head and neck cancer tumor board notes of 50 patients diagnosed and treated in 2021. For each patient, the first tumor board note following cancer diagnosis was anonymized and used as input in the LLM. Keywords tested were age at diagnosis (years), gender, baseline ECOG score, tumor site, histology, clinical TNM stage, and overall AJCC stage (8th edition). Prompt engineering was used to iteratively optimize the output, which was then compared to ground truth data that was abstracted by manual review. The effectiveness of the LLM was quantitatively assessed by calculating precision, recall, accuracy, and F1 score. Results: Most patients were male (n = 32, 64%) and the median age at diagnosis was 62 years old (range = 42 – 93 years). Cancer of the larynx was most prevalent site (n = 19, 38%), followed by cancer of the oral cavity (n = 15, 30%), oropharynx (n = 12, 24%), and hypopharynx (n = 4, 8%). The keywords “age at diagnosis” and “gender” both captured the requested data with 100% precision, recall, F1 score, and accuracy. Outcomes for the remaining keywords and details of their performance are captured in table 1. Collectively, the precision, recall, F1 score, and accuracy were 96.9%, 95.7%, 96.3%, and 93.1%, respectively. Conclusion: For this small cohort using an untrained LLM, these performance metrics exceed our expectations and match literature. LLMs have great potential to be used confidently by researchers to automate manual chart review to leverage and expedite the process of cancer database development and maintenance. Prompt engineering should be utilized to optimize output, especially when using an untrained LLM. As more health systems are allowing researchers to use these AI tools and platforms, larger and more robust retrospective data abstraction can be conducted more efficiently.

Volume

120

Issue

2 Suppl

First Page

e625

Share

COinS