Comparison of natural language processing rules-based and machine-learning systems to identify lumbar spine imaging findings related to low back pain

W K. Tan
Saeed Hassanpour
Patrick J. Heagerty
Sean D. Rundell
Pradeep Suri
Hannu T. Huhdanpaa
Kathryn James
David S. Carrell
Curtis P. Langlotz
Nancy L. Organ
Eric N. Meier
Karen J. Sherman
David F. Kallmes
Patrick H. Luetmer
Brent Griffith, Henry Ford HealthFollow
David R. Nerenz, Henry Ford HealthFollow
Jeffrey G. Jarvik

Recommended Citation

Tan WK, Hassanpour S, Heagerty PJ, Rundell SD, Suri P, Huhdanpaa HT, James K, Carrell DS, Langlotz CP, Organ NL, Meier EN, Sherman KJ, Kallmes DF, Luetmer PH, Griffith B, Nerenz DR, Jarvik JG. Comparison of natural language processing rules-based and machine-learning systems to identify lumbar spine imaging findings related to low back pain. Academic radiology 2018; 25(11):1422-1432.

Document Type

Article

Publication Date

11-1-2018

Publication Title

Academic radiology

Abstract

RATIONALE AND OBJECTIVES: To evaluate a natural language processing (NLP) system built with open-source tools for identification of lumbar spine imaging findings related to low back pain on magnetic resonance and x-ray radiology reports from four health systems.

MATERIALS AND METHODS: We used a limited data set (de-identified except for dates) sampled from lumbar spine imaging reports of a prospectively assembled cohort of adults. From N = 178,333 reports, we randomly selected N = 871 to form a reference-standard dataset, consisting of N = 413 x-ray reports and N = 458 MR reports. Using standardized criteria, four spine experts annotated the presence of 26 findings, where 71 reports were annotated by all four experts and 800 were each annotated by two experts. We calculated inter-rater agreement and finding prevalence from annotated data. We randomly split the annotated data into development (80%) and testing (20%) sets. We developed an NLP system from both rule-based and machine-learned models. We validated the system using accuracy metrics such as sensitivity, specificity, and area under the receiver operating characteristic curve (AUC).

RESULTS: The multirater annotated dataset achieved inter-rater agreement of Cohen's kappa > 0.60 (substantial agreement) for 25 of 26 findings, with finding prevalence ranging from 3% to 89%. In the testing sample, rule-based and machine-learned predictions both had comparable average specificity (0.97 and 0.95, respectively). The machine-learned approach had a higher average sensitivity (0.94, compared to 0.83 for rules-based), and a higher overall AUC (0.98, compared to 0.90 for rules-based).

CONCLUSIONS: Our NLP system performed well in identifying the 26 lumbar spine findings, as benchmarked by reference-standard annotation by medical experts. Machine-learned models provided substantial gains in model sensitivity with slight loss of specificity, and overall higher AUC.

PubMed ID

29605561

ePublication

ePub ahead of print

Volume

25

Issue

11

First Page

1422

Last Page

1432

Center for Health Policy and Health Services Research Articles

Comparison of natural language processing rules-based and machine-learning systems to identify lumbar spine imaging findings related to low back pain

Recommended Citation

Document Type

Publication Date

Publication Title

Abstract

PubMed ID

ePublication

Volume

Issue

First Page

Last Page

Browse

Author Corner

Center for Health Policy and Health Services Research Articles

Comparison of natural language processing rules-based and machine-learning systems to identify lumbar spine imaging findings related to low back pain

Authors

Recommended Citation

Document Type

Publication Date

Publication Title

Abstract

PubMed ID

ePublication

Volume

Issue

First Page

Last Page

Share

Browse

Author Corner