Machine Learning Outperforms Logistic Regression Analysis to Predict Next-Season NHL Player Injury: An Analysis of 2322 Players From 2007 to 2017

Document Type


Publication Date


Publication Title

Orthopaedic Journal of Sports Medicine


© The Author(s) 2020. Background: The opportunity to quantitatively predict next-season injury risk in the National Hockey League (NHL) has become a reality with the advent of advanced computational processors and machine learning (ML) architecture. Unlike static regression analyses that provide a momentary prediction, ML algorithms are dynamic in that they are readily capable of imbibing historical data to build a framework that improves with additive data. Purpose: To (1) characterize the epidemiology of publicly reported NHL injuries from 2007 to 2017, (2) determine the validity of a machine learning model in predicting next-season injury risk for both goalies and position players, and (3) compare the performance of modern ML algorithms versus logistic regression (LR) analyses. Study Design: Descriptive epidemiology study. Methods: Professional NHL player data were compiled for the years 2007 to 2017 from 2 publicly reported databases in the absence of an official NHL-approved database. Attributes acquired from each NHL player from each professional year included age, 85 performance metrics, and injury history. A total of 5 ML algorithms were created for both position player and goalie data: random forest, K Nearest Neighbors, Naïve Bayes, XGBoost, and Top 3 Ensemble. LR was also performed for both position player and goalie data. Area under the receiver operating characteristic curve (AUC) primarily determined validation. Results: Player data were generated from 2109 position players and 213 goalies. For models predicting next-season injury risk for position players, XGBoost performed the best with an AUC of 0.948, compared with an AUC of 0.937 for LR (P <.0001). For models predicting next-season injury risk for goalies, XGBoost had the highest AUC with 0.956, compared with an AUC of 0.947 for LR (P <.0001). Conclusion: Advanced ML models such as XGBoost outperformed LR and demonstrated good to excellent capability of predicting whether a publicly reportable injury is likely to occur the next season.