Introduction

In 2003, Ganz et al. described femoroacetabular im**ement (FAI) as one of the primary causes of hip osteoarthritis [10]. Over the last two decades, hip arthroscopy has been increasingly performed for the treatment of this intra-articular hip disorder along with cartilage and labral injuries [4, 7, 42]. As the annual number of procedures has increased, many studies have sought to evaluate the risk of undergoing a subsequent revision hip arthroscopy [1, 2, 5, 6, 9, 11, 12, 14, 17, 23, 28, 32, 34, 38]. Though these studies have identified several risk factors associated with revision surgery, the ability to translate these pre-operative factors into a specific risk score is poor. A clinical tool to estimate a patient’s individual risk of having subsequent revision hip arthroscopy would be a valuable adjunct for the surgeon to guide discussions regarding surgical decision-making and expectations.

Machine learning has the potential to improve the ability to estimate outcome at an individual level. Machine learning uses data to build flexible prediction and decision-making models without the need for researchers to pre-specify how predictors relate to each other and to the outcome of interest. Through analysis of large clinical datasets, machine learning models can identify factors associated with outcome and use these factors to formulate prospective predictive algorithms. The ideal database for clinically useful machine learning analysis is one that contains a large volume of patient data that is representative of a diverse portion of the population under evaluation. National registries represent a potentially strong data source which hold promise for the development of clinically impactful outcome prediction models due to the large volume of patients from multiple institutions and surgeons.

The Danish Hip Arthroscopy Registry (DHAR) has been prospectively collecting demographic, surgical, and outcome data since 2012. There are currently more than 6000 patients registered in the database who have undergone hip arthroscopy throughout Denmark. This national registry has yielded several clinically useful contributions to the orthopaedic literature [15, 26, 27, 29,30,31], and machine learning enables further analysis. The purpose of this study was to apply machine learning to the DHAR with the primary goal of develo** a clinically useful algorithm capable of predicting subsequent revision hip arthroscopy. The hypothesis was that a resulting algorithm would be able to accurately estimate a patient’s risk of subsequent revision hip arthroscopy based on variables available in the pre-operative clinical setting. If successful, the resulting prediction model could be implemented in the clinic as an online calculator to guide discussions regarding surgical decision-making and outcome expectations at a patient-specific level.

Materials and methods

At the time of data entry in the DHAR, all patients provide informed consent. The DHAR complies with all current national data protection legislature. Data management in the current study was performed confidentially according to Danish and European Union (EU) data protection rules, with all data de-identified prior to retrieval for analysis. As this was a register-based study, ethical approval was automatically waived according to national legislature.

Transparent reporting

This manuscript was written in accordance with the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) statement [3]. The TRIPOD statement represents recommendations for studies develo** and/or validating prediction models. The goal of the TRIPOD statement is to improve the transparency of prediction model studies through full and clear information reporting and includes a 22-item checklist.

Data preparation

Patients in the DHAR with primary hip arthroscopy dates between January 2012 and December 2020 were included. A full list of variables used in the analysis is shown Table 1a (pre-operative variables only) and 1b (intraoperative variables). Patients with previous surgery to the same hip were excluded to focus model prediction on patients undergoing primary hip arthroscopy for FAI. Additionally, a small number of patients with a history of Legg Calve Perthes, developmental dysplasia of the hip, avascular necrosis, slipped capital femoral epiphysis, or hip fracture were excluded to limit heterogeneity of the population and focus on surgical management of primary FAI. New variables were defined for type of previous injury to same hip (acetabular dysplasia, FAI), an indicator if the patient was missing any patient reported outcome variable, type of labral repair anchors (bioabsorbable, PEEK, all suture), number of anchors, type of knots, type of cartilage treatment (microfracture, fixation/resection), and type of other pathology found (adhesions, partial/full ligamentum teres rupture, synovitis, bursitis, calcified labrum, os acetabuli, loose bodies, other). The following variables were recoded: MRI performed (non-contrast, arthrogram) and Tönnis grade (Grades 0,1,2,3, and missing). Time to revision was calculated as number of months from primary hip arthroscopy to revision. For assessing concordance at specific follow-up times, patients with a revision at or prior to the time point were considered as having experienced the event.

Table 1 Characteristics of patients

Machine learning modelling

The cleaned data were split randomly into training (75%) and test (25%) sets for model fitting and evaluation, respectively. The primary outcome for the models was probability of revision hip arthroscopy within 1, 2, and/or 5 years after primary hip arthroscopy. This approach utilises a survival-analysis temporal framing structure [25] and the program R (version: 4.1.1, R Core Team 2021, R Foundation for Statistical Computing, Vienna, Austria) was used to fit and evaluate several models adapted for censored, time-to-event data. “Censoring” refers to the fact that at any given time, complete information is not known for all the patients in the registry. For example, if a patient has two years of follow-up after primary surgery with no revision, we do not know if or when that patient will go on to have a revision. Models adapted for censored data allow use of the partial information contained in these censored observations while accounting for the incompleteness.

The following four machine learning models were used: Cox elastic net, random survival forest, gradient boosted regression (GBM), and super learner. The Cox elastic net is a penalised, semi-parametric regression model that selects a subset of the predictors for inclusion in the model. “Elastic net” refers to the combination of L1 and L2 penalties used to shrink model coefficients toward zero [35]. The random survival forest is an adaptation of the popular tree-based random forest method for censored data. It uses all predictors and is nonparametric, meaning it does not require specification of the model structure [16]. The GBM is also tree based and nonparametric. It iteratively improves the model fit using all predictors [8]. The super learner is an “ensemble” technique that averages over model fits from several different types of models for an even more flexible approach [24]. Our super learner combined all the other three model types: Cox elastic net, random survival forest, and GBM.

The Cox elastic net model (package glmnet, alpha value 0.9, lambda value selected via cross-validation) was fit to the data and predictors with non-zero coefficients were retained, shown in the top panel of Fig. 1. The random survival forest, GBM, and super learners were fit using a grid search method to arrive at hyperparameters (package MachineShop). The grid search method compares all possible combinations of a given set of hyperparameters to find the best fit based on a specified performance metric, for which the C-Index was used, as described below. The random survival forest (package randomForestSRC) used 1000 trees, a minimum node size of 200, and 10 variables tried per split. The GBM (package gbm) used 1000 trees, and interaction depth of 3, minimum node size of 100, and shrinkage of 0.01. The super learner model (SuperModel function, package MachineShop) combined the three previous models with the specified hyperparameters.

Fig. 1
figure 1

Variable importance. The four plots show relative feature importance in each of the machine learning models. The highlighted bars indicate features selected into the Cox model. Random forest, gradient boosted (GBM), and super learner plots show features in the top half by importance score, for readability. Feature importance is measured on a different scale for each model, and thus only rankings of features, rather than scores, should be compared among the models. The Cox model measures feature importance by absolute effect size. The random forest and super learner models use permutation-based importance, which measures the relative change in model performance upon randomly permuting values of the given feature. The GBM uses difference in error rate were the feature to be removed, normalised to sum to 100

Each of the machine learning models was fit using two different sets of predictors: all predictors, and all predictors excluding intraoperative variables (Table 1a). The two separate analyses allowed for comparison of model performance given only variables available in the pre-operative setting versus a model considering all variables available after surgical intervention.

Model evaluation

Performance measures adapted for censored data were used to evaluate the four models on survival probabilities calculated for the hold-out test set. A measure of model concordance adapted for censored data, Harrell’s C-Index, was used at 1-, 2-, and 5-year follow-up times. The C-Index computes the proportion of pairs of observations in which predicted survival probability ranking corresponds to actual ranking [13]. It is a generalisation of the common area under the Receiver Operating Characteristics curve (AUC) metric for censored data and, as with AUC, ranges from 0 to 1 with 1 indicating perfect concordance and 0.5 representing random chance. Concordance is a measure of the model’s ability to differentiate between patients who do and do not experience the event. A model is said to have perfect concordance if the predicted risks for all individuals who experience the outcome are higher than those for all those who do not. Most clinically useful prediction models have a concordance in the 0.65–0.8 range [41]. Calibration which was adapted for censored data was also calculated. Calibration measures the accuracy of the predicted probabilities by comparing actual to expected outcomes. For this purpose, a version of the Hosmer–Lemeshow statistic intended for censored data was used. The statistic sums average misclassification in predicted risk quintiles and converts the sum into a chi-squared statistic [37]. Larger values of the calibration statistic indicate worse accuracy and produce smaller p-values. Statistical significance of the calibration statistic means we reject the null hypothesis of perfect calibration. Each of these performance metrics was calculated separately for models trained using the full set of predictors and pre-operative variables only.

Missing data

Because of high rates of missing data (Table 1) on some variables used for prediction, imputation was performed on the cleaned data prior to analysis. The imputation was performed via random forest (function missForest in package missForest) to arrive at a single imputed data set for each of the training and test data. The random forest imputation method trains a random forest on the observed data and uses it to predict imputed values for missing data [36]. To avoid leakage between the training and test data, the forest was trained on only the training observed data and was then used to predict for both training and test sets. All models were fit and evaluated on the imputed training and test sets, respectively. Imputation was performed separately for the two analyses described above (pre-operative only and all variables). In each case, only the predictor variables included in the specified analysis were used in imputation.

Results

Data characteristics

After data cleaning, 5581 patients were included in the analysis (713 patients excluded for previous hip surgery, 16 more patients excluded based on type of previous injury to same hip). Table 1 describes the characteristics of the population at the time of primary hip arthroscopy and lists all predictor variables considered in the analysis. Of the patients included after data cleaning, 603 (11%) underwent revision surgery, during an average follow-up time of 4.25 years (SD 2.51). The population was predominantly female (3079 patients; 55%), the average alpha angle was 67 (SD 14), average Tönnis grade was 0, and the majority had uni-lateral hip pain (3824 patients; 69%). Table 2 describes the number of patients who experiences revision at or before 1, 2, and 5 years post primary surgery as well as the number with complete follow-up but no revision, and the number censored before the follow-up time.

Table 2 Description of censoring

Machine learning model performance

The four models exhibited concordance in the moderate range across the follow-up times when restricted to only pre-operative variables (0.62–0.67) and exhibited similar concordance when using all variables (Tables 3, 4). The 95% confidence intervals for model concordance were wide for both analyses, ranging from a low of 0.53 to a high of 0.75, indicating uncertainty about the true concordance of the models. The random survival forest and GBM had a slight edge over the other two models in terms of concordance at 1-, 2-, and 5-year follow-up times using only pre-operative variables. The GBM had the best concordance of the models for the analysis using all variables. In general, the models were well calibrated, with only the random survival forest showing evidence of mis-calibration at 1 year (p value less than 0.01) and slight evidence of mis-calibration at 5 years (p value between 0.01 and 0.05) for the analysis restricted to pre-operative variables. For the analysis using all variables, only the Cox elastic net model showed evidence of mis-calibration at 1 year and slight evidence of mis-calibration at 5 years.

Table 3 Model performance
Table 4 Model performance – all variables

Factors predicting risk of revision surgery

Variables with non-zero coefficients in the pre-operative variable Cox elastic net model were, in order of importance: sex, pre-operative HAGOS Quality of Life score, pre-operative NRS Activity score, and pre-operative HAGOS Symptoms and Sport scores. The relative importance of these variables for predicting probability of revision surgery is shown in the top panel of Fig. 1, where the size of each bar corresponds to the absolute value of the variable’s effect size. Variables in the top third by importance for the other three pre-operative variable models also included pre-operative HAGOS scores, pre-operative NRS Activity score, and sex (random survival forest and super learner). However, age at surgery was the most important variable for these three models (Fig. 1, bottom three panels). The random survival forest and super learner models use permutation-based variable importance, which measures importance as the relative change in model performance upon randomly permuting values of the given variable. The GBM quantifies importance as difference in error rate were the variable to be removed.

Discussion

The most important finding of this study is that while machine learning analysis of a national hip arthroscopy registry enabled the development of algorithms capable of predicting subsequent revision surgery, the clinical utility of these models is likely limited. Analysis was performed using only variables that would be available in the pre-operative setting and again using the full data set. Both scenarios resulted in well-calibrated models with moderate concordance, but also with wide confidence intervals that approached random chance. Overall, the analysis was limited by a substantial proportion of missing data but encourages optimism for future models if data collection can be improved.

Machine learning represents an approach to health care research that is increasingly being applied to analyse large orthopaedic databases. The main advantage of machine learning relates to the ability of the technique to realise complex associations and relationships within large datasets. With minimal direct human programming, these models can “learn” which factors are associated with a specified outcome and can then create an algorithm with the goal of accurate outcome prediction. The most common machine learning applications in orthopaedic surgery involve clinical prediction modelling and automated image interpretation. It is anticipated that machine learning models will serve as a valuable adjunct for clinicians in the future, guiding clinical discussions at a patient-specific level.

Within the field of hip arthroscopy several studies have now been performed that seek to predict patient-specific outcome following the procedure. Most have focused on patient reported outcome, with Kunze et al. analysing single-surgeon data to predict multiple post-operative endpoints based on different outcome measuring tools [19,20,21,22]. The prediction of subsequent surgery following hip arthroscopy has also been performed by Haeberle et al. based on another single-surgeon database of over 3000 patients [11]. With their study, Haeberle et al. achieved an AUC of 0.77 ± 0.08 for predicting a patient’s risk of subsequent revision hip arthroscopy. These early studies show promise for clinical usefulness of hip arthroscopy prediction models but are of uncertain real-world applicability due to the single-surgeon nature of the databases and lack of external validation.

This study represents the first national registry-based machine learning model for hip arthroscopy outcome prediction. The goal of the present study was to develop an accurate model based on pre-operative variables that could provide a risk estimate for subsequent hip arthroscopy at a patient-specific level. This would allow a surgeon to input their patient’s data into a prediction calculator during the initial patient encounter and estimate that patient’s individual revision surgery risk. This information could then guide expectations and the surgical with the patient. While the results of this study did demonstrate the ability to predict revision surgery with reasonable accuracy, the wide discrimination confidence interval likely limits the clinical utility of the algorithms.

There are some possible explanations for the inferior model performance of the present study relative to the revision hip arthroscopy model developed by Haeberle et al. [11]. Although overall compliance with the DHAR is between 78–97% annually [43], the completeness of the data limits the ability of the models to accurately predict outcome. This is partly due to the fact that as the DHAR evolved from the initial stages through to the present version, some variables were added, removed, or modified which contributes to the data inconsistency. Variance within the DHAR is also expected, given the multiple-surgeon nature of the registry while the single-surgeon institutional registry likely benefits from more overall consistency. As more patients are enrolled and the data collection stability improves, it is anticipated that future machine learning analysis of the DHAR may yield improved prediction accuracy.

The variables recorded in the DHAR itself may also limit the ability of machine learning analysis to develop useful risk prediction models. The multiple factors included in the register were chosen by the founding surgeons as they were felt to be the most relevant based on current literature. It is possible that some factors not currently included in the DHAR may in fact be more strongly associated with outcome and thus, their exclusion may bias the models toward suboptimal performance. Future analysis may clarify this limitation and the advancements of other machine learning techniques such as computer vision [18] and natural language processing [39, 40] may make register-based data collection both simpler and more comprehensive.

Substantial missing data represent the main limitation of this study while there are other limitations to also consider. First, four common machine learning models that represent various approaches to variable selection and model complexity were selected for data analysis, but it is possible that a model that was not considered may have performed better. Second, the analysis included all variables in the DHAR but there may be other factors associated with the risk of subsequent surgery which are not included in the registry and therefore not considered in our models. Some examples of factors that may be relevant for outcome prediction include clinical examination findings, rehabilitation details, or raw imaging data files. The main concern regarding clinical applicability of this study lies in the accuracy of the model, with concordance limited by a wide confidence interval approaching random chance. Additionally, the ability to pre-operatively predict who is at risk of subsequent revision hip arthroscopy is likely limited by the endpoint itself. That is, a common reason that is often cited for revision surgery is residual CAM deformity—a factor that is not known in the pre-operative setting [2, 12, 33, 38].

Although the results from this preliminary study are not suitable for immediate clinical application, it should serve as a baseline for future outcome prediction studies applying machine learning to large hip arthroscopy datasets. Additionally, there is optimism regarding the future development of patient-specific revision risk estimation if data collection can be improved. Accurate prediction of outcome using machine learning relies on both data quantity and quality. As a national registry, the DHAR will naturally continue to grow the quantity of data collected over time as all hip arthroscopy procedures performed in Denmark are captured. Data quality is more challenging to improve upon. Overcoming bias related to the surgeon-selected nature of the variables currently collected by the registry will require ongoing critical assessment over time and emerging technology like natural language processing for data collection may enable the identification of additional variables that may influence outcome. Another way to potentially improve machine learning driven outcome prediction is through the creation of an international hip arthroscopy register or collaboration between national registers. International collaboration would require a pre-determined definition of a minimum common dataset across registers but would greatly improve predictive power through data sharing. Resulting algorithms could then be implemented into clinical practice to guide outcome expectations and discussions around surgical decision-making in the pre-surgical setting.

Conclusion

The association between pre-surgical factors and outcome following hip arthroscopy is complex. Machine learning analysis of the DHAR produced a model capable of predicting revision surgery risk following primary hip arthroscopy that demonstrated moderate accuracy but likely limited clinical usefulness. Prediction accuracy would benefit from enhanced data quality within the registry and this preliminary study holds promise for future model generation as the DHAR matures. Ongoing collection of high-quality data by the DHAR should enable improved patient-specific outcome prediction that is generalisable across the population.