Introduction

Living kidney donors face the same risk of develo** end-stage kidney disease (ESKD) as the general population [1, 2]. However, recent studies have called this statement into question [3, 4]. Many transplantation centers encounter a heterogeneous donor pool that is different from the healthy study cohorts of older investigations. Due to long transplantation waiting lists, donors with a lower starting glomerular filtration rate (GFR) or other risk factors such as smoking history may be eligible for donation.

Therefore, thorough screening before donation is essential. Various pre-donation risk assessments have been developed to identify the donors at risk for ESKD [5,6,7]. We use the ESKD risk score for donors which was first published in 2016 by Grams et al. [7]. All risk scores provide applicable tools for clinical practice but are based on statistical approaches. This is where Artificial Intelligence (AI) comes into play based on our hypothesis that artificial intelligence has the potential to improve predictions.

Whereas classic statistics outline relationships between a data sample and a population, Machine Learning (ML), a subgroup of artificial intelligence, is capable of making personalized predictions about a desired outcome by attempting to uncover hidden patterns within the provided data [8]. The goal of identifying borderline donors may be facilitated with machine learning, enabling this donor group to be educated in detail about their possibly increased risk of kidney failure after donation and initiating intensified follow-up care.

The main focus of machine learning studies in transplantation has been on the outcome of graft function and the prediction of graft failure [9,10,11]. When it comes to donors, machine learning research is very scarce. To our knowledge, there is only one recent work using machine learning, carried out by a Korean study group, to predict renal adaptation of living kidney donors [12].

Our study aims to test different machine learning techniques to classify the average eGFR slope or the accelerated declining eGFR slope of living kidney donors, utilizing distinct subsets of the provided data, including parameters from the ESKD risk score, clinical data, and histopathological parameters. We chose the eGFR slope as our target for predictions since it represents a dynamic parameter over time of kidney function.

Methods

Objects and inclusion criteria

For this retrospective study, a total of 238 living kidney donors (sex at birth, female/male [%]: 154 [65]/84 (35); mean age [standard deviation, SD]: 54 [10]) after donor nephrectomy between 2009 and 2020 at the Department of General, Visceral, Cancer and Transplant Surgery, University Hospital of Cologne, Germany, were included. Hand-assisted retroperitoneoscopic donor nephrectomy (HARP) was the surgical technique used [13]. Inclusion criteria were donors who had completed 3 years of postoperative follow-up with complete documentation of serum creatinine values pre-donation and at year 1, 2 and 3 after donation to calculate the estimated GFR (eGFR) at each time point. Included patient characteristics can be divided into three groups:

  1. 1.

    Clinical characteristics of the risk tool for ESKD for kidney donor candidates (age, sex at birth, eGFR, systolic blood pressure, hypertension medication, body mass index [BMI], urine albumin creatinine ratio [ACR] and smoking history) [7]. Non-insulin-independent diabetes and race were excluded from the dataset due to one-dimensional distribution. We excluded outliers (n = 2) in albumin creatinine ratio to ensure no distorted model performance.

  2. 2.

    Other donor characteristics assessed preoperatively (height, weight, smoking pack years, serum creatinine, side of the removed kidney, renal cortex volumetry of the graft and of the remaining kidney, and their ratio [remaining to transplant cortex volumetry]). Renal cortex volumetry was assessed from preoperative computed tomography (CT) scans [14].

  3. 3.

    Histopathological assessment of the time-zero biopsy of the graft (total glomeruli, global glomerulosclerosis, ratio glomerulosclerosis [global glomerulosclerosis to total glomeruli], Banff Lesion Scores [15] of glomerulitis g, tubular atrophy ct, and arteriolar hyalinosis ah). We omitted the other Banff Lesion Scores due to one-dimensional distribution. To ensure that only representative core biopsies were included, a minimum set of ten glomeruli was defined to be representative [16].

The final dataset comprised 22 donor features and a missing feature rate of 17.7%, mainly due to incomplete documentation of the time-zero biopsy. A detailed description of the feature distribution is provided in Table 1. The Ethics Committee of the Faculty of Medicine, University of Cologne, Germany, approved this retrospective study (reference number: 23-1462-retro) and waived the need for patient consent. Data analysis was performed in accordance with relevant guidelines, as outlined by the Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD) statement [17].

Table 1 Patient characteristics and correlation to eGFR slope with a cut-off decline of -1 mL/min/1.73 m2/year

Labeling, feature pre-processing and engineering

The dataset was dichotomized into two groups based on the overall decline in eGFR (eGFR slope) over the first, second, and third year after donation. We defined an average decline of the eGFR in year 3 of the follow-up at a rate of < 1 mL/min/1.73 m2/year (average eGFR slope) based on the normal decline in kidney function of approximately 1 mL/min/1.73 m2/year [18]. An accelerated decline of the eGFR in year 3 at a rate of ≥ 1 mL/min/1.73 m2/year was considered a relevant deterioration in kidney function and is referred to as an accelerated declining eGFR slope throughout the remainder of this study for easier readability. Labeling resulted in an unbalanced dataset (average eGFR slope: 185 donors, 78%; accelerated declining eGFR slope: 53 donors, 22%). We used class weights in favor of the underrepresented class. We performed feature engineering of the 7 categorical and 15 continuous variables within scikit learn Pipelines to ensure proper pre-processing of the respective training and test data. We normalized continuous variables to impute missing data points using scikit learns’s k-Nearest Neighbor imputer (n_neighbors = 3). Missing values in categorial data were imputed with the most frequent variable. All categorical features were then converted into dummy variables with one-hot-encoding. In case of binary variables, the first dummy variable was dropped.

Feature selection via sequential forward selection

We performed machine learning-driven sequential forward selection (SFS) for each algorithm on the entire dataset using the open-source MLxtend library [19]. This methodology is considered model-agnostic, meaning that feature selection is independent of the architecture of the model but is based on its influence on performance metrics [20]. The best estimator of each model after hyperparameter search was utilized for sequential forward selection with stratified 5 Cross-Validation (CV)-folds aiming to find the smallest subset of features for the best cross-validation-model performance. The evaluation of model performance after feature selection on the training folds was conducted solely on the respective testing fold to prevent data leakage. The important features that were identified served as a reduced dataset for model training, respectively.

Study design

The study design contains two major parts to classify the eGFR slope at year three post-donation (Fig. 1):

  1. 1.

    We utilized both the entire dataset and two predefined subsets generated from the entire dataset for model training to evaluate model performance:

    Dataset 1: Parameters of the ESKD risk score (n features = 8, n features after one-hot encoding = 10)

    Dataset 2: Dataset 1 + other clinical parameters (n features = 16, n features after one-hot encoding = 18)

    Dataset 3: Whole dataset including histopathological parameters (N features = 22, N features after one-hot encoding = 26)

  2. 2.

    Feature Selection with sequential forward selection was only performed on Dataset 3 for each model. We subsequently utilized the selected important features to retrain the models and to evaluate model performance, respectively.

Fig. 1
figure 1

Flow diagram of the study design. First, distinct subsets (dataset 1 and 2) and the whole dataset 3 were used for model training with Random Forest (RF), XG Boost (XG), Support Vector Machine (SVM) and Logistic regression (LR) to classify eGFR slope of living kidney donors in the third follow-up year (y3). Second, for each model, ML-driven feature selection was performed on the entire dataset resulting in a correspondingly selected feature dataset for model retraining and predictions

Machine learning models

We used supervised machine learning techniques for binary classification using the scikit learn package [3, 4]. Identifying these at-risk donors is still an unmet need in clinical practice.

We used the eGFR slope as our target for predictions. As a dynamic parameter, we consider the eGFR slope to be a better parameter for assessing donor kidney function than just eGFR at a specific time point during follow-up. Particularly for donors with borderline pre-donation eGFR, the extent of eGFR changes over time provides a more comprehensive picture of the current kidney function compared to past time-points, and reflects the approach of clinicians by putting eGFR in a temporal context.

The use of eGFR slope as a surrogate parameter to evaluate kidney function has been discussed in previous literature [35–39]. A recently published meta-analysis reported associations between treatment effects altering the GFR slope and the respective clinical endpoints targeting worsening kidney function. The authors concluded that GFR slope serves as a good surrogate parameter for evaluating kidney function in clinical trials [38], which has also been considered by regulatory agencies such as the U.S. Food and Drug Administration (FDA) [40] and the European Medicines Agency (EMA) [41].

A normal decline in kidney function is approximately 1 mL/min/1.73 m2/year [18]. The median eGFR slope of our donor collective was − 0.33 mL/min/1.73 m2/year, which is consistent with previous findings reporting the measured GFR slope of donors to be around − 0.4 mL/min/1.73 m2/year [2, 42]. Based on these findings, we defined a relevant eGFR slope at − 1 mL/min/1.73 m2/year in the third follow-up year. This resulted in an unbalanced dataset with 185 donors in the average eGFR slope cohort and 53 donors in the accelerated declining eGFR slope cohort.

Neither the ESKD risk score nor the descriptive statistics of the other pre-donation donor features used for model training effectively discriminated the donor cohort with the accelerated declining eGFR slope. Therefore, we employed machine learning to effectively identify this donor cohort. Three machine learning models (random forests, extreme gradient boosting, support vector machines) and logistic regression as the state-of-the-art model were utilized to predict accelerated declining eGFR slope of our donor cohort. Overall, no model sufficiently predicted the outcome. Neither of the models exceeded an AUC of 0.7 or an F1 score of 0.5.

Also, Jeon et al. [12] reported mediocre performance with machine learning in predicting the percentage of renal adaptation (6–12 months post-donation eGFR/pre-donation eGFR, cut-off: 65% of pre-donation eGFR after donation) of kidney donors after training with preoperatively assessed donor features. The authors reported an AUC of 0.63, which is similar to our results. They additionally trained the machine learning model to predict the absolute median eGFR of the second half of the first follow-up year (cut-off: 60 mL/min/1.73 m2). Here, clearly improved model performance with an AUC of 0.85 was observed. However, we consider predicting excretory kidney function decline to be superior to predicting GFR alone, as discussed above.

Despite the low predictive performance of the machine learning models, there are some observed trends of the distinct model performances when trained on different data subsets. The first data subset we used for model training included patient characteristics for calculating the ESKD risk score for kidney donors. The risk score was first introduced in 2016 by Grams et al. [7] after observing more than 4,000,000 individuals who were formally eligible for kidney donation, for 4–16 years. In our transplant center, we use this risk score to screen for potential donor candidates and to exclude donors at risk. Our interest was to find out whether these well-established parameters are sufficient to predict accelerated declining eGFR slope with machine learning.

The calculated 15-year and lifetime ESKD risk score for our donor cohort was below 1% for both eGFR-slope cohorts. Interestingly, a statistically significant difference was noted for the 15-year ESKD risk score. However, the differences in the absolute values were marginal. The calculated risk scores themselves were not included in model training. Likewise, we did not consider non-insulin dependent diabetes and race for model training due to one-dimensionality in our patient cohort.

The best performance using the risk score dataset was noted for support vector machines, which are known to be efficient with small datasets [43]. However, differences in model performance compared to the other models were marginal. In our study, machine learning models failed to adequately predict accelerated declining eGFR slope after being trained on previously evaluated patient characteristics for ESKD risk-prediction.

Subsequently, we integrated more features into the dataset and expected improved predictions related to the greater amount of information. We included additional donor details such as body weight, height, pack years, or renal cortex volumetry from CT scans (Dataset 2). For the entire dataset (Dataset 3), results of the time-zero biopsy, including Banff Lesion Scores, were added. Even though the histology of living donor kidneys is not available in pre-donation screening, results of the time-zero biopsy might affect the remaining renal outcome of living kidney donors.

Including more parameters led to slightly better results for random forests and extreme gradient boosting but worsened the predictions for support vector machines. Logistic regression showed consistent performance across the different data subsets. Barah et al. [44] also reported a slight improvement in model performance for predicting kidney discard with machine learning after adding parameters from the graft biopsy. Nevertheless, expanding the dataset with predefined features did not improve predicting donors with an accelerated declining eGFR slope.

Finally, we applied machine learning-driven feature selection to the whole dataset. We used model agnostic sequential feature selection in a forward approach by sequentially adding the most informative features to enhance model performance in k-fold cross-validation [20]. After sequential forward selection, each model exhibited a different subset of best predictive features. Eight and six best predictive features were found for logistic regression/support vector machines and extreme gradient boosting/random forests, respectively. We then retrained each model with the respective selected features. A clear improvement in prediction was observed for extreme gradient boosting and random forests. Both ensemble methods revealed a k-fold AUC of 0.66 and a k-fold F1 score of 0.44, and outperformed logistic regression and support vector machines which did not show improved predictive performance. These findings are consistent with previous machine learning studies in kidney transplantation: Feature selection improved predictive performance [9], and random forests or extreme gradient boosting outperformed logistic regression [9, 10, 44].

The best predictive features that appeared in all four models after sequential forward selection were the features related to smoking, namely smoking history or pack years, and the Banff Lesion Score g (glomerulitis). Smoking as a cardiovascular risk factor is widely known to enhance the incidence of develo** chronic kidney disease [45]. Therefore, it is not surprising that all four models use features related to smoking to improve predictions for accelerated declining eGFR slope.

The Banff classification is designed for allograft pathologies [15]. Nevertheless, pathologies in the time-zero biopsy provide insights about the donor’s remaining kidney. The Banff g lesion score classifies the proportion of microvascular inflammation within glomeruli which may be linked to antibody-mediated graft rejection or to recurrent or de novo glomerulonephritis [15]. Previous studies reported that glomerulitis was associated with allograft pathologies or graft failure [46–50]. The conclusive determination of whether the reasons for glomerulitis may be recipient-associated, such as humoral rejection or recurrence of an underlying condition, is hindered by inconsistent documentation regarding the timing of biopsy acquisition in relation to reperfusion. Whether the presence of glomerulitis in the time-zero biopsy of the graft allows a conclusion to be drawn about the outcome of the remaining kidney function of living kidney donors needs to be investigated in further studies.

From a data science perspective, we faced a few hurdles that accounted for the moderate model performances. We trained our models on a small dataset that was unbalanced and consisted of missing values. There is a widespread belief that artificial intelligence is designed to only recognize patterns in large amounts of data. However, small datasets are common in the medical field. Althnian et al. [51] empirically investigated the influence of data size on the performance of machine learning models using datasets from the medical domain. They found that it is not the data size itself that affects the predictive ability, but rather how closely the data reflect the general distribution of a patient cohort. These findings are consistent with the results of our study: Not including more data but identifying the predictive features and retraining the models without redundant features improved the predictions.

The limitations of our study are that we used the eGFR values instead of measured GFR to calculate the eGFR slope. Our dataset consisted of missing values, mainly due to incomplete documentation of the histopathological parameters. There are no gold standards in data science for the allowed number of missing values in a dataset, which, thus, remains a field of empirical testing. We did not include all parameters that define the Banff classification due to one-dimensionality. Our dataset stemmed from one transplantation center. The performance of the machine learning models was evaluated by k-fold cross-validation which allows to investigate the ability of the models to generalize the information. To further test the predictive performance and generalizability of the models, an external test set is required for validation.

Conclusion

Our aim was to predict accelerated declining eGFR slope of living kidney donors using machine learning. Training the models with distinct predefined data subsets did not produce satisfactory predictions for any model. However, the predictive performance of the random forests and extreme gradient boosting improved and outperformed logistic regression after training with only important features after machine learning-driven feature selection. Future studies need to be conducted with extended data size to evaluate whether machine learning can sufficiently predict the eGFR slope to identify donors at risk for declining kidney function.