Background

The patellar height is important in the anatomy and biomechanics of the patellofemoral joint. Recently, several studies have shown that an abnormal patellar height is associated with various knee diseases. These diseases include patellar dislocation [1, 2], patellar instability [3], Osgood–Schlatter disease [4, 5], anterior knee pain [6], chondromalacia patella [7], and anterior cruciate ligament (ACL) injuries [8, 9]. Moreover, abnormalities in the patellar height are closely linked to complications and poor recovery after total knee arthroplasty (TKA) [10,11,12], tibial osteotomy [13], and ACL reconstruction [14]. Therefore, early assessment and treatment of abnormal patellar height are vital to effectively control symptoms, prevent and alleviate related diseases, and improve patients’ quality of life.

The patellar height is typically measured directly or indirectly using radiological or magnetic resonance imaging (MRI) methods [15]. However, these standard procedures are lengthy, time-consuming, repetitive, and require additional computational support. They are prone to significant variability among and within observers, which may affect the accuracy of the measurements [16, 17].

Recently, deep-learning algorithms have been increasingly applied across various aspects of the medical field, particularly in orthopedics. Kim et al. [18] automated the detection and segmentation of lumbar vertebrae from radiographs to assess compressive fractures. Similarly, Krogue et al. [19] implemented the automatic identification and classification of hip fractures. Regarding surgical planning, Qu et al. [20] utilized MRI for the precise segmentation of pelvic bone tumors, aiding the development of effective surgical plans for tumor excision and reconstruction. Von Schacky et al. [21] segmented and classified primary bone tumors, whereas Consalvo et al. [22] have detected and differentiated Ewing’s sarcoma and acute osteomyelitis. Leung et al. [23] have predicted the risks of TKA in patients with osteoarthritis for risk assessment. Ye et al. [24] developed a deep learning-based automatic measurement algorithm using a convolutional neural network (CNN) with VGG16 as the encoder. Kwolek et al. [25] employed the YOLO neural network and U-Net for detecting and segmenting the patellofemoral joint, enabling automatic measurements of the Caton-Deschamps index (CDI) and Blackburne-Peel index.

Deep-learning algorithms are particularly effective at navigating the complex anatomical structures and variations between normal and pathological states in humans. They excel at identifying and learning intricate patterns from extensive datasets, which is crucial for accurately measuring patellar height in various radiographic images [26]. Furthermore, the automation capabilities of deep learning significantly streamline the measurement process, reducing both the variability and the time required for manual assessments [27].

Therefore, our objective was to develop a novel deep learning-based algorithm to automatically measure patellar height, to enable high-precision and rapid analysis of medical images.

Methods

Study aim, design, and setting

This multicenter retrospective study aimed to develop a deep learning-based algorithm to automatically measure patellar height parameters in lateral knee radiographs and evaluate its performance and generalization ability. We utilized a dataset containing X-ray images from three tertiary level A hospitals.

Datasets

The images used in this study’s dataset were obtained from three tertiary A-grade comprehensive hospitals: The Second Affiliated Hospital of ** knees; and (4) unclear superior or inferior poles of the patellar or patellar ligament endpoints. The osteoarthritis criteria were used to include radiographs with clear patellar height markers. We retrospectively analyzed 3,923 knee joints, of which 2,341 met the inclusion criteria. A random selection of 90% of the cases from The Second Affiliated Hospital of ** to enhance the model’s robustness and generalization ability.

Model selection

The residual network for human pose estimation (Pose_ResNet) is an advanced human pose estimation method that leverages the robustness of deep residual networks (ResNets) to effectively locate human body keypoints [28, 29]. The network initially extracts image features through convolutional layers, which can recognize complex patterns in the images and provide the necessary visual information for detecting human-body keypoints. The output is a set of heat maps, each corresponding to a specific body keypoint, representing the probabilistic distribution of the keypoint’s possible locations within the image. During training, the loss function typically compared the predicted heat maps with the true heat maps to optimize the network parameters. The final keypoint locations are determined through post-processing steps such as finding the local maxima in the heatmaps. This architecture was built on the fundamental principles of ResNet by integrating deep convolutional layers with residual connections. These connections mitigate the issue of vanishing gradients, thereby enabling the training of deeper networks to enhance feature extraction without losing crucial information during the process.

The High-Resolution Network for Human Pose Estimation (Pose_HRNet) is an efficient network for human pose estimation designed to improve keypoint detection accuracy by maintaining high-resolution feature maps throughout the process [30,31,32,33]. Starting with a high-resolution subnetwork, it progressively adds high-to-low-resolution subnetworks in more stages. Then, it connects multiresolution subnetworks in parallel to learn features on different scales. This multiscale approach enhances the robustness of the model to various challenges in pose estimation, such as occlusions and varying sizes. The global context and local detail information are effectively fused throughout the process by exchanging information across parallel multiresolution subnetworks. Finally, the keypoints are estimated on the network’s output high-resolution feature maps. This approach benefits from semantically richer and spatially precise result representations.

Performance and generalization of keypoint detection

We assessed the landmark detection performance of five models, pose_resnet_50, pose_resnet_101, pose_resnet_152, pose_hrnet_w32, and pose_hrnet_w48, using root mean square error (RMSE), object keypoint similarity (OKS), and percentage of correct keypoints (PCK). Furthermore, we compared the parameter counts and computational complexities of the models. Subsequently, an appropriate model was selected to evaluate the performance of the external test dataset. This was performed to assess the accuracy of knee joint radiographs obtained from different devices across various algorithms.

  • RMSE measures the model accuracy using the square difference between the average predicted and actual values, indicating the error magnitude.

  • OKS evaluates keypoint detection accuracy by considering object scale and keypoint distance, offering a normalized accuracy measure that penalizes larger errors more.

  • PCK calculates the share of keypoints accurately detected within a set distance from true positions, reflecting the model’s precision in keypoint localization.

Performance and generalization of insall–Salvati index (ISI) measurement

The analyses compared the average ISI test results across the three test sets. In addition, we calculated the intraclass correlation coefficient (ICC) to evaluate the reliability and consistency of the manual and automatic measurements. The ICC reflects abstract consistency, with values ≥ 0.75 considered sufficiently reliable [34]. This approach was used to assess the performance and generalizability of ISI measurements.

Statistical analysis

Statistical analyses were performed using PASW Statistics v18 (IBM Corp., Armonk, NY, USA) and Microsoft Excel 2019 (Microsoft Corp., Redmond, WA, USA). We employed ANOVA to compare the mean ages of different groups to identify any statistically significant differences in age distributions. For categorical variables such as sex, side, osteoarthritis grade, and surgery status, chi-square tests were utilized to assess the homogeneity of these variables across the groups. All statistical tests were two-tailed, and a p-value of less than 0.05 was considered to indicate statistical significance.

Results

General data distributions

In this study of 2,241 knee radiographs, the average ages of the patients forming the training, internal, first external, and second external sets, were 50.10 ± 17.77 years (range, 20–87 years), 50.76 ± 16.96 years (range, 21–83 years), 48.88 ± 15.01 years (range, 23–86 years), and 47.43 ± 13.01 years (range, 20–81 years), respectively. Table 1 presents the overall patient characteristics.

Table 1 Patient characteristics

Performance and generalization of keypoint detection

Quantitative analysis

Experiments were conducted on keypoint-detection tasks using different deep-learning models. We primarily compared five models: pose_resnet_50, pose_resnet_101, pose_resnet_152, pose_hrnet_w32, and pose_hrnet_w48. We examined the impact of the input sizes of 256 × 192 and 384 × 288 pixels on model performance. In the comparative analysis, we focused on the number of model parameters (#Params), computational complexity (GFLOPs), and three performance metrics: RMSE, OKS, and PCK. The results are shown in Table 2; Figs. 4, 5 and 6.

Fig. 4
figure 4

Comparison of RMSE of different models RMSE, root mean square error

Fig. 5
figure 5

Comparison of OKS of different models OKS, object keypoint similarity

Fig. 6
figure 6

Comparison of PCK of different models PCK, percentage of correct keypoints

Model Parameters and Computational Complexity: The parameter counts for pose_resnet_50, pose_resnet_101, and pose_resnet_152 were 34.0 M, 53.0 M, and 68.6 M, respectively, with computational complexities of 8.9, 12.4, and 15.7 GFLOPs, respectively. In contrast, pose_hrnet_w32 and pose_hrnet_w48 had parameter counts of 28.5 M and 63.6 M, respectively, with computational complexities of 7.1 and 14.6 GFLOPs, respectively. A positive correlation was observed between the model parameter count and computational complexity. The results of the performance analysis are as follows:

  • RMSE: The pose_hrnet_w48 model scored 10.60 and the pose_hrnet_w32 score was 11.84. However, the pose_resnet series had higher RMSE values with pose_resnet_152 having the highest value of 29.06.

  • OKS: The pose_hrnet_w48 model scored the highest at 0.9592, while pose_hrnet_w32 scored 0.9505. In the pose_resnet series, pose_resnet_50 scored 0.8408.

  • PCK: The pose_hrnet_w48 model had the highest score of 0.8705. The pose_hrnet_w32 model scored 0.7961. The Pose_ResNet series generally exhibited lower PCK scores.

Table 2 Performance metrics for keypoint detection on the internal test set

Qualitative analysis

From the image analysis perspective, our patellar height measurement system can accurately identify the patella’s upper and lower poles and the patellar tendon’s endpoints. The precision of this identification is high, with virtually no deviation discernible to the human eye (Fig. 7). Therefore, it is evident from the effectiveness diagrams of patellar keypoint detection after total knee arthroplasty that the patellar height measurement system can still effectively identify and measure keypoints even with the implantation of knee prostheses (Fig. 8).

Fig. 7
figure 7

Visualized results of keypoint detection on three datasets. From left to right are the internal test set, external test set 1, and external test set 2

Fig. 8
figure 8

Rendering of keypoint detection after total knee replacement

Generalized analysis

Our study also included an analysis of the model’s generalizability. Tables 3 and 4 show the model’s performance on external test sets from other institutions. In the external test set 1 from ** and rotation. These methods significantly enhanced the model’s robustness. The rationale behind this setup was to test the applicability of the model trained on data from the Second Affiliated Hospital of **’an Jiaotong University in broader contexts. The successful performance of the model on datasets from various hospitals indicates its strong potential for clinical application.

In the future, the application of automated patellar height measurement will reduce human errors and variability associated with manual methods, which is critical for diagnosing diseases related to abnormal patellar heights. Accurate patellar height indices will also aid orthopedic surgeons in devising treatment strategies, including surgical planning such as alignment and balance of the patellofemoral joint in TKA. Additionally, this technology can effectively monitor disease progression and recovery post-surgery, ensuring timely and appropriate interventions that can shorten recovery times and improve overall patient outcomes.

Compared to previously developed models for automatically measuring patellar height using deep learning, this study utilizes a large, multicenter dataset and the advanced HRNet architecture, achieving notably low RMSE and high OKS and PCK scores. Despite the outstanding results achieved in this study, there were some limitations. First, training and testing the model on specific hospital datasets may not fully represent the global diversity of knee joint anatomy. Additionally, county and township hospitals may employ different imaging protocols for knee X-rays, which could impact the model’s ability to accurately detect keypoints. Second, the dataset primarily included images with clear patellar height markers, which may have affected the outcomes in patients with abnormal patellar morphology. Third, as a purely linear measurement method, the ISI only considers the patellar height parameter and fails to comprehensively evaluate other anatomical patella structures, such as the patellar type and posterior tibial slope. To mitigate these limitations, future research should train the model with datasets from different races, regions, populations, and devices and focus on optimizing specific patient groups to enhance the model’s applicability across different patient types. By integrating more closely with clinical practice, the model could be guided towards improvements and optimizations for patients with abnormal patellar morphology. Furthermore, consideration should be given to introducing more anatomical structure indicators to comprehensively assess patellar anatomy and the use of three-dimensional imaging techniques (such as computed tomography and MRI) for a comprehensive assessment of patellar anatomy and function. By implementing these measures, we anticipate effectively improving the limitations of the existing model, thereby enabling its true application in clinical settings.

Conclusions

This study successfully developed and validated a deep learning-based automatic patellar height measurement system. This system can accurately measure the patellar height index, performing on par with experienced radiologists. Extensive dataset testing demonstrated the system’s excellent generalization ability and reliability, particularly for processing radiographs from different hospitals and equipment. This measurement system is expected to help in the assessment, treatment, and postoperative monitoring of knee joint diseases, thereby providing a powerful tool for enhancing patients’ quality of life. Due to the potential bias in the selection of datasets in this study, the model still has some shortcomings. In the future, further optimizing and incorporating more anatomical structure indicators will significantly improve the application scope and accuracy of the system, offering a more precise and comprehensive assessment tool for clinical use.