Introduction

Sarcopenia, a condition characterized by age-related loss of muscle mass, strength and function, has garnered significant attention in recent years. As research progresses, there has been a shift in the consensus and focus from solely considering total muscle volume to exploring muscle quality and performance factors1. Understanding the intricate relationship between muscle composition, function and overall health has become crucial in develo** effective interventions for individuals at risk of or affected by sarcopenia2.

The diagnostic consensus of Sarcopenia has been revised in recent years to improve accuracy and standardization. In 2019, known as EWGSOP2 and AWGS2019, introduced diagnostic criteria that consider multiple parameters3,4. These criteria include calf circumference, grip strength, SARC_f questionnaire and assessments of limb muscle mass using techniques such as Dual-energy X-ray absorptiometry (DXA) and Bioelectrical Impedance Analysis (BIA). These updated guidelines reflect a more comprehensive approach to Sarcopenia diagnosis, incorporating not only muscle mass but also functional measures and patient-reported performance outcomes.

Numerous studies have investigated the use of cross-sectional area measurements as a proxy for muscle assessment. For instance, the study by Miller et al. revealed that hip joint fracture caused the asymmetry of muscle cross-sectional area and intermuscular adipose tissue in CT scan5. In the paper, thigh muscle cross-sectional area (CSA) was less on the side of the fracture by 9.2 cm2 (95% CI 5.9, 12.4 cm2), whereas the CSA of IMAT was greater by 2.8 cm2 (95% CI 1.9, 3.8 cm2) on the fractured side. Another study by Jung et al. showed significant reduction in muscle mass in the hip flexor (iliopsoas and rectus femoris) were observed on postoperative CT scans. These findings imply targeted exercise for the hip flexor may be beneficial in rehabilitation of hip fractures6. Furthermore, the study by Byun et al. indicated measuring psoas cross-sectional area (PCA) has potential as a diagnostic tool for sarcopenia7. Notably, the study found the lowest quintile of PCA was significantly associated with mortality in females, with hazard ratio of 1.76 (95% CI 1.05–2.70, p = 0.017).

By quantifying the muscle area in a single plane, these works aimed to assess muscle size and identify individuals at risk of Sarcopenia. However, this approach has limitations. Sole reliance on cross-sectional area overlooks the complexity of individual muscle groups and fails to capture important aspects such as muscle composition, distribution, and overall performance8. Moreover, factors such as participant positioning, limb orientation and the choice of imaging plane can influence the accuracy and comparability of cross-sectional area measurements.

Acquiring the volume of each individual muscle by calculating the annotated segmentation mask is achievable. However, manual segmentation on CT scans is a time-intensive, laborious, and costly task that requires significant effort and expertise. The process often exhibits high variation due to the difficulty of differentiating tissue characteristics. CT scans primarily provides excellent visualization of bony structures and dense tissues, but it has limitations in differentiating soft tissues such as muscles. Muscles have similar radiodensity, making it challenge to distinguish individual muscles based on CT scans. Nevertheless, achieving precise voxel-level segmentation is essential for accurately quantifying each individual muscle's volume and gaining comprehensive insights into muscle performance

Figure 3
figure 3

Pre-processing procedure. Presents image pre-processing result.

Deep learning method of automatic muscle segmentation

Our study is centered around the task of semantic segmentation which is a crucial task in the field of computer vision, particularly in medical imaging. It serves to delineate and classify different regions of interest within an image16. In the context of muscle segmentation, semantic segmentation involves assigning each voxel in the image to a specific class, such as muscle tissue, bone tissue and background. The main challenge in semantic segmentation is the accurate capture of the intricate details and variations in the images, including different patients’ size, position and tissue textures, while also dealing with noise and other imaging artifacts. Due to the ambiguity of each individual muscle tissue in CT scans, our study requires precise voxel-level segmentation9. For the automatic segmentation process, we adopted the architecture of the UNETR model demonstrated in Fig. 414.

Figure 4
figure 4

UNETR architecture. Presents overview of the UNETR architecture. This model utilizes the transformer to extract sequence representations from multiple layers and these represents are merged with the decoder through skip connections. The output sizes shown in the figure correspond to a patch dimension of N = 16 and an embedding size of C = 768.

The UNETR model leverages the power of transformers, which have shown exceptional performance in timeseries domain including Natural Language Process (NLP). The task of 3D medical image segmentation is re-envisioned as a sequence and the UNETR model employs a transformer as the encoder to learn sequence representations of the input volume, effectively capturing global multi-scale information. The encoder follows a U-shaped design, reminiscent of the original U-net architecture, which is known for its success in biomedical image segmentation. The transformer encoder is directly connected to a decoder via skip connections at different resolutions, allowing for the computation of the final semantic segmentation output. This design enables the model to capture both high-level contextual information and low-level spatial details, making it particularly effective for the individual thigh muscle segmentation.

Training of automatic muscle segmentation model

The training procedure was conducted on a DGX A100 workstation provided by NVIDIA, equipped with Nvidia A100 GPU and running on the Ubuntu 20.04 operating system. The deep learning based segmentation model utilized Pytorch and Monai frameworks for its implementation. During the model training, the dice coefficient was employed as the loss function, which is a common choice for semantic segmentation tasks17. The hyper parameters of AdamW optimizer were configured with a learning rate of 8e − 5, weight decay of 1e − 5, batch size of 2 and 5000 epochs. The configuration was determined through a grid search approach, where various values were explored within a learning rate range of 1e − 3 to 1e − 5. The decision of hyperparameters was determined through heuristic techniques, due to their demonstrated capability to achieve the best performance in current datasets for the muscle segmentation model.

Intraclass correlation coefficient (ICC)

To assess the consistency of the predicted results in medical images, we computed the Intraclass Correlation Coefficient (ICC) value between the manual segmentation masks annotated by two researchers and the predicted masks generated by our proposal model using whole thigh CT scans. The ICC is a statistical measure used to ratings performed on the same subjects or objects. The ICC value based on 95% confident interval, it is generally accepted that the values below 0.5 indicate poor, between 0.5 and 0.75 indicate moderate, 0.75 and 0.9 indicate good and greater than 0.90 are excellent reliability18.

By calculating the ICC value, we aimed to determine the level of agreement between the manual annotations and the predictions, thereby assessing the consistency and reliability of the model’s performance.

Evaluation metrics

In evaluating the performance of our model, we employed several metrics to assess its accuracy, including (1) Dice score (DC), (2) Average symmetric surface distance (ASSD), (3) Volume correlation (VC), (4) Relative absolute volume difference (RAVD) and (5) Hausdorff distance (HD)10,11.

The (1) Dice score (DC) measures the overlap between the model's segmentation results and the ground truth, providing a clear indication of the model's accuracy in segmenting muscles. The (2) Average Symmetric Surface Distance (ASSD) calculates the mean distance between the contours of the predicted and actual segmentations, showcasing the model's ability to precisely outline muscle shapes. (3) Volume Correlation (VC) evaluates the correlation between the segmented and actual muscle volumes, demonstrating the model's accuracy in volume estimation. The (4) Relative Absolute Volume Difference (RAVD) sheds light on the volume discrepancies between the model's segmentation and the ground truth, acting as a gauge of the model's consistency in volume representation. The (5) Hausdorff Distance (HD) measures the greatest distance between the surfaces of the predicted and actual segmentations, ensuring the accuracy of the segmentation right up to its furthest points, thus emphasizing the model's capability in accurately defining muscle boundaries.

Each metrics has own advantages and limitations, allowing us to comprehensively evaluate performance of the models in various aspects.

The (1) Dice score, also known as F1 score, quantifies the overlap between the predicted segmentation and the ground truth. It is calculated as twice the intersection of the predicted and ground truth regions divided by the sum of their individual volumes.

$$Dice \;Score= \frac{2TP}{2TP+2FP+FN}$$

The Dice score metric captures both true positive and false positive predictions and is widely utilized due to interpretability. However, Dice score can be sensitive to imbalanced classes, penalizing false negatives more heavily.

The (2) Average symmetric surface distance measures the average distance between the predicted surface and the ground truth surface that it provides insight into localization accuracy.

$$ASSD=\frac{1}{N}\sum_{i=1}^{n}\left(\left|{d}_{i}-{d}_{i}{\prime}\right|\right)$$
$${d}_{i}=Distance \;from \;each \;surface \;point \;on \;the \;predicted$$
$$surface \;to \;the \;nearest \;point \;on \;the \;ground \;truth$$
$${d}_{i}{\prime}=Distance \;from \;each \;surface \;point \;on \;the \;ground \;truth$$
$$surface \;to \;the \;nearest \;point \;on \;the \;predicted \;surface$$

However, the ASSD does not consider volumetric differences and is influenced by outliers.

(3) Volume correlation quantifies the correlation between predicted and ground truth volumes, indicating overall size agreement.

$$Volume \;Correlation=\frac{\left(\mathit{cov}({g}_{p}, {g}_{t})\right)}{\sqrt{\mathit{Var}\left({V}_{P}\right) \times \mathit{var}\left(V{g}_{t}\right)}}$$

The Volume correlation is calculated by using Pearson’s correlation and it’s only calculated with the volume of predicted and ground truth, thus, it tends to ignore the spatial correspondence and may not capture localized errors.

The (4) Relative absolute volume difference measures the percentage of difference between predicted and ground truth volumes, indicating volume estimation accuracy.

$$RAVD=\frac{{V}_{P} -{ V}_{gt}}{{V}_{gt}}\times 100$$

But it also does not capture spatial correspondence.

The (5) Hausdorff Distance measures the maximum distance between predicted and ground truth surfaces, indicating worst-case localization error, so it’s sensitive to outliers, influenced by noise and does not consider volumetric.

$$HD=\mathit{max}\left(h\left(P,G\right), h\left(G,P\right)\right)$$

Ethical standards

The study adhered to the principles of the Declaration of Helsinki and was approved by the IRB at Gyeongsang National University Hospital. (IRB No. GNUH 2022-01-032-008). All research procedures were carried out with strict adherence to ethical standards, including protection of participants' privacy, confidentiality, and rights.