Introduction

Accurate anatomic region labeling of medical images is required for classification of body parts included in medical imaging studies. Body part study labels contain key information used to search, sort, transfer, and display medical imaging datasets across clinical and research healthcare systems [1]. Unfortunately, with the increase in multisystem imaging techniques and consolidation or sharing of Picture Archiving and Communication System (PACS) datasets, currently implemented body part image labeling methods can fall short resulting in incomplete selection and presentation of important relevant imaging studies in clinical viewers. Additionally, the increased demand for automated image-based post-processing workflows, automated selection of studies for clinical AI analysis, and automated anatomical-based study selection for development of AI research datasets has accelerated the need for improved efficiency and reliability of anatomical image labeling techniques.

Ideally, labeling of cross-sectional medical images should accurately reflect the anatomy contained in the individual image and identify all body regions included in a study. Currently, for MR and CT, applied body region labels at the image, series, and study level are often limited to one predominant body region (e.g., chest or abdomen) and do not indicate other body regions included in the scan or do not define a body region (e.g., PET CT or whole-body MR). Furthermore, the lack of standardization of anatomic labels between institutions and human data entry errors both contribute to unreliable anatomy-based labeling of imaging studies. These labeling limitations can adversely affect imaging workflows. They have the potential to adversely affect image interpretation if they result in automated hanging protocols failing to display all information relevant to accurate image interpretation or fail to correctly select data for automated post-processing, including clinical AI workflows. The limitations result in the use of manual search strategies for procurement of anatomically based dataset for AI research, which are prohibitive to rapid developments.

We describe two pixel-based models to automatically identify 17 body regions in CT (CT model) and 18 body regions in MRI studies (MRI model). Our approach improves on some of the limitations of previous attempts to tackle this classification problem using supervised and unsupervised deep learning techniques. Previous publications have shown accuracy results ranging from 72 to 92% [2,10], 2.5% of the labeled data was reviewed with an equal number of studies assigned for all body regions.

Table 1 Anatomical landmarks for all 18 body region classes
Fig. 1
figure 1

Labeling can be done on any cross-section series (axial, sagittal, coronal, oblique). In this example, a rectangle is first drawn on the axial plane to indicate the pelvis area and then extended on the coronal viewport to the lesser trochanter. This creates a thin 3D bounding box which can be easily manipulated in all dimensions to cover the whole anatomy. The 3D bounding box label is automatically carried over to all other series within the same frame of reference using the patient coordinate system. With this technique, hundreds of images can be labeled in few seconds with a handful of clicks. If needed, AI body region image inference results can be made available with a color code associated with each anatomical class (see bottom color bar). For this image, the AI prediction indicates a pelvis (pink) with a confidence level of 0.825. Lower in the scan, thigh images (purple) are also correctly predicted

Data Partitions

Test and validation datasets were stratified on patient ID so that one patient could not be present in both datasets. Labeled datasets were organized according to the main body region and sorted according to study size. A 75/25 split between the training and validation patient datasets was performed for each body region. To estimate the size of the test datasets, a strict survey study sampling model [11] was used with the assumptions of a model at least 90% accurate and a 95% confidence interval with a 10% relative error. Based on this model, it was determined that at least 7600 images per body region for CT and at least 3600 images per body region for MRI were needed.

Model

The classification task is composed of multiple stages that are detailed in Fig. 2. As a first step, we used the standard ResNet50V2 model [12] in a multi-class framework. Following the 2D CNN classifier, a few post-processing steps at the series level were applied. First, a rule engine merged the abdomen-chest class to both the abdomen and chest class and classified an entire series as breast if at least 50% of the images in the series were classified as such. Last, a smoothing step was applied to remove labels inconsistent with those in the immediate vicinity, increasing the consistency of the labels and decreasing noise.

Fig. 2
figure 2

Body region processing steps. To accommodate the model’s input, several pre-processing steps were applied at the image level. First, pixel values were clipped to fit the interval [mean − 4*std, mean + 4*std], where mean and std correspond to the mean value and standard deviation of pixels in the image. Second, pixel values were normalized using the following transform (pixel value − mean)/(2*std) to have input values in the range −1 to 1 to match pretrained models’ requirements. Third, the grayscale images were converted to RGB images by copying pixel data from first channel to the second and third channels. Last, each image was resized with zero padding to fit the model’s required input size of 224 × 224 pixels. Following the 2D CNN image classifier, two rules were applied at the series level to merge the abdomen-chest class to either the abdomen or chest class depending on which body region was predominant. In the absence of chest and abdomen predictions, an abdomen-chest prediction was classified as the abdomen. A second rule was classified as the breast when at least 50% of the images in the series were classified as such. This is to eliminate spurious measurements in noisy breast acquisitions. In the last stage of post-processing, we first applied a routine at the series level to remove outlier labels. We then applied a moving average filter with a window size of three pixels to smooth out results so a continuity in classified labels could be observed throughout the series

Fig. 3
figure 3

Confusion matrix for the 262,326 images in the CT test database with a threshold of 0.5. Rows represent the predictions, and columns represent the ground truth

Hardware and Framework

The ML experimentations took place in a containerized cloud environment using TensorFlow 2.3.0. The docker images were built with GPU support. Argo was used to manage the execution of the data ingestion, training, evaluation, and reporting workflows, while MLflow was used to manage the experimental results and generated artifacts such as the model checkpoints, reports, and figures. An 8 CORE CPU, 30 GB RAM, NVIDIA V100 GPU cloud instance was used for training both models. An 8 CORE CPU, 30 GB RAM, NVIDIA T4 GPU cloud instance was used for running the TensorFlow Serving inference engine for both models. All the pre-processing and post-processing steps were written in Python, while the ingestion control, results aggregation, and dispatch were implemented using node.js.

Training

We enriched our dataset by applying spatial deformations to a random set of images in each training epoch. These transforms include rotation within an angle of ± π/10, translation, and shear with a maximum of 10% in image size in both directions, scaling with a maximum of 20% in image size in both directions and bilinear interpolation. The transformations were applied using built in TensorFlow library functions. We used the transfer learning approach and model weights from pretrained Resnet50V2 model developed for vision benchmark ImageNet dataset. The training hyperparameters are listed in Supplemental Materials – Training Parameters. The loss function used is categorical cross entropy which is well-suited for the multi-class case. We trained each model with all available slices in each series.

Evaluation

We applied the models to the test datasets and evaluated them by computing the average weighted values of the following performance metrics: F1 score, sensitivity, and specificity. Details are provided in the supplemental materials as to the choice of metrics. The choice is also based on information found in [13]. Results were derived by body region, institution, patient demographics, and acquisition parameters (manufacturer, contrast, CT kernel, slice thickness, sequence type). Performance metrics and their corresponding confidence intervals were determined using the spatial aware bootstrap resampling method [14]. The image sampling procedure ensured that no image slices were closer than 10 mm based on slice position and slice thickness information. This is consistent with the 7.5-mm sampling approach reported in [15]. This spatial aware random sampling was performed at the series level to reduce the impact of strongly correlated images and provide more realistic statistical results. To reduce the inter-series correlation, one series per study (randomly selected at each sampling iteration) in the subsampled dataset was kept. The correlation and correlation significance between the model’s accuracy and each confounding factor was assessed using Cramer’s V and Pearson's chi-squared statistical test.

Results

Data

The data consisted of 2891 CT cases (training, 1804 studies; validation, 602 studies; test, 485 studies) and 3339 MRI cases (training, 1911 studies; validation, 636 studies; test, 792 studies). Flowcharts in the Supplemental Materials Inclusion and Exclusion Criteria (Figs. 14) show the distribution of images after the different stages of series and image exclusion criteria. The evaluation of the ground truth revealed a total of 4 labeling errors out of 1455 CT and MRI-labeled studies, which represents an error rate of 0.3%.

Fig. 4
figure 4

Confusion matrix for the 118,829 images in the MRI test database with a threshold of 0.5. Rows represent the predictions, and columns represent the ground truth

Distributions of images and results by confounding factors for the test sets can be found in Tables 2 and 3. Twenty-seven institutions contributed to each CT and MRI test dataset. For CT, 56% of datasets came from primary care hospitals and 44% from critical access hospitals and imaging centers, while for MRI, 55% of the datasets came from primary care hospitals. Sex parity was respected for the CT dataset. A slight over-representation of female sex was noticed for the MRI dataset (56.1%). The age coverage ranged from 18 years old to + 90 years, roughly following the distribution of imaging tests in US Healthcare Systems [16]. Compared to the development datasets (Supplemental Materials – Distribution Development Datasets), the test datasets differed in some key areas. For CT, acquisitions mostly originated from Siemens and non-GE scanners (87.6%, + 80.1%) with a larger proportion of older adults ≥ 65 years (45.6%, + 9.3%), intermediate slice thickness (2 mm < slice thickness < 5 mm) (54.8%, + 46.3%), and non-contrast imaging (76.5%, + 9.1%). For MRI, acquisitions mostly originated from Siemens and non-GE scanner (83.5%, + 69%), a larger proportion of older adults (31.7%, + 11.7%), cases with slice thickness > 2 mm and < 5 mm (68.5%, + 20.3%), and non-contrast imaging (84.0%, + 6.4%).

Table 2 CT image performance metrics by confounding factors. n = number of studies (*series). The p-value for the median chi-square is provided to determine if a significant difference in accuracy is found for each confounding factor
Table 3 MR image performance metrics by confounding factors. n = number of studies (*series). The p-value for the median chi-square is provided to determine if a significant difference in accuracy is found for each confounding factor. **124 studies did not have institution information. ***137 series did not have any of the preset sequence tags. Due to the small number of cases, the performance metrics and confidence interval are not reliable for “In and Out of Phase”

Model Performance

An overall body region image-level sensitivity of 92.5% (92.1–92.8) was achieved for CT and 92.3% (92.0–95.6) for MRI. The post-processing stages contributed to about 1.1% (CT) to 1.6% (MRI) improvement in classification accuracy. Classification results by body region and confusion matrices by modality are respectively reported in Table 4 and in Figs. 3 and 4. Head and breast images have very discernible features, so they tend to be classified more accurately than other body regions such as the neck and extremities.

Table 4 CT and MR image classification sensitivity and specificity by body region. n = number of images

No formal association was found between classification accuracy and CT institution, CT kernel, and MRI contrast. However, statistically superior classification results were noticed in a few instances with Cramer’s V correlation ranging from negligible (V < 0.05) to moderate (V = 0.17). For CT, that was the case for datasets with older (≥ 65) age (p < 0.001, V = 0.041) with contrast (p < 0.001, V = 0.042) and thick (≥ 5 mm) slice (p =  < 0.001, V = 0.048). For MRI, imaging centers (p < 0.001, V = 0.064), 44 years and older (p < 0.001, V = 0.087), Philips manufacturer (p = 0.001, V = 0.076), thin slices (p < 0.001, V = 0.0838), and inversion and MRA sequences (p < 0.001, V = 0.179) exhibited better classification performance. For some of the classes in the test sets, the association between accuracy and factors such as manufacturer and MRI sequence could not be reliably assessed: Hitachi and Canon scanner manufacturers and In and Out of Phase MRI sequences. Despite these limitations, the evaluation of accuracy results and confidence intervals points to performance robustness across age, manufacturer, CT slice thickness, and MRI sequence categories.

When mining the DICOM tags in the test datasets for either the “BodyPartExamined” DICOM tag (BP) or “ProcedureType” (PT), the body region information at the study level was only 22.3% (BP) and 42.2% (PT) accurate for CT and 58.3% (BP) and 47.8% (PT) accurate for MRI. In this cohort, the anatomical AI could prove useful to improve the search for anatomically matched cases for about 50% of the cases.

Discussion

The ability to automate accurate anatomic region labeling of medical images using pixel-based AI could address clinical and research workflow challenges related to existing limitations that affect body region labeling of medical images. Our work demonstrates how a deep learning CNN-based classifier can achieve overall state-of-the-art accuracy greater than 90% in identifying body regions in CT and MR images while covering the entire human body and a large spectrum of acquisition protocols obtained from separate institutions. This is the first known attempt to (a) provide a solution that includes MR and (b) offers a solution that covers numerous body regions, in particular extremities that have been excluded in other CT studies.

Our methods achieved an overall body region image-level sensitivity of 92.5% which were similar to other publications restricted to CT image classification [2,

References

  1. Towbin AJ, Roth CJ, Petersilge CA, Garriott K, Buckwalter KA, Clunie DA: The importance of body part labeling to enable enterprise imaging: A HIMSS - SIIM enterprise imaging community collaborative white paper. J Digit Imaging 34:1-15, 2021. https://doi.org/10.1007/s10278-020-00415-0.

  2. Roth HR, Lee CT, Shin HC, Seff A, Kim L, Yao J, Summers RM: Anatomy-specific classification of medical images using deep convolutional nets. Proc IEEE International Symposium on Biomedical Imaging, 2015. https://doi.org/10.1109/ISBI.2015.7163826.

  3. Zhennan Y, Yiqiang Z, Zhigang P, Shu L, Shinagawa Y, Shaoting Z, Metaxas DN, **ang Sean Z: Multi-instance deep learning: discover discriminative local anatomies for bodypart recognition. IEEE Trans Med Imaging 35:1332-1343, 2016. https://doi.org/10.1109/TMI.2016.2524985.

  4. Zhang P, Wang F, Zheng Y: Self-supervised deep representation learning for fine-grained body part recognition. Proc IEEE International Symposium on Biomedical Imaging, 2017. https://doi.org/10.1109/ISBI.2017.7950587.

  5. Sugimori H: Classification of computed tomography images in different slice positions using deep learning. J Healthc Eng, 2018. https://doi.org/10.1155/2018/1753480.

  6. Yan K, Lu L, Summers RM: Unsupervised body part regression via spatially self-ordering convolutional neural networks. Proc IEEE International Symposium on Biomedical Imaging, 2018. https://doi.org/10.1109/ISBI.2018.8363745.

  7. TCIA. Submission and de-identification overview. Available at https://wiki.cancerimagingarchive.net/display/Public/Submission+and+De-identification+Overview. Updated 2020. Accessed August 2022.

  8. Wang K, Zhang D, Wu, Li Y, Zhang R, Lin L: Cost-effective active learning for deep image classification. IEEE Trans Circuits Syst Video Technol 27(12):2591–2600, 2016. https://doi.org/10.1109/TCSVT.2016.2589879.

    Article  Google Scholar 

  9. Budd S, Robinson EC, Kainz B: A survey on active learning and human-in-the-loop deep learning for medical image analysis. Med Image Anal 71:102062, 2021. https://doi.org/10.1016/j.media.2021.102062.

  10. Mahgerefteh S, Kruskal JB, Yam CS, Blachar A, Sosna J: Peer review in diagnostic radiology: current state and a vision for the future. Radiographics 29:1221–1231. 2009. https://doi.org/10.1148/rg.295095086.

  11. Turner AG: Expert group meeting to review the draft handbook on designing of household sample surveys: sampling strategies (draft). November 2003. Available at https://unstats.un.org/unsd/demographic/meetings/egm/sampling_1203/docs/no_2.pdf. Accessed August 2022.

  12. He K, Zhang X, Ren S, Sun J: Deep residual learning for image recognition. Proc IEEE Comput Soc Conf Comput Vis Pattern Recognit 2016. https://doi.org/10.1109/CVPR.2016.90.

  13. Towards Data Science. Available at https://towardsdatascience.com/multi-class-metrics-made-simple-part-i-precision-and-recall-9250280bddc2. Accessed 18 August 2022.

  14. Efron B: Bootstrap methods: another look at the jackknife. Annals Statistics 7(1):1-26, 1979. https://doi.org/10.1214/aos/1176344552.

  15. Leuschner J, Schmidt M, Baguer DO, Maas P: The LoDoPaB-CT dataset: a benchmark dataset for low-dose CT reconstruction methods. Sc Data 8, 109, 2021. https://doi.org/10.1038/s41597-021-00893-z.

  16. Smith-Bindman R, Kwan ML, Marlow EC, et al: Trends in use of medical imaging in US health care systems and in Ontario, Canada, 2000–2016. J Am Med Assoc JAMA 322(9):843–856, 2019. https://doi.org/10.1001/jama.2019.11456.

  17. Elahi A, Reid D, Redfern RO, Kahn CE, Cook TS: Automating import and reconciliation of outside examinations submitted to an academic radiology department. J Digit Imaging 33(2):355–360, 2020. https://doi.org/10.1007/s10278-019-00291-3.

Download references

Author information

Authors and Affiliations

Authors

Contributions

Philippe Raffy, Jean-François Pambrun, Ashish Kumar, and David Dubois contributed to the study conception, design, material preparation, and data collection. Analysis was performed by Philippe Raffy, David Dubois, and Ryan Young. The first draft of the manuscript was written by Philippe Raffy, and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to David Dubois.

Ethics declarations

Ethical Approval

This retrospective research study was conducted using de-identified data from clinical partners. Based on the nature of the study, official IRB waivers of ethical approval were granted from our healthcare partners.

Competing Interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Key Points

•   An off-the-shelf deep learning model can achieve the state-of-the-art anatomic classification results of over 90% sensitivity on CT and MRI sets completely disjoint of training sets.

•   Classification results cover the entire human body, in particular extremities that have been excluded in previous CT studies.

•   Image-based analysis has the potential to provide accurate metadata about the image composition of a given CT or MRI study.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 1449 KB)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Raffy, P., Pambrun, JF., Kumar, A. et al. Deep Learning Body Region Classification of MRI and CT Examinations. J Digit Imaging 36, 1291–1301 (2023). https://doi.org/10.1007/s10278-022-00767-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10278-022-00767-9

Keywords