Abstract
There is a growing piece of evidence that artificial intelligence may be helpful in the entire prostate cancer disease continuum. However, building machine learning algorithms robust to inter- and intra-radiologist segmentation variability is still a challenge. With this goal in mind, several model training approaches were compared: removing unstable features according to the intraclass correlation coefficient (ICC); training independently with features extracted from each radiologist’s mask; training with the feature average between both radiologists; extracting radiomic features from the intersection or union of masks; and creating a heterogeneous dataset by randomly selecting one of the radiologists’ masks for each patient. The classifier trained with this last resampled dataset presented with the lowest generalization error, suggesting that training with heterogeneous data leads to the development of the most robust classifiers. On the contrary, removing features with low ICC resulted in the highest generalization error. The selected radiomics dataset, with the randomly chosen radiologists, was concatenated with deep features extracted from neural networks trained to segment the whole prostate. This new hybrid dataset was then used to train a classifier. The results revealed that, even though the hybrid classifier was less overfitted than the one trained with deep features, it still was unable to outperform the radiomics model.
Similar content being viewed by others
Introduction
In 2020, prostate cancer was the second most frequent cancer in men worldwide and ranked 5th in terms of mortality, being the leading cause of death in 48 out of 185 countries as analysed by Sung et al.1. Prostate cancer diagnosis procedure relies on unspecific measures such as PSA (prostate-specific antigen) levels and DRE (digital rectal examination), followed by biopsy2, where disease aggressiveness assessment is based on the Gleason Score. This biomarker is used to define the clinical significance of lesions, according to which treatment decisions are made. Thus, an accurate determination of clinical significance is essential for ascertaining the most appropriate treatment options and ensuring the best clinical outcome.
Recent developments in Artificial Intelligence anticipate much needed improvements in the detection, diagnosis, screening, and staging of prostate cancer3. One area of particular interest is radiomics, which allows for quantitative analysis of medical images, contrary to the qualitative analysis performed by experts in the field thus far. Radiomics is defined as the transformation of medical images into high-dimensional mineable data in the form of extracted quantitative features4. Studies using radiomics have shown potential since the features’ quantitative nature eliminates some of the inherent subjectiveness of medical image interpretation. Moreover, radiomics is able to turn radiological images into tabular data, which machine learning algorithms can later analyse5. The latter are designed to detect patterns in the data and are able to find useful diagnostic and prognostic biomarkers that would not be seen by the naked eye of an expert radiologist. The use of radiomic features extracted from bpMRI (bi-parametric Magnetic Resonance Imaging) exams to predict prostate cancer disease aggressiveness can be found in the literature6,7,8,9,10,11,12,13,14.
Radiomics, however, has several shortcomings. A major limitation is the tight link between the computed radiomic features and the volume of interest (VOI) from where they have been extracted. Tissue or lesion segmentation is performed either manually, when a radiologist outlines the boundary of the VOI in the image, automatically, for example by a deep learning algorithm trained to segment a certain VOI, or semi-automatically, where the mask is drawn in an automated fashion and later verified and corrected by a radiologist. When manually traced, the segmentation masks suffer from inter- and intra-reader variability15. These slight real-world differences in human-defined segmentation margins may in principle affect the the distribution of the calculated radiomic features, which subsequently affect the algorithms trained with them16. Building machine learning algorithms that are robust to this real-world heterogeneity is essential to a future safe application of AI methods in the clinical setting.
Parallel to the growing interest in radiomic features, deep Neural Networks have emerged as a promising technique for prostate cancer detection and segmentation of anatomic zones or tumorous lesions17. However, it has been shown that deep learning models tend to overfit when attempting to solve prostate cancer classification problems, not generalizing to out of distribution data18. Despite the large number of studies published, very few compare the performance of handcrafted radiomic features and deep features on the same data and objective18,19.
Regarding the classification of prostate cancer disease aggressiveness, we attempted to answer two research questions. Firstly, we compared different approaches and obtained insights into how to produce classifiers that are robust to differences in segmentation margins. And secondly, we have not only compared the performance of radiomic features and deep features for the classification of prostate cancer aggressiveness but also assessed the performance of models trained with hybrid datasets incorporating both handcrafted radiomic and deep features.
Research questions
In this section, we will describe the research questions addressed in this study.
Research question I (RQ I)
Which is the best approach to train robust classifiers to minor differences in segmentation margins derived from two radiologists?
An example of the common differences in segmentation margins can be found in Fig. 1. To answer RQ I, we compared different approaches at training:
- \(\mathbf{1}{\mathbf{st}}\) approach:
-
(stableRad1) Perform a feature stability analysis and train only with stable features.
- \(\mathbf{2}{\mathbf{nd}}\) approach:
-
(Rad1) Train with features extracted from masks drawn by radiologist 1 (rad1).
- \(\mathbf{3}{\mathbf{rd}}\) approach:
-
(Rad2) Train with features extracted from masks drawn by radiologist 2 (rad2).
- \(\mathbf{4}{\mathbf{th}}\) approach:
-
(avgRad) Train with the feature average between both radiologists.
- \(\mathbf{5}{\mathbf{th}}\) approach:
-
(intersectionRad) Train with features extracted from the intersection of the two masks.
- \(\mathbf{6}{\mathbf{th}}\) approach:
-
(unionRad) Train with features extracted from the union of the two masks.
- \(\mathbf{7}{\mathbf{th}}\) approach:
-
(resampledRad) Train with a randomly resampled dataset where for some patients the extraction was performed from the mask drawn by rad1 and for others the extraction was performed from the mask drawn by rad2.
The dataset that resulted in the most robust classifier was selected for further analysis (research question II).
Research question II
Can deep features significantly improve the performance of machine learning classifiers trained with handcrafted radiomic features?
To answer research question II, three approaches were compared for model training. These included a handcrafted radiomics dataset, a dataset with “deep” features, and a hybrid dataset with both.
Methods
Data description
Our dataset consisted of T2W, DW, and ADC data from the SPIE-AAPM-NCI PROSTATEx challenge20,21,22. The MRI exams were acquired at the Prostate MR Reference Center—Radboud University Medical Centre in the Netherlands. Due to the public nature of the data, ethics committee approval and patient consent were waived for this study. The dataset is composed of 181 patients. Due to complications with some segmentation files, we did not utilize the full extent of the PROSTATEx dataset. The full list of excluded patients can be found in Supplementary Material (SM 1). The approximate location of the centroid of each lesion was provided in DICOM coordinates. Cancer was considered significant when the biopsy Gleason score was 7 or higher. The lesions were labelled with “TRUE” and “FALSE” for the presence of clinically significant cancer, with a distribution of 67 clinically significant lesions (TRUE) and 214 clinically non-significant lesions (FALSE). A gland was considered to have clinically significant cancer if at least one of its lesions is clinically significant for prostate cancer. This resulted in a label distribution of 122 clinically insignificant glands and 59 clinically significant glands, giving a total of 181 patients.
Methods specific to RQ1
Segmentation
Manual segmentations of the whole prostate gland were performed independently by two radiologists (M.L., 10 years of experience, and A.U., radiology resident) on T2W and DWI high b-value images separately.
Radiomic features extraction
Bias field correction was performed on T2W images using the N4 Bias Field Correction algorithm23 and the Python package Simple ITK (version 2.0.0)24. First, each image’s x-, y- and z-spacing was checked for discrepancies. Since x- and y-spacings differed from z-spacing, feature extraction was later performed in 2D. Additionally, T2W images’ x- and y-spacings differed within and between patients, so these were resampled to the highest value of 0.5. Non-quantitative images’ (T2W and DWI) intensities were normalized. The bin width was selected to produce discritized images with between 30 and 130 bins. This resulted in a bin width of 20 for T2W images, 5 for DWI, and 70 for ADC maps.
Radiomic features were extracted from the whole gland segmentation using the Pyradiomics package (version 3.0)25 in Python (version 3.7.9)26. All the pre-processing steps mentioned before were performed as parameters of the extractor function, except for the bias field correction, which was performed prior to the extraction. All image filters and feature classes were enabled, resulting in a total of 3111 features extracted, 1037 from each MRI modality (T2W, high b-value DWI and ADC). In the feature extraction of the ADC map, the mask drawn on the DWI was used. The mathematical expressions and semantic meanings of the features extracted can be found at https://pyradiomics.readthedocs.io/en/latest/.
Spatial stability of radiomic features
For approach 1, spatial stability was assessed by comparing the features extracted from the VOIs created by each radiologist. This analysis was conducted with a two-way, single rater, absolute agreement Intraclass correlation coefficient (ICC) formulation (ICC 2.1)27. Features with ICC 95% confidence interval lower limit over 0.75 were considered to be robust to segmentation and were kept for further analysis.
Methods specific to RQ2
On top of the radiomic features utilized in RQ1, to answer RQ2, we also extracted deep features.
Deep features extraction
Deep features were extracted from segmentation models trained to segment the whole gland. To train the network, the volumes were cropped on both the X and Y axis, and zero padded on the Z axis. A comprehensive description of the training as well as network performance can be found in previous work28. The ground truth masks used for the gland were obtained as described previously, while both the peripheral and transitional zone masks were the publicly available ones from the SPIE-AAPM-NCI PROSTATEx challenge.
The models used to perform the segmentations were all Encoder-Decoder Unet variations, namely: Unet44,45. The issue with this approach is that a disagreement between the segmentation borders does not necessarily mean the resulting features would not be good predictors, it simply means the readers do not agree on what its value should be. Hence, new approaches to combining information from two or more radiologists are of high importance. In this study, we addressed this with research question I.
Regarding this first research question, our results revealed that approach 1, corresponding to the evaluation of feature stability through inter-rater absolute agreement and subsequent removal of unstable features, a technique currently recommended by radiomics guidelines and evaluated in the radiomics quality score46, proved to produce the classifiers with the least ability to generalize to hold-out data. On the other hand, approach 7, corresponding to training classifiers with a radiomics dataset where segmentation masks were randomly chosen from the two available radiologists, proved to be the highest performing across all hold-out test sets. Supporting the hypothesis that the more heterogeneous the training data the more generalizable the classifier may be on unseen data. Additionally, the performance of such classifier was very similar on the different hold-out test sets, indicating its robustness to radiologists with different years of experience. This was further confirmed by the performance on the resampledRad test set, which simulates a real-world clinical environment, where a deployed model would be used by several physicians. Thus, these results are extremely relevant for the clinical translation of AI models. As of right now, but still staying vigilant to further validation studies, the results suggest that gathering segmentations from different radiologists will produce classifiers that are more robust to slight differences in segmentation margins.
Regarding research question II, the few publications comparing deep learning and radiomics-trained classical machine learning on the same classification problem18,19 reported higher performance on the train set when using deep learning, but lower performance on the test set when compared to classical machine learning algorithms trained with radiomic features. In this work, deep learning’s natural tendency for overfitting was confirmed, both on the deep and hybrid classifiers. Even though the hybrid classifier showed a lot less overfitting than the deep model, it was not enough to outperform the radiomics classifier. Despite this, we feel this hybrid approach is worth exploring further with larger datasets and externally validated.
This study has several limitations. First, this was a retrospective study, so a multi-center prospective analysis should be carried out to validate these results and investigate the impact these predictive models have on patient outcome. Second, only T2W, DWI, and ADC sequences were used. Other sequences, such as dynamic contrast-enhanced MRI, could be worth exploring, however, since there are not consistently part of the MRI examination protocols they were not included in our models. Third, although the overall class imbalance was addressed through SMOTE upsampling of the minority class, we did not address the imbalanced nature of the anatomical location of lesions, with the large majority of lesions belonging to the peripheral zone. It would be interesting to investigate the model’s performance on the different anatomical zones independently. Fourth, using a publicly available dataset increased transparency but limited our access to clinical data, such as PSA levels, patient age, or PI-RADS score, which are fundamental components of a clinician’s assessment, but could not be included in our model. Fifth, proper assessment of real-world clinical performance is only possible through external validation. This important validation step will be addressed in future work. Finally, inherent to the Gleason system is the subjectivity of cancer grading, so we must keep in mind that the gold standard used in this study is subject to human error and inter or intra-observer variability. In addition, the definition of clinical significance might be based on more than the Gleason score alone, and variables such as tumour volume or tumour category might be relevant.
Conclusion
In conclusion, the results presented in this study are extremely relevant to the clinical translation of AI models. Heterogeneous radiomics datasets where segmentation masks come from more than one radiologist produced classifiers with the highest generalization power. Additionally, the combination of radiomic and deep features in the classification of prostate cancer disease aggressiveness is studied. Here, we have shown promising results with the hybrid approach, which is worth exploring further with larger datasets.
Data availability
The datasets analysed during the current study are available in the Cancer Imaging Archive repository, https://wiki.cancerimagingarchive.net/pages/viewpage.action?pageId=23691656.
References
Sung, H. et al. Global cancer statistics 2020: Globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 71, 209–249 (2021).
Halpern, J. A. et al. Use of digital rectal examination as an adjunct to prostate specific antigen in the detection of clinically significant prostate cancer. J. Urol. 199, 947–953 (2018).
Ferro, M. et al. Radiomics in prostate cancer: An up-to-date review. Ther. Adv. Urol. 14, 17562872221109020 (2022).
Scapicchio, C. et al. A deep look into radiomics. Radiol. Med. (Torino) 126, 1296–1311 (2021).
Twilt, J. J., van Leeuwen, K. G., Huisman, H. J., Fütterer, J. J. & de Rooij, M. Artificial intelligence based algorithms for prostate cancer classification and detection on magnetic resonance imaging: A narrative review. Diagnostics 11, 959 (2021).
Cutaia, G. et al. Radiomics and prostate mri: Current role and future applications. J. Imaging 7, 34 (2021).
Midiri, F., Vernuccio, F., Purpura, P., Alongi, P. & Bartolotta, T. V. Multiparametric mri and radiomics in prostate cancer: A review of the current literature. Diagnostics 11, 1829 (2021).
Gugliandolo, S. G. et al. Mri-based radiomics signature for localized prostate cancer: A new clinical tool for cancer aggressiveness prediction? Sub-study of prospective phase ii trial on ultra-hypofractionated radiotherapy (airc ig-13218). Eur. Radiol. 31, 716–728 (2021).
Kwon, D. et al. Classification of suspicious lesions on prostate multiparametric mri using machine learning. J. Med. Imaging 5, 034502 (2018).
Li, T. et al. Development and validation of a radiomics nomogram for predicting clinically significant prostate cancer in pi-rads 3 lesions. Front. Oncol. 11, 25 (2021).
Gong, L. et al. Noninvasive prediction of high-grade prostate cancer via biparametric mri radiomics. J. Magn. Reson. Imaging 52, 1102–1109 (2020).
Woźnicki, P. et al. Multiparametric mri for prostate cancer characterization: Combined use of radiomics model with pi-rads and clinical parameters. Cancers 12, 1767 (2020).
Bernatz, S. et al. Comparison of machine learning algorithms to predict clinically significant prostate cancer of the peripheral zone with multiparametric mri using clinical assessment categories and radiomic features. Eur. Radiol. 30, 6757–6769 (2020).
Li, J. et al. Support vector machines (svm) classification of prostate cancer gleason score in central gland using multiparametric magnetic resonance images: A cross-validated study. Eur. J. Radiol. 98, 61–67 (2018).
Steenbergen, P. et al. Prostate tumor delineation using multiparametric mri: Inter observer variability and pathology validation. Radiother. Oncol. 111, S53–S54 (2014).
Zhao, B. Understanding sources of variation to improve the reproducibility of radiomics. Front. Oncol. 2021, 826 (2021).
Li, H. et al. Machine learning in prostate mri for prostate cancer: Current status and future opportunities. Diagnostics 12, 289 (2022).
Castillo, T. J. M. et al. Classification of clinically significant prostate cancer on multi-parametric mri: A validation study comparing deep learning and radiomics. Cancers 14, 12 (2021).
Bertelli, E. et al. Machine and deep learning prediction of prostate cancer aggressiveness using multiparametric mri. Front. Oncol. 11, 802964–802964 (2021).
Karssemeijer, G. L. O. D. J. B. N. & Huisman., H. Prostatex challenge data, the cancer imaging archive. (2017). https://doi.org/10.7937/K9TCIA.2017.MURS5CL.
Litjens, G., Debats, O., Barentsz, J., Karssemeijer, N. & Huisman, H. Computer-aided detection of prostate cancer in mri. IEEE Trans. Med. Imaging 33, 1083–1092. https://doi.org/10.1109/TMI.2014.2303821 (2014).
Clark, K. et al. The cancer imaging archive (tcia): Maintaining and operating a public information repository. J. Digit. Imaging 26, 1045–1057 (2013).
Tustison, N. J. et al. N4itk: Improved n3 bias correction. IEEE Trans. Med. Imaging 29, 1310–1320 (2010).
Yaniv, Z., Lowekamp, B. C., Johnson, H. J. & Beare, R. Simpleitk image-analysis notebooks: A collaborative environment for education and reproducible research. J. Digit. Imaging 31, 290–303 (2018).
Van Griethuysen, J. J. et al. Computational radiomics system to decode the radiographic phenotype. Can. Res. 77, e104–e107 (2017).
Van Rossum, G. & Drake, F. L. Python 3 Reference Manual (CreateSpace, 2009).
Koo, T. K. & Li, M. Y. A guideline of selecting and reporting intraclass correlation coefficients for reliability research. J. Chiropr. Med. 15, 155–163 (2016).
Rodrigues, N. M., Silva, S., Vanneschi, L. & Papanikolaou, N. A comparative study of automated deep learning segmentation models for prostate mri. Cancers 15, 1467 (2023).
Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation. ar**v:abs/1505.04597 (2015).
Zhou, Z., Siddiquee, M. M. R., Tajbakhsh, N. & Liang, J. Unet++: A Nested u-net Architecture for Medical Image Segmentation. (2018). ar**v:1807.10165.
Oktay, O. et al. Attention u-net: Learning Where to Look for the Pancreas. (2018). ar**v:abs/1804.03999.
Aldoj, N., Biavati, F., Michallek, F., Stober, S. & Dewey, M. Automatic prostate and prostate zones segmentation of magnetic resonance images using densenet-like u-net. Sci. Rep. 10, 25 (2020).
Alom, M. Z., Hasan, M., Yakopcic, C., Taha, T. M. & Asari, V. K. Recurrent Residual Convolutional Neural Network Based on u-net (r2u-net) for Medical Image Segmentation. (2018). ar**v:1802.06955.
Pedregosa, F. et al. Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Harris, C. R. et al. Array programming with NumPy. Nature 585, 357–362. https://doi.org/10.1038/s41586-020-2649-2 (2020).
McKinney, W. et al. Data structures for statistical computing in python. In Proceedings of the 9th Python in Science Conference, vol. 445, 51–56 (Austin, TX, 2010).
Kuhn, M. Building predictive models in r using the caret package. J. Stat. Softw. 28, 1–26. https://doi.org/10.18637/jss.v028.i05 (2020).
Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. Smote: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002).
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
Raschka, S. Mlxtend: Providing machine learning and data science utilities and extensions to python’s scientific computing stack. J. Open Sourc. Softw. 3, 56. https://doi.org/10.21105/joss.00638 (2018).
Lundberg, S. M. & Lee, S.-I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 30, 25 (2017).
Han, C. et al. Radiomics models based on apparent diffusion coefficient maps for the prediction of high-grade prostate cancer at radical prostatectomy: Comparison with preoperative biopsy. J. Magn. Reson. Imaging 54, 1892–1901 (2021).
Rodrigues, A. et al. Prediction of prostate cancer disease aggressiveness using bi-parametric mri radiomics. Cancers 13, 6065 (2021).
Niu, X.-K. et al. Clinical application of biparametric mri texture analysis for detection and evaluation of high-grade prostate cancer in zone-specific regions. Am. J. Roentgenol. 210, 549–556 (2018).
Xu, M. et al. Using biparametric mri radiomics signature to differentiate between benign and malignant prostate lesions. Eur. J. Radiol. 114, 38–44 (2019).
Lambin, P. et al. Radiomics: The bridge between medical imaging and personalized medicine. Nat. Rev. Clin. Oncol. 14, 749–762 (2017).
Acknowledgements
This work received funding from the European Union’s Horizon 2020 research and innovation program under grant agreement no. 952159 (ProCAncer-I). This work was also partially supported by FCT, Portugal, through funding of the LASIGE Research Unit (UIDB/00408/2020 and UIDP/00408/2020), and Nuno Rodrigues PhD Grant 2021/05322/BD.
Author information
Authors and Affiliations
Contributions
Conceptualization, N.P., A.R., and N.R.; methodology, A.R., N.R., J.S., N.P.; software, A.R., and N.R.; validation, A.R.; formal analysis, A.R., J.S., N.R.; investigation, A.R.; resources, A.R., N.R.; data curation, A.R., J.S.; writing-original draft preparation, A.R.; writing-review and editing, A.R., N.R., J.S., C.M., I.D., and N.P.; visualization, A.R., and N.R.; supervision, I.D., and N.P.; project administration, A.R.; funding acquisition, N.P.; all authors have read and agreed to the published version of the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Rodrigues, A., Rodrigues, N., Santinha, J. et al. Value of handcrafted and deep radiomic features towards training robust machine learning classifiers for prediction of prostate cancer disease aggressiveness. Sci Rep 13, 6206 (2023). https://doi.org/10.1038/s41598-023-33339-0
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-023-33339-0
- Springer Nature Limited