Introduction

In 2020, prostate cancer was the second most frequent cancer in men worldwide and ranked 5th in terms of mortality, being the leading cause of death in 48 out of 185 countries as analysed by Sung et al.1. Prostate cancer diagnosis procedure relies on unspecific measures such as PSA (prostate-specific antigen) levels and DRE (digital rectal examination), followed by biopsy2, where disease aggressiveness assessment is based on the Gleason Score. This biomarker is used to define the clinical significance of lesions, according to which treatment decisions are made. Thus, an accurate determination of clinical significance is essential for ascertaining the most appropriate treatment options and ensuring the best clinical outcome.

Recent developments in Artificial Intelligence anticipate much needed improvements in the detection, diagnosis, screening, and staging of prostate cancer3. One area of particular interest is radiomics, which allows for quantitative analysis of medical images, contrary to the qualitative analysis performed by experts in the field thus far. Radiomics is defined as the transformation of medical images into high-dimensional mineable data in the form of extracted quantitative features4. Studies using radiomics have shown potential since the features’ quantitative nature eliminates some of the inherent subjectiveness of medical image interpretation. Moreover, radiomics is able to turn radiological images into tabular data, which machine learning algorithms can later analyse5. The latter are designed to detect patterns in the data and are able to find useful diagnostic and prognostic biomarkers that would not be seen by the naked eye of an expert radiologist. The use of radiomic features extracted from bpMRI (bi-parametric Magnetic Resonance Imaging) exams to predict prostate cancer disease aggressiveness can be found in the literature6,7,8,9,10,11,12,13,14.

Radiomics, however, has several shortcomings. A major limitation is the tight link between the computed radiomic features and the volume of interest (VOI) from where they have been extracted. Tissue or lesion segmentation is performed either manually, when a radiologist outlines the boundary of the VOI in the image, automatically, for example by a deep learning algorithm trained to segment a certain VOI, or semi-automatically, where the mask is drawn in an automated fashion and later verified and corrected by a radiologist. When manually traced, the segmentation masks suffer from inter- and intra-reader variability15. These slight real-world differences in human-defined segmentation margins may in principle affect the the distribution of the calculated radiomic features, which subsequently affect the algorithms trained with them16. Building machine learning algorithms that are robust to this real-world heterogeneity is essential to a future safe application of AI methods in the clinical setting.

Parallel to the growing interest in radiomic features, deep Neural Networks have emerged as a promising technique for prostate cancer detection and segmentation of anatomic zones or tumorous lesions17. However, it has been shown that deep learning models tend to overfit when attempting to solve prostate cancer classification problems, not generalizing to out of distribution data18. Despite the large number of studies published, very few compare the performance of handcrafted radiomic features and deep features on the same data and objective18,19.

Regarding the classification of prostate cancer disease aggressiveness, we attempted to answer two research questions. Firstly, we compared different approaches and obtained insights into how to produce classifiers that are robust to differences in segmentation margins. And secondly, we have not only compared the performance of radiomic features and deep features for the classification of prostate cancer aggressiveness but also assessed the performance of models trained with hybrid datasets incorporating both handcrafted radiomic and deep features.

Research questions

In this section, we will describe the research questions addressed in this study.

Research question I (RQ I)

Which is the best approach to train robust classifiers to minor differences in segmentation margins derived from two radiologists?

An example of the common differences in segmentation margins can be found in Fig. 1. To answer RQ I, we compared different approaches at training:

\(\mathbf{1}{\mathbf{st}}\) approach:

(stableRad1) Perform a feature stability analysis and train only with stable features.

\(\mathbf{2}{\mathbf{nd}}\) approach:

(Rad1) Train with features extracted from masks drawn by radiologist 1 (rad1).

\(\mathbf{3}{\mathbf{rd}}\) approach:

(Rad2) Train with features extracted from masks drawn by radiologist 2 (rad2).

\(\mathbf{4}{\mathbf{th}}\) approach:

(avgRad) Train with the feature average between both radiologists.

\(\mathbf{5}{\mathbf{th}}\) approach:

(intersectionRad) Train with features extracted from the intersection of the two masks.

\(\mathbf{6}{\mathbf{th}}\) approach:

(unionRad) Train with features extracted from the union of the two masks.

\(\mathbf{7}{\mathbf{th}}\) approach:

(resampledRad) Train with a randomly resampled dataset where for some patients the extraction was performed from the mask drawn by rad1 and for others the extraction was performed from the mask drawn by rad2.

The dataset that resulted in the most robust classifier was selected for further analysis (research question II).

Figure 1
figure 1

An example of the evident segmentation variability in the masks drawn by radiologist 1, (a), and radiologist 2, (b), on patient Prostatex0000.

Research question II

Can deep features significantly improve the performance of machine learning classifiers trained with handcrafted radiomic features?

To answer research question II, three approaches were compared for model training. These included a handcrafted radiomics dataset, a dataset with “deep” features, and a hybrid dataset with both.

Methods

Data description

Our dataset consisted of T2W, DW, and ADC data from the SPIE-AAPM-NCI PROSTATEx challenge20,21,22. The MRI exams were acquired at the Prostate MR Reference Center—Radboud University Medical Centre in the Netherlands. Due to the public nature of the data, ethics committee approval and patient consent were waived for this study. The dataset is composed of 181 patients. Due to complications with some segmentation files, we did not utilize the full extent of the PROSTATEx dataset. The full list of excluded patients can be found in Supplementary Material (SM 1). The approximate location of the centroid of each lesion was provided in DICOM coordinates. Cancer was considered significant when the biopsy Gleason score was 7 or higher. The lesions were labelled with “TRUE” and “FALSE” for the presence of clinically significant cancer, with a distribution of 67 clinically significant lesions (TRUE) and 214 clinically non-significant lesions (FALSE). A gland was considered to have clinically significant cancer if at least one of its lesions is clinically significant for prostate cancer. This resulted in a label distribution of 122 clinically insignificant glands and 59 clinically significant glands, giving a total of 181 patients.

Methods specific to RQ1

Segmentation

Manual segmentations of the whole prostate gland were performed independently by two radiologists (M.L., 10 years of experience, and A.U., radiology resident) on T2W and DWI high b-value images separately.

Radiomic features extraction

Bias field correction was performed on T2W images using the N4 Bias Field Correction algorithm23 and the Python package Simple ITK (version 2.0.0)24. First, each image’s x-, y- and z-spacing was checked for discrepancies. Since x- and y-spacings differed from z-spacing, feature extraction was later performed in 2D. Additionally, T2W images’ x- and y-spacings differed within and between patients, so these were resampled to the highest value of 0.5. Non-quantitative images’ (T2W and DWI) intensities were normalized. The bin width was selected to produce discritized images with between 30 and 130 bins. This resulted in a bin width of 20 for T2W images, 5 for DWI, and 70 for ADC maps.

Radiomic features were extracted from the whole gland segmentation using the Pyradiomics package (version 3.0)25 in Python (version 3.7.9)26. All the pre-processing steps mentioned before were performed as parameters of the extractor function, except for the bias field correction, which was performed prior to the extraction. All image filters and feature classes were enabled, resulting in a total of 3111 features extracted, 1037 from each MRI modality (T2W, high b-value DWI and ADC). In the feature extraction of the ADC map, the mask drawn on the DWI was used. The mathematical expressions and semantic meanings of the features extracted can be found at https://pyradiomics.readthedocs.io/en/latest/.

Spatial stability of radiomic features

For approach 1, spatial stability was assessed by comparing the features extracted from the VOIs created by each radiologist. This analysis was conducted with a two-way, single rater, absolute agreement Intraclass correlation coefficient (ICC) formulation (ICC 2.1)27. Features with ICC 95% confidence interval lower limit over 0.75 were considered to be robust to segmentation and were kept for further analysis.

Methods specific to RQ2

On top of the radiomic features utilized in RQ1, to answer RQ2, we also extracted deep features.

Deep features extraction

Deep features were extracted from segmentation models trained to segment the whole gland. To train the network, the volumes were cropped on both the X and Y axis, and zero padded on the Z axis. A comprehensive description of the training as well as network performance can be found in previous work28. The ground truth masks used for the gland were obtained as described previously, while both the peripheral and transitional zone masks were the publicly available ones from the SPIE-AAPM-NCI PROSTATEx challenge.

The models used to perform the segmentations were all Encoder-Decoder Unet variations, namely: Unet44,45. The issue with this approach is that a disagreement between the segmentation borders does not necessarily mean the resulting features would not be good predictors, it simply means the readers do not agree on what its value should be. Hence, new approaches to combining information from two or more radiologists are of high importance. In this study, we addressed this with research question I.

Regarding this first research question, our results revealed that approach 1, corresponding to the evaluation of feature stability through inter-rater absolute agreement and subsequent removal of unstable features, a technique currently recommended by radiomics guidelines and evaluated in the radiomics quality score46, proved to produce the classifiers with the least ability to generalize to hold-out data. On the other hand, approach 7, corresponding to training classifiers with a radiomics dataset where segmentation masks were randomly chosen from the two available radiologists, proved to be the highest performing across all hold-out test sets. Supporting the hypothesis that the more heterogeneous the training data the more generalizable the classifier may be on unseen data. Additionally, the performance of such classifier was very similar on the different hold-out test sets, indicating its robustness to radiologists with different years of experience. This was further confirmed by the performance on the resampledRad test set, which simulates a real-world clinical environment, where a deployed model would be used by several physicians. Thus, these results are extremely relevant for the clinical translation of AI models. As of right now, but still staying vigilant to further validation studies, the results suggest that gathering segmentations from different radiologists will produce classifiers that are more robust to slight differences in segmentation margins.

Regarding research question II, the few publications comparing deep learning and radiomics-trained classical machine learning on the same classification problem18,19 reported higher performance on the train set when using deep learning, but lower performance on the test set when compared to classical machine learning algorithms trained with radiomic features. In this work, deep learning’s natural tendency for overfitting was confirmed, both on the deep and hybrid classifiers. Even though the hybrid classifier showed a lot less overfitting than the deep model, it was not enough to outperform the radiomics classifier. Despite this, we feel this hybrid approach is worth exploring further with larger datasets and externally validated.

This study has several limitations. First, this was a retrospective study, so a multi-center prospective analysis should be carried out to validate these results and investigate the impact these predictive models have on patient outcome. Second, only T2W, DWI, and ADC sequences were used. Other sequences, such as dynamic contrast-enhanced MRI, could be worth exploring, however, since there are not consistently part of the MRI examination protocols they were not included in our models. Third, although the overall class imbalance was addressed through SMOTE upsampling of the minority class, we did not address the imbalanced nature of the anatomical location of lesions, with the large majority of lesions belonging to the peripheral zone. It would be interesting to investigate the model’s performance on the different anatomical zones independently. Fourth, using a publicly available dataset increased transparency but limited our access to clinical data, such as PSA levels, patient age, or PI-RADS score, which are fundamental components of a clinician’s assessment, but could not be included in our model. Fifth, proper assessment of real-world clinical performance is only possible through external validation. This important validation step will be addressed in future work. Finally, inherent to the Gleason system is the subjectivity of cancer grading, so we must keep in mind that the gold standard used in this study is subject to human error and inter or intra-observer variability. In addition, the definition of clinical significance might be based on more than the Gleason score alone, and variables such as tumour volume or tumour category might be relevant.

Conclusion

In conclusion, the results presented in this study are extremely relevant to the clinical translation of AI models. Heterogeneous radiomics datasets where segmentation masks come from more than one radiologist produced classifiers with the highest generalization power. Additionally, the combination of radiomic and deep features in the classification of prostate cancer disease aggressiveness is studied. Here, we have shown promising results with the hybrid approach, which is worth exploring further with larger datasets.