Background

Prostate cancer (PCa) is one of the most common cancers and a leading cause of cancer-related death in men [1]. There has been an increasing interest in positron emission tomography (PET) agents targeting prostate-specific membrane antigen (PSMA), a transmembrane protein overexpressed on PCa cells, for imaging and directing therapy of PCa [2]. Radiotracer-avid and non-avid pitfalls have been described with PSMA PET imaging [3, 4]. Reliable classification of lesions with or without radiotracer uptake is an important clinical step in verifying the detection and determining the prognosis of PCa [5]. We developed a PSMA reporting and data system (PSMA-RADS version 1.0) framework to classify PSMA PET scans and individual findings that reflect the probability of PCa, thereby guiding management [5, 6]. We organized the PSMA-RADS framework around a 5-point scale where a higher score indicates a greater likelihood of PCa [5].

While medical images are typically visually evaluated by trained radiologists, this process may be time-consuming and subject to operator variability [7]. Radiomics is a rapidly advancing field that aims to perform high-throughput extraction of clinically relevant features from radiologic data to build diagnostic and prognostic models [8, 9]. Unlike traditional radiomics workflows that utilize engineered handcrafted features, deep learning (DL) approaches can automatically extract deep features to directly model medical endpoints from the input images [9]. Automated artificial intelligence and DL methods have significant advantages over manual evaluation, including more consistent extraction of radiomic features and reliable characterization of disease [7]. Several machine learning and DL applications have been developed for PSMA PET in patients with metastatic disease, including radiomics-based risk stratification, attenuation map estimation for PSMA PET/magnetic resonance imaging (MRI), and bone and lymph node lesion detection in PSMA PET/computed tomography (CT) images [10,11,12,13].

While DL methods can be conveniently treated as a black box, deep neural networks often suffer from a lack of interpretability [PSMA-RADS classification

The framework performed per-lesion and per-patient PSMA-RADS classification. Softmax probabilities were averaged across slices for lesion-level predictions [26]. Patient-level predictions were performed by taking the highest PSMA-RADS score across all lesions on the scan following the recommended guidelines for PSMA-RADS interpretation [5]. Lesion-level performance was evaluated on the validation and test sets. Patient-level performance was evaluated on the test set. The receiver operating characteristic (ROC) curve, area under the ROC curve (AUROC), confusion matrix, overall accuracy, precision, recall, and F1 score were assessed. Accuracy metrics were class-weighted, and ROC curves were micro-averaged. The framework’s performance was evaluated when using both the physician-annotated and automatically extracted radiomic feature and tissue type information inputs. Performance was compared across different scanners.

PCa classification

The framework provided a broad PCa classification, formulated as a binary classification task, based on the likelihood of benign versus disease findings according to the PSMA-RADS framework [5]. PSMA-RADS-1 and -2 lesions were categorized as likely benign findings, and PSMA-RADS-3, 4, and 5 lesions were categorized as likely disease [5]. The predicted softmax probabilities were summed across the respective PSMA-RADS categories. Lesion-level and patient-level performance was evaluated on the validation and test sets when using manually and automatically extracted inputs and compared across different scanners.

t-SNE analysis

The framework’s predictions were visualized using t-SNE to provide an understanding of how the framework clusters its predictions. t-SNE is an unsupervised dimensionality reduction technique used to visualize the local structure and global geometry of high-dimensional data [27]. The framework’s predictions were visualized in two dimensions via t-SNE with principal components analysis initialization.

A confidence score for PSMA-RADS classification

The framework provided confidence scores reflecting the expected level of accuracy. Temperature scaling, a single-parameter variant of Platt scaling, was performed to calibrate the framework’s outputs before the softmax activation [15]. The optimal temperature, T, for temperature scaling calibration was found on the validation set and applied on the test set to yield well-calibrated softmax probabilities. Confidence scores were defined as the calibrated softmax probability corresponding to the predicted PSMA-RADS category. Confidence histograms were observed, and confidence scores of accurate and inaccurate predictions were compared. Confidence scores were visualized on t-SNE space.

A probability score for PCa

The framework provided a probability score that reflected the likelihood of PCa. The probability scores were derived by summing the calibrated softmax probabilities across the respective PSMA-RADS categories corresponding to disease findings. The distribution of probability scores for the test set predictions was compared on boxplots according to their PSMA-RADS categories and visualized on a t-SNE scatter plot.

Feature importance

Feature importance experiments were performed to evaluate the robustness of the framework. Different input combinations, including the cropped PET image (I), the extracted radiomic features (F), and the tissue type of the lesion (L), were used to train the framework (Additional file 1: Table S3). The framework was evaluated on the validation set for lesion-level prediction using the manually extracted inputs for each input combination. Feature ablation experiments were also performed to further assess the importance of the radiomic features where individual radiomic features were removed from the inputs during prediction. The relative performance reductions in overall accuracy due to feature ablation compared to the model predictions without feature ablation were assessed on the validation set for lesion-level prediction using the manually extracted inputs.

Statistical analysis

Statistical significance was determined using a two-tailed t test where a P < 0.05 was used to infer a significant difference. ROC curve 95% confidence and tolerance intervals were computed with 1000 bootstrap samples. Statistical analysis and data processing were implemented in Python 3.8.8 and MATLAB 2022b. The framework was implemented in TensorFlow 2.4.1 and Keras 2.4.3 on NVIDIA Quadro P5000 and NVIDIA A6000 GPUs with Linux CentOS 7.6 and Windows 10 operating systems.

Results

Characterizing the PSMA PET data

A histogram of the PSMA-RADS categories and tissue types across all lesions is shown in Fig. 1b. Lesion-level and patient-level distributions of the PSMA-RADS categories and scanner types are shown in Table 1. There were 898, 1873, 127, and 896 lesions with a tissue type of bone, lymphadenopathy, prostate, and soft tissue, respectively.

PSMA-RADS classification

Accuracy metrics, ROC curves, and confusion matrices on the framework’s performance on the PSMA-RADS classification task are given in Table 2 and Fig. 3a, b. When using automatically extracted inputs, the framework yielded AUROC values of 0.93 and 0.87 (Fig. 3a) and overall accuracies of 0.67 and 0.52 on the validation and test sets, respectively, for lesion-level prediction. ROC curves and AUROC values on the validation and test sets for lesions with a tissue type of bone, lymphadenopathy, prostate, and soft tissue, respectively, are shown in Additional file 1: Fig. S5 for lesion-level prediction when using manually extracted inputs. For patient-level prediction, the framework yielded AUROC values of 0.91 and 0.90 and overall accuracies of 0.77 and 0.77 on the test set with the manually and automatically extracted inputs, respectively (Table 2, Fig. 3a). The framework’s lesion-level and patient-level overall accuracy was not significantly different across different scanners (P > 0.05).

Table 2 Performance on PSMA-RADS classification
Fig. 3
figure 3

ROC curves and confusion matrices for the PSMA-RADS classification task (a, b) and the broad prostate cancer classification task (c, d) when using the automatically extracted inputs. The shaded blue and gray areas correspond to the 95% confidence intervals and the 95% tolerance intervals on the ROC curves, respectively

PCa classification

Accuracy metrics, ROC curves, and confusion matrices on the framework’s performance for PCa classification are given in Table 3 and Fig. 3c, d. The framework yielded AUROC values of 0.98 and 0.96 and overall accuracies of 0.94 and 0.89 on the validation and test sets, respectively, for lesion-level prediction using manually extracted inputs (Table 3). When using automatically extracted inputs, the framework yielded AUROC values of 0.95 and 0.92 and overall accuracies of 0.89 and 0.85 on the validation and test sets, respectively (Fig. 3c). For patient-level prediction, the framework yielded overall accuracies of 0.92 and 0.89 and AUROC values of 0.84 and 0.85, when using the manually and automatically extracted inputs, respectively, on the test set. The framework’s lesion-level and patient-level overall accuracy was not significantly different across scanners (P > 0.05).

Table 3 Performance on PCa classification

t-SNE analysis

The t-SNE scatter plots of the framework’s predictions are shown in Fig. 4. The framework formed well-defined clusters of the predicted PSMA-RADS categories (Fig. 4a). These clusters were preserved when labeled according to the physician annotations (Fig. 4b). The framework learned the global relationship between benign, equivocal, and disease findings (Fig. 4c, d). PSMA-RADS-1A, -1B, and -2 predictions were clustered together in the upper right of the t-SNE space forming a global cluster of benign findings. PSMA-RADS-4 and -5 predictions, findings that were highly likely PCa, were closely clustered in the lower left of the t-SNE space. Equivocal findings corresponding to PSMA-RADS-3A, -3B, and -3D predictions were closely clustered near the PSMA-RADS-4 and -5 predictions at the center of the t-SNE space between the global benign and disease clusters (Fig. 4a). This reflects the uncertainty of those equivocal findings on their compatibility with PCa. Interestingly, PSMA-RADS-3C predictions were clustered near PSMA-RADS-1B and -2 predictions (Fig. 4a). This may be because PSMA-RADS-3C findings are atypical for PCa and likely to be other non-prostate malignancies or benign tumors [5].

Fig. 4
figure 4

t-SNE scatter plots of predictions on test set labeled according to their predicted PSMA-RADS categories (a, b) and to their predicted PSMA-RADS categories corresponding to benign, equivocal, and disease findings (c, d)

A confidence score for PSMA-RADS classification

The optimal temperature for temperature scaling was T = 4.26. Confidence histograms before and after performing temperature scaling calibration are shown in Fig. 5a, b. Before calibration, the framework’s average confidence was 0.90 on the test set. After calibration, the average confidence was 0.63 reflecting the framework’s overall accuracy of 0.61. A confidence histogram comparing correct and incorrect predictions is shown in Fig. 5c. The mean confidence scores were significantly higher (P < 0.05) for correct predictions (0.68) than for incorrect predictions (0.55). The distribution of confidence scores on t-SNE space is shown in Fig. 5d. The framework was less confident of predictions near the boundaries between individual PSMA-RADS subcategory clusters and more confident of predictions farther away from those boundaries.

Fig. 5
figure 5

Confidence histograms comparing the average confidence to expected accuracy (a, b). Stacked confidence histogram for correct and incorrect predictions (c). Confidence scores were depicted on a t-SNE scatter plot (d)

A probability score for PCa

Boxplots of probability scores reflecting the likelihood of PCa are shown in Fig. 6a–c. Higher probability scores were assigned to lesions with higher PSMA-RADS scores (Fig. 6b). PSMA-RADS-1 and -2 lesions had a mean probability score of 0.19 corresponding to benign findings (Fig. 6c). PSMA-RADS-4 and -5 lesions had a mean probability score of 0.86, reflecting the high likelihood of PCa. PSMA-RADS-3 lesions had an intermediate mean probability score of 0.75 corresponding to equivocal findings. However, PSMA-RADS-3C lesions had a significantly lower mean probability score of 0.57 (P < 0.05) when compared to PSMA-RADS-3A, -3B, and -3D lesions (Fig. 6a). This reflects the PSMA-RADS categorization scheme since PSMA-RADS-3C lesions are atypical for PCa [5]. The distribution of probability scores on t-SNE space (Fig. 6d) showed an increased likelihood of PCa from the benign to the disease clusters.

Fig. 6
figure 6

Notched boxplots of probability scores according to each of the PSMA-RADS sub-categories (a), the main PSMA-RADS categories (b), and the broad disease categories (c) where the green triangles correspond to the mean and the horizontal lines correspond to the median. Probability scores that reflect the likelihood of PCa were depicted on a t-SNE scatter plot (d)

Feature importance

The network trained on all input features had the highest performance across all evaluation metrics for lesion-level prediction (Fig. 7a, b and Additional file 1: Table S4) and had a significantly higher overall accuracy than all other networks (P < 0.05). The network trained on both the image and radiomic features outperformed the networks trained only on either the image or radiomic features, highlighting the synergy in combining the radiomic and CNN-extracted features. The networks trained on both the image and tissue type information and both the radiomic features and tissue type information outperformed the networks trained only on either the image or tissue type information, respectively, highlighting the importance of the tissue type information.

Fig. 7
figure 7

Accuracy metrics (a) and ROC curves (b) for different input feature combinations and the relative reduction in overall accuracy due to radiomic feature ablation (c, d). Error bars correspond to 95% confidence intervals (a)

The relative reductions in performance due to radiomic feature ablation are shown in Fig. 7c, d. Ablation of the lesion-to-background ratio and the mean standardized uptake value (SUVmean) of the lesion resulted in the highest and second highest reductions in performance, respectively, for the network given both the image and radiomic features and the network given only radiomic features, highlighting the importance of those features. Circularity and maximum standardized uptake value (SUVmax) of the lesion were the third and fourth most important features in both cases, respectively, followed by lesion volume as the fifth or sixth most important feature. This emphasizes the importance of accurate lesion delineation for reliable extraction of radiomic features reflecting such intensity and shape characteristics.

Discussion

PSMA PET imaging has shown superior performance in the detection and staging of primary and metastatic PCa compared to conventional imaging modalities such as CT, MRI, and bone scan [3, 28, 29]. Our framework incorporated DL and radiomics to classify findings on [18F]DCFPyL PET scans. The framework classified findings on the test set into appropriate PSMA-RADS categories and yielded AUROC values of 0.87 and 0.90 for lesion-level and patient-level predictions, respectively. The framework provided broad PCa classification with AUROC values of 0.92 and 0.85 on the test set for lesion-level and patient-level predictions, respectively. A t-SNE analysis showed prediction clusters consistent with the PSMA-RADS categorization scheme. The framework provided confidence and probability scores reflecting the uncertainty and likelihood of PCa, respectively.

Lesion-level PSMA-RADS classification performance was comparable across the test and validation sets, except for PSMA-RADS-3D lesions which were largely misclassified as PSMA-RADS-3A lesions on the test set. Such cases of inaccuracy would not affect the recommendation suggested by the PSMA-RADS framework since further work-up or follow-up imaging would be required for PSMA-RADS-3A and -3D lesions [5]. Three out of six lesions incorrectly classified as PSMA-RADS-3D lesions on the test set were PSMA-RADS-1A lesions (Fig. 3b), likely because PSMA-RADS-3D lesions lack uptake on PSMA PET imaging despite representing potential malignancy on anatomic imaging [5]. Similarly, 8/9 lesions incorrectly classified as PSMA-RADS-3C lesions on the test set were PSMA-RADS-1B and -2 lesions (Fig. 3b). These observations reflect the complexity of the PSMA-RADS-3 designation.

The framework maintained an overall accuracy of 0.77 (41/53) on the test with both automatically and manually extracted inputs for the patient-level PSMA-RADS classification (Table 2), highlighting the robustness of the framework. Similarly, the framework yielded overall accuracies of 0.85 (621/732) and 0.89 (47/53) on the test set with automatically extracted inputs for lesion-level and patient-level broad PCa classification, respectively (Fig. 3d, Table 3).

A t-SNE analysis revealed learned local and global relationships between the PSMA-RADS categories and benign, equivocal, and disease findings (Fig. 4). The framework provided confidence and probability scores, which may help radiologists interpret the predicted outputs to make a more informed clinical diagnosis (Figs. 5, 6). A high level of uncertainty could serve as a flag for physicians to put less weight on the predicted output or to take a second look when determining diagnosis [30]. The confidence and probability scores may assist in better defining how patients should be treated when they appear to have limited volume recurrent or metastatic disease and are being considered for metastasis-directed therapy [31].

PSMA PET radiotracers have been observed to have physiologic uptake patterns and uptake in various benign bone pathologies, which may result in false-positive findings [3]. Benign findings were accounted for in our dataset where PSMA-RADS-1 and PSMA-RADS-2 findings corresponded to certainly or almost certainly benign regions of uptake [5]. For example, of the 898 regions of uptake in the bone, 33 (3.67%) were PSMA-RADS-1A findings, 15 (1.67%) were PSMA-RADS-1B, and 53 (5.90%) were PSMA-RADS-2 (Fig. 1b). Our framework was trained to differentiate regions of uptake corresponding to PCa and benign findings.

The tissue type information was found to be especially important in improving overall performance (Fig. 7a, b and Additional file 1: Tables S3–S4). Incorporating CT or MRI imaging may provide further anatomic information, especially for lesions with low uptake on the PET image [13]. For example, incorporating dynamic contrast-enhanced MRI may help improve the detection and characterization of skeletal metastases in patients with PCa [32, 33]. While performing textural analysis is challenging on PET due to limited spatial resolution, incorporating higher-order radiomic features, such as gray-level co-occurrence matrix, gray-level run-length matrix, and gray-level size zone matrix, from CT or MRI imaging, may help further improve performance [34].

The four most important radiomic features were lesion-to-background ratio, lesion SUVmean, circularity, and lesion SUVmax in feature ablation experiments (Fig. 7c, d). The variance of the background SUV was relatively important for the network given only radiomic features resulting in a 20.50% reduction in performance after ablation (Fig. 7d). However, the background SUV variance was the least important feature for the network that was given both the image and radiomic features with less than 1% reduction in performance (Fig. 7c), indicating that the CNN extracted important characteristics about the overall context of the lesion from the surrounding background in the input image. Interestingly, the network that was given both the image and the radiomic features was generally less sensitive to feature ablations and had smaller reductions in relative performance when compared to the network that was not given the image (Fig. 7c, d). This suggests that deep features extracted by the CNN are complementary to the radiomic features and can help compensate for the loss of information in ablated features.

While the highest PSMA-RADS score was used to determine the overall scan category, we also considered the impact of using the lower PSMA-RADS scores on patient-level performance which may be relevant in the case of less experienced readers. This was done by taking the median PSMA-RADS score predicted for individual lesions on the scan as the overall score compared to the maximum PSMA-RADS score. ROC curves and AUROC values for patient-level prediction when using manually extracted inputs are shown in Additional file 1: Fig. S6. As expected, using the maximum PSMA-RADS score yielded more reliable patient-level predictions than when using the median PSMA-RADS score, which yielded a lower AUROC value of 0.62 (Additional file 1: Fig. S6).

The framework’s performance was affected by the class imbalance present in the dataset (Fig. 1b). The PSMA-RADS-3C and -3D categories had the lowest performance and the fewest lesions in the entire dataset. Most scans had an overall PSMA-RADS score of either PSMA-RADS-4 or -5 further contributing to the patient-level class imbalance (Table 1). To combat class imbalances, generative adversarial networks could be leveraged to generate a large amount of simulated data to train the framework [35,36,37]. Training the framework using ensemble learning may also improve performance as such meta-learning approaches have had success for classification and prognostic tasks [38, 39].

Related work by Johnsson et al. introduced aPROMISE, a software platform for lesion detection in PSMA PET/CT [13]. In aPROMISE, U-net-based anatomical segmentations of bones and organs on the CT image are fused to the PET image, and lesion detection and segmentation are performed by blob detection and fast marching method, respectively [13]. In contrast, our approach performs deep learning-based lesion classification and segmentation using only the PET image. Another key difference is that aPROMISE is based on the PROMISE criteria, whereas our approach is based on PSMA-RADS [5, 13].

Our study had limitations. First, the framework was validated with physician-annotated PSMA-RADS categories subject to inter-operator variability. While the PSMA-RADS categorization scheme has been shown to have a high inter-observer agreement rate, further validation of the framework by histopathology or a multiple-reader consensus study is important for clinical translation [17, 18, 40]. Second, the framework was trained on a per-slice basis. Incorporating the whole imaged volume may help provide anatomic context by considering the presence of other lesions, for example, in the chest or abdomen regions [41, 42]. Third, while the framework incorporates lesion classification and segmentation tasks, the framework does not perform lesion detection. Incorporating the automated detection task could help identify regions of uptake that might be missed [43, 44].

Conclusion

In conclusion, a DL- and radiomics-based framework was developed and performed lesion-level and patient-level PSMA-RADS and PCa classification on PSMA PET images. A t-SNE analysis revealed learned relationships between the PSMA-RADS categories and disease findings on PSMA PET scans. The framework was interpretable and provided well-calibrated confidence and probability scores for each prediction.