Introduction

Treatment individualisation is a central objective for the improvement of radiotherapy outcomes1. In particular, patients diagnosed with locally advanced head and neck squamous cell carcinoma (HNSCC) might benefit from individualised treatment, since five-year overall survival probability after primary radiochemotherapy is only approx. 50%2. Subgroups of patients may be identified that are currently under- or overtreated and might benefit from e.g. escalated or de-escalated dose prescriptions. Individualisation of treatment may be based on statistical survival models that predict endpoints such as overall survival or loco-regional tumour control (LRC). Survival models are able to analyse time-to-event data which frequently contain censored observations. The prognostic value of these models is based on biomarkers that are able to stratify patients into groups at different risk of treatment failure. Such biomarkers may result from clinical or tumour-related features such as age, gender or tumour stage, molecular analyses of tumour biopsies such as human papillomavirus (HPV) status or gene signatures, dosimetric information or clinical imaging data from computed tomography (CT), magnetic resonance imaging (MRI), positron emission tomography (PET) scans or combinations thereof3,4,5,6,7,8,9,10,11,12,13.

Imaging data are considered a valuable source of information for tailoring individual treatment, due to their non-invasiveness, repeatability and their ability to represent the entire tumour. Numerous radiomics models, in which traditional machine-learning (ML) methods were applied on hundreds to thousands of pre-defined and handcrafted image features, have been developed14,15,16,17,18, but have not yet surpassed the threshold for clinical acceptance and applicability19. Recently, Ger et al.20 found that radiomics features of CT and PET scans failed to improve upon clinical risk models in a large head and neck cancer dataset. With the recent advances that deep convolutional neural networks (CNNs) have brought to the fields of natural and medical image analysis, there is hope to elevate model performance for radiotherapy outcome modelling, as well. This is mostly due to the fact that CNNs are able to automatically learn abstract feature representations of the input data during training. However, so far most applications of deep learning to medical images revolve around tasks for segmentation21 or classification22,23,24. The same holds true for the field of radiotherapy, where most applications of deep learning focus on segmentation, computer-aided detection or motion management25. Only few attempts have been published to combine deep learning on medical imaging data and survival analysis26,27,1. We investigated four approaches to develop survival models based on the CPHM for the prediction of LRC hazards for patients diagnosed with locally-advanced HNSCC. First, (i) a clinical model was trained on the exploratory cohort and evaluated on the independent validation cohort to provide baseline performance metrics. Moreover, three deep learning based strategies using CNNs were applied: We (ii) trained models completely from scratch, using 3D-CNNs as well as 2D-CNNs, applied (iii) a transfer learning approach leveraging weights of pre-trained 2D-CNN networks, and created (iv) a deep autoencoder and used its bottleneck features in a traditional CPHM.

Figure 1
figure 1

Design of the analysis. (i) To provide baseline results, a clinical Cox proportional hazards model (CPHM) was trained on the exploratory cohort and evaluated on the independent validation cohort. (ii)–(iv) Three deep learning approaches were evaluated by training convolutional neural networks in a cross-validation approach. Subsequently, for each approach ensembles were constructed from the models obtained during cross-validation and their performance was evaluated on the independent validation cohort.

Prognostic performance was evaluated by two approaches, calculation of the concordance index (C-index) and the ability to stratify patients into two risk groups based on the model predictions. The C-index37,38,39 measures the alignment between the observed times of loco-regional recurrence and the model predictions. It is given on a scale between zero and one with 0.5 indicating no prognostic value of the model. A C-index close to zero represents perfect predictions, since predicted hazards should be lower for patients with a longer recurrence-free time. We emphasise that this is in contrast to the situation of directly predicting event times, where a C-index close to one would be desirable. 95% confidence intervals (CI) for C-indices were computed using the survcomp R package40,41 which implements the method proposed by Pencina et al.42. Models that did not contain the C-index 0.5 within the 95% CI on the independent validation cohort were considered as successfully validated.

Furthermore, based on the model predictions, patients were assigned to two groups, at low or at high risk for loco-regional recurrence. This stratification was based on the hazard values predicted by the models for every individual patient. The median value of these predictions on the exploratory cohort was used as a cutoff. Patients with a predicted hazard exceeding the cutoff were assigned to the high risk group and the remaining patients with hazards smaller or equal to the cutoff were assigned to the low risk group. To stratify patients of the independent validation cohort, the same cutoff was applied. The difference in LRC between the stratified patient groups was assessed using the log-rank test for the Kaplan–Meier (KM) curves of both risk groups. Significance was established for p values below 0.05.

To address the random nature of the CNN training procedure and to leverage the benefits of model ensembles43, we repeated model training three times, each time using 10-fold cross-validation (CV) based on the exploratory cohort, stratified by the LRC event status, for a total of 30 CV runs. By applying CV on the exploratory cohort, splits of the samples into training and internal test folds were obtained. Models were built in each CV run using the data of the training fold. Data of the internal test fold was set aside for optional hyperparameter tuning and data of the independent validation cohort was used to measure model performance on previously unseen data.

Since each of the 30 CV runs resulted in a trained model (which we refer to as single model), we created ensemble predictions by averaging of the network outputs, essentially considering the information of multiple models before making a final prediction.

Image processing

Preprocessing of patient CT scans was carried out using an in-house developed toolkit44 (available from https://github.com/oncoray/mirp) by performing (1.) cubic interpolation to isotropic voxel size of \(1 \, {\text {mm}}^{3}\), (2.) crop** of the transversal plane to 224 by 224 pixels (with the tumour’s centre of mass as the centre of the cropped slice), (3.) clip** of the intensity range of Hounsfield units (HU) to the range [-200, 200] and (4.) normalisation of pixel values to the interval (0, 1).

Multiple image samples of each patient’s CT scan were extracted and used for model training and prediction. For all 2D-CNN models, we used 7 slices cranial and 8 slices caudal of the slice with the largest tumour area as provided by the segmentation mask, comprising a total of 16 transversal CT slices per patient. For training of the 3D-CNNs we used smaller image regions of the axial plane due to GPU memory limitations. We first extracted a \(32 \times 64 \times 64\) (\(\hbox {z} \times \hbox {y} \times \hbox {x}\)) sized volume centered at the tumour centre of mass. Then, 15 additional random volumes of the same size were extracted for each patient. The volume centres were uniformly sampled from a cubic region of edgelength 32 around the tumour centre of mass. Zero padding was added to all extracted volumes where necessary. For each of the volumes, a prediction was computed. Those were subsequently averaged to obtain a single prediction for each patient.

Cox proportional hazards model

The traditional CPHM fits the effect of p-dimensional covariates \({\varvec{x}}\) on the hazard function h via \(h(t, {\varvec{x}}) = h_{0}(t) \exp \left( \sum _{j=1}^{p}\beta _{j}x_{j}\right) ,\) with an unspecified baseline hazard function \(h_{0}(t)\). We followed Katzman et al.29 in extending this to the more general form of \(h(t, {\varvec{x}}) = h_{0}(t) \exp \left( \gamma _{\varvec{\beta }}({\varvec{x}})\right)\) with \(\varvec{\beta }\) denoting weights learned by a neural network. Log-hazard values \(\gamma _{\varvec{\beta }}({\varvec{x}})\) were estimated from CT image samples \({\varvec{x}}\) by minimisation of (a batch approximation of) the negative of the Cox partial log-likelihood function

$$\begin{aligned} \ln L = \sum _{i=1}^{n} \delta _{i} \left( \gamma _{\varvec{\beta }} ({\varvec{x}}_{i}) - \ln \left( \sum _{\begin{array}{c} j=1\\ t_{j} \ge t_{i} \end{array}}^{n}\exp (\gamma _{\varvec{\beta }}({\varvec{x}}_{j}))\right) \right) , \end{aligned}$$
(1)

letting \(\delta _{i}\) denote an event indicator variable that takes on the value 1 if loco-regional tumour recurrence was observed for CT sample i and 0 otherwise, and n being the total number of available CT samples. Further details on survival analysis and the CPHM are given in “Survival analysis and deep Cox proportional hazards modelling” section of the supplement.

All computations were done using Python 3.6.7 and Keras 2.2.445 with tensorflow (v1.12.0) backend. Our code is publically available from https://github.com/oncoray/cnn-hnscc and experimental outputs can be downloaded from http://doi.org/10.14278/rodare.255.

Clinical model

To develop the clinical CPHM, we considered the clinical features patient age, gender, cT-stage, cN-stage, UICC-stage, tumour site, p16 status, pathological grading, smoking status, alcohol consumption and primary tumour volume. These features have already been considered in previous studies7,35. Tumour site comprised the values oropharynx, hypopharynx, larynx and oral cavity and was one-hot encoded. Volume was computed by summation of tumour segmentation masks and division by a factor of 1000 to obtain units of \({\text {cm}}^{3}\), followed by a (natural) logarithmic transformation.

Imputation of missing values for cN-stage, pathological grading and smoking status (1, 14 and 23 cases, respectively) was performed through selection of the most frequent value in the exploratory cohort. Due to more missing values (58 cases), p16 was converted into the variables \({\text {p16}}_{{\text {unknown}}}\) and p16. The same was done for alcohol consumption for which there were 96 missing cases. cT, cN, UICC and pathological grading stages were converted into the binary categories \({\text {cT}}<4\), \({\text {cN}}<2\), \({\text {UICC}}<4\)7 and \({\text {pathological grading}}<2\). Patient age and tumour volume were z-score normalised with means and standard deviations obtained from the exploratory cohort. Clinical features prognostic for LRC were selected by applying a forward variable selection CPHM based on the likelihood ratio test (inclusion \(\alpha = 0.05\), exclusion \(\alpha = 0.1\)) using the exploratory cohort. Finally, a CPHM was trained on the exploratory cohort using the selected features and applied to the independent validation cohort.

Model ensembles

Due to our cross-validation approach (10-fold CV repeated three times), 30 different models were trained in every analysis. By averaging the resulting predicted log-hazard values, one final ensemble prediction for the hazard of loco-regional recurrence was obtained for every patient. On the independent validation cohort, a patient’s ensemble prediction was computed by averaging over all 30 model predictions. For every patient of the exploratory cohort, a training and an internal test ensemble prediction was computed, since they appeared as part of the training folds and as part of the internal test folds. Training ensemble predictions were obtained by computing for every patient an average over all those 27 models for which that patient was part of the training fold. Similarly, internal test ensemble predictions were computed by only using the remaining three models for which the patient belonged to the internal test fold. For ensemble stratification of patients into groups at low and high risk of loco-regional recurrence, the cutoff value was determined as the median value of the training ensemble predictions.

Training from scratch

Different network architectures of 3D-CNN and 2D-CNN models were trained from scratch. In all trainings we used the AMSGrad version46 of the Adam optimiser to estimate model parameters. For the 3D-CNN experiments the same architecture and hyperparameters as given by Hosny et al.23 were used with small changes. Due to a different input shape, the first dense layer contained slightly fewer neurons. In the last layer, a single output neuron with \(\tanh\) activation was used instead of two neurons with softmax activation which they used for classification purposes. Each model was trained for a fixed number of 200 epochs with a batch size of 24. Neither data augmentation nor callbacks for early stop** or learning rate adjustments were used.

The 2D-CNN architecture (Fig. 2) was loosely inspired by the VGG architecture\(\min\_\hbox {lr}=10^{-7}\)) via the provided Keras callbacks. We also evaluated performance after replacing the final \(\tanh\) activation with a linear output, essentially allowing for unrestricted log-hazard ranges. Moreover, the effect of inserting BN layers between convolutions and ReLU activations was assessed.

Figure 2
figure 2

Architecture used when training a 2D-convolutional neural network from scratch. Numbers give shapes of computed feature maps. The network consists of convolutional filters (‘conv’, light orange), with ReLU activation functions (orange). These are followed by a flattening layer and fully-connected dense layers (‘fc’, green). Network output is computed through a \(\tanh\) activation (purple). (a) This architecture was used when training only on image data. The model output is given by \(\gamma _{\varvec{\beta }}({\varvec{x}}_{{\text {img}}})\). (b) An additional dense layer was introduced when clinical features were used in addition to image data. The network output in this case is given by \(\gamma _{\varvec{\beta }}({\varvec{x}})\).

The effect of combining clinical features and CT samples as two separate inputs to a 2D-CNN was evaluated. First, Spearman correlation coefficients between the 2D-CNN model-output (with BN and \(\tanh\) as final activation) and the clinical features were computed. Then, a second input branch, designed to estimate log-hazard values from the clinical features was added to the network architecture as depicted in Fig. 2b). It consisted of a single dense neuron with \(\tanh\) activation and with BN. The log-hazard estimates coming from the clinical branch and the image branch were then concatenated and fed through the final output layer consisting again of a single dense neuron with \(\tanh\) activation and with BN.

Transfer learning

We evaluated the capabilities of transfer learning for training 2D-CNNs. The ResNet50\(\min\_\hbox {lr}= 10^{-7}\)) via the provided Keras callbacks. A batch size of 32 was used and neither data augmentation nor weight regularisation were applied.

Deep features

Following Wang et al.34, we trained a 2D-CNN autoencoder model that learns to reproduce input CT slices as close as possible while passing through a so called bottleneck layer which acts as a means of compression and dimensionality reduction. Successful reconstruction requires capturing of important image characteristics at the bottleneck and we assumed that relevant tumour information was also encoded within those features. The model architecture is provided in Fig. 3 and consisted of an encoder part of six convolutional layers with filter size \(3 \times 3\), starting with 16 filters and doubling on each subsequent layer. Leaky ReLU (\(\alpha =0.01\)) was used as activation. No BN was applied. Between convolutional layers, max-pooling was used to reduce spatial resolution by a factor of two. Finally, a last \(3 \times 3\) convolutional layer with 64 filters and the same specification as above was applied to reduce the number of features in the bottleneck representation. The following decoder model was constructed as a mirror image of the encoder using upsampling layers for doubling spatial resolution in each step. The decoder’s last layer was a single \(1 \times 1\) convolutional filter with sigmoid activation function to produce outputs with a data range of (0, 1), matching the input image range. Using the binary-crossentropy loss function, we trained the autoencoder for 100 epochs with batches of size 32 using the AMSGrad version of the Adam optimiser with learning rate \(10^{-3}\). We used data augmentation by randomly shearing (shear_range = 0.1), zooming (zoom_range = 0.1) and rotating (rotation_range = 45) the input data. We then extracted the bottleneck feature maps of each slice which were of shape \(7 \times 7 \times 64\), leading to a reduction to 6.25% of the original image size (\(224 \times 224\)). Those features were then flattened into a 3136 dimensional vector and a principal component analysis (PCA) was performed using the features of all slices of every patient from the training fold of the CV as a means of dimensionality reduction. Classical CPHMs were subsequently fitted on those training folds using one, two, five and ten PCA features. The learned PCA transformation was then applied to the independent validation cohort features before evaluating the performance of the trained CPHMs on those transformed features. In addition, a Lasso-based CPHM (LCPHM)52, that automatically selects relevant features, was fit on the full set of bottleneck features of each training fold without performing a PCA for a maximum of 5000 iterations. The best hyperparameter \(\lambda\), which determines the amount of L1 regularisation of the LCPHM, was obtained by another nested CV run on each training fold. This procedure was implemented using the R programming language and the glmnet package53.

Figure 3
figure 3

Architecture of the applied autoencoder. Numbers describe the shapes of computed feature maps. Convolutional layers ('conv’) are comprised of convolutional filters (light orange) and Leaky ReLU (\(\alpha =0.01\)) activation functions (orange). Spatial downsampling is performed using max-pooling layers (red), resulting in a set of bottleneck features. Upsampling operations ('up’, blue), and convolutional layers are then used to reconstruct the input image. A sigmoid activation (purple) is used as model output to match the range of the input data.

Results

Clinical model

All available clinical features were considered to develop a clinical model for the prediction of LRC hazards. Based on the forward variable selection procedure, only the tumour volume was selected. This univariate CPHM achieved a C-index of 0.39 (95% CI: 0.32-0.45) on the exploratory cohort and a C-index of 0.39 (95% CI: 0.30–0.48) on the independent validation cohort. Stratification of the independent validation cohort into patient groups at low and high risk of loco-regional recurrence based on this clinical model showed a statistical trend approaching significance (\(p=0.052\), Supplementary Fig. 1).

Training from scratch

An ensemble of 3D-CNNs was successfully validated for the prediction of LRC. It achieved a C-index of 0.31 (95%-CI: 0.22-0.39) on the independent validation cohort (Table 2), outperforming the clinical model. Ensembling slightly improved average single model performance (C-index: 0.32, Supplementary Table 1). Moreover, stratification of patients of the independent validation cohort (Fig. 4, top row) into groups at low and high risk of loco-regional recurrence based on the model predictions revealed significant differences in LRC (\(p=0.001\)). Ensembles of 2D-CNN models trained from scratch were also successfully validated for prognosis of LRC. However, they showed higher C-indices than the 3D model (C-index: 0.38-0.39, Table 2), i.e. a performance comparable to the clinical model. Average single model performance was similar (Supplementary Table 1). All 2D ensemble models led to significant patient stratifications on the independent validation cohort for LRC or showed a statistical trend (Fig. 4, centre row) (\(p=0.051\)). Table 2 also shows that the inclusion of BN and the choice of final activation did not have a strong impact on performance regarding C-indices or stratification ability of the independent validation cohort. The Spearman correlation coefficient between model predictions and z-score normalised log-tumour volume was moderate across all 30 models (with BN, \(\tanh\) as final activation), with average values of 0.30 and 0.36 for the exploratory and independent validation cohort, respectively. Combining imaging data and tumour volume as network input resulted in decreased performance compared to models with only the CT image as input: a C-index of 0.40 (95%-CI: 0.29-0.50) was obtained on the independent validation cohort and model predictions did not result in a statistically significant stratification (\(p=0.070\)).

Table 2 Ensemble training from scratch: C-indices for the endpoint loco-regional control (LRC) are computed by averaging the model predictions of the repeated cross-validation models to build an ensemble model.

Transfer learning

For transfer learning, the ensemble of DenseNet201 models in combination with its last convolutional layer as the foundation was successfully validated for prognosis of LRC and achieved the best C-index of 0.37 (95%-CI: 0.27-0.47) on the independent validation cohort (Table 3), which was slightly better than the clinical model. Compared to average single model peformance (C-index: 0.41, Supplementary Table 2) this was an improvement of 0.04. Moreover, a statistically significant stratification into low and high risk groups of loco-regional recurrence was achieved by this ensemble for the independent validation cohort (Fig. 4, bottom row) (\(p=0.041\)). Using the last convolutional layer as foundation, ensembles of ResNet50 or IRNV2 models were not able to successfully stratify patients of the independent validation cohort. Layers different from the last convolutional layer of the pre-trained models as input for the newly added dense layers resulted in slightly worse C-indices in all cases.

Table 3 Ensemble of transfer learning models: C-indices for the endpoint loco-regional control (LRC) are computed by averaging the model predictions of the repeated cross-validation models to build an ensemble model.

Boxplots showing the variability of ensemble predictions for patients of the independent validation cohort are provided in Supplementary Figs. 23 and 4 for the ensemble of 3D-CNN models, 2D-CNN models and DenseNet201 models, respectively.

Deep features

The prognostic performance of classical CPHMs using bottleneck features of autoencoder models as covariates are given in Table 4. Model performance was inferior to the clinical model in all scenarios and none of the models achieved a statistically significant stratification of the independent validation cohort into low and high risk groups. The best C-index on the independent validation cohort was 0.42 (95%-CI: 0.32–0.53), obtained by the LCPHM ensemble. The ensemble model improved the C-index on the independent validation cohort by 0.03 compared to the average single model C-index (Supplementary Table  3). The amount of the full variance of the data captured by the PCA features is provided in Supplementary Table  4.

Table 4 Ensemble of autoencoder models: C-indices for the endpoint loco-regional control (LRC) are computed by averaging the model predictions of the repeated cross-validation models to build an ensemble model.
Figure 4
figure 4

Ensemble Kaplan–Meier curves: Kaplan–Meier curves for patient groups at low risk (blue) and high risk (orange) of loco-regional recurrence for training and internal test folds as well as for the independent validation cohort. The stratification was created using the median of the training ensemble predictions as cutoff. The top row shows the curves obtained from an ensemble of 3D-CNN models trained from scratch based on the architecture of Hosny et al.23 with \(\tanh\) as final activation. The centre row shows the curves obtained from an ensemble of 2D-CNN models trained from scratch without batch normalisation and \(\tanh\) as final activation. The bottom row shows the curves obtained from an ensemble of transfer learning models based on DenseNet201 with the last convolutional layer as foundation.

Discussion

We investigated deep learning methods in a survival analysis setting for the endpoint LRC, based on treatment-planning CT images of locally advanced HNSCC patients treated with primary radiochemotherapy. Best performance and successful validation was achieved by an ensemble of 3D-CNNs with a C-index of 0.31 on the independent validation cohort. Patient risk groups defined by the model predictions showed significant differences in LRC (\(p=0.001\)). Ensembles of different 2D-CNN approaches performed similar to a clinical CPHM based on the tumour volume (independent validation C-index of 0.39). Compared to using only a single trained model instance, our analysis revealed benefits in using model ensembles for final predictions, which is in line with the reasoning of Dietterich43.

Overall, reported performances for 2D-CNNs were comparable to results previously published from our group by Leger et al.17. They evaluated multiple combinations of feature selection algorithms and classical machine learning models based on handcrafted radiomics features on the same dataset. An average independent validation C-index over all combinations of 0.62 was achieved (which corresponds to a C-index of 0.38 in our context, as explained in the “Methods” section). Similarly, Haarburger et al.54,55 seems to be a promising way to tackle the small sample size problem of medical imaging. This includes setting up infrastructures to allow to collaboratively train models on data of multiple institutions without violating data-privacy regulations. Also, exploring generative adversarial networks for enhancing dataset sizes through simultaneous generation of synthetic image samples and plausible time-to-event labels56,57 might provide a potentially interesting task. However, for HNSCC, treatment-planning CT scans may simply not contain much more predictive information to achieve better performance, no matter the deep learning approach, model architecture or hyperparameters. As previously indicated3,9,58, considering additional imaging during the course of treatment or additional imaging modalities such as MRI or PET may offer improved predictive potential. Another limitation of our analysis concerns the Cox partial log-likelihood function, as given by equation (1), which does not account for ties in the data. This can very well occur if multiple samples of the same patient are present in a single training batch. Therefore, we plan on using e.g. Efrons correction method59 in future analysis but refrained from that in our current experiments in order to avoid introduction of additional complexity in the loss function. Instead, we experimented with using slight random perturbations on the observed event times to avoid exact matches. We did, however, not observe noteworthy changes in model performance (see first row of Supplementary Table 6). An alternative to the CPHM is the combination of deep learning with accelerated failure time models, as demonstrated by Chapfuwa et al.57 on clinical data. Due to their fully-parametric nature, direct prediction of event times becomes easier and non-monotonic hazard functions can be modelled.

Deep learning approaches on treatment-planning CT images can be useful building blocks on the way to achieve the goal of personalisation of radiotherapy. They may be extended using additional information, e.g. from tumour histology or molecular samples. Nevertheless, deep learning approaches should not be considered the universal remedy since they also bring with them some drawbacks compared to simpler models. Those include increased computational complexity and difficulties in understanding the image-based causes of their predictions, leading to decreased model interpretability.

In this study, we implemented CNNs for the prediction of LRC after primary radiochemotherapy of locally advanced HNSCC based on CT imaging. An ensemble of 3D-CNN models was successfully validated and showed an improved performance compared to 2D-CNN approaches and a clinical model. Risk groups defined on these models differed significantly in LRC. In the future, we aim to assess robustness and translational ability of our trained models by applying them to data of the prospective HNPrädBio trial of the DKTK-ROG as another independent validation (NCT02059668)60.