Introduction

Non-Small Cell Lung Cancer (NSCLC), as one of the most common cancers, exhibits high incidence and mortality rates [1]. For patients ineligible for radical surgery, the combination of radiotherapy and chemotherapy represents their primary treatment option [2]. The assessment of treatment response, which relates to the quality of survival and the effectiveness of treatment, is the key to enhancing the prognosis of patients [3]. Within the same disease stage, patients exhibit varying responses to radiotherapy. Some experience tumor shrinkage, while others manifest signs of tumor progression [Neural network structure

A custom 3D convolutional neural network model, conv3DNet, was used in this study. This model was designed from scratch, using a train-from-initialization approach, rather than relying on a pre-trained model structure. The model consists of three convolutional layers, three 3D Max-Pooling pooling layers, two fully-connected layers, and a Softmax layer for final regression to probability. The architecture of the model is shown in Fig. 1.

Fig. 1
figure 1

The architecture of conv3DNet. Including convolution layers, max pooling layers, and fully connected layers

Federated learning framework

The FL part of this study was implemented using a FL framework, Flower 1.3.0 [25], which consists of three main modules: server-side, client-side, and strategy [26]. The server-side handles global aggregation, while the client-side manages local training. Within the built-in strategy module, various FL training schemes are embedded, facilitating the selection of the appropriate approach to achieve model parameter aggregation on the server. The framework contains popular FL algorithms such as FedAvg [12] and FedProx [27].

The process of federated learning is presented in Fig. 2. First, all clients (participants) wait for the server (central node) to transmit the initial parameters. After receiving the initial parameters, the clients train the model locally using their own training set while the server remains in a waiting state. After each iteration, the client generates a model parameter update. Once a client finishes training, it transmits the model parameters to the server over the network. After receiving the parameters from all clients, the server uses the FedAvg algorithm to aggregate the parameter updates by taking their weighted average. The server then sends the aggregated parameters back to the clients over the network, and the clients train their local model again based on these aggregated parameters. The whole process is iterated repeatedly, with each iteration training the local model on the client, generating parameter updates, transmitting them to the server for aggregation, and obtaining the final global model [15, 28].

Fig. 2
figure 2

The process of federated learning

Model construction

In this study, we first built a DL model and a two-client FL model in a simulated environment, comparing their performance to elucidate the advantages of FL. Due to the unavailability of data from hospital D, we conducted a real-world three-client FL model to explore the potential of FL in healthcare applications. The model frameworks all come from conv3Dnet mentioned in 2.3.

Federated learning in a simulation environment

Initially, we developed a centralized DL model by dividing the data from hospitals A and B into training and validation sets in a ratio of 7:3. In addition, data from hospital C was used as an external validation set for testing the model, and was not involved in model training or debugging. The training hyperparameters included the following: (I) Batch size set at 8; (II) Learning rate of 0.001; (III) Utilizing the Adam optimizer; and (IV) Training for 100 epochs. The loss function employed was cross entropy.

The data were then utilized to build a FL model. Two clients (Hospital A and B) divided the local data into a training set and a validation set in a 7:3 ratio. The data from Hospital C was used as an external validation set for model testing and was not involved in model training or debugging. We configured the initial global model for both clients and used the Flower framework to communicate with the central server via gRPC to build the FL model. The training hyperparameters included the following: (I) Batch size set at 16; (II) Learning rate of 0.001; (III) Utilizing the SGD optimizer (To assess the initial performance of the FL model, we employ an SGD optimizer capable of fine-tuning the model.); (IV) Conducting 10 communication rounds; and (V) Training for 50 local epoch per client. (Local epoch means that each client trains with its local data before sending model parameters to center server.) The loss function used remained cross entropy.

Federated Learning in Real-world environments

The experiment in FL in the real world was continued using the Flower framework. The configuration of the Flower framework to process data from different healthcare organizations and protect data privacy is described. The approach is as follows: Three clients (hospital A, hospital B and hospital D) divided the local data into a training set and a validation set in a 7:3 ratio. The data from hospital C was used as an external validation set for model testing and was not involved in model training and debugging. The model training is all performed on the client side, and the flow is shown in Fig. 3. The training hyper-parameters were as follows: (I) Batch size set at 16 (II) Learning rate of 0.001; (III) Utilizing the Adam optimizer (In real-world environment with higher data distribution and complexity, we use the Adam optimizer, which adaptively adjusts the learning rate for faster and more stable convergence.); (IV) Conducting 10 communication rounds; and (V) Training for 50 local epoch per client. The loss function employed remained cross entropy.

Fig. 3
figure 3

Model training architecture for NSCLC treatment response from CT images. (FL training process: ①Model is trained using local data. ②Model parameters are sent to the server. ③Server aggregation parameters. ④Update parameters.)

In deep learning, the model’s performance is influenced by the distribution of the dataset. To assess the robustness of the proposed model, a switch to a different dataset for external validation was made, following the method described earlier. In the simulation environment, we used data from hospitals A and C to train the centralized learning model DL2, while two clients (Hospital A and C) were used to train the distributed learning model FL3. In the real world, three clients (Hospital A, C, and D) were used to train the distributed learning model FL4. Experimental parameters are shown in Supplementary 3. Data from Hospital B was used as an external validation set to test models DL2, FL3, and FL4, and was not involved in model training or debugging.

Experimental environment

Statistical analyses were performed using IBM SPSS Statistics 21.0 software. CPU processor is Intel(R) Core(TM) i5-11500 @2.70 GHz 2.71 GHz, Intel(R) Xeon(R) Silver 4114 CPU @ 2.20 GHz 2.19 GHz. The GPU processor is NVIDIA Quadro P4000; the operating system is 64 bit Windows 10 Professional with python 3.7; and the network model is implemented using a deep learning framework based on Pytorch(1.13.1).

Results

To train a multicenter well-performing predictive model, a highly diverse dataset is essential. In light of this, we collected data from a total of 245 NSCLC patients from four centers. Among these, 110 cases were classified within the responsive group, while the remaining 135 cases fell into the non-responsive group. A detailed summary of the cohort’s demographic information is presented in Table 1. To ensure no significant differences in patient characteristics across institutions, we conducted statistical analyses using one-way ANOVA, Chi-square tests, and Fisher’s exact test. Results indicated that, aside from gender, all other characteristics were not significantly different across institutions (p > 0.05). Although gender differences among the four cohorts were significant (p < 0.05), we believe gender does not impact our results as our analysis primarily relies on image features.

Table 1 Patient characteristics

The performance of centralized DL models (DL1, DL2) and FL models (FL1, FL3) in simulated environments, as well as FL models (FL2, FL4) in the real world, are evaluated using accuracy, specificity, AUC value, and confusion matrix. For a fair comparison, the models are all evaluated using the same test set (Hospital B or C).

When validated using the dataset from hospital C in the simulated environment, the centralized DL1 model exhibits an AUC value of 0.718(95% CI: 0.52–0.88). The FL1 model achieves higher AUC value of 0.725(95% CI: 0.55–0.90). In the real-world setting, the FL2 model shows an AUC value of 0.698(95% CI: 0.49–0.87). After switching dataset B for external validation, the DL2 model built in the simulated environment achieved an AUC value of 0.695(95% CI: 0.45–0.90), while the FL3 model achieved an AUC value of 0.689(95% CI: 0.51–0.85). Additionally, the FL4 model built in the real world attained an AUC of 0.672(95% CI: 0.45–0.89). Figure 4 summarizes the training loss of the FL model across the two hospitals. Figures 5 and 6 summarizes the ROC curves and confusion matrices obtained from the above three models. Tables 2 and 3 summarize the performance metrics for our model’s training, validation, and testing.

Fig. 4
figure 4

The training loss of the FL model across the two hospitals

Table 2 Three model’s performance on treatment response prediction using hospital C test set
Table 3 Three model’s performance on treatment response prediction using hospital B test set
Fig. 5
figure 5

Performance comparison between local and collaborative FL training based on imaging data to predict treatment response in NSCLC patients. ROC curve of three models

Fig. 6
figure 6

Performance comparison between local and collaborative FL training based on imaging data to predict treatment response in NSCLC patients. Confusion matrix of three models

Discussion

In previous multicenter studies, machine learning or deep learning methods are usually employed to construct models using diverse medical imaging data such as CT and MRI. For instance, Cui et al. [29] developed a DL model for predicting individual patient responses to neoadjuvant chemotherapy based on CT images of patients with locally progressed gastric cancer from four hospitals in China. Braman et al. [30] built a machine learning model to predict the ability of neoadjuvant chemotherapy to provide a complete remission of the pathology through MRI images of breast cancer patients. However, these conventional methods typically necessitate data centralization for training, raising concerns about data privacy.

Therefore, methods such as swarm learning (SL) [31] and FL allow for training models in multi-center collaborations without sharing sensitive data. SL is a decentralized machine learning approach that does not require server-coordinated parameters, enabling direct communication between parties through a blockchain network. For instance, Saldanha et al. [32] used SL in a multicenter study to predict gene mutation status and microsatellite instability. Another study employed SL to predict molecular biomarkers for gastric cancer [33], both of which achieved remarkable results.

Unlike SL, FL utilizes a central server for coordination, through which all participants communicate. This method effectively integrates multicenter data while protecting data privacy. For example, Sheller et al. [34] divided the BraTs dataset into 10 simulated institutions to study simulated FL, which aimed to distinguish healthy brain tissue from cancerous tissue. Sadilek et al. [35] conducted several studies on FL in different scenarios to explore its performance. However, many of the FL studies in existing research have been conducted in simulated environments. These studies primarily focus on technological innovations in data security and privacy protection, but lack validation in real healthcare environments. In contrast, our study focuses more on predicting efficacy in practical clinical applications and validates the feasibility of FL in real clinical settings.

In this study, we introduced the Flower FL framework to establish a collaborative multicenter learning model based on 3D CT images for predicting the treatment response of radiotherapy in NSCLC patients. Our research involved a cohort of 245 patients from four different hospitals. We began with a theoretical performance comparison conducted in a simulation environment using data from three of these hospitals. By comparing the performance of a DL model built by a centralized approach with a FL model, our findings support the effectiveness of the FL approach [DL model accuracy = 0.688/0.691, AUC = 0.718/0.695; FL model accuracy = 0.750/0.714, AUC = 0.725/0.689]. In real-world scenarios, in order to more closely match the actual medical application scenarios and to address the challenges of data acquisition and privacy protection, we utilize data from all four hospitals to develop a FL model [FL model accuracy = 0.688/0.667; area under the curve (AUC) = 0.698/0.672].

The left panel of Fig. 4 shows that after 20 epochs, the loss values fluctuate. And the right panel shows that as training proceeds, the losses for both hospitals gradually decrease and stabilize. The relatively smoother loss curve for Hospital A may indicate a larger amount of data at this site, facilitating smoother learning for the model [36]. As demonstrated in Tables 2 and 3; Figs. 5 and 6, since the model weights received by the federated global model are the weighted average of the local model weights from other clients, which are aggregated by the global model and then returned to the client’s local model, each client can benefit from the experiences of the other clients, thus the FL1, FL3 model’s performance, surpasses that of the DL1, DL2 model. The federated learning models FL2 and FL4, trained using data from three clients, exhibited lower metrics compared to FL1 and FL3, which were trained using data from two clients. This discrepancy may attribute to uneven data distribution in hospital D compared to the other two clients. In cases where data distribution differs noticeably among clients, there may be inconsistencies in data distribution during model aggregation. Using fewer clients reduces the likelihood of this data distribution inconsistency. In addition, the performance of the model trained using AC hospital data (with hospital B as the test set) is comparatively lower than that trained using AB hospital data (with hospital C as the test set). In FL, hospital A, which possesses a larger dataset, is assigned a higher weight. Hence, this discrepancy in performance could potentially be attributed to the fact that the data distribution of Hospital C aligns more closely with that of Hospital A compared to Hospital B.

Certain limitations remain in this study. First, the dataset of NSCLC patients used in this paper was relatively small, and the inclusion of data from more medical institutions could potentially enhance the model’s performance. In addition, this study was limited by the sample size, and CR and PR were categorized as treatment with remission, and SD and PD were categorized as treatment without remission for dichotomous studies. In the future, it is hoped that the RECIST criteria will be used to classify efficacy into four classes for more accurate prediction, while expanding and balancing the sample size. Furthermore, this study focused on constructing a CT-based unimodal model using FL, omitting the integration of additional data such as clinical features and pathology image features, which have the potential to enhance the model’s predictive capacity in the context of cancer treatment response. Future studies will aim to expand the size of the dataset, invite more medical institutions to participate, and integrate data from various sources to build a more comprehensive model for the precise prediction of radiotherapy treatment response in NSCLC patients.

Conclusions

To emphasize the efficacy of a distributed learning approach in a data-private setting, we conducted a study on FL for predicting treatment responses among patients with NSCLC.

We compared traditional DL and FL approaches. Our results show that FL can achieve comparable performance to centralized DL without sharing sensitive data. In addition, we validate the feasibility of FL in real-world applications. We believe that this approach is not only applicable to NSCLC efficacy prediction, but can also be extended to other DL applications for medical image analysis. This research provides an effective approach to address data privacy and collaboration issues in multicenter medical image analysis, which is expected to have a broader impact in clinical applications.