Introduction

Nuclear fusion energy could be the ultimate energy for humankind. Tokamak is the leading candidate for a practical nuclear fusion reactor. It uses magnetic fields to confine extremely high temperature (100 million K) plasma. Disruption is a catastrophic loss of plasma confinement, which releases a large amount of energy and will cause severe damage to tokamak machine1,2,3,4. Disruption is one of the biggest hurdles in realizing magnetically controlled fusion. DMS(Disruption Mitigation System) such as MGI (Massive Gas Injection) and SPI (Shattered Pellet Injection) can effectively mitigate and alleviate the damage caused by disruptions in current devices5,6. For large tokamaks such as ITER, unmitigated disruptions at high-performance discharge are unacceptable. Predicting potential disruptions is a critical factor in effectively triggering the DMS. Thus it is important to accurately predict disruptions with enough warning time7. Currently, there are two main approaches to disruption prediction research: rule-based and data-driven methods. Rule-based methods are based on the current understanding of disruption and focus on identifying event chains and disruption paths and provide interpretability8,9,10,11. Feature engineering may benefit from an even broader domain knowledge, which is not specific to disruption prediction tasks and does not require knowledge of disruptions. On the other hand, data-driven methods learn from the vast amount of data accumulated over the years and have achieved excellent performance, but lack interpretability12,13,14,15,16,17,18,19,20. Both approaches benefit from the other: rule-based methods accelerate the calculation by surrogate models, while data-driven methods benefit from domain knowledge when choosing input signals and designing the model. Currently, both approaches need sufficient data from the target tokamak for training the predictors before they are applied. Most of the other methods published in the literature focus on predicting disruptions specifically for one device and lack generalization ability. Since unmitigated disruptions of a high-performance discharge would severely damage future fusion reactor, it is challenging to accumulate enough disruptive data, especially at high performance regime, to train a usable disruption predictor.

There are attempts to make a model that works on new machines with existing machine’s data. Previous studies across different machines have shown that using the predictors trained on one tokamak to directly predict disruptions in another leads to poor performance15,19,21. Domain knowledge is necessary to improve performance. The Fusion Recurrent Neural Network (FRNN) was trained with mixed discharges from DIII-D and a ‘glimpse’ of discharges from JET (5 disruptive and 16 non-disruptive discharges), and is able to predict disruptive discharges in JET with a high accuracy15. The Hybrid Deep-Learning (HDL) architecture was trained with 20 disruptive discharges and thousands of discharges from EAST, combined with more than a thousand discharges from DIII-D and C-Mod, and reached a boost performance in predicting disruptions in EAST19. An adaptive disruption predictor was built based on the analysis of quite large databases of AUG and JET discharges, and was transferred from AUG to JET with a success rate of 98.14% for mitigation and 94.17% for prevention22.

Mixing data from both target and existing machines is one way of transfer learning, instance-based transfer learning. But the information carried by the limited data from the target machine could be flooded by data from the existing machines. These works are carried out among tokamaks with similar configurations and sizes. However, the gap between future tokamak reactors and any tokamaks existing today is very large23,24. Sizes of the machine, operation regimes, configurations, feature distributions, disruption causes, characteristic paths, and other factors will all result in different plasma performances and different disruption processes. Thus, in this work we selected the J-TEXT and the EAST tokamak which have a large difference in configuration, operation regime, time scale, feature distributions, and disruptive causes, to demonstrate the proposed transfer learning method. J-TEXT is a tokamak with a full-carbon wall where the main types of disruptions are those induced by density limits and locked modes25,26,27,28,29. In contrast, EAST is a tokamak with a metal wall where disruptions caused by density limits and locked modes are also observed, but the most frequent causes of disruptions are temperature hollowing, edge cooling, and VDEs51. Discharges from the J-TEXT tokamak are used for validating the effectiveness of the deep fusion feature extractor, as well as offering a pre-trained model on J-TEXT for further transferring to predict disruptions from the EAST tokamak. To make sure the inputs of the disruption predictor are kept the same, 47 channels of diagnostics are selected from both J-TEXT and EAST respectively, as is shown in Table 4. When selecting, the consistency across discharges, as well as between the two tokamaks, of geometry and view of the diagnostics are considered as much as possible. The diagnostics are able to cover the typical frequency of 2/1 tearing modes, the cycle of sawtooth oscillations, radiation asymmetry, and other spatial and temporal information low level enough. As the diagnostics bear multiple physical and temporal scales, different sample rates are selected respectively for different diagnostics.

Table 4 Channels in J-TEXT & EAST for input of the predictor (for Soft-X Ray (SXR), eXtreme UltraViolet (AXUV)/Pulsed Xenon-based UltraViolet (PXUV), saddle coils (exsad) and Mirnov, the number in parentheses represents the number of input channels).

854 discharges (525 disruptive) out of 2017–2018 compaigns are picked out from J-TEXT. The discharges cover all the channels we selected as inputs, and include all types of disruptions in J-TEXT. Most of the dropped disruptive discharges were induced manually and did not show any sign of instability before disruption, such as the ones with MGI (Massive Gas Injection). Additionally, some discharges were dropped due to invalid data in most of the input channels. It is difficult for the model in the target domain to outperform that in the source domain in transfer learning. Thus the pre-trained model from the source domain is expected to include as much information as possible. In this case, the pre-trained model with J-TEXT discharges is supposed to acquire as much disruptive-related knowledge as possible. Thus the discharges chosen from J-TEXT are randomly shuffled and split into training, validation, and test sets. The training set contains 494 discharges (189 disruptive), while the validation set contains 140 discharges (70 disruptive) and the test set contains 220 discharges (110 disruptive). Normally, to simulate real operational scenarios, the model should be trained with data from earlier campaigns and tested with data from later ones, since the performance of the model could be degraded because the experimental environments vary in different campaigns. A model good enough in one campaign is probably not as good enough for a new campaign, which is the “aging problem”. However, when training the source model on J-TEXT, we care more about disruption-related knowledge. Thus, we split our data sets randomly in J-TEXT. As for the EAST tokamak, a total of 1896 discharges including 355 disruptive discharges are selected as the training set. 60 disruptive and 60 non-disruptive discharges are selected as the validation set, while 180 disruptive and 180 non-disruptive discharges are selected as the test set. It is worth noting that, since the output of the model is the probability of the sample being disruptive with a time resolution of 1 ms, the imbalance in disruptive and non-disruptive discharges will not affect the model learning. The samples, however, are imbalanced since samples labeled as disruptive only occupy a low percentage. How we deal with the imbalanced samples will be discussed in “Weight calculation” section. Both training and validation set are selected randomly from earlier compaigns, while the test set is selected randomly from later compaigns, simulating real operating scenarios. For the use case of transferring across tokamaks, 10 non-disruptive and 10 disruptive discharges from EAST are randomly selected from earlier campaigns as the training set, while the test set is kept the same as the former, in order to simulate realistic operational scenarios chronologically. Given our emphasis on the flattop phase, we constructed our dataset to exclusively contain samples from this phase. Furthermore, since the number of non-disruptive samples is significantly higher than the number of disruptive samples, we exclusively utilized the disruptive samples from the disruptions and disregarded the non-disruptive samples. The split of the datasets results in a slightly worse performance compared with randomly splitting the datasets from all campaigns available. Split of datasets is shown in Table 4.

Normalization

With the database determined and established, normalization is performed to eliminate the numerical differences between diagnostics, and to map the inputs to an appropriate range to facilitate the initialization of the neural network. According to the results by J.X. Zhu et al.19, the performance of deep neural network is only weakly dependent on the normalization parameters as long as all inputs are mapped to appropriate range19. Thus the normalization process is performed independently for both tokamaks. As for the two datasets of EAST, the normalization parameters are calculated individually according to different training sets. The inputs are normalized with the z-score method, which \({X}_{{{{{\rm{norm}}}}}}=\frac{X-{{{{{\rm{mean}}}}}}(X)}{{{{{{\rm{std}}}}}}(X)}\). Theoretically, the inputs should be mapped to (0, 1) if they follow a Gaussian distribution. However, it is important to note that not all inputs necessarily follow a Gaussian distribution and therefore may not be suitable for this normalization method. Some inputs may have extreme values that could affect the normalization process. Thus, we clipped any mapped values beyond (−5, 5) to avoid outliers with extremely large values. As a result, the final range of all normalized inputs used in our analysis was between −5 and 5. A value of 5 was deemed appropriate for our model training as it is not too large to cause issues and is also large enough to effectively differentiate between outliers and normal values.

Labeling

All discharges are split into consecutive temporal sequences. A time threshold before disruption is defined for different tokamaks in Table 5 to indicate the precursor of a disruptive discharge. The “unstable” sequences of disruptive discharges are labeled as “disruptive” and other sequences from non-disruptive discharges are labeled as “non-disruptive”. To determine the time threshold, we first obtained a time span based on prior discussions and consultations with tokamak operators, who provided valuable insights into the time span within which disruptions could be reliably predicted. We then conducted a systematic scan within the time span. Our aim was to identify the constant that yielded the best overall performance in terms of disruption prediction. By iteratively testing various constants, we were able to select the optimal value that maximized the predictive accuracy of our model.

Table 5 Split of datasets of the predictor.

However, research has it that the time scale of the “disruptive” phase can vary depending on different disruptive paths. Labeling samples with an unfixed, precursor-related time is more scientifically accurate than using a constant. In our study, we first trained the model using “real” labels based on precursor-related times, which made the model more confident in distinguishing between disruptive and non-disruptive samples. However, we observed that the model’s performance on individual discharges decreased when compared to a model trained using constant-labeled samples, as is demonstrated in Table 6. Although the precursor-related model was still able to predict all disruptive discharges, more false alarms occurred and resulted in performance degradation. These results indicate that the model is more sensitive to unstable events and has a higher false alarm rate when using precursor-related labels. In terms of disruption prediction itself, it is always better to have more precursor-related labels. However, since the disruption predictor is designed to trigger the DMS effectively and reduce incorrectly raised alarms, it is an optimal choice to apply constant-based labels rather than precursor-relate labels in our work. As a result, we ultimately opted to use a constant to label the “disruptive” samples to strike a balance between sensitivity and false alarm rate.

Table 6 The results of predicting disruptions on J-TEXT using different labeling strategies.

Weight calculation

As not all sequences are used in disruptive discharges, and the number of non-disruptive discharges is far more than disruptive ones, the dataset is greatly imbalanced. To deal with the problem, weights for both classes are calculated and passed to the neural network to help to pay more attention to the under-represented class, the disruptive sequences. The weights for both classes are calculated as: \({W}_{{{{{\rm{disrutive}}}}}}=\frac{{N}_{{{{{\rm{disruptive}}}}}}+{N}_{{{{{\rm{non}}}}}-{disruptive}}}{{2* N}_{{{{{\rm{disruptive}}}}}}},{W}_{{{{{\rm{non}}}}}-{disruptive}}=\frac{{N}_{{{{{\rm{disruptive}}}}}}+{N}_{{{{{\rm{non}}}}}-{disruptive}}}{2* {N}_{{{{{\rm{non}}}}}-{disruptive}}}\). Scaling by \(\frac{{N}_{{{{{\rm{disruptive}}}}}}+{N}_{{{{{\rm{non}}}}}-{disruptive}}}{2}\) helps to keep the loss to a similar magnitude.

Overfitting handling

Overfitting occurs when a model is too complex and is able to fit the training data too well, but performs poorly on new, unseen data. This is often caused by the model learning noise in the training data, rather than the underlying patterns. To prevent overfitting in training the deep learning-based model due to the small size of samples from EAST, we employed several techniques. The first is using batch normalization layers. Batch normalization helps to prevent overfitting by reducing the impact of noise in the training data. By normalizing the inputs of each layer, it makes the training process more stable and less sensitive to small changes in the data. In addition, we applied dropout layers. Dropout works by randomly drop** out some neurons during training, which forces the network to learn more robust and generalizable features. L1 and L2 regularization were also applied. L1 regularization shrinks the less important features’ coefficients to zero, removing them from the model, while L2 regularization shrinks all the coefficients toward zero but does not remove any features entirely. Furthermore, we employed an early stop** strategy and a learning rate schedule. Early stop** stops training when the model’s performance on the validation dataset starts to degrade, while learning rate schedules adjust the learning rate during training so that the model can learn at a slower rate as it gets closer to convergence, which allows the model to make more precise adjustments to the weights and avoid overfitting to the training data.

The deep learning-based FFE design

Our deep learning model, or disruption predictor, is made up of a feature extractor and a classifier, as is demonstrated in Fig. 1. The feature extractor consists of ParallelConv1D layers and LSTM layers. The ParallelConv1D layers are designed to extract spatial features and temporal features with a relatively small time scale. Different temporal features with different time scales are sliced with different sampling rates and timesteps, respectively. To avoid mixing up information of different channels, a structure of parallel convolution 1D layer is taken. Different channels are fed into different parallel convolution 1D layers separately to provide individual output. The features extracted are then stacked and concatenated together with other diagnostics that do not need feature extraction on a small time scale. The concatenated features make up a feature frame. Several time-consecutive feature frames further make up a sequence and the sequence is then fed into the LSTM layers to extract features within a larger time scale. In our case, we choose Relu as our activation function for the layers. After the LSTM layers, the outputs are then fed into a classifier which consists of fully-connected layers. All layers except for the output also select Relu as the activation function. The last layer has two neurons and applies sigmoid as the activation function. Possibilities of disruption or not of each sequence are output respectively. Then the result is fed into a softmax function to output whether the slice is disruptive.

Training and transferring details

When pre-training the model on J-TEXT, 8 RTX 3090 GPUs are used to train the model in parallel and help boost the performance of hyperparameters searching. Since the samples are greatly imbalanced, class weights are calculated and applied according to the distribution of both classes. The size training set for the pre-trained model finally reaches ~125,000 samples. To avoid overfitting, and to realize a better effect for generalization, the model contains ~100,000 parameters. A learning rate schedule is also applied to further avoid the problem. The learning rate takes an exponential decay schedule, with an initial learning rate of 0.01 and a decay rate of 0.9. Adam is chosen as the optimizer of the network, and binary cross-entropy is selected as the loss function. The pre-trained model is trained for 100 epochs. For each epoch, the loss on the validation set is monitored. The model will be checkpointed at the end of the epoch in which the validation loss is evaluated as the best. When the training process is finished, the best model among all will be loaded as the pre-trained model for further evaluation.

When transferring the pre-trained model, part of the model is frozen. The frozen layers are commonly the bottom of the neural network, as they are considered to extract general features. The parameters of the frozen layers will not update during training. The rest of the layers are not frozen and are tuned with new data fed to the model. Since the size of the data is very small, the model is tuned at a much lower learning rate of 1E-4 for 10 epochs to avoid overfitting. As for replacing the layers, the rest of the layers which are not frozen are replaced with the same structure as the previous model. The weights and biases, however, are replaced with randomized initialization. The model is also tuned at a learning rate of 1E-4 for 10 epochs. As for unfreezing the frozen layers, the layers previously frozen are unfrozen, making the parameters updatable again. The model is further tuned at an even lower learning rate of 1E-5 for 10 epochs, yet the models still suffer greatly from overfitting.