Introduction

Recent years have seen the use of deep learning for a wide range of critical tasks, such as autonomous vehicle driving (Grigorescu et al. 2020; Muhammad et al. 2021), facial recognition (Hu et‘al. 2015; Wang and Guo 2021), machine translation (Costa-jussà 2018; Koehn 2020), etc. As deep learning expands its application scope, its security issues are also garnering increased attention (Berman et al. 2019; Liu et al. 2021; Guowen et al. 2019). Deep neural networks are regarded as key components of deep learning, and their security has always been emphasized in research. It is expensive and time consuming to train a deep neural network, so many users use Machine Learning as a Service (MLaaS) (Ribeiro et al. 2015) or directly download post-trained networks from the Internet. In this case, a third party handles the training of the network. An honest third party will train normally and return a clean model, however, there is also the possibility for a malicious third party to manipulate the training process and return a tainted model. Due to the black-box nature of neural network (Rudin 2019), users cannot determine whether the model has been maliciously modified. Service features of MLaaS and the black-box nature of the models offer the possibility of backdoor attack.

The backdoor attack (Gao et al. 2020) consists of two phases, namely the implanting phase and the activating phase. A backdoor is implanted during the training of the neural network, for example by tampering with the training data, and it is then activated during the testing of the network. Backdoor attacks have the main characteristic that the network will make specific incorrect predictions only when triggers are presented in the input, otherwise it behaves normally. As is shown in Fig. 1, the “STOP” sign will be predicted as “LIMIT 50” when the image recognition system predicts an image stamped with a trigger. When the system is applied to auto driving, this kind of backdoor behavior could contribute to serious traffic accidents. As mentioned earlier, a malicious third party is well positioned to implant a backdoor and return the backdoored network to the user. Users are also provided with a partially clean data set when they receive the network in order to test whether the it performs as expected. Nevertheless, the backdoor in the network cannot be activated by clean data, i.e., a user cannot determine whether the network contains a backdoor.

The defense based on knowledge distillation is currently considered to be the most effective method for mitigating backdoor attacks. NAD [57] was the first to introduce knowledge distillation into backdoor defense. It utilizes attention features to represent the neural network’s internal neuron activation information and achieves backdoor defense by aligning the intermediate layer attention features of the student network and the teacher network.The limitation of NAD is that it only involves same-order attention features during knowledge distillation, while the correlation among attention features of different orders is ignored. On this basis, ARGD[58] proposes the attention relation graph, which fully considers and utilizes the relationship between attention features of different orders. As a result, the defense performance is further improved. They have a common limitation, that is, they only optimize knowledge representation, and this knowledge representation is too single. Knowledge distillation was originally proposed because of the need to quantify the network, so we argue that simply optimizing knowledge representation is far from enough to defend against backdoors.

In this work, we propose a new defense mechanism called NBA, which simultaneously optimizes knowledge representation and training samples according to the characteristics of backdoor defense. In terms of knowledge representation, NBA defines and extracts three types of neural behaviors from within the neural network to fully represent the knowledge of the network. By optimizing the corresponding loss function, the student network can be encouraged to align its neural behavior with that of the teacher network, resulting in better training results. In contrast, the knowledge representation used by NAD and ARDG can essentially be regarded as one kind of neural behavior used by the NBA. In terms of training samples, we construct pseudo poisoned samples and input them to the student network. After the backdoor neural behavior is exposed, NBA can remove the backdoor more thoroughly. Based on the above optimizations, NBA can achieve better defensive performance than NAD and ARGD.

In summary, we make the following contributions:

  • We propose novel forms of knowledge and extract neural behavior as efficient representations of knowledge to be transferred. Based on the alignment of neural behavior between both teacher and student networks during defensive distillation, the latter can achieve better learning results than other distillation-based defenses (Li et al. 2021; **a et al. 2022).

  • We optimize original training samples into pseudo samples that can induce student network to exhibit backdoor neural behavior. On this basis, the backdoor in the student network can be further removed actively when combined with a neural behavioral alignment mechanism.

  • We conduct extensive experiments on a number of well-known backdoor attacks. The experimental results corroborate the effectiveness and generality of our approach.

Fig. 1
figure 1

Example of a backdoor attack

Related work

Backdoor attack

We refer to a neural network that has been implanted with a backdoor as a backdoored network, and refer to a sample that has been injected with a trigger as a poisoned sample. The backdoored network exhibits backdoor behavior when it takes poisoned sample as input, namely make specific wrong prediction.

Existing backdoor attack can be divided into poison-label attack and clean-label attack according to whether the label of the poisoned sample is modified. Poison-label attack require the attacker to modify both the samples and the labels, so that the map** between the trigger and the target label can be directly established. BadNets (Gu et al. 2020), etc.

Based on Furlanello et al. (2015), Hinton et al. (2018), we argue that knowledge distillation can be applied to backdoor removal. Existing work confirms this, and they have achieved good results in defensive distillation with attention maps (Li et al. 2021) and corresponding improvements (**a et al. 2022). Accordingly, defensive distillation may provide a promising method of defending against backdoors. In addition, Ge et al. (2021) have considered the backdoor failure problem that may be caused by knowledge distillation, and proposed targeted optimization. However, since its threat model and attack scenarios are not consistent with those discussed in this article, we will not analyze it.

Threat model

We consider a common scenario, where the training process of network is outsourced to a third party. It applies to the case where the user downloads the trained model directly from the Internet or customizes the trained network through MLaaS.

The attacker is free to implant backdoors into the network in any manner he chooses. Different trigger patterns can be designed, different labels can be set, and poisoning rates can be set arbitrarily. Here, we assume that the network was successfully implanted with a backdoor and returned to the user. The user is often provided with a partially clean dataset so that they can confirm the usability of the network once it has been returned (or released). The network is expected to perform well on this dataset.

There are no details about the training process and attack methods available to the defender. He is only provided with a trained network and a small portion of clean dataset. Due to the fact that it is unknown whether a given network contains a backdoor, the proposed defense method should be network-agnostic. The defensive solution should remove the backdoor from a given network without significantly degrading the its normal performance if it is a backdoored network. Particularly if the network is clean, the defense mechanism should not significantly affect its performance.

Methodology

Overview

Figure 2 illustrates the pipeline of using NBA for backdoor defense. It consists of two steps: first, fine-tuning the backdoor network in order to obtain the teacher network, and then, through defensive distillation, aligning the neural behavior of the student network to remove the backdoor.

As shown in Fig. 2a, the defender fine-tunes the given network using a local clean dataset in order to obtain the teacher network. Figure 2b illustrates the subsequent defensive distillation step. As it was originally proposed for the purpose of neural network compression, knowledge distillation will not provide good results when it is applied directly to the backdoor defense. To obtain satisfactory defense performance, we optimize knowledge representation and training samples in knowledge distillation in accordance with backdoor defense features.

The improvements we have made to defensive distillation have been inspired by real-life teaching experiences. As an analogy, we compare the behavior of the backdoor network in processing samples to that of students in solving problems. Consequently, knowledge distillation can be viewed as the process by which teachers instruct students in the proper method of resolving problems. Two lessons can be drawn from practical teaching experience.

Firstly, teachers should provide students with a complete understanding of problem solving, including intermediate steps, intermediate answers, and final solutions. The student will not be able to fully comprehend the ins and outs of the correct method of solving problems if any or all of these are missing. In order to simulate this process, NBA extracts and aligns three kinds of neural behavioral within student and teacher networks. Each of these three neural behaviors within neural network has its own focus, and the combination can facilitate comprehensive learning by the student network of the knowledge imparted by the teacher network. Secondly, teachers will provide concentrated problem-solving guidance for students’ error-prone problems and correct the students’ faulty problem-solving methods. Inspired by this, we constructed pseudo samples to induce the student network to actively exhibit backdoor neural behaviors, aligning them with the benign neural behavior of the teacher network to remove the backdoor more effectively.

Based on this, we propose learning distillation loss and unlearning distillation loss, which are used to encourage the student network to learn benign knowledge efficiently and to actively unlearn backdoor knowledge.

Fig. 2
figure 2

The overview of NBA. NBA consists of two main procedures to remove backdoor: (1) fine-tuning: fine-tuning based on local clean data to obtain the teacher network; (2) defensive distillation: extracting and aligning high-level representation of neural behavior from teacher network and student network. Backdoor will be eliminated from student network by optimizing the two kinds of distillation loss functions adopted in defensive distillation phase

Neural behavior

Definition and extraction of the neural behavior of the neural network are the keys to NBA. We define two types of neural behaviors, namely, response neural behavior and learning neural behavior, respectively, for the intermediate answers and intermediate steps in the problem-solving process. Figure 3 shows the procedure of there two kinds of extracting neural behavior. In addition, we introduce the dark knowledge proposed by Hinton et al. (2015) as prediction neural behavior to represent the final answer of the problem.

Fig. 3
figure 3

The illustration of extracting response neural behavior and learning neural behavior

Response neural behavior

Previous studies (Zagoruyko and Komodakis 2017; Li et al. 2021; **a et al. 2022; Romero et al. 2015) have shown that feature maps can represent the response of neurons inside the network to input samples. We extracted the feature maps of each intermediate layer of the network and regarded it as the original representation of the response neural behavior of the model. In order to capture the focus of the response neural behavior, inspired by Gatys et al. (2016), we use the gram matrices to capture the key features of the feature maps. In particular, the gram matrices that we get here are called response matrices, and they can be used as the high-level representation of the responsive neural behavior.

The feature maps of in the l-th layer of the teacher network and the student network are denoted by \(F^{Tl} \in {\mathbb {R}}^{C_{l} \times H_{l} \times W_{l}}\) and \(F^{Sl} \in {\mathbb {R}}^{C_{l} \times H_{l} \times W_{l}}\), where \(C_{l}\), \(H_{l}\), \(W_{l}\) represent the the number, height and width of feature maps in l-th layer. Further, the response matrices in the l-th layer of the teacher network and the student network can be denoted as \(G^{Fl} \in {\mathbb {R}}^{c \times c}\) and \(G^{Sl} \in {\mathbb {R}}^{c \times c}\). Response matrix \(G^{l}\) is the inner product of the feature maps of the corresponding neural network:

$$\begin{aligned} G_{ij}^{l}=\sum _{i=k}^{H_{l} \times W_{l}}F_{ik}^{l}F_{jk}^{l}. \end{aligned}$$
(1)

\({\mathcal {L}}_{RNB}\) is defined by the mean squared error between the response matrices \(G^{Fl}\) and \(G^{Sl}\), \(l=1,2,\ldots ,n\):

$$\begin{aligned} {\mathcal {L}}_{RNB}=\sum _{l=1}^{n} {\mathcal {L}}_{rnb}^{l}, \end{aligned}$$
(2)

where \({\mathcal {L}}_{RNB}^{l}\) is defined as

$$\begin{aligned} {\mathcal {L}}_{RNB}^{l}=\frac{1}{4C_{l}^{2}(H_{l} \times W_{l})^{2}} \sum _{i=1}^{H_{l} \times W_{l}}\sum _{j=1}^{H_{l} \times W_{l}}\left( G_{ij}^{Fl}-G_{ij}^{Sl}\right) ^{2}. \end{aligned}$$
(3)

By optimizing the Eq. (2), student network is encouraged to align its own response neural behavior with these of the teacher network.

Learning neural behavior

Learning neural behavior is used to simulate the intermediate problem-solving steps in actual teaching. Learning neural behavior is defined as the transformation between response neural behavior from adjacent layers, which is defined as:

$$\begin{aligned} M^{l}= \Vert G^{l}-G^{l+1} \Vert _{2}^{2}, \end{aligned}$$
(4)

when \(M^{l}\) is learning matrix between l-th layer and \(l+1\)-th layer. \(M^{Tl}\) and \(M^{Sl}\) can be calculated by the above equation.

Once there exists n response matrices, there will be \(n-1\) learning matrices. By optimizing the cross entropy loss function as follows, we encourage the student network to aligning its own learning neural behavior with teacher’s:

$$\begin{aligned} {\mathcal {L}}_{LNB}= \sum _{l=0}^{n-1}\Vert M^{Tl}-M^{Sl} \Vert _{2}^{2} \end{aligned}$$
(5)

Prediction neural behavior

We introduce the dark knowledge proposed by Hinton as the prediction neural behavior. Usually, the output of the model is obtained by processing the logits information through softmax. The neural behavior that can be represented in this output is limited. To this end, we follow the procedure of Hinton et al. (2015), introducing a temperature T to soften the result of the softmax as the prediction neural behavior. The prediction neural behavior for class i can be calculated by the follows:

$$\begin{aligned} p_{i}(z_{i},T)= \frac{exp(z_{i}/T)}{\sum \nolimits _{j=0}^{k}exp(z_{j}/T)}, \end{aligned}$$
(6)

where \(z_{i}\) and \(z_{j}\) are logits, k is the number of classes. Here we set \(T=5\).

Therefore, the alignment of prediction neural behavior can be implemented by optimizing the KL-divergence between the prediction neural behavior of the teacher network and the student network:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{PNB}&= L_{KD}(p(u,T),p(z,T))\\&=\sum _{i=0}^{k}-p_{i}(u_{i},T)log(p_{i}(z_{i},T)), \end{aligned} \end{aligned}$$
(7)

where u and z are the logits of teacher network and student network respectively.

By optimizing the \({\mathcal {L}}_{PNB}\), the student model is encouraged to align its own prediction neural behavior with that of the teacher.

Learning distillation loss

Based on the three losses defined in “Neural behavior” section, we define NBA learning distillation loss \({\mathcal {L}}_{LDL}\) to encourage the student network to fully learn the knowledge transferred by the teacher network during defensive distillation:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{LDL}&= {\mathbb {E}}_{(x,y) \sim {\mathcal {D}}}[\lambda _{1} {\mathcal {L}}_{RNB}(T(x),S(x))\\&\quad +\lambda _{2} {\mathcal {L}}_{LNB}(T(x),S(x))\\&\quad +\lambda _{3} {\mathcal {L}}_{PNB}(T(x),S(x))], \end{aligned} \end{aligned}$$
(8)

where \({\mathcal {D}}\) is local dataset, T and S are teacher and student networks. Specifically, \(T(\cdot )\) and \(S(\cdot )\) represent different knowledge form in different loss. \(\lambda _{i}(i=1, 2, 3)\) are hyperparameters controlling the weights of each loss item. They are set as 2.0, 2.0 and 0.1, respectively.

Unlearning distillation loss

Unlearning distillation loss is proposed to correct the backdoor behavior of the student network and improve its generalization. The core idea behind it is to construct pseudo samples, and input the original samples and pseudo samples into the teacher network and student network respectively. By optimizing this loss function, the student network will further eliminate the backdoor behavior, that is, actively unlearn the backdoor knowledge.

Student network typically only show backdoor behavior in the presence of poisoned samples, but defenders only have clean samples. We introduce adversarial attacks to address this challenge.

Typically, the backdoored network can be obtained by optimizing the following loss function during training:

$$\begin{aligned} {\mathcal {L}} = {\mathbb {E}}_{(x,y) \sim {\mathcal {D}}_{c}}[l(f_{\theta }(x),y)]+{\mathbb {E}}_{(x,y) \sim {\mathcal {D}}_{p}}[l(f_{\theta }(x+ \triangle ),y_{t})], \end{aligned}$$
(9)

where \(l(\cdot )\) denotes the loss function such as cross-entropy loss, \({\mathcal {D}}_{c}\) and \({\mathcal {D}}_{p}\) are denote the subsets of training dataset. Particularly, \(\triangle\) denotes the trigger and \(y_{t}\) denotes the target label. The function of the second item is to implant the backdoor. Optimizing this loss function actually creates a shortcut in the network for the decision-making process of recognizing the input as the target label compared to training a clean neural network. Potential adversarial attacks are thus affected.

We conduct non target attack on the student network to craft pseudo samples \(x^{'}=x+\delta\), where \(\delta\) is adversarial perturbation, which can be obtained by optimizing:

$$\begin{aligned} \mathop {max}\limits _{\delta } {\mathcal {L}}(f_{\theta },x,y), s.t. \Vert \delta \Vert _{p} < \epsilon \end{aligned}$$
(10)

The above equation can be solved by gradient-based adversarial methods, such as Madry et al. (2018). According to different predicted labels, pseudo samples can be divided into two categories. The samples that are predicted as target labels are pseudo-poisoned samples, and they will converge to the local extremum caused by the backdoor during the optimization. This means that \(\delta\) and \(\triangle\) are strongly related. Figure 4 visually shows the feature maps generated by pseudo-poisoned samples and poisoned samples. They have strong similarities, which indicates that they are both able to activate the network to exhibit similar behavior. Another type of samples obtained by the above equation are called pseudo clean samples, and their predictions are inconsistent with the target label. Unlearning distillation loss is defined as follows:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{UDL}&= {\mathbb {E}}_{(x,y) \sim {\mathcal {D}}}[\lambda _{1} {\mathcal {L}}_{RNB}(T(x),S(x+\delta ))\\&\quad +\lambda _{2} {\mathcal {L}}_{LNB}(T(x),S(x+\delta ))\\&\quad +\lambda _{3} {\mathcal {L}}_{PNB}(T(x),S(x+\delta ))], \end{aligned} \end{aligned}$$
(11)

It is important to note that during the optimization of the above equation, the two types of pseudo samples have varying roles, but both can contribute to the enhancement of the defense performance. In particular, the pseudo-poisoned samples induce the student network to exhibit backdoor behavior, allowing backdoor knowledge to be removed more efficiently. While the pseudo-clean samples essentially provide a regularization function.

Fig. 4
figure 4

Visualization of feature maps extracted from backoored network

Total loss

Overall objective of defensive distillation has the form of:

$$\begin{aligned} {\mathcal {L}}_{Total} = \alpha ({\mathcal {L}}_{LDL} +\beta {\mathcal {L}}_{UDL}) +l(f_{\theta }(x),y), \end{aligned}$$
(12)

where \(\alpha\) controls the weight of distillation loss in total loss, and \(\beta\) controls the weight of unlearning distillation loss in distillation loss. We set \(\alpha =1.0\) and \(\beta =0.5\) in this paper.

Experiments

Experimental settings

Attack setups

We conduct experiments on 6 representative backdoor attacks, which have their own distinct characteristics in trigger design (BadNets (Gu et al. 2017), TrojanNN (Liu et al. 2018) and Refool (Liu et al. 2020)), trigger injection (Blend (Chen et al. 2017)), and label modifying (CLA (Turner et al. 2019) and SIG (Barni et al. 2019)). The poisoned samples constructed by there methods are shown in Fig. 5.We follow the settings given in the original paper, such as trigger patterns and target labels. We evaluate all attacks and defenses on CIFAR10 and GTSRB, with WideResNet (WRN-16-1) (Zagoruyko and Komodakis 2016). Other attack details are described in Table 1.

Fig. 5
figure 5

Examples of clean samples (top) and poisoned samples (bottom) used in our experiments

Table 1 Settings of 6 well-known backdoor attacks

Defense setups

We compare our approach with the state-of-the-art defenses, including Fine-tuning (FT) (Papernot et al. 2016), Fine-pruning (FP) (Liu et al. 2018), Mode Connectivity Repair (MCR) (Zhao et al. 2020), Neural Attention Distillation (NAD) (Li et al. 2021), and Attention Relation Graph Distillation (ARGD) (**a et al. 2022). The defenses can be compared fairly since they are based on the same threat model. Consistent with previous work (Li et al. 2021; **a et al. 2022), we assume that the defender has a 5% clean training set. We set batch size 64, initial learning rate 0.1 and train network using the Stochastic Gradient Descent (SGD) optimizer with a momentum of 0.9.

Metrics

We use attack success rate (ASR) and benign Accuracy (BA) to evaluate the effectiveness of the defenses. In particular, the lower the ASR and the higher the BA, the better the defense method.

  • Attack Success Rate (ASR) This metric measures the proportion of poisoned testing set predicted to the target class.

  • Benign Accuracy (BA) This metric measures the proportion of clean testing set predicted the ground-truth classes.

Experimental results

Effectiveness of NBA

We present the detailed results on the comparison of performance in Table 2.

Overall, our proposed method achieves good defensive performance. We further illustrate this from two aspects.

We first analyze the defensive effects of NBA against different attacking methods. It is noted that NBA is always effective on different attack methods, i.e., lower ASR and higher BA can be achieved. This demonstrates NBA’s impressive adaptability.

Second, we conducted a comparison between the defenses. FT and three methods based on knowledge distillation (NAD, ARGD, and NBA) achieve better results. The results of a further analysis revealed two important findings. (1) The scheme of knowledge distillation is better than FT. It is due to the fact that the latter relies solely on loss functions such as cross-entropy loss for self-learning, while the former introduces distillation loss and can benefit from the guidance of a teacher network. (2) In the internal comparison of distillation schemes, NBA achieves the best performance (average ASR is 1.52, and BA is 81.14). The reason for this is that NAD’s distillation loss is only based on attention maps extracted from feature maps, and ARGD’s distillation loss takes into account the order relationship between attention maps. Essentially, they are equivalent to the response neural behavior. NBA, however, adopts two different types of distillation loss simultaneously. The first type of distillation loss is learning distillation loss, which utilizes three types of neural behavior, including response neural behavior, as the form of knowledge, and is capable of leading to better learning results. There is also the unlearning distillation loss, which is capable of actively reducing backdoor neural behaviors. As a result, NBA has a significant advantage when it comes to reducing ASR and maintaining BA.

In addition, it is worth noting that although ARGD performs better than NAD on average, its BA value (80.35) is lower than NAD’s (80.47) when defending against attacks such as BadNets. This indicates that the improvement achieved by ARGD is limited. In contrast, NBA outperforms both ARGD and NAD in terms of defense performance, whether defending against specific attack methods or on average. In terms of the degree of improvement, NBA is able to consistently optimize the defense performance (including BA and ASR) by at least 1 percentage point at the margin, achieving the best defense performance.

Furthermore, we find that although FT, NAD, and ARGD do not adopt a loss function similar to NBA’s unlearning distillation loss, they can still reduce ASR to a certain extent. It is important, however, to stress that for these schemes, the reduction of BA relies on the “Catastrophic Forgetting” Effect (Goodfellow et al. 2014; Kirkpatrick et al. 2017) of the neural network, rather than actively removing backdoors.

Table 2 Performance (%) comparison of 6 backdoor defenses against 6 backdoor attacks
Fig. 6
figure 6

The ASRs of 6 defenses under different size of clean data

Fig. 7
figure 7

The BAs of 6 defenses under different size of clean data

Effectiveness under different defender’s capacity

According to the threat model presented in this paper, defender capabilities are primarily determined by the size of the local dataset. Here we investigate the effect of local dataset size on defense performance.

Figures 6 and 7 illustrate that most defense schemes perform better as the size of the clean dataset increases. It should be noted, however, that FP is an exception. As can be seen from Fig. 6, its ASRs at 5% is similar to that at 20%. This suggests that the size of local dataset does not significantly affect the defense performance of FP. This is because more clean samples do not help FP to more accurately identify whether neurons are damaged.

With only a very small dataset (1%), NBA fails to perform as well as ARGD. However, as the dataset becomes larger, its advantages gradually become apparent. Specifically, with a dataset size of 5%, NBA achieves the best performance among all defenses.

Further observations show that the defense performance gap between these defenses except for FP narrows as the dataset size increases. This indicates that dataset size can indeed affect defense performance in a significant way. NBA, however, continues to have a significant advantage in this case. In the experimental data of 20% data set size, although NBA’s BA is similar to that of other schemes, its ASR is still very low compared to other schemes.

The dataset provided by the third party with the trained networks are usually not very large at any one time. We therefore argue that our assumption of the size of the local dataset (i.e. 5% of the training set) is reasonable.

Further understanding of NBA

Fig. 8
figure 8

The ASRs of NBA under different representations of neural behaviors

Fig. 9
figure 9

The BAs of NBA under different representations of neural behaviors

The advantage of neural behavioral representations

Here, we investigate how different neural behavior representations affect defense performance. Feature maps are treated as low-level representations in this study, while gram matrices are treated as high-level representations.

The results are shown in Figs. 8 and 9. Generally speaking, Gram matrices-based NBA has outperformed feature maps-based NBA. In spite of the fact that the feature maps represent the neural behaviors directly inside the model, there is too much redundant information within them. By contrast, Gram matrices can capture neural behavioral knowledge more effectively through inner products of feature maps, so they can be used to achieve better results of knowledge transferring.

Ablation study of distillation loss

NBA combines three kinds of neural behavior and two forms of loss function to effectively remove backdoor. Here we perform two ablation study to demonstrate that none of them can be omitted.

In Table 3, we show that applying each neural behavior can achieve a certain defensive effect. However, when only one or two kinds of neural behaviors are used, the defenase scheme cannot achieve the best performance.

For learning distillation loss and unlearning distillation loss, we provide Table 4, which shows the performance of different settings. The results in Table 4 demonstrate that a reasonable overall defense performance can be obtained when only learning distillation loss is used. The backdoor can be removed more efficiently by unlearning distillation loss (as indicated by the lower ASR), but at the cost of the lower BA. By reasonably adjusting coefficient \(\beta\) in Eq. (12), we can achieve a better trade-off between ASR and BA.

Table 3 Ablation results of neural behavior loss
Table 4 Ablation results of distillation loss

Possible settings for unlearning distillation loss

Crafting pseudo samples is the key for unlearning distillation loss. The rationale and necessity of the pseudo sample crafting approach are covered in this section.

The pseudo samples are replaced with poisoned samples to perform additional experiments. Table 5 shows the experimental results. The first row presents the results of defensive distillation using only the learning distillation loss, and the corresponding ASR is reduced to 3.2. This indicates that the backdoor has been largely removed. The last two rows of the Table 5 display NBA’s results using poisoned samples and pseudo samples. Both methods can further reduce the ASR (1.38 and 1.52, respectively), indicating that introducing unlearning distillation loss can indeed effectively remove the backdoor. It should be noted that there exists diminishing marginal effect in terms of removing the backdoor. There is not much difference between using poisoned samples and using pseudo samples in unlearning distillation loss. Several defense methods are capable of accurately reversing engineering the approximate poisoned sample (Qiao et al. 2019; Wang et al. 2019; Tao et al. 2022), but this is unnecessary for our approach given that even poisoned samples cannot provide a significantly better performance). In addition, these schemes require high computational overhead or attack details such as trigger size (Wang et al. 2019) when crafting potential poisoned samples.

Table 5 Performance comparison of different samples

Conclusion

This paper presents NBA, a novel defensive distillation mechanism for backdoor removal. We optimize the knowledge distillation process from both the knowledge form and training samples to make it better suited to the defense scenario. In terms of knowledge form, we extract and align three kinds of neural behavior of networks to achieve efficient knowledge transfer. In terms of training samples, we construct pseudo samples to further eliminate backdoor from the backdoored network.

To the best of our knowledge, NBA is the first active defensive distillation mechanism and has competitive advantages in terms of backdoor removal. NBA’s highly effective defense performance and realistic threat model make it an attractive candidate for practical defensive scenarios.