1 Introduction

Satellite remote sensing has made significant breakthroughs in recent years [1], and it is essential to interpret and identify remote sensing images that record information on land features so that they can be effectively used for military and livelihood purposes. Hyperspectral imaging employs the two-dimensional imaging technology to collect spatial information on the object surface and the spectral technology to decompose the total radiation in each pixel into radiation spectra of various bands [2]. The surface material exhibits different radiation profiles at various wavelengths depending on the material composition and lighting conditions. Therefore, the hyperspectral image (HSI) classification technique, which determines the substances represented by hyperspectral pixels and assigns classification labels depending on radiation conditions, is of great significance in meteorology [3], environment protection [4], agriculture [5], and military [6].

The domain of HSI classification has witnessed an unprecedented surge in its efficacy, owing to the advances in deep learning techniques [714]. However, the efficacy of these techniques has been challenged by the emergence of various adversarial attacks. [1517]. The seminal work of Szegedy et al. [18] revealed that deep neural networks (DNNs) can be misled by adversarial examples, which are created by minute alterations to the original examples, resulting in incorrect labels being assigned by the DNNs. Subsequently, an increasing number of adversarial attacks and robustness evaluation algorithms were proposed [1921]. As illustrated in Fig. 1, even a seemingly insignificant modification to the original example can significantly reduce the overall accuracy (OA) of the HSI classification model based on deep learning. Notably, the human visual system is incapable of distinguishing between the adversarial example and the original one. Therefore, adversarial defense strategies for HSI classification are introduced to enhance the robustness of HSI classification models against adversarial examples, which can be broadly categorized into two categories: changing inputs [2227] and changing models [16, 17].

Figure 1
figure 1

Adversarial examples that induce misclassification in deep learning-based hyperspectral image (HSI) classification. The overall accuracy (OA) of the deep learning model for the original example is 97.42%. However, adding a subtle perturbation reduces the OA to 22.66%

Adversarial training is one of the most commonly used defense strategies, where the generated adversarial examples are trained together with clean examples to enhance model robustness [26]. However, this strategy has two limitations. First, adversarial training cannot augment the intrinsic robustness of a model, and new adversarial examples can still render the model erroneous. Second, adversarial training necessitates additional adversarial examples for training, which requires more computing resources [25]. An alternative strategy involves directly modifying the models to enhance their resistance to adversarial examples. Xu et al. [17] presented a new HSI classification framework SACNet to resist adversarial attacks by establishing global contextual relationships of spatial pixels and achieved state-of-the-art results. However, given the rich spectral features of hyperspectral data, incorporating spectral features in addition to spatial features could potentially lead to even better or optimal performance.

To defend against adversarial attacks, we jointly consider the features of both spectral and spatial domains and propose a weighted fusion of spectral transformer and spatial self-attention (WFSS) to achieve robust HSI classification. In the spectral domain, we establish long-range dependencies between spectral channels using a transformer network. HSI is inherently a sequence-based data structure in the spectral domain, as it contains hundreds of bands. Therefore, it is challenging for a convolutional neural network (CNN) to learn long-range sequential dependence between bands at a distance. The transformer network with self-attention as the core can establish the relationship between every two bands. For the spatial domain, we use a self-attention context network inspired by Ref. [17] to establish the global context relationship of spatial pixels. We share the local loss with the global loss by establishing the respective global relations of spectral bands and spatial pixels, and the influence of attack can be dispersed globally. Therefore, a higher level of perturbation is needed to cause the input to be misclassified. Moreover, we propose a method for the weighted fusion of spectral features and spatial features to further enhance the model’s defense against adversarial attacks. The contributions of this study are summarized as follows.

  1. 1)

    We systematically analyze the features of adversarial examples in HSI classification and find that adversarial attacks have a greater impact on spectral information. The state-of-the-art defense strategies still cannot achieve satisfactory results on several datasets.

  2. 2)

    We propose a new framework named WFSS through the weighted fusion of spectral transformer and spatial self-attention to defend against adversarial attacks. The WFSS framework can learn global information of spatial and spectral domains and construct global connections through self-attention and transformer. We find that the constructed spectral and spatial global connections can enhance the resistance of HSI classification models to adversarial examples.

  3. 3)

    To the best of our knowledge, this is the first work that has introduced the transformer structure in HSI classification adversarial attacks and defenses. We adopt the transformer structures to build long-distance dependencies among different spectral channels to enhance the robustness against adversarial attacks.

The remainder of this paper is organized as follows. Section 2 provides an overview of adversarial attacks and defenses in HSI classification. Section 3 presents the proposed WFSS in detail. Section 4 reports this study’s performance, comparison and analysis on three benchmark datasets. Finally, Sect. 5 provides a conclusion and discussion of the results and directions for future work.

2 Related works

2.1 Adversarial attacks

We present an overview of some of the most advanced methods used to combat adversarial attacks.

Fast gradient sign method (FGSM). Goodfellow et al. [28] presented an adversarial example generation method based on the gradient descent principle called the FGSM. The fundamental concept is to add the computed loss value to the input, which increases the loss value of the model output and consequently makes the network output inaccurate. The expression is as follows:

(1)

where δ indicates the perturbation, \(\mathrm{sign(\cdot )}\) indicates the sign function, and \(\mathrm{J} ( \boldsymbol{x},\boldsymbol{\theta} ,y_{ \mathrm{true}} )\) is the cross-entropy loss function. θ represents the parameters of a model. \(\nabla _{x}\) denotes the derivative with respect to x. ϵ is used to limit the size of the added perturbation. When \(y_{\mathrm{true}}\) is set to the target class t, the perturbation with target attack is represented by \(\boldsymbol{\delta}_{t}\). The process of adversarial example generation is described by Eq. (2).

$$ \textstyle\begin{cases} \boldsymbol{x}^{\mathrm{adv}} = \boldsymbol{x} + \boldsymbol{\delta}, & \mathrm{no\ target}, \\ \boldsymbol{x}^{\mathrm{adv}} = \boldsymbol{x} - \boldsymbol{\delta}_{t}, & \mathrm{target\ on\ \mathit{t}}. \end{cases} $$
(2)

Project gradient descent (PGD). PGD [29] obtains the perturbation by performing several iterations on the basis of FGSM. The process of generating adversarial examples by PGD is presented as follows:

$$ \boldsymbol{x}^{\mathrm{adv}}_{i+1} =\mathrm{clip} \bigl(\boldsymbol{x}^{ \mathrm{adv}}_{i} - \epsilon \cdot \mathrm{sign} \bigl( \nabla _{x} \mathrm{J} \bigl( \boldsymbol{x}^{\mathrm{adv}}_{i}, \boldsymbol{\theta} ,t \bigr) \bigr)\bigr), $$
(3)

where \(\boldsymbol{x}^{\mathrm{adv}}_{0}\) is the original image x. \(\mathrm{sign}(\cdot )\) indicates the sign function, \(\mathrm{J} ( \boldsymbol{x},\boldsymbol{\theta},y_{ \mathrm{true}} )\) is the cross-entropy loss function. ϵ denotes the step size of each iteration, and t is the target class.

Carlini and Wagner’s attack (C&W). Carlini and Wagner [30] introduced C&W as an optimization-based attack method with the objective of minimizing the distance between the adversarial example and the original example. The expression of the formulation is as follows.

$$\begin{aligned} \mathrm{min} \bigl\Vert \boldsymbol{x}-\boldsymbol{x}^{\mathrm{adv}} \bigr\Vert _{2}^{2} + c\cdot \mathrm{f} \bigl( \boldsymbol{x}^{ \mathrm{adv}},t \bigr), \end{aligned}$$
(4)

where \(\mathrm{f} (\boldsymbol{x}^{\mathrm{adv}},t )= \mathrm{max}(\mathrm{max}_{i\neq t}\mathrm{Z}(\boldsymbol{x}^{ \mathrm{adv}})_{i}-\mathrm{Z}(\boldsymbol{x}^{\mathrm{adv}})_{t},-k)\). The parameter c is a constant that weighs the relationship between the two losses and k is confidence.

2.2 Robust HSI classifications

Deep learning algorithms are extensively applied to HSI classification tasks [7, 14, 3134]. Chen et al. [35] performed the first feature extraction and HSI classification task with deep learning algorithms. Subsequently, they performed classification by stacked autoencoders and spatially dominated information [7]. Recurrent neural networks are used in HSI analysis because of their powerful sequence data learning capabilities [3638].

CNN-based methods. The aforementioned studies place greater emphasis on the spectral information of HSI, while studies have demonstrated that spatial features can effectively improve classification accuracy [39]. CNNs are utilized to learn the spatial texture features of HSIs because of their excellent feature representation capabilities [9, 40, 41]. To further fuse the spectral information, three dimensional (3D)-CNNs are introduced [10, 42, 43]. However, 3D-CNNs contain tremendous parameters suffering from model complexity and computational cost. Therefore, Roy et al. [11] presented HybridSN to reduce the complexity of the model. Xu et al. [12] reduced redundant calculations by processing the entire hyperspectral cube rather than patches. Then, they further addressed the problem that HSIs have difficulty collecting pixel-level labels and introduced a robust self-embedding network (RSEN) to achieve competitive classification results with few training samples [13].

Defense of HSI classification. However, the emergence of adversarial examples has exposed the vulnerabilities of deep learning models [26]. Park et al. [16] randomly sampled all bands of each pixel in HSI to reduce the strength of adversarial perturbation. At the same time, the spectral shape features that are robust to adversarial attacks are first extracted and then encoded to enhance the robustness of the model. Most networks focus only on local features of the data and ignore the connections between global information. Xu et al. [17] determined the global relationships between HSI pixels in the spatial domain through the self-attention mechanism and encoded them in context. They can share the loss of a given pixel over all relevant pixels globally; thus it is not easy to be attacked. However, the rich spectral features play an irreplaceable role in HSI classification. Therefore, it is worthwhile to further investigate the influence of spectral features on the robustness of HSI classification models under adversarial attacks.

2.3 Transformer on HSI

Vision transformers [44] have recently achieved great success in computer vision. The vision transformers efficiently learn long-range interactions between sequential data through a self-attention mechanism [45, 46].

Pure transformer. He et al. [47] performed the first HSI classification task with transformer and presented HSI-BERT. The proposed method learns the global dependence between pixels using a multi-head self-attention mechanism. Hong et al. [48] noted that CNNs have limitations in learning spectral features. They presented a new network called SpectralFormer to learn local spectral sequence features from adjacent bands of HSIs. Zhong et al. [49] introduced a spectral–spatial transformer network to solve the fixed geometric structure of the convolution kernel.

Hybrid CNN-transformer. Recent studies have demonstrated that combining CNNs and transformers can better learn the local and global features of HSIs. Zhao et al. [50] presented a convolutional transformer network to fuse spectral information and pixel positions using center position encoding. Sun et al. [51] used the transformer structure to capture the deep semantic features of HSIs. First, convolutional operations are used to obtain shallow features. Then, the features are turned into semantic tokens, which are fed into the transformer encoder for advanced semantic feature modeling. The powerful long-range interaction modeling capabilities of transformers provide inspiration for establishing spectral robustness. We treat each band of the HSI as a sequence element input into the transformer to establish a long-range dependency between bands.

3 Proposed method

We propose a novel framework named WFSS to establish the long-range dependence between spectral bands and global context connections between spatial pixels, as illustrated in Fig. 2. WFSS consists of four modules. First, HSIs are fed into the backbone network, which uses convolutional layers and average pooling layers for feature extraction and dimensionality reduction. Subsequently, the extracted features are further sent to the spectral transformer module and the spatial self-attention module to establish two global relationships in spectral and spatial dimensions, respectively. These two dependencies are weighted and fused by the fusion module. Finally, the features learned from the backbone network and the fusion module are concatenated to implement multi-scale classification. In this section, we discuss the proposed WFSS in detail.

Figure 2
figure 2

Framework of the proposed weighted fusion of spectral transformer and spatial self-attention (WFSS). \(\boldsymbol{C}_{i} (i=1,2,3)\) is the feature of the i-th convolutional layer; h, w, and b represent the height, width, and band number in the input cube, respectively; u, v, and n denote the height, width and number of channels of the input features, respectively. The input of the transformer encoder (TE) is \(\boldsymbol{T}_{\mathrm{in}}\) and the output is \(\boldsymbol{T}_{\mathrm{out}}\). \(\boldsymbol{F}_{\mathrm{spe}}\) and \(\boldsymbol{F}_{\mathrm{spa}}\) are the output features of the spectral and spatial branches, respectively

3.1 Overview formulation

Adversarial attack causes the model to misclassify images by adding a perturbation to each band of the HSI. This can be defended by limiting the impact of the adversarial perturbation on the bands. Global connections are established for all bands to disperse the impact of the adversarial perturbation on a given band to other relevant bands. Assume that the impact of the adversarial attack on the i-th band is \(L_{i}\). The problem is formulated as

$$\begin{aligned} L_{i}=\sum^{c}_{j=1} \alpha _{ji} L_{ji}, \end{aligned}$$
(5)

where \(\alpha _{ji}\) is the correlation weight of the i-th and j-th bands. \(L_{ji}\) denotes the impact of being dispersed to the j-th band, where \(j=i\) indicates the true impact of the perturbation on the i-th band.

What to correct? Recent HSI adversarial defense researchers have shifted their attention from the local to the global level [17]. While the state-of-the-art method can capture the global context information, it is still limited to the spatial domain. The rich spectral information unique to HSIs is ignored, making it impossible for existing defense methods to achieve satisfactory results at higher levels of perturbation. Therefore, combining spatial and spectral information would be more reasonable to enhance the robustness against adversarial attacks.

Optimization. Let \(\mathcal{L}_{\mathrm{cls}}\) be the loss of the classification network. Suppose \(Y^{\prime }\) and Y are the predicted label and the true label, respectively. The cross-entropy loss \(\mathcal{L}_{\mathrm{cls}}\) is formulated as

$$\begin{aligned} \mathcal{L}_{\mathrm{cls}}=-\frac{1}{hw}\sum _{i=1}^{h}\sum _{j=1}^{w} \sum_{k=1}^{m}Y_{(i,j,k)} \log \bigl(Y^{\prime }_{(i,j,k)}\bigr), \end{aligned}$$
(6)

where m indicates the number of categories, h and w represent the height and width of the hyperspectral data, respectively. Let the adversarial examples that cause the model to misclassify images as \(\boldsymbol{x}^{\mathrm{adv}}\), where \(\boldsymbol{x}^{\mathrm{adv}}= \arg \max\mathcal{L}_{ \mathrm{adv}}(\boldsymbol{x}^{\mathrm{adv}},y;\boldsymbol{\theta})\). The adversarial attack against the HSI classification task targets all pixels rather than individual images. Therefore, the adversarial loss in this paper should be the sum of the losses across all pixels.

3.2 Backbone network

We input the entire hyperspectral cube into the network to avoid redundant calculations due to a large amount of overlap among adjacent patches. There are three convolutional layers and two average pooling layers in the backbone network. Convolution operations perform preliminary feature extraction by converting the original data from high-dimensional to low-dimensional space. The entire hyperspectral cube is used as input, increasing the feature’s spatial dimensions. Pooling operations are used to reduce the size of the feature map, thus reducing the computational burden and the number of parameters.

Let \(\boldsymbol{I} \in \mathbb{R}^{h\times w\times b}\) indicate the input HSI, where h, w, and b represent the height, width, and band number in the input cube, respectively. The outputs of the three convolution layers are presented below:

$$ \boldsymbol{C}_{i}=f(\boldsymbol{W}_{i} \times \boldsymbol{I}+ \boldsymbol{b}_{i}), $$
(7)

where \(\boldsymbol{C}_{i} (i=1,2,3)\) is the feature of the i-th convolutional layer, \(\boldsymbol{W}_{i}\) denotes the weight matrix and \(\boldsymbol{b}_{i}\) is the bias vector. \(f(x)=\max(0,x)\) indicates the rectified linear units (ReLU) function. The size of the convolution kernel is 3 × 3 in this study. In addition, the dilated convolution can expand the receptive field by using different dilation rates [52]. This allows features to be extracted from a larger receptive field without increasing the computational load. The three convolutional layers extract features from different receptive fields with different dilation rates.

3.3 Spectral transformer

The transformer achieves the most advanced performance in natural language processing and computer vision. It can effectively establish long-range dependencies between sequence data.

As illustrated in Fig. 2, the output features X of the backbone network are input into the spectral transformer (ST) module. In the first step, the input features are flattened into \(\boldsymbol{T} \in \mathbb{R}^{n\times uv}\), where n is the number of bands, u is the height, and v is the weight. The input token is represented by \(\boldsymbol{T} \in \mathbb{R}^{n\times z}\), where n and z denote the number and length of tokens, respectively.

The token is indicated by \([T_{1},T_{2},\ldots,T_{n}]\). We concatenate a learnable parameter class token \(T_{*}\) with T, which is used for the classification task. Then, a position encoder (PE) is added to each token to mark the position information. The resulting input token is given by

$$ \boldsymbol{T}_{\mathrm{in}}=[T_{*},T_{1},T_{2}, \ldots,T_{n}]+ \boldsymbol{PE}. $$
(8)

A transformer encoder (TE) is applied to learn the deep relationship between tokens as displayed in Fig. 3. It consists of a multi-head self-attention (MSA) block, a multi-layer perceptron (MLP) layer, and two normalization layers.

Figure 3
figure 3

Illustration of the spectral transformer encoder and transformer decoder. MLP refers to multi-layer perceptron and norm refers to normalization. The input \(\boldsymbol{T}_{\mathrm{in}}\) is linearly mapped into three matrices (queries Q, keys K, values V)

The core of the MSA module is the self-attention (SA) mechanism. We define three learnable weight matrices \(\boldsymbol{W}_{Q}\), \(\boldsymbol{W}_{K}\), and \(\boldsymbol{W}_{V}\) to linearly map the token to the three matrices (queries Q, keys K, values V) as

$$\begin{aligned} &\boldsymbol{Q}=\boldsymbol{T}_{\mathrm{in}} \boldsymbol{W}_{Q}, \\ &\boldsymbol{K}=\boldsymbol{T}_{\mathrm{in}}\boldsymbol{W}_{K}, \\ &\boldsymbol{V}=\boldsymbol{T}_{\mathrm{in}}\boldsymbol{W}_{V}. \end{aligned}$$
(9)

Attention scores are obtained by dot product of the matrices Q and K. Attention weights are calculated by normalization operations. SA is expressed as follows:

$$\begin{aligned} &\mathrm{SA}=\mathrm{Attention}(\boldsymbol{Q},\boldsymbol{K}, \boldsymbol{V})=\sigma \biggl( \frac{\boldsymbol{Q}\boldsymbol{K}^{T}}{\sqrt{d_{k}}}\biggr)\boldsymbol{V}, \\ &\sigma (x)_{i}=\frac{\exp (x_{i})}{\sum_{j=1}\exp (x_{j})}, \end{aligned}$$
(10)

where \(\sigma (x)_{i}\) indicates the softmax function, and \(d_{k}\) is the dimension of K.

MSA is a particular SA that simultaneously performs multiple independent attention heads, first concatenating the output of each head and then projecting the final result. Multi-head self-attention can obtain information about different features in different positions. Formally,

$$\begin{aligned} &\mathrm{MSA}(\boldsymbol{Q},\boldsymbol{K},\boldsymbol{V})= \mathrm{Concat}(\mathrm{SA_{1},\ldots,SA_{h}}) \boldsymbol{W}^{o}, \\ &\mathrm{SA_{i}=Attention} \bigl(\boldsymbol{T}_{\mathrm{in}} \boldsymbol{W}_{i}^{Q},\boldsymbol{T}_{\mathrm{in}} \boldsymbol{W}_{i}^{K}, \boldsymbol{T}_{\mathrm{in}} \boldsymbol{W}_{i}^{V}\bigr), \end{aligned}$$
(11)

where \(\boldsymbol{W}_{i}^{Q},\boldsymbol{W}_{i}^{K},\boldsymbol{W}_{i}^{V} \in \mathbb{R}^{z\times d_{k}}\), \(\boldsymbol{W}^{o}\in \mathbb{R}^{h \times d_{k}\times z}\) is a linear projection matrix and h is the number of heads.

The MLP block consists of two fully connected layers and a GELU activation function. There are also two layers of dropout to prevent overfitting. The output \(\boldsymbol{T}_{\mathrm{out}}\) of TE is equal in size to the input \(\boldsymbol{T}_{\mathrm{in}}\).

\(\boldsymbol{T}_{\mathrm{out}}\) is decoded by the transformer decoder (TD) module in order to fuse spectral features with spatial features. To better fuse the global spectral features with the spatial pixel-level features, we need to map the output of the encoder back to the pixel space. To this end, we use an improved decoder to refine the image features in each spectral band. The difference between TD and TE is that the MSA is replaced with multi-head cross attention (MCA). The queries, keys, and values in TE are from the same input data. In TD, however, the queries are the output of the backbone network, and the keys and values are the output of the TE. We need to combine different sequence inputs using MCA to avoid extensive calculations on the dense relationships between pixels in features X. We will obtain the final spectral global relationship features \(\boldsymbol{F}_{\mathrm{spe}}\).

3.4 Spatial self-attention

As demonstrated in Fig. 4, the backbone network first performs preliminary feature extraction on the input data and then feeds the features X into the self-attention module. We generate the corresponding queries \(\boldsymbol{Q}^{s}\), keys \(\boldsymbol{K}^{s}\) and values \(\boldsymbol{V}^{s}\) by inputting X into three 1 × 1 convolutional layers. The global spatial attention map M is obtained by the calculation of Eq. (13). The attention weights \(\boldsymbol{W}^{s}\) of global spatial pixels are obtained by matrix multiplication of \(\boldsymbol{Q}^{s}\) and the transpose of \(\boldsymbol{K}^{s}\).

$$\begin{aligned} \boldsymbol{W}^{s}_{(i,j)}= \frac{\exp (\boldsymbol{Q}_{i}^{s}\times (\boldsymbol{K}_{j}^{s})^{T})}{\sum_{k=1}^{d_{k}}\exp (\boldsymbol{Q}_{k}^{s}\times (\boldsymbol{K}_{j}^{s})^{T})}, \end{aligned}$$
(12)

where \(\boldsymbol{W}^{s}_{(i,j)}\) is used to evaluate the influence of the i-th pixel on the j-th pixel. The global spatial pixels’ attention weight \(\boldsymbol{W}^{s}\) is further multiplied by \(\boldsymbol{V}^{s}\) to obtain the result:

$$\begin{aligned} \boldsymbol{M}=\boldsymbol{W}^{s}_{(i,j)} \times \boldsymbol{V}^{s} + \boldsymbol{C}_{3}, \end{aligned}$$
(13)

where \(\boldsymbol{C}_{3}\) is the output of the third convolutional layer in the backbone network. We obtain the final global attention feature map by fusing the attention features with the original features.

Figure 4
figure 4

Illustration of the spatial domain module: self-attention learning and spatial pixel global encoding. \(\boldsymbol{W}^{s}\) refers to the attention weights. X denotes the input features of the spatial branch. FC layers means fully-connected layers. Use superscript “s” to distinguish matrix vectors for spectral branch and spatial branch

Then, we use the attention feature map as input to encode the spatial pixels. Let \(m_{i}\) represent the i-th element of M, and define a codebook A, which is indicated by \([a_{1},a_{2},\ldots,a_{k}]\). The codebook is used to learn the visual center through the global spatial attention map M. The normalized residual between M and A is calculated by Eq. (14).

$$\begin{aligned} e_{ij}= \frac{\exp (-s_{j}\times r_{ij})}{\sum_{l=1}^{k}\exp (-s_{l}\times r_{il})} r_{ij}, \end{aligned}$$
(14)

where \(s_{j}\) is the scale factor of the j-th codeword, and \(r_{ij}= \Vert m_{i}-a_{j} \Vert ^{2}\) represents the residuals between the i-th element in M and the j-th codeword in A. Batch normalization activated by ReLU is used to generate global context vector e.

Converting the dimensions of e to the original size with a fully connected layer as follows:

$$\begin{aligned} \hat{\boldsymbol{e}}=s(\boldsymbol{W}_{f} \boldsymbol{e} + \boldsymbol{b}_{f}), \end{aligned}$$
(15)

where \(s(\cdot )\) denotes the sigmod function, and \(\boldsymbol{W}_{f}\) and \(\boldsymbol{b}_{f}\) are the weight matrix and the bias vector, respectively. Our final spatially enhanced context features \(\boldsymbol{F}_{\mathrm{spa}}\) are derived as

$$\begin{aligned} \boldsymbol{F}_{\mathrm{spa}}=\boldsymbol{M}\hat{ \boldsymbol{e}}+ \boldsymbol{M}. \end{aligned}$$
(16)

3.5 Weighted fusion and classification

Since the impact of the adversarial attack is different for spatial pixels and spectral channels, the feature distributions of the two branches are different. We assign different weights γ to the two modules to perform different degrees of scaling for better integration of the model, as follows:

$$\begin{aligned} \boldsymbol{F}=\gamma \boldsymbol{F}_{\mathrm{spa}}+(1-\gamma ) \boldsymbol{F}_{\mathrm{spe}}, \end{aligned}$$
(17)

where F, \(\boldsymbol{F}_{\mathrm{spa}}\) and \(\boldsymbol{F}_{\mathrm{spe}}\) denote the fusion features, spatial features, and spectral features, respectively. We fuse the features learned from the backbone network and the fusion module by concatenation to implement multi-scale classification. Finally, the classification task is performed by a convolutional layer (kernel size = 1) and a softmax function.

4 Experiments

To evaluate the proposed method and compare it with classical and most advanced HSI classification methods (1D-CNN [9], HybridSN [11], SSFCN [12], 3D-CNN [10], SpectralFormer [48], RSEN [13] and SACNet [17]), different adversarial attack methods (FGSM [28], PGD [29] and C&W [30]) are applied to generate adversarial examples on three hyperspectral benchmark datasets: Pavia University (PaviaU), Salinas and Indian Pines.Footnote 1

4.1 Datasets

PaviaU was acquired by Reflective Optics System Imaging Spectrometer. It consists of 103 bands with a size of 610 × 340, and a total of 42,776 samples were labeled and divided into nine categories. Salinas was captured by the airborne visible/infrared imaging spectrometer (AVIRIS) sensor in Salinas Valley, California. The spatial resolution of this dataset is 3.7 m and the size is 512 × 217. The original dataset is 224 bands and after removing the bands with heavy water-absorption, there are 204 bands left. Indian Pines was taken by the AVIRIS sensor in Indiana. The size of this dataset is 145 × 145 and it has 224 bands, of which 200 are valid bands. This dataset has a total of 16 land-cover classes. The training and test sets are listed in Table 1.

Table 1 Number of training/test samples in each class for the three datasets

4.2 Evaluation metrics and experimental settings

Overall accuracy (OA), average accuracy (AA), and kappa coefficient (κ) are commonly used evaluation metrics for HSI classification. The OA is calculated as the number of correctly predicted samples and divided by the total number of samples. The AA is calculated as the sum of accuracy for each category predicted divided by the number of categories. \(\kappa =\frac{p_{0}-p_{e}}{1-p_{e}}\), where \(p_{0}\) is the proportion of samples correctly classified and \(p_{e}\) is the expected proportion of samples correctly classified by chance.

We choose the Adam optimizer with an initial learning rate of \(5\times 10^{-4}\) and a weight decay of \(5\times 10^{-5}\) to train all models. The training epoch for each dataset is set to 500. For the fusion coefficient, we set the range of γ= 0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, and 0.9 to verify the effect of the γ. As depicted in Fig. 5, the best OA value was attained for all three datasets when \(\gamma = 0.1\). Therefore, the fusion coefficient was fixed at 0.1 in all experiments. The study in this paper is implemented in the PyTorch environment, using an AMD EPYC 7543 32-Core Processor, 90-GB RAM, and an NVIDIA A40 48-GB GPU server. All experiments are randomly repeated 10 times to take the average to reduce experimental error.

Figure 5
figure 5

The classification performance of the WFSS framework with different fusion coefficients γ on each dataset

4.3 Robustness evaluation and comparison

We select the most representative methods for robustness evaluation and comparison in the field of hyperspectral classification, including 1D-CNN [9], HybridSN [11], SSFCN [12], 3D-CNN [10], SpectralFormer [48], RSEN [13], and state-of-the-art defense method SACNet [17]. We adopt PGD [29] with \(\ell _{\infty}\) to generate adversarial examples, where ϵ is set to 0.04, the iteration number is 10, and the first class is targeted as the attack.

Tables 2, 3 and 4 report the results of the quantitative comparison of the different models against adversarial attacks. Among them, the quantitative indicator in the upper section refers to the classification accuracy of each class. In the absence of defense, the common hyperspectral image classification framework exhibits a surprising vulnerability towards adversarial examples. The OA of all advanced classification models dropped below 30%. Although the fusion of spectral features and spatial features dramatically improves classification performance, it exhibits lower robustness against adversarial attacks. From Table 2 to Table 4, we can see that the classification results of 1D-CNN are better than those of 3D-CNN on all three datasets. In addition, the HybridSN combining 3D-CNN and 2D-CNN has lower classification accuracy than 1D-CNN on the adversarial examples. This phenomenon indicates that the influence of adversarial perturbation may be more significant for spectral features than spatial features. For the PaviaU dataset, the OA score of 1D-CNN, 3D-CNN, and HybridSN is 23.40%, 17.50%, and 22.48%, respectively. Additionally, we find that other state-of-the-art classification models are also susceptible to adversarial perturbations. For example, SSFCN can achieve an OA score of 22.46% and 18.51% on PaviaU and Indian Pines, respectively. For the Salinas dataset, SSFCN can only yield an OA score of only 8.61%. SpectralFormer uses the state-of-the-art transformer structure for HSI classification, but it is still not robust enough to combat adversarial attacks. This indicates that simply using the transformer is not robust against adversarial attacks. Furthermore, we evaluate and compare the WFSS framework with the current most advanced defense method SACNet, as summarized in Table 6. It is clear that although SACNet achieves superior defense performance under FGSM attacks using spatial context coding, it still fails to achieve satisfactory results under strong attacks such as PGD.

Table 2 Robustness evaluation results of the PaviaU dataset on the adversarial examples generated by the PGD [29] attack with \(\epsilon =0.04\). Best results are marked in bold
Table 3 Robustness evaluation results of the Salinas dataset on the adversarial examples generated by the PGD [29] with \(\epsilon =0.04\). Best results are marked in bold
Table 4 Robustness evaluation results of the Indian Pines dataset on the adversarial examples generated by the PGD [29] with \(\epsilon =0.04\). Best results are marked in bold

The WFSS performs better than SACNet when encountering FGSM attacks and achieves an OA score close to 90% against PGD and C&W. Specifically, it obtains an OA value of 90.97% and 95.09% on the PaviaU and Salinas datasets, respectively, under PGD attack. WFSS can yield an OA score of 88.62% on the Indian Pines dataset, which is significantly better than existing state-of-the-art defense methods. In conclusion, the WFSS framework can significantly enhance the robustness against adversarial attacks by combining global spectral dependence and spatial pixel global context relationships.

We further visualize the classification maps from Fig. 6 to Fig. 8 to demonstrate the classification results of different models under adversarial attacks. We find that most of the samples are misclassified on the most advanced classification models. For the PaviaU dataset, most samples are successfully classified into the category “Asphalt” under targeted attacks. Some samples in the “grass” category are misclassified as the “trees” category, probably because the spectral reflectance properties of meadows and trees are similar. It is evident that the most advanced defense method SACNet successfully defends against some attacks, but there is a particular gap between the classification map of SACNet and the ground truth map. In contrast, our WFSS achieves satisfactory classification results with almost no difference between its classification and the ground truth map.

Figure 6
figure 6

Classification maps for the PaviaU dataset on the adversarial examples generated by the PGD [29] with ϵ = 0.04. (a) Ground truth. (b) 1D-CNN [9]. (c) HybridSN [11]. (d) SSFCN [12]. (e) 3D-CNN [10]. (f) SpectralFormer [48]. (g) RSEN [13]. (h) SACNet [17]. (i) WFSS (ours)

Figure 7
figure 7

Classification maps for the Salinas dataset on the adversarial examples generated by the PGD [29] with ϵ = 0.04. (a) Ground truth. (b) 1D-CNN [9]. (c) HybridSN [11]. (d) SSFCN [12]. (e) 3D-CNN [10]. (f) SpectralFormer [48]. (g) RSEN [13]. (h) SACNet [17]. (i) WFSS (ours)

Figure 8
figure 8

Classification maps for the Indian pines dataset on the adversarial examples generated by the PGD [29] with ϵ = 0.04. (a) Ground truth. (b) 1D-CNN [9]. (c) HybridSN [11]. (d) SSFCN [12]. (e) 3D-CNN [10]. (f) SpectralFormer [48]. (g) RSEN [13]. (h) SACNet [17]. (i) WFSS (ours)

4.4 Ablation study on different modules

We first explore the influences of different perturbation intensities on model classification. We let ϵ take 12 different values in [0, 1] to generate the adversarial perturbations under the PGD [29] attack. The OA scores of different models are illustrated in Fig. 9. We can easily observe that the OA score of all models decreases as the perturbation intensity increases. The OA scores of most models drop below 30% after ϵ is more than 0.04. Even the state-of-the-art defense method SACNet has an OA score below 20% at the maximum perturbation intensity. In contrast, the WFSS achieves the best classification results at all perturbation intensities, and the OA score is still greater than 30% at the maximum perturbation intensity.

Figure 9
figure 9

Overall accuracy of different models on the adversarial examples generated by PGD [29] with different values of ϵ. (a) PaviaU dataset. (b) Salinas dataset. (c) Indian Pines dataset

We further explore the influences of different modules in the proposed WFSS on the classification performance and robustness of the model. In Table 5, SA indicates the spatial self-attention module, and ST denotes the spectral transformer module. Table 5 shows that the ST module has stronger robustness than the SA module. For the Salinas dataset, the SA module can increase the OA score of the model from 17.91% to 61.13%, while the ST module enables an OA score of 82.35%. This suggests that spectral information significantly contributes to improving network robustness. The best results can be achieved for all three datasets by combining the SA module and the ST module.

Table 5 Classification and defense results of each module in the WFSS framework on the adversarial examples generated by PGD [29] with \(\epsilon =0.04\). Best results are marked in bold

We perform experimental evaluation under different attack methods, including FGSM [28] and C&W [30], to demonstrate the effectiveness of the WFSS framework. All experimental settings are the same as PGD, where the ϵ of FGSM is set to 0.04. The classification results of the three datasets are presented in Table 6. The state-of-the-art defense method SACNet shows good defense against FGSM attack but is inefficient against PGD and C&W (Table 6). For instance, the OA score of SACNet for FGSM attack in the Salinas dataset is 91.65%, but it decreased to 67.94% and 66.82% for the PGD and C&W attacks, respectively. In contrast, the WFSS framework can still achieve approximately 90% OA on all datasets toward PGD and C&W attacks.

Table 6 Robustness evaluation results of all datasets on the adversarial examples with different attack methods. Best results are marked in bold

In addition to the robustness in encountering adversarial attacks, the classification performance on clean examples is one of the critical metrics. We can observe that the proposed WFSS achieves more than 96% OA on all datasets. The highest accuracy has been achieved on the Salinas datasets, and second only to RSEN on the PaviaU and Indian Pines datasets. These results show that the WFSS achieves competitive results in clean images while ensuring that the model is resistant to adversarial examples.

4.5 Affect of adversarial attacks on HSI classification models

We plot spectral curves of the clean images, the adversarial examples, and the adversarial perturbations to explore how adversarial attacks affect HSI classification models, as presented in Fig. 10. Although human vision cannot discern the difference between adversarial examples and clean examples in the form of images, changes in the spectral curve are readily perceived by the human eye. Figure 10 shows that the adversarial perturbations make the spectral curves of the different samples more discrete.

Figure 10
figure 10

Spectral curves of clean images (green), adversarial examples (blue), and adversarial perturbations (red) in different categories of the Salinas dataset with ϵ = 0.4. (a)-(c) The first class (Brocoli-green-weeds-1). (d)-(f) The fifth class(Fallow-smooth). (g)-(i) The eighth class(grapes-untrained)

Take the grapes-untrained class in the third row of Fig. 10 as an example. When examining the reflectance of the clean examples between bands 40∼100, we observe an upward trend (Fig. 10 (g)). However, there is a decreasing trend in the reflectance of adversarial perturbations (Fig. 10 (i)). Moving on to the reflectance between bands 100∼125, we notice a declining pattern in the clean examples, followed by an increasing trend (Fig. 10 (g)). Conversely, the adversarial perturbations reflectance shows an opposite trend, starting with an upward trend and then turning into a downward (Fig. 10 (i)). This same trend can be observed in the change in reflectance between bands 125∼155 as well. We speculate that the adversarial attack makes the classification network erroneous because the perturbation changes the spectral curves of the samples. This is further evidence that the spectral domain has a more significant influence on the robustness of the model.

We perform a correlation analysis using the t-SNE algorithm [53] to explore the influence of the spectral transformer module on the spectral features. T-SNE is a nonlinear dimensionality reduction algorithm that can reduce high-dimensional data to 2 or 3 dimensions for visualization. In Fig. 11, we perform the correlation analysis of the spectral features in high-dimensional space for all samples of the PaviaU dataset. Specifically, the first column indicates the influence of the ST module on the clean example correlation, and the second column represents the influence of the ST module on the adversarial example correlation. The features of all samples in the high-dimensional space are close together when the ST module is absent. Figure 11(b) demonstrates that the adversarial attack causes the high-dimensional features to become disordered and crowded. This makes it more difficult for the model to classify the samples. The results in Fig. 11(c) indicate that the ST module makes the feature clustering between different categories more distinct and increases the classification boundaries. It establishes dependencies between spectra, making the data more relevant in high-dimensional space. This ensures that the model is robust even when encountering adversarial attacks.

Figure 11
figure 11

Visualization results of the spectral features in high-dimensional space based on the t-SNE algorithm in the PaviaU dataset. (a) Without ST module on clean examples. (b) Without ST module on adversarial examples. (c) With ST module on clean examples. (d) With ST module on adversarial examples

5 Conclusion

This study presents the WFSS, a new weighted fusion framework, to tackle the issue of adversarial attacks on HSI classification. By combining spatial and spectral information, the WFSS framework enhances the classification robustness. The extensive experimental results show that the WFSS framework achieves superior performance over state-of-the-art defense methods. Furthermore, we find that the combination of spatial and spectral information can significantly enhance classification robustness. However, the fusion of different modules can result in larger model size and longer runtime, which poses a challenge in achieving lightweight models. Therefore, addressing this issue requires further research and solutions.