WFSS: weighted fusion of spectral transformer and spatial self-attention for robust hyperspectral image classification against adversarial attacks

Tang, Lichun; Yin, Zhaoxia; Su, Hang; Lyu, Wanli; Luo, Bin

doi:10.1007/s44267-024-00038-x

WFSS: weighted fusion of spectral transformer and spatial self-attention for robust hyperspectral image classification against adversarial attacks

Research
Open access
Published: 28 February 2024

Volume 2, article number 5, (2024)
Cite this article

Download PDF

You have full access to this open access article

Visual Intelligence Aims and scope Submit manuscript

WFSS: weighted fusion of spectral transformer and spatial self-attention for robust hyperspectral image classification against adversarial attacks

Download PDF

Lichun Tang¹,
Zhaoxia Yin²,
Hang Su³,
Wanli Lyu¹ &
…
Bin Luo ORCID: orcid.org/0000-0001-5948-5055¹

661 Accesses
Explore all metrics

Abstract

The emergence of adversarial examples poses a significant challenge to hyperspectral image (HSI) classification, as they can attack deep neural network-based models. Recent adversarial defense research tends to establish global connections of spatial pixels to resist adversarial attacks. However, it cannot yield satisfactory results when only spatial pixel information is used. Starting from the premise that the spectral band is equally important for HSI classification, this paper explores the impact of spectral information on model robustness. We aim to discover potential relationships between different spectral bands and establish global connections to resist adversarial attacks. We design a spectral transformer based on the transformer structure to model long-distance dependency relationships among spectral bands. Additionally, we use a self-attention mechanism in the spatial domain to develop global relationships among spatial pixels. Based on the above framework, we further explore the influence of both spectral and spatial domains on the robustness of the model against adversarial attacks. Specifically, a weighted fusion of spectral transformer and spatial self-attention (WFSS) is designed to achieve the multi-scale fusion of spectral and spatial connections, which further improves the model’s robustness. Comprehensive experiments on three benchmarks show that the WFSS framework has superior defensive capabilities compared to state-of-the-art HSI classification methods.

A novel spatial and spectral transformer network for hyperspectral image super-resolution

Article 01 June 2024

Improved Multi-scale Fusion of Attention Network for Hyperspectral Image Classification

DesU-NetAM: optimized DenseU-Net with attention mechanism for hyperspectral image classification

Article 25 August 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Satellite remote sensing has made significant breakthroughs in recent years [1], and it is essential to interpret and identify remote sensing images that record information on land features so that they can be effectively used for military and livelihood purposes. Hyperspectral imaging employs the two-dimensional imaging technology to collect spatial information on the object surface and the spectral technology to decompose the total radiation in each pixel into radiation spectra of various bands [2]. The surface material exhibits different radiation profiles at various wavelengths depending on the material composition and lighting conditions. Therefore, the hyperspectral image (HSI) classification technique, which determines the substances represented by hyperspectral pixels and assigns classification labels depending on radiation conditions, is of great significance in meteorology [3], environment protection [4], agriculture [5], and military [6].

The domain of HSI classification has witnessed an unprecedented surge in its efficacy, owing to the advances in deep learning techniques [7–14]. However, the efficacy of these techniques has been challenged by the emergence of various adversarial attacks. [15–17]. The seminal work of Szegedy et al. [18] revealed that deep neural networks (DNNs) can be misled by adversarial examples, which are created by minute alterations to the original examples, resulting in incorrect labels being assigned by the DNNs. Subsequently, an increasing number of adversarial attacks and robustness evaluation algorithms were proposed [19–21]. As illustrated in Fig. 1, even a seemingly insignificant modification to the original example can significantly reduce the overall accuracy (OA) of the HSI classification model based on deep learning. Notably, the human visual system is incapable of distinguishing between the adversarial example and the original one. Therefore, adversarial defense strategies for HSI classification are introduced to enhance the robustness of HSI classification models against adversarial examples, which can be broadly categorized into two categories: changing inputs [22–27] and changing models [16, 17].

Adversarial training is one of the most commonly used defense strategies, where the generated adversarial examples are trained together with clean examples to enhance model robustness [26]. However, this strategy has two limitations. First, adversarial training cannot augment the intrinsic robustness of a model, and new adversarial examples can still render the model erroneous. Second, adversarial training necessitates additional adversarial examples for training, which requires more computing resources [25]. An alternative strategy involves directly modifying the models to enhance their resistance to adversarial examples. Xu et al. [17] presented a new HSI classification framework SACNet to resist adversarial attacks by establishing global contextual relationships of spatial pixels and achieved state-of-the-art results. However, given the rich spectral features of hyperspectral data, incorporating spectral features in addition to spatial features could potentially lead to even better or optimal performance.

To defend against adversarial attacks, we jointly consider the features of both spectral and spatial domains and propose a weighted fusion of spectral transformer and spatial self-attention (WFSS) to achieve robust HSI classification. In the spectral domain, we establish long-range dependencies between spectral channels using a transformer network. HSI is inherently a sequence-based data structure in the spectral domain, as it contains hundreds of bands. Therefore, it is challenging for a convolutional neural network (CNN) to learn long-range sequential dependence between bands at a distance. The transformer network with self-attention as the core can establish the relationship between every two bands. For the spatial domain, we use a self-attention context network inspired by Ref. [17] to establish the global context relationship of spatial pixels. We share the local loss with the global loss by establishing the respective global relations of spectral bands and spatial pixels, and the influence of attack can be dispersed globally. Therefore, a higher level of perturbation is needed to cause the input to be misclassified. Moreover, we propose a method for the weighted fusion of spectral features and spatial features to further enhance the model’s defense against adversarial attacks. The contributions of this study are summarized as follows.

1)
We systematically analyze the features of adversarial examples in HSI classification and find that adversarial attacks have a greater impact on spectral information. The state-of-the-art defense strategies still cannot achieve satisfactory results on several datasets.
2)
We propose a new framework named WFSS through the weighted fusion of spectral transformer and spatial self-attention to defend against adversarial attacks. The WFSS framework can learn global information of spatial and spectral domains and construct global connections through self-attention and transformer. We find that the constructed spectral and spatial global connections can enhance the resistance of HSI classification models to adversarial examples.
3)
To the best of our knowledge, this is the first work that has introduced the transformer structure in HSI classification adversarial attacks and defenses. We adopt the transformer structures to build long-distance dependencies among different spectral channels to enhance the robustness against adversarial attacks.

The remainder of this paper is organized as follows. Section 2 provides an overview of adversarial attacks and defenses in HSI classification. Section 3 presents the proposed WFSS in detail. Section 4 reports this study’s performance, comparison and analysis on three benchmark datasets. Finally, Sect. 5 provides a conclusion and discussion of the results and directions for future work.

2 Related works

2.1 Adversarial attacks

We present an overview of some of the most advanced methods used to combat adversarial attacks.

Fast gradient sign method (FGSM). Goodfellow et al. [28] presented an adversarial example generation method based on the gradient descent principle called the FGSM. The fundamental concept is to add the computed loss value to the input, which increases the loss value of the model output and consequently makes the network output inaccurate. The expression is as follows:

(1)

where δ indicates the perturbation, $\mathrm{sign(\cdot )}$ indicates the sign function, and $\mathrm{J} ( \boldsymbol{x},\boldsymbol{\theta} ,y_{ \mathrm{true}} )$ is the cross-entropy loss function. θ represents the parameters of a model. $\nabla _{x}$ denotes the derivative with respect to x. ϵ is used to limit the size of the added perturbation. When $y_{\mathrm{true}}$ is set to the target class t, the perturbation with target attack is represented by $\boldsymbol{\delta}_{t}$. The process of adversarial example generation is described by Eq. (2).

$$ \textstyle\begin{cases} \boldsymbol{x}^{\mathrm{adv}} = \boldsymbol{x} + \boldsymbol{\delta}, & \mathrm{no\ target}, \\ \boldsymbol{x}^{\mathrm{adv}} = \boldsymbol{x} - \boldsymbol{\delta}_{t}, & \mathrm{target\ on\ \mathit{t}}. \end{cases} $$

(2)

Project gradient descent (PGD). PGD [29] obtains the perturbation by performing several iterations on the basis of FGSM. The process of generating adversarial examples by PGD is presented as follows:

$$ \boldsymbol{x}^{\mathrm{adv}}_{i+1} =\mathrm{clip} \bigl(\boldsymbol{x}^{ \mathrm{adv}}_{i} - \epsilon \cdot \mathrm{sign} \bigl( \nabla _{x} \mathrm{J} \bigl( \boldsymbol{x}^{\mathrm{adv}}_{i}, \boldsymbol{\theta} ,t \bigr) \bigr)\bigr), $$

(3)

where $\boldsymbol{x}^{\mathrm{adv}}_{0}$ is the original image x. $\mathrm{sign}(\cdot )$ indicates the sign function, $\mathrm{J} ( \boldsymbol{x},\boldsymbol{\theta},y_{ \mathrm{true}} )$ is the cross-entropy loss function. ϵ denotes the step size of each iteration, and t is the target class.

Carlini and Wagner’s attack (C&W). Carlini and Wagner [30] introduced C&W as an optimization-based attack method with the objective of minimizing the distance between the adversarial example and the original example. The expression of the formulation is as follows.

$$\begin{aligned} \mathrm{min} \bigl\Vert \boldsymbol{x}-\boldsymbol{x}^{\mathrm{adv}} \bigr\Vert _{2}^{2} + c\cdot \mathrm{f} \bigl( \boldsymbol{x}^{ \mathrm{adv}},t \bigr), \end{aligned}$$

(4)

where $\mathrm{f} (\boldsymbol{x}^{\mathrm{adv}},t )= \mathrm{max}(\mathrm{max}_{i\neq t}\mathrm{Z}(\boldsymbol{x}^{ \mathrm{adv}})_{i}-\mathrm{Z}(\boldsymbol{x}^{\mathrm{adv}})_{t},-k)$. The parameter c is a constant that weighs the relationship between the two losses and k is confidence.

2.2 Robust HSI classifications

Deep learning algorithms are extensively applied to HSI classification tasks [7, 14, 31–34]. Chen et al. [35] performed the first feature extraction and HSI classification task with deep learning algorithms. Subsequently, they performed classification by stacked autoencoders and spatially dominated information [7]. Recurrent neural networks are used in HSI analysis because of their powerful sequence data learning capabilities [36–38].

CNN-based methods. The aforementioned studies place greater emphasis on the spectral information of HSI, while studies have demonstrated that spatial features can effectively improve classification accuracy [39]. CNNs are utilized to learn the spatial texture features of HSIs because of their excellent feature representation capabilities [9, 40, 41]. To further fuse the spectral information, three dimensional (3D)-CNNs are introduced [10, 42, 43]. However, 3D-CNNs contain tremendous parameters suffering from model complexity and computational cost. Therefore, Roy et al. [11] presented HybridSN to reduce the complexity of the model. Xu et al. [12] reduced redundant calculations by processing the entire hyperspectral cube rather than patches. Then, they further addressed the problem that HSIs have difficulty collecting pixel-level labels and introduced a robust self-embedding network (RSEN) to achieve competitive classification results with few training samples [13].

Defense of HSI classification. However, the emergence of adversarial examples has exposed the vulnerabilities of deep learning models [26]. Park et al. [16] randomly sampled all bands of each pixel in HSI to reduce the strength of adversarial perturbation. At the same time, the spectral shape features that are robust to adversarial attacks are first extracted and then encoded to enhance the robustness of the model. Most networks focus only on local features of the data and ignore the connections between global information. Xu et al. [17] determined the global relationships between HSI pixels in the spatial domain through the self-attention mechanism and encoded them in context. They can share the loss of a given pixel over all relevant pixels globally; thus it is not easy to be attacked. However, the rich spectral features play an irreplaceable role in HSI classification. Therefore, it is worthwhile to further investigate the influence of spectral features on the robustness of HSI classification models under adversarial attacks.

2.3 Transformer on HSI

Vision transformers [44] have recently achieved great success in computer vision. The vision transformers efficiently learn long-range interactions between sequential data through a self-attention mechanism [45, 46].

Pure transformer. He et al. [47] performed the first HSI classification task with transformer and presented HSI-BERT. The proposed method learns the global dependence between pixels using a multi-head self-attention mechanism. Hong et al. [48] noted that CNNs have limitations in learning spectral features. They presented a new network called SpectralFormer to learn local spectral sequence features from adjacent bands of HSIs. Zhong et al. [49] introduced a spectral–spatial transformer network to solve the fixed geometric structure of the convolution kernel.

Hybrid CNN-transformer. Recent studies have demonstrated that combining CNNs and transformers can better learn the local and global features of HSIs. Zhao et al. [50] presented a convolutional transformer network to fuse spectral information and pixel positions using center position encoding. Sun et al. [51] used the transformer structure to capture the deep semantic features of HSIs. First, convolutional operations are used to obtain shallow features. Then, the features are turned into semantic tokens, which are fed into the transformer encoder for advanced semantic feature modeling. The powerful long-range interaction modeling capabilities of transformers provide inspiration for establishing spectral robustness. We treat each band of the HSI as a sequence element input into the transformer to establish a long-range dependency between bands.

3 Proposed method

We propose a novel framework named WFSS to establish the long-range dependence between spectral bands and global context connections between spatial pixels, as illustrated in Fig. 2. WFSS consists of four modules. First, HSIs are fed into the backbone network, which uses convolutional layers and average pooling layers for feature extraction and dimensionality reduction. Subsequently, the extracted features are further sent to the spectral transformer module and the spatial self-attention module to establish two global relationships in spectral and spatial dimensions, respectively. These two dependencies are weighted and fused by the fusion module. Finally, the features learned from the backbone network and the fusion module are concatenated to implement multi-scale classification. In this section, we discuss the proposed WFSS in detail.

3.1 Overview formulation

Adversarial attack causes the model to misclassify images by adding a perturbation to each band of the HSI. This can be defended by limiting the impact of the adversarial perturbation on the bands. Global connections are established for all bands to disperse the impact of the adversarial perturbation on a given band to other relevant bands. Assume that the impact of the adversarial attack on the i-th band is $L_{i}$. The problem is formulated as

$$\begin{aligned} L_{i}=\sum^{c}_{j=1} \alpha _{ji} L_{ji}, \end{aligned}$$

(5)

where $\alpha _{ji}$ is the correlation weight of the i-th and j-th bands. $L_{ji}$ denotes the impact of being dispersed to the j-th band, where $j=i$ indicates the true impact of the perturbation on the i-th band.

What to correct? Recent HSI adversarial defense researchers have shifted their attention from the local to the global level [17]. While the state-of-the-art method can capture the global context information, it is still limited to the spatial domain. The rich spectral information unique to HSIs is ignored, making it impossible for existing defense methods to achieve satisfactory results at higher levels of perturbation. Therefore, combining spatial and spectral information would be more reasonable to enhance the robustness against adversarial attacks.

Optimization. Let $\mathcal{L}_{\mathrm{cls}}$ be the loss of the classification network. Suppose $Y^{\prime }$ and Y are the predicted label and the true label, respectively. The cross-entropy loss $\mathcal{L}_{\mathrm{cls}}$ is formulated as

$$\begin{aligned} \mathcal{L}_{\mathrm{cls}}=-\frac{1}{hw}\sum _{i=1}^{h}\sum _{j=1}^{w} \sum_{k=1}^{m}Y_{(i,j,k)} \log \bigl(Y^{\prime }_{(i,j,k)}\bigr), \end{aligned}$$

(6)

where m indicates the number of categories, h and w represent the height and width of the hyperspectral data, respectively. Let the adversarial examples that cause the model to misclassify images as $\boldsymbol{x}^{\mathrm{adv}}$, where $\boldsymbol{x}^{\mathrm{adv}}= \arg \max\mathcal{L}_{ \mathrm{adv}}(\boldsymbol{x}^{\mathrm{adv}},y;\boldsymbol{\theta})$. The adversarial attack against the HSI classification task targets all pixels rather than individual images. Therefore, the adversarial loss in this paper should be the sum of the losses across all pixels.

3.2 Backbone network

We input the entire hyperspectral cube into the network to avoid redundant calculations due to a large amount of overlap among adjacent patches. There are three convolutional layers and two average pooling layers in the backbone network. Convolution operations perform preliminary feature extraction by converting the original data from high-dimensional to low-dimensional space. The entire hyperspectral cube is used as input, increasing the feature’s spatial dimensions. Pooling operations are used to reduce the size of the feature map, thus reducing the computational burden and the number of parameters.

Let $\boldsymbol{I} \in \mathbb{R}^{h\times w\times b}$ indicate the input HSI, where h, w, and b represent the height, width, and band number in the input cube, respectively. The outputs of the three convolution layers are presented below:

$$ \boldsymbol{C}_{i}=f(\boldsymbol{W}_{i} \times \boldsymbol{I}+ \boldsymbol{b}_{i}), $$

(7)

where $\boldsymbol{C}_{i} (i=1,2,3)$ is the feature of the i-th convolutional layer, $\boldsymbol{W}_{i}$ denotes the weight matrix and $\boldsymbol{b}_{i}$ is the bias vector. $f(x)=\max(0,x)$ indicates the rectified linear units (ReLU) function. The size of the convolution kernel is 3 × 3 in this study. In addition, the dilated convolution can expand the receptive field by using different dilation rates [52]. This allows features to be extracted from a larger receptive field without increasing the computational load. The three convolutional layers extract features from different receptive fields with different dilation rates.

3.3 Spectral transformer

The transformer achieves the most advanced performance in natural language processing and computer vision. It can effectively establish long-range dependencies between sequence data.

As illustrated in Fig. 2, the output features X of the backbone network are input into the spectral transformer (ST) module. In the first step, the input features are flattened into $\boldsymbol{T} \in \mathbb{R}^{n\times uv}$, where n is the number of bands, u is the height, and v is the weight. The input token is represented by $\boldsymbol{T} \in \mathbb{R}^{n\times z}$, where n and z denote the number and length of tokens, respectively.

The token is indicated by $[T_{1},T_{2},\ldots,T_{n}]$. We concatenate a learnable parameter class token $T_{*}$ with T, which is used for the classification task. Then, a position encoder (PE) is added to each token to mark the position information. The resulting input token is given by

$$ \boldsymbol{T}_{\mathrm{in}}=[T_{*},T_{1},T_{2}, \ldots,T_{n}]+ \boldsymbol{PE}. $$

(8)

A transformer encoder (TE) is applied to learn the deep relationship between tokens as displayed in Fig. 3. It consists of a multi-head self-attention (MSA) block, a multi-layer perceptron (MLP) layer, and two normalization layers.

The core of the MSA module is the self-attention (SA) mechanism. We define three learnable weight matrices $\boldsymbol{W}_{Q}$, $\boldsymbol{W}_{K}$, and $\boldsymbol{W}_{V}$ to linearly map the token to the three matrices (queries Q, keys K, values V) as

$$\begin{aligned} &\boldsymbol{Q}=\boldsymbol{T}_{\mathrm{in}} \boldsymbol{W}_{Q}, \\ &\boldsymbol{K}=\boldsymbol{T}_{\mathrm{in}}\boldsymbol{W}_{K}, \\ &\boldsymbol{V}=\boldsymbol{T}_{\mathrm{in}}\boldsymbol{W}_{V}. \end{aligned}$$

(9)

Attention scores are obtained by dot product of the matrices Q and K. Attention weights are calculated by normalization operations. SA is expressed as follows:

$$\begin{aligned} &\mathrm{SA}=\mathrm{Attention}(\boldsymbol{Q},\boldsymbol{K}, \boldsymbol{V})=\sigma \biggl( \frac{\boldsymbol{Q}\boldsymbol{K}^{T}}{\sqrt{d_{k}}}\biggr)\boldsymbol{V}, \\ &\sigma (x)_{i}=\frac{\exp (x_{i})}{\sum_{j=1}\exp (x_{j})}, \end{aligned}$$

(10)

where $\sigma (x)_{i}$ indicates the softmax function, and $d_{k}$ is the dimension of K.

MSA is a particular SA that simultaneously performs multiple independent attention heads, first concatenating the output of each head and then projecting the final result. Multi-head self-attention can obtain information about different features in different positions. Formally,

$$\begin{aligned} &\mathrm{MSA}(\boldsymbol{Q},\boldsymbol{K},\boldsymbol{V})= \mathrm{Concat}(\mathrm{SA_{1},\ldots,SA_{h}}) \boldsymbol{W}^{o}, \\ &\mathrm{SA_{i}=Attention} \bigl(\boldsymbol{T}_{\mathrm{in}} \boldsymbol{W}_{i}^{Q},\boldsymbol{T}_{\mathrm{in}} \boldsymbol{W}_{i}^{K}, \boldsymbol{T}_{\mathrm{in}} \boldsymbol{W}_{i}^{V}\bigr), \end{aligned}$$

(11)

where $\boldsymbol{W}_{i}^{Q},\boldsymbol{W}_{i}^{K},\boldsymbol{W}_{i}^{V} \in \mathbb{R}^{z\times d_{k}}$, $\boldsymbol{W}^{o}\in \mathbb{R}^{h \times d_{k}\times z}$ is a linear projection matrix and h is the number of heads.

The MLP block consists of two fully connected layers and a GELU activation function. There are also two layers of dropout to prevent overfitting. The output $\boldsymbol{T}_{\mathrm{out}}$ of TE is equal in size to the input $\boldsymbol{T}_{\mathrm{in}}$.

$\boldsymbol{T}_{\mathrm{out}}$ is decoded by the transformer decoder (TD) module in order to fuse spectral features with spatial features. To better fuse the global spectral features with the spatial pixel-level features, we need to map the output of the encoder back to the pixel space. To this end, we use an improved decoder to refine the image features in each spectral band. The difference between TD and TE is that the MSA is replaced with multi-head cross attention (MCA). The queries, keys, and values in TE are from the same input data. In TD, however, the queries are the output of the backbone network, and the keys and values are the output of the TE. We need to combine different sequence inputs using MCA to avoid extensive calculations on the dense relationships between pixels in features X. We will obtain the final spectral global relationship features $\boldsymbol{F}_{\mathrm{spe}}$.

3.4 Spatial self-attention

As demonstrated in Fig. 4, the backbone network first performs preliminary feature extraction on the input data and then feeds the features X into the self-attention module. We generate the corresponding queries $\boldsymbol{Q}^{s}$, keys $\boldsymbol{K}^{s}$ and values $\boldsymbol{V}^{s}$ by inputting X into three 1 × 1 convolutional layers. The global spatial attention map M is obtained by the calculation of Eq. (13). The attention weights $\boldsymbol{W}^{s}$ of global spatial pixels are obtained by matrix multiplication of $\boldsymbol{Q}^{s}$ and the transpose of $\boldsymbol{K}^{s}$.

$$\begin{aligned} \boldsymbol{W}^{s}_{(i,j)}= \frac{\exp (\boldsymbol{Q}_{i}^{s}\times (\boldsymbol{K}_{j}^{s})^{T})}{\sum_{k=1}^{d_{k}}\exp (\boldsymbol{Q}_{k}^{s}\times (\boldsymbol{K}_{j}^{s})^{T})}, \end{aligned}$$

(12)

where $\boldsymbol{W}^{s}_{(i,j)}$ is used to evaluate the influence of the i-th pixel on the j-th pixel. The global spatial pixels’ attention weight $\boldsymbol{W}^{s}$ is further multiplied by $\boldsymbol{V}^{s}$ to obtain the result:

$$\begin{aligned} \boldsymbol{M}=\boldsymbol{W}^{s}_{(i,j)} \times \boldsymbol{V}^{s} + \boldsymbol{C}_{3}, \end{aligned}$$

(13)

where $\boldsymbol{C}_{3}$ is the output of the third convolutional layer in the backbone network. We obtain the final global attention feature map by fusing the attention features with the original features.

Then, we use the attention feature map as input to encode the spatial pixels. Let $m_{i}$ represent the i-th element of M, and define a codebook A, which is indicated by $[a_{1},a_{2},\ldots,a_{k}]$. The codebook is used to learn the visual center through the global spatial attention map M. The normalized residual between M and A is calculated by Eq. (14).

$$\begin{aligned} e_{ij}= \frac{\exp (-s_{j}\times r_{ij})}{\sum_{l=1}^{k}\exp (-s_{l}\times r_{il})} r_{ij}, \end{aligned}$$

(14)

where $s_{j}$ is the scale factor of the j-th codeword, and $r_{ij}= \Vert m_{i}-a_{j} \Vert ^{2}$ represents the residuals between the i-th element in M and the j-th codeword in A. Batch normalization activated by ReLU is used to generate global context vector e.

Converting the dimensions of e to the original size with a fully connected layer as follows:

$$\begin{aligned} \hat{\boldsymbol{e}}=s(\boldsymbol{W}_{f} \boldsymbol{e} + \boldsymbol{b}_{f}), \end{aligned}$$

(15)

where $s(\cdot )$ denotes the sigmod function, and $\boldsymbol{W}_{f}$ and $\boldsymbol{b}_{f}$ are the weight matrix and the bias vector, respectively. Our final spatially enhanced context features $\boldsymbol{F}_{\mathrm{spa}}$ are derived as

$$\begin{aligned} \boldsymbol{F}_{\mathrm{spa}}=\boldsymbol{M}\hat{ \boldsymbol{e}}+ \boldsymbol{M}. \end{aligned}$$

(16)

3.5 Weighted fusion and classification

Since the impact of the adversarial attack is different for spatial pixels and spectral channels, the feature distributions of the two branches are different. We assign different weights γ to the two modules to perform different degrees of scaling for better integration of the model, as follows:

$$\begin{aligned} \boldsymbol{F}=\gamma \boldsymbol{F}_{\mathrm{spa}}+(1-\gamma ) \boldsymbol{F}_{\mathrm{spe}}, \end{aligned}$$

(17)

where F, $\boldsymbol{F}_{\mathrm{spa}}$ and $\boldsymbol{F}_{\mathrm{spe}}$ denote the fusion features, spatial features, and spectral features, respectively. We fuse the features learned from the backbone network and the fusion module by concatenation to implement multi-scale classification. Finally, the classification task is performed by a convolutional layer (kernel size = 1) and a softmax function.

4 Experiments

To evaluate the proposed method and compare it with classical and most advanced HSI classification methods (1D-CNN [9], HybridSN [11], SSFCN [12], 3D-CNN [10], SpectralFormer [48], RSEN [13] and SACNet [17]), different adversarial attack methods (FGSM [28], PGD [29] and C&W [30]) are applied to generate adversarial examples on three hyperspectral benchmark datasets: Pavia University (PaviaU), Salinas and Indian Pines.^{Footnote 1}

4.1 Datasets

PaviaU was acquired by Reflective Optics System Imaging Spectrometer. It consists of 103 bands with a size of 610 × 340, and a total of 42,776 samples were labeled and divided into nine categories. Salinas was captured by the airborne visible/infrared imaging spectrometer (AVIRIS) sensor in Salinas Valley, California. The spatial resolution of this dataset is 3.7 m and the size is 512 × 217. The original dataset is 224 bands and after removing the bands with heavy water-absorption, there are 204 bands left. Indian Pines was taken by the AVIRIS sensor in Indiana. The size of this dataset is 145 × 145 and it has 224 bands, of which 200 are valid bands. This dataset has a total of 16 land-cover classes. The training and test sets are listed in Table 1.

Table 1 Number of training/test samples in each class for the three datasets

Full size table

4.2 Evaluation metrics and experimental settings

Overall accuracy (OA), average accuracy (AA), and kappa coefficient (κ) are commonly used evaluation metrics for HSI classification. The OA is calculated as the number of correctly predicted samples and divided by the total number of samples. The AA is calculated as the sum of accuracy for each category predicted divided by the number of categories. $\kappa =\frac{p_{0}-p_{e}}{1-p_{e}}$, where $p_{0}$ is the proportion of samples correctly classified and $p_{e}$ is the expected proportion of samples correctly classified by chance.

We choose the Adam optimizer with an initial learning rate of $5\times 10^{-4}$ and a weight decay of $5\times 10^{-5}$ to train all models. The training epoch for each dataset is set to 500. For the fusion coefficient, we set the range of γ= 0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, and 0.9 to verify the effect of the γ. As depicted in Fig. 5, the best OA value was attained for all three datasets when $\gamma = 0.1$. Therefore, the fusion coefficient was fixed at 0.1 in all experiments. The study in this paper is implemented in the PyTorch environment, using an AMD EPYC 7543 32-Core Processor, 90-GB RAM, and an NVIDIA A40 48-GB GPU server. All experiments are randomly repeated 10 times to take the average to reduce experimental error.

4.3 Robustness evaluation and comparison

We select the most representative methods for robustness evaluation and comparison in the field of hyperspectral classification, including 1D-CNN [9], HybridSN [11], SSFCN [12], 3D-CNN [10], SpectralFormer [48], RSEN [13], and state-of-the-art defense method SACNet [17]. We adopt PGD [29] with $\ell _{\infty}$ to generate adversarial examples, where ϵ is set to 0.04, the iteration number is 10, and the first class is targeted as the attack.

Tables 2, 3 and 4 report the results of the quantitative comparison of the different models against adversarial attacks. Among them, the quantitative indicator in the upper section refers to the classification accuracy of each class. In the absence of defense, the common hyperspectral image classification framework exhibits a surprising vulnerability towards adversarial examples. The OA of all advanced classification models dropped below 30%. Although the fusion of spectral features and spatial features dramatically improves classification performance, it exhibits lower robustness against adversarial attacks. From Table 2 to Table 4, we can see that the classification results of 1D-CNN are better than those of 3D-CNN on all three datasets. In addition, the HybridSN combining 3D-CNN and 2D-CNN has lower classification accuracy than 1D-CNN on the adversarial examples. This phenomenon indicates that the influence of adversarial perturbation may be more significant for spectral features than spatial features. For the PaviaU dataset, the OA score of 1D-CNN, 3D-CNN, and HybridSN is 23.40%, 17.50%, and 22.48%, respectively. Additionally, we find that other state-of-the-art classification models are also susceptible to adversarial perturbations. For example, SSFCN can achieve an OA score of 22.46% and 18.51% on PaviaU and Indian Pines, respectively. For the Salinas dataset, SSFCN can only yield an OA score of only 8.61%. SpectralFormer uses the state-of-the-art transformer structure for HSI classification, but it is still not robust enough to combat adversarial attacks. This indicates that simply using the transformer is not robust against adversarial attacks. Furthermore, we evaluate and compare the WFSS framework with the current most advanced defense method SACNet, as summarized in Table 6. It is clear that although SACNet achieves superior defense performance under FGSM attacks using spatial context coding, it still fails to achieve satisfactory results under strong attacks such as PGD.

Table 2 Robustness evaluation results of the PaviaU dataset on the adversarial examples generated by the PGD [29] attack with $\epsilon =0.04$. Best results are marked in bold

Full size table

Table 3 Robustness evaluation results of the Salinas dataset on the adversarial examples generated by the PGD [29] with $\epsilon =0.04$. Best results are marked in bold

Full size table

Table 4 Robustness evaluation results of the Indian Pines dataset on the adversarial examples generated by the PGD [29] with $\epsilon =0.04$. Best results are marked in bold

Full size table

The WFSS performs better than SACNet when encountering FGSM attacks and achieves an OA score close to 90% against PGD and C&W. Specifically, it obtains an OA value of 90.97% and 95.09% on the PaviaU and Salinas datasets, respectively, under PGD attack. WFSS can yield an OA score of 88.62% on the Indian Pines dataset, which is significantly better than existing state-of-the-art defense methods. In conclusion, the WFSS framework can significantly enhance the robustness against adversarial attacks by combining global spectral dependence and spatial pixel global context relationships.

We further visualize the classification maps from Fig. 6 to Fig. 8 to demonstrate the classification results of different models under adversarial attacks. We find that most of the samples are misclassified on the most advanced classification models. For the PaviaU dataset, most samples are successfully classified into the category “Asphalt” under targeted attacks. Some samples in the “grass” category are misclassified as the “trees” category, probably because the spectral reflectance properties of meadows and trees are similar. It is evident that the most advanced defense method SACNet successfully defends against some attacks, but there is a particular gap between the classification map of SACNet and the ground truth map. In contrast, our WFSS achieves satisfactory classification results with almost no difference between its classification and the ground truth map.

4.4 Ablation study on different modules

We first explore the influences of different perturbation intensities on model classification. We let ϵ take 12 different values in [0, 1] to generate the adversarial perturbations under the PGD [29] attack. The OA scores of different models are illustrated in Fig. 9. We can easily observe that the OA score of all models decreases as the perturbation intensity increases. The OA scores of most models drop below 30% after ϵ is more than 0.04. Even the state-of-the-art defense method SACNet has an OA score below 20% at the maximum perturbation intensity. In contrast, the WFSS achieves the best classification results at all perturbation intensities, and the OA score is still greater than 30% at the maximum perturbation intensity.

We further explore the influences of different modules in the proposed WFSS on the classification performance and robustness of the model. In Table 5, SA indicates the spatial self-attention module, and ST denotes the spectral transformer module. Table 5 shows that the ST module has stronger robustness than the SA module. For the Salinas dataset, the SA module can increase the OA score of the model from 17.91% to 61.13%, while the ST module enables an OA score of 82.35%. This suggests that spectral information significantly contributes to improving network robustness. The best results can be achieved for all three datasets by combining the SA module and the ST module.

Table 5 Classification and defense results of each module in the WFSS framework on the adversarial examples generated by PGD [29] with $\epsilon =0.04$. Best results are marked in bold

Full size table

We perform experimental evaluation under different attack methods, including FGSM [28] and C&W [30], to demonstrate the effectiveness of the WFSS framework. All experimental settings are the same as PGD, where the ϵ of FGSM is set to 0.04. The classification results of the three datasets are presented in Table 6. The state-of-the-art defense method SACNet shows good defense against FGSM attack but is inefficient against PGD and C&W (Table 6). For instance, the OA score of SACNet for FGSM attack in the Salinas dataset is 91.65%, but it decreased to 67.94% and 66.82% for the PGD and C&W attacks, respectively. In contrast, the WFSS framework can still achieve approximately 90% OA on all datasets toward PGD and C&W attacks.

Table 6 Robustness evaluation results of all datasets on the adversarial examples with different attack methods. Best results are marked in bold

Full size table

In addition to the robustness in encountering adversarial attacks, the classification performance on clean examples is one of the critical metrics. We can observe that the proposed WFSS achieves more than 96% OA on all datasets. The highest accuracy has been achieved on the Salinas datasets, and second only to RSEN on the PaviaU and Indian Pines datasets. These results show that the WFSS achieves competitive results in clean images while ensuring that the model is resistant to adversarial examples.

4.5 Affect of adversarial attacks on HSI classification models

We plot spectral curves of the clean images, the adversarial examples, and the adversarial perturbations to explore how adversarial attacks affect HSI classification models, as presented in Fig. 10. Although human vision cannot discern the difference between adversarial examples and clean examples in the form of images, changes in the spectral curve are readily perceived by the human eye. Figure 10 shows that the adversarial perturbations make the spectral curves of the different samples more discrete.

Take the grapes-untrained class in the third row of Fig. 10 as an example. When examining the reflectance of the clean examples between bands 40∼100, we observe an upward trend (Fig. 10 (g)). However, there is a decreasing trend in the reflectance of adversarial perturbations (Fig. 10 (i)). Moving on to the reflectance between bands 100∼125, we notice a declining pattern in the clean examples, followed by an increasing trend (Fig. 10 (g)). Conversely, the adversarial perturbations reflectance shows an opposite trend, starting with an upward trend and then turning into a downward (Fig. 10 (i)). This same trend can be observed in the change in reflectance between bands 125∼155 as well. We speculate that the adversarial attack makes the classification network erroneous because the perturbation changes the spectral curves of the samples. This is further evidence that the spectral domain has a more significant influence on the robustness of the model.

We perform a correlation analysis using the t-SNE algorithm [53] to explore the influence of the spectral transformer module on the spectral features. T-SNE is a nonlinear dimensionality reduction algorithm that can reduce high-dimensional data to 2 or 3 dimensions for visualization. In Fig. 11, we perform the correlation analysis of the spectral features in high-dimensional space for all samples of the PaviaU dataset. Specifically, the first column indicates the influence of the ST module on the clean example correlation, and the second column represents the influence of the ST module on the adversarial example correlation. The features of all samples in the high-dimensional space are close together when the ST module is absent. Figure 11(b) demonstrates that the adversarial attack causes the high-dimensional features to become disordered and crowded. This makes it more difficult for the model to classify the samples. The results in Fig. 11(c) indicate that the ST module makes the feature clustering between different categories more distinct and increases the classification boundaries. It establishes dependencies between spectra, making the data more relevant in high-dimensional space. This ensures that the model is robust even when encountering adversarial attacks.

5 Conclusion

This study presents the WFSS, a new weighted fusion framework, to tackle the issue of adversarial attacks on HSI classification. By combining spatial and spectral information, the WFSS framework enhances the classification robustness. The extensive experimental results show that the WFSS framework achieves superior performance over state-of-the-art defense methods. Furthermore, we find that the combination of spatial and spectral information can significantly enhance classification robustness. However, the fusion of different modules can result in larger model size and longer runtime, which poses a challenge in achieving lightweight models. Therefore, addressing this issue requires further research and solutions.

Availability of data and materials

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

Notes

Hyperspectral datasets. https://ehu.eus/ccwintco/index.php?title=Hyperspectral_Remote_Sensing_Scenes. Accessed April 7, 2022.

Abbreviations

AA:: average accuracy
AVIRIS:: airborne visible/infrared imaging spectrometer
CNN:: convolutional neural network
C&W:: Carlini and Wagner’s attack
DNNs:: deep neural networks
FGSM:: fast gradient sign method
HSI:: hyperspectral image
κ :: kappa coefficient
MCA:: multi-head cross attention
MSA:: multi-head self-attention
MLP:: multilayer perceptron
OA:: overall accuracy
PGD:: project gradient descent
PE:: position encodes
PaviaU:: Pavia University
RSEN:: robust self-embedding network
ReLU:: rectified linear units
ST:: spectral transformer
SA:: self-attention
TE:: transformer encoder
TD:: transformer decoder
WFSS:: weighted fusion of spectral transformer and spatial self-attention

References

Cheng, G., **e, X., Han, J., Guo, L., & **a, G.-S. (2020). Remote sensing image scene classification meets deep learning: challenges, methods, benchmarks, and opportunities. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 13, 3735–3756.
Article ADS Google Scholar
Plaza, A., Benediktsson, J. A., Boardman, J. W., Brazile, J., Bruzzone, L., Camps-Valls, G., et al. (2009). Recent advances in techniques for hyperspectral image processing. Remote Sensing of Environment, 113, S110–S122.
Article Google Scholar
Xu, W., Wooster, M. J., & Grimmond, C. S. B. (2008). Modelling of urban sensible heat flux at multiple spatial scales: a demonstration using airborne hyperspectral imagery of Shanghai and a temperature–emissivity separation approach. Remote Sensing of Environment, 112(9), 3493–3510.
Article ADS Google Scholar
Roberts, D. A., Quattrochi, D. A., Hulley, G. C., Hook, S. J., & Green, R. O. (2012). Synergies between VSWIR and TIR data for the urban environment: an evaluation of the potential for the hyperspectral infrared imager (HyspIRI) decadal survey mission. Remote Sensing of Environment, 117, 83–101.
Article ADS Google Scholar
Lu, B., Dao, P. D., Liu, J., He, Y., & Shang, J. (2020). Recent advances of hyperspectral imaging technology and applications in agriculture. Remote Sensing, 12(16), 2659.
Article ADS Google Scholar
Shimoni, M., Haelterman, R., & Perneel, C. (2019). Hypersectral imaging for military and security applications: combining myriad processing and sensing techniques. IEEE Geoscience and Remote Sensing Magazine, 7(2), 101–117.
Article Google Scholar
Chen, Y., Zhao, X., & Jia, X. (2015). Spectral–spatial classification of hyperspectral data based on deep belief network. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 8(6), 2381–2392.
Article ADS Google Scholar
Zhou, P., Han, J., Cheng, G., & Zhang, B. (2019). Learning compact and discriminative stacked autoencoder for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 57(7), 4823–4833.
Article ADS Google Scholar
Hu, W., Huang, Y., Wei, L., Zhang, F., & Li, H. (2015). Deep convolutional neural networks for hyperspectral image classification. Journal of Sensors, 2015, 1–12.
Article Google Scholar
Hamida, A. B., Benoit, A., Lambert, P., & Amar, C. B. (2018). 3-D deep learning approach for remote sensing image classification. IEEE Transactions on Geoscience and Remote Sensing, 56(8), 4420–4434.
Article ADS Google Scholar
Roy, S. K., Krishna, G., Dubey, S. R., & Chaudhuri, B. B. (2019). HybridSN: exploring 3-D–2-D CNN feature hierarchy for hyperspectral image classification. IEEE Geoscience and Remote Sensing Letters, 17(2), 277–281.
Article ADS Google Scholar
Xu, Y., Du, B., & Zhang, L. (2019). Beyond the patchwise classification: spectral-spatial fully convolutional networks for hyperspectral image classification. IEEE Transactions on Big Data, 6(3), 492–506.
Article Google Scholar
Xu, Y., Du, B., & Zhang, L. (2022). Robust self-ensembling network for hyperspectral image classification. IEEE Transactions on Neural Networks and Learning Systems. Advance online publication. https://doi.org/10.1109/TNNLS.2022.3198142.
Article PubMed Google Scholar
Jia, S., Jiang, S., Zhang, S., Xu, M., & Jia, X. (2022). Graph-in-graph convolutional network for hyperspectral image classification. IEEE Transactions on Neural Networks and Learning Systems. Advance online publication. https://doi.org/10.1109/TNNLS.2022.3182715.
Article PubMed Google Scholar
Shi, C., Dang, Y., Fang, L., Lv, Z., & Zhao, M. (2021). Hyperspectral image classification with adversarial attack. IEEE Geoscience and Remote Sensing Letters, 19, 1–5.
CAS Google Scholar
Park, S., Lee, H. J., & Ro, Y. M. (2021). Adversarially robust hyperspectral image classification via random spectral sampling and spectral shape encoding. IEEE Access, 9, 66791–66804.
Article Google Scholar
Xu, Y., Du, B., & Zhang, L. (2021). Self-attention context network: addressing the threat of adversarial attacks for hyperspectral image classification. IEEE Transactions on Image Processing, 30, 8671–8685.
Article PubMed ADS Google Scholar
Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., et al. (2013). Intriguing properties of neural networks. ar**v preprint. ar**v:1312.6199.
Liu, A., Guo, J., Wang, J., Liang, S., Tao, R., Zhou, W., et al. (2023). X-Adv: physical adversarial object attacks against x-ray prohibited item detection. In C. Troncoso & J. A. Calandrino (Eds.), Proceedings of the 32nd USENIX security symposium (1-18). Berkeley: USENIX Association.
Google Scholar
Liu, A., Liu, X., Fan, J., Ma, Y., Zhang, A., **e, H., et al. (2019). Perceptual-sensitive GAN for generating adversarial patches. In Proceedings of the 33th AAAI conference on artificial intelligence (pp. 1028–1035). Palo Alto: AAAI Press.
Google Scholar
Guo, J., Bao, W., Wang, J., Ma, Y., Gao, X., **ao, G., et al. (2023). A comprehensive evaluation framework for deep model robustness. Pattern Recognition, 137, 109308.
Article Google Scholar
Zantedeschi, V., Nicolae, M.-I., & Rawat, A. (2017). Efficient defenses against adversarial attacks. In B. Thuraisingham, B. Biggio, D. M. Freeman, et al. (Eds.), Proceedings of the 10th ACM workshop on artificial intelligence and security (pp. 39–49). New York: ACM.
Chapter Google Scholar
Tramèr, F., Kurakin, A., Papernot, N., Goodfellow, I., Boneh, D., & McDaniel, P. (2017). Ensemble adversarial training: attacks and defenses. ar**v preprint. ar**v:1705.07204.
Grosse, K., Manoharan, P., Papernot, N., Backes, M., & McDaniel, P. (2017). On the (statistical) detection of adversarial examples. ar**v preprint. ar**v:1702.06280.
Akhtar, N., & Mian, A. (2018). Threat of adversarial attacks on deep learning in computer vision: a survey. IEEE Access, 6, 14410–14430.
Article Google Scholar
Xu, Y., Du, B., & Zhang, L. (2020). Assessing the threat of adversarial examples on deep neural networks for remote sensing scene classification: attacks and defenses. IEEE Transactions on Geoscience and Remote Sensing, 59(2), 1604–1617.
Article ADS Google Scholar
Liu, A., Liu, X., Yu, H., Zhang, C., Liu, Q., & Tao, D. (2021). Training robust deep neural networks via adversarial noise propagation. IEEE Transactions on Image Processing, 30, 5769–5781.
Article PubMed ADS Google Scholar
Goodfellow, I. J., Shlens, J., & Szegedy, C. (2015). Explaining and harnessing adversarial examples. In Y. LeCun & Y. Bengio (Eds.), Proceedings of the 3th international conference on learning representations, San Diego, USA (pp. 1–11). [Poster presentation].
Google Scholar
Madry, A., Makelov, A., Schmidt, L., Tsipras, D., & Vladu, A. (2018). Towards deep learning models resistant to adversarial attacks. In Proceedings of the 6th international conference on learning representations (pp. 1–18). Retrived May 25, 2023, from https://openreview.net/pdf?id=rJzIBfZAb.
Google Scholar
Carlini, N., & Wagner, D. (2017). Towards evaluating the robustness of neural networks. In Proceedings of the IEEE symposium on security and privacy (pp. 39–57). Piscataway: IEEE.
Google Scholar
Li, S., Song, W., Fang, L., Chen, Y., Ghamisi, P., & Benediktsson, J. A. (2019). Deep learning for hyperspectral image classification: an overview. IEEE Transactions on Geoscience and Remote Sensing, 57(9), 6690–6709.
Article ADS Google Scholar
Hong, D., Gao, L., Yao, J., Zhang, B., Plaza, A., & Chanussot, J. (2020). Graph convolutional networks for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 59(7), 5966–5978.
Article ADS Google Scholar
Zhang, H., Zou, J., & Zhang, L. (2022). EMS-GCN: an end-to-end mixhop superpixel-based graph convolutional network for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 60, 1–16.
CAS Google Scholar
Xu, Q., Yang, C., Tang, J., & Luo, B. (2022). Grouped bidirectional LSTM network and multi-stage fusion convolutional transformer for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 60, 1–14.
Google Scholar
Chen, Y., Lin, Z., Zhao, X., Wang, G., & Gu, Y. (2014). Deep learning-based classification of hyperspectral data. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 7(6), 2094–2107.
Article ADS Google Scholar
Mou, L., Ghamisi, P., & Zhu, X. X. (2017). Deep recurrent neural networks for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 55(7), 3639–3655.
Article ADS Google Scholar
Zhang, X., Sun, Y., Jiang, K., Li, C., Jiao, L., & Zhou, H. (2018). Spatial sequential recurrent neural network for hyperspectral image classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 11(11), 4141–4155.
Article ADS Google Scholar
Hang, R., Liu, Q., Hong, D., & Ghamisi, P. (2019). Cascaded recurrent neural networks for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 57(8), 5384–5394.
Article ADS Google Scholar
Ghamisi, P., Maggiori, E., Li, S., Souza, R., Tarablaka, Y., Moser, G., et al. (2018). New frontiers in spectral-spatial hyperspectral image classification: the latest advances based on mathematical morphology, Markov random fields, segmentation, sparse representation, and deep learning. IEEE Geoscience and Remote Sensing Magazine, 6(3), 10–43.
Article Google Scholar
Zhu, X. X., Tuia, D., Mou, L., **a, G.-S., Zhang, L., Xu, F., et al. (2017). Deep learning in remote sensing: a comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine, 5(4), 8–36.
Article Google Scholar
Zhao, W., & Du, S. (2016). Spectral–spatial feature extraction for hyperspectral image classification: a dimension reduction and deep learning approach. IEEE Transactions on Geoscience and Remote Sensing, 54(8), 4544–4554.
Article ADS Google Scholar
Chen, Y., Jiang, H., Li, C., Jia, X., & Ghamisi, P. (2016). Deep feature extraction and classification of hyperspectral images based on convolutional neural networks. IEEE Transactions on Geoscience and Remote Sensing, 54(10), 6232–6251.
Article ADS Google Scholar
Zhong, Z., Li, J., Luo, Z., & Chapman, M. (2017). Spectral–spatial residual network for hyperspectral image classification: a 3-D deep learning framework. IEEE Transactions on Geoscience and Remote Sensing, 56(2), 847–858.
Article ADS Google Scholar
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., et al. (2021). An image is worth 16x16 words: transformers for image recognition at scale. In Proceedings of the 9th international conference on learning representations (pp. 1–12). Retrived May 25, 2023, from https://openreview.net/pdf?id=YicbFdNTTy.
Google Scholar
Naseer, M. M., Ranasinghe, K., Khan, S. H., Hayat, M., Shahbaz Khan, F., & Yang, M.-H. (2021). Intriguing properties of vision transformers. In Y. Dauphin, M. Ranzato, A. Beygelzimer, et al. (Eds.), Proceedings of the 35th international conference on neural information processing systems (pp. 23296–23308). Red Hook: Curran Associates.
Google Scholar
Park, N., & Kim, S. (2022). How do vision transformers work? In Proceedings of the 10th international conference on learning representations (pp. 1–14). Retrived May 25, 2023, from https://openreview.net/pdf?id=D78Go4hVcxO.
Google Scholar
He, J., Zhao, L., Yang, H., Zhang, M., & Wei, L. H. (2019). Hyperspectral image classification using the bidirectional encoder representation from transformers. IEEE Transactions on Geoscience and Remote Sensing, 58(1), 165–178.
Article ADS Google Scholar
Hong, D., Han, Z., Yao, J., Gao, L., Zhang, B., Plaza, A., et al. (2021). Spectralformer: rethinking hyperspectral image classification with transformers. IEEE Transactions on Geoscience and Remote Sensing, 60, 1–15.
Article Google Scholar
Zhong, Z., Li, Y., Ma, L., Li, J., & Zheng, W.-S. (2021). Spectral–spatial transformer network for hyperspectral image classification: a factorized architecture search framework. IEEE Transactions on Geoscience and Remote Sensing, 60, 1–15.
Article Google Scholar
Zhao, Z., Hu, D., Wang, H., & Yu, X. (2022). Convolutional transformer network for hyperspectral image classification. IEEE Geoscience and Remote Sensing Letters, 19, 1–5.
Google Scholar
Sun, L., Zhao, G., Zheng, Y., & Wu, Z. (2022). Spectral–spatial feature tokenization transformer for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 60, 1–14.
Article Google Scholar
Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2017). Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4), 834–848.
Article PubMed Google Scholar
van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9(11), 2579–2605.
Google Scholar

Download references

Funding

This research work is partly supported by the National Natural Science Foundation of China (No.62172001, No.62076147, and No.61860206004).

Author information

Authors and Affiliations

Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, Anhui University, Anhui, China
Lichun Tang, Wanli Lyu & Bin Luo
School of Communication & Electronic Engineering, East China Normal University, Shanghai, China
Zhaoxia Yin
Department of Computer Science and Technology, Tsinghua University, Bei**g, China
Hang Su

Authors

Lichun Tang
View author publications
You can also search for this author in PubMed Google Scholar
Zhaoxia Yin
View author publications
You can also search for this author in PubMed Google Scholar
Hang Su
View author publications
You can also search for this author in PubMed Google Scholar
Wanli Lyu
View author publications
You can also search for this author in PubMed Google Scholar
Bin Luo
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by LT, ZY and BL. The first draft of the manuscript was written by LT and all the authors have commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Bin Luo.

Ethics declarations

Competing interests

Bin Luo is an Associate Editor-in-Chief at Visual Intelligence and was not involved in the editorial review of this article or the decision to publish it. The authors declare that they have no other competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Tang, L., Yin, Z., Su, H. et al. WFSS: weighted fusion of spectral transformer and spatial self-attention for robust hyperspectral image classification against adversarial attacks. Vis. Intell. 2, 5 (2024). https://doi.org/10.1007/s44267-024-00038-x

Download citation

Received: 05 June 2023
Revised: 26 January 2024
Accepted: 28 January 2024
Published: 28 February 2024
DOI: https://doi.org/10.1007/s44267-024-00038-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

WFSS: weighted fusion of spectral transformer and spatial self-attention for robust hyperspectral image classification against adversarial attacks

Abstract

Similar content being viewed by others

A novel spatial and spectral transformer network for hyperspectral image super-resolution

Improved Multi-scale Fusion of Attention Network for Hyperspectral Image Classification

DesU-NetAM: optimized DenseU-Net with attention mechanism for hyperspectral image classification

1 Introduction

2 Related works

2.1 Adversarial attacks

2.2 Robust HSI classifications

2.3 Transformer on HSI

3 Proposed method

3.1 Overview formulation

3.2 Backbone network

3.3 Spectral transformer

3.4 Spatial self-attention

3.5 Weighted fusion and classification

4 Experiments

4.1 Datasets

4.2 Evaluation metrics and experimental settings

4.3 Robustness evaluation and comparison

4.4 Ablation study on different modules

4.5 Affect of adversarial attacks on HSI classification models

5 Conclusion

Availability of data and materials

Notes

Abbreviations

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation