1 Introduction

Image manipulation generation [22, 27, 37, 49] significantly enriches sample diversity and visual interest; however, it also brings a crisis of trust in image reliability. To address concerns related to malicious tampering and illegal dissemination, forgery image forensic methods [2, 10, 38] have been continuously studied in recent years to automatically detect such manipulations. Various image manipulation detectors based on deep neural networks (DNNs) [1, 17, 31, 43] have demonstrated remarkable performance. Nevertheless, these detectors seem to be susceptible to adversarial examples, where artificially applying imperceptible perturbations on clean inputs would mislead the detectors into producing incorrect predictions. Furthermore, adversarial examples usually exhibit the transferability across diverse models, even those with unknown structural parameters.

Fig. 1
figure 1

Workflows for adversarial examples in image manipulation detection models. The adversarial example of an authentic image spoofs detectors predicting a false-positive mask, while the example of a fake image prevents models from detecting manipulations as usual

Such possibility that a detector can be easily deceived just by adding subtle noise presents a severe threat to information security. Therefore, it is significant to investigate adversarial attacks specifically for existing manipulation detection methods from the perspective of reverse forensics. Meanwhile, exploring the transferability of adversarial examples proves advantageous in discovering universal vulnerabilities within detectors, thereby improving their robustness to malicious adversaries in black-box attack scenarios.

Despite extensive efforts in develo** adversarial attacks [50, 52, 56], most of the works primarily design adversaries tailored for image classifiers [46, 51, 53]. In such attacks, classifiers fail to correctly classify inputs in untargeted attacks or produce predictions consistent with preset labels in targeted attacks. However, these attack principles may not be optimal for effectively disrupting image manipulation detectors. This limitation arises from the fundamental differences between classification and manipulation detection. Manipulation detectors not only perform image-level identification of authentic and forgery images, but they also predict pixel-wise probability distribution maps (known as binary masks) to localize manipulated regions.

A noteworthy revelation is that adversarial attacks studied recently are aware of dense prediction tasks where adversaries are designed for object detection or semantic segmentation models. Huang et al. [20] explore transfer-based self-ensemble attack (T-SEA) on object detection, which ensembles the input, the attacked model, and the adversarial patch to boost the adversarial transferability under black-box attacks. Cai et al. [3] propose ensemble-based black-box targeted and untargeted attacks on semantic segmentation and object detection. Li et al. [28] optimally search adversarial points to generate high visual quality anti-forensic fake face images via exploring Style-GAN’s manifold. Jia et al. [21] propose a hybrid adversarial attack against face forgery detectors based on the strategy of meta-learning. However, there are few works on crafting imperceptible adversarial examples specific to image manipulation detection models. Moreover, the aforementioned attacks are mainly implemented in the spatial domain, which results in increasing the perturbation intensity for better effects at the expense of image quality. Although Zhu et al. [55] propose adversarial manipulation generation (AMG) that incorporates both spatial and frequency features into a GAN architecture to attack against manipulation detection, this generative attack requires a pre-trained generator that performs reasonably well on a dataset in advance. There may be limitations in acquiring dataset authorization for cross-dataset attacks. On the contrary, utilizing transferable adversaries from a small number of models to disrupt black-box victims is more cost-effective in practical applications. This is a worthwhile topic that adversarial examples generated on detector vulnerabilities could indeed easily increase security threats.

To address the aforementioned problems, we propose a novel adversarial attack called RevAggAL, which not only effectively achieves iterative attacks specific against image manipulation detectors at both image-level and pixel-level, but also generates adversarial examples with good invisibility and high transferability. Specifically, we first design a new loss function for optimizing perturbations from the perspective of pixel-level segmentation prediction. Then, we extract low-frequency components of input samples to constrain the visual differences between clean and perturbed images. We further improve adversaries with higher transferability across DNN-based detectors via gradient aggregation of mid-layer features in black-box attack scenarios. It is worth noting that our method is advantageous at aspects of cross-models and cross-datasets without additional pre-training on white-box adversarial generators in advance.

We summarize our main contributions as follows:

  • We propose an efficient loss function for optimally adding perturbations from the view of pixel-level segmentation decisions, which can also be extended to other classic iterative attack algorithms for adversarial example generation.

  • We introduce the low-frequency as a new constraint for limiting subtle noise in finer details. Such innovation would suppress perturbations appearing in sensitive regions to human observers. We also exploit aggregated gradients on mid-layer features from the white-box surrogate detectors to improve the transferability of designed adversarial examples to black-box victims.

  • We conduct extensive experiments on three DNN-based image manipulation detectors with five datasets under both white-box and black-box settings. Compared with traditional iterative attacks, our method achieves favorable attack performance in generating more imperceptible adversaries while reducing the degradation of image quality.

2 Related work

2.1 Digital image forensics

Digital image forensics technology plays a crucial role in identifying the authenticity, integrity, and source of images. Conventional approaches, such as digital watermarking [45] and digital signature [33, 36], require pre-insertion of encrypted information before transmitting an image to the generator, followed by feature extraction and consistency verification to stimulate the detector. However, these active forensic methods are impractical due to the possible absence of automatic key addition capabilities in most imaging devices. Recently, blind forensics has emerged as an effective alternative, directly analyzing inherent characteristics and proprietary properties to determine image sources [12, 13, 42] and manipulation artifacts [14, 26, 41]. Our work deals with blind forensics against forgery image manipulation in this domain.

2.2 Image manipulation detection and localization

In generic image manipulation blind forensics, it is necessary to distinguish authentic images from ones manipulated by a DNN-based generator, as well as to accurately localize modified regions. Consequently, this task is commonly regarded as an image manipulation detection and localization (IMDL) problem. Early methods extract image block features (such as DCT, PCA, SVD) or internal statistical characteristics (such as pixel mean, and RGB correlation) to establish true and false classifications. However, these patch-based operations often provide imprecise manipulation localization. Building upon this foundation, Li et al. [29] proposed the implementation of a fully convolutional network (FCN) [32] for precise localization. Salloum et al. [40] developed a multi-task FCN (MFCN) to segment a pixel-level fine-grained manipulated area and its boundary. Zhou et al. [54] employed the SRM kernel [11] to Faster R-CNN [39] to localize forgeries with bounding boxes. Chen et al. [6] proposed MVSSNet, which effectively captures subtle changes in suspicious boundaries and learns more general features through multi-level supervision. Wu et al. [48] considered the impact of image transmission on various online social networks (OSNs) and proposed a training scheme for improving the robustness of image manipulation detection by modeling predictable noises and intentionally introducing unseen noises. Our work refers to the pixel-wise semantic segmentation task, where we select a deformed baseline method (ResFCN, an FCN combined with ResNet-50 [18]) and two state-of-the-art methods (MVSSNet and OSN) for detecting and localizing manipulations.

2.3 Adversarial attack

Adversarial examples with subtle perturbations, resembling their originals, have been demonstrated to effectively disrupt a DNN-based model. Beyond image classification, the scope of adversarial attacks has expanded to encompass dense prediction tasks, such as object detection and semantic segmentation. Despite growing research on exploring the feasibility of attack methodologies for face forgery detection [4, 9, 28], there are few studies specifically targeting common image manipulation detectors. We introduce classic attack algorithms that hold the potential for manipulation detectors. Fast gradient sign method (FGSM) [15] is an algorithm that generates adversarial examples by leveraging the gradient. Projected gradient descent (PGD) [34] is an attack strategy that performs multiple iterations, taking a small step at each iteration while projecting the disturbance within a specified range. Nevertheless, most existing works primarily generate adversaries with spatial constraints, easily resulting in a certain degree of image quality degradation with perturbations visible to human eyes. Inspired by recent research on attacking classifiers in frequency domain [21, 55], we select low-frequency components from decomposed input samples to more precisely constrain the perturbations. Furthermore, we employ an Adam optimizer to optimize the loss calculation as in C &W [5].

3 Methodology

3.1 Problem formulation

Assume we have an image manipulation detector F which accepts a normalized image \(x \in [0,1]^{H \times W \times C}\) and predicts a pixel-wise probability distribution map \(y_{map} \in [0,1]^{H \times W}\), where each pixel is assigned a probability value indicating its likelihood of belonging to a specific category, to localize the manipulated region. Subsequently, an additional Global Max Pooling (GMP) function is applied to select and identify the most significant features from each channel across spatial dimensions, effectively compressing pixel-level probabilities into a single value for a binary image-level prediction discrimination label \(y \in Y=\{0,1\}\) (i.e., 0 for Authentic, and 1 for Fake). In this work, we define F as the full neural network including the GMP function, \(M(x)=y_{map}\) as the output of all layers excluding the GMP function, and

$$\begin{aligned} F(x) = GMP(M(x)) = y. \end{aligned}$$
(1)
Fig. 2
figure 2

Overview of our RevAggAL attack for loss optimization to generate considerable adversarial examples for image manipulation detection models

The purpose of a targeted attack task is to craft an adversarial example \(x^{adv}\) with a small perturbation \(\delta \), which misleads the detector into predicting a result consistent with the preset target value. The generation of the adversarial example can be formalized as follows:

$$\begin{aligned} \begin{aligned}&\mathrm {\mathop {minimize}\limits _{\delta }} \ D(x, x+\delta ) \\&\mathrm {such \ that} \ F(x+\delta ) = y^{*} \end{aligned} \end{aligned}$$
(2)

where \(D(\cdot , \cdot )\) denotes a distance metric to quantify the difference between the benign image and its adversarial example, and \(y^{*}\) represents the desired target label corresponding to a pre-specified target image \(x^{*}\).

Instead of directly solving this difficult minimization problem, in this work, we utilize the optimizer and transform it to solve the following loss optimization problem:

$$\begin{aligned} x^{adv} = \mathrm {\mathop {argmin}\limits _{\delta }} \ \{D(x, x+\delta ) + loss_{F, y^{*}}(x+\delta )\}. \end{aligned}$$
(3)

The first term in Eq. (3) constrains the perturbation, and the second one enforces the prediction to be aligned with the target label \(y^{*}\).

Instead of directly using discrimination labels as reference values, we consider two target cases at pixel level: 1) For an authentic image, the adversarial example is designed to mislead the detector into predicting some false-positive pixel regions via the target fake map \(y^{*}_{map0}\) with parts of pixel regions labeled as fake. 2) For a fake one, the perturbation is added so that the detector fails to notice any pixels in the manipulated area based on the target authentic map \(y^{*}_{map1}\) with all pixels labeled as authentic. It is necessary to mention that labels with only two classes guide which target map is suitable to be chosen in this process.

3.2 Pixel reverse content decision-making loss

On the basis of manipulation detection definition, we can further express \(loss_{F, y^{*}}(x+\delta )\) in Eq. (3) as:

$$\begin{aligned} loss_{F, y^{*}}(x+\delta ) = {\left\{ \begin{array}{ll} \textrm{argmin} \ J(M(x^{*}), y^{*}_{map0}), y=1 \\ \textrm{argmin} \ J(M(x^{*}), y^{*}_{map1}), y=0 \end{array}\right. } \end{aligned}$$
(4)

where the loss function \(J(\cdot , \cdot )\) measures the distance between the target and predicted maps. One loss function commonly used in conventional adversarial attacks is cross-entropy (CE) for image-level discrimination labels or mean squared error (MSE) for pixel-wise probability distribution maps. However, using MSE loss for targeting all pixel values toward zero seems to be an overly absolute consideration.

Generally, all pixel values in a normalized pixel-level prediction probability map are continuously distributed in the range of [0,1]. The default decision threshold in image manipulation detection is commonly set to 0.5, which means pixels with values lower than 0.5 are defined as authentic pixels, and vice versa as fake pixels. Similar to [7] and [48], we maintain the value of 0.5 as our decision-making boundary, and only force fake pixels toward a range below the boundary value when the target label is 0; otherwise, focus on pushing authentic pixels beyond the boundary if the target label is 1.

To optimize losses more reasonably, we propose an understandable yet efficient loss function named pixel reverse content decision-making (PRevCDm) loss instead of MSE loss. Specifically, we first define an inverse function to obtain an opposite probability map with respect to the original probability map,

$$\begin{aligned} Rev(M(x)) = \sum \nolimits _{i}^{H \times W} (1-M(x)^{i}). \end{aligned}$$
(5)

Next, we express a map** function Y as follows to divide pixels corresponding to different labels into two camps:

$$\begin{aligned} Y = {\left\{ \begin{array}{ll} M(x^{adv}) - Rev(x^{adv}), y=1 \\ Rev(x^{adv}) - M(x^{adv}), y=0 \end{array}\right. } \end{aligned}$$
(6)

where Y represents the property of binary labels of pixels in the original probability map.

Considering the impact of pixels near the boundary on decision-making, we introduce a new constant value kappa (abbreviated as k for convenience) to further expand the boundary range. With k as a tolerance bias, these swing pixels within the finite border width move toward a more optimal position depending on the specific target reference, where the binary label of the pixel at \(Y>-k\) position equals the target label. Eq. (6) can be adjusted as follows:

$$\begin{aligned} Y = {\left\{ \begin{array}{ll} M(x^{adv}) - Rev(x^{adv}) + k, y=1 \\ Rev(x^{adv}) - M(x^{adv}) + k, y=0. \end{array}\right. } \end{aligned}$$
(7)

Then, we utilize the \(ReLu(\cdot )\) function to filter out and retain worthy pixels in the original probability map corresponding to the target label, and enter them into the next loss calculation. More specifically, for the original authentic image, only pixels that predict probability logits lower than (0.5+k/2) will remain and contribute to the loss function, while in the original fake image, pixels with predicted probabilities higher than (0.5-k/2) need more attention and will be counted into loss function.

To encourage the optimizer to search along a descent direction in the initial stage, we use \((e^{x}-1)\) as a further map** function with a larger slope when \(x>0\), where x is substituted by ReLu(Y). Consequently, we can finally define the PRevCDm loss with the average of all pixel losses as:

$$\begin{aligned} loss_{F, y^{*}}(x^{adv}) = \frac{1}{N} \sum \nolimits _{i}^{N} (e^{ReLu(Y)}-1). \end{aligned}$$
(8)

3.3 Low-frequency constraint

Previous works typically utilize \(\ell _{p}\)-norm (\(p\in \{0, 2, \infty \}\)) to regularize perturbations in the representation space. However, it is still difficult to completely generate imperceptible perturbations under such traditional constraints. Existing spatial perturbations are commonly added at arbitrary positions on the original benign image. Once noise and aliasing artifacts appear in the light blank background, a random distribution of perturbations acceptable to deep discriminators may be easily detected by resolution-sensitive human visual systems. Therefore, using another constraint rather than \(\ell _{p}\)-norm is the key to limiting perturbations into imperceptible details.

For an image in the spatial domain, visible sensitive information mainly refers to colorful style content and general object structure, while those slender edges or complex textures are less sensitive. After converting the image to the frequency domain, the low-frequency component of an image contains its basic structural content, while rich detailed features like object edges and textures are included in the high-frequency components.

Motivated by the above principle, instead of directly using \(\ell _{p}\)-norm constraint, we introduce low-frequency as a new constraint. With the discrete wavelet transform (DWT) function, we first decompose the input image into one low-frequency (\(x_{LL}\)) and three high-frequency components (\(x_{LH}\), \(x_{HL}\), \(x_{HH}\)). Then, we use inverse DWT (IDWT) to reconstruct a new image with only the low-frequency component as \(x^{new}=\varPhi (x)\) so that we can retain the main content information. The new alternative constraint of the perturbation in the first term in Eq. (3) can be expressed as:

$$\begin{aligned} \quad D(x, x+\delta ) = D_{lf}(x, x^{adv}) = \left\| \varPhi (x) - \varPhi (x^{adv}) \right\| _2 \end{aligned}$$
(9)

And the loss of main content specific to the perturbed example is reduced by minimizing Eq. (9) to reduce image quality degradation.

3.4 Transferability improvement with aggregate gradient

We also enhance the transferability of adversarial examples to perform more generalized black-box attacks against different manipulation detectors. In the black-box attack scenario, a common operation is to use a surrogate source model to craft adversarial examples.

Intuitively, we hope adversarial examples generated from the surrogate model are generalizable to diverse victim models for high transferability. However, designing such adversarial examples is nontrivial. Current DNN-based manipulation detectors with various structures usually extract different proprietary features to better adapt themselves to cross-data domains, which comes with the appearance of model-specific feature representations. Therefore, we argue that the indiscriminate distortion of arbitrary extracted features is more likely to fall into a model-specific local optimum, significantly reducing the transferability of adversarial examples.

To avoid trap** into a local optimum caused by overfitting to model-specific features, we propose to disturb the model-agnostic features from the source model as guidance to generate more transferable adversarial examples. Inspired by [44], we consider model-agnostic features specifically for fake manipulated images when generating more transferable adversarial examples. In the forward propagation of the surrogate detector F, let \(F_{k}(x)\) denotes the features in the k-th layer, and the gradient w.r.t. \(F_{k}(x)\) can be written as:

$$\begin{aligned} \varDelta _{k}^{x} = \frac{\partial O(x,y)}{\partial F_{k}(x)} = \frac{\partial M(x)}{\partial F_{k}(x)} \end{aligned}$$
(10)

where \(O(\cdot , \cdot )\) denotes the logit output with respect to the ground-true label y, and \(M(\cdot )\) denotes the logit output of the pixel-level distribution map mentioned in 2.1. Note that the raw gradient \(\varDelta _{k}^{x}\) calculated with global feature maps generally carries model-specific information, resulting in visual pulses and large gradient noise on non-object regions.

To distort model-specific details but preserve general structures, we utilize the binarization mask to randomly discard pixels within a partial region of the input sample x with the probability \(p_{r}\). We further calculate average aggregate gradients for such transformed inputs. We simplify these two steps and express them as:

$$\begin{aligned} \bar{\varDelta }_{k}^{x} = \frac{1}{N} \sum _{n=1}^{N} \varDelta _{k}^{x\odot M_{p_{r}}^{n}}, M_{p_{r}} \sim Bernoulli(1-p_{r}) \end{aligned}$$
(11)

where the \(Mask_{p_{r}}\) is a binary mask with the same size as x, \(\odot \) means the element-wise product, and N indicates the ensemble number of random masks adopted to x. For simplicity, we denote \(\bar{\varDelta }_{k}^{x}\) as \(\bar{\varDelta }\) in the rest of this paper.

Given the particularity of targeted attacks against manipulation detection models, we tend to divert the detector’s attention to both spurious regions and trivial backgrounds of a manipulated image. This strategy is designed to generate pixel-wise trend-positive predictions, effectively deceiving the detector. Specifically, we suppress model importance features with the aggregate gradients, and design the loss function Eq. (12) to guide the generation of transferable adversarial examples for fake manipulated images.

$$\begin{aligned} loss_{F, y^{*}}(x_{Fake}^{adv}) = \left\| \bar{\varDelta } \odot F_{k}(x_{Fake}^{adv}) \right\| _2 \end{aligned}$$
(12)

Here, we choose the \(\ell _2\) regulation norm to suppress all relatively high intensity \(\bar{\varDelta }\), forcing the output prediction return to zero.

3.5 The unified attack

Combining the PRevCDm Loss for authentic images and the aggregate gradient for fakes in adversarial example generation, as well as the low-frequency constraint, the overall attack problem of this work can be concluded from Eq. (3) as:

$$\begin{aligned} \begin{aligned} x^{adv}&= \mathrm {\mathop {argmin}} \{\alpha D_{lf}(x, x^{adv}) \\&\quad + \beta _1 loss_{F, y^{*}}(x_{Au}^{adv})_{PRevCDm} \\&\quad + \beta _2 loss_{F, y^{*}}(x_{Fake}^{adv})_{AggGrad}\} \end{aligned} \end{aligned}$$
(13)

where \(\alpha \), \(\beta _1\), and \(\beta _2\) are hyper-parameters. For clarity, we present the pseudo-code in Algorithm 1 to outline the main procedures of our attack.

Algorithm 1
figure c

RevAggAL Attack

Table 1 The performance evaluation of different white-box attacks on three manipulation detectors (ResFCN, MVSSNet, and OSN) with five datasets (COVERAGE, COLUMBIA, CASIA1, NIST 2016, and Realistic Tampering)

4 Experiments

4.1 Experimental setting

Table 2 Attack Success Rate (%) of adversarial attacks on three target detectors with five datasets

Datasets and Models. We evaluate our method with five image manipulation datasets, namely COVERAGE [47], COLUMBIA [35], CASIA1 [8], NIST 2016 [16] and Realistic Tampering [24, 25]. COVERAGE contains 100 negative images manipulated by copy-move and their originals with genuine objects. COLUMBIA provides 363 images, including 183 genuine images and 180 spliced images. CASIA1 derives from the Corel image dataset, which consists of 800 authentic images, 459 copy-move images, and 461 spliced images. NIST 2016 contains 564 high-resolution images with copy-move, splicing, and removal. Realistic Tampering includes 220 realistic images captured by four cameras and their corresponding forgeries created by modern photo-editing software. We adopt ResFCN [18, 32], MVSSNet [7], and OSN [48] as our experimental image manipulation detectors. Previous studies have demonstrated their detection performance, and we directly use officially released detectors with pre-trained models and optimized parameters. Each detector is used as the victim model in its white-box attack, while only assuming a known surrogate model to generate adversarial examples transferable to attack other models in each black-box attack scenario.

Evaluation metrics. We calculate the attack success rate (ASR) to evaluate the overall attack performance, where ASR denotes the ratio of the number of successfully attacked images to the entire test dataset. We also evaluate detector performance changes before and after the attack with two manipulation detection metrics: image-level F1-score (imF1), and pixel-level F1-score (pF1). Here, authentic images are only considered for imF1 calculation, while fake images are used for both metrics computation. On the other hand, we adopt three typical metrics, including Fr\(\acute{e}\)chet inception distance (FID) [19], peak signal-to-noise ratio (PSNR), and structural similarity (SSIM), to measure visual image quality.

Fig. 3
figure 3

Visualization results of adversarial examples generated under several attacks. Compared to classic attacks, examples from our method can successfully fool the detector at the pixel-level while preserving better image quality

Implementation details. We use an Adam [4. By comparing the first and second rows, we observe the PRevCDm loss improves the attack performance. Despite a slight degradation in image quality with the PRevCDm loss, the utilization of the Dlf instead of spatial constraints helps mitigate this effect. Furthermore, we compare the third and last rows and observe an improvement in ASR while maintaining image quality with the AggGrad. The ablation experiments validate the effectiveness of the proposed components, which provides valuable inspiration for future work in this field.

Table 4 Ablation study on CASIA1 under the black-box attack setting

5 Conclusion

In this paper, we propose an efficient adversarial attack named RevAggAL to explore the vulnerability of current state-of-the-art image manipulation detectors. To address the challenge of transferable attacks with more imperceptible perturbation, we combine the PRevCDm loss with aggregated gradients for adversarial example generation under the low-frequency constraint. Experiments demonstrate that our proposed method can achieve good attack performance while ensuring better image quality.