Assessment framework for deepfake detection in real-world situations

Lu, Yuhang; Ebrahimi, Touradj

doi:10.1186/s13640-024-00621-8

Assessment framework for deepfake detection in real-world situations

Research
Open access
Published: 13 February 2024

Volume 2024, article number 6, (2024)
Cite this article

Download PDF

You have full access to this open access article

EURASIP Journal on Image and Video Processing Submit manuscript

Assessment framework for deepfake detection in real-world situations

Download PDF

2676 Accesses
3 Citations
1 Altmetric
Explore all metrics

Abstract

Detecting digital face manipulation in images and video has attracted extensive attention due to the potential risk to public trust. To counteract the malicious usage of such techniques, deep learning-based deepfake detection methods have been employed and have exhibited remarkable performance. However, the performance of such detectors is often assessed on related benchmarks that hardly reflect real-world situations. For example, the impact of various image and video processing operations and typical workflow distortions on detection accuracy has not been systematically measured. In this paper, a more reliable assessment framework is proposed to evaluate the performance of learning-based deepfake detectors in more realistic settings. To the best of our acknowledgment, it is the first systematic assessment approach for deepfake detectors that not only reports the general performance under real-world conditions but also quantitatively measures their robustness toward different processing operations. To demonstrate the effectiveness and usage of the framework, extensive experiments and detailed analysis of four popular deepfake detection methods are further presented in this paper. In addition, a stochastic degradation-based data augmentation method driven by realistic processing operations is designed, which significantly improves the robustness of deepfake detectors.

Deepfake Detection Approaches Using Deep Learning: A Systematic Review

Deepfake Detection Using Multiple Facial Features

Deepfake Detection Performance Evaluation and Enhancement Through Parameter Optimization

1 Introduction

In recent years, the rapid development of deep convolutional neural networks (DCNNs) and ease of access to large-scale datasets have led to significant progress on a broad range of computer vision tasks and meanwhile created a surge of new applications. For example, the recent advancement of generative adversarial networks (GANs) [6,7] are capable of changing the expression, attributes, and even identity of a human face image, the outcome of which refers to the popular term ‘Deepfake’. The recent development of such technologies and the wide availability of open-source software has simplified the creation of deepfakes, increasingly damaging our trust in online media and raising serious public concerns. To counteract the misuse of these deepfake techniques and malicious attacks, detecting manipulations in facial images and video has become a hot topic in the media forensics community and has received increasing attention from both academia and businesses.

Nowadays, multiple grand challenges, competitions, and public benchmarks [8,9,10] are organized to assist the progress of deepfake detection. At the same time, with the advanced deep learning techniques and large-scale datasets, numerous detection methods [4, 11,12,13,14,15,16] have been published and have reported promising results on different datasets. But some studies [17, 18] have shown that the detection performance significantly drops in the cross-dataset scenario, where the fake samples are forged by other unknown manipulation methods. Therefore, cross-dataset evaluation has become an important step in recent studies to better show the advantages of deepfake detection methods, encouraging researchers [19,20,21] to propose detection methods with better generalization ability to different types of manipulations.

Nevertheless, another scenario that commonly exists in the real world has received little attention from researchers. In fact, it has long been shown that DCNN-based methods are vulnerable to real-world perturbations and processing operations [22,23,24] in different vision tasks. In more realistic conditions, images and video can face unpredictable distortions from the extrinsic environment, such as noise and poor illumination conditions, or constantly undergo various processing operations to ease their distribution. In the context of this paper, a deployed deepfake detector could mistakenly block a pristine yet heavily compressed image. On the other hand, a malicious agent could also fool the detector by simply adding imperceptible noise to fake media content. To the best of our acknowledgment, most of the current deep learning-based deepfake detection methods are developed based on constrained and less realistic face manipulation datasets, and therefore, they are not robust enough in real-world situations. Similarly, the conventional assessment approach, which exists in various benchmarks, often directly samples test data from the same distribution as training data and can hardly reflect model performance in more complex situations. In fact, most of the existing deepfake detection methods only report their performance on some well-known benchmarks in the community.

Therefore, a more reliable and systematic approach is desired firsthand to assess the performance of a deepfake detector in more realistic scenarios and further motivate researchers to develop robust detection methods. In this paper, a comprehensive assessment framework for deepfake detection in real-world conditions has been conceived for both image and video deepfakes. Notably, the realistic situations are simulated by applying common image and video processing operations to the test data. The performance of multiple deepfake detectors is measured under the impact of various real-world processing operations. At the same time, a generic approach to improve the robustness of the detectors has been proposed.

In summary, the following contributions have been made.

A realistic assessment framework is proposed to evaluate and benchmark the performance of learning-based deepfake detection systems. To the best of our knowledge, this is the first framework that systematically evaluates deepfake detectors in realistic situations.
The performance of several popular deepfake detection methods has been evaluated and analyzed with the proposed performance evaluation framework. The extensive results demonstrate the necessity and effectiveness of the assessment approach.
Inspired by the real-world data degradation process, a stochastic degradation-based augmentation (SDAug) method driven by typical image and video processing operations is designed for deepfake detection tasks. It brings remarkable improvement in the robustness of different detectors.
A flexible Python toolbox is developed and the source code of the proposed assessment framework is released to facilitate relevant research activities.

This article is an extended version of our recent publication [25]. The additional contents of this paper are summarized as follows.

More recent deepfake detection methods have been summarized and introduced in the related work section.
The proposed assessment framework has been extended to support the evaluation of video deepfake detectors.
The performance of two current state-of-the-art deepfake detection methods has been additionally evaluated using the assessment framework.
More substantial experimental results have been presented to better demonstrate the necessity and usage of the assessment framework. The performance and characteristics of four popular deepfake detection methods are analyzed in depth based on the assessment results.
The impact of different image compression operations on the performance of deepfake detectors is additionally studied in detail.
More experiments, comparisons, and cross-manipulation evaluations have been conducted for the proposed stochastic degradation-based augmentation method. Its effectiveness and limitations are further analyzed.

2 Related work

2.1 Deepfake detection

Deepfake detection is often treated as a binary classification problem in computer vision. Early on, solutions based on facial expressions [26], head movements [27] and eye blinking [28] were proposed to address such detection problems. In recent years, the primary solution to this problem is by leveraging advanced neural network architectures. Zhou et al. [29] proposed to detect deepfakes with a two-stream neural network. Rössler et al. [4] retrained an XceptionNet [30] with manipulated face dataset which outperforms their proposed benchmark. Nguyen et al. [

3.3 Assessment methodology

Current deepfake detection algorithms are based on deep learning and rely heavily on the distribution of the training data. These methods are typically evaluated using a test dataset that is similar to the training sets. Some benchmarks, such as [8, 9], attempt to measure the performance of deepfake detectors under more realistic conditions by adding random perturbations to partial test data and mixing up with others. However, there is no standard approach for determining the proportion or strength of these perturbations, which makes the results of these benchmarks more stochastic and less reliable. The assessment methodology proposed in this paper aims to more thoroughly measure the impact of various influencing factors, at different severity levels, on the performance of deepfake detection algorithms.

In this section, the principle and usage of our assessment framework are introduced in detail. First, the deepfake detector is trained on its original target datasets, such as FaceForensics++ [4]. The processing operations and corruptions in the framework are not applied to the training data. Then, as illustrated in Fig. 3, multiple copies of the test set are created, and each type of distortion at one specific severity level is applied to the copies independently. The standard test data together with different distorted data are fed to the deepfake detector respectively. Finally, the detector generates “real or fake” predictions. During the entire evaluation, the true positive rate (TPR) and false positive rate (FPR) are measured by constantly comparing the detector’s predictions and the binary ground-truth labels. The ROC curve is plotted and the Area Under the Curve (AUC) score is reported as the final metric. An overall evaluation score can be obtained by averaging the scores from each distortion style and strength level to report the general performance of a tested detector. Besides, the computed metrics can also be grouped by each operation category to further analyze the robustness of one deepfake detector on a specific processing operation.

In addition, to relieve the burden on storage caused by the multiple copies of the test set, a Python toolbox is developed to address this problem in an online manner, which hard-codes the digital processing operations and makes the strength level a parameter. It operates in the same format as the famous Transforms module in the TorchVison toolbox and can be easily integrated into the evaluation process.

4 Stochastic degradation-based augmentation

To improve the ability of deepfake detection methods to handle realistic distortions and pre-processing operations, an effective data augmentation approach is proposed which leads to a robustness improvement.

Standard data augmentation methods often introduce geometric and color space transformation to enrich training data and improve the model generalization ability. But according to our experiments, this type of augmentation technique is less effective for deepfake detection under realistic conditions.

Motivated by a typical data acquisition and transmission pipeline in the real world, the stochastic degradation-based augmentation (SDAug) method is proposed. The main novelty of the proposed augmentation technique resides in the fact that it is driven by the typical operations that images and video are subject to in realistic conditions. Based on the observation of the data degradation process, a carefully designed augmentation chain is conceived, which allows the training data to better resemble real-world conditions and further boosts the performance of deepfake detection methods.

Generally, the brightness and contrast of input image x are first modified by image enhancement operator enh. Afterward, the image is convoluted with an image blurring kernel f, followed by additive Gaussian noise n. In the end, JPEG compression is applied to obtain the augmented training data $x_{\text {aug}}$. The augmentation chain is described by the following formula.

$$\begin{aligned} x_{\text {aug}}={\varvec{JPEG}}[({\varvec{enh}}(x)\circledast {\varvec{f}})+{\varvec{n}}] \end{aligned}$$

(1)

In addition, unlike the common data augmentation process, the SDAug method is implemented in a stochastic manner. The term ‘stochastic’ can be interpreted in the following two aspects. First, each aforementioned augmentation operation will occur with a certain probability in the augmentation chain. Second, each operation will use a random severity level for every frame. The realistic scenario is rather complex and does not necessarily consist of multiple types of distortions and processing operations. A random mixture of several distortions and severity levels can create more diversity in the augmented training data. Moreover, stochastic augmentation helps preserve more information from the original training data and therefore prevents accuracy loss on the high-quality data. In detail, the augmentation operations are explained in sequence as follows.

Enhancement: The augmentation chain begins with an image enhancement operation. A probability of 50% is adopted to apply either a brightness or a contrast operation on the training data which will be then non-linearly modified by a factor randomly selected from [0.5, 1.5].

Smoothing: Image blurring operation is then applied with a selected probability of 50%. Either Gaussian blur or Average blur filter is used with a kernel size varying in the range [3, 15].

Additive Gaussian noise: For each batch of training data, a probability of 30% is adopted to add a Gaussian noise. The standard deviation of the Gaussian noise varies randomly in the interval [0, 50].

JPEG compression: Finally, JPEG compression is applied with a selected probability of 70%. The quality factor corresponding to the compression is randomly chosen in the range [10, 95].

5 Experimental results

In this work, numerous experiments have been conducted to demonstrate the effectiveness and usage of the proposed assessment framework. The experimental setup will be described at the beginning of this section, followed by the substantial assessment results and analysis for both image and video scenarios. Then, the impact of three image compression technologies on deepfake detectors is further discussed as an example of the multiple applications of the framework. In the end, the effectiveness of the proposed augmentation technique is reported and analyzed.

5.1 Implementation details

5.1.1 Datasets

Two widely used face manipulation datasets are selected in this paper for extensive experimentation. For both datasets, there is a strict split up in the dataset suggested by the dataset provider and the video used for training will not appear in the validation and testing stages.

FaceForensics++ [4], denoted by FFpp, contains 1000 pristine and 4000 manipulated video generated by four different deepfake creation algorithms. In addition, raw video contents are compressed with two quality parameters using the AVC/H.264 codec, denoted as C23 and C40. In the experiments, the training set is denoted as FFpp-Raw, FFpp-C23, and FFpp-C40 when the model is trained on single-quality-level data, while it is denoted as FFpp-Full when data of all three quality levels are involved for training. On the contrary, to provide a fair baseline, only uncompressed data are used for the final assessment.

Celeb-DFv2 [63] is another high-quality dataset, with 590 pristine celebrity video and 5639 fake video. The test data are selected as recommended by [63] while the rest are left for training purposes, where the training and validation sets were split into 80% and 20% accordingly.

5.1.2 Detection methods

Experiments have been conducted with the following learning-based deepfake detectors, all of which have reported excellent performance on popular benchmarks.

Capsule-Forensics is a deepfake detection method based on a combination of capsule networks and CNNs. The capsule network was initially proposed by [31] to address some limitations of CNNs and it used a rather smaller amount of parameters than traditional CNN to train very deep neural networks. [11] employed the capsule network as a component in a deepfake detection pipeline for detecting manipulated images and video. This method achieved the best performance at that time in the FaceForensics++ dataset compared to its competing methods.

XceptionNet [30] is a popular CNN architecture in many computer vision tasks and has been used to detect face manipulations when it works as a classification network. Rössler et al. [4] first adopted it as a baseline in the FaceForensics++ benchmark along with three other approaches. The detection system based on XceptionNet architecture was first pre-trained using ImageNet database [49] and then re-trained on a specific dataset for the deepfake detection task. It achieved excellent performance in the FaceForensics++ benchmark on both compressed and uncompressed contents and has become a popular baseline method for recent deepfake detection approaches.

SBIs [21] refers to a data synthetic method, Self-blended Images, which is specially designed for deepfake detection tasks. This method aims to generate hardly recognizable fake samples that contain common face forgery traces to encourage the model to learn more general and robust representations for face forgery detection. The overall detection system is based on a pre-trained deep classification network, EfficientNet-b4 [64]. After retraining with the SBIs technique, the detector demonstrates an impressive generalization ability to different unseen face manipulations and achieves the current state-of-the-art in cross-dataset settings. But its robustness to common image and video processing operations has not been measured.

UIA-VIT [39] detects face forgery using vision transformer technique. This approach jointly trains an end-to-end pipeline that both classifies the deepfake images and estimates the modification areas in an unsupervised manner. Overall, the UIA-VIT method focuses on intra-frame inconsistency without pixel-level annotations and achieves state-of-the-art performance regarding generalization ability.

5.1.3 Training details

The Capsule-Forensics, XceptionNet, and UIA-VIT methods are trained with Adam optimizer with $\beta _1=0.9$, $\beta _2=0.999$. Following the hyper-parameters suggested in the original paper, the Capsule-Forensics model is trained from scratch for 25 epochs with a learning rate of $5 \times 10^{-4}$, the XceptionNet model is trained for 10 epochs with a learning rate of $1 \times 10^{-3}$, and the UIA-VIT model is trained for 8 epochs with a learning rate of $3 \times 10^{-5}$. During training, 100 frames are randomly sampled from each video in the training set. For evaluation and testing, 32 frames are extracted from the video in the validation and test set. Extracted frames are pre-processed and cropped around the face regions using the dlib toolbox [65]. The face regions are finally resized into 300x300 pixels before feeding to the network.

The SBIs method has a different experimental setting from the previous three methods. It is retrained with SAM [66] optimizer for 100 epochs. The batch size and learning rate are set to 32 and $1 \times 10^{-3},$ respectively. During the training phase, only authentic high-quality video is used and the corresponding fake samples are created by their proposed self-blending method.

5.1.4 Performance metrics

During the evaluation, the Area Under Receiver Operating Characteristic Curve (AUC) is used as a metric in all experiments.

Table 2 AUC (%) scores of the Capsule-Forensics, denoted as CapsuleNet, XceptionNet, and UIA-VIT methods tested on unaltered and distorted variants of FFpp and Celeb-DFv2 test set respectively. Raw, C23, and Full refer to different quality settings of the FFpp. DL-Comp refers to deep learning-based compression [60] and ‘High’ refers to the high-quality compressed image

Full size table

5.2 Assessment results on realistic image deepfakes

In this section, the performance of the Capsule-Forensics, XceptionNet, and UIA-VIT methods is measured when facing more realistic image deepfakes produced by the assessment framework. The three deepfake detectors are trained on the original unaltered training sets of FFpp and Celeb-DFv2, respectively. The assessment framework further evaluates the performance of these detectors and summarizes the results as shown in Table 2 and Fig. 4.

In general, our findings draw the following conclusions. First, even mild real-world processing operations can have a noticeable negative impact on detection accuracy. The first two detectors present exceptional performance on unaltered FFpp and CelebDFv2 testing data as expected, but then show severe performance deterioration on all kinds of modified data from the assessment framework, which indicates a lack of robustness. Although UIA-VIT is known for outstanding generalization ability, it also suffers from performance degradation in front of processing operations.

Second, the Capsule-Forensics and XceptionNet methods are prone to be affected by different types of perturbation. When trained on the same high-quality dataset, the Capsule-Forensics method is generally more robust toward JPEG compression, synthetic noise, and gamma correction operation, while XceptionNet at times presents slightly better results that could be of statistical nature. The results from the assessment framework provide valuable guidance toward improving a specific deepfake detector. Moreover, among the considered influencing factors, noise and blurriness effects on images are the most prominent for deepfake detectors. The performance of both detectors deteriorates rapidly after increasing the severity levels of the two distortions.

Finally, the impact of quality variants of training data on learning-based detectors has been analyzed based on the assessment results. When trained only with very high-quality data (FFpp-Raw), both the Capsule-Forensics and XceptionNet models will be extremely sensitive to nearly all kinds of realistic processing operations. On the contrary, training the model with relatively low-quality data slightly improves the robustness toward low-intensity processing operations and distortions, but with a cost on the original high-quality testing set. For example, both models trained with compressed data (FFpp-C23, FFpp-Full) show a higher AUC score on our realistic benchmark, but their performance on original unaltered data decreases by 0.5–1%. However, although training with compressed data slightly improves the robustness of UIA-VIT against compression and noise, it brings more negative impact when facing other processing operations.

Table 3 AUC (%) scores of four selected deepfake detection methods on the distorted variants of the FFpp test set that are subject to different video processing operations. The notations C23 and C40 here refer to the two different compression rates using AVC/H.264 codec. The notation Resolution refers to reducing video resolution by a specific scale

Full size table

5.3 Assessment results on realistic video deepfakes

In addition to images, the framework provides a comprehensive evaluation for the four detection methods, i.e. Capsule-Forensics, XceptionNet, SBIs, and UIA-VIT on video deepfakes under real-world conditions. Table 3 summarizes the performance of the four deepfake detection methods using the proposed realistic benchmark.

As a result, when trained with high-quality data, both the Capsule-Forensics and XceptionNet methods show a similar trend as in the previous image deepfake detection benchmark and perform poorly when facing pre-processed video deepfakes. The SBIs and UIA-VIT methods outperform the other two detectors and present relatively stable scores in front of most video processing operations, particularly those artifacts introduced by changing brightness or assigning video filters.

However, when the previous two methods are trained directly on compressed data, they maintain higher robustness toward multiple processing operations and even outperform the SBIs method, whose overall score even decreases by 0.66% instead. On the other hand, none of the three methods can properly classify video deepfakes processed by heavy compression, resolution reduction, or video noise.

In addition to benchmarking overall performance, the assessment framework also provides the means to analyze the behavior of a method under one specific realistic situation and help reveal the mechanism behind it. For instance, it is interesting to observe that, regardless of the training data, the SBIs method is more robust to geometric transformation than the other two and retains a good ability to accurately classify a vertically flipped video. It is because the SBIs method is based on local forgery traces instead of the global inconsistency on the face.

While the generalization problem is well-explored by synthetic data-based methods, how to improve robustness toward processing operations and distortions which exist in the real world is still an open question. This paper provides a systematic benchmarking approach that helps reveal the drawbacks of general deepfake detectors. For instance, although the SBIs method demonstrates a good generalization ability in cross-dataset experiments in their paper [21], our assessment framework shows that it is susceptible to some common perturbations in the real world, such as video compression, video noise, and low resolution.

5.4 Impact of different image coding algorithms

The assessment framework additionally provides means to measure the impact of a specific type of processing operation on the performance of a deepfake detector. For instance, image compression operation is almost inevitable during the distribution of a fake image. Meanwhile, AI-based compression technologies have become increasingly popular and are often capable of obtaining relatively smaller bitstreams. However, it is unknown to which extent the learning-based compression algorithms will affect the deepfake detection methods comparonventional JPEG compression.

Table 4 AUC (%) scores of cores of the Capsule-Forensics, denoted as CapsuleNet, and XceptionNet methods tested on unaltered and distorted variants of FFpp

Full size table

Table 5 AUC (%) scores of three selected deepfake detection methods trained with the SDAug augmentation method on the distorted variants of the FFpp test set

Full size table

In this section, a detailed comparison has been made between JPEG compression and two popular AI-based image compression methods, denoted by bmshj [60] and hific [61], respectively. In detail, the Capsule-Forensics and XceptionNet methods are first trained on uncompressed data. Afterward, their performance on different compressed data is evaluated using the framework and is then reported in Fig. 5. As a result, the image compression operation generally brings more negative impact to XceptionNet than to the Capsule-Forensics method. The latter obtains relatively high AUC scores when the test data are compressed by JPEG with high compression factors. Although the bmshj-based compression method is capable of achieving lower bitrates than JPEG, it brings significant negative impact to both detectors, whose predictions are close to random guess regardless of the select compression factor. On the contrary, both tested detectors are more robust to test data compressed using hific codec than using JPEG operation or bmshj codec, even with extremely low bitrates. The results reported in this section imply that hific codec introduces fewer adversarial artifacts, which can interrupt the functionality of other AI-based detectors.

5.5 Experimental results with augmentation

Table 4 shows the evaluation results of the Capsule-Forensics and XceptionNet methods trained on the unaltered FFpp dataset together with the proposed augmentation strategy. The information regarding the models trained with the proposed stochastic degradation augmentation methods is denoted as +SDAug.

In comparison, it is evident that training with the stochastic degradation-based augmentation technique on the same dataset remarkably improves the performance on nearly all kinds of processed data even with intense severity. For example, previous experiments show that the detectors are more vulnerable to synthetic noises and blurry effects. The sub-figures in Figs. 6 and 7 further illustrate the impact of increasing the severity of these distortions on the two detection methods. The data augmentation scheme significantly improves the robustness and meanwhile still maintains high performance on original unaltered data.

It is worth noting that the performance improves not only on the four types of processing operations that appear during data augmentation but also on other different kinds of distortions. As shown in Table 4 and the last two sub-figures in Figs. 6 and 7, both detectors are much more robust toward learning-based compression, low-resolution effects, and other mixed distortions. A similar observation is obtained from the video deepfake assessment framework, see Tables 5 and 6. Although these video processing operations are not present in the proposed augmentation chain, the SDAug technique brings performance improvement to the Capsule-Forensics and XceptionNet methods on nearly all kinds of processed video deepfakes.

To compare with conventional augmentation methods based on geometric and color space transformation, the well-known Augmix [67] augmentation technique is evaluated under the same realistic assessment framework. This method generates multiple augmentation chains that work in parallel by randomly applying transformations to the training data. As a result, Augmix brings limited improvements to the robustness of the detector compared to SDAug, see Table 4. Its overall performance is even worse than simply training with low-quality data, which implies that the traditional data augmentation method is less practical when facing real-world distortions.

To show the effectiveness of the stochastic mechanism, an extra model has been trained using the same degradation-based augmentation chain but without randomness, which means the input data will be processed by all the augmentation operations with a fixed strength level. The corresponding experiment results are also reported in Table 4 and Figs. 6, 7, denoted as +DAug. As a result, the models trained with DAug are able to improve the performance on multiple processed data but the AUC scores degrade heavily on the original unmodified data. In comparison, the model trained with SDAug shows more significant robustness improvement and meanwhile maintains high performance on original high-quality data.

Table 6 Cross-manipulation evaluation on Celeb-DFv1 and Celeb-DFv2 (AUC(%)) after training on FFpp dataset

Full size table

Finally, cross-dataset evaluations have been conducted for the Capsule-Forensics and XceptionNet methods to evaluate the generalization ability of those models trained with the proposed augmentation technique. First, the two detectors are trained on the FFpp dataset but tested on the Celeb-DFv1 and Celeb-DFv2 test sets for frame-level AUC scores. The two methods obtain very low scores on the new dataset. In comparison, the proposed augmentation scheme brings a noticeable performance improvement for both detectors on new datasets, showing its capability to improve the generalization ability on unseen forensic face contents. Moreover, we conduct more cross-manipulation experiments on FaceForensics++ which consists of four types of manipulations, namely DeepFakes, Face2Face, FaceSwap, and NeuralTextures. In specific, the XceptionNet model is trained on one type of manipulation and is tested on the remaining three. The results demonstrated in Fig. 8 show that the model trained with SDAug consistently achieves superior generalization performance.

5.6 Limitations and Future Work

The experiments carried out in this paper are mainly limited to video deepfakes or standard-quality image deepfakes. The detection of HD single-image deepfakes created by completely different methods, such as GANs, has not been evaluated with the proposed assessment framework. Although preliminary explorations have been done by previous work [68], there have been more advanced techniques recently to create HD single-image deepfakes, not only by GANs but also by Diffusion Models, and corresponding detection methods. It would be interesting to extend the assessment framework to be able to study the robustness of state-of-the-art HD image deepfake detectors.

On the other hand, although the proposed augmentation technique is in general very helpful in improving the robustness of deepfake detectors when facing various real-world image and video processing operations, some limitations have been observed from the previous results report. First of all, the augmentation chain is hand-designed and the selection of hyperparameters might not be optimal. The proposed augmentation chain could be improved by conducting a parameter search with AutoML technology. Second, according to Table 5, the augmentation method generally provides limited help for SBIs method, because SBIs is entirely based on synthetic data and the augmentation can possibly corrupt the manually designed forgery traces. It could be promising to incorporate our proposed augmentation operations into the forgery data synthesis process to further improve the robustness of detectors based on synthetic forgery data.

6 Conclusion

Most of the current deepfake detection methods are designed to be as high performing as possible on specific benchmarks. But it has been shown that current assessment and ranking approaches employed in related benchmarks are less reliable and insightful. In this work, a more systematic performance assessment approach is proposed for deepfake detectors in realistic situations. To show the necessity and usage of the assessment framework, extensive experiments have been performed, where the robustness of four popular deepfake detectors is reported and analyzed. Furthermore, motivated by the assessment results, a new data augmentation chain based on a natural data degradation process has been conceived and shown to significantly improve the model’s robustness against distortions from various image and video processing operations. The effectiveness and limitations of the proposed augmentation method have been also discussed in detail.

Availability of data and materials

The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

Abbreviations

AUC:: Area under receiver operating characteristic curve
DAug:: Degradation-based augmentation
DCNNs:: Deep convolutional neural networks
DFDC:: Deepfake detection challenge
FFpp:: FaceForensicsplusplus
GANs:: Generative adversarial networks
SDAug:: Stochastic degradation-based augmentation
TMC:: Trusted media challenge

References

T. Karras, T. Aila, S. Laine, J. Lehtinen, Progressive growing of gans for improved quality, stability, and variation. ar**v preprint (2017)
T. Karras, S. Laine, T. Aila, A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4401–4410 (2019)
T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, T. Aila, Analyzing and improving the image quality of stylegan. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8110–8119 (2020)
A. Rössler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, M. Nießner, FaceForensics++: Learning to detect manipulated facial images. In: International Conference on Computer Vision (ICCV) (2019)
J. Thies, M. Zollhöfer, M. Nießner, Deferred neural rendering: image synthesis using neural textures. ACM Trans. Graph. 38(4), 1–12 (2019)
Article Google Scholar
Y. Nirkin, Y. Keller, T. Hassner, Fsgan: Subject agnostic face swap** and reenactment. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7184–7193 (2019)
E. Zakharov, A. Shysheya, E. Burkov, V. Lempitsky, Few-shot adversarial learning of realistic neural talking head models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9459–9468 (2019)
B. Dolhansky, J. Bitton, B. Pflaum, J. Lu, R. Howes, M. Wang, C.C. Ferrer, The deepfake detection challenge dataset 2006, 07397 (2020)
L. Jiang, R. Li, W. Wu, C. Qian, C.C. Loy, DeeperForensics-1.0: a large-scale dataset for real-world face forgery detection (2020). 2001.03024
W. Chen, B. Chua, S. Winkler, AI Singapore trusted media challenge dataset. ar**v preprint ar**v:2201.04788 (2022)
H.H. Nguyen, J. Yamagishi, I. Echizen, Use of a capsule network to detect fake images and videos. Ar**v (2019)
H. Zhao, W. Zhou, D. Chen, T. Wei, W. Zhang, N. Yu, Multi-attentional deepfake detection. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2185–2194 (2021)
H. Liu, X. Li, W. Zhou, Y. Chen, Y. He, H. Xue, W. Zhang, N. Yu, Spatial-phase shallow learning: rethinking face forgery detection in frequency domain. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 772–781 (2021)
Y. Qian, G. Yin, L. Sheng, Z. Chen, J. Shao, Thinking in frequency: Face forgery detection by mining frequency-aware clues. In: European Conference on Computer Vision, pp. 86–103 (2020). Springer
J. Li, H. **e, J. Li, Z. Wang, Y. Zhang, Frequency-aware discriminative feature learning supervised by single-center loss for face forgery detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6458–6467 (2021)
Y. Luo, Y. Zhang, J. Yan, W. Liu, Generalizing face forgery detection with high-frequency features. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16317–16326 (2021)
A. Khodabakhsh, R. Ramachandra, K. Raja, P. Wasnik, C. Busch, Fake face detection methods: Can they be generalized? In: 2018 International Conference of the Biometrics Special Interest Group (BIOSIG), pp. 1–6 (2018). IEEE
X. Xuan, B. Peng, W. Wang, J. Dong, On the generalization of GAN image forensics. In: Chinese Conference on Biometric Recognition, pp. 134–141 (2019). Springer
A. Haliassos, K. Vougioukas, S. Petridis, M. Pantic, Lips don’t lie: A generalisable and robust approach to face forgery detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5039–5049 (2021)
M. Kim, S. Tariq, S.S. Woo, Fretal: Generalizing deepfake detection using knowledge distillation and representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1001–1012 (2021)
K. Shiohara, T. Yamasaki, Detecting deepfakes with self-blended images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18720–18729 (2022)
S.F. Dodge, L. Karam, Understanding how image quality affects deep neural networks. 2016 Eighth International Conference on Quality of Multimedia Experience (QoMEX), 1–6 (2016)
M. Mehdipour Ghazi, H. Kemal Ekenel, A comprehensive analysis of deep learning based representation for face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 34–41 (2016)
K. Grm, V. Štruc, A. Artiges, M. Caron, H.K. Ekenel, Strengths and weaknesses of deep learning models for face recognition against image degradations. IET Biom. 7(1), 81–89 (2018)
Article Google Scholar
Y. Lu, T. Ebrahimi, A novel assessment framework for learning-based deepfake detectors in realistic conditions. In: Applications of Digital Image Processing XLV, vol. 12226, pp. 207–217 (2022). SPIE
S. Agarwal, H. Farid, Y. Gu, M. He, K. Nagano, H. Li, Protecting world leaders against deep fakes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops (2019)
X. Yang, Y. Li, S. Lyu, Exposing deep fakes using inconsistent head poses. ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 8261–8265 (2019)
T. Jung, S. Kim, K. Kim, Deepvision: deepfakes detection using human eye blinking pattern. IEEE Access 8, 83144–83154 (2020). https://doi.org/10.1109/ACCESS.2020.2988660
Article Google Scholar
P. Zhou, X. Han, V.I. Morariu, L.S. Davis, Two-Stream Neural Networks for Tampered Face Detection. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 1831–1839 (2017). https://doi.org/10.1109/CVPRW.2017.229. ISSN: 2160-7516
F. Chollet, Xception: Deep learning with depthwise separable convolutions. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1800–1807 (2017)
S. Sabour, N. Frosst, G.E. Hinton, Dynamic routing between capsules. Adv. Neural Inf. Process.30 (2017)
D. Güera, E.J. Delp, Deepfake video detection using recurrent neural networks. In: 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1–6 (2018). IEEE
E. Sabir, J. Cheng, A. Jaiswal, W. AbdAlmageed, I. Masi, P. Natarajan, Recurrent convolutional strategies for face manipulation detection in videos. Interfaces (GUI) 3(1), 80–87 (2019)
Google Scholar
I. Masi, A. Killekar, R.M. Mascarenhas, S.P. Gurudatt, W. AbdAlmageed, Two-branch recurrent network for isolating deepfakes in videos. In: European Conference on Computer Vision, pp. 667–684 (2020). Springer
H.H. Nguyen, F. Fang, J. Yamagishi, I. Echizen, Multi-task learning for detecting and segmenting manipulated facial images and videos. In: 2019 IEEE 10th International Conference on Biometrics Theory, Applications and Systems (BTAS), pp. 1–8 (2019). IEEE
M. Du, S. Pentyala, Y. Li, X. Hu, Towards generalizable deepfake detection with locality-aware autoencoder. In: Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 325–334 (2020)
D.M. Montserrat, H. Hao, S.K. Yarlagadda, S. Baireddy, R. Shao, J. Horvath, E. Bartusiak, J. Yang, D. Guera, F. Zhu, E.J. Delp, Deepfakes detection with automatic face weighting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops (2020)
Y. Zheng, J. Bao, D. Chen, M. Zeng, F. Wen, Exploring temporal coherence for more general video face forgery detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15044–15054 (2021)
W. Zhuang, Q. Chu, Z. Tan, Q. Liu, H. Yuan, C. Miao, Z. Luo, N. Yu, UIA-VIT: unsupervised inconsistency-aware method based on vision transformer for face forgery detection. In: European Conference on Computer Vision, pp. 391–407 (2022). Springer
H. Dang, F. Liu, J. Stehouwer, X. Liu, A.K. Jain, On the detection of digital face manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5781–5790 (2020)
T. Saikia, C. Schmid, T. Brox, Improving robustness against common corruptions with frequency biased models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10211–10220 (2021)
L. Li, J. Bao, T. Zhang, H. Yang, D. Chen, F. Wen, B. Guo, Face x-ray for more general face forgery detection. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 5000–5009 (2020)
S. Seferbekov, https://github.com/selimsef/dfdc_deepfake_challenge
Z. Hanqing, C. Hao, Z. Wenbo, https://github.com/cuihaoleo/kaggle-dfdc
A. Davletshin, https://github.com/NTech-Lab/deepfake-detection-challenge
S. **g, S. Huafeng, Y. Zhenfei, F. Zheng, Y. Guojun, C. Siyu, N. Ning, L. Yu, https://github.com/Siyu-C/RobustForensics
H. James, P. Ian, https://github.com/jphdotam/DFDC/
D. Hendrycks, T. Dietterich, Benchmarking neural network robustness to common corruptions and perturbations. Proceedings of the International Conference on Learning Representations (2019)
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). https://doi.org/10.1109/CVPR.2009.5206848
C. Michaelis, B. Mitzkus, R. Geirhos, E. Rusak, O. Bringmann, A.S. Ecker, M. Bethge, W. Brendel, Benchmarking robustness in object detection: Autonomous driving when winter is coming. ar**v preprint ar**v:1907.07484 (2019)
C. Kamann, C. Rother, Benchmarking the robustness of semantic segmentation models. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020). https://doi.org/10.1109/cvpr42600.2020.00885
C. Sakaridis, D. Dai, L. Van Gool, Acdc: The adverse conditions dataset with correspondences for semantic driving scene understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10765–10775 (2021)
S. Karahan, M.K. Yildirum, K. Kirtac, F.S. Rende, G. Butun, H.K. Ekenel, How image degradations affect deep cnn-based face recognition? In: 2016 International Conference of the Biometrics Special Interest Group (BIOSIG), pp. 1–5 (2016). IEEE
Y. Lu, L. Barras, T. Ebrahimi, A novel framework for assessment of deep face recognition systems in realistic conditions. In: 2022 10th European Workshop on Visual Information Processing (EUVIP), pp. 1–6 (2022). https://doi.org/10.1109/EUVIP53989.2022.9922840
F.A. Petitcolas, R.J. Anderson, M.G. Kuhn, Attacks on copyright marking systems. In: International Workshop on Information Hiding, pp. 218–238 (1998). Springer
R. Cogranne, Q. Giboulot, P. Bas, Alaska# 2: Challenging academic research on steganalysis with realistic images. In: 2020 IEEE International Workshop on Information Forensics and Security (WIFS), pp. 1–5 (2020). IEEE
A. Foi, M. Trimeche, V. Katkovnik, K. Egiazarian, Practical Poissonian-gaussian noise modeling and fitting for single-image raw-data. IEEE Trans. Image Process. 17(10), 1737–1754 (2008). https://doi.org/10.1109/TIP.2008.2001399
Article MathSciNet PubMed ADS Google Scholar
T. Marciniak, A. Chmielewska, R. Weychan, M. Parzych, A. Dabrowski, Influence of low resolution of images on reliability of face detection and recognition. Multimed. Tools Appl. 74 (2013). https://doi.org/10.1007/s11042-013-1568-8
P. Li, L. Prieto, D. Mery, P.J. Flynn, On low-resolution face recognition in the wild: comparisons and new techniques. IEEE Trans. Inf. Forensics Secur. 14, 2000–2012 (2019)
Article Google Scholar
J. Ballé, D. Minnen, S. Singh, S.J. Hwang, N. Johnston, Variational image compression with a scale hyperprior. In: International Conference on Learning Representations (2018)
F. Mentzer, G.D. Toderici, M. Tschannen, E. Agustsson, High-fidelity generative image compression. Adv. Neural. Inf. Process. Syst. 33, 11913–11924 (2020)
Google Scholar
K. Zhang, W. Zuo, Y. Chen, D. Meng, L. Zhang, Beyond a gaussian denoiser: Residual learning of deep CNN for image denoising. IEEE Trans. Image Process. 26(7), 3142–3155 (2017). https://doi.org/10.1109/TIP.2017.2662206
Article MathSciNet PubMed ADS Google Scholar
Y. Li, X. Yang, P. Sun, H. Qi, S. Lyu, Celeb-df: A large-scale challenging dataset for deepfake forensics. In: IEEE Conference on Computer Vision and Patten Recognition (CVPR) (2020)
M. Tan, Q. Le, Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR
D.E. King, DLIB-ML: a machine learning toolkit. J. Mach. Learn. Res. 10, 1755–1758 (2009)
Google Scholar
P. Foret, A. Kleiner, H. Mobahi, B. Neyshabur, Sharpness-aware minimization for efficiently improving generalization. ar**v preprint ar**v:2010.01412 (2020)
D. Hendrycks, N. Mu, E.D. Cubuk, B. Zoph, J. Gilmer, B. Lakshminarayanan, Augmix: A simple data processing method to improve robustness and uncertainty. In: International Conference on Learning Representations (2019)
J. Sabel, F. Johansson, On the robustness and generalizability of face synthesis detection methods. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 962–971 (2021)

Download references

Acknowledgements

No additional acknowledgments.

Funding

The authors acknowledge support from CHIST-ERA project XAIface (CHIST-ERA-19-XAI-011) with funding from the Swiss National Science Foundation (SNSF) under Grant number 20CH21 195532.

Author information

Authors and Affiliations

Multimedia Signal Processing Group (MMSPG), École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
Yuhang Lu & Touradj Ebrahimi

Authors

Yuhang Lu
View author publications
You can also search for this author in PubMed Google Scholar
Touradj Ebrahimi
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors participated in the design of the analytics, performance measures, experiments, and writing of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Yuhang Lu.

Ethics declarations

Ethics approval and consent to participate

Not applicable

Consent for publication

Not applicable

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Lu, Y., Ebrahimi, T. Assessment framework for deepfake detection in real-world situations. J Image Video Proc. 2024, 6 (2024). https://doi.org/10.1186/s13640-024-00621-8

Download citation

Received: 04 January 2023
Accepted: 11 January 2024
Published: 13 February 2024
DOI: https://doi.org/10.1186/s13640-024-00621-8

Assessment framework for deepfake detection in real-world situations

Abstract

Similar content being viewed by others

Deepfake Detection Approaches Using Deep Learning: A Systematic Review

Deepfake Detection Using Multiple Facial Features

Deepfake Detection Performance Evaluation and Enhancement Through Parameter Optimization

1 Introduction

2 Related work

2.1 Deepfake detection

3.3 Assessment methodology

4 Stochastic degradation-based augmentation

5 Experimental results

5.1 Implementation details

5.1.1 Datasets

5.1.2 Detection methods

5.1.3 Training details

5.1.4 Performance metrics

5.2 Assessment results on realistic image deepfakes

5.3 Assessment results on realistic video deepfakes

5.4 Impact of different image coding algorithms

5.5 Experimental results with augmentation

5.6 Limitations and Future Work

6 Conclusion

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation