Abstract
Detecting digital face manipulation in images and video has attracted extensive attention due to the potential risk to public trust. To counteract the malicious usage of such techniques, deep learning-based deepfake detection methods have been employed and have exhibited remarkable performance. However, the performance of such detectors is often assessed on related benchmarks that hardly reflect real-world situations. For example, the impact of various image and video processing operations and typical workflow distortions on detection accuracy has not been systematically measured. In this paper, a more reliable assessment framework is proposed to evaluate the performance of learning-based deepfake detectors in more realistic settings. To the best of our acknowledgment, it is the first systematic assessment approach for deepfake detectors that not only reports the general performance under real-world conditions but also quantitatively measures their robustness toward different processing operations. To demonstrate the effectiveness and usage of the framework, extensive experiments and detailed analysis of four popular deepfake detection methods are further presented in this paper. In addition, a stochastic degradation-based data augmentation method driven by realistic processing operations is designed, which significantly improves the robustness of deepfake detectors.
Similar content being viewed by others
1 Introduction
In recent years, the rapid development of deep convolutional neural networks (DCNNs) and ease of access to large-scale datasets have led to significant progress on a broad range of computer vision tasks and meanwhile created a surge of new applications. For example, the recent advancement of generative adversarial networks (GANs) [6,7] are capable of changing the expression, attributes, and even identity of a human face image, the outcome of which refers to the popular term ‘Deepfake’. The recent development of such technologies and the wide availability of open-source software has simplified the creation of deepfakes, increasingly damaging our trust in online media and raising serious public concerns. To counteract the misuse of these deepfake techniques and malicious attacks, detecting manipulations in facial images and video has become a hot topic in the media forensics community and has received increasing attention from both academia and businesses.
Nowadays, multiple grand challenges, competitions, and public benchmarks [8,9,10] are organized to assist the progress of deepfake detection. At the same time, with the advanced deep learning techniques and large-scale datasets, numerous detection methods [4, 11,12,13,14,15,16] have been published and have reported promising results on different datasets. But some studies [17, 18] have shown that the detection performance significantly drops in the cross-dataset scenario, where the fake samples are forged by other unknown manipulation methods. Therefore, cross-dataset evaluation has become an important step in recent studies to better show the advantages of deepfake detection methods, encouraging researchers [19,20,21] to propose detection methods with better generalization ability to different types of manipulations.
Nevertheless, another scenario that commonly exists in the real world has received little attention from researchers. In fact, it has long been shown that DCNN-based methods are vulnerable to real-world perturbations and processing operations [22,23,24] in different vision tasks. In more realistic conditions, images and video can face unpredictable distortions from the extrinsic environment, such as noise and poor illumination conditions, or constantly undergo various processing operations to ease their distribution. In the context of this paper, a deployed deepfake detector could mistakenly block a pristine yet heavily compressed image. On the other hand, a malicious agent could also fool the detector by simply adding imperceptible noise to fake media content. To the best of our acknowledgment, most of the current deep learning-based deepfake detection methods are developed based on constrained and less realistic face manipulation datasets, and therefore, they are not robust enough in real-world situations. Similarly, the conventional assessment approach, which exists in various benchmarks, often directly samples test data from the same distribution as training data and can hardly reflect model performance in more complex situations. In fact, most of the existing deepfake detection methods only report their performance on some well-known benchmarks in the community.
Therefore, a more reliable and systematic approach is desired firsthand to assess the performance of a deepfake detector in more realistic scenarios and further motivate researchers to develop robust detection methods. In this paper, a comprehensive assessment framework for deepfake detection in real-world conditions has been conceived for both image and video deepfakes. Notably, the realistic situations are simulated by applying common image and video processing operations to the test data. The performance of multiple deepfake detectors is measured under the impact of various real-world processing operations. At the same time, a generic approach to improve the robustness of the detectors has been proposed.
In summary, the following contributions have been made.
-
A realistic assessment framework is proposed to evaluate and benchmark the performance of learning-based deepfake detection systems. To the best of our knowledge, this is the first framework that systematically evaluates deepfake detectors in realistic situations.
-
The performance of several popular deepfake detection methods has been evaluated and analyzed with the proposed performance evaluation framework. The extensive results demonstrate the necessity and effectiveness of the assessment approach.
-
Inspired by the real-world data degradation process, a stochastic degradation-based augmentation (SDAug) method driven by typical image and video processing operations is designed for deepfake detection tasks. It brings remarkable improvement in the robustness of different detectors.
-
A flexible Python toolbox is developed and the source code of the proposed assessment framework is released to facilitate relevant research activities.
This article is an extended version of our recent publication [25]. The additional contents of this paper are summarized as follows.
-
More recent deepfake detection methods have been summarized and introduced in the related work section.
-
The proposed assessment framework has been extended to support the evaluation of video deepfake detectors.
-
The performance of two current state-of-the-art deepfake detection methods has been additionally evaluated using the assessment framework.
-
More substantial experimental results have been presented to better demonstrate the necessity and usage of the assessment framework. The performance and characteristics of four popular deepfake detection methods are analyzed in depth based on the assessment results.
-
The impact of different image compression operations on the performance of deepfake detectors is additionally studied in detail.
-
More experiments, comparisons, and cross-manipulation evaluations have been conducted for the proposed stochastic degradation-based augmentation method. Its effectiveness and limitations are further analyzed.
2 Related work
2.1 Deepfake detection
Deepfake detection is often treated as a binary classification problem in computer vision. Early on, solutions based on facial expressions [26], head movements [27] and eye blinking [28] were proposed to address such detection problems. In recent years, the primary solution to this problem is by leveraging advanced neural network architectures. Zhou et al. [29] proposed to detect deepfakes with a two-stream neural network. Rössler et al. [4] retrained an XceptionNet [30] with manipulated face dataset which outperforms their proposed benchmark. Nguyen et al. [
3.3 Assessment methodology
Current deepfake detection algorithms are based on deep learning and rely heavily on the distribution of the training data. These methods are typically evaluated using a test dataset that is similar to the training sets. Some benchmarks, such as [8, 9], attempt to measure the performance of deepfake detectors under more realistic conditions by adding random perturbations to partial test data and mixing up with others. However, there is no standard approach for determining the proportion or strength of these perturbations, which makes the results of these benchmarks more stochastic and less reliable. The assessment methodology proposed in this paper aims to more thoroughly measure the impact of various influencing factors, at different severity levels, on the performance of deepfake detection algorithms.
In this section, the principle and usage of our assessment framework are introduced in detail. First, the deepfake detector is trained on its original target datasets, such as FaceForensics++ [4]. The processing operations and corruptions in the framework are not applied to the training data. Then, as illustrated in Fig. 3, multiple copies of the test set are created, and each type of distortion at one specific severity level is applied to the copies independently. The standard test data together with different distorted data are fed to the deepfake detector respectively. Finally, the detector generates “real or fake” predictions. During the entire evaluation, the true positive rate (TPR) and false positive rate (FPR) are measured by constantly comparing the detector’s predictions and the binary ground-truth labels. The ROC curve is plotted and the Area Under the Curve (AUC) score is reported as the final metric. An overall evaluation score can be obtained by averaging the scores from each distortion style and strength level to report the general performance of a tested detector. Besides, the computed metrics can also be grouped by each operation category to further analyze the robustness of one deepfake detector on a specific processing operation.
In addition, to relieve the burden on storage caused by the multiple copies of the test set, a Python toolbox is developed to address this problem in an online manner, which hard-codes the digital processing operations and makes the strength level a parameter. It operates in the same format as the famous Transforms module in the TorchVison toolbox and can be easily integrated into the evaluation process.
4 Stochastic degradation-based augmentation
To improve the ability of deepfake detection methods to handle realistic distortions and pre-processing operations, an effective data augmentation approach is proposed which leads to a robustness improvement.
Standard data augmentation methods often introduce geometric and color space transformation to enrich training data and improve the model generalization ability. But according to our experiments, this type of augmentation technique is less effective for deepfake detection under realistic conditions.
Motivated by a typical data acquisition and transmission pipeline in the real world, the stochastic degradation-based augmentation (SDAug) method is proposed. The main novelty of the proposed augmentation technique resides in the fact that it is driven by the typical operations that images and video are subject to in realistic conditions. Based on the observation of the data degradation process, a carefully designed augmentation chain is conceived, which allows the training data to better resemble real-world conditions and further boosts the performance of deepfake detection methods.
Generally, the brightness and contrast of input image x are first modified by image enhancement operator enh. Afterward, the image is convoluted with an image blurring kernel f, followed by additive Gaussian noise n. In the end, JPEG compression is applied to obtain the augmented training data \(x_{\text {aug}}\). The augmentation chain is described by the following formula.
In addition, unlike the common data augmentation process, the SDAug method is implemented in a stochastic manner. The term ‘stochastic’ can be interpreted in the following two aspects. First, each aforementioned augmentation operation will occur with a certain probability in the augmentation chain. Second, each operation will use a random severity level for every frame. The realistic scenario is rather complex and does not necessarily consist of multiple types of distortions and processing operations. A random mixture of several distortions and severity levels can create more diversity in the augmented training data. Moreover, stochastic augmentation helps preserve more information from the original training data and therefore prevents accuracy loss on the high-quality data. In detail, the augmentation operations are explained in sequence as follows.
Enhancement: The augmentation chain begins with an image enhancement operation. A probability of 50% is adopted to apply either a brightness or a contrast operation on the training data which will be then non-linearly modified by a factor randomly selected from [0.5, 1.5].
Smoothing: Image blurring operation is then applied with a selected probability of 50%. Either Gaussian blur or Average blur filter is used with a kernel size varying in the range [3, 15].
Additive Gaussian noise: For each batch of training data, a probability of 30% is adopted to add a Gaussian noise. The standard deviation of the Gaussian noise varies randomly in the interval [0, 50].
JPEG compression: Finally, JPEG compression is applied with a selected probability of 70%. The quality factor corresponding to the compression is randomly chosen in the range [10, 95].
5 Experimental results
In this work, numerous experiments have been conducted to demonstrate the effectiveness and usage of the proposed assessment framework. The experimental setup will be described at the beginning of this section, followed by the substantial assessment results and analysis for both image and video scenarios. Then, the impact of three image compression technologies on deepfake detectors is further discussed as an example of the multiple applications of the framework. In the end, the effectiveness of the proposed augmentation technique is reported and analyzed.
5.1 Implementation details
5.1.1 Datasets
Two widely used face manipulation datasets are selected in this paper for extensive experimentation. For both datasets, there is a strict split up in the dataset suggested by the dataset provider and the video used for training will not appear in the validation and testing stages.
FaceForensics++ [4], denoted by FFpp, contains 1000 pristine and 4000 manipulated video generated by four different deepfake creation algorithms. In addition, raw video contents are compressed with two quality parameters using the AVC/H.264 codec, denoted as C23 and C40. In the experiments, the training set is denoted as FFpp-Raw, FFpp-C23, and FFpp-C40 when the model is trained on single-quality-level data, while it is denoted as FFpp-Full when data of all three quality levels are involved for training. On the contrary, to provide a fair baseline, only uncompressed data are used for the final assessment.
Celeb-DFv2 [63] is another high-quality dataset, with 590 pristine celebrity video and 5639 fake video. The test data are selected as recommended by [63] while the rest are left for training purposes, where the training and validation sets were split into 80% and 20% accordingly.
5.1.2 Detection methods
Experiments have been conducted with the following learning-based deepfake detectors, all of which have reported excellent performance on popular benchmarks.
Capsule-Forensics is a deepfake detection method based on a combination of capsule networks and CNNs. The capsule network was initially proposed by [31] to address some limitations of CNNs and it used a rather smaller amount of parameters than traditional CNN to train very deep neural networks. [11] employed the capsule network as a component in a deepfake detection pipeline for detecting manipulated images and video. This method achieved the best performance at that time in the FaceForensics++ dataset compared to its competing methods.
XceptionNet [30] is a popular CNN architecture in many computer vision tasks and has been used to detect face manipulations when it works as a classification network. Rössler et al. [4] first adopted it as a baseline in the FaceForensics++ benchmark along with three other approaches. The detection system based on XceptionNet architecture was first pre-trained using ImageNet database [49] and then re-trained on a specific dataset for the deepfake detection task. It achieved excellent performance in the FaceForensics++ benchmark on both compressed and uncompressed contents and has become a popular baseline method for recent deepfake detection approaches.
SBIs [21] refers to a data synthetic method, Self-blended Images, which is specially designed for deepfake detection tasks. This method aims to generate hardly recognizable fake samples that contain common face forgery traces to encourage the model to learn more general and robust representations for face forgery detection. The overall detection system is based on a pre-trained deep classification network, EfficientNet-b4 [64]. After retraining with the SBIs technique, the detector demonstrates an impressive generalization ability to different unseen face manipulations and achieves the current state-of-the-art in cross-dataset settings. But its robustness to common image and video processing operations has not been measured.
UIA-VIT [39] detects face forgery using vision transformer technique. This approach jointly trains an end-to-end pipeline that both classifies the deepfake images and estimates the modification areas in an unsupervised manner. Overall, the UIA-VIT method focuses on intra-frame inconsistency without pixel-level annotations and achieves state-of-the-art performance regarding generalization ability.
5.1.3 Training details
The Capsule-Forensics, XceptionNet, and UIA-VIT methods are trained with Adam optimizer with \(\beta _1=0.9\), \(\beta _2=0.999\). Following the hyper-parameters suggested in the original paper, the Capsule-Forensics model is trained from scratch for 25 epochs with a learning rate of \(5 \times 10^{-4}\), the XceptionNet model is trained for 10 epochs with a learning rate of \(1 \times 10^{-3}\), and the UIA-VIT model is trained for 8 epochs with a learning rate of \(3 \times 10^{-5}\). During training, 100 frames are randomly sampled from each video in the training set. For evaluation and testing, 32 frames are extracted from the video in the validation and test set. Extracted frames are pre-processed and cropped around the face regions using the dlib toolbox [65]. The face regions are finally resized into 300x300 pixels before feeding to the network.
The SBIs method has a different experimental setting from the previous three methods. It is retrained with SAM [66] optimizer for 100 epochs. The batch size and learning rate are set to 32 and \(1 \times 10^{-3},\) respectively. During the training phase, only authentic high-quality video is used and the corresponding fake samples are created by their proposed self-blending method.
5.1.4 Performance metrics
During the evaluation, the Area Under Receiver Operating Characteristic Curve (AUC) is used as a metric in all experiments.
5.2 Assessment results on realistic image deepfakes
In this section, the performance of the Capsule-Forensics, XceptionNet, and UIA-VIT methods is measured when facing more realistic image deepfakes produced by the assessment framework. The three deepfake detectors are trained on the original unaltered training sets of FFpp and Celeb-DFv2, respectively. The assessment framework further evaluates the performance of these detectors and summarizes the results as shown in Table 2 and Fig. 4.
In general, our findings draw the following conclusions. First, even mild real-world processing operations can have a noticeable negative impact on detection accuracy. The first two detectors present exceptional performance on unaltered FFpp and CelebDFv2 testing data as expected, but then show severe performance deterioration on all kinds of modified data from the assessment framework, which indicates a lack of robustness. Although UIA-VIT is known for outstanding generalization ability, it also suffers from performance degradation in front of processing operations.
Second, the Capsule-Forensics and XceptionNet methods are prone to be affected by different types of perturbation. When trained on the same high-quality dataset, the Capsule-Forensics method is generally more robust toward JPEG compression, synthetic noise, and gamma correction operation, while XceptionNet at times presents slightly better results that could be of statistical nature. The results from the assessment framework provide valuable guidance toward improving a specific deepfake detector. Moreover, among the considered influencing factors, noise and blurriness effects on images are the most prominent for deepfake detectors. The performance of both detectors deteriorates rapidly after increasing the severity levels of the two distortions.
Finally, the impact of quality variants of training data on learning-based detectors has been analyzed based on the assessment results. When trained only with very high-quality data (FFpp-Raw), both the Capsule-Forensics and XceptionNet models will be extremely sensitive to nearly all kinds of realistic processing operations. On the contrary, training the model with relatively low-quality data slightly improves the robustness toward low-intensity processing operations and distortions, but with a cost on the original high-quality testing set. For example, both models trained with compressed data (FFpp-C23, FFpp-Full) show a higher AUC score on our realistic benchmark, but their performance on original unaltered data decreases by 0.5–1%. However, although training with compressed data slightly improves the robustness of UIA-VIT against compression and noise, it brings more negative impact when facing other processing operations.
5.3 Assessment results on realistic video deepfakes
In addition to images, the framework provides a comprehensive evaluation for the four detection methods, i.e. Capsule-Forensics, XceptionNet, SBIs, and UIA-VIT on video deepfakes under real-world conditions. Table 3 summarizes the performance of the four deepfake detection methods using the proposed realistic benchmark.
As a result, when trained with high-quality data, both the Capsule-Forensics and XceptionNet methods show a similar trend as in the previous image deepfake detection benchmark and perform poorly when facing pre-processed video deepfakes. The SBIs and UIA-VIT methods outperform the other two detectors and present relatively stable scores in front of most video processing operations, particularly those artifacts introduced by changing brightness or assigning video filters.
However, when the previous two methods are trained directly on compressed data, they maintain higher robustness toward multiple processing operations and even outperform the SBIs method, whose overall score even decreases by 0.66% instead. On the other hand, none of the three methods can properly classify video deepfakes processed by heavy compression, resolution reduction, or video noise.
In addition to benchmarking overall performance, the assessment framework also provides the means to analyze the behavior of a method under one specific realistic situation and help reveal the mechanism behind it. For instance, it is interesting to observe that, regardless of the training data, the SBIs method is more robust to geometric transformation than the other two and retains a good ability to accurately classify a vertically flipped video. It is because the SBIs method is based on local forgery traces instead of the global inconsistency on the face.
While the generalization problem is well-explored by synthetic data-based methods, how to improve robustness toward processing operations and distortions which exist in the real world is still an open question. This paper provides a systematic benchmarking approach that helps reveal the drawbacks of general deepfake detectors. For instance, although the SBIs method demonstrates a good generalization ability in cross-dataset experiments in their paper [21], our assessment framework shows that it is susceptible to some common perturbations in the real world, such as video compression, video noise, and low resolution.
5.4 Impact of different image coding algorithms
The assessment framework additionally provides means to measure the impact of a specific type of processing operation on the performance of a deepfake detector. For instance, image compression operation is almost inevitable during the distribution of a fake image. Meanwhile, AI-based compression technologies have become increasingly popular and are often capable of obtaining relatively smaller bitstreams. However, it is unknown to which extent the learning-based compression algorithms will affect the deepfake detection methods comparonventional JPEG compression.
In this section, a detailed comparison has been made between JPEG compression and two popular AI-based image compression methods, denoted by bmshj [60] and hific [61], respectively. In detail, the Capsule-Forensics and XceptionNet methods are first trained on uncompressed data. Afterward, their performance on different compressed data is evaluated using the framework and is then reported in Fig. 5. As a result, the image compression operation generally brings more negative impact to XceptionNet than to the Capsule-Forensics method. The latter obtains relatively high AUC scores when the test data are compressed by JPEG with high compression factors. Although the bmshj-based compression method is capable of achieving lower bitrates than JPEG, it brings significant negative impact to both detectors, whose predictions are close to random guess regardless of the select compression factor. On the contrary, both tested detectors are more robust to test data compressed using hific codec than using JPEG operation or bmshj codec, even with extremely low bitrates. The results reported in this section imply that hific codec introduces fewer adversarial artifacts, which can interrupt the functionality of other AI-based detectors.
5.5 Experimental results with augmentation
Table 4 shows the evaluation results of the Capsule-Forensics and XceptionNet methods trained on the unaltered FFpp dataset together with the proposed augmentation strategy. The information regarding the models trained with the proposed stochastic degradation augmentation methods is denoted as +SDAug.
In comparison, it is evident that training with the stochastic degradation-based augmentation technique on the same dataset remarkably improves the performance on nearly all kinds of processed data even with intense severity. For example, previous experiments show that the detectors are more vulnerable to synthetic noises and blurry effects. The sub-figures in Figs. 6 and 7 further illustrate the impact of increasing the severity of these distortions on the two detection methods. The data augmentation scheme significantly improves the robustness and meanwhile still maintains high performance on original unaltered data.
It is worth noting that the performance improves not only on the four types of processing operations that appear during data augmentation but also on other different kinds of distortions. As shown in Table 4 and the last two sub-figures in Figs. 6 and 7, both detectors are much more robust toward learning-based compression, low-resolution effects, and other mixed distortions. A similar observation is obtained from the video deepfake assessment framework, see Tables 5 and 6. Although these video processing operations are not present in the proposed augmentation chain, the SDAug technique brings performance improvement to the Capsule-Forensics and XceptionNet methods on nearly all kinds of processed video deepfakes.
To compare with conventional augmentation methods based on geometric and color space transformation, the well-known Augmix [67] augmentation technique is evaluated under the same realistic assessment framework. This method generates multiple augmentation chains that work in parallel by randomly applying transformations to the training data. As a result, Augmix brings limited improvements to the robustness of the detector compared to SDAug, see Table 4. Its overall performance is even worse than simply training with low-quality data, which implies that the traditional data augmentation method is less practical when facing real-world distortions.
To show the effectiveness of the stochastic mechanism, an extra model has been trained using the same degradation-based augmentation chain but without randomness, which means the input data will be processed by all the augmentation operations with a fixed strength level. The corresponding experiment results are also reported in Table 4 and Figs. 6, 7, denoted as +DAug. As a result, the models trained with DAug are able to improve the performance on multiple processed data but the AUC scores degrade heavily on the original unmodified data. In comparison, the model trained with SDAug shows more significant robustness improvement and meanwhile maintains high performance on original high-quality data.
Cross-manipulation experiments on FaceForensice++ (Raw) dataset with XceptionNet trained on four different types of manipulated dataset separately, namely Deepfake, Face2Face, FaceSwap, NeuralTextures. AUC (%) scores are compared between the XceptionNet model trained with or without the SDAug technique
Finally, cross-dataset evaluations have been conducted for the Capsule-Forensics and XceptionNet methods to evaluate the generalization ability of those models trained with the proposed augmentation technique. First, the two detectors are trained on the FFpp dataset but tested on the Celeb-DFv1 and Celeb-DFv2 test sets for frame-level AUC scores. The two methods obtain very low scores on the new dataset. In comparison, the proposed augmentation scheme brings a noticeable performance improvement for both detectors on new datasets, showing its capability to improve the generalization ability on unseen forensic face contents. Moreover, we conduct more cross-manipulation experiments on FaceForensics++ which consists of four types of manipulations, namely DeepFakes, Face2Face, FaceSwap, and NeuralTextures. In specific, the XceptionNet model is trained on one type of manipulation and is tested on the remaining three. The results demonstrated in Fig. 8 show that the model trained with SDAug consistently achieves superior generalization performance.
5.6 Limitations and Future Work
The experiments carried out in this paper are mainly limited to video deepfakes or standard-quality image deepfakes. The detection of HD single-image deepfakes created by completely different methods, such as GANs, has not been evaluated with the proposed assessment framework. Although preliminary explorations have been done by previous work [68], there have been more advanced techniques recently to create HD single-image deepfakes, not only by GANs but also by Diffusion Models, and corresponding detection methods. It would be interesting to extend the assessment framework to be able to study the robustness of state-of-the-art HD image deepfake detectors.
On the other hand, although the proposed augmentation technique is in general very helpful in improving the robustness of deepfake detectors when facing various real-world image and video processing operations, some limitations have been observed from the previous results report. First of all, the augmentation chain is hand-designed and the selection of hyperparameters might not be optimal. The proposed augmentation chain could be improved by conducting a parameter search with AutoML technology. Second, according to Table 5, the augmentation method generally provides limited help for SBIs method, because SBIs is entirely based on synthetic data and the augmentation can possibly corrupt the manually designed forgery traces. It could be promising to incorporate our proposed augmentation operations into the forgery data synthesis process to further improve the robustness of detectors based on synthetic forgery data.
6 Conclusion
Most of the current deepfake detection methods are designed to be as high performing as possible on specific benchmarks. But it has been shown that current assessment and ranking approaches employed in related benchmarks are less reliable and insightful. In this work, a more systematic performance assessment approach is proposed for deepfake detectors in realistic situations. To show the necessity and usage of the assessment framework, extensive experiments have been performed, where the robustness of four popular deepfake detectors is reported and analyzed. Furthermore, motivated by the assessment results, a new data augmentation chain based on a natural data degradation process has been conceived and shown to significantly improve the model’s robustness against distortions from various image and video processing operations. The effectiveness and limitations of the proposed augmentation method have been also discussed in detail.