1 Introduction

In recent years, the rapid development of deep convolutional neural networks (DCNNs) and ease of access to large-scale datasets have led to significant progress on a broad range of computer vision tasks and meanwhile created a surge of new applications. For example, the recent advancement of generative adversarial networks (GANs) [6,7] are capable of changing the expression, attributes, and even identity of a human face image, the outcome of which refers to the popular term ‘Deepfake’. The recent development of such technologies and the wide availability of open-source software has simplified the creation of deepfakes, increasingly damaging our trust in online media and raising serious public concerns. To counteract the misuse of these deepfake techniques and malicious attacks, detecting manipulations in facial images and video has become a hot topic in the media forensics community and has received increasing attention from both academia and businesses.

Nowadays, multiple grand challenges, competitions, and public benchmarks [8,9,10] are organized to assist the progress of deepfake detection. At the same time, with the advanced deep learning techniques and large-scale datasets, numerous detection methods [4, 11,12,13,14,15,16] have been published and have reported promising results on different datasets. But some studies [17, 18] have shown that the detection performance significantly drops in the cross-dataset scenario, where the fake samples are forged by other unknown manipulation methods. Therefore, cross-dataset evaluation has become an important step in recent studies to better show the advantages of deepfake detection methods, encouraging researchers [19,20,21] to propose detection methods with better generalization ability to different types of manipulations.

Nevertheless, another scenario that commonly exists in the real world has received little attention from researchers. In fact, it has long been shown that DCNN-based methods are vulnerable to real-world perturbations and processing operations [22,23,24] in different vision tasks. In more realistic conditions, images and video can face unpredictable distortions from the extrinsic environment, such as noise and poor illumination conditions, or constantly undergo various processing operations to ease their distribution. In the context of this paper, a deployed deepfake detector could mistakenly block a pristine yet heavily compressed image. On the other hand, a malicious agent could also fool the detector by simply adding imperceptible noise to fake media content. To the best of our acknowledgment, most of the current deep learning-based deepfake detection methods are developed based on constrained and less realistic face manipulation datasets, and therefore, they are not robust enough in real-world situations. Similarly, the conventional assessment approach, which exists in various benchmarks, often directly samples test data from the same distribution as training data and can hardly reflect model performance in more complex situations. In fact, most of the existing deepfake detection methods only report their performance on some well-known benchmarks in the community.

Therefore, a more reliable and systematic approach is desired firsthand to assess the performance of a deepfake detector in more realistic scenarios and further motivate researchers to develop robust detection methods. In this paper, a comprehensive assessment framework for deepfake detection in real-world conditions has been conceived for both image and video deepfakes. Notably, the realistic situations are simulated by applying common image and video processing operations to the test data. The performance of multiple deepfake detectors is measured under the impact of various real-world processing operations. At the same time, a generic approach to improve the robustness of the detectors has been proposed.

In summary, the following contributions have been made.

  • A realistic assessment framework is proposed to evaluate and benchmark the performance of learning-based deepfake detection systems. To the best of our knowledge, this is the first framework that systematically evaluates deepfake detectors in realistic situations.

  • The performance of several popular deepfake detection methods has been evaluated and analyzed with the proposed performance evaluation framework. The extensive results demonstrate the necessity and effectiveness of the assessment approach.

  • Inspired by the real-world data degradation process, a stochastic degradation-based augmentation (SDAug) method driven by typical image and video processing operations is designed for deepfake detection tasks. It brings remarkable improvement in the robustness of different detectors.

  • A flexible Python toolbox is developed and the source code of the proposed assessment framework is released to facilitate relevant research activities.

This article is an extended version of our recent publication [25]. The additional contents of this paper are summarized as follows.

  • More recent deepfake detection methods have been summarized and introduced in the related work section.

  • The proposed assessment framework has been extended to support the evaluation of video deepfake detectors.

  • The performance of two current state-of-the-art deepfake detection methods has been additionally evaluated using the assessment framework.

  • More substantial experimental results have been presented to better demonstrate the necessity and usage of the assessment framework. The performance and characteristics of four popular deepfake detection methods are analyzed in depth based on the assessment results.

  • The impact of different image compression operations on the performance of deepfake detectors is additionally studied in detail.

  • More experiments, comparisons, and cross-manipulation evaluations have been conducted for the proposed stochastic degradation-based augmentation method. Its effectiveness and limitations are further analyzed.

2 Related work

2.1 Deepfake detection

Deepfake detection is often treated as a binary classification problem in computer vision. Early on, solutions based on facial expressions [26], head movements [27] and eye blinking [28] were proposed to address such detection problems. In recent years, the primary solution to this problem is by leveraging advanced neural network architectures. Zhou et al. [29] proposed to detect deepfakes with a two-stream neural network. Rössler et al. [4] retrained an XceptionNet [30] with manipulated face dataset which outperforms their proposed benchmark. Nguyen et al. [

Fig. 3
figure 3

Workflow of the proposed assessment framework. Distortions caused by processing operations are first applied to test data separately. The corresponding predictions by the deepfake detector are compared with the ground-truth label (“real or fake”)

3.3 Assessment methodology

Current deepfake detection algorithms are based on deep learning and rely heavily on the distribution of the training data. These methods are typically evaluated using a test dataset that is similar to the training sets. Some benchmarks, such as [8, 9], attempt to measure the performance of deepfake detectors under more realistic conditions by adding random perturbations to partial test data and mixing up with others. However, there is no standard approach for determining the proportion or strength of these perturbations, which makes the results of these benchmarks more stochastic and less reliable. The assessment methodology proposed in this paper aims to more thoroughly measure the impact of various influencing factors, at different severity levels, on the performance of deepfake detection algorithms.

In this section, the principle and usage of our assessment framework are introduced in detail. First, the deepfake detector is trained on its original target datasets, such as FaceForensics++ [4]. The processing operations and corruptions in the framework are not applied to the training data. Then, as illustrated in Fig. 3, multiple copies of the test set are created, and each type of distortion at one specific severity level is applied to the copies independently. The standard test data together with different distorted data are fed to the deepfake detector respectively. Finally, the detector generates “real or fake” predictions. During the entire evaluation, the true positive rate (TPR) and false positive rate (FPR) are measured by constantly comparing the detector’s predictions and the binary ground-truth labels. The ROC curve is plotted and the Area Under the Curve (AUC) score is reported as the final metric. An overall evaluation score can be obtained by averaging the scores from each distortion style and strength level to report the general performance of a tested detector. Besides, the computed metrics can also be grouped by each operation category to further analyze the robustness of one deepfake detector on a specific processing operation.

In addition, to relieve the burden on storage caused by the multiple copies of the test set, a Python toolbox is developed to address this problem in an online manner, which hard-codes the digital processing operations and makes the strength level a parameter. It operates in the same format as the famous Transforms module in the TorchVison toolbox and can be easily integrated into the evaluation process.