1 Introduction

There is a paradigm shift in how people consume news today. They mostly look for a summarized version of news over the social media platforms to quickly gather more information [1]. This change is due to easy access to news readily available over social media platforms like Twitter and Facebook. Taking advantage of this inevitable dependency, people with malicious intent use this platform to spread fake images. Fake images are digitally manipulated images which undergo multiple altering. Morphed images are an excellent example of fake images where a person's face is replaced with another person’s face. Nowadays, it is widely used to propagate a narrative or propaganda under the political arm. Norwegian Media Authority, Norway, [2] conducted a survey on fake information over coronavirus. The study's findings concluded that social media, mostly microblogging platforms, were the most significant contributor to spread false information.

Similarly, a survey conducted by CIGI-IPSOS and Internet Society [3] showed that Facebook and Twitter are the top two platforms in spreading fake news. Fake images and videos are the key material in the broadcast of fake news. Fake images gather more attention than text [4]. Some of the impacts of fake images and videos have led to grave impacts. Global tech giants like Adobe, Facebook, and Google are investing in develo** artificial intelligence (AI) applications to counter the fake images and videos flooding the internet.

Figure 1a shows a digitally alerted fake image of a child with three eyes, and Fig. 1b displays the morphed image of Zuckerberg with prime minister Narendra Modi. These fake photos were viral on social media platforms and were circulated across the globe. Some fake images do not harm, but fake images like hurricane Sandy instilled fear among the citizens [5]. Thus, there is a need to build solutions to detect fake images over microblogging platforms.

Fig. 1
figure 1

Examples of fake images a child with three eyes—copy and move technique b Zuckerberg with Modi—image splicing technique ( a and b are from boomlive)

1.1 Motivation

The utilization of fake images in fake news has increased, as its impact is more than text. There are psychological reasons that images change the way humans remember and consume information [4]. A similar observation was outlined by Adobe's 2015 state of content survey results, which showed that the interaction was three times more on the post with images than with post with just textual messages [6]. A survey conducted by the activist group Avaaz showed that Facebook causes most public health threats by sharing significant health misinformation [7]. Therefore, there is a critical need to create solutions to spot fake images over social media. The need for time is to check the proliferation of tampered images and mitigate its impact on the people.

Sharma & Sharma [8] described various methods in its detection. The conventional hand-crafted image forensic methods do not fair very well to identify manipulated images over social platforms. A multi-modal approach has recently been used, which uses multiple content and context type like text, visual, statistical, user profile, and network propagation to detect fake news. Out of these, a multi-modal framework using image and text has fared a little better than others [9, 10].

Our paper proposes a multi-modal approach that utilizes the new and upgraded models to detect fake images shared over social media platforms. Many well-known convolutional neural networks (CNN) models are available for image classification like ResNet, InceptionNet, and ImageNet. These pre-trained models are trained over millions of images. A new model EfficientNetB0 has shown better accuracy in image classification with fewer parameters and lower FLOPS than other CNN models like ResNet34, ResNet50, and InceptionNet-v2 [11]. Our model employs EfficientNetB0 for learning the inherent features of fake images. At times, in some cases of fake news, the images are authentic but out of context. Thus, text analyses are also required for fake image detection. The proposed model uses bidirectional encoder-based sentence transformer RoBERTa [12] for text analyses. Bidirectional Encoder Representations from Transformers (BERT) has been widely used in text classification. The creators of RoBERTa have proved that it has better results than BERT itself and RoBERTa tends to understand the context better [12]. Therefore, our proposed multi-modal framework utilizes EfficientNetB0 and RoBERTa for images and text analyses, respectively.

In summary, the critical points of this paper are as follows:

  • Develo** a practical multi-modal deep learning framework for the detection of fake images shared over social media platforms.

  • The model applies error level analysis (ELA) images instead of regular images for image learning, which helps deep learning models to converge faster and have better accuracy.

  • The model employs the novelty of using EfficientNet on the images and optimized Sentence transformer for text analysis within a multi-modal approach.

  • Study and analyse the previous Twitter dataset changes by analysing the latest Twitter dataset containing images shared in India. [13]

The model can be used by fact-checking websites across the world to move towards automated marking of fake news, fake images for the posts shared over microblogging websites. Currently, a lot of manpower is required for doing the detection work. Secondly, as its automated more content can get generated over their websites as now only limited viral news/images are selected for fact checking. Another use case is of applying these models directly over the social media platforms in form of extensions over browser or apps in mobile.

The remainder of this paper is organized as follows. Section 2 reviews related work on detecting fake images using various techniques. Section 3 outlines the proposed model framework explaining all three components. Section 4 shares information about the datasets, experimental results, and comparative analysis with other models. Section 5 concludes and provides direction towards future work.

2 Related work

Digital alterations over images can be done in various ways. Image splicing, copy and move, resampling, and compression are majorly used techniques. There are numerous software tools available, like Photoshop, GIMP, Pixlr, and Paint.net, for altering the images.

Detection of manipulated images can be done either by hand-crafted extraction and learning forensic image features or by applying deep learning methods that learn the features by itself.

2.1 Forensic methods

For detecting copy and move tampering, the forensic approach primarily uses discrete cosine transformation (DCT) and discrete wavelet transform (DWT) coefficients [14,15,16,17]. Other novel methods like multiscale WLD histograms [18] and fractional Zernike moments (FrZms) [19] are also used. Similarly, for identifying image splicing CFA [20, 21], discrete octonion cosine transforms (DOCT) [22], and histograms gradients [23, 24] are applied. The problem with forensic techniques is that each technique is suitable for individual manipulation type. Various researches using forensic techniques resulted in high detection accuracy where a single tampering method was applied over an image for manipulation. However, when multiple tampering methods like rotating, resampling, mirroring, and compression were applied along with copy and move or image splicing over the same image, the accuracy was impacted ([17,18,19, 22, 24]). Fake images shared over social media platforms typically undergo multiple tampering. The quality of fake images is further deteriorated by adding noise. Thus, using forensic techniques is not an optimized option.

2.2 Single modality

Another approach is using deep learning frameworks using a single modality. Here, single content type is used to predict the fake or real classification of the information over the social platform like image, text, context, and user profile. Huang et al. [25] proposed the spatial–temporal structural neural network framework to model the message spread from temporal and spatial perspectives for rumour detection. It worked fine for rumours, but the propagation of fake images was not considered. A single SRM-CNN-based model was suggested by Rao and Ni [26] for the detection of fake images. Other hybrid CNN models were proposed later [27,28,29]. Mangal and Sharma [30] used the cosine similarity index between text over images and headline text to identify fake images. The model used the CNN-LSTM framework. Singh and Sharma [31] used custom CNN model with high-pass filters for fake image detection over social platforms. Johnston et al. [32] proposed a CNN model to spot and localize tampered regions in manipulated videos. The model used CNN to estimate a quantization parameter, intra/inter mode, and deblock setting of pixels patch up in videos to identify and mark the tampered regions in videos. Ghanem et al. [33] proposed using the suspicious account's semantic and stylistic features to detect the fake credibility of the news generated from these accounts. On the contrary, Vishwakarma et al. [34] proposed web scrap** and image reverse search for fake image detection. Kaliyar et al. [35] used text-based modality. Wang and Chen [36] used the information credibility model and suggested a solution that uses an online social network credibility evaluation behaviour model based totally on the SOR framework.

2.3 Multi-modal methods

Recently, research has been done using multi modalities which perform better than single modalities [9, 10]. ** et al. [37] integrated multiple content types and suggested solution using a recurrent neural network (RNN) having an attention mechanism for combining features of the visual, textual, and social context. Text and social context were initially combined with an LSTM network for a fused representation. The resultant representation was then bonded with image features which were mined from deep CNN. Wang et al. [38] proposed EANN [event adversarial neural networks] to detect fake news, that obtain event-invariant characteristics, and assist fake detection on newly emerged events. The architecture comprises three major modules: first, the multi-modal feature extractor, second, the fake news detector, and at last, the event discriminator. The main work of the multi-modal feature extractor is generating the visual and textual features from posts. The work of event discriminator is to eliminate event-specific features and keep event invariant features among events. Gupta et al. [39] proposed MVAE (multimodal variational autoencoder), an end-to-end network. The main task was to build an autoencoder model. The proposed model has three primary modules: encoder, decoder, and classifier module. The model uses two streams—text and visual—where their respective features are learned in the encoder component. It uses bidirectional LSTM for producing text features and VGG19 for image features. Cui et al. [40] presented a novel method SAME [sentiment-aware multi-modal embedding] incorporating users hidden opinions from users’ comments into a unified deep multi-modal embedding framework for detecting forged news. Different networks are used to handle the heterogeneous data, like text, image, user profile, and publisher. In the next phase, the adversarial mechanism is adopted to learn semantically meaningful spaces per data modality. The model characterizes a unique regularization loss in the last phase to bring embeddings of relevant pairs closer. Zhou et al. [41] proposed the SAFE (similarity aware fake) framework. The model computes the probability of false reports by text and visual learnings separately. Later, it considers both these probabilities along with the calculated similarity index between the text and visual content to classify it as fake or not. Another prominent multi-modal framework proposed by other researchers is Spotfake[42].

However, in the models mentioned above, which employ text and images, there are certain drawbacks. First, they have low accuracy over social platforms datasets ([38, 39, 42]). We hypothesize this is because the deep learning model learns the main features of the image and subside manipulated features. The second drawback is that they use sub-activities like learning correlation across modalities or using sub-tasks like event discriminator and domain classifier ([38,39,40,41]). This paper suggests an explicit multi-modal approach using text and visual content. It uses two streams, each for text and image. The intrinsic features of the image and text are learnt separately and are fused for the final classification. The proposed model has better accuracy than the above stated state-of-art models. Table 1 illustrates the studies mentioned above and shows the features and techniques used, datasets, and resulting performance evaluation.

Table 1 Summary of former studies on fake image detection

During our research work, authors have attempted to overcome the drawbacks of the problems mentioned above by using the following: first, we employ EfficientNetB0 model. EfficientNet utilizes inverted residual networks and is very optimized for image classification. Second, to inflate manipulated features in an image, we use ELA images instead of regular images. Third, to improve the text analysis, a fine-tuned sentence transformer RoBERTa is employed, which shows better results in a similar text by understanding context better. Last, the model is not dependent on any sub-activities for prediction.

3 System design

The paper proposes an efficient approach to tackle the problem of fake image detection using a multi-modal framework. Text modality is also considered to fill the gap where the image is authentic, but out of context to the news shared. The proposed model considers both text and image modalities from the social media platform and passes it to their respective feature extraction channels.

The comprehensive architecture of the recommended model is illustrated in Fig. 2. It comprises of 3 components:

  • Image feature learning—It learns the intrinsic features from the fake images.

  • Text feature learning—This layer learns the latent text features provided along with the fake images.

  • Classifier—Softmax is used as a classifier which classifies the image using the fused features.

Fig. 2
figure 2

Architecture of the proposed system

Presuming that we have that N training pairs then model \(M = \left\{ {{\text{FS}}_{k} ,G_{k} } \right\}^{N}_{k = 1}\), the FSk is the feature set from text and image embeddings, and G is the correct label of the data. As this is multi-modal, the features from both modalities are taken.

$$ {\text{FS}}_{k} = {\text{FS}}_{t} + {\text{FS}}_{i} $$
(1)

For extracting the latent features of the images, we have used the latest lightweight CNN model called EfficientNetB0 [11]. EfficientNetB0 is a highly optimized variation of CNN. In the pre-processing phase, their ELA-generated images are used despite using regular images in the dataset. ELA highlights the compression features within an image. It is noted that applying any image processing filter helps in improving the generalization ability and expedite the convergence of deep learning networks [26]. The ELA images are passed through EfficientNetB0 pre-trained model and transfer learning of EfficientNetB0 is used to generate the image embeddings from the output of its third to the last layer. The image embeddings are forwarded to two layers of the fully connected dense layers to learn the image features. The image features are represented as FSi. After preprocessing the text, it is passed through sentence transformer RoBERTa for generating the text embeddings for learning the text features. The text embeddings are forwarded to two layers of the fully connected dense layers. The text features are represented as FSt. After normalization, the image and text feature sets are concatenated and passed through two fully connected networks (FCN) layer. Here, feature vectors from both the modalities are learned and they are passed through the final classifier Softmax for classification. The Softmax predicts the probability of fake images. The learning of the model can be represented as below.

3.1 Image feature learning

At the pre-processing level besides resizing, all the images are passed through the error level analysis (ELA) process. ELA is a forensic method to highlight the compression differences in an image. The fundamental concept behind ELA is that if an image is tampered with and compressed, then there will not be uniformity in the compression levels within an image. A significant difference in compression levels will be observed. The ELA images prove beneficial as they subside an image's main features and bloat the manipulation features. ELA-type forensic technique supports the neural network to learn and converge faster [26]. For learning the latent features of images, EfficientNetB0, a variant of deep convolution neural network is utilized. Deep networks get saturated, and the output accuracy is at par with their shallow networks at a lot of computation cost. Hence, EfficientNet originated from the Google Brain gave the compounding scaling formula for DNN and designed EfficientNet. They verified their multiple variations of EfficientNet frameworks from EfficientNetB0 to EffiecientNetB7 and proved them to be more efficient than other well known DNN's like ResNet-152, Inception-ResNet-V2, and NASNet-A.

Their basic model EfficientNetB0 outperforms many DNN's by having better accuracy with very few parameters and FLOPS in image classification [11]. Our proposed framework experimented with multiple variations from EfficientNetB0 to EfficientNetB5. The highest accuracy was achieved on EfficientNetB0.

The key architecture of EfficientNet are:

  • Swish activation—It is a multiplication of a linear and a sigmoid function. It has been proved that the Swish activation function matches or outperforms the rectified linear unit (ReLU), especially in image classification [44]. Figure 3 illustrates the comparison graph between ReLU and Swish activation functions. The swish's advantages are primarily because it is bounded below and unbounded above and is also non-monotonic. These attributes help it to outperform ReLU in deep networks and avoid dead neurons in the neural network.

  • Inverted residual block (MBConv block)—These form a shortcut between the beginning and end of a convolutional block. A traditional residual block has a wide—> narrow—> wide structure with several channels. There are a large number of channels at the input layer, which are compressed with a 1 × 1 convolution. The number of channels then increases again with a 1 × 1 convolutions so input and output can be added. In contrast, an inverted residual block follows a narrow—> wide—> narrow approach, hence the inversion. We first widen with a 1 × 1 convolution, then use a 3 × 3 depthwise convolution, then we use a 1 × 1 convolution to lower the number of channels so input and output can be added.

  • Squeeze and excitation blockIt is a way to give weightage to each channel instead of treating them all equally.

Fig. 3
figure 3

Comparison of graphs between ReLU and Swish activation function [45]

Figure 4 illustrates the complete architecture of EfficientNetB0. The other variations like B1 to B7 also have similar architecture but with prescribed scaling as per the formulae suggested.

Fig. 4
figure 4

The architecture for baseline network EfficientNet-B0 as provided by the authors [11]

To extract the image embeddings, all the preprocessed images are passed through EfficientNet-B0, and the output from the third last layer is extracted out. The output from this layer has the image feature vectors.

The latent features of images can be modelled as:

$${\text{FS}}_{i} = ~\varnothing ~\left( {W_{{{\text{if}}}} {\text{FSi}}_{{{\text{effb}}0}} } \right) $$
(2)

Here, activation function is represented by ϴ and Wif weights of the third last layer of EfficientNet-B0 and FSieffb0 is the output from the previous layer.

3.2 Text feature learning

For text analysis, the text data is pre-processed where the NLP libraries are used to remove the stopwords and translate the text to English if in any other language. After the pre-processing, the text is passed to the sentence transformer RoBERTa. The usage of sentence transformer resolves the problem of vanishing and exploding gradients in RNN. RoBERTa is a fine-tuned and lighter version of the BERT-base. BERT designed and proposed by Google is an innovative self-supervised pretraining method that learns to forecast deliberately hidden (masked) sections of text. It has shown remarkable results in text classification, especially its use on Twitter tweet analyses.

RoBERTa designed by Facebook [46] uses 50 K subwords as compared to BERT’s 30 k subwords. There are two main differences in RoBERTa from BERT. Firstly, it uses dynamic masking instead of static masking in BERT. Secondly, it works without NPS. The results achieved without NPS are better than that with NPS.

The sentence embeddings vectors obtained from the RoBERTa are passed through the two stacked dense fully connected layers. This is done as these features will be concatenated with the image embeddings in the next phase.

The textual feature learning can be modelled as:

$$ {\text{FS}}_{t} = \varnothing \left( {W_{{{\text{tf}}}} {\text{FSt}}_{{{\text{st}}}} } \right) $$
(3)

Here, activation function is denoted by ϴ and Wtf weights of the last dense layer, and FStst is the output from the sentence transformer stacked layer.

3.3 Classification

Before the classification, we need to fuse the feature vectors obtained from the dense layers of image and text streams. The two distinct features set, i.e., FSt * FSi are fused into a vector of dimensionality 2p, this can be denoted as \({\mathrm{FS}}_{k} \in {\mathrm{FS}}^{2p}\). Moreover, we can denote the multi-modal feature extractor as FE(IP; ϴfe), where IP denotes the vectorized input data, and ϴfe represents the set of parameters for the multi-modal extractor, and FE represents the overall map** function. It is made sure that the dimensions from both the channels are in the same dimension and the batch normalization is applied. The final feature set after concatenating both the modalities is represented as below:

\({\text{IP}}_{k} = {\text{FS}}_{t} *{\text{FS}}_{i}\) (the combination of both features sets)

$$ {\text{FS}}_{k} = {\text{FE}} \left( {{\text{IP}}; \varnothing_{{{\text{fe}}}} } \right) $$
(4)

After fusion, two dense layers are added for learning the combined feature vectors. The activation function used in dense layers is tanh. The output from the dense layers is passed on to the Softmax layer for classification. We represent the predictor of the fake image from Softmax as PR(FSk; ϴpr). Here, ϴpr represents the parameter set of predictors, and PR represents the map** function. Adam optimizer is used to optimize learning. The output from the predictor ŷ for the multi-modal event IPj represents the probability of the event and can be represented as:

$$ \hat{y} = {\text{PR}}\left( {{\text{FE}}\left( {{\text{IP}}^{j} ; \varnothing_{{{\text{fe}}}} } \right); \varnothing_{{{\text{pr}}}} } \right) $$
(5)

The learning loss is calculated using categorical crossentropy. The categorical crossentropy loss is calculated as below. If n = number of samples, m represents the number of categories. Hence for binary classification

$$\begin{aligned} {\text{Loss}}_{{{\text{pr}}}} ~\left( {\varnothing _{{{\text{fe}}}} ,\varnothing _{{{\text{pr}}}} } \right) & = \sum\limits_{{i = 1}}^{n} {y`_{{i1}} ~\log ~y_{{i1}} + y`_{{i2}} ~\log y_{{i2}} } \\ & \quad + \ldots + y`_{{{\text{im}}}} \log y_{{{\text{im}}}} \\ & = {\text{ }}\partial {\text{loss }}/{\text{ }}\partial y_{{{\text{in}}}} = \sum\limits_{{i = 1}}^{n} {Y^{\prime}_{{{\text{im}}}} ~/~~Y_{{{\text{im}}}} } \\ & = \sum\limits_{{i = 1}}^{n} {Y^{\prime}_{{i2}} ~/~~Y_{{i2}} } \\ \end{aligned} $$

For optimization of parameters \({\varnothing }_{\mathrm{fe}}\, {\text{and}}\, {\varnothing }_{\mathrm{pr}}\), we need to minimize the crossentropy classification loss, which is represented as below:

$$ \left( {\varnothing_{{{\text{fe}}}}^{*} , \varnothing_{{\text{pr }}}^{*} } \right) = \mathop {\min }\limits_{{\varnothing {\text{fe }}, \varnothing {\text{pr}}}} {\text{Loss}}_{{{\text{pr}}}} $$
(6)

The summarized algorithm for the working of the proposed model is provided in Algorithm 1. The ISk represents the input Set, FSk denotes feature Set. FSt and FSi represent a feature set of text and images, respectively. Here, the algorithm illustrates the steps followed by the model. The text and visual features are taken, respectively, on different channels FStk and FSik for each of the dataset image and text combination. The fusion of features is optimized as per the loss until a good accuracy is achieved.

4 Experiment results and analysis

In this section, we present the experiments' results to evaluate the proposed multi-modal model's effectiveness empirically. This section covers the information about datasets, results compared with other multi-modal frameworks and our study on the latest Twitter dataset. Three evaluation metrics were considered to evaluate the experimental results: accuracy, area under the ROC (receiver operating characteristic) curve (AUC) and F-score. Accuracy measures how accurately model classifies correctly. AUC represents the degree or measure of separability. It represents how much the model can distinguish between classes. F-score is the measures of the harmonic mean of Precision and Recall.

4.1 Experimental setup

Images were resized to 300 × 300 size. The model was implemented using Keras library over google TensorFlow framework using a computer system with 32 GB RAM and Nvidia GEFORCE RTX 2080 8 GB GPU. For selecting the right combination of hyperparameters, multiple iterations were required employing different batch sizes and with different dropout probabilities to get the correct hyperparameter values. In each iteration, the number of possible combinations is reduced based on the previous iteration performance. For conducting random search and evaluating parameters in a random search, Talos library is used. Talos was developed for automated hyperparameter tuning and model evaluation of deep learning networks. The optimum results were achieved in 300 epochs having a batch size of 128. Adam optimizer was used with a learning rate of 10–4. Figure 5 illustrates the plot diagram of the proposed model. We performed each experiment by randomly dividing our dataset into 75% training, 10% testing, and 15% validation subsets. The final results were obtained when the highest accuracy was reached. Accuracy metric was selected to stop the network.

Fig. 5
figure 5

Plot diagram of the proposed architecture with dimension information

4.2 Datasets

The experiment was conducted over three publicly available datasets. CASIA 2.0 [47] is image only dataset. MediaEval [48] and the Chinese Weibo [39] datasets are social media datasets consisting of images and text. MediaEval is the Twitter dataset, and Weibo is from the Weibo microblogging platform of China.

4.2.1 CASIA 2.0

The dataset has 12,616 images. There are 7492 authentic images and 5124 tampered images. The images are altered by applying copy-move and image splicing techniques. Crop** and resizing is also done while applying tampering over the images.

4.2.2 MediaEval

The dataset of social media has 193 cases of real images, 218 cases of fake images, and two altered videos. It has about 6000 rumours and 5000 non-rumour tweets from 11 events. We observed that tweets were in different languages, so we translated them using the google translate library. Though we found that specific tweets had problems translating using google trans API, those few tweets have been ignored.

4.2.3 Weibo

Weibo dataset consists of data collected from **nhua News Agency, an authoritative news source of China, and Weibo, a Chinese microblogging website. The fake images and text collected from Weibo are collected in time ranging from 2012 to 2016. Weibo’s official rumour debunking system verifies the dataset. The system encourages everyday users to report suspicious tweets on Weibo, which are then examined by a reputable committee that classifies the suspicious posts as false or real. The posts were totally in Chinese, so all were first translated to English. Some of the posts were too long for sentence transformer, and those few have been ignored.

4.2.4 Twitter Indian dataset v.01

To study the changing trends over the social platform Twitter, a new dataset is created from an Indian perspective [13]. We have collected fake, and authentic images shared over Twitter. Authors searched specifically for morphed/forged image news over fact-checking websites of India. Then, those corresponding news articles were searched over Twitter platform and images and tweets were collected from Twitter. The events covered are mainly from politics and religion arena as they are the most targeted area for fake images in India. The data has been reviewed in two phases. First, all the collected news are verified from the various well-known fact-checking websites active in India, namely boomlive, Alt news and India Today. Peer reviewers have also done manual annotations in the second round. The manual reviewers reviewed the images by going over to the Twitter platform and cross-checking them. Dataset has a total of 110 such images. 61 images are fake, and 49 are valid. All the events covered are from November 2019 to November 2020, shared over Twitter in India.

4.3 Experiment results

The initial level of the experiment was conducted for selecting the EfficientNet variation for the proposed problem. The experiment was conducted with different variations of EfficientNet from B0-B5. We got the best results with the initial model EfficientNetB0. We observed that with the limited dataset and the low resolution of the images present in the datasets (~ 300/400 pixels) the EfficientNetB0 gave better results than other variations of EfficientNet. Using more scaled variations of EfficientNet beyond B0 leads to over learning and reduced accuracy. Table 2 shows the accuracy values over various EfficientNet variations.

Table 2 Experiment results on different variations of EfficientNet

We conducted the first experiment over CASIA 2.0 dataset. As this dataset has only images, image channel having EfficientNetB0 was employed. The experimental results showed an accuracy of 87.13% over the CASIA dataset. Over the social media datasets MediaEval and Weibo, the accuracies were 85.3 and 81.2%, respectively. Table 3 provides information on performance metrics results on all the datasets. The accuracy over CASIA dataset is more, as CASIA dataset has manipulated images with only a single manipulation type. The tampered images are manipulated with either image splicing or copy and move. Also, no noise and compressions are applied. The images in MediaEval and Weibo are images which are morphed using multiple tampering including noise and compression. Also MediaEval and Weibo have more on human faces and buildings, and CASIA has more of nature images. Figure 6 illustrates the accuracies comparison of each modality over social media datasets. We got more accuracy with images than with text modality.

Table 3 Performance metrics of the proposed model
Fig. 6
figure 6

Accuracies comparison of modalities a MediaEval b Weibo

Table 4 illustrates the comparison of results with other benchmarked multi-modal methods. Among the benchmarking multi-modal framework, MVAE [39] and SpotFake [42] models have good accuracy over social media datasets. Both use text and visual content like our proposed model. MVAE uses an additional component of variational autoencoder to learn the similarities between both modalities, which gives it an edge over other previous models like EANN. SpotFake, on the other hand, uses VGG19 for extracting image feature vector and BERT transformer for text feature extraction. Our results surpass the MVAE by 10.8% and SpotFake by 7.6% over MediaEval dataset. The better results are attributed to the following reasons. First, we have employed EfficientNet for images that work better than other models like VGG19 and ResNet over the smaller dataset [11]. Second, we have used error level analysis (ELA)-generated images rather than simple images. ELA images bloat the latent features of compression, which is typically applied in social media images. ELA process shows the disparities in edges of images due to different compression levels. Highlighting tampering features and subsiding the main image features leads to faster and better learning in deep learning models. On the text side, RoBERTa also supports learning the context of short tweets better than other simple LSTM or BERT used in MVAE and SpotFake. As the tweets are small texts and similar words are used, word frequency plays a vital role in this detection. Here, we have used optimized sentence transformer for text-embedding generation, making it more advantageous than other methods used in the other state-of-art models.

Table 4 Performance comparison-proposed model to other models

Over the Weibo dataset, the accuracy is on the little higher side with most models and at par with few of them. A little lower accuracy is due to two reasons. The images in MediaEval are more related to natural disasters and natural images, while in Weibo dataset contains more image of people and human faces. Second, the translation is not very accurate due to its complexity as it is the Chinese dataset. Also, the posts were long in Weibo dataset, as compared to concise tweets on Twitter.

4.4 Error analysis

There were some observations from the wrongly detected fake images. It was observed that high-resolution images wherein only a small region was manipulated were not detected correctly. Images which were not compressed had few instances of the failed cases. Also, the images which were authentic but had more irrelevant text posts were not correctly predicted.

4.5 Experiment classification graphs analysis

Figures 7 and 8 illustrate the metrics graphs captured during the experiment's learning and validation phases. The AUROC graph shows that the area under the curve, which is 0.87, is good and supports the model's accuracy. Another important graph to observe is the Precision–Recall graph. The Precision–Recall graph provides better information than the ROC graph while evaluating the imbalanced datasets' binary classification problem. The proposed model has a higher recall value which is a positive sign in fake image classification; it is due to additional text analysis that supports the images data.

Fig. 7
figure 7

MediaEval dataset—a confusion matrix b ROC curve c Precision–Recall graph

Fig. 8
figure 8

Weibo dataset—a confusion matrix b ROC curve c Precision–Recall graph

4.6 Performance over Indian dataset

Both MediaEval and Weibo datasets are old datasets about specific events occurring in the 2012–2016 timeframe. There have been changes in the usage of social media platforms in the last few years. So, as part of our research, we created India perspective dataset “Indian dataset v.01” over the Twitter platform. All news events in the dataset represent events in the period from October 2019 to November 2020. This dataset comprises 110 photographs and their corresponding associated tweets. India has multiple languages and has tweeted in local languages. We have considered tweets in the English language only. The news is majorly from politics, religion, and the Bollywood arena. Figure 9 shows some of the examples from the India dataset. All three images shows the morphed images and tampered with face or poster or placard.

Fig. 9
figure 9

Examples of fake images from the Indian dataset—a Kamala Harris morphed photo b an altered Kamal Nath poster c a morphed placard held by CPM party people

There are differences from previous Twitter datasets. These differences are due to three primary reasons. First, Twitter platform rules were updated for tweets. Twitter extended its 140-character limit to 280 in 2017. So, people started writing long tweets. These long textual comments impact the learning from short texts. This confirms the concern raised in the paper [42]. Second, the latest technological software available for manipulations. Due to advanced software, people can edit a minimal area of the image. Identifying small, manipulated regions has resulted in low accuracy. Third, the evolution of people's mindset in using social platforms in India. Owing to its broader reach, Twitter is currently used as a complaint forum for elected people. Therefore, several tweets were irrelevant to the image as individuals shared their grievances and complaints in tweets rather than tweeting on the associated image.

When we ran our India dataset over the same model accuracy of only 58.3% was observed. Figure 10 shows the evaluation graphs from the Indian dataset. The low accuracy is due to long textual posts and the majority of being irrelevant to the images. Secondly, as a small region of the image was tampered with, CNN models learnt main image features. This shows that with changing times, textual and image cues have changed, and models need to be continuously trained on new data to improve accuracy.

Fig. 10
figure 10

Indian Dataset v.01 a accuracy comparison across modalities b ROC curve c Precision–Recall

This calls for substantial new datasets of fake images that need to be created to keep up with the changing technological advances in the digital platform industry. Older databases are not going to work well with the current social media network trends. However, a recent dataset created like Fakeddit [10] its source is Reddit, a web aggregator and not a social media platform. Another new dataset “New Politifact” [50] is also not from the microblogging platform. It was observed in the CIGI-IPSOS survey that microblogging platforms spread more fake news content than the rest of the websites [3]. Authors are working on to create new dataset solely based on Twitter images considering the latest events of 2020 and 2021.

5 Conclusion

This paper has proposed an explicit deep learning-based multi-modal approach to detect fake images shared over social media platforms. Forensic methods have their limitations. In the proposed model, the visual and textual modalities are learned on respective channels and later fused to get the feature sets from both modalities. There are no additional components required for understanding the correlation between modalities. The model uses EfficientNet-B0 and sentence transformer RoBERTa for extracting the features of images and text, respectively. The ELA-generated images are used as input to the CNN model. ELA images further support better learning of image manipulations. The EfficientNetB0 has been verified against CASIA2.0 dataset as well, which is dataset comprising of tampered images. An efficiency of 87.13% is recorded over CASIA 2.0. The experiment has been tested against Twitter and Weibo datasets. Accuracy of 85.3% and 81.2% is achieved. These results surpass other previous state-of-art models. The research proves that a multi-modal deep learning model can detect fake images over social media platforms. We further created a new Twitter dataset using the latest 2020 events from an Indian perspective. The observation was that currently, there are many changes in image and textual cues from the previous dataset, which lowers the accuracy of models trained over old data. This indicates dire need to create social media images dataset based on the latest trends to keep up with the microblogging industry's changing trends.

The detection of satire images is not covered. Also, the proposed solution is not verified against fake images generated through generative adversarial networks. The text written over the images is also not considered. This will be taken up as a further part of the research.