1 Introduction

Micro-blogging [10, 14, 36, 40] sites like Twitter, Facebook, Instagram, etc. are helpful for collecting situational information [13] during a disaster like an earthquake, floods, disease outbreaks [25], etc. During these events, minor tweets are posted relevant to the specific classes such as infrastructure damage, resources [6, 33], service requests [24], etc., and also spam tweets, communal tweets and emotion information are posted [8, 16, 17, 19, 31, 38]. Therefore, it is required to design the powerful methodologies for the detection of specific class tweets (like Need, Availability of resources, etc.), so that relevant tweets can be automatically detected from the large set of tweets. The detection of specific class tweets [1, 11, 21, 35] has received much attention in the last two years. In the next few years, the detection of specific class tweets is likely to become more important in social media. Specifically, the detection of two types of tweets contains information related to Need and Availability of resources is a challenging task. During the disaster, victims post tweets with information such as where essential resources such as food, water, medical aid, shelter, etc. are needed or required. Similarly, humanitarian organizations post tweets with information such as where specific resources such as medical resources, food, water packets, etc., are available in the affected area. Examples of Need and Availability of Resource tweets are shown in Table 1. The first four tweets represent the need for resources such as mobile hospitals, password-free Wi-Fi, blood and ambulances. The next four tweets reflect the availability of information on resources such as the Italian Army to provide services to earthquake victims, the availability of shelter tents, money and ambulances. However, detection of Need and Availability of Resource tweets is very beneficial for both humanitarian organizations and victims during the disaster.

Table 1 Examples of need and availability of resource tweet

1.1 Objectives of this study

The main objective of this work is to assist the victims and humanitarian organizations in the event of a disaster by designing a method for automatic identification of Need and Availability of Resource tweets (NAR) from Twitter. The problem of detecting NAR tweets can be treated as a multi-classification problem. The classes are (i) Need of resource tweet (ii) Availability of resource tweet and (iii). None of both.

1.2 Prior work with limitations

Only a few existing works [1, 3, 11] are only focused on extracting the need and availability of resource tweets during the disaster. Among them, most of the works used information-retrieval methodologies such as word2vec, a combination of word embeddings and character embeddings, etc. Specifically, the authors in [3] used both information-retrieval methodologies and classification methodologies (CNN with crisis word embeddings) to extract the Need and Availability of Resource tweets during the disaster. The main drawback of CNN with crisis embeddings is that it does not work well if the number of training tweets is small and, in the case of information retrieval methodologies, keywords must be given manually to identify the need and availability of resource tweets during the disaster.

To overcome the above-mentioned issues, a novel method is proposed by using the stacking mechanism [44] to identify NAR tweets during the disaster. The stacking mechanism uses a two-level classifiers. The first level uses multiple classifiers and the classifier output is used as the second level classifier input, while the second level uses only one classifier. The stacking method does not produce improved results if the models used in the stacking method are stable. Therefore, different models such as CNN and KNN classifiers with domain-specific features are used in this work. CNN is used to capture the semantic similarity between words, and even vocabulary words are different in the testing phase. In order to overcome the problem of a lower number of training tweets, new features are proposed and used in the KNN classifier to detect NAR tweets. The two models (CNN and KNN classifiers with proposed features) have different functionality for the detection of tweets. The output of these two models is given as input to the SVM (second level) classifier. The SVM classifier is trained to determine the relationship between the output of the two CNN and KNN classifier models. It gives the final prediction of tweets whether a tweet label is a resource need or a resource availability or none. The efficacy of the final prediction depends on the classifiers used in level-1 and level-2. The reason for selecting the KNN and SVM classifiers as first and second level classifiers is clearly explained in Sections 4.4.2 and 4.5.2.

1.3 Contributions of this work

The main contributions are summarized as:

  1. 1.

    A Stacked Convolutional Neural Network is proposed to automatically identify the need and availability of resources tweets during the disaster.

  2. 2.

    Crisis word embedding is used in a deep learning model and domain-specific features are used in a feature-based classification method. Various classification algorithms such as SVM, Bagging, gradient boosting, random forest, KNN, Decision tree and Naive Bayes classification are also used.

  3. 3.

    Extensive experiments are carried out on real-time Twitter datasets such as the Nepal and Italy earthquakes in 2015 and 2016.

  4. 4.

    The proposed model is compared to the existing methodologies by using different parameters. In addition, statistical validation is performed to compare the methods using the MCNemar test.

This paper is organized as follows. The second section examines the related work. The proposed approach for the detection of NAR tweets during a disaster is described in the third section. Experimental results and analysis are discussed in the fourth section. The last section is the conclusion of the paper.

2 Related work

Many studies [2, 2], the authors manually analyzed WhatsApp messages for the requirement of medical, human, infrastructural resources during a disaster by considering the case study of Nepal earthquake dataset 2015. However, they have not proposed an automatic method for identifying the resources. In [11], the authors found that neural network retrieved models by integrating the character-level and word-level embeddings with pattern recognition techniques perform well than state-of-art models. The authors applied information retrieval techniques for detecting the NAR tweets. In [7], the authors used a novel vector training approach for clustering the tweets about the emergency situations and compared their method with Bag-Of-Words (BOW), word2vec-sum and doc2vec. And described that clustering of tweets will be helpful further for identifying the different aspects of topic in emergency situations. However, they are not proposed a method for identifying the NAR tweets during a disaster.

3 Stacked convolutional neural network

The problem can be defined as follows: Given a ‘N’ number of tweets X = {x1, x2, x3, x4,.....xN}, identify the tweets which are related to the three classes such as 1). Need of the resource 2). Availability of the resource and 3). None of the above. This section describes the stacked convolutional neural network for identifying the NAR tweets during a crisis. The overview of the proposed stacked convolutional neural network is shown in Fig. 1. The stacking mechanism [44] combines the predictions of diverse classifiers in the best way by learning the relationship between the models. Different classifiers vary in prediction errors from the data. For instance, some classifiers mispredict the data, while some other classifiers predict the same data correctly. It increases the generalization ability of the model and reduces the misclassification rate, bias and variance of the model. The stacking based classifiers give a high performance than the individual classifier models due to its generalization ability [42]. However, most of the resource detection systems focus on the individual classifier models rather than the ensemble methods (a combination of diverse classifiers). In this work, stacked convolutional neural network is proposed for detecting the resource tweets from social media during the disaster.

Fig. 1
figure 1

The overview of the proposed stacked convolutional neural network

It consists of two phases of the classifier. In the first phase, the Convolutional Neural Network and the KNN classifiers are used and referred to as base-level classifiers. The SVM classifier is used as a meta-level classifier in the second phase. Before the tweets are given as inputs to the base-level classifiers, the following pre-processing and extraction steps are performed, such as:

3.1 Tokenization and pre-processing

  • All tweets are changing to lower case letters to avoid the multiple copies of same words.

  • These are divided into words and it referred as tokens

  • The user mentions (@users), hash-tags (#) and URL’s are removed from the tweets.

  • Similarly, stop-words, numerical and unknown symbols are omitted from tweets.

3.2 Feature extraction

For each tweet, two types of feature representation, and the following techniques are used to generate a feature representation from tweets, such as:

3.2.1 Word embeddings

We used pre-trained crisis word embeddings to represent the 300-dimensional vectors for each word in a tweet. It is mainly based on 52 million crisis-related tweets collected during 19 crisis events and used word2vec tool for training the word embeddings. It uses the Continuous Bag Of Words Model (CBOW) architecture with negative sampling to generate word embeddings.

3.2.2 Domain-specific features

χ2 − static feature selection algorithm is used [45] to extract the top-most informative words from tweets because it has already been shown to be one of the most efficient feature selection algorithm for text categorization. The SVM classifier is used for the χ2 − static feature selection algorithm because the authors in [20] concluded that the SVM with χ2 statistic feature selection performed well than other traditional methods. The extracted domain-specific features are shown in Table 2. The first, second, and third columns are the serial number, features and information category, respectively.

Table 2 Proposed domain-specific features for identifying the NAR tweets

3.3 Base-level classifiers

The above two methods provide two feature vector representations for each tweet that are given as input to base-level classifiers such as CNN and KNN Classifiers.

3.3.1 Convolutional neural network

CNN is suitable to elicit local and deep features from natural language. The authors [12] have shown that CNN has had better results in sentence classification. The authors in [34] have extended a convolutional-recursive deep model for 3D object classification that employs a combination of Convolutional and Recursive Neural Networks (RNN) cooperatively. The CNN layer discovers the low-level translation stable features that are feed into multiple, fixed-tree RNNs to formulate higher-order features. In [27], the authors have shown that CNN outperforms many traditional methods in biomedical text classification, especially for assigning subjective medical headings to biomedical articles. CNN contains the following layers, such as the Embedding layer, Convolutional Layer, Pooling Layer and Dense layer.

Embedding Layer

It is the very first layer of CNN. It takes a fixed number of words from the tweets as input and converts into a corresponding 300-dimensional crisis word vector. The 300-dimensional tweet vector is passed into a series of convolution and pooling operations to understand high-level feature representations.

Convolution and Pooling Layer

In the convolution layer, the new features ‘F’ are generated by using convolution kernel ‘URgd’ to a window of g words (filter size) as shown in (1).

$$ (i.e) F_{j} =f(U.x_{j:j+g-1}+b) $$
(1)

Where ‘xj:j+g− 1’ is the concatenation of input vectors ‘(xj, xj+ 1...xj+g− 1)’, ‘b’ is a bias term and ‘f’ is a non-linear activation function like ‘sig’, ‘tanh’, etc. The filter is used to the window of ‘g’ words for getting the feature map with ‘FRng+ 1’ which is shown in (2). Different ‘g’ values (3 ,4 ,5) are used to capture the different n-gram features from the tweet.

$$ F_{i}={f_{1}, f_{2}........f_{n-h+1}}. $$
(2)

This process is repeated for 100 times (100 filters) to produce the 100 feature maps to learn the complementary features of the same filter size. After getting the feature map, maximum pooling is applied to each feature map.

$$ m=[\mu_{q}(F_{1}), \mu_{q}(F_{2}), \mu_{q}(F_{3})...........\mu_{q}(F_{N})] $$
(3)

where ‘μq(Fi)’ refers to the maximum pooling operation [4] used to the each window of ‘q’ features in the feature map ‘Fi’. The output dimension is reduced by the max-pooling while kee** important features from each feature map.

After the maximum pooling operation, different feature vectors are generated from the convolution layer with filter sizes (3, 4, 5). Then, the concatenation operation is applied to the different feature vectors to become a single block.

Dense layer

The dense layer with the softmax activation function is used on the top of the pooling layer to keep the features generated from the pooling layer. It is shown in the (4).

$$ z=f(Wm+b_{e}) $$
(4)

Where ‘W’ is a weight matrix, ‘be’ is a bias vector and ‘e’ is a non-linear activation function. The input of dense layer may be variable length, which produces fixed output ‘z’, and it is given as input for classification.

The output layer defines the probability distribution and uses a softmax function. The probability of the ‘t’ label output is given by (5).

$$ P(y=t/TW, \theta)=\frac{exp({W_{t}^{T}} z +b_{t})}{{\sum}_{i=1}^{t} exp({W_{i}^{T}} z+b_{i})} $$
(5)

Where ‘Wt’ is the weights associated with the class ‘t’ labels in the output layer.

3.4 KNN classifier

We adopted the K-Nearest Neighbour as a base-level classifier in the proposed model to get the feature vector of the tweet to the meta-level (second-level) classifier. It acts as a first-level classifier for getting better performance than other classifiers (Decision tree, Naive Bayes classifier), and a detailed explanation is shown in Sections 4.4 and 4.5.2. It accepts domain-specific features such as aid, needs, etc., as an input feature vector of the tweets. The KNN classifier gives the scores to the tweet neighbors among the training tweets and uses the class labels of ‘k’ most similarity neighbors to predict the probability vector of the tweet. We use the Euclidean distance ‘E(Tw,Tw1)’ to measure the similarity between the tweets ‘Tw’ and ‘Tw1’ that is shown in (6)

$$ {{E(Tw,Tw^{1})=\sqrt{{{\sum}_{i}^{N}}(Tw_{i}-T{w_{i}^{1}})^{2}}}} $$
(6)

Where ‘N’ is dimension size of the tweet vectors ‘Tw’ and ‘Tw1’. The classes of these neighbors are weighted using the similarity of each neighbor to Tw0 as follows:

$$ {{score(Tw_{0},C_{i})=\sum\limits_{Tw_{j} \in KNN (Tw_{0})} Sim(Tw_{0},Tw_{j})\delta(Tw_{j},C_{i}) )}} $$
(7)

where ‘KNN(Tw)’ indicates the set of K-nearest neighbors of tweet Tw. δ(Twj, Ci) represents the probability of Twj with respect to the class Ci and i = 3 represents the number of classes are three such as Need of resource, Availability of resource and None of the both.

Finally, it produces the three-dimensional probability vector for each tweet in testing data. Results indicate that the KNN classifier also plays a significant role in the proposed model for detecting the NAR tweets.

3.5 Meta-level classifier

In this work, we have adopted the SVM classifier [39] and it is one of the traditional machine learning algorithms in the proposed model. SVM is used as a meta-level classifier for getting better performance than other classifiers (Decision tree, Naive Bayes classifier) and a detailed explanation is shown in Sections 4.4 and 4.5.2. It accepts the concatenation of the predicted outputs of the CNN and KNN classifiers as input features. The size of the input vector is six-dimensional. We used the Radial Basis Function (RBF) kernel in the SVM classifier for transforming the data into a higher dimensional feature space. Given a set of testing tweets to the base-level classifiers and it produces the output of six-dimensional vectors. The results are sent as input features to the meta-level classifier (SVM classifier). The output of the SVM (second level classifier) is used as a final tweet prediction. Later, the learned model will be used to detect NAR tweets during a disaster.

The main advantage of the proposed stacked convolutional neural network for detecting NAR tweets during a disaster is that it works effectively, even for small datasets, due to the use of domain-specific features. And also, even though the words are different in both training and testing tweets using the CNN model. The summarization of the proposed method is shown in algorithm 1.

figure a

4 Experimental results and analysis

In this section, we first introduce the datasets, parameters details of the model and metrics used for performance evaluation. Subsequently, the experimental results include the results of the preliminary experiments, the classifier selection experiments in the proposed model and the ablation experiments. Furthermore, a comparison is made between the proposed approach and existing approaches.

4.1 Datasets

The data are collected from Nepal and Italy earthquakes that occurred during 2015 and 2016, respectively. Tweets are crawled from the tweet-id’s through the Twitter API the tweet-id’s are obtained from the authors [11]. Out of the total tweets, 80% and 20% of tweets are used for training and testing the proposed model, respectively. The details of disaster datasets are given in Table 3. The code is made available to the public Footnote 1.

Table 3 Details of Nepal and Italy earthquake datasets

4.2 Parameter details

Training the CNN model by optimizing the sparse-cross entropy of (5) using the ADADELTA [23] package. Table 4 gives the inscription of the various methods. The first column, second column and third column indicate the serial number, method name and abbreviation, respectively. In the abbreviation, the methods before and after ‘+’ symbol are the base-level classifiers (first level classifiers) , ‘+’ indicates the concatenation of predicted output of the base-level classifiers (first level classifiers) and ‘→’ symbol indicates the flow of predicted output of the base-level classifiers as input to the meta-classifier. The method after ‘→’ symbol indicates the meta-level classifier (second level classifier).

Table 4 Inscriptions

4.3 Metrics for performance evaluation

The performance of the proposed models is assessed based on the standard measures such as accuracy, precision, recall and f1-score are calculated using Eqs. 8 to 11, respectively.

$$ Accuracy=\frac{TP_{i}+TN_{i}}{TP_{i}+TN_{i}+FP_{i}+FN_{i}} $$
(8)
$$ Precision=\frac{TP_{i}}{TP_{i}+FP_{i}} $$
(9)
$$ Recall=\frac{TP_{i}}{TP_{i}+FN_{i}} $$
(10)
$$ F1-score=\frac{2*Precision*Recall}{Precision+Recall} $$
(11)

where TPi= Total No. of positive tweets detected correctly as positive.

TNi=Total No. of negative tweets detected correctly as negative.

FPi= Total No. of negative tweets wrongly detected as positive.

FNi= Total No. of positive tweets wrongly detected as negative.

i= No. of classes.

The accuracy of CNN is shown in Table 6 for various batch sizes. However, the batch size of 64 got the best accuracy compared to the batch sizes of 32 and 128. Therefore, for further experiments batch size of CNN, 64 is considered.

4.4 Experimental results

This section explains the results of the preliminary experiments, the classifier selection experiments in the proposed model, and the ablation experiments.

4.4.1 Preliminary experiments

Initially, the experiment is performed on the SVM classifier based on the proposed domain-specific features for the identification of NAR tweets and compared to the BoW model shown in Table 5. It highlighted the impact of the proposed domain-specific features compared with the BoW model for the proposed solution. It is beneficial for the proposed solution to identify tweets, especially for smaller datasets. Later, various experiments are performed using the CNN model to determine the best batch size. The batch sizes such as 16, 32 and 64 are used. Results of the CNN model using the accuracy parameter is shown in Table 6 by varying the batch sizes. The results show that the CNN model provides the best outcome for the batch size of 64 compared to others, such as 32 and 128. Therefore, for additional experiments, 64 batch size is considered. It is noted that the values reported in all tables are based on the average Need and Availability of resource classes.

Table 5 Comparison of SPROP with baseline model
Table 6 Accuracy of CNN by varying the batch sizes

4.4.2 Classifier selection in the proposed method

The following four different experiments are performed for the proposed method to choose the best appropriate classifier for base-level and meta-level classifiers.

  1. 1.

    In the first experiment, the output of CNN and SVM (base-level classifiers) are given as features to the meta-level classifier. By varying the meta-level classifiers (SVM, KNN, Decision tree and Naive Bayes), the results are reported in Table 7. KNN gives the best performance than other classifiers for the Nepal earthquake dataset. But in the case of the Italy earthquake dataset, SVM gives the best performance than the other classifiers.

  2. 2.

    In the second experiment, the CNN output and the decision tree (base-level classifiers) are given as features to the meta-level classifier. The models used in the second experiment by different meta-level classifiers are CDS, CDK, CDNB and CDD, and the results are reported in Table 8. Among the other models, CDK gives the best accuracy for the Nepal earthquake dataset and Italy earthquake dataset. CDNB also provides the same accuracy as CDK in the case of the Italy Earthquake dataset.

  3. 3.

    In the third experiment, the output of the CNN and Naive Bayes classifiers (base-level classifiers) is given as a feature to the meta-level classifier. The models used in the third experiment to vary the meta-level classifiers are CNBS, CNBK, CNBNB and CNBD, and the results are reported in Table 9. CNBNB has the best accuracy among the models for both disaster datasets. CNBS gives the same accuracy as the CNBNB in the case of the Italy earthquake dataset.

  4. 4.

    Finally, in the fourth experiment, the output of the CNN and KNN classifiers (base-level classifiers) is given as input to the meta- classifier. The models used in the fourth experiment to vary the meta- classifiers are CKS, CKK, CKNB and CKD, and the results are tabulated in Table 10. CKS achieves the highest accuracy among the models for both disaster models.

After performing four different experiments, the best f1-score models (models that achieve the best f1-score) are selected from the four various experiments of models such as CDK, CKS / CKK, CNBS and CSK for both disaster datasets. In the same way, the best precision models (models that achieve the highest precision) such as CKNB, CDNB, CNBB / CNBD and CSNB on the Nepal earthquake dataset are selected. Similarly, CSNB, CDS, CNBNB and CKS models achieve the best precision for the Italy earthquake dataset. In the case of the execution time, CDS runs very fastly on average of both disaster datasets. However, it does not give the best results compare to other models.

Table 7 Comparison of proposed models (CNN+SVM → classifier) on Nepal and Italy earthquake datasets
Table 8 Comparison of proposed models (CNN+ Decision tree → classifier) on Nepal and Italy earthquake datasets
Table 9 Comparison of proposed models (CNN+ Naive Bayes classifier → classifier) with variations on Nepal and Italy earthquake datasets
Table 10 Comparison of proposed models (CNN+ KNN → classifier) with variations on Nepal and Italy earthquake datasets

Finally, all models are compared and selected as the CSK model that achieves the best f1-score for the Nepal earthquake dataset. In the case of an accuracy parameter, the CSK model gives the best performance for the Nepal earthquake dataset but not provide for the Italy earthquake dataset. Overall comparison of all the models, CKS performs well than the other models on both disaster datasets. Therefore, CKS is selected to identify NAR tweets during the disaster.

4.4.3 Ablation experiments

Various experiments are conducted to assess the effectiveness of the individual component in the proposed model (CKS) on two datasets, such as Nepal and Italy earthquake. The proposed model is initially evaluated and the results for two datasets are tabulated in Table 11. Later, the experiments are performed by excluding informative (domain-specific) features and CNN individually in the proposed model and the results are reported in Table 11. The informative features play a crucial role in the proposed method for Italy’s earthquake dataset, which reduces the performance of the proposed model by almost 5.31% accuracy. In the case of the Nepal Earthquake, the performance is reduced by approximately 0.90% accuracy. By removing the CNN model, the performance of both datasets is drastically reduced by almost 25% and 15% for the Nepal and Italy earthquake datasets, respectively. It indicates that CNN plays a significant role in both disaster datasets. By removing both CNN and SVM classifiers from the proposed model, the performance reduction is the same as when CNN is removed. It indicates that the SVM classifier alone does not have much impact on the performance of the model. However, the proposed method (CKS) provides the best accuracy than any of the components used to identify NAR tweets during the disaster. It is also proved by using statistical validation and it is given in Section 4.5.2.

Table 11 Accuracy of ablation experiments on Nepal and Italy earthquakes

4.5 Comparison of the proposed approach with the existing approaches

This section provides a brief explanation of the methods that are compared with the proposed model. It can be categorized into two subsections based on the methods. 1. Classification Methodologies. 2. Statistical validation of the classifier models.

4.5.1 Classification methodologies

This section describes the comparison of the proposed model with the existing classification methodologies [9, 12, 30, 35]. In [9], the authors presented an AIDR platform for automatic classification of tweets into user-defined categories with the use of uni-gram and bi-gram features. Similarly, in this paper, the SVM classifier with features such as uni-gram and bi-gram used as a baseline, and experiments are performed. In [35], the authors used features such as location, infrastructure damage, communication, etc., for identifying the resources during a disaster and SVM classifier is used for classification. The authors [12] used CNN for sentence classification by hyper-tuning the parameters. Similar to this, CNN is experimented and compared with the proposed model. In [30], the authors used the low-level lexical and syntactical features for identifying the situational information during a disaster. The proposed CKS model achieves the best accuracy compared to existing methods on the Nepal and Italy earthquake dataset and the results are reported in Table 12. However, the proposed model outperforms existing methods on both Nepal and Italy earthquake datasets for identifying the NAR tweets. Better accuracy is achieved for the proposed model when compared to the existing method due to the use of informative features and traditional classifiers, which enhanced the diversity of the model for identifying the NAR tweets. In general, stacking models give better accuracy than individual models when the models have diversity. And also, it is observed that from Table 12, for Italy earthquake dataset has a huge impact on the proposed method compared to the Nepal earthquake dataset due to the small dataset. In case of the execution time, Rudra model [30] runs very fastly and BoW model [9] runs very slowly compared to other models. However, it does not give the best result for detecting the NAR tweets during the disaster.

Table 12 Comparison of proposed model with existing methods on average of Nepal and Italy earthquake datasets using Accuracy parameter

4.5.2 Statistical Validation for comparison of the various classifier models

In this section, we have investigated the statistical significance of the different classification models. The authors in [5] suggest that the use of the MCNemar statistical test for the deep learning models. Therefore, we have used the MCNemar statistical methods [5] to study the efficacy of statistical significance for classification methods. The contingency table of the MCNemar test is shown in Table 13.

Table 13 Contigency table

Here ‘N01’ represents the number of tweets corrected detected by Model A and Model B. ‘N02’ represents the number of tweets corrected detected by Model B and wrongly detected by Model A. ‘N11’ represents the number of tweets corrected detected by Model A and wrongly detected by Model B. ‘N12’ represents the number of tweets wrongly detected by Model A and Model B

The chi-squared (χ2) can be defined as follows:

$$ \chi^{2}=\frac{(|N_{02}-N_{11}|-1)^{2}}{N_{02}+N_{11}} $$
(12)

The hypothesis is:

  1. 1.

    Null hypothesis (N0): There exists no significant difference between the performances of the classifier model.

  2. 2.

    Alternate hypothesis (N1): It can be defined as the existence of a significant difference between the performances of the classifier model.

If N0 is accepted, then the probability (p) value is greater than 0.05. If N1 is accepted, then the probability (p) value is less than 0.05.

Tables 14 and 15 show the results of the MCNemar statistical test of the performance of the various proposed methods and the comparison with the existing methods. In tables, the ‘↑↑’ indicates that the strong evidence of the proposed method is statistically significant compared to the other method and that the probability value is less than 0.01 (p < 0.01). It represents the confidence level of 99.99% of the proposed method. ‘↑’ indicates that the weak evidence of the proposed method is statistically significant compared to the other method and the probability value is between 0.01 and 0.05 (0.01< p < 0.05). ‘\(\sim \)’ indicates that there is no statistical significance between the two methods of the same classification performance. Subsequently, the methods in the first column of the Tables 14 and 15 are statistically significant compared to the other methods in the row. From Table 14, we can describe it in the following ways:

  1. 1.

    There is strong evidence that the CSS is statistically significant compared to other methods such as CSD and failed to reject the N1 hypothesis. It is the weak evidence that the CSS is statistically significant than the CSK and was unable to accept the N0 hypothesis. There is no significant difference between the CSS and CSNB, and failed to reject the N0 hypothesis.

  2. 2.

    There is strong evidence that the CDK is statistically significant as the CDD method and failed to reject the N1 hypothesis. It is weak evidence that the CDK is statistically significant than the CSK and failed to accept the N0 hypothesis. However, there is no statistically significant difference between the CDK and the CDNB, and accept the N0 hypothesis.

  3. 3.

    There is no statistically significant difference between the CNBNB and CNBS, and accept the N0 hypothesis. But there is strong evidence that CNBNB is statistically significant than the CNBK and CNBD, and failed to reject the N1 hypothesis.

  4. 4.

    It is strong evidence that the CKS is statistically significant than the other methods such as CKK, CKNB and CKD, and accept the N1 hypothesis.

  5. 5.

    Finally, we have shown strong evidence that CKS is statistically significant than the CNBNB, CDK and CSS, and accepts the N1 hypothesis.

Table 14 Results of the MCNemar-tests on variants of the proposed methods for both Nepal and Italy Earthquake datasets
Table 15 Results of the MCNemar-tests for ablation methods and the existing methods for both Nepal and Italy Earthquake datasets

Similarly, from Table 15, we can explain as follows:

  1. 1.

    The first row shows a comparison of the ablation experiments, and the second row represents a comparison of the proposed method with the existing methods.

  2. 2.

    The results show that the strong evidence of the proposed method is statistically significant compared to the existing methods and leads to significant improvement by adding the proposed features to the model. And it accepts the N1 hypothesis.

4.6 Discussion

This paper proposes a method named as CKS (CNN and KNN are used as base-level classifiers, and SVM is used as a Meta-level classifier) for identifying tweets related to the Need and Availability of Resources during the disaster. It intends to assist victims and humanitarian organizations for identifying where the resources are available and where the resources are needed or required using social media in the event of a disaster. It also helps service providers to collect the necessary resources, transport, etc., to provide the victims with the resources that they need. For example, we can automatically make a different mark on the map to help local volunteers and victims that specifies where a large number of resources are needed or available. The performance of the proposed method has been demonstrated in both the Nepal and Italy Earthquake datasets. The research implication to the COVID-19 is discussed in the subsequent section.

4.6.1 Research implication to COVID-19

Our study has a practical implication to COVID-19 for resource management. Develo** countries like India are suffering from financial resources, particularly when the government has no choice but to close the business and lock it up during COVID-19. These can affect their daily consumption of food, lack of nutrition, medical resources (such as ventilators), masks and other urgent needs. These types of resources are automatically identified by using social media. Some organizations post this type of resource information on social media, where resources are available. In the same way, victims post information where resources are needed. However, the proposed model can be used for the identification of these types of resource tweets during COVID-19. It may help both the people and humanitarian or government organizations to save the time and effective utilization of the resources.

There is a chance of increasing the number of COVID cases in the future. It may have an impact on the lack of medical resources, such as ventilators, hospital services, victim quarantine, masks, etc. Identifying this type of resource where it is available and needed is very important for the effective use of resources. This allows people to save their lives and avoid the spread of the disease from one person to another.

In the future, work can be extended to extract a specific type of resource information where it is required and available along with a priority-based geo-location of tweets. For example, the highest priority is given for the tweets where it contain information related to very urgent needs such as ventilators and masks, etc., and have geo-location information (where resource needs or availability exist). If tweets contain information on resources such as the need or availability of food, donation of money or services is the next priority. If the tweets include some other type of resource information without Geo-location shall be given as a minimum priority. And also, the automatic matching of the Needs and Availability of Resource tweets during COVID-19.

The pre-requirements of the model to be working in the real-world for deployment can be described is as follows:

  1. 1.

    GPU server.

  2. 2.

    Finding the relevant hash-tags and keywords.

  3. 3.

    Filtering the Fake and Spam tweets.

GPU server

GPU Server is needed to store and processing the tweets from the twitter during COVID-19.

Finding the relevant hashtags and keywords

Finding the relevant hashtags and keywords is one of the important modules for deploying the model in the real-world. The relevant hashtags and keywords can be used for filtering the tweets related to the COVID-19. During the COVID-19, users post millions of tweets on twitter by using different hashtags and keywords like #Covid-19, #CORONA, COVID, etc. Therefore, finding the relevant hashtags is an important task during the COVID-19. Various methods [15, 18, 43] are available in the existing literature for finding the relevant hashtags to the COVID-19. Most of the methods, there is a need to give some seed keywords manually for finding the relevant hashtags.

Filtering the Fake and Spam tweets

After extracting the tweets related to the COVID-19, there is a need to filter the Fake and Spam tweets from the extracted relevant tweets to the COVID-19. Fake tweets [26] can be defined as if it contains the incorrect time or location related to the need and availability of resources or link to the misleading information, etc., is called fake tweets. Spam tweets [8, 19, 26] can be defined as if it contains links to the advertisements or loans or some other irrelevant content, etc., is called spam tweet.

After removing the fake and spam tweets from the relevant tweets, the resultant tweets are passed to the model for identifying the resource tweets during the disaster.

5 Conclusion

Detection of the NAR tweets during a disaster is a difficult task due to different kinds of tweets are posted related to the disaster. A model is proposed for identifying the NAR tweets during a disaster. The results suggest the idea that the stacking of a convolutional neural network with traditional feature-based classifiers is useful for detecting the NAR tweets. The results also suggest that the combination of CNN, KNN and SVM (CKS) with domain-specific features outperform the other combinations. Also, the proposed method outperforms the existing methods on Nepal and Italy earthquake datasets. Furthermore, we discussed the application of the proposed method for real-time scenarios like the COVID-19 outbreak. In future work, the accuracy of the model is improved by using other deep learning models to detect NAR tweets during a disaster.