Keywords

1 Introduction

A rumor is defined as a story or a statement whose truth-value is unverified or deliberately false [18]. Nowadays, with the rapid growth of social media, large amounts of rumors are easily spread across the Internet. This brings negative effect (e.g., public panic) onto the society. For example, on April 23th of 2013, a rumor about “explosion” that injured Barack Obama in the White House spread through Twitter and wiped out $130 billion in stock valueFootnote 1. Therefore, it is crucial to detect rumors on social media effectively and as early as possible before they get spread widely.

In the previous studies, there have been several methods aiming to detect rumors for each post [18, 25]. Individual posts typically contain limited context, and rumors may be depicted by the same truth-telling way as non-rumor ones. Besides, a single post does not reflect much about the temporal property of a claim spreading on social media. Therefore, current studies on rumor detection aim to classify the aggregation of posts by identifying an event as a rumor or not [8, 11, 27, 28]. An event, which may possess true or false information, is defined as a set of posts (e.g., microblogs, tweets, and wechats) related to some specific claim [11]. Most previous approaches for rumor detection are based on applying conventional supervised learning algorithms with manually designed features. A wide variety of handcrafted features, such as content-based, user-based, and propagation-based features [1, 8, 12, 25], have been incorporated. Some other rumor detection methods exploit complicated features, e.g., user’s feedback [19], variation of features along event lifecycle [12], signals posts reflecting skepticism about factual claims [28] and conflict viewpoints [4]. Recently, Gated Recurrent Unit (GRU) and Convolutional Neural Network (CNN) based methods [11, 27] have been shown to be competitive for rumor detection over events. These two methods both view an event as a series of posts, by splitting posts into groups along time. GRU is particularly suited for modeling sequential phenomena and capable of capturing the variation of contextual characteristic over rumor diffusion, and CNN can extract both local and global features through the convolution operation and reveal these high-level interactions [26]. However, there are some drawbacks in these models. Firstly, GRU is bias towards the latest groups [16] and CNN is not inherently equipped for a sense of time series [2]. Secondly, they identify rumors according to content features of groups represented by tf-idf or unsupervised paragraph vector [9]. We observe that posts on social media are full of redundant posts, and many posts about an event contribute less to rumor detection. Therefore, it is unsuitable to generate group embedding by tf-idf or unsupervised paragraph vector. Thirdly, since these models use group representations as the input, salient posts that contribute to rumor classification are arduous to get, yet are important for further analysis in real-world tasks, such as public opinion monitoring, where picking out salient posts is crucial for experts to verify conclusions drawn by automatic methods. Furthermore, the above-mentioned models only use the content information, while other useful features of groups (e.g., temporal information) are ignored. Recently, Liu et al. [10] incorporates attention mechanism to model content and dynamic information of individual post for rumor detection. However, without grou** the aggregated posts, it cannot utilize variation of features along an event’s lifecycle while it is helpful for rumor detection [12]. In addition, the model may be very complex when an event consists of a large number of posts.

To address the issues mentioned above, we propose a model named Supervised Group Embedding based Rumor Detection (SGERD) in this paper. First, in order to make each group contain as many correlated posts as possible, we split posts of each event into several groups with equal time interval by following [27]. Our intuition is that the representation of groups using unsupervised methods is arduous to alleviate the negative effects of redundant posts in a group, thus we directly take the content of posts as the input to learn the task-oriented representation for each post, and extract the local features of nearby posts to generate group embeddings. Furthermore, considering that the influence of groups on different time windows is dissimilar, we model the temporal information of groups and equip SGERD with a sense of group order.

The main contributions of our work are as follows:

  • We conduct rumor detection at the post-level by proposing a supervised method to learn group embeddings, which significantly improves the model performance. Moreover, we can conveniently pick out meaningful posts from each event.

  • We incorporate temporal information besides textual features into neural networks, which is shown to be helpful for rumor detection.

  • Experiments are conducted on two real-world datasets, and the results show that SGERD is effective and outperforms state-of-the-art methods.

The remainder of this paper is organized as follows. We review related work in Sect. 2, and present the SGERD for rumor detection in Sect. 3. We detail the dataset, experimental results, and discussion in Sect. 4. Finally, we present conclusions in Sect. 5.

2 Related Work

In recent years, the task of detecting rumors on the Internet has received considerable attention. As for research objectives, most studies attempted to detect rumors at the post level, i.e., classify a single post as rumor or not [18, 25], or identify whether the aggregation of posts under an event is rumor [8, 11, 27, 28]. Some other researches aimed to detect fake images [3] and identify hoax articles in WiKipedia [7]. With respect to rumor detection methods, many previous studies employed traditional classifiers using different sets of hand-crafted features. For instance, various features are extracted from the content, user characteristics, and the propagation pattern [1, 8, 12, 25]. Moreover, some rumor detection methods exploited complicated features, such as user’s feedback [19], variation of features along an event’s lifecycle [12], signals posts reflecting skepticism about factual claims [28], and conflict viewpoints [4].

Recently, Gated Recurrent Unit (GRU) [11, 14], Convolutional Neural Network (CNN) [27] and attention based method [10] have been proposed for rumor detection. Different from other prior works, they exploited the content of posts rather than typical features of events (e.g., the retweet number of an event and the information related to evaluate a user’s credibility). Yu et al. [27] adopted a two-layer CNN model named CAMI to extract both local and global features of events. First, posts of an event are split into twenty groups according to an equal time interval. Then, the groups are embedded into representations with fixed sizes by paragraph vector [9]. Last, the model takes group embeddings of the event as input and detects whether the event is rumor. Liu et al. [10] proposed an attention-based approach called AIM to detect rumors using content and dynamic information. However, their approach does not utilize grou** method and the number of model parameters is proportional to the number of posts in each event. In this way, classification will be intractable when there are great numbers of posts in events. Ma et al. [14] regarded rumor detection and stance classification as highly relevant tasks. They associated them and proposed a model which utilizes a multi-task learning scheme to model features shared by two tasks. Some studies modeled the propagation structures of different events by exploiting tree-kernel [13] and recursive neural network [24] in order to capture the patterns of propagation trees.

In the above, methods integrating various hand-crafted features into traditional classifiers only rely on limited context and cannot capture high-level features, thus they fail to be adaptive to complicated occasions of social media. Methods based on GRU have a preference for the latest group of events, while the latest one may not play a key role in rumor detection [27]. Though CAMI using CNN achieves the state-of-the-art performance, it has the drawback of not being equipped with a sense of time series. AIM does not utilize the grou** method, thus it cannot model variation of features along an event’s lifecycle, which is useful for rumor detection [12], and fails to be applied to events with a large number of posts.

3 Proposed Rumor Detection Model

3.1 Definitions

According to [11, 27], an event is defined as a set of posts related to a specific claim, e.g., “Trump Campaign colluded with Russia during 2016 presidential election”, and each post is associated with a timestamp. In this way, an event contains much more information than a single post. We denote an event instance as \(e ={(post_{i}, timestamp_{i})}\), consisting of ideally all relevant posts \(post_{i}\) at \(timestamp_{i}\), where \(timestamp_{i}\) is in chronological order and \(timestamp_{1}\) is the start time of e, i.e., timestamp of the first post of the corresponding event e. The total number of posts under e is denoted as \(\vert e \vert \), and thus \(i \in \left[ 1, \vert e \vert \right] \). Based on this definition, our task is to detect whether a sequence of relevant posts associated with an event is rumor. Following the previous work [27], we split posts of an event into n groups according to an equal time interval for each event and set n to twenty.

3.2 Model Structure

The overall model architecture is illustrated in Fig. 1, which contains four modules: split posts into n groups and learn task-oriented post embedding, construct group embedding G over a variable length of posts, learn temporal embedding \(T_{emb}\) to equip model with a sense of group order, and employ a series of convolution operations for classification.

Fig. 1.
figure 1

The architecture of SGERD

3.2.1 Representation of Posts in Each Group

For an event instance e with n groups, our input consists of streams of posts, which can be interpreted as a time series where nearby posts are likely to be correlated. The work of Shen et al. [20] shown that a model only using simple operations (e.g., parameter-free pooling operation) on word embeddings may have comparable performance for some tasks. Inspired by their work, different from the procedure utilized in the literature [27] for rumor detection, we use average word embeddings to represent posts instead of paragraph vector [9] for a simpler procedure. Concretely, let (\(v_1\), \(v_2\), ..., \(v_L\)) denote a sequence of words of a post, where each word \(v_i\) is represented by a d-dimensional word embedding trained by Word2Vec [15], and L is the number of words this specific post contains. We represent each post as the average word embeddings of this post: \(\frac{1}{L}(\sum _{i=1}^{L}v_i)\). This operation can be viewed as average pooling and result in an embedding with the same dimension d as word embedding \(v_i\).

However, post representation by average word embeddings is not task-oriented, i.e., it is represented by unsupervised method and thus not suitable for rumor detection. For the purpose of generating a representation of each post that fits for detecting rumor, we utilize a fully connected feed-forward network (FFN) [22] to apply to each post. It consists of two linear transformations with a hidden Rectified Linear Unit (ReLU) [17] nonlinearity in between as follows:

$$\begin{aligned} \mathrm{FFN}(x) = \mathrm{ReLU}(xW_{1})W_{2}, \end{aligned}$$
(1)

where \(W_{1} \in R^{d \times h}\) and \(W_{2} \in R^{h \times d}\) are parameter matrices. It uses trainable weight matrices to attend to different dimensions of the input, and thus we can obtain the representation of each post that more suits for the task of rumor detection after applied by the FFN.

3.2.2 Generation of Group Embedding

In this section we propose a supervised method to generate group embedding over a variable length of posts. We define \(Group_{i} \in \mathbb {R}^{len_{i} \times d}\) as the i-th group of event e with a sequence of \(len_{i}\) posts, where each post is allocated by the grou** method proposed by [27], \(Group_{i}[j] \in \mathbb {R}^{d}\) as the embedding of the j-th post in \(Group_{i}\) obtained by the average pooling of word embeddings and applied Eq. (1), and \(Group_{i}[j:j+len] \in \mathbb {R}^{(len+1) \times d}\) as the concatenation of the embeddings from post j up to post \(j+len\). We apply the convolution operation to combine nearby posts from temporal windows of filter size, and extract local features for group embeddings. We denote a one-dimensional convolution filter F as a weight matrix \(W_{F} \in \mathbb {R}^{ws \times d} \), where ws is the size of filter F. When F is applied to \(Group_i\), the dot product is calculated between \(W_{F}\) and each possible windows of ws successive posts representations, then bias \(b_{F}\) is added and activation function f is applied. This results in a feature map \(p \in \mathbb {R}^{len_i-ws+1} \) with entry j as

$$\begin{aligned} p[j] = f(W_{F} \cdot Group_i[j:j+ws-1] + b_F), \end{aligned}$$
(2)

where \( j \in [1, 1+len_i-ws] \), \(b_F \in \mathbb {R}\), f is a non-linear function such as ReLU. Note that the weights of convolutional kernels are shared across different groups. After the convolution operation, a sequence of local features of nearby posts is extracted, where each one is corresponding to posts of the same time window. Since many posts do not contribute to rumor detection, we want to restrain the non-salient local features and keep important features in the group embedding that are helpful for rumor detection. For this purpose, finally, we apply a max-over-time pooling operation [5] over the feature map p and take the maximum value as the salient feature. The general idea is to mine one significant feature with the highest value of each feature map corresponding to this specific filter, and meanwhile ignore some less important information. After the pooling operation, we aggregate the local features to obtain a global representation for groups, i.e., each group is represented by a fixed length vector \(g_i \in \mathbb {R}^{m} \), whose size is equal to the number of filters.

To equip our SGERD with a sense of group order, as well as to model the influence of each group within different time windows, we incorporate temporal information \(T = [t_1, t_2, ..., t_n]\) of e into the generated group embeddings, where each entry \(t_i\) is min-max normalized time interval of end time of the i-th group and start time of e, i.e. this time interval is the latest timestamp of post in the i-th group minus start time of e. This is similar to exploiting the information about the position of tokens in a sequence by position embeddings [2, 23]. In particular, we embed T by a weighted vector \(V \in \mathbb {R}^{n}\) and a bias \(b_T\), followed by a non-linear hyperbolic tangent (tanh), which results in \(T_{emb}\) with each row as:

$$\begin{aligned} T_{emb}[i] = \mathrm{tanh}(T \circ V^\mathrm {T} + b_T), \end{aligned}$$
(3)

where \(i \in [1, m]\), \(b_{T} \in R\), and \(\circ \) represents the element-wise multiplication.

Finally, both group embedding and temporal embedding are combined to obtain a temporal-aware group embedding: \(\tilde{G} = G + T_{emb}\), where \(G=[g_1, g_2, ..., g_n]\) and the columns of \(\tilde{G}\) can be viewed as tuned group embeddings with the temporal information. Temporal embedding is useful in our architecture since they give our SGERD a sense of which parts of event it is currently dealing with and reflect different influence of each group (ref. Sect. 4.3).

3.2.3 Group Embedding-Based Rumor Detection

After constructing embedding for each group, we repeat the above convolution operation twice to extract low and high level group features from \(\tilde{G}\), while these operations use different settings of filters. Then, a fully connected layer and the ultimate output \(\hat{l}_{e}\) are obtained via softmax, where \(\hat{l}_{e}\) is the predicted probability of event e being the category of rumor.

Our model is trained end-to-end by minimizing the following error over the training set D:

$$\begin{aligned} J = -\sum _{\forall e \in D}l_{e}\ln {\hat{l}_{e}} - \sum _{\forall e \in D}(1-l_{e})\ln {(1-\hat{l}_{e})} + \frac{\lambda }{2}||\theta ||_{2}, \end{aligned}$$
(4)

where \(l_{e}\) is the ground truth label of e, \(\lambda \) is the regularization term, and \(\theta \) is the parameter set to be trained during learning. Training is done through stochastic gradient descent over shuffled mini-batches with Adam [6] update rule.

4 Experiment

In this section, we evaluate the performance of the proposed model for rumor detection. We have designed the experiments to achieve the following goals: (i) to compare the performance of different methods in detecting rumors, (ii) to evaluate the function of different components for learning group embedding, (iii) to evaluate the effectiveness of mainstream methods in early detection of rumors, and (iv) to validate the model performance by extracting salient posts which contribute more to detect rumors.

4.1 Dataset

Following previous works on rumor detection [12, 27], we evaluate the effectiveness of our SGERD on two real-world datasets: Weibo and Twitter. There are 2,313 and 2,351 events belonging to rumor and non-rumor in Weibo, 498 and 494 events belonging to rumor and non-rumor in Twitter, respectively. As for temporal information, average time intervals of events are 2,460.7 h and 1,582.6 h for Weibo and Twitter, respectively. The above numbers of rumor events from Weibo were obtained from Sina community management centerFootnote 2, and similar numbers of non-rumor events were gathered by crawling the posts of general threads that are not reported as rumors. For Twitter, rumor and non-rumor events were confirmed by SnopesFootnote 3—an online rumor debunking service, and combined with some non-rumor events from two public datasets [1, 8].

4.2 Experimental Settings

To demonstrate the effectiveness of our proposed SGERD on rumor detection, we have implemented the following baselines for comparison:

AIM is an attention-based method which utilizes both content and dynamic information of posts [10].

CAMI is based on two CNN hidden layers [27]. Input layer is content features of groups learned by paragraph vector [9], and groups have fixed number of twenty.

GRU-2 is based on two GRU hidden layers [11]. Input layer is content features represented by tf-idf, and time span of each group has variable length.

SVM-TS is a SVM classifier with linear kernel which uses time-series structures to model the variation of social context features [12]. These features are manually designed and based on contents, users and propagation patterns.

RFC is a Random Forest Classifier which aims to fit the temporal tweets volume curve with three parameters [8].

DT-Rank is a ranking model implemented by decision tree method to detect trending rumors, which ranks the clustered results by focusing on rumors with enquiry phrases and cluster disputed factual claims [28].

DTC is a Decision Tree Classifier, which models information credibility based on overall statistic handcrafted features [1].

SVM-RBF is a SVM-based Classifier adopting RBF kernel, which models information credibility based on overall statistic handcrafted features [25].

Note that although the method proposed by Ma et al. [14] can also detect rumor, jointly optimizing rumor detection and stance classification makes it unsuitable for comparison here. Methods that model propagation structures [13, 24] need propagation trees of posts for each event, thus they are not able to be compared for Weibo and Twitter. Following the setting of previous works [11, 27], we select 10% of data for validation, and split the remaining 90% into training and testing sets in a 3:1 ratio for both datasets. Note that validation, training and testing sets are stratified shuffled according to classes. We employ Accuracy, Precision, Recall and \(F_{1}\) to evaluate the performance on rumor detection [27].

Our proposed SGERD is implemented based on KerasFootnote 4. For each dataset, we set the regularization term \(\lambda \) to be 0.001, the dimensionality of word embedding d as 100, the inner layer dimensionality h of FFN as 50, and the filter size ws of each convolution layer as (3, 3, 3). Finally, the corresponding filter numbers m for three layers are (50, 20, 20) and (50, 10, 10) for Weibo and Twitter, respectively. The above hyperparameters are tuned in the validation dataset.

4.3 Comparison with Baselines

Table 1 presents the performance of different methods in terms of Accuracy, Precision, Recall, and \(F_{1}\). The accuracy on Twitter is generally much lower than on Weibo because Twitter is smaller than Weibo and has higher ratio of reposts. We can observe the performance ranking of these methods on rumor detection as follows: SGERD, CAMI, AIM, GRU-2, SVM-TS, RFC, DTC, SVM-RBF, and DT-Rank. All methods based on deep neural network (DNN) perform better than other conventional ones. The classical methods, i.e., SVM-TS, RFC, DTC, SVM-RBF, and DT-Rank, are mainly implemented by combining traditional classifiers with different sets of handcrafted features. They are not able to capture high-level features, thus fail to be effective in complicated scenarios of social media. In contrast, methods based on DNN, i.e., SGERD, CAMI, AIM, GRU-2, can dig out deep latent semantic features and tend to be adaptive to complicated scenarios. Furthermore, DT-Rank extracts organized expressions from posts with enquiry phrases, which is not common in the datasets we used. Thus relying on limited enquiry phrases, DT-Rank has rather restricted adaptability to different datasets. Compared to SVM-RBF and DTC, both SVM-TS and RFC achieve a better performance by integrating time-series structure, which confirms that temporal features are significant to rumor detection.

Table 1. Rumor detection results (R: Rumor; N: Non-rumor)

Among all DNN based methods, our proposed SGERD outperforms all previous published DNN based methods for rumor detection. All these methods can model high-level features and dig out deep latent semantic information of posts, and thus reach a high performance. However, these methods all have limitations. GRU-2 is capable of capturing the variation of contextual characteristic over rumor diffusion, but GRU-2 has bias towards the latest group which usually does not play a key role. CAMI has been proven as able to extract both local and global features through convolution operation and reveal those high-level interactions, but its major model structure is implemented based on simple CNN, which is not inherently equipped with a sense of group order. Both GRU-2 and CAMI use unsupervised methods to generate group embeddings, which do not consider the different importance of each post. AIM employs two kinds of attention mechanisms to help exploit dynamical information of each post, however, the potential interactions among nearby groups are discarded since grou** methods are not utilized. By overcoming these shortcomings, SGERD achieves a considerable performance improvement on rumor detection.

4.4 Ablation Experiments

To evaluate the function of different components for learning group embedding, we implement three variations of SGERD, denoted as SGERD - P, SGERD - G, SGERD - \(T_{emb}\), respectively. Compared with SGERD, SGERD - P directly utilizes the unsupervised average word embeddings to generate group embedding, rather than learn the representation of each post with supervision. Similarly, SGERD - G is conducted by employing max-pooling operation over the supervised post embeddings to represent a group without convolution operation to learn local features for the group embedding. Note that these two variations of our model do not use temporal embedding. To investigate the functional performance of temporal embedding, SGERD - \(T_{emb}\) is implemented by removing temporal embedding from SGERD. The results of ablation experiments on rumor detection for each dataset are show in Table 2.

Table 2. Rumor detection results (R: Rumor; N: Non-rumor)

Compared to SGERD - \(T_{emb}\), SGERD - P decreases its performance on both two datasets. SGERD - G decreases by 0.3% and 0.4% on Weibo and Twitter dataset, respectively. This shows that learning the representation of each post with supervision is helpful for the rumor detection task, because it can generate more task-oriented post embedding. Similarly, SGERD - G decreases by 0.5% and 0.9% on Weibo and Twitter dataset, respectively. This shows that the convolution operation for extracting local features of nearby posts is important to detect rumors. Furthermore, it is obvious that the decreased performance brought by SGERD - G is relatively significant when compared with SGERD - P. We assume that it is because the convolution operation can combine features of multiple posts in the same time windows, which may be discarded by directly employing max-pooling operation over the post embeddings. Compared to SGERD in Table 1, SGERD - \(T_{emb}\) decreases by 0.6% and 0.9% on Weibo and Twitter, respectively, which indicates that the temporal information of posts plays an important part on deciding whether or not an event is rumor, and our temporal embeddings successfully model the temporal features. Finally, all these variations without specific components still perform better than state-of-the-art methods, which indicates that it is better to model the post-level content using our supervised method than the unsupervised aggregation of groups.

4.5 Early Detection of Rumors

In practical occasion, rumor is usually requested to be detected as early as possible, and thus early detection of rumor is a crucial task. To investigate the performance of SGERD on early detection of rumors, we set several detection deadlines, posts after which are not considered in early detection. The mean official report time (ORT) of rumor given by the debunking services of Snopes and Sina community management center is taken as a reference. We conduct early detection experiments on AIM, CAMI, GRU-2 and SVM-TS for comparison, since these methods have the best performance among all mainstream methods. Although DT-Rank is mainly adopted in early detection task, its performance on rumor detection is much poorer than other baseline methods, and thus we do not take it into consideration.

Accuracy of different methods during different detection deadlines is presented in Fig. 2, from which we can observe that the accuracy curve of most methods will climb from a small value and gradually converge to a certain accuracy. During first several hours, the accuracy of SGERD climbs rapidly and tends to converge to a relatively high value at the earliest time, while other methods take longer time to converge and cannot reach such a high accuracy. The accuracy of SGERD will reach 91.4% for Weibo and 77.3% for Twitter within 12 h, which is much earlier than the official report time of rumor.

Fig. 2.
figure 2

Early detection performance of rumors

When discarding the temporal embedding, SGERD - \(T_{emb}\) achieves an accuracy curve nearly coinciding with that of SGERD, which indicates that temporal embedding has a rather limited impact on early detection. This is because the proposed temporal embedding is designed to model temporal information for sequences of posts, which requires the input sequence being long enough. However, for early detection of rumors, the detection deadlines are restricted in a certain range and the number of posts is limited, resulting in that the temporal features of posts cannot be captured and temporal embedding for early detection cannot be as effective as usual rumor detection. Similarly, SVM-TS and GRU-2 model time series information of the input sequences in a way which conflicts with the requirement of early detection. Therefore, SVM-TS and GRU-2 are unsuitable when the detection deadline is early, and the climbing rates of their accuracies curves are slow and the convergence accuracies are low. Without integrating the temporal information, CAMI can extract key features even with a short sequence of posts, and its accuracy curves therefore become steadier and keep in higher accuracy than GRU-2 and SVM-TS. However, CAMI does not consider the different importance of each post in the same group. AIM ignores the variation among the nearby groups as it does not group the posts. Therefore, their accuracy curves converge to lower levels when compared with SGERD. With the benefit of modeling from post level, our SGERD can mine useful posts and alleviate the negative effect of redundant posts, and it ranks the first for early detection in every stage.

4.6 Samples of the Salient Posts

Similar to the visualization work in information retrieval [21], we present samples of the salient posts extracted by SGERD. Firstly, for each post, we evaluate the output of CNN before the max-over-time pooling operation, i.e., picking out the largest output value \(p^{*}\) as in Eq. (2) among all the different filters to represent this post. Secondly, we sort all the posts according to their output value \(p^{*}\), and trace back to posts that have large output value for each event. Finally, we get the salient posts making significant contribution to rumor detection. Figure 3 presents several events, from which we can visualize posts with large output values in blue that contribute more to rumor detection. Similarly, we illustrate trivial posts with small output values in green for comparison.

Fig. 3.
figure 3

Extracted salient and trivial posts in rumor events. Blue and green color represent posts with large and small output values, respectively. Note that the feature of a post is extracted from that post together with posts in the same window, but only the center post is selected here. (Color figure online)

Sample (a) is an identified rumor on Weibo about “Enter reversed password when robbers threaten you to take money from ATM and it will call the police for help secretly”, sample (b) is an incorrect opinion of beer spreading on Weibo, which claims that beer causes male feminization, sample (c) is a false news of hairbands exported from China on Twitter, and sample (d) is a fabricated information on Twitter that vegetarian hot dog contains meat and human DNA. From above examples, we can observe that many posts express doubts and opposition to these events, such as “is it true?”, have a relatively large output value, while the posts with small value are redundant (e.g., “repost”) or just make a factual description. This step of visualization is especially useful for rumor detection because it provides explanatory information about how the model works, and helps understand better what is learned by SGERD.

5 Conclusions

In this paper, we have proposed a supervised group embedding based rumor detection model named SGERD. Our SGERD processes each event starting from posts and leverages the max-pooling operation to alleviate the effect of redundant posts, so it is able to pick out salient posts. Furthermore, SGERD incorporates temporal information to equip CNN with a sense of group order, and it models the influence of different groups. To demonstrate the effectiveness of our SGERD, we have done experiments on two real-world datasets, and the results show that it outperforms traditional handcrafted features based models as well as deep neural network models (i.e. GRU-2, CAMI and AIM). Finally, visualizing salient posts contributing more to rumor detection can help us comprehend better how SGERD works, and help domain experts more easily verify conclusions drawn by this automatic method.