Introduction

Internet and social media impact all aspects of our lives. We use them to read news, connect with friends and family, share opinions, buy products, and entertain us. It affects our beliefs, behaviour and so it shapes our political, financial, health, and other important decisions. Unfortunately, as a result, social networks created an information platform in which automated accounts (including human-assisted bot accounts and bot-assisted humans) can try to take advantage of the system for various opportunistic reasons: trigger collective attention [12, 28], gain status [10, 43], monetize public attention [9], diffuse disinformation [5, 16, 34], or seed discord [47]. It is known that a large fraction of active Twitter users are bots [44] and they are responsible for much disinformation—see, [48] for many examples of manipulation of public opinion. Having said that, not all bot accounts are designed to harm or take advantage of other users. Some of them are legitimate and useful tools such as chatbots that respond to common questions of users, or knowbots that are designed to automatically retrieve some useful information from the Internet. On the other hand, human accounts may also spread disinformation and be responsible for some other malicious behaviour. Detecting bots and understanding roles they play within the system falls into a common machine learning task of node classification. It is important to note that in this work, we do not focus on the intent of the bot accounts (whether they are benign or malicious); determining this is outside the scope of our study.

The objectives of this paper are to investigate: (1) whether graph embeddings extract information from the associated network that can be successfully used for node classification task, (2) what is the relative value of classical vs. structural node embeddings for bot detection, (3) does the predictive power of embeddings depend on their complexity (measured by the dimension of the embedding). To achieve these goals, we start with defining classical and structural embedding techniques. Classical embedding techniques, such as Node2Vec [21] and DeepWalk [37] learn information related to proximity of nodes in the network. On the other hand, structural embedding algorithms such as Role2Vec [1] and Struc2Vec [39] learn representations of the local structure surrounding each node. In our work, we build features using both classical and structural embedding techniques and use those features to train models for classifying bots.

In our experiments, we concentrate on Twitter data and the task of identifying bot accounts, but our questions (and answers) are broader and so potentially more influential. They are applicable to all kinds of networks and data sets that are naturally represented as graphs which, of course, includes social media platforms such as Twitter. Moreover, they are applicable to a much wider class of machine learning tasks: node classification algorithms train a model to learn in which class a node of the graph belongs to. Bot detection is a specific example of this class of problems in which a binary classification is performed (nodes are categorized into bots and humans). However, in general, multi-class classifications is also often considered and needed. Other important applications of this nature include, for example, identifying nodes associated with users that might be interested in some specific product, or detecting hostile actors. For this reason there is an increasing need for effective methods of analysis data represented as graphs. For more details we direct the reader to a recent survey [30] and a book [25]. Lastly, we point out that although our study focuses on bot detection on Twitter social network, bot detection in general is a domain specific task. Users (including bots) on other social networks may interact with one another in different ways, which could impact the predictive information captured by the underlying social network.

There are many approaches that can be used to perform node classification in graphs. Most techniques attempt to detect bots at the account level by processing many social media posts and analyzing them using various NLP techniques with the goal to extract some important and distinguishing features. These features are usually complemented with user metadata, friend metadata, content and language sentiment, as well as temporal features [13]. In this paper, we will refer to these features as NLP and P (Profile). These techniques are very powerful, but a supervised machine learning algorithm is only as good as the data used for training. Unfortunately, good quality datasets with the ground-truth are rarely available. Additional challenge is that bot accounts evolve rapidly and so one needs to constantly update datasets and evolve the set of features to keep up with the other side. In particular, hot topics discussed on social media evolve rapidly; for example, NLP features that were important for bot detection before presidential elections in some country might become quickly outdated after the election. Similarly, results of NLP analysis cannot be easily transferred from one language to another or across differ geographical regions or countries. Furthermore, recent developments in Large-Language-Models (LLMs) such as GPT-4 [35] will make it more difficult to detect human versus bot generated language. Language derived features, which may have helped identify bots, may be obsolete as bots take advantage of more sophisticated LLMs. As a result, a collaborated effort of many researchers and data scientists is needed to maintain bot detection models. One successful example is Botometer, bot detection tool developed at Indiana University using various labelled datasets and 1, 209 features (current, 3rd version of the model) [48]; see also [42] for a new supervised learning method that trains classifiers specialized for each class of bots and combines their decisions through the maximum rule (ensemble approach). Botometer handles over a quarter million requests every day! However, since the bot score is intended to be used with English-language accounts, what can one do with non-English accounts? What if the content or metadata is not easily available? Finally, how about other node classification tasks which cannot enjoy such powerful tools such as Botometer?

An alternative approach is to use some features of nodes that can be calculated exclusively using graph data. The main advantage of this approach is that such information is easier to obtain and is typically less sensitive as it does not include the analysis of user messages and metadata associated with them. More importantly, it can be hypothesised that the signal is more stable in time and graph space, that is, if some topological structure of the network indicates that some nodes are likely to be bots, then such signal is likely to loose its predictive power slower than, for example, discussion topics extracted from NLP features. Typical features concentrate on local properties of nodes such as node degree, various node centralities, local clustering coefficient, etc. We will call features derived using this approach as GF (Graph Features). The idea behind is that bots need to use some strategies to form an audience. They employ various algorithms to gather followers and expand their own social circles such as following popular accounts and ask to be followed back [2], generating specific content around a given topic with the hope to gain trust and catch attention [17], or even interacting with other users by engaging in conversation [24]. These algorithms create networks around the bots that should be structurally and topologically distinguishable from the ones around real human beings which, in turn, affect the extracted graph features. The same rationale applies to other applications of node classification.

The above approach, based on analysis of predefined graph features, was proved to be useful in various node classifications tasks but it has a few issues. First of all, very often features of one node alone are not enough to adequately classify the node. Indeed, bots typically work in a coordinated way and are not usually suspicious when considered individually. Hence, bot detection requires combining information about multiple bots and analyzing them together [11]. This often is very challenging, both conceptually as well as computationally, as it requires to consider at least a quadratic number of pairs of nodes. features capture properties that are rather local whereas some embedding algorithms aim to extract more global and structural properties. properties Moreover, we often do not have access to a complete network but rather sample it using some sampling method. Unfortunately, the choice of a sampling algorithm may substantially affect GF. The features that are to be analyzed need to be predefined by the analyst. Therefore, the result of this approach depends heavily on skills, knowledge, or just sheer luck of the user.

To solve at least some of these problems, we propose to utilize node embedding algorithms that assign nodes of a graph to points in a low dimensional space of real numbers. The goal of the embedding is to decrease the dimension but, at the same time, to extract the most important features of nodes and ignore noise. We will call features obtained based on this approach EMB. As mentioned, we consider two classes of embedding techniques. One, which we call classical embeddings, focus on learning local and global proximity information about nodes. Such techniques can be used to identify communities and groups in networks. The second class of algorithms, called structural embeddings, learn representations of the local graph structure around each node. Structural embedding techniques are often used to identify what roles nodes play in their local environment. These algorithms (both classical and structural) have quickly became an intensely researched topics in recent years; see, for example, [8] or a recent book [25], and the list of potential applications constantly increases. After reducing the dimension via node embeddings, node classification can be done more efficiently compared to extracting graph features and using the original network to identify synchronized behaviour. On the other hand, synchronized behaviour should create similar network structure around the involved nodes and so should be captured by the embedding. Such group of nodes may be then potentially extracted (even in an unsupervised way) by some machine learning tools such as DBSCAN that are able to identify dense regions of the embedded space. Some embedding algorithms not only capture local properties of graphs but also try to pay attention to global structure and different roles the nodes play within the network [14, 39] which might carry more predictive power than local GF. Additional benefit of such approach, in comparison to using GF, is that features are identified automatically in an unsupervised way by the algorithm, as opposed to having to identify them manually by the analyst. Finally, embeddings seem to be less sensitive to sampling techniques and so they might be used as a foundation for more robust classification algorithms. There are many different node embedding algorithms considered in the literature. Additionally these embeddings have many hyperparameters, of which a common among all embedding approaches is their target dimension. Although embeddings were considered for various tasks in the earlier literature, an analysis of how useful they can be for bot detection remains an open field that we investigate in this paper. To answer this question we report on predictive power of NLP, P, GF, and EMB features. This investigation also allows us to compare classical and structural embeddings to find which of them are more useful for this task. Additionally, we check how target embedding dimension hyperparameter affects their predictive power.

Research motivation and goals

As we have highlighted in the introduction, there are many aspects of a user data that one could leverage in building models for identifying whether a user is bot or non-bot. In this work, we categorize these features into four groups. The first are features captured from user profile data and features derived using natural language models from user tweets (NLP and P). The second are simple graph features, computed for nodes like degree or eigenvalue centrality (denoted GF). The third are features extracted from the user’s social network using embedding algorithms (EMB). EMB features are further broken down into classical and structural embeddings. We build bot classification models using various combinations of the above feature sets, for two different datasets (as will be highlighted in the coming sections). It is important to note that our focus is not on identifying whether bot accounts are malicious or benign. Also, we are not introducing a new bot classification model, rather our aim is to compare the predictive power of these feature sets (NLP, P, GF and EMB) with the focus on node (classical) and structural features extracted using embedding algorithms as a new and novel source of predictive features. More specifically, we compare the predictive power of structural versus classical embeddings and show that bot accounts on Twitter often form local social structures, which can be captured by structural embedding techniques.

The main contributions of our paper are as follows:

  • By analyzing the performance of bot detection models, using various combinations of feature sets we show that all three feature sets (NLP, P, GF and EMB) have predictive power for identifying bot accounts.

  • Addition of classical and structural features enhances the performance of bot detection models, hinting at the fact that there are clues for detecting bot accounts in the graph features extracted from the structure of the social network of the users that are not captured by other types of features.

  • By analyzing six different embedding algorithms (EMB) and comparing their performance to human engineered features (GF), we show that embedding algorithms can capture features having predictive power in an unsupervised way that is difficult to design manually.

  • Using two different Twitter datasets, we show that features extracted using structural embedding techniques have higher predictive power as compared to features learned using classical embedding techniques.

  • We perform dimensionality analysis on both structural and classical embeddings and show that increasing dimensionality of embeddings does not bring much value added. Already low-dimensional embeddings are useful for bot detection.

  • Lastly, we perform stability analysis against our embedding features and show that models built using embedding algorithms can be resistant to the addition of noise in the underlying network.

Finally, let us stress that despite the fact that these results are optimistic and show a potential of algorithms based on graph embeddings, this is an early stage of research in this direction. We finish the paper with a discussion of future work that will deepen our understanding of the power (as well as potential issues) of embedding algorithms.

Related work

In this section, we provide a brief overview of various studies focusing on bot detection algorithms as well as feature sets used for building such models. There are numerous studies focusing on feature engineering and feature extraction from user information on social media networks such as Twitter [18, 27, 30, 33, 45]. For example, in their work Minnich et al. [33] categorize the feature sets used for detecting bots into the following categories: metadata-based features, content-based features, temporal-based features, and network-based features. In this work, the authors mention the importance of including information about the social network of bot accounts such as the number of followers captured by node out-degrees. Similar feature sets were used by Lee et al. [26], where the authors focus on extracting features from user generated information such as tweets and profile data, in addition to first degree graph features such as node degree. Indeed, many of the research efforts on extracting network features are focused on first degree features that can be mapped to real-world metrics such as the number of followers or friends, in addition to network of users who directly interact with user’s tweets via liking or re-sharing them. As we will show in our work, higher order network features such as the network structure of a user’s followers (follower’s of your followers) can have additional predictive power when comes to building bot classification algorithms. There are also a number of studies focused on using node and graph embeddings as features (in addition to other types of features mentioned above) for building bot detection algorithm [3, 4, 22, 29, 38]. For example, Alkulaib et al. [4] build a bot detection technique using the anomalous properties of certain nodes in the graph. The authors use a graph transformer as a self-attention encoder to learn both node and structural representations of nodes. In another work, Hamdi et al. [22] investigate fake news detection on Twitter using node and graph embeddings. Although the studies mentioned here highlight the fact that embeddings can be used as a source of features with predictive power for building classifiers on the social network, they do not explore the difference between structural versus classical (node) embedding techniques.

Although the majority of the effort has been focused on identifying bots, there has been some recent research that focuses on identifying whether bot accounts are malicious or benign. For example, Mbona et al. [31] use features generated from user information to predict whether a user account is malicious or benign. In their paper, Mbona et al. [31] use similar feature sets such as user information and tweet data for analyzing their models. Other recent research such as Tan et al. [15] [TwiBot-20] and Stella et al. [43] [Italian Election]. We recognize that labeled bot-datasets often contain some level of bias, since the real ground-truth is not readily available. In general, labels are identified by careful analysis of humans or by cleverly designed algorithms. Throughout our study, we ensure to stay aware of this fact and highlight any impact this may have on our findings.

In the TwiBot-20 dataset, the authors focus on building a comprehensive Twitter dataset composed of semantic, property, and neighbourhood information. Here, semantic is the Tweet text generated by the user; property is the information related to user profile such as number of followers and following, and finally, neighbourhood is the network structure of the user. We highlight the features used from this dataset below. To capture a natural representation of the ground-truth Twittersphere, the authors implemented a breadth-first search algorithm, to sample and build the dataset. In this methodology, a user is selected as the root of the tree and subsequent layers are built using the directed follow edges of each user. This process is repeated up to layer 3, creating a sample network with a selected user at its root [15]. The sampling algorithm used by the authors builds a directed graph, where nodes are users and edges are follow relationship. As highlighted by Feng et al., this method of sampling does not focus on any particular topic or pattern and should be a more natural representation of the Twittersphere.

We compile a list of raw features available from the TwiBot-20 dataset in Table 1. Note that the values for these features are a snapshot captured at the time of sampling. We categorize each feature into three types: Profile, NLP and Graph. The profile features are datapoints available through Twitter’s API, and highlight some properties of each user. As pointed out by Feng et al., the followers and following are randomly selected. We use the raw user Tweets as the input to our NLP feature engineering. We provide more detail in our NLP feature analysis section. Lastly, graph features are build using the raw edge list provided in the TwiBot-20 dataset. As mentioned before, an edge between two nodes indicated a follow relationship between the nodes. Although the original network provided by Feng et al. is a directed graph, we convert it to undirected graph for our analysis. Lastly, we note that the profile feature verified is excluded from the bot classification process. This is done for two main reasons. Firstly, most users accounts are not subject Twitter’s verification process, where an account is confirmed to be owned by the user it claims to be. This process would inherently exclude bots from being verified. Secondly, due to the nature of the verification process, this feature could introduce bias for any classifier, thus making the discovery other meaningful features more difficult.

Table 1 Feature list for TwiBot-20 dataset

In the Italian Election dataset, Setlla et al. [43] aim to investigate the online social interactions during a 2018 Italian election and how it helps to understand the political landscape. In their work, the authors study relationship between real users and bots, using the Twitter network. Unlike the TwiBot-20 dataset, the authors build a sample of the social network by focusing on tweets containing a list of political topics; such as “#ItalyElection2018”, “#voto”, etc. The sampling technique used by Setlla et al. results in a network with a vastly different graph topology than that created by Feng et al. By sampling the TwitterSphere based on topics, the authors created a dataset in which nodes are users and edges represent interactions between users, such as retweets or mentions. Although this makes it difficult to compare the performance of bot detection algorithm between these two datasets, having diversity in how a social network is constructed helps us understand how bots manifest themselves within a network. The Italian Election dataset also contains labels indicating if a user is identified as a bot or not. As described by the authors, the bot/not-bot labels were generated by using an a classifier trained using Twitter user’s profile information [43]. Although the original dataset used by Setlla et al. [43] contains user profile and raw Tweet data, in this work we only have access to the network data and thus we can only focus on features extracted from the underlying network structure. Similar to the TwiBot-20 network, the Italian Election graph is directed, with edges pointing from users who interact with other user’s content. We also convert the Italian Election graph into an undirected graph for the purpose of our study.

We summarize some high-level statistics of both networks in Table 2. It is important to note that we apply additional data cleansing and filtering to provided dataset. For example, we run our analysis on the largest component of each graph, and convert both graphs into undirected networks. The reason for converting these graphs to undirected networks is that some embedding algorithms only take undirected graphs as input. Using undirected network ensures that comparison between the performance of each embedding is fair. It is important to note however that by converting graphs from directed to undirected we lose some (potentially predictive) information. Lastly, we note that the sampling technique used to construct the above two networks has potential impact on the level predictive information captured by featured built using node/graph statistics and embedding algorithms. One could construct a social network based a variety of information, for example edges could represent follow/friend relationship or retweet or like relationship. Nodes themselves could represent users or tweets.

Table 2 Graph statistics for the Twibot-20 and Italian Election datasets

Profile and NLP features

In this section we focus on features extracted from user’s profile information and their tweets. We perform feature engineering, specially on the raw tweets using various NLP techniques. Since we only have profile and tweet data from the Twibot-20, our analysis is centered around this dataset. To maximize their impact on a social network, bots aim to mimic real-user behaviour. To this end, bots aim to create accounts and content that seem natural, such that it was generated by a real user. An example of such actions could include following other users, tweeting about relevant topics and engaging in conversations. Despite their effort, as we will discuss in this section, bots often leave behind signs that allow us to distinguish them from non-bots. Starting with the profile Twitter API data named in "Datasets" section, the number of public lists that a user is a member of, \(listed\_count\), is strikingly different for the two groups (bots vs. non-bots)—see Fig. 1 and Table 3. It is a measure of user’s popularity, and it turns out that humans tend to be added to Twitter lists by other users of the platform more often than bots are. It means that in general Twitter users value human-generated tweets and intuitively prefer this type of content.

Table 3 Statistics for the profile features belonging to the Twibot-20’s bot and non-bot accounts
Fig. 1
figure 1

Histogram of \(listed\_count\), number of public lists that users are a member of, for bots and non-bots

It is worth noting the difference between the number of users that follow them (\(followers\_count\)) and the number of accounts the user follows (\(friends\_count\))—see Fig. 2 and Table 3. There is a clear asymmetry here. In general, humans follow less users and get followed more than bots do. The reason behind this could be that bots become friends with many users in order to seem more legitimate and, at the same time, human users are less interested in bot-generated content than the one created by humans.

Fig. 2
figure 2

Histogram of \(followers\_count\), number of users that follow the account (left), and \(friends\_count\), number of user’s followings (right), for bots and non-bots

In addition to the original profile data from the Twitter API we listed in "Datasets" section, we have extracted a number of features from user tweets in the Twibot-20 dataset. This was done by sampling each user’s tweets and running NLP feature extraction on them. For language detection we utilized fastText python module [6], whereas for the sentiment analysis (only for the English tweets) we used HuggingFace, transformer pre-trained model [46]. Here is the list of extracted features:

  • \(links\_no\) - total number of hyperlinks in user tweets

  • \(mentions\_no\) - total number of references to other users in user tweets

  • \(tweets\_no\) - total number of tweets generated by user

  • \(links\_per\_tweet\) - average number of hyperlinks per tweet\(^{1}\)

  • \(mentions\_per\_tweet\) - average number of references to other users per tweetFootnote 1

  • \(av\_tweet\_len\) - average tweet length (in characters)

  • \(std\_tweet\_len\) - standard deviation of tweet length

  • \(no\_langs\) - number of dominant languages used in tweets (it is assumed that each tweet has exactly one dominant language)

  • \(perc\_en\) - percentage of tweets written in English

  • \(no\_odd\_langs\) - number of languages present in less than 10% of tweets

  • \(perc\_legit\) - percentage of tweets written in languages present in more than 10% of tweets

  • \(av\_sent\) - average sentiment score (using the scores of the dominant labels)

  • \(std\_sent\) - standard deviation of sentiment score (using the scores of the dominant labels, e.g., taking \(-\)0.9 for 0.9 score for negative sentiment)

  • \(positive\_sent\_perc\) - percentage of English tweets with positive sentiment assigned

Fig. 3
figure 3

Histogram of \(tweets\_no\), total number of tweets (left), and \(mentions\_per\_tweet\), average number of references to other users per tweet (right), for bots and non-bots

Based on exploratory data analysis, there are some noticeable differences between bots and non-bots. First of all, as reported in [15], bots in Twibot-20 generate fewer tweets than humans do—see Fig. 3 (left). This is quite surprising and in contrast with earlier findings reported in [36]. This difference may be attributed to the fact that bots change behaviour with time, they are constantly getting more clever. Currently, they interact with the system only to achieve a very specific goal and often disappear shortly after, generating fewer tweets in total. This indicates that NLP approach cannot be easily generalized and might require constant re-training.

It seems that bots tag other users more frequently than humans do—see Fig. 3 (right). The reason behind might be that some types of bots do not produce much of their own content but, instead, tag many different users to generate their attention and hope for a potential link click. This is reflected in \(mentions\_per\_tweet\) (the average number of references to other users per tweet) depicted in the figure but also in \(mentions\_no\), its cumulative counterpart. We used the former value in further analyses.

Fig. 4
figure 4

Histogram of \(links\_no\), the total number of hyperlinks (left), and \(links\_per\_tweet\), the number of hyperlinks per tweet (right), for bots and non-bots

Perhaps surprisingly, Fig. 4 suggests that the same reasoning cannot be as easily applied to \(links\_no\) and its counterpart \(links\_per\_tweet\), focusing on the number of hyperlinks generated by users. The two charts are more ambiguous. For \(links\_no\), it seems that humans include more hyperlinks in total than bots do. The lowest values of this feature (less than 80) are dominated by bots and the largest ones are non-conclusive (interchanging between bots and humans dominance). However, if one looks at the per tweet counterpart, the largest values in the distribution are visibly assigned to bots more often than to humans. Bots achieve the extreme values of \(links\_per\_tweet\) about twice as frequently as non-bots even though the direction was not obvious for the absolute value—the conclusion is that humans generate more hyperlinks because they generate more tweets. For this reason, we will exclude \(links\_no\) from the analysis and work using per tweet features instead.

Fig. 5
figure 5

Histogram of \(av\_sent\), average sentiment score (left), and \(positive\_sent\_perc\), percentage of positive tweets (right), for bots and non-bots

In both examples above, we removed two features that are highly correlated, without affecting the quality of the model. Of course, one does not need to do it and let the classifier to deal with this situation. Removing redundant features is, however, a good practice and the main reason to perform EDI. It reduces the dimension of the problem and so improves scalability. It is another evidence that NLP approach requires supervision of domain experts and careful investigation.

One of the most noticeable differences seems to be the one in the sentiment of the posts. In Fig. 5 there is a visible mismatch between bots and non-bots, both in terms of sentiment score (left) and sentiment label (right). While for both groups the histograms are right-skewed, tweets posted by non-humans in general tend to be more negative than those created by real people. Bots may be present on social media platforms to lead a campaign against some product, company or political fraction and their actions are deterministic, not affected by external sources. On the other hand, humans seem to react more strongly to negative stimuli, that is, we believe in negative opinions more so than praise [41]. It may explain the observed phenomenon.

Fig. 6
figure 6

Histogram of \(no\_odd\_langs\), number of languages present in less than 10% of user tweets (left), and \(perc\_en\), percentage of tweets written in English (right) for bots and non-bots

As one would expect, the way tweets are written seems to be different linguistically between bots and non-bots. Language detection performed on tweets suggests that humans may write in a more convoluted way in which case the fastText model fails to detect tweet language properly. In Fig. 6 (left) we present the number of languages that the user used in less than 10% of their tweets, i.e. \(no\_odd\_langs\). Larger fraction of bots use zero or only one rare language. The same result holds for \(no\_langs\) and \(perc\_legit\). Moreover, bots seem to use mostly English as can be observed in Fig. 6 (right). There are only a few bots in the dataset for which the percentage of tweets written in English (\(perc\_en\)) is low.

With recent advances in Transformer models, computer generated text is becoming evermore human like. The current state-of-the-art is the OpenAI’s GPT-3 [7], a generative model for NLP tasks with 175 billion parameters! GPT-3 has been demonstrated to be effective on a variety of few-shot tasks: due to its extensive pre-training and size, it is able to learn rapidly from very few training examples. It generates texts that are nearly indistinguishable from human-written texts: Humans correctly distinguished GPT-3 generated texts from real human texts approximately 52% of the time, which is not significantly higher than a random chance [7]. For more details on GPT-3 and other related topics we direct the reader to, for example, a recent survey [30].

Another potential source of information for identifying bot accounts lies within the raw tweet text produced by each user. To this end, we perform topic modeling using BERTopic (BERT for Topic Modelling) [20]. BERTopic is a topic modelling technique that uses transformers and the c-TF-IDF to produce dense clusters that allow for clearly understandable topics while maintaining key phrases in the topic descriptions. This is done to gain insight into difference in the type of topics bot and non-bot account focus on while interacting with other users. In Figs. 7 and 8 we highlight the score of various topics used by bot and non-bot accounts. The graph represents relative c-TF-IDF scores between and within topics. The darker shades basically means that the tweets in which the words from a particular topic appeared are strongly related with each other. Topics extracted here are tokenized and used in the bot classification models, as we will highlight in later sections. Note that topics for bots and non bots are not the same (e.g. topic 0 for bots is roughly topic 1 for non bots).

Fig. 7
figure 7

Topic scores for tweets made by bots

Fig. 8
figure 8

Topic scores for tweets made by non-bots

Graph derived features

We showed in "Profile and NLP features" section that there are statistically significant differences in NLP and P features between bot and non-bot accounts. Another potentially independent source for extracting features is rooted in the way users/bots interact with others in the network. One can capture this information by analyzing various graph properties derived from the underlying social-network. This can be done in two ways. One, by carefully designing statistical features of the nodes. Second, using unsupervised methods to learn node and structural representations of the nodes. In this section, we provide a detailed analysis of node feature engineering in addition to features extracted using various embedding techniques.

Node features

In this section, we build node features derived from the underlying network structure using both TwiBot-20 and Italian Election datasets. For extracting features we use NetworkX as well as igraph python packages depending on the efficiency of the corresponding algorithms. Here is the list of extracted node-features that were computed for all nodes. For detailed definition we direct the reader to, for example, [25] or any other textbook on network science. -

  • \(degree\_centrality\) - degree (the number of edges the vertex has)

  • strength - minimum ratio of edges removed/components created during graph decomposition process

  • \(eigen\_centrality\) - eigenvector centrality, a measure of the importance of the vertex (using relative scores)

  • closeness - closeness centrality, a measure of the importance of the vertex calculated using the sum of the length of the shortest path between the vertex and other vertices

  • \(harmonic\_centrality\) - harmonic centrality (another variant of closeness centrality, calculated similarly)

  • betweenness - betweenness centrality, a measure of the importance of the vertex calculated using number of shortest paths that pass through the node

  • authority - authority score, sum of the scaled hub values that have edge to the given node

  • \(hub\_score\) - hub score, sum of the scaled authority values of the nodes it has edge to

  • constraint - Burt’s constraint, an index that measures the extent to which a person’s contacts are redundant

  • coreness - coreness (unique value of k such that a node belongs to the k-core but not to the \((k+1)\)-core)

  • eccentricity - eccentricity (the maximum distance from a given node to other nodes)

  • pagerank - another way of measuring node importance - invented by Google Search to rank web pages in Google search engine output

In addition to the above list of features, we compute average, standard deviation, minimum and maximum of every feature for the neighbouring nodes of each vertex. A full list of these features is given in Tables 6 and 7.

Fig. 9
figure 9

Histogram of closeness centrality measure for bots and non-bot users’ neighbours for TwiBot-20 (left) and Italian Election dataset (right)

Fig. 10
figure 10

Histogram of closeness centrality measure for bots and non-bot users’ neighbours for TwiBot-20 (left) and Italian Election dataset (right)

Fig. 11
figure 11

Histogram of harmonic centrality measure for bots and non-bot users’ neighbours for TwiBot-20 (left) and Italian Election dataset (right)

Fig. 12
figure 12

Histogram of pagerank measure for bots and non-bot users’ neighbours for TwiBot-20 (left) and Italian Election dataset (right)

As we highlighted in "Datasets" section, there are major differences in how the TwiBot-20 and Italian Election datasets were constructed. Firstly, the underlying network constructed in the TwiBot-20 captures follower-following relationship, while the network in the Italian Election dataset represents interactions between users. Secondly, the sampling technique used in the TwiBot-20 dataset results in much more uniform graph topology since at each sampling layer a fix number of nodes (followers) were sampled. This is in contrast to the Italian Election dataset, were nodes were more randomly sampled. The difference in the network topology between these two datasets is reflected in the values captured using the node-features, as shown in Figs. 9, 1011, and 12. This is indeed an important observation, since one could extract more meaningful node-features by resampling the same underlying graph using different techniques.

Secondly, the features which calculation did not involved neighbours (Figs. 91011, and 12) indicate only slight differences between bots and non-bots, both in terms of feature count and magnitude of discrepancies. Nevertheless, in the case of harmonic and closeness centrality (Figs. 11 and 9) the difference is more visible on the Italian Election dataset: bots seem to be more likely to take extreme values. Regarding the TwiBot-20 dataset, the discrepancies between bot and non-bots are less visible, but all closeness, degree, harmonic centrality, and pagerank measure distributions seem to be more left-skewed for non-bots. (Figs. 9, 1011, and 12).

Fig. 13
figure 13

Histogram of neighbours’ mean betweenness measures for bots and non-bot users’ neighbours for TwiBot-20 (left) and Italian Election dataset (right)

Fig. 14
figure 14

Histogram of neighbours’ max closeness measures for bots and non-bot users’ neighbours for TwiBot-20 (left) and Italian Election dataset (right)

Fig. 15
figure 15

Histogram of neighbours’ mean authority measures for bots and non-bot users’ neighbours for TwiBot-20 (left) and Italian Election dataset (right)

Fig. 16
figure 16

Histogram of neighbours’ mean eccentricity measures for bots and non-bot users’ neighbours for TwiBot-20 (left) and Italian Election dataset (right)

Figures 1314, 15, and 16 reveal that discrepancies between bot and non-bots groups are more visible on the distributions of features involving particular nodes’ neighbours’ during calculation. Similarly to the previous group of characteristics, differences are more visible on the Italian Election data and again, values in this dataset seem to have lower variance (Figs. 13, 14, and 15) or variance among groups (Fig. 16). In particular, values for bots’ features seem to have even lower standard deviation (Figs. 14 and 16), which may be an indicator of the fact that bots constitute a homogenous group. Nevertheless, as different conclusions may be drawn on the basis of TwiBot-20 dataset (Figs. 1314, 15, and 16), so this observation may be attributed to different sampling or annotating method.

The fact that node features constructed on the basis of data about vertices’ neighbours may help in explaining being bot versus non-bot (at least more than pure node features) indicates the purposefulness of node embeddings usage. However, as this assumption is based solely on graphical analysis, one may be interested in modelling the relationship of node features and “being a bot”. This is done in the following sections.

Classical and structural embeddings

There are over 100 algorithms proposed in the literature for classical and structural embeddings which are based on various approaches such as random walks, linear algebra, and deep learning [19, 25]. Moreover, many of these algorithms have various parameters that can be carefully tuned to generate embeddings in some multidimensional spaces, possibly in different dimensions. In this paper, we typically set all parameters but the dimension to the default values recommended by their authors. Once parameters are fixed, the algorithms learn the embedding in an unsupervised way. Having said that, some algorithms are randomized and so the outcome might vary. For our experiments, we selected 6 popular algorithms that span different families and includes both node as well as structural embeddings.

The first two algorithms, Deep Walk [37] and Node2Vec [21], are based on random walks performed on the graph. This approach was successfully used in NLP; for example the Word2Vec algorithm [32] is based on the assumption that “words are known by the company they keep”. For a given word, embedding is achieved by looking at words appearing close to each other as defined by context windows (groups of consecutive words). For graphs, the nodes play the role of words and “sentences” are constructed via random walks. The exact procedure how one performs such random walks differs for the two algorithms we selected.

In the Deep Walk algorithm, the family of walks is sampled by performing random walks on graph G, typically between 32 and 64 per node, and for some fixed length. The walks are then used as sentences. For each node \(v_i\), the algorithm tries to find an embedding \(e_i\) of \(v_i\) that maximizes the approximated likelihood of observing the nodes in its context windows obtained from generated walks, assuming independence of observations.

In Node2Vec, biased random walks are defined via two main parameters. The return parameter (p) controls the likelihood of immediately revisiting a node in the random walk. Setting it to a high value ensures that we are less likely to sample an already-visited node in the following two steps. The in-out parameter (q) allows the search to differentiate between inward and outward nodes so we can smoothly interpolate between breadth-first-search (BFS) and depth-first search (DFS) exploration.

The above algorithms primarily capture proximity among the nodes, nodes that are close to one another in the network are embedded together. But proximity among the nodes does not always imply similarity, as in the specific application we consider in this paper, bot detection. A role the nodes play within the network depends more on the structure of the network around them more than the distance between them. (See [40] for a survey on roles.) The next four algorithms aim to create embeddings that capture structural properties of the network.

The first algorithm from this family, Role2Vec [

Availability of data and materials

The two datasets that are analyzed (TwiBot-20 and Italian Election) are publicly available—see "Datasets" section for more details. Jupyter notebook containing all experiments performed in the paper is available from the corresponding author on request.

Notes

  1. Redundant feature—it may be computed from the other features but it is explicitly included in the model.

References

  1. Ahmed Nesreen K, Rossi Ryan A, Lee John Boaz, Willke Theodore L, Zhou Rong, Kong **angnan, Eldardiry Hoda. role2vec: Role-based network embeddings. In Proc. DLG KDD, 2019;1–7.

  2. Aiello Luca Maria, Deplano Martina, Schifanella Rossano, Ruffo Giancarlo. People are strange when you’re a stranger: Impact and influence of bots on social networks. In Sixth International AAAI Conference on Weblogs and Social Media, 2012.

  3. Ali Alhosseini Seyed, Bin Tareaf Raad, Najafi Pejman, Meinel Christoph. Detect me if you can: Spam bot detection using inductive representation learning. In Companion Proceedings of The 2019 World Wide Web Conference, 2019;pages 148–153.

  4. Alkulaib Lulwah, Zhang Lei, Sun Yanshen, Lu Chang-Tien. Twitter bot identification: An anomaly detection approach. In 2022 IEEE International Conference on Big Data (Big Data), pages 3577–3585. IEEE, 2022.

  5. Bail Christopher A, Guay Brian, Maloney Emily, Combs Aidan, Hillygus D Sunshine, Merhout Friedolin, Freelon Deen, Volfovsky Alexander. Assessing the Russian internet research agency’s impact on the political attitudes and behaviors of American twitter users in late 2017. Proc Natl Acad Sci. 2020;117(1):243–50.

    Article  Google Scholar 

  6. Bojanowski Piotr, Grave Edouard, Joulin Armand, Mikolov Tomas. Enriching word vectors with subword information. ar**v preprint ar**v:1607.04606, 2016.

  7. Brown Tom B, Mann Benjamin, Ryder Nick, Subbiah Melanie, Kaplan Jared, Dhariwal Prafulla, Neelakantan Arvind, Shyam Pranav, Sastry Girish, Askell Amanda, et al. Language models are few-shot learners. ar**v preprint ar**v:2005.14165, 2020.

  8. Cai Hongyun, Zheng Vincent W, Chen-Chuan Chang Kevin. A comprehensive survey of graph embedding: Problems, techniques, and applications. IEEE Trans Knowl Data Eng. 2018;30(9):1616–37.

    Article  Google Scholar 

  9. Carter Daniel. Hustle and brand: The sociotechnical sha** of influence. Social Media+ Society, 2016;2(3):2056305116666305.

  10. Cha Meeyoung, Haddadi Hamed, Benevenuto Fabricio, Gummadi Krishna. Measuring user influence in twitter: The million follower fallacy. In Proceedings of the International AAAI Conference on Web and Social Media, 2010;volume 4.

  11. Chavoshi Nikan, Hamooni Hossein, Mueen Abdullah. Debot: Twitter bot detection via warped correlation. In Icdm, 2016;pages 817–822.

  12. De Domenico Manlio, Altmann Eduardo G. Unraveling the origin of social bursts in collective attention. Sci Rep. 2020;10(1):1–9.

    Google Scholar 

  13. Dong Guozhu, Liu Huan. Feature engineering for machine learning and data analytics. CRC Press, 2018.

  14. Donnat Claire, Zitnik Marinka, Hallac David, Leskovec Jure. Learning structural node embeddings via diffusion wavelets. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018;pages 1320–1329.

  15. Feng S, Wan H, Wang N, Li J, Luo M. Twibot-20: A comprehensive twitter bot detection benchmark. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, 2021;1:4485–4494.

  16. Deen Freelon, Michael Bossetta, Chris Wells, Josephine Lukito, Yi** **%2C**a&author=Kirsten%2CAdams"> Google Scholar 

  17. Freitas Carlos, Benevenuto Fabricio, Ghosh Saptarshi, Veloso Adriano. Reverse engineering socialbot infiltration strategies in twitter. In 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), pages 25–32. IEEE, 2015.

  18. Gao Hongyu, Chen Yan, Lee Kathy, Palsetia Diana, Choudhary Alok N. Towards online spam filtering in social networks. NDSS. 2012;12:1–16.

    Google Scholar 

  19. Goyal Palash, Ferrara Emilio. Graph embedding techniques, applications, and performance: A survey. Knowledge-Based Syst. 2018;151:78–94.

    Article  Google Scholar 

  20. Grootendorst Maarten. Bertopic: Leveraging bert and c-tf-idf to create easily interpretable topics., 2020.

  21. Grover Aditya, Leskovec Jure. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, 2016;pp. 855–864.

  22. Hamdi Tarek, Slimi Hamda, Bounhas Ibrahim, Slimani Yahya. A hybrid approach for fake news detection in twitter based on user features and graph embedding. In Distributed Computing and Internet Technology: 16th International Conference, ICDCIT 2020, Bhubaneswar, India, January 9–12, 2020, Proceedings 16, 2020;p. 266–280. Springer.

  23. Henderson Keith, Gallagher Brian, Eliassi-Rad Tina, Tong Hanghang, Basu Sugato, Akoglu Leman, Koutra Danai, Faloutsos Christos, Li Lei. Rolx: structural role extraction & mining in large graphs. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, 2012;pages 1231–1239.

  24. Hwang Tim, Pearce Ian, Nanis Max. Socialbots: Voices from the fronts. interactions. 2012;19(2):38–45.

    Article  Google Scholar 

  25. Kamiński Bogumił, Prałat Paweł, Théberge François. Mining Complex Networks. CRC Press, 2021.

  26. Lee Kyumin, Eoff Brian, Caverlee James. Seven months with the devils: A long-term study of content polluters on twitter. In: Proceedings of the international AAAI conference on web and social media. 2011;5:185–92.

  27. Lee Sangho, Kim Jong. Warningbird: A near real-time detection system for suspicious urls in twitter stream. IEEE transactions on dependable and secure computing. 2013;10(3):183–95.

    Article  Google Scholar 

  28. Lehmann Janette, Gonçalves Bruno, Ramasco José J, Cattuto Ciro. Dynamical classes of collective attention in twitter. In Proceedings of the 21st international conference on World Wide Web, 2012;p. 251–260.

  29. Magelinski Thomas, Beskow David, Carley Kathleen M. Graph-hist: Graph classification from latent feature histograms with application to bot detection. In Proceedings of the AAAI Conference on Artificial Intelligence, 2020;34:5134–5141.

  30. Matwin Stan, Milios Aristides, Prałat Paweł, Soares Amilcar, Théberge François. Generative methods for social media analysis. SpringerBriefs in Computer Science, 2023.

  31. Mbona I, Eloff-Jan HP. Classifying social media bots as malicious or benign using semi-supervised machine learning. J Cybersec. 2023;9(1):015.

    Google Scholar 

  32. Mikolov T, Sutskever I, Chen K, Corrado-Greg S, Dean J. Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, 2013;1:3111–3119.

  33. Minnich Amanda, Chavoshi Nikan, Koutra Danai, Mueen Abdullah. Botwalk: Efficient adaptive exploration of twitter bot networks. In Proceedings of the 2017 IEEE/ACM international conference on advances in social networks analysis and mining 2017, 2017;pages 467–474.

  34. Monti Federico, Frasca Fabrizio, Eynard Davide, Mannion Damon, Bronstein Michael M. Fake news detection on social media using geometric deep learning. ar**v preprint ar**v:1902.06673, 2019.

  35. OpenAI. Gpt-4 technical report, 2023.

  36. Perdana Rizal Setya, Muliawati Tri Hadiah, Alexandro Reddy. Bot spammer detection in twitter using tweet similarity and time interval entropy. Jurnal Ilmu Komputer dan Informasi. 2015;8(1):19–25.

    Article  Google Scholar 

  37. Perozzi Bryan, Al-Rfou Rami, Skiena Steven. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 2014;p. 701–710.

  38. Pham Phu, Nguyen Loan TT, Vo Bay, Yun Unil. Bot2vec: A general approach of intra-community oriented representation learning for bot detection in different types of social networks. Inform Syst. 2022;103: 101771.

    Article  Google Scholar 

  39. Ribeiro Leonardo FR, Saverese Pedro HP, Figueiredo Daniel R. struc2vec: Learning node representations from structural identity. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, 2017;385–394.

  40. Rossi Ryan A, Ahmed Nesreen K. Role discovery in networks. IEEE Trans Knowl Data Eng. 2014;27(4):1112–31.

    Article  Google Scholar 

  41. Rozin Paul, Royzman Edward B. Negativity bias, negativity dominance, and contagion. Person Soc Psychol Rev. 2001;5((4):296–320.

    Article  Google Scholar 

  42. Sayyadiharikandeh Mohsen, Varol Onur, Yang Kai-Cheng, Flammini Alessandro, Menczer Filippo. Detection of novel social bots by ensembles of specialized classifiers. In: Proceedings of the 29th ACM International Conference on Information & Knowledge Management, 2020;pages 2725–2732.

  43. Stella Massimo, Cristoforetti Marco, De Domenico Manlio. Influence of augmented humans in online interactions during voting events. PLoS ONE. 2019;14(5): e0214210.

    Article  Google Scholar 

  44. Tan Zhaoxuan, Feng Shangbin, Sclar Melanie, Wan Herun, Luo Minnan, Choi Ye**, Tsvetkov Yulia. Botpercent: Estimating twitter bot populations from groups to crowds. ar**v preprint ar**v:2302.00381, 2023.

  45. Thomas Kurt, Grier Chris, Ma Justin, Paxson Vern, Song Dawn. Design and evaluation of a real-time url spam filtering service. In 2011 IEEE symposium on security and privacy, pages 447–462. IEEE, 2011.

  46. Wolf Thomas, Debut Lysandre, Sanh Victor, Chaumond Julien, Delangue Clement, Moi Anthony, Cistac Pierric, Rault Tim, Louf Rémi, Funtowicz Morgan, Davison Joe, Shleifer Sam, von Platen Patrick, Ma Clara, Jernite Yacine, Plu Julien, Xu Canwen, Scao Teven Le, Gugger Sylvain, Drame Mariama, Lhoest Quentin, Rush Alexander M. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online, October 2020. Association for Computational Linguistics.

  47. Woolley Samuel C, Howard Philip N. Computational propaganda: political parties, politicians, and political manipulation on social media. Oxford University Press, 2018.

  48. Yang Kai-Cheng, Varol Onur, Davis Clayton A, Ferrara Emilio, Flammini Alessandro, Menczer Filippo. Arming the public with artificial intelligence to counter social bots. Human Beh Emerg Technol. 2019;1(1):48–61.

    Article  Google Scholar 

Download references

Acknowledgements

Not applicable.

Funding

The project was supported by Patagona Technologies through Canadian Department of National Defense project on “Detecting and Responding to Hostile Information Activities: unsupervised methods for measuring the quality of graph embeddings”.

Research program of PP is partially supported by NSERC under Discovery Grant No. 2022–03804.

Author information

Authors and Affiliations

Authors

Contributions

ADe, BK, PP designed experiments and were major contributors in writing the manuscript. ADe, KS, AS, ADu performed experiments. All authors provided feedback. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Paweł Prałat.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

See Tables 6 and 7

Table 6 Statistics for the Twibot-20 dataset node features
Table 7 Statistics for the Italian Election dataset node features

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dehghan, A., Siuta, K., Skorupka, A. et al. Detecting bots in social-networks using node and structural embeddings. J Big Data 10, 119 (2023). https://doi.org/10.1186/s40537-023-00796-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s40537-023-00796-3

Keywords