1 Introduction

Online social media platforms such as Facebook, Twitter, and WeChat have become essential for individuals to interact, access data, and share posts. Nevertheless, social media has become a convenient platform for the rapid spread of false information, rumors, and terrorist statements [1]. This phenomenon poses a significant challenge to society and the online media supervision. A classical rumor event occurred during in the Fukushima earthquake in Japan (2011), which serves as a stark reminder of the potential consequences of a nuclear leak [2]. Ionized salts were believed to protect the body from nuclear radiation. Hence, many Chinese consumers bought salt immediately and several supermarkets purchased salt, causing widespread panic and confusion among the public. Another example is the belief concerning COVID-19 that ingesting pure alcohol could eradicate the virus in an infected body, which led to 800 fatalities in Iran and an additional 5876 hospitalizations for methanol poisoning [3]. These cases demonstrate that public opinion is adversely affected by the spread of malicious rumors, disrupting the regular social order and weakening government credibility [4]. It is important to take proactive measures to minimize the negative impact of network rumors once they are discovered on social media.

Rumor control is essential for providers of social platform services to supply accurate and truthful information, preventing rumors from spreading further and potentially causing more harm. The techniques for blocking rumors discussed in previous research can be divided into three categories:

  • Control rumor propagation by blocking nodes [5,6,7,8,9,10,11]: The purpose of these approaches is to reduce the spread of rumors by identifying influential nodes in a social network and blocking them when rumors are spread;

  • Control rumor propagation by blocking key edges [12,13,14]: These techniques restrict rumor propagation by obstructing a particular set of edges that are useful for rumor propagation;

  • Clarifying rumors by spreading the truth [15,16,17]: The assumption behind these approaches is that once individuals come to understand the truth, they will no longer believe the rumor. Their primary idea is to propagate the truth by identifying a set of nodes that users can trust.

Previous studies have shown that limiting the influence of key users in the dissemination of rumors can be an effective way of controlling rumor propagation. However, these works consider blocking a rumor as a static process and utilize greedy techniques to solve it. They do not consider how blocking nodes can influence rumor propagation after multiple cycles. This paper investigates a new issue in minimizing the influence of rumors on social networks, referred to as the dynamic rumor influence minimization problem, to block rumor propagation.

The process represented by an independent cascade model is divided into multiple time steps in order to ensure the comprehensibility of the rumor propagation process. At every time step, our aim is to discover the appropriate group of message-blocking individuals (blockers), indicated by B, comprising k members. The messages sent by the blockers are filtered or blocked to prevent the propagation of rumors from the blockers to other nodes, thereby forming the basis of rumor control.

Our strategy for addressing the DRIM challenge involves two components. First, a static rumor propagation model (SRPM) is developed based on rumor popularity and an independent cascade pattern. Next, a dynamic rumor propagation model (DRPM) is constructed by incorporating rumor popularity as a dynamic variable that evolves over time steps. Second, we propose a rumor blocking model using deep reinforcement learning technologies, which can select the most suitable blockers to control rumor propagation in interacting with the DRPM. Finally, the experimental results demonstrate that the models learned by deep reinforcement learning achieve better results in a variety of situations. The main contributions of this paper can be summarized as follows:

  • We formally introduce the dynamic rumor influence minimization (DRIM) problem,which incorporates dynamic changes resulting from rumor propagation in social networks better than its predecessor, the static RIM problem.

  • The popularity of rumors is determined according to the characteristics of information propagation in a social network. This paper presents two types of rumor propagation models: a static model (SRPM) that assumes that popularity remains constant and a dynamic model (DRPM) that considers changes in popularity. The models obtained from this study may be beneficial in simulating real-world rumors.

  • We propose a deep reinforcement learning-based rumor blocking model to control the dissemination of rumors. The model has the ability to modify the control policy depending on the state evolution of a social network. Analyzing the blocking model can yield insights into rumor control by providing a dynamic perspective on network evolution.

The rest of this paper is organized as follows. Section 2 reviews the work related to rumor influence minimization and reinforcement learning. Section 3 introduces the preliminaries of social networks, rumor propagation models, and reinforcement learning. Section 4 formalizes the dynamic rumor influence minimization problem, and its solution is provided in Section 5. The experimental results and the conclusion are reported in Sections 6 and 7, respectively.

2 Related work

In this section, we examine existing research concerning the influence minimization of rumors and the application of reinforcement learning.

2.1 Rumor influence minimization

A great deal of research has been conducted on ways to reduce the influence of rumors. The first serious exploration of the influences between users in a social network was made in the work of Domingos et al. [18]. Kempe et al. [19] conceived of viral marketing as an optimization problem, referred to as influence maximization (IM). Inspired by the influence maximization problem, Fan et al. [5] explored the opposite problem, least-cost rumor blocking. They attempted to identify a minimum set of nodes that would act as protectors. Protectors are actively involved in limiting the negative impact of a rumor, i.e., reducing the number of individuals affected by the rumor. User experience on minimizing rumor influence was studied by Wang et al. [6] who provided a method for addressing IM issues while maintaining a high level of user experience. Among the various propagation models proposed in rumor analysis, Adil et al. [16, 17] investigated the problem of multiple-rumor dissemination in a social network and offered the HISBM model to address this issue. The dynamic programming approach of Yan et al. [7] described how to solve the rumor influence minimization problem in a tree network. This paper proposes a reinforcement learning-based approach to reduce the influence of rumors, which is distinct from previous methods. This approach can adjust its blocking strategy dynamically in response to the propagation of rumors in a social network, and the results of this process are used to optimize (refine) the strategy.

2.2 Deep reinforcement learning

Recent advances in computer vision and natural language processing have followed the improvement of deep learning techniques. Furthermore, deep learning has made achievements in reinforcement learning challenges, such as making remarkable advances in the game of Go [20] and Atari games [21]. Reinforcement learning is a type of machine learning where an agent learns to interact with its environment to maximize a reward. This learning process allows the agent to improve its performance over time. Unlike other types of machine learning, reinforcement learning involves using a reward function, which helps the agent determine the value of its actions and guides its decision-making process. Reinforcement learning algorithms can generally be divided into two categories:

  • Value function-based approaches. These approaches are distinguished by optimizing the strategy while preserving a value system. An example is the Q-learning algorithm proposed by Watkins et al. [22]. The algorithm of Mnih et al. [21] outperformed all previous algorithms by combining Q-learning with deep learning in their study of Atari games. The authors referred to the improved algorithm as a deep Q-networks (DQN), and the stability and efficiency of the algorithm were improved by using freezing target networks and experience replay.

  • Policy-based approaches. These approaches directly model and optimize strategies, that do not require preserving a value function. Similar work on policy-based approaches was undertaken by Williams et al. [23] who collected trajectories by random walks and proposed the REINFORCE algorithm. Policy-based training is characterized by a wide variation in trajectory, which is challenging to train. Silver et al. [24] developed a deterministic policy gradient (DPG) algorithm. Lilicrap et al. [25] employed a DQN to estimate the values of functions based on the DPG algorithm, thereby producing the deep deterministic policy gradient (DDPG) algorithm.

The combination of value functions with an explicit representation of a policy may generate an actor-critic method. The value function is used as a baseline to calculate policy gradients. Actor-critic methods differ from the baselines only in that they employ a learned value function.

Motivated by the reinforcement learning research on influence maximization [26], we pioneer the use of reinforcement learning to solve the influence minimization problem. Compared with the previous RIM methods, the reinforcement learning-based method achieves more successful outcomes in a variety of scenarios.

3 Preliminaries

This section briefly introduces three concepts and their related definitions: social networks, rumor propagation models, and reinforcement learning.

3.1 Social networks

A social network is usually represented by a directed graph \(G = (\mathcal {V}, \mathcal {E})\), where the set \(\mathcal {V}\) of nodes and the set \(\mathcal {E}\) of edges denote the users and the relationships between users (e.g., following or being followed), respectively. Figure 1 illustrates the social network of the Zachary’s karate club, where different-colored nodes represent distinct communities. The size of each node indicates its influence within the community. Nodes with a strong influence have a prominent role in spreading rumors. It has been observed that nodes with more connections have a stronger capacity to spread information. Consequently, nodes with a higher degree of influence generally tend to be more influential.

Fig. 1
figure 1

Illustration of Zachary’s karate club social network. Zachary’s karate is a university club consisting of 34 nodes, 78 edges, and two opinion leaders. The two opinion leaders have generated distinct followings, resulting in the establishment of distinct communities

An edge \((u,v) \in \mathcal {E}\) in a rumor propagation scenario represents the fact that user v follows user u. As a result, user u is permitted to share a rumor with user v. Let puv ∈ [0,1] denote the probability that node u activates v, i.e., the probability that the rumor is passed from user u to user v. Specifically, we have puv = 0 when edge \((u,v) \notin \mathcal {E}\).

3.2 Rumor propagation models

With the development of deep learning, people have more in-depth research on social networks [27,28,29]. Rumors share many similarities with ordinary information regarding how they spread. Most rumor propagation solutions can be classified as a particular type of information propagation models. Models found to simulate rumors have been explored in several studies [6, 16, 17, 30]. Many of them recognize the critical role played by linear thresholds and independent cascade patterns.

Linear threshold models are characterized by a predetermined threshold 𝜃 for each node. Node activation occurs when the influence of the neighbors of the node exceeds the fixed value 𝜃. Activation will continue until no inactive nodes can satisfy the threshold condition. Contrary to linear threshold models, with fixed threshold parameters, independent cascade models rely on a certain probability for information propagation. Propagation can typically be depicted as node state distributions represented in discrete time steps starting at an initial time zero (t = 0). Let Su(t) represent the activation state of a node u at a moment t in time. Then, node u is active or inactive at moment t if Su(t) = 1 or Su(t) = 0, respectively. A node that is activated at moment ti will activate its inactive neighbor u at moment ti+ 1 with a probability puv. The propagation process simulated by independent cascade patterns terminates if no node has been activated at the moment ti. In this paper, independent cascade models are applied as our model fundamentals to incorporate rumor propagation characteristics.

An independent cascade model is illustrated in Fig. 2 to represent the propagation of rumors. The user states of a network can be divided into three categories: uninfected, infected, and activated. An uninfected node has not been affected by the rumor. An infected node has already been exposed to the rumor, but is unable to propagate it. An activated node has just been affected by the rumor and can spread it. Each edge has a weight value that indicates the probability of spreading the rumor along that edge. For example, a weight of 0.7 on edge (1,3) implies that the probability of rumor propagation from node 1 to node 3 is 0.7. Figure 2(a) shows the situation at the initial point of rumor propagation with node 1 as the original node. According to Fig. 2(b), the rumor spreads from node 1 to node 3, but not to node 2. A node in an independent cascade model has only one opportunity to activate its adjacent nodes. Therefore, at time step 2, node 1 switches from the active to the infected state, thereby losing its ability to activate other nodes. Instead, node 3 continues to activate nodes 2 and 4. The sole active node 6 has no successor that can be activated at the last time step, as shown in Fig. 2(d). Eventually, the propagation process ends.

Fig. 2
figure 2

Illustration of the independent cascade model. The decomposition diagrams of four time steps shows how probabilities affect propagation in an independent cascade model

3.3 Reinforcement learning

Recent developments in AI have heightened the need for reinforcement learning [31,32,33]. Reinforcement learning involves adjusting strategies to maximize expectations based on continuous interaction with the environment, consisting of two main objects: agents and the environment. Agents can recognize any changes in their environment and act accordingly. The environment reacts to the agent’s actions and changes its state, which also offers feedback to the agent as the rewards. In addition to the agent and environment, there are some other crucial components.

  • States: The environment description at a moment is called a state, denoted by s, which refers to the discrete situations of rumor spreading in a social network.

  • Actions: The node blocking behavior of an agent is referred to as an action, denoted as a.

  • Policies: A policy in rumor analysis is a function, denoted by π, referring to the agent’s behaviors, which determines whether a certain action a was taken by the agents in a particular state s. For example, the policy π(s) = a is utilized to represent an action a that decide which nodes should be blocked from spreading the rumor in state s.

  • Markov Decision Processes: “Markovianity” is the property that a “future” state is independent of the “past”. Markov decision processes are stochastic processes that exhibit this property. Almost all the reinforcement learning problems can be formulated as Markov decision processes. Figure 3 provides a visual representation of the interaction between agents and their environment in MDPs. Every interaction can be divided into three steps: (1) an agent perceives state st and reward rt from the environment; (2) the agent takes action a according to state st; and (3) the interaction updates the state st and the reward rt to st+ 1 and rt+ 1, respectively. The interaction repeated multiple times to form a trajectory τ.

    $$ \tau = s_{0}, a_{0}, r_{1}, s_{1}, a_{1}, \cdots, s_{t} $$
    (1)
  • Value Function: The future reward that the agents can expect to receive from state s is represented by the value function, which is denoted by Vπ(s). Value functions can be employed to evaluate the benefits and drawbacks of strategies. The sum of future rewards at different time steps is denoted as:

    $$ R_{t} = r_{t+1} + \gamma r_{t+2} + {\cdots} = \sum\limits_{k=0}^{T} \gamma^{k} r_{t+k} $$
    (2)

    where γ denotes the discount factor. Similarly, we can define the state-value function Qπ(s,a) that gives the expectation R of future rewards according to state s and action a.

Fig. 3
figure 3

Illustration of agent-environment interaction in a Markov decision process. The agent perceives state st in the environment, and then takes action a according to state st. Due to action at, the state st of the environment changes to st+ 1 and the agent perceives the new state st+ 1. Once the environment state st reaches the terminal state, the process will be complete

One of the breakthroughs in reinforcement learning was the development of Q-learning [22], defined by

$$ \begin{array}{@{}rcl@{}} Q(s_{t}, a_{t}) \leftarrow Q(s_{t}, a_{t}) &+& \alpha[R_{t+1} + \gamma \mathop{\max}\limits_{a}Q(s_{t+1},a) \\&&{\kern30pt}-Q(s_{t}, a_{t})] \end{array} $$
(3)

The Q-learning approach learns an action-value function Q that closely resembles the optimal action-value function Q. Algorithm 1 shows the process of Q-learning. The action a represented in line 5 of the algorithm with the maximum value of Q must be determined, making Q-learning unsuitable for environments with continuous states or actions.

Algorithm 1
figure f

Q-learning.

With the rise of deep learning, the concept of deep reinforcement learning has also been proposed. Mnih et al. [21] proposed the deep Q-learning (DQN) algorithm, which approximates Q(s,a) by using a neural network Q𝜃(s,a) with parameter 𝜃. The proposal of the DQN solves the problem that Q-learning cannot deal with continuous state and action space. Figure 4 shows how the DQN model plays an Atari game.

Fig. 4
figure 4

DQN model for Atari game playing. The Atari game has a screen resolution of 210 × 160, with each pixel having two states: black and white. Therefore the total number of states is 2210×160. Q-Learning does not have the capacity to retain all of the states. Consequently, the value function approximation technique was devised, which approximates the optimal Q function by utilizing a neural network. The DQN model process for the Atari game involves the following four steps: (1) divide the screen into individual pixels; (2) calculate the gray value of each pixel and output it as a vector; (3) input the vector into the neural network and obtain the Q value of the state s; and (4) select the action a with the largest Q value and execute it

4 Problem formulation

This section describes the rumor influence minimization (RIM) problem in two respects: the dynamic rumor propagation model (DRPM) and the formalization of RIM problems based on the dynamic rumor propagation model.

4.1 Dynamic rumor propagation model

Many studies that concentrate on rumor propagation adopt a fixed propagation probability puv. The probability of information propagation in a real social network generally varies with the evolution of time and the number of participants. Hence, this study proposes a dynamic rumor propagation model (DRPM), that can calculate and update the probability puv of rumor propagation in a dynamic manner. Three main factors influence puv.

  1. (1)

    Credibility of spreaders. Rumors are more likely to be believed if the spreader has credibility. For example, online social networks tend to reward bloggers with a higher number of followers who post more reliable information. This study measures the credibility of a spreader by the number of followers for each user.

  2. (2)

    Probability that an inactive node will believe a rumor. Typically, a small percentage of social network data is devoted to rumors. Rumors may not be noticed by users with more following relationships due to their limited interaction time. A higher number of bloggers followed by a user can expand the sources of information to enable better judgement of rumors. Therefore, the more users are followed by an inactive node, the lower the probability that node will ultimately believe a rumor.

  3. (3)

    Popularity of rumors. Users are more likely to believe rumors that are popular on social networks. The mechanism of social platforms will enable people to focus their attention on new hot topics. The popularity of these topics is thus likely to change dynamically. As a result, rumors become popular when enough people share information about them.

The present study first describes a static rumor propagation model (SRPM) that combines factors (1) and (2), where the probability puv is calculated as follows:

$$ p_{uv} = \frac{\alpha * \log(1 + \textup{OUT}(u))}{\alpha * \log(1+\textup{OUT}(u)) + \beta * \log(1+\textup{IN}(v))} $$
(4)

where α and β are the balance coefficients satisfying α,β ∈ (0,1) and α + β = 1. The notation OUT(u) represents the out-degree of node u. The influence of spreaders can be calculated using the function \(\log (1 + \textup {OUT}(u))\). This formula avoids the influence of extreme values better than calculating influence directly in terms of out degrees. For example, there may be bloggers with millions of followers on a large network. The probability of rumor spreading by these bloggers will be very close to 1 due to the linear growth of influences. Similarly, we use IN(v) to represent the in-degree of node v and apply \(\log (1+\textup {IN}(v))\) to denote the probability that user v receives and believes a rumor posted by user u. According to the balance of parameters α and β, the representation of puv can be simplified as follows.

$$ p_{uv} = \frac{\alpha*\log(1+\textup{OUT}(u))}{\alpha*\log(1+\textup{OUT}(u)) +(1-\alpha)*\log(1+\textup{IN}(v))} $$
(5)

The static parameter α is used to adjust the distribution of the propagation probability p. Unfortunately, it is difficult to simulate the changes in popularity using a static parameter. Hence, we replace the static α in an SRPMs with dynamically varying popularity αt, i,e, the popularity given by (6) at moment t.

$$ \alpha^{t} = \frac{\lambda * \vert A^{t} \vert +c_{1}}{\vert I^{t} \vert +c_{2}} $$
(6)

where c1 and c2 are constants used to smooth the change in popularity. The parameter λ is a scaling factor used to control the size of the dynamic popularity affected by the current number of infected users. |At| and |It| denote the numbers of activated nodes and rumor-infected nodes at moment t, respectively. The static model (SRPM) updated by (6) is called the dynamic rumor propagation model (DRPM). The popularity increases when the number of activated nodes at moment t increases significantly compared to historical moments. Over time, the popularity αt decreases due to the increase in |It| until the end of the propagation process. According to (4) and (6), the propagation probability \(p_{uv}^{t}\) in DRPMs is formalized as follows:

$$ p_{uv}^{t} = \frac{\alpha^{t-1} * \log(1+\textup{OUT}(u))}{\alpha^{t-1} *\log(1+\textup{OUT}(u)) + (1-\alpha^{t-1}) * \log(1+\textup{IN}(v))} $$
(7)

4.2 Dynamic rumor influence minimization

DRPMs have been established to simulate rumors dynamically. A systematic understanding of how DRPMs contribute to rumor influence analysis is still lacking. This section introduces the dynamic rumor influence minimization problem based on the proposed DRPMs.

Definition 1

Blockers: If an inactive node u is selected as a blocker at time step t, it will not be activated at time step t + 1.

Definition 2

Dynamic Rumor Influence Minimization (DRIM): DRIM for a social network \(G=(\mathcal {V}, \mathcal {E})\) aims to find the blocker set Bt (containing k blocker nodes) at each moment to minimize the final number of users infected by the rumor |IT|, where T is the moment when rumor propagation terminates. The DRIM problem can be formalized as

$$ B^{*} = \mathop{\arg\min}\limits_{B} \vert I^{T} \vert $$
(8)

where B is the sequence consisting of {B1,B2,⋯ ,Bt}.

Let the positive integer k denote the blocker budget, which is the number of blockers allowed to be selected at each time step. Figure 5 reveals the dynamic rumor influence minimization process for a social network when the blocker budget equals one. The initial state of the network is shown in Fig. 5(a). Suppose that node 1 is the only seed at the initial state. Then, if node 2 is selected as the blocker at the initial time step, it will remain unaffected, i.e., B1 = {2}, while the rumor activates nodes 4 and 5. The rumor-blocking behavior determined by selecting a blocker (K = 1) at time step 2 is similar to those represented in Fig. 5(a). After choosing node 6 as the blocker at time step 2, i.e., B2 = {6}, the rumor is prevented from spreading from node 5 to node 6. Node 8 will be activated by the rumor diffusion from node 4. Finally, node 9 is selected as the blocker at time step 3, which stops the rumor propagation from node 8. As a result, there is no active node, so the propagation process is complete.

Fig. 5
figure 5

Illustration of the dynamic rumor influence minimization problem

5 Methodology

Recently, investigators [6, 16, 17] have examined the effects of survival theory on likelihood calculation of the nodes that are activated during each time step. One of the main disadvantages of the survival theory is that it ignores the implications of blocked nodes for the future. The main advantage of utilizing reinforcement learning in the selection of blocking nodes is that it can predict the results after multiple rounds of propagation, allowing us to choose the optimal node for blocking according to the final rumor propagation results. To this end, we propose the novel model of reinforcement learning for dynamic blocking (RLDB), which obtains excellent performance by considering the role of blockers in an integrated manner. The workflow of our proposed model can be divided into two core processes: training the model and utilizing the trained model to identify blockers. The first stage involves training the model, which includes gathering and analyzing data, initializing parameters, and develo** Algorithm 3. The second stage involves utilizing the trained model to select blockers using Algorithm 2. Using this workflow, our proposed model can accurately and efficiently identify blockers, thus allowing for the solution of the RIM problem.

Algorithm 2
figure g

Blocker selection using the RLDB model.

5.1 Reinforcement learning model for blocker selection

The primary difficulty encountered when using an RLDB model is determining the effective blocking strategies. Figure 6 illustrates how the RLDB model selects blockers when the blocker budget satisfies k = 1. The selection process can be represented by the following three steps:

  1. (1)

    Determine which nodes in a given social network can potentially be activated in the upcoming time step and add them to the candidate set C. For example, the candidate set is C = {4,6} in Fig. 6.

  2. (2)

    Calculate the future rewards R of blocking each candidate in the candidate set C by deep the Q-network, respectively.

  3. (3)

    Take the action with the greatest future rewards.

Fig. 6
figure 6

Reinforcement learning model for blocker selection

The algorithm for choosing the blockers of the RLDB model is outlined in Algorithm 2. In particular, between Lines 7 and 10 in Algorithm 2, we employ a neural network to predict the number of individuals influenced by the rumor after we have selected a single blocker. Predicting and selecting multiple blockers simultaneously drastically decreases the amount of computation involved and has a negligible impact on performance.

5.2 Parameter learning

Figure 7 depicts the overall framework of the deep Q-network (DQN) [21] with experience replay and objective function freezing. An application of deep Q-learning with experience replay will be explored in this section. This learning methodology is characterized by the fact that the agent’s experience at each time step is stored in a replay memory for parameter updates. The loss function for model training can be represented as:

$$ \mathcal{L}(s_{t},a_{t},s_{t+1} \vert \theta) = (r^{t} + \max_{a}\hat{Q}_{\theta^{-}}(s_{t+1}, a_{t+1}) - Q_{\theta}(s_{t}, a_{t})) $$
(9)

where rt = −|At| and 𝜃 are parameters of the target network \(\hat {Q}\). Similarly, the parameter 𝜃 pertains to the strategy network Q. The policy network, represented by Q𝜃, is employed by the DQN to make decisions. For each time step \(T^{\prime }\), the value of 𝜃 is copied to 𝜃. 𝜃 stays unchanged at other times. Hence, the sole focus of this optimization process is to optimize 𝜃.

Fig. 7
figure 7

Illustration of the deep Q-network with experience replay and objective function freezing

The parameter training for the DQN is demonstrated through Algorithm 3 which is based on (9) and utilizes experience replay. It is important to recognize that there are two neural networks, i.e., \(Q_{\theta ^{-}}\) and Q𝜃, in the DQN, where 𝜃 and 𝜃 are the parameters of the neural networks. At every time step \(T^{\prime }\), we set b to be the same as a. By utilizing this approach, the target network can be trained more effectively and reach a steady state more quickly.

Algorithm 3
figure h

DQN with experience replay.

6 Experiment

This section evaluates the effectiveness of the developed rumor influence minimization method for rumor control under a particular DRPM. First, we provide an overview of the datasets and experimental setting. Second, a thorough assessment is performed by examining the outcomes of the experiments and interpreting them from various perspectives. Finally, we compare reinforcement learning for dynamic blocking with other baselines. We implemented the experiment using PyTorch as the fundamental deep learning framework and NetworkX to manipulate the graph structure. The reinforcement learning environment was built with reference to the environment of the gym library, and the reinforcement learning algorithm was implemented using PyTorch. We employed a server equipped with an RTX3090 GPU to conduct model training at the hardware level.

6.1 Datasets

We chose four real-world social networks to evaluate the feasibility and performance of the proposed method.

  1. 1.

    Zachary’s karate club [34]. This dataset reported by Wayne W. Zachary concerns the social network of a karate club on a university campus, and it is frequently employed as an illustration in community structure analysis.

  2. 2.

    Facebook [35]. This dataset consists of “circles” (or “friends lists”) from Facebook. The dataset was gathered from participants who used a certain app to access Facebook.

  3. 3.

    Cora [36]. Cora is a collection of academic papers related to machine learning that is available in a citation network format. The citation relationships between the papers are extracted and these relationships are utilized to form a network topology.

  4. 4.

    Email [37]. Email data obtained from a major European research institution were utilized to develop the network. The e-mails only represent communication between institution members (the core). The dataset does not include any messages that were sent to or received from external sources.

Rumor control research requires only community structure information. Therefore, two versions of the dataset, namely Facebook-s and Cora-s, were created by taking the community structures from the original Facebook and Cora. The details of the datasets are shown in Table 1.

Table 1 The statistics of experimental datasets

6.2 Evaluation criteria

To evaluate the performance of our proposed method, we consider the infection rate [6, 16, 17], i.e., the proportion of people affected by the rumor compared with total number of people, as the most intuitive way to measure the results of rumor propagation. A lower rate of infections indicates that rumor control is effective.

$$ Infection\_Rate = \frac{\vert I^{T} \vert}{\vert \mathcal{V} \vert} $$
(10)

This study conducts a thorough assessment of the method, which includes not only the infection rate but also the precision, recall, and F1 score. These indicators focus on assessing the accuracy of the prediction rather than the infection rate, which cannot give a full account of the influence on rumor control.

6.3 Hyperparameter setting

Hyperparameters, such as the number of neural network layers, batch size, and learning rate, are settings for our approach that cannot be learned from the data. They are often chosen by the practitioner and are typically specific to the problem at hand. These values are typically set before training the model, and they can significantly affect the performance of the model. Let n represent the number of nodes in a focused dataset. First, we attempt to reduce the size of the neural network to prevent overfitting. The parameter n for each neural network layer is represented in Table 2. Second, increasing the batch size during the training process can achieve better training results [38]. Therefore, we use a batch size that increases as our experiment progresses. Third, the learning rate depends on the size of the datasets. A larger dataset requires a lower learning rate. As a result, the training process can be dramatically improved by doubling or halving the learning rate as shown in Table 2.

Table 2 Hyperparameter settings of the RLDB model

The ReLU function is used to activate the hidden layer, and Adam [39] is utilized for its optimization. The determination of the hyperparameters requires a combination of expertise and trial-and-error to identify the best possible configuration. The number of neurons should be at most 210, and the neural network should contain at most 4 layers. The batch size configuration should be based on the memory capacity of the GPU. Having a larger batch size can help make the training process more stable. It is worth mentioning that the Dropout in the deep reinforcement learning model can prevent the convergence of the training loss.

6.4 Baseline methods

We chose four baseline methods to compare the performance of the proposed RLDB model. Our experiment offers an optimal setting for the algorithms with tunable parameters.

  1. (1)

    Random. Randomly select a node as a blocker from the set of candidate nodes C.

  2. (2)

    Out-Degree(OD) [19]. The out-degree of a node u in a network is equal to the number of outgoing edges from u. Inferring the influence of individuals in social networks is more precise when utilizing out-degree nodes compared to other centrality-based approaches.

  3. (3)

    Betweenness Centrality(BC) [40]. The betweenness of node u equals the number of shortest paths from all nodes to other nodes through node u. Social network research has increasingly emphasized the importance of betweenness centrality.

  4. (4)

    PageRank(PR) [41]. Google commonly uses the PageRank score to determine the importance of a website node. PageRank’s dam** factor parameter is set to 0.85 in all our experiments on the datasets.

6.5 Results

6.5.1 Study of parameter α

The experiment aims to analyze the influence of popularity α on the propagation of rumors in SRPMs. The graph in Fig. 8 displays the time elapsed from the start of a rumor to its termination and the infection rate of the whole social network. A rumor with low popularity will be difficult to spread on a social network. Consequently, its life expectancy will be brief. The time for a rumor to spread in a social network is roughly equal to the diameter of the entire network when the popularity is high. Such rumors can also be disseminated quickly.

Fig. 8
figure 8

Variation of propagation time and infection rate with parameter α

A rumor has the longest spreading time when the popularity is moderate. We can determine a suitable level of popularity α according to the time needed for the rumor to spread in a social network. The optimal popularity, represented by \(\alpha ^{\prime }\), is achieved when the propagation time is greatest. Rumor infection rates can similarly be examined to analyze the spread of rumors within social networks.

Figure 8 reveals that there are similar trend variations of popularity α among the four datasets. The rumor propagation time first increases and then decreases as the popularity α rises.This consistent trend reinforces our findings regarding the influence of rumor popularity on the duration of propagation. Nevertheless, each dataset requires a distinct level of popularity for optimal performance. The email dataset has an optimal popularity of \(\alpha ^{\prime }=0.1\), while the best popularity in the Cora dataset is \(\alpha ^{\prime }=0.1\). Sparse networks tend to be preferred over dense networks in regard to achieving optimal popularity. The rumor infection rate is estimated to be in the range of 0.5 to 0.7 when \(\alpha =\alpha ^{\prime }\).

6.5.2 Study of the parameter λ

This section examines the effects of scaling factor λ in the DRPMs. This exploration aims to discover the role of scaling factor λ in determining the propagation time and infection rate. Figure 9 shows the intercorrelations between λ and the propagation time or infection rate in the four datasets. An interesting point to note is that if the value of λ is too large or too small, it will reduce the time required for a rumor to propagate in a social network. Conversely, we can determine the appropriate scaling factor λ based on the propagation time of a rumor in social networks. In addition, the factor λ is designated as the optimal scaling factor \(\lambda ^{\prime }\) if λ is associated with the longest propagation time.

Fig. 9
figure 9

Variation in the propagation time and infection rate with parameter λ

6.5.3 Performance comparison

This section empirically compares the RLDB method with the baselines, i.e., Random, OD, BC, and PR, under two rumor propagation models, an SRPM and a DRPM. The infection rates of the evaluation metrics for the five blocker budgets are recorded in Tables 3 and 4 for the SRPMs and DRPMs, respectively. The trend comparisons of the four datasets are depicted in Figs. 10 and 11.

Table 3 Infection rate comparison under DRPM
Table 4 Infection rate comparison under DRPMs
Fig. 10
figure 10

Comparison with other methods under SRPM

Fig. 11
figure 11

Infection rate comparison under DRPM

Figure 10 presents the baseline comparison under the SRPM models. Overall, the RLDB method maintains significant rumor control influence in all datasets, especially under the smallest blocker set size k = 1. There is a significant positive performance improvement (a minimal infection rate) of the RLDB approach compared with the baseline methods. Gradually, the gap between RLDB and other comparison methods is reduced as the value of k increases. Some control boundaries can be found in the datasets of Zachary’s karate club and Cora. When k equals 10, rumor infection rates are kept to an extremely low level. Currently, the RLDB and comparison methods have a small difference in performance. The RLDB method still has some advantages over the other approaches when applied to the Facebook and Email datasets with increasing the values of k.

An interesting observation can be drawn from the Facebook-s dataset. The infection rates of PR and BC methods reverse as the blocker set size k increases. The PR strategy is more effective at control than the BC method when k < 4. Increasing the value of k improves the BC technique’s capability, eventually leading to better results than those of PR. The reversal may be explained by the complexity of the network topology. The performance of techniques based on a single metric can be adversely affected by fluctuations in the dataset parameters. In contrast, the data-driven RLDB method achieves superior performance across multiple scenarios through the strategies derived from learning processes.

Figure 11 presents the infection rates under the DRPM models. In contrast to the SRPM models, the DRPM model introduces dynamic popularity, resulting in a more complex propagation process. The superior control effect of the RLDB model attests to its superior generalizability, especially for intricate propagation models and network structures.

Figures 12 and 13 illustrate the precision, recall, and F1 scores for the SRPM and DRPM models, respectively. RLDB performs better than the other methods when evaluated on both the SRPM and DRPM by each metric. RLDB stands out as particularly superior when analyzed using the DRPM on multiple datasets. It is noteworthy that Random has outperformed the other comparison methods except for RLDB in terms of accuracy, recall, and F1 scores, while OD has fared the worst in comparison to the infection rate. These scores are due to the fact that OD, BC, and PR methods based on node statistical attributes tend to select highly influential nodes. Despite having a large influence, the opinions of high-influence nodes are usually not swayed by other nodes, which makes these methods less successful than randomly choosing nodes.

Fig. 12
figure 12

Precision, Recall and F1 comparison under SRPMs

Fig. 13
figure 13

Precision, Recall and F1 comparison under DRPMs

7 Conclusion and future work

This paper is an initial exploration of the potential of reinforcement learning to control the spread of rumors on social media. The insights on model construction gained from this study may assist in reducing the infection rates of rumors in an integrated manner. First, we propose the static and dynamic rumor propagation models SRPM and DRPM based on the independent cascade models. Second, this research project advances the knowledge of rumor propagation and presents a dynamic rumor influence minimization problem. This problem offers more control over the spread of rumors by breaking down the blocking process into several components over time, in contrast to the traditional static rumor influence minimization problem. Another significant accomplishment is the implementation of reinforcement learning for dynamic blocking (RLDB) as a practical strategy to prevent the spread of rumors from multiple sources and blocking processes.

Testing our approach on extensive real-word social networks has proven effective. However, the results on large-scale artificial datasets demonstrate that RLDB takes significantly more time when dealing with networks with more than 1500 individual nodes. A potential way to address this issue is to split up a large network into smaller networks by creating distinct communities. In future research, we will optimize the efficiency of RLDB in large networks to improve its applicability.