1 Introduction

Graph neural networks (GNNs) preserve the attributes and structural features of graphs by converting node information into low-dimensional dense embeddings. In the past few years, they have become a powerful strategy for analyzing graph structured data with overwhelming success in various graph-based tasks, such as node classification [1,21] that the local–local contrasting mode always better than other contrasting modes (local–global and global-global) in node classification tasks. However, there is an obvious flaw in the local–local contrasting mode, that is, the comparison between node-level embeddings overemphasizes local semantic information and ignores global semantic information, yet the global semantic information is equally important for node classification tasks. This leads to two challenging questions: (1) How can data augmentation strategies be improved to minimize the loss of semantic information in the data augmentation stage? (2) How to make the GCL framework utilize global semantic information while retaining the local–local contrasting mode?

In order to address the above problems, this paper proposes a graph contrastive learning framework MDGCL based on multiple graph diffusion methods, which introduces graph diffusion as a data augmentation method and adopts a deterministic-stochastic data augmentation strategy to skillfully solve the above problems. Specifically, MDGCL framework first utilizes PPR diffusion and Markov diffusion to perform data augmentation on the original graph. Compared to the data augmentation method that reduces the semantic information of the original graph, graph diffusion method attempts to add semantic information to the original graph, thus alleviating the semantic information loss caused by subsequent stochastic augmentation methods. At the same time, the augmented views obtained by the graph diffusion method provide diffusion matrices, which contain global semantic information that is not available in adjacency matrices. Therefore, the representations learned on these augmented views allows MDGCL to utilize the global semantic information for training while retaining the local–local contrasting mode. These works improve the performance of graph contrastive learning frameworks on node classification tasks, which holds significant theoretical and practical implications for the analysis and application of graph data.

In summary, our main contributions are as follows:

  • Firstly, we propose a novel graph contrastive learning framework MDGCL, which uses Markov diffusion and PPR diffusion for data augmentation, effectively reducing the loss of semantic information while providing global semantic information for framework training.

  • Secondly, a deterministic-stochastic data augmentation strategy is introduced. This strategy aims to improve the node classification performance of the framework through the combination of different data augmentation methods.

  • Thirdly, We conducted comparison experiments on a series of benchmark datasets and the experimental results show that our framework achieves better results on node classification tasks compared to chosen baseline frameworks.

2 Related Work

Our work mainly involves the application of graph diffusion in graph contrastive learning. The following mainly introduces the research related to graph contrastive learning method and graph diffusion.

2.1 Graph Self-supervised Learning

Since the label information of graph-structured data in the real world is too scarce and its acquisition takes a lot of time and resources, in recent years, more and more researchers have shifted their attention from graph supervised learning to graph self-supervised learning, and some representative methods have been proposed.

2.1.1 Generation-Based Graph Self-supervised Learning

Generation-based methods usually take the original graph or subgraph as model’s inputs and obtain encoders by reconstructing features or structures. Graph completion [22] takes inspiration from image completion. It first masks the features of certain nodes in the input graph. Then, the GCN encoder uses the features of adjacent nodes to predict the features of masked nodes. Instead of learning to reconstruct the original feature matrix, AttributeMask [27] improves the data augmentation process based on GRACE. Edge-drop** and node feature masking are no longer performed randomly, it determines the edges to be dropped and the node features to be masked by introducing edge centrality and node centrality. Although this method reduces the loss of semantic information to some extent, it does not choose the local–local contrasting mode, which is optimal for node classification tasks. NCLA [28] argues that most graph contrastive learning frameworks are manually selecting data augmentation methods for different datasets, which limits the universality of these frameworks. Therefore, it adopts the multi-head attention mechanism as a learnable graph data augmentation method, which automatically selects the optimal augmented view through the attention mechanism, eliminating the need for manual selection of augmentation methods and improving the universality of the framework. In order not to lose the information, instead of performing data augmentation, Sim-GRACE [36,37] have shown that as the number of GCN layers increases, the over-smoothing problem will be inevitable, and the two-layer GCN has the best performance. This means that for the target node, it can only aggregate local information within the two-hop range. However, for graph data, nodes outside the two-hop range may also have relatively close connections with the target node. In order to reconstruct these connections, MDGCL framework adopts graph diffusion method as data augmentation method, and the resulting augmented view carries the diffusion matrix, which contains global information. We demonstrate the graph diffusion process for the MDGCL framework in Fig. 2.

Fig. 2
figure 2

Graph diffusion process

As illustrated in Fig. 2, graph diffusion transforms some nodes outside the two-hop range into nodes within the two-hop range by adding edges. When encoding the target node, the encoder’s field of view extends from local to global, which means that the encoder can utilize more semantic information, thereby improving the framework’s performance in node classification tasks.

The generated graph diffusion matrix is given by Eq. (1):

$$\begin{aligned} \textbf{S}=\sum _{k=0}^{\infty } \Theta _k \textbf{T}^k \in \mathbb {R}^{N \times N}. \end{aligned}$$
(1)

where \(\textrm{T} \in \textrm{R}^{N \times N}\) is a generalized transfer matrix, defined by \(A D ^ {-1}\), \(A\in \textrm{R}^{N \times N}\) is the adjacency matrix, \(D\in \textrm{R}^{N \times N}\) is the diagonal matrix, \(d_{ii}=\sum _{j} a_{ij}\). For the Personalized PageRank (PPR) method, \(\Theta _k=\alpha \left( 1-\alpha \right) ^k\), where \(\alpha \) represents the transition probability in a random walk, its diffusion matrix is given by Eq. (2):

$$\begin{aligned} \textbf{S}^{PPR}=\alpha \left( \textbf{I}_n-\left( 1-\alpha \right) \textbf{D}^{-1/2}{\textbf{AD}}^{-1/2}\right) ^{-1}. \end{aligned}$$
(2)

where \(\textbf{I}_n\) denotes the identity matrix. Through PPR diffusion we obtain an augmented view \(\widetilde{\textrm{G}_1}\) that extends the graph topology and provides global information. However, we argue that a single graph diffusion method is not always effective in extending graph topology, and its expansion results are not diverse. Therefore, in order to solve this problem, in addition to the PPR diffusion method, we also introduce another graph diffusion method: Markov diffusion. The Markov diffusion process starts with \(M={\textrm{AD}}^{-1}\). The random walk is extended by squaring the Markov matrix M, as shown in Eq. (3), for the i-th round of random walk

$$\begin{aligned} M_i=M_{i-1}\times M_{i-1}. \end{aligned}$$
(3)

where \(M_{i-1}\) is the output of the last round of random walk. For the Markov diffusion, the \(\Theta _k\) in Eq. (1) is \(\left( 1-\alpha \right) ^k\), where \(\alpha \) represents the transfer probability in a random walk, and the Markov diffusion matrix is given by Eq. (4):

$$\begin{aligned} \textbf{S}^{\text{ Markov } }=\sum _{k=0}^{\infty }(1-\alpha )^k M_{k-1} \times M_{k-1} \in \mathbb {R}^{n \times n}. \end{aligned}$$
(4)

After PPR diffusion and Markov diffusion, we get two augmented views \(\widetilde{\textrm{G}_1}=\left\{ \widetilde{A_1}, X\right\} \) and \(\widetilde{\textrm{G}_2}=\left\{ \widetilde{A_2}, X\right\} \) from the original graph G, as shown in Eq. (5)

$$\begin{aligned} \begin{aligned} \widetilde{A_1}&=\textrm{S}^{\textrm{PPR}}(A) \\ \widetilde{A_2}&=\textrm{S}^{\textrm{Markov}}(A) \end{aligned} \end{aligned}$$
(5)

where \(A\in R^{N\times N}\) is the adjacency matrix of the original graph G.

3.2.2 FeatureDrop and EdgeDrop

  1. (1)

    FeatureDrop: MDGCL uses the method of masking node features to realize the drop** of node features. It first generates a random vector \(M=\left\{ m_1,m_2,m_3,...,m_F\right\} \), where F represents the dimension of the node feature, \(m_i \in \left\{ 0,1\right\} \). Then, the new feature vector \(\tilde{X}\) is obtained by dot product of M with the node feature vector X, as shown in Eq. (6):

    $$\begin{aligned} \tilde{X}=\left\{ x_1 \cdot \textrm{m}_1, x_2 \cdot m_2, x_3 \cdot m_3 \ldots x_F \cdot m_F\right\} . \end{aligned}$$
    (6)

    The drop rate, which controls the number of zero elements in the random vector M, is between 0 and 1.

  2. (2)

    EdgeDrop: MDGCL uses random drop** to remove fixed-ratio edges on the augmented view, and for the adjacency matrix A, p non-zero elements in A are randomly selected and set to zero. In terms of implementation, we first form a masking matrix \(U\in \left\{ 0,1\right\} ^{N \times N}\), where N represents the number of nodes. Then the masking matrix U is dot-producted with the adjacency matrix A to obtain the new adjacency matrix \(\tilde{A}\), as shown in Eq. (7):

    $$\begin{aligned} \tilde{A}=U \cdot A. \end{aligned}$$
    (7)

    The two final augmented views \(\widetilde{\mathcal {G}_1}\) and \(\widetilde{\mathcal {G}_2}\) are obtained by performing the FeatureDrop and EdgeDrop as described above on the two picked views \(\widetilde{\textbf{G}_1}\) and \(\widetilde{\textbf{G}_2}\) respectively, as shown in Fig. 3.

Fig. 3
figure 3

The process of FeatureDrop and EdgeDrop in MDGCL

3.3 View Encoding

In this section, we choose GCN as the encoder. GCN is a neural network, which uses the nonlinear function \(\sigma \) as the propagation function between neural network layers, as shown in Eq. (8):

$$\begin{aligned} H^{(l+1)}=\sigma \left( \widetilde{D}^{-\frac{1}{2}} \tilde{A} \widetilde{D}^{-\frac{1}{2}} H^{(l)} W^{(l)}\right) . \end{aligned}$$
(8)

where \(\sigma \) is ReLU or PReLU, \(\tilde{A} = A + I_n\), A represents adjacency matrix and \(I_n\) represents identity matrix, \(\tilde{D}\) is the degree matrix of \(\tilde{A}\). \(H^{\left( l\right) }\) is the output representation of the previous layer, and \(W^{\left( l\right) }\) is the weight of the l-th layer.

Fig. 4
figure 4

Projection head of MDGCL

We have defined two double-layer GCNs \(f_1(\cdot )\) and \(f_2(\cdot )\) as the encoder for MDGCL, as shown in Eq. (9):

$$\begin{aligned} f_1(\cdot ), f_2(\cdot ): R^{{N} \times {N}} \times R^{{N} \times {F}} \mapsto {R}^{{N} \times {F}_{{h}}}. \end{aligned}$$
(9)

The input of \(f_1(\cdot )\) and \(f_2(\cdot )\) is the adjacency matrix \(R^{{N} \times {N}}\) and the node feature matrix \(R^{{N} \times {F}}\), and the output is the node representation \({R}^{{N} \times {F}_{{h}}}\). Where N is the number of nodes, F is the feature dimension, and \({F}_{{h}}\) is the feature embedding dimension. As shown in Fig. 1, two encoders share parameters. After sending the adjacency matrix and node feature matrix of the two augmented views \(\widetilde{\mathcal {G}_1}\) and \(\widetilde{\mathcal {G}_2}\) obtained in Sect. 3.2 to the encoder \(f_1(\cdot )\) and \(f_2(\cdot )\) respectively, two node representations \(H_1\) and \(H_2\) are obtained.

In order to improve the learning ability of contrastive learning frameworks, SimCLR [38] proposed the Projection Head method, which uses a learnable nonlinear projection between representations obtained by the encoder and the contrast loss, achieving good results. To further improve the learning ability of MDGCL, we also use the projection head method. The projection head of MDGCL is a multi-layer perceptron (MLP) \(f_{\text{ pro } }(\cdot ): R^{{N} \times {F}_{{h}}} \longmapsto {R}^{{N} \times {F}_{{h}}}\), and it consists of two hidden layers and a nonlinear activation function Elu. As shown in Fig. 4, node-level representations \(H_1\) and \(H_2\) are transformed into new node-level embeddings \(\widetilde{H_1}\) and \(\widetilde{H_2}\) after being fed into \(f_{\text{ pro } }(\cdot )\), as shown in Eq. (10):

$$\begin{aligned} \begin{aligned} \widetilde{H_1}&=f_{\text{ pro } }\left( \textrm{H}_1\right) \\ \widetilde{H_2}&=f_{\text{ pro } }\left( \textrm{H}_2\right) \end{aligned}. \end{aligned}$$
(10)

3.4 Local–Local Contrastive Learning

Fig. 5
figure 5

The local–local contrastive learning of MDGCL

We use local–local contrasting mode in our framework, where contrastive learning is performed between node-level representations. The contrastive learning of MDGCL is inspired by GRACE [26]. For two augmented views \(\widetilde{\mathcal {G}_1}\) and \(\widetilde{\mathcal {G}_2}\), the node \(v_1\) in \(\widetilde{\mathcal {G}_1}\) is used as the anchor, and the node \(u_1\) corresponding to \(v_1\) in \(\widetilde{\mathcal {G}_2}\) is used as a positive sample. All nodes except \(v_1\) and \(u_1\) are used as negative samples, as shown in Fig. 5, negative samples are divided into intra-view negative samples and inter-view negative samples. The contrastive goal of MDGCL is to maximize the consistency between positive samples and minimize the consistency between negative samples. We uses cosine similarity \(\textrm{s}\) to define the consistency \(\theta \), as shown in Eq. (11):

$$\begin{aligned} \theta \left( \textrm{v}_1, \textrm{u}_1\right) =\textrm{s}\left( \textrm{H}_1^{\textrm{v}_1}, \textrm{H}_2^{u_1}\right) . \end{aligned}$$
(11)

where \(\textrm{H}_1^{\textrm{v}_1}\) and \(\textrm{H}_2^{\textrm{v}_2}\) are the node-level representations obtained by nodes \(v_1\) and \(u_1\) in \(\widetilde{\mathcal {G}_1}\) and \(\widetilde{\mathcal {G}_2}\) respectively. After introducing negative samples, the comparison target between sample pairs \(\left( v_i,u_i\right) \) is defined as Eq. (12):

$$\begin{aligned} \ell \left( v_i, u_i\right) =\log \frac{e^{\theta \left( v_i, u_i\right) / \tau }}{e^{\theta \left( v_i, u_i\right) / \tau }+\sum _{k=1}^N {1}_{\left[ k \ne i\right] } e^{\theta \left( v_i, u_k\right) / \tau }+\sum _{k=1}^N {1}_{[k \ne i]} e^{\theta \left( v_i, v_k\right) / \tau }}. \end{aligned}$$
(12)

where \(\tau \) is the temperature parameter. \({1}_{\left[ k\ne i\right] }\) is an indication function that equals to \(1\ \mathrm {iff\ }k\ne i\).

Due to the symmetry of the two augmented views, when the \(u_i\) is used as the anchor node, there is also a comparison target \(\ell \left( u_i,v_i\right) \). Finally, the overall objective to be maximized by the MDGCL framework is defined as the average of all sample pairs’ comparison objective, as shown in Eq. (13):

$$\begin{aligned} \mathcal {J}=\frac{1}{2 N} \sum _{i=1}^N\left[ \ell \left( \varvec{u}_i, \varvec{v}_i\right) +\ell \left( \varvec{v}_i, \varvec{u}_i\right) \right] . \end{aligned}$$
(13)

In summary, in each epoch of training, MDGCL framework first performs data augmentation on the original graph G. It uses data augmentation methods described in Sect. 3.2 to obtain two augmented views \(\widetilde{\mathcal {G}_1}\) and \(\widetilde{\mathcal {G}_2}\). Then, we obtain node representations \(\widetilde{H_1}\) and \(\widetilde{H_2}\) from \(\widetilde{\mathcal {G}_1}\) and \(\widetilde{\mathcal {G}_2}\) by using encoder \(f_1(\cdot )\), \(f_2(\cdot )\) and Projection Head \(f_{\text{ pro } }(\cdot )\) in Sect. 3.3. Finally, the parameters of \(f_1(\cdot )\) and \(f_2(\cdot )\) is updated by maximizing the objective in Eq. (13). The learning algorithm of MDGCL is summarized in Algorithm 1.

Algorithm 1
figure a

MDGCL learning algorithm

3.5 Theoretical Justification

In this section, We provide a theoretical justification for our framework in terms of mutual information(MI) maximization, which represents the amount of information that one random variable contains about another random variable, it can be understood as the degree of correlation between two random variables. To be clear, we are talking about the MI between the original graph node features and the embedding representations output by the encoder. Next, we will reveal the relationship between our defined objective function \(\mathcal {J}\) and MI.

Theorem 1

Let \(\textbf{X}_i=\left\{ {x}_k\right\} _{k\in \mathcal {N}(i)}\) be the neighborhood of node \(v_i\) that collectively maps to its output embedding, where \(\mathcal {N}(i)\) denotes the set of neighbors of node \(v_i\) specified by GCN, and \(\textbf{X}\) be the corresponding random variable with a uniform distribution \(p\left( \textbf{X}_i\right) =\frac{1}{N}\). Given two random variables \(\textbf{U},\textbf{V}\in R^{F^\prime }\) being the embedding in the two views, with their joint distribution denoted as \(p\left( \textbf{U},\textbf{V}\right) \), our objective \(\mathcal {J}\) is a lower bound of MI between encoder input \(\textbf{X}\) and node representations in two graph views \(\textbf{U},\textbf{V}\). Formally

$$\begin{aligned} \mathcal {J}\le I\left( \textbf{X};\textbf{U},\textbf{V}\right) . \end{aligned}$$
(14)

Proof

Inspired by [26], we first proof the connection between our objective \(\mathcal {J}\) and the InfoNCE objective [39], which is defined by [40] as

$$\begin{aligned} I_{\textrm{NCE}}(\textbf{U} ; \textbf{V}) \triangleq \mathbb {E}_{\prod _i p\left( \varvec{u}_i, \varvec{v}_{\varvec{i}}\right) }\left[ \frac{1}{N} \sum _{i=1}^N \log \frac{e^{\theta \left( \varvec{u}_{\varvec{i}}, \varvec{v}_i\right) }}{\frac{1}{N} \sum _{j=1}^N e^{\theta \left( \varvec{u}_i, \varvec{v}_j\right) }}\right] . \end{aligned}$$
(15)

where \(\theta \) is the same as defined in Eq. (11). For convenience of notion, We define \(\rho _c\left( \varvec{u}_i\right) =\sum _{j=1}^N {1}_{\left[ i\ne j\right] }\exp \left( \theta \left( \varvec{u}_i, \varvec{v}_j\right) / \tau \right) \), \(\rho _c\left( \varvec{u}_i\right) =\sum _{j=1}^N \exp \left( \theta \left( \varvec{u}_i, \varvec{v}_j\right) / \tau \right) \). By using \(\rho _c\left( \varvec{u}_i\right) \) and \(\rho _c\left( \varvec{u}_i\right) \), we rewrite objective \(\mathcal {J}\) as

$$\begin{aligned} \mathcal {J}=\mathbb {E}_{\prod _i p\left( \varvec{u}_i, \varvec{v}_i\right) }\left[ \frac{1}{N} \sum _{i=1}^N \log \frac{\exp \left( \theta \left( \varvec{u}_i, \varvec{v}_i\right) / \tau \right) }{\sqrt{\left( \rho _c\left( \varvec{u}_i\right) +\rho _r\left( \varvec{u}_i\right) \right) \left( \rho _c\left( \varvec{v}_i\right) +\rho _r\left( \varvec{v}_i\right) \right) }}\right] . \end{aligned}$$
(16)

Then, we using \(\rho _c\) to rewrite the InfoNCE estimator \(I_{\textrm{NCE}}\)

$$\begin{aligned} I_{\textrm{NCE}}(\textbf{U}, \textbf{V})=\mathbb {E}_{\Pi _i p\left( \varvec{u}_i, \varvec{v}_i\right) }\left[ \frac{1}{N} \sum _{i=1}^N \log \frac{\exp \left( \theta \left( \varvec{u}_i, \varvec{v}_i\right) / \tau \right) }{\rho _c\left( \varvec{u}_i\right) }\right] . \end{aligned}$$
(17)

\(\square \)

Therefore,

$$\begin{aligned} \begin{aligned} 2 \mathcal {J}&= I_{\textrm{NCE}}(\textbf{U}, \textbf{V})-\mathbb {E}_{\Pi _i p\left( \textbf{u}_i, \varvec{v}_i\right) }\left[ \frac{1}{N} \sum _{i=1}^N \log \left( 1+\frac{\rho _r\left( \varvec{u}_i\right) }{\rho _c\left( \varvec{u}_i\right) }\right) \right] \\&+I_{\textrm{NCE}}(\textbf{V}, \textbf{U})-\mathbb {E}_{\Pi _i p\left( \varvec{u}_i, \varvec{v}_i\right) }\left[ \frac{1}{N} \sum _{i=1}^N \log \left( 1+\frac{\rho _r\left( \varvec{v}_i\right) }{\rho _c\left( \varvec{v}_i\right) }\right) \right] \\&\le I_{\textrm{NCE}}(\textbf{U}, \textbf{V})+I_{\textrm{NCE}}(\textbf{V}, \textbf{U}) . \end{aligned}. \end{aligned}$$
(18)

According to [28], the InfoNCE estimator is a lower bound of the true MI, i.e.

$$\begin{aligned} I_{\textrm{NCE}}(\textbf{U}, \textbf{V}) \le I(\textbf{U} ; \textbf{V}). \end{aligned}$$
(19)

Then, we arrive at

$$\begin{aligned} 2 \mathcal {J} \le I(\textbf{U} ; \textbf{V})+I(\textbf{V} ; \textbf{U})=2 I(\textbf{U} ; \textbf{V}). \end{aligned}$$
(20)

Which leads to the inequality

$$\begin{aligned} \mathcal {J} \le I(\textbf{U} ; \textbf{V}). \end{aligned}$$
(21)

The data processing inequality states that for all random variables \(\textbf{X},\textbf{Y},\textbf{Z}\) satisfying the Markov relation \(\textbf{X}\rightarrow \textbf{Y}\rightarrow \textbf{Z}\), the inequality \(I\left( \textbf{X};\textbf{Z}\right) \le I\left( \textbf{X};\textbf{Y}\right) \) holds. Then, we find that \(\textbf{X},\textbf{U},\textbf{V}\) satisfy the relation \(\textbf{U}\leftarrow \textbf{X}\rightarrow \textbf{V}\). Since \(\textbf{U}\) and \(\textbf{V}\) are conditionally independent after observing \(\textbf{X}\), the relation is Markov equivalent to \(\textbf{U}\rightarrow \textbf{X}\rightarrow \textbf{V}\), which leads to \(I\left( \textbf{U};\textbf{V}\right) \le I\left( \textbf{U};\textbf{X}\right) \), We further notice that the relation \(\textbf{X}\rightarrow (U,V)\rightarrow U\) holds, then we arrive at \(I\left( \textbf{X};\textbf{U}\right) \le I\left( \textbf{X},\textbf{U},\textbf{V}\right) \). Combining the two inequalities we get

$$\begin{aligned} I\left( \textbf{U};\textbf{V}\right) \le I\left( \textbf{X};\textbf{U},\textbf{V}\right) . \end{aligned}$$
(22)

According to Eqs. (21) and (22), we finally arrive at inequality

$$\begin{aligned} \mathcal {J}\le I\left( \textbf{X};\textbf{U},\textbf{V}\right) . \end{aligned}$$
(23)

4 Experiments

4.1 Datasets

We have selected 8 representative datasets. Table 1 summarizes the basic information of these datasets, and a brief description of each dataset is given below

Table 1 Statistics of experimental datasets
  • Cora [41]: It is a citation network composed of papers in the field of machine learning. Papers are divided into 7 categories, which are characterized by 1433 different words. Nodes represent papers, while edges represent citation relationships.

  • CiteSeer [41]: It is a part of academic papers selected from the CiteSeer Digital Paper Library. Papers are divided into 6 categories, which are characterized by 3703 different words. Nodes represent papers, while edges represent reference relationships.

  • Pubmed [41]: It is a dataset which has 19717 scientific publications on diabetes from the Pubmed database, which are divided into 3 categories. Nodes represent publications, while edges represent reference relationships. Each node is characterized by 500 words.

  • DBLP [41]: It is a dataset of papers from the computer literature index database of DBLP. All papers are divided into 4 categories. The node represents the paper, while the edge represents the reference link.

  • Amazon-Computers [41]: It is a purchasing relationship network for the Amazon computer category. In the network, nodes represent the purchased products, edges represent two products being purchased at the same time, and labels are the internal classification of products on the Amazon website, characterized by 767 words from product purchase comments.

  • Amazon-Photo [41]: It is an Amazon photo category product purchase relationship network, similar to Amazon Computers. Nodes represent the purchased products, while edges represent two products being purchased simultaneously. 745 words from product purchase comments are used to describe the features of the products, and the product category labels are derived from the Amazon website’s classification of the products.

  • Coauthor-CS [41]: It is a co-author relationship network in the field of computer science, where nodes represent authors and edges indicate that they are co-authors of the same paper. The node label represents the most active research field of each author, and node features are 6805 keywords from the author’s paper.

  • Coauthor-Physics [41]: It is a network of co-author relationships in the field of physics, similar to Coauthor-CS. Nodes represent authors, edges indicate that they are co-authors of the same paper, node features represent the keywords of each author’s paper, and node labels represent the most active research field of each author.

4.2 Baseline Frameworks

To evaluate the effectiveness of MDGCL, we selected 6 representative frameworks in the field of graph contrastive learning, namely DGI, GMI, MVGRL, GRACE, GCA, and COSTA for comparative experiments. The introduction for each framework is as follows:

  • DGI [20]: It uses a local–global contrasting mode to obtain node-level representation \(h_1\) and global representation s on the original graph, then obtains node-level representation \(h_2\) on the augmented view. Finally, It conducts the contrastive learning by maximizing the consistency between \(h_1\) and s, minimizing the consistency between \(h_2\) and s.

  • GMI [42]: It proposes the concept of graph mutual information for contrastive learning by maximizing the graph mutual information between the input and output of the encoder.

  • MVGRL [34]: Based on the DGI, it obtains a augmented view through PPR diffusion and a original view. The node-level embedded representation of one view and the graph-level embedded representation of the other view are cross compared for learning. Its contrasting mode is local–global.

  • GRACE [27]: Its improvements are made based on GRACE. Edge-drop** and node feature masking are no longer random, node centrality and edge centrality are used to measure the importance of nodes and edges, so as to remove unimportant edges or mask unimportant node features.

  • COSTA [31]: Instead of data augmentation on the graph, it performs data augmentation on the embedded representation after the node embedded representation is obtained through the original graph, and two forms of single view and multiple views are proposed.

4.3 Experimental Setting

Referring to the previous graph contrastive learning framework [19, 20], we set the number of layers in GCN to 2, and choose the parameter \(\alpha \) of PPR diffusion and Markov diffusion from [0.1,0.2,0.3,0.4]. The parameter settings of other baseline frameworks are set with reference to the default parameters in the original paper.

We first use an unsupervised method to train the GCN encoder, and then use the GCN encoder to obtain the embedding of each node on the original graph. Finally, a simple l2-regularized logistic regression classifier is trained by using the obtained embedding to get the classification results of each node. The classification results of nodes are compared with their actual category labels to obtain the classification accuracy of the nodes. We train MDGCL for twenty runs and take the average result as the final result for fair evaluation. The micro-F1 is used as an evaluation indicator for the accuracy of the framework for node classification. The specific parameter settings of MDGCL on the 8 datasets are given in Table 2.

Table 2 Parameter setting of MDGCL

4.4 Results and Analyses

After initializing MDGCL with the above parameters, we conducted node classification experiments on 8 datasets and using micro-F1 to evaluate the framework performance, and the experimental results are shown in Table 3.

Table 3 Results of MDGCL with baseline frameworks for node classification task on 8 datasets

The results in Table 3 show that MDGCL achieves good results on all 8 publicly available datasets, and achieves the best results on 5 datasets, which indicates that our proposed improvement works are effective. On DBLP and Amazon-Photo datasets, GRACE has the best performance among all baseline frameworks, but it simply uses data augmentation methods that lose a lot of semantic information. Different from it, before using stochastic augmentation methods, MDGCL first uses deterministic augmentation methods(Markov and PPR diffusion) to increase the semantic information, which will alleviate the subsequent loss of semantic information. This is why MDGCL has a higher classification accuracy than GRACE on these two datasets. On CiteSeer, MDGCL achieves better classification accuracy compared to MVGRL, which gets the best result among baseline frameworks, indicating that our proposed deterministic-stochastic augmentation strategy is more efficient than MVGRL that only adopts deterministic augmentation method. In the comparison between the three variants of GCA and GRACE on Coauthor-Physics datasets, it is easy to find that the optimized edge-drop** and node feature masking proposed by GCA does improve its learning ability, but GCA only adopts simple stochastic augmentation methods and its experimental performance is still weaker than MDGCL, which further validates the superiority of our deterministic-stochastic augmentation strategy, a combination of multiple data augmentation methods can further improve the performance of the framework. At the same time, it can also be observed that although the DGI uses a local–global contrasting mode to allow the encoder to obtain global semantic information, MDGCL is superior from the experimental results. This verifies the advantages of the local–local contrasting mode. At the same time, it also demonstrates the effectiveness of multiple graph diffusion methods in providing global semantic information. COSTA argued that traditional data augmentation methods would lead to the loss of too much semantic information in the augmented view, so its augmentation is performed on the embedding, but based on the experimental results, its learning ability is not as good as MDGCL, we argue that embedding augmentaion is not applicable to most datasets compared with data augmentation.

In addition to the 5 best-performing datasets mentioned above, on Cora, MDGCL performs slightly worse than COSTA, which gets the best results; but on Pubmed and Amazon-Computers, MDGCL’s result is far from the best. In both datasets, GCA-DE achieves the best results. It uses the degree of nodes as an indicator to measure the importance of nodes and edges, and then preferentially drop** unimportant nodes and edges during the augmentation stage. We argue that these two datasets may be degree-sensitive, which explains the outstanding performance of GCA-DE. Therefore, we consider combining the GCA-DE method in future research to further enhance the node classification performance of MDGCL on different datasets.

4.5 Ablation Experiments

To verify the effectiveness of the multiple graph diffusion methods and the deterministic-stochastic data augmentation method used in MDGCL, we conducted ablation experiments on two datasets, Cora and CiteSeer. The variants setting for the ablation experiments in this section are shown in Table 4.

Table 4 Variant setting for ablation experiments

After obtaining the variants, we set the parameter values for each variant according to Sect. 4.3, and the ablation experiment results are shown in Fig. 6.

Fig. 6
figure 6

Experimental results of node classification comparison between MDGCL and its five variants

Observing the comparative experimental results shown in Fig. 6, it is not difficult to draw the following conclusions: On these datasets, MDGCL performs better than MDGCL-N. On the Cora dataset, the node classification accuracy of MDGCL is 1.77% higher than MDGCL-N, mainly because of the graph diffusion methods, which allow MDGCL to retain more semantic information. On the same dataset, MDGCL performs better than MDGCL-M and MDGCL-P, which use only a single graph diffusion method. On the CiteSeer dataset, the node classification accuracy of MDGCL is improved by 3.90% compared to MDGCL-M and 3.87% compared to MDGCL-P. This indicates that the multiple graph diffusion methods used by MDGCL are effective. At the same time, MDGCL always outperforms the MDGCL-C and MDGCL-G, which indicates that the deterministic-stochastic augmentation approach further improves the learning ability of the framework.

Figure 6 also shows that the node classification accuracy of MDGCL outperforms that of MDGCL-G. On the Cora dataset with the largest boost, the accuracy improves by 3.04%. This demonstrates the effectiveness of the global semantic information brought by the graph diffusion matrix in the node classification tasks, and shows that it is practical for us not to change the local–local contrasting mode, but to choose graph diffusion to preserve the global semantic information. And the comparison of MDGCL with MDGCL-G also verifies the superiority of local–local contrasting mode compared to the local–global contrasting mode in node classification tasks.

5 Conclusions

In this paper, we propose a novel graph contrastive learning framework, MDGCL, which simultaneously solves two problems of existing graph contrastive learning frameworks by introducing Markov diffusion, PPR diffusion, and deterministic-stochastic data augmentation strategy: stochastic augmentation method loses too much semantic information; local–local contrasting mode ignores global semantic information. We conducted a series of comparison experiments on 8 publicly available graph neural network datasets with 6 representative frameworks in the current graph contrastive learning field, and the experimental results better validated the effectiveness of our proposed MDGCL framework. However, compared with the supervised GNN model, the performance of the MDGCL framework still has a certain gap. At the same time, the application of the MDGCL framework is currently limited to node classification tasks. How to improve the performance of the MDGCL framework and apply it to other graph downstream tasks(i.e. graph classification and link prediction) are our main focus in the future.