MDGCL: Graph Contrastive Learning Framework with Multiple Graph Diffusion Methods

Li, Yuqiang; Zhang, Yi; Liu, Chun

doi:10.1007/s11063-024-11672-3

MDGCL: Graph Contrastive Learning Framework with Multiple Graph Diffusion Methods

Open access
Published: 13 July 2024

Volume 56, article number 213, (2024)
Cite this article

Download PDF

You have full access to this open access article

Neural Processing Letters Aims and scope Submit manuscript

MDGCL: Graph Contrastive Learning Framework with Multiple Graph Diffusion Methods

Download PDF

Yuqiang Li¹,
Yi Zhang¹ &
Chun Liu¹

Abstract

In recent years, some classical graph contrastive learning(GCL) frameworks have been proposed to address the problem of sparse labeling of graph data in the real world. However, in node classification tasks, there are two obvious problems with existing GCL frameworks: first, the stochastic augmentation methods they adopt lose a lot of semantic information; second, the local–local contrasting mode selected by most frameworks ignores the global semantic information of the original graph, which limits the node classification performance of these frameworks. To address the above problems, this paper proposes a novel graph contrastive learning framework, MDGCL, which introduces two graph diffusion methods, Markov and PPR, and a deterministic–stochastic data augmentation strategy while retaining the local–local contrasting mode. Specifically, before using the two stochastic augmentation methods (FeatureDrop and EdgeDrop), MDGCL first uses two deterministic augmentation methods (Markov diffusion and PPR diffusion) to perform data augmentation on the original graph to increase the semantic information, this step ensures subsequent stochastic augmentation methods do not lose too much semantic information. Meanwhile, the diffusion matrices carried by the augmented views contain global semantic information of the original graph, allowing the framework to utilize the global semantic information while retaining the local-local contrasting mode, which further enhances the node classification performance of the framework. We conduct extensive comparative experiments on multiple benchmark datasets, and the results show that MDGCL outperforms the representative baseline frameworks on node classification tasks. Among them, compared with COSTA, MDGCL’s node classification accuracy has been improved by 1.07% and 0.41% respectively on two representative datasets, Amazon-Photo and Coauthor-CS. In addition, we also conduct ablation experiments on two datasets, Cora and CiteSeer, to verify the effectiveness of each improvement work of our framework.

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Graph neural networks (GNNs) preserve the attributes and structural features of graphs by converting node information into low-dimensional dense embeddings. In the past few years, they have become a powerful strategy for analyzing graph structured data with overwhelming success in various graph-based tasks, such as node classification [1,21] that the local–local contrasting mode always better than other contrasting modes (local–global and global-global) in node classification tasks. However, there is an obvious flaw in the local–local contrasting mode, that is, the comparison between node-level embeddings overemphasizes local semantic information and ignores global semantic information, yet the global semantic information is equally important for node classification tasks. This leads to two challenging questions: (1) How can data augmentation strategies be improved to minimize the loss of semantic information in the data augmentation stage? (2) How to make the GCL framework utilize global semantic information while retaining the local–local contrasting mode?

In order to address the above problems, this paper proposes a graph contrastive learning framework MDGCL based on multiple graph diffusion methods, which introduces graph diffusion as a data augmentation method and adopts a deterministic-stochastic data augmentation strategy to skillfully solve the above problems. Specifically, MDGCL framework first utilizes PPR diffusion and Markov diffusion to perform data augmentation on the original graph. Compared to the data augmentation method that reduces the semantic information of the original graph, graph diffusion method attempts to add semantic information to the original graph, thus alleviating the semantic information loss caused by subsequent stochastic augmentation methods. At the same time, the augmented views obtained by the graph diffusion method provide diffusion matrices, which contain global semantic information that is not available in adjacency matrices. Therefore, the representations learned on these augmented views allows MDGCL to utilize the global semantic information for training while retaining the local–local contrasting mode. These works improve the performance of graph contrastive learning frameworks on node classification tasks, which holds significant theoretical and practical implications for the analysis and application of graph data.

In summary, our main contributions are as follows:

Firstly, we propose a novel graph contrastive learning framework MDGCL, which uses Markov diffusion and PPR diffusion for data augmentation, effectively reducing the loss of semantic information while providing global semantic information for framework training.
Secondly, a deterministic-stochastic data augmentation strategy is introduced. This strategy aims to improve the node classification performance of the framework through the combination of different data augmentation methods.
Thirdly, We conducted comparison experiments on a series of benchmark datasets and the experimental results show that our framework achieves better results on node classification tasks compared to chosen baseline frameworks.

2 Related Work

Our work mainly involves the application of graph diffusion in graph contrastive learning. The following mainly introduces the research related to graph contrastive learning method and graph diffusion.

2.1 Graph Self-supervised Learning

Since the label information of graph-structured data in the real world is too scarce and its acquisition takes a lot of time and resources, in recent years, more and more researchers have shifted their attention from graph supervised learning to graph self-supervised learning, and some representative methods have been proposed.

2.1.1 Generation-Based Graph Self-supervised Learning

Generation-based methods usually take the original graph or subgraph as model’s inputs and obtain encoders by reconstructing features or structures. Graph completion [22] takes inspiration from image completion. It first masks the features of certain nodes in the input graph. Then, the GCN encoder uses the features of adjacent nodes to predict the features of masked nodes. Instead of learning to reconstruct the original feature matrix, AttributeMask [27] improves the data augmentation process based on GRACE. Edge-drop** and node feature masking are no longer performed randomly, it determines the edges to be dropped and the node features to be masked by introducing edge centrality and node centrality. Although this method reduces the loss of semantic information to some extent, it does not choose the local–local contrasting mode, which is optimal for node classification tasks. NCLA [28] argues that most graph contrastive learning frameworks are manually selecting data augmentation methods for different datasets, which limits the universality of these frameworks. Therefore, it adopts the multi-head attention mechanism as a learnable graph data augmentation method, which automatically selects the optimal augmented view through the attention mechanism, eliminating the need for manual selection of augmentation methods and improving the universality of the framework. In order not to lose the information, instead of performing data augmentation, Sim-GRACE [36,37] have shown that as the number of GCN layers increases, the over-smoothing problem will be inevitable, and the two-layer GCN has the best performance. This means that for the target node, it can only aggregate local information within the two-hop range. However, for graph data, nodes outside the two-hop range may also have relatively close connections with the target node. In order to reconstruct these connections, MDGCL framework adopts graph diffusion method as data augmentation method, and the resulting augmented view carries the diffusion matrix, which contains global information. We demonstrate the graph diffusion process for the MDGCL framework in Fig. 2.

As illustrated in Fig. 2, graph diffusion transforms some nodes outside the two-hop range into nodes within the two-hop range by adding edges. When encoding the target node, the encoder’s field of view extends from local to global, which means that the encoder can utilize more semantic information, thereby improving the framework’s performance in node classification tasks.

The generated graph diffusion matrix is given by Eq. (1):

$$\begin{aligned} \textbf{S}=\sum _{k=0}^{\infty } \Theta _k \textbf{T}^k \in \mathbb {R}^{N \times N}. \end{aligned}$$

(1)

where $\textrm{T} \in \textrm{R}^{N \times N}$ is a generalized transfer matrix, defined by $A D ^ {-1}$, $A\in \textrm{R}^{N \times N}$ is the adjacency matrix, $D\in \textrm{R}^{N \times N}$ is the diagonal matrix, $d_{ii}=\sum _{j} a_{ij}$. For the Personalized PageRank (PPR) method, $\Theta _k=\alpha \left( 1-\alpha \right) ^k$, where $\alpha $ represents the transition probability in a random walk, its diffusion matrix is given by Eq. (2):

$$\begin{aligned} \textbf{S}^{PPR}=\alpha \left( \textbf{I}_n-\left( 1-\alpha \right) \textbf{D}^{-1/2}{\textbf{AD}}^{-1/2}\right) ^{-1}. \end{aligned}$$

(2)

where $\textbf{I}_n$ denotes the identity matrix. Through PPR diffusion we obtain an augmented view $\widetilde{\textrm{G}_1}$ that extends the graph topology and provides global information. However, we argue that a single graph diffusion method is not always effective in extending graph topology, and its expansion results are not diverse. Therefore, in order to solve this problem, in addition to the PPR diffusion method, we also introduce another graph diffusion method: Markov diffusion. The Markov diffusion process starts with $M={\textrm{AD}}^{-1}$. The random walk is extended by squaring the Markov matrix M, as shown in Eq. (3), for the i-th round of random walk

$$\begin{aligned} M_i=M_{i-1}\times M_{i-1}. \end{aligned}$$

(3)

where $M_{i-1}$ is the output of the last round of random walk. For the Markov diffusion, the $\Theta _k$ in Eq. (1) is $\left( 1-\alpha \right) ^k$, where $\alpha $ represents the transfer probability in a random walk, and the Markov diffusion matrix is given by Eq. (4):

$$\begin{aligned} \textbf{S}^{\text{ Markov } }=\sum _{k=0}^{\infty }(1-\alpha )^k M_{k-1} \times M_{k-1} \in \mathbb {R}^{n \times n}. \end{aligned}$$

(4)

After PPR diffusion and Markov diffusion, we get two augmented views $\widetilde{\textrm{G}_1}=\left\{ \widetilde{A_1}, X\right\} $ and $\widetilde{\textrm{G}_2}=\left\{ \widetilde{A_2}, X\right\} $ from the original graph G, as shown in Eq. (5)

$$\begin{aligned} \begin{aligned} \widetilde{A_1}&=\textrm{S}^{\textrm{PPR}}(A) \\ \widetilde{A_2}&=\textrm{S}^{\textrm{Markov}}(A) \end{aligned} \end{aligned}$$

(5)

where $A\in R^{N\times N}$ is the adjacency matrix of the original graph G.

3.2.2 FeatureDrop and EdgeDrop

(1)
FeatureDrop: MDGCL uses the method of masking node features to realize the drop** of node features. It first generates a random vector $M=\left\{ m_1,m_2,m_3,...,m_F\right\} $, where F represents the dimension of the node feature, $m_i \in \left\{ 0,1\right\} $. Then, the new feature vector $\tilde{X}$ is obtained by dot product of M with the node feature vector X, as shown in Eq. (6):
$$\begin{aligned} \tilde{X}=\left\{ x_1 \cdot \textrm{m}_1, x_2 \cdot m_2, x_3 \cdot m_3 \ldots x_F \cdot m_F\right\} . \end{aligned}$$
(6)
The drop rate, which controls the number of zero elements in the random vector M, is between 0 and 1.
(2)
EdgeDrop: MDGCL uses random drop** to remove fixed-ratio edges on the augmented view, and for the adjacency matrix A, p non-zero elements in A are randomly selected and set to zero. In terms of implementation, we first form a masking matrix $U\in \left\{ 0,1\right\} ^{N \times N}$, where N represents the number of nodes. Then the masking matrix U is dot-producted with the adjacency matrix A to obtain the new adjacency matrix $\tilde{A}$, as shown in Eq. (7):
$$\begin{aligned} \tilde{A}=U \cdot A. \end{aligned}$$
(7)
The two final augmented views $\widetilde{\mathcal {G}_1}$ and $\widetilde{\mathcal {G}_2}$ are obtained by performing the FeatureDrop and EdgeDrop as described above on the two picked views $\widetilde{\textbf{G}_1}$ and $\widetilde{\textbf{G}_2}$ respectively, as shown in Fig. 3.

3.3 View Encoding

In this section, we choose GCN as the encoder. GCN is a neural network, which uses the nonlinear function $\sigma $ as the propagation function between neural network layers, as shown in Eq. (8):

$$\begin{aligned} H^{(l+1)}=\sigma \left( \widetilde{D}^{-\frac{1}{2}} \tilde{A} \widetilde{D}^{-\frac{1}{2}} H^{(l)} W^{(l)}\right) . \end{aligned}$$

(8)

where $\sigma $ is ReLU or PReLU, $\tilde{A} = A + I_n$, A represents adjacency matrix and $I_n$ represents identity matrix, $\tilde{D}$ is the degree matrix of $\tilde{A}$. $H^{\left( l\right) }$ is the output representation of the previous layer, and $W^{\left( l\right) }$ is the weight of the l-th layer.

We have defined two double-layer GCNs $f_1(\cdot )$ and $f_2(\cdot )$ as the encoder for MDGCL, as shown in Eq. (9):

$$\begin{aligned} f_1(\cdot ), f_2(\cdot ): R^{{N} \times {N}} \times R^{{N} \times {F}} \mapsto {R}^{{N} \times {F}_{{h}}}. \end{aligned}$$

(9)

The input of $f_1(\cdot )$ and $f_2(\cdot )$ is the adjacency matrix $R^{{N} \times {N}}$ and the node feature matrix $R^{{N} \times {F}}$, and the output is the node representation ${R}^{{N} \times {F}_{{h}}}$. Where N is the number of nodes, F is the feature dimension, and ${F}_{{h}}$ is the feature embedding dimension. As shown in Fig. 1, two encoders share parameters. After sending the adjacency matrix and node feature matrix of the two augmented views $\widetilde{\mathcal {G}_1}$ and $\widetilde{\mathcal {G}_2}$ obtained in Sect. 3.2 to the encoder $f_1(\cdot )$ and $f_2(\cdot )$ respectively, two node representations $H_1$ and $H_2$ are obtained.

In order to improve the learning ability of contrastive learning frameworks, SimCLR [38] proposed the Projection Head method, which uses a learnable nonlinear projection between representations obtained by the encoder and the contrast loss, achieving good results. To further improve the learning ability of MDGCL, we also use the projection head method. The projection head of MDGCL is a multi-layer perceptron (MLP) $f_{\text{ pro } }(\cdot ): R^{{N} \times {F}_{{h}}} \longmapsto {R}^{{N} \times {F}_{{h}}}$, and it consists of two hidden layers and a nonlinear activation function Elu. As shown in Fig. 4, node-level representations $H_1$ and $H_2$ are transformed into new node-level embeddings $\widetilde{H_1}$ and $\widetilde{H_2}$ after being fed into $f_{\text{ pro } }(\cdot )$, as shown in Eq. (10):

$$\begin{aligned} \begin{aligned} \widetilde{H_1}&=f_{\text{ pro } }\left( \textrm{H}_1\right) \\ \widetilde{H_2}&=f_{\text{ pro } }\left( \textrm{H}_2\right) \end{aligned}. \end{aligned}$$

(10)

3.4 Local–Local Contrastive Learning

We use local–local contrasting mode in our framework, where contrastive learning is performed between node-level representations. The contrastive learning of MDGCL is inspired by GRACE [26]. For two augmented views $\widetilde{\mathcal {G}_1}$ and $\widetilde{\mathcal {G}_2}$, the node $v_1$ in $\widetilde{\mathcal {G}_1}$ is used as the anchor, and the node $u_1$ corresponding to $v_1$ in $\widetilde{\mathcal {G}_2}$ is used as a positive sample. All nodes except $v_1$ and $u_1$ are used as negative samples, as shown in Fig. 5, negative samples are divided into intra-view negative samples and inter-view negative samples. The contrastive goal of MDGCL is to maximize the consistency between positive samples and minimize the consistency between negative samples. We uses cosine similarity $\textrm{s}$ to define the consistency $\theta $, as shown in Eq. (11):

$$\begin{aligned} \theta \left( \textrm{v}_1, \textrm{u}_1\right) =\textrm{s}\left( \textrm{H}_1^{\textrm{v}_1}, \textrm{H}_2^{u_1}\right) . \end{aligned}$$

(11)

where $\textrm{H}_1^{\textrm{v}_1}$ and $\textrm{H}_2^{\textrm{v}_2}$ are the node-level representations obtained by nodes $v_1$ and $u_1$ in $\widetilde{\mathcal {G}_1}$ and $\widetilde{\mathcal {G}_2}$ respectively. After introducing negative samples, the comparison target between sample pairs $\left( v_i,u_i\right) $ is defined as Eq. (12):

$$\begin{aligned} \ell \left( v_i, u_i\right) =\log \frac{e^{\theta \left( v_i, u_i\right) / \tau }}{e^{\theta \left( v_i, u_i\right) / \tau }+\sum _{k=1}^N {1}_{\left[ k \ne i\right] } e^{\theta \left( v_i, u_k\right) / \tau }+\sum _{k=1}^N {1}_{[k \ne i]} e^{\theta \left( v_i, v_k\right) / \tau }}. \end{aligned}$$

(12)

where $\tau $ is the temperature parameter. ${1}_{\left[ k\ne i\right] }$ is an indication function that equals to $1\ \mathrm {iff\ }k\ne i$.

Due to the symmetry of the two augmented views, when the $u_i$ is used as the anchor node, there is also a comparison target $\ell \left( u_i,v_i\right) $. Finally, the overall objective to be maximized by the MDGCL framework is defined as the average of all sample pairs’ comparison objective, as shown in Eq. (13):

$$\begin{aligned} \mathcal {J}=\frac{1}{2 N} \sum _{i=1}^N\left[ \ell \left( \varvec{u}_i, \varvec{v}_i\right) +\ell \left( \varvec{v}_i, \varvec{u}_i\right) \right] . \end{aligned}$$

(13)

In summary, in each epoch of training, MDGCL framework first performs data augmentation on the original graph G. It uses data augmentation methods described in Sect. 3.2 to obtain two augmented views $\widetilde{\mathcal {G}_1}$ and $\widetilde{\mathcal {G}_2}$. Then, we obtain node representations $\widetilde{H_1}$ and $\widetilde{H_2}$ from $\widetilde{\mathcal {G}_1}$ and $\widetilde{\mathcal {G}_2}$ by using encoder $f_1(\cdot )$, $f_2(\cdot )$ and Projection Head $f_{\text{ pro } }(\cdot )$ in Sect. 3.3. Finally, the parameters of $f_1(\cdot )$ and $f_2(\cdot )$ is updated by maximizing the objective in Eq. (13). The learning algorithm of MDGCL is summarized in Algorithm 1.

3.5 Theoretical Justification

In this section, We provide a theoretical justification for our framework in terms of mutual information(MI) maximization, which represents the amount of information that one random variable contains about another random variable, it can be understood as the degree of correlation between two random variables. To be clear, we are talking about the MI between the original graph node features and the embedding representations output by the encoder. Next, we will reveal the relationship between our defined objective function $\mathcal {J}$ and MI.

Theorem 1

Let $\textbf{X}_i=\left\{ {x}_k\right\} _{k\in \mathcal {N}(i)}$ be the neighborhood of node $v_i$ that collectively maps to its output embedding, where $\mathcal {N}(i)$ denotes the set of neighbors of node $v_i$ specified by GCN, and $\textbf{X}$ be the corresponding random variable with a uniform distribution $p\left( \textbf{X}_i\right) =\frac{1}{N}$. Given two random variables $\textbf{U},\textbf{V}\in R^{F^\prime }$ being the embedding in the two views, with their joint distribution denoted as $p\left( \textbf{U},\textbf{V}\right) $, our objective $\mathcal {J}$ is a lower bound of MI between encoder input $\textbf{X}$ and node representations in two graph views $\textbf{U},\textbf{V}$. Formally

$$\begin{aligned} \mathcal {J}\le I\left( \textbf{X};\textbf{U},\textbf{V}\right) . \end{aligned}$$

(14)

Proof

Inspired by [26], we first proof the connection between our objective $\mathcal {J}$ and the InfoNCE objective [39], which is defined by [40] as

$$\begin{aligned} I_{\textrm{NCE}}(\textbf{U} ; \textbf{V}) \triangleq \mathbb {E}_{\prod _i p\left( \varvec{u}_i, \varvec{v}_{\varvec{i}}\right) }\left[ \frac{1}{N} \sum _{i=1}^N \log \frac{e^{\theta \left( \varvec{u}_{\varvec{i}}, \varvec{v}_i\right) }}{\frac{1}{N} \sum _{j=1}^N e^{\theta \left( \varvec{u}_i, \varvec{v}_j\right) }}\right] . \end{aligned}$$

(15)

where $\theta $ is the same as defined in Eq. (11). For convenience of notion, We define $\rho _c\left( \varvec{u}_i\right) =\sum _{j=1}^N {1}_{\left[ i\ne j\right] }\exp \left( \theta \left( \varvec{u}_i, \varvec{v}_j\right) / \tau \right) $, $\rho _c\left( \varvec{u}_i\right) =\sum _{j=1}^N \exp \left( \theta \left( \varvec{u}_i, \varvec{v}_j\right) / \tau \right) $. By using $\rho _c\left( \varvec{u}_i\right) $ and $\rho _c\left( \varvec{u}_i\right) $, we rewrite objective $\mathcal {J}$ as

$$\begin{aligned} \mathcal {J}=\mathbb {E}_{\prod _i p\left( \varvec{u}_i, \varvec{v}_i\right) }\left[ \frac{1}{N} \sum _{i=1}^N \log \frac{\exp \left( \theta \left( \varvec{u}_i, \varvec{v}_i\right) / \tau \right) }{\sqrt{\left( \rho _c\left( \varvec{u}_i\right) +\rho _r\left( \varvec{u}_i\right) \right) \left( \rho _c\left( \varvec{v}_i\right) +\rho _r\left( \varvec{v}_i\right) \right) }}\right] . \end{aligned}$$

(16)

Then, we using $\rho _c$ to rewrite the InfoNCE estimator $I_{\textrm{NCE}}$

$$\begin{aligned} I_{\textrm{NCE}}(\textbf{U}, \textbf{V})=\mathbb {E}_{\Pi _i p\left( \varvec{u}_i, \varvec{v}_i\right) }\left[ \frac{1}{N} \sum _{i=1}^N \log \frac{\exp \left( \theta \left( \varvec{u}_i, \varvec{v}_i\right) / \tau \right) }{\rho _c\left( \varvec{u}_i\right) }\right] . \end{aligned}$$

(17)

$\square $

Therefore,

$$\begin{aligned} \begin{aligned} 2 \mathcal {J}&= I_{\textrm{NCE}}(\textbf{U}, \textbf{V})-\mathbb {E}_{\Pi _i p\left( \textbf{u}_i, \varvec{v}_i\right) }\left[ \frac{1}{N} \sum _{i=1}^N \log \left( 1+\frac{\rho _r\left( \varvec{u}_i\right) }{\rho _c\left( \varvec{u}_i\right) }\right) \right] \\&+I_{\textrm{NCE}}(\textbf{V}, \textbf{U})-\mathbb {E}_{\Pi _i p\left( \varvec{u}_i, \varvec{v}_i\right) }\left[ \frac{1}{N} \sum _{i=1}^N \log \left( 1+\frac{\rho _r\left( \varvec{v}_i\right) }{\rho _c\left( \varvec{v}_i\right) }\right) \right] \\&\le I_{\textrm{NCE}}(\textbf{U}, \textbf{V})+I_{\textrm{NCE}}(\textbf{V}, \textbf{U}) . \end{aligned}. \end{aligned}$$

(18)

According to [28], the InfoNCE estimator is a lower bound of the true MI, i.e.

$$\begin{aligned} I_{\textrm{NCE}}(\textbf{U}, \textbf{V}) \le I(\textbf{U} ; \textbf{V}). \end{aligned}$$

(19)

Then, we arrive at

$$\begin{aligned} 2 \mathcal {J} \le I(\textbf{U} ; \textbf{V})+I(\textbf{V} ; \textbf{U})=2 I(\textbf{U} ; \textbf{V}). \end{aligned}$$

(20)

Which leads to the inequality

$$\begin{aligned} \mathcal {J} \le I(\textbf{U} ; \textbf{V}). \end{aligned}$$

(21)

The data processing inequality states that for all random variables $\textbf{X},\textbf{Y},\textbf{Z}$ satisfying the Markov relation $\textbf{X}\rightarrow \textbf{Y}\rightarrow \textbf{Z}$, the inequality $I\left( \textbf{X};\textbf{Z}\right) \le I\left( \textbf{X};\textbf{Y}\right) $ holds. Then, we find that $\textbf{X},\textbf{U},\textbf{V}$ satisfy the relation $\textbf{U}\leftarrow \textbf{X}\rightarrow \textbf{V}$. Since $\textbf{U}$ and $\textbf{V}$ are conditionally independent after observing $\textbf{X}$, the relation is Markov equivalent to $\textbf{U}\rightarrow \textbf{X}\rightarrow \textbf{V}$, which leads to $I\left( \textbf{U};\textbf{V}\right) \le I\left( \textbf{U};\textbf{X}\right) $, We further notice that the relation $\textbf{X}\rightarrow (U,V)\rightarrow U$ holds, then we arrive at $I\left( \textbf{X};\textbf{U}\right) \le I\left( \textbf{X},\textbf{U},\textbf{V}\right) $. Combining the two inequalities we get

$$\begin{aligned} I\left( \textbf{U};\textbf{V}\right) \le I\left( \textbf{X};\textbf{U},\textbf{V}\right) . \end{aligned}$$

(22)

According to Eqs. (21) and (22), we finally arrive at inequality

$$\begin{aligned} \mathcal {J}\le I\left( \textbf{X};\textbf{U},\textbf{V}\right) . \end{aligned}$$

(23)

4 Experiments

4.1 Datasets

We have selected 8 representative datasets. Table 1 summarizes the basic information of these datasets, and a brief description of each dataset is given below

Table 1 Statistics of experimental datasets

Full size table

Cora [41]: It is a citation network composed of papers in the field of machine learning. Papers are divided into 7 categories, which are characterized by 1433 different words. Nodes represent papers, while edges represent citation relationships.
CiteSeer [41]: It is a part of academic papers selected from the CiteSeer Digital Paper Library. Papers are divided into 6 categories, which are characterized by 3703 different words. Nodes represent papers, while edges represent reference relationships.
Pubmed [41]: It is a dataset which has 19717 scientific publications on diabetes from the Pubmed database, which are divided into 3 categories. Nodes represent publications, while edges represent reference relationships. Each node is characterized by 500 words.
DBLP [41]: It is a dataset of papers from the computer literature index database of DBLP. All papers are divided into 4 categories. The node represents the paper, while the edge represents the reference link.
Amazon-Computers [41]: It is a purchasing relationship network for the Amazon computer category. In the network, nodes represent the purchased products, edges represent two products being purchased at the same time, and labels are the internal classification of products on the Amazon website, characterized by 767 words from product purchase comments.
Amazon-Photo [41]: It is an Amazon photo category product purchase relationship network, similar to Amazon Computers. Nodes represent the purchased products, while edges represent two products being purchased simultaneously. 745 words from product purchase comments are used to describe the features of the products, and the product category labels are derived from the Amazon website’s classification of the products.
Coauthor-CS [41]: It is a co-author relationship network in the field of computer science, where nodes represent authors and edges indicate that they are co-authors of the same paper. The node label represents the most active research field of each author, and node features are 6805 keywords from the author’s paper.
Coauthor-Physics [41]: It is a network of co-author relationships in the field of physics, similar to Coauthor-CS. Nodes represent authors, edges indicate that they are co-authors of the same paper, node features represent the keywords of each author’s paper, and node labels represent the most active research field of each author.

4.2 Baseline Frameworks

To evaluate the effectiveness of MDGCL, we selected 6 representative frameworks in the field of graph contrastive learning, namely DGI, GMI, MVGRL, GRACE, GCA, and COSTA for comparative experiments. The introduction for each framework is as follows:

DGI [20]: It uses a local–global contrasting mode to obtain node-level representation $h_1$ and global representation s on the original graph, then obtains node-level representation $h_2$ on the augmented view. Finally, It conducts the contrastive learning by maximizing the consistency between $h_1$ and s, minimizing the consistency between $h_2$ and s.
GMI [42]: It proposes the concept of graph mutual information for contrastive learning by maximizing the graph mutual information between the input and output of the encoder.
MVGRL [34]: Based on the DGI, it obtains a augmented view through PPR diffusion and a original view. The node-level embedded representation of one view and the graph-level embedded representation of the other view are cross compared for learning. Its contrasting mode is local–global.
GRACE [27]: Its improvements are made based on GRACE. Edge-drop** and node feature masking are no longer random, node centrality and edge centrality are used to measure the importance of nodes and edges, so as to remove unimportant edges or mask unimportant node features.
COSTA [31]: Instead of data augmentation on the graph, it performs data augmentation on the embedded representation after the node embedded representation is obtained through the original graph, and two forms of single view and multiple views are proposed.

4.3 Experimental Setting

Referring to the previous graph contrastive learning framework [19, 20], we set the number of layers in GCN to 2, and choose the parameter $\alpha $ of PPR diffusion and Markov diffusion from [0.1,0.2,0.3,0.4]. The parameter settings of other baseline frameworks are set with reference to the default parameters in the original paper.

We first use an unsupervised method to train the GCN encoder, and then use the GCN encoder to obtain the embedding of each node on the original graph. Finally, a simple l2-regularized logistic regression classifier is trained by using the obtained embedding to get the classification results of each node. The classification results of nodes are compared with their actual category labels to obtain the classification accuracy of the nodes. We train MDGCL for twenty runs and take the average result as the final result for fair evaluation. The micro-F1 is used as an evaluation indicator for the accuracy of the framework for node classification. The specific parameter settings of MDGCL on the 8 datasets are given in Table 2.

Table 2 Parameter setting of MDGCL

Full size table

4.4 Results and Analyses

After initializing MDGCL with the above parameters, we conducted node classification experiments on 8 datasets and using micro-F1 to evaluate the framework performance, and the experimental results are shown in Table 3.

Table 3 Results of MDGCL with baseline frameworks for node classification task on 8 datasets

Full size table

The results in Table 3 show that MDGCL achieves good results on all 8 publicly available datasets, and achieves the best results on 5 datasets, which indicates that our proposed improvement works are effective. On DBLP and Amazon-Photo datasets, GRACE has the best performance among all baseline frameworks, but it simply uses data augmentation methods that lose a lot of semantic information. Different from it, before using stochastic augmentation methods, MDGCL first uses deterministic augmentation methods(Markov and PPR diffusion) to increase the semantic information, which will alleviate the subsequent loss of semantic information. This is why MDGCL has a higher classification accuracy than GRACE on these two datasets. On CiteSeer, MDGCL achieves better classification accuracy compared to MVGRL, which gets the best result among baseline frameworks, indicating that our proposed deterministic-stochastic augmentation strategy is more efficient than MVGRL that only adopts deterministic augmentation method. In the comparison between the three variants of GCA and GRACE on Coauthor-Physics datasets, it is easy to find that the optimized edge-drop** and node feature masking proposed by GCA does improve its learning ability, but GCA only adopts simple stochastic augmentation methods and its experimental performance is still weaker than MDGCL, which further validates the superiority of our deterministic-stochastic augmentation strategy, a combination of multiple data augmentation methods can further improve the performance of the framework. At the same time, it can also be observed that although the DGI uses a local–global contrasting mode to allow the encoder to obtain global semantic information, MDGCL is superior from the experimental results. This verifies the advantages of the local–local contrasting mode. At the same time, it also demonstrates the effectiveness of multiple graph diffusion methods in providing global semantic information. COSTA argued that traditional data augmentation methods would lead to the loss of too much semantic information in the augmented view, so its augmentation is performed on the embedding, but based on the experimental results, its learning ability is not as good as MDGCL, we argue that embedding augmentaion is not applicable to most datasets compared with data augmentation.

In addition to the 5 best-performing datasets mentioned above, on Cora, MDGCL performs slightly worse than COSTA, which gets the best results; but on Pubmed and Amazon-Computers, MDGCL’s result is far from the best. In both datasets, GCA-DE achieves the best results. It uses the degree of nodes as an indicator to measure the importance of nodes and edges, and then preferentially drop** unimportant nodes and edges during the augmentation stage. We argue that these two datasets may be degree-sensitive, which explains the outstanding performance of GCA-DE. Therefore, we consider combining the GCA-DE method in future research to further enhance the node classification performance of MDGCL on different datasets.

4.5 Ablation Experiments

To verify the effectiveness of the multiple graph diffusion methods and the deterministic-stochastic data augmentation method used in MDGCL, we conducted ablation experiments on two datasets, Cora and CiteSeer. The variants setting for the ablation experiments in this section are shown in Table 4.

Table 4 Variant setting for ablation experiments

Full size table

After obtaining the variants, we set the parameter values for each variant according to Sect. 4.3, and the ablation experiment results are shown in Fig. 6.

Observing the comparative experimental results shown in Fig. 6, it is not difficult to draw the following conclusions: On these datasets, MDGCL performs better than MDGCL-N. On the Cora dataset, the node classification accuracy of MDGCL is 1.77% higher than MDGCL-N, mainly because of the graph diffusion methods, which allow MDGCL to retain more semantic information. On the same dataset, MDGCL performs better than MDGCL-M and MDGCL-P, which use only a single graph diffusion method. On the CiteSeer dataset, the node classification accuracy of MDGCL is improved by 3.90% compared to MDGCL-M and 3.87% compared to MDGCL-P. This indicates that the multiple graph diffusion methods used by MDGCL are effective. At the same time, MDGCL always outperforms the MDGCL-C and MDGCL-G, which indicates that the deterministic-stochastic augmentation approach further improves the learning ability of the framework.

Figure 6 also shows that the node classification accuracy of MDGCL outperforms that of MDGCL-G. On the Cora dataset with the largest boost, the accuracy improves by 3.04%. This demonstrates the effectiveness of the global semantic information brought by the graph diffusion matrix in the node classification tasks, and shows that it is practical for us not to change the local–local contrasting mode, but to choose graph diffusion to preserve the global semantic information. And the comparison of MDGCL with MDGCL-G also verifies the superiority of local–local contrasting mode compared to the local–global contrasting mode in node classification tasks.

5 Conclusions

In this paper, we propose a novel graph contrastive learning framework, MDGCL, which simultaneously solves two problems of existing graph contrastive learning frameworks by introducing Markov diffusion, PPR diffusion, and deterministic-stochastic data augmentation strategy: stochastic augmentation method loses too much semantic information; local–local contrasting mode ignores global semantic information. We conducted a series of comparison experiments on 8 publicly available graph neural network datasets with 6 representative frameworks in the current graph contrastive learning field, and the experimental results better validated the effectiveness of our proposed MDGCL framework. However, compared with the supervised GNN model, the performance of the MDGCL framework still has a certain gap. At the same time, the application of the MDGCL framework is currently limited to node classification tasks. How to improve the performance of the MDGCL framework and apply it to other graph downstream tasks(i.e. graph classification and link prediction) are our main focus in the future.

Data availability

The datasets used in this paper are publicly available.

References

Liu M, Wang Z, Ji S (2021) Non-local graph neural networks. IEEE Trans Pattern Anal Mach Intell 44(12):10270–10276
Article Google Scholar
Huang Q, He H, Singh A, Lim S-N, Benson AR (2020) Combining label propagation and simple models out-performs graph neural networks. ar**v preprint ar**v:2010.13993
Li J, Zheng R, Feng H, Li M, Zhuang X (2023) Permutation equivariant graph framelets for heterophilous graph learning. ar**v preprint ar**v:2306.04265
Huang C, Li M, Cao F, Fujita H, Li Z, Wu X (2022) Are graph convolutional networks with random weights feasible? IEEE Trans Pattern Anal Mach Intell 45(3):2751–2768
Article Google Scholar
Knyazev B, Taylor GW, Amer M (2019) Understanding attention and generalization in graph neural networks. Adv Neural Inf Process Syst 32
Yao L, Mao C, Luo Y (2019) Graph convolutional networks for text classification. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 7370–7377
He X, Deng K, Wang X, Li Y, Zhang Y, Wang M (2020) Lightgcn: Simplifying and powering graph convolution network for recommendation. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 639–648
Zhang M, Chen Y (2018) Link prediction based on graph neural networks. Adv Neural Inf Process Syst 31
** D, Liu Z, Li W, He D, Zhang W (2019) Graph convolutional networks meet markov random fields: Semi-supervised community detection in attribute networks. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 152–159
Li M, Zhang L, Cui L, Bai L, Li Z, Wu X (2023) Blog: Bootstrapped graph representation learning with local and global regularization for recommendation. Pattern Recogn 144:109874
Article Google Scholar
Li M, Zhuang X, Bai L, Ding W (2024) Multimodal graph learning based on 3d haar semi-tight framelet for student engagement prediction. Inf Fusion 105:102224
Article Google Scholar
Linsker R (1988) Self-organization in a perceptual network. Computer 21(3):105–117
Article Google Scholar
Bachman P, Hjelm RD, Buchwalter W (2019) Learning representations by maximizing mutual information across views. Adv Neural Inf Process Syst 32
He K, Fan H, Wu Y, **e S, Girshick R (2020) Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738
Tian Y, Krishnan D, Isola P (2020) Contrastive multiview coding. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pp. 776–794. Springer
Collobert R, Weston J (2008) A unified architecture for natural language processing: Deep neural networks with multitask learning. In: Proceedings of the 25th International Conference on Machine Learning, pp. 160–167
Mnih A, Kavukcuoglu K (2013) Learning word embeddings efficiently with noise-contrastive estimation. Adv Neural Inf Process Syst 26
Hjelm RD, Fedorov A, Lavoie-Marchildon S, Grewal K, Bachman P, Trischler A, Bengio Y (2018) Learning deep representations by mutual information estimation and maximization. ar**v preprint ar**v:1808.06670
You Y, Chen T, Sui Y, Chen T, Wang Z, Shen Y (2020) Graph contrastive learning with augmentations. Adv Neural Inf Process Syst 33:5812–5823
Google Scholar
Velickovic P, Fedus W, Hamilton WL, Liò P, Bengio Y, Hjelm RD (2019) Deep graph infomax. ICLR (Poster) 2(3):4
Google Scholar
Zhu Y, Xu Y, Liu Q, Wu S (2021) An empirical study of graph contrastive learning. ar**v preprint ar**v:2109.01116
You Y, Chen T, Wang Z, Shen Y (2020) When does self-supervision help graph convolutional networks? In: International Conference on Machine Learning, pp. 10871–10880. PMLR
** W, Derr T, Liu H, Wang Y, Wang S, Liu Z, Tang J (2020) Self-supervised learning on graphs: Deep insights and new direction. ar**v preprint ar**v:2006.10141
Sun K, Lin Z, Zhu Z (2020) Multi-stage self-supervised learning for graph convolutional networks on graphs with few labeled nodes. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 5892–5899
Caron M, Bojanowski P, Joulin A, Douze M (2018) Deep clustering for unsupervised learning of visual features. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 132–149
Zhu Y, Xu Y, Yu F, Liu Q, Wu S, Wang L (2020) Deep graph contrastive representation learning. ar**v preprint ar**v:2006.04131
Zhu Y, Xu Y, Yu F, Liu Q, Wu S, Wang L (2021) Graph contrastive learning with adaptive augmentation. In: Proceedings of the Web Conference 2021, pp. 2069–2080
Shen X, Sun D, Pan S, Zhou X, Yang LT (2023) Neighbor contrastive learning on learnable graph augmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 9782–9791
**a J, Wu L, Chen J, Hu B, Li SZ (2022) Simgrace: A simple framework for graph contrastive learning without data augmentation. In: Proceedings of the ACM Web Conference 2022, pp. 1070–1079
Gong X, Yang C, Shi C (2023) Ma-gcl: Model augmentation tricks for graph contrastive learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 4284–4292
Zhang Y, Zhu H, Song Z, Koniusz P, King I (2022) Costa: Covariance-preserving feature augmentation for graph contrastive learning. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 2524–2534
Gasteiger J, Weißenberger S, Günnemann S (2019) Diffusion improves graph learning. Advances in neural information processing systems 32
Zhao J, Dong Y, Ding M, Kharlamov E, Tang J (2021) Adaptive diffusion in graph neural networks. Adv Neural Inf Process Syst 34:23321–23333
Google Scholar
Hassani K, Khasahmadi AH (2020) Contrastive multi-view representation learning on graphs. In: International Conference on Machine Learning, pp. 4116–4126. PMLR
Li Q, Han Z, Wu X-M (2018) Deeper insights into graph convolutional networks for semi-supervised learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32
Xu K, Li C, Tian Y, Sonobe T, Kawarabayashi K-i, Jegelka S (2018) Representation learning on graphs with jum** knowledge networks. In: International Conference on Machine Learning, pp. 5453–5462. PMLR
Rong Y, Huang W, Xu T, Huang J (2019) Dropedge: Towards deep graph convolutional networks on node classification. ar**v preprint ar**v:1907.10903
Chen T, Kornblith S, Norouzi M, Hinton G (2020) A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607. PMLR
Oord Avd, Li Y, Vinyals O (2018) Representation learning with contrastive predictive coding. ar**v preprint ar**v:1807.03748
Poole B, Ozair S, Van Den Oord A, Alemi A, Tucker G (2019) On variational bounds of mutual information. In: International Conference on Machine Learning, pp. 5171–5180. PMLR
Feng S, **g B, Zhu Y, Tong H (2022) Adversarial graph contrastive learning with information regularization. In: Proceedings of the ACM Web Conference 2022, pp. 1362–1371
Peng Z, Huang W, Luo M, Zheng Q, Rong Y, Xu T, Huang J (2020) Graph representation learning via graphical mutual information maximization. In: Proceedings of The Web Conference 2020, pp. 259–270

Download references

Author information

Authors and Affiliations

School of Computer Science and Technology, Wuhan University of Technology, Wuhan, China
Yuqiang Li, Yi Zhang & Chun Liu

Authors

Yuqiang Li
View author publications
You can also search for this author in PubMed Google Scholar
Yi Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Chun Liu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Yuqiang Li contributed to the conception of the study; Yi Zhang performed the experiment and wrote the manuscript; Chun Liu helped perform the analysis with constructive discussions.

Corresponding author

Correspondence to Chun Liu.

Ethics declarations

Conflict of interest

The authors declare no Conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Li, Y., Zhang, Y. & Liu, C. MDGCL: Graph Contrastive Learning Framework with Multiple Graph Diffusion Methods. Neural Process Lett 56, 213 (2024). https://doi.org/10.1007/s11063-024-11672-3

Download citation

Accepted: 01 July 2024
Published: 13 July 2024
DOI: https://doi.org/10.1007/s11063-024-11672-3

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

MDGCL: Graph Contrastive Learning Framework with Multiple Graph Diffusion Methods

Abstract

1 Introduction

2 Related Work

2.1 Graph Self-supervised Learning

2.1.1 Generation-Based Graph Self-supervised Learning

3.2.2 FeatureDrop and EdgeDrop

3.3 View Encoding

3.4 Local–Local Contrastive Learning

3.5 Theoretical Justification

Theorem 1

Proof

4 Experiments

4.1 Datasets

4.2 Baseline Frameworks

4.3 Experimental Setting

4.4 Results and Analyses

4.5 Ablation Experiments

5 Conclusions

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation