1 Introduction

Graph neural networks (GNNs) have demonstrated remarkable ability for graph learning tasks [5, 53, 58, 65]. The input to GNNs is the so-called graph data which records useful features and structural information among data. Such data are widely seen in many fields, such as biomedical science [1], social networks [16], and recommend systems [32, 51, 53, 54, 68] are designed based on the homophily assumption i.e., nodes with similar features or labels are often linked with each other. Such phenomenon can be commonly observed in citation networks [13] which have been widely used as benchmarks in GNNs empirical studies. However, the homophily assumption may not always hold, and its opposite, i.e. Heterophily, can be observed quite often in many real-world datasets in which the linked nodes are more likely to contain different class labels and features [59]. For example, in online transaction networks, fraudsters are more likely to link to customers instead of other fraudsters [42]. The GNNs designed under homophilic assumption are deemed unsuitable for heterophilic graphs. It is evident from their significant performance degradation [59]. The reason is that the class of heterophilic graphs contains heterogeneous instances and hence the signals should be sharpened rather than smoothed out. An ideal framework for learning on graphs should be able to accommodate both homophilic and heterophilic scenarios.

One of the active aspects to resolve the GNN adaption challenge is by regularizing the solution of GNNs via the perspective of optimization. The work done by [67] unified most of state-of-the-art GNNs as an optimization framework. Furthermore, one of the recent works [18] considered assigning an adjustable p-Laplacian regularizer to the (discrete) graph regularization problem that is conventionally treated as a way of producing GNN outcomes (i.e., Laplacian smoothing). In view of the fact that the classic graph Laplacian regularizer measures the graph signal energy along edges under \(L_2\) metric, it would be beneficial if GNN could be adapted to heterophilic graphs under \(L_p\) metric (\(1 \le p < 2\)). Given that \(L_1\) metric is more robust to high-frequency signals, a higher model discriminative power is preserved. The early work [27] has demonstrated the advantage of adopting \(L_1\) metric in the Locality Preserving Projection (LPP) model. In addition, the recently proposed p-Laplacian GNN [18] adaptively modifies aggregation weights by exploiting the variance of node embeddings on edges measured by the graph gradient. With a further investigation on the application of p-Laplacian, [36] suggested an efficient gradient descent optimization strategy to construct the p-Laplacian embedding space that guarantees convergence to viable local solutions. Although the p-Laplacian based optimization scheme has been successfully lodged in basic GNN model [18], whether it can be further incorporated with more advanced GNN model (i.e., graph framelets) with higher prediction power and flexibility is still unclear. Specifically, one may be interested in identifying whether the advantage of deploying p-Laplacian in GNN still remains in graph framelet or what is the exact functionality of p-Laplacian interacting with the feature representations generated both low and high pass framelet domains. In addition, as graph framelets have shown great potential in terms of the graph noise reduction [56, 61, 69], whether the inclusion of p-Laplacian regularizer could enhance/dilute such power remains unclear. These research gaps inspire us to incorporate p-Laplacian into graph framelets and explore further for the underlying relationship between them.

Since the p-Laplacian regularized optimization problem lacks a closed-form solution, except for when \(p = 2\) [64], an iterative algorithm is suggested to estimate the solution and each iteration can be analogized to a GNN layer [62]). The solution of such an algorithm is defined by the first-order condition of the optimization problem. As a result, one can relate this to the implicit layer approach [10, 57] which has the potential of avoiding over-smoothing issues since the adjusted node feature will be re-injected in each iteration step. By adjusting the value of p, both p-Laplacian GNN and the algorithm are capable of producing learning outcomes with different discriminative levels and thus able to handle both homophilic and heterophilic graphs. Similar work has been done by [17] which presents a framework based on the Bregman layer to fulfill the bi-level optimization formulations. To reap the benefit of p-Laplacian regularization in the framelet domain, in this paper, we propose two p-Laplacian Regularized Framelet GCN models which are named p-Laplacian Undecimated Framelet GCN (pL-UFG) and p-Laplacian Fourier Undecimated Framelet GCN (pL-fUFG). In these models, the p-Laplacian regularization can be applied on either the framelet domain or the well-known framelet Fourier domain. The p-Laplacian-based regularizer incurs a penalty to both low and high-frequency domains created by framelet, producing flexible models that are capable of adapting different types of graph inputs (i.e., homophily and heterophily, direct and undirect). We summarize our contributions as follows:

  • We define two types of new GNN layers by introducing p-Laplacian regularizers to both decomposed and reconstructed framelets. This paves the way for introducing more general Bregman divergence regularization in the graph framelet framework;

  • We propose an iterative algorithm to fit the proposed models and explore an alternative algorithm developed from the F-norm LPP;

  • We prove that the iteration of the proposed algorithm provides a sequence of spectral graph convolutions and diagonal scaling over framelet-filtered graph signals, this gives an deeper explanation of how p-Laplacian regularizer interacts with the framelet.

  • We connect the proposed p-Laplacian regularizer to the previously studied framelet regularizer to illustrate the denoising power of the proposed model.

  • We investigate the performance of the new models on graph learning tasks for both homophilic (undirected) graphs and heterophilic (directed) graphs. To our best knowledge, we are the first to explore the possibility of applying the framelet GCN for directed graphs. The experiment results demonstrate the effectiveness of pL-UFG and pL-fUFG on real-world node classification tasks with strong robustness.

The rest of the paper is organized as follows. In Sect. 2, we introduce the p-Laplacian operator and regularized GNN, followed by a review of recent studies on regularized graph neural networks, which include implicit layers and graph homophily. In Sect. 3, we introduce the fundamental properties of graph framelet and propose the p-Laplacian regularized framelet models. Furthermore, we develop an algorithm to solve the regularization problem that is more general than p-Laplacian regularization. We also provide theoretical analysis to show how the p-Laplacian based regularizer interacts with the graph framelet. By the end of Sect. 3 a brief discussion on the denoising power of the proposed model is presented. Lastly, we present the experimental results and analysis in Sect. 5. Finally, this paper is concluded in Sect. 6.

2 Preliminaries

This section provides an in-depth exploration of the fundamental concepts, encompassing graphs, graph framelets, and regularized graph neural networks. For the sake of reader comprehension and ease of following the intended ideas, each newly introduced model and definition will be accompanied by a succinct review of its developmental history.

2.1 Basic notations

Let \(\mathcal {G} = (\mathcal {V}, \mathcal {E}, W)\) denote a weighted graph, where \(\mathcal {V} = \{v_1, v_2, \cdots , v_N \}\) and \(\mathcal {E} \subseteq \mathcal {V} \times \mathcal {V}\) represent the node set and the edge set, respectively. \(\textbf{X} \in {\mathbb {R}}^{N \times c}\) is the feature matrix for \(\mathcal {G}\) with \(\{ \textbf{x}_1, \textbf{x}_2, \cdots , \textbf{x}_N \}\) as its rows, and \(W = [w_{i,j}] \in \mathbb {R}^{N \times N}\) is the weight matrix on edges with \(w_{i,j}>0\) if \((v_i, v_j) \in \mathcal {E}\) and \(w_{i,j} = 0\) otherwise. For undirected graphs, we have \(w_{i,j} = w_{j,i}\) which means that W is a symmetric matrix. For directed graphs, it is likely that \(w_{i,j} \ne w_{j, i}\) which means that W may not be a symmetric matrix. In most cases, the weight matrix is the graph adjacency matrix, i.e., \(w_{i,j} \in \{0, 1\}\) with elements \(w_{i,j}=1\) if \((v_i, v_j) \in \mathcal {E}\) and \(w_{i,j} = 0\) otherwise. Furthermore, let \(\mathcal {N}_i = \{v_j: (v_i,v_j)\in \mathcal {E}\}\) denote the set of neighbours of node \(v_i\) and \(\textbf{D} = \text {diag}(d_{1,1},..., d_{N,N})\in \mathbb {R}^{N\times N}\) denote the diagonal degree matrix with \(d_{i,i} = \sum ^N_{j=1}w_{i,j}\) for \(i = 1,..., N\). The normalized graph Laplacian is defined as \(\widetilde{\textbf{L}} = \textbf{I}- \textbf{D}^{-\frac{1}{2}} (\textbf{W}+ \textbf{I}) \textbf{D}^{-\frac{1}{2}}\). Lastly, for any vector \(\textbf{x}= (x_1,..., x_c)\in \mathbb {R}^c\), we use \(\Vert \textbf{x}\Vert _2 = (\sum ^c_{i=1} x^2_i)^{\frac{1}{2}}\) to denote its L\(_2\)-norm, and similarly for any matrix \(\textbf{M}= [m_{i,j}]\), \(\Vert \textbf{M}\Vert :=\Vert \textbf{M}\Vert _F = (\sum _{i,j}m^2_{i,j})^{\frac{1}{2}}\) is used to denote its Frobenius norm.

2.2 Consistency in graphs

Generally speaking, most GNN frameworks are designed under the homophily assumption in which the labels of nodes and neighbours in the graph are mostly identical. The recent work by [66] emphasises that the general topology GNN fails to obtain outstanding results on the graphs with different class labels and dissimilar features in their connected nodes, which we call heterophilic or low homophilic graphs. The definition of homophilic and heterophilic graphs are given by:

Definition 1

(Homophily and Heterophily [18]) The homophily or heterophily of a network is used to define the relationship between labels of connected nodes. The level of homophily of a graph can be measured by \(\mathcal {H(G)} = \mathbb {E}_{i \in \mathcal {V}}[\vert \{j\}_{j \in \mathcal {N}_{i,y_i=y_i}}\vert /\vert \mathcal {N}_i \vert ]\), where \(\vert \{j\}_{j \in \mathcal {N}_{i,y_i= y_i}}\vert\) denotes the number of neighbours of \(i \in V\) that share the same label as i such that \(y_i = y_j\). \(\mathcal {H(G)} \rightarrow 1\) corresponds to strong homophily while \(\mathcal {H(G)} \rightarrow 0\) indicates strong heterophily. We say that a graph is a homophilic (heterophilic) graph if it has strong homophily (heterophily).

2.3 p-Laplacian operator

The first paper that explores the notation of graph p-Laplacian and utilizes it in the regularization problem in graph structure data can be traced back to the work [64], in which the regularization scheme was further explored to build a flexible GNN learning model in the study [18]. In this paper, we refer to the notation similar to the one used in the paper [18] to define the graph p-Laplacian operator. We first define the graph gradient as follows:

Definition 2

(Graph Gradient) Let \({\mathcal {F}}_{\mathcal{V}}:=\{\textbf{F} \vert {\textbf{F}}: {\mathcal {V}} \rightarrow {\mathbb {R}}^d\}\) and \({\mathcal {F}}_{\mathcal {E}}:=\{\textbf{g} \vert {\textbf{g}}: {\mathcal {E}} \rightarrow{ \mathbb {R}}^d\}\) be the function space on nodes and edges, respectively. Given a graph \({\mathcal {G}} = ({\mathcal {V}}, {\mathcal {E}}, W)\) and a function \({\textbf{F}} \in {\mathcal {F}}_{\mathcal {V}}\), the graph gradient is an operator \(\nabla _W\):\({\mathcal {F}}_{\mathcal {V}} \rightarrow {\mathcal {F}}_{\mathcal {E}}\) defined as for all \([i,j] \in {\mathcal {E}}\),

$$(\nabla _W {\textbf{F}}) ([i,j]): = \sqrt{\frac{w_{i,j}}{d_{j,j}}}{\textbf{f}}_j - \sqrt{\frac{w_{i,j}}{d_{i,i}}}{\textbf{f}}_i.$$

where \({\textbf{f}}_i\) and \({\textbf{f}}_j\) are the signal vectors on nodes i and j, i.e., the rows of \({\textbf{F}}\).

Without confusion, we will simply denote \(\nabla _W\) as \(\nabla\) for convenience. For \([i,j] \notin \mathcal {E}\), \((\nabla {\textbf{F}})([i,j]):= 0\). The graph gradient of \({\textbf{F}}\) at a vertex i, \(\forall i \in \{1,..., N\}\), is defined as \(\nabla {\textbf{F}}(i):= ((\nabla {\textbf{F}})([i, 1]);\dots ; (\nabla {\textbf{F}})([i,N]))\) and its Frobenius norm is given by \(\Vert \nabla {\textbf{F}}(i) \Vert _2:= (\sum ^N_{j=1}(\nabla {\textbf{F}})^2([i,j]))^{\frac{1}{2}}\) which measures the variation of \({\textbf{F}}\) around node i. Note that we have two explanations for the notation \(\nabla {\textbf{F}}\): one as a graph gradient (over edges) and one as node gradient (over nodes). The meaning can be inferred from its context in the rest of the paper. We also provide the definition of graph divergence, analogous to the Laplacian operator in continuous setting, which is the divergence of the gradient of a continuous function.

Definition 3

(Graph Divergence) Given a graph \(\mathcal {G} = (\mathcal {V}, \mathcal {E}, W)\) and a function \({\textbf{F}}: \mathcal V \rightarrow \mathbb R^d\), \({\textbf{g}}: \mathcal E \rightarrow \mathbb R^d\), the graph divergence is an operator \(\text {div}: \mathcal F_\mathcal E \rightarrow \mathcal F_\mathcal V\) which satisfies:

$$\begin{aligned} \langle \nabla {\textbf{F}}, {\textbf{g}}\rangle = \langle {\textbf{F}}, -\text {div}({\textbf{g}}) \rangle . \end{aligned}$$
(1)

Furthermore, the graph divergence can be computed by:

$$\begin{aligned} \text {div}({\textbf{g}})(i) = \sum _{j =1 }^N \sqrt{\frac{w_{i,j}}{d_{i,i}}}({\textbf{g}}[i,j]-{\textbf{g}}[j,i]). \end{aligned}$$
(2)

Given the above definitions on graph gradient and divergence, we reach the definition of the graph p-Laplacian.

Definition 4

(p-Laplacian operator [18]) Given a graph \(\mathcal {G} = (\mathcal {V}, \mathcal {E}, W)\) and a multiple channel signal function \({\textbf{F}}: \mathcal V \rightarrow \mathbb R^d\), the graph p-Laplacian is an operator \(\Delta _p: \mathcal F_\mathcal V \rightarrow \mathcal F_\mathcal V\), defined by:

$$\begin{aligned} \Delta _p {\textbf{F}}: = - \frac{1}{2} \text {div}( \Vert \nabla {\textbf{F}}\Vert ^{p-2} \nabla {\textbf{F}}), \quad \text {for} \,\,\, p\ge 1. \end{aligned}$$
(3)

where \(\Vert \cdot \Vert ^{p-2}\) is element-wise power over the node gradient \(\nabla {\textbf{F}}\).

Remark 1

Clearly, when we have \(p=2\), Eq. (3) recovers the classic graph Laplacian. When \(p =1\), we can analogize Eq. (3) as a curvature operator defined on the nodes of the graph, because when \(p =1\), we have Eq. (3) as \(\Delta _1 {\textbf{F}}= -\frac{1}{2}\text {div}\left( \frac{\nabla {\textbf{F}}}{\Vert \nabla {\textbf{F}}\Vert }\right)\). This aligns with the definition of mean curvature operator defined on the continuous domain. Furthermore, we note that the p-Laplacian operator is linear when \(p = 2\), while in general for \(p \ne 2\), the p-Laplacian is a non-linear operator since \(\Delta _p(a{\textbf{F}}) \ne a\Delta _p({\textbf{F}}) \,\,\, \text {for} \,\,\, a \in \mathbb R\backslash \{1\}\).

Definition 5

(p-eigenvector and p-eigenvalue) Let \(\psi _p({\textbf{v}}) = (\Vert v_1 \Vert ^{p-2} v_1,..., \Vert v_N \Vert ^{p-2} v_N)\) for \({\textbf{v}}\in \mathbb R^N\). With some abuse of the notation, we call \({\textbf{u}}\in \mathcal F_V\) (\(d=1\)) a p-eigenvector of \(\Delta _p\) if \(\Delta _p {\textbf{u}}= \lambda \psi _p(\textbf{u})\) where \(\lambda\) is the associated p-eigenvalue.

Note that in above definition, we use the fact that \(\textbf{u}\) acting on all nodes gives a vector in \(\mathbb R^N\). Let \(\psi _p(\textbf{U}) = (\psi _p(\textbf{U}_{;,1}),..., \psi _p(\textbf{U}_{:, N}))\) for \(\textbf{U}\in \mathbb R^{N \times N}\). One [18] can show that p-Laplacian can be decomposed as \(\Delta _p = \psi _p(\textbf{U}) \mathbf \Lambda \psi _p(\textbf{U})^T\) for some diagonal matrix \(\varvec{\Lambda }\).

2.4 Graph framelets

Framelet is a type of wavelet frame. The study by [46] is the first to present a wavelet with a lifting scheme that provides a foundation in the research of wavelet transform on graphs. With the increase of computational power, [22] proposed a framework for the wavelet transformation on graphs and employed Chebyshev polynomials to make approximations on wavelets. [15] further developed tight framelets on graphs. The new design has been applied for graph learning tasks [60] with great performance enhancement against the classic GNNs. The recent studies show that framelet can naturally decompose the graph signal and re-aggregate them effectively, achieving a significant result on graph noise reduction [62] with the use of a double term regularizer on the framelet coefficients. Combing with singular value decomposition (SVD), the framelets have been made applicable to directed graphs [69].

A simple method of building more versatile and stable framelet families was suggested by [56] which is known as Quasi-Framelets. In this study, we will introduce graph framelets using the same architecture described in the paper [56]. We begin by defining the filtering functions for Quasi-Framelets:

Definition 6

(Filtering functions for Quasi-Framelets) We call a set of \(R+1\) positive filtering functions defined on \([0, \pi ]\), \(\mathcal {F} = \{g_0(\xi ), g_1(\xi ),..., g_R(\xi )\}\), Quasi-Framelet scaling functions if the following identity condition is satisfied:

$$\begin{aligned} g_0(\xi )^2 + g_1(\xi )^2 + \cdots + g_R(\xi )^2 \equiv 1,\;\;\; \forall \xi \in [0, \pi ], \end{aligned}$$
(4)

such that \(g_0\) descends from 1 to 0 and \(g_R\) ascends from 0 to 1 as frequency increases over the spectral domain \([0, \pi ]\).

Particularly \(g_0\) aims to regulate the highest frequency while \(g_R\) to regulate the lowest frequency, and the rest to regulate other frequencies in between.

Consider a graph \(\mathcal {G} =(\mathcal {V}, \mathcal {E})\) with its normalized graph Laplacian \(\widetilde{\textbf{L}}\). Let \(\widetilde{\textbf{L}}\) have the eigen-decomposition \(\widetilde{\textbf{L}} = \textbf{U}\Lambda \textbf{U}^T\) where \(\textbf{U}\) is the orthogonal spectral bases with its spectra \(\Lambda = \text {diag}(\lambda _1, \lambda _2,..., \lambda _N)\) in increasing order. For a given set of Quasi-Framelet functions \(\mathcal {F} = \{g_0(\xi ), g_1(\xi ),..., g_R(\xi )\}\) defined on \([0, \pi ]\) and a given level J (\(\ge 0\)), define the following Quasi-Framelet signal transformation matrices

$$\begin{aligned} \mathcal {W}_{0,J}&= \textbf{U g}_0\left(\frac{\varvec{\Lambda }}{2^{m+J}}\right) \cdots g_0(\frac{\varvec{\Lambda }}{2^{m}}) \textbf{U}^T, \end{aligned}$$
(5)
$$\begin{aligned} \mathcal {W}_{r,0}&= \textbf{U g}_r\left(\frac{\varvec{\Lambda }}{2^{m}}\right) \textbf{U}^T, \;\;\text {for } r = 1,...,R, \end{aligned}$$
(6)
$$\begin{aligned} \mathcal {W}_{r,\ell }&= \textbf{U g}_r\left(\frac{\varvec{\Lambda }}{2^{m+\ell }}\right)g_0\left(\frac{\varvec{\Lambda }}{2^{m+\ell -1}}\right) \cdots g_0\left(\frac{\varvec{\Lambda }}{2^{m}}\right) \textbf{U}^T, \nonumber \\ \;\;\;\;&\text {for } r = 1,...,R,\;\; \ell = 1,...,J. \end{aligned}$$
(7)

Note that in the above definition, m is called the coarsest scale level which is the smallest m satisfying \(2^{-m}\lambda _N \le \pi\). Denote by \(\mathcal {W} = [\mathcal {W}_{0,J}; \mathcal {W}_{1,0};...; \mathcal {W}_{R,0};\) \(\mathcal {W}_{1,1};..., \mathcal {W}_{R,J}]\) as the stacked matrix. It can be proved that \(\mathcal {W}^T\mathcal {W} = \textbf{I}\), thus providing a signal decomposition and reconstruction process based on \(\mathcal {W}\). We call this graph Framelet transformation.

In order to alleviate the computational cost imposed by eigendecomposition for the graph Laplacians, the framelet transformation matrices can be approximated by Chebyshev polynomials. Empirically, the implementation of the Chebyshev polynomials \(\mathcal {T}^n_r(\xi )\) with a fixed degree n, \(n=3\) is sufficient to approximate \(g_r(\xi )\). In the sequel, we will simplify the notation by using \(\mathcal {T}_r(\xi )\) instead of \(\mathcal {T}^n_r(\xi )\). Then the Quasi-Framelet transformation matrices are defined in Eqs. (6, 7) can be approximated by,

$$\begin{aligned} \mathcal {W}_{0,J}&\approx \mathcal {T}_0\left(\frac{1}{2^{m+J}}\widetilde{\textbf{L}}\right) \cdots \mathcal {T}_0\left(\frac{1}{2^{m}}\widetilde{\textbf{L}}\right), \end{aligned}$$
(8)
$$\begin{aligned} \mathcal {W}_{r,0}&\approx \mathcal {T}_r\left(\frac{1}{2^{m}}\widetilde{\textbf{L}}\right), \;\;\; \text {for } r = 1, ..., R, \end{aligned}$$
(9)
$$\begin{aligned} \mathcal {W}_{r,\ell }&\approx \mathcal {T}_r\left(\frac{1}{2^{m+\ell }}\widetilde{\textbf{L}}\right)\mathcal {T}_0\left(\frac{1}{2^{m+\ell -1}}\widetilde{\textbf{L}}\right) \cdots \mathcal {T}_0\left(\frac{1}{2^{m}}\widetilde{\textbf{L}}\right),&\;\;\;\; \text {for } r=1, ..., R, \ell = 1, ..., J. \end{aligned}$$
(10)

Remark 1

The approximated transformation matrices defined in Eqs. (8, 9, 10) simply depend on the graph Laplacian. For directed graphs, we directly take in the Laplacian normalized by the out degrees in our experiments. We observe this strategy leads to improved performance in general.

For a graph signal \(\textbf{X}\), the framelet (graph) convolution similar to the spectral graph convolution that can be defined as

$$\begin{aligned} \theta \star \textbf{X} = \mathcal {W}^T(\text {diag}(\theta ))(\mathcal {W}\textbf{X}), \end{aligned}$$
(11)

where \(\theta\) is the learnable network filter. We also call \(\mathcal {W}\textbf{X}\) the framelet coefficients of \(\textbf{X}\) in Fourier domain. The signal then will be filtered in its spectral-domain according to learnable filter \(\text {diag}(\theta )\).

2.5 Regularized graph neural network

In reality, graph data normally have large, noisy and complex structures [25, 29]. This brings the challenge of choosing a suitable neural network for fitting the data. It is well known that one of the common computational issues for most classic GNNs is over-smoothing which can be quantified by Dirichlet energy that converges to zero as the number of layers increases. This observation leads to an investigation into the so-called implicit layers on regularized graph neural networks.

The first paper to consider GNNs layer as a regularized signal smoothing process is done by [67], in which the classic GNN layers are interpreted as the solution of the regularized optimization problem, with certain approximation strategies to avoid matrix inversion in the closed-form solution. The regularized layer also can be linked to the implicit layer [57], more specific examples are given by [67]. In general, given the node features, the output of the GNN layer can be written as the solution for the following optimization problem

$$\begin{aligned} \textbf{F}= \mathop {\mathrm {arg\,min}}\limits _{\textbf{F}} \left\{ \mu \Vert \textbf{X}- \textbf{F}\Vert ^2_F +\frac{1}{2} \text {tr}(\textbf{F}^T \widetilde{\textbf{L}}\textbf{F})\right\} , \end{aligned}$$
(12)

where \(\textbf{F}\) is the graph signal. Equation (12) defines the layer output as the solution for the (layer) optimization problem with a regularization term \(\text {tr}(\textbf{F}^T \widetilde{\textbf{L}}\textbf{F})\}\), which is able to enforce smoothness of a graph signal. The closed form solution is given by \(\textbf{F}= (\textbf{I}+ \frac{1}{2\mu } \widetilde{\textbf{L}})^{-1}\textbf{X}\). It is computationally inefficient because of matrix inversion. However, we can rearrange it to \((\textbf{I}+ \frac{1}{2\mu } \widetilde{\textbf{L}})\textbf{F}= \textbf{X}\) and interpret it as a fixed point solution to \(\textbf{F}^{(t+1)} = -\frac{1}{2\mu } \widetilde{\textbf{L}} \textbf{F}^{(t)} + \textbf{X}\). The latter is then the implicit layer in GNN at layer t. Such an iteration is motivated by the recurrent GNNs as shown in several recent works [21, 34, 43]. Different from explicit GNNs, the output features \(\textbf{F}\) of a general implicit GNN are directly modeled as the solution of a well defined implicit function, e.g., the first order condition from an optimization problem. Denote the rows of \(\textbf{F}\) by \(\textbf{f}_i\) (\(i=1, 2,..., N\)) in column shape. It is well known that the regularization term (12) is

$$\begin{aligned} \frac{1}{2}\text {tr}(\textbf{F}^T \widetilde{\textbf{L}}\textbf{F}) = \frac{1}{2} \sum _{(v_i,v_j)\in \mathcal {E}}\left\| \sqrt{\frac{w_{i,j}}{d_{j,j}}}\textbf{f}_j -\sqrt{\frac{w_{i,j}}{d_{i,i}}}\textbf{f}_i\right\| ^2, \end{aligned}$$

which is the so-called graph Dirichlet energy [64]. As proposed in the work [64], one can replace the graph Laplacian matrix in the above equation to the pre-defined graph p-Laplacian, then the p-Laplacian based energy denoted as \(\mathcal S_p(\textbf{F})\) can be defined as (for any \(p \ge 1\)):

$$\begin{aligned} \mathcal {S}_p(\textbf{F}) = \frac{1}{2}\sum _{(v_i,v_j)\in \mathcal {E}}\left\| \sqrt{\frac{w_{i,j}}{d_{j,j}}}\textbf{f}_j -\sqrt{\frac{w_{i,j}}{d_{i,i}}}\textbf{f}_i\right\| ^p, \end{aligned}$$
(13)

where we adopt the definition of element-wise p-norm as in paper [18]. Finally, the regularization problem becomes

$$\begin{aligned} \textbf{F}= \mathop {\mathrm {arg\,min}}\limits _{\textbf{F}} \left\{ \mu \Vert \textbf{X}- \textbf{F}\Vert ^2_F + \mathcal {S}_p(\textbf{F})\right\} . \end{aligned}$$
(14)

With strong generalizability of the regularization form in Eq. (14), it is natural to consider to deploy Eq. 14 to multi-scale graph neural networks (i.e., graph framelets) to explore whether the benefits of allocating p-Laplacian based regularizer still remains and how the changes of p within included regularizer interacts with the feature propagation process via each individual filtered domains. In the next section, we will accordingly proposed our model and provide detailed discussion and analysis on it.

3 The proposed models

In this section, we show our proposed models: p-Laplacian Framelet GNN (pL-UFG) and p-Laplacian Fourier Undecimated Framelet GNN (pL-fUFG) models. In addition, we introduce a more general regularization framework and describe the algorithms of our proposed models.

3.1 p-Laplacian undecimated framelet GNN (pL-UFG)

Instead of simply taking the convolution result as the framelet layer output (derived in Eq. (11)), we apply the p-Laplacian regularization on the framelet reconstructed signal by imposing the following optimization to define the new layer,

$$\begin{aligned} \textbf{F}= \mathop {\mathrm {arg\,min}}\limits _{\textbf{F}}\mathcal {S}_p(\textbf{F}) + \mu \Vert \textbf{F} - \mathcal {W}^T\text {diag}(\theta )\mathcal {W}\textbf{X}\Vert ^2_F , \end{aligned}$$
(15)

where \(\mu \in (0, \infty )\). In which we recall that the first term \(\mathcal {S}_p(\textbf{F})\) in the above equation is the p-Laplacian based energy defined in Eq. (13). \(\mathcal {S}_p(\textbf{F})\) measures the total signals’ variation throughout the graph-based format of p-Laplacian. The second term dictates that optimal \(\textbf{F}\) should not deviate excessively from the framelet reconstructed signal for the input signal \(\textbf{X}\). Each component of the network filter \(\theta\) in the frequency domain is applied to the framelet coefficients \(\mathcal {W}\textbf{X}\). We call the model in Eq.(15) pL-UFG2.

One possible variant to model (15) is to apply the regularization individually on the reconstruction at each scale level, i.e., for all the \(r, \ell\), define

$$\begin{aligned} \textbf{F}_{r, \ell }&= \mathop {\mathrm {arg\,min}}\limits _{ \textbf{F}_{r, \ell } } \mathcal {S}_p( \textbf{F}_{r, \ell } ) + \mu \Vert \textbf{F}_{r, \ell } - \mathcal {W}^T_{r, \ell } \text {diag}(\theta _{r, \ell }) \mathcal {W}_{r, \ell } \textbf{X}\Vert ^2_F, \nonumber \\ \textbf{F}&= \textbf{F}_{0,J} + \sum ^R_{r=1}\sum ^J_{\ell =0} \textbf{F}_{r, \ell }. \end{aligned}$$
(16)

We name the model with the propagation in the above equation as pL-UFG1. In our experiment, we note that pL-UFG1 performs better than pL-UFG2.

3.2 p-Laplacian Fourier undecimated framelet GNN (pL-fUFG)

In pL-fUFG, we take a strategy of regularizing the framelet coefficients. Informed by earlier experience, we consider the following optimization problem for each framelet transformation in Fourier domain. For each framelet transformation \(\mathcal {W}_{r,j}\) defined in Eqs. (9)-(10), define

$$\begin{aligned} \textbf{F}_{r,\ell } = \mathop {\mathrm {arg\,min}}\limits _{ \textbf{F}_{r,\ell } } \mathcal {S}_p( \textbf{F}_{r,\ell } ) + \mu \Vert \textbf{F}_{r,\ell } - \text {diag}(\theta _{r,\ell }) \mathcal {W}_{r,\ell } \textbf{X}\Vert ^2_F. \end{aligned}$$
(17)

Then the final layer output is defined by the reconstruction as follows

$$\begin{aligned} \textbf{F}= \mathcal {W}^T_{0, J} \textbf{F}_{0,J} + \sum ^R_{r=1}\sum ^J_{\ell =0} \mathcal {W}^T_{r, \ell } \textbf{F}_{r, \ell }. \end{aligned}$$
(18)

Or we can only aggregate all the regularized and filtered framelet coefficients in the following way

$$\begin{aligned} \textbf{F}= \textbf{F}_{0,J} + \sum ^R_{r=1}\sum ^J_{\ell =0} \textbf{F}_{r,\ell }. \end{aligned}$$
(19)

With the form of both pL-UFG and pL-fUFG, in the next section we show an even more generalized regularization framework by incorporating p-Laplacian with graph framelets.

3.3 More general regularization

Fig. 1
figure 1

The figure above shows the working flow of the p-Laplacian regularized framelet. The input graph data is first filtered and reconstructed by the framelet model which contains one low-pass and two high-pass filters (i.e., \(J=2\)). Then, the result (denoted as \(\textbf{Y}\)) is further regularized by a sequence of graph convolution and diagonal rescaling induced by the p-Laplacian, which is generated based on the graph gradient information, serving as an implicit layer of the model. By adjusting the p value, node features resulting from this implicit layer can be smoothed or sharpened accordingly, thus making the model adopt both homophily and heterophilic graphs. Lastly, the layer output will then be either forwarded to the task objective function or to the next framelet and p-Laplacian layers before the final prediction task

For convenience, we write the graph gradient for multiple channel signals \(\textbf{F}\) as

$$\begin{aligned} \nabla _W \textbf{F}([i,j]) =\sqrt{\frac{w_{i,j}}{d_{j,j}}}\textbf{f}_j -\sqrt{\frac{w_{i,j}}{d_{i,i}}}\textbf{f}_i. \end{aligned}$$

or simply \(\nabla \textbf{F}\) if no confusion. For an undirected graph we have \(\nabla _W \textbf{F}([i,j]) = - \nabla _W \textbf{F}([j,i])\). With this notation, we can re-write the p-Laplacian regularizer in (13) (the element-wise p-norm) in the following.

$$\begin{aligned}&\mathcal {S}_p(\textbf{F}) = \frac{1}{2}\sum _{(v_i,v_j)\in \mathcal {E}}\left\| \nabla _W \textbf{F}([i,j])\right\| ^p \nonumber \\&=\frac{1}{2}\sum _{v_i\in \mathcal {V}} \left[ \left( \sum _{v_j \sim v_i} \left\| \nabla _W \textbf{F}([i,j])\right\| ^p\right) ^{\frac{1}{p}} \right] ^p \nonumber \\&= \frac{1}{2} \sum _{v_i \in \mathcal {V}} \Vert \nabla _W \textbf{F}(v_i) \Vert _p^p \end{aligned}$$
(20)

where \(v_j \sim v_i\) stands for the node \(v_j\) that is connected to node \(v_i\) and \(\nabla _W\textbf{F}(v_i)= \left( \nabla _W \textbf{F}([i,j]) \right) _{v_j: (v_i, v_j) \in \mathcal {E}}\) is the node gradient vector for each node \(v_i\) and \(\Vert \cdot \Vert _p\) is the vector p-norm. In fact, \(\Vert \nabla _W\textbf{F}(v_i)\Vert _p\) measures the variation of \(\textbf{F}\) in the neighbourhood of each node \(v_i\). Next, we generalize the regularizer by considering any positive convex function \(\phi\) as

$$\begin{aligned} \mathcal {S}^{\phi }_p(\textbf{F}) = \frac{1}{2} \sum _{v_i\in \mathcal {V}} \phi (\Vert \nabla _W\textbf{F}(v_i)\Vert _p). \end{aligned}$$
(21)

It is clear when \(\phi (\xi ) = \xi ^p\), we recover the p-Laplacian regularizer. In image processing field, several penalty functions have been proposed in the literature. For example, \(\phi (\xi ) = \xi ^2\) is known in the context of Tikhonov regularization

[2, 3, 30]. When \(\phi (\xi ) = \xi\) (i.e. \(p=1\)), it is the classic total variation regularization. When \(\phi (\xi ) = \sqrt{\xi ^2 + \epsilon ^2} - \epsilon\), it is referred as the regularized total variation. An example work can be found in the work [11]. The case of \(\phi (\xi ) = r^2\log (1 + \xi ^2/r^2)\) corresponds to the non linear diffusion [40].

In pL-UFG and pL-fUFG, we use \(\textbf{Y}\) to denote \(\mathcal {W}^T\text {diag}(\theta )\mathcal {W}\textbf{X}\) and \(\text {diag}(\theta _{r,j}) \mathcal {W}_{r,j} \textbf{X}\) respectively, as then we propose the generalized p-Laplacian regularization models as

$$\begin{aligned} \textbf{F}= \mathop {\mathrm {arg\,min}}\limits _{\textbf{F}}\mathcal {S}^{\phi }_p(\textbf{F}) + \mu \Vert \textbf{F} - \textbf{Y}\Vert ^2_F. \end{aligned}$$
(22)

3.4 The algorithm

We derive an iterative algorithm for solving the generalized p-Laplacian regularization problem (22), presented as the following theorem.

Theorem 1

For a given positive convex function \(\phi (\xi )\), define

$$\begin{aligned} M_{i,j}&= \frac{w_{i,j}}{2} \left\| \nabla _W \textbf{F}([i,j]) \right\| ^{p-2} \cdot \\&\quad \left[ \frac{\phi '(\Vert \nabla _W\textbf{F}(v_i)\Vert _p)}{\Vert \nabla _W\textbf{F}(v_i)\Vert ^{p-1}_p} \right. + \left. \frac{\phi '(\Vert \nabla _W\textbf{F}(v_j)\Vert _p)}{\Vert \nabla _W\textbf{F}(v_j)\Vert ^{p-1}_p} \right] ,\\ \alpha _{ii}&= 1/\left( \sum _{v_j\sim v_i}\frac{M_{i,j}}{d_{i,i}} + 2\mu \right) ,\quad \quad \beta _{ii} = 2\mu \alpha _{ii} \end{aligned}$$

and denote the matrices \(\textbf{M}= [M_{i,j}]\), \(\varvec{\alpha } = \text {diag}(\alpha _{11},..., \alpha _{NN})\) and \(\varvec{\beta } = \text {diag}(\beta _{11},..., \beta _{NN})\). Then the solution to problem (22) can be solved by the following message passing process

$$\begin{aligned} \textbf{F}^{(k+1)} = \varvec{\alpha }^{(k)} \textbf{D}^{-1/2}\textbf{M}^{(k)} \textbf{D}^{-1/2} \textbf{F}^{(k)} + \varvec{\beta }^{(k)}\textbf{Y}, \end{aligned}$$
(23)

with an initial value, e.g., \(\textbf{F}^{(0)} = \textbf{0}\).

Here \(\varvec{\alpha }^{(k)}\), \(\varvec{\beta }^{(k)}\) and \(\textbf{M}^{(k)}\) are calculated according to the current features \(\textbf{F}^{(k)}\). When \(\phi (\xi ) = \xi ^p\), from Eq. (23) we can have the algorithm for solving pL-UFG and pL-fUFG.

Proof

Define \(\mathcal {L}^{\phi }_p(\textbf{F})\) as the objective function in Eq. (22), consider a node feature \(\textbf{f}_i\) in \(\textbf{F}\), then we have

$$\begin{aligned}&\frac{\partial \mathcal {L}^{\phi }_p(\textbf{F})}{\partial \textbf{f}_i} = 2\mu (\textbf{f}_i \!-\! \textbf{y}_i)\! \\&\qquad +\! \frac{1}{2}\sum _{v_k\in \mathcal {V}}\frac{\partial }{\partial \textbf{f}_i}\phi (\Vert \nabla _W\textbf{F}(v_k)\Vert _p) \\&\quad =2\mu (\textbf{f}_i - \textbf{y}_i) + \frac{1}{2}\frac{\partial }{\partial \textbf{f}_i}\phi (\Vert \nabla _W\textbf{F}(v_i)\Vert _p) \\&\qquad +\frac{1}{2}\sum _{v_j\sim v_i}\frac{\partial }{\partial \textbf{f}_i}\phi (\Vert \nabla _W\textbf{F}(v_j)\Vert _p)\\&\quad =2\mu (\textbf{f}_i - \textbf{y}_i) + \frac{1}{2}\phi '(\Vert \nabla _W\textbf{F}(v_i)\Vert _p)\\&\quad \frac{1}{p}(\Vert \nabla _W\textbf{F}(v_i)\Vert _p)^{1-p}\frac{\partial }{\partial \textbf{f}_i}\!\\&\quad \left( \!\sum _{v_j\sim v_i} \Vert \nabla _{i,j}(w, \textbf{f})\Vert ^p\!\right) \! \\&\qquad +\!\frac{1}{2p} \sum _{v_j\sim v_i} \frac{\partial }{\partial \textbf{f}_i}\Vert \nabla _{j,i}(w,\textbf{f})\Vert ^p \\&\quad \cdot \phi '(\Vert \nabla _W\textbf{F}(v_j)\Vert _p)(\Vert \nabla _W\textbf{F}(v_j)\Vert _p)^{1-p} \\&\quad = 2\mu (\textbf{f}_i - \textbf{y}_i) + \frac{1}{2p}\phi '(\Vert \nabla _W\textbf{F}(v_i)\Vert _p)(\Vert \nabla _W\textbf{F}(v_i)\Vert _p)^{1-p} \cdot \\&\quad \left( \!\sum _{v_j\sim v_i}p \Vert \nabla _W \textbf{F}([i,j])\Vert ^{p-2} \sqrt{\frac{w_{i,j}}{d_{i,i}}} (-\nabla _W \textbf{F}([i,j]))\!\right) \\&\qquad + \frac{1}{2}\sum _{v_j\sim v_i} \phi '(\Vert \nabla _W\textbf{F}(v_j)\Vert _p) (\Vert \nabla _W\textbf{F}(v_j)\Vert _p)^{1-p} \Vert \nabla _W \textbf{F}([i,j])\Vert ^{p-2} \\&\quad \sqrt{\frac{w_{i,j}}{d_{i,i}}}\nabla _W \textbf{F}([j,i])\\&\quad =2\mu (\textbf{f}_i\! -\! \textbf{y}_i)\! +\! \sum _{v_j\sim v_i}\! \frac{1}{2}\Vert \nabla _W \textbf{F}([i,j])\Vert ^{p-2}\!\!\\&\quad \sqrt{\frac{w_{i,j}}{d_{i,i}}}\nabla _W \textbf{F}([j,i]) \cdot \\&\quad \left[ \frac{\phi '(\Vert \nabla _W\textbf{F}(v_i)\Vert _p)}{\Vert \nabla _W\textbf{F}(v_i)\Vert ^{p-1}_p} + \frac{\phi '(\Vert \nabla _W\textbf{F}(v_j)\Vert _p)}{\Vert \nabla _W\textbf{F}(v_j)\Vert ^{p-1}_p} \right] \end{aligned}$$

Setting \(\frac{\partial \mathcal {L}^{\phi }_p(\textbf{F})}{\partial \textbf{f}_i} = 0\) gives the following first order condition:

$$\begin{aligned}&\bigg ( \sum _{v_j\sim v_i}\frac{1}{2}\left[ \frac{\phi '(\Vert \nabla _W\textbf{F}(v_i)\Vert _p)}{\Vert \nabla _W\textbf{F}(v_i)\Vert ^{p-1}_p} + \frac{\phi '(\Vert \nabla _W\textbf{F}(v_j)\Vert _p)}{\Vert \nabla _W\textbf{F}(v_j)\Vert ^{p-1}_p} \right] \\&\quad \cdot \Vert \nabla _{i,j}(w, \textbf{f})\Vert ^{p-2}\frac{w_{i,j}}{d_{i,i}} + 2\mu \bigg ) \textbf{f}_i \\&=\sum _{v_j\sim v_i}\frac{1}{2}\left[ \frac{\phi '(\Vert \nabla _W\textbf{F}(v_i)\Vert _p)}{\Vert \nabla _W\textbf{F}(v_i)\Vert ^{p-1}_p} + \frac{\phi '(\Vert \nabla _W\textbf{F}(v_j)\Vert _p)}{\Vert \nabla _W\textbf{F}(v_j)\Vert ^{p-1}_p} \right] \\&\quad \cdot \Vert \nabla _W \textbf{F}([i,j])\Vert ^{p-2}\frac{w_{i,j}}{\sqrt{d_{i,i}d_{j,j}}}\textbf{f}_j + 2\mu \textbf{y}_i. \end{aligned}$$

This equation defines the message passing on each node \(v_i\) from its neighbour nodes \(v_j\). With the definition of \(M_{i,j}\), \(\alpha _{ii}\) and \(\beta _{ii}\), it can be turned into the iterative formula in Eq. (23). This completes the proof. \(\square\)

Remark 2

When \(p\le 1\), the objective function is not differentiable in some extreme cases for example the neighbor node signals are the same and with the same degrees. In these rare cases, the first order condition cannot be applied in the above proof. However in practice, we suggest the following alternative iterative algorithm to solve the optimization problem. In fact, we can split the terms in \(\mathcal {S}_p(\textbf{F})\) as

$$\begin{aligned} \mathcal {S}_p(\textbf{F}) = \frac{1}{2}\sum _{(v_i,v_j)\in \mathcal {E}}\left\| \nabla _W \textbf{F}([i,j])\right\| ^{p-2}\left\| \nabla _W \textbf{F}([i,j])\right\| ^2. \end{aligned}$$
(24)

At iteration k, we take \(w^{\text {new}}_{i,j} = w_{i,j} \left\| \nabla _W \textbf{F}([i,j])\right\| ^{p-2}\) as the new edge weights, then the next iterate is defined as the solution to optimize the Dirichlet energy with the new weights, i.e.,

$$\begin{aligned} \textbf{F}^{(k+1)} = \mathop {\mathrm {arg\,min}}\limits \frac{1}{2} \sum _{(v_i,v_j)\in \mathcal {E}} \left\| \nabla _{W^{\text {new}}} \textbf{F}([i,j])\right\| ^2. \end{aligned}$$

Thus one step of the classic GCN can be applied in the iteration to solve the p-Laplacian regularized problem (14).

3.5 Interaction between p-Laplacian and framelets

In this section, we present some theoretical support on how the p-Laplacian regularizer interact with the framelets in the model. Specifically, we show that the proposed algorithm in Eq. (23) provides a sequence of (p-Laplacian-based) spectral graph convolutions and diagonal scaling of the node features over the framelet filtered graph signals.This is indicated by the following analysis.

First considering the iteration Eq. (23) with initial values \(\textbf{F}^{(0)} = \textbf{Y}= \mathcal {W}^T\text {diag}(\theta )\mathcal {W}\textbf{X}\) or \(\text {diag}(\theta _{r,j}) \mathcal {W}_{r,j} \textbf{X}\) without loss of generalityFootnote 1, we have

$$\begin{aligned}&\textbf{F}^{(K)} = \varvec{\alpha }^{(K\!-\!1)} \textbf{D}^{-1/2}\textbf{M}^{(K-1)} \textbf{D}^{-1/2} \textbf{F}^{(K\!-\!1)}\! + \!\varvec{\beta }^{(K-1)}\textbf{Y}\\&\quad = \varvec{\alpha }^{(K-1)}\widetilde{\textbf{M}}^{(K-1)}\textbf{F}^{(K-1)} + \varvec{\beta }^{(K-1)}\textbf{Y}\\&\quad = \varvec{\alpha }^{(K-1)}\widetilde{\textbf{M}}^{(K-1)}\left( \varvec{\alpha }^{(K-2)}\widetilde{\textbf{M}}^{(K-2)}\textbf{F}^{(K-2)} \right. \\&\left. \quad + \varvec{\beta }^{(K-2)}\textbf{Y}\right) + \varvec{\beta }^{(K-1)}\textbf{Y}\\&\quad = \varvec{\alpha }^{(K-1)}\widetilde{\textbf{M}}^{(K-1)}\varvec{\alpha }^{(K-2)} \widetilde{\textbf{M}}^{(K-2)}\textbf{F}^{(K-2)}\\&\quad + \varvec{\alpha }^{(K-1)}\widetilde{\textbf{M}}^{(K-1)} \varvec{\beta }^{(K-2)}\textbf{Y}+\varvec{\beta }^{(K-1)}\textbf{Y}\\&= \left( \prod _{k=0}^{K-1} \varvec{\alpha }^{(k)}\widetilde{\textbf{M}}^{(k)}\right) \textbf{Y}\\&\quad + \varvec{\beta }^{(K-1)}\textbf{Y}+ \sum _{k=0}^{K-1} \left( \prod _{l =K-k}^{K-1} \varvec{\alpha }^{(l)}\widetilde{\textbf{M}}^{(l)}\right) \varvec{\beta }^{(K-k-1)}\textbf{Y}. \end{aligned}$$

The result \(\textbf{F}^{(K)}\) depends on the key operations \(\varvec{\alpha }^{(k)}\widetilde{\textbf{M}}^{(k)}\) for \(k = 0, 1,..., K-1\), where

$$\begin{aligned}&\widetilde{M}^{(k)}_{i,j} = \frac{w_{i,j}}{\sqrt{d_{i,i}d_{j,j}}} \left\| \nabla _W\textbf{F}^{(k)}[(i,j)] \right\| ^{p-2} \!\!\! \cdot \left[ \frac{\phi '(\Vert \nabla _W\textbf{F}^{(k)} (v_i)\Vert _p)}{\Vert \nabla _W\textbf{F}^{(k)} (v_i)\Vert ^{p-1}_p} \right. \nonumber \\&\qquad \left. + \frac{\phi '(\Vert \nabla _W\textbf{F}^{(k)} (v_j)\Vert _p)}{\Vert \nabla _W\textbf{F}^{(k)} (v_j)\Vert ^{p-1}_p} \right] \nonumber \\&\quad =\frac{w_{i,j} \cdot w^{(k)}_{i,j}(p)}{\sqrt{d_{i,i}d_{j,j}}} \left\| \nabla _W \textbf{F}^{(k)}[(i,j)] \right\| ^{p-2}, \end{aligned}$$
(25)
$$\begin{aligned}&\alpha _{i,i}^{(k)} = 1/\left( \sum _{v_j\sim v_i}{\widetilde{M}^{(k)}_{i,j}} + 2\mu \right) , \end{aligned}$$
(26)

where \(\widetilde{M}_{ij}\) has absorbed \(d_{i,i}\) and \(d_{j,j}\) into \(M_{ij}\) defined in Theorem 1, i.e., \(\widetilde{M}_{ij} = M_{i,j}/\sqrt{d_{i,i}d_{j,j}}\).

Denote

$$\begin{aligned} w^{(k)}_{i,j}(p)=\left[ \frac{\phi '(\Vert \nabla _W\textbf{F}^{(k)} (v_i)\Vert _p)}{\Vert \nabla _W\textbf{F}^{(k)} (v_i)\Vert ^{p-1}_p} + \frac{\phi '(\Vert \nabla _W\textbf{F}^{(k)} (v_j)\Vert _p)}{\Vert \nabla _W\textbf{F}^{(k)} (\!v_j\!)\Vert ^{p-1}_p} \right] . \end{aligned}$$

To introduce the following analysis, define the relevant matrices \(\widetilde{\textbf{M}}^{(k)} = [\widetilde{M}^{(k)}_{i,j}]\), \(\varvec{\alpha }^{(k)} = \text {diag}(\alpha _{ii}^{(k)} )\) and \(\textbf{W}^{(k)}_p = [w^{(k)}_{i,j}(p)]= \textbf{w}^{(k)}_p \oplus (\textbf{w}^{(k)}_p)^T\) (the Kronecker sum) with the column vector

$$\begin{aligned} \textbf{w}^{(k)}_p = \left[ \frac{\phi '(\Vert \nabla _W\textbf{F}^{(k)} (v_i)\Vert _p)}{\Vert \nabla _W\textbf{F}^{(k)} (v_i)\Vert ^{p-1}_p}\right] ^N_{i=1}. \end{aligned}$$

Now, recall that the definition of the classic p-Laplacian is:

$$\begin{aligned} \Delta _p (\textbf{F}): = - \frac{1}{2} \text {div}( \Vert \nabla \textbf{F}\Vert ^{p-2} \nabla \textbf{F}), \quad \text {for} \,\,\, p\ge 1, \end{aligned}$$
(27)

Our purpose is to show that our generalized algorithm Eq. (25) above can be implemented as the linear combination of the classic p-Laplacian filtering. First, based on [18], the operator Eq. (27) in the original p-Laplacian message passing algorithm can be written in detailed matrix form as follows

$$\begin{aligned} \Delta ^{(k)}_p(\textbf{F}^{(k)}) = \left( (\varvec{\alpha }^{(k)}_0)^{-1} - 2\mu \textbf{I}_N \right) \textbf{F}^{(k)} - {\textbf{G}^{(k)}} \textbf{F}^{(k)}, \end{aligned}$$
(28)

where the matrix \(\textbf{G}^{(k)}\) has elements

$$\begin{aligned} G^{(k)}_{i,j} =&\frac{w_{i,j}}{\sqrt{d_{i,i}d_{j,j}}} \left\| \nabla _W \textbf{F}^{(k)}[(i,j)] \right\| ^{p-2}, \end{aligned}$$

and the diagonal matrix \(\varvec{\alpha }^{(k)}_0\) has diagonal element defined as

$$\begin{aligned} (\alpha ^{(k)}_0)_{ii} =&1/\left( \sum _{v_j\sim v_i}{G^{(k)}_{i,j}} + 2\mu /p\right) . \end{aligned}$$

Eq. (28) shows that the operation \(\Delta ^{(k)}_p(\textbf{F}^{(k)})\) can be represented as the product of a matrix (still denoted as \(\Delta ^{(k)}_p\)) and the signal matrix \(\textbf{F}^{(k)}\).

Noting that \(\textbf{G}^{(k)} = \widetilde{\textbf{M}}^{(k)}\oslash \textbf{W}^{(k)}_p\), multiplying diagonal matrix \(\varvec{\alpha }^{(k)}\) on both sides of Eq. (28) and taking out \(\textbf{F}^{(k)}\) give

$$\begin{aligned} \varvec{\alpha }^{(k)} \Delta ^{(k)}_p = \left( \varvec{\alpha }^{(k)}(\varvec{\alpha }^{(k)}_0)^{-1} - 2\mu \varvec{\alpha }^{(k)} \right) - \varvec{\alpha }^{(k)} (\widetilde{\textbf{M}}^{(k)}\oslash \textbf{W}^{(k)}_p). \end{aligned}$$

As \(\varvec{\alpha }^{(k)}\) is diagonal, we can re-write the above equality as

$$\begin{aligned} \varvec{\alpha }^{(k)} \widetilde{\textbf{M}}^{(k)} = \left( \left( \varvec{\alpha }^{(k)}(\varvec{\alpha }^{(k)}_0)^{-1} - 2\mu \varvec{\alpha }^{(k)} \right) - \varvec{\alpha }^{(k)} \Delta ^{(k)}_p \right) \odot \textbf{W}^{(k)}_p. \end{aligned}$$

Given that \(\textbf{W}^{(k)}_p\) is the Kronecker sum of the vector \(\textbf{w}^{(k)}_p\) and its transpose, if we still use \(\textbf{w}^{(k)}_p\) as its diagonal matrix, then we have

$$\begin{aligned}&\varvec{\alpha }^{(k)} \widetilde{\textbf{M}}^{(k)} = \left( \left( \varvec{\alpha }^{(k)}/\varvec{\alpha }^{(k)}_0 - 2\mu \varvec{\alpha }^{(k)} \right) - \varvec{\alpha }^{(k)} \Delta ^{(k)}_p \right) \textbf{w}^{(k)}_p \nonumber \\&\quad + \textbf{w}^{(k)}_p \left( \left( \varvec{\alpha }^{(k)}/\varvec{\alpha }^{(k)}_0 - 2\mu \varvec{\alpha }^{(k)} \right) - \varvec{\alpha }^{(k)} \Delta ^{(k)}_p \right) . \end{aligned}$$
(29)

This demonstrates that the key terms \(\varvec{\alpha }^{(k)} \widetilde{\textbf{M}}^{(k)}\) in our generalized algorithm is in linear form of p-Laplacian operator \(\Delta ^{(k)}_p\). As demonstrated in the research [18], the operator \(\Delta ^{(k)}_p\) approximately performs spectral graph convolution. Hence we can conclude that the generalized p-Laplacian iterations (23) indeed performs a sequence of graph spectral convolutions (see Lemma 1 below) and gradient-based diagonal transformation (i.e., node feature scaling) over the framelet filtered graph signals.

Lemma 1

The matrix \(\Delta ^{(k)}_p\) is SPD (see [4, 37]) which has its own eigendecomposition, offering a graph spectral convolution interpretation.

Remark 3

Eq. (29) indicates that the p-Laplacian regularizer provides graph spectral convolution on top of the framelet filtering, which produces a second layer of filtering conceptually and hence restricts the solution space further. See Fig. 1 for the illustration. Interestingly, the combination of p-Laplacian regularization and framelet offers a more adaptive smoothness that suites both homophilic and heterophilic data as shown in our experiments.

Remark 4

In terms of asymptotic behavior of Eq. (25), one can show roughly that the elements in \(\varvec{\alpha }^{(k)}\widetilde{\textbf{M}}^{(k)}\) are between 0 and 1, which converges to zero if K is too large. Therefore a large K annihilates the first term in Eq. (25) but leaves the second term and partial sum from the third. A larger value of \(\mu\) seems to speed up this convergence further, resulting shortening the time for finding the solution. However, it is a model selection problem for generalizability. Moreover, the inclusion of the source term \(\varvec{\beta }^{(K-1)}\textbf{Y}\) guarantees to supply certain amount of variations of the node features from both low and high frequency domains of the graph framelet and it has been shown such inclusion can help the model escape from over-smoothing issue [12].

Remark 5

Compared to GPRGNN [12] in which the model outcome is obtained by

$$\begin{aligned} \hat{\textbf{F}}^{(k)} = \sum _k^{K-1}\varvec{\gamma }_k\textbf{F}^{(k)}, \end{aligned}$$

where \(\varvec{\gamma }_k \in \mathbb R^{N\times N}\) is learnable generalized Page rank coefficients, as we have shown in Eq. (29), our proposed algorithm defined in Eq. (23) provides a mixed (spectral convolution and diagonal scaling) to the framelet graph signal outputs and thus further directs the solution space of framelet to a reasonably defined region. Furthermore, due to the utilization of the Chebyshev polynomial in framelet, the computational complexity for the first filtering is not high, which is helpful in terms of defining the provably corrected space for the second operation. This shows the efficiency of incorporating framelet in p-Laplacian based regularization.

4 Discussions on the proposed model

In this section, we conduct comprehensive discussions for our proposed models. We note that we will mainly focused on exploring the property of the generalized p-Laplacian based framelet GNN (Eq. (22)) and the iterative algorithm in Eq. (23), although we may also show some conclusions of pL-UFG and pL-fUFG as well. Specifically, in Sect. 4.1 we discuss the denoising power of our proposed model. Section 4.2 analyzes the model’s computational complexity. Section 4.3 provides a comparison between the p-Laplacian regularizer with other regularizers that are applied to GNNs via different learning tasks. Finally, the section is concluded (Sect. 4.4) by illustrating the limitation of the model and potential aspects for future studies.

4.1 Discussion on the denoising power

The denoising power of framelet has been developed in the study[15] and empirically studied by the past works [15, 56, 62]. Let \(\textbf{X}= \textbf{F}+ \mathbf \tau\) be the input node feature matrix with noise \(\mathbf \tau\). In works [15, 62], a framelet-based regularizer was proposed to resolve the optimization problem which can be described as:

$$\begin{aligned} \min _\textbf{F} \Vert \mathcal D\textbf{F}\Vert _{1,\mathcal G}+ f(\textbf{F}, \textbf{X}), \end{aligned}$$
(30)

where \(f(\cdot ,\cdot )\) is some fidelity function that takes different forms for different applications. For any graph signal \(\textbf{F}\), the graph \(\mathbb L_p\)-norm \(\Vert \textbf{F}\Vert _{p,\mathcal G}:= \left( \sum _i \vert \textbf{f}_i\vert ^p \times d_{i,i}\right) ^{1/p}\), where \(d_{i,i}\) is the degree of the i-th node with respect to \(\textbf{f}\). \(\mathcal D\) is a linear transformation generated from discrete transformations, for example, the graph Fourier transforms or the Framelet transforms. Thus, the regularizer assigns a higher penalty to the node with a larger number of neighbours. Specifically, considering the functional \(\mathcal D\) as the framelet transformation, then the first term in Eq. (30) can be written as:

$$\begin{aligned} \Vert \mathcal D\textbf{F}\Vert _{1,\mathcal G} = \left( \sum _{(r,\ell )\in \mathcal Z}\sum _i \vert \textbf{F}_{r,\ell }[i]\vert ^p \times d_{i,i}\right) ^{1/p}. \end{aligned}$$
(31)

where \(\mathcal Z = \{ (r,\ell ): r = 1,...,R, \ell = 0,...,J \} \cup \{ (0,J)\}\). Replacing the p-Laplacian regularizer in Eq. (15) by the framelet regularizer defined in the above Eq. (31) we have:

$$\begin{aligned} \textbf{F}&= \mathop {\mathrm {arg\,min}}\limits _{\textbf{F}} \left( \sum _{(r,\ell )\in \mathcal Z}\sum _i \vert \textbf{F}_{r,\ell }[i]\vert ^p \times d_{i,i} \right) ^{1/p} +\mu \Vert \textbf{F}\nonumber \\&\quad - \mathcal {W}^T\text {diag}(\theta )\mathcal {W}\textbf{X}\Vert ^2_F. \end{aligned}$$
(32)

Followed by the work in the research [15], the framelet regularizer-based denoising problem in Eq. (30) is associated with the variational regularization problem that can be generally described in the form similar to Eq. (15) where the variational term \(\mathcal S_p(\textbf{F})\) is utilized for regularizing the framelet objective function. Simply by replacing the input of the first term in Eq. (32) by \(\nabla \textbf{F}\) and omitting node degree information, we have:

$$\begin{aligned} \textbf{F}= \mathop {\mathrm {arg\,min}}\limits _{\textbf{F}} \sum _{(r, \ell )\in \mathcal Z}\Vert \nabla \textbf{F}_{r,\ell }\Vert _p^p+\mu \Vert \textbf{F} -\mathcal {W}^T\text {diag}(\theta )\mathcal {W}\textbf{X}\Vert ^2_F. \end{aligned}$$
(33)

Therefore our proposed model naturally is equipped with denoising capability and the corresponding (denoising) regularizing term aligns with the denoising regularizer developed in the paper [64] without nodes degree information. However, the work in the study [64] only handles \(p=1\) and \(p=2\). Whereas our p-Laplacian framelet model covers the values of \(p \in \mathbb R_{+}\). This allows more flexibility and effectiveness in the model.

Furthermore, the major difference between most wavelet frame models and the classic variational models is the choice of the underlying transformation (i.e., \(\mathcal D\) applied to the graph signal) that maps the data to the transformed domain that can be sparsely approximated. Given both p-Laplacian (i.e., Eq. (15)) and framelet regularizers (i.e., Eq. (31)) target on the graph signal \(\textbf{F}_\mathcal Z\) that is produced from the sparse and tight framelet transformation (Chebyshev polynomials), this further illustrates the equivalence between two denoising regularizers. Please refer to [15] for more details.

4.2 Discussion on the computational complexity

In this section, we briefly discuss the computational complexity of our proposed model, specific to the generalized model defined in Eq. (22) with its results that can be approximated by the algorithm presented in Eq. (23). The primary computational cost of our model stems from the generation of the framelet decomposition matrices (\(\mathcal W\)) or the framelet transformation applied to the node features. We acknowledge that generating \(\mathcal W\) involves a high computational complexity, as transitional framelet methods typically employ eigen-decomposition of the graph Laplacian for this purpose. Consequently, the framelet transform itself exhibits a time complexity of \(\mathcal {O}(N^2(nj+1)Kc)\) and a space complexity of \(\mathcal {O}(N^2(nj+1)c)\) for a graph with N nodes and c-dimensional features. Notably, the constants n, j, and K are independent of the graph data. Existing literature [60] has demonstrated that when the input graph contains fewer than \(4\times 10^4\) nodes (with fixed feature dimension), the computational time for framelet convolution is comparable to that of graph attention networks with 8 attention heads [48]. As the graph’s node size increases, the performance of graph attention networks degrades rapidly, while graph framelets maintain faster computation. However, our application of Chebyshev polynomials to approximate \(\mathcal W\) significantly reduces the associated computational cost compared to traditional methods. Additionally, we acknowledge that the inclusion of the p-Laplacian based implicit layer introduces additional computational cost to the original framelet model, primarily arising from the computation of the norm of the graph gradient, denoted as \(|\nabla \textbf{F}|\). Considering the example of the Euclidean norm, the computational cost for \(|\nabla \textbf{F}|\) scales as \(\mathcal O(N\times c)\), where N and c represent the number of nodes and feature dimension, respectively. Thus, even when accounting for the computation of the implicit layer, the overall cost of our proposed method remains comparable to that of the framelet model.

4.3 Comparison with other regularizers and potential application scenarios

As we have illustrated in Sect. 3.3, with the assistance from different form of the regularizers, GNNs performance could be enhanced via different learning tasks. In this discussion, we position our study within the wider research landscape that investigates various regularizers to enhance the performance of Graph Neural Networks (GNNs) across different learning tasks. We earlier discussed the denoising power of the pL-UFG in Sect. 4.1, establishing that it can be expressed in terms of a denoising formulation. This is comparable to the approach from [38], who used a different regularizer term to highlight the denoising capabilities of GNNs. They showed that their regularizer can effectively reduce noise and enhance the robustness of GNN models.

Our study, however, emphasizes the unique advantages that the p-Laplacian brings, a theme also echoed in the works by [33] and [8]. Both studies incorporated the \(L_1\)-norm as a regularization term in robust principal component analysis (RPCA), showcasing its ability to recover low-rank matrices even in the presence of significant noise. Furthermore, the study by [35] reinforces the benefits of the \(L_1\)-norm in preserving discontinuity and enhancing local smoothness. Characterized by its heavy-tail property, the \(L_1\)-norm imposes less penalty on large values, thereby making it effective in handling data with substantial corruption or outliers.

Furthermore, in terms of the potential practical implementations, one can deploy our method into various aspects. For example, the p-Laplacian based regularizer can be used in image processing and computer vision applications, where it can help to smooth out noisy or jagged images while preserving important features. In addition, we note that, as we implement our graph learning method via an optimization (regularization) framework, this suggests any potential practical implementations (such as graph generation [26], graph based time series forecasting [49]) of GNNs can also be deployed under our proposed methods with a higher flexibility power.

4.4 Limitation and potential future studies

While outstanding properties have been presented from our proposed model, the exploration of how the p-Laplacian based regularization framework can be deployed to various types of graphs (i.e., dynamic or spatio-temporal) is still wanted. In fact, one may require the corresponding GNNs to be able to capture the pattern of the evolution (i.e., combing GNN with LSTM or Transformer [39]) of the graph when the input graph is dynamic. We note that such a requirement is beyond the capability of the current model. However, as the p-Laplacian based regularizer restricts the solution of the graph framelet, it would be interesting to make the quantity of p learnable throughout the training process. We leave this to future work.

Moreover, followed by the idea of making p value as a learnable parameter in the model, one can also explore assigning different values of p to different frequency domains of framelet. It has been verified that the low-frequency domain of framelet induces a smoothing effect on the graph signal whereas the sharpening effect is produced from the high-frequency domain [23]. Therefore, one may prefer to obtain a relatively large quantity of p to the low-frequency domain to enhance the smoothing effect of the framelet when the input graph is highly homophily, as one prefers to predict identical labels for connected nodes. On the other hand, a smaller value of p is wanted for a high-frequency domain to further sharpen the differences between node features (i.e., \(\nabla \textbf{F}\)) so that distinguished labels can be produced from the model when the input graph is heterophilic. Moreover, as the p-Laplacian-based implicit layers are allocated before the framelet reconstruction, it would be interesting to explore how such implicit layers can affect the signal reconstruction of the framelet.

5 Experiments

In this section, we present empirical studies on pL-UFG and pL-fUFG on real-world node classification tasks with both heterophilic and homophilic graphs. We also test the robustness of the proposed models against noise. Both two experiments are presented with detailed discussions on their results. In addition, we discuss by adjusting the quantity of the p in our proposed model, the so-called ablation study is automatically conducted in our experiments. The code for our experiment can be accessed via https://github.com/superca729/pL-UFG. Lastly, it is worth noting that our proposed method has the potential to be applied to the graph learning task other than node classification such as graph level classification (pooling) [60] and link prediction [31]. Although we have yet to delve into these tasks, we believe that by assigning some simple manipulations to our methods, such as deploying the readout function for graph pooling or computing the log-likelihood for graph link prediction, our method is capable of handling these tasks. Similarly, our model could be beneficial in community detection tasks by possibly identifying clusters of similar characteristics or behaviors. We leave these promising research aspects to our future work.

5.1 Datasets, baseline models and the parameter setting

Datasets. We use both homophilic and heterophilic graphs (datasets) from https://www.pyg.org/ to assess pL-UFG and pL-fUFG, including benchmark heterophilic datasets: Chameleon, Squirrel, Actor, Wisconsin, Texas, Cornell, and homophilic datasets: Cora, CiteSeer, PubMed, Computers, Photos, CS, Physics. In our experiments, homophilic graphs are undirected and heterophilic graphs are directed (where we observe an improved performance when direction information is provided). We included the summary statistics of the included datasets together with their homophily index and split ratio in Table 1

Baseline models. We consider eight baseline models for comparison:

  • MLP: standard feedward multiple layer perceptron.

  • GCN [28]: GCN is the first of its kind to implement linear approximation to spectral graph convolutions.

  • SGC [50]: SGC reduces GCNs’ complexity by removing nonlinearities and collapsing weight matrices between consecutive layers.

  • GAT [48]: GAT is a graph neural network that applies the attention mechanism on node feature to learn edge weight.

  • JKNet [55]: JKNet can flexibly leverage different neighbourhood ranges to enable better structure-aware representation for each node.

  • APPNP [19]: APPNP combines GNN with personalized PageRank to separate the neural network from the propagation scheme.

  • GPRGNN [12]: GPRGNN architecture that adaptively learns the GPR (General Pagerank) weights so as to jointly optimize node feature and topological information extraction, regardless the level of homophily on a graph.

  • p-GNN [18]:p-GNN is p-Laplacian based GNN model, whose message-passing mechanism is derived from a discrete regularization framework.

  • UFG [60]: UFG is a type of GNNs based on framelet transforms, the framelet decomposition can naturally aggregate the graph features into low-pass and high-pass spectra.

The test results are reproduced using code that runs in our machine, which might be different from the reported results. In addition, compared to p-GNN, the extra time and space complexity induced from our algorithm is \(O(N^2(RJ+1)d)\).

Hyperparameter tuning. We use grid search for tuning hyperparameters. We test \(p \in \{1.0, 1.5, 2.0, 2.5\}\) for PGNN, pL-UFG and pL-fUFG. The learning rate is chosen from {0.01, 0.005}. We consider the number of iterations T in p-Laplacian message passing from \(\{4, 5\}\) after 10 warming-up steps. For homophilic datasets, we tune \(\mu \in \{0.1, 0.5, 1, 5, 10\}\) and for heterophilic graphs \(\mu \in \{3, 5, 10, 20, 30, 50, 70\}\). The framelet type is fixed as Linear (see [56]) and the level J is set to 1. The dilation scale \({s} \in \{1, 1.5, 2, 3, 6\}\), and for n, the degree of Chebyshev polynomial approximation to all g’s in (1518), is drawn from {2, 3, 7}. It is worth noting that in graph framelets, the Chebyshev polynomial is utilized for approximating the spectral filtering of the Laplacian eigenvalues. Thus, a d-degree polynomial approximates d-hop neighbouring information of each node of the graph. Therefor, when the input graph is heterophilic, one may require d to be relatively larger as node labels tend to be different between directly connected (1-hop) nodes. The number of epochs is set to 200, the same as the baseline model[18].

Table 1 Statistics of the datasets, \(\mathcal H(\mathcal G)\) represent the level of homophily of overall benchmark datasets

5.2 Experiment results and discussion

Node classification: Tables 2 and 3 summarize the results on homophilic and heterophilic datasets. The values after ± are standard deviations. The top three results are highlighted in First, Second, and Third. From the experiment results, we observe that pL-UFG and pL-fUFG achieve competitive performances against the baselines in most of the homophily and heterophily benchmarks. For the models’ performance on homophily undirect graphs, Table 2 shows that pL-UFG2\(^{1.5}\) has the top accuracy in Cora, Photos and CS. For Citeseer, pL-UFG1\(^{2.0}\), pL-UFG2\(^{2.0}\) and pL-fUFG\(^{2.0}\) have the best performance. In terms of the performance on heterophily direct graphs (Table 3), we observe that both of our two models are capable of generating the highest accuracy in all benchmark datasets except Actor where the top performance was generated from MLP. However, we note that for Actor, our pL-UFG1\(^{1.0}\) achieves almost identical outcomes compared to those of MLP.

Aligned with our theoretical prediction in Remark 4 from the experiment, we discover that the value of the trade-off term \(\mu\) for pL-UFG and pL-fUFG is significantly higher than that in pGNN, indicating that framelet has a major impact on the performance of the model. The performance of pL-UFG, pL-fUFG and others on heterophilic graphs are shown in Table 3. pL-UFG and pL-fUFG both can outperform MLP and other state-of-the-art GNNs under a low homophilic rate. In terms of denoising capacity, pL-UFG and pL-fUFG are far better than the baseline models. Figures 6 and 7 show that both pL-UFG and pL-fUFG produce the top accuracies across different noise levels.

Discussion for \({p < 2}\). To enable GNN model to better adapt to the heterophilic graphs, an ideal GNN model shall induce a relatively high variation in terms of the node feature generated from it. Therefore, compared to the models with \(p=2\), the model regularized with \(p \ne 2\) regularizer imposes a lesser penalty and thus produces outputs with a higher variation. This is supported by the empirical observation from Table 3 in which the highest prediction accuracy for heterophilic graphs are usually achieved by our model with \(p <2\).

Discussion for p = 1. Here we specifically focus on \(p=1\) in our models. Recall that when \(p=1\), the p-Laplacian operator acts on the graph signal \(\textbf{F}\) is \(\Delta _1 \textbf{F}= -\frac{1}{2}\text {div}\left( \frac{\nabla \textbf{F}}{\Vert \nabla \textbf{F}\Vert }\right)\). Based on Remark 1, when \(p=1\), \(\Delta _1 \textbf{F}\) is equivalent to the mean curvature operator defined on the embedded surface. In analogy to differential geometry, points that curvature can not be properly defined are so-called singular points. Thus one can similarly conclude that when a graph contains a singular node (i.e., the graph is crumpled [7]) then both \(\Delta _1 \textbf{F}\) and its corresponding \(\mathcal S_p (\textbf{F})\) can not be properly defined. Furthermore, one can easily check that, when \(p=1\), the regularization term \(\phi (\xi ) = \xi ^{p}\) produces a higher penalty than its \(\mathcal S_p(\textbf{F})\) counterparts. This is because, when \(p=1\), \(\mathcal S_p(\textbf{F}) = (\sum _i^c (\nabla \textbf{f}_i)^2)^{\frac{1}{2}}\) whereas \(\phi (\xi ) = \xi ^{1} = \sum _i \vert \nabla \textbf{f}_i\vert\). Clearly, we have \(\mathcal S_p(\textbf{F}) < \phi (\xi )\) unless all of the nodes in the graph are singular nodes, or in the graph in which there is only one non-singular point while the rest nodes are singular.

5.3 Visualization on the effect of p

Based on the claim we made in previous sections, increasing the value of p leads our model to exhibit a stronger smoothing effect on the node features, thereby making it more suitable for homophilic graphs. Conversely, when p is small, the model preserves distinct node features, making it a better fit for heterophilic graphs. To validate this idea, we visualize the changes in relative positions (distances) between node features for \(p=1\) (the minimum value) and \(p=2.5\) (the maximum value). We specifically selected the Cora and Actor datasets and employed Isomap [47] to map both the initial node features and the node features generated by our model to \(\mathbb R^2\), while preserving their distances. The results are depicted in Fig. 2 and 3.

From both Fig. 2 and 3, it can be observed that when \(p=1\), the sharpening effect induced by our model causes the node features to become more distinct from each other. Under the Isomap projection, the nodes are scattered apart compared to the input features. Conversely, when a relatively large value of \(p=2.5\) is assigned, the model exhibits a stronger smoothing effect, resulting in all nodes being aggregated towards the center in the Isomap visualization. These observations provide direct support for our main claim regarding the proposed model.

Fig. 2
figure 2

Isomap Visualization with using homophilic data (Cora) by changing of p

Fig. 3
figure 3

Isomap Visualization with using heterophilic data (Actor) by changing of p

Table 2 Test accuracy (%) on homophilic undirected graph
Table 3 Test accuracy (%) on heterophilic directed graph

5.4 Experiments on denoising capacity

Followed by the discussion in Sect. 4.1, in this section we evaluate our proposed models’ (pL-UFG and pL-fUFG) denoising capacity on both homophilic (Cora) and heterophilic (Chameleon) graphs. Since the node features of the included datasets are in binary, we randomly assign binary noise with different proportions of the node features (i.e., \(r \in \{5\%, 10\%, 15\%, 20\%\}\)). From Fig. 6 and 7 we see that pL-UFG\(2^{1.5}\) defined in Eq. (16) outperforms other baselines and showed the strongest robustness to the contaminated datasets. This is expected based on our discussion on the denoising power in Sect. 4.1 and the formulation of our models (defined in Eqs. (15), (16) and (17)). The denoising capacity of our proposed model is sourced from two parts including the sparse approximation of the framelet decomposition and reconstruction, i.e., \(\mathcal W\) and \(\mathcal W^T\), and variational regularizer \(\mathcal S_p(\textbf{F})\) [15]. Compared to pL-UFG1 defined in Eq. (15), pL-UFG2 assigns the denoising operators, \(\mathcal W\), \(\mathcal W^T\) and \(\mathcal S_{p}(\textbf{F})\), to the node inputs that is reconstructed from both sparsely decomposed low and high-frequency domains so that the denoising operators target on the original graph inputs at different scales and hence naturally result in a better model denoising performance. In addition, without taking the reconstruction in the first place, the term in pL-fUFG \(\Vert \textbf{F}_{r,j} - \text {diag}(\theta _{r,j}) \mathcal {W}_{r,j} \textbf{X}\Vert ^2_F\) is less sparse than the pL-UFG2 counterpart. Thus regularizing with the same \(\mathcal S_p(\textbf{F})\), the effect from insufficient eliminated noise from pL-fUFG will lead pL-fUFG to an incorrect (respect to pL-UFG2) solution space which can not be recovered by the additional reconstruction using \(\mathcal W^T\), and this is the potential reason for observing an outperforming result from pL-UFG2 compared to other proposed models.

In addition to the above experiment, here we show how the results of how the quantity of p affect the denoising power of the proposed model, while kee** all other parameters constant. The results of the changes of the denoising power are included in Fig. 4 (homophilic graph) and 5 (heterophilic graph).

Fig. 4
figure 4

Changes of the denoising power by the quantity of p via homophilic graph

Fig. 5
figure 5

Changes of the denoising power by the quantity of p via heterophilic graph

Based on the observations from the results, it appears that the performance of different values of p varies under different noise rates. For the homophilic graph, the performance of pL-UFG2\(^{1.5}\) and pL-UFG2\(^{2.0}\) seems to be relatively stable across different noise rates, while the performance of other values of p fluctuates more. For the heterophilic graph, the performance of pL-UFG1\(^{2.5}\) and pL-UFG2\(^{2.5}\) appears to be relatively stable across different noise rates, while the performance of other values of p fluctuates more.

We also note that these observations on the fluctuations of the results are expected, as we have shown that the denoising process of our proposed model can be presented as:

$$\begin{aligned} \textbf{F}\!=\! \mathop {\mathrm {arg\,min}}\limits _{\textbf{F}} \!\sum _{(r,\ell )\in \mathcal Z} \Vert \nabla \textbf{F}_{(r,\ell )}\Vert _p^p +\!\mu \Vert \textbf{F} \!-\! \mathcal {W}^T\!\text {diag}(\!\theta \!)\mathcal {W}\textbf{X}\Vert ^2_F. \end{aligned}$$
(34)

where \(\mathcal Z = \{ (r,\ell ): r = 1,...,R, \ell = 1,...,J \} \cup \{ (0,J)\}\) is the set of indices of all framelet decomposed frequency domains. It is not hard to verify that once a larger quantity of p is assigned, the penalty on the node feature difference (\(\nabla \textbf{F}\)) becomes to be greater. Therefore, a stronger denoising power is induced. In terms of different graph types, when the input graph is heterophily, in most of the cases, the connected nodes tend to have distinct features, after assigning noise to those features, if the feature difference becomes larger, a higher quantity of p is preferred. In the meanwhile, adding noise for the heterophilic graph could also make the feature difference becomes smaller, in this case a large quantity of p may not be appropriate, this explains why most of lines in Fig. 5 are with large fluctuations. Similar reasoning can be applied to the case of the denoising performance of our model for homophilic graphs, we omit it here.

5.5 Regarding to ablation study

It is worth noting that one of the advantages of assigning an adjustable p-Laplacian-based regularizer is on the convenience of conducting the ablation study. As the key principle of ablation study is to test whether a new component (in our case the regularizer) in a proposed method can always add advantages over baseline model counterparts regardless of the newly involved parameters. This suggests that the ablation study of our method are naturally conducted to compare our models with the regularizers with the underlying networks via both the node classification tasks and the model denoising power test (Figs. 6 and 7). It is not hard to observed that in most of cases, regardless of the change of p, our proposed model kept outperforming baseline models. However, as our proposed model was driven from an regularization framework, another potential parameter that we note could affect the model performance is the quantity of \(\mu\). Based on Eq. (22), an increase of the quantity of \(\mu\) leads to a increase of model’s convexity, as a consequence, could guide model more closed to the global optima when \(p \ne 2\). Another observation to \(\mu\) is we found that a smaller value of \(\mu\) together with a relatively bigger value of p are more suitable for homophily/heterophilic graphs, this implies that \(\mu\) seems to have an opposite effect on model’s adaption power compared to p. However, to quantify the effect of \(\mu\) via a suitable measure (i.e., potentially Dirichlet energy [23, 45]) is out of the scope of this paper, we leave it to future discussions.

Fig. 6
figure 6

Denoising power on homophilic graph (Cora)

Fig. 7
figure 7

Denoising power on heterophilic graph (Chameleon)

6 Conclusion and further work

This paper showcases the application of p-Laplacian Graph Neural Networks (GNN) in conjunction with framelet graphs. The incorporation of p-Laplacian regularization brings remarkable flexibility, enabling effective adaptation to both homophilic undirected and heterophilic directed graphs, thereby significantly enhancing the predictive capabilities of the model. To validate the efficacy of our proposed model, we conducted extensive numerical experiments on diverse graph datasets, demonstrating its superiority over baseline methods. Notably, our model exhibits robustness against noise perturbation, even under high noise levels. These promising findings highlight the tremendous potential of our approach and warrant further investigations in several directions. For instance, delving into the intriguing mathematical properties of our model, including weak and strong convergence, analyzing the behavior of (Dirichlet) energy, and establishing connections with non-linear diffusion equations, opens up fascinating avenues for future research.