1 Introduction

As the branch that best reflects the intelligence in the field of artificial intelligence, machine learning has attracted considerable attention in the past few decades (Michalski and Anderson 1984). Machine learning has achieved a huge success in a variety of tasks, especially in supervised learning tasks such as classification and regression. Most successful supervised learning, such as supervised deep learning (Lecun et al. 2015), requires sufficient labeled samples.

However. In many practical tasks, it is hard to obtain sufficient labeled samples because the labeling process is too costly, while a large number of unlabeled samples can be easily obtained. Although these unlabeled samples are unable to provide clear supervision information, they contain important information about the data distribution. Therefore, the unlabeled samples are helpful to improve the performance of the learner. It is the motivation and the ultimate goal of semi-supervised learning to enhance the generalization ability of learners by using a large number of inexpensive unlabeled samples. For this reason, semi-supervised learning has received much attention in the past few decades (Chapelle et al. 2006; Zhu and Goldberg 2009; Triguero et al. 2015; Van Engelen and Hoos 2020). Various types of semi-supervised learning methods have been proposed, forming four important semi-supervised learning paradigms: the generative semi-supervised learning method (Shahshahani and Landgrebe 1994; Cozman and Cohen 2002), the co-training style semi-supervised learning method (or the disagreement-based semi-supervised learning method) (Blum and Mitchell 1998; Wang and Zhou 2010), the semi-supervised SVM method (Joachims 1999; Chapelle et al. 2008) and the graph-based semi-supervised learning method (Zhu et al. 2003; Zhou et al. 2003). Meanwhile, semi-supervised learning has also been extensively studied in other fields such as regression (Zhou and Li 2005), clustering (Wagstaff et al. 2001; Basu et al. 2002; Zeng and Cheung 2012), dimensionality reduction (Zhang et al. 2007) and feature selection (Sheikhpour et al. 2017; Sechidis and Brown 2018), etc.

In recent years, deep learning (Lecun et al. 2015) has also made great progress in the semi-supervised learning field, just as it does in the supervised learning field. On the one hand, the neural network method has been applied to the three semi-supervised learning paradigms: the generative method (Kingma et al. 2014; Dai et al. 2017; Li et al. 2017), the co-training style method (Chen et al. 2018) and the graph-based method (Kipf and Welling 2017; Li et al. 2018; Jiang et al. 2019). On the other hand, the methodology of semi-supervised learning, using unlabeled samples to enhance the generalization performance of the learner, has also been applied to deep learning to train the deep neural networks (Li et al. 2018; Weston et al. 2008; Lee 2013) or design new neural networks (Rasmus et al. 2015; Park et al. 2018; Berthelot et al. 2019). For more content, we recommend readers with the recent survey article (Van Engelen and Hoos 2020). These methods not only enrich the semi-supervised learning field but improve the performance of semi-supervised learning in the related tasks.

The graph-based semi-supervised learning (GSSL) is an important semi-supervised learning paradigm, and its core assumption is that similar samples on the graph should possess the same label. Due to the good flexibility (various relationships between samples can be captured by constructing a specific graph), the high interpretability and good generalization performance, many methods in this framework have been proposed and have some success (Zhu et al. 2003; Zhou et al. 2003; Belkin et al. 2006; Subramanya and Bilmes 2011). Moreover, it is still an active research area in semi-supervised learning (Berton et al. 2017; Rustamov and Klosowski 2018). The study of GSSL methods can be divided into two aspects:

  1. (1)

    Label inference

The label inference in GSSL mainly focuses on how to carry out label learning based on the supervised information provided by labeled samples and the similarities between samples provided by the graph. There are many successful methods, such as the traditional semi-supervised learning method using Gaussian fields and harmonic functions (Zhu et al. 2003), the graph label propagation method (Zhou et al. 2003), the LapSVM and LapRLS method based on manifold regularization (Belkin et al. 2006) and the class probability distribution measure propagation on the graph (Subramanya and Bilmes 2011).

  1. (2)

    Graph construction

For the GSSL method, it is critical and very difficult to construct high-quality graphs (De Sousa et al. 2013).

Recent research indicates that the key to the success of the GSSL is to construct high-quality graphs rather than design better label inference algorithms (Jebara et al. 2009; Berton and de Andrade Lopes 2014; Li et al. 2016). The GSSL’s core assumption is that similar samples on the graph should share the same class label. According to this criterion, if the similarities between samples on the graph are consistent with their true class labels, the class labels of the unlabeled samples can be correctly predicted through the smoothness constraint on the graph. Conversely, if the similarity on the graph are contrary to the true class labels of the samples, the unlabeled samples will be given the wrong class labels by the label inference. Related experimental results of such cases are found in the literature (Belkin and Niyogi 2008; Karlen et al. 2008). In these cases, the utilization of unlabeled samples will lead to a negative effect: deteriorating the performance of the learner, an occurrence known as unsafe phenomena in semi-supervised learning (Li et al. 2016; Li and Zhou 2015; Wei et al. 2018). Therefore, the quality of the graph is extremely important for the performance of the GSSL method.

However, it is extremely difficult to construct a high-quality graph in GSSL because there are no operational metric for evaluating the quality of the graph. The quality of the graph can only be evaluated indirectly by the classification accuracy of the result of label inference on the graph, which is a post-mortem verification method and cannot provide any guidance to the graph construction. This is why the construction of a high-quality graph is difficult in GSSL. Fortunately, this difficulty provides us with enlightenment: why not let the graph construction and label inference guide each other to achieve their common improvement? Motivated by this insight. In this paper we integrate graph construction and label inference into an optimization model.

Before describing the details of the proposed method, we need to briefly review the existing graph construction methods in GSSL. The basic task of the graph construction is measuring the similarities between samples. According to how the similarities between samples are computed, the existing graph construction methods in GSSL can be roughly divided into two categories:

  1. (2.1)

    Distance metric-based methods

The distance metric-based graph construction methods measure the similarities between samples by computing a certain distance between them, and intuitively, a pair of sample with a smaller distance should have a higher similarity. Among such methods, the most commonly used graph construction methods include the kNN graph and the \(\epsilon \)-ball neighborhood graph (Zhu et al. 2003) based on the Euclidean distance. Meanwhile, the Gaussian kernel weighting method is also popular for graph construction in GSSL. In some cases on the kNN graph, the degree of the node varies greatly, which will deteriorate the quality of the graph. To solve this problem, the b-matching graph, in which the degree of each node is constrained to be b, was used in GSSL (Jebara et al. 2009).

Instead of using the Euclidean distance, the graph construction method based on the manifold hypothesis measure the similarities between samples through the geodesic distance. The key to this kind of method is how to compute the geodesic distances between samples accurately. The classic method uses the length of the shortest path on the Euclidean distance-based kNN graph to approximate the geodesic distance (Tenenbaum et al. 2000). Nevertheless, this approximation method has the problems named “short circuit” and “open circuit” over manifolds due to the inherent defects of the kNN graph. For relieving the “short circuit” problem over manifolds, a method for detecting and correcting the weight of the “short circuit” edge was proposed in Ghazvininejad et al. (2011) and was used to better compute the geodesic distances between samples.

In addition, some studies note that the valuable supervision information provided by the labeled samples should also be used for the graph construction in GSSL. In Berton and de Andrade Lopes (2014), the graph construction based on informativeness of labeled instance (GBILI) method was proposed. In which the distances between samples and the sum of distances between the sample and the all labeled samples are considered jointly to guide the edge generation on the graph. The GBILI method makes the labeled nodes tend to connect more edges so that the label information can spread to the unlabeled samples effectively. To further improve the robustness of the GBILI method, the literature (Berton et al. 2017) proposed a robust graph construction method considering the label information and proved that the graph constructed by this method is the optimal graph for modeling the smoothness hypothesis under certain conditions.

The basic principle of the distance metric-based graph construction methods is that a pair of sample with a smaller distance should have a higher similarity. In which the distance metric needs to be chosen in advance. If the distance metric is chosen inappropriately, the corresponding graph will not correctly reflect the similarities between samples, resulting in performance deterioration of the subsequent GSSL. At the same time, the quality of the graph is also affected significantly by the choice of parameters (such as the number of neighbors and the distance threshold), which also affects the performance of the GSSL. Furthermore, once the distance metric and parameters are selected, the corresponding graph will be fixed, thus it is unable to deal with various data distribution adaptively and poses difficulties in guaranteeing the performance of the GSSL.

  1. (2.2)

    Data representation-based methods

The data representation-based graph construction methods measure the similarities between samples by the representation coefficients between samples that are obtained by solving a certain data representation model. The literature (Wang and Zhang 2008) reconstructed a sample using a convex combination of the sample’s k nearest neighbors, and designed the linear neighborhood propagation (LNP) algorithm to propagate the label on the graph. Inspired by the strong discriminating power of sparse representation (Wright et al. 2009), the \(\ell _{1}\) graph (Yan and Wang 2009) was constructed by using the absolute value of linear representation coefficients learned by sparse representation to measure the similarities between samples. After that, the literature (Cheng et al. 2010) added a nonnegative constraint on the sparse representation coefficients to better measure the similarity. In this literature, the representation coefficient matrix can be regarded as a graph for spectral clustering, subspace learning and GSSL. Since these above mentioned three methods all optimize each sample’s representation coefficients individually, the representation coefficient matrix cannot capture the global information of the data distribution.

Inspired by the low-rank representation of data (Liu et al. 2013), the literature (Zhuang et al. 2011) proposed a method that implements a sparse and low-rank representation learning simultaneously. In this method, the \(\ell _{1}\) norm and nuclear norm regularization term are both applied to the representation coefficient matrix of all samples to capture the local and global structure simultaneously. When the data representation coefficients are obtained, the absolute values of the representation coefficients are used to measure the similarities between samples. Similar to the distance metric-based graph construction methods, the supervision information is also applied in the data representation-based graph construction method. In literature (Zhuang et al. 2017), semi-supervised low-rank representation (SSLRR) was proposed to construct a graph for GSSL. In which the representation coefficients between two labeled samples with different class labels are constrained to 0.

The data representation-based methods can learn the adjacent structure and edge weight of the graph and have the robustness to the noise data. However, this kind of method will be unable to reveal the data distribution correctly when the data distribution does not satisfy the subspace hypothesis, resulting in the inability to guarantee the performance of the subsequent GSSL.

It can be seen from the above analysis that the distance metric-based and the data representation-based graph construction methods heavily depend on their corresponding assumptions. If the assumption is incorrect, the two kinds of methods mentioned above will be unable to correctly capture the similarity that is consistent with the data distribution, which could result in the deterioration of the performance of the subsequent GSSL. However, the data distribution is complex and varies from data to data in practice, so it is hard to measure the true similarities between samples adaptively by using a specific assumption. Thus, to build a high-quality graph, it is necessary to propose a graph construction method that can alleviate the issues caused by the complex and various data distribution and can discover the potential data distribution adaptively.

Considering the above requirements, in this paper, we turn to the domain of the clustering ensemble to find a solution. In the field of clustering, the clustering ensemble is a popular way to improve the quality and robustness of the final clustering results. By integrating multiple clustering results into a final clustering result, the clustering ensemble can obtain a stronger (Bai et al. 2018) or a more robust clustering result (Zhao et al. 2017). Among a large number of clustering ensemble methods, the similarity-based method, fusing many base clusterings to construct a sample similarity matrix (Fred and Jain 2005), is flexible and effective. The reason why this method is effective is that different types of base clusterings can capture different types of data distribution. Furthermore, by fusing multiple different clustering results, a robust similarity measure can be obtained for the subsequent clustering, which improves the quality and robustness of the final clustering result.

Inspired by this idea, we propose a graph construction method that measures the similarities between samples by weighted fusion of multiple clustering results. First, different clustering algorithms with different settings are run to obtain multiple clustering results, which can capture complex and various data distribution. Then, the weighted fusion of these clustering results is used for graph construction and the weights of the multiple clustering results are adjusted dynamically according to the “must link”y and “cannot link” constraints provided by the labeled samples and the result of label inference on the graph. During the process of learning, the dynamic graph construction and label inference on the graph are optimized alternately, which achieves the dynamic improvement of the quality. In summary, the contributions of the proposal in this paper include the following three points:

  1. 1.

    The weighted fusion of multiple clustering results is used to construct the graph in GSSL. By fusing multiple different clustering results, this method can alleviate the issues caused by complex and various data distribution effectively.

  2. 2.

    The graph construction and label inference on the graph are integrated into a unified optimization model, which realizes the mutual guidance between these two processes. In the optimization model, the supervision information and the iterative intermediate results are rationally utilized, which dynamically improves the quality of the graph during the learning process.

  3. 3.

    The proposed method is a general framework and many existing GSSL methods can be embedded into this framework to improve their performance.

The rest of this paper is organized as follows. Section 2 introduces some basic notions and the related works. Including the general framework of GSSL and some representative graph construction methods. Section 3 provides the description of the GSSL-IQGD framework proposed in this paper. In sect. 4, we explain why the proposal is effective by using three toy examples on artificial data sets and verify it’s effectiveness by comparing it with other classic GSSL methods on ten benchmark data sets. The conclusion and further work prospects are given in Sect. 5.

2 Notations and related works

2.1 Formalization of the problem

For a given semi-supervised classification task, let \(D_{l}=\{(\bf{x}_{i},y_{i})\}_{i=1}^{l}\) denote the l labeled samples and \(D_{u}=\{\bf{x}_{j}\}_{j=l+1}^{n}\) denote the u unlabeled samples, where \(n=l+u\) and \(\bf{x}_{i}\in {{\mathbb {R}}}^{d}\) is the d dimension description for the ith sample, and \(y_{i}\in \{1,2,\cdots ,c\}\) is the class label of the ith labeled sample, and c is the number of categories.

For convenience of discussion, the class label of the sample is described in the form of a matrix. Let \(\bf{F}\in \{0,1\}^{n \times c}\) be the label matrix, where

$$\begin{aligned} f_{ik}=\left\{ \begin{array}{ll} 1, &{\text{ if }} (1\le i \le l) \wedge (y_{i}=k)\\ 0, &{} {\text{ otherwise }} \end{array} \right. \end{aligned}$$
(1)

and let \({{\bf{Z}}}\in {{\mathbb {R}}}^{n \times c}\) be the predicting label matrix, where \(z_{ik}\) represents the membership degree of the ith sample to the kth category. In the rest of this article, let \({\bf{f}}_{i}\) and \({\bf{Z}}_{i}\) be the ith row of the matrix \(\bf{F}\) and \({\bf{Z}}\), respectively. More notations are included in Table 1.

Table 1 Definition of main notations

2.2 Graph smoothness term and label inference of GSSL

For a semi-supervised classification task described in Sect. 2.1, the GSSL method first converts the data into a graph \(G=(V,E,\bf{W})\), where \(V=\{v_{i}\}_{i=1}^{n}\) is the vertices set, and the vertex \(v_{i}\) corresponds to the sample \(\bf{x}_{i}\). Additionally, E is the edges set. The nonnegative matrix \(\bf{W}\) represents the weight of each edge in E and \(w_{ij}=0\) means there is no edge between vertex \(v_{i}\) and \(v_{j}\). For an undirected graph, we have \(\bf{W}=\bf{W}^{T}\).

The graph smoothness term is an important component of the GSSL method. In general, the smoothness loss term on the graph can be written as:

$$\begin{aligned} L_{\rm{smooth}}({\bf{Z}})=\frac{1}{2}\sum _{i=1}^{n}\sum _{j=1}^{n}w_{ij}d({\bf{Z}}_{i},{\bf{Z}}_{j}) , \end{aligned}$$
(2)

where \(d(\cdot , \cdot )\) is a certain distance or dissimilarity metric, \(d({\bf{Z}}_{i},{\bf{Z}}_{j})\) measures the difference between the prediction results of sample \(\bf{x}_{i}\) and \(\bf{x}_{j}\), and \(w_{ij}\) reflects the similarities between them. The effect of minimizing Eq. (2) can be explained as follows: the more similar the samples are on the graph, the closer their prediction labels should be. Based on the smoothness assumption, GSSL learns labels for unlabeled samples.

In general, the GSSL framework (Zhou et al. 2003) can be written as:

$$\begin{aligned} {\mathop {\min }\limits _{{\bf{Z}}}}L({\bf{Z}})=\gamma _{\rm{fit}}L_{\rm{fit}}({\bf{Z}})+\gamma _{\rm{smooth}}L_{\rm{smooth}}({\bf{Z}}) , \end{aligned}$$
(3)

where \(L_{\rm{fit}}({\bf{Z}})\) is the fitting loss on the known class labels, and \(L_{\rm{smooth}}({\bf{Z}})\) is the smoothness loss on the graph. \(\gamma _{\rm{fit}}\) and \(\gamma _{\rm{smooth}}\) are two hyper-parametersFootnote 1 trading off the fitting loss and smoothness loss, respectively.

Accordingly, the prediction function for the unlabeled sample \(\bf{x}_{j}\) is:

$$\begin{aligned} y_{j}={\mathop {\arg \;\max }\limits _{k=1,2,\cdots ,c}} \, z_{jk}. \end{aligned}$$
(4)

It should be noted that. In essence, the model represented by (3) is the same as the regularization framework in Zhou et al. (2003). Most classical semi-supervised learning methods, such as the Harmonic (Zhu et al. 2003), LLGC (Zhou et al. 2003), LapRLS (Belkin et al. 2006), LapSVM (Belkin et al. 2006) and measure propagation (Subramanya and Bilmes 2011), can be described by this framework.

2.3 Semi-supervised classification with graph convolutional networks

Among all deep neural network based semi-supervised learning methods, the SSC-GCN (semi-supervised classification with graph convolution networks) proposed in 2017 (Kipf and Welling 2017) and its extensions (Li et al. 2018; Jiang et al. 2019) are closest to the GSSL. In addition to the input data described in Sect. (2.1), the graph \(G=(V,E,\bf{W})\) is also given in advance in the SSC-GCN.

In the method, first, the symmetric and normalized graph Laplacain matrix \(\hat{\bf{W}}\) is computed as:

$$\begin{aligned} \hat{\bf{W}}=\bar{\bf{D}}^{-1/2}\bar{\bf{W}}\bar{\bf{D}}^{-1/2}, \end{aligned}$$
(5)

where \(\bar{\bf{W}}=\bf{W}+\bf{I}\) and \( \bar{\bf{D}}=diag \left( \bar{d}_1, \bar{d}_2, \cdots , \bar{d}_n \right) \) is the degree matrix with \(\bar{d}_i=\sum _{j=1}^n \bar{w}_{ij}, i=1,2,\cdots ,n\). Then, the spatial-based graph convolution is applied to the output of each layer of the neural network to obtain smooth hidden representations, i.e. the hidden representations of similar samples on the graph are close. At last, a two layers SSC-GCN model used in literature (Kipf and Welling 2017) is expressed in the following form:

$$\begin{aligned} {\bf{Z}}= {\text{ softmax }} \left( \hat{\bf{W}} {\text{ ReLU }} \left( \hat{\bf{W}} \bf{X} \bf{W}^{(0)} \right) \bf{W}^{(1)} \right) , \end{aligned}$$
(6)

where \(\bf{X}=\left( \bf{x}_1^T, \bf{x}_2^T, \cdots , \bf{x}_n^T \right) ^T \in {\mathbb {R}}^{n \times d}\) is the matrix arranged by the description vectors of n samples, \(\bf{W}^{(0)} \in {\mathbb {R}}^{d \times h}\) and \(\bf{W}^{(1)} \in {\mathbb {R}}^{h \times c}\) are parameters of the graph convolution network, h is the number of hidden neural units, \( {\text{ ReLU }}(\cdot )=\max (\cdot ,0) \) is the nonlinear activation function. And

$$\begin{aligned} \forall \bf{u} \in {\mathbb {R}}^{c},\; {\text{ softmax }}\left( \bf{u}\right) =\left( \frac{\exp (u_1) }{\sum _{k=1}^c \exp (u_k)}, \frac{\exp (u_2) }{\sum _{k=1}^c \exp (u_k)}, \cdots , \frac{\exp (u_c) }{\sum _{k=1}^c \exp (u_k)} \right) \end{aligned}$$

is applied to the final output of the network, so that the ith row of the predicting label matrix \({\bf{Z}}\) corresponds to the membership degrees of the ith sample to all the categories. After getting the predicting label matrix Z, the fitting loss

$$\begin{aligned} L_{\rm{fit}}({\bf{Z}})=-\sum _{i=1}^{l} \sum _{k=1}^{c} f_{ik} \log z_{ik} \end{aligned}$$
(7)

is evaluated over l labeled samples, then parameters \(\bf{W}^{(0)}\) and \(\bf{W}^{(1)}\) are trained by gradient descent.

In this paper we focus on how to construct a better graph to improve the performance of GSSL. In the SSC-GCN method, although the graph is given in advance, it can still be easily embedded in the proposed framework and its performance can be improved.

2.4 Graph construction method

The study of the GSSL method is mainly divided into two parts: graph construction and label inference. Recent research shows that the key to the success of the GSSL method is constructing a high-quality graph instead of designing a better label inference algorithm (Berton et al. 2017; De Sousa et al. 2013; Jebara et al. 2009; Zhuang et al. 2017). Therefore, this paper focuses on the graph construction method. The classical graph construction methods used in GSSL are briefly reviewed as follows.

2.4.1 Nearest neighbor graph

The “0–1” kNN graph and the weighted kNN graph are the most commonly used methods in GSSL (Zhu et al. 2003; Zhou et al. 2003). Formally, the entry of the edge weight matrix \(\bf{W}\) is defined as

$$\begin{aligned} w_{ij}=\left\{ \begin{array}{ll} 1, & \bf{x}_{i}\in kNN(\bf{x}_{j})\\ 0,& {\text{ otherwise} } \end{array}\right. \end{aligned}$$
(8)

or

$$\begin{aligned} w_{ij}=\left\{ \begin{array}{ll} e^{\frac{-\Vert \bf{x}_{i}-\bf{x}_{j}\Vert _{2}^{2}}{2\delta ^{2}}}, & \bf{x}_{i}\in kNN(\bf{x}_{j})\\ 0,& {\text{ otherwise} } \end{array}\right. . \end{aligned}$$
(9)

Apart from these forms, there are some other kinds of the nearest neighbor graph that are also frequently used in GSSL, such as the \(\varepsilon \)-ball nearest neighbor graph, the mutual kNN graph and so on.

2.4.2 \(b-matching\) graph

To avoid the situation where the degree of some vertices in the kNN graph are very large while others is very small, the b-matching graph was proposed in the literature (Jebara et al. 2009), where the degree of each vertex is constrained to b. The corresponding optimization problem is as follows.

$$\begin{aligned}\underset{\bf{W}}{{\text { min}}} \;&\sum _{i=1}^{n}\sum _{j=1}^{n}w_{ij}d(\bf{x}_{i},\bf{x}_{j})\\ s.t.\;&w_{ij}\in \{0,1\},\;w_{ij}=w_{ji}, \; i,j=1,2,\cdots ,n\\&\sum _{j=1}^{n}w_{ij}=b, \;w_{ii}=0, \; i=1,2,\cdots ,n \end{aligned} $$
(10)

where \(d(\bf{x}_{i},\bf{x}_{j})\) is the distance between sample \(\bf{x}_{i}\) and \(\bf{x}_{j}\). This optimization problem can be solved efficiently by the loopy belief propagation algorithm (Huang and Jebara 2007).

2.4.3 Linear neighbor graph

Unlike the method that directly uses the distances between samples to measure the similarity, the literature (Wang and Zhang 2008) proposed a linear representation-based similarity measure. In detail, for the sample \(\bf{x}_{i}\), its k nearest neighbors \(kNN(\bf{x}_{i})\) are computed. And it is reconstructed by a convex combination of its \(kNN(\bf{x}_{i})\). The combination coefficients is used as the weights of the edges connected to node vertex \(v_{i}\):

$$\begin{aligned} \underset{\bf{W}}{min}\;&\sum _{i=1}^{n} \left\| \bf{x}_{i}-\sum _{\bf{x}_{j} \in kNN(\bf{x}_{i})} w_{ij} \bf{x}_{j} \right\| _{2}^{2}\\ s.t.\;&w_{ij}\ge 0,\;i,j=1,2,\cdots ,n\\&\sum _{j=1}^{n}w_{ij}=1,\; i=1,2,\cdots ,n\\&w_{ij}=0,\;i=1,2,\cdots ,n,\;\bf{x}_{j} \notin kNN(\bf{x}_{i}). \end{aligned}$$
(11)

2.4.4 \(\ell _{1}\) graph

To mine the subspace structure of data. In the literature (Cheng et al. 2010), the \(\ell _{1}\) graph was used to learn the adjacency structure and the edge weights simultaneously. The similarities between samples are measured by the absolute value of the linear combination coefficients learned by the sparse representation. In detail, the ith sample is reconstructed by the remaining \(n-1\) samples:

$$\begin{aligned} \underset{\varvec{\alpha }}{{\text { min}}}\; \Vert \varvec{\alpha }\Vert _{1} \;\;s.t.\;\left( \bf{X}_{\bar{i}},\bf{I} \right) \varvec{\alpha }=\bf{x}_{i} \end{aligned}$$
(12)

where \(\bf{X}_{\bar{i}}=\left( \bf{x}_{1},\bf{x}_{2},\cdots ,\bf{x}_{i-1},\bf{x}_{i+1},\cdots ,\bf{x}_{n}\right) \in {\mathbb {R}}^{d \times (n-1)}\) is a matrix of all samples except the ith sample. In addition, the identity matrix \(\bf{I} \in {\mathbb {R}}^{d \times d}\) is used as the basis for reconstructing the noise on \(\bf{x}_{i}\), which can improve the robustness of the model. Actually, the combination coefficient vector can be split into two segments:

$$\begin{aligned} \varvec{\alpha }=\left( \begin{array}{l} \varvec{\alpha }_{samp}\\ \varvec{\alpha }_{noise} \end{array} \right) \in {\mathbb {R}}^{(n-1)+d}. \end{aligned}$$
(13)

Accordingly, we have \(\bf{x}_{i} =\bf{X}_{\bar{i}}\varvec{\alpha }_{samp}+ \bf{I}\varvec{\alpha }_{noise} =\bf{X}_{\bar{i}}\varvec{\alpha }_{samp}+ \varvec{\alpha }_{noise}\); that is, the segment \(\varvec{\alpha }_{samp} \in {\mathbb {R}}^{n-1}\) contains the coefficients for reconstructing the ith sample with the rest of the samples. Based on the above analysis, let \(\varvec{\alpha }^{*}\) be the optimal solution of problem (12); then, the weights of the edges connected to vertex \(v_{i}\) are calculated by

$$\begin{aligned} w_{ij}= \left\{ \begin{array}{ll} |\alpha _{j}^{*}|, & j<i\\ 0, & j=i\\ |\alpha _{j-1}^{*}|, & j>i \end{array} \right. . \end{aligned}$$
(14)

2.4.5 LRR graph and SSLRR graph

The low-rank representation is a robust subspace structure recovery method proposed in the literature (Liu et al. 2013). Unlike the sparse representation model that learns each sample representation coefficients individually. In which the representation coefficients of all samples lack global constraint, the low-rank representation can better capture the global structure of the data by applying low-rank regularization to the representation coefficient matrix. Usually, the nuclear norm \(\Vert \cdot \Vert _{*}\) is used to approximate the rank of a matrix, and the low-rank representation model can be written as follows.

$$\begin{aligned} \underset{\bf{R,E}}{{\text {min}}}\; \Vert \bf{R}\Vert _{*}+\lambda \Vert \bf{E}\Vert _{2,1} \;\;s.t.\;\bf{X}=\bf{X} \bf{R}+\bf{E} . \end{aligned}$$
(15)

After obtaining the optimal representation matrix \(\bf{R}^{*}\), the weight of the edge between vertex \(v_{i}\) and \(v_{j}\) can be calculated by the following formula (Zhuang et al. 2011)

$$\begin{aligned} w_{ij}=\frac{\mid r_{ij}^{*} \mid + \mid r_{ji}^{*} \mid }{2}, \end{aligned}$$
(16)

where \(r_{ij}^{*}\) is the element in the ith row and the jth column of the matrix \(\bf{R}^*\).

To better use the supervised information to improve the quality of the graph, a semi-supervised graph construction method based on the LRR graph (SSLRR) is proposed in literature (Zhuang et al. 2017). In the SSLRR method, the “cannot link” constraint is added into the low-rank representation model to enforce the representation coefficients between samples with different class labels to be 0:

$$\begin{aligned} \underset{\bf{R,E}}{{\text {min}}}\;&\Vert \bf{R}\Vert _{*}+\lambda \Vert \bf{E}\Vert _{2,1}\\ s.t.\;&\bf{X}=\bf{X} \bf{R}+\bf{E}\\&\sum _{j=1}^{n}r_{ji}=1,\; i=1,2,\cdots ,n\\&r_{ij}=0,\; \;i,j=1,2,\cdots ,l, \; y_{i}\ne y_{j}. \end{aligned}$$
(17)

Similar to the LRR graph, the weight of the edge between vertex \(v_{i}\) and \(v_{j}\) is calculated by the formula (16).

It can be seen from the above discussion that the quality of the graph depends heavily on the assumptions used in these methods, whether it is for graph construction methods based on the distance metric or data representation (some kind of distance metric for the former and subspace structure for the latter). If the data distribution does not meet the corresponding assumptions, the quality of the graph will be seriously degraded, resulting in the performance deterioration of the subsequent GSSL. Indeed, the data distribution tends to be complex and various, so the graph construction method based on specific assumption encounters difficulty in adaptively capturing the similarities between samples that is consistent with the data distribution.

Motivated by the above analysis, a graph construction method by fusing multiple clustering results is proposed for the following reasons:

  1. 1.

    Clustering is a classic method to mine the structure of a data distribution, and different clustering algorithms are good at mining different data distribution structures. Thus we can use different clustering algorithms to capture various data distribution structures.

  2. 2.

    We can construct a high-quality graph by integrating multiple clustering results reasonably.

3 GSSL via improving the quality of the graph dynamically

From the above discussion, we can see that the quality of the graph directly affects the performance of the GSSL method. Traditional methods are based on certain specific assumption, so it is difficult to capture the complex and various data distribution. To address this problem, the method of fusing multiple clustering results is employed to elevate the quality of the graph, which can improve the performance of the GSSL method.

3.1 Measuring similarity via the weighted co-association matrix

In practice, the potential data distribution tends to be complex and varies from data to data. Clustering is a classical unsupervised learning method that aims to discover the data distribution structure. Many classical clustering algorithms (Jain 2010) have been proposed, and different algorithms are expert in dealing with different data distribution. Therefore, different clustering algorithms with different settings (for more implementation details, see the experimental section) can be used to obtain a candidate set that covers various data distribution.

Assume that the clustering process produces m clustering results \(\Pi =\{\pi _{t}\}_{t=1}^{m}\), where \(\pi _{t}\) is the tth clustering result, and let \(\bf{R}^{(t)} \in \{0,1\}^{n \times n}\) be the matrix derived from \(\pi _{t}\), where

$$\begin{aligned} r^{(t)}_{ij}=\left\{ \begin{array}{ll} 1,& {\text{ if }} \bf{x}_i {\text{ and} } \bf{x}_j {\text{ belong }} {\text{ to }} {\text{ the} } {\text{ same }} {\text{ cluster} } {\text{ in }} \pi _t\\ 0,& {\text{ otherwise} } \end{array} \right. . \end{aligned}$$
(18)

The co-association matrix (Fred and Jain 2005) is defined as:

$$\begin{aligned} \bf{M}^{\rm{(co)}}=\frac{1}{m}\sum _{t=1}^{m} \bf{R}^{(t)}. \end{aligned}$$
(19)

The element \(m_{ij}^{\rm{(co)}}\) of matrix \(\bf{M}^{\rm{(co)}}\) can be used to measure the similarity between \(\bf{x}_{i}\) and \(\bf{x}_{j}\). It is widely accepted that the importance of different clustering result should be different. Thus the weighted co-association matrix can measure the similarity better:

$$\begin{aligned} \bf{W}=\sum _{t=1}^{m}\lambda _{t} \bf{R}^{(t)} \;\;s.t.\;\varvec{\lambda }\in \bf{conv}_m , \end{aligned}$$
(20)

where \(\varvec{\lambda }=(\lambda _{1},\lambda _{2},\cdots ,\lambda _{m})^T\) is the weights vector of m clustering results, and \(\bf{conv}_m=\{\varvec{\alpha }|\varvec{\alpha }\in {\mathbb {R}}^m,\;\alpha _t\ge 0,\;t=1,2,\cdots ,m,\;\sum _{t=1}^{m}\alpha _{t}=1\}\) is the simplex. Obviously, \(w_{ij}\in [0,1]\), and the larger the value is, the greater the similarity between \(x_i\) and \(x_j\) is.

The difference between formula (19) and (20) is that the weight of each clustering result in formula (19) is equal, while in formula (20) the weights are obtained by optimization (which will be shown in the subsequent discussion). From the perspective of the clustering ensemble, the quality of each clustering result is different. Naturally, they should be given different weights in the fusion process. However. In unsupervised scenario, how to evaluate the quality of the clustering result is an open problem. Therefore, the equal weight strategy adopted by formula (19) is not a bad choice. In this paper, some criteria are used to evaluate the quality of the clustering results. First, the supervision information provided by the labeled samples can be used to evaluate the quality of each clustering result (see Sect. 3.2), making higher quality clustering results obtain greater weights. Second, the result of label inference can be used as the pseudo label to evaluate the quality of each clustering result, which can dynamically adjust the weights of clustering results (see Sect. 3.3).

3.2 Refining the weights by means of supervision information

Unlike the clustering ensemble, the supervision information provided by the labeled samples can be used to evaluate the clustering result in semi-supervised learning. We can directly obtain the “must link” and “cannot link” constraint (Wagstaff et al. 2001; Zeng and Cheung 2012) through the labeled samples. Let \(ML \) and \(CL \) be the sets of “must link” sample pairs and “cannot link” sample pairs, respectively. The definitions of \(ML \) and \(CL \) are

$$\begin{aligned} {ML}=\left\{ \left( \bf{x}_{i},\bf{x}_{j}\right) |\;i,j=1,2,\cdots ,l,\;y_{i}=y_{j}\right\} , \end{aligned}$$
(21)

and

$$\begin{aligned} {CL}=\left\{ \left( \bf{x}_{i},\bf{x}_{j}\right) |\; i,j=1,2,\cdots ,l,\;y_{i}\ne y_{j}\right\} . \end{aligned}$$
(22)

For the sake of the this discussion, their matrix representations are defined as

$$\begin{aligned} \bf{M} \in \{0,1\}^{n \times n},\;\; m_{ij}= \left\{ \begin{array}{ll} 1,& \left( \bf{x}_{i},\bf{x}_{j}\right) \in ML \\ 0,&{\text{ otherwise }} \end{array} \right. , \end{aligned}$$
(23)

and

$$\begin{aligned} \bf{C} \in \{0,1\}^{n \times n},\;\; c_{ij}= \left\{ \begin{array}{ll} 1,& \left( \bf{x}_{i},\bf{x}_{j}\right) \in CL \\ 0,& {\text{ otherwise }} \end{array} \right. . \end{aligned}$$
(24)

If the sample pair \(\left( \bf{x}_{i},\bf{x}_{j}\right) \) meets the “must link” constraint, then on the graph G their corresponding weight \(w_{ij}\) should be large. We can use the following optimization problem to achieve this goal.

$$\begin{aligned} \underset{\varvec{\lambda }}{{\text {min}}}\; -\sum _{\left( \bf{x}_{i},\bf{x}_{j}\right) \in ML } w_{ij} \;\;s.t.\;\bf{W}=\sum _{t=1}^{m}\lambda _{t} \bf{R}^{(t)}, \;\varvec{\lambda }\in \bf{conv}_m . \end{aligned}$$
(25)

If the sample pair \(\left( \bf{x}_{i},\bf{x}_{j}\right) \) meets the “cannot link” constraint, then on the graph G, their corresponding node pair should ideally be disconnected. However, this goal corresponds to a discrete optimization problem that is very difficult to implement by numerical optimization techniques. In this paper, we use the following optimization problem to approximate the necessary condition for achieving this goal.

$$\begin{aligned} \underset{\varvec{\lambda }}{{\text {min}}}\; \sum _{\left( \bf{x}_{i}, \bf{x}_{j}\right) \in CL } \bf{w}_{i} \bf{w}_{j}^{T} \;\;s.t.\;\bf{W}=\sum _{t=1}^{m}\lambda _{t} \bf{R}^{(t)},\; \varvec{\lambda }\in \bf{conv}_m . \end{aligned}$$
(26)

From formula (20) we know that the weight \(w_{ij}\) on the graph G is nonnegative. Therefore, the inner product of the weight vectors \(\bf{w}_{i} \bf{w}_{j}^{T}\) is also nonnegative. The inner product \(\bf{w}_{i} \bf{w}_{j}^{T}=0\) means that the intersection of the set of nodes that directly connect to the ith node and the set of nodes that directly connect to the jth node is empty, i.e., \( \{v_{k}|\;w_{ik}\ne 0,\;k=1,2,\cdots ,n\} \bigcap \{v_{k}|\;w_{jk}\ne 0,\;k=1,2,\cdots ,n\}=\phi \), which is a necessary condition for that the node \(v_{i}\) and \(v_{j}\) are disconnected on the graph G.

Combining optimization problems (25) and (26), we obtain the following optimization problem.

$$\begin{aligned} &\underset{\varvec{\lambda }}{{\text {min}}}\; -\gamma _{\rm{ml}}\sum _{\left( \bf{x}_{i},\bf{x}_{j}\right) \in ML } w_{ij} \;+\; \gamma _{\rm{cl}}\sum _{\left( \bf{x}_{i}, \bf{x}_{j}\right) \in CL } \bf{w}_{i} \bf{w}_{j}^{T}\\&s.t.\;\bf{W}=\sum _{t=1}^{m}\lambda _{t} \bf{R}^{(t)},\; \varvec{\lambda }\in \bf{conv}_m . \end{aligned}$$
(27)

where the \(\gamma _{\rm{ml}}\) and \(\gamma _{\rm{cl}}\) are two nonnegative trade-off hyper-parameters.

3.3 Optimizing the quality of the graph and the class label iteratively

In this section, the most commonly used squared loss is chosen as an example to show how to integrate the graph construction by fusing multiple clustering results and the label inference into a unified framework to achieve their mutual guidance and dynamic improvement. Formally, the squared fitting loss and smoothness terms can be written as follows.

$$\begin{aligned} L_{\rm{fit}}({\bf{Z}})& = \sum _{i=1}^{n}\Vert \bf{f}_{i} - {\bf{Z}}_{i}\Vert _{2}^{2} , \end{aligned}$$
(28)
$$\begin{aligned} L_{\rm{smooth}}({\bf{Z}})& = \frac{1}{2}\sum _{i=1}^{n}\sum _{j=1}^{n}w_{ij}\Vert {\bf{Z}}_{i} -{\bf{Z}}_{j}\Vert _{2}^{2} . \end{aligned}$$
(29)

It should be noted that the method proposed in this paper is a general framework for improving the quality of the graph dynamically in GSSL. Other GSSL methods, such as the Harmonic method (Zhu et al. 2003) (in which the squared loss is used as the fitting and smoothness loss), LLGC (Zhou et al. 2003) (in which the squared loss is used as the fitting and smoothness loss), LapSVM (Belkin et al. 2006) (in which the hinge loss is used as the fitting loss and the squared loss is used as the smoothness loss) and measure propagation (Subramanya and Bilmes 2011) (in which the KL divergence is used as the fitting and smoothness loss), can be embedded in this framework without modification and their performance can be improved.

By integrating formula (27), (28) and (29), the final form of the proposed model is given below.

$$\begin{aligned} \underset{{\bf{Z}},\varvec{\lambda }}{{\text {min}}}\;&L({\bf{Z}},\varvec{\lambda })= \gamma _{\rm{fit}}\sum _{i=1}^{n}\Vert \bf{f}_{i}-{\bf{Z}}_{i}\Vert _{2}^{2}+\frac{\gamma _{\rm{smooth}}}{2}\sum _{i=1}^{n}\sum _{j=1}^{n}w_{ij}\Vert {\bf{Z}}_{i} -{\bf{Z}}_{j}\Vert _{2}^{2}\\&\;\;\;\;\;\;\;\;\;\;\;-\gamma _{\rm{ml}}\sum _{(\bf{x}_{i},\bf{x}_{j}) \in ML } w_{ij} +\gamma _{\rm{cl}}\sum _{(\bf{x}_{i},\bf{x}_{j}) \in CL } \bf{w}_{i} \bf{w}_{j}^{T}\\ s.t.\;&\bf{W}=\sum _{t=1}^{m}\lambda _{t} \bf{R}^{(t)},\; \varvec{\lambda }\in \bf{conv}_m . \end{aligned}$$
(30)

In the rest of this paper, the method corresponding to the above formula is named SLLI-IQGD (Squared Loss Label Inference via Improving the Quality of the Graph Dynamically).

In formula (30), label inference and graph construction are integrated into an optimization model. In this model, the graph is constructed by fusing multiple clustering results, and the fusion weights are learned iteratively under the joint guidance of three kinds of information: “must link”, “cannot link” and the pseudo label generated by label inference. As a result, the quality of the graph and the result of label inference are improved dynamically.

3.4 Model solution

The alternating optimization method is used to solve the optimization problem (30), which contains fixing \(\varvec{\lambda }\) and updating \({\bf{Z}}\), and fixing \({\bf{Z}}\) and updating \(\varvec{\lambda }\).

3.4.1 Fixing \(\varvec{\lambda }\) and updating \({\bf{Z}}\)

When \(\varvec{\lambda }\) is fixed, the optimization problem (30) can be written as follows.

$$\begin{aligned} \underset{{\bf{Z}}}{{\text {min}}}\; L({\bf{Z}})&= \gamma _{\rm{fit}}\sum _{i=1}^{n}\Vert {\bf{Z}}_{i} - \bf{f}_{i}\Vert _{2}^{2} +\frac{\gamma _{\rm{smooth}}}{2}\sum _{i=1}^{n}\sum _{j=1}^{n}w_{ij}\Vert {\bf{Z}}_{i} -{\bf{Z}}_{j}\Vert _{2}^{2}\\&=\gamma _{\rm{fit}}{\text{ tr }}(({\bf{Z}}-\bf{F})({\bf{Z}}-\bf{F})^{T})+\gamma _{\rm{smooth}}{\text{ tr }}({\bf{Z}}^{T} \bf{LZ}) , \end{aligned}$$
(31)

where \(\bf{L}=\bf{D}-\bf{W}\) is the Laplacian matrix of graph G and \(\bf{D}=diag(d_1,d_2,\cdots ,d_n)\) is the diagonal matrix with \(d_{i}=\sum _{j=1}^{n}w_{ij},\;i=1,2,\cdots ,n\). The differential of \(L({\bf{Z}})\) w.r.t \({\bf{Z}}\) is:

$$\begin{aligned} \frac{\partial L({\bf{Z}})}{\partial {\bf{Z}}}=2\gamma _{\rm{fit}} {\bf{Z}}-2\gamma _{\rm{fit}} \bf{F}+2\gamma _{\rm{smooth}} \bf{LZ} . \end{aligned}$$
(32)

Let the differential be \(\bf{0}\), and we can obtain the following formula for updating \({\bf{Z}}\).

$$\begin{aligned} {\bf{Z}}^{*}=\gamma _{\rm{fit}}(\gamma _{\rm{smooth}} \bf{L}+\gamma _{\rm{fit}} \bf{I})^{-1} \bf{F} . \end{aligned}$$
(33)

Since the matrix \(\bf{L}\) is a positive semidefinite matrix and both \(\gamma _{\rm{fit}}\) and \(\gamma _{\rm{smooth}}\) are greater than 0, the matrix \((\gamma _{\rm{smooth}} \bf{L}+\gamma _{\rm{fit}} \bf{I})\) is a invertible matrix.

The optimization problem (31) is actually a label propagation algorithm on the graph under the regularization framework proposed in Zhou et al. (2003). Different from the literature (Zhou et al. 2003), the weight matrix \(\bf{W}\) is the weighted fusion of multiple clustering results. It can be seen from the solving process of the above subproblem that the weighted co-association matrix obtained by the previous iteration guides the learning of the label of the unlabeled samples through the smoothness loss term.

3.4.2 Fixing \({\bf{Z}}\) and updating \(\varvec{\lambda }\)

When \({\bf{Z}}\) is fixed, the optimization problem (30) can be written as follows.

$$\begin{aligned} \underset{\varvec{\lambda }}{{\text {min}}}\;&L(\varvec{\lambda })= \frac{\gamma _{\rm{smooth}}}{2}\sum _{i=1}^{n}\sum _{j=1}^{n}w_{ij}\Vert {\bf{Z}}_{i} -{\bf{Z}}_{j}\Vert _{2}^{2} -\gamma _{\rm{ml}}\sum _{(\bf{x}_{i},\bf{x}_{j}) \in ML } w_{ij}\\&\;\;\;\;\;\;\;\;\;+\gamma _{\rm{cl}}\sum _{(\bf{x}_{i},\bf{x}_{j}) \in CL } \bf{w}_{i} \bf{w}_{j}^{T}\\ s.t.\;&\bf{W}=\sum _{t=1}^{m}\lambda _{t} \bf{R}^{(t)},\; \varvec{\lambda }\in \bf{conv}_m . \end{aligned}$$
(34)

By substituting the first constraint, the matrix representation of the “must link” constraint defined in formula (23) and the “cannot link” constraint defined in (24) into the above objective function, the optimization problem (34) can be converted equivalently into the following form.

$$\begin{aligned} &\underset{\varvec{\lambda }}{{\text {min}}}\; L(\varvec{\lambda })= \sum _{t=1}^{m}\lambda _{t}\frac{\gamma _{\rm{smooth}}}{2} \left( \sum _{i=1}^{n}\sum _{i=1}^{n}r_{ij}^{(t)}\Vert {\bf{Z}}_{i} -{\bf{Z}}_{j}\Vert _{2}^{2} \right) \\&\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;-\sum _{t=1}^{m}\lambda _{t}\gamma _{\rm{ml}} \left( \sum _{i=1}^{l}\sum _{i=1}^{l}r_{ij}^{(t)}m_{ij} \right) \\&\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;+\gamma _{\rm{cl}}\sum _{i=1}^{l}\sum _{i=1}^{l}c_{ij} \left( \varvec{\lambda }^{T} \left( \begin{array}{c} \bf{r}_{i}^{(1)}\\ \bf{r}_{i}^{(2)}\\ \vdots \\ \bf{r}_{i}^{(m)}\\ \end{array} \right) \right) \left( \varvec{\lambda }^{T} \left( \begin{array}{c} \bf{r}_{j}^{(1)}\\ \bf{r}_{j}^{(2)}\\ \vdots \\ \bf{r}_{j}^{(m)}\\ \end{array} \right) \right) ^{T} \\&s.t.\;\varvec{\lambda }\in \bf{conv}_m \end{aligned}$$
(35)

Let

$$\begin{aligned} \bf{v}^{\rm{smooth}}&=(v_{1}^{\rm{smooth}},v_{2}^{\rm{smooth}},\cdots ,v_{m}^{\rm{smooth}})^T \in {\mathbb {R}}^m,\nonumber \\ v^{\rm{smooth}}_{t}&=\sum _{i=1}^{n}\sum _{j=1}^{n}r_{ij}^{(t)}\Vert {\bf{Z}}_{i}-{\bf{Z}}_{j}\Vert _{2}^{2},\;t=1,2,\cdots ,m, \end{aligned}$$
(36)
$$\begin{aligned} \bf{v}^{\rm{ml}}&=(v^{\rm{ml}}_1,v^{\rm{ml}}_2,\cdots ,v^{\rm{ml}}_m)^T \in {\mathbb {R}}^m,\nonumber \\ v^{\rm{ml}}_{t}&=\sum _{i=1}^{l}\sum _{j=1}^{l}r_{ij}^{(t)}m_{ij},\;t=1,2,\cdots ,m, \end{aligned}$$
(37)
$$\begin{aligned} \bf{v}&=\frac{ \gamma _{\rm{smooth}}}{2} \bf{v}^{\rm{smooth}}-\gamma _{\rm{ml}} \bf{v}^{\rm{ml}}\in {\mathbb {R}}^{m} , \end{aligned}$$
(38)
$$\begin{aligned} \bf{S}&=\gamma _{\rm{cl}}\sum _{i=1}^{l}\sum _{j=1}^{l}c_{ij} \bf{S}^{(ij)}, \end{aligned}$$
(39)
$$\begin{aligned} \bf{S}^{(ij)}&= \left( \begin{array}{c} \bf{r}_{i}^{(1)}\\ \bf{r}_{i}^{(2)}\\ \vdots \\ \bf{r}_{i}^{(m)}\\ \end{array} \right) \left( \begin{array}{c} \bf{r}_{j}^{(1)}\\ \bf{r}_{j}^{(2)}\\ \vdots \\ \bf{r}_{j}^{(m)}\\ \end{array} \right) ^{T} \in {\mathbb {R}}^{m \times m} . \end{aligned}$$
(40)

Obviously, S is a symmetric matrix.

By using formula (3639), the optimization problem (35) can be equivalently written as the following form.

$$\begin{aligned} \underset{\varvec{\lambda }}{{\text {min}}}\; L(\varvec{\lambda })= \varvec{\lambda }^{T} \bf{S} \varvec{\lambda }+\bf{v}^{T} \varvec{\lambda } \;\;s.t.\; \varvec{\lambda }\in \bf{conv}_m \end{aligned}$$
(41)

The optimization problem (41) is a standard quadratic programming problem whose convexity depends on whether the matrix is a positive semidefinite matrix. Because \(\bf{S}\) is a symmetric matrix, all of its eigenvalues are real. Let \(\mu _{\rm{min}}\) be the smallest eigenvalue of \(\bf{S}\).

If \(\mu _{\rm{min}} \ge 0\), then \(\bf{S}\) is a positive semidefinite matrix, and the optimization problem (41) turns out to be a convex quadratic programming (CQP) problem, which can be solved by using a convex quadratic programming algorithm (Boyd and Vandenberghe 2004).

If \(\mu _{\rm{min}} < 0\), then \(\bf{S}\) is not a positive semidefinite matrix, and the optimization problem (41) is not a convex quadratic programming problem. In this case, the optimization problem (41) can be converted into the following equivalent form.

$$\begin{aligned} \underset{\varvec{\lambda }}{{\text {min}}}\; L(\varvec{\lambda })= \varvec{\lambda }^{T} \bf{S}_{+} \varvec{\lambda }+\bf{v}^{T} \varvec{\lambda }+\mu _{\rm{min}}\varvec{\lambda }^{T} \bf{I} \varvec{\lambda } \;\;s.t.\;\varvec{\lambda }\in \bf{conv}_m , \end{aligned}$$
(42)

where \(\bf{S}_{+}=\bf{S}-\mu _{\rm{min}} \bf{I}\) is a positive semidefinite matrix, and the first term of the objective function in problem (42) is convex w.r.t \(\varvec{\lambda }\). The second term \(\bf{v}^{T} \varvec{\lambda }\) is linear w.r.t \(\varvec{\lambda }\). The third term \(\mu _{\rm{min}}\varvec{\lambda }^{T} \bf{I} \varvec{\lambda }\) is concave w.r.t \(\varvec{\lambda }\) since \(\mu _{\rm{min}} < 0\). Therefore, the optimization problem (42) is concave-convex quadratic programming (CCQP) problem. Such a problem can be solved by transforming the concave part of the objective function into a series of convex quadratic programming problems, for details, see the literature (Yuille and Rangarajan 2003).

As seen from subproblem (41), three factors jointly guide the learning of the weights of clustering results.

  1. 1.

    Through the graph smoothness term, the predicting label matrix \({\bf{Z}}\) obtained in the last iteration provides guidance information that is encoded in \(\bf{v}^{\rm{smooth}}\).

  2. 2.

    The supervision information expressed by the “must link” constraint provides guidance information that is encoded in \(\bf{v}^{\rm{ml}}\).

  3. 3.

    The supervision information expressed by the “cannot link” constraint also provides guidance information that is encoded in \(\bf{S}\).

3.5 The framework of the GSSL-IQGD algorithm

In this section, the framework of the GSSL-IQGD algorithm is described in Algorithm 1.

figure a

3.6 Computational complexity analysis

In this section, we provide the analysis on the computational complexity of the Algorithm 1, which consists of three phases, preparation phase (line 1–3), learning phase (line 4–13) and predicting phase (line 14).

In the preparation phase (line 1–3), assume that the time complexity on the clustering algorithm is \(O\left( T_{\rm{cluster}}\right) \), then the time complexity on generating m clustering results is \(O\left( mT_{\rm{cluster}}\right) \)Footnote 2. The time complexity on calculating \(\bf{v}^{\rm{ml}}\) (see formula (37)) and \(\bf{S}\) (see formula (39)) are \(O\left( l^2m\right) \) and \(O\left( l^2m^2n \right) \) respectively. And the time complexity on calculating the smallest eigenvalue \(\mu _{\rm{min}}\) of \(\bf{S}\) is \(O\left( m^3\right) \). At last, the time complexity on initializing the \(\bf{W}\) is \(O\left( mn^2\right) \). To sum up, the time complexity of the preparation phase is \(O\left( mT_{\rm{cluster}} + l^2m + l^2m^2n + m^3 + mn^2 \right) \). In semi-supervised learning, usually l is a small number and \(l<<n\), so it can be simplified to be \(O\left( mT_{\rm{cluster}} + m^2n + m^3 + mn^2\right) \).

In the learning phase (line 4–13), the loop body contains four main steps: updating the predicting label matrix \({\bf{Z}}\), updating the \(\bf{v}\), updating the weights of m clustering results and updating the edge weight matrix \(\bf{W}\). The time complexity of these four steps is as follows.

  1. 1.

    The time complexity on updating \({\bf{Z}}\) is depending on the label inference algorithm \(A_{\rm{label \, inference}}\). Assume that it’s time complexity is \(O\left( T_{\rm{label \, inference}} \right) \).

  2. 2.

    When \({\bf{Z}}\) is given, the time complexity on updating \(\bf{v}\) (see formula (38)) is \(O\left( mn^2 \right) \).

  3. 3.

    When \(\bf{v}\) is given, the time complexity on updating \(\varvec{\lambda }\) is depending on whether the \(\bf{S}\) is a positive semi-definite matrix.

    1. (a)

      If the \(\bf{S}\) is a positive semi-definite matrix, then updating process of \(\varvec{\lambda }\) is a convex quadratic programming and it can be solved by ellipsoid method in polynomial time. So the time complexity is \(O\left( P(m) \right) \), where \(P(\cdot )\) is a polynomial function.

    2. (b)

      If the \(\bf{S}\) is not a positive semi-definite matrix, then \(\varvec{\lambda }\) is updated through a series of convex quadratic programming, so the time complexity is \(O\left( n_{\rm{qp}}P(m) \right) \), where \( n_{\rm{qp}}\) is the number of quadratic programming problems involved in the process. Usually \(n_{\rm{qp}}\) can be treated as a constant, so we have \(O\left( n_{\rm{qp}}P(m) \right) = O\left( P(m) \right) \).

    As a result, the time complexity on updating \(\varvec{\lambda }\) is \(O\left( P(m) \right) \).

  4. 4.

    When \(\varvec{\lambda }\) is given, the time complexity on updating \(\bf{W}\) is \(O\left( mn^2 \right) \).

Assuming that the number of iteration is \(n_{\rm{ite}}\), the time complexity of learning phase is \(O\left( n_{\rm{ite}}\left( T_{\rm{label \, inference}} + mn^2 + P(m) + mn^2 \right) \right) \). In practice, we can specify a maximum number of iterations as the stop** condition of the algorithm, so \( n_{\rm{ite}}\) can be treated as a constant, and the time complexity of this phase can be simplified to be \(O\left( T_{\rm{label \, inference}} + mn^2 + P(m) \right). \)

In the predicting phase (line 14), the time complexity on calculating the predicting labels of u unlabeled samples is \(O\left( uc \right) \). Notice that \(u<n\) and \(c<<n\), we have \( O\left( uc \right) =O\left( n \right) \).

To sum up, the time complexity of Algorithm 1 is

$$\begin{aligned} \begin{aligned}&O\left( mT_{{\mathrm{cluster}}} + m^2n + m^3 + mn^2 \right) +O \left( T_{{\mathrm{label \, inference}}} + mn^2 + P(m) \right) +O\left( n \right) \\&\quad =O\left( mT_{{\mathrm{cluster}}} + T_{\mathrm{label \, inference}} + P(m) +m^2n + mn^2 \right) \end{aligned}. \end{aligned}$$

4 Experiments

In this section, systematic experiments are conducted to illustrate the working mechanism and the effectiveness of the proposed GSSL-IQGD framework.

4.1 Experiments on artificial data sets

To illustrate the working mechanism of the proposed GSSL-IQGD framework, as a specific algorithm under the framework, the SLLI-IQGD algorithm described in formula (30) is selected to perform experiments on the 3 artificial data sets.

4.1.1 Artificial data sets

There are 3 artificial data sets used in this experiment. These data sets can be downloaded from https://github.com/deric/clustering-benchmark. The basic information of the three artificial data sets is given in Table 2.

Table 2 Basic information of three artificial data sets

These three data sets are all 2-dimensional or 3-dimensional data. Thus, we can observe their distribution via the visualization. Figure 1 shows the distribution of the three artificial data sets.

Fig. 1
figure 1

Three artificial data sets and the selected labeled samples in each class (marked in a dark cross)

As seen from Fig. 1a, the Xclara data set is a typical mixed Gaussian distribution containing three components, and each of them corresponds to a class. From Fig. 1b, we can observe that the Chainlink data set contains two typical manifold structures, and each of them corresponds to a class. In Fig. 1c, there are two kinds of data distribution on the Spiralsquare data set. Including four mixed Gaussian clusters and two manifold structures. Therefore, there are in total six classes on the Spiralsquare data set. We randomly selected two samples in each class as labeled samples for each data set, which are marked with the dark cross in Fig. 1.

For the first two data sets shown in Fig. 1a and b, we can construct high-quality graphs that are consistent with the true data distribution by using the appropriate distance metric and parameters. For the Xclara data set, the Euclidean distance should be the best choice, while for the Chainlink data set, the geodesic distance would be better. However. In practical applications, the true data distribution is usually unknown and cannot be visualized since the data’s dimension is much larger than three; therefore, the correct distance metric is unknown. Even if we fortuitously choose the correct distance metric, there is still a lack of theoretical guidance on how to choose the graph construction parameters. If the parameters are set improperly, the quality of the graph will still be poor.

For the third Spiralsquare data set shown in Fig. 1c, the situation will be worse. If we use the Euclidean distance, the similarities between samples on the four Gaussian clusters can be calculated correctly, while the similarity on the manifolds will be calculated incorrectly. Moreover, if we switch to the geodesic distance, the situation will be reversed. In this case, it is difficult to construct a high-quality graph by using only one particular distance metric.

From the above analysis, the data distribution in practice is usually unknown, complex and varies from data to data. Most traditional graph construction methods make specific assumptions about the data distribution, so it is difficult to adaptively measure the similarities between samples and build a high-quality graph. The next section elaborates how the SLLI-IQGD algorithm proposed in this paper can adaptively discover the data distribution via fusing multiple clustering results, and then co** with the complexity and variety of the unknown data distribution.

4.1.2 Experimental setting

For each data set, the k-means (Jain 2010) and DBSCAN (Ester et al. 1996) algorithms are used to obtain different clustering results. For the k-means algorithm, the number of clusters is fixed to the number of classes c, and the initial class centers are selected randomly. For each data set, the algorithm repeats five times to obtain five clustering results. For the DBSCAN algorithm, the parameter MinPts is fixed to be 3 and the parameter Eps is determined by the given different noise ratios of samples. By setting different noise ratios (\(\{0\%,0.01\%,0.03\%,0.05\%,0.07\%\}\) for Xclara and \(\{0\%,0.1\%,0.3\%,0.5\%,0.7\%\}\) for Chainlink and Spiralsquare), five different clustering results are obtained for ecah data set. According to the above settings, a total of \(5+5=10\) different clustering results are obtained for each data set.

In these three toy examples, the hyper-parameters of the SLLI-IQGD algorithm are set to be \(\gamma _{\rm{fit}}=\gamma _{\rm{smooth}}=\gamma _{\rm{ml}}=\gamma _{\rm{cl}}=0.5\).

4.1.3 Experimental results and analysis

The representative clustering results on the three data sets are given in Figs. 24.

Fig. 2
figure 2

k-means and DBSCAN clustering results on Xclara data set

Fig. 3
figure 3

k-means and DBSCAN clustering results on Chainlink data set

As seen from Fig. 2, for the Xclara data set, the k-means algorithm can discover the data distribution structure correctly, while the DBSCAN algorithm does not work effectively. For the Chainlink data set, the situation is reversed (see Fig. 3), i.e., the DBSCAN algorithm can discover the data distribution structure correctly, while the k-means algorithm does not work well. For the Spiralsquare data set, neither of the two clustering algorithms can discover the true data distribution structure (see Fig. 4). However, each clustering algorithm can discover a part of the data distribution structure.

Fig. 4
figure 4

k-means and DBSCAN clustering results on Spiralsquare data set, Note that only the top 6 biggest clusters of the 72 clusters obtained by DBSCAN algorithm are given in sub-figure b

It can be seen from the above results that different clustering algorithms are good at mining different data distribution. Therefore, different data distribution can be discovered by using different clustering algorithms. By fusing multiple clustering results reasonably, the similarities between samples can be measured adaptively, and then the constructed graph will be of high quality.

The classification error rate and the learned weight vector \(\varvec{\lambda }^{*}\) of the clustering results on three artificial data sets are given in Table 3. In the last column of Table 3, the former five numerical values are the weights of the clustering results obtained by the k-means algorithm, and the latter five numerical values are the weights of the clustering results obtained by the DBSCAN algorithm. The largest element in the weight vector is marked in bold.

Table 3 The classification error rate and the weight vector \(\varvec{\lambda }^{*}\) on three artificial data sets

It can be seen from Table 3 that these clustering results that contain the true data distribution are given the largest weight on all data sets (marked in bold) in the learned weight vector \(\varvec{\lambda }^{*}\). Specifically, on the Xclara data set, the five clustering results obtained by k-means are given the largest weight 0.20, while the other five clustering results obtained by DBSCAN are given the weight very close to 0. On the Chainlink data set, the first two clustering results obtained by DBSCAN are given the largest weight 0.50, while the five clustering results obtained by k-means and the other three clustering results obtained by DBSCAN are given weight close to 0. On the Spiralsquare data set, the largest weight 1 is given to the last clustering result obtained by DBSCAN, while the other nine clustering results are given the weight close to 0. This result is consistent with the clustering results shown in Figs. 24. As a result, the graph constructed by the weighted fusion of multiple clustering results can measure the similarities between samples correctly, and the classification error rate of the SLLI-IQGD algorithm is very low on all three data sets.

To further illustrate that the quality of the graph is gradually improved during the iterative learning process, Table 4 records the classification error rate of the unlabeled samples after each iteration of the SLLI-IQGD algorithm and the \(\ell _2\) norm of the difference between the weight vectors \(\varvec{\lambda }\) obtained by two successive iterations on three artificial data sets.

It can be seen from Table 4 that on artificial data sets, after a few iterations, the \(\ell _2\) norm of the difference between the weight vectors obtained by successive iterations is rapidly reduced to 0 or very close to 0, i.e., the weight vector converges to the final result \(\varvec{\lambda }^{*}\).

Table 4 Classification error rate and change of \(\varvec{\lambda }\) after each iteration on three artificial data sets

As seen from the first row of Table 4, on the Xclara data set, when the weight of each cluster is initialized to be equal, the classification error rate of the result of label inference on the corresponding graph is \(2.34\times 10^{-3}\), that is very close to 0. With the increases of the number of iteration, the weights of the former five clustering results generated by the k-means algorithm are continuously increased, the weights of the latter five clustering results generated by the DBSCAN algorithm are continuously reduced. In the process of iterative learning, the classification error rate does not decrease significantly. This result is related to the clustering results. The former five clustering results generated by the k-means algorithm on the Xclara data set can correctly capture the data distribution, while the latter five clustering results generated by the DBSCAN algorithm assign almost all the samples into one cluster, which is trivial and cannot capture any data distribution information. Therefore, the classification error rate does not change significantly during the iteration. It can be seen from the second row and the third row of Table 4 that as the number of iterations increases, the weights of the clustering results is continuously adjusted, and the classification error gradually decreases to 0 or very close to 0.

Recall that the error rate of the result of the label inference is the most credible criterion to evaluate the quality of the graph. Therefore, we can conclude that in the three toy examples, the quality of the graph is gradually improved during the iterative process.

These three toy examples illustrate the effectiveness of the proposed framework and explain why it works well. In the next section, we will evaluate three algorithms under the proposed framework through a large number of comparative experiments on benchmark data sets.

4.2 Compared with GSSL based on static graph construction methods

In this section, a large amount of comparison experiments are conducted to verify the effectiveness of the proposal from two perspectives. Specifically, the experiments in this section are designed based on the following two questions. Compared with the commonly used static graph construction methods, can the proposed IQGD method construct better graphs so as to obtain better GSSL results? Whether the proposed GSSL-IQGD framework is a general framework, i.e. can different GSSL methods be embedded into it to improve their performances?

4.2.1 Data sets

A total of 10 data sets are used in this experiment. The basic information of them is shown in Table 5. Among these 10 data sets, the number of samples ranges from several hundred to thirty-five thousand. The number of features ranges from several to more than seven hundred. Both binary and multi-class classification tasks are included.

Table 5 Basic information of 10 benchmark data sets

For the first 8 data sets, 10 different ratios \(\{1\%,2\%,\cdots ,10\%\}\) of samples are selected as labeled samples. And for the last 2 data sets, 5 different number \(\{1, 5, 10, 15, 20\}\) of samples are selected randomly from each class as labeled samples. The rest of the samples are used as the unlabeled samples. It should be pointed out that the number of samples on D9 and D10 is relatively large (an order of magnitude larger than the first 8 data sets). If the ratio of labeled samples is still set to \(\{1\%, 2\%, \cdots , 10\%\}\), then the total number of labeled samples will be relatively large, which is contradictory to the idea of semi-supervised learning. For every ratio (or number) of labeled samples, the labeled samples are randomly selected whose class proportion is equal (or approximately equal) to the whole data set. These experiments are repeated 10 times.

4.2.2 Comparison methods and experimental setting

  1. A.

    Label inference methods for comparison

The focus of this paper is how to construct a high-quality graph to improve the performance of the GSSL method. For this goal, this paper proposes a framework for dynamically improving the quality of the graph for GSSL. In order to verify the effectiveness of the proposed IQGD methods. In this experiment, three of the most representative label inference methods in GSSL, the Harmonic (Zhu et al. 2003), LLGC (Zhou et al. 2003) and SSC-GCN (Kipf and Welling 2017) are selected as the comparison methods. These GSSL methods are selected for two purposes. First, they are combined with various representative graph construction methods to serve as comparison methods. Second, they are embedded into the proposed framework to verify whether the proposed framework can construct higher quality graphs to get better label inference results.

The first two are traditional methods, and the third method is based on graph neural networks. The Harmonic method can only deal with the binary classification problem. In this experiment, the “One vs Rest” strategy is employed to extend the Harmonic method to multi-class classification tasks. For the LLGC method, the hyper-parameter \(\alpha \) is set to be 0.99 throughout this experiment, which is recommended in the literature (Zhou et al. 2003). For the SSC-GCN method, a two-layer neural network is used in the experiment which is the same as literature (Kipf and Welling 2017). Unlike the literature (Kipf and Welling 2017), no validation set is used to assist training, because the number of labeled samples is limited in semi-supervised learning.

  1. B.

    Graph construction methods for Comparison

As mentioned above, the focus of this paper is how to construct a high-quality graph. In order to verify the effectiveness of the proposed IQGD method. In this experiment 4 different common used graph construction methods. Including the kNN graph, see formula (8), the \(b-matching\) graph, see formula (10), the \(\ell _{1}\) graph, see formula (12, 14) and the LRR graph, see formula (15, 16) are selected as the comparison methods.

Among the 4 methods, the former 2 methods are both graph construction methods based on the distance metric. The kNN graph is the simplest but most frequently used graph construction method in GSSL. In addition, the \(b-matching\) graph is a regularized neighbor graph such that every node has the same degree equal to b, which can overcome the adverse effects of the due to the large differences between node degrees. The latter 2 methods are based on data representation. They can simultaneously learn the adjacency structure and the edge weight matrix of the graph. The \(\ell _{1}\) graph learns the linear representation coefficients for every sample individually with a sparse regularization term, while the LRR graph learns the linear representation coefficients for all samples at the same time with a low-rank regularization term.

The parameter settings for the 4 kinds of graph construction methods are given in Table 6. As seen from the table, there are a total of \(3+3+1+1=8\) graph construction methods. Combined with the 3 label inference methods, we obtain \(3\times 8=24\) comparison methods in the experiment. For the last two data sets, only 3 kinds of kNN graphs are used for comparison, because the construction of the remaining 5 kinds of graphs is very time-consuming, each graph cannot be constructed within 15 days.Footnote 3

Table 6 The setting of four graph construction methods
  1. C.

    The proposed methods

The setting of the proposed method in this paper includes three aspects, i.e., the selection of the label inference algorithm, the way to generate clustering results and the setting of the hyper-parameters.

First, for the selection of the label inference algorithm \(A_{\rm{label \, inference}}\), 3 label inference algorithms, including the Harmonic (Zhu et al. 2003), LLGC (Zhou et al. 2003), and SSC-GCN (Kipf and Welling 2017) are selected and embedded in the GSSL-IQGD framework. As a result, 3 corresponding methods, Harmonic-IQGD, LLGC-IQGD, SSC-GCN-IQGD and SLLI-IQGD are obtained. By comparing the Harmonic-IQGD, LLGC-IQGD and SSC-GCN-IQGD with the Harmonic, LLGC and SSC-GCN respectively, the effectiveness of IQGD method can be directly observed.

Second, k-means (Jain 2010) and DBSCAN (Ester et al. 1996) are employed for generating clustering results. And the setting of them is shown in the Table 7. According to the setting, there are total \(50+12=62\) clustering results that are obtained for each data set, i.e. in Algorithm 1, \(m=62\). These clustering results are fused for graph construction, and their fusion weights are dynamically adjusted to improve the quality of the graph.

Table 7 The setting of k-means and DBSCAN

In practice, if there is more prior knowledge about the data distribution, choosing the appropriate clustering method can achieve better results. If there is little prior knowledge of the data distribution, different kinds of clustering algorithms can be used to cover as many kinds of data distribution as possible. Regardless of how the clustering results are generated, the proposed method can adaptively fuse them to construct a high-quality graph for GSSL. In this experiment, only two kinds of classical clustering methods are selected to discover the potential data distribution, and they are enough to verify the effectiveness of the proposed GSSL-IQGD framework.

Third, the simplest settings are used for the hyper-parameters throughout this experimentFootnote 4. In the Harmonic-IQGD, LLGC-IQGD and SSC-GCN-IQGD methods, \(\gamma _{\rm{smooth}}=\gamma _{\rm{ml}}=\gamma _{\rm{cl}}=0.5\) is employed for updating the weight vector \(\varvec{\lambda }\). The relative magnitudes between \(\gamma _{\rm{fit}}\) and \(\gamma _{\rm{smooth}}\) in Harmonic-IQGD and LLGC-IQGD methods are set to be the same as their counterparts, i.e. the Harmonic and LLGC, respectively.

4.2.3 Experimental results and analysis

Tables 815 show the classification error rate for the proposed methods and comparison methods on the first 8 data sets.

Table 8 Data set D1 classification error rate \((\%)\)

In Tables 815, there are 10 columns in every table, and each column corresponds to a labeled samples ratio. Each row in the table corresponds to a method and there are a total of 27 methods that are used for comparison.

These 27 methods can be divided into 3 groups. In the first group, the 1st-9th rows in the table, show the methods obtained by combining the Harmonic with different graph construction methods. In the second group, the 10th–18th rows in the table, show the methods obtained by combining the LLGC with different graph construction methods. In the third group, the 19th–27th rows in the table, show the method obtained by combining the SSC-GCN with different graph construction methods.

In the three groups, the Harmonic-IQGD, LLGC-IQGD and SSC-GCN-IQGD are obtained by embedding the corresponding GSSL methods into the proposed GSSL-IQGD framework. In each group, the method of graph label inference is the same, and the only difference is in the method of graph construction. Therefore, it is straightforward to demonstrate the effectiveness of the proposal by comparing the Harmonic-IQGD, LLGC-IQGD and SSC-GCN-IQGD with their counterparts, respectively.

In each table, the ranks of the 3 methods under the proposed GSSL-IQGD framework and the top 3 methods are marked with digital superscripts. The rank of each method is used to compare the performance of each method directly.

Table 8 records the classification error rate of the 27 methods on D1 data set. The following results can be drawn from this table directly.

  1. 1.

    At labeled ratios of \(2\%\), \(6\%\), \(7\%\), \(8\%\) and \(9\%\), the Harmonic-IQGD method defeats its counterparts with 8 different graph construction methods; At labeled ratios of \(1\%\), \(4\%\), \(5\%\) and \(10\%\), the Harmonic-IQGD defeats its 7 counterparts and is defeated by the Harmonic method with the C3M graph; At the labeled ratio of \(3\%\), the Harmonic-IQGD defeats its 5 counterparts and is defeated by the Harmonic method with the \(b-matching\) graph using 3 different distance metrics. To sum up, when the label inference method is fixed to Harmonic, the Harmonic-IQGD method obtained by embedding the Harmonic into the proposed GSSL-IQGD framework can achieve better performance compared with the Harmonic method using 8 different graphs in most cases. These results show that the proposed IQGD method can indeed construct a higher-quality graph compared with 8 different graph construction methods.

  2. 2.

    The LLGC-IQGD method defeats its counterparts with 8 different graph construction methods at 10 different labeled ratios. These results indicate that when the label inference method is fixed to LLGC, compared with the 8 different graph construction methods, the IQGD method can construct a better graph and then the better label inference performance can be achieved.

  3. 3.

    At 10 different labeled ratio, the SSC-GCN-IQGD defeats its 8 counterparts with significant advantage. These results indicate that when the label inference method is fixed to SSC-GCN, compared with the 8 different graph construction methods, the proposed IQGD method can also construct a better graph, and then the better label inference performance can be achieved. It can also be seen from these results that as one of the most advanced methods at present, the performance of SSC-GCN also heavily depends on the quality of the graph. And by embedding it into the proposed framework, its performance can also be improved significantly.

  4. 4.

    The Harmonic-IQGD, LLGC-IQGD and SSC-GCN-IQGD methods take the top 3 places among the 27 methods at labeled ratios of \(2\%\), \(6\%\), \(7\%\), \(8\%\) and \(9\%\), and in the remaining 5 labeled ratios, two of them ranked in the top 3 places. These results show that by embedding the Harmonic, LLGC and SSC-GCN into the proposed GSSL-IQGD framework respectively, the performance of all three methods can be improved significantly. That is to say, the proposed method is a general framework.

Table 9 Data set D2 classification error rate \((\%)\)

The comparison results on D2, D3, D4 and D5 data sets, as shown in Tables 9, 10, 11 and 12, are similar to the result on D1 data set and are summarized as follows. The results on each data set are no longer described separately.

  1. 1.

    When the label inference method is fixed to Harmonic, the Harmonic-IQGD method obtained by embedding the Harmonic into the proposed GSSL-IQGD framework can achieve better performance compared with the Harmonic method using 8 different graphs in most cases. The results on LLGC and SSC-GCN methods are virtually the same as on the Harmonic. These results show that, compared with the 8 representative graph construction methods, the proposed IQGD method is an effective graph construction method.

  2. 2.

    The Harmonic-IQGD, LLGC-IQGD and SSC-GCN-IQGD methods take the top 3 places or two of them ranked in the top 3 places among the 27 methods in almost all cases. These results show the proposed method is a general framework since different GSSL methods can improve their performance by embedding them into it.

Table 10 Data set D3 classification error rate \((\%)\)
Table 11 Data set D4 classification error rate \((\%)\)
Table 12 Data set D5 classification error rate \((\%)\)

For the D6, D7 and D8 data sets, as shown in Tables 13, 14 and 15, the comparison results are summarized as follows.

  1. 1.

    The Harmonic-IQGD, LLGC-IQGD and SSC-GNC-IQGD methods defeat their counterparts the Harmonic, LLGC and SSC-GCN with 8 different graphs at 10 different ratios of labeled samples, respectively. These results show that compared with the 8 representative graph construction methods, the proposed IQGD method can construct a better graph, as a result, the performance of the corresponding GSSL is improved.

  2. 2.

    The 3 methods, Harmonic-IQGD, LLGC-IQGD and SSC-GCN-IQGD under the GSSL-IQGD framework take the top 3 places among the 27 methods at 10 different ratios of labeled samples. These results show that the proposed framework can be applied to both the three GSSL methods and the expected performance gains are achieved.

Table 13 Data set D6 classification error rate \((\%)\)
Table 14 Data set D7 classification error rate \((\%)\)
Table 15 Data set D8 classification error rate \((\%)\)

Tables 16 and 17 show the classification error rate for the proposed methods and comparison methods on the last 2 data sets. Each table contains 5 columns and 12 rows. Each column corresponds to a labeled number and each row corresponds to a method. Similar to Tables 815, these 12 methods can also be divided into 3 groups.

Table 16 Data set D9 classification error rate \((\%)\)

Table 16 shows the classification error rate of the 12 methods on D9 data set. The following conclusions can be drawn from the table.

  1. 1.

    At 5 different numbers of labeled samples, the Harmonic-IQGD, LLGC-IQGD and SSC-GCN-IQGD under the GSSL-IQGD framework take the top 3 places among the 12 methods.

  2. 2.

    As a direct result, the Harmonic-IQGD, LLGC-IQGD and SSC-GNC-IQGD methods defeat the Harmonic, LLGC and SSC-GCN using the other three graph construction methods, respectively. These results show that compared with the three kNN graphs, the proposed IQGD method can construct a better graph so as to obtain a better GSSL result.

Table 17 Data set D10 classification error rate \((\%)\)

Table 17 shows the classification error rate of the 12 methods on D10 data set. The following conclusions can be drawn from the table.

  1. 1.

    For the Harmonic-IQGD, although it ranks in the top 3 only once among the 12 methods, it ranks first in its group and is significantly better than other methods in the same group. These results show that when the label inference method is fixed to Harmonic, the proposed IQGD method can construct a better graph compared with the three kNN graph construction methods.

  2. 2.

    The methods in the second group have a good performance on this data set. When the number of labeled samples is 10 and 50, the LLGC-IQGD method ranks first in this group. And for the remaining number of labeled samples, the LLGC method with three different kNN graphs performs better. These results show that when the number of labeled samples is smaller, the proposed IQGD method outperforms the other three graph construction methods for GSSL.

  3. 3.

    For the SSC-GCN-IQGD method, it ranks first in its group at first 4 different numbers of labeled samples and ranks second in the last case. These results show that when the label inference method is fixed to SSC-GCN, the proposed IQGD method can construct a better graph so as to improve the performance of the SSC-GCN method.

To more intuitively exhibit the performance of each method, Fig. 5 shows the mean and standard deviation of the rankings of the 27 methods in the whole experiment. The mean and standard deviation of the ranking of each method are computed through \(8 \times 10=80\) rankingp results (the first 8 data sets and 10 different labeled ratios on each data set). In Fig. 5, each method corresponds to a line segment, the midpoint of the line segment represents the mean of this method’s rank, the length of the line segment represents twice the standard deviation. And the methods in the 3 groups are marked with 3 different colors and shapes.

As seen from Fig. 5, among all the 27 methods, the LLGC-IQGD, SSC-GCN-IQGD and Harmonic-IQGD are the top 3 methods, and are significantly better than the rank in the 4th method LLGC-C3M. On the other hand, the Harmonic-IQGD, LLGC-IQGD and SSC-GCN-IQGD methods are the best methods and are significantly better than the rank in the 2nd methods in the three groups respectively.

Fig. 5
figure 5

Mean and standard deviation of rankings of 27 methods on the first 8 data sets

Based on the above experimental results, it can be concluded that compared with the common used graph construction methods, the proposed IQGD method can construct a better graph for GSSL. At the same time, the proposed GSSL-IQGD is a general framework and the performance of different GSSL methods can be improved by embedding them into it.

4.3 Compared with GSSL based on dynamic graph construction methods

4.3.1 Date sets

In this section, the first 8 data setsFootnote 5 shown in Table 5 are used in the experiment. The ratio of the labeled samples and the division of labeled and unlabeled samples are the same as the experiment in Sect. 4.2.

4.3.2 Comparison methods and experimental setting

In this experiment, 3 different GSSL methods based on dynamic graph construction methods are selected for comparison. The 3 methods and their settings are described as follows.

Table 18 Data sets D1- D4 classification error rate \((\%)\)

The STSSL-S\(^3\)R (Semi-Supervised Sparse Representation) and the STSSL-S\(^2\)LRR (Semi-Supervised Low Rank Representation) are two specific methods based on the framework proposed in literature (Li et al. 2015). In the STSSL-S\(^3\)R (STSSL-S\(^2\)LRR) method, the semi-supervised sparse (low rank) representation based graph construction and the Harmonic likewise label inference are integrated into a unified optimization framework. The alternating minimization algorithm is used to solve the optimization problem. In the process of model solving, the graph constructed by semi-supervised sparse (low rank) representation is updated dynamically. In this experiment, the model hyper-parameters [\(\gamma \) and \(\lambda \), see formula (13) in literature (Li et al. 2015)] and the optimization hyper-parameters [\(\epsilon \) and \(\rho _1\) see Algorithm 1 in the literature (Li et al. 2015)] are set to the default values in author’s codeFootnote 6.

The MGR-GGMC (Multiple Graph Regularized graph transduction via Greedy Gradient Max-Cut) method was proposed in literature (**u et al. 2018). In this method, the weighted sum of multiple graph smoothness terms and multiple greedy gradient max-cut based label inference are integrated into a unified optimization framework. In the process of model solving, the weights of multiple graph smoothness terms and the multiple label inference results are updated alternatively. In this method, the graph construction process essentially is a dynamic weighted fusion of multiple base graphs. And these base graphs are constructed by common used methods, for example the kNN graph. In this experiment, the hyper-parameters are set to the recommended values in the article, i.e. \(\mu =0.99\) and \(\beta =0.1\) [see Sect. 3.2 in the literature (**u et al. 2018)]. At the same time, the multiple base graphs used in MGR-GGMC method are same as the 8 different graphs used in Sect. 4.2, see Table 6 for details.

At last, the SLLI-IQGD method described in formula (30), one of the simplest methods in the proposed framework, is used for comparison. In this experiment, the hyper-parameters of SLLI-IQGD method are set to \(\gamma _{\rm{fit}}=\gamma _{\rm{smooth}}=\gamma _{\rm{ml}}=\gamma _{\rm{cl}}=0.5\), which is the same as the setting in Sect. 4.1. At the same time, the multiple clustering results used in the SLLI-IQGD method are those used in the Harmonic-IQGD method in section 4.2 (see Table 7).

4.3.3 Experimental results and analysis

Tables 18 and 19 show the classification error of the 3 comparison methods and the proposed SLLI-IQGD method on the D1-D4 data sets and the D5-D8 data sets respectively. For each comparison experiment (given a data set and a labeled ratio), the bold marks represent the best results of the four methods.

The following conclusions can be drawn from Tables 18 and 19.

  1. 1.

    On the 6 data sets (D1 and D4–D8), the performance of the proposed SLLI-IQGD method significantly outperforms that of the other 3 comparison methods at 10 different ratios of labeled sample.

  2. 2.

    On the D2 data set, the performance of the proposed SLLI-IQGD method outperforms that of the other 3 comparison methods at the first 6 labeled ratios and its performance is better than the STSSL-S\(^2\)LRR and MGR-GGMC methods at the last 4 labeled ratios.

  3. 3.

    On the D3 data set, the performance of the proposed SLLI-IQGD method outperforms that of the other 3 comparison methods at the 8 different labeled ratios (except for \(1\%\) and \(5\%\)). At labeled ratio \(1\%\), the STSSL-S\(^2\)LRR method obtains a better result than the proposed SLLI-IQGD method. And at labeled ratio \(5\%\), the performance of the STSSL-S\(^3\)R method is better than the proposed SLLI-IQGD method.

  4. 4.

    To sum up, the proposed SLLI-IQGD method is significantly superior to the 3 comparison methods in most cases. The advantages of the proposed SLLI-IQGD method are more significant when the labeled ratio is small.

Table 19 Data sets D5–D8 classification error rate \((\%)\)

Although the STSSL-S\(^3\)R (STSSL-S\(^2\)LRR) method realizes the mutual guidance between sparse (low rank) representation based graph construction and graph based label inference. But they still make a strong assumption about the data distribution, that is, assuming that the data distribution satisfies the subspace structure. These two methods alternately optimize the label inference result and sparse (low-rank) representation coefficient matrix. When the assumption is not satisfied, it makes the model go further and further in the wrong direction. Therefore. In this experiment, these two methods are difficult to achieve a good result. Different from them, the SLLI-IQGD uses different clustering methods to mine a variety of possible data distribution structures, which weakens the dependence on data distribution assumptions. At the same time, by designing a reasonable learning scheme of the weights of multiple clustering results, the dynamic improvement of the graph quality is achieved, so a better result can be obtained.

The MGR-GGMC method integrates the smoothing terms on multiple base graphs and multiple greedy gradient max-cut based label inference into a unified optimization problem. Compared with the proposed SLLI-IQGD method, there are three differences. First, the base graphs used in MGR-GGMC are obtained by conventional graph construction methods, while the base graphs in the proposal are obtained by different clustering methods. Compared with the conventional graph construction methods, the clustering method can mine the potential data distribution structure more effectively. Second. In the MGR-GGMC method, label inference is performed on every base graph. In the proposed method, it is carried out on the fused graph. In this way, the graph obtained by fusion of multiple clustering results can guide graph-based label propagation more effectively. Third. In the MGR-GGMC method, the the learning process of weights of multiple base graph is only guild by the intermediate label inference result. In the proposed method, the the learning process of weights of multiple clustering results is jointly guild by three factors: the intermediate label inference result, the “must link” supervision information and the “cannot link” supervision information. So the proposed method can learn better weights so as to construct a better graph. Therefore, compared with the MGR-GGMC method, the method in this paper can achieve a better result.

Through the experiments in this section, the following conclusions can be drawn. Compared with the existing GSSL methods. Including adopting the dynamic graph construction method based on data representation and multiple graphs fusion, the method proposed in this paper can construct a better graph so as to obtain a better performance.

5 Conclusions and future work

This paper proposes a general framework named GSSL-IQGD for improving the quality of the graph in GSSL and the performance of existing GSSL methods. In this framework, the two processes, graph construction based on the weighted fusion of multiple clustering results and label inference are integrated into a unified optimization problem. In the model solving, these two processes are alternately executed and guided by each other, which realizes the dynamic improvement of the quality of the graph and the result of label inference. In the experiment, firstly, three toy examples illustrate the working mechanism of the method. Then, a large number of comparative experiments verify the effectiveness of the proposed IQGD for improving the quality of the graph in GSSL. Meanwhile, these experimental results also indicate that the proposed GSSL-IQGD method is a general method, i.e. it can be used to improve the performance of existing different GSSL methods. Finally, the advantage of the proposed GSSL-IQGD method compared with other existing GSSL methods based on dynamic graph construction is verified through a large number of comparative experiments.

The method proposed in this paper is a general framework. In the experiment, three classic GSSL methods were chosen to be embedded into the framework and the desired performance gains were achieved. In the future, embedding other classic GSSL methods into the framework proposed in this paper to improve the performance of these methods is a worthwhile research work. In addition, similar to most GSSL methods, the method proposed in this paper belongs to the transductive learning method. How to extend it to inductive learning is also a meaningful research direction.