Keywords

1 Introduction

High dimensional data often lie in a set of low-dimensional subspaces in many practical scenarios. Based on this observation, subspace clustering algorithms [2] aim to partition the data such that data belonging to the same subspace are identified as one cluster. Among various subspace clustering algorithms, the ones that employ sparsity prior, such as Sparse Subspace Clustering (SSC) [1], have been proven to be effective in separating the data in accordance with the subspaces that the data lie in under certain assumptions.

Sparse subspace clustering methods construct the sparse similarity graph by sparse representation of the data, where the vertices represent the data. Subspace-sparse representation ensures that vertices corresponding to different subspaces are disconnected in the sparse similarity graph, leading to their compelling performance with spectral clustering [3] applied on such graph. Elhamifar and Vidal [1] prove that when the subspaces are independent or disjoint, subspace-sparse representations can be obtained by solving the canonical sparse coding problem using data as the dictionary under certain conditions on the rank, or singular value of the data matrix and the principle angle between the subspaces. Under the independence assumption on the subspaces, low rank representation [4, 5] is also proposed to recover the subspace structures. Relaxing the assumptions on the subspaces to allowing overlap** subspaces, the Greedy Subspace Clustering [6] and the Low-Rank Sparse Subspace Clustering [7] achieve subspace-sparse representation with high probability. However, their results rely on the semi-random model or full-random model which assumes that the data in each subspace are generated i.i.d. uniformly on the unit sphere in that subspace as well as certain additional conditions on the size and dimensionality of the data. In addition, the geometric analysis in [8] also adopts the semi-random model and it handles overlap** subspaces. Noisy SSC proposed in [9] handles noisy data that lie in disjoint or overlap** subspaces.

To avoid the non-convex optimization problem incurred by \(\ell ^{0}\)-norm, most of the sparse subspace clustering or sparse graph based clustering methods use \(\ell ^{1}\)-norm [1, 1014] or \(\ell ^{2}\)-norm with thresholding [15] to impose sparsity on the constructed similarity graph. In addition, \(\ell ^{1}\)-norm has been widely used as a convex relaxation of \(\ell ^{0}\)-norm for efficient sparse coding algorithms [1618]. On the other hand, sparse representation methods such as [19] that directly optimize objective function involving \(\ell ^{0}\)-norm demonstrate compelling performance compared to its \(\ell ^{1}\)-norm counterpart. It remains an interesting question whether sparse subspace clustering equipped with \(\ell ^{0}\)-norm, which is the origination of the sparsity that counts the number of nonzero elements, has advantage in obtaining the subspace-sparse representation. In this paper, we propose \(\ell ^{0}\)-induced sparse subspace clustering which employs \(\ell ^{0}\)-norm to enforce the sparsity of representation, and present a novel \(\hbox {A}\ell ^{0}\)-SSC for optimization. This paper offers two major contributions:

  1. 1

    We propose the \(\ell ^{0}\)-induced Subspace Subspace Clustering method and prove that it almost surely renders the desired subspace-sparse representation. We present the theory of the \(\ell ^{0}\)-induced sparse subspace clustering (\(\ell ^{0}\)-SSC), which shows that \(\ell ^{0}\)-SSC gives subspace-sparse representation almost surely under minimum assumptions on the underlying subspaces the data lie in, i.e. subspaces are distinct. To the best of our knowledge, this is the mildest assumption on the subspaces compared to most existing sparse subspace clustering methods. Furthermore, our theory presented in Theorem 1 assumes that the data in each subspace are generated i.i.d. from arbitrary continuous distribution supported on that subspace, which is milder than the assumption of semi-random model in [6, 7] that assume the data are i.i.d. uniformly distributed on the unit sphere in each subspace. Moreover, we prove that under the general conditions in Theorem 1, finding subspace representation can not be computationally cheaper than solving the corresponding \(\ell ^{0}\) problem. In fact, if there is an algorithm that obtains subspace representation for each data point, then it can be used to get the optimal solution to the \(\ell ^{0}\) problem for \(\ell ^{0}\)-SSC by an additional step of polynomial complexity.

  2. 2

    We propose Approximate \(\ell ^{0}\)-SSC to efficiently obtain an approximate solution to the problem of \(\ell ^{0}\)-SSC with theoretical guarantee. The optimization problem of \(\ell ^{0}\)-SSC is NP-hard and it is impractical to directly pursue the global optimal solution. Instead, we develop an approximate algorithm named Approximate \(\ell ^{0}\)-SSC (\(\hbox {A}\ell ^{0}\)-SSC) which obtains a sub-optimal solution for \(\ell ^{0}\)-SSC by proximal gradient descent method with theoretical guarantee. Under certain assumptions on the sparse eigenvalues of the data, the sub-optimal solution by \(\hbox {A}\ell ^{0}\)-SSC is a critical point of the original objective, and the bound for the \(\ell ^{2}\)-distance between such sub-optimal solution and the global optimal solution is given. It should be emphasized that the techniques we develop to derive such bound could be applied to more general optimization problems of sparse coding using proximal gradient descent, so as to obtain the gap between the sub-optimal solution and the global solution to the associated \(\ell ^{0}\) problem.

Similar to SSC, the sub-optimal solution by \(\hbox {A}\ell ^{0}\)-SSC is used to build a sparse similarity matrix upon which spectral clustering is performed to obtain the data clusters. Extensive experimental results on various real data sets show the impressive performance of \(\hbox {A}\ell ^{0}\)-SSC compared to other competing clustering methods including SSC.

The remaining parts of the paper are organized as follows. The representative subspace clustering methods, SSC [1], are introduced in the next subsection. The theoretical property of \(\ell ^{0}\)-SSC, detailed formulation of \(\hbox {A}\ell ^{0}\)-SSC and theoretical guarantee on the obtained sub-optimal solution are illustrated. We then show the clustering performance of the proposed models, and conclude the paper. We use bold letters for matrices and vectors, and regular lower letter for scalars throughout this paper. The bold letter with superscript indicates the corresponding column of a matrix, and the bold letter with subscript indicates the corresponding element of a matrix or vector. \(\Vert \cdot \Vert _F\) and \(\Vert \cdot \Vert _p\) denote the Frobenius norm and the \(\ell ^{p}\)-norm, and \(\mathrm{diag}(\cdot )\) indicates the diagonal elements of a matrix.

1.1 Sparse Subspace Clustering and \(\ell ^{1}\)-Graph

SSC [1] and \(\ell ^{1}\)-graph [10, 11] employ the broadly used sparse representation [13, 2022] of the data to construct the sparse similarity graph. With the data where n is the size of the data and d is the dimensionality, SSC and \(\ell ^{1}\)-graph solves the following sparse coding problem:

$$\begin{aligned} \mathop {\min }\limits _{{\varvec{\alpha }}} {\Vert {{\varvec{\alpha }}} \Vert _1}\quad s.t.\;{\varvec{X}} = {{\varvec{X}}}{\varvec{\alpha }},\,\, \mathrm{diag}(\varvec{\alpha }) = \mathbf {0}\end{aligned}$$
(1)

Both SSC and \(\ell ^{1}\)-graph construct a sparse similarity graph \(G = ( {{\varvec{X}},{\mathbf {W}}} )\) where the data \({\varvec{X}}\) are represented as vertices, \({\mathbf {W}}\) of size \(n \times n\) is the weighted adjacency matrix of G, and \({\mathbf {W}}_{ij}\) indicates the edge weight, or the similarity between \(\mathbf {x}_i\) and \(\mathbf {x}_j\), \({\mathbf {W}}\) is a sparse similarity matrix set by the sparse codes \(\varvec{\alpha }\) as below:

$$\begin{aligned} {{\mathbf {W}}_{ij}}=({|{\varvec{\alpha }_{ij}}|+|{\varvec{\alpha }_{ji}}|})/{2} \quad 1 \le i,j \le n \end{aligned}$$
(2)

There is an edge between \(\mathbf {x}_i\) and \(\mathbf {x}_j\) if and only if \({\mathbf {W}}_{ij} \ne 0\). Furthermore, if the underlying subspaces that the data lie in are independent or disjoint, Elhamifar and Vidal [1] proves that the optimal solution to (1) is the subspace-sparse representation under several additional conditions. The sparse representation \(\varvec{\alpha }^i\) is called subspace-sparse representation if the nonzero elements of \(\varvec{\alpha }^i\), namely the sparse representation of the datum \(\mathbf {x}_i\), correspond to the data points in the same subspace as \(\mathbf {x}_i\). Therefore, vertices corresponding to different subspaces are disconnected in the sparse similarity graph. With the subsequent spectral clustering [3] applied on such sparse similarity graph, compelling clustering performance is achieved. Allowing some tolerance for inexact representation, robust sparse subspace clustering methods such as [9, 23] turn to solve the following Lasso-type problem for SSC and \(\ell ^{1}\)-graph:

$$\begin{aligned} \mathop {\min }\limits _{{\varvec{\alpha }}} {\Vert {{\varvec{\alpha }}} \Vert _1}\quad s.t.\;\Vert {\varvec{X}} - {{\varvec{X}}}{\varvec{\alpha }}\Vert _F \le \delta ,\,\, \mathrm{diag}(\varvec{\alpha }) = \mathbf {0}\end{aligned}$$

which is equivalent to the following problem

$$\begin{aligned} \mathop {\min }\limits _{{\varvec{\alpha }}} {\Vert \varvec{X}- \varvec{X}\varvec{\alpha }\Vert _F^2 + {\lambda _{\ell ^{1}}}\Vert {\varvec{\alpha }}\Vert _1} \quad s.t. \,\, \mathrm{diag}(\varvec{\alpha }) = \mathbf {0}\end{aligned}$$
(3)

where \({\lambda _{\ell ^{1}}}>0\) is a weighting parameter for the \(\ell ^{1}\) term.

2 \(\ell ^{0}\)-Induced Sparse Subspace Clustering

In this paper, we propose \(\ell ^{0}\)-induced Sparse Subspace Clustering (\(\ell ^{0}\)-SSC), which solves the following \(\ell ^{0}\) problem:

$$\begin{aligned} \mathop {\min }\limits _{{\varvec{\alpha }}} {\Vert {{\varvec{\alpha }}} \Vert _0}\quad s.t.\;{\varvec{X}} = {{\varvec{X}}}{\varvec{\alpha }},\,\, \mathrm{diag}(\varvec{\alpha }) = \mathbf {0}\end{aligned}$$
(4)

And the solution to the above problem is used to build a sparse similarity graph for clustering. We then give the theorem about \(\ell ^{0}\)-induced almost surely subspace-sparse representation, and the proof is presented in the supplementary document for this paper.

Theorem 1

(\(\ell ^{0}\)-Induced Almost Surely Subspace-Sparse Representation) Suppose the data lie in a union of K distinct subspaces \(\{\mathcal {S}_k\}_{k=1}^K\) of dimensions \(\{d_k\}_{k=1}^K\), i.e. \(\mathcal {S}_k \ne \mathcal {S}_{k'}\) for \(k \ne k'\). Let denote the data that belong to subspace \(\mathcal {S}_k\), and \(\sum \limits _{k=1}^K n_k = n\). When \(n_k \ge d_k+1\), if the data belonging to each subspace are generated i.i.d. from arbitrary unknown continuous distribution supported on that subspace,Footnote 1 then with probability 1, the optimal solution to (4), denoted by \(\varvec{\alpha }^*\), is a subspace-sparse representation, i.e. nonzero elements in \({\varvec{\alpha }^*}^i\) corresponds to the data that lie in the same subspace as \(\mathbf {x}_i\).

Proof

(Sketch of the proof) It can be verified that the probability measure of “inter-subspace hyperplane” is 0, and we defer the details to the supplementary.

According to Theorem 1, \(\ell ^{0}\)-SSC (4) obtains the subspace-sparse representation almost surely under minimum assumption on the subspaces, i.e. it only requires that the subspaces be distinct. To the best of our knowledge, this is the mildest assumption on the subspaces for most existing sparse subspace clustering methods. Moreover, the only assumption on the data generation is that the data in each subspace are i.i.d. random samples from arbitrary continuous distributions supported on that subspace. In the light of assumed data distribution, such assumption on the data generation is much milder than the assumption of the semi-random model in [68] (note that the data can always be normalized to have unit norm and reside on the unit sphere). Table 1 summarizes different assumptions on the subspaces and random data generation for different subspace clustering methods including sparse subspace clustering methods. It can be seen that \(\ell ^{0}\)-SSC has mildest assumption on both subspaces and the random data generation. Note that Theorem 1 is also free from the geometric assumptions such as those involving subspace incoherence in [7, 8].

Table 1. Assumptions on the subspaces and random data generation (for randomized part of the algorithm) for different subspace clustering methods. \(D_1\) means the data in each subspace are generated i.i.d. uniformly on the unit sphere in that subspace, and \(D_2\) means the data in each subspace are generated i.i.d. from arbitrary continuous distribution supported on that subspace. Note that \(S_1< S_2< S_3 < S_4\), \(D_1 < D_2\), where the assumption on the right hand side of < is milder than that on the left hand side. The methods that are based on these assumptions are listed as follows. \(S_1\): [4, 5]; \(S_2\):[1]; \(S_3\):[69]; \(D_1\): [68, 23].

The \(\ell ^{0}\) sparse representation problem (4) is known to be NP-hard. One may ask if there is a shortcut to the almost surely subspace-sparse representation under the conditions in Theorem 1. We show that such shortcut is almost surely impossible. Namely, suppose there is an algorithm which, for each data point \(\mathbf {x}_i\), can find the data from the same subspace as \(\mathbf {x}_i\) that linearly represent \(\mathbf {x}_i\), then such representation almost surely leads to the solution to the \(\ell ^{0}\) problem:

$$\begin{aligned} \mathop {\min }\limits _{{\varvec{\alpha }^i}} {\Vert {{\varvec{\alpha }^i}} \Vert _0}\quad s.t.\;{\mathbf {x}_i} = {{\varvec{X}}}{\varvec{\alpha }^i},\,\, \varvec{\alpha }_{ii} = 0 \end{aligned}$$
(5)

Theorem 2

(There is “no free lunch” for obtaining subspace representation under the general conditions of Theorem 1) Under the assumptions of Theorem 1, if there is an algorithm which, for any data point \(\mathbf {x}_i \in \mathcal {S}_k\), \(1 \le i \le n, 1 \le k \le K\), can find the data from the same subspace as \(\mathbf {x}_i\) that linearly represent \(\mathbf {x}_i\), i.e.

$$\begin{aligned}&\mathbf {x}_i = \varvec{X}\varvec{\beta }\quad (\varvec{\beta }_i = 0) \end{aligned}$$
(6)

where nonzero elements of \(\varvec{\beta }\) correspond to the data that lie in the subspace \(\mathcal {S}_k\). Then, with probability 1, solution to the \(\ell ^{0}\) problem (5) can be obtained from \(\varvec{\beta }\) in \(\mathcal {O}({\hat{n}}^3)\) time, where \(\hat{n}\) is the number of nonzero elements in \(\varvec{\beta }\).

Therefore, we have the interesting “no free lunch” conclusion: with probability 1, finding the subspace representation for each data point \(\mathbf {x}_i\) can not be much computationally cheaper than solving the \(\ell ^{0}\) sparse representation (5).

It should be emphasized that our theoretical results on \(\ell ^{0}\)-SSC is significantly different from that in [24]. First, our results are developed under the widely used randomized subspace clustering models, while the recovered subspaces are supposed to form a minimal union-of-subspace structure in [24]. In addition, Theorem 1 shows that any global optimal solution to \(\ell ^{0}\)-SSC can almost surely recover any unknown underlying subspaces, considering that there can be multiple globally optimal solutions to \(\ell ^{0}\)-SSC. In contrast, given an underlying unknown minimal union-of-subspace structure, [24] does not show which globally optimal solution to \(\ell ^{0}\)-SSC can recover such minimal union-of-subspace structure.

Note that SSC-OMP [25] adopts Orthogonal Matching Pursuit (OMP) [26] to choose neighbors for each datum in the sparse similarity graph, which can be interpreted as approximately solving the \(\ell ^{0}\) problem (5) for \(1 \le i \le n\). However, SSC-OMP does not present the nice theoretical properties of the \(\ell ^{0}\)-SSC shown above. Moreover, we give the theory about the distance between the sub-optimal solution by our \(\hbox {A}\ell ^{0}\)-SSC and the global optimal solution to the \(\ell ^{0}\)-SSC problem under the assumption on the sparse eigenvalues of the data matrix. Extensive experimental results show the significant performance advantage of \(\hbox {A}\ell ^{0}\)-SSC over the SSC-OMP.

3 Approximate \(\ell ^{0}\)-SSC (\(\hbox {A}\ell ^{0}\)-SSC)

Solving the \(\ell ^{0}\)-SSC problem exactly is NP-hard, therefore, we introduce an approximate algorithm for \(\ell ^{0}\)-SSC in this section with theoretical guarantee.

3.1 Optimization of \(\hbox {A}\ell ^{0}\)-SSC

Similar to the case of SSC and \(\ell ^{1}\)-graph, by allowing tolerance for inexact representation, we turn to optimize the following \(\ell ^{0}\) problemFootnote 2 for \(\ell ^{0}\)-SSC.

(7)

Problem (7) is NP-hard, and it is impractical to seek for its global optimal solution. The literature extensively resorts to approximate algorithms, such as Orthogonal Matching Pursuit [26], or that use surrogate functions [27], for \(\ell ^{0}\) problems. In this paper we present \(\hbox {A}\ell ^{0}\)-SSC that employs proximal gradient descent (PGD) method to optimize (7) and obtains a sub-optimal solution with theoretical guarantee. The sub-optimal solution is used to build a sparse similarity matrix for clustering. In the following text, the superscript with bracket indicates the iteration number of PGD. Note that problem (7) is equivalent to a set of problems

(8)

for \(1 \le i \le n\). We describe PGD for optimizing \(L(\varvec{\alpha }^i)\) with respect to the sparse code of the i-th data point, i.e. \(\varvec{\alpha }^i\), for any \(1 \le i \le n\). We initialize \(\varvec{\alpha }\) as \({\varvec{\alpha }}^{(0)} = \varvec{\alpha }_{\ell ^{1}}\) and \(\varvec{\alpha }_{\ell ^{1}}\) is the sparse codes generated by solving (3) with \(\lambda _{\ell ^{1}} = \lambda \). The data matrix \(\varvec{X}\) is normalized such that each column has unit \(\ell ^{2}\)-norm.

In t-th iteration of PGD for \(t \ge 1\), gradient descent is performed on the squared loss term of \(L(\varvec{\alpha }^i)\), i.e. \(Q(\varvec{\alpha }^i) = \Vert \mathbf {x}_i - \varvec{X}{\varvec{\alpha }^i}\Vert _2^2\), to obtain

$$\begin{aligned} \tilde{\varvec{\alpha }^i}^{(t)} = {\varvec{\alpha }^i}^{(t-1)} - \frac{2}{{\tau }s} ({\varvec{X}^\top }{\varvec{X}}{{\varvec{\alpha }^i}^{(t-1)}}-{\varvec{X}^\top }{\mathbf {x}_i}) \end{aligned}$$
(9)

where \(\tau \) is any constant that is greater than 1. s is the Lipschitz constant for the gradient of function \(Q(\cdot )\). s is usually chosen as two times the largest eigenvalue of \(\varvec{X}^{\top }\varvec{X}\). Due to the sparsity of \(\varvec{\alpha }^i\), it is shown in Lemma 1 that s can be much smaller which also ensures the shrinkage of the support of the sequence \(\{{\varvec{\alpha }^i}^{(t)}\}_t\) and the decline of the objective function. \({\varvec{\alpha }^i}^{(t)}\) is then the solution to the following \(\ell ^{0}\) regularized problem:

(10)

It can be verified that (10) has closed-form solution, and the j-th element of \({\varvec{\alpha }^i}^{(t)}\) is

$$\begin{aligned}&{\varvec{\alpha }^i_j}^{(t)} = \left\{ \begin{array} {lll} 0 &{}:&{} {|{\tilde{\varvec{\alpha }^i_j}^{(t)}}| < \sqrt{\frac{2\lambda }{{\tau }s}} \,\, \mathrm{or } \,\, i = j } \\ {\tilde{\varvec{\alpha }^i_j}^{(t)}} &{} : &{} \mathrm{otherwise} \end{array} \right. \end{aligned}$$
(11)

for \(1 \le j \le n\). The iterations start from \(t=1\) and continue until the sequence \(\{L({\varvec{\alpha }^i}^{(t)})\}_t\) or \(\{{\varvec{\alpha }^i}^{(t)}\}_t\) converges or maximum iteration number is achieved, then a sub-optimal solution is obtained. A sparse similarity matrix is built by the sub-optimal solution upon which spectral clustering is performed to get the clustering result, as described in Algorithm 1 for \(\hbox {A}\ell ^{0}\)-SSC. The time complexity of PGD method is \(\mathcal {O}(Mn^2)\) where M is the number of iterations (or maximum number of iterations) for PGD.

figure a

3.2 Theoretical Analysis

In this section we present the bound for the distance between the sub-optimal solution by \(\hbox {A}\ell ^{0}\)-SSC and the global optimal solution to the objective problem (8). We first prove that the sequence \(\{{\varvec{\alpha }^i}^{(t)}\}_t\) produced by PGD has shrinking support and the objective sequence \(\{L({\varvec{\alpha }^i}^{(t)})\}_t\) is decreasing so that it always converges in Lemma 1. Under certain assumptions on the sparse eigenvalues of the data \(\varvec{X}\), we show that the sub-optimal solution by \(\hbox {A}\ell ^{0}\)-SSC is actually a critical point, namely \(\{{\varvec{\alpha }^i}^{(t)}\}_t\) converges to a critical point of the objective (8), and this sub-optimal solution and the global optimal solution to (8) are local solutions of a carefully designed capped-\(\ell ^{1}\) regularized problem. Based on the established theory in [28] showing the distance between different local solutions to various sparse estimation problems including the capped-\(\ell ^{1}\) problem, the bound for \(\ell ^{2}\)-distance between the sub-optimal solution and the global optimal solution is presented in Theorem 3, again under the assumption on the sparse eigenvalues of \(\varvec{X}\). Note that our analysis is valid for all \(1 \le i \le n\).

In the following analysis, we let \(\varvec{\beta }_{{\mathbf {I}}}\) denote the vector formed by the elements of \(\varvec{\beta }\) with indices in \({\mathbf {I}}\) when \(\varvec{\beta }\) is a vector, or matrix formed by columns of \(\varvec{\beta }\) with indices in \({\mathbf {I}}\) when \(\varvec{\beta }\) is a matrix. Also, we let \(\mathbf {S}_i = \mathrm{supp}({\varvec{\alpha }^i}^{(0)})\) and \(|\mathbf {S}_i| = A_i\) for \(1 \le i \le n\).

Lemma 1

(Support shrinkage in the proximal iterations and sufficient decrease of the objective) When \(s > \max \{2A_i, \frac{2(1+{\lambda }A_i)}{\lambda \tau }\}\), then the sequence \(\{{\varvec{\alpha }^i}^{(t)}\}_t\) generated by PGD with (9) and (11) satisfies

$$\begin{aligned} \mathrm{supp} ({\varvec{\alpha }^i}^{(t)}) \subseteq \mathrm{supp} ({\varvec{\alpha }^i}^{(t-1)}), t \ge 1 \end{aligned}$$
(12)

namely the support of the sequence \(\{{\varvec{\alpha }^i}^{(t)}\}_t\) shrinks when the iteration proceeds. Moreover, the sequence of the objective \(\{L({\varvec{\alpha }^i}^{(t)})\}_t\) decreases, and the following inequality holds for \(t \ge 1\):

$$\begin{aligned}&L({\varvec{\alpha }^i}^{(t)}) \le L({\varvec{\alpha }^i}^{(t-1)}) - \frac{(\tau -1)s}{2} \Vert {\varvec{\alpha }^i}^{(t)} - {\varvec{\alpha }^i}^{(t-1)}\Vert _2^2 \end{aligned}$$
(13)

And it follows that the sequence \(\{L({\varvec{\alpha }^i}^{(t)})\}_t\) converges. The above results hold for any \(1 \le i \le n\).

Before stating Lemma 2, the following definitions are introduced which are essential for our analysis.

Definition 1

(Critical points) Given the non-convex function which is a proper and lower semi-continuous function.

  • for a given \(\mathbf {x}\in \mathrm{dom}f\), its Frechet subdifferential of f at \(\mathbf {x}\), denoted by \(\tilde{\partial } f(x)\), is the set of all vectors which satisfy

    $$\begin{aligned}&\limsup \limits _{\mathbf {y}\ne \mathbf {x},\mathbf {y}\rightarrow \mathbf {x}} \frac{f(\mathbf {y})-f(\mathbf {x})-\langle \mathbf {u}, \mathbf {y}-\mathbf {x}\rangle }{\Vert \mathbf {y}-\mathbf {x}\Vert } \ge 0 \end{aligned}$$
  • The limiting-subdifferential of , denoted by written \(\partial f(x)\), is defined by

The point \(\mathbf {x}\) is a critical point of f if \(0 \in \partial f(x)\).

Also, we are considering the following capped-\(\ell ^{1}\) regularized problem, which replaces the noncontinuous \(\ell ^{0}\)-norm with the continuous capped-\(\ell ^{1}\) regularization term R:

(14)

where \(\mathbf {R}(\varvec{\beta };b) = \sum \limits _{j=1}^n R(\varvec{\beta }_j;b)\), \(R(t;b) = {\lambda }\frac{\min \{|t|,b\}}{b}\) for some \(b > 0\). It can be seen that R(tb) approaches the \(\ell ^{0}\)-norm when \(b \rightarrow \)0+.

Now we define the local solution of problem (14).

Definition 2

(Local solution) A vector \(\tilde{\varvec{\beta }}\) is a local solution to the problem (14) if

$$\begin{aligned}&\Vert 2{\varvec{X}^{\top }}({\varvec{X}} {\tilde{\varvec{\beta }}} - \mathbf {x}_i ) + {\dot{\mathbf {R}}} (\tilde{\varvec{\beta }};b)\Vert _2 = 0 \end{aligned}$$
(15)

where \({\dot{\mathbf {R}}(\tilde{\varvec{\beta }};b) = [\dot{R}(\tilde{\varvec{\beta }}_1;b),\dot{R}(\tilde{\varvec{\beta }}_2;b),\ldots ,\dot{R}(\tilde{\varvec{\beta }}_n;b) ]^{\top }}\).

Note that in the above definition and the following text, \(\dot{R}(t;b)\) can be chosen as any value between the right differential \(\frac{\partial R}{\partial t}(t+;b)\) (or \({\dot{R}(t+;b)}\)) and left differential \(\frac{\partial R}{\partial t}(t-;b)\) (or \({\dot{R}(t-;b)}\)).

Definition 3

(Sparse eigenvalues) The lower and upper sparse eigenvalues of a matrix \(\mathbf {A}\) are defined as

$$\begin{aligned}&\kappa _-(m) := \min _{\Vert \mathbf {u}\Vert _0 \le m; \Vert \mathbf {u}\Vert _2=1} \Vert \mathbf {A}\mathbf {u}\Vert _2^2 \quad \kappa _+(m) := \max _{\Vert \mathbf {u}\Vert _0 \le m,\Vert \mathbf {u}\Vert _2=1}\Vert \mathbf {A}\mathbf {u}\Vert _2^2 \end{aligned}$$

It is worthwhile mentioning that the sparse eigenvalues are closely related to the Restricted Isometry Property (RIP) [29] used frequently in the compressive sensing literature. Typical RIP requires bounds such as \(\delta _\tau +\delta _{2\tau }+\delta _{3\tau }< 1\) or \(\delta _{2\tau } < \sqrt{2}-1\) [30] for stably recovering the signal from measurements and \(\tau \) is the sparsity of the signal, where \(\delta _{\tau }=\max \{\kappa _+(\tau )-1,1-\kappa _-(\tau )\}\). Similar to [28], we use more general conditions on the sparse eigenvalues in this paper (in the sense of not requiring bounds in terms of \(\delta \)) to obtain theoretical results. In the following text, sparse eigenvalues \(\kappa _-\) and \(\kappa _+\) are for the data matrix \(\varvec{X}\).

Definition 4

(Degree of Nonconvexity of a Regularizer) For \(\kappa \ge 0\) and , define

$$\begin{aligned} \theta (t,\kappa ):= \sup _s \{ -\mathrm{sgn}(s-t) ({\dot{P}}(s;b) - {\dot{P}}(t;b)) - \kappa |s-t|\} \end{aligned}$$

as the degree of nonconvexity for function P. If , \(\theta (\mathbf {u},\kappa )=[\theta (u_1,\kappa ),\ldots ,\theta (u_n,\kappa )]\).

Note that \(\theta (t,\kappa ) = 0\) for convex function P.

In the following lemma, we show that the sequences \(\{{\varvec{\alpha }^i}^{(t)}\}_t\) generated by \(\hbox {A}\ell ^{0}\)-SSC converges to a critical point of \(L(\varvec{\alpha }^i)\), denoted by \( {\hat{\varvec{\alpha }}}^i\), under certain assumption on the sparse eigenvalues of \(\varvec{X}\). Therefore, the sub-optimal solution by \(\hbox {A}\ell ^{0}\)-SSC is a critical point of \(L(\varvec{\alpha }^i)\) in this case. Denote by \({\varvec{\alpha }^i}^*\) the global optimal solution to the \(\ell ^{0}\)-SSC problem(8), and let \({\hat{\mathbf {S}}_i} = \mathrm{supp}( {\hat{\varvec{\alpha }}}^i)\), \({\mathbf {S}_i^*} = \mathrm{supp}( {\hat{\varvec{\alpha }}}^*)\). The following lemma also shows that both \( {\hat{\varvec{\alpha }}}^i\) and \({\varvec{\alpha }^i}^*\) are local solutions to the capped-\(\ell ^{1}\) regularized problem (14).

Lemma 2

For any \(1 \le i \le n\), suppose \(\kappa _-(A_i) > 0\), then the sequences \(\{{\varvec{\alpha }^i}^{(t)}\}_t\) generated by PGD with (9) and (11) converges to a critical point of \(L(\varvec{\alpha }^i)\), which is denoted by \( {\hat{\varvec{\alpha }}}^i\). Moreover, if

$$\begin{aligned}&0< b < \min \{\min _{j \in {\hat{\mathbf {S}}_i}} | {\hat{\varvec{\alpha }}}_j^i|, \frac{\lambda }{ \max _{j \notin {\hat{\mathbf {S}}_i}} |\frac{\partial Q}{\partial {\varvec{\alpha }_j^i}}|_{\varvec{\alpha }^i = {\hat{\varvec{\alpha }}}^i}|}, \min _{j \in {\mathbf {S}_i^*}} | {\varvec{\alpha }_j^i}^*|, \frac{\lambda }{ \max _{j \notin {\mathbf {S}_i^*}} |\frac{\partial Q}{\partial {\varvec{\alpha }_j^i}}|_{\varvec{\alpha }^i = {\varvec{\alpha }^i}^*}|} \} \end{aligned}$$
(16)

(if the denominator is 0, \(\frac{\lambda }{0}\) is defined to be \(+\infty \) in the above inequality), then both \( {\hat{\varvec{\alpha }}}^i\) and \({\varvec{\alpha }^i}^*\) are local solutions to the capped-\(\ell ^{1}\) regularized problem (14).

Theorem 5 in [28] gives the estimation on the distance between two local solutions of the capped-\(\ell ^{1}\) regularized problem. Based on this result, we have the following theorem showing that under assumptions on the sparse eigenvalues of \(\varvec{X}\), the sub-optimal solution \({\hat{\varvec{\alpha }}}^i\) obtained by \(\hbox {A}\ell ^{0}\)-SSC has bounded \(\ell ^{2}\)-distance to \({\varvec{\alpha }^i}^*\), the global optimal solution to the original \(\ell ^{0}\) problem (8).

Theorem 3

(Sub-optimal solution is close to the global optimal solution) For any \(1 \le i \le n\), suppose \(\kappa _-(A_i) > 0\) and \(\kappa _-(|\hat{\mathbf {S}}_i \cup \mathbf {S}_i^*|)> \kappa > 0\), and b is chosen according to (16) as in Lemma 2. Then

$$\begin{aligned}&\Vert \varvec{X}( {\hat{\varvec{\alpha }}}^i - {{\varvec{\alpha }^i}^*})\Vert _2^2 \le \frac{2\kappa _-(|\hat{\mathbf {S}}_i \cup \mathbf {S}_i^*|)}{(\kappa _-(|\hat{\mathbf {S}}_i \cup \mathbf {S}_i^*|)-\kappa )^2} \\&\big (\sum \limits _{j \in \hat{\mathbf {S}}_i} (\max \{0,\frac{\lambda }{b} - {\kappa } |{\hat{\varvec{\alpha }}}^i_j - b|\})^2 +|\mathbf {S}_i^*\setminus \hat{\mathbf {S}}_i| (\max \{0, \frac{\lambda }{b} - {\kappa } b\})^2 \big ) \nonumber \end{aligned}$$
(17)

In addition,

$$\begin{aligned}&\Vert ( {\hat{\varvec{\alpha }}}^i - {{\varvec{\alpha }^i}^*})\Vert _2^2 \le \frac{2}{(\kappa _-(|\hat{\mathbf {S}}_i \cup \mathbf {S}_i^*|)-\kappa )^2} \\&\big (\sum \limits _{j \in \hat{\mathbf {S}}_i} (\max \{0,\frac{\lambda }{b} - {\kappa } |{\hat{\varvec{\alpha }}}^i_j - b|\})^2 +|\mathbf {S}_i^*\setminus \hat{\mathbf {S}}_i| (\max \{0, \frac{\lambda }{b} - {\kappa } b\})^2 \big ) \nonumber \end{aligned}$$
(18)

Remark 1

This result follows from Lemma 2 and Theorem 5 in [28]. The property of support shrinkage in Lemma 1 guarantees that \(\hat{\mathbf {S}}_i \subseteq \mathbf {S}_i\), indicating that sub-optimal solution \( {\hat{\varvec{\alpha }}}^i\) is sparse, so we can expect that \(|\hat{\mathbf {S}}_i \cup \mathbf {S}_i^*|\) is reasonably small. Also note that the bound for distance between the sub-optimal solution and the global optimal solution presented in Theorem 3 does not require typical RIP conditions. Also, when \(\frac{\lambda }{b} - {\kappa } |{\hat{\varvec{\alpha }}}^i_j - b|\) for nonzero \({\hat{\varvec{\alpha }}}_j^i\) and \(\frac{\lambda }{b} - {\kappa } b\) are no greater than 0, or they are small positive numbers, the sub-optimal solution \( {\hat{\varvec{\alpha }}}^i\) is equal to or very close to the global optimal solution.

The detailed proofs of the theorems and lemmas in this paper are included in the supplementary document. The theoretical results in this section are mainly derived from the optimization perspective. Due to limited space, we present an additional theorem in the supplementary which applies the bound (18) to show how accurate the sub-optimal solution \({\hat{\varvec{\alpha }}}^i\) is from the perspective of subspace-sparse representation, connecting \(\hbox {A}\ell ^{0}\)-SSC to the correctness of subspace clustering.

4 Experimental Results

The superior clustering performance of \(\hbox {A}\ell ^{0}\)-SSC is demonstrated in this section with extensive experimental results. Two measures are used to evaluate the performance of the clustering methods, i.e. the Accuracy (AC) and the Normalized Mutual Information(NMI) [31]. We compare our \(\hbox {A}\ell ^{0}\)-SSC to K-means (KM), Spectral Clustering (SC), SSC, Sparse Manifold Clustering and Embedding (SMCE) [12]. \(\hbox {A}\ell ^{0}\)-SSC is also compared to SSC-OMP to show the advantage of the proposed PGD in the previous sections. By adjusting the parameters, SSC and \(\ell ^{1}\)-graph solve almost the same problem and generate equivalent results, so we report their performance under the same name SSC.

4.1 Clustering on UCI Data

In this subsection, we conduct experiments on the Ionosphere and Heart data from UCI machine learning repository [32], revealing the performance of \(\hbox {A}\ell ^{0}\)-SSC on general machine learning data. The Ionosphere data contains 351 points of dimensionality 34. The Heart data contains 270 points of dimensionality 13.

The clustering results on the two data sets are shown in Table 2.

Table 2. Clustering results on UCI ionosphere and heart.
Table 3. Clustering results on COIL-20 and COIL-100 database. c in the left column is the cluster number, i.e. the first c clusters of the entire data are used for clustering. c has the same meaning in Table 4.
Table 4. Clustering results on the extended Yale Face Database B.
Table 5. Clustering Results on UMIST Face, CMU PIE, AR Face, CMU Multi-PIE and Georgia Tech Face database. Note that the CMU Multi-PIE contains the facial images captured in four sessions (S1 to S4).

4.2 Clustering on COIL-20 and COIL-100 Database

COIL-20 Database has 1440 images of 20 objects in which the background has been removed, and the size of each image is \(32 \times 32\), so the dimension of this data is 1024. COIL-100 Database contains 100 objects with 72 images of size \(32 \times 32\) for each object. The images of each object were taken 5 degrees apart when the object was rotated on a turntable. The clustering results on these two data sets are shown in Table 3. We observe that \(\hbox {A}\ell ^{0}\)-SSC performs consistently better than all other competing methods. On COIL-100 Database, SMCE renders slightly better results than SSC on the entire data due to its capability of modeling non-linear manifolds.

4.3 Clustering on Extended Yale Face Database B and More Face Data Sets

The Extended Yale Face Database B contains face images for 38 subjects with 64 frontal face images taken under different illuminations for each subject. The clustering results are shown in Table 4. We can see that \(\hbox {A}\ell ^{0}\)-SSC achieves significantly better clustering result than SSC, which is the second best method on this data. We demonstrate more experimental results on UMIST Face, CMU PIE, AR Face, CMU Multi-PIE and Georgia Tech Face Database in Table 5, and the used data sets are introduced at http://www.face-rec.org/databases/.

4.4 Parameter Setting

\(\lambda \) is usually set to 0.5 for \(\hbox {A}\ell ^{0}\)-SSC, with the maximum iteration number \(M = 100\) and the stop** threshold \(\varepsilon = 10^{-6}\). We observe that the average number of non-zero elements of the sparse code generated by \(\hbox {A}\ell ^{0}\)-SSC is around 3 for most data sets. In SSC-OMP, \(\Vert \varvec{\alpha }^i\Vert _0\) is tuned to control the sparsity of the generated sparse codes such that the aforementioned average number of non-zero elements of the sparse code matches that of \(\hbox {A}\ell ^{0}\)-SSC. For SSC, the weighting parameter for the \(\ell ^{1}\)-norm has the default value of 0.1. For all the methods that use spectral clustering to obtain the clustering results, K-meas are performed multiple times and the data partition with minimum distortion is taken as the final result.

We investigate how the clustering performance on the Extended Yale Face Database B changes by varying the weighting parameter \(\lambda \) for \(\hbox {A}\ell ^{0}\)-SSC, and illustrate the result in Fig. 1. The parameter sensitivity result on COIL-20 Database is presented in the supplementary document. We observe that the performance of \(\hbox {A}\ell ^{0}\)-SSC is much better than other algorithms over a relatively large range of \(\lambda \), revealing the robustness of our algorithm with respect to the weighting parameter \(\lambda \).

Fig. 1.
figure 1

Clustering performance with different values of \(\lambda \), i.e. the weight for the \(\ell ^{0}\)-norm, on the Extended Yale Face Database B. Left: Accuracy; Right: NMI. Note that the performance of SSC does not vary with \(\lambda \) since its weighting parameter for the \(\ell ^{1}\)-norm is chosen from [0.1, 1] for the best performance.

5 Conclusion

We propose a novel \(\hbox {A}\ell ^{0}\)-SSC for data clustering under the principle of \(\ell ^{0}\)-induced sparse subspace clustering (\(\ell ^{0}\)-SSC). Compared to the existing sparse subspace clustering methods, \(\ell ^{0}\)-SSC features \(\ell ^{0}\)-induced almost surely subspace-sparse representation under milder assumptions on the subspaces and random data generation. \(\hbox {A}\ell ^{0}\)-SSC uses proximal gradient descent to solve the optimization problem of \(\ell ^{0}\)-SSC and obtain a sub-optimal solution with theoretical guarantee. Extensive experimental results on various real data sets demonstrate the effectiveness and superiority of \(\hbox {A}\ell ^{0}\)-SSC over other competing methods.