Abstract
We consider estimating matrix-valued model parameters with a dedicated focus on their robustness. Our setting concerns large-scale structured data so that a regularization on the matrix’s rank becomes indispensable. Though robust loss functions are expected to be effective, their practical implementations are known difficult due to the non-smooth criterion functions encountered in the optimizations. To meet the challenges, we develop a highly efficient computing scheme taking advantage of the projection-free Frank–Wolfe algorithms that require only the first-order derivative of the criterion function. Our methodological framework is broad, extensively accommodating robust loss functions in conjunction with penalty functions in the context of matrix estimation problems. We establish the non-asymptotic error bounds of the matrix estimations with the Huber loss and nuclear norm penalty in two concrete cases: matrix completion with partial and noisy observations and reduced-rank regressions. Our theory demonstrates the merits from using robust loss functions, so that matrix-valued estimators with good properties are achieved even when heavy-tailed distributions are involved. We illustrate the promising performance of our methods with extensive numerical examples and data analysis.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Massive data with informative structures from the data collection processes are becoming increasingly available in many data-enabled areas. Examples include those from FMRI, electroencephalogram (EEG), and tick-by-tick financial trading records of many assets. Methodologically for multivariate data analysis, matrices as the model parameters are commonly analyzed in the core step(s) of many popular approaches including the principal component analysis, canonical correlation analysis (Anderson, 2003), Gaussian graphical model analysis (Lauritzen, 1996), reduced-rank regression (Reinsel & Velu, 1998), sufficient dimension reduction (Cook, 2009), and many others.
Structural information—our foremost consideration in this study—is indispensable in solving many matrix estimation problems with large-scale data. For matrix-valued model parameters, a class of methods imposes restrictions on the rank of the targeted matrix. In matrix completion with partial and noisy observations, for example, without such structural information, successfully recovering the signal is not possible. For multi-response regression problems, structural information is vital for both methodological development and practical implementation for drawing informative conclusions. Constraining the rank of the parameter matrix in multi-response regression leads to the conventional reduced-rank regression (Reinsel & Velu, 1998).
Our primary goal in this study is to investigate robustness when estimating matrices with large-scale data and structural information. Robustness is a foundational concern in current data-enabled investigations. During massive data collection processes, observations of heterogeneous quality are inevitable, and even erroneous records are common. On one hand, due to the huge size of the data in modern large-scale investigations, validations and error corrections become too daunting to be practical. Robust statistical methods in these scenarios are thus highly desirable. On the other hand, however, in many existing methods, though being convenient, commonly applied criterion functions including the squared loss and the negative log-likelihood are unfortunately not robust to the violations of the model assumptions in the aforementioned practical reality.
We are thus motivated to consider robustness in the context with structural information, which is incorporated by constraining the rank of the matrix-valued model parameters. The foremost challenge in this scenario is the fundamental computational difficulty. One source contributing to the difficulty roots in the fact that constraining a matrix’s rank results in a non-convex problem. As a rare example in reduced-rank multivariate regression, an analytic solution is available despite the non-convexity; see (Reinsel & Velu, 1998). Unfortunately when considerations are broader, such a convenience generally no longer exists; and how to solve optimization problems with rank constraints is generally difficult. To meet the challenge, a convex relaxation of the problem leads to regularizing the nuclear norm of the matrix-valued model parameter. From the statistical perspective, numerous works (Candès & Tao, 2010; Negahban & Wainwright, 2011; Agarwal et al., 2012) have studied the theoretical properties of this type of estimators constructed with the nuclear norm relaxation, and have proved that the resulting estimator achieves optimal or near-optimal statistical properties under different settings. Additional to the non-convexity, consideration of robustness is further contributing to the computational difficulty. Resorting to robust loss functions is a traditional class of influential methods for establishing more robust statistical methods; see Huber (2004) and Hampel et al. (2011). Though demonstrated effective in conventional statistical analysis, substantial difficulties arise when handling large-scale modern complex data-enabled problems. Computationally, in particular, their applications encounter major challenges because robust loss functions are not smooth whose second-order derivatives do not exist. Analytically, establishing the statistical properties of the matrix estimations is challenging in this scenario too, because the impacts from possibly heavy-tailed errors are involved in studying large-scale problems. Existing methods using the squared loss or the negative log-likelihood as the loss functions require the noises to be sub-Gaussian in order to handle high-dimensional data. Robust methods can accommodate noises with heavier tails than sub-Gaussian; meanwhile, the capacity for handling high-dimensional data remains desirable.
There has been an active recent development in robust statistical methods with high-dimensional data; see, for example, Loh (2017), Zhou et al. (2018), Sun et al. (2020), and reference therein. Recently, there has been increasing interest in investigating robust methods for matrix-valued model parameters. She and Chen (2017) studied the robust reduced-rank regression in a scenario concerning outliers. They define the estimator as the minimizer of a non-convex optimization problem, establish theoretical error bounds, and propose to apply an iterative algorithm that alternatively solves for two parts of the model parameters in their setting. Due to the nonconvexity, their algorithm does not guarantee the convergence to the minimum. Wong and Lee (2017) studied matrix completion with Huber loss. Their algorithm is developed by iteratively projecting non-robust matrix estimators, which is computationally demanding with many projection operations required. Elsener and van de Geer (2018) investigated robust matrix completion with the Huber loss function and nuclear norm penalization. The computation algorithms in Elsener and van de Geer (2018) involved a soft-thresholding step for singular values. This works well when the solution is of exact low rank. However, when the solution is of approximately low rank, or of modestly higher rank, such a step becomes computationally demanding. As pointed out in She and Chen (2017), efficient algorithms are desirable for solving optimization problems with rank constraints and robust loss functions.
We attempt our study with a foremost consideration on an efficient computing scheme for solving large-scale statistical problems with robustness. In particular, we aim to develop efficient first-order algorithms by building a scheme with Frank–Wolfe-type algorithms for robust matrix estimation problems. The Frank–Wolfe algorithm is a first-order method and is drawing considerable attention recently (Jaggi, 2013; Lacoste-Julien & Jaggi, 2015; Freund & Grigas, 2016; Freund et al., 2017; Kerdreux et al., 2018; Swoboda & Kolmogorov, 2019). The key advantage of the Frank–Wolfe algorithms is their freedom from the required projections in most proximal-type algorithms. In addition, as we shall see in our algorithms in Sect. 2, for matrix estimation problems, in each iteration, the Frank–Wolfe algorithm only requires computing the top one leading singular vectors, which can be conducted efficiently even for huge-size problems. These merits make Frank–Wolfe-type algorithms particularly appealing for solving large-scale robust low-rank matrix estimation problems.
Our study makes two main contributions. Foremost, we develop a new computation scheme for robust matrix estimation and demonstrate that the first-order optimization technique makes solving large-scale robust estimation problems practically convenient. We show extensively that our framework is broadly applicable, covering general robust loss functions including those used in median and quantile regression; see Sect. 2. Second, our theoretical analysis reveals the benefit from using robust loss functions and rank constraints. Our non-asymptotic results demonstrate that our framework can accommodate high-dimensional data. For matrix completion and reduced-rank regression, the resulting matrix-valued estimator works satisfactorily even when the model error distributions are heavy-tailed.
The rest of this article is organized as follows. Section 2 elaborates a concrete framework using the Frank–Wolfe algorithm to solve robust matrix estimation problems. We present matrix completion and reduced-rank regression with various robust loss functions. Section 3 justifies the validity of our method with theory on the algorithm convergence and error bounds of the resulting estimators. Section 4 presents extensive numerical examples demonstrating the promising performance of our methods.
For a generic matrix A, we denote by \(A^\top\) its transpose, \(\sigma _1(A)\) its largest singular value, \(\Vert A\Vert _*\) its nuclear norm, and \(\Vert A\Vert _F\) its Frobenius norm. Let \(\langle A, B\rangle =\text {trace}(A^\top B)\) for \(A, B \in {\mathbb {R}}^{p\times q}\). We denote by \(\varTheta \in {\mathbb {R}} ^{p\times q}\) a generic matrix-valued model parameter. In this study, we focus on two concrete cases. In one case, \(\varTheta =M\) where M is the signal to be recovered in the matrix completion problem with a single copy of partial and noisy observations; the other one is \(\varTheta =C\) where C is the matrix-valued coefficients in the multi-response regression problem. Furthermore, we show that our framework broadly applies in solving a general class of problems.
2 Methodology
2.1 Matrix completion
We consider the matrix completion problem first. In this setting, one observes a noisy subset of all entries of a matrix \(M\in {{\mathbb {R}}}^{p\times q}\), which is the model parameter of interest. Let the set of observed entries be \(\varOmega = \{(i_t,j_t)\}_{t=1}^n\), where \(i_t\in \{1,\dots , p\}; j_t\in \{1,\dots q\}\), and denote by \(X_{i_t,j_t}\), \((i_t,j_t)\in \varOmega\), the corresponding noisy observations such that
We assume that \(\xi _t\)’s are independent and identically distributed random variables with mean zero.
To effectively recover M with a single copy of partial and noisy observations over \(\varOmega\), one popular approach is to assume that the underlying true matrix, denoted by \(M^*\), is of low-rank that \(\text {rank}(M^*)\le r\) for some \(r\le \min (p,q)\). Then one can estimate \(M^*\) by solving a constrained optimization problem by minimizing the objective function \((2n)^{-1}\sum _{t=1}^n \ell (X_{i_t,j_t}-M_{i_t,j_t})\) over M, subject to \(\text {rank}(M)\le r\) for some loss function \(\ell (\cdot )\). Since the rank constraint is non-convex, solving the optimization is generally not tractable. To obtain a practical solution, a common strategy is relaxing the rank constraint to the convex nuclear norm constraint.
The Huber loss function leads to robust estimators because its design alleviates the excessive contribution from a data point that is extremely deviated from the fit. Practically, the Huber loss performs promisingly when handling a substantial portion of noisy observations whose distribution can be heavy-tailed; see Huber (2004).
By applying the Huber loss with a constraint on nuclear norm, we consider the following robust matrix completion problem:
where \(\ell _\eta (\cdot )\) is the classical Huber loss function:
Here \(\eta\) is the tuning parameter of the Huber loss, and \(\lambda\) is the tuning parameter regularizing the nuclear norm of M. In our numerical studies, we choose the tuning parameters by applying the cross-validation.
Since \(\ell _\eta\) is not smooth, those methods commonly applied in solving \(\ell _2\)-loss problems—requiring second-order derivatives—do not directly apply. Computing optimization problem (1) is generally hard; see the discussion in She and Chen (2017). Efficient algorithms for solving (5) are lacking; the primary difficulty is due to the absence of the second-order derivative of the Huber loss. It is even more challenging to minimize the Huber loss on a restricted low-rank region, and to achieve the computational efficiency with large-scale data. More broadly, non-smooth criterion functions are commonly the case with general robust loss functions, with prominent examples including the least absolute deviation loss of the median regression, check loss of the quantile regression, and Tukey’s biweight loss besides the aforementioned Huber loss.
To address the computational difficulty when handling large-scale problems with robust loss functions, we propose to apply the Frank-Wolfe algorithm to solve this problem. The Frank–Wolfe algorithm has been particularly powerful for convex optimizations. As a first-order approach that requires no second-order derivative of the criterion function, it is particularly powerful for solving problems with non-smooth loss functions, which is exactly the case for our problem (1). Briefly speaking, the Frank–Wolfe algorithm pursues some constrained approximation of the gradient—the first-order derivative of the criterion function evaluated at a given value. The algorithm runs iteratively, with the optimization proceeding along the direction as identified by the approximation of the gradient. Therefore, the Frank–Wolfe algorithm is practically appealing, as one has the opportunity to best exploit some constrained approximation that can be computed efficiently. For a detailed account of the Frank–Wolfe algorithms and recent advances in the area, we refer to Freund and Grigas (2016), Freund et al. (2017), and references therein.
Concretely in our setting, we develop an algorithm that runs iteratively. Specifically, at the \((k+1)\)-th iteration with \(M^{(k)}\) from the previous step, the matrix-valued gradient of (1): \(\nabla {\mathcal {L}}(M^{(k)}) \in {\mathbb {R}}^{p\times q}\) is analytically calculated by
where \(J_t\) is a matrix with \(J_{t,i_tj_t}=1\) and all the other entries 0, \(\text {1}(\cdot )\) is the indicator function, and \(\text { sign}(x)=1\) if x is positive and − 1 otherwise. Hence, evaluating the gradient can be done efficiently, and it is a scalable process that can be efficiently distributed if multiple computing units are available. Then, the Frank–Wolfe algorithm suggests computing a descent direction in the \((k+1)\)-th iteration:
In this step, a key observation is that
where \(u_1\) and \(v_1\) are the leading left and right singular vectors of \(\nabla {\mathcal {L}}(M^{(k)})\). The required singular decomposition can be computed efficiently by an existing algorithm that is implemented in the standard “PROLACK" package in Matlab. Then, we conduct a descent step to update \(M^{(k)}\) by
where \(\alpha _{k+1}\in [0,1]\) is a pre-specified step-size. For example, \(\alpha _{k+1} = 1/(k+3)\) guarantees convergence to an optimal solution. Meanwhile, line search is viable, and there are various ways to further accelerate this algorithm.
Intuitively, the updating direction in Equation (4) is viewed as the best rank-one approximation of the gradient matrix (3). Further, if we view the vector \(u_1\) as the direction corresponding to the first principal component of the columns of M, then formula (4) is essentially a column-wise update along this direction, with the step sizes proportional to the components in the vector \(v_1\). From this perspective, the update formula (4) can also be viewed as a computationally efficient matrix-valued coordinate descent along the direction \(u_1\). Since the objective function (1) is convex, such an update progressing along the gradient direction ensures that the criterion function converges, approaching the minimum.
We summarize the algorithm in Algorithm 1.
![figure a](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10994-023-06325-w/MediaObjects/10994_2023_6325_Figa_HTML.png)
2.2 Reduced-rank regression
In our second concrete problem with matrix-valued model parameters, we consider a multivariate linear regression
where \(\xi _{ij}\)’s are model errors. We assume that \(\xi _{ij}\)’s are independent and identically distributed random variables with mean zero. Then, we have in a matrix form
where \(Y=[y_{ij}]_{n\times q}, X=[x_{ij}]_{n\times p}=[x_1,\dots , x_n]^\top , C=[c_1,\dots , c_q]\in {\mathbb {R}}^{p\times q}\), and \(\var** =[\xi _{ij}]_{n\times q}\).
In this setting, one may opt to restrict the rank of C—\(\text {rank}(C)\le r\) \((r \le \min (p,q))\)—leading to the conventional reduced-rank regression (Reinsel & Velu, 1998). Also by relaxing the rank constraint with the nuclear norm, we consider the estimation problem as
where \(c_j\) denotes the j-th column of C, and \(\ell _\eta (\cdot )\) is the Huber loss function with parameter \(\eta\).
Again, to address the computational challenges, analogous to problem (1), we propose to solve problem (5) also by applying Frank–Wolfe algorithm iteratively with the steps described as follows. Denote by \(C^{(k)}\) the solution after the k-th iteration. At the \((k+1)\)-th iteration, let \(\nabla {\mathcal {L}}_{\eta }(C^{(k)})\) be the gradient of the loss function at \(C^{(k)}\):
where \(Z^{ij}\) is a matrix with the j-th column being \(x_i\) and the remaining entries 0. Then, we compute a descent direction from
with the solution
where \(u_1\) and \(v_1\) are the leading left and right singular vectors of \(\nabla {\mathcal {L}}_\eta (C^{(k)})\).
The algorithm follows Algorithm 1, with different input data and the gradient matrix specified by (6).
2.3 Other robust loss functions
Our framework for develo** efficient computation algorithms can easily accommodate a broad class of robust loss functions that are not smooth. Examples of the loss functions are the \(\ell _1\)-loss (the least absolute deviation loss), the check-loss, Tukey’s biweight loss, and more; see Hampel et al. (2011).
A scheme is developed as follows. The only necessary adjustment as in Algorithm 1 is calculating the gradient of loss function \(\nabla {\mathcal {L}}(\cdot )\). Then, the general updating step is
where \(\alpha _{k+1}\) is some pre-specified step-size, \(V^{(k+1)} = -\lambda \cdot u_1v_1^\top ,\) with \(u_1\) and \(v_1\) being the first left and right singular vectors of \(\nabla {\mathcal {L}}(\varTheta ^{(k)})\).
Table 1 presents gradients for several common loss functions in the context of matrix completion and reduced-rank regression.
3 Theory
3.1 Convergence of the algorithms
For self-completeness, we present the theoretical guarantees for the Frank–Wolfe algorithm in the context of robust matrix estimations, together with a simple way to choose the step-sizes.
We prove that by choosing the stepsize properly, the objective functions by using the Huber loss in both matrix completion and reduced-rank regression problems converge to the optimums at the rate of \({\mathcal {O}}(1/k)\), where k is the iteration counter. The next proposition is for reduced-rank regression problems, and the result for the matrix completion problem can be proved similarly.
Proposition 1
Consider the loss function \({\mathcal {L}}_\eta (\cdot ): {\mathbb {R}}^{n \times p} \rightarrow {\mathbb {R}}\) constructed from the Huber loss function (2) with parameter \(\eta\). For the reduced-rank regression problem (5), by the Frank–Wolfe Algorithm with stepsize set as
where \(L_z\) is some positive number. Suppose the diameter of the feasible set is \(D:=\max _{V_1,V_2\in {\mathbb {S}}} \Vert V_1-V_2\Vert _F\), where \({\mathbb {S}}= \{V: \Vert V\Vert _*\le \lambda \}\). Then, we have that \({\mathcal {L}}(C^{(k)})\) is monotonely decreasing in k, and we have
Proof
Since the Huber loss function is differentiable everywhere, and we have that \(\nabla {\mathcal {L}}_\eta ( C )\) is Lipschitz-continuous. Thus, with \(L_z\) defined above its Lipschitz constant, by Theorem 1 of Freund et al. (2017), we have that the result holds as desired. \(\square\)
We point out that for the matrix completion problem (1), the result holds by the same argument by letting \(L_z = 1\).
Meanwhile, our broad interests include some non-convex losses such as the Tukey’s biweight loss. A strategy for handling them is the approximation by a Lipschitz continuous function with arbitrary precision where simple smoothing techniques are applicable. Upon applying the same stepsizes as discussed above, we can show that the algorithm converges to a stationary point at the same rate; see the analysis of a recent work of Reddi et al. (2016).
Recently, Charisopoulos et al. (2021) studied the low-rank matrix recovery algorithms with the non-convex rank constraint and non-smooth loss functions. They established optimization convergence rates for a prox-linear method and a subgradient method for matrix completion. They proved that with a sufficient number of observations and an appropriate initialization, both methods are guaranteed to converge to the truth. The prox-linear method possesses a much faster convergence rate of \({\mathcal {O}}(1/(2^k))\) but with a higher computational cost at each iteration in solving a convex subproblem. While the subgradient method has a lower cost at each iteration with a subgradient evaluation step and a project step onto the desired region, it has a slower rate. Compared with their algorithms, our method has a lower computational burden in each iteration with no projection required and a relatively slower convergence rate. It is worth studying minimizing a robust loss function directly with the non-convex constraint in the future.
3.2 Statistical properties
We investigate the non-asymptotic error bounds in this section. We first introduce two conditions for both matrix completion and reduced-rank regression models.
Assumption 1
The truth \(M^*\) and \(C^*\) has rank at most \(r,~0<r<\min (p,q)\).
Assumption 2
The noises \(\xi\)’s are i.i.d. with zero mean and a distribution function \(F_\xi\) satisfying
for any \(|m|\le \eta\) and \(\eta >0\), where \(c_1=c_1(\eta )\) is a constant depending only on \(\eta\).
Assumption 2 is key on the distribution of the noises.
It is very mild by only requiring non-vanishing probability mass of \(\xi\) between \(m-\eta\) and \(m+\eta\) for a positive \(\eta\) and \(|m|\le \eta\), avoiding assuming instead explicit conditions on its tail probability and/or existence of its moments up to some order.
Since the condition holds for \(\eta >0\) as long as the probability mass of \(\xi\) near 0 is not too small, it is easily satisfied by a wide range of distributions including heavy-tailed ones; see more discussion about this assumption and examples in Appendix 1.
3.2.1 Matrix completion
For any matrix A and some linear subspace \({\mathcal {M}}\) of \({\mathbb {R}}^{p\times q}\), we define \(A_{{\mathcal {M}}}\) as the projection of A onto \({\mathcal {M}}\). We consider without loss of generality that \(p>q>1\). Recall that \(J_t\) \((t=1,\dots , n)\) is a \(p\times q\) random matrix, independent of \(X_{i_t,j_t}\) and \(\xi _t\), with one randomly chosen entry \(J_{t,i_tj_t}\) being 1 and the others 0. \(M_{i_t,j_t}\) can be written as
for all \((i_t,j_t)\in \varOmega\). As a working model, we treat \(J_t\) as uniformly distributed over its support. That is, the probability of \(M_{i_t,j_t}\) being the t-th observation is \((p q)^{-1}\). This assumes that the observed entries in the target matrix are uniformly sampled at random (Koltchinskii et al., 2011; Rohde & Tsybakov, 2011; Elsener & van de Geer, 2018), and we refer to Klopp (2014) for more discussions. Recht (2011) analyzed the matrix completion model under this assumption. As pointed out in Recht (2011), this is a sampling with replacement scheme and therefore may appear less realistic as it may result in duplicated entries; however, it has the benefit of simplifying the technical proof and assumptions. Overall, it is a reasonable and informative showcase without requiring any prior information on the sampling scheme. If additional information is available in the sampling process, other models such as the weighted sampling model (Negahban & Wainwright, 2012) can be applied.
We first show that the estimator belongs to a restricted set. We consider the singular value decomposition
where U is a \(p\times q\) matrix, \(\varLambda\) is a \(q\times q\) diagonal matrix with diagonal entries the ordered singular values \(\sigma _1\ge \sigma _2\ge \dots \ge \sigma _{q}\), and V is a \(q\times q\) matrix. For \(k=1,2,\dots ,q\), let \(u_k\) be the k-th column of U, and \(v_k\) the k-th column of V. For any positive integer \(r\le \min \{p,q\}\), let \(U_r\) be the subspace of \({\mathbb {R}}^{p\times q}\) spanned by \(u_1, \dots , u_r\), and \(V_r\) be the subspace spanned by \(v_1, \dots , v_r\). Define a pair of subspace of \({\mathbb {R}}^{p\times q}\) as
where \(\text {row}(M)\) and \(\text {col}(M)\) denote the row and column space of M. For simplicity notation, we use \({\mathcal {M}}_r ={\mathcal {M}}_r(U,V)\) and \(\overline{{\mathcal {M}}}_r^\perp =\overline{{\mathcal {M}}}_r^\perp (U,V)\). Lemma 1 indicates that the estimator \({\hat{M}}\) belongs to the set
To establish the error bounds, we need the following technical assumption.
Assumption 3
For any \(M\in {\mathcal {M}}_0\), there exists a real number \(L>1\), such that
Assumptions of this type—referred to as the ‘spikiness condition’—are assumed in existing literature on analogous problems, e.g., in Negahban and Wainwright (2012) for matrix completion problems; see also a recent work Fan et al. (2021). Intuitively, this assumption requires that for \(M\in {\mathcal {M}}_0\), the entries of \(M-M^*\) are not overly ‘spiky’, or in other words, relatively evenly distributed; so that the maximum discrepancy is not extremely far away from the averaged discrepancy. We remark that here the term \(\frac{1}{\sqrt{pq}}\) relates to the aforementioned uniform sampling scheme setting, under which each entry is observed with the probability \(\frac{1}{pq}\). Hence, it reflects an increasingly more difficult high-dimensional problem due to sparse entries in a single copy of large matrix. Instead, if the probability of each entry being observed is a constant independent of p, q, this assumption is not required.
We consider the Lagrangian form of the problem (1):
where \(\gamma >0\) is the corresponding regularization tuning parameter. Let \({\hat{\varDelta }}={\hat{M}}-M^*\) and \(\varDelta =M-M^*\). Theorem 1 establishes a non-asymptotic upper bound for the error for estimating a \(M^*\) of low rank.
Theorem 1
For problem (7), suppose that Assumption 1, 2, and 3 hold and the noises \(\xi _t\)’s are distributed symmetrically about zero. Let \({\hat{M}}\) be the solution to problem (7) with
with a constant \(c_0>0\). When \(n>C(L)\cdot c^2_1 pr\log (p+q)\log (q+1)\),
with probability at least \(1-3(p+q)^{-1}\), for some constants \(C_1\), \(c_2\) and \(c_3\) independent of n, p, and q, and C(L) a constant only depending on L.
Theorem 1 is non-asymptotic; \(\gamma\) is chosen based on Lemma 7 in Appendix 2 as twice the upper bound of \(\sigma _1(\nabla {\mathcal {L}}_\eta (M^*))\). In Theorem 1, we only require the error terms satisfy Assumption 2, which is \(F_\xi (m+\eta )-F_\xi (m-\eta )>\frac{1}{c_1^2}\), for \(\eta >0\) being the parameter in the Huber loss (2) and \(|m|<\eta\). Since this assumption is easily satisfied by many heavy-tailed distributions, this result demonstrates the robustness of our method.
We note that in general \(\gamma\) can be
for any constant \(K_1\ge 1\). Under the conditions in Theorem 1, we can also derive the upper bounds of the estimation error in nuclear norm based on (23) in the Appendix:
We may discuss the asymptotic properties of \({{\hat{M}}}\) when \(n \rightarrow \infty\). Matrix completion is a hard problem attempting to recover a matrix-valued model parameter with a single incomplete copy from the data generating process. The average estimation error converges to zero in probability as \(n\rightarrow \infty\). That is, when \(rp \log (p+q)\log (q+1)=o(n)\), \((pq)^{-1}\Vert {\hat{\varDelta }}\Vert _F^2 \rightarrow 0\). Intuitively, if the rank of \(M^*\) is r, then the number of free parameters is at the order of rp. Hence it’s reasonable to require a sample size at least of some larger order of rp, so as to recover the model parameters consistently.
Without requiring the Gaussian assumption, our error rate is still comparable to the statistical optimum established by Koltchinskii et al. (2011) for matrix completion problems under a low-rank constraint with Gaussian noises. Compared with the rate in the lower bound given in Theorem 6 of Koltchinskii et al. (2011), our upper bound in Theorem 1 differs only in an additional logarithm term \(\sqrt{\log (p+q)\log (q+1)}\) and the \(\eta\) in the Huber loss.
The assumption in Theorem 1 that the model error is symmetrically distributed around 0 is needed in obtaining the upper bound of \(\sigma _{1}(\nabla {\mathcal {L}}(M^*))\); see the proof of Lemma 7. It assures that \(\sigma _1({\mathbb {E}}[\nabla {\mathcal {L}}(M^*)])=0\). Similar assumptions are also found in Loh (2017). Thanks to the symmetrization assumption, the convergence can be established with no strong extra requirement on \(\eta\). Without the symmetrization, as shown in Lemma 7 in the Supplement Material, other conditions are required to control
so that
is stochastically small enough. With this extra term, the upper bound in Theorem 1 becomes
The extra term in (8) may then be viewed as a price paid to achieve robustness against noises with heavy-tailed distributions. This is an impact from applying the robust Huber loss. It is a remarkable different feature from the study on matrix completion with \(\ell _2\)-loss. Nevertheless, it is worth noting that for \(\ell _2\)-loss related studies, conditions are commonly assumed to control the tail probability behavior of the model errors, for example, by the sub-Gaussian distributions. In contrast, our development does not require such assumptions on the tail probability properties, which is the gain in return by applying the Huber loss.
3.2.2 Reduced-rank regression
The problem (5) is also expressed in the Lagrangian form:
where \(\gamma >0\) is a regularization parameter, and \({\mathcal {L}}_\eta (C)\) is defined in Equation (5).
Again, we point out that the estimator belongs to a restricted set. By applying the singular value decomposition to \(C^*\), we have
where \(\varLambda =\textrm{diag}(\sigma _1,\dots , \sigma _q)\) is the diagonal matrix containing all singular values of \(C^*\). For \(r\le \min \{p,q\}\), we define a pair of subspace of \({\mathbb {R}}^{p\times q}\) as
where \(U_r\) is a subspace spanned by the first r columns of U, and \(V_r\) is the subspace spanned by the first r columns of V. For simplicity in notations, we denote by \({\mathcal {C}}_r ={\mathcal {C}}_r(U, V)\) and \(\overline{{\mathcal {C}}}_r^\perp =\overline{{\mathcal {C}}}_r^\perp (U,V)\). Note that \({\mathcal {C}}_r\) and \(\overline{{\mathcal {C}}}_r\) are not equal. Lemma 4 indicates that the estimator \({\hat{C}}\) belongs to the set
We assume the following conditions on the random design matrix X.
Assumption 4
\(x_1, x_2, \dots , x_n\) are i.i.d. random vectors sampling from a multivariate normal distribution \({\mathcal {N}} (0, \varSigma )\) and without loss of generality, are standardized such that \(\Vert x_i\Vert _F\le 1\). \(\sigma _1(\Sigma )\ge \sigma _n(\Sigma )>0\), where \(\sigma _1(\Sigma )\) and \(\sigma _n(\Sigma )\) denote the largest and smallest eigenvalues of \(\Sigma\), respectively.
The multivariate normal distribution and its analogies are commonly assumed in the literature (e.g., Negahban & Wainwright, 2011; Sun et al., 2020; Fan et al., 2021). The setting with Assumption 4 facilitates achieving
the optimal convergence rate; other types of conditions are possible, at the expense of a slower convergence rate.
Theorem 2 establishes a non-asymptotic upper bound for \(\Vert {\hat{\varDelta }}\Vert _F\).
Theorem 2
For problem (9), suppose that Assumption 1 and 2 hold and the noises \(\xi _{ij}\)’s are distributed symmetrically about zero. Suppose X satisfies Assumption 4. Let \({\hat{C}}\) be the solution to the optimization problem (9) with
Then for \(n>C_2\frac{\sigma _1(\Sigma )}{\sigma _n(\Sigma )}c_1^2{r(p+q)}\) with probability at least \(1-3e^{-(p+q)}\),
where \(C_2\) and \(C_3\) are constants.
The value for \(\gamma\) is selected based on Lemma 8 in Appendix 3 as twice the upper bound for \(\sigma _1(\nabla {\mathcal {L}}_\eta (C^*))\) according to condition (10). Generally, for any \(K_2\ge 1\) and
our result remains valid and only differs in constant terms.
Under the same condition, we can establish the error bound in terms of the nuclear norm
When \(r(p+q)=o(n)\), the Frobenius norm of the error \(\Vert {\hat{\varDelta }}\Vert _F^2 \rightarrow 0\) in probability. Similarly, the robustness of the method is seen as only a mild distributional Assumption 2 is required: \(F_\xi (m+\eta )-F_\xi (m-\eta )>\frac{1}{c_1^2}\) for \(|m|<\eta\) and \(\eta >0\). Our estimator achieves a comparable convergence rate as that in Negahban and Wainwright (2011) and Rohde and Tsybakov (2011), with the notable difference due to the \(\eta\) in the Huber loss. Meanwhile, our method does not require the errors to follow normal distributions, which is the case in those studies with the \(\ell _2\) loss. Here assuming symmetricity plays the same role as that in Theorem 1. Based on the same discussions after Theorem 1, if the noises are not symmetrically distributed, then there will be an extra term in the upper bound.
4 Numerical examples
In this section, we conduct an extensive numerical investigation of the proposed method using both simulated and real data sets. In all cases, we choose the tuning parameters by ten-fold cross-validation. Specifically, for matrix completion problems, we first randomly select \(90\%\) of the observed entries as training samples and test the results using the remaining \(10\%\) samples. We repeat the procedure 10 times and choose the best tuning parameter. With extensive studies on simulated and real data sets, our results provide strong empirical evidence that the proposed method provides robustness under different settings.
4.1 Jester joke data
We first test our method using the Jester joke data set. This data set contains more than 4.1 million ratings for 100 jokes from 73,421 users. This data set is publicly available through http://www.ieor.berkeley.edu/~goldberg/jester-data/. The whole data set contains three sub-datasets, which are: (1) jester-1: 24,983 users who rate 36 or more jokes; (2) jester-2: 23,500 users who rate 36 or more jokes; (3) jester-3: 24,938 users who rate between 15 and 35 jokes. More detailed descriptions can be found in Toh and Yun (2010) and Chen et al. (2012), where the authors consider the nuclear-norm based approach to conduct matrix completion.
Due to the large number of users, we randomly select \(n_u\) users’ ratings from the datasets. Since many entries are unknown, we cannot compute the relative error using every entry. Instead, we take the metric of the normalized mean absolute error (NMAE) to measure the accuracy of the estimator \({\hat{M}}\):
where \(r_{\min }\) and \(r_{\max }\) denote the lower and upper bounds for the ratings, respectively. In the Jester joke data set, the range is \([-10,10]\). Thus, we have \(r_{\max }-r_{\min } = 20\).
In each iteration, we first randomly select \(n_u\) users, and then randomly permute the ratings from the users to generate \(M^0\in {\mathbb {R}}^{n_u\times 100}\). Next, we uniformly sample SR for SR\(\in \{15\%,20\%,0.25\%\}\) entries to generate a set of observed indices \(\varOmega\). Note that we can only observe the entry (j, k) if \((j,k)\in \varOmega\), and \(M^0_{j,k}\) is available. Thus, the actual sampling ratio is less than the input SR. We consider different settings of \(n_u\) and SR, and we report the averaged NMAE and running times in Table 2 after running each setting 100 times. We compare robust methods with \(\ell _1\) loss, Huber loss, and Tukey loss with the non-robust \(\ell _2\) loss. From Table 2, we see that robust matrix completion methods work promisingly.
4.2 Cameraman image denoising
We test our method using the popular Cameraman image, which is widely used in image processing literature. We consider the “Cameraman" image with \(512\times 512\) pixels as shown in Fig. 1a. We then generate random noise by first adding independent Gaussian noise to each pixel with a standard deviation set as 3. Then, we add some heavy-tailed noises by randomly choosing 10% pixels and replace the coefficient as 1000 or \(-1000\). Furthermore, we randomly select 40% or 60% pixels as missing entries. We provide two typical simulated noisy images in the above of Fig. 1b, c, and provide the recovered images using the Tukey approach below them. The recovered images provide visual evidence that our method is robust to heavy-tailed noises in practice. In addition, in Table 3, we provide the averaged NMAE with standard deviations of different approaches after repeating the data generating schemes 100 times. For the effective picture recovery and the NMAE, we conclude that robust matrix completion has promising performance with partial and noisy observations.
4.3 Simulations
We first consider several similar simulation settings as described in She and Chen (2017) to compare our method with their robust reduced-rank regression (\(R^4\)) method. In all cases, we focus on testing the robustness by artificially introducing data corruption and outliers.
Setting 1: We first consider a low-dimensional case where we set \(n = 100\), \(p = 12\), \(q = 8\) and \(r = 3\) or 5. We construct the design matrix X by generating its n rows by independently sampling from \(N(0,\Sigma _0)\), where we consider highly correlated covariates by letting the diagonal elements of \(\Sigma _0\) be 1’s and setting its off-diagonal elements as 0.5. For the noise matrix \(\var**\), we sample each row of \(\var**\) independently from \(N(0,\sigma ^2\Sigma _1)\), where \(\Sigma _1\) is the q-dimensional identity matrix, and \(\sigma\) is set as 1. Next, we construct the coefficient matrix \(C^*\). We generate \(C^* = B_1B_2^\top\), where \(B_1\in {\mathbb {R}}^{p\times r}\), \(B_2\in {\mathbb {R}}^{q\times r}\), and all entries of \(B_1\) and \(B_2\) are independently sampled from N(0, 1). We then add outliers with a matrix \(U^*\) by setting the first \(o\% \cdot n\) rows of \(U^*\) as nonzero, where \(o \in \{30, 35,\ldots ,50\}\) is the proportion of outliers, and the j-th entry of any outlier row of \(U^*\) is the product of a Randemacher random variable and a scalar \(\alpha \in \{0.75,1\}\) times the sample standard deviation of the j-th column of \(XC^*\). Finally, we set the response matrix \(Y = XC^* + U^* + \var**\). We report the mean and standard deviation of the mean squared error (MSE) from 200 runs, where
In addition, we also report the mean and standard deviation of the mean squared estimation error, where
Setting 2: We then test our method on heavy-tailed noise. Same as Setting 1, we let \(n = 100\), \(p = 12\), \(q = 8\), and \(r = 2,3\), or 4, and consider the same generating scheme to construct the design matrix X, and then generate the noise matrix by the heavy-tailed t-distribution with a degree of freedom 3 or 5. Furthermore, we add outliers by the same generating scheme as in Setting 1 to generate \(U^*\) and letting \(\alpha = 0.5, 0.75\) or 1.
Setting 3: We consider a high-dimensional setting where \(n = 100\), \(p = 50\) and \(q = 50\), and \(r = 3\) or 5, where there are \(2,500 > 100\) parameters in the matrix C to be estimated. We consider the same data generating scheme as in Setting 1.
Setting 4: Finally, we consider an ultrahigh-dimensional setting where \(n = 300\), \(p = 100\) and \(q = 400\), and \(r = 3\) or 5, where there are \(40,000\gg 300\) parameters to be estimated. We consider the same data generating scheme as in Setting 1.
The results are shown in Tables 4, 5, 6, and 7. We compare our method incorporating Huber and Tukey loss functions with the \(R^4\) method when it is applicable. We note that for high-dimensional Settings 3 and 4, the \(R^4\) method of She and Chen (2017) cannot be applied here because one of the iterations in their algorithm is not defined. We compare our method with another robust method where we use the \(\ell _1\) loss in place of the Huber loss in the objective with the nuclear norm constraint (Denoted as \(\ell _1\)). In all four settings, both Huber loss and Tukey loss achieve very promising performance, and Tukey loss slightly outperforms Huber loss in settings with outliers.
5 Intermediate theoretical results
Our estimators (1) and (5) are penalized M-estimators. We exploit the framework of Negahban et al. (2012) in studying their statistical properties. Negahban et al. (2012) elaborates the notion of decomposability associated with some penalty function, which is a key property for establishing the restricted strong convexity (RSC) property and the error bounds of the penalized estimators.
For self-completeness, we outline the decomposability of penalizing with the nuclear norm, and then derive the restricted strong convexity property for both models under the Huber loss function.
5.1 Decomposability of nuclear norm
A norm \(\Vert \cdot \Vert\) is decomposable with respect to a pair of subspace if for all \(A\in {{\mathcal {M}}}\) and \(B\in \overline{{\mathcal {M}}}^\perp\) with \(({\mathcal {M}},\overline{{\mathcal {M}}}^\perp )\) a pair of subspace of \({\mathbb {R}}^{p\times q}\) satisfy
To illustrate the decomposability of nuclear norm, recall
Note that \({\mathcal {M}}_r\ne \overline{{\mathcal {M}}}_r\). Since U and V both have orthogonal columns, nuclear norm is decomposable with respect to the pair \(({\mathcal {M}}_r,\overline{{\mathcal {M}}}_r^\perp )\). Note that if the rank of \(M^*\) is equal or smaller than r, then \(U_r\) and \(V_r\) equal to or contain the column and row space of \(M^*\) respectively, and \(M^*\in {\mathcal {M}}_r(U,V)\).
We present key intermediate results as lemmas below. The proofs of the lemmas are given in the Appendix.
5.2 Results for matrix completion
The decomposability leads to the first lemma, which is a special case of Lemma 1 in Negahban et al. (2012). It provides an upper bound for \(\Vert {\hat{\varDelta }}_{\overline{{\mathcal {M}}}_r^\perp }\Vert _*\).
Lemma 1
For any \(\gamma\) satisfying
the error \({\hat{\varDelta }}\) satisfies
Lemma 1 indicates that the estimator \({\hat{M}}\) belongs to the set
Note that if the rank of \(M^*\) is no greater than r, then \(\sum _{k=r+1}^{q}\sigma _k=0\) and the projection of the error on \(\overline{{\mathcal {M}}}_r^\perp\) is solely controlled by the projection of error on \(\overline{{\mathcal {M}}}_r\), so as the error itself, since
Now, consider the quantity
For simplicity, we sometimes refer to \(\delta {\mathcal {L}}_\eta (M,M^*)\) as \(\delta {\mathcal {L}}_\eta\). The next Lemma gives a lower bound of \(\delta {\mathcal {L}}_\eta (M,M^*)\), which is used to establish restricted strong convexity (RSC) and the upper bound for the error. The key to proving this lemma includes Lemma 1 and the application of empirical process techniques.
Lemma 2
(Lower bound of \(\delta {\mathcal {L}}_\eta (M,M^*)\)) Suppose Assumption 1 and 2 hold, and that the regularization parameter in optimization problem (7) satisfies
Then for any \(x>0\) and \(M \in \{M:\Vert M-M^*\Vert _{\max } \le \eta \} \cap {\mathcal {M}}_0\),
with probability at least \(1-e^{-x}\).
By controlling the negative term, we have the restricted strong convexity property.
Lemma 3
(Restricted Strong Convexity) Suppose that all the conditions in Lemma 2 and Assumption 3 hold. For \(M \in \{M:\Vert M-M^*\Vert _{\max } \le \eta \} \cap {\mathcal {M}}_0\), with probability at least \(1-e^{-(p+q)}\),
for \(n>C(L)\cdot c^2_1 pr\log (p+q)\log (q+1)\), where C(L) is a a constant only depending on L.
5.3 Results of reduced-rank regression
Recall
Lemma 1 can be easily extended to \({\hat{C}}\).
Lemma 4
For any \(\gamma\) satisfying
\({\hat{\varDelta }}={\hat{C}}-C^*\) satisfies
Lemma 4 indicates that the estimator \({\hat{C}}\) belongs to the set
The next result is to establish the RSC condition. Consider the quantity
Lemma 5
(Lower bound of \(\delta {\mathcal {L}}_\eta (C,C^*)\)) Consider the reduced-rank regression problem (9). Suppose that Assumption 1, 2 and 4 hold, and the noises \(\xi _t\)’s are distributed symmetrically about zero. Suppose the regularization parameter in optimization problem (9) satisfies
Then for any \(x>0\) and \(C\in \{C:\Vert C-C^*\Vert _F\le \eta \}\cap {\mathcal {C}}_0\),
with probability at least \(1-e^{-x}\).
By controlling the negative term and setting the right side to be greater than 0, we have the restricted strong convexity property.
Lemma 6
(Restricted Strong Convexity) Suppose that all the conditions in Lemma 5 hold, then for \(C\in \{C:\Vert C-C^*\Vert _F\le \eta \}\cap {\mathcal {C}}_0\) and \(n>C_2\frac{\sigma _1(\varSigma )}{\sigma _n(\varSigma )}c_1^2{r(p+q)}\), where \(C_2\) is a constant,
with probability at least \(1-(p+q)^{-1}\).
Data availibility
The real data sets to evaluate the performance of the methods in this paper are publicly available. ‘Jester Joke’ data set is available through http://www.ieor.berkeley.edu/~goldberg/jester-data/, and ‘Cameraman image’ data is available through http://ltfat.org/doc/signals/cameraman.html.
Code availability
The MATLAB code is available upon request to the corresponding author.
References
Agarwal, A., Negahban, S., & Wainwright, M. J. (2012). Noisy matrix decomposition via convex relaxation: Optimal rates in high dimensions. The Annals of Statistics, 1171–1197.
Anderson, T. W. (2003). An introduction to multivariate statistical analysis. New York: Wiley, 3rd edition.
Boucheron, S., Lugosi, G., & Massart, P. (2013). Concentration inequalities: A nonasymptotic theory of independence. Oxford University Press.
Bousquet, O. (2002). A Bennett concentration inequality and its application to suprema of empirical processes. Comptes Rendus Mathematique, 334(6), 495–500.
Candès, E. J., & Tao, T. (2010). The power of convex relaxation: Near-optimal matrix completion. IEEE Transactions on Information Theory, 56(5), 2053–2080.
Charisopoulos, V., Chen, Y., Davis, D., Díaz, M., Ding, L., & Drusvyatskiy, D. (2021). Low-rank matrix recovery with composite optimization: Good conditioning and rapid convergence. Foundations of Computational Mathematics, 1–89.
Chen, C., He, B., & Yuan, X. (2012). Matrix completion via an alternating direction method. IMA Journal of Numerical Analysis, 32(1), 227–245.
Cook, R. D. (2009). Regression graphics: Ideas for studying regressions through graphics, (Vol. 482). Hoboken: John Wiley & Sons.
Elsener, A., & van de Geer, S. (2018). Robust low-rank matrix estimation. Annals of Statistics, 46(6B), 3481–3509.
Fan, J., Wang, W., & Zhu, Z. (2021). A shrinkage principle for heavy-tailed data: High-dimensional robust low-rank matrix recovery. Annals of Statistics, 49(3), 1239.
Freund, R. M., & Grigas, P. (2016). New analysis and results for the Frank–Wolfe method. Mathematical Programming, 155(1–2), 199–230.
Freund, R. M., Grigas, P., & Mazumder, R. (2017). An extended Frank–Wolfe method with “in-face’’ directions, and its application to low-rank matrix completion. SIAM Journal on Optimization, 27(1), 319–346.
Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., & Stahel, W. A. (2011). Robust statistics: The approach based on influence functions (Vol. 196). Hoboken: John Wiley & Sons.
Huber, P. J. (2004). Robust statistics (Vol. 523). Hoboken: John Wiley & Sons.
Jaggi, M. (2013). Revisiting frank-wolfe: Projection-free sparse convex optimization. In International conference on machine learning (pp. 427–435). PMLR.
Kerdreux, T., d’Aspremont, A., & Pokutta, S. (2018). Restarting frank-wolfe. ar**v preprint ar**v:1810.02429.
Klopp, O. (2014). Noisy low-rank matrix completion with general sampling distribution. Bernoulli, 20(1), 282–303.
Koltchinskii, V., Lounici, K., & Tsybakov, A. B. (2011). Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. The Annals of Statistics, 39(5), 2302–2329.
Lacoste-Julien, S. & Jaggi, M. (2015). On the global linear convergence of Frank–Wolfe optimization variants. ar**v preprint ar**v:1511.05932.
Lauritzen, S. L. (1996). Graphical models (Vol. 17). Oxford: Clarendon Press.
Ledoux, M., & Talagrand, M. (2013). Probability in Banach spaces: Isoperimetry and processes. Springer, Berlin.
Loh, P.-L. (2017). Statistical consistency and asymptotic normality for high-dimensional robust \(m\)-estimators. The Annals of Statistics, 45(2), 866–896.
Negahban, S., & Wainwright, M. J. (2011). Estimation of (near) low-rank matrices with noise and high-dimensional scaling. The Annals of Statistics, 1069–1097.
Negahban, S., & Wainwright, M. J. (2012). Restricted strong convexity and weighted matrix completion: Optimal bounds with noise. The Journal of Machine Learning Research, 13(1), 1665–1697.
Negahban, S. N., Ravikumar, P., Wainwright, M. J., & Yu, B. (2012). A unified framework for high-dimensional analysis of \(m\)-estimators with decomposable regularizers. Statistical Science, 27(4), 538–557.
Recht, B. (2011). A simpler approach to matrix completion. Journal of Machine Learning Research, 12(12).
Reddi, S. J., Hefny, A., Sra, S., Poczos, B., & Smola, A. (2016). Stochastic variance reduction for nonconvex optimization. In International conference on machine learning (pp. 314–323).
Reinsel, G. C., & Velu, R. (1998). Multivariate reduced rank regression. Berlin: Springer.
Rohde, A., & Tsybakov, A. B. (2011). Estimation of high-dimensional low-rank matrices. The Annals of Statistics, 39(2), 887–930.
She, Y., & Chen, K. (2017). Robust reduced-rank regression. Biometrika, 104(3), 633–647.
Sun, Q., Zhou, W.-X., & Fan, J. (2020). Adaptive Huber regression. Journal of the American Statistical Association, 115(529), 254–265.
Swoboda, P., & Kolmogorov, V. (2019). Map inference via block-coordinate Frank–Wolfe algorithm. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 11146–11155).
Toh, K.-C., & Yun, S. (2010). An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems. Pacific Journal of Optimization, 6(615–640), 15.
Wong, R. K., & Lee, T. C. (2017). Matrix completion with noisy entries and outliers. The Journal of Machine Learning Research, 18(1), 5404–5428.
Zhou, W.-X., Bose, K., Fan, J., & Liu, H. (2018). A new perspective on robust m-estimation: Finite sample theory and applications to dependence-adjusted multiple testing. The Annals of Statistics, 46(5), 1904–1931.
Funding
Tang was supported in part by a Subaward of an NIH Grant R01GM140476, and an NSF Grant DMS-2210687. Fang was partially supported by NSF Grants DMS-1820702, DMS-1953196, DMS-2015539, and a Grant from Whitehead foundation.
Author information
Authors and Affiliations
Contributions
All authors contributed to the conception and design in methods, theory, and algorithms. Theoretical development were performed by NJ, and the experimental evaluation was performed by EXF. All authors participated in preparing, reading, and revising the manuscript; all authors approved the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors have no relevant financial or non-financial interests to disclose.
Additional information
Editor: Pradeep Ravikumar.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendices
Appendix 1: More on the assumption on the model errors
A key assumption in Theorem 1 and Theorem 2 is that the noises \(\xi\)’s are i.i.d. with zero mean and a distribution function \(F_\xi\) satisfying
for any \(|m|\le \eta\) and some \(\eta >0\), where \(c_1\) is a positive constant depending only on \(\eta\). This is the same as requiring \(Pr( \xi \in [m - \eta , m + \eta ])\) to be always positive for any \(|m|\le \eta\) and \(\eta >0\). Since \({\mathbb {E}}(\xi )=0\) and \(0\in [m-\eta , m+\eta ]\), this condition holds as long as the probability mass near 0 is not too small, which is easily satisfied by a large class of distributions including heavy-tailed ones. As an example, Fig. 2 gives the distribution of a t-distribution with degree of freedom being 3. The area of the grey part represents \(F_\xi (m+\eta )-F_\xi (m-\eta )\) when \(m=1\) and \(\eta =2\). Since the density function near 0 is strictly bounded from below, the required condition (11) holds for \(\eta >0\).
The Huber contamination model also satisfies Assumption 2. Specifically, suppose the errors \(\xi\)’s follow a Huber contamination model \((1-c)F+cG\) with F being the distribution function of a normal random variable. Then \(Pr( \xi \in [m - \eta , m + \eta ]) = (1-c)\{F(m + \eta )-F(m - \eta )\}+c\{G(m + \eta )-G(m - \eta )\}\). Then the first term creates no issue. Assumption 2 is easily met if G in the second term is a continuous distribution with zero mean. When G is from a discrete distribution, it is a step function. Then the second term is either 0 or a value bounded above from 1. Overall, Assumption 2 is satisfied.
Appendix 2: Proof for matrix completion
This section presents the proof related to the matrix completion.
Proof of Lemma 1
Note that
Using triangle inequalities and the decomposability of nuclear norm on \({\mathcal {M}}_r\) and \({\mathcal {M}}_r^\perp\),
Thus,
By the convexity of the loss function \({\mathcal {L}}_\eta\), together with the assumption on \(\gamma\) and the definition of the dual norm,
Since \({\hat{M}}\) is the optimizer of problem (7),
Notice that \(\Vert M^*_{\overline{{\mathcal {M}}}_r^\perp }\Vert _*=\sum _{k=r+1}^{q}\sigma _k\), therefore the lemma holds. \(\square\)
For simplicity, let
Before we look in the RSC condition, we first bound the term \(\sigma _{1}(\nabla {\mathcal {L}}_\eta (M^*))\).
Lemma 7
(Upper bound for \(\sigma _1(\nabla {\mathcal {L}}_\eta (M^*)\)) Suppose the noises are i.i.d with zero-mean and are symmetrically distributed around zero, then for any \(x>0\), and a positive constant \(c_0\),
with probability at least \(1-e^{-x}\).
Proof of Lemma 7
Since \(\sigma _1(\cdot )\) is a norm, the triangle inequality holds
It can be derived from Equation (1) that
Since the noises are symmetrically distributed around zero and \(\frac{\partial l_\eta (X_{ij}-M^*_{ij})}{\partial M_{ij}^*}\) is an odd function of the noise \(\xi _{ij}\), we have \({\mathbb {E}}[\frac{\partial l_\eta (X_{ij}-M^*_{ij})}{\partial M_{ij}^*}]=0\), and thus
To bound \(\sigma _1(\nabla {\mathcal {L}}_\eta (M^*)-{\mathbb {E}}[\nabla {\mathcal {L}}_\eta (M^*)])\), first notice that
Since the errors \(X_{ij}-M_{ij}^*\) are i.i.d.,
by Theorem 2.3 in Bousquet (2002), we have for any \(x>0\)
with probability at least \(1-e^{-x}\).
Moreover, since
we have with probability at least \(1-e^{-x}\)
By symmetrization inequality in Boucheron et al. (2013),
where \(\epsilon _1,\dots ,\epsilon _n\) are i.i.d. Rademacher variables with distribution \({\mathbb {P}}(\epsilon _t=1)={\mathbb {P}}(\epsilon _t=-1)=\frac{1}{2}\), and are independent of \(\{X_{i_t,j_t}\}_{t=1}^n\) and \(\{J_t\}_{t=1}^n\).
Now, let \({\mathbb {E}}^*\) denote the conditional expectation given \(\{X_{i_t,j_t}, J_t\}_{t=1}^n\). Notice that \(W_{i_t,j_t} \frac{\partial l_\eta (X_{i_t,j_t}-M^*_{i_t,j_t})}{\partial M_{i_t,j_t}^*}\) is a \(\eta -\)Lipschitz function of \(W_{i_t,j_t}\). By Theorem 4.12 in Ledoux and Talagrand (2013), we have
Then take expectation over \(J_t\), and we have for a positive constant \(c_0\),
where the second inequality follows from the definition of dual norm, and the last inequality follows from Proposition 2 in Koltchinskii et al. (2011): it is simple to show that
besides, since \(\sigma _1(\epsilon _tJ_t)=|\epsilon _t|\sigma _1(J_t)\le |\epsilon _t|\), we have
where \(U_Z^{(\alpha )}\) is defined as \(U_Z^{(\alpha )}=\inf \{u>0:{\mathbb {E}} \exp (\frac{\sigma _1(Z)^\alpha }{u^\alpha })\le 2\}\), then by concavity of logarithm, we have
finally, using Proposition 2 in Koltchinskii et al. (2011), we have \(\forall {\tilde{x}}>0\) and a constant \({\tilde{c}}_0\)
Then
since \(\frac{1}{\sqrt{{\tilde{x}}+\log (p+q)}}\le \sqrt{2}\left[ \frac{1}{\sqrt{{\tilde{x}}}}+\frac{1}{\sqrt{\log (p+q)}}\right]\), after simplification, we have
where \(c_0\) is a constant independent of n, p and q.
By Equation (13), together with Equation (14), (15), (16) and (17), we have with probability at least \(1-e^{-x}\)
\(\square\)
Proof of Lemma 2
\(\delta {\mathcal {L}}_\eta (M, M^*)\) can be written as
In the following, we establish the lower bound for \({\mathbb {E}}[\delta {\mathcal {L}}_\eta (M,M^*)]\) and the upper bound for \(|{\mathbb {E}}[\delta {\mathcal {L}}_\eta (M,M^*)]-\delta {\mathcal {L}}_\eta (M,M^*)|\), respectively, for \(M\in \{M:\Vert M-M^*\Vert _{\max }\le \eta \}\cap {\mathcal {M}}_0\).
Given any \(M\in \{M:\Vert M-M^*\Vert _{\max }\le \eta \}\cap {\mathcal {M}}_0\) and \(\varDelta = M-M^*\),
where \(\frac{\partial l_\eta (X_{ij}-M^*_{ij})}{\partial M_{ij}^*}\) is defined in Equation (12).
Since \(l_\eta (X_{ij}-M_{ij})\) and \(\frac{\partial l_\eta (X_{ij}-M_{ij})}{\partial M_{ij}^*}\) are continuous function of \(M_{ij}\),
where \(F(\cdot )\) is the cdf of \(X_{ij}\), and
Apply Taylor’s theorem to \({\mathbb {E}}[l_\eta (X_{ij}-M_{ij})]\), and we have for some \(t_{ij}\in (0,1)\)
where the inequality follows from the Assumption 2.
Next, we consider the upper bound for \(\left| {\mathbb {E}}[\delta {\mathcal {L}}_\eta ]-\delta {\mathcal {L}}_\eta \right|\). The techniques used here are similar to those in the proof Lemma 7.
Let \(\delta l_{\eta ,ij}=l_\eta (X_{ij}-{M}_{ij})-l_\eta (X_{ij}-M^*_{ij})-\frac{\partial l_\eta (X_{ij}-M^*_{ij})}{\partial M^*_{ij}}{\varDelta }_{ij}\), since \(l_\eta\) is \(\eta -\) Lipschitz,
For any \(M \in {\mathcal {M}}_0\), let \(\varDelta =M-M^*\), we have
Let \(Z_1=\frac{1}{n}\sup _{M\in {\mathbb {R}}^{p\times q}}\left| \sum _{t=1}^n \frac{f_{1t}(M)}{||\varDelta ||_F} \right|\). By Equation (20), \(\frac{f_{1t}(M)}{||\varDelta ||_F}\le 4\eta\) and \({\mathbb {E}}[\frac{f_{1t}^2(M)}{||\varDelta ||_F^2}]\le \frac{\eta ^2}{pq}\) for any \(M\in {\mathbb {R}}^{p\times q}\). Since the errors \(X_{ij}-M_{ij}^*\) are i.i.d., by Theorem 2.3 in Bousquet (2002), for any \(x\ge 0\)
with probability at least \(1-e^{-x}\).
Let \(\epsilon _t\)’s be i.i.d. Rademacher variables. Then, by symmetrization inequality in Boucheron et al. (2013),
Let \({\mathbb {E}}^*\) denote the conditional expectation given \(\{X_{i_t,j_t}, J_t\}_{t=1}^n\).
By contraction principle in Theorem 4.4 of Ledoux and Talagrand (2013), since \(|\frac{\delta l_{\eta ,i_tj_t}}{\varDelta _{i_t,j_t}}|\le 2\eta\),
Then
The second inequality follows form the definition of the dual norm.
By Lemma 1, we have for \(M\in {\mathcal {M}}_0\) and \(r>0\),
Note that
Then we have for all \(M\in {\mathbb {R}}^{p\times q}\) and \(\varDelta =M-M^*\)
If \(M^*\) is exactly low-rank with \(\text {rank}(M^*)\le r\), then \(\sum _{k=r+1}^{q} \sigma _k=0\), in this case
where the last inequality follows from Equation (17).
Then, by Equation (22)
with probability at least \(1-e^{-x}\).
Therefore, by Equation (21), with probability at least \(1-e^{-x}\).
Together with Equation (19), we have with probability at least \(1-e^{-x}\),
\(\square\)
Proof of Theorem 1
Construct \(M_t = M^* + t({\hat{M}}-M^*)\) in the following way. If \(\Vert {\hat{M}}-M^*\Vert _{\max }<\eta\), then \(t=1\), otherwise, choose t such that \(\Vert M_t-M^*\Vert _{\max }=\eta\). Let \(\varDelta _t=M_t-M^*=t({\hat{M}}-M^*)=t{\hat{\varDelta }}\). Notice
Since \({\hat{M}}\) is the optimizer of problem (7),
Therefore,
Then by Lemma 2, for any \(x>0\), with probability at least \(1-e^{-x}\)
Divided both sides of the inequality by \(||\varDelta _t||_F\), we have
with probability at least \(1-e^{-x}\). The second inequality follows from Equation (23) when \(M^*\) has rank smaller than r and the fact that \(\gamma \ge 2\sigma _1(\nabla {\mathcal {L}}(M^*))\) with probability at least \(1-e^{-x}\) by Lemma 7.
Take \(x=\log (p+q)\) and \(n>C(L) \cdot c_1^2\sqrt{2rp\log (p+q)\log (q+1)}\) with C(L) with being some constant depending on L, we have \(\Vert \varDelta _t\Vert _{\max }\le \frac{L}{\sqrt{pq}}\Vert \varDelta _t\Vert _F<\eta\). Then by the construction of \(M_t\), \(t=1\). Finally, we have with probability at least \(1-2e^{-x}-e^{-2x}=1-3(p+q)^{-1}\),
where \(C_1\), \(c_2\), \(c_3\) are absolute constants. \(\square\)
Appendix 3: Proof for reduced-rank regression
Lemma 8
(Upper Bound for \(\sigma _1(\nabla {\mathcal {L}}_\eta (C^*))\)) Suppose that \(\xi _{ij}\)’s are i.i.d. with zero mean and symmetrically distributed around zero, then for any \(x>0\), we have with probability at least \(1-e^{-x}\),
Proof of Lemma 8
Note \(\frac{\partial }{\partial c_{kj}}{\mathcal {L}}(C) = \sum _{i=1}^n -\ell _\eta ' (y_{ij}-x_i^\top c_j) x_{ik}\). Let \(g_{ij}=\ell _\eta ' (y_{ij}-x_i^\top c_j)\), \(G=[g_{ij}]_{n\times q}\) and \(G^*\) the value of G when \(C=C^*\), then \(\nabla {\mathcal {L}}_\eta (C^*)=-X^\top G^*\).
Following the proof of Lemma 3 Negahban and Wainwright (2011) (the proof is given in its supplementary material), we have
It remains to bound \(\frac{1}{n}\langle Xv, G^*u \rangle\). Let
where \(g^*_i\) is the i-th row of \(G^*\). Since \(\xi _{ij}\)’s are symmetrically distributed around zero and \(l'_\eta (x)\) is an odd function, \({\mathbb {E}}[G^*]=0.\) Hence, \({\mathbb {E}}\{\langle v, x_i \rangle \langle u, g^*_i \rangle \}=0\). Further, for k being any positive integer,
By Berstein’s inequality, for any \(t>0\) and u, v satisfying \(\Vert u\Vert _2=1, \Vert v\Vert _2=1\),
Combining with Equation (25), we have
Take \(t=2(p+q)+x\) for any \(x>0\), and then we have
\(\square\)
Proof of Lemma 5
In the following, we establish the lower bound for \({\mathbb {E}}[\delta {\mathcal {L}}_\eta ]\) and the upper bound for \(\left| \delta {\mathcal {L}}_\eta -{\mathbb {E}}[\delta {\mathcal {L}}_\eta ]\right|\), respectively, for \(C\in {\mathcal {C}}_0\cap \{C:\Vert C-C^*\Vert _F\le \eta \}\)
Given any \(C\in {\mathcal {C}}_0\cap \{C:\Vert C-C^*\Vert _F\le \eta \}\) and \(\varDelta =C-C^*\), for some \(t_{ij} \in (0,1)\)
where the equality follows from Taylor’s theorem, and the first inequality follows from Assumption 2 and Assumption 4. For the calculation of \(\frac{\partial ^2{\mathbb {E}}[l_\eta (y_{ij}-\langle Z^{ij},C\rangle )]}{\partial (\langle Z^{ij},C\rangle )^2}\), please refer to the calculation of \(\frac{\partial ^2{\mathbb {E}}[l_\eta (X_{ij}-M_{ij}]}{\partial M_{ij}^2}\) in the case of matrix completion problems.
For any \(i=1,\dots ,n, j=1,\dots ,q\), there exist \(\tau _{ij}\in (0,1)\), such that \(\ell _\eta (y_{ij}-x_i^\top c_j)-\ell _\eta (y_{ij}-x^\top _i c_j^*) = \ell _\eta '(y_{ij}-x_i^\top {\tilde{c}}_j)x_i^\top (c_j^*-c_j),\) where \({\tilde{c}}_j = c_j^* + \tau _{ij} (c_j-c_j^*)\). Therefore,
Then
Following the proof in Lemma 8, we have for any \(x>0\)
with probability at least \(1-e^{-x}\).
Similar to Equation (23), it can be shown that if \(C^*\) has rank smaller than r, then \(\sup _{C\in {\mathcal {C}}_0}\frac{\Vert \varDelta \Vert _*}{\Vert \varDelta \Vert _F}\le 4\sqrt{2r}.\) Hence, for \(C\in {\mathcal {C}}_0\), \(\Vert \varDelta \Vert _* \le 4\sqrt{2r}\Vert \varDelta \Vert _F\). Now we have with probability at least \(1-e^{-x}\),
\(\square\)
Proof of Theorem 2
Construct \(C_t = C^* + t({\hat{C}}-C^*)\) in the following way. If \(\Vert {\hat{C}}-C^*\Vert _F < \eta\), then \(t=1\), otherwise, choose t such that \(\Vert C_t-C^*\Vert _F = \eta\). Let \(\varDelta _t=C_t - C^* = t({\hat{C}}-C^*) = t{\hat{\varDelta }}\). Notice
Since \({\hat{C}}\) is the optimizer of problem (9), we have
Therefore,
By Lemma 5, for any \(x>0\), with probability at least \(1-e^{-x}\),
The second inequality follows from Equation (23) when \(M^*\) has rank smaller than r and the fact that the selection of \(\gamma \ge 2\sigma _1(\nabla {\mathcal {L}}(M^*))\) with probability at least \(1-e^{-x}\) by Lemma 7.
Further, by Equation (23) and the fact that \(\gamma \ge 2\sigma _1(\nabla {\mathcal {L}}(C^*))\) with probability at least \(1-e^{-x}\) by Lemma 8, we have
with probability at least \((1-e^{-x})^2\), Take \(x=p+q\) and \(n > C_2 \cdot \frac{\sigma _1(\varSigma )}{\sigma _n(\varSigma )} c_1^2 (p+q)r\) with \(C_2\) being some constant, we have \(\Vert \varDelta _t\Vert _F < \eta\). Then by the construction of \(C_t\), \(t=1\). Finally, we have with probability at least \(1-2e^{-x}-e^{-2x}=1-3e^{-(p+q)}\),
\(\square\)
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
**g, N., Fang, E.X. & Tang, C.Y. Robust matrix estimations meet Frank–Wolfe algorithm. Mach Learn 112, 2723–2760 (2023). https://doi.org/10.1007/s10994-023-06325-w
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-023-06325-w