Robust matrix estimations meet Frank–Wolfe algorithm

**g, Naimin; Fang, Ethan X.; Tang, Cheng Yong

doi:10.1007/s10994-023-06325-w

Robust matrix estimations meet Frank–Wolfe algorithm

Published: 05 April 2023

Volume 112, pages 2723–2760, (2023)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Robust matrix estimations meet Frank–Wolfe algorithm

Download PDF

469 Accesses
2 Altmetric
Explore all metrics

Abstract

We consider estimating matrix-valued model parameters with a dedicated focus on their robustness. Our setting concerns large-scale structured data so that a regularization on the matrix’s rank becomes indispensable. Though robust loss functions are expected to be effective, their practical implementations are known difficult due to the non-smooth criterion functions encountered in the optimizations. To meet the challenges, we develop a highly efficient computing scheme taking advantage of the projection-free Frank–Wolfe algorithms that require only the first-order derivative of the criterion function. Our methodological framework is broad, extensively accommodating robust loss functions in conjunction with penalty functions in the context of matrix estimation problems. We establish the non-asymptotic error bounds of the matrix estimations with the Huber loss and nuclear norm penalty in two concrete cases: matrix completion with partial and noisy observations and reduced-rank regressions. Our theory demonstrates the merits from using robust loss functions, so that matrix-valued estimators with good properties are achieved even when heavy-tailed distributions are involved. We illustrate the promising performance of our methods with extensive numerical examples and data analysis.

Max-norm optimization for robust matrix recovery

Article 30 May 2017

Nonsmooth Low-Rank Matrix Recovery: Methodology, Theory and Algorithm

Robust matrix completion with complex noise

Article 30 November 2019

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Massive data with informative structures from the data collection processes are becoming increasingly available in many data-enabled areas. Examples include those from FMRI, electroencephalogram (EEG), and tick-by-tick financial trading records of many assets. Methodologically for multivariate data analysis, matrices as the model parameters are commonly analyzed in the core step(s) of many popular approaches including the principal component analysis, canonical correlation analysis (Anderson, 2003), Gaussian graphical model analysis (Lauritzen, 1996), reduced-rank regression (Reinsel & Velu, 1998), sufficient dimension reduction (Cook, 2009), and many others.

Structural information—our foremost consideration in this study—is indispensable in solving many matrix estimation problems with large-scale data. For matrix-valued model parameters, a class of methods imposes restrictions on the rank of the targeted matrix. In matrix completion with partial and noisy observations, for example, without such structural information, successfully recovering the signal is not possible. For multi-response regression problems, structural information is vital for both methodological development and practical implementation for drawing informative conclusions. Constraining the rank of the parameter matrix in multi-response regression leads to the conventional reduced-rank regression (Reinsel & Velu, 1998).

Our primary goal in this study is to investigate robustness when estimating matrices with large-scale data and structural information. Robustness is a foundational concern in current data-enabled investigations. During massive data collection processes, observations of heterogeneous quality are inevitable, and even erroneous records are common. On one hand, due to the huge size of the data in modern large-scale investigations, validations and error corrections become too daunting to be practical. Robust statistical methods in these scenarios are thus highly desirable. On the other hand, however, in many existing methods, though being convenient, commonly applied criterion functions including the squared loss and the negative log-likelihood are unfortunately not robust to the violations of the model assumptions in the aforementioned practical reality.

We are thus motivated to consider robustness in the context with structural information, which is incorporated by constraining the rank of the matrix-valued model parameters. The foremost challenge in this scenario is the fundamental computational difficulty. One source contributing to the difficulty roots in the fact that constraining a matrix’s rank results in a non-convex problem. As a rare example in reduced-rank multivariate regression, an analytic solution is available despite the non-convexity; see (Reinsel & Velu, 1998). Unfortunately when considerations are broader, such a convenience generally no longer exists; and how to solve optimization problems with rank constraints is generally difficult. To meet the challenge, a convex relaxation of the problem leads to regularizing the nuclear norm of the matrix-valued model parameter. From the statistical perspective, numerous works (Candès & Tao, 2010; Negahban & Wainwright, 2011; Agarwal et al., 2012) have studied the theoretical properties of this type of estimators constructed with the nuclear norm relaxation, and have proved that the resulting estimator achieves optimal or near-optimal statistical properties under different settings. Additional to the non-convexity, consideration of robustness is further contributing to the computational difficulty. Resorting to robust loss functions is a traditional class of influential methods for establishing more robust statistical methods; see Huber (2004) and Hampel et al. (2011). Though demonstrated effective in conventional statistical analysis, substantial difficulties arise when handling large-scale modern complex data-enabled problems. Computationally, in particular, their applications encounter major challenges because robust loss functions are not smooth whose second-order derivatives do not exist. Analytically, establishing the statistical properties of the matrix estimations is challenging in this scenario too, because the impacts from possibly heavy-tailed errors are involved in studying large-scale problems. Existing methods using the squared loss or the negative log-likelihood as the loss functions require the noises to be sub-Gaussian in order to handle high-dimensional data. Robust methods can accommodate noises with heavier tails than sub-Gaussian; meanwhile, the capacity for handling high-dimensional data remains desirable.

There has been an active recent development in robust statistical methods with high-dimensional data; see, for example, Loh (2017), Zhou et al. (2018), Sun et al. (2020), and reference therein. Recently, there has been increasing interest in investigating robust methods for matrix-valued model parameters. She and Chen (2017) studied the robust reduced-rank regression in a scenario concerning outliers. They define the estimator as the minimizer of a non-convex optimization problem, establish theoretical error bounds, and propose to apply an iterative algorithm that alternatively solves for two parts of the model parameters in their setting. Due to the nonconvexity, their algorithm does not guarantee the convergence to the minimum. Wong and Lee (2017) studied matrix completion with Huber loss. Their algorithm is developed by iteratively projecting non-robust matrix estimators, which is computationally demanding with many projection operations required. Elsener and van de Geer (2018) investigated robust matrix completion with the Huber loss function and nuclear norm penalization. The computation algorithms in Elsener and van de Geer (2018) involved a soft-thresholding step for singular values. This works well when the solution is of exact low rank. However, when the solution is of approximately low rank, or of modestly higher rank, such a step becomes computationally demanding. As pointed out in She and Chen (2017), efficient algorithms are desirable for solving optimization problems with rank constraints and robust loss functions.

We attempt our study with a foremost consideration on an efficient computing scheme for solving large-scale statistical problems with robustness. In particular, we aim to develop efficient first-order algorithms by building a scheme with Frank–Wolfe-type algorithms for robust matrix estimation problems. The Frank–Wolfe algorithm is a first-order method and is drawing considerable attention recently (Jaggi, 2013; Lacoste-Julien & Jaggi, 2015; Freund & Grigas, 2016; Freund et al., 2017; Kerdreux et al., 2018; Swoboda & Kolmogorov, 2019). The key advantage of the Frank–Wolfe algorithms is their freedom from the required projections in most proximal-type algorithms. In addition, as we shall see in our algorithms in Sect. 2, for matrix estimation problems, in each iteration, the Frank–Wolfe algorithm only requires computing the top one leading singular vectors, which can be conducted efficiently even for huge-size problems. These merits make Frank–Wolfe-type algorithms particularly appealing for solving large-scale robust low-rank matrix estimation problems.

Our study makes two main contributions. Foremost, we develop a new computation scheme for robust matrix estimation and demonstrate that the first-order optimization technique makes solving large-scale robust estimation problems practically convenient. We show extensively that our framework is broadly applicable, covering general robust loss functions including those used in median and quantile regression; see Sect. 2. Second, our theoretical analysis reveals the benefit from using robust loss functions and rank constraints. Our non-asymptotic results demonstrate that our framework can accommodate high-dimensional data. For matrix completion and reduced-rank regression, the resulting matrix-valued estimator works satisfactorily even when the model error distributions are heavy-tailed.

The rest of this article is organized as follows. Section 2 elaborates a concrete framework using the Frank–Wolfe algorithm to solve robust matrix estimation problems. We present matrix completion and reduced-rank regression with various robust loss functions. Section 3 justifies the validity of our method with theory on the algorithm convergence and error bounds of the resulting estimators. Section 4 presents extensive numerical examples demonstrating the promising performance of our methods.

For a generic matrix A, we denote by $A^\top$ its transpose, $\sigma _1(A)$ its largest singular value, $\Vert A\Vert _*$ its nuclear norm, and $\Vert A\Vert _F$ its Frobenius norm. Let $\langle A, B\rangle =\text {trace}(A^\top B)$ for $A, B \in {\mathbb {R}}^{p\times q}$. We denote by $\varTheta \in {\mathbb {R}} ^{p\times q}$ a generic matrix-valued model parameter. In this study, we focus on two concrete cases. In one case, $\varTheta =M$ where M is the signal to be recovered in the matrix completion problem with a single copy of partial and noisy observations; the other one is $\varTheta =C$ where C is the matrix-valued coefficients in the multi-response regression problem. Furthermore, we show that our framework broadly applies in solving a general class of problems.

2 Methodology

2.1 Matrix completion

We consider the matrix completion problem first. In this setting, one observes a noisy subset of all entries of a matrix $M\in {{\mathbb {R}}}^{p\times q}$, which is the model parameter of interest. Let the set of observed entries be $\varOmega = \{(i_t,j_t)\}_{t=1}^n$, where $i_t\in \{1,\dots , p\}; j_t\in \{1,\dots q\}$, and denote by $X_{i_t,j_t}$, $(i_t,j_t)\in \varOmega$, the corresponding noisy observations such that

$$\begin{aligned} X_{i_t,j_t} = M_{i_t,j_t} + \xi _{t}, \quad t=1,\dots , n. \end{aligned}$$

We assume that $\xi _t$’s are independent and identically distributed random variables with mean zero.

To effectively recover M with a single copy of partial and noisy observations over $\varOmega$, one popular approach is to assume that the underlying true matrix, denoted by $M^*$, is of low-rank that $\text {rank}(M^*)\le r$ for some $r\le \min (p,q)$. Then one can estimate $M^*$ by solving a constrained optimization problem by minimizing the objective function $(2n)^{-1}\sum _{t=1}^n \ell (X_{i_t,j_t}-M_{i_t,j_t})$ over M, subject to $\text {rank}(M)\le r$ for some loss function $\ell (\cdot )$. Since the rank constraint is non-convex, solving the optimization is generally not tractable. To obtain a practical solution, a common strategy is relaxing the rank constraint to the convex nuclear norm constraint.

The Huber loss function leads to robust estimators because its design alleviates the excessive contribution from a data point that is extremely deviated from the fit. Practically, the Huber loss performs promisingly when handling a substantial portion of noisy observations whose distribution can be heavy-tailed; see Huber (2004).

By applying the Huber loss with a constraint on nuclear norm, we consider the following robust matrix completion problem:

$$\begin{aligned} \min _{M \in {\mathbb {R}}^{p\times q}} {\mathcal {L}}_\eta (M):= \frac{1}{2n}\sum _{t=1}^n \ell _{\eta }(X_{i_t,j_t}-M_{i_t,j_t}), \ \text { subject to }\Vert M\Vert _*\le \lambda , \end{aligned}$$

(1)

where $\ell _\eta (\cdot )$ is the classical Huber loss function:

$$\begin{aligned} \ell _\eta (x) = {\left\{ \begin{array}{ll} \frac{1}{2}x^2&{} \text{ if } |x|\le \eta ,\\ \eta \cdot \big (|x| - \frac{1}{2}\eta \big ) &{} \text{ otherwise. } \end{array}\right. } \end{aligned}$$

(2)

Here $\eta$ is the tuning parameter of the Huber loss, and $\lambda$ is the tuning parameter regularizing the nuclear norm of M. In our numerical studies, we choose the tuning parameters by applying the cross-validation.

Since $\ell _\eta$ is not smooth, those methods commonly applied in solving $\ell _2$-loss problems—requiring second-order derivatives—do not directly apply. Computing optimization problem (1) is generally hard; see the discussion in She and Chen (2017). Efficient algorithms for solving (5) are lacking; the primary difficulty is due to the absence of the second-order derivative of the Huber loss. It is even more challenging to minimize the Huber loss on a restricted low-rank region, and to achieve the computational efficiency with large-scale data. More broadly, non-smooth criterion functions are commonly the case with general robust loss functions, with prominent examples including the least absolute deviation loss of the median regression, check loss of the quantile regression, and Tukey’s biweight loss besides the aforementioned Huber loss.

To address the computational difficulty when handling large-scale problems with robust loss functions, we propose to apply the Frank-Wolfe algorithm to solve this problem. The Frank–Wolfe algorithm has been particularly powerful for convex optimizations. As a first-order approach that requires no second-order derivative of the criterion function, it is particularly powerful for solving problems with non-smooth loss functions, which is exactly the case for our problem (1). Briefly speaking, the Frank–Wolfe algorithm pursues some constrained approximation of the gradient—the first-order derivative of the criterion function evaluated at a given value. The algorithm runs iteratively, with the optimization proceeding along the direction as identified by the approximation of the gradient. Therefore, the Frank–Wolfe algorithm is practically appealing, as one has the opportunity to best exploit some constrained approximation that can be computed efficiently. For a detailed account of the Frank–Wolfe algorithms and recent advances in the area, we refer to Freund and Grigas (2016), Freund et al. (2017), and references therein.

Concretely in our setting, we develop an algorithm that runs iteratively. Specifically, at the $(k+1)$-th iteration with $M^{(k)}$ from the previous step, the matrix-valued gradient of (1): $\nabla {\mathcal {L}}(M^{(k)}) \in {\mathbb {R}}^{p\times q}$ is analytically calculated by

$$\begin{aligned}&\nabla {\mathcal {L}}(M^{(k)}) =\,\frac{1}{2n} \sum _{t=1}^n J_t [(M^{(k)}_{i_t,j_t}-X_{i_t,j_t})\text {1}( |M^{(k)}_{i_t,j_t}-X_{i_t,j_t}|\le \eta )\nonumber \\&\quad + \eta \text {sign}(M^{(k)}_{i_t,j_t}-X_{i_t,j_t})\text {1}(|M^{(k)}_{i_t,j_t}-X_{i_t,j_t}|> \eta )], \end{aligned}$$

(3)

where $J_t$ is a matrix with $J_{t,i_tj_t}=1$ and all the other entries 0, $\text {1}(\cdot )$ is the indicator function, and $\text { sign}(x)=1$ if x is positive and − 1 otherwise. Hence, evaluating the gradient can be done efficiently, and it is a scalable process that can be efficiently distributed if multiple computing units are available. Then, the Frank–Wolfe algorithm suggests computing a descent direction in the $(k+1)$-th iteration:

$$\begin{aligned} V^{(k+1)} \leftarrow \mathop {\textrm{argmin}}_{V} \langle \nabla {\mathcal {L}}(M^{(k)}),V\rangle , \text { subject to }\Vert V\Vert _* \le \lambda . \end{aligned}$$

In this step, a key observation is that

$$\begin{aligned} V^{(k+1)} = -\lambda \cdot u_1v_1^\top , \end{aligned}$$

(4)

where $u_1$ and $v_1$ are the leading left and right singular vectors of $\nabla {\mathcal {L}}(M^{(k)})$. The required singular decomposition can be computed efficiently by an existing algorithm that is implemented in the standard “PROLACK" package in Matlab. Then, we conduct a descent step to update $M^{(k)}$ by

$$\begin{aligned} M^{(k+1)} \leftarrow M^{(k)} +\alpha _{k+1} \big (V^{(k+1)} - M^{(k)}\big ), \end{aligned}$$

where $\alpha _{k+1}\in [0,1]$ is a pre-specified step-size. For example, $\alpha _{k+1} = 1/(k+3)$ guarantees convergence to an optimal solution. Meanwhile, line search is viable, and there are various ways to further accelerate this algorithm.

Intuitively, the updating direction in Equation (4) is viewed as the best rank-one approximation of the gradient matrix (3). Further, if we view the vector $u_1$ as the direction corresponding to the first principal component of the columns of M, then formula (4) is essentially a column-wise update along this direction, with the step sizes proportional to the components in the vector $v_1$. From this perspective, the update formula (4) can also be viewed as a computationally efficient matrix-valued coordinate descent along the direction $u_1$. Since the objective function (1) is convex, such an update progressing along the gradient direction ensures that the criterion function converges, approaching the minimum.

We summarize the algorithm in Algorithm 1.

2.2 Reduced-rank regression

In our second concrete problem with matrix-valued model parameters, we consider a multivariate linear regression

$$\begin{aligned} y_{ij}=x_i^\top c_j+\xi _{ij}, ~~~~~~~~\text {for } i=1,\ldots ,n,~j=1,\dots ,q, \end{aligned}$$

where $\xi _{ij}$’s are model errors. We assume that $\xi _{ij}$’s are independent and identically distributed random variables with mean zero. Then, we have in a matrix form

$$\begin{aligned} Y=XC+\var** , \end{aligned}$$

where $Y=[y_{ij}]_{n\times q}, X=[x_{ij}]_{n\times p}=[x_1,\dots , x_n]^\top , C=[c_1,\dots , c_q]\in {\mathbb {R}}^{p\times q}$, and $\var** =[\xi _{ij}]_{n\times q}$.

In this setting, one may opt to restrict the rank of C—$\text {rank}(C)\le r$ $(r \le \min (p,q))$—leading to the conventional reduced-rank regression (Reinsel & Velu, 1998). Also by relaxing the rank constraint with the nuclear norm, we consider the estimation problem as

$$\begin{aligned} \min _{C\in {\mathbb {R}}^{p\times q}} {\mathcal {L}}_{\eta }(C):= \sum _{i = 1}^n\sum _{j=1}^q \ell _{\eta } (y_{ij} - x_{i}^\top c_j), \text { subject to }\Vert C\Vert _* \le \lambda , \end{aligned}$$

(5)

where $c_j$ denotes the j-th column of C, and $\ell _\eta (\cdot )$ is the Huber loss function with parameter $\eta$.

Again, to address the computational challenges, analogous to problem (1), we propose to solve problem (5) also by applying Frank–Wolfe algorithm iteratively with the steps described as follows. Denote by $C^{(k)}$ the solution after the k-th iteration. At the $(k+1)$-th iteration, let $\nabla {\mathcal {L}}_{\eta }(C^{(k)})$ be the gradient of the loss function at $C^{(k)}$:

$$\begin{aligned}\nabla {\mathcal {L}}_{\eta }(C^{(k)}) &= \sum _{i=1}^n\sum _{j=1}^q Z^{ij} [(x_i^\top c^{(k)}_j-y_{ij})\text {1}(|x_i^\top c^{(k)}_j-y_{ij}|\le \eta )\nonumber \\&\quad + \,\eta \text { sign}(x_i^\top c^{(k)}_j-y_{ij})\text {1}(| x_i^\top c^{(k)}_j-y_{ij}|>\eta ) ], \end{aligned}$$

(6)

where $Z^{ij}$ is a matrix with the j-th column being $x_i$ and the remaining entries 0. Then, we compute a descent direction from

$$\begin{aligned} V^{(k+1)} \leftarrow \mathop {\textrm{argmin}}_{V\in {\mathbb {R}}^{p\times q}} \langle \nabla {\mathcal {L}}(C^{(k)}), V \rangle \text { subject to }\Vert V\Vert _* \le \lambda , \end{aligned}$$

with the solution

$$\begin{aligned} V^{(k+1)} = -\lambda \cdot u_1v_1^\top , \end{aligned}$$

where $u_1$ and $v_1$ are the leading left and right singular vectors of $\nabla {\mathcal {L}}_\eta (C^{(k)})$.

The algorithm follows Algorithm 1, with different input data and the gradient matrix specified by (6).

2.3 Other robust loss functions

Our framework for develo** efficient computation algorithms can easily accommodate a broad class of robust loss functions that are not smooth. Examples of the loss functions are the $\ell _1$-loss (the least absolute deviation loss), the check-loss, Tukey’s biweight loss, and more; see Hampel et al. (2011).

A scheme is developed as follows. The only necessary adjustment as in Algorithm 1 is calculating the gradient of loss function $\nabla {\mathcal {L}}(\cdot )$. Then, the general updating step is

$$\begin{aligned} \varTheta ^{(k+1)} = \varTheta ^{(k)} + \alpha _{k+1} \big ( V^{(k+1)} - \varTheta ^{(k)} \big ), \end{aligned}$$

where $\alpha _{k+1}$ is some pre-specified step-size, $V^{(k+1)} = -\lambda \cdot u_1v_1^\top ,$ with $u_1$ and $v_1$ being the first left and right singular vectors of $\nabla {\mathcal {L}}(\varTheta ^{(k)})$.

Table 1 presents gradients for several common loss functions in the context of matrix completion and reduced-rank regression.

Table 1 Gradients under different loss functions for matrix completion ($\nabla {\mathcal {L}}(M)$) and reduced-rank regression ($\nabla {\mathcal {L}}(C)$), $d_{ij}=X_{ij}-M_{ij}$ or $y_{ij}-x_i^\top c_j$ depending on the context

Full size table

3 Theory

3.1 Convergence of the algorithms

For self-completeness, we present the theoretical guarantees for the Frank–Wolfe algorithm in the context of robust matrix estimations, together with a simple way to choose the step-sizes.

We prove that by choosing the stepsize properly, the objective functions by using the Huber loss in both matrix completion and reduced-rank regression problems converge to the optimums at the rate of ${\mathcal {O}}(1/k)$, where k is the iteration counter. The next proposition is for reduced-rank regression problems, and the result for the matrix completion problem can be proved similarly.

Proposition 1

Consider the loss function ${\mathcal {L}}_\eta (\cdot ): {\mathbb {R}}^{n \times p} \rightarrow {\mathbb {R}}$ constructed from the Huber loss function (2) with parameter $\eta$. For the reduced-rank regression problem (5), by the Frank–Wolfe Algorithm with stepsize set as

$$\begin{aligned} \alpha _{k+1} = \min \left \{ \frac{\nabla {\mathcal {L}}_\eta (C^{(k)})^\top (C^{(k)} - V^{(k+1)})}{L_z \Vert C^{(k)}-V^{(k+1)}\Vert ^2},1\right \}, \text { for all }k\ge 1, \end{aligned}$$

where $L_z$ is some positive number. Suppose the diameter of the feasible set is $D:=\max _{V_1,V_2\in {\mathbb {S}}} \Vert V_1-V_2\Vert _F$, where ${\mathbb {S}}= \{V: \Vert V\Vert _*\le \lambda \}$. Then, we have that ${\mathcal {L}}(C^{(k)})$ is monotonely decreasing in k, and we have

$$\begin{aligned} {\mathcal {L}}_\eta (C^{(k)}) - {\mathcal {L}}_\eta (C^* ) \le \frac{2L_z D^2}{k}. \end{aligned}$$

Proof

Since the Huber loss function is differentiable everywhere, and we have that $\nabla {\mathcal {L}}_\eta ( C )$ is Lipschitz-continuous. Thus, with $L_z$ defined above its Lipschitz constant, by Theorem 1 of Freund et al. (2017), we have that the result holds as desired. $\square$

We point out that for the matrix completion problem (1), the result holds by the same argument by letting $L_z = 1$.

Meanwhile, our broad interests include some non-convex losses such as the Tukey’s biweight loss. A strategy for handling them is the approximation by a Lipschitz continuous function with arbitrary precision where simple smoothing techniques are applicable. Upon applying the same stepsizes as discussed above, we can show that the algorithm converges to a stationary point at the same rate; see the analysis of a recent work of Reddi et al. (2016).

Recently, Charisopoulos et al. (2021) studied the low-rank matrix recovery algorithms with the non-convex rank constraint and non-smooth loss functions. They established optimization convergence rates for a prox-linear method and a subgradient method for matrix completion. They proved that with a sufficient number of observations and an appropriate initialization, both methods are guaranteed to converge to the truth. The prox-linear method possesses a much faster convergence rate of ${\mathcal {O}}(1/(2^k))$ but with a higher computational cost at each iteration in solving a convex subproblem. While the subgradient method has a lower cost at each iteration with a subgradient evaluation step and a project step onto the desired region, it has a slower rate. Compared with their algorithms, our method has a lower computational burden in each iteration with no projection required and a relatively slower convergence rate. It is worth studying minimizing a robust loss function directly with the non-convex constraint in the future.

3.2 Statistical properties

We investigate the non-asymptotic error bounds in this section. We first introduce two conditions for both matrix completion and reduced-rank regression models.

Assumption 1

The truth $M^*$ and $C^*$ has rank at most $r,~0<r<\min (p,q)$.

Assumption 2

The noises $\xi$’s are i.i.d. with zero mean and a distribution function $F_\xi$ satisfying

$$\begin{aligned} F_\xi (m+\eta )-F_\xi (m-\eta )\ge \frac{1}{c_1^2}, \end{aligned}$$

for any $|m|\le \eta$ and $\eta >0$, where $c_1=c_1(\eta )$ is a constant depending only on $\eta$.

Assumption 2 is key on the distribution of the noises.

It is very mild by only requiring non-vanishing probability mass of $\xi$ between $m-\eta$ and $m+\eta$ for a positive $\eta$ and $|m|\le \eta$, avoiding assuming instead explicit conditions on its tail probability and/or existence of its moments up to some order.

Since the condition holds for $\eta >0$ as long as the probability mass of $\xi$ near 0 is not too small, it is easily satisfied by a wide range of distributions including heavy-tailed ones; see more discussion about this assumption and examples in Appendix 1.

3.2.1 Matrix completion

For any matrix A and some linear subspace ${\mathcal {M}}$ of ${\mathbb {R}}^{p\times q}$, we define $A_{{\mathcal {M}}}$ as the projection of A onto ${\mathcal {M}}$. We consider without loss of generality that $p>q>1$. Recall that $J_t$ $(t=1,\dots , n)$ is a $p\times q$ random matrix, independent of $X_{i_t,j_t}$ and $\xi _t$, with one randomly chosen entry $J_{t,i_tj_t}$ being 1 and the others 0. $M_{i_t,j_t}$ can be written as

$$\begin{aligned} M_{i_t,j_t}=\text {tr}(J_t^\top M)=\sum _{i=1}^{p}\sum _{j=1}^{q} J_{t,ij}M_{ij}, \end{aligned}$$

for all $(i_t,j_t)\in \varOmega$. As a working model, we treat $J_t$ as uniformly distributed over its support. That is, the probability of $M_{i_t,j_t}$ being the t-th observation is $(p q)^{-1}$. This assumes that the observed entries in the target matrix are uniformly sampled at random (Koltchinskii et al., 2011; Rohde & Tsybakov, 2011; Elsener & van de Geer, 2018), and we refer to Klopp (2014) for more discussions. Recht (2011) analyzed the matrix completion model under this assumption. As pointed out in Recht (2011), this is a sampling with replacement scheme and therefore may appear less realistic as it may result in duplicated entries; however, it has the benefit of simplifying the technical proof and assumptions. Overall, it is a reasonable and informative showcase without requiring any prior information on the sampling scheme. If additional information is available in the sampling process, other models such as the weighted sampling model (Negahban & Wainwright, 2012) can be applied.

We first show that the estimator belongs to a restricted set. We consider the singular value decomposition

$$\begin{aligned} M^* = U \varLambda V^\top , \end{aligned}$$

where U is a $p\times q$ matrix, $\varLambda$ is a $q\times q$ diagonal matrix with diagonal entries the ordered singular values $\sigma _1\ge \sigma _2\ge \dots \ge \sigma _{q}$, and V is a $q\times q$ matrix. For $k=1,2,\dots ,q$, let $u_k$ be the k-th column of U, and $v_k$ the k-th column of V. For any positive integer $r\le \min \{p,q\}$, let $U_r$ be the subspace of ${\mathbb {R}}^{p\times q}$ spanned by $u_1, \dots , u_r$, and $V_r$ be the subspace spanned by $v_1, \dots , v_r$. Define a pair of subspace of ${\mathbb {R}}^{p\times q}$ as

$$\begin{aligned} {\mathcal {M}}_r(U,V)&:=\{M\in {\mathbb {R}}^{p\times q}\mid \text {row}(M)\subseteq V_r, \text { col}(M)\subseteq U_r\},\\ \overline{{\mathcal {M}}}_r^\perp (U,V)&:=\{M\in {\mathbb {R}}^{p\times q}\mid \text {row}(M)\subseteq V_r^{\perp }, \text { col}(M)\subseteq U_r^{\perp }\}, \end{aligned}$$

where $\text {row}(M)$ and $\text {col}(M)$ denote the row and column space of M. For simplicity notation, we use ${\mathcal {M}}_r ={\mathcal {M}}_r(U,V)$ and $\overline{{\mathcal {M}}}_r^\perp =\overline{{\mathcal {M}}}_r^\perp (U,V)$. Lemma 1 indicates that the estimator ${\hat{M}}$ belongs to the set

$$\begin{aligned} {\mathcal {M}}_0=\{M\in {\mathbb {R}}^{p\times q}:||\varDelta _{\overline{{\mathcal {M}}}_r^\perp }||_* \le 3||\varDelta _{\overline{{\mathcal {M}}}_r}||_*+4\sum _{k=r+1}^{q}\sigma _k, ~\varDelta =M-M^*\}. \end{aligned}$$

To establish the error bounds, we need the following technical assumption.

Assumption 3

For any $M\in {\mathcal {M}}_0$, there exists a real number $L>1$, such that

$$\begin{aligned} \Vert M-M^*\Vert _{\max } \le \frac{L}{\sqrt{pq}} \Vert M-M^*\Vert _F. \end{aligned}$$

Assumptions of this type—referred to as the ‘spikiness condition’—are assumed in existing literature on analogous problems, e.g., in Negahban and Wainwright (2012) for matrix completion problems; see also a recent work Fan et al. (2021). Intuitively, this assumption requires that for $M\in {\mathcal {M}}_0$, the entries of $M-M^*$ are not overly ‘spiky’, or in other words, relatively evenly distributed; so that the maximum discrepancy is not extremely far away from the averaged discrepancy. We remark that here the term $\frac{1}{\sqrt{pq}}$ relates to the aforementioned uniform sampling scheme setting, under which each entry is observed with the probability $\frac{1}{pq}$. Hence, it reflects an increasingly more difficult high-dimensional problem due to sparse entries in a single copy of large matrix. Instead, if the probability of each entry being observed is a constant independent of p, q, this assumption is not required.

We consider the Lagrangian form of the problem (1):

$$\begin{aligned} {\hat{M}} = \mathop {\textrm{argmin}}_{M \in {\mathbb {R}}^{p\times q}} \{{\mathcal {L}}_\eta (M) +\gamma \Vert M\Vert _*\}, \end{aligned}$$

(7)

where $\gamma >0$ is the corresponding regularization tuning parameter. Let ${\hat{\varDelta }}={\hat{M}}-M^*$ and $\varDelta =M-M^*$. Theorem 1 establishes a non-asymptotic upper bound for the error for estimating a $M^*$ of low rank.

Theorem 1

For problem (7), suppose that Assumption 1, 2, and 3 hold and the noises $\xi _t$’s are distributed symmetrically about zero. Let ${\hat{M}}$ be the solution to problem (7) with

$$\begin{aligned} \gamma = 2\eta \left\{ 4c_0\left[ \sqrt{\frac{\log (p+q)}{n q}}+\sqrt{\log (q+1)}\frac{\log (p+q)}{n}\right] +\sqrt{\frac{2\log (p+q)}{n p q}}+\frac{8\log (p+q)}{3n}\right\} , \end{aligned}$$

with a constant $c_0>0$. When $n>C(L)\cdot c^2_1 pr\log (p+q)\log (q+1)$,

$$\begin{aligned} \frac{1}{\sqrt{p q }}\Vert {\hat{\varDelta }}\Vert _F\le C_1 c_1^2\eta \sqrt{\frac{p\log (p+q)\log (q+1)}{n}}\left(\sqrt{2r}c_2+c_3\right), \end{aligned}$$

with probability at least $1-3(p+q)^{-1}$, for some constants $C_1$, $c_2$ and $c_3$ independent of n, p, and q, and C(L) a constant only depending on L.

Theorem 1 is non-asymptotic; $\gamma$ is chosen based on Lemma 7 in Appendix 2 as twice the upper bound of $\sigma _1(\nabla {\mathcal {L}}_\eta (M^*))$. In Theorem 1, we only require the error terms satisfy Assumption 2, which is $F_\xi (m+\eta )-F_\xi (m-\eta )>\frac{1}{c_1^2}$, for $\eta >0$ being the parameter in the Huber loss (2) and $|m|<\eta$. Since this assumption is easily satisfied by many heavy-tailed distributions, this result demonstrates the robustness of our method.

We note that in general $\gamma$ can be

$$\begin{aligned} K_1\cdot 2\eta \left\{ 4c_0\left[ \sqrt{\frac{\log (p+q)}{n q}}+\sqrt{\log (q+1)}\frac{\log (p+q)}{n}\right] +\sqrt{\frac{2\log (p+q)}{n p q}}+\frac{8\log (p+q)}{3n}\right\} , \end{aligned}$$

for any constant $K_1\ge 1$. Under the conditions in Theorem 1, we can also derive the upper bounds of the estimation error in nuclear norm based on (23) in the Appendix:

$$\begin{aligned} \frac{1}{\sqrt{pq}}\Vert {\hat{\varDelta }}\Vert _*\le 4C_1c_1^2\eta \sqrt{2r}\sqrt{\frac{p\log (p+q)\log (q+1)}{n}}\left(\sqrt{2r}c_2+c_3\right). \end{aligned}$$

We may discuss the asymptotic properties of ${{\hat{M}}}$ when $n \rightarrow \infty$. Matrix completion is a hard problem attempting to recover a matrix-valued model parameter with a single incomplete copy from the data generating process. The average estimation error converges to zero in probability as $n\rightarrow \infty$. That is, when $rp \log (p+q)\log (q+1)=o(n)$, $(pq)^{-1}\Vert {\hat{\varDelta }}\Vert _F^2 \rightarrow 0$. Intuitively, if the rank of $M^*$ is r, then the number of free parameters is at the order of rp. Hence it’s reasonable to require a sample size at least of some larger order of rp, so as to recover the model parameters consistently.

Without requiring the Gaussian assumption, our error rate is still comparable to the statistical optimum established by Koltchinskii et al. (2011) for matrix completion problems under a low-rank constraint with Gaussian noises. Compared with the rate in the lower bound given in Theorem 6 of Koltchinskii et al. (2011), our upper bound in Theorem 1 differs only in an additional logarithm term $\sqrt{\log (p+q)\log (q+1)}$ and the $\eta$ in the Huber loss.

The assumption in Theorem 1 that the model error is symmetrically distributed around 0 is needed in obtaining the upper bound of $\sigma _{1}(\nabla {\mathcal {L}}(M^*))$; see the proof of Lemma 7. It assures that $\sigma _1({\mathbb {E}}[\nabla {\mathcal {L}}(M^*)])=0$. Similar assumptions are also found in Loh (2017). Thanks to the symmetrization assumption, the convergence can be established with no strong extra requirement on $\eta$. Without the symmetrization, as shown in Lemma 7 in the Supplement Material, other conditions are required to control

$$\begin{aligned} {\mathbb {E}}[\frac{\partial l_\eta (X_{ij}-M^*_{ij})}{\partial M_{ij}^*}]=\int _{M^*_{ij}-\eta }^{M^*_{ij}+\eta }F(X_{ij})dX_{ij}-\eta , \end{aligned}$$

so that

$$\begin{aligned} \sigma _{1}\left({\mathbb {E}}[\nabla {\mathcal {L}}(M^*)]\right)&=\sigma _1( \frac{1}{2n}\sum _{t=1}^n \sum _{i=1}^{p}\sum _{j=1}^{q} {\mathbb {E}}[J_tJ_{t,ij}] {\mathbb {E}}\left[\frac{\partial l_\eta (X_{ij}-M^*_{ij})}{\partial M_{ij}^*}\right])\\&=\frac{1}{2pq}\sigma _1(\left[ {\mathbb {E}}[\frac{\partial l_\eta (X_{ij}-M^*_{ij})}{\partial M_{ij}^*}]\right] _{p\times q}) \end{aligned}$$

is stochastically small enough. With this extra term, the upper bound in Theorem 1 becomes

$$\begin{aligned} \frac{1}{\sqrt{p q }}\Vert {\hat{\varDelta }}\Vert _F&\le \frac{\text {Constant} \cdot c_1^2\sqrt{2r}}{\sqrt{pq}}\sigma _1\left( \left[ {\mathbb {E}}[\frac{\partial l_\eta (X_{ij}-M^*_{ij})}{\partial M_{ij}^*}]\right] _{p\times q}\right) \nonumber \\& \quad+C_1c_1^2\eta \sqrt{\frac{p\log (p+q)\log (q+1)}{n}}\left(\sqrt{2r}c_2+c_3\right). \end{aligned}$$

(8)

The extra term in (8) may then be viewed as a price paid to achieve robustness against noises with heavy-tailed distributions. This is an impact from applying the robust Huber loss. It is a remarkable different feature from the study on matrix completion with $\ell _2$-loss. Nevertheless, it is worth noting that for $\ell _2$-loss related studies, conditions are commonly assumed to control the tail probability behavior of the model errors, for example, by the sub-Gaussian distributions. In contrast, our development does not require such assumptions on the tail probability properties, which is the gain in return by applying the Huber loss.

3.2.2 Reduced-rank regression

The problem (5) is also expressed in the Lagrangian form:

$$\begin{aligned} {\hat{C}} = \mathop {\textrm{argmin}}_{C\in {\mathbb {R}}^{p\times q}} \{{\mathcal {L}}_\eta (C)+\gamma \Vert C\Vert _*\}, \end{aligned}$$

(9)

where $\gamma >0$ is a regularization parameter, and ${\mathcal {L}}_\eta (C)$ is defined in Equation (5).

Again, we point out that the estimator belongs to a restricted set. By applying the singular value decomposition to $C^*$, we have

$$\begin{aligned} C^* = U \varLambda V^\top , \end{aligned}$$

where $\varLambda =\textrm{diag}(\sigma _1,\dots , \sigma _q)$ is the diagonal matrix containing all singular values of $C^*$. For $r\le \min \{p,q\}$, we define a pair of subspace of ${\mathbb {R}}^{p\times q}$ as

$$\begin{aligned} {\mathcal {C}}_r(U,V)&\;=\;\{M\in {\mathbb {R}}^{p\times q}\mid \text {row}(M)\subseteq V_r, \text { col}(M)\subseteq U_r\},\\ \overline{{\mathcal {C}}}_r^\perp (U,V)&\;=\;\{M\in {\mathbb {R}}^{p\times q}\mid \text {row}(M)\subseteq V_r^{\perp }, \text { col}(M)\subseteq U_r^{\perp }\}, \end{aligned}$$

where $U_r$ is a subspace spanned by the first r columns of U, and $V_r$ is the subspace spanned by the first r columns of V. For simplicity in notations, we denote by ${\mathcal {C}}_r ={\mathcal {C}}_r(U, V)$ and $\overline{{\mathcal {C}}}_r^\perp =\overline{{\mathcal {C}}}_r^\perp (U,V)$. Note that ${\mathcal {C}}_r$ and $\overline{{\mathcal {C}}}_r$ are not equal. Lemma 4 indicates that the estimator ${\hat{C}}$ belongs to the set

$$\begin{aligned} {\mathcal {C}}_0=\left\{C\in {\mathbb {R}}^{p\times q}:\Vert \varDelta _{\overline{{\mathcal {C}}}_r^\perp }\Vert _* \le 3\Vert \varDelta _{\overline{{\mathcal {C}}}_r}\Vert _*+4\sum _{k=r+1}^{q}\sigma _k, \varDelta =C-C^* \right\}. \end{aligned}$$

We assume the following conditions on the random design matrix X.

Assumption 4

$x_1, x_2, \dots , x_n$ are i.i.d. random vectors sampling from a multivariate normal distribution ${\mathcal {N}} (0, \varSigma )$ and without loss of generality, are standardized such that $\Vert x_i\Vert _F\le 1$. $\sigma _1(\Sigma )\ge \sigma _n(\Sigma )>0$, where $\sigma _1(\Sigma )$ and $\sigma _n(\Sigma )$ denote the largest and smallest eigenvalues of $\Sigma$, respectively.

The multivariate normal distribution and its analogies are commonly assumed in the literature (e.g., Negahban & Wainwright, 2011; Sun et al., 2020; Fan et al., 2021). The setting with Assumption 4 facilitates achieving

the optimal convergence rate; other types of conditions are possible, at the expense of a slower convergence rate.

Theorem 2 establishes a non-asymptotic upper bound for $\Vert {\hat{\varDelta }}\Vert _F$.

Theorem 2

For problem (9), suppose that Assumption 1 and 2 hold and the noises $\xi _{ij}$’s are distributed symmetrically about zero. Suppose X satisfies Assumption 4. Let ${\hat{C}}$ be the solution to the optimization problem (9) with

$$\begin{aligned} \gamma = 8 \eta \sigma _1(\Sigma )\left(\sqrt{6n(p+q)}+3(p+q)\right). \end{aligned}$$

(10)

Then for $n>C_2\frac{\sigma _1(\Sigma )}{\sigma _n(\Sigma )}c_1^2{r(p+q)}$ with probability at least $1-3e^{-(p+q)}$,

$$\begin{aligned} \Vert {\hat{\varDelta }}\Vert _F \le C_3 c_1^2\sqrt{2r}\eta \frac{\sigma _1(\Sigma )}{{{\sigma _{n}(\Sigma )}}}\left(\sqrt{\frac{6(p+q)}{n}}+\frac{3(p+q)}{n}\right), \end{aligned}$$

where $C_2$ and $C_3$ are constants.

The value for $\gamma$ is selected based on Lemma 8 in Appendix 3 as twice the upper bound for $\sigma _1(\nabla {\mathcal {L}}_\eta (C^*))$ according to condition (10). Generally, for any $K_2\ge 1$ and

$$\begin{aligned} \gamma = K_2\cdot 8 \eta \sigma _1\left(\Sigma )(\sqrt{6n(p+q)}+3(p+q)\right), \end{aligned}$$

our result remains valid and only differs in constant terms.

Under the same condition, we can establish the error bound in terms of the nuclear norm

$$\begin{aligned} \Vert \varDelta \Vert _* \le 8C_3 c_1^2 r\eta \frac{\sigma _1(\Sigma )}{\sigma _n(\Sigma )}\left(\sqrt{\frac{6(p+q)}{n}}+\frac{3(p+q)}{n}\right). \end{aligned}$$

When $r(p+q)=o(n)$, the Frobenius norm of the error $\Vert {\hat{\varDelta }}\Vert _F^2 \rightarrow 0$ in probability. Similarly, the robustness of the method is seen as only a mild distributional Assumption 2 is required: $F_\xi (m+\eta )-F_\xi (m-\eta )>\frac{1}{c_1^2}$ for $|m|<\eta$ and $\eta >0$. Our estimator achieves a comparable convergence rate as that in Negahban and Wainwright (2011) and Rohde and Tsybakov (2011), with the notable difference due to the $\eta$ in the Huber loss. Meanwhile, our method does not require the errors to follow normal distributions, which is the case in those studies with the $\ell _2$ loss. Here assuming symmetricity plays the same role as that in Theorem 1. Based on the same discussions after Theorem 1, if the noises are not symmetrically distributed, then there will be an extra term in the upper bound.

4 Numerical examples

In this section, we conduct an extensive numerical investigation of the proposed method using both simulated and real data sets. In all cases, we choose the tuning parameters by ten-fold cross-validation. Specifically, for matrix completion problems, we first randomly select $90\%$ of the observed entries as training samples and test the results using the remaining $10\%$ samples. We repeat the procedure 10 times and choose the best tuning parameter. With extensive studies on simulated and real data sets, our results provide strong empirical evidence that the proposed method provides robustness under different settings.

4.1 Jester joke data

We first test our method using the Jester joke data set. This data set contains more than 4.1 million ratings for 100 jokes from 73,421 users. This data set is publicly available through http://www.ieor.berkeley.edu/~goldberg/jester-data/. The whole data set contains three sub-datasets, which are: (1) jester-1: 24,983 users who rate 36 or more jokes; (2) jester-2: 23,500 users who rate 36 or more jokes; (3) jester-3: 24,938 users who rate between 15 and 35 jokes. More detailed descriptions can be found in Toh and Yun (2010) and Chen et al. (2012), where the authors consider the nuclear-norm based approach to conduct matrix completion.

Due to the large number of users, we randomly select $n_u$ users’ ratings from the datasets. Since many entries are unknown, we cannot compute the relative error using every entry. Instead, we take the metric of the normalized mean absolute error (NMAE) to measure the accuracy of the estimator ${\hat{M}}$:

$$\begin{aligned} \text {NMAE} = \frac{\sum _{(j,k) \in \varOmega } |{\hat{M}}_{jk} - M^0_{jk}| }{|\varOmega | (r_{\max }-r_{\min })}, \end{aligned}$$

where $r_{\min }$ and $r_{\max }$ denote the lower and upper bounds for the ratings, respectively. In the Jester joke data set, the range is $[-10,10]$. Thus, we have $r_{\max }-r_{\min } = 20$.

In each iteration, we first randomly select $n_u$ users, and then randomly permute the ratings from the users to generate $M^0\in {\mathbb {R}}^{n_u\times 100}$. Next, we uniformly sample SR for SR$\in \{15\%,20\%,0.25\%\}$ entries to generate a set of observed indices $\varOmega$. Note that we can only observe the entry (j, k) if $(j,k)\in \varOmega$, and $M^0_{j,k}$ is available. Thus, the actual sampling ratio is less than the input SR. We consider different settings of $n_u$ and SR, and we report the averaged NMAE and running times in Table 2 after running each setting 100 times. We compare robust methods with $\ell _1$ loss, Huber loss, and Tukey loss with the non-robust $\ell _2$ loss. From Table 2, we see that robust matrix completion methods work promisingly.

4.2 Cameraman image denoising

We test our method using the popular Cameraman image, which is widely used in image processing literature. We consider the “Cameraman" image with $512\times 512$ pixels as shown in Fig. 1a. We then generate random noise by first adding independent Gaussian noise to each pixel with a standard deviation set as 3. Then, we add some heavy-tailed noises by randomly choosing 10% pixels and replace the coefficient as 1000 or $-1000$. Furthermore, we randomly select 40% or 60% pixels as missing entries. We provide two typical simulated noisy images in the above of Fig. 1b, c, and provide the recovered images using the Tukey approach below them. The recovered images provide visual evidence that our method is robust to heavy-tailed noises in practice. In addition, in Table 3, we provide the averaged NMAE with standard deviations of different approaches after repeating the data generating schemes 100 times. For the effective picture recovery and the NMAE, we conclude that robust matrix completion has promising performance with partial and noisy observations.

Table 2 Averaged normalized mean absolute error with standard deviations in the parentheses for different methods using Jester joke data set under different data generating schemes after 100 runs

Full size table

Table 3 Averaged normalized mean absolute error with standard deviations in the parentheses for different methods using Lena image after 100 runs

Full size table

4.3 Simulations

We first consider several similar simulation settings as described in She and Chen (2017) to compare our method with their robust reduced-rank regression ($R^4$) method. In all cases, we focus on testing the robustness by artificially introducing data corruption and outliers.

Setting 1: We first consider a low-dimensional case where we set $n = 100$, $p = 12$, $q = 8$ and $r = 3$ or 5. We construct the design matrix X by generating its n rows by independently sampling from $N(0,\Sigma _0)$, where we consider highly correlated covariates by letting the diagonal elements of $\Sigma _0$ be 1’s and setting its off-diagonal elements as 0.5. For the noise matrix $\var**$, we sample each row of $\var**$ independently from $N(0,\sigma ^2\Sigma _1)$, where $\Sigma _1$ is the q-dimensional identity matrix, and $\sigma$ is set as 1. Next, we construct the coefficient matrix $C^*$. We generate $C^* = B_1B_2^\top$, where $B_1\in {\mathbb {R}}^{p\times r}$, $B_2\in {\mathbb {R}}^{q\times r}$, and all entries of $B_1$ and $B_2$ are independently sampled from N(0, 1). We then add outliers with a matrix $U^*$ by setting the first $o\% \cdot n$ rows of $U^*$ as nonzero, where $o \in \{30, 35,\ldots ,50\}$ is the proportion of outliers, and the j-th entry of any outlier row of $U^*$ is the product of a Randemacher random variable and a scalar $\alpha \in \{0.75,1\}$ times the sample standard deviation of the j-th column of $XC^*$. Finally, we set the response matrix $Y = XC^* + U^* + \var**$. We report the mean and standard deviation of the mean squared error (MSE) from 200 runs, where

$$\begin{aligned} \text {MSE}(X{\hat{C}}) = \Vert XC^* - X{{\hat{C}}}\Vert _F^2/ (qn). \end{aligned}$$

In addition, we also report the mean and standard deviation of the mean squared estimation error, where

$$\begin{aligned} \text {MSE}({\hat{C}}) = \Vert {\hat{C}}-C^*\Vert _F^2/(qp). \end{aligned}$$

Setting 2: We then test our method on heavy-tailed noise. Same as Setting 1, we let $n = 100$, $p = 12$, $q = 8$, and $r = 2,3$, or 4, and consider the same generating scheme to construct the design matrix X, and then generate the noise matrix by the heavy-tailed t-distribution with a degree of freedom 3 or 5. Furthermore, we add outliers by the same generating scheme as in Setting 1 to generate $U^*$ and letting $\alpha = 0.5, 0.75$ or 1.

Setting 3: We consider a high-dimensional setting where $n = 100$, $p = 50$ and $q = 50$, and $r = 3$ or 5, where there are $2,500 > 100$ parameters in the matrix C to be estimated. We consider the same data generating scheme as in Setting 1.

Setting 4: Finally, we consider an ultrahigh-dimensional setting where $n = 300$, $p = 100$ and $q = 400$, and $r = 3$ or 5, where there are $40,000\gg 300$ parameters to be estimated. We consider the same data generating scheme as in Setting 1.

The results are shown in Tables 4, 5, 6, and 7. We compare our method incorporating Huber and Tukey loss functions with the $R^4$ method when it is applicable. We note that for high-dimensional Settings 3 and 4, the $R^4$ method of She and Chen (2017) cannot be applied here because one of the iterations in their algorithm is not defined. We compare our method with another robust method where we use the $\ell _1$ loss in place of the Huber loss in the objective with the nuclear norm constraint (Denoted as $\ell _1$). In all four settings, both Huber loss and Tukey loss achieve very promising performance, and Tukey loss slightly outperforms Huber loss in settings with outliers.

Table 4 Sample average of MSE$(X{\hat{C}})$ and MSE$({{\hat{C}}})$ for Setting 1 under different settings with sample standard deviation in parentheses after 200 runs

Full size table

Table 5 Sample average of MSE$(X{\hat{C}})$ and MSE$({{\hat{C}}})$ for Setting 2 under different settings with sample standard deviation in parentheses after 200 runs

Full size table

Table 6 Sample average of MSE$(X{\hat{C}})$ and MSE$({{\hat{C}}})$ for Setting 3 under different settings with sample standard deviation in parentheses after 200 runs

Full size table

Table 7 Sample average of MSE$(X{\hat{C}})$ and MSE$({{\hat{C}}})$ for Setting 4 under different settings with sample standard deviation in parentheses after 200 runs

Full size table

5 Intermediate theoretical results

Our estimators (1) and (5) are penalized M-estimators. We exploit the framework of Negahban et al. (2012) in studying their statistical properties. Negahban et al. (2012) elaborates the notion of decomposability associated with some penalty function, which is a key property for establishing the restricted strong convexity (RSC) property and the error bounds of the penalized estimators.

For self-completeness, we outline the decomposability of penalizing with the nuclear norm, and then derive the restricted strong convexity property for both models under the Huber loss function.

5.1 Decomposability of nuclear norm

A norm $\Vert \cdot \Vert$ is decomposable with respect to a pair of subspace if for all $A\in {{\mathcal {M}}}$ and $B\in \overline{{\mathcal {M}}}^\perp$ with $({\mathcal {M}},\overline{{\mathcal {M}}}^\perp )$ a pair of subspace of ${\mathbb {R}}^{p\times q}$ satisfy

$$\begin{aligned} \Vert A+B\Vert =\Vert A\Vert +\Vert B\Vert . \end{aligned}$$

To illustrate the decomposability of nuclear norm, recall

$$\begin{aligned} {\mathcal {M}}_r(U,V)&:=\{M\in {\mathbb {R}}^{p\times q}\mid \text {row}(M)\subseteq V_r, \text { col}(M)\subseteq U_r\},\\ \overline{{\mathcal {M}}}_r^\perp (U,V)&:=\{M\in {\mathbb {R}}^{p\times q}\mid \text {row}(M)\subseteq V_r^{\perp }, \text { col}(M)\subseteq U_r^{\perp }\}. \end{aligned}$$

Note that ${\mathcal {M}}_r\ne \overline{{\mathcal {M}}}_r$. Since U and V both have orthogonal columns, nuclear norm is decomposable with respect to the pair $({\mathcal {M}}_r,\overline{{\mathcal {M}}}_r^\perp )$. Note that if the rank of $M^*$ is equal or smaller than r, then $U_r$ and $V_r$ equal to or contain the column and row space of $M^*$ respectively, and $M^*\in {\mathcal {M}}_r(U,V)$.

We present key intermediate results as lemmas below. The proofs of the lemmas are given in the Appendix.

5.2 Results for matrix completion

The decomposability leads to the first lemma, which is a special case of Lemma 1 in Negahban et al. (2012). It provides an upper bound for $\Vert {\hat{\varDelta }}_{\overline{{\mathcal {M}}}_r^\perp }\Vert _*$.

Lemma 1

For any $\gamma$ satisfying

$$\begin{aligned} \gamma \ge 2\sigma _1(\nabla {\mathcal {L}}_\eta (M^*)), \end{aligned}$$

the error ${\hat{\varDelta }}$ satisfies

$$\begin{aligned} \Vert {\hat{\varDelta }}_{\overline{{\mathcal {M}}}_r^\perp }\Vert _* \le 3\Vert {\hat{\varDelta }}_{\overline{{\mathcal {M}}}_r}\Vert _*+4\sum _{k=r+1}^{q}\sigma _k. \end{aligned}$$

Lemma 1 indicates that the estimator ${\hat{M}}$ belongs to the set

$$\begin{aligned} {\mathcal {M}}_0=\{M\in {\mathbb {R}}^{p\times q}:||\varDelta _{\overline{{\mathcal {M}}}_r^\perp }||_* \le 3||\varDelta _{\overline{{\mathcal {M}}}_r}||_*+4\sum _{k=r+1}^{q}\sigma _k, ~\varDelta =M-M^*\}. \end{aligned}$$

Note that if the rank of $M^*$ is no greater than r, then $\sum _{k=r+1}^{q}\sigma _k=0$ and the projection of the error on $\overline{{\mathcal {M}}}_r^\perp$ is solely controlled by the projection of error on $\overline{{\mathcal {M}}}_r$, so as the error itself, since

$$\begin{aligned} \Vert {\hat{\varDelta }}\Vert _*\le \Vert {\hat{\varDelta }}_{\overline{{\mathcal {M}}}_r^\perp }\Vert _* +\Vert {\hat{\varDelta }}_{\overline{{\mathcal {M}}}_r}\Vert _*\le 4\Vert {\hat{\varDelta }}_{\overline{{\mathcal {M}}}_r}\Vert _*. \end{aligned}$$

Now, consider the quantity

$$\begin{aligned} \delta {\mathcal {L}}_\eta (M,M^*)={\mathcal {L}}_\eta (M)-{\mathcal {L}}_\eta (M^*)-\langle \nabla {\mathcal {L}}_\eta (M^*),{\varDelta }\rangle . \end{aligned}$$

For simplicity, we sometimes refer to $\delta {\mathcal {L}}_\eta (M,M^*)$ as $\delta {\mathcal {L}}_\eta$. The next Lemma gives a lower bound of $\delta {\mathcal {L}}_\eta (M,M^*)$, which is used to establish restricted strong convexity (RSC) and the upper bound for the error. The key to proving this lemma includes Lemma 1 and the application of empirical process techniques.

Lemma 2

(Lower bound of $\delta {\mathcal {L}}_\eta (M,M^*)$) Suppose Assumption 1 and 2 hold, and that the regularization parameter in optimization problem (7) satisfies

$$\begin{aligned} \gamma \ge 2 \sigma _1(\nabla {\mathcal {L}}_\eta (M^*)). \end{aligned}$$

Then for any $x>0$ and $M \in \{M:\Vert M-M^*\Vert _{\max } \le \eta \} \cap {\mathcal {M}}_0$,

$$\begin{aligned} \delta {\mathcal {L}}_\eta (M,M^*)&\ge \frac{1}{4c_1^2pq}\Vert \varDelta \Vert _F^2\\&~-\{32\sqrt{2r}\eta c_0\left[ \sqrt{\frac{\log (p+q)}{nq}}+\sqrt{\log (q+1)}\frac{\log (p+q)}{n}\right] + \sqrt{\frac{2x\eta ^2}{npq}}+\frac{8x\eta }{3n}\} \Vert \varDelta \Vert _F, \end{aligned}$$

with probability at least $1-e^{-x}$.

By controlling the negative term, we have the restricted strong convexity property.

Lemma 3

(Restricted Strong Convexity) Suppose that all the conditions in Lemma 2 and Assumption 3 hold. For $M \in \{M:\Vert M-M^*\Vert _{\max } \le \eta \} \cap {\mathcal {M}}_0$, with probability at least $1-e^{-(p+q)}$,

$$\begin{aligned} \delta {\mathcal {L}}_\eta (M, M^*) \ge \frac{1}{8c_1^2pq}\Vert \varDelta \Vert _F^2, \end{aligned}$$

for $n>C(L)\cdot c^2_1 pr\log (p+q)\log (q+1)$, where C(L) is a a constant only depending on L.

5.3 Results of reduced-rank regression

Recall

$$\begin{aligned} {\mathcal {C}}_r(U,V)&\;=\;\{M\in {\mathbb {R}}^{p\times q}\mid \text {row}(M)\subseteq V_r, \text { col}(M)\subseteq U_r\},\\ \overline{{\mathcal {C}}}_r^\perp (U,V)&\;=\;\{M\in {\mathbb {R}}^{p\times q}\mid \text {row}(M)\subseteq V_r^{\perp }, \text { col}(M)\subseteq U_r^{\perp }\}. \end{aligned}$$

Lemma 1 can be easily extended to ${\hat{C}}$.

Lemma 4

For any $\gamma$ satisfying

$$\begin{aligned} \gamma \ge 2\sigma _1(\nabla {\mathcal {L}}_\eta (C^*)), \end{aligned}$$

${\hat{\varDelta }}={\hat{C}}-C^*$ satisfies

$$\begin{aligned} \Vert {\hat{\varDelta }}_{\overline{{\mathcal {C}}}_r^\perp }\Vert _* \le 3\Vert {\hat{\varDelta }}_{\overline{{\mathcal {C}}}_r}\Vert _*+4\sum _{k=r+1}^{q}\sigma _k. \end{aligned}$$

Lemma 4 indicates that the estimator ${\hat{C}}$ belongs to the set

$$\begin{aligned} {\mathcal {C}}_0=\{C\in {\mathbb {R}}^{p\times q}:\Vert \varDelta _{\overline{{\mathcal {C}}}_r^\perp }\Vert _* \le 3\Vert \varDelta _{\overline{{\mathcal {C}}}_r}\Vert _*+4\sum _{k=r+1}^{q}\sigma _k, \varDelta =C-C^*\}. \end{aligned}$$

The next result is to establish the RSC condition. Consider the quantity

$$\begin{aligned} \delta {\mathcal {L}}_\eta (C,C^*)={\mathcal {L}}_\eta (C)-{\mathcal {L}}_\eta (C^*)-\langle \nabla {\mathcal {L}}_\eta (C^*),{\varDelta }\rangle . \end{aligned}$$

Lemma 5

(Lower bound of $\delta {\mathcal {L}}_\eta (C,C^*)$) Consider the reduced-rank regression problem (9). Suppose that Assumption 1, 2 and 4 hold, and the noises $\xi _t$’s are distributed symmetrically about zero. Suppose the regularization parameter in optimization problem (9) satisfies

$$\begin{aligned} \gamma \ge 2 \sigma _1(\nabla {\mathcal {L}}_\eta (C^*)). \end{aligned}$$

Then for any $x>0$ and $C\in \{C:\Vert C-C^*\Vert _F\le \eta \}\cap {\mathcal {C}}_0$,

$$\begin{aligned} \delta {\mathcal {L}}_\eta (C, C^*) \ge \frac{n\sigma _n(\varSigma )}{2c_1^2}\Vert \varDelta \Vert _F^2 - 48\sqrt{2r}\eta \sigma _1(\varSigma )(\sqrt{4n(p+q)+2nx}+2(p+q)+x)\Vert \varDelta \Vert _F, \end{aligned}$$

with probability at least $1-e^{-x}$.

By controlling the negative term and setting the right side to be greater than 0, we have the restricted strong convexity property.

Lemma 6

(Restricted Strong Convexity) Suppose that all the conditions in Lemma 5 hold, then for $C\in \{C:\Vert C-C^*\Vert _F\le \eta \}\cap {\mathcal {C}}_0$ and $n>C_2\frac{\sigma _1(\varSigma )}{\sigma _n(\varSigma )}c_1^2{r(p+q)}$, where $C_2$ is a constant,

$$\begin{aligned} \delta {\mathcal {L}}_\eta (C, C^*) \ge \frac{n\sigma _n(\varSigma )}{4c_1^2}\Vert \varDelta \Vert _F^2, \end{aligned}$$

with probability at least $1-(p+q)^{-1}$.

Data availibility

The real data sets to evaluate the performance of the methods in this paper are publicly available. ‘Jester Joke’ data set is available through http://www.ieor.berkeley.edu/~goldberg/jester-data/, and ‘Cameraman image’ data is available through http://ltfat.org/doc/signals/cameraman.html.

Code availability

The MATLAB code is available upon request to the corresponding author.

References

Agarwal, A., Negahban, S., & Wainwright, M. J. (2012). Noisy matrix decomposition via convex relaxation: Optimal rates in high dimensions. The Annals of Statistics, 1171–1197.
Anderson, T. W. (2003). An introduction to multivariate statistical analysis. New York: Wiley, 3rd edition.
Boucheron, S., Lugosi, G., & Massart, P. (2013). Concentration inequalities: A nonasymptotic theory of independence. Oxford University Press.
Bousquet, O. (2002). A Bennett concentration inequality and its application to suprema of empirical processes. Comptes Rendus Mathematique, 334(6), 495–500.
Article MathSciNet MATH Google Scholar
Candès, E. J., & Tao, T. (2010). The power of convex relaxation: Near-optimal matrix completion. IEEE Transactions on Information Theory, 56(5), 2053–2080.
Article MathSciNet MATH Google Scholar
Charisopoulos, V., Chen, Y., Davis, D., Díaz, M., Ding, L., & Drusvyatskiy, D. (2021). Low-rank matrix recovery with composite optimization: Good conditioning and rapid convergence. Foundations of Computational Mathematics, 1–89.
Chen, C., He, B., & Yuan, X. (2012). Matrix completion via an alternating direction method. IMA Journal of Numerical Analysis, 32(1), 227–245.
Article MathSciNet MATH Google Scholar
Cook, R. D. (2009). Regression graphics: Ideas for studying regressions through graphics, (Vol. 482). Hoboken: John Wiley & Sons.
MATH Google Scholar
Elsener, A., & van de Geer, S. (2018). Robust low-rank matrix estimation. Annals of Statistics, 46(6B), 3481–3509.
Article MathSciNet MATH Google Scholar
Fan, J., Wang, W., & Zhu, Z. (2021). A shrinkage principle for heavy-tailed data: High-dimensional robust low-rank matrix recovery. Annals of Statistics, 49(3), 1239.
Article MathSciNet MATH Google Scholar
Freund, R. M., & Grigas, P. (2016). New analysis and results for the Frank–Wolfe method. Mathematical Programming, 155(1–2), 199–230.
Article MathSciNet MATH Google Scholar
Freund, R. M., Grigas, P., & Mazumder, R. (2017). An extended Frank–Wolfe method with “in-face’’ directions, and its application to low-rank matrix completion. SIAM Journal on Optimization, 27(1), 319–346.
Article MathSciNet MATH Google Scholar
Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., & Stahel, W. A. (2011). Robust statistics: The approach based on influence functions (Vol. 196). Hoboken: John Wiley & Sons.
MATH Google Scholar
Huber, P. J. (2004). Robust statistics (Vol. 523). Hoboken: John Wiley & Sons.
Google Scholar
Jaggi, M. (2013). Revisiting frank-wolfe: Projection-free sparse convex optimization. In International conference on machine learning (pp. 427–435). PMLR.
Kerdreux, T., d’Aspremont, A., & Pokutta, S. (2018). Restarting frank-wolfe. ar**v preprint ar**v:1810.02429.
Klopp, O. (2014). Noisy low-rank matrix completion with general sampling distribution. Bernoulli, 20(1), 282–303.
Article MathSciNet MATH Google Scholar
Koltchinskii, V., Lounici, K., & Tsybakov, A. B. (2011). Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. The Annals of Statistics, 39(5), 2302–2329.
Article MathSciNet MATH Google Scholar
Lacoste-Julien, S. & Jaggi, M. (2015). On the global linear convergence of Frank–Wolfe optimization variants. ar**v preprint ar**v:1511.05932.
Lauritzen, S. L. (1996). Graphical models (Vol. 17). Oxford: Clarendon Press.
MATH Google Scholar
Ledoux, M., & Talagrand, M. (2013). Probability in Banach spaces: Isoperimetry and processes. Springer, Berlin.
Loh, P.-L. (2017). Statistical consistency and asymptotic normality for high-dimensional robust $m$-estimators. The Annals of Statistics, 45(2), 866–896.
Article MathSciNet MATH Google Scholar
Negahban, S., & Wainwright, M. J. (2011). Estimation of (near) low-rank matrices with noise and high-dimensional scaling. The Annals of Statistics, 1069–1097.
Negahban, S., & Wainwright, M. J. (2012). Restricted strong convexity and weighted matrix completion: Optimal bounds with noise. The Journal of Machine Learning Research, 13(1), 1665–1697.
MathSciNet MATH Google Scholar
Negahban, S. N., Ravikumar, P., Wainwright, M. J., & Yu, B. (2012). A unified framework for high-dimensional analysis of $m$-estimators with decomposable regularizers. Statistical Science, 27(4), 538–557.
Article MathSciNet MATH Google Scholar
Recht, B. (2011). A simpler approach to matrix completion. Journal of Machine Learning Research, 12(12).
Reddi, S. J., Hefny, A., Sra, S., Poczos, B., & Smola, A. (2016). Stochastic variance reduction for nonconvex optimization. In International conference on machine learning (pp. 314–323).
Reinsel, G. C., & Velu, R. (1998). Multivariate reduced rank regression. Berlin: Springer.
Book MATH Google Scholar
Rohde, A., & Tsybakov, A. B. (2011). Estimation of high-dimensional low-rank matrices. The Annals of Statistics, 39(2), 887–930.
Article MathSciNet MATH Google Scholar
She, Y., & Chen, K. (2017). Robust reduced-rank regression. Biometrika, 104(3), 633–647.
Article MathSciNet MATH Google Scholar
Sun, Q., Zhou, W.-X., & Fan, J. (2020). Adaptive Huber regression. Journal of the American Statistical Association, 115(529), 254–265.
Article MathSciNet MATH Google Scholar
Swoboda, P., & Kolmogorov, V. (2019). Map inference via block-coordinate Frank–Wolfe algorithm. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 11146–11155).
Toh, K.-C., & Yun, S. (2010). An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems. Pacific Journal of Optimization, 6(615–640), 15.
MathSciNet MATH Google Scholar
Wong, R. K., & Lee, T. C. (2017). Matrix completion with noisy entries and outliers. The Journal of Machine Learning Research, 18(1), 5404–5428.
MathSciNet MATH Google Scholar
Zhou, W.-X., Bose, K., Fan, J., & Liu, H. (2018). A new perspective on robust m-estimation: Finite sample theory and applications to dependence-adjusted multiple testing. The Annals of Statistics, 46(5), 1904–1931.

Download references

Funding

Tang was supported in part by a Subaward of an NIH Grant R01GM140476, and an NSF Grant DMS-2210687. Fang was partially supported by NSF Grants DMS-1820702, DMS-1953196, DMS-2015539, and a Grant from Whitehead foundation.

Author information

Naimin **g
Present address: Biostatistics and Research Decision Sciences, Merck & Co., Inc, Kenilworth, NJ, USA

Authors and Affiliations

Department of Statistics, Operations, and Data Science, Fox School of Business, Temple University, Philadelphia, PA, USA
Naimin **g & Cheng Yong Tang
Department of Biostatistics and Bioinformatics, Duke University, Durham, NC, USA
Ethan X. Fang

Authors

Naimin **g
View author publications
You can also search for this author in PubMed Google Scholar
Ethan X. Fang
View author publications
You can also search for this author in PubMed Google Scholar
Cheng Yong Tang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to the conception and design in methods, theory, and algorithms. Theoretical development were performed by NJ, and the experimental evaluation was performed by EXF. All authors participated in preparing, reading, and revising the manuscript; all authors approved the manuscript.

Corresponding author

Correspondence to Cheng Yong Tang.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Editor: Pradeep Ravikumar.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1: More on the assumption on the model errors

A key assumption in Theorem 1 and Theorem 2 is that the noises $\xi$’s are i.i.d. with zero mean and a distribution function $F_\xi$ satisfying

$$\begin{aligned} F_\xi (m+\eta )-F_\xi (m-\eta )>\frac{1}{c_1(\eta )^2}, \end{aligned}$$

(11)

for any $|m|\le \eta$ and some $\eta >0$, where $c_1$ is a positive constant depending only on $\eta$. This is the same as requiring $Pr( \xi \in [m - \eta , m + \eta ])$ to be always positive for any $|m|\le \eta$ and $\eta >0$. Since ${\mathbb {E}}(\xi )=0$ and $0\in [m-\eta , m+\eta ]$, this condition holds as long as the probability mass near 0 is not too small, which is easily satisfied by a large class of distributions including heavy-tailed ones. As an example, Fig. 2 gives the distribution of a t-distribution with degree of freedom being 3. The area of the grey part represents $F_\xi (m+\eta )-F_\xi (m-\eta )$ when $m=1$ and $\eta =2$. Since the density function near 0 is strictly bounded from below, the required condition (11) holds for $\eta >0$.

The Huber contamination model also satisfies Assumption 2. Specifically, suppose the errors $\xi$’s follow a Huber contamination model $(1-c)F+cG$ with F being the distribution function of a normal random variable. Then $Pr( \xi \in [m - \eta , m + \eta ]) = (1-c)\{F(m + \eta )-F(m - \eta )\}+c\{G(m + \eta )-G(m - \eta )\}$. Then the first term creates no issue. Assumption 2 is easily met if G in the second term is a continuous distribution with zero mean. When G is from a discrete distribution, it is a step function. Then the second term is either 0 or a value bounded above from 1. Overall, Assumption 2 is satisfied.

Appendix 2: Proof for matrix completion

This section presents the proof related to the matrix completion.

Proof of Lemma 1

Note that

$$\begin{aligned} M^*_{{\mathcal {M}}_r}+M^*_{\overline{{\mathcal {M}}}_r^\perp }=\sum _{k=1}^r u_k \sigma _kv_k^\top +\sum _{k=r+1}^{q} u_k \sigma _k v_k^\top =M^*. \end{aligned}$$

Using triangle inequalities and the decomposability of nuclear norm on ${\mathcal {M}}_r$ and ${\mathcal {M}}_r^\perp$,

$$\begin{aligned} ||{\hat{M}}||_*&=||M^*+{\hat{\varDelta }}||_*=||M^*_{{\mathcal {M}}_r}+M^*_{\overline{{\mathcal {M}}}_r^\perp }+{\hat{\varDelta }}_{\overline{{\mathcal {M}}}_r}+{\hat{\varDelta }}_{\overline{{\mathcal {M}}}_r^\perp }||_*\nonumber \\&\ge ||M^*_{{\mathcal {M}}_r}+{\hat{\varDelta }}_{\overline{{\mathcal {M}}}_r^\perp }||_*-||M^*_{\overline{{\mathcal {M}}}_r^\perp }+{\hat{\varDelta }}_{\overline{{\mathcal {M}}}}||_*\nonumber \\&\ge ||M^*_{{\mathcal {M}}_r}||_*+||{\hat{\varDelta }}_{\overline{{\mathcal {M}}}_r^\perp }||_*-||M^*_{\overline{{\mathcal {M}}}_r^\perp }||_*-||{\hat{\varDelta }}_{\overline{{\mathcal {M}}}}||_*. \end{aligned}$$

Thus,

$$\begin{aligned} ||M^*||_*-||{\hat{M}}||_*&\le ||M^*||_*- ||M^*_{{\mathcal {M}}_r}||_*-||{\hat{\varDelta }}_{\overline{{\mathcal {M}}}_r^\perp }||_*+||M^*_{\overline{{\mathcal {M}}}_r^\perp }||_*+||{\hat{\varDelta }}_{\overline{{\mathcal {M}}}}||_*\nonumber \\&=2||M^*_{\overline{{\mathcal {M}}}_r^\perp }||_*+||{\hat{\varDelta }}_{\overline{{\mathcal {M}}}_r}||_*-||{\hat{\varDelta }}_{\overline{{\mathcal {M}}}_r^\perp }||_* \end{aligned}$$

By the convexity of the loss function ${\mathcal {L}}_\eta$, together with the assumption on $\gamma$ and the definition of the dual norm,

$$\begin{aligned} {\mathcal {L}}_\eta ({\hat{M}})-{\mathcal {L}}_\eta (M^*)&\ge \langle \nabla {\mathcal {L}}_\eta (M^*), {\hat{\varDelta }}\rangle \ge -|\langle \nabla {\mathcal {L}}_\eta (M^*), {\hat{\varDelta }}\rangle |\nonumber \\&\ge -\sigma _1(\nabla {\mathcal {L}}_\eta (M^*)) \Vert {\hat{\varDelta }}\Vert _*\ge -\frac{\gamma }{2}(~\Vert {\hat{\varDelta }}_{\overline{{\mathcal {M}}}_r}\Vert _*+\Vert {\hat{\varDelta }}_{\overline{{\mathcal {M}}}_r^\perp }\Vert _*). \end{aligned}$$

Since ${\hat{M}}$ is the optimizer of problem (7),

$$\begin{aligned} 0&\ge [{\mathcal {L}}_\eta ({\hat{M}})+\gamma \Vert {\hat{M}}\Vert _*] -[{\mathcal {L}}_\eta (M^*)+\gamma \Vert M^*\Vert _*]\nonumber \\&\ge [{\mathcal {L}}_\eta ({\hat{M}})-{\mathcal {L}}_\eta (M^*)] - \gamma [~\Vert M^*\Vert _*-\Vert {\hat{M}}\Vert _*]\nonumber \\&\ge -\frac{\gamma }{2} \left\{ \Vert {\hat{\varDelta }}_{\overline{{\mathcal {M}}}_r^\perp }\Vert _*-3\Vert {\hat{\varDelta }}_{\overline{{\mathcal {M}}}_r}\Vert _*-4\Vert M^*_{\overline{{\mathcal {M}}}_r^\perp }\Vert _*\right\} . \end{aligned}$$

Notice that $\Vert M^*_{\overline{{\mathcal {M}}}_r^\perp }\Vert _*=\sum _{k=r+1}^{q}\sigma _k$, therefore the lemma holds. $\square$

For simplicity, let

$$\begin{aligned} \frac{\partial l_\eta (X_{ij}-M^*_{ij})}{\partial M_{ij}^*}=\left. \frac{\partial l_\eta (X_{ij}-M_{ij})}{\partial M_{ij}}\right| _{M_{ij}=M_{ij}^*}. \end{aligned}$$

(12)

Before we look in the RSC condition, we first bound the term $\sigma _{1}(\nabla {\mathcal {L}}_\eta (M^*))$.

Lemma 7

(Upper bound for $\sigma _1(\nabla {\mathcal {L}}_\eta (M^*)$) Suppose the noises are i.i.d with zero-mean and are symmetrically distributed around zero, then for any $x>0$, and a positive constant $c_0$,

$$\begin{aligned} \sigma _1(\nabla {\mathcal {L}}_\eta (M^*)) \le \eta \left\{ 4c_0\left[ \sqrt{\frac{\log (p+q)}{nq}}+\sqrt{\log (q+1)}\frac{\log (p+q)}{n}\right] +\sqrt{\frac{2x}{npq}}+\frac{8x}{3n}\right\} , \end{aligned}$$

with probability at least $1-e^{-x}$.

Proof of Lemma 7

Since $\sigma _1(\cdot )$ is a norm, the triangle inequality holds

$$\begin{aligned} \sigma _1(\nabla {\mathcal {L}}_\eta (M^*))\le \sigma _1({\mathbb {E}}[\nabla {\mathcal {L}}_\eta (M^*)]) + \sigma _1(\nabla {\mathcal {L}}_\eta (M^*)- {\mathbb {E}}[\nabla {\mathcal {L}}_\eta (M^*)]). \end{aligned}$$

(13)

It can be derived from Equation (1) that

$$\begin{aligned} \nabla {\mathcal {L}}_\eta (M^*)&= \frac{1}{2n}\sum _{t=1}^n J_t \sum _{i=1}^{p}\sum _{j=1}^{q} J_{t,ij} \frac{\partial l_\eta (X_{ij}-M^*_{ij})}{\partial M_{ij}^*}. \end{aligned}$$

Since the noises are symmetrically distributed around zero and $\frac{\partial l_\eta (X_{ij}-M^*_{ij})}{\partial M_{ij}^*}$ is an odd function of the noise $\xi _{ij}$, we have ${\mathbb {E}}[\frac{\partial l_\eta (X_{ij}-M^*_{ij})}{\partial M_{ij}^*}]=0$, and thus

$$\begin{aligned} \sigma _1({\mathbb {E}}[\nabla {\mathcal {L}}_\eta (M^*)]) =0. \end{aligned}$$

(14)

To bound $\sigma _1(\nabla {\mathcal {L}}_\eta (M^*)-{\mathbb {E}}[\nabla {\mathcal {L}}_\eta (M^*)])$, first notice that

$$\begin{aligned} \sigma _1(\nabla {\mathcal {L}}_\eta (M^*)-{\mathbb {E}}[\nabla {\mathcal {L}}_\eta (M^*)])&= \sup _{\begin{array}{c} ||W||_*\le 1\\ W\in {\mathbb {R}}^{p\times q} \end{array}} \langle W, \nabla {\mathcal {L}}_\eta (M^*)-{\mathbb {E}}[\nabla {\mathcal {L}}_\eta (M^*)]\rangle \\&=\sup _{\begin{array}{c} ||W||_*\le 1\\ W\in {\mathbb {R}}^{p\times q} \end{array}} \frac{1}{2n}\sum _{t=1}^n \langle W, J_t\sum _{i=1}^{p}\sum _{j=1}^{q}J_{t,ij} \frac{\partial l_\eta (X_{ij}-M^*_{ij})}{\partial M_{ij}^*}\\&\quad -{\mathbb {E}}[J_t\sum _{i=1}^{p}\sum _{j=1}^{q}J_{t,ij} \frac{\partial l_\eta (X_{ij}-M^*_{ij})}{\partial M_{ij}^*}]\rangle \\&:= \sup _{\begin{array}{c} ||W||_*\le 1\\ W\in {\mathbb {R}}^{p\times q} \end{array}} \frac{1}{2n}\sum _{t=1}^n f_{t}(M^*)\\&:= Z. \end{aligned}$$

Since the errors $X_{ij}-M_{ij}^*$ are i.i.d.,

$$\begin{aligned} \sup _{\begin{array}{c} ||W||_*\le 1\\ W\in {\mathbb {R}}^{p\times q} \end{array}}f_{t}(M^*)\le 2\eta ~~ \text { and }\sup _{\begin{array}{c} ||W||_*\le 1\\ W\in {\mathbb {R}}^{p\times q} \end{array}}{\mathbb {E}}[f_{t}^2(M^*)]\le \frac{\eta ^2}{pq}, \end{aligned}$$

by Theorem 2.3 in Bousquet (2002), we have for any $x>0$

$$\begin{aligned} Z\le {\mathbb {E}}[Z] + \sqrt{\frac{2x\eta ^2}{npq}+\frac{8x\eta }{n}{\mathbb {E}}[Z]}+\frac{2x\eta }{3n}, \end{aligned}$$

with probability at least $1-e^{-x}$.

Moreover, since

$$\begin{aligned} \sqrt{\frac{2x\eta ^2}{npq}+\frac{8x\eta }{n}{\mathbb {E}}[Z]} \le \sqrt{\frac{2x\eta ^2}{npq}}+\sqrt{\frac{8x\eta }{n}{\mathbb {E}}[Z]} \le \sqrt{\frac{2x\eta ^2}{npq}}+\frac{\frac{4x\eta }{n}+2{\mathbb {E}}[Z]}{2}, \end{aligned}$$

we have with probability at least $1-e^{-x}$

$$\begin{aligned} Z\le 2{\mathbb {E}}[Z] + \sqrt{\frac{2x\eta ^2}{npq}}+\frac{8x\eta }{3n}. \end{aligned}$$

(15)

By symmetrization inequality in Boucheron et al. (2013),

$$\begin{aligned} {\mathbb {E}}[Z]&\le {\mathbb {E}}[\sup _{\begin{array}{c} ||W||_*\le 1\\ W\in {\mathbb {R}}^{p\times q} \end{array}} \frac{1}{n}\left| \langle W, \sum _{t=1}^n \epsilon _t J_t\sum _{i=1}^{p}\sum _{j=1}^{q}J_{t,ij}\frac{\partial l_\eta (X_{ij}-M^*_{ij})}{\partial M_{ij}^*}\rangle \right| ]\nonumber \\&=\frac{1}{n} {\mathbb {E}}[\sup _{\begin{array}{c} ||W||_*\le 1\\ W\in {\mathbb {R}}^{p\times q} \end{array}} \left| \sum _{t=1}^n \epsilon _t \langle W, J_t\sum _{i=1}^{p}\sum _{j=1}^{q}J_{t,ij}\frac{\partial l_\eta (X_{ij}-M^*_{ij})}{\partial M_{ij}^*}\rangle \right| ] \nonumber \\&=\frac{1}{n} {\mathbb {E}}[\sup _{\begin{array}{c} ||W||_*\le 1\\ W\in {\mathbb {R}}^{p\times q} \end{array}} \left| \sum _{t=1}^n \epsilon _t W_{i_t,j_t} \frac{\partial l_\eta (X_{i_t,j_t}-M^*_{i_t,j_t})}{\partial M_{i_t,j_t}^*}\right| ] \end{aligned}$$

where $\epsilon _1,\dots ,\epsilon _n$ are i.i.d. Rademacher variables with distribution ${\mathbb {P}}(\epsilon _t=1)={\mathbb {P}}(\epsilon _t=-1)=\frac{1}{2}$, and are independent of $\{X_{i_t,j_t}\}_{t=1}^n$ and $\{J_t\}_{t=1}^n$.

Now, let ${\mathbb {E}}^*$ denote the conditional expectation given $\{X_{i_t,j_t}, J_t\}_{t=1}^n$. Notice that $W_{i_t,j_t} \frac{\partial l_\eta (X_{i_t,j_t}-M^*_{i_t,j_t})}{\partial M_{i_t,j_t}^*}$ is a $\eta -$Lipschitz function of $W_{i_t,j_t}$. By Theorem 4.12 in Ledoux and Talagrand (2013), we have

$$\begin{aligned} {\mathbb {E}}^*[Z]\le \frac{2\eta }{n} {\mathbb {E}}^* [\sup _{\begin{array}{c} ||W||_*\le 1\\ W\in {\mathbb {R}}^{p\times q} \end{array}} \left| \sum _{t=1}^n \epsilon _t W_{i_t,j_t} \right| ]. \end{aligned}$$

Then take expectation over $J_t$, and we have for a positive constant $c_0$,

$$\begin{aligned} {\mathbb {E}}[Z]&\le \frac{2\eta }{n}{\mathbb {E}}[\sup _{\begin{array}{c} ||W||_*\le 1\\ W\in {\mathbb {R}}^{p\times q} \end{array}} \left| \sum _{t=1}^n \epsilon _t \langle J_t, W\rangle \right| ]\nonumber \\&\le \frac{2\eta }{n}{\mathbb {E}}[\sup _{\begin{array}{c} ||W||_*\le 1\\ W\in {\mathbb {R}}^{p\times q} \end{array}} \sigma _1(\sum _{t=1}^n\epsilon _tJ_t)||W||_*]\nonumber \\&\le 2\eta {\mathbb {E}}[\sigma _1(\frac{1}{n}\sum _{t=1}^n\epsilon _tJ_t)]\nonumber \\&\le 2\eta c_0 \left[ \sqrt{\frac{\log (p+q)}{nq}}+\sqrt{\log (q+1)}\frac{\log (p+q)}{n}\right] , \end{aligned}$$

(16)

where the second inequality follows from the definition of dual norm, and the last inequality follows from Proposition 2 in Koltchinskii et al. (2011): it is simple to show that

$$\begin{aligned} \max \{\sigma _1(\frac{1}{n}\sum _{t=1}^n{\mathbb {E}}[J_tJ_t^\top ]),\sigma _1(\frac{1}{n}\sum _{t=1}^n{\mathbb {E}}[J_t^\top J_t])\}=\frac{1}{q}, \end{aligned}$$

besides, since $\sigma _1(\epsilon _tJ_t)=|\epsilon _t|\sigma _1(J_t)\le |\epsilon _t|$, we have

$$\begin{aligned} U_{\epsilon _tJ_t}^{(2)}\le U_{\epsilon _t}^{(2)}=\sqrt{\frac{1}{\log 2}}, \end{aligned}$$

where $U_Z^{(\alpha )}$ is defined as $U_Z^{(\alpha )}=\inf \{u>0:{\mathbb {E}} \exp (\frac{\sigma _1(Z)^\alpha }{u^\alpha })\le 2\}$, then by concavity of logarithm, we have

$$\begin{aligned} \sqrt{\log \frac{q}{\log 2}}&=\sqrt{\frac{1}{2}\log q+\frac{1}{2}\log (\frac{1}{\log 2})}\\&\le \sqrt{\log (\frac{q}{2}+\frac{1}{2\log 2})} \le \sqrt{\log (q+1)}, \end{aligned}$$

finally, using Proposition 2 in Koltchinskii et al. (2011), we have $\forall {\tilde{x}}>0$ and a constant ${\tilde{c}}_0$

$$\begin{aligned} {\mathbb {P}}\left\{ \sigma _1(\frac{1}{n}\sum _{t=1}^n\epsilon _t J_t)\ge {\tilde{c}}_0\left[ \sqrt{\frac{{\tilde{x}}+\log (p+q)}{nq}}+\sqrt{\log (q+1)} \frac{{\tilde{x}}+\log (p+q)}{n}\right] \right\} \le e^{-{\tilde{x}}}. \end{aligned}$$

Then

$$\begin{aligned} {\mathbb {E}}[\sigma _1(\frac{1}{n}\sum _{t=1}^n\epsilon _tJ_t)]&=\int _0^\infty {\mathbb {P}}(\sigma _1(\frac{1}{n}\sum _{t=1}^n\epsilon _tJ_t)\ge s)ds\\&\le {\tilde{c}}_0\left[ \sqrt{\frac{\log (p+q)}{nq}}+\sqrt{\log (q+1)} \frac{\log (p+q)}{n}\right] \\&+{\tilde{c}}_0\int _{0}^\infty e^{-{\tilde{x}}} \left[ \frac{1}{2\sqrt{nq({\tilde{x}}+\log (p+q))}}+\frac{\sqrt{\log (q+1)}}{n}\right] d{\tilde{x}}, \end{aligned}$$

since $\frac{1}{\sqrt{{\tilde{x}}+\log (p+q)}}\le \sqrt{2}\left[ \frac{1}{\sqrt{{\tilde{x}}}}+\frac{1}{\sqrt{\log (p+q)}}\right]$, after simplification, we have

$$\begin{aligned} {\mathbb {E}}[\sigma _1(\frac{1}{n}\sum _{t=1}^n\epsilon _tJ_t)]\le c_0 \left[ \sqrt{\frac{\log (p+q)}{nq}}+\sqrt{\log (q+1)}\frac{\log (p+q)}{n}\right] , \end{aligned}$$

(17)

where $c_0$ is a constant independent of n, p and q.

By Equation (13), together with Equation (14), (15), (16) and (17), we have with probability at least $1-e^{-x}$

$$\begin{aligned} \sigma _1(\nabla {\mathcal {L}}_\eta (M^*)) \le \eta \left\{ 4c_0\left[ \sqrt{\frac{\log (p+q)}{nq}}+\sqrt{\log (q+1)}\frac{\log (p+q)}{n}\right] +\sqrt{\frac{2x}{npq}}+\frac{8x}{3n}\right\} . \end{aligned}$$

$\square$

Proof of Lemma 2

$\delta {\mathcal {L}}_\eta (M, M^*)$ can be written as

$$\begin{aligned} \delta {\mathcal {L}}_\eta&=\delta {\mathcal {L}}_\eta +{\mathbb {E}}[\delta {\mathcal {L}}_\eta ]-{\mathbb {E}}[\delta {\mathcal {L}}_\eta ]\nonumber \\&\ge {\mathbb {E}}[\delta {\mathcal {L}}_\eta ]-|{\mathbb {E}}[\delta {\mathcal {L}}_\eta ]-\delta {\mathcal {L}}_\eta |. \end{aligned}$$

In the following, we establish the lower bound for ${\mathbb {E}}[\delta {\mathcal {L}}_\eta (M,M^*)]$ and the upper bound for $|{\mathbb {E}}[\delta {\mathcal {L}}_\eta (M,M^*)]-\delta {\mathcal {L}}_\eta (M,M^*)|$, respectively, for $M\in \{M:\Vert M-M^*\Vert _{\max }\le \eta \}\cap {\mathcal {M}}_0$.

Given any $M\in \{M:\Vert M-M^*\Vert _{\max }\le \eta \}\cap {\mathcal {M}}_0$ and $\varDelta = M-M^*$,

$$\begin{aligned} {\mathbb {E}}[\delta {\mathcal {L}}_\eta (M,M^*)]&=\frac{1}{2n}\sum _{t=1}^n{\mathbb {E}}\{\sum _{i=1}^{p}\sum _{j=1}^{q}J_{t,ij}[l_\eta (X_{ij}-M_{ij})-l_\eta (X_{ij}-M^*_{ij})-\frac{\partial l_\eta (X_{ij}-M^*_{ij})}{\partial M_{ij}^*}\varDelta _{ij}]\}\\&=\frac{1}{2pq} \sum _{i=1}^{p}\sum _{j=1}^{q}{\mathbb {E}}[l_\eta (X_{ij}-M_{ij})]-{\mathbb {E}}[l_\eta (X_{ij}-M^*_{ij})]-{\mathbb {E}}[\frac{\partial l_\eta (X_{ij}-M^*_{ij})}{\partial M_{ij}^*}]\varDelta _{ij}, \end{aligned}$$

where $\frac{\partial l_\eta (X_{ij}-M^*_{ij})}{\partial M_{ij}^*}$ is defined in Equation (12).

Since $l_\eta (X_{ij}-M_{ij})$ and $\frac{\partial l_\eta (X_{ij}-M_{ij})}{\partial M_{ij}^*}$ are continuous function of $M_{ij}$,

$$\begin{aligned} {\mathbb {E}}[\frac{\partial l_\eta (X_{ij}-M_{ij})}{\partial M_{ij}}]&=\frac{\partial {\mathbb {E}}[l_\eta (X_{ij}-M_{ij})]}{\partial M_{ij}}\nonumber \\ {}&=\int _{|M_{ij}-X_{ij}|\le \eta } (M_{ij}-X_{ij}) dF(X_{ij})+\eta \int _{M_{ij}-X_{ij}>\eta } dF(X_{ij})-\eta \int _{M_{ij}-X_{ij}<-\eta } dF(X_{ij})\nonumber \\&=\left. (M_{ij}-X_{ij})F(X_{ij})\right| _{M_{ij}-\eta }^{M_{ij}+\eta } - \int _{M_{ij}-\eta }^{M_{ij}+\eta }F(X_{ij})d(-X_{ij})\nonumber \\&\quad +\eta F(M_{ij}-\eta )-\eta [1-F(M_{ij}+\eta )]\nonumber \\&=\int _{M_{ij}-\eta }^{M_{ij}+\eta }F(X_{ij})dX_{ij}-\eta , \end{aligned}$$

(18)

where $F(\cdot )$ is the cdf of $X_{ij}$, and

$$\begin{aligned} \frac{\partial ^2{\mathbb {E}}[l_\eta (X_{ij}-M_{ij})]}{\partial M_{ij}^{2}} = F(M_{ij}+\eta )-F(M_{ij}-\eta ) \end{aligned}$$

Apply Taylor’s theorem to ${\mathbb {E}}[l_\eta (X_{ij}-M_{ij})]$, and we have for some $t_{ij}\in (0,1)$

$$\begin{aligned} {\mathbb {E}}[\delta {\mathcal {L}}_\eta (M,M^*)]&=\frac{1}{2pq}\sum _{i=1}^p\sum _{j=1}^q\frac{1}{2}[F(M^*_{ij}+t_{ij}\varDelta _{ij}+\eta )-F(M^*_{ij}+t_{ij}\varDelta _{ij}-\eta )]\varDelta _{ij}^2 \nonumber \\&=\frac{1}{4pq}\sum _{i=1}^p\sum _{j=1}^q[F_\xi (t_{ij}\varDelta _{ij}+\eta )-F_\xi (t_{ij}\varDelta _{ij}-\eta )]\varDelta _{ij}^2\nonumber \\&\ge \frac{1}{4c_1^2pq}\Vert \varDelta \Vert _F^2, \end{aligned}$$

(19)

where the inequality follows from the Assumption 2.

Next, we consider the upper bound for $\left| {\mathbb {E}}[\delta {\mathcal {L}}_\eta ]-\delta {\mathcal {L}}_\eta \right|$. The techniques used here are similar to those in the proof Lemma 7.

Let $\delta l_{\eta ,ij}=l_\eta (X_{ij}-{M}_{ij})-l_\eta (X_{ij}-M^*_{ij})-\frac{\partial l_\eta (X_{ij}-M^*_{ij})}{\partial M^*_{ij}}{\varDelta }_{ij}$, since $l_\eta$ is $\eta -$ Lipschitz,

$$\begin{aligned} \left| \delta l_{\eta ,ij}\right| \le 2\eta |{\varDelta }_{ij}|\le 2\eta \Vert {\varDelta }\Vert _F. \end{aligned}$$

(20)

For any $M \in {\mathcal {M}}_0$, let $\varDelta =M-M^*$, we have

$$\begin{aligned} \left| {\mathbb {E}}[\delta {\mathcal {L}}_\eta ({M},M^*)]-\delta {\mathcal {L}}_\eta ({\hat{M}},M^*)\right|&=\frac{1}{2n}\left| \sum _{t=1}^n (\delta l_{\eta ,i_tj_t}-{\mathbb {E}}[\delta l_{\eta ,i_tj_t}])\right| \nonumber \\&:=\frac{1}{2n}\left| \sum _{t=1}^n f_{1t}({M})\right| = \frac{||{\varDelta }||_F}{2n} \left| \sum _{t=1}^n \frac{f_{1t}({M})}{||{\varDelta }||_F} \right| \nonumber \\&\le \frac{||{\varDelta }||_F}{2n} \sup _{M\in {\mathcal {M}}_0}\left| \sum _{t=1}^n \frac{f_{1t}(M)}{||\varDelta ||_F} \right| . \end{aligned}$$

(21)

Let $Z_1=\frac{1}{n}\sup _{M\in {\mathbb {R}}^{p\times q}}\left| \sum _{t=1}^n \frac{f_{1t}(M)}{||\varDelta ||_F} \right|$. By Equation (20), $\frac{f_{1t}(M)}{||\varDelta ||_F}\le 4\eta$ and ${\mathbb {E}}[\frac{f_{1t}^2(M)}{||\varDelta ||_F^2}]\le \frac{\eta ^2}{pq}$ for any $M\in {\mathbb {R}}^{p\times q}$. Since the errors $X_{ij}-M_{ij}^*$ are i.i.d., by Theorem 2.3 in Bousquet (2002), for any $x\ge 0$

$$\begin{aligned} Z_1 \le 2{\mathbb {E}}(Z_1) +2\sqrt{\frac{2x\eta ^2}{npq}}+\frac{16x\eta }{3n}, \end{aligned}$$

(22)

with probability at least $1-e^{-x}$.

Let $\epsilon _t$’s be i.i.d. Rademacher variables. Then, by symmetrization inequality in Boucheron et al. (2013),

$$\begin{aligned} {\mathbb {E}}[Z_1]&= \frac{1}{n}{\mathbb {E}}\left[ \sup _{M\in {\mathcal {M}}_0}\left| \sum _{t=1}^n \frac{f_{1t}(M)}{||\varDelta ||_F} \right| \right] \\&\le \frac{2}{n} {\mathbb {E}}\left[ \sup _{M\in {\mathcal {M}}_0}\left| \sum _{t=1}^n \epsilon _t \frac{\delta l_{\eta ,i_tj_t}}{\Vert \varDelta \Vert _F} \right| \right] \end{aligned}$$

Let ${\mathbb {E}}^*$ denote the conditional expectation given $\{X_{i_t,j_t}, J_t\}_{t=1}^n$.

By contraction principle in Theorem 4.4 of Ledoux and Talagrand (2013), since $|\frac{\delta l_{\eta ,i_tj_t}}{\varDelta _{i_t,j_t}}|\le 2\eta$,

$$\begin{aligned} {\mathbb {E}}^*\left[ \sup _{M\in {\mathcal {M}}_0}\left| \sum _{t=1}^n \epsilon _t\frac{\delta l_{\eta ,i_tj_t}}{\varDelta _{i_t,j_t}}\frac{\varDelta _{i_t,j _t}}{||\varDelta ||_F} \right| \right] \le 4\eta {\mathbb {E}}^*\left[ \sup _{M\in {\mathcal {M}}_0}\left| \sum _{t=1}^n \epsilon _t\frac{\varDelta _{i_t,j _t}}{||\varDelta ||_F} \right| \right] . \end{aligned}$$

Then

$$\begin{aligned} {\mathbb {E}}[Z_1]&\le \frac{8\eta }{n}{\mathbb {E}}\left[ \sup _{M\in {\mathcal {M}}_0}\left| \sum _{t=1}^n \epsilon _t\frac{\langle J_t, \varDelta \rangle }{||\varDelta ||_F} \right| \right] \\&\le 8\eta {\mathbb {E}}\left[ \sup _{M\in {\mathcal {M}}_0}\left| \frac{\sigma _1(\sum _{t=1}^n\frac{1}{n}\epsilon _tJ_t)||\varDelta ||_*}{||\varDelta ||_F} \right| \right] \\&\le 8\eta {\mathbb {E}}\left[ \sigma _1(\sum _{t=1}^n\frac{1}{n}\epsilon _tJ_t)\right] \sup _{M\in {\mathcal {M}}_0} \frac{||\varDelta ||_*}{||\varDelta ||_F}. \end{aligned}$$

The second inequality follows form the definition of the dual norm.

By Lemma 1, we have for $M\in {\mathcal {M}}_0$ and $r>0$,

$$\begin{aligned} ||\varDelta _{\overline{{\mathcal {M}}}_r^\perp }||_*\le 3||\varDelta _{\overline{{\mathcal {M}}}_r}||_* + 4\sum _{k=r+1}^{q} \sigma _k. \end{aligned}$$

Note that

$$\begin{aligned} \text {rank}(\varDelta _{\overline{{\mathcal {M}}}_r})&=\text {rank}(\varDelta -\varDelta _{\overline{{\mathcal {M}}}_r^\perp })\\&=\text {rank}(U_r U_r^\top \varDelta ({\varvec{I}}-V_rV_r^\top )+\varDelta V_rV_r^\top )\le 2r. \end{aligned}$$

Then we have for all $M\in {\mathbb {R}}^{p\times q}$ and $\varDelta =M-M^*$

$$\begin{aligned} ||\varDelta ||_*&\le ||\varDelta _{\overline{{\mathcal {M}}}_r^\perp }||_*+||\varDelta _{\overline{{\mathcal {M}}}_r}||_*\le 4||\varDelta _{\overline{{\mathcal {M}}}_r}||_* + 4\sum _{k=r+1}^{q} \sigma _k\nonumber \\&\le 4\sqrt{2r}||\varDelta ||_F + 4\sum _{k=r+1}^{q} \sigma _k. \end{aligned}$$

(23)

If $M^*$ is exactly low-rank with $\text {rank}(M^*)\le r$, then $\sum _{k=r+1}^{q} \sigma _k=0$, in this case

$$\begin{aligned} {\mathbb {E}}[Z_1]&\le 32\sqrt{2r}\eta {\mathbb {E}}\left[ \sigma _1(\sum _{t=1}^n\frac{1}{n}\epsilon _tJ_t)\right] \\&\le 32\sqrt{2r}\eta c_0\left[ \sqrt{\frac{\log (p+q)}{nq}}+\sqrt{\log (q+1)}\frac{\log (p+q)}{n}\right] , \end{aligned}$$

where the last inequality follows from Equation (17).

Then, by Equation (22)

$$\begin{aligned} Z_1\le 64\sqrt{2r}\eta c_0\left[ \sqrt{\frac{\log (p+q)}{nq}}+2\sqrt{\log (q+1)}\frac{\log (p+q)}{n}\right] + \sqrt{\frac{2x\eta ^2}{npq}}+\frac{16x\eta }{3n}, \end{aligned}$$

with probability at least $1-e^{-x}$.

Therefore, by Equation (21), with probability at least $1-e^{-x}$.

$$\begin{aligned}&\left| \delta {\mathcal {L}}_\eta -{\mathbb {E}}[\delta {\mathcal {L}}_\eta ]\right| \le Z_1 \Vert {\varDelta }\Vert _F/2\\&\quad \le \{32\sqrt{2r}\eta c_0\left[ \sqrt{\frac{\log (p+q)}{nq}}+\sqrt{\log (q+1)}\frac{\log (p+q)}{n}\right] + \sqrt{\frac{2x\eta ^2}{npq}}+\frac{8x\eta }{3n}\} \Vert {\hat{\varDelta }}\Vert _F. \end{aligned}$$

Together with Equation (19), we have with probability at least $1-e^{-x}$,

$$\begin{aligned} \delta {\mathcal {L}}_\eta&\ge \frac{1}{4c_1^2pq}\Vert {\varDelta }\Vert _F^2\nonumber \\&~-\{32\sqrt{2r}\eta c_0\left[ \sqrt{\frac{\log (p+q)}{nq}}+\sqrt{\log (q+1)}\frac{\log (p+q)}{n}\right] + \sqrt{\frac{2x\eta ^2}{npq}}+\frac{8x\eta }{3n}\} \Vert {\varDelta }\Vert _F \end{aligned}$$

(24)

$\square$

Proof of Theorem 1

Construct $M_t = M^* + t({\hat{M}}-M^*)$ in the following way. If $\Vert {\hat{M}}-M^*\Vert _{\max }<\eta$, then $t=1$, otherwise, choose t such that $\Vert M_t-M^*\Vert _{\max }=\eta$. Let $\varDelta _t=M_t-M^*=t({\hat{M}}-M^*)=t{\hat{\varDelta }}$. Notice

$$\begin{aligned} \delta {\mathcal {L}}_\eta (M_t)&={\mathcal {L}}_\eta (M_t)-{\mathcal {L}}(M^*)-\langle \nabla {\mathcal {L}}(M^*),\varDelta _t\rangle \\&\le t{\mathcal {L}}({\hat{M}})+(1-t){\mathcal {L}}(M^*)-{\mathcal {L}}(M^*)-\langle \nabla {\mathcal {L}}(M^*),\varDelta _t\rangle \\&=t\delta {\mathcal {L}}_\eta ({\hat{M}}). \end{aligned}$$

Since ${\hat{M}}$ is the optimizer of problem (7),

$$\begin{aligned} {\mathcal {L}}_\eta ({\hat{M}})+\gamma ||{\hat{M}}||_* -\left[ {\mathcal {L}}_\eta (M^*)-\gamma ||M^*||_*\right] \le 0 \end{aligned}$$

Therefore,

$$\begin{aligned} \delta {\mathcal {L}}_\eta&={\mathcal {L}}_\eta ({\hat{M}})-{\mathcal {L}}_\eta (M^*)-\langle \nabla {\mathcal {L}}_\eta (M^*),\varDelta \rangle \\&\le \gamma (||M^*||_*-||{\hat{M}}||_* )+\left| \langle \nabla {\mathcal {L}}_\eta (M^*),\varDelta \rangle \right| \\&\le \gamma ||{\hat{\varDelta }}||_*+\sigma _1(\nabla {\mathcal {L}}_\eta (M^*)||{\hat{\varDelta }}||_*\\&\le \frac{3}{2}\gamma ||{\hat{\varDelta }}||_*. \end{aligned}$$

Then by Lemma 2, for any $x>0$, with probability at least $1-e^{-x}$

$$\begin{aligned} \frac{1}{4c_1^2pq}||\varDelta _t||_F^2&\le \frac{3}{2}\gamma ||\varDelta _t||_*\\&~+\{32\sqrt{2r}\eta c_0\left[ \sqrt{\frac{\log (p+q)}{nq}}+\sqrt{\log (q+1)}\frac{\log (p+q)}{n}\right] + \sqrt{\frac{2x\eta ^2}{npq}}+\frac{8x\eta }{3n}\} ||\varDelta _t||_F \end{aligned}$$

Divided both sides of the inequality by $||\varDelta _t||_F$, we have

$$\begin{aligned} \frac{1}{4c_1^2pq}||\varDelta _t||_F&\le \frac{3}{2}\gamma \frac{||\varDelta _t||_*}{||\varDelta _t||_F}\\&~+\left\{ 32\sqrt{2r}\eta c_0\left[ \sqrt{\frac{\log (p+q)}{nq}}+\sqrt{\log (q+1)}\frac{\log (p+q)}{n}\right] + \sqrt{\frac{2x\eta ^2}{npq}}+\frac{8x\eta }{3n}\right\} \\&\le 6\gamma \sqrt{2r}\\&~+\left\{ 32\sqrt{2r}\eta c_0\left[ \sqrt{\frac{\log (p+q)}{nq}}+\sqrt{\log (q+1)}\frac{\log (p+q)}{n}\right] + \sqrt{\frac{2x\eta ^2}{npq}}+\frac{8x\eta }{3n}\right\} , \end{aligned}$$

with probability at least $1-e^{-x}$. The second inequality follows from Equation (23) when $M^*$ has rank smaller than r and the fact that $\gamma \ge 2\sigma _1(\nabla {\mathcal {L}}(M^*))$ with probability at least $1-e^{-x}$ by Lemma 7.

Take $x=\log (p+q)$ and $n>C(L) \cdot c_1^2\sqrt{2rp\log (p+q)\log (q+1)}$ with C(L) with being some constant depending on L, we have $\Vert \varDelta _t\Vert _{\max }\le \frac{L}{\sqrt{pq}}\Vert \varDelta _t\Vert _F<\eta$. Then by the construction of $M_t$, $t=1$. Finally, we have with probability at least $1-2e^{-x}-e^{-2x}=1-3(p+q)^{-1}$,

$$\begin{aligned} \frac{1}{\sqrt{p q }}\Vert {\hat{\varDelta }}\Vert _F\le C_1\cdot c_1^2\eta \sqrt{\frac{p\log (p+q)\log (q+1)}{n}}(\sqrt{2r}c_2+c_3), \end{aligned}$$

where $C_1$, $c_2$, $c_3$ are absolute constants. $\square$

Appendix 3: Proof for reduced-rank regression

Lemma 8

(Upper Bound for $\sigma _1(\nabla {\mathcal {L}}_\eta (C^*))$) Suppose that $\xi _{ij}$’s are i.i.d. with zero mean and symmetrically distributed around zero, then for any $x>0$, we have with probability at least $1-e^{-x}$,

$$\begin{aligned} \sigma _1(\nabla {\mathcal {L}}_\eta (C^*))\le 4 \eta \sigma _1(\varSigma )(\sqrt{4n(p+q)+2nx}+2(p+q)+x) \end{aligned}$$

Proof of Lemma 8

Note $\frac{\partial }{\partial c_{kj}}{\mathcal {L}}(C) = \sum _{i=1}^n -\ell _\eta ' (y_{ij}-x_i^\top c_j) x_{ik}$. Let $g_{ij}=\ell _\eta ' (y_{ij}-x_i^\top c_j)$, $G=[g_{ij}]_{n\times q}$ and $G^*$ the value of G when $C=C^*$, then $\nabla {\mathcal {L}}_\eta (C^*)=-X^\top G^*$.

$$\begin{aligned} \sigma _1(\nabla {\mathcal {L}}_\eta (C^*))=\sup _{\begin{array}{c} \Vert u\Vert _2=1 \\ u\in {\mathbb {R}}^{q} \end{array}}\sup _{\begin{array}{c} \Vert v\Vert _2=1 \\ v\in {\mathbb {R}}^{p} \end{array}} \langle Xv, G^*u \rangle . \end{aligned}$$

Following the proof of Lemma 3 Negahban and Wainwright (2011) (the proof is given in its supplementary material), we have

$$\begin{aligned} {\mathbb {P}} (|\sigma _1(X^\top G^*)|\ge 4\delta n)\le 8^{p+q} \max _{\begin{array}{c} \Vert u\Vert _2=1 \\ u\in {\mathbb {R}}^{q} \end{array}}\max _{\begin{array}{c} \Vert v\Vert _2=1 \\ v\in {\mathbb {R}}^{p} \end{array}} {\mathbb {P}} \left\{ \frac{\langle Xv, G^*u \rangle }{n}\ge \delta \right\} . \end{aligned}$$

(25)

It remains to bound $\frac{1}{n}\langle Xv, G^*u \rangle$. Let

$$\begin{aligned} Z:= \frac{1}{n}\langle Xv, G^*u \rangle = \frac{1}{n} \sum _{i=1}^n \langle v, x_i \rangle \langle u, g^*_i \rangle , \end{aligned}$$

where $g^*_i$ is the i-th row of $G^*$. Since $\xi _{ij}$’s are symmetrically distributed around zero and $l'_\eta (x)$ is an odd function, ${\mathbb {E}}[G^*]=0.$ Hence, ${\mathbb {E}}\{\langle v, x_i \rangle \langle u, g^*_i \rangle \}=0$. Further, for k being any positive integer,

$$\begin{aligned} \sum _{i=1}^n {\mathbb {E}}\{\langle v, x_i \rangle ^{2k}\langle u, g^*_i \rangle ^{2k}\}&\le \eta ^{2k} {\mathbb {E}}\sum _{i=1}^n \langle v, x_i \rangle ^{2k} \\&= \eta ^{2k} {\mathbb {E}}\sum _{i=1}^n (x_i^\top u)^{2k}\\&= \eta ^{2k} n (u^\top \varSigma u)^{2k} (2k-1)!! \\&\le \eta ^{2k} n (2k-1)!! {\sigma _1(\varSigma )}^{2k}. \end{aligned}$$

By Berstein’s inequality, for any $t>0$ and u, v satisfying $\Vert u\Vert _2=1, \Vert v\Vert _2=1$,

$$\begin{aligned} {\mathbb {P}}\{Z \ge \eta \sigma _1(\varSigma )(\sqrt{2t/n}+t/n)\} \le e^{-t}. \end{aligned}$$

Combining with Equation (25), we have

$$\begin{aligned} {\mathbb {P}} (|\sigma _1(X^\top G^*)|\ge 4 n \eta \sigma _1(\varSigma )(\sqrt{2t/n}+t/n))\le 8^{p+q} e^{-t}. \end{aligned}$$

Take $t=2(p+q)+x$ for any $x>0$, and then we have

$$\begin{aligned} {\mathbb {P}} (|\sigma _1(X^\top G^*)|\ge 4 \eta \sigma _1(\varSigma )(\sqrt{4n(p+q)+2nx}+2(p+q)+x))\le e^{-x}. \end{aligned}$$

$\square$

Proof of Lemma 5

$$\begin{aligned} \delta {\mathcal {L}}_\eta&=\delta {\mathcal {L}}_\eta +{\mathbb {E}}[\delta {\mathcal {L}}_\eta ]-{\mathbb {E}}[\delta {\mathcal {L}}_\eta ]\\&\ge {\mathbb {E}}[\delta {\mathcal {L}}_\eta ]-\left| \delta {\mathcal {L}}_\eta -{\mathbb {E}}[\delta {\mathcal {L}}_\eta ]\right| . \end{aligned}$$

In the following, we establish the lower bound for ${\mathbb {E}}[\delta {\mathcal {L}}_\eta ]$ and the upper bound for $\left| \delta {\mathcal {L}}_\eta -{\mathbb {E}}[\delta {\mathcal {L}}_\eta ]\right|$, respectively, for $C\in {\mathcal {C}}_0\cap \{C:\Vert C-C^*\Vert _F\le \eta \}$

Given any $C\in {\mathcal {C}}_0\cap \{C:\Vert C-C^*\Vert _F\le \eta \}$ and $\varDelta =C-C^*$, for some $t_{ij} \in (0,1)$

$$\begin{aligned} {\mathbb {E}}[\delta {\mathcal {L}}_\eta ]&=\sum _{i=1}^n\sum _{j=1}^q\{{\mathbb {E}}[l_\eta (y_{ij}-x_i^\top c_j)]-{\mathbb {E}}[l_\eta (y_{ij}-x_i^\top c^*_j)]-{\mathbb {E}}[l'_\eta (y_{ij}-x_i^\top c^*_j)(-x_i^\top \varDelta _j)]\}\\&=\frac{1}{2}\sum _{i=1}^n\sum _{j=1}^q{\mathbb {E}}_{X}[\{F_\xi (t_{ij}x_i^\top \varDelta _j+\eta )-F_\xi (t_{ij} x_i^\top \varDelta _j-\eta )\}(x_i^\top \varDelta _j)^2]\\&\ge \frac{1}{2c_1^2}\sum _{i=1}^n\sum _{j=1}^q {\mathbb {E}}(x_i^\top \varDelta _j)^2\\&\ge \frac{n}{2c_1^2}\sigma _n(\varSigma ) \sum _{j=1}^n \Vert \varDelta _j\Vert _F^2 =\frac{n \sigma _n(\varSigma )}{2c_1^2}\Vert \varDelta \Vert _F^2, \end{aligned}$$

where the equality follows from Taylor’s theorem, and the first inequality follows from Assumption 2 and Assumption 4. For the calculation of $\frac{\partial ^2{\mathbb {E}}[l_\eta (y_{ij}-\langle Z^{ij},C\rangle )]}{\partial (\langle Z^{ij},C\rangle )^2}$, please refer to the calculation of $\frac{\partial ^2{\mathbb {E}}[l_\eta (X_{ij}-M_{ij}]}{\partial M_{ij}^2}$ in the case of matrix completion problems.

For any $i=1,\dots ,n, j=1,\dots ,q$, there exist $\tau _{ij}\in (0,1)$, such that $\ell _\eta (y_{ij}-x_i^\top c_j)-\ell _\eta (y_{ij}-x^\top _i c_j^*) = \ell _\eta '(y_{ij}-x_i^\top {\tilde{c}}_j)x_i^\top (c_j^*-c_j),$ where ${\tilde{c}}_j = c_j^* + \tau _{ij} (c_j-c_j^*)$. Therefore,

$$\begin{aligned} \delta {\mathcal {L}}_\eta (C)= \langle \nabla {\mathcal {L}}_\eta ({\tilde{C}}) -\nabla {\mathcal {L}}_\eta (C^*), C-C^* \rangle . \end{aligned}$$

Then

$$\begin{aligned} \left| \delta {\mathcal {L}}_\eta -{\mathbb {E}}[\delta {\mathcal {L}}_\eta ]\right|&= \langle \nabla {\mathcal {L}}_\eta ({\tilde{C}}) -\nabla {\mathcal {L}}_\eta (C^*), C-C^* \rangle -{\mathbb {E}}\{\langle \nabla {\mathcal {L}}_\eta ({\tilde{C}}) -\nabla {\mathcal {L}}_\eta (C^*), C-C^* \rangle \}\\&= \langle X^\top {\tilde{G}} - X^\top G^*, C-C^* \rangle -{\mathbb {E}}\{\langle X^\top {\tilde{G}} - X^\top G^*, C-C^* \rangle \}\\&\le \Vert \varDelta \Vert _* \sigma _1 (X^\top ({\tilde{G}}-G^*)-{\mathbb {E}}\{X^\top ({\tilde{G}}-G^*)\}). \end{aligned}$$

Following the proof in Lemma 8, we have for any $x>0$

$$\begin{aligned} \sigma _1 (X^\top ({\tilde{G}}-{\mathbb {E}}({\tilde{G}})-G^*))\le 12 \eta \sigma _1(\varSigma )(\sqrt{4n(p+q)+2nx}+2(p+q)+x), \end{aligned}$$

with probability at least $1-e^{-x}$.

Similar to Equation (23), it can be shown that if $C^*$ has rank smaller than r, then $\sup _{C\in {\mathcal {C}}_0}\frac{\Vert \varDelta \Vert _*}{\Vert \varDelta \Vert _F}\le 4\sqrt{2r}.$ Hence, for $C\in {\mathcal {C}}_0$, $\Vert \varDelta \Vert _* \le 4\sqrt{2r}\Vert \varDelta \Vert _F$. Now we have with probability at least $1-e^{-x}$,

$$\begin{aligned} \delta {\mathcal {L}}_\eta \ge \frac{n \sigma _n(\varSigma )}{2c_1^2}\Vert \varDelta \Vert _F^2 - 48\sqrt{2r}\eta \sigma _1(\varSigma )(\sqrt{4n(p+q)+2nx}+2(p+q)+x)\Vert \varDelta \Vert _F \end{aligned}$$

$\square$

Proof of Theorem 2

Construct $C_t = C^* + t({\hat{C}}-C^*)$ in the following way. If $\Vert {\hat{C}}-C^*\Vert _F < \eta$, then $t=1$, otherwise, choose t such that $\Vert C_t-C^*\Vert _F = \eta$. Let $\varDelta _t=C_t - C^* = t({\hat{C}}-C^*) = t{\hat{\varDelta }}$. Notice

$$\begin{aligned} \delta {\mathcal {L}}_\eta (C_t)&= {\mathcal {L}}_\eta (C_t)-{\mathcal {L}}_\eta (C^*)-\langle \nabla {\mathcal {L}}_\eta (C^*),{\varDelta _t}\rangle \\&\le t {\mathcal {L}}_\eta ({\hat{C}}) + (1-t) {\mathcal {L}}_\eta (C^*)-{\mathcal {L}}_\eta (C^*)-\langle \nabla {\mathcal {L}}_\eta (C^*),{\varDelta _t}\rangle \\&=t \delta {\mathcal {L}}_\eta ({\hat{C}}). \end{aligned}$$

Since ${\hat{C}}$ is the optimizer of problem (9), we have

$$\begin{aligned} {\mathcal {L}}_\eta ({\hat{C}})+\gamma \Vert {\hat{C}}\Vert _*\le {\mathcal {L}}_\eta (C^*)+\gamma \Vert C^*\Vert _*. \end{aligned}$$

Therefore,

$$\begin{aligned} \delta {\mathcal {L}}_\eta ({\hat{C}})&={\mathcal {L}}_\eta ({\hat{C}})-{\mathcal {L}}_\eta (C^*)-\langle \nabla {\mathcal {L}}_\eta (C^*),{\hat{\varDelta }}\rangle \\&\le \gamma (\Vert C^*\Vert _*-\Vert {\hat{C}}\Vert _*)+\left| \langle \nabla {\mathcal {L}}_\eta (C^*),{\hat{\varDelta }}\rangle \right| \\&\le \gamma \Vert {\hat{\varDelta }}\Vert _*+\sigma _1(\nabla {\mathcal {L}}_\eta (C^*))\Vert {\hat{\varDelta }}\Vert _*\\&\le \frac{3}{2}\gamma \Vert {\hat{\varDelta }}\Vert _*. \end{aligned}$$

By Lemma 5, for any $x>0$, with probability at least $1-e^{-x}$,

$$\begin{aligned} \frac{3t}{2} \gamma \Vert {\hat{\varDelta }}\Vert _*+48\sqrt{2r}\eta \sigma _1(\varSigma )(\sqrt{4n(p+q)+2nx}+2(p+q)+x)\Vert \varDelta _t\Vert _F \ge \frac{n \sigma _n(\varSigma )}{2c_1^2}\Vert \varDelta _t\Vert _F^2. \end{aligned}$$

The second inequality follows from Equation (23) when $M^*$ has rank smaller than r and the fact that the selection of $\gamma \ge 2\sigma _1(\nabla {\mathcal {L}}(M^*))$ with probability at least $1-e^{-x}$ by Lemma 7.

Further, by Equation (23) and the fact that $\gamma \ge 2\sigma _1(\nabla {\mathcal {L}}(C^*))$ with probability at least $1-e^{-x}$ by Lemma 8, we have

$$\begin{aligned} \frac{n \sigma _n(\varSigma )}{2c_1^2}\Vert \varDelta _t\Vert _F&\le \frac{3}{2} \gamma 4\sqrt{2r}+48\sqrt{2r}\eta \sigma _1(\varSigma )(\sqrt{4n(p+q)+2nx}+2(p+q)+x) \end{aligned}$$

with probability at least $(1-e^{-x})^2$, Take $x=p+q$ and $n > C_2 \cdot \frac{\sigma _1(\varSigma )}{\sigma _n(\varSigma )} c_1^2 (p+q)r$ with $C_2$ being some constant, we have $\Vert \varDelta _t\Vert _F < \eta$. Then by the construction of $C_t$, $t=1$. Finally, we have with probability at least $1-2e^{-x}-e^{-2x}=1-3e^{-(p+q)}$,

$$\begin{aligned} \frac{1}{2c_1^2}\Vert \varDelta \Vert _F \le 48\sqrt{2r}\eta \frac{\sigma _1(\varSigma )}{\sigma _n(\varSigma )}\left(\sqrt{\frac{6(p+q)}{n}}+\frac{3(p+q)}{n}\right). \end{aligned}$$

$\square$

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

**g, N., Fang, E.X. & Tang, C.Y. Robust matrix estimations meet Frank–Wolfe algorithm. Mach Learn 112, 2723–2760 (2023). https://doi.org/10.1007/s10994-023-06325-w

Download citation

Received: 07 May 2021
Revised: 13 April 2022
Accepted: 26 February 2023
Published: 05 April 2023
Issue Date: July 2023
DOI: https://doi.org/10.1007/s10994-023-06325-w

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Robust matrix estimations meet Frank–Wolfe algorithm

Abstract

Similar content being viewed by others

Max-norm optimization for robust matrix recovery

Nonsmooth Low-Rank Matrix Recovery: Methodology, Theory and Algorithm

Robust matrix completion with complex noise

1 Introduction

2 Methodology

2.1 Matrix completion

2.2 Reduced-rank regression

2.3 Other robust loss functions

3 Theory

3.1 Convergence of the algorithms

Proposition 1

Proof

3.2 Statistical properties

Assumption 1

Assumption 2

3.2.1 Matrix completion

Assumption 3

Theorem 1

3.2.2 Reduced-rank regression

Assumption 4

Theorem 2

4 Numerical examples

4.1 Jester joke data

4.2 Cameraman image denoising

4.3 Simulations

5 Intermediate theoretical results

5.1 Decomposability of nuclear norm

5.2 Results for matrix completion

Lemma 1

Lemma 2

Lemma 3

5.3 Results of reduced-rank regression

Lemma 4

Lemma 5

Lemma 6

Data availibility

Code availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendices

Appendix 1: More on the assumption on the model errors

Appendix 2: Proof for matrix completion

Proof of Lemma 1

Lemma 7

Proof of Lemma 7

Proof of Lemma 2

Proof of Theorem 1

Appendix 3: Proof for reduced-rank regression

Lemma 8

Proof of Lemma 8

Proof of Lemma 5

Proof of Theorem 2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation