1 Introduction

Massive data with informative structures from the data collection processes are becoming increasingly available in many data-enabled areas. Examples include those from FMRI, electroencephalogram (EEG), and tick-by-tick financial trading records of many assets. Methodologically for multivariate data analysis, matrices as the model parameters are commonly analyzed in the core step(s) of many popular approaches including the principal component analysis, canonical correlation analysis (Anderson, 2003), Gaussian graphical model analysis (Lauritzen, 1996), reduced-rank regression (Reinsel & Velu, 1998), sufficient dimension reduction (Cook, 2009), and many others.

Structural information—our foremost consideration in this study—is indispensable in solving many matrix estimation problems with large-scale data. For matrix-valued model parameters, a class of methods imposes restrictions on the rank of the targeted matrix. In matrix completion with partial and noisy observations, for example, without such structural information, successfully recovering the signal is not possible. For multi-response regression problems, structural information is vital for both methodological development and practical implementation for drawing informative conclusions. Constraining the rank of the parameter matrix in multi-response regression leads to the conventional reduced-rank regression (Reinsel & Velu, 1998).

Our primary goal in this study is to investigate robustness when estimating matrices with large-scale data and structural information. Robustness is a foundational concern in current data-enabled investigations. During massive data collection processes, observations of heterogeneous quality are inevitable, and even erroneous records are common. On one hand, due to the huge size of the data in modern large-scale investigations, validations and error corrections become too daunting to be practical. Robust statistical methods in these scenarios are thus highly desirable. On the other hand, however, in many existing methods, though being convenient, commonly applied criterion functions including the squared loss and the negative log-likelihood are unfortunately not robust to the violations of the model assumptions in the aforementioned practical reality.

We are thus motivated to consider robustness in the context with structural information, which is incorporated by constraining the rank of the matrix-valued model parameters. The foremost challenge in this scenario is the fundamental computational difficulty. One source contributing to the difficulty roots in the fact that constraining a matrix’s rank results in a non-convex problem. As a rare example in reduced-rank multivariate regression, an analytic solution is available despite the non-convexity; see (Reinsel & Velu, 1998). Unfortunately when considerations are broader, such a convenience generally no longer exists; and how to solve optimization problems with rank constraints is generally difficult. To meet the challenge, a convex relaxation of the problem leads to regularizing the nuclear norm of the matrix-valued model parameter. From the statistical perspective, numerous works (Candès & Tao, 2010; Negahban & Wainwright, 2011; Agarwal et al., 2012) have studied the theoretical properties of this type of estimators constructed with the nuclear norm relaxation, and have proved that the resulting estimator achieves optimal or near-optimal statistical properties under different settings. Additional to the non-convexity, consideration of robustness is further contributing to the computational difficulty. Resorting to robust loss functions is a traditional class of influential methods for establishing more robust statistical methods; see Huber (2004) and Hampel et al. (2011). Though demonstrated effective in conventional statistical analysis, substantial difficulties arise when handling large-scale modern complex data-enabled problems. Computationally, in particular, their applications encounter major challenges because robust loss functions are not smooth whose second-order derivatives do not exist. Analytically, establishing the statistical properties of the matrix estimations is challenging in this scenario too, because the impacts from possibly heavy-tailed errors are involved in studying large-scale problems. Existing methods using the squared loss or the negative log-likelihood as the loss functions require the noises to be sub-Gaussian in order to handle high-dimensional data. Robust methods can accommodate noises with heavier tails than sub-Gaussian; meanwhile, the capacity for handling high-dimensional data remains desirable.

There has been an active recent development in robust statistical methods with high-dimensional data; see, for example, Loh (2017), Zhou et al. (2018), Sun et al. (2020), and reference therein. Recently, there has been increasing interest in investigating robust methods for matrix-valued model parameters. She and Chen (2017) studied the robust reduced-rank regression in a scenario concerning outliers. They define the estimator as the minimizer of a non-convex optimization problem, establish theoretical error bounds, and propose to apply an iterative algorithm that alternatively solves for two parts of the model parameters in their setting. Due to the nonconvexity, their algorithm does not guarantee the convergence to the minimum. Wong and Lee (2017) studied matrix completion with Huber loss. Their algorithm is developed by iteratively projecting non-robust matrix estimators, which is computationally demanding with many projection operations required. Elsener and van de Geer (2018) investigated robust matrix completion with the Huber loss function and nuclear norm penalization. The computation algorithms in Elsener and van de Geer (2018) involved a soft-thresholding step for singular values. This works well when the solution is of exact low rank. However, when the solution is of approximately low rank, or of modestly higher rank, such a step becomes computationally demanding. As pointed out in She and Chen (2017), efficient algorithms are desirable for solving optimization problems with rank constraints and robust loss functions.

We attempt our study with a foremost consideration on an efficient computing scheme for solving large-scale statistical problems with robustness. In particular, we aim to develop efficient first-order algorithms by building a scheme with Frank–Wolfe-type algorithms for robust matrix estimation problems. The Frank–Wolfe algorithm is a first-order method and is drawing considerable attention recently (Jaggi, 2013; Lacoste-Julien & Jaggi, 2015; Freund & Grigas, 2016; Freund et al., 2017; Kerdreux et al., 2018; Swoboda & Kolmogorov, 2019). The key advantage of the Frank–Wolfe algorithms is their freedom from the required projections in most proximal-type algorithms. In addition, as we shall see in our algorithms in Sect. 2, for matrix estimation problems, in each iteration, the Frank–Wolfe algorithm only requires computing the top one leading singular vectors, which can be conducted efficiently even for huge-size problems. These merits make Frank–Wolfe-type algorithms particularly appealing for solving large-scale robust low-rank matrix estimation problems.

Our study makes two main contributions. Foremost, we develop a new computation scheme for robust matrix estimation and demonstrate that the first-order optimization technique makes solving large-scale robust estimation problems practically convenient. We show extensively that our framework is broadly applicable, covering general robust loss functions including those used in median and quantile regression; see Sect. 2. Second, our theoretical analysis reveals the benefit from using robust loss functions and rank constraints. Our non-asymptotic results demonstrate that our framework can accommodate high-dimensional data. For matrix completion and reduced-rank regression, the resulting matrix-valued estimator works satisfactorily even when the model error distributions are heavy-tailed.

The rest of this article is organized as follows. Section 2 elaborates a concrete framework using the Frank–Wolfe algorithm to solve robust matrix estimation problems. We present matrix completion and reduced-rank regression with various robust loss functions. Section 3 justifies the validity of our method with theory on the algorithm convergence and error bounds of the resulting estimators. Section 4 presents extensive numerical examples demonstrating the promising performance of our methods.

For a generic matrix A, we denote by \(A^\top\) its transpose, \(\sigma _1(A)\) its largest singular value, \(\Vert A\Vert _*\) its nuclear norm, and \(\Vert A\Vert _F\) its Frobenius norm. Let \(\langle A, B\rangle =\text {trace}(A^\top B)\) for \(A, B \in {\mathbb {R}}^{p\times q}\). We denote by \(\varTheta \in {\mathbb {R}} ^{p\times q}\) a generic matrix-valued model parameter. In this study, we focus on two concrete cases. In one case, \(\varTheta =M\) where M is the signal to be recovered in the matrix completion problem with a single copy of partial and noisy observations; the other one is \(\varTheta =C\) where C is the matrix-valued coefficients in the multi-response regression problem. Furthermore, we show that our framework broadly applies in solving a general class of problems.

2 Methodology

2.1 Matrix completion

We consider the matrix completion problem first. In this setting, one observes a noisy subset of all entries of a matrix \(M\in {{\mathbb {R}}}^{p\times q}\), which is the model parameter of interest. Let the set of observed entries be \(\varOmega = \{(i_t,j_t)\}_{t=1}^n\), where \(i_t\in \{1,\dots , p\}; j_t\in \{1,\dots q\}\), and denote by \(X_{i_t,j_t}\), \((i_t,j_t)\in \varOmega\), the corresponding noisy observations such that

$$\begin{aligned} X_{i_t,j_t} = M_{i_t,j_t} + \xi _{t}, \quad t=1,\dots , n. \end{aligned}$$

We assume that \(\xi _t\)’s are independent and identically distributed random variables with mean zero.

To effectively recover M with a single copy of partial and noisy observations over \(\varOmega\), one popular approach is to assume that the underlying true matrix, denoted by \(M^*\), is of low-rank that \(\text {rank}(M^*)\le r\) for some \(r\le \min (p,q)\). Then one can estimate \(M^*\) by solving a constrained optimization problem by minimizing the objective function \((2n)^{-1}\sum _{t=1}^n \ell (X_{i_t,j_t}-M_{i_t,j_t})\) over M, subject to \(\text {rank}(M)\le r\) for some loss function \(\ell (\cdot )\). Since the rank constraint is non-convex, solving the optimization is generally not tractable. To obtain a practical solution, a common strategy is relaxing the rank constraint to the convex nuclear norm constraint.

The Huber loss function leads to robust estimators because its design alleviates the excessive contribution from a data point that is extremely deviated from the fit. Practically, the Huber loss performs promisingly when handling a substantial portion of noisy observations whose distribution can be heavy-tailed; see Huber (2004).

By applying the Huber loss with a constraint on nuclear norm, we consider the following robust matrix completion problem:

$$\begin{aligned} \min _{M \in {\mathbb {R}}^{p\times q}} {\mathcal {L}}_\eta (M):= \frac{1}{2n}\sum _{t=1}^n \ell _{\eta }(X_{i_t,j_t}-M_{i_t,j_t}), \ \text { subject to }\Vert M\Vert _*\le \lambda , \end{aligned}$$
(1)

where \(\ell _\eta (\cdot )\) is the classical Huber loss function:

$$\begin{aligned} \ell _\eta (x) = {\left\{ \begin{array}{ll} \frac{1}{2}x^2&{} \text{ if } |x|\le \eta ,\\ \eta \cdot \big (|x| - \frac{1}{2}\eta \big ) &{} \text{ otherwise. } \end{array}\right. } \end{aligned}$$
(2)

Here \(\eta\) is the tuning parameter of the Huber loss, and \(\lambda\) is the tuning parameter regularizing the nuclear norm of M. In our numerical studies, we choose the tuning parameters by applying the cross-validation.

Since \(\ell _\eta\) is not smooth, those methods commonly applied in solving \(\ell _2\)-loss problems—requiring second-order derivatives—do not directly apply. Computing optimization problem (1) is generally hard; see the discussion in She and Chen (2017). Efficient algorithms for solving (5) are lacking; the primary difficulty is due to the absence of the second-order derivative of the Huber loss. It is even more challenging to minimize the Huber loss on a restricted low-rank region, and to achieve the computational efficiency with large-scale data. More broadly, non-smooth criterion functions are commonly the case with general robust loss functions, with prominent examples including the least absolute deviation loss of the median regression, check loss of the quantile regression, and Tukey’s biweight loss besides the aforementioned Huber loss.

To address the computational difficulty when handling large-scale problems with robust loss functions, we propose to apply the Frank-Wolfe algorithm to solve this problem. The Frank–Wolfe algorithm has been particularly powerful for convex optimizations. As a first-order approach that requires no second-order derivative of the criterion function, it is particularly powerful for solving problems with non-smooth loss functions, which is exactly the case for our problem (1). Briefly speaking, the Frank–Wolfe algorithm pursues some constrained approximation of the gradient—the first-order derivative of the criterion function evaluated at a given value. The algorithm runs iteratively, with the optimization proceeding along the direction as identified by the approximation of the gradient. Therefore, the Frank–Wolfe algorithm is practically appealing, as one has the opportunity to best exploit some constrained approximation that can be computed efficiently. For a detailed account of the Frank–Wolfe algorithms and recent advances in the area, we refer to Freund and Grigas (2016), Freund et al. (2017), and references therein.

Concretely in our setting, we develop an algorithm that runs iteratively. Specifically, at the \((k+1)\)-th iteration with \(M^{(k)}\) from the previous step, the matrix-valued gradient of (1): \(\nabla {\mathcal {L}}(M^{(k)}) \in {\mathbb {R}}^{p\times q}\) is analytically calculated by

$$\begin{aligned}&\nabla {\mathcal {L}}(M^{(k)}) =\,\frac{1}{2n} \sum _{t=1}^n J_t [(M^{(k)}_{i_t,j_t}-X_{i_t,j_t})\text {1}( |M^{(k)}_{i_t,j_t}-X_{i_t,j_t}|\le \eta )\nonumber \\&\quad + \eta \text {sign}(M^{(k)}_{i_t,j_t}-X_{i_t,j_t})\text {1}(|M^{(k)}_{i_t,j_t}-X_{i_t,j_t}|> \eta )], \end{aligned}$$
(3)

where \(J_t\) is a matrix with \(J_{t,i_tj_t}=1\) and all the other entries 0, \(\text {1}(\cdot )\) is the indicator function, and \(\text { sign}(x)=1\) if x is positive and − 1 otherwise. Hence, evaluating the gradient can be done efficiently, and it is a scalable process that can be efficiently distributed if multiple computing units are available. Then, the Frank–Wolfe algorithm suggests computing a descent direction in the \((k+1)\)-th iteration:

$$\begin{aligned} V^{(k+1)} \leftarrow \mathop {\textrm{argmin}}_{V} \langle \nabla {\mathcal {L}}(M^{(k)}),V\rangle , \text { subject to }\Vert V\Vert _* \le \lambda . \end{aligned}$$

In this step, a key observation is that

$$\begin{aligned} V^{(k+1)} = -\lambda \cdot u_1v_1^\top , \end{aligned}$$
(4)

where \(u_1\) and \(v_1\) are the leading left and right singular vectors of \(\nabla {\mathcal {L}}(M^{(k)})\). The required singular decomposition can be computed efficiently by an existing algorithm that is implemented in the standard “PROLACK" package in Matlab. Then, we conduct a descent step to update \(M^{(k)}\) by

$$\begin{aligned} M^{(k+1)} \leftarrow M^{(k)} +\alpha _{k+1} \big (V^{(k+1)} - M^{(k)}\big ), \end{aligned}$$

where \(\alpha _{k+1}\in [0,1]\) is a pre-specified step-size. For example, \(\alpha _{k+1} = 1/(k+3)\) guarantees convergence to an optimal solution. Meanwhile, line search is viable, and there are various ways to further accelerate this algorithm.

Intuitively, the updating direction in Equation (4) is viewed as the best rank-one approximation of the gradient matrix (3). Further, if we view the vector \(u_1\) as the direction corresponding to the first principal component of the columns of M, then formula (4) is essentially a column-wise update along this direction, with the step sizes proportional to the components in the vector \(v_1\). From this perspective, the update formula (4) can also be viewed as a computationally efficient matrix-valued coordinate descent along the direction \(u_1\). Since the objective function (1) is convex, such an update progressing along the gradient direction ensures that the criterion function converges, approaching the minimum.

We summarize the algorithm in Algorithm 1.

figure a

2.2 Reduced-rank regression

In our second concrete problem with matrix-valued model parameters, we consider a multivariate linear regression

$$\begin{aligned} y_{ij}=x_i^\top c_j+\xi _{ij}, ~~~~~~~~\text {for } i=1,\ldots ,n,~j=1,\dots ,q, \end{aligned}$$

where \(\xi _{ij}\)’s are model errors. We assume that \(\xi _{ij}\)’s are independent and identically distributed random variables with mean zero. Then, we have in a matrix form

$$\begin{aligned} Y=XC+\var** , \end{aligned}$$

where \(Y=[y_{ij}]_{n\times q}, X=[x_{ij}]_{n\times p}=[x_1,\dots , x_n]^\top , C=[c_1,\dots , c_q]\in {\mathbb {R}}^{p\times q}\), and \(\var** =[\xi _{ij}]_{n\times q}\).

In this setting, one may opt to restrict the rank of C\(\text {rank}(C)\le r\) \((r \le \min (p,q))\)—leading to the conventional reduced-rank regression (Reinsel & Velu, 1998). Also by relaxing the rank constraint with the nuclear norm, we consider the estimation problem as

$$\begin{aligned} \min _{C\in {\mathbb {R}}^{p\times q}} {\mathcal {L}}_{\eta }(C):= \sum _{i = 1}^n\sum _{j=1}^q \ell _{\eta } (y_{ij} - x_{i}^\top c_j), \text { subject to }\Vert C\Vert _* \le \lambda , \end{aligned}$$
(5)

where \(c_j\) denotes the j-th column of C, and \(\ell _\eta (\cdot )\) is the Huber loss function with parameter \(\eta\).

Again, to address the computational challenges, analogous to problem (1), we propose to solve problem (5) also by applying Frank–Wolfe algorithm iteratively with the steps described as follows. Denote by \(C^{(k)}\) the solution after the k-th iteration. At the \((k+1)\)-th iteration, let \(\nabla {\mathcal {L}}_{\eta }(C^{(k)})\) be the gradient of the loss function at \(C^{(k)}\):

$$\begin{aligned}\nabla {\mathcal {L}}_{\eta }(C^{(k)}) &= \sum _{i=1}^n\sum _{j=1}^q Z^{ij} [(x_i^\top c^{(k)}_j-y_{ij})\text {1}(|x_i^\top c^{(k)}_j-y_{ij}|\le \eta )\nonumber \\&\quad + \,\eta \text { sign}(x_i^\top c^{(k)}_j-y_{ij})\text {1}(| x_i^\top c^{(k)}_j-y_{ij}|>\eta ) ], \end{aligned}$$
(6)

where \(Z^{ij}\) is a matrix with the j-th column being \(x_i\) and the remaining entries 0. Then, we compute a descent direction from

$$\begin{aligned} V^{(k+1)} \leftarrow \mathop {\textrm{argmin}}_{V\in {\mathbb {R}}^{p\times q}} \langle \nabla {\mathcal {L}}(C^{(k)}), V \rangle \text { subject to }\Vert V\Vert _* \le \lambda , \end{aligned}$$

with the solution

$$\begin{aligned} V^{(k+1)} = -\lambda \cdot u_1v_1^\top , \end{aligned}$$

where \(u_1\) and \(v_1\) are the leading left and right singular vectors of \(\nabla {\mathcal {L}}_\eta (C^{(k)})\).

The algorithm follows Algorithm 1, with different input data and the gradient matrix specified by (6).

2.3 Other robust loss functions

Our framework for develo** efficient computation algorithms can easily accommodate a broad class of robust loss functions that are not smooth. Examples of the loss functions are the \(\ell _1\)-loss (the least absolute deviation loss), the check-loss, Tukey’s biweight loss, and more; see Hampel et al. (2011).

A scheme is developed as follows. The only necessary adjustment as in Algorithm 1 is calculating the gradient of loss function \(\nabla {\mathcal {L}}(\cdot )\). Then, the general updating step is

$$\begin{aligned} \varTheta ^{(k+1)} = \varTheta ^{(k)} + \alpha _{k+1} \big ( V^{(k+1)} - \varTheta ^{(k)} \big ), \end{aligned}$$

where \(\alpha _{k+1}\) is some pre-specified step-size, \(V^{(k+1)} = -\lambda \cdot u_1v_1^\top ,\) with \(u_1\) and \(v_1\) being the first left and right singular vectors of \(\nabla {\mathcal {L}}(\varTheta ^{(k)})\).

Table 1 presents gradients for several common loss functions in the context of matrix completion and reduced-rank regression.

Table 1 Gradients under different loss functions for matrix completion (\(\nabla {\mathcal {L}}(M)\)) and reduced-rank regression (\(\nabla {\mathcal {L}}(C)\)), \(d_{ij}=X_{ij}-M_{ij}\) or \(y_{ij}-x_i^\top c_j\) depending on the context

3 Theory

3.1 Convergence of the algorithms

For self-completeness, we present the theoretical guarantees for the Frank–Wolfe algorithm in the context of robust matrix estimations, together with a simple way to choose the step-sizes.

We prove that by choosing the stepsize properly, the objective functions by using the Huber loss in both matrix completion and reduced-rank regression problems converge to the optimums at the rate of \({\mathcal {O}}(1/k)\), where k is the iteration counter. The next proposition is for reduced-rank regression problems, and the result for the matrix completion problem can be proved similarly.

Proposition 1

Consider the loss function \({\mathcal {L}}_\eta (\cdot ): {\mathbb {R}}^{n \times p} \rightarrow {\mathbb {R}}\) constructed from the Huber loss function (2) with parameter \(\eta\). For the reduced-rank regression problem (5), by the Frank–Wolfe Algorithm with stepsize set as

$$\begin{aligned} \alpha _{k+1} = \min \left \{ \frac{\nabla {\mathcal {L}}_\eta (C^{(k)})^\top (C^{(k)} - V^{(k+1)})}{L_z \Vert C^{(k)}-V^{(k+1)}\Vert ^2},1\right \}, \text { for all }k\ge 1, \end{aligned}$$

where \(L_z\) is some positive number. Suppose the diameter of the feasible set is \(D:=\max _{V_1,V_2\in {\mathbb {S}}} \Vert V_1-V_2\Vert _F\), where \({\mathbb {S}}= \{V: \Vert V\Vert _*\le \lambda \}\). Then, we have that \({\mathcal {L}}(C^{(k)})\) is monotonely decreasing in k, and we have

$$\begin{aligned} {\mathcal {L}}_\eta (C^{(k)}) - {\mathcal {L}}_\eta (C^* ) \le \frac{2L_z D^2}{k}. \end{aligned}$$

Proof

Since the Huber loss function is differentiable everywhere, and we have that \(\nabla {\mathcal {L}}_\eta ( C )\) is Lipschitz-continuous. Thus, with \(L_z\) defined above its Lipschitz constant, by Theorem 1 of Freund et al. (2017), we have that the result holds as desired. \(\square\)

We point out that for the matrix completion problem (1), the result holds by the same argument by letting \(L_z = 1\).

Meanwhile, our broad interests include some non-convex losses such as the Tukey’s biweight loss. A strategy for handling them is the approximation by a Lipschitz continuous function with arbitrary precision where simple smoothing techniques are applicable. Upon applying the same stepsizes as discussed above, we can show that the algorithm converges to a stationary point at the same rate; see the analysis of a recent work of Reddi et al. (2016).

Recently, Charisopoulos et al. (2021) studied the low-rank matrix recovery algorithms with the non-convex rank constraint and non-smooth loss functions. They established optimization convergence rates for a prox-linear method and a subgradient method for matrix completion. They proved that with a sufficient number of observations and an appropriate initialization, both methods are guaranteed to converge to the truth. The prox-linear method possesses a much faster convergence rate of \({\mathcal {O}}(1/(2^k))\) but with a higher computational cost at each iteration in solving a convex subproblem. While the subgradient method has a lower cost at each iteration with a subgradient evaluation step and a project step onto the desired region, it has a slower rate. Compared with their algorithms, our method has a lower computational burden in each iteration with no projection required and a relatively slower convergence rate. It is worth studying minimizing a robust loss function directly with the non-convex constraint in the future.

3.2 Statistical properties

We investigate the non-asymptotic error bounds in this section. We first introduce two conditions for both matrix completion and reduced-rank regression models.

Assumption 1

The truth \(M^*\) and \(C^*\) has rank at most \(r,~0<r<\min (p,q)\).

Assumption 2

The noises \(\xi\)’s are i.i.d. with zero mean and a distribution function \(F_\xi\) satisfying

$$\begin{aligned} F_\xi (m+\eta )-F_\xi (m-\eta )\ge \frac{1}{c_1^2}, \end{aligned}$$

for any \(|m|\le \eta\) and \(\eta >0\), where \(c_1=c_1(\eta )\) is a constant depending only on \(\eta\).

Assumption 2 is key on the distribution of the noises.

It is very mild by only requiring non-vanishing probability mass of \(\xi\) between \(m-\eta\) and \(m+\eta\) for a positive \(\eta\) and \(|m|\le \eta\), avoiding assuming instead explicit conditions on its tail probability and/or existence of its moments up to some order.

Since the condition holds for \(\eta >0\) as long as the probability mass of \(\xi\) near 0 is not too small, it is easily satisfied by a wide range of distributions including heavy-tailed ones; see more discussion about this assumption and examples in Appendix 1.

3.2.1 Matrix completion

For any matrix A and some linear subspace \({\mathcal {M}}\) of \({\mathbb {R}}^{p\times q}\), we define \(A_{{\mathcal {M}}}\) as the projection of A onto \({\mathcal {M}}\). We consider without loss of generality that \(p>q>1\). Recall that \(J_t\) \((t=1,\dots , n)\) is a \(p\times q\) random matrix, independent of \(X_{i_t,j_t}\) and \(\xi _t\), with one randomly chosen entry \(J_{t,i_tj_t}\) being 1 and the others 0. \(M_{i_t,j_t}\) can be written as

$$\begin{aligned} M_{i_t,j_t}=\text {tr}(J_t^\top M)=\sum _{i=1}^{p}\sum _{j=1}^{q} J_{t,ij}M_{ij}, \end{aligned}$$

for all \((i_t,j_t)\in \varOmega\). As a working model, we treat \(J_t\) as uniformly distributed over its support. That is, the probability of \(M_{i_t,j_t}\) being the t-th observation is \((p q)^{-1}\). This assumes that the observed entries in the target matrix are uniformly sampled at random (Koltchinskii et al., 2011; Rohde & Tsybakov, 2011; Elsener & van de Geer, 2018), and we refer to Klopp (2014) for more discussions. Recht (2011) analyzed the matrix completion model under this assumption. As pointed out in Recht (2011), this is a sampling with replacement scheme and therefore may appear less realistic as it may result in duplicated entries; however, it has the benefit of simplifying the technical proof and assumptions. Overall, it is a reasonable and informative showcase without requiring any prior information on the sampling scheme. If additional information is available in the sampling process, other models such as the weighted sampling model (Negahban & Wainwright, 2012) can be applied.

We first show that the estimator belongs to a restricted set. We consider the singular value decomposition

$$\begin{aligned} M^* = U \varLambda V^\top , \end{aligned}$$

where U is a \(p\times q\) matrix, \(\varLambda\) is a \(q\times q\) diagonal matrix with diagonal entries the ordered singular values \(\sigma _1\ge \sigma _2\ge \dots \ge \sigma _{q}\), and V is a \(q\times q\) matrix. For \(k=1,2,\dots ,q\), let \(u_k\) be the k-th column of U, and \(v_k\) the k-th column of V. For any positive integer \(r\le \min \{p,q\}\), let \(U_r\) be the subspace of \({\mathbb {R}}^{p\times q}\) spanned by \(u_1, \dots , u_r\), and \(V_r\) be the subspace spanned by \(v_1, \dots , v_r\). Define a pair of subspace of \({\mathbb {R}}^{p\times q}\) as

$$\begin{aligned} {\mathcal {M}}_r(U,V)&:=\{M\in {\mathbb {R}}^{p\times q}\mid \text {row}(M)\subseteq V_r, \text { col}(M)\subseteq U_r\},\\ \overline{{\mathcal {M}}}_r^\perp (U,V)&:=\{M\in {\mathbb {R}}^{p\times q}\mid \text {row}(M)\subseteq V_r^{\perp }, \text { col}(M)\subseteq U_r^{\perp }\}, \end{aligned}$$

where \(\text {row}(M)\) and \(\text {col}(M)\) denote the row and column space of M. For simplicity notation, we use \({\mathcal {M}}_r ={\mathcal {M}}_r(U,V)\) and \(\overline{{\mathcal {M}}}_r^\perp =\overline{{\mathcal {M}}}_r^\perp (U,V)\). Lemma 1 indicates that the estimator \({\hat{M}}\) belongs to the set

$$\begin{aligned} {\mathcal {M}}_0=\{M\in {\mathbb {R}}^{p\times q}:||\varDelta _{\overline{{\mathcal {M}}}_r^\perp }||_* \le 3||\varDelta _{\overline{{\mathcal {M}}}_r}||_*+4\sum _{k=r+1}^{q}\sigma _k, ~\varDelta =M-M^*\}. \end{aligned}$$

To establish the error bounds, we need the following technical assumption.

Assumption 3

For any \(M\in {\mathcal {M}}_0\), there exists a real number \(L>1\), such that

$$\begin{aligned} \Vert M-M^*\Vert _{\max } \le \frac{L}{\sqrt{pq}} \Vert M-M^*\Vert _F. \end{aligned}$$

Assumptions of this type—referred to as the ‘spikiness condition’—are assumed in existing literature on analogous problems, e.g., in Negahban and Wainwright (2012) for matrix completion problems; see also a recent work Fan et al. (2021). Intuitively, this assumption requires that for \(M\in {\mathcal {M}}_0\), the entries of \(M-M^*\) are not overly ‘spiky’, or in other words, relatively evenly distributed; so that the maximum discrepancy is not extremely far away from the averaged discrepancy. We remark that here the term \(\frac{1}{\sqrt{pq}}\) relates to the aforementioned uniform sampling scheme setting, under which each entry is observed with the probability \(\frac{1}{pq}\). Hence, it reflects an increasingly more difficult high-dimensional problem due to sparse entries in a single copy of large matrix. Instead, if the probability of each entry being observed is a constant independent of pq, this assumption is not required.

We consider the Lagrangian form of the problem (1):

$$\begin{aligned} {\hat{M}} = \mathop {\textrm{argmin}}_{M \in {\mathbb {R}}^{p\times q}} \{{\mathcal {L}}_\eta (M) +\gamma \Vert M\Vert _*\}, \end{aligned}$$
(7)

where \(\gamma >0\) is the corresponding regularization tuning parameter. Let \({\hat{\varDelta }}={\hat{M}}-M^*\) and \(\varDelta =M-M^*\). Theorem 1 establishes a non-asymptotic upper bound for the error for estimating a \(M^*\) of low rank.

Theorem 1

For problem (7), suppose that Assumption 1, 2, and 3 hold and the noises \(\xi _t\)’s are distributed symmetrically about zero. Let \({\hat{M}}\) be the solution to problem (7) with

$$\begin{aligned} \gamma = 2\eta \left\{ 4c_0\left[ \sqrt{\frac{\log (p+q)}{n q}}+\sqrt{\log (q+1)}\frac{\log (p+q)}{n}\right] +\sqrt{\frac{2\log (p+q)}{n p q}}+\frac{8\log (p+q)}{3n}\right\} , \end{aligned}$$

with a constant \(c_0>0\). When \(n>C(L)\cdot c^2_1 pr\log (p+q)\log (q+1)\),

$$\begin{aligned} \frac{1}{\sqrt{p q }}\Vert {\hat{\varDelta }}\Vert _F\le C_1 c_1^2\eta \sqrt{\frac{p\log (p+q)\log (q+1)}{n}}\left(\sqrt{2r}c_2+c_3\right), \end{aligned}$$

with probability at least \(1-3(p+q)^{-1}\), for some constants \(C_1\), \(c_2\) and \(c_3\) independent of n, p, and q, and C(L) a constant only depending on L.

Theorem 1 is non-asymptotic; \(\gamma\) is chosen based on Lemma 7 in Appendix 2 as twice the upper bound of \(\sigma _1(\nabla {\mathcal {L}}_\eta (M^*))\). In Theorem 1, we only require the error terms satisfy Assumption 2, which is \(F_\xi (m+\eta )-F_\xi (m-\eta )>\frac{1}{c_1^2}\), for \(\eta >0\) being the parameter in the Huber loss (2) and \(|m|<\eta\). Since this assumption is easily satisfied by many heavy-tailed distributions, this result demonstrates the robustness of our method.

We note that in general \(\gamma\) can be

$$\begin{aligned} K_1\cdot 2\eta \left\{ 4c_0\left[ \sqrt{\frac{\log (p+q)}{n q}}+\sqrt{\log (q+1)}\frac{\log (p+q)}{n}\right] +\sqrt{\frac{2\log (p+q)}{n p q}}+\frac{8\log (p+q)}{3n}\right\} , \end{aligned}$$

for any constant \(K_1\ge 1\). Under the conditions in Theorem 1, we can also derive the upper bounds of the estimation error in nuclear norm based on (23) in the Appendix:

$$\begin{aligned} \frac{1}{\sqrt{pq}}\Vert {\hat{\varDelta }}\Vert _*\le 4C_1c_1^2\eta \sqrt{2r}\sqrt{\frac{p\log (p+q)\log (q+1)}{n}}\left(\sqrt{2r}c_2+c_3\right). \end{aligned}$$

We may discuss the asymptotic properties of \({{\hat{M}}}\) when \(n \rightarrow \infty\). Matrix completion is a hard problem attempting to recover a matrix-valued model parameter with a single incomplete copy from the data generating process. The average estimation error converges to zero in probability as \(n\rightarrow \infty\). That is, when \(rp \log (p+q)\log (q+1)=o(n)\), \((pq)^{-1}\Vert {\hat{\varDelta }}\Vert _F^2 \rightarrow 0\). Intuitively, if the rank of \(M^*\) is r, then the number of free parameters is at the order of rp. Hence it’s reasonable to require a sample size at least of some larger order of rp, so as to recover the model parameters consistently.

Without requiring the Gaussian assumption, our error rate is still comparable to the statistical optimum established by Koltchinskii et al. (2011) for matrix completion problems under a low-rank constraint with Gaussian noises. Compared with the rate in the lower bound given in Theorem 6 of Koltchinskii et al. (2011), our upper bound in Theorem 1 differs only in an additional logarithm term \(\sqrt{\log (p+q)\log (q+1)}\) and the \(\eta\) in the Huber loss.

The assumption in Theorem 1 that the model error is symmetrically distributed around 0 is needed in obtaining the upper bound of \(\sigma _{1}(\nabla {\mathcal {L}}(M^*))\); see the proof of Lemma 7. It assures that \(\sigma _1({\mathbb {E}}[\nabla {\mathcal {L}}(M^*)])=0\). Similar assumptions are also found in Loh (2017). Thanks to the symmetrization assumption, the convergence can be established with no strong extra requirement on \(\eta\). Without the symmetrization, as shown in Lemma 7 in the Supplement Material, other conditions are required to control

$$\begin{aligned} {\mathbb {E}}[\frac{\partial l_\eta (X_{ij}-M^*_{ij})}{\partial M_{ij}^*}]=\int _{M^*_{ij}-\eta }^{M^*_{ij}+\eta }F(X_{ij})dX_{ij}-\eta , \end{aligned}$$

so that

$$\begin{aligned} \sigma _{1}\left({\mathbb {E}}[\nabla {\mathcal {L}}(M^*)]\right)&=\sigma _1( \frac{1}{2n}\sum _{t=1}^n \sum _{i=1}^{p}\sum _{j=1}^{q} {\mathbb {E}}[J_tJ_{t,ij}] {\mathbb {E}}\left[\frac{\partial l_\eta (X_{ij}-M^*_{ij})}{\partial M_{ij}^*}\right])\\&=\frac{1}{2pq}\sigma _1(\left[ {\mathbb {E}}[\frac{\partial l_\eta (X_{ij}-M^*_{ij})}{\partial M_{ij}^*}]\right] _{p\times q}) \end{aligned}$$

is stochastically small enough. With this extra term, the upper bound in Theorem 1 becomes

$$\begin{aligned} \frac{1}{\sqrt{p q }}\Vert {\hat{\varDelta }}\Vert _F&\le \frac{\text {Constant} \cdot c_1^2\sqrt{2r}}{\sqrt{pq}}\sigma _1\left( \left[ {\mathbb {E}}[\frac{\partial l_\eta (X_{ij}-M^*_{ij})}{\partial M_{ij}^*}]\right] _{p\times q}\right) \nonumber \\& \quad+C_1c_1^2\eta \sqrt{\frac{p\log (p+q)\log (q+1)}{n}}\left(\sqrt{2r}c_2+c_3\right). \end{aligned}$$
(8)

The extra term in (8) may then be viewed as a price paid to achieve robustness against noises with heavy-tailed distributions. This is an impact from applying the robust Huber loss. It is a remarkable different feature from the study on matrix completion with \(\ell _2\)-loss. Nevertheless, it is worth noting that for \(\ell _2\)-loss related studies, conditions are commonly assumed to control the tail probability behavior of the model errors, for example, by the sub-Gaussian distributions. In contrast, our development does not require such assumptions on the tail probability properties, which is the gain in return by applying the Huber loss.

3.2.2 Reduced-rank regression

The problem (5) is also expressed in the Lagrangian form:

$$\begin{aligned} {\hat{C}} = \mathop {\textrm{argmin}}_{C\in {\mathbb {R}}^{p\times q}} \{{\mathcal {L}}_\eta (C)+\gamma \Vert C\Vert _*\}, \end{aligned}$$
(9)

where \(\gamma >0\) is a regularization parameter, and \({\mathcal {L}}_\eta (C)\) is defined in Equation (5).

Again, we point out that the estimator belongs to a restricted set. By applying the singular value decomposition to \(C^*\), we have

$$\begin{aligned} C^* = U \varLambda V^\top , \end{aligned}$$

where \(\varLambda =\textrm{diag}(\sigma _1,\dots , \sigma _q)\) is the diagonal matrix containing all singular values of \(C^*\). For \(r\le \min \{p,q\}\), we define a pair of subspace of \({\mathbb {R}}^{p\times q}\) as

$$\begin{aligned} {\mathcal {C}}_r(U,V)&\;=\;\{M\in {\mathbb {R}}^{p\times q}\mid \text {row}(M)\subseteq V_r, \text { col}(M)\subseteq U_r\},\\ \overline{{\mathcal {C}}}_r^\perp (U,V)&\;=\;\{M\in {\mathbb {R}}^{p\times q}\mid \text {row}(M)\subseteq V_r^{\perp }, \text { col}(M)\subseteq U_r^{\perp }\}, \end{aligned}$$

where \(U_r\) is a subspace spanned by the first r columns of U, and \(V_r\) is the subspace spanned by the first r columns of V. For simplicity in notations, we denote by \({\mathcal {C}}_r ={\mathcal {C}}_r(U, V)\) and \(\overline{{\mathcal {C}}}_r^\perp =\overline{{\mathcal {C}}}_r^\perp (U,V)\). Note that \({\mathcal {C}}_r\) and \(\overline{{\mathcal {C}}}_r\) are not equal. Lemma 4 indicates that the estimator \({\hat{C}}\) belongs to the set

$$\begin{aligned} {\mathcal {C}}_0=\left\{C\in {\mathbb {R}}^{p\times q}:\Vert \varDelta _{\overline{{\mathcal {C}}}_r^\perp }\Vert _* \le 3\Vert \varDelta _{\overline{{\mathcal {C}}}_r}\Vert _*+4\sum _{k=r+1}^{q}\sigma _k, \varDelta =C-C^* \right\}. \end{aligned}$$

We assume the following conditions on the random design matrix X.

Assumption 4

\(x_1, x_2, \dots , x_n\) are i.i.d. random vectors sampling from a multivariate normal distribution \({\mathcal {N}} (0, \varSigma )\) and without loss of generality, are standardized such that \(\Vert x_i\Vert _F\le 1\). \(\sigma _1(\Sigma )\ge \sigma _n(\Sigma )>0\), where \(\sigma _1(\Sigma )\) and \(\sigma _n(\Sigma )\) denote the largest and smallest eigenvalues of \(\Sigma\), respectively.

The multivariate normal distribution and its analogies are commonly assumed in the literature (e.g., Negahban & Wainwright, 2011; Sun et al., 2020; Fan et al., 2021). The setting with Assumption 4 facilitates achieving

the optimal convergence rate; other types of conditions are possible, at the expense of a slower convergence rate.

Theorem 2 establishes a non-asymptotic upper bound for \(\Vert {\hat{\varDelta }}\Vert _F\).

Theorem 2

For problem (9), suppose that Assumption 1 and 2 hold and the noises \(\xi _{ij}\)’s are distributed symmetrically about zero. Suppose X satisfies Assumption 4. Let \({\hat{C}}\) be the solution to the optimization problem (9) with

$$\begin{aligned} \gamma = 8 \eta \sigma _1(\Sigma )\left(\sqrt{6n(p+q)}+3(p+q)\right). \end{aligned}$$
(10)

Then for \(n>C_2\frac{\sigma _1(\Sigma )}{\sigma _n(\Sigma )}c_1^2{r(p+q)}\) with probability at least \(1-3e^{-(p+q)}\),

$$\begin{aligned} \Vert {\hat{\varDelta }}\Vert _F \le C_3 c_1^2\sqrt{2r}\eta \frac{\sigma _1(\Sigma )}{{{\sigma _{n}(\Sigma )}}}\left(\sqrt{\frac{6(p+q)}{n}}+\frac{3(p+q)}{n}\right), \end{aligned}$$

where \(C_2\) and \(C_3\) are constants.

The value for \(\gamma\) is selected based on Lemma 8 in Appendix 3 as twice the upper bound for \(\sigma _1(\nabla {\mathcal {L}}_\eta (C^*))\) according to condition (10). Generally, for any \(K_2\ge 1\) and

$$\begin{aligned} \gamma = K_2\cdot 8 \eta \sigma _1\left(\Sigma )(\sqrt{6n(p+q)}+3(p+q)\right), \end{aligned}$$

our result remains valid and only differs in constant terms.

Under the same condition, we can establish the error bound in terms of the nuclear norm

$$\begin{aligned} \Vert \varDelta \Vert _* \le 8C_3 c_1^2 r\eta \frac{\sigma _1(\Sigma )}{\sigma _n(\Sigma )}\left(\sqrt{\frac{6(p+q)}{n}}+\frac{3(p+q)}{n}\right). \end{aligned}$$

When \(r(p+q)=o(n)\), the Frobenius norm of the error \(\Vert {\hat{\varDelta }}\Vert _F^2 \rightarrow 0\) in probability. Similarly, the robustness of the method is seen as only a mild distributional Assumption 2 is required: \(F_\xi (m+\eta )-F_\xi (m-\eta )>\frac{1}{c_1^2}\) for \(|m|<\eta\) and \(\eta >0\). Our estimator achieves a comparable convergence rate as that in Negahban and Wainwright (2011) and Rohde and Tsybakov (2011), with the notable difference due to the \(\eta\) in the Huber loss. Meanwhile, our method does not require the errors to follow normal distributions, which is the case in those studies with the \(\ell _2\) loss. Here assuming symmetricity plays the same role as that in Theorem 1. Based on the same discussions after Theorem 1, if the noises are not symmetrically distributed, then there will be an extra term in the upper bound.

4 Numerical examples

In this section, we conduct an extensive numerical investigation of the proposed method using both simulated and real data sets. In all cases, we choose the tuning parameters by ten-fold cross-validation. Specifically, for matrix completion problems, we first randomly select \(90\%\) of the observed entries as training samples and test the results using the remaining \(10\%\) samples. We repeat the procedure 10 times and choose the best tuning parameter. With extensive studies on simulated and real data sets, our results provide strong empirical evidence that the proposed method provides robustness under different settings.

4.1 Jester joke data

We first test our method using the Jester joke data set. This data set contains more than 4.1 million ratings for 100 jokes from 73,421 users. This data set is publicly available through http://www.ieor.berkeley.edu/~goldberg/jester-data/. The whole data set contains three sub-datasets, which are: (1) jester-1: 24,983 users who rate 36 or more jokes; (2) jester-2: 23,500 users who rate 36 or more jokes; (3) jester-3: 24,938 users who rate between 15 and 35 jokes. More detailed descriptions can be found in Toh and Yun (2010) and Chen et al. (2012), where the authors consider the nuclear-norm based approach to conduct matrix completion.

Due to the large number of users, we randomly select \(n_u\) users’ ratings from the datasets. Since many entries are unknown, we cannot compute the relative error using every entry. Instead, we take the metric of the normalized mean absolute error (NMAE) to measure the accuracy of the estimator \({\hat{M}}\):

$$\begin{aligned} \text {NMAE} = \frac{\sum _{(j,k) \in \varOmega } |{\hat{M}}_{jk} - M^0_{jk}| }{|\varOmega | (r_{\max }-r_{\min })}, \end{aligned}$$

where \(r_{\min }\) and \(r_{\max }\) denote the lower and upper bounds for the ratings, respectively. In the Jester joke data set, the range is \([-10,10]\). Thus, we have \(r_{\max }-r_{\min } = 20\).

In each iteration, we first randomly select \(n_u\) users, and then randomly permute the ratings from the users to generate \(M^0\in {\mathbb {R}}^{n_u\times 100}\). Next, we uniformly sample SR for SR\(\in \{15\%,20\%,0.25\%\}\) entries to generate a set of observed indices \(\varOmega\). Note that we can only observe the entry (jk) if \((j,k)\in \varOmega\), and \(M^0_{j,k}\) is available. Thus, the actual sampling ratio is less than the input SR. We consider different settings of \(n_u\) and SR, and we report the averaged NMAE and running times in Table 2 after running each setting 100 times. We compare robust methods with \(\ell _1\) loss, Huber loss, and Tukey loss with the non-robust \(\ell _2\) loss. From Table 2, we see that robust matrix completion methods work promisingly.

4.2 Cameraman image denoising

We test our method using the popular Cameraman image, which is widely used in image processing literature. We consider the “Cameraman" image with \(512\times 512\) pixels as shown in Fig. 1a. We then generate random noise by first adding independent Gaussian noise to each pixel with a standard deviation set as 3. Then, we add some heavy-tailed noises by randomly choosing 10% pixels and replace the coefficient as 1000 or \(-1000\). Furthermore, we randomly select 40% or 60% pixels as missing entries. We provide two typical simulated noisy images in the above of Fig.  1b, c, and provide the recovered images using the Tukey approach below them. The recovered images provide visual evidence that our method is robust to heavy-tailed noises in practice. In addition, in Table 3, we provide the averaged NMAE with standard deviations of different approaches after repeating the data generating schemes 100 times. For the effective picture recovery and the NMAE, we conclude that robust matrix completion has promising performance with partial and noisy observations.

Fig. 1
figure 1

a We test our method on the \(512\times 512\) Cameraman image. b A sample noisy image with heavy-tailed noises and 40% missing entries. c A sample noisy image with heavy-tailed noises and 60% missing entries

Table 2 Averaged normalized mean absolute error with standard deviations in the parentheses for different methods using Jester joke data set under different data generating schemes after 100 runs
Table 3 Averaged normalized mean absolute error with standard deviations in the parentheses for different methods using Lena image after 100 runs

4.3 Simulations

We first consider several similar simulation settings as described in She and Chen (2017) to compare our method with their robust reduced-rank regression (\(R^4\)) method. In all cases, we focus on testing the robustness by artificially introducing data corruption and outliers.

Setting 1: We first consider a low-dimensional case where we set \(n = 100\), \(p = 12\), \(q = 8\) and \(r = 3\) or 5. We construct the design matrix X by generating its n rows by independently sampling from \(N(0,\Sigma _0)\), where we consider highly correlated covariates by letting the diagonal elements of \(\Sigma _0\) be 1’s and setting its off-diagonal elements as 0.5. For the noise matrix \(\var**\), we sample each row of \(\var**\) independently from \(N(0,\sigma ^2\Sigma _1)\), where \(\Sigma _1\) is the q-dimensional identity matrix, and \(\sigma\) is set as 1. Next, we construct the coefficient matrix \(C^*\). We generate \(C^* = B_1B_2^\top\), where \(B_1\in {\mathbb {R}}^{p\times r}\), \(B_2\in {\mathbb {R}}^{q\times r}\), and all entries of \(B_1\) and \(B_2\) are independently sampled from N(0, 1). We then add outliers with a matrix \(U^*\) by setting the first \(o\% \cdot n\) rows of \(U^*\) as nonzero, where \(o \in \{30, 35,\ldots ,50\}\) is the proportion of outliers, and the j-th entry of any outlier row of \(U^*\) is the product of a Randemacher random variable and a scalar \(\alpha \in \{0.75,1\}\) times the sample standard deviation of the j-th column of \(XC^*\). Finally, we set the response matrix \(Y = XC^* + U^* + \var**\). We report the mean and standard deviation of the mean squared error (MSE) from 200 runs, where

$$\begin{aligned} \text {MSE}(X{\hat{C}}) = \Vert XC^* - X{{\hat{C}}}\Vert _F^2/ (qn). \end{aligned}$$

In addition, we also report the mean and standard deviation of the mean squared estimation error, where

$$\begin{aligned} \text {MSE}({\hat{C}}) = \Vert {\hat{C}}-C^*\Vert _F^2/(qp). \end{aligned}$$

Setting 2: We then test our method on heavy-tailed noise. Same as Setting 1, we let \(n = 100\), \(p = 12\), \(q = 8\), and \(r = 2,3\), or 4, and consider the same generating scheme to construct the design matrix X, and then generate the noise matrix by the heavy-tailed t-distribution with a degree of freedom 3 or 5. Furthermore, we add outliers by the same generating scheme as in Setting 1 to generate \(U^*\) and letting \(\alpha = 0.5, 0.75\) or 1.

Setting 3: We consider a high-dimensional setting where \(n = 100\), \(p = 50\) and \(q = 50\), and \(r = 3\) or 5, where there are \(2,500 > 100\) parameters in the matrix C to be estimated. We consider the same data generating scheme as in Setting 1.

Setting 4: Finally, we consider an ultrahigh-dimensional setting where \(n = 300\), \(p = 100\) and \(q = 400\), and \(r = 3\) or 5, where there are \(40,000\gg 300\) parameters to be estimated. We consider the same data generating scheme as in Setting 1.

The results are shown in Tables 4, 5, 6, and 7. We compare our method incorporating Huber and Tukey loss functions with the \(R^4\) method when it is applicable. We note that for high-dimensional Settings 3 and 4, the \(R^4\) method of She and Chen (2017) cannot be applied here because one of the iterations in their algorithm is not defined. We compare our method with another robust method where we use the \(\ell _1\) loss in place of the Huber loss in the objective with the nuclear norm constraint (Denoted as \(\ell _1\)). In all four settings, both Huber loss and Tukey loss achieve very promising performance, and Tukey loss slightly outperforms Huber loss in settings with outliers.

Table 4 Sample average of MSE\((X{\hat{C}})\) and MSE\(({{\hat{C}}})\) for Setting 1 under different settings with sample standard deviation in parentheses after 200 runs
Table 5 Sample average of MSE\((X{\hat{C}})\) and MSE\(({{\hat{C}}})\) for Setting 2 under different settings with sample standard deviation in parentheses after 200 runs
Table 6 Sample average of MSE\((X{\hat{C}})\) and MSE\(({{\hat{C}}})\) for Setting 3 under different settings with sample standard deviation in parentheses after 200 runs
Table 7 Sample average of MSE\((X{\hat{C}})\) and MSE\(({{\hat{C}}})\) for Setting 4 under different settings with sample standard deviation in parentheses after 200 runs

5 Intermediate theoretical results

Our estimators (1) and (5) are penalized M-estimators. We exploit the framework of Negahban et al. (2012) in studying their statistical properties. Negahban et al. (2012) elaborates the notion of decomposability associated with some penalty function, which is a key property for establishing the restricted strong convexity (RSC) property and the error bounds of the penalized estimators.

For self-completeness, we outline the decomposability of penalizing with the nuclear norm, and then derive the restricted strong convexity property for both models under the Huber loss function.

5.1 Decomposability of nuclear norm

A norm \(\Vert \cdot \Vert\) is decomposable with respect to a pair of subspace if for all \(A\in {{\mathcal {M}}}\) and \(B\in \overline{{\mathcal {M}}}^\perp\) with \(({\mathcal {M}},\overline{{\mathcal {M}}}^\perp )\) a pair of subspace of \({\mathbb {R}}^{p\times q}\) satisfy

$$\begin{aligned} \Vert A+B\Vert =\Vert A\Vert +\Vert B\Vert . \end{aligned}$$

To illustrate the decomposability of nuclear norm, recall

$$\begin{aligned} {\mathcal {M}}_r(U,V)&:=\{M\in {\mathbb {R}}^{p\times q}\mid \text {row}(M)\subseteq V_r, \text { col}(M)\subseteq U_r\},\\ \overline{{\mathcal {M}}}_r^\perp (U,V)&:=\{M\in {\mathbb {R}}^{p\times q}\mid \text {row}(M)\subseteq V_r^{\perp }, \text { col}(M)\subseteq U_r^{\perp }\}. \end{aligned}$$

Note that \({\mathcal {M}}_r\ne \overline{{\mathcal {M}}}_r\). Since U and V both have orthogonal columns, nuclear norm is decomposable with respect to the pair \(({\mathcal {M}}_r,\overline{{\mathcal {M}}}_r^\perp )\). Note that if the rank of \(M^*\) is equal or smaller than r, then \(U_r\) and \(V_r\) equal to or contain the column and row space of \(M^*\) respectively, and \(M^*\in {\mathcal {M}}_r(U,V)\).

We present key intermediate results as lemmas below. The proofs of the lemmas are given in the Appendix.

5.2 Results for matrix completion

The decomposability leads to the first lemma, which is a special case of Lemma 1 in Negahban et al. (2012). It provides an upper bound for \(\Vert {\hat{\varDelta }}_{\overline{{\mathcal {M}}}_r^\perp }\Vert _*\).

Lemma 1

For any \(\gamma\) satisfying

$$\begin{aligned} \gamma \ge 2\sigma _1(\nabla {\mathcal {L}}_\eta (M^*)), \end{aligned}$$

the error \({\hat{\varDelta }}\) satisfies

$$\begin{aligned} \Vert {\hat{\varDelta }}_{\overline{{\mathcal {M}}}_r^\perp }\Vert _* \le 3\Vert {\hat{\varDelta }}_{\overline{{\mathcal {M}}}_r}\Vert _*+4\sum _{k=r+1}^{q}\sigma _k. \end{aligned}$$

Lemma 1 indicates that the estimator \({\hat{M}}\) belongs to the set

$$\begin{aligned} {\mathcal {M}}_0=\{M\in {\mathbb {R}}^{p\times q}:||\varDelta _{\overline{{\mathcal {M}}}_r^\perp }||_* \le 3||\varDelta _{\overline{{\mathcal {M}}}_r}||_*+4\sum _{k=r+1}^{q}\sigma _k, ~\varDelta =M-M^*\}. \end{aligned}$$

Note that if the rank of \(M^*\) is no greater than r, then \(\sum _{k=r+1}^{q}\sigma _k=0\) and the projection of the error on \(\overline{{\mathcal {M}}}_r^\perp\) is solely controlled by the projection of error on \(\overline{{\mathcal {M}}}_r\), so as the error itself, since

$$\begin{aligned} \Vert {\hat{\varDelta }}\Vert _*\le \Vert {\hat{\varDelta }}_{\overline{{\mathcal {M}}}_r^\perp }\Vert _* +\Vert {\hat{\varDelta }}_{\overline{{\mathcal {M}}}_r}\Vert _*\le 4\Vert {\hat{\varDelta }}_{\overline{{\mathcal {M}}}_r}\Vert _*. \end{aligned}$$

Now, consider the quantity

$$\begin{aligned} \delta {\mathcal {L}}_\eta (M,M^*)={\mathcal {L}}_\eta (M)-{\mathcal {L}}_\eta (M^*)-\langle \nabla {\mathcal {L}}_\eta (M^*),{\varDelta }\rangle . \end{aligned}$$

For simplicity, we sometimes refer to \(\delta {\mathcal {L}}_\eta (M,M^*)\) as \(\delta {\mathcal {L}}_\eta\). The next Lemma gives a lower bound of \(\delta {\mathcal {L}}_\eta (M,M^*)\), which is used to establish restricted strong convexity (RSC) and the upper bound for the error. The key to proving this lemma includes Lemma 1 and the application of empirical process techniques.

Lemma 2

(Lower bound of \(\delta {\mathcal {L}}_\eta (M,M^*)\)) Suppose Assumption 1 and 2 hold, and that the regularization parameter in optimization problem (7) satisfies

$$\begin{aligned} \gamma \ge 2 \sigma _1(\nabla {\mathcal {L}}_\eta (M^*)). \end{aligned}$$

Then for any \(x>0\) and \(M \in \{M:\Vert M-M^*\Vert _{\max } \le \eta \} \cap {\mathcal {M}}_0\),

$$\begin{aligned} \delta {\mathcal {L}}_\eta (M,M^*)&\ge \frac{1}{4c_1^2pq}\Vert \varDelta \Vert _F^2\\&~-\{32\sqrt{2r}\eta c_0\left[ \sqrt{\frac{\log (p+q)}{nq}}+\sqrt{\log (q+1)}\frac{\log (p+q)}{n}\right] + \sqrt{\frac{2x\eta ^2}{npq}}+\frac{8x\eta }{3n}\} \Vert \varDelta \Vert _F, \end{aligned}$$

with probability at least \(1-e^{-x}\).

By controlling the negative term, we have the restricted strong convexity property.

Lemma 3

(Restricted Strong Convexity) Suppose that all the conditions in Lemma 2 and Assumption 3 hold. For \(M \in \{M:\Vert M-M^*\Vert _{\max } \le \eta \} \cap {\mathcal {M}}_0\), with probability at least \(1-e^{-(p+q)}\),

$$\begin{aligned} \delta {\mathcal {L}}_\eta (M, M^*) \ge \frac{1}{8c_1^2pq}\Vert \varDelta \Vert _F^2, \end{aligned}$$

for \(n>C(L)\cdot c^2_1 pr\log (p+q)\log (q+1)\), where C(L) is a a constant only depending on L.

5.3 Results of reduced-rank regression

Recall

$$\begin{aligned} {\mathcal {C}}_r(U,V)&\;=\;\{M\in {\mathbb {R}}^{p\times q}\mid \text {row}(M)\subseteq V_r, \text { col}(M)\subseteq U_r\},\\ \overline{{\mathcal {C}}}_r^\perp (U,V)&\;=\;\{M\in {\mathbb {R}}^{p\times q}\mid \text {row}(M)\subseteq V_r^{\perp }, \text { col}(M)\subseteq U_r^{\perp }\}. \end{aligned}$$

Lemma 1 can be easily extended to \({\hat{C}}\).

Lemma 4

For any \(\gamma\) satisfying

$$\begin{aligned} \gamma \ge 2\sigma _1(\nabla {\mathcal {L}}_\eta (C^*)), \end{aligned}$$

\({\hat{\varDelta }}={\hat{C}}-C^*\) satisfies

$$\begin{aligned} \Vert {\hat{\varDelta }}_{\overline{{\mathcal {C}}}_r^\perp }\Vert _* \le 3\Vert {\hat{\varDelta }}_{\overline{{\mathcal {C}}}_r}\Vert _*+4\sum _{k=r+1}^{q}\sigma _k. \end{aligned}$$

Lemma 4 indicates that the estimator \({\hat{C}}\) belongs to the set

$$\begin{aligned} {\mathcal {C}}_0=\{C\in {\mathbb {R}}^{p\times q}:\Vert \varDelta _{\overline{{\mathcal {C}}}_r^\perp }\Vert _* \le 3\Vert \varDelta _{\overline{{\mathcal {C}}}_r}\Vert _*+4\sum _{k=r+1}^{q}\sigma _k, \varDelta =C-C^*\}. \end{aligned}$$

The next result is to establish the RSC condition. Consider the quantity

$$\begin{aligned} \delta {\mathcal {L}}_\eta (C,C^*)={\mathcal {L}}_\eta (C)-{\mathcal {L}}_\eta (C^*)-\langle \nabla {\mathcal {L}}_\eta (C^*),{\varDelta }\rangle . \end{aligned}$$

Lemma 5

(Lower bound of \(\delta {\mathcal {L}}_\eta (C,C^*)\)) Consider the reduced-rank regression problem (9). Suppose that Assumption 12 and 4 hold, and the noises \(\xi _t\)’s are distributed symmetrically about zero. Suppose the regularization parameter in optimization problem (9) satisfies

$$\begin{aligned} \gamma \ge 2 \sigma _1(\nabla {\mathcal {L}}_\eta (C^*)). \end{aligned}$$

Then for any \(x>0\) and \(C\in \{C:\Vert C-C^*\Vert _F\le \eta \}\cap {\mathcal {C}}_0\),

$$\begin{aligned} \delta {\mathcal {L}}_\eta (C, C^*) \ge \frac{n\sigma _n(\varSigma )}{2c_1^2}\Vert \varDelta \Vert _F^2 - 48\sqrt{2r}\eta \sigma _1(\varSigma )(\sqrt{4n(p+q)+2nx}+2(p+q)+x)\Vert \varDelta \Vert _F, \end{aligned}$$

with probability at least \(1-e^{-x}\).

By controlling the negative term and setting the right side to be greater than 0, we have the restricted strong convexity property.

Lemma 6

(Restricted Strong Convexity) Suppose that all the conditions in Lemma 5 hold, then for \(C\in \{C:\Vert C-C^*\Vert _F\le \eta \}\cap {\mathcal {C}}_0\) and \(n>C_2\frac{\sigma _1(\varSigma )}{\sigma _n(\varSigma )}c_1^2{r(p+q)}\), where \(C_2\) is a constant,

$$\begin{aligned} \delta {\mathcal {L}}_\eta (C, C^*) \ge \frac{n\sigma _n(\varSigma )}{4c_1^2}\Vert \varDelta \Vert _F^2, \end{aligned}$$

with probability at least \(1-(p+q)^{-1}\).