Robust non-parametric regression via incoherent subspace projections

Mukhoty, Bhaskar; Dutta, Subhajit; Kar, Purushottam

doi:10.1007/s10994-021-06045-z

Robust non-parametric regression via incoherent subspace projections

Published: 05 September 2021

Volume 110, pages 2941–2989, (2021)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Robust non-parametric regression via incoherent subspace projections

Download PDF

1020 Accesses
2 Altmetric
Explore all metrics

Abstract

This paper establishes the algorithmic principle of alternating projections onto incoherent low-rank subspaces (APIS) as a unifying principle for designing robust regression algorithms that offer consistent model recovery even when a significant fraction of training points are corrupted by an adaptive adversary. APIS offers the first algorithm for robust non-parametric (kernel) regression with an explicit breakdown point that works for general PSD kernels under minimal assumptions. APIS also offers, as straightforward corollaries, robust algorithms for a much wider variety of well-studied settings, including robust linear regression, robust sparse recovery, and robust Fourier transforms. Algorithms offered by APIS enjoy formal guarantees that are frequently sharper than (especially in non-parametric settings) or competitive to existing results in these settings. They are also straightforward to implement and outperform existing algorithms in several experimental settings.

Scalable subspace methods for derivative-free nonlinear least-squares optimization

Article Open access 09 June 2022

Regression function estimation as a partly inverse problem

Article 20 April 2019

Robust Computation of Linear Models by Convex Relaxation

Article 23 September 2014

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction and problem statement

We have a regression problem with n training points with regression values (aka responses or signal) denoted by the vector ${\mathbf{a}}^*\in {\mathbb{R}}^n$ (see below for examples). An adversary introduces additive corruptions ${\mathbf{b}}^*\in {\mathbb{R}}^n$ so that the responses we actually observe are given by

$${\mathbf{y}}= {\mathbf{a}}^{*} + {\mathbf{b}}^{*}.$$

(1)

Can, under some conditions on ${\mathbf{a}}^* \text{ and } {\mathbf{b}}^*$, we recover ${\mathbf{a}}^*$ (as well as parameters in its generative model) despite the corruptions? This paper develops the APIS framework to answer this question in the affirmative.

Key idea: If ${\mathbf{a}}^*\in \mathscr {A}$ and ${\mathbf{b}}^*\in \mathscr {B}$, where $\mathscr {A} \text{ and } \mathscr {B}$ are unions of low-rank subspaces that are incoherent with respect to each other (see Sect. 1.1 for an introduction to incoherence and Sect. 6.1 for details), then this paper shows that such recovery is indeed possible. Specifically, let ${\mathbf{a}}^*\in \mathscr {A} \text{ and } {\mathbf{b}}^*\in \mathscr {B}$, where $\mathscr {A}= \bigcup _{i=1}^P A_i$ and $\mathscr {B}= \bigcup _{j=1}^Q B_j$ are the unions of subspaces, with ${{\,\mathrm{rank}\,}}(A_i) \le s$ for all $i \in [P]$ and ${{\,\mathrm{rank}\,}}(B_j) \le k$ for all $j \in [Q]$ for some integers $s, k > 0$. Then, this paper shows that it is possible to recover ${\mathbf{a}}^*$ consistently using a simple strategy that involves alternating projections onto these unions using projection operators $\varPi _\mathscr {A}(\cdot ) \text{ and } \varPi _\mathscr {B}(\cdot )$ (see Algorithm 1). As we shall see, such incoherent unions of subspaces implicitly arise in several learning settings, e.g., if ${\mathbf{a}}^* \text{ and } {\mathbf{b}}^*$ are known to have s- and k-sparse representations in two bases that are incoherent to each other. We denote the privileged subspaces within these unions to which ${\mathbf{a}}^*$ and ${\mathbf{b}}^*$ belong as ${A}^*\ni {\mathbf{a}}^* \text{ and } {B}^*\ni {\mathbf{b}}^*$.

What all is known to APIS: APIS requires the projection operators $\varPi _\mathscr {A}(\cdot ), \varPi _\mathscr {B}(\cdot )$ to be executable efficiently at runtime as its alternating strategy may invoke these operators multiple times. Thus, it needs the unions $\mathscr {A} \text{ and } \mathscr {B}$ to be (implicitly) known. The discussions in Sects. 2 and 5.1, 5.2 assure that these conditions are indeed satisfied in several interesting learning applications. However, APIS does not require the vectors ${\mathbf{a}}^* \text{ and } {\mathbf{b}}^*$ to be known, nor does it assume that the subspaces ${A}^* \text{ and } {B}^*$ to which they belong are known, nor does it require ${A}^* \text{ and } {B}^*$ to be unique either.

1.1 A key to the manuscript

The reader may be curious about several questions that need solutions for the above strategy to make sense. We summarize APIS’s solutions to these questions below but provide more details in subsequent discussions (that are underlined in italics for easy identification).

1.
How are $\mathscr {A} \text{ and } \mathscr {B}$ known to the algorithm? Sect. 2 shows how in the case of robust linear regression, the union $\mathscr {A}$ is implicitly known the moment training data is made available. Section 5.1 shows that the same is true in several other important learning applications. Section 5.2 on the other hand, shows how the union $\mathscr {B}$ is defined for several interesting corruption models.
2.
How are the projection operators for these unions constructed and efficiently executed? Projection onto unions of spaces can be intractable in general. Nevertheless, Sect. 5 shows how for several interesting learning applications, the projection operators $\varPi _\mathscr {A}(\cdot ) \text{ and } \varPi _\mathscr {B}(\cdot )$ can be executed efficiently. Moreover, Table 1 gives explicit time complexity for these projection operations in a variety of applications.
3.
What if $\mathscr {A} \text{ and } \mathscr {B}$ are not incoherent to each other? As discussed in Sects. 3 and 6.4, APIS can exploit local incoherence properties to guarantee recovery even when strict notions of incoherence fail to hold. See Sect. 2 for an introduction to incoherence.
4.
Does the adversary know $\mathscr {A}$ (or, perhaps even ${\mathbf{a}}^*$) before deciding ${\mathbf{b}}^*$? As discussed in Sect. 4, APIS allows a fully adaptive adversary that is permitted to decide the corruption vector ${\mathbf{b}}^*$ with complete knowledge of ${\mathbf{a}}^*, {A}^*$ as well as $\mathscr {A} \text{ and } \mathscr {B}$.
5.
Is the model in Eq. (1) general enough to capture interesting applications and can the unions $\mathscr {A} \text{ and } \mathscr {B}$ be data-dependent? The discussion in Sect. 5 shows that the model does indeed capture several statistical estimation and signal processing problems such as low-rank kernel regression, sparse signal transforms, robust sparse recovery, and robust linear regression. In most of these settings, the union $\mathscr {A}$ is indeed data-dependent.
6.
What if ${\mathbf{a}}^* \text{ and } {\mathbf{b}}^*$ are only approximate members of these unions? As discussed in Sect. 6.4, APIS can readily accommodate compressible signals, where the clean signal ${\mathbf{a}}^*$ does not belong to $\mathscr {A}$ but is well-approximated by vectors in $\mathscr {A}$, as well as handle unmodelled errors such as simultaneous sparse corruptions and dense Gaussian noise.
7.
How low-rank must the subspaces in the unions be, i.e., how large can s and k be? The key result in this paper Theorem 1 guarantees recovery the moment a certain incoherence requirement is satisfied. The pursuit of satisfying this requirement results in bounds on s and k to emerge for various applications. Table 1 summarizes the signal-corruption pairings for which APIS guarantees perfect recovery and bounds how large s and k can be. Detailed derivations of these results are presented in the appendices and summarized in Sect. 6.

2 A gentle introduction to the intuition behind APIS

We recall our model from Sect. 1. We have $\mathbf{y}= {\mathbf{a}}^*+ {\mathbf{b}}^*$ with ${\mathbf{a}}^*\in \mathscr {A} \text{ and } {\mathbf{b}}^*\in \mathscr {B}$, where $\mathscr {A}= \bigcup _{i=1}^P A_i$ and $\mathscr {B}= \bigcup _{j=1}^Q B_j$ are the unions of subspaces, with ${{\,\mathrm{rank}\,}}(A_i) \le s$ for all $i \in [P]$ and ${{\,\mathrm{rank}\,}}(B_j) \le k$ for all $j \in [Q]$ for some integers $s, k > 0$. To present the core ideas behind APIS , we consider a simplified scenario where $P = 1 = Q$, i.e., the unions consist of a single subspace each $\mathscr {A}= \left\{ A\right\} , \mathscr {B}= \left\{ B\right\}$. As Definition 1 shows, we say two subspaces $A, B \subseteq {\mathbb{R}}^n$ (of possibly different ranks) are $\mu$-incoherent for some $\mu > 0$, if $\forall ~\mathbf{u}\in A$, $\left\| \varPi _B(\mathbf{u}) \right\| _2^2 \le \mu \cdot \left\| \mathbf{u} \right\| _2^2$ and $\forall ~\mathbf{v}\in B$, $\left\| \varPi _A(\mathbf{v}) \right\| _2^2 \le \mu \cdot \left\| \mathbf{v} \right\| _2^2$. As the discussion after Definition 1 shows, an alternate interpretion of this property is that for any two unit vectors $\mathbf{a}\in A, \mathbf{b}\in B$, we must always have $(\mathbf{a}^\top \mathbf{b})^2 \le \mu$. This means that if $\mu$ is small, then no two vectors from these two subspaces can be very aligned to each other and thus, the vectors must be near-orthonormal. Definition 2 extends the concept of incoherence to unions of subspaces.

Figure 1 illustrates this concept using a toy example where A is a rank-2 subspace of ${\mathbb{R}}^3$ and B is a rank-1 subspace of $\mathbb{R}^3$ , i.e., $s = 2, k = 1$. Notice that in Fig. 1a, the subspaces A, B are highly incoherent indicating a value of $\mu \rightarrow 0$. It is not possible for two vectors, one each from A, B, to be very aligned to each other. On the other hand, Fig. 1b illustrates an example of a pair of subspaces that are quite coherent and have a high value of $\mu \rightarrow 1$. Also, shown in Fig. 1b are examples of two vectors ${\mathbf{a}}^*\in A, {\mathbf{b}}^*\in B$ that are extremely aligned to each other since the subspaces A, B are not incoherent and allow vectors to get very aligned.

Why incoherence helps robust recovery? To appreciate the benefits of incoherence, consider an extreme example where A, B are perfectly incoherent with $\mu = 0$ as illustrated in Fig. 2. For example, take $A = \left\{ (x,y,z) \in {\mathbb{R}}^3, x + y + z = 0\right\}$ to be a rank-2 subspace of ${\mathbb{R}}^3$ and $B = \left\{ (t,t,t) \in {\mathbb{R}}^3, t \in {\mathbb{R}}\right\}$ to be a rank-1 subspace of ${\mathbb{R}}^3$. Clearly, for any $(x,y,z) \in A, (t,t,t) \in B$, we have $\left\langle (x,y,z),(t,t,t) \right\rangle = t(x + y + z) = 0$. Now suppose the signal and corruption vectors are chosen as ${\mathbf{a}}^*\in A, {\mathbf{b}}^*\in B$ and we are presented with $\mathbf{y}= {\mathbf{a}}^*+ {\mathbf{b}}^*$. Separating these two components is extremely simple in this case. To extract ${\mathbf{b}}^*$ from $\mathbf{y}$, we simply project $\mathbf{y}$ onto the subspace B to get $\varPi _B(\mathbf{y}) = \varPi _B({\mathbf{a}}^*+ {\mathbf{b}}^*) = \varPi _B({\mathbf{a}}^*) + \varPi _B({\mathbf{b}}^*) = \mathbf{0}+ {\mathbf{b}}^*= {\mathbf{b}}^*$, where we have $\varPi _B({\mathbf{b}}^*) = {\mathbf{b}}^*$ due to the idempotence of orthonormal projections and $\varPi _B({\mathbf{a}}^*) = \mathbf{0}$ due to perfect incoherence between the two subspaces. Having done this, we can recover ${\mathbf{a}}^*$ by simply shaving off the contribution of ${\mathbf{b}}^*$ in $\mathbf{y}$ and projecting onto A to get $\varPi _A(\mathbf{y}- {\mathbf{b}}^*) = \varPi _A({\mathbf{a}}^*) = {\mathbf{a}}^*$. It is easy to see that the above two steps are simply a single iteration of Algorithm 1. Thus, perfect incoherence allows straightforward recovery. Theorem 1 shows that APIS assures recovery even when the subspaces are reasonably incoherent but not perfectly incoherent. Section 6.4 extends this further to show how APIS offers recovery even in cases where only local incoherence is present in the task structure.

Why lack of incoherence can make recovery impossible: To see why some form of incoherence is essential in general, consider a case similar to the one in Fig. 1b but taken to the extreme, i.e., where $\mu = 1$. To present a general case, let us take A, B to be higher rank spaces. For example, let $A = \left\{ (a,b,c,d) \in {\mathbb{R}}^4, a + b = 0\right\}$ and $B = \left\{ (p,q,0,s) \in {\mathbb{R}}^4, p + q = 0\right\}$ be the two subspaces of ${\mathbb{R}}^4$ with ranks 3 and 2 respectively. Since the unit vector $\left( \frac{1}{\sqrt{2}}, -\frac{1}{\sqrt{2}}, 0, 0\right)$ lies in both A and B, the subspaces are coherent with $\mu = 1$. Suppose we are unlucky to have chosen the signal as ${\mathbf{a}}^*= (u, -u, 0, v) \in A$ for some $u, v \in {\mathbb{R}}$. Note that ${\mathbf{a}}^*\in A \cap B$. Then the adversary can readily choose ${\mathbf{b}}^*= (x, -x, 0, y) \in B$ for some secret values of $x, y \in {\mathbb{R}}$ that the adversary does not reveal to anybody. Recall that the adversary is allowed to choose ${\mathbf{b}}^*$ having seen the value of ${\mathbf{a}}^*$. Thus, we are presented with $\mathbf{y}= {\mathbf{a}}^*+ {\mathbf{b}}^*= ((u+x), -(u+x), 0, (v+y))$. However, depending on x, y, this can be an arbitrary vector in the space B. Thus, recovering ${\mathbf{a}}^*$ becomes equivalent to recovering the secret values x, y which makes recovery impossible.

A real-life example: To make the above intuitions concrete, let us take the example of robust linear regression where we have ${\mathbf{a}}^*= X^\top {\mathbf{w}}^*\in {\mathbb{R}}^n$, where $X = [\mathbf{x}^1,\ldots ,\mathbf{x}^n] \in {\mathbb{R}}^{d \times n}$ is the covariate matrix of the n data points and ${\mathbf{w}}^*\in {\mathbb{R}}^d$ is the linear model. In this case we always have ${\mathbf{a}}^*\in \text {span}(\mathbf{x}^1,\ldots ,\mathbf{x}^n)$, i.e. $P = 1$ and $\mathscr {A}= \left\{ A\right\} = \text {span}(\mathbf{x}^1,\ldots ,\mathbf{x}^n)$. Suppose $\mathscr {B}= \left\{ B\right\}$ with $B = A$ i.e., a completely coherent system with $\mu = 1$. In this case, the adversary can choose an adversarial model $\tilde{\mathbf{w}} \in {\mathbb{R}}^d$ and set ${\mathbf{b}}^*= X^\top (\tilde{\mathbf{w}} - {\mathbf{w}}^*) \in B$ so that we are presented with $\mathbf{y}= {\mathbf{a}}^*+ {\mathbf{b}}^*= X^\top \tilde{\mathbf{w}}$. Since $\tilde{\mathbf{w}}$ is kept secret by the adversary, recovery yet again becomes impossible. On the other hand, as Table 1 and calculations in Sect. B in appendix show, if the adversary is restricted to impose only sparse corruptions, specifically $\mathscr {B}= \left\{ \mathbf{b}\in {\mathbb{R}}^n, \left\| \mathbf{b} \right\| _0 \le k\right\}$ for $k \le \frac{n}{154}$, then $\mathscr {A}$ is sufficiently incoherent from $\mathscr {B}$ and APIS guarantees recovery.

3 Related works and our contributions in context

3.1 Summary of contributions

APIS presents a unified framework for designing robust (non-parametric) regression algorithms based on the principle of successive projections onto incoherent sub-spaces and applies it to various settings (see Sect. 5). APIS also offers explicit breakdown points and offers some of the fastest recoveries in experiments. Below, we give a survey of past works in these settings and offer comparisons with our contributions.

3.2 Robust non-parametric regression

Classical results in this are mostly relying on robust estimators such as Huber, $L_1$ and median (Cizek and Sadikoglu 2020; Fan et al. 1994), some of which (e.g., those based on Tukey’s depth) are computationally intractable (Du et al. 2018). Please refer to recent reviews in Cizek and Sadikoglu (2020) and Du et al. (2018) for details. More recent work includes the LBM method (Du et al. 2018) that uses binning and median-based techniques.

Comparison: We compare experimentally to all these methods in Sect. 7. Classical techniques mostly do not offer explicit breakdown points; instead, they analyze the influence function of their estimators (Cizek and Sadikoglu 2020). Classical works and LBM also consider only Huber contamination models where the adversary is essentially stochastic. In contrast, APIS offers explicit breakdown points against a fully adaptive adversary (see Sect. 4). LBM does not scale well with dimension d. Unless it receives $n = (\varOmega \left( 1\right) )^d$ training points, it has to settle for coarse bins that increase the bias or face a situation where most bins are unpopulated, affecting the recovery. In contrast, APIS requires kernel ridge regression problems to be solved, for which efficient routines exist even for large d.

3.3 Robust linear regression

Past works adopt various strategies such as robust gradient methods e.g., SEVER (Diakonikolas et al. 2019), RGD (Prasad et al. 2018), hard thresholding techniques TORRENT (Bhatia et al. 2015), and reweighing techniques STIR (Mukhoty et al. 2019), apart from classical techniques based on robust loss functions such as Tukey’s Bisquare and constrained L1-minimization based morphological component analysis (MCA) (McCoy and Tropp 2014).

Comparison: We compare experimentally to all these methods in Sect. 7. On the theoretical side, APIS offers more attractive guarantees. SEVER requires $n > d^5$ samples, whereas APIS requires $n > \varOmega \left( d\log (d)\right)$ samples. RGD offers theoretical guarantees only for Huber and heavy-tailed contamination models where the adversary is essentially stochastic, whereas APIS can tolerate a fully adaptive adversary (see Sect. 4). APIS offers much sharper guarantees (see Sect. 6.4) than TORRENT and STIR in the hybrid corruption case where apart from sparse corruptions, all points face Gaussian noise. However, TORRENT and STIR offer better breakdown points than APIS .

3.4 Robust Fourier and other signal transforms

Several works offer recovery of Fourier-sparse functions under sparse outliers, with the discrete cube or torus being candidate domains, and propose algorithms based on linear programming (Chen and De 2020; Guruswami and Zuckerman 2016). These offer good theoretical guarantees but are expensive ($\text {poly}(n)$ runtime) to implement (the authors themselves offer no experimental work). On the other hand, APIS only requires “fast” transforms such as FFT to be carried out several times (and consequently, APIS offers an $\mathscr {O}\left( n\log n\right)$ runtime in these settings). Under sparse corruptions, APIS guarantees robust versions of several other transforms such as robust Hadamard transforms (see Table 1). APIS is additionally able to handle dense corruptions in special cases as well (see Sect. 6). The work of Bafna et al. (2018) uses an algorithm proposed by Baraniuk et al. (2010), in the context of performing robust Fourier transforms in the presence of sparse corruptions. However, their RIP-based analysis is restrictive and only applies to transforms such as Fourier, for which every entry of the design matrix is $\mathscr {O}\left( 1/{\sqrt{n}}\right)$ (see Bafna et al. 2018, Theorem 2.2). This is not true of transforms, e.g., Haar wavelet, where design matrix entries can be $\varOmega \left( 1\right)$. APIS continues to give recovery guarantees even in such cases, and it can handle certain cases where corruptions are dense, i.e., $\left\| {\mathbf{b}}^* \right\| _0 = \varOmega \left( n\right)$ which Bafna et al. (2018) do not consider.

3.5 Use of (local) incoherence in literature

The general principle of alternating projections and the notion of incoherence has been used in prior work. For example, Hegde and Baraniuk (2012) apply this principle to the problem of signal recovery on incoherent manifolds. However, our application of the alternating projection principle to robust non-parametric regression is novel and not addressed by prior work. Notions of incoherence and incoherent bases are also well-established in compressive sensing (Candes and Wakin 2008) and matrix completion (Chen 2015). However, to the best of our knowledge, APIS offers the first application of these notions to robust non-parametric recovery. It is well-known (Krahmer and Ward 2014; Zhou et al. 2016) that (global) incoherence may be unavailable in practical situations (e.g., Fourier and wavelet bases are not incoherent). Nevertheless, several results in compressive sensing (Krahmer and Ward 2014), matrix completion (Chen et al. 2014) and robust PCA (Zhang et al. 2015) assert that local notions of incoherence can still guarantee recovery. Section 6.4shows that APIS can as well exploit local incoherence properties to guarantee recovery in settings where strict notions of incoherence fail to hold.

3.6 Learning incoherent spaces

An interesting line of work has pursued the goal of learning incoherent dictionaries for the task of classification (Schnass and Vandergheynst 2010; Barchiesi and Plumbley 2013, 2015). Specifically, a set of discriminative subspaces (sub-dictionaries) are learnt, one per class so as to offer discriminative advantage in supervised classification tasks. However, in the problem setting for APIS, as described after Eq. (1), the subspaces $\mathscr {A}, \mathscr {B}$ are well defined once once training data has been obtained and the corruption model has been fixed and thus, do not need to be learnt. For this reason, these works do not directly relate to the work in the current paper.

4 APIS: alternating projections onto incoherent subspaces

Adversary model: APIS allows a fully adaptive adversary that is permitted to decide the corruption vector ${\mathbf{\underline{b}}}^*$ with complete knowledge of $\underline{{\mathbf{a}}^*, {A}^*}$ as well as $\underline {\mathscr {A} \text{ and } \mathscr {B}}$.

We note that this is the most potent adversary model considered in the literature. Specifically, given a pair of incoherent unions $\mathscr {A} \text{ and } \mathscr {B}$, first a subspace ${A}^*$ and ${\mathbf{a}}^*\in {A}^*$ are chosen arbitrarily. The adversary is now told ${A}^*, {\mathbf{a}}^*$ and is then free to choose a subspace ${B}^*$ in the union $\mathscr {B}$ and ${\mathbf{b}}^*\in B^*$ using its knowledge in any way.

APIS is described in Algorithm 1 and involves alternately projecting onto unions of subspaces $\mathscr {A}, \mathscr {B}$. For specific applications, the projection steps take on various forms, e.g., solving a (kernel) least-squares problem, a Fourier transform, or hard-thresholding. These are discussed in Sect. 5.

Notation: For $\mathbf{v}\in {\mathbb{R}}^d$ and set $F \subseteq [d]$, let $\mathbf{v}_F \in {\mathbb{R}}^d$ denote a vector with coordinates in the set F identical to those in $\mathbf{v}$ and others set to zero. For any matrix $X \in {\mathbb{R}}^{d \times n}$ and any sets $S \subseteq [n], F \subseteq [d]$, we let $X^F_S = [\tilde{x}_{ij}] \in {\mathbb{R}}^{d \times n}$ be a matrix such that $\tilde{x}_{ij} = x_{ij}$ if $i \in F, j \in S$ and $\tilde{x}_{ij} = 0$ otherwise. We similarly let $X^F = [z_{ij}]$ denote the matrix with entries in the rows in F identical to those in X and other entries zeroed out i.e. $z_{ij} = x_{ij}$ if $i \in F$ and $z_{ij} = 0$ otherwise. $X_S$ is similarly defined as a matrix with entries in the columns in S identical to those in X and other entries zeroed out.

Projections: For any subspace $S \subseteq {\mathbb{R}}^n$, $\varPi _S$ denotes orthonormal projection onto S and $\varPi _S^\perp$ denotes the orthonormal projection onto the ortho-complement of S so that for any $\mathbf{v}\in {\mathbb{R}}^n$ and any subspace S, we always have $\mathbf{v}= \varPi _S(\mathbf{v}) + \varPi _S^\perp (\mathbf{v})$. We abuse notation to extend the projection operator $\varPi _\cdot (\cdot )$ to unions of low-rank subspaces. Let $\mathscr {A}= \bigcup _{i=1}^P A_i \subseteq {\mathbb{R}}^n$ be a union of P subspaces, then for any $\mathbf{v}\in {\mathbb{R}}^n$ we define $\varPi _{\mathscr {A}}(\mathbf{v}) = \arg \min _{\mathbf{z}= \varPi _{A_i}(\mathbf{v}), i \in [P]}\left\| \mathbf{v}- \mathbf{z} \right\| _2^2$. Projection onto a union of subspaces is expensive in general (requiring time linear in P, the number of subspaces in the union) but will be efficient in all cases we consider (see Table 1).

Hard thresholding: The hard-thresholding operator will be instrumental in allowing efficient projections in APIS. For any $n, k < n$, let $\mathscr {S}^n_k = \left\{ \mathbf{z}\in {\mathbb{R}}^n: \left\| \mathbf{z} \right\| _0 \le k\right\}$ be the set of all k-sparse vectors. For any $\mathbf{z}\in {\mathbb{R}}^n, k < n$, let $\text {HT} _k(\mathbf{z}) {:}{=} \varPi _{\mathscr {S}^n_k}(\mathbf{z})$ denote the projection of $\mathbf{z}$ onto $\mathscr {S}^n_k$. Note that this operation is possible in $\mathscr {O}\left( n\log n\right)$ time by sorting all the coordinates by magnitude, retaining the top k coordinates (in magnitude) and setting rest to 0.

5 Applications and projection details

The signal and corruption model in Eq. ( 1) does indeed capture several statistical estimation and signal processing problems. In most of these settings, the union $\underline{\mathscr {A}}$ is indeed data-dependent. The discussion below shows that in each case, the union of subspaces $\mathscr {A}$ is well-defined once training data is available. On the other hand, the union of subspaces $\mathscr {B}$ is well-defined once the corruption model has been identified. The projection operators $\varPi _\mathscr {A}(\cdot) \text{ and } \varPi _\mathscr {B}(\cdot)$ can be executed in polynomial time (see Table 1 for time complexity details).

5.1 Examples of signal models supported by APIS

Linear regression: Here we have ${\mathbf{a}}^*= X^\top {\mathbf{w}}^*$, where $X = [\mathbf{x}^1,\ldots ,\mathbf{x}^n] \in {\mathbb{R}}^{d \times n}$ is the covariate matrix of the n data points and ${\mathbf{w}}^*\in {\mathbb{R}}^d$ is the linear model. It is easy to see that Eq. (1) recovers robust linear regression as a special case with $P = 1$ and $\mathscr {A}= A$, where $A = \text {span}(\mathbf{x}^1,\ldots ,\mathbf{x}^n)$. Using the SVD $X = U\varSigma V^\top$, we can project onto $\mathscr {A}$ simply by solving a least squares problem i.e. we have $\varPi _\mathscr {A}(\mathbf{z}) = VV^\top \mathbf{z}= (X^\top X^\dagger) \mathbf{z}$.

Low-rank kernel regression: Consider a Mercer kernel $K: {\mathbb{R}}^d \times {\mathbb{R}}^d \rightarrow {\mathbb{R}}$ such as the RBF kernel and let $G \in {\mathbb{R}}^{n \times n}$ be the Gram matrix with $G_{ij} = K(\mathbf{x}^i,\mathbf{x}^j)$. Low-rank kernel regression corresponds to the case when the uncorrupted signal satisfies ${\mathbf{a}}^*= G{{\boldsymbol{\alpha }}}^*$ where ${{\boldsymbol{\alpha }}}^*\in {\mathbb{R}}^n$ belongs to the span of the some s eigenvectors of G. Specifically, consider the eigendecomposition $G = V\varSigma V^\top$, $V = [\mathbf{v}^1,\ldots ,\mathbf{v}^r] \in {\mathbb{R}}^{n \times r}$ is the matrix of eigenvectors (r is the rank of the Gram matrix) and $\varSigma = {{\,\mathrm{diag}\,}}(s_1,\ldots ,s_r) \in {\mathbb{R}}^{r \times r}$ is the diagonal matrix of strictly positive eigenvalues (assume $s_1 \ge s_2 \ge \ldots \ge s_r > 0$). APIS offers the strongest guarantees in the case when ${{\boldsymbol{\alpha }}}^*\in \text {span}(\mathbf{v}^1,\ldots ,\mathbf{v}^s)$, i.e., when ${{\boldsymbol{\alpha }}}^*$ lies in the span of the the top eigenvectors. Note that in this case, ${\mathbf{a}}^*$ too is spanned by the top s eigenvectors of G since ${\mathbf{a}}^*= G{{\boldsymbol{\alpha }}}^*$. We stress that the guarantees continue to hold (see Sect. C in appendix) but deteriorate if ${{\boldsymbol{\alpha }}}^*, {\mathbf{a}}^*$ are spanned by s eigenvectors that include the lower ones as well. This is because Gram matrices corresponding to popular kernels such as the RBF kernel are often very ill-conditioned. Then, we can see that Eq. (1) recovers the robust low-rank kernel regression problem as a special case with $P = 1$ and $\mathscr {A}= A$, where $A = \text {span}(\mathbf{v}^1,\ldots ,\mathbf{v}^s)$. Projection onto $\mathscr {A}= A$ is given by $\varPi _\mathscr {A}(\mathbf{z}) = \varPi _A(\mathbf{z}) = V_sV_s^\top \mathbf{z}$, where $V_s = [\mathbf{v}^1,\ldots ,\mathbf{v}^s] \in {\mathbb{R}}^{n \times s}$. Section 6.5 shows how this restriction of the signal to the span of the top-s eigenvectors does not affect the universality of popular kernels such as the RBF kernel.

Sparse signal transforms: Consider signal transforms such as Fourier, Hadamard, wavelet, etc. Sparse signal transforms correspond to the case when ${\mathbf{a}}^*= M^\top {{\boldsymbol{\alpha }}}^*$, where $M \in {\mathbb{R}}^{n \times n}$ is the design matrix of the transform (for sake of simplicity, assume M to be orthonormal as is often the case) and ${{\boldsymbol{\alpha }}}^*\in {\mathbb{R}}^n$ is an s-sparse vector, i.e., $\left\| {{\boldsymbol{\alpha }}}^* \right\| _0 \le s$. It is easy to see that Eq. (1) recovers the robust sparse signal transform problem with $P = \left( {\begin{array}{c}n\\ s\end{array}}\right)$ and $\mathscr {A}= \bigcup _{F \subset [n], \left| {F} \right| = s}A_F$ with $A_F = \text {span}(\left\{ \mathbf{m}^i\right\} _{i \in F})$, where $\mathbf{m}^i$ is the $i^\mathrm{th}$ column of the design matrix M. Given that M is orthonormal, projection onto a given subspace $A_F$ is given by $\varPi _{A_F}(\mathbf{z}) = M_FM_F^\top \mathbf{z}$. The orthonormality of $M$ can be further exploited to carry out projection onto the union $\mathscr {A}$ in $\mathscr {O}\left( n\log n\right)$ time by using “fast” versions of these transforms followed by a hard-thresholding operation. Specifically, we have $\varPi _\mathscr {A}(\mathbf{z}) = M\mathbf{v}$, where $\mathbf{v}= \text {HT} _s(M^\top \mathbf{z})$.

Sparse recovery: In the sparse recovery signal model, the uncorrupted signal satisfies ${\mathbf{a}}^*= X^\top {\mathbf{w}}^*$ where ${\mathbf{w}}^*\in {\mathbb{R}}^d$ is an ${s}^*$-sparse linear model, i.e., $\left\| {\mathbf{w}}^* \right\| _0 \le {s}^*$ for some ${s}^*< d$. Equation (1) recovers the robust sparse recovery problem as a special case with $P = \left( {\begin{array}{c}d\\ {s}^*\end{array}}\right)$ and $\mathscr {A}= \bigcup _{F \subset [d], \left| {F} \right| = {s}}^*A_F$ , where the subspace $A_F$ is given by $A_F = \text {span}(\mathbf{x}^1_F,\ldots ,\mathbf{x}^n_F)$. Projection onto a given subspace $A_F$ can be easily seen to be $\varPi _{A_F}(\mathbf{z}) = ((X^F)^\top (X^F)^\dagger) \mathbf{z}$. Projection onto the union $\mathscr {A}$ can then be shown to be simply the classical sparse recovery problem that can be solved efficiently if X satisfies properties such as RIP or RSC (see Agarwal et al. (2012)). Specifically, we have $\varPi _\mathscr {A}(\mathbf{z}) = X^\top {\hat{\mathbf{w}}}$ , where ${\hat{\mathbf{w}}}= \arg \min _{\left\| \mathbf{w} \right\| _0 \le {s}}^*\left\| X^\top \mathbf{w}- \mathbf{z} \right\| _2^2$. The projection step $\varPi _\mathscr {A}(\cdot )$ can be carried out in $\mathscr {O}\left( nd\right)$ time here as well by employing projected gradient and iterative hard-thresholding methods (see Agarwal et al. (2012)).

5.2 Examples of corruption models supported by APIS

Sparse fully adaptive adversarial corruptions This is most widely studied case in literature and assumes a sparse corruption vector i.e. $\left\| {\mathbf{b}}^* \right\| _0 \le k$ for some $k < n$. The model in Eq. (1) recovers this case with $Q = \left( {\begin{array}{c}n\\ k\end{array}}\right)$ and $\mathscr {B}= \bigcup _{T \subset [n], \left| {T} \right| = k}B_T$ with $B_T$ as the subspace of all vectors with support within the set T. Note that the convergence guarantees for APIS do not impose any restrictions on the magnitude of corruptions. Instead, the number of iterations required for recovery merely scale logarithmically with the $L_2$ norm of the corruption vector i.e. the runtime scales as $\log (\left\| {\mathbf{b}}^* \right\| _2)$ (see Theorem 1). It can be easily seen that the hard-thresholding operator $\text {HT} _k(\cdot )$ (Sect. 4) offers projection onto the union $\mathscr {B}$.

Dense fully adaptive adversarial corruptions Unlike several previous works, APIS also allows corruption vectors that are dense $\left\| {\mathbf{b}}^* \right\| _0 = \varOmega \left( n\right)$, i.e., most points suffer corruption. This is because APIS only requires the unions $\mathscr {A}, \mathscr {B}$ to be incoherent and does not care if $\mathscr {B}$ contains dense vectors. We will see such examples in Sect. 6, with noiselet corruptions, and in Sect. 6.4 where we will exploit local incoherence results to guarantee recovery when the signal is Fourier-sparse, and the corruptions are Wavelet-sparse. In Sect. 7, we will establish that APIS offers recovery in such dense corruption settings, experimentally as well.

As noted earlier, in both the corruption models, the adversary has full knowledge of ${\mathbf{a}}^*, {A}^*$ before choosing ${\mathbf{b}}^*, {B}^*$ in any manner, i.e., the adversary is fully adaptive.

5.3 Do the subspaces really need to be low-rank? What if this is too strict and ${\mathbf{a}}^*\notin \mathscr {A}$?

For exact recovery guarantees (which APIS does offer), some low-rank restriction seems to be necessary, especially when working with universal models such as the Gaussian kernel whose Gram matrix is often full-rank (and ill-conditioned), or the Fourier transform whose design matrix is also full-rank (but well-conditioned). Given such full-rank designs, unless additional restrictions are put (e.g., low-rank), recovery remains an ill-posed problem. However, in Sect. 6.4, we will see that APIS offers non-trivial recovery even if the clean signal $\underline{{\mathbf{a}}^*}$ does not belong to $\underline{\mathscr {A}}$ but is well-approximated by vectors in $\underline{\mathscr {A}}$. Specifically, these are cases when ${\mathbf{a}}^*\notin \mathscr {A}$ but rather ${\mathbf{a}}^*+ {\mathbf{e}}^*\in \mathscr {A}$ and $\left\| {\mathbf{e}}^* \right\| _2$ is small. It is common in signal processing tasks to consider signals (images, etc.) that are well-approximated by a sparse wavelet/Fourier representation but not exactly sparse themselves.

6 Recovery, breakdown points, misspecified models and universality

All detailed proofs and derivations are provided in the appendices.

6.1 Incoherence

A key requirement for robust recovery in model presented in Eq. (1) is for the unions $\mathscr {A} \text{ and } \mathscr {B}$ to be incoherent with respect to each other. We present below a notion of subspace incoherence suitable to our technique. We note that the notion presented below is similar to notions of subspace incoherence prevalent in literature but suited for our setting.

Definition 1

(Subspace incoherence) For any $\mu > 0$, we say two subspaces $A, B \subseteq {\mathbb{R}}^n$ (of possibly different ranks) are $\mu$-incoherent if for all $\mathbf{u}\in A$, $\left\| \varPi _B(\mathbf{u}) \right\| _2^2 \le \mu \cdot \left\| \mathbf{u} \right\| _2^2$ and for all $\mathbf{v}\in B$, $\left\| \varPi _A(\mathbf{v}) \right\| _2^2 \le \mu \cdot \left\| \mathbf{v} \right\| _2^2$.

Note that the above definition uses the same incoherence constant $\mu$ for projections both ways. This is justified since $\max _{\mathbf{u}\in A} \frac{\left\| \varPi _B(\mathbf{u}) \right\| _2^2}{\left\| \mathbf{u} \right\| _2^2} = \max _{\mathbf{v}\in B} \frac{\left\| \varPi _A(\mathbf{v}) \right\| _2^2}{\left\| \mathbf{v} \right\| _2^2}$. To see why, let U and V be orthonormal matrices whose columns span A and B resp. and notice that $\mu = \left\| V^\top U \right\| _{\text {op}}^2 = \left\| U^\top V \right\| _{\text {op}}^2$ where $\left\| \cdot \right\| _{\text {op}}$ is the operator norm. Since orthonormal projections are non-expansive, we always have $\mu \le 1$. However, our results demand stronger contractions which we will establish for our application settings discussed in Sect. 5. However, first we extend the notion of incoherence to unions of subspaces.

Definition 2

(Subspace union (SU) incoherence) For any $\mu > 0$, we say that a pair of unions of subspaces $\mathscr {A}= \bigcup _{i=1}^P A_i$ and $\mathscr {B}= \bigcup _{j=1}^Q B_j$ is $\mu$-SU incoherent if for all $i \in [P] \text{ and all } j \in [Q]$, the subspaces $A_i \text{ and } B_j$ are $\mu$-incoherent.

Theorem 1 states the main claim of this paper. Restrictions on s, k and breakdown points emerge when trying to satisfy the incoherence criterion demanded by Theorem 1.

Theorem 1

Suppose we obtain data as described in Eq. (1) where the two unions $\mathscr {A} \text{ and } \mathscr {B}$ are $\mu$-incoherent with $\mu < \frac{1}{9}$. Then, for any $\epsilon > 0$ within $T = \mathscr {O}\left( \log \frac{\left\| {\mathbf{a}}^* \right\| _2 + \left\| {\mathbf{b}}^* \right\| _2}{\epsilon }\right)$ iterations, APIS offers $\left\| \mathbf{a}^T - {\mathbf{a}}^* \right\| _2 \le \epsilon$. Moreover, in the known signal support case when $P = 1$ (see below), the requirement is further relaxed to $\mu < \frac{1}{3}$.

Proof

(Sketch) We present the main steps in deriving the result for the known signal support case when $P = 1$. We recall the notation from Algorithm 1 where $A^{t+1} \ni \mathbf{a}^{t+1}, B^{t+1} \ni \mathbf{b}^{t+1}$, and let $\mathbf{p}^t = \varPi _A({\mathbf{b}}^*- \mathbf{b}^t)$ and $\mathbf{p}^{t+1} = \varPi _A({\mathbf{b}}^*- \mathbf{b}^{t+1})$. Thus, we have $\mathbf{a}^{t+1} = \varPi _A(\mathbf{y}- \mathbf{b}^{t+1}) = {\mathbf{a}}^*+ \varPi _A({\mathbf{b}}^*- \mathbf{b}^{t+1})$ (since ${\mathbf{a}}^*\in A$ and orthonormal projections are idempotent) which gives us $\left\| \mathbf{a}^{t+1} - {\mathbf{a}}^* \right\| _2 = \left\| \varPi _A({\mathbf{b}}^*- \mathbf{b}^{t+1}) \right\| _2 = \left\| \mathbf{p}^{t+1} \right\| _2$. Let $\mathfrak {Q}{:}{=} B^{t+1} \cap {B}^*$ denote the meet of the two subspaces, as well as denote the symmetric difference subspaces $\mathfrak {P}{:}{=} B^{t+1} \cap ({B^*})^\perp$ and $\mathfrak {R}= {B}^*\cap (B^{t+1})^\perp$ (recall that $A \ni {\mathbf{a}}^* \text{ and } {B}^*\ni {\mathbf{b}}^*$).

Below we show that $\left\| \mathbf{p}^{t+1} \right\| _2 \le 3\mu \cdot \left\| \mathbf{p}^t \right\| _2$ that establishes a linear rate of convergence if $\mu < \frac{1}{3}$ as it grants $\left\| \mathbf{a}^{t+1} - {\mathbf{a}}^* \right\| _2 = \left\| \mathbf{p}^{t+1} \right\| _2 \le 3\mu \cdot \left\| \mathbf{p}^t \right\| _2 = 3\mu \cdot \left\| \mathbf{a}^t - {\mathbf{a}}^* \right\| _2$. To show that $\left\| \mathbf{p}^{t+1} \right\| _2 \le 3\mu \cdot \left\| \mathbf{p}^t \right\| _2$, we note that

$$\begin{aligned} \mathbf{b}^{t+1} = \varPi _{B^{t+1}}({\mathbf{a}}^*+ {\mathbf{b}}^*- \mathbf{a}^t) = \varPi _{B^{t+1}}({\mathbf{b}}^*- \varPi _A({\mathbf{b}}^*- \mathbf{b}^t)) = \varPi _{B^{t+1}}({\mathbf{b}}^*- \mathbf{p}^t), \end{aligned}$$

and thus ${\mathbf{b}}^*- \mathbf{b}^{t+1} = {\mathbf{b}}^*- \varPi _{B^{t+1}}({\mathbf{b}}^*- \mathbf{p}^t) = \varPi _\mathfrak {R}({\mathbf{b}}^*) + \varPi _{B^{t+1}}(\mathbf{p}^t)$. This gives us, by an application of the triangle inequality, $\left\| \mathbf{p}^{t+1} \right\| _2 = \left\| \varPi _A({\mathbf{b}}^*- \mathbf{b}^{t+1}) \right\| _2 \le \left\| \varPi _A(\varPi _\mathfrak {R}({\mathbf{b}}^*)) \right\| _2 + \left\| \varPi _A(\varPi _{B^{t+1}}(\mathbf{p}^t)) \right\| _2$. Applying incoherence now tells us that, since $\mathbf{p}^t \in \mathscr {A}$ by projection, we have

$$\begin{aligned} \left\| \varPi _A(\varPi _{B^{t+1}}(\mathbf{p}^t)) \right\| _2 \le \sqrt{\mu }\cdot \left\| \varPi _{B^{t+1}}(\mathbf{p}^t) \right\| _2 \le \mu \cdot \left\| \mathbf{p}^t \right\| _2 \end{aligned}$$

Using other arguments given in the full proof, it can be shown that $\left\| \varPi _A(\varPi _\mathfrak {R}({\mathbf{b}}^*)) \right\| _2 \le 2\mu \cdot \left\| \mathbf{p}^t \right\| _2$ which gives us $\left\| \mathbf{p}^{t+1} \right\| _2 \le 3\mu \cdot \left\| \mathbf{p}^t \right\| _2$ concluding the proof sketch.

Section A in the appendix gives the complete proof this result. APIS offers a stronger guarantee, requiring $\mu < \frac{1}{3}$, in the known signal support case (Chen and De 2020). These are cases when the union $\mathscr {A}$ consists of a single subspace, i.e., $P = 1$. Note that this is indeed the case (see Sect. 5) for linear regression and low-rank kernel regression. We now derive breakdown points, as well as restrictions on s, k for various applications that arise when we attempt to satisfy the incoherence requirements of Theorem 1. Table 1summarizes the signal-corruption pairings for which APIS guarantees perfect recovery and their corresponding breakdown points essentially bounding how large s, k can be. Detailed derivations of these results are presented in the appendix and summarized below.

Table 1 Some signal and corruption models handled by APIS, and their corresponding breakdown points and per-iteration time complexity

Full size table

6.2 Cases with sparse corruptions

In this case $\mathscr {B}= \mathscr {S}^n_k$, the set of all k-sparse vectors. Calculating the incoherence constants then reduces to application-specific derivations which we sketch below.

Linear regression If the covariate matrix is X, then we get $\mu \le \max _{\begin{array}{c} S \subset [n]\\ |S| = k \end{array}}\frac{\left\| X_S \right\| _{\text {op}}^2}{\lambda _{\min }(XX^\top )}$, where $\left\| \cdot \right\| _{\text {op}}$ is the operator norm (see Appendix B for proofs). It turns out that this is satisfied in several natural settings. For example, if the covariates are Gaussian i.e. $\mathbf{x}^i \sim \mathscr {N}(\mathbf{0}, I_d)$ then $\mu < \frac{1}{3}$ (as required by Theorem 1) with high probability whenever $k < \frac{n}{154}$. We stress that our results do not require data points to be sampled from a standard Gaussian per se. The requirement $\mu < \frac{1}{3}$ is satisfied by other data distributions as well (see Appendix B).

Kernel regression For a Gram matrix G (calculated on covariates $\mathbf{x}^i, i \in [n]$ using a PSD kernel), we get $\mu \le \frac{\Lambda ^{\text {unif}}_k(G)}{\lambda _s(G)}$ where $\lambda _s$ is the $s^\mathrm{th}$ largest eigenvalue of G and $\Lambda ^{\text {unif}}_k$ is the largest eigenvalue of any principal $k \times k$ submatrix of G. For the special case of RBF kernel, further calculations show that $\mu < \frac{1}{9}$ is satisfied, for instance, when covariates are sampled uniformly over the unit sphere and we have $s < {\tilde{\boldsymbol{\varOmega }}}\left( \log n\right) , k \le \mathscr {O}\left( \sqrt{n}\right)$. Yet again, these settings (RBF kernel, unit sphere etc) are not essential, but merely sufficient conditions where APIS is guaranteed to succeed.

Signal transforms For a variety of signal transforms including Fourier, Hadamard, noiselet, we are assured $\mu < \frac{1}{9}$, as desired by Theorem 1, whenever $sk < \frac{n}{9}$. This can be realized in several ways, e.g., $s = \mathscr {O}\left( 1\right) , k = \mathscr {O}\left( n\right)$ or $s, k = \mathscr {O}\left( \sqrt{n}\right)$, etc. See Table 1 for a summary.

6.3 Cases with dense corruptions

Notably, APIS offers exact recovery even in certain cases where the corruption vector is completely dense, i.e., $\left\| {\mathbf{b}}^* \right\| _0 = n$. Note that the adversary is still allowed to be completely adaptive. One such case is when the signal is s-sparse in the Fourier or wavelet (Haar or Daubechies D4/D8) bases, and the corruptions are k-sparse in noiselet basis (Coifman et al. 2001). Since wavelets are known to represent natural signals well, this is a practically useful setting. Note that a vector ${\mathbf{b}}^*$ with a k-sparse noiselet representation even for $k = 1$ can be completely dense, i.e., $\left\| {\mathbf{b}}^* \right\| _0 = n$. APIS also supports dense Gaussian noise the responses as is discussed below.

6.4 Handing model misspecifications

In certain practical situations, the model outlined in Eq. (1) may not be satisfied. For instance, we could have ${\mathbf{a}}^*\notin \mathscr {A}$ if we have an image that is not entirely (but only approximately) sparse in the wavelet basis. Similarly, the unions $\mathscr {A} \text{ and } \mathscr {B}$ could fail to be incoherent (as is the case in the Fourier-Wavelet pair). In this section, we show how APIS can still offer non-trivial recovery in these settings.

Unmodelled error In this case we modify Eq. (1) to include an unmodelled error term.

$$\begin{aligned} \mathbf{y}= {\tilde{\mathbf{a}}}+ {\mathbf{b}}^*+ {\mathbf{e}}^*, \end{aligned}$$

(2)

where ${\tilde{\mathbf{a}}}\in \mathscr {A}, {\mathbf{b}}^*\in \mathscr {B}$ and ${\mathbf{a}}^*= {\tilde{\mathbf{a}}}+ {\mathbf{e}}^*\notin \mathscr {A}$. We make no assumptions on ${\mathbf{e}}^*$ belonging to any union of subspaces etc and allow it to be completely arbitrary. Section E in the appendix shows that if $\mu < \frac{1}{9}$ is satisfied, then for any $\epsilon > 0$, within $T \le \mathscr {O}\left( \log ((\left\| {\mathbf{a}}^*_2 \right\| + \left\| {\mathbf{b}}^*_2 \right\| )/\epsilon )\right)$ iterations, APIS guarantees a recovery error of

$$\begin{aligned} \left\| \mathbf{a}^T - {\tilde{\mathbf{a}}} \right\| _2 \le \epsilon + \mathscr {O}\left( \max _{A \in \mathscr {A}}\left\| \varPi _A({\mathbf{e}}^*) \right\| _2 + \max _{B \in \mathscr {B}}\left\| \varPi _B({\mathbf{e}}^*) \right\| _2\right) . \end{aligned}$$

We now look at two applications of this result.

Simultaneous sparse corruptions and dense Gaussian noise. Consider linear regression where, apart from k adversarially corrupted points, all n points get Gaussian noise i.e. $\mathbf{y}= X^\top {\mathbf{w}}^*+ {\mathbf{b}}^*+ {\mathbf{e}}^*$, where ${\mathbf{e}}^*\sim \mathscr {N}(\mathbf{0}, \sigma ^2\cdot I_n)$. The above result shows that within $T = \mathscr {O}\left( \log n\right)$ iterations, APIS guarantees $\left\| \mathbf{w}^T - {\mathbf{w}}^* \right\| _2^2 \le \mathscr {O}\left( \sigma ^2\left( \frac{(d+k)\ln n}{n}\right) \right)$. As $n \rightarrow \infty$, the model error behaves as $\left\| \mathbf{w}- {\mathbf{w}}^* \right\| _2^2 \le \mathscr {O}\left( k\log n/n\right)$. This guarantees consistent recovery if $k\log n/n \rightarrow 0$ as $n \rightarrow \infty$. This is a sharper result than previous works (Bhatia et al. 2015; Mukhoty et al. 2019) that do not offer consistent estimation even if $k\log n/n \rightarrow 0$.

Compressible signals. Given an image ${\mathbf{a}}^*$ that is not itself wavelet-sparse, but still $(s,\epsilon )$-approximately wavelet sparse i.e. there exists an image ${\tilde{\mathbf{a}}}$ that is s wavelet-sparse, and $\left\| {\mathbf{a}}^*- {\tilde{\mathbf{a}}} \right\| _2 \le \epsilon \cdot \left\| {\mathbf{a}}^* \right\| _2$. In particular, ${\tilde{\mathbf{a}}}$ can be taken to be the best s wavelet-sparse approximation of ${\mathbf{a}}^*$. The above shows that even if ${\mathbf{a}}^*$ is subjected to adversarial corruptions, APIS offers a recovery of ${\tilde{\mathbf{a}}}$ to within $\mathscr {O}\left( \epsilon \cdot \left\| {\mathbf{a}}^* \right\| _2\right)$ error within $\mathscr {O}\left( \log (1/\epsilon )\right)$ iterations.

Handling lack of incoherence: Pairs of bases that are not incoherent are well-known (Krahmer and Ward 2014; Zhou et al. 2016), the most famous example being the Fourier-Wavelet pair which can only assure $\mu \approx 1$ no matter how small s, k are. Thus, Theorem 1, if applied directly, would fail to offer a non-trivial recovery result if the signal is wavelet-sparse and corruptions are Fourier-sparse. However, in Sect. F in the appendix, we show that using local incoherence properties of these two bases [which are also well-studied e.g. Krahmer and Ward (2014)], APIS can be shown to continue to offer exact recovery if the signal is not just sparse in the wavelet domain, but also anti-concentrated as well i.e. it spreads its mass over its wavelet support elements (please see Sect. F in the appendix for details). For this setting, we show that the incoherence constant satisfies $\mu \le \frac{k^4}{s} + \frac{sk^2}{n}$. Now $\mu < \frac{1}{9}$ can be ensured if, for example, $sk^2 \le n/18$ (i.e. s, k are small compared to n which controls the second term) and $k^4 < s/18$ (i.e. $s \gg k$ which controls the first term). We note that some form of signal restriction, for example, signal anti-concentration, seems to be necessary since a spike signal having support over a single wavelet-basis element, can be irrevocably corrupted by an adaptive Fourier-sparse signal, given that the bases are not incoherent. Also, notice that this is yet another instance of APIS guaranteeing recovery when the corruptions are dense since a Fourier-sparse vector ${\mathbf{b}}^*$ can still have $\left\| {\mathbf{b}}^* \right\| _0 = \varOmega \left( n\right)$.

6.5 Does APIS retain universality?

Kernel (ridge) regression with the RBF kernel is known to be a universal estimator (Micchelli et al. 2006). However, APIS requires the signal ${\mathbf{a}}^*$ to have a low-rank representation in terms of the top-s eigenvectors of the Gram matrix. As Table 1 indicates, for the RBF kernel, Theorem 1 allows $s$ to be as large as $\mathscr {O}\left( \log n/\log \log n\right)$ for n points. Does this model retain universality despite this restriction? What sort of functions can ${\mathbf{a}}^*$ still approximate? We sketch an argument below that indicates an answer in the affirmative, along with a qualitative outline of functions that can be still described by this low-rank model.

Several results on random matrix approximation guarantee that if data covariates are chosen from nice distributions then, as the number of covariates $n \rightarrow \infty$, not only do the eigenvalues of the Gram matrix closely approximate those of the integral operator induced the PSD kernel (Minh et al. 2006; Rosasco et al. 2010), but the empirical operator also approaches the integral operator in the Hilbert-Schmidt norm. This assures us that eigenvectors of the Gram matrix closely approximate the eigenfunctions of the integral operator. For instance, Rosasco et al. (2010) offer an explicit two-way relation between the eigenvectors and the eigenfunctions. Now, in the uni-dimensional case ($d = 1$), for any $i \le n$, the $i^\mathrm{th}$ largest eigenfunction of the integral operator for the RBF kernel is represented in terms of the $i^\mathrm{th}$-order Hermite polynomial (Rasmussen and Williams 2006). The $i^\mathrm{th}$ Hermite polynomial is of degree i and Hermite polynomials form a universal basis (as they constitute an orthogonal polynomial sequence). In particular, the first s Hermite polynomials span all degree-s polynomial functions. Thus, even with the restriction on $s$, APIS does allow signals ${\mathbf{a}}^*$ that are (upto vanishing approximation errors) spanned by Hermite polynomials of order upto $s$. Now, APIS allows $s \le \mathscr {O}\left( \log n/\log \log n\right)$ and as $n \rightarrow \infty$, $s \rightarrow \infty$ as well (albeit slowly). Thus, ${\mathbf{a}}^*$ can represent functions that are (upto approximation errors) arbitrarily high degree polynomials.

A similar argument holds for multi-dimensional spaces and product kernels e.g. RBF since, for such kernels, the eigenfunctions and eigenvalues in the multi-dimensional case are products of their uni-dimensional counterparts (Fasshauer 2011). Although it would be interesting to make the above arguments rigorous, they nevertheless indicate that APIS offers robust recovery for a model that is still universal in the limit. In Sect. 7, we will see that APIS offers excellent reconstruction for sinusoids, polynomials as well as their combinations over multi-dimensional spaces, even under adversarial corruptions.

7 Experiments

Extensive experiments were carried out comparing APIS to state-of-the-art competitor algorithms on three key robust regression tasks

1.
Robust non-parametric regression to learn (multi-dimensional) sinusoidal and polynomial functions (see Figs. 3, 8, and Table 2) and robust Fourier transform (see Fig. 8)
2.
Robust linear regression (see Fig. 4)
3.
Image denoising on the benchmark Set12 images (see Fig. 5) with sparse adversarial salt-and-pepper corruption, as well as dense-checkerboard pattern corruptions (see Figs. 6, 7, and Table 3).

7.1 System configuration

Experiments for which runtimes for various algorithms were recorded were carried out on a 64-bit machine with Intel® Core™ i7-6500U CPU @ 2.50 GHz, 4 cores, 16 GB RAM and Ubuntu 16.04 OS, except for the Deep CNN model from (Zhang et al. 2017) which was trained on NVidia K80 GPUs (made available by the Kaggle platform for which the authors are grateful). All methods were implemented in Python, except those from (Gu et al. 2014; Cizek and Sadikoglu 2020), for which codes were made available by the authors themselves, in R and MATLAB, respectively. All figures e.g. Figs. 3, 4, 6 and 7 show actual predictions by various algorithms, and all results are reported over a single run of the algorithms.

7.2 Baselines and competitor algorithms

Below are described the state-of-the-art competitor algorithms chosen alongside APIS in various experiments.

Robust non-parametric kernel regression: Based on recommendations of the recent survey by Cizek and Sadikoglu (2020), the widely studied Huber and median estimators were chosen as baselines. The Nadaraya-Watson (kernel regression) and kernel ridge regression estimators were also chosen as baselines. In addition, the recently proposed LBM method (Du et al. $\mathscr {N}(0,0.1^2)$ was added to all points (even clean ones). In the figures, corrupted points are depicted using a red cross and clean points using an empty gray circle. Hyperparameter tuning was done for all methods as described in the main text. 1000 test points were sampled from $\mathscr {N}(0,1.5^2)$ for estimating the RMSE for various algorithms. No corruption or Gaussian noise was added to test responses. In all cases, APIS offers the best test RMSE that is 2 to $5\times$ smaller than the next best method

Full size image

Table 2 Multidimensional non-parametric regression with RBF kernel

Full size table

Table 3 PSNR, SSIM metric values and recovery times for various methods, averaged over all 12 images of the Set12 dataset (see Fig. 5)

Full size table

Notes

https://dlmf.nist.gov/5.6.

References

Agarwal, A., Negahban, S. N., & Wainwright, M. J. (2012). Fast global convergence of gradient methods for high-dimensional statistical recovery. Annals of Statistics, 40(5), 2452–2482.
Article MathSciNet Google Scholar
Bafna, M., Murtagh, J., & Vyas, N. (2018). Thwarting adversarial examples: an $L_0$-robust sparse Fourier transform. In Proceedings of the 32nd annual conference on neural information processing systems (NIPS).
Baraniuk, R. G., Cevher, V., Duarte, M. F., & Hegde, C. (2010). Model-based compressive sensing. IEEE Transactions on Information Theory, 56(4), 1982–2001. https://doi.org/10.1109/TIT.2010.2040894
Article MathSciNet MATH Google Scholar
Barchiesi, D., & Plumbley, M. D. (2013) Learning incoherent subspaces for classification via supervised iterative projections and rotations. In IEEE international workshop on machine learning for signal processing (MLSP). IEEE, pp 1–6.
Barchiesi, D., & Plumbley, M. D. (2015). Learning incoherent subspaces: classification via incoherent dictionary learning. Journal of Signal Processing Systems, 79(2), 189–199.
Article Google Scholar
Bhatia, K., Jain, P., & Kar, P. (2015). Robust regression via hard thresholding. In Proceedings of the 29th annual conference on neural information processing systems (NIPS).
Bouwmans, T., Javed, S., Zhang, H., Lin, Z., & Otazo, R. (2018). On the applications of robust PCA in image and video processing. Proceedings of the IEEE, 106(8), 1427–1457. https://doi.org/10.1109/JPROC.2018.2853589
Article Google Scholar
Candes, E. J., & Wakin, M. B. (2008). An introduction to compressive sampling. IEEE Signal Processing Magazine, 25(2), 21–30.
Article Google Scholar
Chen, X., & De, A. (2020). Reconstruction under outliers for Fourier-sparse functions. In Proceedings of the ACM-SIAM symposium on discrete algorithms (SODA).
Chen, Y. (2015). Incoherence-optimal matrix completion. IEEE Transactions on Information Theory, 61(5), 2909–2923.
Article MathSciNet Google Scholar
Chen Y, Bhojanapalli S, Sanghavi S, Ward R (2014) Coherent matrix completion. In: Proceedings of the 31 st international conference on machine learning (ICML).
Cizek, P., & Sadikoglu, S. (2020). Robust nonparametric regression: A review. WIREs Computational Statistics, 12(3), e1492.
Article MathSciNet Google Scholar
Coifman, R., Geshwind, F., & Meyer, Y. (2001). Noiselets. Applied and Computational Harmonic Analysis, 10(1), 27–44.
Article MathSciNet Google Scholar
Dabov, K., Foi, A., Katkovnik, V., & Egiazarian, K. (2007). Image denoising by sparse 3-d transform-domain collaborative filtering. IEEE Transactions on Image Processing, 16(8), 2080–2095. https://doi.org/10.1109/TIP.2007.901238
Article MathSciNet Google Scholar
Diakonikolas, I., Kamath, G., Kane, D., Li, J., Steinhardt, J., Stewart, A. (2019). Sever: A robust meta-algorithm for stochastic optimization. In 36th international conference on machine learning (ICML).
Du SS, Wang Y, Balakrishnan S, Ravikumar P, Singh A (2018) Robust nonparametric regression under Huber’s $\epsilon$-contamination model. ar**v:1805.10406 [math.ST]
Fan, J., Hu, T. C., & Truong, Y. K. (1994). Robust non-parametric function estimation. Scandinavian Journal of Statistics, 21(4), 433–446.
MathSciNet MATH Google Scholar
Fan, L., Zhang, F., Fan, H., & Zhang, C. (2019). Brief review of image denoising techniques. Visual Computing for Industry, Biomedicine, and Art, 2(1), 7.
Article Google Scholar
Fasshauer, G. E. (2011). Positive definite kernels: past, present and future. Dolomites Research Notes on Approximation, 4, 21–63.
Google Scholar
Foucart, S., & Rauhut, H. (2013). A mathematical introduction to compressive sensing. Birkhäuser: Applied and Numerical Harmonic Analysis.
Book Google Scholar
Getreuer, P. (2012). Rudin–Osher–Fatemi total variation denoising using split Bregman. Image Processing on Line, 2, 74–95.
Article Google Scholar
Gu, S., Zhang, L., Zuo, W., & Feng, X. (2014) Weighted nuclear norm minimization with application to image denoising. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2862–2869).
Guruswami, V., & Zuckerman, D. (2016) Robust Fourier and polynomial curve fitting. In Proceedings of the 57th IEEE annual symposium on foundations of computer science (FOCS).
Hegde, C., & Baraniuk, R. G. (2012). Signal recovery on incoherent manifolds. IEEE Transactions on Information Theory, 58(12), 7204–7214. https://doi.org/10.1109/TIT.2012.2210860
Article MathSciNet MATH Google Scholar
Krahmer, F., & Ward, R. (2014). Stable and robust sampling strategies for compressive imaging. IEEE Transactions on Image Processing, 23(2), 612–622.
Article MathSciNet Google Scholar
McCoy, M. B., & Tropp, J. A. (2014). Sharp recovery bounds for convex demixing, with applications. Foundations of Computational Mathematics, 14(3), 503–567.
Article MathSciNet Google Scholar
Micchelli, C. A., Xu, Y., & Zhang, H. (2006). Universal kernels. Journal of Machine Learning Research, 7, 2651–2667.
MathSciNet MATH Google Scholar
Minh, H. Q., Niyogi, P., & Yao, Y. (2006). Merce’s theorem, feature maps, and smoothing. In Proceedings of the international conference on computational learning theory (COLT).
Mukhoty, B., Gopakumar, G., Jain, P., & Kar, P. (2019). Globally-convergent iteratively reweighted least squares for robust regression problems. In Proceedings of the 22nd international conference on artificial intelligence and statistics (AISTATS).
Prasad, A., Suggala, A. S., Balakrishnan, S., & Ravikumar, P. (2018). Robust estimation via robust gradient estimation. ar**v:1802.06485 [stat.ML].
Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian processes for machine learning. MIT Press.
Rosasco, L., Belkin, M., & Vito, E. D. (2010). On learning with integral operators. Journal of Machine Learning Research, 11, 905–934.
MathSciNet MATH Google Scholar
Schnass, K., & Vandergheynst, P. (2010). Classification via incoherent subspaces. ar**v:1005.1471 [cs.CV].
Wang, Z., Bovik, A. C., Sheikh, H. R., & Simoncelli, E. P. (2004). Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4), 600–612.
Article Google Scholar
Zhang, H., Zhou, Y., & Liang, Y. (2015). Analysis of robust PCA via local incoherence. In Proceedings of the 29th annual conference on neural information processing systems (NIPS).
Zhang, K., Zuo, W., Chen, Y., Meng, D., & Zhang, L. (2017). Beyond a gaussian denoiser: Residual learning of deep CNN for image denoising. IEEE Transactions on Image Processing, 26(7), 3142–3155.
Article MathSciNet Google Scholar
Zhou, Y., Zhang, H., & Liang, Y. (2016). On compressive orthonormal sensing. In 54th annual allerton conference on communication, control, and computing (Allerton).

Download references

Acknowledgements

The authors thank the anonymous reviewers for comments that helped improve the presentation of the paper. B.M. thanks the Research-I Foundation for financial support. The work of S.D. was partially supported by the DST-SERB grant ECR/2017/000374. P.K. thanks Microsoft Research India and Tower Research for research grants.

Author information

Authors and Affiliations

Indian Institute of Technology Kanpur, Kanpur, India
Bhaskar Mukhoty, Subhajit Dutta & Purushottam Kar

Authors

Bhaskar Mukhoty
View author publications
You can also search for this author in PubMed Google Scholar
Subhajit Dutta
View author publications
You can also search for this author in PubMed Google Scholar
Purushottam Kar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Purushottam Kar.

Ethics declarations

Conflicts of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Editors: Annalisa Appice, Sergio Escalera, Jose A. Gamez, Heike Trautmann.

Appendices

Appendix

A A generic recovery guarantee for APIS: a proof of Theorem 1

In this section, we will prove Theorem 1. We will present the proof in two parts, presenting the main proof ideas in Lemma 1 with the special case of $P = 1$ (the so-called known signal support case (Chen and De 2020)) where the union $\mathscr {A}$ consists of a single subspace A. Recall that we denote using P (resp. Q), the number of subspaces in the union $\mathscr {A}= \bigcup _{i=1}^P A_i$ in which the uncorrupted signal ${\mathbf{a}}^*$ resides (resp the union $\mathscr {B}= \bigcup _{j=1}^Q B_j$ in which the corruption ${\mathbf{b}}^*$ resides). We also recall that the known signal support case does capture linear regression and low-rank kernel ridge regression. Note however, that even in the known signal support case, we may still have $Q > 1$ i.e. $\mathscr {B}$ may still be a general non-trivial union of subspaces e.g. the set of k-sparse vectors which has $Q = \left( {\begin{array}{c}n\\ k\end{array}}\right)$. We will then extend the proof to the general case in Lemma 2 where both $P, Q \ge 1$. We reproduce the APIS algorithm below for ease of reading.

2.1 A.1 Convergence analysis for $P = 1$ i.e. $\mathscr {A}= A$ but still $Q \ge 1$

We now present the proof in the case of known signal support.

Lemma 1

Suppose we obtain data as described in Eq. (1) where the two unions $\mathscr {A}, \mathscr {B}$ are $\mu$-incoherent with $\mu < \frac{1}{3}$ and in addition, the union $\mathscr {A}$ contains a single subspace (the “known signal support” model). Then, for any $\epsilon > 0$ within $T = \mathscr {O}\left( \log \frac{\left\| {\mathbf{a}}^* \right\| _2 + \left\| {\mathbf{b}}^* \right\| _2}{\epsilon }\right)$ iterations, APIS offers $\left\| \mathbf{a}^T - {\mathbf{a}}^* \right\| _2 \le \epsilon$.

Proof

To simplify notation, we denote $\mathbf{a}^t {=}{:} \mathbf{a}, \mathbf{b}^t {=}{:} \mathbf{b}, \mathbf{a}^{t+1} {=}{:} {\mathbf{a}^+}, \mathbf{b}^{t+1} {=}{:} {\mathbf{b}^+}, B^{t+1} {=}{:} {B^+}$ (please refer to Algorithm 2 for notation). Let $\mathfrak {Q}{:}{=} {B^+}\cap {B}^*$ denote the meet of the two subspaces, as well as denote the symmetric difference subspaces $\mathfrak {P}{:}{=} {B^+}\cap ({B^*})^\perp$ and $\mathfrak {R}= {B}^*\cap ({B^+})^\perp$ (recall that $A \ni {\mathbf{a}}^*, {B}^*\ni {\mathbf{b}}^*$).

Denote $\mathbf{p}= \varPi _A({\mathbf{b}}^*- \mathbf{b})$ and ${\mathbf{p}^+}= \varPi _A({\mathbf{b}}^*- {\mathbf{b}^+})$. In this case we have ${\mathbf{a}^+}= \varPi _A(\mathbf{y}- {\mathbf{b}^+}) = {\mathbf{a}}^*+ \varPi _A({\mathbf{b}}^*- {\mathbf{b}^+})$ (since ${\mathbf{a}}^*\in A$ and orthonormal projections are idempotent) and thus, $\left\| {\mathbf{a}^+}- {\mathbf{a}}^* \right\| _2 = \left\| \varPi _A({\mathbf{b}}^*- {\mathbf{b}^+}) \right\| _2 = \left\| {\mathbf{p}^+} \right\| _2$. We will show below that $\left\| {\mathbf{p}^+} \right\| _2 \le 3\mu \cdot \left\| \mathbf{p} \right\| _2$ which will establish, if $\mu < \frac{1}{3}$, a linear rate of convergence since we will have $\left\| {\mathbf{a}^+}- {\mathbf{a}}^* \right\| _2 = \left\| {\mathbf{p}^+} \right\| _2 \le 3\mu \cdot \left\| \mathbf{p} \right\| _2 = 3\mu \cdot \left\| \mathbf{a}- {\mathbf{a}}^* \right\| _2$.

We have

$$\begin{aligned} {\mathbf{b}^+}= \varPi _{{B^+}}({\mathbf{a}}^*+ {\mathbf{b}}^*- \mathbf{a}) = \varPi _{{B^+}}({\mathbf{b}}^*- \varPi _A({\mathbf{b}}^*- \mathbf{b})) = \varPi _{{B^+}}({\mathbf{b}}^*- \mathbf{p}), \end{aligned}$$

and thus ${\mathbf{b}}^*- {\mathbf{b}^+}= {\mathbf{b}}^*- \varPi _{{B^+}}({\mathbf{b}}^*- \mathbf{p}) = \varPi _\mathfrak {R}({\mathbf{b}}^*) + \varPi _{{B^+}}(\mathbf{p})$. This gives us, by an application of the triangle inequality,

$$\begin{aligned} \left\| {\mathbf{p}^+} \right\| _2 = \left\| \varPi _A({\mathbf{b}}^*- {\mathbf{b}^+}) \right\| _2 \le \left\| \varPi _A(\varPi _\mathfrak {R}({\mathbf{b}}^*)) \right\| _2 + \left\| \varPi _A(\varPi _{B^+}(\mathbf{p})) \right\| _2 \end{aligned}$$

Now, the projection step assures us that projecting onto ${B^+}$ was the best option out of all the subspaces in $\mathscr {B}$ and thus, if we denote $\mathbf{z}= {\mathbf{b}}^*- \mathbf{p}$, then we have, for any subspace $B \in \mathscr {B}$,

$$\begin{aligned} \left\| \varPi _{B^+}(\mathbf{z}) - \mathbf{z} \right\| _2^2 \le \left\| \varPi _B(\mathbf{z}) - \mathbf{z} \right\| _2^2. \end{aligned}$$

Now, $\varPi _B(\mathbf{z}) - \mathbf{z}= \varPi _B^\perp (\mathbf{z})$. Using this, in particular, we have, setting $B = {B}^*$

$$\begin{aligned} \left\| \varPi _{B^+}^\perp (\mathbf{z}) \right\| _2^2 \le \left\| \varPi _{B^*}^\perp (\mathbf{z}) \right\| _2^2 \end{aligned}$$

Canceling components in the subspace $({B^+})^\perp \cap ({B}^*)^\perp$, as well as those in the subspace $\mathfrak {Q}$ gives us

$$\begin{aligned} \left\| \varPi _\mathfrak {R}(\mathbf{z}) \right\| _2^2 \le \left\| \varPi _\mathfrak {P}(\mathbf{z}) \right\| _2^2 = \left\| \varPi _\mathfrak {P}(\mathbf{p}) \right\| _2^2 \end{aligned}$$

since $\varPi _{B^*}^\perp ({\mathbf{b}}^*) = \mathbf{0}$. Now, $\varPi _\mathfrak {R}(\mathbf{z}) = \varPi _\mathfrak {R}({\mathbf{b}}^*) - \varPi _\mathfrak {R}(\mathbf{p})$ since projections are linear operators. Applying the triangle inequality gives us $\left\| \varPi _\mathfrak {R}(\mathbf{z}) \right\| _2 \ge \left\| \varPi _\mathfrak {R}({\mathbf{b}}^*) \right\| _2 - \left\| \varPi _\mathfrak {R}(\mathbf{p}) \right\| _2$. This gives us

$$\begin{aligned} \left\| \varPi _\mathfrak {R}({\mathbf{b}}^*) \right\| _2&\le \left\| \varPi _\mathfrak {P}(\mathbf{p}) \right\| _2 + \left\| \varPi _\mathfrak {R}(\mathbf{p}) \right\| _2\\&\le \left\| \varPi _{B^+}(\mathbf{p}) \right\| _2 + \left\| \varPi _{B}^*(\mathbf{p}) \right\| _2, \end{aligned}$$

where the second step follows since orthonormal projections are always non-expansive. Applying incoherence results now tells us that, since $\mathbf{p}\in \mathscr {A}$, we have

$$\begin{aligned} \left\| \varPi _A(\varPi _\mathfrak {R}({\mathbf{b}}^*)) \right\| _2 \le \sqrt{\mu }\cdot \left\| \varPi _\mathfrak {R}({\mathbf{b}}^*) \right\| _2 = \sqrt{\mu }(\left\| \varPi _{B^+}(\mathbf{p}) \right\| _2 + \left\| \varPi _{B}^*(\mathbf{p}) \right\| _2) \le 2\mu \cdot \left\| \mathbf{p} \right\| _2 \end{aligned}$$

This gives us, upon applying contraction due to incoherence,

$$\begin{aligned} \left\| {\mathbf{p}^+} \right\| _2&\le \left\| \varPi _A(\varPi _\mathfrak {R}({\mathbf{b}}^*)) \right\| _2 + \left\| \varPi _A(\varPi _{B^+}(\mathbf{p})) \right\| _2\\&\le 3\mu \cdot \left\| \mathbf{p} \right\| _2 \end{aligned}$$

Thus, in the known signal support case, APIS offers a linear rate of convergence whenever $\mu < \frac{1}{3}$. Now, APIS initializes $\mathbf{a}^0 = \mathbf{0}$ which means that initially, we have

$$\begin{aligned} \mathbf{p}^1 = \varPi _A({\mathbf{b}}^*- \mathbf{b}^1) = \varPi _A({\mathbf{b}}^*- \varPi _{{B^+}}({\mathbf{a}}^*+ {\mathbf{b}}^*)) \end{aligned}$$

and thus, $\left\| \mathbf{p}^1 \right\| _2 \le \left\| {\mathbf{a}}^* \right\| _2 + \left\| {\mathbf{b}}^* \right\| _2$ since projections are always non-expansive. The linear rate of convergence implies that within $T = \mathscr {O}\left( \log \frac{\left\| {\mathbf{a}}^* \right\| _2 + \left\| {\mathbf{b}}^* \right\| _2}{\epsilon }\right)$ iterations, we will have $\left\| \mathbf{p}^T \right\| _2 \le \epsilon$. Using our earlier observation $\left\| \mathbf{a}^T - {\mathbf{a}}^* \right\| _2 = \left\| \mathbf{p}^T \right\| _2$ then finishes the proof.

2.2 A.2 Convergence analysis for general case i.e both $P, Q \ge 1$

We now present the proof in the general case.

Lemma 2

Suppose we obtain data as described in Eq. (1) where the two unions $\mathscr {A}, \mathscr {B}$ are $\mu$-incoherent with $\mu < \frac{1}{9}$ (we allow both $P, Q > 1$ in this case). Then, for any $\epsilon > 0$ within $T = \mathscr {O}\left( \log \frac{\left\| {\mathbf{a}}^* \right\| _2 + \left\| {\mathbf{b}}^* \right\| _2}{\epsilon }\right)$ iterations, APIS offers $\left\| \mathbf{a}^T - {\mathbf{a}}^* \right\| _2 \le \epsilon$.

Proof

As before, to simplify notation, we denote $\mathbf{a}^t {=}{:} \mathbf{a}, \mathbf{b}^t {=}{:} \mathbf{b}, \mathbf{a}^{t+1} {=}{:} {\mathbf{a}^+}, \mathbf{b}^{t+1} {=}{:} {\mathbf{b}^+}, A^{t+1} {=}{:} {A^+}, B^{t+1} {=}{:} {B^+}$. Let $\mathfrak {Q}{:}{=} {B^+}\cap {B}^*$ denote the meet of the two subspaces, as well as denote the symmetric difference subspaces $\mathfrak {P}{:}{=} {B^+}\cap ({B}^*)^\perp$ and $\mathfrak {R}= {B}^*\cap ({B^+})^\perp$ (recall that ${B}^*\ni {\mathbf{b}}^*$). Also let ${\mathfrak {M}}{:}{=} {A^+}\cap {A}^*$ denote the meet of the two subspaces, as well as denote the symmetric difference subspaces ${\mathfrak {L}}{:}{=} {A^+}\cap ({A}^*)^\perp$ and ${\mathfrak {N}}= {A}^*\cap ({A^+})^\perp$ (recall that ${A}^*\ni {\mathbf{a}}^*$). We also introduce the additional notation $p {:}{=} \max _{A \in \mathscr {A}}\ \left\| \varPi _A({\mathbf{b}}^*- \mathbf{b}) \right\| _2, p^+ {:}{=} \max _{A \in \mathscr {A}}\ \left\| \varPi _A({\mathbf{b}}^*- {\mathbf{b}^+}) \right\| _2$ as well as $q {:}{=} \max _{B \in \mathscr {B}}\ \left\| \varPi _B({\mathbf{a}}^*- \mathbf{a}) \right\| _2, q^+ {:}{=} \max _{B \in \mathscr {B}}\ \left\| \varPi _B({\mathbf{a}}^*- {\mathbf{a}^+}) \right\| _2$.

Note that the update step gives us ${\mathbf{a}^+}= \varPi _{A^+}({\mathbf{a}}^*+ {\mathbf{b}}^*- {\mathbf{b}^+})$ which gives us

$$\begin{aligned} {\mathbf{a}^+}-{\mathbf{a}}^*= \varPi _{\mathfrak {N}}({\mathbf{a}}^*) + \varPi _{A^+}({\mathbf{b}^+}- {\mathbf{b}}^*), \end{aligned}$$

i.e.

$$\begin{aligned} \left\| {\mathbf{a}^+}-{\mathbf{a}}^* \right\| _2 \le \left\| \varPi _{\mathfrak {N}}({\mathbf{a}}^*) \right\| _2 + \left\| \varPi _{A^+}({\mathbf{b}^+}- {\mathbf{b}}^*) \right\| \le \left\| \varPi _{\mathfrak {N}}({\mathbf{a}}^*) \right\| _2 + p^+, \end{aligned}$$

by applying the triangle inequality. A similar analysis of the projection step, as we did to analyze the special case for $P = 1$, then gives us

$$\begin{aligned} \left\| \varPi _{\mathfrak {N}}({\mathbf{a}}^*) \right\| _2 \le \left\| \varPi _{A^+}({\mathbf{b}^+}- {\mathbf{b}}^*) \right\| _2 + \left\| \varPi _{A}^*({\mathbf{b}^+}- {\mathbf{b}}^*) \right\| _2 \le 2p^+, \end{aligned}$$

giving us

$$\begin{aligned} \left\| {\mathbf{a}^+}-{\mathbf{a}}^* \right\| _2 \le 3p^+. \end{aligned}$$

We now show that we have $p^+ \le 9\mu \cdot p$ i.e. the quantity p decreases at a linear rate whenever $\mu < \frac{1}{9}$. Since the update step gives us ${\mathbf{b}^+}= \varPi _{A^+}({\mathbf{b}}^*+ {\mathbf{a}}^*- \mathbf{a})$, an analysis similar to the one done for the special case for $P = 1$ gives us, for any $A \in \mathscr {A}$,

$$\begin{aligned} \left\| \varPi _A({\mathbf{b}}^*- {\mathbf{b}^+}) \right\| _2&\le \left\| \varPi _A(\varPi _\mathfrak {R}({\mathbf{b}}^*)) \right\| _2 + \left\| \varPi _A(\varPi _{B^+}(\mathbf{a}- {\mathbf{a}}^*)) \right\| _2\\&\le \sqrt{\mu }(\left\| \varPi _\mathfrak {R}({\mathbf{b}}^*) \right\| _2 + q). \end{aligned}$$

Going as before also gives us

$$\begin{aligned} \left\| \varPi _\mathfrak {R}({\mathbf{b}}^*) \right\| _2 \le \left\| \varPi _{B^+}(\mathbf{a}- {\mathbf{a}}^*) \right\| _2 + \left\| \varPi _{B}^*(\mathbf{a}- {\mathbf{a}}^*) \right\| _2 \le 2q, \end{aligned}$$

and thus, putting the results together gives us

$$\begin{aligned} p \le \max _{A \in \mathscr {A}}\ \left\| \varPi _A({\mathbf{b}}^*- {\mathbf{b}^+}) \right\| _2 \le 3\sqrt{\mu }\cdot q, \end{aligned}$$

or considering this result for a different iterate, we get $p^+ \le 3\sqrt{\mu }\cdot q^+$. Since the updates w.r.t $\mathbf{a}$ and $\mathbf{b}$ are absolutely symmetric, a similar analysis to the above also gives us $q^+ \le 3\sqrt{\mu }\cdot p$ and consequently, $p^+ \le 9\mu \cdot p$. Thus, APIS offers a linear rate of convergence in the general case whenever $\mu < \frac{1}{9}$. A similar analysis as before confirms $p^1 = \max _{A \in \mathscr {A}}\ \left\| \varPi _A({\mathbf{b}}^*- \mathbf{b}^1) \right\| _2 = \mathscr {O}\left( \left\| {\mathbf{a}}^* \right\| _2 + \left\| {\mathbf{b}}^* \right\| _2\right)$ and that within $T = \mathscr {O}\left( \log \frac{\left\| {\mathbf{a}}^* \right\| _2 + \left\| {\mathbf{b}}^* \right\| _2}{\epsilon }\right)$ iterations, we would have $p^T \le \frac{\epsilon }{3}$. Since we already saw above that $\left\| \mathbf{a}^T - {\mathbf{a}}^* \right\| _2 \le 3p^T$, this confirms the upper bound on the number of iterations required.

B Robust linear regression using APIS

We recall that in this case, we have known signal support i.e. $P = 1$ with $\mathscr {A}= A$ being the row span of the covariate matrix $X \in {\mathbb{R}}^{d \times n}$ and $\mathscr {B}$ being the union of subspaces of k-sparse vectors.

Lemma 3

If the corruption vectors are (adaptive adversarial) k-sparse vectors and the covariates $\mathbf{x}^i \in {\mathbb{R}}^d, i \in [n]$ are sampled i.i.d. from a standard Gaussian i.e. $\mathbf{x}^i \sim \mathscr {N}(\mathbf{0}, I_d)$ and $n = \varOmega \left( d\right)$, then with probability at least $1 - \frac{1}{d^2}$, APIS offers exact recovery at a linear rate if $k < \frac{n}{154}$. Moreover, the projection operation $\varPi _\mathscr {A}(\cdot )$ can be performed in $\mathscr {O}\left( nd\right)$ time in this case.

Proof

Let V denote the d right singular vectors of the covariate matrix $X \in {\mathbb{R}}^{d \times n}$. Then the projection operator $\varPi _A$ is given as $\varPi _A(\mathbf{z}) = VV^\top \mathbf{z}$ where $VV^\top = X^\top (XX^\top )^\dagger X$. Note that this can also be accomplished by simply solving a least squares problem which can be done in $\mathscr {O}\left( nd\right)$ time using various (conjugate, stochastic) gradient descent techniques. This settles the time complexity of the projection operation $\varPi _\mathscr {A}(\cdot )$. However, $\varPi _A(\mathbf{z}) = VV^\top \mathbf{z}$ also gives us the following expression for the SU-incoherence constant.

$$\begin{aligned} \mu = \max _{\begin{array}{c} \mathbf{u},\mathbf{v}\in S^{n-1}\\ \left\| \mathbf{v} \right\| _0 \le k \end{array}}\left( \mathbf{v}^\top VV^\top \mathbf{u}\right) ^2 \le \max _{\begin{array}{c} \mathbf{v}\in S^{n-1}\\ \left\| \mathbf{v} \right\| _0 \le k \end{array}}\left\| VV^\top \mathbf{v} \right\| _2^2 = \max _{\begin{array}{c} S \subset [n]\\ |S| = k \end{array}}\frac{\left\| X_S \right\| _2^2}{\lambda _{\min }(XX^\top )}. \end{aligned}$$

The above constants are readily available from prior works e.g. TORRENT (Bhatia et al. 2015) and are reproduced here (see Bhatia et al. 2015, Lemma 14 and Theorem 15). For any $\delta \in (0,1)$, with probability at least $1- \delta$, we have

1.
$\lambda _{\min }(XX^\top ) \ge n - 3\sqrt{513dn + 178n\log \frac{2}{\delta }}$
2.
$\max _{\begin{array}{c} S \subset [n]\\ |S| = k \end{array}}\left\| X_S \right\| ^2_2 \le k\left( 1+3e\sqrt{6\log \frac{en}{k}}\right) + 3\sqrt{513dk + 178k\log \frac{1}{\delta }}$

To simplify the above bounds, we set $\delta = \frac{1}{d^2}$ and notice that for large enough d, we have $\log (2d^2) < \frac{d}{100}$ so that we get $\lambda _{\min }(XX^\top ) \ge n - 3\sqrt{515dn}$ and $\max _{\begin{array}{c} S \subset [n]\\ |S| = k \end{array}}\left\| X_S \right\| ^2_2 \le k\left( 1+3e\sqrt{6\log \frac{en}{k}}\right) + 3\sqrt{515dk}$, each with confidence at least $1 - \frac{1}{d^2}$. We also assume that n is large enough so that $\sqrt{515dn}< \frac{n}{300}$ ($n > 300^2\cdot 515\cdot d$ i.e. $n = \varOmega \left( d\right)$ suffices to ensures this) so that we get $\lambda _{\min }(XX^\top ) \ge \frac{99n}{100}$ and $\max _{\begin{array}{c} S \subset [n]\\ |S| = k \end{array}}\left\| X_S \right\| ^2_2 \le k\left( \frac{101}{100} + 3e\sqrt{6\log \frac{en}{k}}\right)$.

This gives us

$$\begin{aligned} \mu \le \left( \frac{100}{99}\right) \frac{k}{n}\left( \frac{101}{100} + 3e\sqrt{6\log \frac{en}{k}}\right) \end{aligned}.$$

Elementary calculations show that we have $\mu < \frac{1}{3}$ whenever $k \le \frac{n}{154}$. Since Theorem 1 assures a linear rate of convergence for APIS in the known support case whenever $\mu < \frac{1}{3}$, this finishes the proof.

However, we note that similar breakdown points can be obtained even if the data covariates come from other nice distributions, for example, sub-Gaussian distributions that include all distributions with bounded support, arbitrary (non-standard) Gaussian distributions, mixtures of Gaussian distributions, and many more. The following result sketches that APIS offers a linear rate of convergence even in this general setting. However, the breakdown point is less explicit due to the generality of the result.

Lemma 4

If the corruption vectors are (adaptive adversarial) k-sparse vectors and the covariates $\mathbf{x}^i \in {\mathbb{R}}^d, i \in [n]$ are sampled i.i.d. from a sub-Gaussian distribution with sub-Gaussian norm R and covariance matrix $\varSigma \in {\mathbb{R}}^{d \times d}$, and $n = \varOmega \left( d\right)$, then with probability at least $1 - \frac{1}{d^2}$, APIS offers exact recovery at a linear rate if $k < \frac{n}{\mathscr {O}\left( 1\right) }$. The constants hidden in the $\mathscr {O}\left( \cdot \right) , \varOmega \left( \cdot \right)$ notations used in this statement are either universal or depend only on the sub-Gaussian norm R of the distribution.

Proof

We note that the projection operation $\varPi _\mathscr {A}(\cdot )$ can still be performed in $\mathscr {O}\left( nd\right)$ time in this case (by solving a least squares problem). As before, we have

$$\begin{aligned} \mu = \max _{\begin{array}{c} \mathbf{u},\mathbf{v}\in S^{n-1}\\ \left\| \mathbf{v} \right\| _0 \le k \end{array}}\left( \mathbf{v}^\top VV^\top \mathbf{u}\right) ^2 \le \max _{\begin{array}{c} \mathbf{v}\in S^{n-1}\\ \left\| \mathbf{v} \right\| _0 \le k \end{array}}\left\| VV^\top \mathbf{v} \right\| _2^2 = \max _{\begin{array}{c} S \subset [n]\\ |S| = k \end{array}}\frac{\left\| X_S \right\| _2^2}{\lambda _{\min }(XX^\top )}, \end{aligned}$$

where $VV^\top = X^\top (XX^\top )^\dagger X$. For the case of sub-Gaussian distributions, the following relevant results are available (see Bhatia et al. 2015, Lemma 16 and Theorem 17). For any $\delta \in (0,1)$, with probability at least $1- \delta$, we have the following where c, C are universal constants that depend only on the sub-Gaussian norm R of the distribution.

1.
$\lambda _{\min }(XX^\top ) \ge n \cdot \lambda _{\min }(\varSigma ) - C\cdot \sqrt{dn} - \sqrt{\frac{n}{c}\log \frac{2}{\delta }}$
2.
$\max _{\begin{array}{c} S \subset [n]\\ |S| = k \end{array}}\left\| X_S \right\| ^2_2 \le k\left( \lambda _{\max }(\varSigma ) + \sqrt{\frac{n}{ck}\log \frac{en}{k}}\right) + C\cdot \sqrt{kd} + \sqrt{\frac{n}{c}\log \frac{2}{\delta }}$

As before, to simplify the above bounds, we set $\delta = \frac{1}{d^2}$ and notice that for large enough d and $n = \varOmega \left( d\right)$, we have $\lambda _{\min }(XX^\top ) \ge \frac{99n}{100}\cdot \lambda _{\min }(\varSigma )$ and $\max _{\begin{array}{c} S \subset [n]\\ |S| = k \end{array}}\left\| X_S \right\| ^2_2 \le k\left( \lambda _{\max }(\varSigma )\cdot \frac{101}{100} + \sqrt{\frac{n}{ck}\log \frac{en}{k}}\right)$ which gives us

$$\begin{aligned} \mu \le \left( \frac{100}{99}\right) \frac{k}{n}\left( \frac{\lambda _{\max }(\varSigma )}{\lambda _{\min }(\varSigma )}\cdot \frac{101}{100} + \frac{1}{\lambda _{\min }(\varSigma )}\sqrt{\frac{n}{ck}\log \frac{en}{k}}\right) \end{aligned}.$$

Assuming w.l.o.g. $\lambda _{\max }(\varSigma ) \ge 1$ and denoting ${\boldsymbol{\kappa }}{:}{=} \frac{\lambda _{\max }(\varSigma )}{\lambda _{\min }(\varSigma )}$ as the condition number of the covariance matrix $\varSigma$ gives us

$$\begin{aligned} \mu \le \mathscr {O}\left( {\boldsymbol{\kappa }}\cdot \frac{k}{n}\left( 1 + \sqrt{\frac{n}{k}\log \frac{en}{k}}\right) \right) , \end{aligned}$$

which can be shown to assure $\mu < \frac{1}{3}$ when $k \le \mathscr {O}\left( \frac{n}{{\boldsymbol{\kappa }}}\right)$. Now notice that the above breakdown point depends on the condition number of the covariance matrix. This dependence is superfluous and can be removed, as we show below.

Notice that if we let ${\tilde{X}}= \varSigma ^{-\frac{1}{2}}X$ where X is the covariate matrix used by the algorithm and $\varSigma$ is the covariance matrix of the distribution generating the covariates, then we have

$$\begin{aligned} VV^\top = X^\top (XX^\top )^\dagger X = {\tilde{X}}^\top ({\tilde{X}}{\tilde{X}}^\top )^\dagger {\tilde{X}}\end{aligned},$$

where ${\tilde{X}}$ is now a matrix of covariates assumed to be sampled from a (still) sub-Gaussian distribution but with identity covariance. This allows us to use the following improved upper bound on the incoherence constant

$$\begin{aligned} \mu = \max _{\begin{array}{c} S \subset [n]\\ |S| = k \end{array}}\frac{\left\| {\tilde{X}}_S \right\| _2^2}{\lambda _{\min }({\tilde{X}}{\tilde{X}}^\top )}, \end{aligned}$$

as well as

1.
$\lambda _{\min }({\tilde{X}}{\tilde{X}}^\top ) \ge n - C\cdot \sqrt{dn} - \sqrt{\frac{n}{c}\log \frac{2}{\delta }}$
2.
$\max _{\begin{array}{c} S \subset [n]\\ |S| = k \end{array}}\left\| {\tilde{X}}_S \right\| ^2_2 \le k\left( 1+ \sqrt{\frac{n}{ck}\log \frac{en}{k}}\right) + C\cdot \sqrt{kd} + \sqrt{\frac{n}{c}\log \frac{2}{\delta }}$

The above in turn give us

$$\begin{aligned} \mu \le \mathscr {O}\left( \frac{k}{n}\left( 1 + \sqrt{\frac{n}{k}\log \frac{en}{k}}\right) \right) , \end{aligned}$$

which can be shown to assure $\mu < \frac{1}{3}$ when $k < \frac{n}{\mathscr {O}\left( 1\right) }$ where the constants hidden in the $\mathscr {O}\left( \cdot \right)$ notation are either universal or depend only on the sub-Gaussian norm R of the distribution. Note that the algorithm does not need to know $\varSigma$ at all (either exactly or even approximately) for the above trick to work. The algorithm can continue to perform the $\varPi _\mathscr {A}(\cdot )$ projections using $VV^\top = X^\top (XX^\top )^\dagger X$ but the analysis uses the (equivalent) $VV^\top = {\tilde{X}}^\top ({\tilde{X}}{\tilde{X}}^\top )^\dagger {\tilde{X}}$ instead.

C Robust low-rank kernel regression using APIS

We recall that in this case, the uncorrupted signal satisfies ${\mathbf{a}}^*= G{{\boldsymbol{\alpha }}}^*$ where $G \in {\mathbb{R}}^{n \times n}$ be the Gram matrix with $G_{ij} = K(\mathbf{x}^i,\mathbf{x}^j)$ corresponding to a Mercer kernel $K: {\mathbb{R}}^d \times {\mathbb{R}}^d \rightarrow {\mathbb{R}}$ such as the RBF kernel. Moreover, ${{\boldsymbol{\alpha }}}^*$ belongs to the span of the some s eigenvectors of G i.e. ${{\boldsymbol{\alpha }}}^*= V{\boldsymbol{\gamma }}^*$ where $\left\| {\boldsymbol{\gamma }}^* \right\| _0 \le s$ and $V = [\mathbf{v}^1,\ldots ,\mathbf{v}^r] \in {\mathbb{R}}^{n \times r}$ is the matrix of eigenvectors of G and r is the rank of G. As we will see, APIS offers the strongest guarantees in the case when ${{\boldsymbol{\alpha }}}^*\in \text {span}(\mathbf{v}^1,\ldots ,\mathbf{v}^s)$, i.e., when ${{\boldsymbol{\alpha }}}^*$ lies in the span of the the top eigenvectors.

Thus, in this case, we have known signal support i.e. $P = 1$ with $\mathscr {A}= A$ being the span of the top s eigenvectors of G and $\mathscr {B}$ being the union of subspaces of k-sparse vectors. Here we derive breakdown points for the case of kernel ridge regression. Lemma 5 presents this result for general Mercer kernels, whereas Lemma 6 will yield a specific breakdown point for the special case of the RBF kernel.

Lemma 5

If the corruption vectors are (adaptive adversarial) k-sparse vectors and the uncorrupted signal lies in the span of the top s eigenvectors of a Gram matrix G corresponding to a Mercer kernel, then APIS offers exact recovery at a linear rate if $3\cdot \Lambda ^{\text {unif}}_k(G) < \lambda _s(G)$ where $\lambda _s(G)$ is the $s^\mathrm{th}$-largest eigenvalue of G and for any $k > 0$, $\Lambda ^{\text {unif}}_k(G)$ denotes the largest eigenvalue of any principal $k \times k$ sub-matrix of G. Moreover, the projection operation $\varPi _\mathscr {A}(\cdot )$ can be performed in $\mathscr {O}\left( ns\right)$ time in this case apart from a one-time cost of $\mathscr {O}\left( n^2s\right)$.

Proof

Let $\mathbf{v}^1,\ldots ,\mathbf{v}^s \in {\mathbb{R}}^n$ be the top-s eigenvectors of G i.e. ${\tilde{V}}= [\mathbf{v}^1,\ldots ,\mathbf{v}^s] \in {\mathbb{R}}^{n \times s}$. Also, let the diagonal matrix containing the corresponding top-s eigenvalues $\lambda _1 \ge \lambda _2 \ge \ldots \ge \lambda _s > 0$ be denoted by ${\tilde{\varSigma }}= {{\,\mathrm{diag}\,}}(\lambda _1,\ldots ,\lambda _s) \in {\mathbb{R}}^{s\times s}$. The time complexity of the projection step is settled by noting that the projection operator $\varPi _A$ is given as $\varPi _A(\mathbf{z}) = {\tilde{V}}{\tilde{V}}^\top \mathbf{z}$. Calculating ${\tilde{V}}$ takes a one-time cost of $\mathscr {O}\left( n^2s\right)$ whereas applying the projection operator requires two multiplications with an $n \times s$ matrix which takes $\mathscr {O}\left( ns\right)$ time.

Consider the matrix ${\tilde{X}}= {\tilde{\varSigma }}^{-\frac{1}{2}}{\tilde{V}}^\top G$. It is easy to see that

$$\begin{aligned} {\tilde{X}}^\top ({\tilde{X}}{\tilde{X}}^\top )^{-1}{\tilde{X}}= {\tilde{V}}{\tilde{V}}^\top \end{aligned}.$$

Notice the parallels between the above and a similar expression derived in the linear regression case in the proof of Lemma 3. An identical analysis then gives us

$$\begin{aligned} \mu = \max _{\begin{array}{c} \mathbf{u},\mathbf{v}\in S^{n-1}\\ \left\| \mathbf{v} \right\| _0 \le k \end{array}}\left( \mathbf{v}^\top {\tilde{V}}{\tilde{V}}^\top \mathbf{u}\right) ^2 \le \max _{\begin{array}{c} \mathbf{v}\in S^{n-1}\\ \left\| \mathbf{v} \right\| _0 \le k \end{array}}\left\| {\tilde{V}}{\tilde{V}}^\top \mathbf{v} \right\| _2^2 = \max _{\begin{array}{c} S \subset [n]\\ |S| = k \end{array}}\frac{\left\| {\tilde{X}}_S \right\| _2^2}{\lambda _{\min }({\tilde{X}}{\tilde{X}}^\top )} \end{aligned}.$$

Now, clearly we have $\lambda _{\min }({\tilde{X}}{\tilde{X}}^\top ) \ge \lambda _s$ which lower bounds the denominator in the last expression. To upper bound the numerator, notice that ${\tilde{X}}_S = [{\tilde{\mathbf{x}}}^i]_{i \in S} \in {\mathbb{R}}^{s \times k}$ where ${\tilde{\mathbf{x}}}^i = {\tilde{\varSigma }}^{-\frac{1}{2}}{\tilde{V}}^\top G_i$ where $G_i$ is the $i^\mathrm{th}$ column of the matrix G. Now consider ${\hat{\mathbf{x}}}^i = \varSigma ^{-\frac{1}{2}}V^\top G_i$ where $\varSigma \in {\mathbb{R}}^{r \times r}$ is the diagonal matrix of all the eigenvalues of G, not just the top-s ones (assuming G is of rank r) and $V \in {\mathbb{R}}^{n \times r}$ is the matrix of all the eigenvectors of G and let ${\hat{X}}= [{\hat{\mathbf{x}}}^i]_{i = 1}^n \in {\mathbb{R}}^{s \times n}$ and, in particular, ${\hat{X}}_S = [{\hat{\mathbf{x}}}^i]_{i \in S} \in {\mathbb{R}}^{s \times k}$.

Since ${\tilde{X}}_S$ is a projection of ${\hat{X}}_S$ onto the top-s eigenvectors of G, we conclude that $\left\| {\tilde{X}}_S \right\| _2^2 \le \left\| {\hat{X}}_S \right\| _2^2$. However, notice that ${\hat{X}}^\top {\hat{X}}= G$ and thus, $\left\| {\hat{X}}_S \right\| _2^2$ is upper bounded by the largest eigenvalue of the principal sub-matrix $G^S_S$. Thus, we have

$$\begin{aligned} \mu \le \max _{\begin{array}{c} S \subset [n]\\ |S| = k \end{array}}\frac{\left\| {\tilde{X}}_S \right\| _2^2}{\lambda _{\min }({\tilde{X}}{\tilde{X}}^\top )} \le \frac{\Lambda ^{\text {unif}}_k(G)}{\lambda _s(G)} \end{aligned}.$$

This concludes the proof upon noting that we get $\mu < \frac{1}{3}$ as desired by Theorem 1 for APIS to offer exact recovery at a linear rate whenever $3\cdot \Lambda ^{\text {unif}}_k(G) < \lambda _s(G)$.

We note that the above proof does not use anywhere the fact that the signal has support only among the top-s eigenvectors of the Gram matrix. However, if we start considering other sets of s eigenvectors as possible support, we will run into adverse incoherence constants. Specifically, if the set of eigenvectors contains the smallest eigenvector of G as well, then we would have

$$\begin{aligned} \mu \le \max _{\begin{array}{c} S \subset [n]\\ |S| = k \end{array}}\frac{\left\| {\tilde{X}}_S \right\| _2^2}{\lambda _{\min }(G)} \end{aligned}.$$

Notice that the denominator now has $\lambda _{\min }(G)$ instead of $\lambda _s(G)$. Since the eigenvalues of Gram matrices w.r.t popular kernels such as RBF decay rapidly (see proof of Lemma 6 below), this would mean that $\mu$ could take a very large value and it may be impossible to satisfy $\mu < \frac{1}{9}$ no matter how small the value of k. That is why we restrict the support to the top-s eigenvectors. However, Appendix E.1 shows that APIS offers recovery even if signals are not totally represented by the top-s eigenvectors but merely well-approximated by them.

1.1 C.1 Breakdown point derivations for the RBF kernel

Our goal in this discussion will be to establish the following breakdown point result for robust kernel ridge regression settings.

Lemma 6

If the corruption vectors are (adaptive adversarial) k-sparse vectors and the uncorrupted signal lies in the span of the top s eigenvectors of a Gram matrix G corresponding to the RBF kernel $\kappa (\mathbf{x},\mathbf{y}) = \exp \left( -\frac{\left\| \mathbf{x}-\mathbf{y} \right\| _2^2}{h^2}\right)$ with $\mathbf{x}, \mathbf{y}\in {\mathbb{R}}^d$ for $d > 1$ and h being the bandwidth parameter of the kernel, with the data covariates $\mathbf{x}^1,\ldots ,\mathbf{x}^n \in {\mathbb{R}}^d$ sampled from the uniform distribution over the unit sphere $S^{d-1}$, then APIS offers exact recovery at a linear rate in the following settings. We note that these conditions are neither exhaustive nor necessary but merely some sufficient conditions in which recovery is guaranteed by APIS.

1.
Case 1: $d = 2$, $s \ge e$: if $k \le \sqrt{n}, s^s \le n^{\frac{1}{5}}$ (i.e. $s \le \mathscr {O}\left( \log n/\log \log n\right)$), and $h \in \left[ \sqrt{\frac{40}{\log n}}, \frac{1.13}{(20.4)^{\frac{2.5}{\log n}}}\right]$, then with probability at least $1 - 4\exp (-n^{\frac{2}{5}})$, we have $3\cdot \Lambda ^{\text {unif}}_k(G) < \lambda _s(G)$ i.e. $\mu < \frac{1}{3}$ as guaranteed by Lemma 5.
2.
Case 2: $d > 2$, $s \ge e$: if $k \le \sqrt{n}, \frac{(s+\frac{d}{2}-1)^{s+\frac{d}{2}-1}}{\exp (s- 1)\left( \frac{d}{2}\right) ^{\frac{d+1}{2}}} \le n^{\frac{1}{5}}$, and $h \in \left[ \sqrt{\frac{40}{\log n}}, \left( \frac{n^{\frac{1}{20}}}{18.8}\right) ^{\frac{1}{2s}}\right]$, then with probability at least $1 - 4\exp (-n^{\frac{2}{5}})$, we have $3\cdot \Lambda ^{\text {unif}}_k(G) < \lambda _s(G)$ i.e. $\mu < \frac{1}{3}$ as guaranteed by Lemma 5.

Note that in both cases, the range which the bandwidth is allowed to take while ensuring recovery expands with n. For example, in the $d = 2$ case, in the limit $n \rightarrow \infty$, the range expands to [0, 1.13] since $(20.4)^{\frac{2.5}{\log n}} \rightarrow 1$ as $n \rightarrow \infty$ since the exponent $\frac{2.5}{\log n} \rightarrow 0$.

Sample complexity Before giving derivations for the above results, we put in a word about the sample complexity.

1.
Case 1: $d = 2$, $s \ge e$: $n = \varOmega \left( 1\right)$ samples and $s \le \mathscr {O}\left( \log n/\log \log n\right)$ clearly suffice in this case.
2.
Case 2: $d > 2$, $s \ge e$: we first simplify the expression $\frac{(s+\frac{d}{2}-1)^{s+\frac{d}{2}-1}}{\exp (s- 1)\left( \frac{d}{2}\right) ^{\frac{d+1}{2}}}$ using simple inequalities such as $(x+y)^p \le 2^p(x^p + y^p)$ for any $x, y \in {\mathbb{R}}_+, p \in \mathbb {N}$ to obtain the following inequality (using the shorthand $D {:}{=} \frac{d}{2} - 1$ to avoid clutter)
$$\begin{aligned} 2^D\left( D^s + \frac{s^s}{D^D} + \left( \frac{s}{D}\right) ^D\right) < n^{\frac{1}{5}} \end{aligned}$$
Simple calculations show that $n = (\varOmega \left( 1\right) )^d$ as well as $s < \mathscr {O}\left( \log _d(n)\right)$ suffice to satisfy the above requirement.

Note that in both cases, we can tolerate upto $k \le \sqrt{n}$ corruptions.

1.2 C.2 Some pre-calculations

Let the data points $\{\mathbf{x}^i\}$ be sampled from the uniform distribution over $S^{d-1}$ with $d > 1$, and the RBF kernel, $\kappa (\mathbf{x}^i, \mathbf{x}^j)=\exp \left(- \frac{\left\| \mathbf{x}^i - \mathbf{x}^j \right\| ^2}{h^2}\right)$. Let $\pi _r$ be the $r^{th}$ largest, $r\in \mathbb {N}\cup \{0\}$ distinct eigenvalue of the integral transform operator corresponding to the kernel function $\kappa$, then (Minh et al. 2006, Theorem 2) states

$$\begin{aligned} \pi _r= \exp \left( -\frac{2}{h^2}\right) h^{d-2}I_{r+\frac{d}{2}-1}\left( \frac{2}{h^2}\right) \Gamma \left( \frac{d}{2}\right) \end{aligned},$$

(3)

where, I denotes the modified Bessel function of the first kind. Here each $\pi _r$ occurs with multiplicity $\frac{(2r+d-2)(r+d-3)!}{r! (d-2)!}$. The eigenvalues also satisfy

$$\begin{aligned} \left( \frac{2e}{h^2}\right) ^r \frac{A_1}{(2r+d-2)^{r+\frac{d-1}{2}}}< \pi _r < \left( \frac{2e}{h^2}\right) ^r \frac{A_2}{(2r+d-2)^{r+\frac{d-1}{2}}} \end{aligned},$$

(4)

where $A_1, A_2$ being independent of r are given as follows:

$$\begin{aligned} A_1= \frac{2^{\frac{d}{2} -1}}{\sqrt{\pi }}\exp \left( -\frac{2}{h^2} - \frac{1}{12}+ \frac{d}{2} -1\right) \Gamma \left( \frac{d}{2}\right) \\ A_2= \frac{2^{\frac{d}{2} -1}}{\sqrt{\pi }}\exp \left( -\frac{2}{h^2}+\frac{1}{h^4}+ \frac{d}{2} -1\right) \Gamma \left( \frac{d}{2}\right) \end{aligned},$$

with $\Gamma$ denoting the Gamma function.

Let $\lambda _r^{(n)}$ be the $r^{th}$-largest eigenvalue of the $n \times n$ gram matrix $G_{ij}=k(\mathbf{x}^i,\mathbf{x}^j)$ over n data points. We have from Rosasco et al. (2010, Theorems 5 and 7) that for a normalized mercer kernel $\kappa (\mathbf{x}^i, \mathbf{x}^i) \le 1$ (which the RBF kernel does satisfy), with probability $1 - 2\exp (-\tau )$,

$$\begin{aligned} \left| {\lambda _r^{(n)} - n\pi _r} \right| \le 2 \sqrt{2n\tau }, \end{aligned}$$

Hence, for a given principal sub-matrices of size k,

$$\begin{aligned}&\Pr \left( \lambda _0^{(k)}> k\pi _0 + 2\sqrt{2k\tau _1}\right) \le 2e^{-\tau _1} \nonumber \\&\quad \implies \Pr \left( \bigcup _{\text {sub-matrices of size } k} \{\lambda _0^{(k)}> k\pi _0 + 2\sqrt{2k\tau _1}\}\right) \le {n \atopwithdelims ()k} 2e^{-\tau _1} \le \left( \frac{ne}{k}\right) ^{k} 2e^{-\tau _1}\nonumber \\&\quad \Longleftrightarrow \Pr \left( \Lambda _{k}^{unif}> k\pi _0 + 2\sqrt{2k\tau _1}\right) \le \left( \frac{ne}{k}\right) ^{k} 2e^{-\tau _1} {=}{:}\frac{\delta }{2} \nonumber \\&\quad \Longleftrightarrow \Pr \left( \Lambda _{k}^{unif} > k\pi _0 + 2\sqrt{2k\left( k\ln \left( \frac{ne}{k}\right) +\ln \frac{4}{\delta }\right) }\right) \le \frac{\delta }{2} \quad \text {putting, } \tau _1=k \ln \left( \frac{ne}{k}\right) + \ln \frac{4}{\delta }. \end{aligned}$$

(5)

Also we have,

$$\begin{aligned} \Pr \left( \lambda _s^{(n)} < n\pi _s - 2\sqrt{2n\ln \frac{4}{\delta }}\right) \le \frac{\delta }{2} \end{aligned}$$

(6)

Combining Eqs. (5) and (6):

$$\begin{aligned}&\Pr \left( \left[ 3\Lambda _{k}^{unif} > 3k\pi _0 + 6\sqrt{2k(k\ln \left( \frac{ne}{k}\right) +\ln \frac{4}{\delta })}\right] \cup \left[ \lambda _s^{(n)} < n\pi _s - 2\sqrt{2n\ln \frac{4}{\delta }}\right] \right) \le \delta \\&\quad \Longleftrightarrow \Pr \left( \left[ 3\Lambda _{k}^{unif} \le 3k\pi _0 + 6\sqrt{2k(k\ln \left( \frac{ne}{k}\right) +\ln \frac{4}{\delta })}\right] \cap \left[ n\pi _s - 2\sqrt{2n\ln \frac{4}{\delta }}\le \lambda _s^{(n)}\right] \right) \ge 1-\delta \\&\quad \Longleftrightarrow \Pr \left( 3\Lambda _{k}^{unif} \le \lambda _s^{(n)}\right) \ge 1-\delta \end{aligned}$$

whenever,

$$\begin{aligned}& 3k \pi _0 + 6 {\sqrt{2k\left(k\ln {\frac{ne}{k}}+\ln {\frac{4}{\delta }}\right)}} \le n\pi _s - 2{\sqrt{2n\ln {\frac{4}{\delta }}}} \nonumber \\ &\quad \Longleftrightarrow {\frac{3k}{n}}\left( \pi _0 + 2{\sqrt{2}}{\sqrt{\ln {\frac{ne}{k}}+{\frac{1}{k}}\ln {\frac{4}{\delta }}}}\right) + 2{\sqrt{{\frac{2}{n}}\ln {\frac{4}{\delta }}}}\le \pi _s \nonumber \\ &\quad \Longleftarrow \frac{3k}{n}\left( \pi _0 + 2{\sqrt{2}}{\sqrt{\ln {\frac{ne}{k}}}}\right) +2{\sqrt{2}}{\sqrt{{\frac{1}{n}}\ln {\frac{4}{\delta }}}}\left(3{\sqrt{{\frac{k}{n}}}}+1\right) \le \pi _s \quad {\text{using}}, \,{\sqrt{a+b}}\le {\sqrt{a}}+{\sqrt{b}} \end{aligned}$$

(7)

We break the remaining proof into the two cases $d = 2$ and $d > 2$ in the following two subsections.

1.3 C.3 Case 1: $d = 2$, $s \ge e$

From Eq. (3) and using $I_{0}(x)=\frac{1}{\pi }\int \limits _{0}^{\pi } \exp (x \cos (\theta ))d\theta \le \exp (x)$ we have,

$$\begin{aligned} \pi _0= \exp \left( -\frac{2}{h^2}\right) I_{0}\left( \frac{2}{h^2}\right) \le \exp \left( -\frac{2}{h^2}\right) \exp \left( \frac{2}{h^2}\right) \le 1 \end{aligned}$$

From Eq. (4) we have, for $s\ge e$ and $s^s\le n^{\epsilon _2}$

$$\begin{aligned} \pi _s&\ge \left( \frac{2e}{h^2}\right) ^s \frac{\exp \left( -\frac{2}{h^2} - \frac{1}{12}\right) }{\sqrt{\pi }(2s)^{s+\frac{1}{2}}} = \frac{\exp \left( -\frac{2}{h^2}\right) }{h^{2s}}\frac{\exp (s- \frac{1}{12})}{\sqrt{2\pi } s^{s+\frac{1}{2}}}\\&\ge \frac{\exp \left( -\frac{2}{h^2}\right) }{h^{2s}}\frac{\exp \left( \frac{11}{12}\right) }{\sqrt{2\pi }}\frac{\exp (s- 1)}{s^{s+\frac{1}{2}}}\\&\ge \frac{\exp \left( -\frac{2}{h^2}\right) }{h^{2s}}\frac{\exp \left( \frac{11}{12}\right) }{\sqrt{2\pi }}\frac{1}{s^{s-\frac{1}{2}}}\quad \text {using, } \exp (s-1) \ge s\\&\ge \frac{\exp \left( -\frac{2}{h^2}\right) }{h^{2s}}\frac{1}{s^{s}} \quad \text {using, } s \ge e\\&\ge \frac{\exp \left( -\frac{2}{h^2}\right) }{h^{2s}}\frac{1}{n^{\epsilon _2}} \end{aligned}$$

From Eq. (7) we require:

$$\begin{aligned} \frac{1}{n^{\epsilon _2}}&\ge h^{2s}\exp \left( \frac{2}{h^2}\right) \left( \frac{3k}{n} + \frac{6\sqrt{2} k}{n} \sqrt{\ln \frac{ne}{k}}+2\sqrt{2} \sqrt{\frac{1}{n}\ln \frac{4}{\delta }}(3\sqrt{\frac{k}{n}}+1)\right) \\ \end{aligned}$$

Let $k \le n^{\epsilon _3}$. To satisfy the above requirement we break it into following cases:

$$\begin{aligned} {\frac{1}{(9+8\sqrt{2}) n^{\epsilon _2}}}&\ge h^{2s}\exp \left( \frac{2}{h^2}\right) {\frac{3k}{n}}\nonumber \\ \Longleftarrow n^{1-\epsilon _2 - \epsilon _3}&\ge (9+8{\sqrt{2}}) h^{2s}\exp \left({\frac{2}{h^2}}\right) \end{aligned}$$

(8)

Since $\frac{k}{n}\sqrt{\ln \frac{ne}{k}} \le \left( \frac{k}{n}\right)^{\frac{3}{5}} \le n^{\frac{3(\epsilon _3-1)}{5}}$, for $0\le \frac{k}{n} \le 0.5$

$$\begin{aligned}&{\frac{6{\sqrt{2}}}{(9+8{\sqrt{2}})}}{\frac{1}{n^{\epsilon _2}}} \ge 6{\sqrt{2}}{\frac{k}{n}}{\sqrt{\ln {\frac{ne}{k}}}} h^{2s}\exp \left({\frac{2}{h^2}}\right) \nonumber \\&\quad \Longleftarrow n^{{\frac{3}{5}}(1-\epsilon _3) -\epsilon _2} \ge (9+8{\sqrt{2}}) h^{2s}\exp \left({\frac{2}{h^2}}\right) \end{aligned}$$

(9)

Assume, $\frac{1}{n^{\epsilon _4}}\sqrt{\ln \frac{4}{\delta }}= 1$ so that, $\delta = 4\exp (-n^{2\epsilon _4})$with, $\quad 0< \epsilon _4 < \frac{1}{2}$

$$\begin{aligned}& {\frac{6+2{\sqrt{2}}}{(9+8{\sqrt{2}})}}\frac{1}{n^{\epsilon _2}} \ge 2{\sqrt{2}}h^{2s}\exp \left({\frac{2}{h^2}}\right) {\sqrt{{\frac{1}{n}}\ln {\frac{4}{\delta }}}}\left(3\left({\frac{k}{n}}\right)^{0.5}+1\right)\nonumber \\&\quad \Longleftarrow {\frac{6+2{\sqrt{2}}}{(9+8{\sqrt{2}})}} {\frac{1}{n^{\epsilon _2}}} \ge h^{2s}\exp \left({\frac{2}{h^2}}\right) {\sqrt{{\frac{1}{n}}\ln {\frac{4}{\delta }}}}(6+2\sqrt{2})\quad {\text{since}},\, {\frac{1}{2}} \ge {\frac{k}{n}}\nonumber \\&\quad \Longleftrightarrow n^{{\frac{1}{2}}-\epsilon _4-\epsilon _2} \ge (9+8{\sqrt{2}}) \,h^{2s}\exp \left({\frac{2}{h^2}}\right) \,{\text{using}},\, {\frac{1}{n^{\epsilon _4}}}{\sqrt{\ln {\frac{4}{\delta}} }}= 1 \end{aligned}$$

(10)

We now summarize, last three conditions in order to satisfy Eq. (7)

Breakdown point: we set $\epsilon _3=\frac{1}{2}$, so as to obtain $\frac{k}{n}\le n^{\epsilon _3-1}=n^{-\frac{1}{2}}$
Confidence bound: we set $\frac{1}{2}-\epsilon _4-\epsilon _2 = \frac{3}{5}(1-\epsilon _3) -\epsilon _2$, so that we get $\epsilon _4=\frac{1}{5}$. This gives us, $\delta =4\exp (-n^{2\epsilon _4})=4\exp (-n^{\frac{2}{5}})$
Generality: we need $\frac{1}{2}-\epsilon _4-\epsilon _2 \ge 0 \implies \frac{3}{10}-\epsilon _2 \ge 0$ Set $\epsilon _2=\frac{2}{10}$, so that $s^s\le n^{\epsilon _2}=n^{\frac{1}{5}}$
Bandwidth: Using $s \le s\ln s \le \epsilon _2 \ln (n)=\frac{\ln (n)}{5}$. We require, $n^\frac{1}{10} \ge 20.4 h^{\frac{\ln (n)}{2.5}}\exp \left( \frac{2}{h^2}\right)$, which is satisfied if:
$$\begin{aligned} n^\frac{1}{20} \ge \exp \left( \frac{2}{h^2}\right) \quad&\text {and} \quad n^\frac{1}{20} \ge 20.4 h^\frac{\ln (n)}{2.5}\\ \sqrt{\frac{40}{\ln (n)}}&\le h \le \frac{1.13}{(20.4)^\frac{2.5}{\ln (n)}} \end{aligned}$$
Note that the permissible range for h improves with n.

1.4 C.4 Case: $d > 2$, $s \ge e$

For $d > 2$ we have from eq. 4,

$$\begin{aligned} \pi _0 < \frac{(2e)^{\frac{d}{2} -1}\exp \left( -\frac{2}{h^2}+\frac{1}{h^4}\right) \Gamma \left( \frac{d}{2}\right) }{\sqrt{\pi }(d-2)^{\frac{d-1}{2}}} \le 2\exp \left( \frac{1}{h^4}\right) , \end{aligned}$$

where we have used that for $d > 2$, we always have $\frac{(2e)^{\frac{d}{2} -1}\Gamma \left( \frac{d}{2}\right) }{\sqrt{\pi }(d-2)^{\frac{d-1}{2}}} \le 2$. A short proof of this is given below. From^{Footnote 1}, we deduce that we always have $\Gamma (x)\le \sqrt{2 \pi } x^{x-\frac{1}{2}}\exp \left( \frac{1}{12x}-x\right)$ so that,

$$\begin{aligned} \frac{(2e)^{\frac{d}{2} -1}\Gamma \left( \frac{d}{2}\right) }{\sqrt{\pi }(d-2)^{\frac{d-1}{2}}}&\le \frac{2^{\frac{d}{2} -1}\sqrt{2\pi }\exp \left( \frac{d}{2}-1 +\frac{1}{6d}-\frac{d}{2}\right) \left( \frac{d}{2(d-2)}\right) ^{\frac{d-1}{2}}}{\sqrt{\pi }}\\&= \exp \left( \frac{1}{6d}-1\right) \left( \frac{d}{d-2}\right) ^{\frac{d-1}{2}}\\&\le 3\exp \left( \frac{1}{18}-1\right) \text { since, both are strictly decreasing on }\,d>2\\&= 1.11 < 1.2 \end{aligned}$$

Coming back to the original argument, assume, $\frac{\exp (s- 1)\left( \frac{d}{2}\right) ^{\frac{d+1}{2}}}{(s+\frac{d}{2}-1)^{s+\frac{d}{2}-1}} \ge \frac{1}{n^{\epsilon _2}}$ and $s\ge e$ so that,

$$\begin{aligned} \pi _s&> \left( \frac{2e}{h^2}\right) ^s \frac{(2e)^{\frac{d}{2} -1}\exp \left( -\frac{2}{h^2} - \frac{1}{12}\right) \Gamma \left( \frac{d}{2}\right) }{\sqrt{\pi }(2s+d-2)^{s+\frac{d-1}{2}}} \ge \left( \frac{e}{h^2}\right) ^s \frac{(e)^{\frac{d}{2} -1}\exp \left( -\frac{2}{h^2} - \frac{1}{12}\right) \sqrt{2\pi \frac{d}{2}}\left( \frac{d}{2e}\right) ^\frac{d}{2}}{\sqrt{2\pi }(s+\frac{d}{2}-1)^{s+\frac{d-1}{2}}}\\&=\frac{1}{h^{2s}}\frac{\exp (s-\frac{2}{h^2} - \frac{13}{12})\left( \frac{d}{2}\right) ^{\frac{d+1}{2}}}{(s+\frac{d}{2}-1)^{s+\frac{d-1}{2}}}\\&\ge \frac{\exp \left( -\frac{2}{h^2}\right) }{h^{2s}} \frac{\exp (s- 1)\left( \frac{d}{2}\right) ^{\frac{d+1}{2}}}{(s+\frac{d}{2}-1)^{s+\frac{d}{2}-1}} \quad \text {using, } \exp \left( -\frac{1}{12}\right) \sqrt{s+\frac{d}{2}-1} \ge \exp \left( -\frac{1}{12}\right) \sqrt{e} \ge 1\\&\ge \frac{\exp \left( -\frac{2}{h^2}\right) }{h^{2s}} \frac{1}{n^{\epsilon _2}} \end{aligned}$$

From Eq. (7), we require,

$$\begin{aligned} \frac{1}{n^{\epsilon _2}} \ge h^{2s}\exp \left( \frac{2}{h^2}\right) \left( \frac{3k}{n}\left( 1.2\exp (\frac{1}{h^4}) + 2\sqrt{2}\sqrt{\ln \frac{ne}{k}}\right) +2\sqrt{2} \sqrt{\frac{1}{n}\ln \frac{4}{\delta }}(3\left( \frac{k}{n}\right) ^{\frac{1}{2}}+1)\right) \\ \end{aligned}$$

Let $k \le n^{\epsilon _3}$. In order to satisfy the above requirement we break it into following three cases:

$$\begin{aligned} {\frac{1.4}{18.8}} {\frac{1}{n^{\epsilon _2}}}&\ge 3.6h^{2s}\exp \left({\frac{2}{h^2}}+{\frac{1}{h^4}}\right) \frac{k}{n}\nonumber \\ \Longleftarrow n^{1-\epsilon _2 - \epsilon _3}&\ge 18.8 h^{2s}\exp \left( \left( 1+{\frac{1}{h^2}}\right) ^2\right) \end{aligned}$$

(11)

Since $\frac{k}{n}\sqrt{\ln \left( \frac{ne}{k}\right) } \le \left( \frac{k}{n}\right) ^\frac{3}{5} \le n^\frac{3(\epsilon _3-1)}{5}$,

$$\begin{aligned} {\frac{8.5}{18.8}}{\frac{1}{n^{\epsilon _2}}}&\ge 6{\sqrt{2}}{\frac{k}{n}}{\sqrt{\ln \left({\frac{ne}{k}}\right)}} h^{2s}\exp \left({\frac{2}{h^2}}\right) \nonumber \\ \Longleftarrow n^{{\frac{3}{5}}(1-\epsilon _3) -\epsilon _2}&\ge 18.8 h^{2s}\exp \left({\frac{2}{h^2}}\right) \end{aligned}$$

(12)

Assume, $\frac{1}{n^{\epsilon _4}}\sqrt{\ln \frac{4}{\delta }}= 1$ so that, $\delta = 4\exp (-n^{2\epsilon _4})$with, $\quad 0< \epsilon _4 < \frac{1}{2}$

$$\begin{aligned} {\frac{8.9}{18.8}} {\frac{1}{n^{\epsilon _2}}}&\ge 2{\sqrt{2}}h^{2s}\exp \left({\frac{2}{h^2}}\right) {\sqrt{{\frac{1}{n}}\ln {\frac{4}{\delta }}}}\left(3\left({\frac{k}{n}}\right)^{0.5}+1\right)\nonumber \\ \Longleftarrow {\frac{8.9}{18.8}}{\frac{1}{n^{\epsilon _2}}}&\ge h^{2s}\exp \left({\frac{2}{h^2}}\right) {\sqrt{{\frac{1}{n}}\ln {\frac{4}{\delta }}}}(6+2{\sqrt{2}})\quad {\text{since}},\, {\frac{1}{2}} \ge {\frac{k}{n}}\nonumber \\ \Longleftrightarrow n^{{\frac{1}{2}}-\epsilon _4-\epsilon _2}&\ge 18.8 \,h^{2s}\exp \left({\frac{2}{h^2}}\right) \,{\text{using}},\, {\frac{1}{n^{\epsilon _4}}}{\sqrt{\ln {\frac{4}{\delta}} }}= 1 \end{aligned}$$

(13)

Below we instantiate the variables $\epsilon _2,\epsilon _3$ and $\epsilon _4$ which satisfies the above three conditions simultaneously.

Breakdown point: set $\epsilon _3=\frac{1}{2}$, $\frac{k}{n}\le n^{\epsilon _3-1}=n^{-\frac{1}{2}}$
Confidence bound: set $\frac{1}{2}-\epsilon _4-\epsilon _2 = \frac{3}{5}(1-\epsilon _3) -\epsilon _2$, so that $\epsilon _4=\frac{1}{5}$. This gives, $\delta =4\exp (-n^{2\epsilon _4})=4\exp (-n^\frac{2}{5})$
Universality: we need $\frac{1}{2}-\epsilon _4-\epsilon _2 \ge 0 \implies \frac{3}{10}-\epsilon _2 \ge 0$. Set $\epsilon _2=\frac{2}{10}$, so that $\frac{(s+\frac{d}{2}-1)^{s+\frac{d}{2}-1}}{\exp (s- 1)\left( \frac{d}{2}\right) ^{\frac{d+1}{2}}} \le n^{\epsilon _2}=n^\frac{1}{5}$
Bandwidth: instantiating Eq. (13) ,$n^\frac{1}{10} \ge 18.8 h^{2s}\exp \left( \frac{2}{h^2}\right)$. we require:
$$\begin{aligned} n^{\frac{1}{20}} \ge \exp \left({\frac{2}{h^2}}\right)&\qquad {\text{and,}}\quad n^{\frac{1}{20}} \ge 18.8 h^{2s}\\ \Longleftrightarrow h \ge {\sqrt{{\frac{40}{\ln (n)}}}}&\qquad {\text{and,}} \quad h \le \left({\frac{n^{\frac{1}{20}}}{18.8}}\right)^{\frac{1}{2s}}\\ \end{aligned}$$
Assume Eq. (13) holds. In order to satisfy Eq. (11) we further require,
$$\begin{aligned} n^{\frac{3}{10}}&\ge 18.8 h^{2s}\exp \left( \left( 1+{\frac{1}{h^2}}\right) ^2\right) \\ \Longleftarrow n^{\frac{3}{10}}&\ge 18.8 h^{2s}\exp \left({\frac{2}{h^2}}\right) \exp \left( 1+{\frac{1}{h^4}}\right) \\ \Longleftarrow n^{\frac{3}{10}}&\ge n^{\frac{1}{10}}\exp \left( 1+{\frac{1}{h^4}}\right) \\ \Longleftarrow n^{\frac{1}{5}}&\ge \exp \left({\frac{1}{h^4}}\right) \quad \Longleftarrow h \ge \left({\frac{5}{\ln (n)-5}}\right)^{\frac{1}{4}} \end{aligned}$$
Hence to satisfy all conditions on the bandwidth we require,
$$\begin{aligned} \max \left\{ \sqrt{\frac{40}{\ln (n)}},\left( \frac{5}{\ln (n)-5}\right) ^\frac{1}{4}\right\}&\le h \le \left( \frac{n^\frac{1}{20}}{18.8}\right) ^\frac{1}{2s} \end{aligned}$$
Note that here as well, the acceptable range of bandwidth improves with s that in turn improves with n.

D Robust signal transforms using APIS

Table 4 This table is a subset of Table 1 and presents only the rows concerning signals that have a sparse representation in a basis such as Fourier, wavelet etc with the corruption being either a sparse vector or having a sparse representation in the noiselet basis

Full size table

Table 4 is a subset of Table 1, it only presents the rows concerning signals that have a sparse representation in a basis such as Fourier, wavelet, etc., with the corruption being either a sparse vector or having a sparse representation in the noiselet basis.

A proof of the breakdown points for examples in the first row i.e. when the signal has an s-sparse representation in Fourier, Hadamard, or noiselet bases, is given below. The proof is quite generic and holds for all transformations. The $n \times n$ design matrix has all its entries of magnitude $\mathscr {O}\left( \frac{1}{\sqrt{n}}\right)$ which is true of the design matrices of the Fourier, Hadamard, and noiselet transforms.

Lemma 7

Consider the $n \times n$ (orthonormal) design matrix U corresponding to a transformation such as Fourier etc. Let $\mathscr {A}$ be the union of subspaces of all signals that have an s-sparse representation in this basis i.e. $\mathscr {A}= \left\{ {\mathbf{a}}^*: {\mathbf{a}}^*= U{{\boldsymbol{\alpha }}}^*, \left\| {{\boldsymbol{\alpha }}}^* \right\| _0 \le s\right\}$. Also let $\mathscr {B}$ be the union of subspaces corresponding to k-sparse vectors i.e. $\mathscr {B}= \left\{ {\mathbf{b}}^*: \left\| {\mathbf{b}}^* \right\| _0 \le k\right\}$. Then the pair $(\mathscr {A}, \mathscr {B})$ is $\mu$-SU incoherent (see Sect. 6.1) for $\mu \le \frac{sk}{n}$ if every entry of the design matrix U satisfies $| {U_{ij}} | \le \frac{1}{\sqrt{n}}$.

Proof

Note that to bound the SU incoherence constant, we only need to bound

$$\begin{aligned} \max _{\begin{array}{c} {\boldsymbol{\alpha }}\in S^{n-1}, \left\| {\boldsymbol{\alpha }} \right\| _0 \le s\\ \mathbf{b}\in S^{n-1}, \left\| \mathbf{v} \right\| _0 \le k \end{array}} \left( (U{\boldsymbol{\alpha }})^\top \mathbf{b}\right) ^2 \end{aligned}$$

Since ${\boldsymbol{\alpha }},\mathbf{b}$ are sparse vectors, we have ${\boldsymbol{\alpha }}^\top U^\top \mathbf{b}\le \left\| U_S^K \right\| _2$ where $S = {{\,\mathrm{supp}\,}}({\boldsymbol{\alpha }})$ and $K = {{\,\mathrm{supp}\,}}(\mathbf{b})$ are the supports of ${\boldsymbol{\alpha }}, \mathbf{b}$. Now, for any matrix $A \in {\mathbb{R}}^{s \times k}$, we have $\left\| A \right\| _2 \le \sqrt{sk}\cdot \left\| A \right\| _\infty$. Since $U_S^K$ is effectively an $s \times k$ matrix since its other rows and columns are zeroed out, this gives us $\left\| U_S^K \right\| _2 \le \sqrt{sk}\cdot \nu$ where $\nu {:}{=} \left\| U_S^K \right\| _\infty$. However, by assumption, $\left\| U \right\| _\infty \le \frac{1}{\sqrt{n}}$ which gives us $(U{\boldsymbol{\alpha }})^\top \mathbf{b}\le \sqrt{\frac{sk}{n}}$ and thus, $\mu \le \max _{\begin{array}{c} {\boldsymbol{\alpha }}\in S^{n-1}, \left\| {\boldsymbol{\alpha }} \right\| _0 \le s\\ \mathbf{b}\in S^{n-1}, \left\| \mathbf{v} \right\| _0 \le k \end{array}} \left( (U{\boldsymbol{\alpha }})^\top \mathbf{b}\right) ^2 \le \frac{sk}{n}$ which finishes the proof.

An equivalent incoherence bound can also be derived from the results of Foucart and Rauhut (2013, Ch. 12) who effectively show that $\nu \le \frac{1}{\sqrt{n}}$, but we presented the above proof in our notation for the sake of convenience.

Corollary 1

APIS offers a linear rate of recovery when the signal is s-sparse in either the Fourier, Hadarmard or noiselet bases and the corruption is a k-sparse vector, whenever $sk < \frac{n}{9}$.

Proof

Lemma 7 shows that the SU-incoherence constant in these cases is bounded by $\mu \le \frac{sk}{n}$. Theorem 1 shows that APIS has a linear rate of recovery when $\mu < \frac{1}{9}$. Combining the two finishes the proof.

A proof of the breakdown points in the second and the third rows of Table 4 i.e. when the signal has an s-sparse representation in the Fourier or wavelet (Haar, Daubechies D4/D8) bases and the corruption has a k-sparse representation in the noiselet basis or the vice-versa, is given below. We note that corruptions having a sparse representation in the noiselet, wavelet, or Fourier bases can nevertheless be dense as vectors i.e. have $\left\| {\mathbf{b}}^* \right\| _0 = n$.

Lemma 8

APIS offers a linear rate of recovery when the signal is s-sparse in Fourier or wavelet (Haar, Daubechies D4/D8) bases and the corruption has a k-sparse representation in the noiselet basis, or vice versa, whenever $sk < \frac{n}{27}$.

Proof

The proof is a generalization of the one used for Lemma 7. Notice that here we have two bases involved, one for the signal (e.g., wavelet) and one for the corruption (e.g., noiselet). Let U, V denote the design matrices corresponding to these two bases. Then it is easy to see that calculating the SU-incoherence constant $\mu$ requires us to bound

$$\begin{aligned} \max _{\begin{array}{c} \mathbf{u}\in S^{n-1}, \left\| \mathbf{u} \right\| _0 \le s\\ \mathbf{v}\in S^{n-1}, \left\| \mathbf{v} \right\| _0 \le k \end{array}} \left( \mathbf{u}^\top U^\top V\mathbf{v}\right) ^2 \end{aligned}$$

Going as before, we can see that since $\mathbf{u},\mathbf{v}$ are sparse vectors, we have $\mathbf{u}^\top U^\top V\mathbf{v}\le \left\| U_S^\top V_K \right\| _2$ where $S = {{\,\mathrm{supp}\,}}(\mathbf{u})$ and $K = {{\,\mathrm{supp}\,}}(\mathbf{v})$ are the supports of $\mathbf{u}, \mathbf{v}$ respectively. Now, as $U_S^\top V_K$ is effectively an $s \times k$ matrix since all its other rows and columns are zeroed out, we have $\left\| U_S^\top V_K \right\| _2 \le \sqrt{sk}\cdot \nu$ where $\nu {:}{=} \left\| U_S^\top V_K \right\| _\infty$. Now, results from Candes and Wakin (2008), Foucart and Rauhut (2013) show us that $\nu \le 3$ for the (wavelet-noiselet) and (Fourier-noiselet) systems where wavelet could either be the Haar or Daubechies D4/D8 variants. Proceeding similarly as in Lemma 7 and then Corollary 1 finishes the proof.

E Handling unmodelled errors with APIS

Recall that in this case, we modify Eq. (1) to include an unmodelled error term.

$$\begin{aligned} \mathbf{y}= {\tilde{\mathbf{a}}}+ {\mathbf{b}}^*+ {\mathbf{e}}^*, \end{aligned}$$

where ${\tilde{\mathbf{a}}}\in \mathscr {A}, {\mathbf{b}}^*\in \mathscr {B}$ and ${\mathbf{a}}^*= {\tilde{\mathbf{a}}}+ {\mathbf{e}}^*$. We make no assumptions on ${\mathbf{e}}^*$ such as requiring it to belong to any union of subspaces etc. ${\mathbf{e}}^*$ can be completely arbitrary; in particular it can be dense $\left\| {\mathbf{e}}^* \right\| _0 = n$ and need not have a sparse representation in any particular basis. A useful case is when ${\tilde{\mathbf{a}}}$ can be taken to be the best approximation of ${\mathbf{a}}^*$ in the union of subspaces $\mathscr {A}$. Below, we offer a recovery guarantee for APIS in this case. As in Appendix A, we will first present the main proof ideas with the special case of $P = 1$ (the so-called known signal support case (Chen and De 2020)) where the union $\mathscr {A}$ consists of a single subspace A. We will then extend the proof to the general case where both $P, Q \ge 1$. Recall that we denote using P (resp. Q), the number of subspaces in the union $\mathscr {A}= \bigcup _{i=1}^P A_i$ in which the signal ${\tilde{\mathbf{a}}}$ resides (resp the union $\mathscr {B}= \bigcup _{j=1}^Q B_j$ in which the corruption ${\mathbf{b}}^*$ resides).

1.1 E.1 Convergence analysis for $P = 1$ i.e. $\mathscr {A}= A$ but still $Q \ge 1$

We now present the proof in the case of known signal support.

Lemma 9

Suppose we obtain data as described in Eq. (2) where the two unions $\mathscr {A}, \mathscr {B}$ are $\mu$-SU incoherent with $\mu < \frac{1}{3}$ and in addition, the union $\mathscr {A}$ contains a single subspace (the known signal support model). Then, for any $\epsilon > 0$ within $T = \mathscr {O}\left( \log \frac{\left\| {\mathbf{a}}^* \right\| _2 + \left\| {\mathbf{b}}^* \right\| _2}{\epsilon }\right)$ iterations, APIS offers $\left\| \mathbf{a}^T - {\tilde{\mathbf{a}}} \right\| _2 \le \epsilon + \frac{4\sqrt{\mu }}{1 - 3\mu }\cdot \max _{B \in \mathscr {B}}\left\| \varPi _B({\mathbf{e}}^*) \right\| _2 + 2\cdot \left\| \varPi _A({\mathbf{e}}^*) \right\| _2$.

Proof

As in the proof of Lemma 1, denote $\mathbf{p}= \varPi _A({\mathbf{b}}^*- \mathbf{b})$ and ${\mathbf{p}^+}= \varPi _A({\mathbf{b}}^*- {\mathbf{b}^+})$. Let $\mathfrak {Q}{:}{=} {B^+}\cap {B}^*$ denote the meet of the two subspaces, as well as denote the symmetric difference subspaces $\mathfrak {P}{:}{=} {B^+}\cap ({B}^*)^\perp$ and $\mathfrak {R}= {B}^*\cap ({B^+})^\perp$ (recall that $A \ni {\mathbf{a}}^*, {B}^*\ni {\mathbf{b}}^*$). We also use the shorthand $\mathbf{r}{:}{=} {\mathbf{e}}^*- \varPi _A({\mathbf{e}}^*)$. In this case, we have ${\mathbf{a}^+}= \varPi _A({\tilde{\mathbf{a}}}+ {\mathbf{b}}^*+ {\mathbf{e}}^*- {\mathbf{b}^+})$. Since $\varPi _A({\tilde{\mathbf{a}}}) = {\tilde{\mathbf{a}}}$, we get ${\mathbf{a}^+}- {\tilde{\mathbf{a}}}= \varPi _A({\mathbf{b}}^*- {\mathbf{b}^+}+ {\mathbf{e}}^*)$ . Applying the triangle inequality gives us $\left\| {\mathbf{a}^+}- {\tilde{\mathbf{a}}} \right\| _2 \le \left\| \varPi _A({\mathbf{b}}^*- {\mathbf{b}^+}) \right\| _2 + \left\| \varPi _A({\mathbf{e}}^*) \right\| _2 = \left\| {\mathbf{p}^+} \right\| _2 + \left\| \varPi _A({\mathbf{e}}^*) \right\| _2$. Now, we have

$$\begin{aligned} {\mathbf{b}^+}= \varPi _{{B^+}}({\tilde{\mathbf{a}}}+ {\mathbf{b}}^*+ {\mathbf{e}}^*- \mathbf{a}) = \varPi _{{B^+}}({\mathbf{b}}^*+ {\mathbf{e}}^*- \varPi _A({\mathbf{b}}^*+ {\mathbf{e}}^*- \mathbf{b})) = \varPi _{{B^+}}({\mathbf{b}}^*- \mathbf{p}+ \mathbf{r}), \end{aligned}$$

and thus ${\mathbf{b}}^*- {\mathbf{b}^+}= \varPi _\mathfrak {R}({\mathbf{b}}^*) + \varPi _{{B^+}}(\mathbf{p}- \mathbf{r})$. The triangle inequality then gives us

$$\begin{aligned} \left\| {\mathbf{p}^+} \right\| _2 = \left\| \varPi _A({\mathbf{b}}^*- {\mathbf{b}^+}) \right\| _2 \le \left\| \varPi _A(\varPi _\mathfrak {R}({\mathbf{b}}^*)) \right\| _2 + \left\| \varPi _A(\varPi _{{B^+}}(\mathbf{p}- \mathbf{r})) \right\| _2 \end{aligned}$$

Now, if we denote $\mathbf{z}= {\mathbf{b}}^*- \mathbf{p}+ \mathbf{r}$, the projection step assures us, as before, that,

$$\begin{aligned} \left\| \varPi _\mathfrak {R}(\mathbf{z}) \right\| _2^2 \le \left\| \varPi _\mathfrak {P}(\mathbf{z}) \right\| _2^2 = \left\| \varPi _\mathfrak {P}(\mathbf{p}- \mathbf{r}) \right\| _2^2 \end{aligned}$$

since $\varPi _{B^*}^\perp ({\mathbf{b}}^*) = \mathbf{0}$. Going as before gives us

$$\begin{aligned} \left\| \varPi _\mathfrak {R}({\mathbf{b}}^*) \right\| _2 \le \left\| \varPi _{B^+}(\mathbf{p}- \mathbf{r}) \right\| _2 + \left\| \varPi _{B}^*(\mathbf{p}- \mathbf{r}) \right\| _2 \end{aligned}$$

Applying incoherence results now tells us that

$$\begin{aligned} \left\| \varPi _A(\varPi _\mathfrak {R}({\mathbf{b}}^*)) \right\| _2 \le \sqrt{\mu }\cdot \left\| \varPi _\mathfrak {R}({\mathbf{b}}^*) \right\| _2&= \sqrt{\mu }(\left\| \varPi _{B^+}(\mathbf{p}- \mathbf{r}) \right\| _2 + \left\| \varPi _{B}^*(\mathbf{p}- \mathbf{r}) \right\| _2)\\&\le 2\mu \left\| \mathbf{p} \right\| _2 + 2\sqrt{\mu }\cdot \max _{B \in \mathscr {B}}\left\| \varPi _B(\mathbf{r}) \right\| _2 \end{aligned}$$

Putting things together gives us

$$\begin{aligned} \left\| {\mathbf{p}^+} \right\| _2&\le \left\| \varPi _A(\varPi _\mathfrak {R}({\mathbf{b}}^*)) \right\| _2 + \left\| \varPi _A(\varPi _{{B^+}}(\mathbf{p}- \mathbf{r})) \right\| _2\\&\le 3\mu \left\| \mathbf{p} \right\| _2 + 3\sqrt{\mu }\cdot \max _{B \in \mathscr {B}}\left\| \varPi _B(\mathbf{r}) \right\| _2\\&\le 3\mu \left\| \mathbf{p} \right\| _2 + 3\sqrt{\mu }\cdot \max _{B \in \mathscr {B}}\left\| \varPi _B({\mathbf{e}}^*) \right\| _2, \end{aligned}$$

where the last step follows since $\mathbf{r}= \varPi _A^\perp ({\mathbf{e}}^*)$ and projections are always non-expansive. Now, APISinitializes $\mathbf{a}^0 = \mathbf{0}$ which means that initially, we have (using ${\mathbf{a}}^*= {\tilde{\mathbf{a}}}+ {\mathbf{e}}^*$)

$$\begin{aligned} \mathbf{p}^1 = \varPi _A({\mathbf{b}}^*- \mathbf{b}) = \varPi _A({\mathbf{b}}^*- \varPi _{{B^+}}({\mathbf{a}}^*+ {\mathbf{b}}^*)) \end{aligned}$$

and thus, $\left\| \mathbf{p}^1 \right\| _2 \le \left\| {\mathbf{a}}^* \right\| _2 + \left\| {\mathbf{b}}^* \right\| _2$ since projections are always non-expansive. Thus, if $\mu < \frac{1}{3}$, then the linear rate of convergence implies that within $T = \mathscr {O}\left( \log \frac{\left\| {\mathbf{a}}^* \right\| _2 + \left\| {\mathbf{b}}^* \right\| _2}{\epsilon }\right)$ iterations, we will have

$$\begin{aligned} \left\| \mathbf{p}^T \right\| _2 \le \epsilon + \frac{4\sqrt{\mu }}{1 - 3\mu }\cdot \max _{B \in \mathscr {B}}\left\| \varPi _B({\mathbf{e}}^*) \right\| _2 + \left\| \varPi _A({\mathbf{e}}^*) \right\| _2. \end{aligned}$$

Using our earlier observation $\left\| \mathbf{a}^T - {\tilde{\mathbf{a}}} \right\| _2 = \left\| \mathbf{p}^T \right\| _2 + \left\| \varPi _A({\mathbf{e}}^*) \right\| _2$ then finishes the proof.

1.2 E.2 Application to simultaneous sparse corruptions and dense Gaussian noise case

Consider the robust linear regression problem with the true linear model being ${\mathbf{w}}^*\in {\mathbb{R}}^d$ where, apart from k adversarially corrupted points, all n points get Gaussian noise i.e. $\mathbf{y}= X^\top {\mathbf{w}}^*+ {\mathbf{b}}^*+ {\mathbf{e}}^*$, where ${\mathbf{e}}^*\sim \mathscr {N}(\mathbf{0}, \sigma ^2\cdot I_n)$. It is easy to see that for any fixed r-dimensional subspace S, we have $\left\| \varPi _S({\mathbf{e}}^*) \right\| _2 \le \mathscr {O}\left( \sqrt{r}\right)$. Thus, $\left\| \varPi _A({\mathbf{e}}^*) \right\| _2 \le \mathscr {O}\left( \sqrt{d}\right)$ and taking a union bound over all $\left( {\begin{array}{c}n\\ k\end{array}}\right)$ subspaces of k-sparse vectors tells us that $\max _{B \in \mathscr {B}}\left\| \varPi _B({\mathbf{e}}^*) \right\| _2 \le \mathscr {O}\left( \sqrt{k \log n}\right)$.

Lemma 9 shows that within $T = \mathscr {O}\left( \log n\right)$ iterations, APIS guarantees a model vector $\mathbf{w}^T$ such that $\left\| X\mathbf{w}^t - X{\mathbf{w}}^* \right\| _2 \le \mathscr {O}\left( \sqrt{d} + \sqrt{k \log n}\right)$. Using $\mathbf{w}^T - {\mathbf{w}}^*= X^\dagger (X\mathbf{w}^t - X{\mathbf{w}}^*)$ and the lower bounds on the eigenvalues of $XX^\top$ from the proof of Lemma 3 tell us that $\left\| \mathbf{w}^T - {\mathbf{w}}^* \right\| _2 = \mathscr {O}\left( \frac{\sqrt{d} + \sqrt{k \log n}}{\sqrt{n}}\right)$. Squaring both sides tells us that $\left\| \mathbf{w}^T - {\mathbf{w}}^* \right\| _2^2 \le \mathscr {O}\left( \sigma ^2\left( \frac{(d+k)\ln n}{n}\right) \right)$.

Note that as $n \rightarrow \infty$, the above model recovery error behaves as $\left\| \mathbf{w}- {\mathbf{w}}^* \right\| _2^2 \le \mathscr {O}\left( k\log n/n\right)$. This guarantees consistent recovery if $k\log n/n \rightarrow 0$ as $n \rightarrow \infty$. This is a sharper result than previous works (Bhatia et al. 2015; Mukhoty et al. 2019) that do not offer consistent estimation even if $k\log n/n \rightarrow 0$.

1.3 E.3 Applicability to robust non-parametric kernel ridge regression

The above results are also useful when applying APIS to robust kernel ridge regression. In several cases, the function (signal) we are trying to approximate need not be exactly represented in terms of the top s eigenvectors of the Gram matrix (see Sect. 3). However, Lemma 9 shows that APIS still offers recovery of the s-sparse representation of the signal in terms of the top-s eigenvectors. As discussed in Sect. 6.5, this still constitutes a universal model in the limit, and experiments in Sect. 7 show that APIS offers excellent reconstruction even under adversarial corruptions for sinusoids, polynomials, and their combinations.

1.4 E.4 Convergence analysis for general case i.e both $P, Q \ge 1$

We now present the proof in the general case.

Lemma 10

Suppose we obtain data as described in Eq. (2) where the two unions $\mathscr {A}, \mathscr {B}$ are $\mu$-SU incoherent with $\mu < \frac{1}{9}$. Then, for any $\epsilon > 0$ within $T = \mathscr {O}\left( \log \frac{\left\| {\mathbf{a}}^* \right\| _2 + \left\| {\mathbf{b}}^* \right\| _2}{\epsilon }\right)$ iterations, APIS offers $\left\| \mathbf{a}^T - {\tilde{\mathbf{a}}} \right\| _2 \le \epsilon + \mathscr {O}\left( \max _{B \in \mathscr {B}}\left\| \varPi _B({\mathbf{e}}^*) \right\| _2 + \left\| \varPi _A({\mathbf{e}}^*) \right\| _2\right)$.

Proof

The analysis in the general case proceeds by extending the proof of Lemma 9 in a manner similar to how Lemma 2 extended the proof of Lemma 1. We define the quantities $p {:}{=} \max _{A \in \mathscr {A}}\ \left\| \varPi _A({\mathbf{b}}^*- \mathbf{b}) \right\| _2, p^+ {:}{=} \max _{A \in \mathscr {A}}\ \left\| \varPi _A({\mathbf{b}}^*- {\mathbf{b}^+}) \right\| _2$ and correspondingly $q {:}{=} \max _{B \in \mathscr {B}}\ \left\| \varPi _B({\tilde{\mathbf{a}}}- \mathbf{a}) \right\| _2, q^+ {:}{=} \max _{B \in \mathscr {B}}\ \left\| \varPi _B({\tilde{\mathbf{a}}}- {\mathbf{a}^+}) \right\| _2$ as in the proof of Lemma 2 and introduce two new notations $u = \max _{A \in \mathscr {A}}\left\| \varPi _A({\mathbf{e}}^*) \right\| _2$ and $v = \max _{B \in \mathscr {B}}\left\| \varPi _B({\mathbf{e}}^*) \right\| _2$. We get the following results

$$\begin{aligned} \left\| {\mathbf{a}^+}-{\tilde{\mathbf{a}}} \right\| _2&\le 3p^+ + 3u\\ p&\le 3\sqrt{\mu }(q + u)\\ q^+&\le 3\sqrt{\mu }(p + v) \end{aligned}$$

The second result above can be rewritten as $p^+ \le 3\sqrt{\mu }(q^+ + u)$ which gives us $p^+ \le 9\mu \cdot p + (9\mu \cdot v + 3\sqrt{\mu }\cdot u)$. Thus, we continue to get a linear rate of convergence whenever $\mu < \frac{1}{9}$ and, since $p^1 \le \mathscr {O}\left( \left\| {\mathbf{a}}^* \right\| _2 + \left\| {\mathbf{b}}^* \right\| _2\right)$, after $T = \mathscr {O}\left( \log \frac{\left\| {\mathbf{a}}^* \right\| _2 + \left\| {\mathbf{b}}^* \right\| _2}{\epsilon }\right)$ iterations get $p^T \le \frac{\epsilon }{3}$. Since $\left\| \mathbf{a}^T -{\tilde{\mathbf{a}}} \right\| _2 \le 3p^T + 3u$ from above, we get

$$\begin{aligned} \left\| {\mathbf{a}^+}-{\mathbf{a}}^* \right\| _2 \le \epsilon + 10\mu \cdot v + 4\sqrt{\mu }\cdot u + 3u \le \epsilon + 5\cdot \max _{A \in \mathscr {A}}\left\| \varPi _A({\mathbf{e}}^*) \right\| _2 + 2\cdot \max _{B \in \mathscr {B}}\left\| \varPi _B({\mathbf{e}}^*) \right\| _2, \end{aligned}$$

which finishes the proof.

1.5 E.5 Application to recovery of compressible signals

The above result has applications in, for example, the sparse signal transform example, where ${\mathbf{e}}^*$ may model components of the signal not captured in the low-rank model. For instance, the signal may not come entirely from any single rank-s subspace $A \in \mathscr {A}$ but merely have most of its weight concentrated on a single rank-s subspace ${A}^*\in \mathscr {A}$. ${\mathbf{e}}^*$ would then model the component of the signal orthogonal to ${A}^*$.

Consider an image ${\mathbf{a}}^*$ that is not wavelet-sparse, but $(s,\epsilon )$-approximately wavelet sparse meaning that there exists an image ${\tilde{\mathbf{a}}}$ that is s wavelet-sparse, and $\left\| {\mathbf{a}}^*- {\tilde{\mathbf{a}}} \right\| _2 \le \epsilon \cdot \left\| {\mathbf{a}}^* \right\| _2$. In particular, ${\tilde{\mathbf{a}}}$ can be taken to be the best s wavelet-sparse approximation of ${\mathbf{a}}^*$. This means that $\left\| {\mathbf{e}}^* \right\| _2 \le \epsilon \cdot \left\| {\mathbf{a}}^* \right\| _2$. Lemma 10 shows that APIS offers a recovery of ${\tilde{\mathbf{a}}}$ to within $\mathscr {O}\left( \epsilon \cdot \left\| {\mathbf{a}}^* \right\| _2\right)$ error within $T = \mathscr {O}\left( \log \frac{\left\| {\mathbf{a}}^* \right\| _2 + \left\| {\mathbf{b}}^* \right\| _2}{\epsilon \cdot \left\| {\mathbf{a}}^* \right\| _2}\right) = \mathscr {O}\left( \log \frac{1}{\epsilon }+ \log \frac{\left\| {\mathbf{b}}^* \right\| _2}{\epsilon \left\| {\mathbf{a}}^* \right\| _2}\right)$ iterations.

F Handling lack of incoherence with APIS

The results of Theorem 1 and Lemmas 7 and 8 rely on the notion of incoherence described in Sect. 6.1. However, in certain situations, the bases in question are not incoherent. For example, when the signal is s-sparse in the Haar wavelet basis and the corruption is k-sparse in the Fourier basis, it precludes the recovery guarantee offered by APIS.

In the following, we sketch an argument, taking the (Haar-Fourier) case as an example, to show, when the signal offers more structure, local incoherence can still be guaranteed and APIS, with suitable modifications made to the signal projection step $\varPi _\mathscr {A}(\cdot )$ to exploit this additional structure (discussed later), can offer exact recovery at a linear rate.

As before, let U, V denote the design matrices corresponding to the signal and corruption bases. As before, calculating the SU-incoherence constant $\mu$ requires us to bound

$$\begin{aligned} \max _{\begin{array}{c} \mathbf{u}\in S^{n-1}, \left\| \mathbf{u} \right\| _0 \le s\\ \mathbf{v}\in S^{n-1}, \left\| \mathbf{v} \right\| _0 \le k \end{array}} \left( \mathbf{u}^\top U^\top V\mathbf{v}\right) ^2 \end{aligned}$$

Since $\mathbf{u},\mathbf{v}$ are sparse vectors, we have $\mathbf{u}^\top U^\top V\mathbf{v}\le \left\| U_S^\top V_K \right\| _2$ where $S = {{\,\mathrm{supp}\,}}(\mathbf{u})$ and $K = {{\,\mathrm{supp}\,}}(\mathbf{v})$ are the supports of $\mathbf{u}, \mathbf{v}$. Now, the proof strategy in Lemma 8 fails here since for the Haar-wavelet pair, we get $\left\| U_S^\top V_K \right\| _\infty {=}{:} \nu = 1$ where $S = {{\,\mathrm{supp}\,}}(\mathbf{u})$ and $K = {{\,\mathrm{supp}\,}}(\mathbf{v})$ are the supports of $\mathbf{u}, \mathbf{v}$ respectively (see, for example (Zhou et al. 2016) for this lack of incoherence result).

This happens because there are individual basis vectors in the Haar and Fourier bases, say $\mathbf{m}, \mathbf{n}$ whose inner product is unity i.e. $\left| {\left\langle \mathbf{m}, \mathbf{n}\right\rangle} \right| = 1$ i.e. $\mathbf{m}= \pm \mathbf{n}$. This allows a situation where there is a signal that is just 1-sparse in the Haar basis, specifically ${\mathbf{a}}^*= c\cdot \mathbf{m}$ for some $c\in {\mathbb{R}}$, and the signal then gets corrupted by a corruption vector that is again just 1-sparse in the Fourier basis, specifically ${\mathbf{b}}^*= d\cdot \mathbf{n}$ for some $d \in {\mathbb{R}}$. Exact recovery is information theoretically impossible since the algorithm would essentially receive $\mathbf{y}= {\mathbf{a}}^*+ {\mathbf{b}}^*= (c\pm d)\cdot \mathbf{m}= (c \pm d)\cdot \mathbf{n}$ with no way of separating c and d (we use ± since $\mathbf{m}, \mathbf{n}$ could be parallel or anti-parallel depending on convention).

1.1 F.1 Structured anti-concentrated signals

It turns out that one way to avoid the above problem is to ensure that our signal does not concentrate its mass on just a few coordinates (this prevents the signal from being 1-sparse). Although several ways may exist to enforce the above, in the following definitions, we present the notions of anti-concentrated signal with stratified sparsity. Specifically, suppose the uncorrupted signal is ${\mathbf{a}}^*= U\mathbf{u}$ with U being the design matrix of the Haar wavelet transformation.

Definition 3

A signal ${\mathbf{a}}^*= U\mathbf{u}\in {\mathbb{R}}^n$ is said to be $(\gamma ,s)$ anti-concentrated if it is s-sparse i.e. $\left\| \mathbf{u} \right\| _0 \le s$, and there exists some $\gamma > 0$, such that $\left\| \mathbf{u} \right\| _\infty \le \frac{\gamma }{\sqrt{s}}\cdot \left\| \mathbf{u} \right\| _2$.

Note that, in general, all s-sparse vectors are at least $(\sqrt{s},s)$-anti-concentrated. However, a $(\sqrt{s},s)$-anti concentrated signal is allowed to put almost all its weight on a single coordinate. In contrast, the most anti-concentrated s-sparse vector, for which all s coordinates have equal magnitude, would be (1, s)-anti-concentrated. Before presenting the notion of stratified sparsity, we need to introduce the notion of strata for the Haar basis. The Haar basis elements can be arranged into $\log n$-many strata with the $i^\mathrm{th}$ stratum containing $n_i = 2^i$ basis elements (see the proof of Lemma 11 below for details).

Definition 4

A signal ${\mathbf{a}}^*= U\mathbf{u}\in {\mathbb{R}}^n$ is said to be $\alpha$-stratified sparse if for some $\alpha \in (0,1)$, the support of $\mathbf{u}$ is such that the $i^\mathrm{th}$ stratum of the Haar basis contains at most $(n_i)^\alpha$ support elements of $\mathbf{u}$. Note that this implies that the vector $\mathbf{u}$ is s-sparse with $s \le n^\alpha$ as well (although it need not necessarily be anti-concentrated as required by Def. 3).

1.2 F.2 Local incoherence with structured anti-concentrated signals

Given signals with additional structure as described above in Defs 3 and 4, the following result shows how local incoherence still continues to hold. Note that the following result starts giving vacuous results ($\mu \rightarrow 1$) for $(\gamma ,s)$-anti concentrated vectors, as $\gamma \rightarrow \sqrt{s}$. This is as expected since $\gamma \approx \sqrt{s}$ allows signals that are very concentrated e.g. being close to being 1-sparse.

Lemma 11

Suppose the set of signals $\mathscr {A}$ is the set of s -sparse (w.r.t Haar basis), $\alpha$ -stratified and $(\gamma ,s)$ -anti-concentrated signals with $\alpha \in (0,1), s = n^\alpha , \gamma \in [1,\sqrt{s}]$ . Then, with respect to $\mathscr {B}$ being the set of k -sparse corruption vectors (no further assumptions being imposed on corruption vectors), for some small universal constant $c > 0$ , the following (local) incoherence bound continues to hold.

$$\begin{aligned} \mu \le c\cdot \gamma ^2\cdot {\left\{ \begin{array}{ll} \frac{k^{2 + 4\alpha }}{s} &{} \alpha < \frac{1}{2} \\ \frac{k^{2 + 4\alpha }}{s} + \frac{k^2}{s}\log ^2\frac{n}{k^2} &{} \alpha = \frac{1}{2} \\ \frac{k^{2 + 4\alpha }}{s} + \frac{sk^2}{n} &{} \alpha > \frac{1}{2} \end{array}\right. } \end{aligned}$$

Proof

Calculating th SU-incoherence constant $\mu$ now requires us to bound

$$\begin{aligned} \max _{\begin{array}{c} \mathbf{u}\in S^{n-1}, \left\| \mathbf{u} \right\| _0 \le s\\ \mathbf{u}\text { is}\, \alpha -\text {strat.}, (\gamma ,s)-\text {anti-conc.}\\ \mathbf{v}\in S^{n-1}, \left\| \mathbf{v} \right\| _0 \le k \end{array}} \left( \mathbf{u}^\top U^\top V\mathbf{v}\right) ^2 \end{aligned}$$

Then, applying the $L_1-L_\infty$ Hölder’s inequality gives us

$$\begin{aligned} \left| {\mathbf{u}^\top U^\top V\mathbf{v}} \right| = \left| {\sum _{i \in S}\sum _{j \in K} \mathbf{u}_i\mathbf{v}_j(U^\top V)_{ij}} \right| \le \max _{i \in S, j \in K}\left| {\mathbf{u}_i\mathbf{v}_j} \right| \cdot \sum _{i \in S}\sum _{j \in K} \left| {(U^\top V)_{ij}} \right| \le \gamma k{\sqrt{s}}\cdot \bar{\nu }_{s,k}, \end{aligned}$$

where $(U^\top V)_{ij} = \mathbf{u}_i^\top \mathbf{v}_j$ and $\bar{\nu }_{s,k}$ is the largest average value of entries in the matrix $U_S^\top V_K$ for any choice of sets S, K of size s, k respectively i.e.

$$\begin{aligned} \bar{\nu }_{s,k} = \max _{\begin{array}{c} S, K \subset [n]\\ |S| = s, |K| = k \end{array}}\frac{1}{sk}\sum _{i \in S}\sum _{j \in K} \left| {(U^\top V)_{ij}} \right| \end{aligned}$$

The above step is perhaps the most crucial in the proof since it shows that the incoherence constant $\underline{\mu}$ depends on the average of the $\underline{\left| {(U^\top V)_{ij}} \right|}$ values rather than the largest values, which are always $\underline{\varOmega \left( 1\right)}$ for the Haar-Fourier system.

The result of Krahmer and Ward (2014, Lemma 6.1) shows that upon indexing the Haar basis elements by $i \in [1, \log n - 1]$ into the $\log n$ strata and further indexing the $2^i$ basis elements within the $i^\mathrm{th}$ stratum using $l \in [0, 2^i - 1]$, as well as indexing the Fourier basis elements by $j \in \left[ -\frac{n}{2}+1,\frac{n}{2}\right] \backslash \left\{ 0\right\}$, we get a local incoherence bound

$$\begin{aligned} \left| {\mathbf{u}_{i,l}^\top \mathbf{v}_j} \right| \le \min \left\{ \frac{6\cdot 2^{\frac{i}{2}}}{\left| {j} \right| }, 3\pi \cdot 2^{-\frac{i}{2}}\right\} \le \mathscr {O}\left( \min \left\{ \frac{2^{\frac{i}{2}}}{\left| {j} \right| }, 2^{-\frac{i}{2}}\right\} \right) \end{aligned}$$

Noting that $2^{-\frac{i}{2}} \le \frac{2^{\frac{i}{2}}}{j}$ iff $i \ge 2\log j$, elementary calculations show that

$$\begin{aligned} \bar{\nu }_{s,k} \le \mathscr {O}\left( \frac{1}{sk} \sum _{j=1}^k \left( \sum _{i = 1}^{2\log j} 2^{i\alpha } \cdot \frac{2^{\frac{i}{2}}}{j} + \sum _{i = 2\log j}^{\log n} 2^{i\alpha } \cdot 2^{-\frac{i}{2}} \right) \right) \end{aligned}$$

If $\alpha < \frac{1}{2}$, the second summation is that of a decreasing series. Thus, the second summation can be upper bounded in this case as

$$\begin{aligned} \sum _{i = 2\log j}^{\log n} 2^{i\left( \alpha - \frac{1}{2}\right) } \le \mathscr {O}\left( 2^{2\log j\left( \alpha - \frac{1}{2}\right) }\right) \le \mathscr {O}\left( j^{2\alpha - 1}\right) \end{aligned}$$

If $\alpha = \frac{1}{2}$ then we have a much simpler summation

$$\begin{aligned} \sum _{i = 2\log j}^{\log n} 2^{0} \le \mathscr {O}\left( \log \frac{n}{j^2}\right) \end{aligned}$$

If $\alpha > \frac{1}{2}$, the second summation is that of an increasing series. Thus, the second summation can be upper bounded in this case as

$$\begin{aligned} \sum _{i = 2\log j}^{\log n} 2^{i\left( \alpha - \frac{1}{2}\right) } \le \mathscr {O}\left( 2^{\log n\left( \alpha - \frac{1}{2}\right) }\right) \le \mathscr {O}\left( n^{\alpha - \frac{1}{2}}\right) = \mathscr {O}\left( \frac{s}{\sqrt{n}}\right) \end{aligned}$$

Thus, ignoring constant factors, we get

$$\begin{aligned} \sum _{i = 2\log j}^{\log n} 2^{i\left( \alpha - \frac{1}{2}\right) } \le {\left\{ \begin{array}{ll} j^{2\alpha - 1} &{} \alpha < \frac{1}{2} \\ \log \frac{n}{j^2} &{} \alpha = \frac{1}{2} \\ \frac{s}{\sqrt{n}} &{} \alpha > \frac{1}{2} \end{array}\right. } \end{aligned}$$

This gives us

$$\begin{aligned} \sum _{j=1}^k\sum _{i = 2\log j}^{\log n} 2^{i\left( \alpha - \frac{1}{2}\right) } \le {\left\{ \begin{array}{ll} k^{2\alpha } &{} \alpha < \frac{1}{2} \\ k\log \frac{n}{k^2} &{} \alpha = \frac{1}{2} \\ \frac{sk}{\sqrt{n}} &{} \alpha > \frac{1}{2} \end{array}\right. } \end{aligned}$$

Similarly, the first summation can be bounded, ignoring constant factors, as

$$\begin{aligned} \sum _{j=1}^k\left( \frac{1}{j}\cdot \sum _{i = 1}^{2\log j} 2^{i\left( \alpha + \frac{1}{2}\right) }\right) \le \sum _{j=1}^k\left( \frac{1}{j} \left( 2^{2\log j\left( \alpha + \frac{1}{2}\right) }\right) \right) \le \sum _{j=1}^k\left( \frac{1}{j} \cdot j^{2\alpha + 1}\right) \le k^{2\alpha + 1} \end{aligned}$$

Absorbing all constant factors into a single constant $c > 0$ gives us

$$\begin{aligned} \bar{\nu }_{s,k} \le \frac{c}{sk} \left( k^{2\alpha + 1} + {\left\{ \begin{array}{ll} k^{2\alpha } &{} \alpha< \frac{1}{2} \\ k\log \frac{n}{k^2} &{} \alpha = \frac{1}{2} \\ \frac{sk}{\sqrt{n}} &{} \alpha> \frac{1}{2} \end{array}\right. } \right) = \frac{1}{s}\cdot {\left\{ \begin{array}{ll} k^{2\alpha } &{} \alpha < \frac{1}{2} \\ k^{2\alpha } + \log \frac{n}{k^2} &{} \alpha = \frac{1}{2} \\ k^{2\alpha } + \frac{s}{\sqrt{n}} &{} \alpha > \frac{1}{2} \end{array}\right. } \end{aligned}$$

which in turn gives us (renaming $c^2 {=}{:} c$),

$$\begin{aligned} \mu \le \left( \gamma k\sqrt{s}\cdot \bar{\nu }_{s,k}\right) ^2 \le c\cdot \gamma ^2\cdot {\left\{ \begin{array}{ll} \frac{k^{2 + 4\alpha }}{s} &{} \alpha < \frac{1}{2} \\ \frac{k^{2 + 4\alpha }}{s} + \frac{k^2}{s}\log ^2\frac{n}{k^2} &{} \alpha = \frac{1}{2} \\ \frac{k^{2 + 4\alpha }}{s} + \frac{sk^2}{n} &{} \alpha > \frac{1}{2} \end{array}\right. }. \end{aligned}$$

Thus, we do have incoherence when $k \ll s$ as well as $sk \ll n$. We can get a stronger result if the corruption is also assured to be anti concentrated. Specifically ${\mathbf{b}}^*= V\mathbf{v}$ where $\mathbf{v}$ is k-sparse as well as $(\delta , k)$ anti-concentrated for some $\delta \in [1,\sqrt{k}]$. We present this improved result below and note that it offers superior dependence on k due to the additional structure in the corruption vector.

$$\begin{aligned} \mu \le c\cdot \gamma ^2\delta ^2\cdot {\left\{ \begin{array}{ll} \frac{k^{1 + 4\alpha }}{s} &{} \alpha < \frac{1}{2} \\ \frac{k^{1 + 4\alpha }}{s} + \frac{k}{s}\log ^2\frac{n}{k^2} &{} \alpha = \frac{1}{2} \\ \frac{k^{1 + 4\alpha }}{s} + \frac{sk}{n} &{} \alpha > \frac{1}{2} \end{array}\right. } \end{aligned}$$

This finishes the proof.

1.3 F.3 Algorithmic modifications to APIS

Since signals now have additional structure, specifically, anti-concentration and stratified sparsity, we need to modify the projection step $\varPi _\mathscr {A}(\cdot )$ appropriately to handle both properties. Fortunately, simple modifications to the hard-thresholding operator address both.

Algorithm 3 gives the recipe to perform projections onto vectors that are s-sparse in the Haar basis, as well as stratified and sup-norm bounded (to ensure anti-concentration). The sup-norm bound M is a new hyperparameter in the algorithm and can be tuned according to the hyperparameter tuning procedure outlined in Sect. 7. After performing an inverse Haar transform, Algorithm 3 breaks up the resulting vector into the $\log n$ strata offered by the Haar basis and performs Bounded Hard Thresholding (BHT) on each stratum separately.

Algorithm 4 presents BHT, a modified hard thresholding operation that admits the sup-norm restriction in addition to the sparsity restriction. Instead of the traditional hard-thresholding operator HT (see Sect. 4) which simply selects the top t coordinates according to magnitude, BHT instead uses the discounted magnitude of each coordinate to do so. The discounted magnitude d of a value $v \in {\mathbb{R}}$ given a sup-norm bound $M > 0$ is defined as

$$\begin{aligned} d = \sqrt{v^2 - (\left| {v} \right| - \min \left\{ \left| {v} \right| , M\right\} )^2} \end{aligned}$$

Note that if there is no sup-norm bound (equivalently if $M = \infty$), then the discounted magnitude is simply the magnitude i.e. $d = \left| {v} \right|$. Thus, in the absence of an sup-norm bound, BHT becomes simply HT.

To prove the optimality of Algorithm 3 it is sufficient to prove the optimality of the BHT procedure since Algorithm 3 simply applies it in a stratum-wise manner. We prove the optimality of BHT below.

Theorem 2

For any vector $\mathbf{r}\in {\mathbb{R}}^n, t \in [n], M > 0$, let $\mathbf{p}= \text {BHT} (\mathbf{r}, t, M)$ (see Algorithm 4). Then $\mathbf{p}$ is t-sparse and satisfies $\left\| \mathbf{p} \right\| _\infty \le M$. Moreover, let $\mathbf{q}\in {\mathbb{R}}^n$ be any vector that is also t-sparse and satisfies $\left\| \mathbf{q} \right\| _\infty \le M$. Then we must have $\left\| \mathbf{r}- \mathbf{p} \right\| _2^2 \le \left\| \mathbf{r}- \mathbf{q} \right\| _2^2$ i.e. BHT does provide the optimal projection onto sup-norm bounded sparse vectors.

Proof

That $\mathbf{p}$ is t-sparse and satisfies $\left\| \mathbf{p} \right\| _\infty \le M$ is immediate from the steps taken by Algorithm 4. To prove the second part, let $S = {{\,\mathrm{supp}\,}}(\mathbf{p}), T = {{\,\mathrm{supp}\,}}(\mathbf{q})$ be the support of the two vectors. Assume w.l.o.g. that $|S| = t = |T|$. Now, we create a third vector $\mathbf{k}$ with the same support as $\mathbf{q}$ but with possibly different values. Specifically, set $\mathbf{k}_j = \min \left\{ \left| {\mathbf{r}_j} \right| , M\right\} \cdot {{\,\mathrm{sign}\,}}\left\{ \mathbf{r}_j\right\}$ for $j \in T$ and $\mathbf{k}_j = 0$ for $j \notin T$. Notice that $\mathbf{k}$ is also t-sparse, ${{\,\mathrm{supp}\,}}(\mathbf{k}) = T$, and it satisfies $\left\| \mathbf{k} \right\| _\infty \le M$ as well.

It is easy to see that $\left\| \mathbf{r}- \mathbf{k} \right\| _2^2 \le \left\| \mathbf{r}- \mathbf{q} \right\| _2^2$ which captures our intuition that once we have chosen a t-sized support for our vector, the ideal thing to do is to fill coordinates in the support with the value $\min \left\{ \left| {\mathbf{r}_j} \right| , M\right\} \cdot {{\,\mathrm{sign}\,}}(\mathbf{r}_j)$ (with the absolute value and sign operations being applied component-wise) which maximally preserves the vector in that coordinate subject to the sup-norm bound.

Now we prove that the choice of support made by BHT is optimal by showing that $\left\| \mathbf{r}- \mathbf{p} \right\| _2^2 \le \left\| \mathbf{r}- \mathbf{k} \right\| _2^2$. To see this, we consider the following sequence of inequalities. We will find the shorthand $\mathbf{m}_i {:}{=} \left| {\mathbf{r}_i} \right| - \min \left\{ \left| {\mathbf{r}_i} \right| , M\right\}$ very useful in the following. This is because

$$\begin{aligned} {{\,\mathrm{sign}\,}}(\mathbf{r}_j)\cdot \mathbf{m}_j = \mathbf{r}_j - \min \left\{ \left| {\mathbf{r}_j} \right| , M\right\} \cdot {{\,\mathrm{sign}\,}}(\mathbf{r}_j) \end{aligned}$$

is simply the residual error at any coordinate that is in the support of either $\mathbf{p}$ or $\mathbf{k}$. Note that we have

$$\begin{aligned} \left\| \mathbf{r}- \mathbf{k} \right\| _2^2&= \sum _{i \in T}\mathbf{m}_i^2 + \sum _{j \in S \backslash T}\mathbf{r}_j^2 + \sum _{l \notin S \cup T}\mathbf{r}_l^2\\ \left\| \mathbf{r}- \mathbf{p} \right\| _2^2&= \sum _{i \in S}\mathbf{m}_i^2 + \sum _{j \in T \backslash S}\mathbf{r}_j^2 + \sum _{l \notin S \cup T}\mathbf{r}_l^2 \end{aligned}$$

This gives us

$$\begin{aligned} \left\| \mathbf{r}- \mathbf{k} \right\| _2^2 - \left\| \mathbf{r}- \mathbf{p} \right\| _2^2 = \sum _{i \in S\backslash T}(\mathbf{r}_i^2 - \mathbf{m}_i^2) - \sum _{j \in T\backslash S}(\mathbf{r}_j^2 - \mathbf{m}_j^2) = \sum _{i \in S\backslash T}\mathbf{d}_i^2 - \sum _{j \in T\backslash S}\mathbf{d}_j^2, \end{aligned}$$

where $\mathbf{d}_i = \sqrt{\mathbf{r}_i^2 - \mathbf{m}_i^2}$ is the discounted magnitude of the $i^\mathrm{th}$ coordinate as defined above. However, since BHT always chooses the t coordinates with highest discounted magnitude, we must have $\sum _{i \in S\backslash T}\mathbf{d}_i^2 \ge \sum _{j \in T\backslash S}\mathbf{d}_j^2$ since $|S| = t = |T|$. Thus, we get $\left\| \mathbf{r}- \mathbf{k} \right\| _2^2 \ge \left\| \mathbf{r}- \mathbf{p} \right\| _2^2$ and since we have $\left\| \mathbf{r}- \mathbf{k} \right\| _2^2 \le \left\| \mathbf{r}- \mathbf{q} \right\| _2^2$ from the construction of $\mathbf{k}$ as we saw earlier, this finishes the proof.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mukhoty, B., Dutta, S. & Kar, P. Robust non-parametric regression via incoherent subspace projections. Mach Learn 110, 2941–2989 (2021). https://doi.org/10.1007/s10994-021-06045-z

Download citation

Received: 24 December 2020
Revised: 14 June 2021
Accepted: 26 July 2021
Published: 05 September 2021
Issue Date: October 2021
DOI: https://doi.org/10.1007/s10994-021-06045-z

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Robust non-parametric regression via incoherent subspace projections

Abstract

Similar content being viewed by others

Scalable subspace methods for derivative-free nonlinear least-squares optimization

Regression function estimation as a partly inverse problem

Robust Computation of Linear Models by Convex Relaxation

1 Introduction and problem statement

1.1 A key to the manuscript

2 A gentle introduction to the intuition behind APIS

3 Related works and our contributions in context

3.1 Summary of contributions

3.2 Robust non-parametric regression

3.3 Robust linear regression

3.4 Robust Fourier and other signal transforms

3.5 Use of (local) incoherence in literature

3.6 Learning incoherent spaces

4 APIS: alternating projections onto incoherent subspaces

5 Applications and projection details

5.1 Examples of signal models supported by APIS

5.2 Examples of corruption models supported by APIS

5.3 Do the subspaces really need to be low-rank? What if this is too strict and \({\mathbf{a}}^*\notin \mathscr {A}\)?

6 Recovery, breakdown points, misspecified models and universality

6.1 Incoherence

Definition 1

Definition 2

Theorem 1

Proof

6.2 Cases with sparse corruptions

6.3 Cases with dense corruptions

6.4 Handing model misspecifications

6.5 Does APIS retain universality?

7 Experiments

7.1 System configuration

7.2 Baselines and competitor algorithms

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Appendices

Appendix

A A generic recovery guarantee for APIS: a proof of Theorem 1

2.1 A.1 Convergence analysis for \(P = 1\) i.e. \(\mathscr {A}= A\) but still \(Q \ge 1\)

Lemma 1

Proof

2.2 A.2 Convergence analysis for general case i.e both \(P, Q \ge 1\)

Lemma 2

Proof

B Robust linear regression using APIS

Lemma 3

Proof

Lemma 4

Proof

C Robust low-rank kernel regression using APIS

Lemma 5

Proof

1.1 C.1 Breakdown point derivations for the RBF kernel

Lemma 6

1.2 C.2 Some pre-calculations

1.3 C.3 Case 1: \(d = 2\), \(s \ge e\)

1.4 C.4 Case: \(d > 2\), \(s \ge e\)

D Robust signal transforms using APIS

Lemma 7

Proof

Corollary 1

Proof

Lemma 8

Proof

E Handling unmodelled errors with APIS

1.1 E.1 Convergence analysis for \(P = 1\) i.e. \(\mathscr {A}= A\) but still \(Q \ge 1\)

Lemma 9

Proof

1.2 E.2 Application to simultaneous sparse corruptions and dense Gaussian noise case

1.3 E.3 Applicability to robust non-parametric kernel ridge regression

1.4 E.4 Convergence analysis for general case i.e both \(P, Q \ge 1\)