Abstract
This paper considers the problem of supervised learning with linear methods when both features and labels can be corrupted, either in the form of heavy tailed data and/or corrupted rows. We introduce a combination of coordinate gradient descent as a learning algorithm together with robust estimators of the partial derivatives. This leads to robust statistical learning methods that have a numerical complexity nearly identical to non-robust ones based on empirical risk minimization. The main idea is simple: while robust learning with gradient descent requires the computational cost of robustly estimating the whole gradient to update all parameters, a parameter can be updated immediately using a robust estimator of a single partial derivative in coordinate gradient descent. We prove upper bounds on the generalization error of the algorithms derived from this idea, that control both the optimization and statistical errors with and without a strong convexity assumption of the risk. Finally, we propose an efficient implementation of this approach in a new Python library called linlearn, and demonstrate through extensive numerical experiments that our approach introduces a new interesting compromise between robustness, statistical performance and numerical efficiency for this problem.
Similar content being viewed by others
Availability of data and materials
All data sets used in the experimental sections of this work are publicly available except for the synthetic ones which were randomly generated. Detailed information and sources were given in Appendix A.3.
Code Availability
The experiments were carried out using the library called linlearn which was developed as part of this project and open-sourced under the BSD-3 License on GitHub and available here https://github.com/linlearn/linlearn.
Notes
By implicit we mean defined as the \({{\,\mathrm{\hbox {argmin}}\,}}\) of some functional, as opposed to the explicit iterations of an optimization algorithm: an implicit estimator differs from the exact algorithm applied on the data, while an explicit algorithm does not.
Or more generally the centered moment of order \(1+\alpha \) for \(\alpha \in (0,1]\), see below.
We call “\(\eta \)-corruption” the context where the outlier set \({\mathcal {O}}\) in Assumption 2 satisfies \(\vert {\mathcal {O}}\vert = \eta n\) with \(\eta \in [0, 1/2)\).
Indeed, considering strong convexity, optimization converges linearly and the final bound is of the form \(a\exp (-bT) + c T \log (T/\delta )/n\) for some \(a,b,c >0\) and one can see that \(T \sim \log n\) is approximately optimal.
References
Acharya, A., Hashemi, A., Jain, P., Sanghavi, S., Dhillon, I.S., Topcu, U.: Robust training in high dimensions via block coordinate geometric median descent. In: International Conference on Artificial Intelligence and Statistics, pp. 11145–11168 (2022)
Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58(1), 137–147 (1999)
Armijo, L.: Minimization of functions having Lipschitz continuous first partial derivatives. Pac. J. Math. 16(1), 1–3 (1966)
Audibert, J.-Y., Munos, R., Szepesvári, C.: Exploration-exploitation tradeoff using variance estimates in multi-armed bandits. Theoret. Comput. Sci. 410(19), 1876–1902 (2009). (Algorithmic Learning Theory)
Ballester-Ripoll, R., Paredes, E.G., Pajarola, R.: Sobol tensor trains for global sensitivity analysis. Reliab. Eng. Syst. Saf. 183, 311–322 (2019)
Bartlett, P.L., Bousquet, O., Mendelson, S.: Local rademacher complexities. Ann. Stat. 33(4), 1497–1537 (2005)
Beck, A., Tetruashvili, L.: On the convergence of block coordinate descent type methods. SIAM J. Optim. 23(4), 2037–2060 (2013)
Bhatia, K., Jain, P., Kamalaruban, P., Kar, P.: Consistent robust regression. Adv. Neural. Inf. Process. Syst. 30, 2110–2119 (2017)
Blondel, M., Seki, K., Uehara, K.: Block coordinate descent algorithms for large-scale sparse multiclass classification. Mach. Learn. 93(1), 31–52 (2013)
Boucheron, S., Lugosi, G., Massart, P., Ledoux, M.: Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, Oxford (2013)
Brownlees, C., Joly, E., Lugosi, G., et al.: Empirical risk minimization for heavy-tailed losses. Ann. Stat. 43(6), 2507–2536 (2015)
Bubeck, S.: Convex optimization: algorithms and complexity. Found. Trends® Mach. Learn. 8(3–4), 231–357 (2015)
Bubeck, S., Cesa-Bianchi, N., Lugosi, G.: Bandits with heavy tail. IEEE Trans. Inf. Theory 59(11), 7711–7717 (2013)
Candanedo, L.M., Feldheim, V.: Accurate occupancy detection of an office room from light, temperature, humidity and CO2 measurements using statistical learning models. Energy Build. 112, 28–39 (2016)
Candanedo, L.M., Feldheim, V., Deramaix, D.: Data driven prediction models of energy use of appliances in a low-energy house. Energy Build. 140, 81–97 (2017)
Candès, E.J., Li, X., Ma, Y., Wright, J.: Robust principal component analysis? J. ACM 58(3), 1–37 (2011)
Catoni, O.: Challenging the empirical mean and empirical variance: a deviation study. In: Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, vol. 48, pp. 1148–1185. Institut Henri Poincaré (2012)
Charikar, M., Steinhardt, J., Valiant, G.: Learning from untrusted data. In: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pp. 47–60 (2017)
Chen, M., Gao, C., Ren, Z.: Robust covariance and scatter matrix estimation under Huber’s contamination model. Ann. Stat. 46(5), 1932–1960 (2018)
Chen, P., **, X., Li, X., Lihu, X.: A generalized Catoni’s M-estimator under finite \(\alpha \)-th moment assumption with \(\alpha \in (1, 2)\). Electron. J. Stat. 15(2), 5523–5544 (2021)
Chen, Y., Lili, S., Jiaming, X.: Distributed statistical machine learning in adversarial settings: byzantine gradient descent. Proc. ACM Meas. Anal. Comput. Syst. 1(2), 1–25 (2017)
Cherapanamjeri, Y., Aras, E., Tripuraneni, N., Jordan, M.I., Flammarion, N., Bartlett, P.L.: Optimal robust linear regression in nearly linear time. ar**v preprint ar**v:2007.08137 (2020)
Cherapanamjeri, Y., Flammarion, N., Bartlett, P.L.: Fast mean estimation with sub-gaussian rates. In: Conference on Learning Theory, pp. 786–806 (2019)
Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms. MIT Press, Cambridge (2009)
Depersin, J., Lecué, G.: Robust sub-gaussian estimation of a mean vector in nearly linear time. Ann. Stat. 50(1), 511–536 (2022)
Devroye, L., Györfi, L.: Nonparametric Density Estimation: The L1 View. Wiley Interscience Series in Discrete Mathematics. Wiley (1985)
Devroye, L., Lerasle, M., Lugosi, G., Oliveira, R.I.: Sub-gaussian mean estimators. Ann. Stat. 44(6), 2695–2725 (2016)
Diakonikolas, I., Kamath, G., Kane, D., Li, J., Moitra, A., Stewart, A.: Robust estimators in high-dimensions without the computational intractability. SIAM J. Comput. 48(2), 742–864 (2019)
Diakonikolas, I., Kamath, G., Kane, D., Li, J., Steinhardt, J., Stewart, A.: Sever: a robust meta-algorithm for stochastic optimization. In: International Conference on Machine Learning, pp. 1596–1606 (2019b)
Diakonikolas, I., Kane, D.M., Pensia, A.: Outlier robust mean estimation with subgaussian rates via stability. Adv. Neural. Inf. Process. Syst. 33, 1830–1840 (2020)
Diakonikolas, I., Kong, W., Stewart, A.: Efficient algorithms and lower bounds for robust linear regression. In: Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 2745–2754. SIAM (2019)
Dixon, W.J.: Analysis of extreme values. Ann. Math. Stat. 21(4), 488–506 (1950)
Donoho, D.L., Liu, R.C.: The “automatic’’ robustness of minimum distance functionals. Ann. Stat. 16(2), 552–586 (1988)
Dua, D., Graff, C.: UCI machine learning repository (2017)
Edgeworth, F.Y.: On observations relating to several quantities. Hermathena 6(13), 279–285 (1887)
Fanaee-T, H., Gama, J.: Event labeling combining ensemble detectors and background knowledge. Progr Artif Intell 2(2), 113–127 (2014)
Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24(6), 381–395 (1981)
Friedman, J. Hastie, T., Tibshirani, R.: A note on the group lasso and a sparse group lasso. ar**v preprint ar**v:1001.0736 (2010)
Gao, C., et al.: Robust regression via mutivariate regression depth. Bernoulli 26(2), 1139–1170 (2020)
Geer, S.A., van de Geer, S.: Empirical Processes in M-estimation, vol. 6. Cambridge University Press, Cambridge (2000)
Genkin, A., Lewis, D.D., Madigan, D.: Large-scale bayesian logistic regression for text categorization. Technometrics 49(3), 291–304 (2007)
Geoffrey, C., Guillaume, L., Matthieu, L.: Robust high dimensional learning for Lipschitz and convex losses. J. Mach. Learn. Res. 21 (2020)
Grubbs, F.E.: Procedures for detecting outlying observations in samples. Technometrics 11(1), 1–21 (1969)
Gupta, A., Kohli, S.: An MCDM approach towards handling outliers in web data: a case study using OWA operators. Artif. Intell. Rev. 46(1), 59–82 (2016)
Hampel, F.R.: A general qualitative definition of robustness. Ann. Math. Stat. 42(6), 1887–1896 (1971)
Hampel, F.R., Ronchetti, E.M., Rousseeuw, P., Stahel, W.A.: Robust Statistics: The Approach Based on Influence Functions. Wiley-Interscience, New York (1986)
Hawkins, D.M.: Identification of Outliers, vol. 11. Springer, Berlin (1980)
Hoare, C.A.R.: Algorithm 65: find. Commun. ACM 4(7), 321–322 (1961)
Holland, M.: Robustness and scalability under heavy tails, without strong convexity. In International Conference on Artificial Intelligence and Statistics, pp. 865–873 (2021)
Holland, M., Ikeda, K.: Better generalization with less data using robust gradient descent. In International Conference on Machine Learning, pp. 2761–2770 (2019)
Holland, M.J.: Robust descent using smoothed multiplicative noise. In: The 22nd International Conference on Artificial Intelligence and Statistics, pp. 703–711 (2019)
Holland, M.J., Ikeda, K.: Efficient learning with robust gradient descent. Mach. Learn. 108(8), 1523–1560 (2019)
Hopkins, S.B.: Mean estimation with sub-Gaussian rates in polynomial time. Ann. Stat. 48(2), 1193–1213 (2020)
Hsu, D., Sabato, S.: Loss minimization and parameter estimation with heavy tails. J. Mach. Learn. Res. 17(1), 543–582 (2016)
Huber, P.J.: Robust estimation of a location parameter. Ann. Math. Stat. 35(1), 73–101 (1964)
Huber, P.J.: The 1972 wald lecture robust statistics: a review. Ann. Math. Stat. 43(4), 1041–1067 (1972)
Huber, P.J.: Robust Statistics. Wiley, New York (1981)
Jerrum, M.R., Valiant, L.G., Vazirani, V.V.: Random generation of combinatorial structures from a uniform distribution. Theoret. Comput. Sci. 43, 169–188 (1986)
Juditsky, A., Kulunchakov, A., Tsyntseus, H.: Sparse recovery by reduced variance stochastic approximation. Inf. Inference: J. IMA 12(2), 851–896 (2022)
Klivans, A., Kothari, P.K., Meka, R.: Efficient algorithms for outlier-robust regression. In: Conference On Learning Theory, pp. 1420–1430 (2018)
Klivans, A.R., Long, P.M., Servedio, R.A.: Learning halfspaces with malicious noise. J. Mach. Learn. Res. 10(12) (2009)
Knuth, D.E.: Seminumerical algorithms (the art of computer programming 2). AddisonWesley, Reading, MA, pp. 124–125 (1969)
Koklu, M., Ozkan, I.A.: Multiclass classification of dry beans using computer vision and machine learning techniques. Comput. Electron. Agric. 174, 105507 (2020)
Koltchinskii, V.: Local Rademacher complexities and oracle inequalities in risk minimization. Ann. Stat. 34(6), 2593–2656 (2006)
Kuhn, M., Johnson, K.: Feature Engineering and Selection: A Practical Approach for Predictive Models. CRC Press (2019)
Lai, K.A., Rao, A.B., Vempala, S.: Agnostic estimation of mean and covariance. In: 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pp. 665–674. IEEE (2016)
Lecué, G., Lerasle, M., et al.: Robust machine learning by median-of-means: theory and practice. Ann. Stat. 48(2), 906–931 (2020)
Lecué, G., Lerasle, M., Mathieu, T.: Robust classification via MOM minimization. Mach. Learn. 109(8), 1635–1665 (2020)
Lecué, G., Mendelson, S.: Learning subgaussian classes: upper and minimax bounds (2013). Topics in Learning Theory-Société Mathématique de France,(S. Boucheron and N. Vayatis Eds.) (2013)
Ledoux, M., Talagrand, M.: Probability in Banach Spaces: Isoperimetry and Processes. Springer, Berlin (1991)
Lei, Z., Luh, K., Venkat, P., Zhang, F.: A fast spectral algorithm for mean estimation with sub-Gaussian rates. In: Conference on Learning Theory, pp. 2598–2612 (2020)
Li, J.: Robust sparse estimation tasks in high dimensions. ar**v preprint ar**v:1702.05860 (2017)
Li, X., Zhao, T., Arora, R., Liu, H., Hong, M.: On faster convergence of cyclic block coordinate descent-type methods for strongly convex minimization. J Mach. Learn. Res. 18(1), 6741–6764 (2017)
Liu, L., Li, T., Caramanis, C.: High dimensional robust estimation of sparse models via trimmed hard thresholding. ar**v preprint ar**v:1901.08237 (2019)
Liu, L., Shen, Y., Li, T., Caramanis, C.: High dimensional robust sparse regression. In: International Conference on Artificial Intelligence and Statistics, pp. 411–421 (2020)
Liu, T., Tao, D.: Classification with noisy labels by importance reweighting. IEEE Trans. Pattern Anal. Mach. Intell. 38(3), 447–461 (2015)
Lugosi, G., Mendelson, S.: Mean estimation and regression under heavy-tailed distributions: a survey. Found. Comput. Math. 19(5), 1145–1190 (2019)
Lugosi, G., Mendelson, S.: Sub-gaussian estimators of the mean of a random vector. Ann. Stat. 47(2), 783–794 (2019)
Lugosi, G., Mendelson, S.: Robust multivariate mean estimation: the optimality of trimmed mean. Ann. Stat. 49(1), 393–410 (2021)
Massart, P., Nédélec, É.: Risk bounds for statistical learning. Ann. Stat. 34(5), 2326–2366 (2006)
Maurer, A., Pontil, M.: Empirical Bernstein bounds and sample-variance penalization. In: COLT (2009)
Minsker, S., et al.: Geometric median and robust estimation in banach spaces. Bernoulli 21(4), 2308–2335 (2015)
Minsker, S., et al.: Sub-Gaussian estimators of the mean of a random matrix with heavy-tailed entries. Ann. Stat. 46(6A), 2871–2903 (2018)
Mizera, I., et al.: On depth and deep points: a calculus. Ann. Stat. 30(6), 1681–1736 (2002)
Mnih, V., Szepesvári, C., Audibert, J.-Y.: Empirical Bernstein stop**. In: Proceedings of the 25th International Conference on Machine Learning, pp. 672–679 (2008)
Nemirovskij, A.S., Yudin, D.B.: Problem Complexity and Method Efficiency in Optimization. Wiley-Interscience (1983)
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Springer, New York (2004)
Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012)
Owen, A.: A robust hybrid of lasso and ridge regression. Contemp. Math. 443(7), 59–72 (2007)
Paul, D., Chakraborty, S., Das, S.: Robust principal component analysis: a median of means approach. ar**v preprint ar**v:2102.03403 (2021)
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, É.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Pensia, A., Jog, V., Loh, P.-L.: Robust regression with covariate filtering: heavy tails and adversarial contamination. ar**v preprint ar**v:2009.12976 (2020)
Prasad, A., Balakrishnan, S., Ravikumar, P.: A robust univariate mean estimator is all you need. In: Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, vol. 108, pp. 4034–4044 (2020)
Prasad, A., Suggala, A.S., Balakrishnan, S., Ravikumar, P.: Robust estimation via robust gradient estimation. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 82(3), 601–627 (2020)
Shalev-Shwartz, S., Tewari, A.: Stochastic Methods for \(\ell _1\)-Regularized Loss Minimization. The Journal of Machine Learning Research 12, 1865–1892 (2011)
Shevade, S.K., Sathiya Keerthi, S.: A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics 19(17), 2246–2253 (2003)
Srebro, Nathan, Sridharan, Karthik, Tewari, Ambuj: Optimistic rates for learning with a smooth loss. ar**v preprint ar**v:1009.3896, (2010)
Jiyuan, T., Liu, W., Mao, X., Chen, X.: Variance Reduced Median-of-Means Estimator for Byzantine-Robust Distributed Inference. J. Mach. Learn. Res. 22(84), 1–67 (2021)
Tukey, John W.: A survey of sampling from contaminated distributions. Contributions to Probability and Statistics, pp. 448–485, (1960)
van der Vaart, A.W.: Asymptotic Statistics (Cambridge Series in Statistical and Probabilistic Mathematics). Cambridge University Press (1998)
van Erven, Tim, Sachs, Sarah, Koolen, Wouter M., Kotlowski, Wojciech: Robust Online Convex Optimization in the Presence of Outliers. In: Proceedings of Thirty Fourth Conference on Learning Theory, volume 134, pp. 4174–4194. PMLR, (2021)
Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York, NY (1999)
Vardi, Y., Zhang, C.-H.: The multivariate \(L_1\)-median and associated data depth. Proc. Natl. Acad. Sci. 97(4), 1423–1426 (2000)
Vergara, A., Vembu, S., Ayhan, T., Ryan, M.A., Homer, M.L., Huerta, R.: Chemical gas sensor drift compensation using classifier ensembles. Sens. Actuators, B Chem. 166, 320–329 (2012)
Wright, S.J.: Coordinate descent algorithms. Math. Program. 151(1), 3–34 (2015)
Wu, T.T., Lange, K.: Coordinate descent algorithms for lasso penalized regression. The Annals of Applied Statistics 2(1), 224–244 (2008)
Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68(1), 49–67 (2006)
Zhang, Lijun, Zhou, Zhi-Hua: \(\ell _1\)-regression with heavy-tailed distributions. In: Advances in Neural Information Processing Systems, pp. 1084–1094, (2018)
Zhang, Tong: Solving Large Scale Linear Prediction Problems Using Stochastic Gradient Descent Algorithms. In: Proceedings of the Twenty-First International Conference on Machine Learning, p. 116, New York, NY, USA, (2004). Association for Computing Machinery
Zheng, Alice, Casari, Amanda: Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists. O’Reilly Media, Inc., (2018)
Acknowledgements
This research is supported by the Agence Nationale de la Recherche as part of the “Investissements d’avenir” program (reference ANR-19-P3IA-0001; PRAIRIE 3IA Institute).
Funding
This research is supported by the french Agence Nationale de la Recherche as well as by the PRAIRIE Institute.
Author information
Authors and Affiliations
Contributions
Conceptualization: Stéphane Gaïffas; Methodology: Ibrahim Merad, Stéphane Gaïffas; Formal analysis and investigation: Ibrahim Merad; Writing - original draft preparation: Ibrahim Merad; Writing - review and editing: Stéphane Gaïffas; Funding acquisition, Resources and Supervision: Stéphane Gaïffas.
Corresponding author
Ethics declarations
Conflict of interest
The authors have no competing interests to declare that are relevant to the content of this article.
Ethics approval
The authors approve the ethical rules set by the journal and are committed to abiding by them. Declarations relative to conflicts of interest, research involving Human Participants and/or Animals and informed consent are not applicable for this work.
Consent to participate
Not applicable.
Consent for publication
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Supplementary theoretical results and details on experiments
1.1 A.1 The Lipschitz constants \(L_j\) are unknown
The step-sizes \((\beta _j)_{j \in \llbracket d \rrbracket }\) used in Theorems 1 and 2 are given by \(\beta _j = 1 / L_j\), where the Lipschitz constants \(L_j\) are defined by (8). This makes them non-observable, since they depend on the unknown distribution of the non-corrupted features \(P_{X_i}\) for \(i \in {\mathcal {I}}\). We cannot use line-search (Armijo 1966) here, since it requires to evaluate the objective \(R(\theta )\), which is unknown as well. In order to provide theoretical guarantees similar to that of Theorem 1 without knowing \((L_j)_{j=1}^d\), we use the following approach. First, we use the upper bound
which holds under Assumption 1 and estimate \({\mathbb {E}}[(X^j)^2]\) to build a robust estimator of \(U_j\). In order to obtain an observable upper bound and to control its deviation with a large probability, we introduce the following condition.
Definition 2
We say that a real random variable Z satisfies the \(L^{\zeta }\)-\(L^{\xi }\) condition with constant \(C \ge 1\) whenever it satisfies
Using this condition, we can use the \(\texttt {MOM} \) estimator to obtain a high probability upper bound on \({\mathbb {E}}[(X^j)^2]\) as stated in the following lemma.
Lemma 5
Grant Assumption 2 with \(\alpha \in (0, 1]\) and suppose that for all \(j\in \llbracket d \rrbracket ,\) the variable \((X^j)^2\) satisfies the \(L^{(1+\alpha )}\)-\(L^1\) condition with a known constant C. For any fixed \(j \in \llbracket d \rrbracket ,\) let \(\widehat{\sigma }^2_j\) be the \(\texttt {MOM} \) estimator of \({\mathbb {E}}[(X^j)^2]\) with K blocks. If \(\vert {\mathcal {O}}\vert \le K / 12,\) we have
If we fix a confidence level \(\delta \in (0, 1)\) and choose \(K:= \lceil 18 \log (1 / \delta ) \rceil ,\) we have
with a probability larger than \(1 - \delta \).
The proof of Lemma 5 is given in Appendix B. Denoting \(\widehat{U}_j\) the upper bounds it provides on \({\mathbb {E}}[(X^j)^2],\) we can readily bound the Lipschitz constants as \(L_j \le \gamma \widehat{U}_j\) which leads to the following statement.
Corollary 2
Grant the same assumptions as in Theorem 1 and Proposition 3. Suppose additionally that for all \(j\in \llbracket d \rrbracket \), the variable \((X^j)^2\) satisfies the \(L^{(1+\alpha )}\)-\(L^1\) condition with a known constant C and fix \(\delta \in (0, 1)\). Let \(\theta ^{(T)}\) be the output of Algorithm 1 with step-sizes \(\widehat{\beta }_j = 1 / {\overline{L}}_j\) where \({\overline{L}}_j:= \gamma \widehat{U}_j\) and \(\widehat{U}_j\) are the upper bounds from Lemma 5 with confidence \(\delta /2d,\) an initial iterate \(\theta ^{(0)},\) importance sampling distribution \(p_j = {\overline{L}}_j / \sum _{k \in \llbracket d \rrbracket } {\overline{L}}_{k}\) and estimators of the partial derivatives with error vector \(\epsilon (\cdot )\). Then, we have
with probability at least \(1 - \delta \).
The proof of Corollary 2 is given in Appendix B. It is a direct consequence of Theorem 1 and Lemma 5 and shows that an upper bound similar to that of Theorem 1 can be achieved with observable step-sizes. One may argue that the \(L^{(1+\alpha )}\)-\(L^1\) condition simply bypasses the difficulty of deriving an observable upper bound by arbitrarily assuming that a ratio of moments is observed. However, we point out that a hypothesis of this nature is indispensable to obtain bounds such as the one above (alternatively, consider a real random variable with an infinitesimal mass drifting towards infinity). In fact, the \(L^{(1+\alpha )}\)-\(L^1\) condition is much weaker than the requirement of boundedness (with known range) common to most known empirical bounds (Maurer et al. 2009; Audibert et al. 2009; Mnih et al. 2008).
1.2 A.2 Observable upper bound for the moment \(m_{\alpha , j}\)
Since the moment \(m_{\alpha , j}\), it is not observable, so we propose in Lemma 6 below an observable upper bound deviation for it based on \(\texttt {MOM} \). Let us introduce now a robust estimator \(\widehat{m}^{\texttt {MOM} }_{\alpha , j}(\theta )\) of the unknown moment \(m_{\alpha , j}(\theta )\) using the following “two-step” \(\texttt {MOM} \) procedure. First, we compute \(\widehat{g}^{\texttt {MOM} }_j(\theta )\), the \(\texttt {MOM} \) estimator of \(g_j(\theta )\) with K blocks given by (14). Then, we compute again a \(\texttt {MOM} \) estimator on \(\vert g^i_j(\theta ) - \widehat{g}^{\texttt {MOM} }_j(\theta ) \vert ^{1+\alpha }\) for \(i\in \llbracket n \rrbracket \), namely
where
using uniformly sampled blocks \(B_1, \ldots , B_K\) of equal size that form a partition of \(\llbracket n \rrbracket \).
Lemma 6
Grant Assumptions 1 and 2 with \(\alpha \in (0, 1]\) and suppose that for all \(j\in \llbracket d \rrbracket \) and \(\theta \in \Theta \) the partial derivatives \(\ell '(X^\top \theta , Y)X^j\) satisfy the \(L^{(1+\alpha )^2}\)-\(L^{(1+\alpha )}\) condition with known constant C for any \(j\in \llbracket d \rrbracket \) (see Definition 2). Then, if \(\vert {\mathcal {O}}\vert \le K / 12,\) we have
where \(\kappa = \epsilon + 24 (1 + \alpha ) \big (\frac{(1 + \epsilon ) K}{n} \big )^{\alpha /(1+\alpha )}\) and \(\epsilon = (24 (1 + C^{(1+\alpha )^2} ))^{1/(1+\alpha )} \big (\frac{K}{n}\big )^{\alpha / (1 + \alpha )}\).
The proof of Lemma 6 is given in Appendix B.
1.3 A.3 Experimental details
We provide in this section supplementary information about the numerical experiments conducted in Sect. 6.
1.3.1 A.3.1 Data sets
The main characteristics of the data sets used from the UCI repository are given in Table 3 and their direct URLs are given in Table 4.
1.3.2 A.3.2 Data corruption
For a given corruption rate \(\eta \), we obtain a corrupted version of a data set by replacing an \(\eta \)-fraction of its samples with uninformative elements. For a data set of size n we choose \({\mathcal {O}}\subset \llbracket n \rrbracket \) which satisfies \(\vert {\mathcal {O}}\vert = \eta n\) up to integer rounding. The corruption is applied prior to any preprocessing except in the regression case where label scaling is applied before. The affected subset is chosen uniformly at random. Since many data sets contain both continuous and categorical data features, we distinguish two different corruption mechanisms which we apply depending on their nature. The labels are corrupted as continuous or categorical values when the task is respectively regression or classification. Denote \({\widetilde{{\varvec{X}}}} \in {\mathbb {R}}^{n\times (d+1)}\) the data matrix with the vector of labels added to its columns. Let \({\widetilde{J}} \subset \llbracket d+1 \rrbracket \) denote the index of continuous columns, we compute \(\widehat{\mu }_j\) and \(\widehat{\sigma }_j\) their empirical means and standard deviations respectively for \(j \in {{\widetilde{J}}}\). We also sample a random unit vector u of size \(\vert {\widetilde{J}}\vert \).
-
For categorical feature columns, for each corrupted index \(i \in {\mathcal {O}}\), we replace \({\varvec{X}}_{i,j}\) with a uniformly sampled value among \(\{{\varvec{X}}_{\cdot ,j}\}\) i.e. among the possible modalities of the categorical feature in question.
-
For continuous features, for each corrupted index \(i \in {\mathcal {O}}\), we replace \({\varvec{X}}_{i, {\widetilde{J}}}\) with equal probability with one of the following possibilities:
-
a vector \(\xi \) sampled coordinatewise according to \(\xi _j = r_j + 5 \widehat{\sigma }_j \nu \) where \(r_j\) is a value randomly picked in the column \({\varvec{X}}_{\cdot ,j}\) and \(\nu \) is a sample from the Student distribution with 2.1 degrees of freedom.
-
a vector \(\xi \) sampled coordinatewise according to \(\xi _j = \widehat{\mu }_j + 5\widehat{\sigma }_j u_j + z \) where z is a standard gaussian.
-
a vector \(\xi \) sampled according to \(\xi = \widehat{\mu }+ 5\widehat{\sigma }\otimes w \) where w is a uniformly sampled unit vector.
-
1.4 A.4 Preprocessing
We apply a minimal amount of preprocessing to the data before applying the considered learning algorithms. More precisely, categorical features are one-hot encoded while centering and standard scaling is applied to the continuous features.
1.5 A.5 Parameter hyper-optimization
We use the hyperopt library to find optimal hyper-parameters for all algorithms. For each data set, the available samples are split into training, validation and test sets with proportions \(70\%, 15\%, 15\%\). Whenever corruption is applied, it is restricted to the training set. We run 50 rounds of hyper-parameter optimization which are trained on the training set and evaluated on the validation set. Then, we report results on the test set for all hyper-optimized algorithms. For each algorithm, the hyper-parameters are tried out using the following sampling mechanism (the one we specify to hyperopt):
-
\(\texttt {MOM} \), \(\texttt {GMOM} \), \(\texttt {LLM} \): we optimize the number of blocks K used for the median-of-means computations. This is done through a block_size \(=K/n\) hyper-parameter chosen with log-uniform distribution over \([10^{-5}, 0.2]\)
-
\(\texttt {CH} \) and \(\texttt {CH\, GD} \): we optimize the confidence \(\delta \) used to define the \(\texttt {CH} \) estimator’s scale parameter (see Equation (21)) chosen with log-uniform distribution over \([e^{-10}, 1]\)
-
\(\texttt {TM} \), \(\texttt {HG} \): we optimize the percentage used for trimming uniformly in \([10^{-5}, 0.3]\)
-
\(\texttt {RANSAC} \): we optimize the value of the min_samples parameter in the scikit-learn implementation, chosen as \(4 + m\) with m an integer chosen uniformly in \(\llbracket 100 \rrbracket \)
-
\(\texttt {HUBER} \): we optimize the epsilon parameter in the scikit-learn implementation chosen uniformly in [1.0, 2.5]
Appendix B: Proofs
1.1 B.1 Proof of Theorem 1
This proof follows, with minor modifications, the proof of Theorem 1 from Wright (2015). Using Definition 1, we obtain
Let us recall that \(e_j\) stands for the j-th canonical basis of \({\mathbb {R}}^d\) and that, as described in Algorithm 1, we have
where we use the notations \({\widehat{g}}_t = {\widehat{g}}_{j_t}(\theta ^{(t)})\) and \(g_t = g_{j_t}(\theta ^{(t)})\) and where we recall that \(j_1, \ldots , j_t\) is a i.i.d sequence with distribution p. We introduce also \(\epsilon _j:= \epsilon _j(\delta )\). Using Assumption 3, we obtain
on the event \({\mathcal {E}}\), where we used the choice \(\beta _{j_t} = 1/L_{j_t}\) and the fact that \(\vert {\widehat{g}}_t - g_t \vert \le \epsilon _{j_t}\) on \({\mathcal {E}}\).
Since \(j_1, \ldots , j_t\) is a i.i.d sequence with distribution p, we have for any \((j_1, \ldots , j_{t-1})\)-measurable and integrable function \(\varphi \) that
where we denote for short the conditional expectation \({\mathbb {E}}_{t-1}[\cdot ] = {\mathbb {E}}_{t-1}[\cdot \vert j_1, \ldots , j_{t-1}]\). So, taking \({\mathbb {E}}_{t-1}[\cdot ]\) on both sides of (B6) leads, whenever \(p_j = L_j / \sum _{k=1}^d L_k\), to
where we introduced \(\** := \Vert \epsilon (\delta )\Vert _2^2\), while it leads to
whenever \(p_j = 1 / d\), simply using \(L_{\min } \le L_j \le L_{\max }\). In order to treat both cases simultaneously, consider \({{\bar{L}}} = \sum _{k=1} L_k\) and \({{\bar{\epsilon }}} = \** / (2 \sum _k L_k)\) whenever \(p_j = L_j / \sum _{k=1}^d L_k\) and \({{\bar{L}}} = d L_{\max }\) and \({{\bar{\epsilon }}} / (2 d L_{\min })\) whenever \(p_j = 1 / d\) and continue from the inequality
Introducing \(\phi _t:= {\mathbb {E}}\big [R(\theta ^{(t)})\big ] - R^\star \) and taking the expectation w.r.t. all \(j_1, \dots , j_t\) we obtain
Using Inequality (9) with \(\theta _1 = \theta ^{(t)}\) gives
for any \(\theta _2 \in {\mathbb {R}}^d\), so that by minimizing both sides with respect to \(\theta _2\) leads to
namely
by taking the expectation on both sides. Together with (B7) this leads to the following approximate contraction property:
and by iterating \(t=1, \ldots , T\) to
which allows to conclude the Proof of Theorem 1. \(\square \)
1.2 B.2 Proof of Theorem 2
This proof reuses ideas from Li et al. (2017) and Beck and Tetruashvili (2013) and adapts them to our context where the gradient coordinates are replaced with high confidence approximations. Without loss of generality, we initially assume that the coordinates are cycled upon in the natural order. We condition on the event (B5) which holds with probability \(\ge 1 - \delta \) as in the proof of Theorem 1 and denote \(\epsilon _j = \epsilon _j(\delta )\) and \(\epsilon _{Euc} = \Vert \epsilon (\delta )\Vert \).
Let the iterations be denoted as \(\theta ^{(t)}\) for \(t=0,\dots , T\) and \(\theta ^{(t)}_{i+1} = \theta ^{(t)}_{i} - \beta _{i+1} {\widehat{g}}(\theta ^{(t)}_{i})_{i+1} e_{i+1}\) for \(i=0, \dots , d-1\) with \(\beta _i = 1/L_i\), \(\theta ^{(t)}_0 = \theta ^{(t)}\) and \(\theta ^{(t)}_d = \theta ^{(t+1)}\). With these notations we have
Similarly to (B6) in the proof of Theorem 1 we find:
leading to
The following aims to find a relationship between \(\sum _{i=0}^{d-1} \frac{1}{2L_{i+1}} g(\theta ^{(t)}_{i})_{i+1}^2\) and \(\big \Vert g(\theta ^{(t)})\Vert _2^2\) which we do by comparing coordinates. For the first step in a cycle we have \(g(\theta ^{(t)})_1 = g(\theta ^{(t)}_0)_1\) because \(\theta ^{(t)}= \theta ^{(t)}_0\). Let \(j \in \{1,\dots , d-1\}\), by the Mean Value Theorem, there exists \(\gamma ^{(t)}_j \in {\mathbb {R}}^d\) such that we have:
where we introduced the following quantities: \(A \in {\mathbb {R}}^d\) equal to \(A = \text {diag}(L_j)_{j=1}^d\), the vector \(\delta ^{(t)} \in {\mathbb {R}}^d\) is such that \(\delta ^{(t)}_j = {\widehat{g}}(\theta ^{(t)}_{j-1})_{j} - g(\theta ^{(t)}_{j-1})_{j}\) which satisfies \(\vert \delta ^{(t)}_{j}\vert \le \epsilon _j\), the matrix \(H = (h_1, \dots , h_d)^\top \) and \({\widetilde{H}} = A^{1/2} + H A^{-1/2} = ( {\widetilde{h}}_1, \dots , {\widetilde{h}}_d)^\top \). In the case \(j=0\) the vector \(h_{j+1} = h_1\) is simply zero. This allows us to obtain the following estimation:
We can bound the spectral norm \(\Vert {\widetilde{H}}\Vert \) as follows:
For \(\Vert H\Vert \), we use the coordinate-wise Lipschitz-smoothness in order to find
Combining the previous inequality with (B8) and (B9), we find:
where the last step uses that \(\frac{dL_{\max }/L_{\min }}{1 + d\frac{L_{\max }}{L_{\min }}} \le 1\). Using \(\lambda \)-strong convexity by choosing \(\theta _1 = \theta ^{(t)}\) in inequality (9) and minimizing both sides w.r.t. \(\theta _2\) we obtain:
which combined with the previous inequality yields the contraction inequality:
and after T iterations we have:
which concludes the proof of Theorem 2. To see that the proof still holds for any choice of coordinates satisfying the conditions in the main claim, notice that the computations leading up to Inequality (B9) work all the same if one were to apply a permutation to the coordinates beforehand.
1.3 B.3 Convergence of the parameter error
We state and prove a result about the linear convergence of the parameter under strong convexity.
Theorem 8
Grant Assumptions 1, 3 and 4. Let \(\theta ^{(T)}\) be the output of Algorithm 1 with constant step-size \(\beta = \frac{2}{\lambda + L},\) an initial iterate \(\theta ^{(0)},\) uniform coordinates sampling \(p_j = 1 / d\) and estimators of the partial derivatives with error vector \(\epsilon (\cdot )\). Then, we have
with probability at least \(1 - \delta \), where the expectation is w.r.t. the sampling of the coordinates.
Proof
As in the proof of Theorem 1, let \(({\widehat{g}}_j(\theta ))_{j=1}^d\) be the estimators used and introduce the notations
We also condition on the event (B5) which holds with probability \(1 - \delta \) and use the notations \(\epsilon _{Euc} = \Vert \epsilon (\delta )\Vert _2\) and \(\epsilon _j = \epsilon _j(\delta )\). We denote \(\Vert \cdot \Vert _{L_2}\) the \(L_2\)-norm w.r.t. the distribution over \(j_t\) i.e. for a random variable \(\xi \) we have \(\Vert \xi \Vert _{L_2} = \sqrt{{\mathbb {E}}_{j_t} \Vert \xi \Vert ^2}\). We compute:
We first treat the first term of (B11), in the case of uniform sampling with equal step-sizes \(\beta _j = \beta \) we have:
By taking the expectation w.r.t. the random coordinate \(j_t\) we find:
The first inequality is obtained by applying inequality (2.1.24) from Nesterov (2004) (see also Bubeck (2015) Lemma 3.11) and the second one is due to the choice of \(\beta \). We can bound the second term as follows:
Combining the latter with the former bound, we obtain the approximate contraction:
By iterating this argument on T rounds we find that:
Finally, the following inequality yields the result in the case of uniform sampling:
\(\square \)
1.4 B.4 Proof of Lemma 1
Let \(\theta \in \Theta \), using Assumption 1 we have:
Taking the expectation and using Assumption 2 shows that the risk \(R(\theta )\) is well defined (recall that \(q\le 2\)). Next, since \(1\le q \le 2\), simple algebra gives
Given Assumption 2, it is straightforward that \({\mathbb {E}}\vert X^j\big \vert ^{1+\alpha } < \infty \) and \({\mathbb {E}}\vert Y^{q-1} X^j\vert ^{1 + \alpha }< \infty \). Moreover, using a Hölder inequality with exponents \(a = \frac{q(1+\alpha )}{(q-1)(1+\alpha )}\) and \(b = q\) (the case \(q=1\) is trivial) we find:
which is finite under Assumption 2. This concludes the proof of Lemma 1.
1.5 B.5 Proof of Lemma 2
This proof follows a standard argument from Lugosi and Mendelson (2019a); Geoffrey et al. (2020) in which we use a Lemma from Bubeck et al. (2013) in order to control the \((1+\alpha )\)-moment of the block means instead of their variance. Indeed, we know from Lemma 1 that under Assumptions 1 and 2, the gradient coordinates have finite \((1+\alpha )\)-moments, namely \({\mathbb {E}}[\vert \ell '(X^\top \theta , Y) X_j \vert ^{1 + \alpha } ] < +\infty \) for any \(j \in \llbracket d \rrbracket \). Recall that \((\widehat{g}_j^{(k)}(\theta ))_{k \in \llbracket K \rrbracket }\) stands for the block-wise empirical mean given by Eq. (15) and introduce the set of non-corrupted block indices given by \({\mathcal {K}}= \{ k \in \llbracket K \rrbracket \;: \; B_k \cap {\mathcal {O}} = \emptyset \}\). We will initially assume that the number of outliers satisfies \(\vert {\mathcal {O}}\vert \le (1 - \varepsilon ) K / 2\) for some \(0< \varepsilon < 1\). Note that since samples are i.i.d in \(B_k\) for \(k \in {\mathcal {K}}\), we have \({\mathbb {E}}\big [\widehat{g}_j^{(k)}(\theta )\big ] = g_j(\theta )\). We use the following Lemma from Bubeck et al. (2013).
Lemma 7
Let \(Z, Z_1, \ldots , Z_n\) be a i.i.d sequence with \(m_\alpha = {\mathbb {E}}[\vert Z - {\mathbb {E}}Z\vert ^{1 + \alpha }] < +\infty \) for some \(\alpha \in (0, 1]\) and put \({{\bar{Z}}}_n = \frac{1}{n} \sum _{i \in \llbracket n \rrbracket } Z_i\). Then, we have
for any \(\delta \in (0, 1),\) with a probability \(1 - \delta \).
Lemma 7 entails that
with probability larger than \(1 - 2 \delta '\), for each \(k \in {\mathcal {K}}\), since we have n/K samples in block \(B_k\). Now, recalling that \(\widehat{g}_j(\theta )\) is the median (see (14)), we can upper bound its failure probability as follows:
since at most \(\vert {\mathcal {O}}\vert \) blocks contain one outlier. Since the blocks \(B_k\) are disjoint and contain i.i.d samples for \(k \in {\mathcal {K}}\), we know that
follows a binomial distribution \(\text {Bin}(\vert {\mathcal {K}}\vert , p)\) with \(p \le 2\delta '\). Using the fact that \(\text {Bin}(\vert {\mathcal {K}}\vert , p)\) is stochastically dominated by \(\text {Bin}(\vert {\mathcal {K}}\vert , 2\delta ')\) and that \({\mathbb {E}}[\text {Bin}(\vert {\mathcal {K}}\vert , 2\delta ')] = 2\delta '\vert {\mathcal {K}}\vert \), we obtain, if \(S \sim \text {Bin}(\vert {\mathcal {K}}\vert , 2\delta ')\), that
where we used the fact that \(\vert {\mathcal {O}}\vert \le (1 - \varepsilon ) K / 2\) and \(\vert {\mathcal {K}}\vert \le K\) for the second inequality and the Hoeffding inequality for the last. This concludes the proof of Lemma 2 for the choice \(\varepsilon = 5/6\) and \(\delta ' = 1/8\).
1.6 B.6 Proof of Proposition 3
Step 1 First, we fix \(\theta \in \Theta \) and try to bound \(\big \vert \widehat{g}_j^{\texttt {MOM} }(\theta ) - g_j(\theta )\big \vert \) in terms of quantities only depending on \({\widetilde{\theta }}\) which is the closest point to \(\theta \) in an \(\varepsilon \)-net. Recall that \(\Delta \) is the diameter of the parameter set \(\Theta \) and let \(\varepsilon > 0\) be a positive number. There exists an \(\varepsilon \)-net covering \(\Theta \) with cardinality no more than \((3\Delta /2\varepsilon )^d\) i.e. a set \(N_{\varepsilon }\) such that for all \(\theta \in \Theta \) there exists \({\widetilde{\theta }} \in N_{\varepsilon }\) such that \(\Vert {\widetilde{\theta }} - \theta \Vert \le \varepsilon \). Consider a fixed \(\theta \in \Theta \) and \(j\in \llbracket d \rrbracket \), we wish to bound the quantity \(\big \vert \widehat{g}_j^{\texttt {MOM} }(\theta ) - g_j(\theta )\big \vert \). Using the \(\varepsilon \)-net \(N_{\varepsilon }\), there exists \({\widetilde{\theta }}\) such that \(\Vert {\widetilde{\theta }} - \theta \Vert \le \varepsilon \) which we can use as follows:
where we used the gradient’s coordinate Lipschitz constant to bound the second term. We now focus on the second term. Introducing the notation \(g_j^i(\theta ) = \ell '(\theta ^\top X_i, Y_i)X_i^j\), we have
Let \((B_k)_{k\in \llbracket K \rrbracket }\) be the blocks used to compute the \(\texttt {MOM} \) estimator and associated block means \(\widehat{g}_j^{(k)}(\theta )\) and \(\widehat{g}_j^{(k)}({\widetilde{\theta }})\). Notice that the \(\texttt {MOM} \) estimator is monotonous non decreasing w.r.t. to each of the entries \(g_j^i(\theta )\) when the others are fixed. Without loss of generality, assume that \(\widehat{g}_j^{\texttt {MOM} }(\theta ) - g_j({\widetilde{\theta }}) \ge 0\) then we have:
where is the \(\texttt {MOM} \) estimator obtained using the entries \(\ell '\big ({\widetilde{\theta }}^{\top } X_i, Y_i\big )X_i^j + \varepsilon \gamma \Vert X_i\Vert ^2 = g_j^i({\widetilde{\theta }}) + \varepsilon \gamma \Vert X_i\Vert ^2\) instead of \(g_j^i(\theta )\). Note that no longer depends on \(\theta \) except through the fact that \({\widetilde{\theta }}\) is chosen in \(N_{\varepsilon }\) so that \(\big \Vert {\widetilde{\theta }} - \theta \big \Vert \le \varepsilon \). Indeed, using the Lipschitz smoothness of the loss function and a Cauchy-Schwarz inequality we find that:
Step 2
We now use the concentration property of \(\texttt {MOM} \) to bound the quantity which is in terms of \({\widetilde{\theta }}\). The samples \((g_j^i({\widetilde{\theta }}) + \varepsilon \gamma \Vert X_i\Vert ^2)_{i \in \llbracket n \rrbracket }\) are independent and distributed according to the random variable \(\ell '({\widetilde{\theta }}^{\top } X, Y)X^j + \varepsilon \gamma \Vert X\Vert ^2\). Denote \(\overline{L} = \gamma {\mathbb {E}}\Vert X\Vert ^2\) and for \(k \in \llbracket K \rrbracket \) let \(\widehat{g}_j^{(k)}({\widetilde{\theta }}) = \frac{K}{n}\sum _{i\in B_k} g_j^i({\widetilde{\theta }})\) and \(\widehat{L}^{(k)} = \frac{K}{n}\sum _{i\in B_k} \gamma \Vert X_i\Vert ^2\). We use Lemma 7 for each of these pairs of means to obtain that with probability at least \(1 - \delta '/2\):
and with probability at least \(1 - \delta '/2\)
where \(m_{L, \alpha } = {\mathbb {E}}\vert \gamma \Vert X\Vert ^2 - \overline{L} \vert ^{1+\alpha }\). Hence for all \(k \in \llbracket K \rrbracket \)
Now defining the Bernoulli variables
we have just seen they have success probability \(\le \delta '\), moreover
since at most \(\vert {\mathcal {O}}\vert \) blocks contain one outlier. Since the blocks \(B_k\) are disjoint and contain i.i.d samples for \(k \in {\mathcal {K}}\), we know that \(\sum _{k \in {\mathcal {K}}} U_k\) follows a binomial distribution \(\text {Bin}(\vert {\mathcal {K}}\vert , p)\) with \(p \le \delta '\). Using the fact that \(\text {Bin}(\vert {\mathcal {K}}\vert , p)\) is stochastically dominated by \(\text {Bin}(\vert {\mathcal {K}}\vert , \delta ')\) and that \({\mathbb {E}}[\text {Bin}(\vert {\mathcal {K}}\vert , \delta ')] = \delta '\vert {\mathcal {K}}\vert \), we obtain, if \(S \sim \text {Bin}(\vert {\mathcal {K}}\vert , \delta ')\), that
where we used the condition \(\vert {\mathcal {O}}\vert \le (1 - \varepsilon ') K / 2\) and \(\vert {\mathcal {K}}\vert \le K\) for the second inequality and the Hoeffding inequality for the last. To conclude, we choose \(\varepsilon ' = 5/6\) and \(\delta ' = 1/4\) and combine (B12), (B13) and the last inequality in which we take \(K = \lceil 18\log (1/\delta ) \rceil \) and use a union bound argument to obtain that with probability at least \(1 - \delta \) for all \(j \in \llbracket d \rrbracket \)
Step 3 We use the \(\varepsilon \)-net to obtain a uniform bound. For \(\theta \in \Theta \) denote \({\widetilde{\theta }}(\theta ) \in N_{\varepsilon }\) the closest point in \(N_{\varepsilon }\) satisfying in particular \(\Vert {\widetilde{\theta }}(\theta ) - \theta \Vert \le \varepsilon \), we write, following previous arguments
Here, we make a union bound argument over \({\widetilde{\theta }} \in N_{\varepsilon }\) for the inequality (B14) and choose \(\varepsilon = n^{-\alpha /(1+\alpha )}\) to obtain the final result concluding the proof of Proposition 3.
1.7 B.7 Proof of Proposition 4
This proof reuses arguments from the proof of Theorem 2 in Lecué et al. (2020). We wish to bound \(\big \vert \widehat{g}_j^{\texttt {MOM} }(\theta ) - g_j(\theta ) \big \vert \) with high probability and uniformly on \(\theta \in \Theta \). Fix \(\theta \in \Theta \) and \(j \in \llbracket d \rrbracket \), we have \(\widehat{g}_j^{\texttt {MOM} }(\theta ) = {{\,\textrm{median}\,}}\big ( \widehat{g}_j^{(1)}(\theta ), \dots , \widehat{g}_j^{(K)}(\theta ) \big )\) with \(\widehat{g}_j^{(k)}(\theta ) = \frac{K}{n}\sum _{i\in B_k} g^i_j(\theta )\) where the blocks \(B_1, \dots , B_K\) constitute a partition of \(\llbracket n \rrbracket \).
Define the function \(\phi (t) = (t-1){{{\textbf {1}}}}_{1\le t \le 2} + {{{\textbf {1}}}}_{t > 2}\), let \({\mathcal {K}}= \{ k\in \llbracket K \rrbracket , \,\, B_k \cap {\mathcal {O}}= \emptyset \}\) and \({\mathcal {J}} = \bigcup _{k \in {\mathcal {K}}} B_k\). Thanks to the inequality \(\phi (t) \ge {{{\textbf {1}}}}_{t \ge 2}\), we have:
Besides, the inequality \(\phi (t) \le {{{\textbf {1}}}}_{t \ge 1}\), an application of Markov’s inequality and Lemma 7 yield:
Therefore, recalling that we defined \(M_{\alpha , j}:= \sup _{\theta \in \Theta } m_{\alpha , j}(\theta )\) we have
Now since for all t we have \(0\le \phi (t)\le 1 \), McDiarmid’s inequality says with probability \(\ge 1 - \exp (-2y^2 K)\) that:
Using a simple symmetrization argument (see for instance Lemma 11.4 in Boucheron et al. (2013)) we find:
where the \(\varepsilon _k\)s are independent Rademacher variables. Since \(\phi \) is 1-Lipschitz and satisfies \(\phi (0)=0\) we can use the contraction principle [see Theorem 11.6 in Boucheron et al. (2013)] followed by another symmetrization step to find
Taking \(\vert {\mathcal {O}}\vert \le (1 - \varepsilon )K/2\), we found that with probability \(\ge 1 - \exp (-2y^2 K)\)
Now by choosing \(y = 1/4 - \vert {\mathcal {O}}\vert /K\) and \(x = \max \Big (\Big (\frac{36M_{\alpha , j}}{(n/K)^\alpha }\Big )^{1/(1+\alpha )}, \frac{64 {\mathcal {R}}_j(\Theta )}{n}\Big )\), we obtain the deviation bound:
where the last inequality comes from the choice \(\varepsilon = 5/6\). A simple union bound argument lets the previous inequality hold for all \(j\in \llbracket d \rrbracket \) with high probability.
Finally, assuming that \(X^j\) has finite fourth moment for all \(j\in \llbracket d \rrbracket \), we can control the Rademacher complexity. In this part, we assume without loss of generality that \({\mathcal {I}}= \llbracket n \rrbracket \), we first write
Denote \(\phi _i(t) = (\ell '(t, Y_i) - \ell '(0, Y_i))X_i^j\) and notice that \({\mathbb {E}}\big [\sum _{i=1}^n \varepsilon _i \ell '(0, Y_i)X_i^j\big ] = 0\). Notice also that \(\phi _i(0) = 0\) and \(\phi _i\) is \(\gamma \vert X_i^j\vert \)-Lipschitz for all i. We use a variant of the contraction principle adapted to our case in which functions with different Lipschitz constants appear. We use Lemma 11.7 from Boucheron et al. (2013) and adapt the proof of their Theorem 11.6 to make the following estimations:
By iterating the previous argument n times we find:
Now recalling that the diameter of \(\Theta \) is \(\Delta \), we use Lemma 8 below with \(p=1\) to bound the previous quantity as:
where we used a Cauchy–Schwarz inequality in the last step, which concludes the proof of Proposition 4. \(\square \)
Lemma 8
(Khintchine inequality variant) Let \(\alpha \in (0,1]\) and \((x_i)_{i\in \llbracket n \rrbracket }\) be real numbers with \(n \in {\mathbb {N}}\) and \(p>0\) and \((\varepsilon _i)_{i\in \llbracket n \rrbracket }\) be i.i.d Rademacher random variables then we have the inequality:
with the constant \(B_{p,\alpha }:= 2p \Big (\frac{1+\alpha }{\alpha }\Big )^{\alpha p/(1+\alpha ) - 1} \Gamma \Big (\frac{\alpha p}{1+\alpha }\Big )\). Moreover, for \(p=1\) the constant \(B_{1,\alpha }\) is bounded for any \(\alpha \ge 0\).
Proof
This proof is a generalization of Lemma 4.1 from Ledoux and Talagrand (1991) and uses similar methods. For all \(\lambda > 0\) we have:
where we used the inequality \(\cosh (u) \le \exp \Big ( \frac{\vert u\vert ^{1+\alpha }}{1+\alpha }\Big )\) valid for all \(u\in {\mathbb {R}}\) which can be quickly proven. Since both functions are even, fix \(u>0\) and define \(f_u(\alpha )=\exp \Big ( \frac{\vert u\vert ^{1+\alpha }}{1+\alpha }\Big ) - \cosh (u)\), we can show that \(f_u\) is monotonous on [0, 1] separately for \(u\in (0,\sqrt{e})\) and \((e, +\infty )\) and notice that \(f_u(0)\) and \(f_u(1)\) are both non-negative for all \(u >0\) thanks to the famous inequality \(\cosh (u)\le e^{u^2/2}\). Therefore, the inequality holds for \(u\in (0,\sqrt{e})\) and \((e, +\infty )\). Finally, for \(u\in (\sqrt{e}, e)\), the function \(f_u(\alpha )\) reaches a minimum at \(f_u(1/\log (u) - 1) = u^e - \cosh (u)\) and by taking logarithms we have \(u^e \ge \cosh (u) \iff \log (1+e^{2u})\le u + \log (2) + e\log (u)\) but since the derivatives verify \(\frac{2}{1+e^{-2u}}\le 2 \le 1+e/u\) for \(u\in (\sqrt{e}, e)\) and \(e^{e/2}\ge \cosh (\sqrt{e})\) the desired inequality follows by integration.
By homogeneity, we can focus on the case \(\big (\sum _{i=1}^n \vert x_i\vert ^{1+\alpha }\big )^{1/(1+\alpha )} = 1\), we compute:
where we used the previous inequality and chose \(\lambda = (t^{1/p})^{1/\alpha }\) in the last step. This proves the main inequality. Finally, it is easy to see that \(B_{1,\alpha }\) is bounded for high values of \(\alpha \) while for \(\alpha \sim 0\) it is consequence of the fact that \(\Gamma (x) \sim 1/x\) near 0 and the limit \(x^x \rightarrow 0\) when \(x \rightarrow 0^+\). \(\square \)
1.8 B.8 Proof of Lemma 3
As previously, Lemma 1 along with Assumptions 1 and 2 guarantee that the gradient coordinates have finite \((1+\alpha )\)-moments. From here, Lemma 3 is a direct application of Lemma 9 stated and proved below. In the following lemma, for any sequence \((z_i)_{i=1}^N\) of real numbers, \((z_i^*)_{i=1}^N\) denotes a non-decreasing reordering of it.
Lemma 9
Let \({\widetilde{X}}_1, \ldots , {\widetilde{X}}_N, {\widetilde{Y}}_1, \dots , {\widetilde{Y}}_N\) denote an \(\eta \)-corrupted i.i.d sample with rate \(\eta \) from a random variable X with expectation \(\mu = {\mathbb {E}}X\) and with finite \(1 + \gamma \) centered moment \({\mathbb {E}}\vert X - \mu \vert ^{1+\gamma } =M < \infty \) for some \(0 < \gamma \le 1\). Denote \({\widehat{\mu }}\) the \(\epsilon \)-trimmed mean estimator computed as \({\widehat{\mu }}=\frac{1}{N}\sum _{i=1}^N \phi _{\alpha , \beta }({\widetilde{X}}_i)\) with \(\phi _{\alpha , \beta }(x) = \max (\alpha , \min (x, \beta ))\) and the thresholds \(\alpha = {\widetilde{Y}}^*_{\epsilon N}\) and \(\beta = {\widetilde{Y}}^*_{(1-\epsilon ) N}\). Let \(1 > \delta \ge e^{-N}/4,\) taking \(\epsilon = 8\eta +12 \frac{\log (4/\delta )}{n},\) we have
with probability at least \(1 - \delta \).
Proof
This proof goes along the lines of the proof of Theorem 1 from Lugosi and Mendelson (2021) with the main difference that only the \((1+\gamma )\)-moment is used instead of the variance. Denote X the random variable whose expectation \(\mu = {\mathbb {E}}X\) is to be estimated and \(\overline{X} = X - \mu \). Let \(X_1, \dots , X_N, Y_1, \dots , Y_N\) the original uncorrupted i.i.d. sample from X and let \({\widetilde{X}}_1, \dots , {\widetilde{X}}_N, {\widetilde{Y}}_1, \dots , {\widetilde{Y}}_N\) denote the corrupted sample with rate \(\eta \). We define the following quantity which will intervene in the proof:
Step 1 We first derive confidence bounds on the truncation thresholds. Define the random variable \(U = {{{\textbf {1}}}}_{\overline{X} \ge Q_{1-2\epsilon }(\overline{X})}\). Its standard deviation satisfies \(\sigma _U \le {\mathbb {P}}^{1/2}(\overline{X} \ge Q_{1-2\epsilon }(\overline{X})) = \sqrt{2\epsilon }\). By applying Bernstein’s inequality we find with probability \(\ge 1 - \exp (-\epsilon N/12)\) that:
a similar argument with \(U = {{{\textbf {1}}}}_{\overline{X} > Q_{1 - \epsilon /2}(\overline{X})}\) yields with probability \(\ge 1 - \exp (-\epsilon N /12)\) that:
and similarly with probability \(\ge 1 - \exp (-\epsilon N /12)\) we have:
and with probability \(\ge 1 - \exp (-\epsilon N /12)\):
so that with probability \(\ge 1 - 4\exp (-\epsilon N /12) \ge 1 - \delta /2\) the four previous inequalities hold simultaneously. We call this event E which only depends on the variables \(Y_1, \dots , Y_N\). Since \(\eta \le \epsilon /8\), if \(2\eta N\) samples are corrupted we still have:
and
consequently, the two following bounds hold
and similarly
This provides guarantees on the truncation levels used which are \(\alpha = {\widetilde{Y}}^*_{\epsilon N}\) and \(\beta = {\widetilde{Y}}^*_{(1-\epsilon ) N}\).
Step 2
We first bound the deviation \(\Big \vert \frac{1}{N} \sum _{i=1}^N \phi _{\alpha , \beta }(X_i) - \mu \Big \vert \) in the absence of corruption. W e write:
The first term is dominated by:
and lower bounded by:
The sum in (B17) above has terms upper bounded by \(Q_{1 - \epsilon /2}(\overline{X}) + \overline{{\mathcal {E}}}(\epsilon , X)\). We need to work with the knowledge that \({\mathbb {E}}\vert \overline{X}\vert ^{1+\gamma } = M < \infty \) in order to bound their variance:
To control the three terms in the previous expression we mimic the proof of Chebyshev’s inequality to obtain that, when \(Q_{2\epsilon }(\overline{X}) < 0\):
analogously, when \(Q_{1 - \epsilon /2}(\overline{X}) > 0\) we have:
from (B18), we deduce that
and from (B19) we find
In the pathological case where we have \(Q_{2\epsilon }(\overline{X}) \ge 0\) we use that \(Q_{2\epsilon }(\overline{X}) \le Q_{1 - \epsilon /2}(\overline{X})\) (for \(\epsilon \le 2/5\)) we deduce \(\vert Q_{2\epsilon }(\overline{X})\vert \le \vert Q_{1 - \epsilon /2}(\overline{X})\vert \) and hence we still have
The case \(Q_{1 - \epsilon /2}(\overline{X})\le 0\) is similarly handled. Moreover, a simple calculation yields
All in all, we have shown the inequality:
which we now use to apply Bernstein’s inequality on the sum in (B17) to find, conditionally on \(Y_1, \dots , Y_n\), with probability at least \(1 - \delta /4\):
where we used (B19), the fact that \(\frac{\log (4/\delta )}{N} \le \epsilon /12\) and the assumption that \(\delta \ge e^{-N}/4\). Using the same argument on the lower tail, we obtain, on the event E, that with probability at least \(1 - \delta /2\)
Step 3
Now we show that \(\Big \vert \frac{1}{N}\sum _{i=1}^N \phi _{\alpha , \beta }(X_i) - \frac{1}{N}\sum _{i=1}^N \phi _{\alpha , \beta }({\widetilde{X}}_i)\Big \vert \) is of the same order as the previous bounds. There are at most \(2\eta N\) indices such that \(X_i \ne {\widetilde{X}}_i\) and for such differences we have the bound:
and since we have \(\eta \le \epsilon /8\) then
where the last step follows from (B18) and (B19). Finally, using similar arguments along with Hölder’s inequality, we show that:
and a similar computation for \({\mathbb {E}}\big [ \vert \overline{X} - Q_{1 - \epsilon /2}(\overline{X})\vert {{{\textbf {1}}}}_{\overline{X} \ge Q_{1 - \epsilon /2}(\overline{X})}\big ]\) leads to
This completes the proof of Lemma 9. \(\square \)
1.9 B.9 Proof of Proposition 5
Step 1. Notice that the \(\texttt {TM} \) estimator is also a monotonous non decreasing function of each of its entries when the others are fixed. This allows us to replicate Step 1 of the proof of Proposition 3. We define an \(\varepsilon \)-net \(N_{\varepsilon }\) on the set \(\Theta \), fix \(\theta \in \Theta \) and let \({\widetilde{\theta }}\) be the closest point in \(N_{\varepsilon }\). We obtain, for all \(j\in \llbracket d \rrbracket \), the inequalities:
where is the \(\texttt {TM} \) estimator obtained for the entries \(\ell '\big ({\widetilde{\theta }}^{\top } X_i, Y_i\big )X_i^j + \varepsilon \gamma \Vert X_i\Vert ^2 = g_j^i({\widetilde{\theta }}) + \varepsilon \gamma \Vert X_i\Vert ^2\).
Step 2
We use the concentration property of the \(\texttt {TM} \) estimator to bound the previous quantity which is in terms of \({\widetilde{\theta }}\). The terms \(\big (g_j^i({\widetilde{\theta }}) + \varepsilon \gamma \Vert X_i\Vert ^2\big )_{i\in \llbracket n \rrbracket }\) are independent and distributed according to \(Z:= \ell '\big ({\widetilde{\theta }}^\top X, Y\big )X^j + \gamma \varepsilon \Vert X\Vert ^2.\) Obviously we have \({\mathbb {E}}\ell '\big (\theta ^\top X, Y\big )X^j = g_j(\theta ).\) Furthermore, let \(\overline{L} = {\mathbb {E}}\gamma \Vert X\Vert ^2\), so that \({\mathbb {E}}\big [g_j^i({\widetilde{\theta }}) + \varepsilon \gamma \Vert X_i\Vert ^2\big ] = g_j(\theta ) + \varepsilon \overline{L}\). We will apply Lemma 9 for Before we do so, we need to compute the centered \((1+\alpha )\)-moment of Z. Let \(m_{j,\alpha }({\widetilde{\theta }})\) and \(m_{L,\alpha }\) be the centered \((1+\alpha )\)-moments of \(\ell '(\theta ^\top X, Y)X^j\) and \(\gamma \Vert X\Vert ^2\) respectively, we have:
Now applying Lemma 9 we find with probability no less than \(1-\delta \)
with \(\epsilon _{\delta } = 8\eta +12 \frac{\log (4/\delta )}{n}\). By combining with (B20) and using a union bound argument, we deduce that with the same probability, we have for all \(j\in \llbracket d \rrbracket \)
Step 3 We use the \(\varepsilon \)-net to obtain a uniform bound. We proceed similarly as in the proof of Proposition 3. For \(\theta \in \Theta \) denote \({\widetilde{\theta }}(\theta ) \in N_{\varepsilon }\) the closest point in \(N_{\varepsilon }\) satisfying in particular \(\Vert {\widetilde{\theta }}(\theta ) - \theta \Vert \le \varepsilon \), we write, following previous arguments
Taking union bound over \({\widetilde{\theta }} \in N_{\varepsilon }\) for the inequality (B21) and choosing \(\varepsilon = n^{-\alpha /(1+\alpha )}\) concludes the proof of Proposition 5.
1.10 B.10 Proof of Corollary 1
We first write the result of Proposition 5 with a big O notation. This tells us that with probability at least \(1 - \delta \) for all \(\theta \in \Theta \), for all \(j\in \llbracket d \rrbracket \) we have:
It only remains to apply Theorem 1 with importance sampling. The main result corresponds to having the second term (the statistical error) dominate the bound given by Theorem 1. This happens as soon as the number of iterations T is high enough so that
From here, it is straightforward to check that the stated number of iterations suffices.
1.11 B.11 Proof of Lemma 4
Similarly to the proof of Lemma 2, the assumptions, this time taken with \(\alpha = 1\), imply that the gradient has a second moment so that the existence of \(\sigma _j^2 = {\mathbb {V}}(g_j(\theta ))\) is guaranteed. We apply Lemma 1 from Holland et al. (2019) with \(\delta /2\) to obtain:
with probability at least \(1 - \delta /2\), where C is a constant such that we have:
and one can easily check that our choice of \(\psi \), the Gudermannian function, satisfies the previous inequality for \(C = 1/2\). This, along with the choice of scale s according to (21) and our assumption on \({\widehat{\sigma }}_j\) yields the announced deviation bound by a simple union bound argument.
1.12 B.12 Proof of Proposition 6
In this proof, for a scale \(s > 0\) and a set of real numbers \((x_i)_{i\in \llbracket n \rrbracket }\), we let \({\bar{x}}=\frac{1}{n} \sum _{i\in \llbracket n \rrbracket } x_i\) be their mean and define the function \(\zeta _s\big ((x_i)_{i\in \llbracket n \rrbracket }\big )\) as the unique x satisfying
Since the function \(\psi \) is increasing the previous equation has a unique solution. Moreover, for fixed scale s, the function \(\zeta _s\big ((x_i)_{i\in \llbracket n \rrbracket }\big )\) is monotonous non decreasing w.r.t. each \(x_i\) when the others are fixed.
Step 1
We proceed similarly as in the proof of Proposition 3 except that we only use the monotonicity of the \(\texttt {CH} \) estimator with fixed scale. Let \(N_{\varepsilon }\) be an \(\varepsilon \)-net for \(\Theta \) with \(\varepsilon = 1/\sqrt{n}\). We have \(\vert N_{\varepsilon }\vert \le (3\Delta /2\varepsilon )^d\) with \(\Delta \) the diameter of \(\Theta \). Fix a coordinate \(j\in \llbracket d \rrbracket \), a point \(\theta \in \Theta \) and let \({{\widetilde{\theta }}}\) be the closest point to it in the \(\varepsilon \)-net. We wish to bound the difference
where we have the \(\texttt {CH} \) estimator \(\widehat{g}_j^{\texttt {CH} }(\theta ) = \zeta _{s(\theta )}\big ((g_j^i(\theta ))_{i\in \llbracket n \rrbracket }\big )\) with scale \(s(\theta )\) computed according to (21) and (22). Assume, without loss of generality that \(\widehat{g}_j^{\texttt {CH} }(\theta ) - g_j({{\widetilde{\theta }}}) \ge 0\). Using the non-decreasing property of the \(\texttt {CH} \) estimator at a fixed scale, we find that
Indeed, one has
We introduce the notation so that:
Step 2
We now use the concentration property of \(\texttt {CH} \) to bound the previous quantity which is in terms of \({\widetilde{\theta }}\). We apply Lemma 1 from Holland et al. (2019) with \(\delta /2\) and scale \(s(\theta )\) to the samples \((g_j^i({{\widetilde{\theta }}}) + \varepsilon \gamma \Vert X_i\Vert ^2)_{i\in \llbracket n \rrbracket }\) which are independent and distributed according to the random variable \(\ell '\big ({{\widetilde{\theta }}}^\top X, Y\big )X^j + \varepsilon \gamma \Vert X\Vert ^2\) with expectation \(g_j({{\widetilde{\theta }}}) + \varepsilon \overline{L}\). Using our assumptions on \(\sigma _L, \sigma _j(\theta ), \sigma _j({{\widetilde{\theta }}}), \widehat{\sigma }_j(\theta )\) and the definition of the scale \(s(\theta )\) according to (21) we find:
A simple union bound yields that for all \(j\in \llbracket d \rrbracket \)
Step 3
We use the \(\varepsilon \)-net to obtain a uniform bound. We proceed similarly to the proof of Proposition 3. For \(\theta \in \Theta \) denote \({\widetilde{\theta }}(\theta ) \in N_{\varepsilon }\) the closest point in \(N_{\varepsilon }\) satisfying in particular \(\Vert {\widetilde{\theta }}(\theta ) - \theta \Vert \le \varepsilon \), we write, following previous arguments
Taking union bound over \({\widetilde{\theta }} \in N_{\varepsilon }\) for the inequality (B22) and using the choice \(\varepsilon = 1/\sqrt{n}\) concludes the proof of Proposition 6.
1.13 B.13 Proof of Corollary 2
Under the assumptions made, the constants \((L_j)_{j\in \llbracket d \rrbracket }\) are estimated using the \(\texttt {MOM} \) estimator and we obtain the bounds \(({\overline{L}}_j)_{j\in \llbracket d \rrbracket }\) which hold with probability at least \(1 - \delta /2\) by a union bound argument. The rest of the proof is the same as that of Theorem 1 using a failure probability \(\delta /2\) instead of \(\delta \) and replacing the constants \((L_j)_{j\in \llbracket d \rrbracket }\) by their upperbounds accordingly. The result then follows after a simple union bound argument.
1.14 B.14 Proof of Lemma 5
Let \(B_1, \dots , B_K\) be the blocks used for the estimation so that \(B_1 \cup \dots \cup B_K = \llbracket n \rrbracket \) and \(B_{k_1} \cap B_{k_2} = \emptyset \) for \(k_1\ne k_2\). Let \({\mathcal {K}}\) denote the uncorrupted block indices \({\mathcal {K}}= \{k \in \llbracket K \rrbracket \text { such that } B_k \cap {\mathcal {O}}= \emptyset \}\) and assume \(\vert {\mathcal {O}}\vert \le (1 - \varepsilon )K/2\). For \(k \in \llbracket K \rrbracket \) let \(\widehat{\sigma }^2_k = \frac{K}{n}\sum _{i\in B_k}X_i^2\) be the block means computed by MOM. Denote \(N = n/K\), by using (a slight generalization of) Lemma 7 and the \(L^{(1+\alpha )}\)-\(L^{1}\) condition satisfied by \(X^2\) with a known constant C, we obtain that with probability at least \(1-\delta \) we have
which implies the inequality
Define the Bernoulli random variables \(U_k = {{{\textbf {1}}}}_{}\Big \{\sigma ^2 > \Big (1 - C\big ( \frac{3}{\delta N^{\alpha }} \big )^{\frac{1}{1+\alpha }}\Big )^{-1} \widehat{\sigma }^2_k\Big \}\) for \(k \in \llbracket K \rrbracket \) which have success probability \(\le \delta \). Denote \(S = \sum _k U_k\), we can bound the failure probability of the estimator as follows:
where we used the fact that \(\vert {\mathcal {O}}\vert \le (1 - \varepsilon ) K / 2\) and \(\vert {\mathcal {K}}\vert \le K\) for the second inequality and Hoeffding’s inequality for the last. The proof is finished by taking \(\varepsilon = 5/6\) and \(\delta = 1/4.\)
1.15 B.15 Proof of Lemma 6
Lemma 6 is a direct consequence of the following result.
Lemma 10
Let \(X_1, \dots , X_n\) an i.i.d sample of a random variable X with expectation \({\mathbb {E}}X = \mu \) and \((1+\alpha )\)-moment \({\mathbb {E}}\vert X - \mu \vert ^{1+\alpha } = m_{\alpha }< \infty \). Assume that the variable X satisfies the \(L^{(1+\alpha )^2}\)-\(L^{(1+\alpha )}\) condition with constant \(C >1\). Let \(\widehat{\mu }\) be the median-of-means estimate of \(\mu \) with K blocks and \(\widehat{m}_{\alpha }\) a similarly obtained estimate of \(m_{\alpha }\) from the samples \((\vert X_i - \widehat{\mu }\vert ^{1+\alpha })_{i\in \llbracket n \rrbracket }\). Then, with probability at least \(1 - 2\exp (-K/18)\) we have
with \(\kappa = \epsilon + 24(1+\alpha ) \Big (\frac{1 + \epsilon }{n/K}\Big )^{\frac{\alpha }{1+\alpha }}\) and \(\epsilon = \Big ( \frac{3\times 2^{2+\alpha }(1 + C^{(1+\alpha )^2})}{(n/K)^{\alpha }} \Big )^{\frac{1}{1+\alpha }}\).
Proof
Let \(\widehat{\mu }\) be the MOM estimate of \(\mu \) with K blocks, using Lemma 2, we have with probability at least \(1 -\exp (-K/18)\),
Let \(\widehat{m}_{\alpha }\) be the MOM estimate of \(m_{\alpha }\) obtained from the samples \(\big (\vert X_i - \widehat{\mu }\vert ^{1+\alpha }\big )_{i\in \llbracket n \rrbracket }\). Denote \(B_1, \dots , B_K\) the blocks we use, we have:
for any \(i \in \llbracket n \rrbracket \). Let \(N = n/K\), using the convexity of the function \(f(x) = \vert x\vert ^{1+\alpha }\) we find that:
where the last step uses Jensen’s inequality. Using Lemma 7 we have, for \(\delta > 0\), the concentration bound
which, using that X satisfies the \(L^{(1+\alpha )^2}\)-\(L^{(1+\alpha )}\) condition, translates to
Replacing \(\epsilon \) with \(\epsilon m_{\alpha }\) we find
Now conditioning on the event (B23) and using the previous bound with \(\epsilon = \Big ( \frac{3\times 2^{\alpha }\big (1 + C^{(1+\alpha )^2}\big )}{N^{\alpha }\delta } \Big )^{\frac{1}{1+\alpha }}\) in (B24), we obtain that
Now define \(U_j\) as the indicator variable of the event in the last probability. We have just seen it has success rate less than \(\delta \). We can use the MOM trick, assuming the number of outliers satisfies \(\vert {\mathcal {O}}\vert \le K(1- \varepsilon )/2\) for \(\varepsilon \in (0,1)\), we have for \(S = \sum _j U_j\)
Taking \(\varepsilon = 5/6\) and \(\delta = 1/4\) yields that the previous probability is \(\le \exp (-K/18)\). Finally, recall that we conditioned on the event where the deviation \(\vert \mu - \widehat{\mu }\vert \) is bounded as previously stated and that this event holds with \(\ge 1 - \exp (-K/18)\). Taking this conditioning into account and using a union bound argument leads to the fact that the bound
holds with probability at least \(1 - 2\exp (-K/18)\). \(\square \)
1.16 B.16 Proof of Theorem 7
This proof is inspired from Theorem 5 in Nesterov (2012) and Theorem 1 in Shalev-Shwartz and Tewari (2011) while kee** track of the degradations caused by the errors on the gradient coordinates.
We condition on the event (B5) and denote \(\epsilon _j = \epsilon _j(\delta )\) and \(\epsilon _{Euc} = \Vert \epsilon (\delta )\Vert _2\). We define for all \(\theta \in \Theta \)
and denote \(\theta ^{(t)}\) the optimization iterates for \(t=0,\dots , T\) and \(j_t\) the random coordinate sampled at step t and let \(\widehat{g}_t = \widehat{g}_{j_t}(\theta ^{(t)})\) for brevity. We have that \(u_{j_t}(\theta ^{(t)})\) satisfies the following optimality condition
where \(\rho _t = \text {sign}\big (u_{j_t}(\theta ^{(t)}) - \theta ^{(t)}_{j_t}\big )\). Using this condition for \(\vartheta = \theta ^{(t)}_{j_t}\) and the coordinate-wise Lipschitz smoothness property of R we find
Defining the potential \(\Phi (\theta ) = \sum _{j=1}^d L_j(\theta _j - \theta ^\star _j)^2\), we have:
where the first inequality uses the optimality condition with \(\vartheta = \theta ^\star _{j_t}\) and the second one uses (B25). Now, defining \(\Psi (\theta ) = \frac{1}{2} \Phi (\theta ) + R(\theta )\), taking the expectation w.r.t. \(j_t\) and using the convexity of R and a Cauchy-Schwarz inequality, we find
Recall that according to (B26), we have \(R (\theta ^{(t+1)}) \le R (\theta ^{(t)})\), summing over \(t = 0,\dots , T\) we find:
which yields the result after multiplying by \(\frac{d}{T+1}\). To finish, we show that conditionally on any choice of \(j_t\) we have \(\Vert \theta ^{(t+1)} - \theta ^\star \Vert _2 \le \Vert \theta ^{(t)} - \theta ^\star \Vert _2.\) Indeed a straightforward computation yields
We need to show that \(\delta _t^2 \le -2 \delta _t \big (\theta ^{(t)}_{j_t} - \theta ^\star _{j_t}\big )\) with \(\delta _t = (u_{j_t}(\theta ^{(t)}) - \theta ^{(t)}_{j_t})\). Notice that \(\delta _t\) always has the opposite sign of \(g_{j_t}(\theta ^{(t)})\) (thanks to the thresholding) so by convexity of R along the coordinate \(j_t\) we have \(\delta _t \big (\theta ^{(t)}_{j_t} - \theta ^\star _{j_t}\big ) \le 0\) and so it is down to showing \(\vert \delta _t\vert \le 2\big \vert \theta ^{(t)}_{j_t} - \theta ^\star _{j_t}\big \vert \) which can be seen from
which concludes the proof of Theorem 7.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Merad, I., Gaïffas, S. Robust supervised learning with coordinate gradient descent. Stat Comput 33, 116 (2023). https://doi.org/10.1007/s11222-023-10283-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11222-023-10283-7