Abstract
This chapter reviews diagnostic and robust procedures for detecting outliers and other interesting observations in linear regression. First, we present statistics for detecting single outliers and influential observations and show their limitations for multiple outliers in high-leverage situations. Second, we discuss diagnostic procedures designed to avoid masking by finding first a clean subset for estimating the parameters and then increasing its size by incorporating, one by one, new homogeneous observations until a heterogeneous observation is found. We also discuss procedures based on sensitive observations for detecting high-leverage outliers in large data sets using the eigenvectors of a sensitivity matrix. We briefly review robust estimation methods and its relationship with diagnostic procedures. Next, we consider large high-dimensional data sets where the application of iterative procedures can be slow and show that the joint use of simple univariate statistics, as predictive residuals, Cook’s distances, and Peña’s sensitivity statistic, can be a useful diagnostic tool. We also comment on other recent procedures based on regularization and sparse estimation and conclude with a brief analysis of the relationship of outlier detection and cluster analysis. A real data and a simulated example are presented to illustrate the procedures presented in the chapter.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Box, G.E.P.: When Murphy speaks listen. Qual. Prog. 22, 79–84 (1989)
Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J., Stahel, W.A.: Robust Statistics. Wiley, New York (1986)
Belsley, D.A., Kuh, E., Welsch, R.E.: Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. Wiley, New York (1980)
Hawkins, D.M.: Identification of Outliers. Chapman Hall, New York (1980)
Cook, R.D., Weisberg, S.: Residuals and Influence in Regression. Chapman Hall, New York (1982)
Atkinson, A.C.: Plots, Transformations and Regression. Clarendon, Oxford (1985)
Chatterjee, S., Hadi, A.S.: Sensitivity Analysis in Linear Regression. Wiley, New York (1988)
Barnett, V., Lewis, T.: Outliers in Statistical Data, 3 edn. Wiley, New York (1994)
Atkinson, A.C., Riani, M.: Robust Diagnostic Regression Analysis. Springer, Berlin, Heidelberg, New York (2012)
Carroll, R.J.: Transformation and Weighting in Regression. Routledge (2017)
Cook, R.D.: Detection of influential observations in linear regression. Technometrics 19, 15–18 (1977)
Cook, R.D.: Assessment of local influence (with discussion). J. R. Stat. Soc. B 48(2), 133–169 (1986)
Suárez Rancel, M., González Sierra, M.A.: Regression diagnostic using local influence: A review. Commun. Stat. A 30, 799–813 (2001)
Hartless, G., Booth, J.G., Littell, R.C.: Local influence of predictors in multiple linear regression. Technometrics 45, 326–332 (2003)
Critchley, F., Atkinson, R.A., Lu, G., Biazi, E.: Influence analysis based on the case sensitivity function. J. R. Stat. Soc. B 63(2), 307–323 (2001)
Lawrance, J.: Deletion influence and masking in regression. J. R. Stat. Soc. B 57, 181–189 (1995)
Hawkins, D.M., Bradu, D., Kass, G.V.: Location of several oultiers in multiple regression data using elemental sets. Technometrics 26, 197–208 (1984)
Gray, J.B., Ling, R.F.: K-Clustering as a detection tool for influential subsets in regression. Technometrics 26, 305–330 (1984)
Marasinghe, M.G.: A multistage procedure for detecting several outliers in linear regression. Technometrics 27, 395–399 (1985)
Kianifard, F., Swallow, W.: Using recursive residuals calculated in adaptively ordered observations to identify outliers in linear regression. Biometrics 45, 571–585 (1989)
Kianifard, F., Swallow, W.: A Monte Carlo comparison of five procedures for identifying outliers in lineal regression. Commun. Stat. (Theory and Methods) 19, 1913–1938 (1990)
Hadi, A.S., Simonoff, J.S.: Procedures for the identification of multiple outliers in linear models. J. Am. Stat. Assoc. 88, 1264–1272 (1993)
Hadi, A.S., Simonoff, J.S.: Improving the estimation and outlier identification properties of the least median of squares and minimum volume ellipsoid estimators. Parisankhyan Samikkha 1, 61–70 (1994)
Atkinson, A.C.: Fast very robust methods for the detection of multiple outliers. J. Am. Stat. Assoc. 89, 1329–1339 (1994)
Swallow, W., Kianifard, F.: Using robust scale estimates in detecting multiple outliers in linear regression. Biometrics 52, 545–556 (1996)
Peña, D., Yohai, V.J.: The detection of influential subsets in linear regression using an influence matrix. J. R. Stat. Soc. B 57, 145–156 (1995)
Peña, D., Yohai, V.J.: A fast procedure for robust estimation and diagnostics in large regression problems. J. Am. Stat. Assoc. 94, 434–445 (1999)
Huber, P.: Between Robustness and Diagnosis. In: Stahel, W., Weisberg, S. Directions in Robust Statistics and Diagnosis, pp. 121–130. Springer, Berlin, Heidelberg, New York (1991)
Rousseeuw, P.J., Leroy, A.M.: Robust Regression and Outlier Detection. Wiley, New York (1987)
Maronna, R.A., Martin, R.D., Yohai, V.J., Saliban-Barrera, M.: Robust Statistics, Theory and Methods (with R). Wiley, New York (2019)
Box, G.E.P., Tiao, C.G.: A Bayesian approach to some outlier problems. Biometrika 55, 119–129 (1968)
Peña, D., Guttman, I.: Comparing probabilistic models for outlier detection. Biometrika 80(3), 603–610 (1993)
Berger, J.O., Moreno, E., Pericchi, L.R., Bayarri, M.J., Bernardo, J.M., Cano, J.A., …, Dasgupta, A.: An overview of robust Bayesian analysis. Test 3(1), 5–124 (1994)
Justel, A., Peña, D.: Bayesian unmasking in linear models. Comput. Stat. Data Anal. 36, 69–94 (2001)
Hans, C.: Bayesian lasso regression. Biometrika 96(4), 835–845 (2009)
Peña, D.: A new statistic for influence in linear regression. Technometrics 47(1), 1–12 (2005)
Diaz-Garcia, J.A., Gonzalez-Farias, G.: A note on the Cook distance. J. Stat. Plann. Infer. 120, 119–136 (2004)
Cook, R.D., Peña, D., Weisberg, S.: The likelihood displacement. A unifying principle for influence. Commun. Stat. A 17, 623–640 (1988)
Muller, E.K., Mok, M.C.: The distribution of Cook D statistics. Commun. Stat. A 26, 525–546 (1997)
Atkinson, A.C.: Masking unmasked. Biometrika 73, 533–41 (1986)
Wisnowski, J.W., Montgomey, D.C., Simpson, J.R.: A comparative analysis of multiple outliers detection procedures in the linear regression model. Comput. Stat. Data Anal. 36, 351–382 (2001)
Rousseeuw, P.J.: Least median of squares regression. J. Am. Stat. Assoc. 79, 871–880 (1984)
Kashif, M., Amanullah, M., Aslam, M.: Pena’s statistic for the Liu regression. J. Stat. Comput. Simul. 88(13), 2473–2488 (2018)
Hendry, D.F., Johansen, S., Santos, C.: Automatic selection of indicators in a fully saturated regression. Computational Statistics 33, 317–335 (2008); Erratum, 337—339
Hendry, D.F., Doornik, J.A.: Empirical Model Discovery and Theory Evaluation: Automatic Selection Methods in Econometrics. MIT Press (2014)
Johansen, S., Nielsen, B.: Asymptotic theory of outlier detection algorithms for linear time series regression models. Scand. J. Stat. 43(2), 321–348 (2016)
She, Y., Owen, A.B.: Outlier detection using nonconvex penalized regression. J. Am. Stat. Assoc. 106(494), 626–639 (2011)
Kong, D., Bondell, H., Wu, Y.: Fully efficient robust estimation, outlier detection and variable selection via penalized regression. Statistica Sinica 28, 1031–1052, (2018).
Peña, D., Rodriguez, J., Tiao, G.C.: Identifying mixtures of regression equations by the SAR procedure (with discussion). In: Bernardo et al., Bayesian Statistics, vol. 7, pp. 327–347. Oxford Univ. Press, Oxford (2003)
García-Escudero, L.A., Gordaliza, A., San Martin, R., Van Aelst, S., Zamar, R.: Robust linear clustering. J. Roy. Stat. Soc. Ser. B (Statistical Methodology) 71(1), 301–318 (2009)
García-Escudero, L.A., Gordaliza, A., Mayo-Iscar, A., San Martín, R.: Robust clusterwise linear regression through trimming. Comput. Stat. Data Anal. 54(12), 3057–3069 (2010)
Acknowledgements
This research has been supported by Grant ECO2015-66593-P of MINECO/FEDER/UE, Spain.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Section Editor information
Rights and permissions
Copyright information
© 2023 Springer-Verlag London Ltd., part of Springer Nature
About this chapter
Cite this chapter
Peña, D. (2023). Detecting Outliers and Influential and Sensitive Observations in Linear Regression. In: Pham, H. (eds) Springer Handbook of Engineering Statistics. Springer Handbooks. Springer, London. https://doi.org/10.1007/978-1-4471-7503-2_31
Download citation
DOI: https://doi.org/10.1007/978-1-4471-7503-2_31
Published:
Publisher Name: Springer, London
Print ISBN: 978-1-4471-7502-5
Online ISBN: 978-1-4471-7503-2
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)