Detecting Outliers and Influential and Sensitive Observations in Linear Regression

  • Chapter
  • First Online:
Springer Handbook of Engineering Statistics

Part of the book series: Springer Handbooks ((SHB))

Abstract

This chapter reviews diagnostic and robust procedures for detecting outliers and other interesting observations in linear regression. First, we present statistics for detecting single outliers and influential observations and show their limitations for multiple outliers in high-leverage situations. Second, we discuss diagnostic procedures designed to avoid masking by finding first a clean subset for estimating the parameters and then increasing its size by incorporating, one by one, new homogeneous observations until a heterogeneous observation is found. We also discuss procedures based on sensitive observations for detecting high-leverage outliers in large data sets using the eigenvectors of a sensitivity matrix. We briefly review robust estimation methods and its relationship with diagnostic procedures. Next, we consider large high-dimensional data sets where the application of iterative procedures can be slow and show that the joint use of simple univariate statistics, as predictive residuals, Cook’s distances, and Peña’s sensitivity statistic, can be a useful diagnostic tool. We also comment on other recent procedures based on regularization and sparse estimation and conclude with a brief analysis of the relationship of outlier detection and cluster analysis. A real data and a simulated example are presented to illustrate the procedures presented in the chapter.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
EUR 29.95
Price includes VAT (France)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
EUR 298.53
Price includes VAT (France)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
EUR 379.79
Price includes VAT (France)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Box, G.E.P.: When Murphy speaks listen. Qual. Prog. 22, 79–84 (1989)

    Google Scholar 

  2. Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J., Stahel, W.A.: Robust Statistics. Wiley, New York (1986)

    MATH  Google Scholar 

  3. Belsley, D.A., Kuh, E., Welsch, R.E.: Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. Wiley, New York (1980)

    MATH  Google Scholar 

  4. Hawkins, D.M.: Identification of Outliers. Chapman Hall, New York (1980)

    MATH  Google Scholar 

  5. Cook, R.D., Weisberg, S.: Residuals and Influence in Regression. Chapman Hall, New York (1982)

    MATH  Google Scholar 

  6. Atkinson, A.C.: Plots, Transformations and Regression. Clarendon, Oxford (1985)

    MATH  Google Scholar 

  7. Chatterjee, S., Hadi, A.S.: Sensitivity Analysis in Linear Regression. Wiley, New York (1988)

    MATH  Google Scholar 

  8. Barnett, V., Lewis, T.: Outliers in Statistical Data, 3 edn. Wiley, New York (1994)

    MATH  Google Scholar 

  9. Atkinson, A.C., Riani, M.: Robust Diagnostic Regression Analysis. Springer, Berlin, Heidelberg, New York (2012)

    MATH  Google Scholar 

  10. Carroll, R.J.: Transformation and Weighting in Regression. Routledge (2017)

    Google Scholar 

  11. Cook, R.D.: Detection of influential observations in linear regression. Technometrics 19, 15–18 (1977)

    MathSciNet  MATH  Google Scholar 

  12. Cook, R.D.: Assessment of local influence (with discussion). J. R. Stat. Soc. B 48(2), 133–169 (1986)

    MATH  Google Scholar 

  13. Suárez Rancel, M., González Sierra, M.A.: Regression diagnostic using local influence: A review. Commun. Stat. A 30, 799–813 (2001)

    MathSciNet  MATH  Google Scholar 

  14. Hartless, G., Booth, J.G., Littell, R.C.: Local influence of predictors in multiple linear regression. Technometrics 45, 326–332 (2003)

    MathSciNet  Google Scholar 

  15. Critchley, F., Atkinson, R.A., Lu, G., Biazi, E.: Influence analysis based on the case sensitivity function. J. R. Stat. Soc. B 63(2), 307–323 (2001)

    MathSciNet  MATH  Google Scholar 

  16. Lawrance, J.: Deletion influence and masking in regression. J. R. Stat. Soc. B 57, 181–189 (1995)

    MathSciNet  MATH  Google Scholar 

  17. Hawkins, D.M., Bradu, D., Kass, G.V.: Location of several oultiers in multiple regression data using elemental sets. Technometrics 26, 197–208 (1984)

    MathSciNet  Google Scholar 

  18. Gray, J.B., Ling, R.F.: K-Clustering as a detection tool for influential subsets in regression. Technometrics 26, 305–330 (1984)

    MathSciNet  Google Scholar 

  19. Marasinghe, M.G.: A multistage procedure for detecting several outliers in linear regression. Technometrics 27, 395–399 (1985)

    Google Scholar 

  20. Kianifard, F., Swallow, W.: Using recursive residuals calculated in adaptively ordered observations to identify outliers in linear regression. Biometrics 45, 571–585 (1989)

    MATH  Google Scholar 

  21. Kianifard, F., Swallow, W.: A Monte Carlo comparison of five procedures for identifying outliers in lineal regression. Commun. Stat. (Theory and Methods) 19, 1913–1938 (1990)

    Google Scholar 

  22. Hadi, A.S., Simonoff, J.S.: Procedures for the identification of multiple outliers in linear models. J. Am. Stat. Assoc. 88, 1264–1272 (1993)

    MathSciNet  Google Scholar 

  23. Hadi, A.S., Simonoff, J.S.: Improving the estimation and outlier identification properties of the least median of squares and minimum volume ellipsoid estimators. Parisankhyan Samikkha 1, 61–70 (1994)

    Google Scholar 

  24. Atkinson, A.C.: Fast very robust methods for the detection of multiple outliers. J. Am. Stat. Assoc. 89, 1329–1339 (1994)

    MATH  Google Scholar 

  25. Swallow, W., Kianifard, F.: Using robust scale estimates in detecting multiple outliers in linear regression. Biometrics 52, 545–556 (1996)

    MATH  Google Scholar 

  26. Peña, D., Yohai, V.J.: The detection of influential subsets in linear regression using an influence matrix. J. R. Stat. Soc. B 57, 145–156 (1995)

    MathSciNet  MATH  Google Scholar 

  27. Peña, D., Yohai, V.J.: A fast procedure for robust estimation and diagnostics in large regression problems. J. Am. Stat. Assoc. 94, 434–445 (1999)

    MATH  Google Scholar 

  28. Huber, P.: Between Robustness and Diagnosis. In: Stahel, W., Weisberg, S. Directions in Robust Statistics and Diagnosis, pp. 121–130. Springer, Berlin, Heidelberg, New York (1991)

    Google Scholar 

  29. Rousseeuw, P.J., Leroy, A.M.: Robust Regression and Outlier Detection. Wiley, New York (1987)

    MATH  Google Scholar 

  30. Maronna, R.A., Martin, R.D., Yohai, V.J., Saliban-Barrera, M.: Robust Statistics, Theory and Methods (with R). Wiley, New York (2019)

    MATH  Google Scholar 

  31. Box, G.E.P., Tiao, C.G.: A Bayesian approach to some outlier problems. Biometrika 55, 119–129 (1968)

    MathSciNet  MATH  Google Scholar 

  32. Peña, D., Guttman, I.: Comparing probabilistic models for outlier detection. Biometrika 80(3), 603–610 (1993)

    MathSciNet  MATH  Google Scholar 

  33. Berger, J.O., Moreno, E., Pericchi, L.R., Bayarri, M.J., Bernardo, J.M., Cano, J.A., …, Dasgupta, A.: An overview of robust Bayesian analysis. Test 3(1), 5–124 (1994)

    Google Scholar 

  34. Justel, A., Peña, D.: Bayesian unmasking in linear models. Comput. Stat. Data Anal. 36, 69–94 (2001)

    MathSciNet  MATH  Google Scholar 

  35. Hans, C.: Bayesian lasso regression. Biometrika 96(4), 835–845 (2009)

    MathSciNet  MATH  Google Scholar 

  36. Peña, D.: A new statistic for influence in linear regression. Technometrics 47(1), 1–12 (2005)

    MathSciNet  Google Scholar 

  37. Diaz-Garcia, J.A., Gonzalez-Farias, G.: A note on the Cook distance. J. Stat. Plann. Infer. 120, 119–136 (2004)

    MathSciNet  MATH  Google Scholar 

  38. Cook, R.D., Peña, D., Weisberg, S.: The likelihood displacement. A unifying principle for influence. Commun. Stat. A 17, 623–640 (1988)

    Google Scholar 

  39. Muller, E.K., Mok, M.C.: The distribution of Cook D statistics. Commun. Stat. A 26, 525–546 (1997)

    MathSciNet  MATH  Google Scholar 

  40. Atkinson, A.C.: Masking unmasked. Biometrika 73, 533–41 (1986)

    MathSciNet  MATH  Google Scholar 

  41. Wisnowski, J.W., Montgomey, D.C., Simpson, J.R.: A comparative analysis of multiple outliers detection procedures in the linear regression model. Comput. Stat. Data Anal. 36, 351–382 (2001)

    MathSciNet  MATH  Google Scholar 

  42. Rousseeuw, P.J.: Least median of squares regression. J. Am. Stat. Assoc. 79, 871–880 (1984)

    MathSciNet  MATH  Google Scholar 

  43. Kashif, M., Amanullah, M., Aslam, M.: Pena’s statistic for the Liu regression. J. Stat. Comput. Simul. 88(13), 2473–2488 (2018)

    MathSciNet  MATH  Google Scholar 

  44. Hendry, D.F., Johansen, S., Santos, C.: Automatic selection of indicators in a fully saturated regression. Computational Statistics 33, 317–335 (2008); Erratum, 337—339

    Google Scholar 

  45. Hendry, D.F., Doornik, J.A.: Empirical Model Discovery and Theory Evaluation: Automatic Selection Methods in Econometrics. MIT Press (2014)

    Google Scholar 

  46. Johansen, S., Nielsen, B.: Asymptotic theory of outlier detection algorithms for linear time series regression models. Scand. J. Stat. 43(2), 321–348 (2016)

    MathSciNet  MATH  Google Scholar 

  47. She, Y., Owen, A.B.: Outlier detection using nonconvex penalized regression. J. Am. Stat. Assoc. 106(494), 626–639 (2011)

    MathSciNet  MATH  Google Scholar 

  48. Kong, D., Bondell, H., Wu, Y.: Fully efficient robust estimation, outlier detection and variable selection via penalized regression. Statistica Sinica 28, 1031–1052, (2018).

    MathSciNet  MATH  Google Scholar 

  49. Peña, D., Rodriguez, J., Tiao, G.C.: Identifying mixtures of regression equations by the SAR procedure (with discussion). In: Bernardo et al., Bayesian Statistics, vol. 7, pp. 327–347. Oxford Univ. Press, Oxford (2003)

    Google Scholar 

  50. García-Escudero, L.A., Gordaliza, A., San Martin, R., Van Aelst, S., Zamar, R.: Robust linear clustering. J. Roy. Stat. Soc. Ser. B (Statistical Methodology) 71(1), 301–318 (2009)

    Google Scholar 

  51. García-Escudero, L.A., Gordaliza, A., Mayo-Iscar, A., San Martín, R.: Robust clusterwise linear regression through trimming. Comput. Stat. Data Anal. 54(12), 3057–3069 (2010)

    MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

This research has been supported by Grant ECO2015-66593-P of MINECO/FEDER/UE, Spain.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Daniel Peña .

Editor information

Editors and Affiliations

Section Editor information

Rights and permissions

Reprints and permissions

Copyright information

© 2023 Springer-Verlag London Ltd., part of Springer Nature

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Peña, D. (2023). Detecting Outliers and Influential and Sensitive Observations in Linear Regression. In: Pham, H. (eds) Springer Handbook of Engineering Statistics. Springer Handbooks. Springer, London. https://doi.org/10.1007/978-1-4471-7503-2_31

Download citation

Publish with us

Policies and ethics

Navigation