Check your outliers﻿! An introduction to identifying statistical outliers in R with easystats

Thériault, Rémi; Ben-Shachar, Mattan S.; Patil, Indrajeet; Lüdecke, Daniel; Wiernik, Brenton M.; Makowski, Dominique

doi:10.3758/s13428-024-02356-w

Check your outliers! An introduction to identifying statistical outliers in R with easystats

Original Manuscript
Published: 25 March 2024

Volume 56, pages 4162–4172, (2024)
Cite this article

Behavior Research Methods Aims and scope Submit manuscript

1234 Accesses
28 Altmetric
Explore all metrics

Abstract

Beyond the challenge of kee** up to date with current best practices regarding the diagnosis and treatment of outliers, an additional difficulty arises concerning the mathematical implementation of the recommended methods. Here, we provide an overview of current recommendations and best practices and demonstrate how they can easily and conveniently be implemented in the R statistical computing software, using the {performance} package of the easystats ecosystem. We cover univariate, multivariate, and model-based statistical outlier detection methods, their recommended threshold, standard output, and plotting methods. We conclude by reviewing the different theoretical types of outliers, whether to exclude or winsorize them, and the importance of transparency. A preprint of this paper is available at: 10.31234/osf.io/bu6nt.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Introduction to R for Quality Control

Using R for Statistics: A Beginner’s Manual

Noise Versus Outliers

Data availability

This paper first appeared as a preprint (https://doi.org/10.31234/osf.io/bu6nt) and is also available as an online vignette at: https://easystats.github.io/performance/articles/check_outliers. All data used in this paper uses data included with base R.

Code availability

The performance package is available at the package official website (https://easystats.github.io/performance), on CRAN (https://cran.r-project.org/package=performance), and on the R-Universe (https://easystats.r-universe.dev/performance). The source code is available on GitHub (https://github.com/easystats/performance/), and the package can be installed from CRAN with install.packages("performance"). The code to reproduce figures and all analyses in this paper is available at https://osf.io/eqja6/.

Notes

Note that check_outliers() only checks numeric variables.
3.29 is an approximation of the two-tailed critical value for p < .001, obtained through qnorm(p = 1 – 0.001 / 2). We chose this threshold for consistency with the thresholds of all our other methods.
Note that univariate outlier detection methods might not be the optimal way of treating reaction time outliers (Ratcliff, 1993; Van Zandt & Ratcliff, 1995).
Our default threshold for the MCD method is defined by stats::qchisq(p = 1 – 0.001, df = ncol(x)), which again is an approximation of the critical value for p < .001 consistent with the thresholds of our other methods.
Our default threshold for the Cook method is defined by stats::qf(0.5, ncol(x), nrow(x) - ncol(x)), which again is an approximation of the critical value for p < .001 consistent with the thresholds of our other methods. In this case, the value 0.5 represents the median of the implied F distribution for D, which allows us to flag D values that are “above average”.
Some authors provide much more detailed classifications of outliers; for example, see Table 1 in Aguinis et al. (2013), for 14 different outlier definitions based on a literature review.

References

Aguinis, H., Gottfredson, R. K., & Joo, H. (2013). Best-practice recommendations for defining, identifying, and handling outliers. Organizational Research Methods, 16(2), 270–301. https://doi.org/10.1177/1094428112470848
Article Google Scholar
Anders, R., Alario, F., Van Maanen, L., et al. (2016). The shifted Wald distribution for response time data analysis. Psychological Methods, 21(3), 309. https://doi.org/10.1037/met0000066
Article PubMed Google Scholar
Aruguete, M. S., Huynh, H., Browne, B. L., Jurs, B., Flint, E., & McCutcheon, L. E. (2019). How serious is the ‘carelessness’ problem on Mechanical Turk? International Journal of Social Research Methodology, 22(5), 441–449. https://doi.org/10.1080/13645579.2018.1563966
Article Google Scholar
Brown, S. D., & Heathcote, A. (2008). The simplest complete model of choice response time: Linear ballistic accumulation. Cognitive Psychology, 57(3), 153–178. https://doi.org/10.1016/j.cogpsych.2007.12.002
Article PubMed Google Scholar
Cao, N., Lin, Y. R., Gotz, D., & Du, F. (2018). Z-Glyph: Visualizing outliers in multivariate data. Information Visualization, 17(1), 22–40. https://doi.org/10.1177/1473871616686635
Article Google Scholar
Chaloner, K., & Brant, R. (1988). A Bayesian approach to outlier detection and residual analysis. Biometrika, 75(4), 651–659. https://doi.org/10.1093/biomet/75.4.651
Article Google Scholar
Ciccione, L., Dehaene, G., & Dehaene, S. (2023). Outlier detection and rejection in scatterplots: Do outliers influence intuitive statistical judgments? Journal of Experimental Psychology: Human Perception and Performance, 49(1), 129–144. https://doi.org/10.1037/xhp0001065
Article PubMed Google Scholar
Cook, R. D. (1977). Detection of influential observation in linear regression. Technometrics, 19(1), 15–18. https://doi.org/10.1080/00401706.1977.10489493
Article Google Scholar
Curran, P. G. (2016). Methods for the detection of carelessly invalid responses in survey data. Journal of Experimental Social Psychology, 66, 4–19. https://doi.org/10.1016/j.jesp.2015.07.006
Article Google Scholar
Gnanadesikan, R., & Kettenring, J. R. (1972). Robust estimates, residuals, and outlier detection with multiresponse data. Biometrics, 28(1), 81–124. https://doi.org/10.2307/2528963
Article Google Scholar
Goldammer, P., Annen, H., Stöckli, P. L., & Jonas, K. (2020). Careless responding in questionnaire measures: Detection, impact, and remedies. The Leadership Quarterly, 31(4), 101384. https://doi.org/10.1016/j.leaqua.2020.101384
Article Google Scholar
Leys, C., Ley, C., Klein, O., Bernard, P., & Licata, L. (2013). Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median. Journal of Experimental Social Psychology, 49(4), 764–766. https://doi.org/10.1016/j.jesp.2013.03.013
Article Google Scholar
Leys, C., Klein, O., Dominicy, Y., & Ley, C. (2018). Detecting multivariate outliers: Use a robust variant of the Mahalanobis distance. Journal of Experimental Social Psychology, 74, 150–156. https://doi.org/10.1016/j.jesp.2017.09.011
Article Google Scholar
Leys, C., Delacre, M., Mora, Y. L., Lakens, D., & Ley, C. (2019). How to classify, detect, and manage univariate and multivariate outliers, with emphasis on pre-registration. International Review of Social Psychology. https://doi.org/10.5334/irsp.289
Lüdecke, D., Ben-Shachar, M. S., Patil, I., Waggoner, P., & Makowski, D. (2021). performance: An R package for assessment, comparison and testing of statistical models. Journal of Open Source Software, 6(60), 3139. https://doi.org/10.21105/joss.03139
Article Google Scholar
Lüdecke, D., Makowski, D., Ben-Shachar, M. S., Patil, I., Wiernik, B. M., Bacher, E., & Thériault, R. (2023). easystats: Streamline model interpretation, visualization, and reporting. R package version 0.7.0. Retrieved February 26, 2024, from https://easystats.github.io/easystats/
McElreath, R. (2020). Statistical rethinking: A Bayesian course with examples in R and stan. CRC Press.
Book Google Scholar
McNeil, D. R. (1977). Interactive Data Analysis: A Practical Primer. Wiley.
Google Scholar
Miller, J. (2023). Outlier exclusion procedures for reaction time analysis: The cures are generally worse than the disease. Journal of Experimental Psychology: General. https://doi.org/10.1037/xge0001450
Patil, I., Makowski, D., Ben-Shachar, M. S., Wiernik, B. M., Bacher, E., & Lüdecke, D. (2022). datawizard: An R package for easy data preparation and statistical transformations. Journal of Open Source Software, 7(78), 4684. https://doi.org/10.21105/joss.04684
Article Google Scholar
Ratcliff, R. (1993). Methods for dealing with reaction time outliers. Psychological Bulletin, 114(3), 510. https://doi.org/10.1037/0033-2909.114.3.510
Article PubMed Google Scholar
Ratcliff, R., Smith, P. L., Brown, S. D., & McKoon, G. (2016). Diffusion decision model: Current issues and history. Trends in Cognitive Sciences, 20(4), 260–281. https://doi.org/10.1016/j.tics.2016.01.007
Article PubMed PubMed Central Google Scholar
Rouder, J. N., Province, J. M., Morey, R. D., Gomez, P., & Heathcote, A. (2015). The lognormal race: A cognitive-process model of choice and latency with desirable psychometric properties. Psychometrika, 80, 491–513. https://doi.org/10.1007/s11336-013-9396-3
Article PubMed Google Scholar
Schramm, P., & Rouder, J. N. (2019). Are reaction time transformations really beneficial? PsyAr**v. https://doi.org/10.31234/osf.io/9ksa6
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366. https://doi.org/10.1177/0956797611417632
Article PubMed Google Scholar
Smiti, A. (2020). A critical overview of outlier detection methods. Computer Science Review, 38, 100306. https://doi.org/10.1016/j.cosrev.2020.100306
Article Google Scholar
Tukey, J. W., & McLaughlin, D. H. (1963). Less vulnerable confidence and significance procedures for location based on a single sample: Trimming/winsorization 1. Sankhyā: The Indian Journal of Statistics, Series A, 331–352.
Van Zandt, T., & Ratcliff, R. (1995). Statistical mimicking of reaction time data: Single-process models, parameter variability, and mixtures. Psychonomic Bulletin & Review, 2(1), 20–54. https://doi.org/10.3758/BF03214411
Article Google Scholar
Ward, M. K., & Meade, A. W. (2023). Dealing with careless responding in survey data: Prevention, identification, and recommended best practices. Annual Review of Psychology, 74(1), 577–596. https://doi.org/10.1146/annurev-psych-040422-045007
Article PubMed Google Scholar
Yentes R.D., & Wilhelm, F. (2023). careless: Procedures for computing indices of careless responding. R package version 1.2.2. Retrieved February 26, 2024, from https://cran.r-project.org/package=careless
Zijlstra, W. P., van der Ark, L. A., & Sijtsma, K. (2011). Outliers in questionnaire data: Can they be detected and should they be removed? Journal of Educational and Behavioral Statistics, 36(2), 186–212. https://doi.org/10.3102/1076998610366263
Article Google Scholar

Download references

Acknowledgements

{performance} is part of the collaborative easystats ecosystem (Lüdecke et al., 2023). Thus, we thank all members of easystats, contributors, and users alike.

Funding

This research received no external funding.

Author information

Authors and Affiliations

Department of Psychology, Université du Québec à Montréal, Succursale Centre-Ville, C.P. 8888, Montréal, Québec, H3C 3P8, Canada
Rémi Thériault
Independent Researcher, Ramat Gan, Israel
Mattan S. Ben-Shachar
Center for Humans and Machines, Max Planck Institute for Human Development, Berlin, Germany
Indrajeet Patil
Institute of Medical Sociology, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
Daniel Lüdecke
Independent Researcher, Tampa, FL, USA
Brenton M. Wiernik
School of Psychology, University of Sussex, Brighton, UK
Dominique Makowski

Authors

Rémi Thériault
View author publications
You can also search for this author in PubMed Google Scholar
Mattan S. Ben-Shachar
View author publications
You can also search for this author in PubMed Google Scholar
Indrajeet Patil
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Lüdecke
View author publications
You can also search for this author in PubMed Google Scholar
Brenton M. Wiernik
View author publications
You can also search for this author in PubMed Google Scholar
Dominique Makowski
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Writing- Original draft preparation: RT. Writing- Reviewing and Editing, Software: RT, MSB-S, IP, DL, BMW, and DM.

Corresponding author

Correspondence to Rémi Thériault.

Ethics declarations

Competing interests

The authors declare no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Thériault, R., Ben-Shachar, M.S., Patil, I. et al. Check your outliers! An introduction to identifying statistical outliers in R with easystats. Behav Res 56, 4162–4172 (2024). https://doi.org/10.3758/s13428-024-02356-w

Download citation

Accepted: 02 February 2024
Published: 25 March 2024
Issue Date: June 2024
DOI: https://doi.org/10.3758/s13428-024-02356-w

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Check your outliers! An introduction to identifying statistical outliers in R with easystats

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

An Introduction to R for Quality Control

Using R for Statistics: A Beginner’s Manual

Noise Versus Outliers

Data availability

Code availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Check your outliers﻿! An introduction to identifying statistical outliers in R with easystats

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

An Introduction to R for Quality Control

Using R for Statistics: A Beginner’s Manual

Noise Versus Outliers

Data availability

Code availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation

Check your outliers! An introduction to identifying statistical outliers in R with easystats