Abstract
Theory cannot be fully validated unless original results have been replicated, resulting in conclusion consistency. Replications are the strongest source of evidence to verify research findings and knowledge claims. In the social sciences, replication studies often fail and thus a continuing need for replication studies to confirm tentative facts, expand knowledge to gain new understanding, and verify hypotheses. Failure to replicate in the social and behavioral sciences sometimes arises due to dissimilarity between hypotheses formulated in original and replication studies. Alternatively, failure to replicate also occurs when the same hypothesis is tested; but done so in the absence of knowledge from previous investigations, as when original study effect sizes are not considered in replication studies. To increase replicability of research findings, this paper demonstrates that the application of two one-sided tests to evaluate a replication question provides a superior means for conducting replications, assuming all other methodological procedures remained as similar as possible. Furthermore, this paper sought to explore the impact of heteroscedasticity and unbalanced designs in replication studies in four paired conditions of variance and sample size. Two Monte Carlo simulations, each with two stages, were conducted to investigate conclusion consistency among different replication procedures to determine the repeatability of an observed effect. Overall, the proposed approach yielded a higher proportion of successful replications than the conventional approach (testing the original null hypothesis of no effect). Thus, findings can be confirmed by replications and in the absence of confirmation, there cannot be a final statement about any theory.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11135-023-01657-0/MediaObjects/11135_2023_1657_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11135-023-01657-0/MediaObjects/11135_2023_1657_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11135-023-01657-0/MediaObjects/11135_2023_1657_Fig3_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11135-023-01657-0/MediaObjects/11135_2023_1657_Fig4_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11135-023-01657-0/MediaObjects/11135_2023_1657_Fig5_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11135-023-01657-0/MediaObjects/11135_2023_1657_Fig6_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11135-023-01657-0/MediaObjects/11135_2023_1657_Fig7_HTML.png)
Similar content being viewed by others
Notes
Both superiority and non-inferiority tests are special cases of one-sided tests. Under a superiority test, rejecting the null hypothesis implies that a treatment mean is greater than a reference mean \(\left(\delta ={\mu }_{A}-{\mu }_{B}\right)\), by more than the superiority margin \(\left({M}_{s}\right)\) which can be the smallest difference from the reference that is considered to be different. In contrast, a non-inferiority test aims to show that a treatment mean is no more than a reference mean by more than the non-inferiority margin \(\left({M}_{NI}\right)\) which can be the largest change from the baseline (zero) that is considered to be trivial. An equivalence test or two one-sided tests combines superiority and non-inferiority tests. Rejecting the null hypothesis implies that the mean difference differs by less than specific limits where \({M}_{s}\) is the lower limit and \({M}_{NI}\), the upper limit.
7.5% = (39,267 + 90,880 + 142,894)/(1,083,733 + 1,115,120 + 1,185,105 + 39,267 + 90,880 + 142,894).
References
Agresti, A.: Categorical Data Analysis. Wiley, New York (2002)
Baker, M.: 1,500 scientists lift the lid on reproducibility. Nature 533(7604), 452–454 (2016). https://doi.org/10.1038/533452a
Bartoszyński, R., Niewiadomska-Bugaj, M.: Probability and Statistical Inference. John Wiley & Sons, New Jersey (2008)
Berger, J.O., Sellke, T.: Testing a point null hypothesis: the irreconcilability of p values and evidence. J. Am. Stat. Assoc. 82(397), 112–122 (1987)
Brandt, M.J., IJzerman, H., Dijksterhuis, A., Farach, F.J., Geller, J., Giner-Sorolla, R., Grange, J.A., Perugini, M., Spies, J.R., van’t Veer, A.: The replication recipe: What makes for a convincing replication? J. Exp. Soc. Psychol. 50, 217–224 (2014). https://doi.org/10.1016/j.jesp.2013.10.005
Brewer, M.B., Crano, W.D.: Research design and issues of validity. In: Reis, H.T., Judd, C.M. (eds.) Handbook of Research Methods in Social and Personality psychology, 2nd edn., pp. 11–26. Cambridge University Press, New York (2014)
Brown, A. N., Cameron, D. B., Wood, B. D. K.: Quality evidence for policymaking. I’ll believe it when I see the replication. Replication Paper 1. International Initiative for Impact Evaluation (3ie), Washington (2014)
Campbell, D.T.: Reforms as experiments. Am. Psychol. 24, 409–429 (1969)
Cohen, J.: Statistical Power Analysis for the Behavioral Sciences, 2nd edn. L. Erlbaum Associates, Hillsdale (1988)
Fidler, F., Wilcox, J.: Reproducibility of Scientific Results. The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University. https://plato.stanford.edu/entries/scientific-reproducibility/ (2018). Accessed 25 February 2021
Food and Drug Administration (FDA): Guideline for the Format and Content of the Human Pharmakinetics and Bioavailability Section of an Application. U.S. Food and Drug Administration, Rockville (1987)
Food and Drug Administration (FDA): History of Bioequivalence for Critical Dose Drugs. U.S. Food and Drug Administration, Rockville (2010)
Goodman, S.N.: A Comment on replication, p values, and evidence. Stat. Med. 11, 875–879 (1992)
Goodman, S.N.: p values, hypothesis tests, and likelihood: Implications for epidemiology of a neglected historical debate. Am. J. Epidemiol. 137(5), 485–496 (1993)
Greenland, S.: Null misinterpretation in statistical testing and its impact on health risk assessment. Prev. Med. 53, 225–228 (2011). https://doi.org/10.1016/j.ypmed.2011.08.010
Greenland, S.: Nonsignificance plus high power does not imply support for the null over the alternative. Ann. Epidemiol. 22(5), 364–368 (2012). https://doi.org/10.1016/j.annepidem.2012.02.007
Hacking, I.: Logic of statistical inference. Cambridge University Press. New York (1965)
Hamermesh, D.S.: Viewpoint: replication in economics. Can. J. Econ. 40(3), 715–733 (2007)
Hauck, W.W., Anderson, S.: A new statistical procedure for testing equivalence in two-group comparative bioavailability trials. J. Pharmacokinet. Biopharm. 12(1), 83–91 (1984)
Hopkins, K.D., Hopkins, B.R., Glass, G.V.: Basic Statistics for the Behavioral Sciences, 3rd edn. Simon & Schuster Company, Needham Heights (1996)
Klein, R.A., Ratliff, K.A., Vianello, M., Adams, R.B., Bahník, Š, Bernstein, M., Bocian, K., Brandt, M., Brooks, B.S., Brumbaugh, C.C., Cemalcilar, Z.: Investigating variation in replicability. A ‘“many labs”’ replication project. Soc. Psychol. 45(3), 142–152 (2014). https://doi.org/10.1027/1864-9335/a000178
Lindsay, R.M., Ehrenberg, A.S.C.: The design of replicated studies. Am. Stat. 47, 217–228 (1993)
Lykken, D.T.: Statistical significance in psychological research. Psychol. Bull. 70(3), 151–159 (1968)
McShane, B.B., Tackett, J.L., Böckenholt, U., Gelman, A.: Large-scale replication projects in contemporary psychological research. Am. Stat. 73(S1), 99–105 (2019). https://doi.org/10.1080/00031305.2018.1505655
NCSS.: Group sequential superiority by a margin tests for two means with known variances. In PASS Sample Size Software (Chapter 763). https://www.ncss.com/wp content/themes/ncss/pdf/Procedures/PASS/GroupSequential_Superiority_by_a_Margin_Tests_for_Two_Means_with_Known_Variances-Simulation.pdf (n.d.a). Accessed 31 Aug 2021
NCSS.: Group Sequential non-inferiority tests for two means with known variances. In PASS Sample Size Software (Chapter 762). https://www.ncss.com/wpcontent/themes/ncss/pdf/Procedures/PASS/Group Sequential_Non-Inferiority_Tests_for_Two_Means_with_Known_Variances-Simulation.pdf (n.d.b). Accessed 31 Aug 2021
NCSS.: Equivalence tests for two means. In PASS Sample Size Software (Chapter 465). https://www.ncss.com/wp content/themes/ncss/pdf/Procedures/PASS/Equivalence_Tests_for_Two_Means-Simulation.pdf (n.d.c). Accessed 31 Aug 2021
Neyman, J.: Fiducial argument and the theory of confidence intervals. Biometrika 32(2), 128–150 (1941)
Neyman, J., Pearson, E.: On the problem of the most efficient tests of statistical hypotheses. Philos. Trans. R. Soc. Lond. Ser. A 231, 289–337 (1933)
Norton, B.J., Strube, M.J.: Understanding statistical power. J. Orthop. Sports Phys. Ther. 31(6), 307–315 (2001)
of repeatability. European Journal of Parapsychology, 3, 423–433.
Open Science Collaboration: The Reproducibility project: a model of large-scale collaboration for empirical research on reproducibility. In: Stodden, V., Leisch, F., Peng, R.D. (eds.) Implementing Reproducible Research, pp. 299–323. Taylor & Francis, New York (2014)
Open Science Collaboration: Estimating the reproducibility of psychological science. Science 349, 943–951 (2015)
Ottenbacher, K.J.: The power of replications and replications of power. Am. Stat. 50(3), 271–275 (1996)
Pashler, H., Wagenmakers, E.-J.: Editors’ introduction to the special section on replicability in psychological science: A crisis of confidence? Perspect. Psychol. Sci. 7(6), 528–530 (2012). https://doi.org/10.1177/1745691612465253
Peng, R.D.: The reproducibility crisis in science: a statistical counterattack. Significance 12(3), 30–32 (2015). https://doi.org/10.1111/j.1740-9713.2015.00827.x
Prinz, F., Schlange, T., Asadullah, K.: Believe it or not: how much can we rely on published data on potential drug targets? Nat. Rev. Drug Discov. 10, 712–713 (2011)
Rubin, D.B.: Estimating causal effects of treatments in randomized and nonrandomized studies. J. Educ. Psychol. 66(5), 688–701 (1974)
Sargent, C.L.: The repeatability of significance and the significance of repeatability. Eur. J. Parapsychol. 3, 423–433 (1981)
SAS Institute Inc. SAS/STAT® 9.4 user’s guide. SAS Institute Inc., Cary (2013)
Schmidt, S.: Shall we really do it again? The powerful concept of replication is neglected in the social sciences. Rev. Gen. Psychol. 13(2), 90–100 (2009)
Schooler, J.W.: Metascience could rescue the “replication crisis.” Nature 515(7525), 9 (2014). https://doi.org/10.1038/515009a
Schuirmann, D.J.: A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability. J. Pharmacokinet. Biopharm. 15(6), 657–680 (1987)
Shadish, W.R.: Campbell and Rubin: a primer and comparison of their approaches to causal inference in field settings. Psychol. Methods 15(1), 3–17 (2010). https://doi.org/10.1037/a0015916
Shadish, W.R., Cook, T.D., Campbell, D.T.: Experimental and Quasi-Experimental Designs for Generalized Causal Inference. Houghton Mifflin, Boston (2002)
Simons, D.J.: The value of direct replication. Perspect. Psychol. Sci. 9(1), 76–80 (2014). https://doi.org/10.1177/1745691613514755
Tarran, B.: The S word and what to do about it. Significance 16(4), 14 (2019)
Trochim, W.M.K., Donnelly, J.P.: The Research Methods Knowledge Base. Cengage Learning, Boston (2006)
Washburn, A.N., Hanson, B.E., Motyl, M., Skitka, L.J., Yantis, C., Wong, K.M., Sun, J., et al.: Why do some psychology researchers resist adopting proposed reforms to research practices? A description of researchers’ rationales. Adv. Methods Pract. Psychol. Sci. 1(2), 166–173 (2018). https://doi.org/10.1177/2515245918757427
Wasserstein, R.L., Lazar, N.A.: The ASA’s statement on p values: context, process, and purpose. Am. Stat. 70(2), 129–133 (2016). https://doi.org/10.1080/00031305.2016.1154108
Wasserstein, R.L., Schirm, A.L., Lazar, N.A.: Moving to a World beyond “p < 0.05.” Am. Stat. 73(1), 1–19 (2019). https://doi.org/10.1080/00031305.2019.1583913
Winer, B.J.: Statistical Principles in Experimental Design, 2nd edn. McGraw-Hill Publishing, New York (1971)
Funding
Edward Applegate and Chris Coryn declared that no funds, grants, or other support were received during the preparation of this manuscript. Pedro Mateu has received research support from the Office of the Vice Chancellor for Research at Universidad del Pacífico (Lima, Peru).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declared no conflicts of interest with respect to the authorship and/or publication of this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Mateu, P., Applegate, B. & Coryn, C.L. Towards more credible conceptual replications under heteroscedasticity and unbalanced designs. Qual Quant 58, 723–751 (2024). https://doi.org/10.1007/s11135-023-01657-0
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11135-023-01657-0