How (not) to measure replication
Tóm tắt
The replicability crisis refers to the apparent failures to replicate both important and typical positive experimental claims in psychological science and biomedicine, failures which have gained increasing attention in the past decade. In order to provide evidence that there is a replicability crisis in the first place, scientists have developed various measures of replication that help quantify or “count” whether one study replicates another. In this nontechnical essay, I critically examine five types of replication measures used in the landmark article “Estimating the reproducibility of psychological science” (Open Science Collaboration, Science, 349, ac4716, 2015) based on the following techniques: subjective assessment, null hypothesis significance testing, comparing effect sizes, comparing the original effect size with the replication confidence interval, and meta-analysis. The first four, I argue, remain unsatisfactory for a variety of conceptual or formal reasons, even taking into account various improvements. By contrast, at least one version of the meta-analytic measure does not suffer from these problems. It differs from the others in rejecting dichotomous conclusions, the assumption that one study replicates another or not simpliciter. I defend it from other recent criticisms, concluding however that it is not a panacea for all the multifarious problems that the crisis has highlighted.
Tài liệu tham khảo
Anderson, C.J., Bahník, Š., Barnett-Cowan, M., Bosco, F.A., Chandler, J., Chartier, C.R., Cheung, F., Christopherson, C.D., Cordes, A., Cremata, E.J., Della Penna, N., Estel, V., Fedor, A., Fitneva, S.A., Frank, M.C., Grange, J.A., Hartshorne, J. K., Hasselman, F., Henninger, F., van der Hulst, M., Jonas, K.J., Lai, C.K., Levitan, C.A., Miller, J.K., Moore, K.S., Meixner, J.M., Munafò, M. R., Neijenhuijs, K.I., Nilsonne, G., Nosek, B.A., Plessow, F., Prenoveau, J.M., Ricker, A.A., Schmidt, K., Spies, J.R., Stieger, S., Strohminger, N., Sullivan, G.B., van Aert, R.C.M., van Assen, M.A.L.M., Vanpaemel, W., Vianello, M., Voracek, M., & Zuni, K. (2016). Response to comment on “Estimating the reproducibility of psychological science”. Science, 351(6277), 1037–c.
Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452–454.
Begley, C.G., & Ellis, L.M. (2012). Raise standards for preclinical cancer research: Drug development. Nature, 483(7391), 531–533.
Braver, S.L., Thoemmes, F.J., & Rosenthal, R. (2014). Continuously cumulating meta-analysis and replicability. Perspectives on Psychological Science, 9(3), 333–342.
Camerer, C.F., Dreber, A., Holzmeister, F., Ho, T.-H., Huber, J., Johannesson, M., Kirchler, M., Nave, G., Nosek, B.A., Pfeiffer, T., Altmejd, A., Buttrick, N., Chan, T., Chen, Y., Forsell, E., Gampa, A., Heikensten, E., Hummer, L., Imai, T., Isaksson, S., Manfredi, D., Rose, J., Wagenmakers, E.-J., & Wu, H. (2018). Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015. Nature Human Behaviour, 2(9), 637–644.
Carter, E.C., Schönbrodt, F.D., Gervais, W.M., & Hilgard, J. (2019). Correcting for bias in psychology: A comparison of meta-analytic methods. Advances in Methods and Practices in Psychological Science, 2(2), 115–144.
Cox, D.R., & Hinkley, D. (1974). Theoretical statistics. London: Chapman and Hall.
Cumming, G. (2013). Understanding the new statistics: Effect sizes, confidence intervals and meta-analysis. London: Routledge.
Douglas, H.E. (2009). Science, policy and the value-free ideal. Pittsburgh: University of Pittsburgh Press.
Dreber, A., Pfeiffer, T., Almenberg, J., Isaksson, S., Wilson, B., Chen, Y., Nosek, B.A., & Johannesson, M. (2015). Using prediction markets to estimate the reproducibility of scientific research. Proceedings of the National Academy of Sciences, 112(50), 15343–15347.
Earp, B.D., & Trafimow, D. (2015). Replication, falsification, and the crisis of confidence in social psychology. Frontiers in Psychology, 6, 621.
Edwards, A. (1972). Likelihood. Cambridge: Cambridge University Press.
Ellis, P.D. (2010). The essential guide to effect sizes: Statistical power, meta-analysis and the interpretation of research results. Cambridge: Cambridge University Press.
Etz, A., & Vanderkerckhove, J. (2016). A Bayesian perspective on the reproducibility project: Psychology. PLOS ONE, 11(2), e0149794.
Fidler, F., & Wilcox, J. (2018). Reproducibility of scientific results. In Zalta, E.N. (Ed.) The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, winter 2018 edition.
Fletcher, S.C., & Mayo-Wilson, C. (2021). Evidence in classical statistics. Written for Routledge Handbook of Evidence, Maria Lasonen-Aarnio and Clayton Littlejohn, eds.
Forsell, E., Viganola, D., Pfeiffer, T., Almenberg, J., Wilson, B., Chen, Y., Nosek, B.A., Johannesson, M., & Dreber, A. (2019). Predicting replication outcomes in the Many Labs 2 study. Journal of Economic Psychology, 75, 102117.
Gilbert, D.T., King, G., Pettigrew, S., & Wilson, T.D. (2016). Comment on “Estimating the reproducibility of psychological science”. Science, 351 (6277), 1037–b.
Graff, D. (2000). Shifting sands: An interest-relative theory of vagueness. Philosophical Topics, 28(1), 45–81.
Graff Fara, D. (2008). Profiling interest-relativity. Analysis, 68 (4), 326–35.
Hacking, I. (1965). The logic of statistical inference. Cambridge: Cambridge University Press.
Harlow, L.L., Mulaik, S.A., Steiger, J.H, & editors. (1997). What if there were no significance tests? Lawrence Erlbaum Associates.
Holman, B. (2019). In defense of meta-analysis. Synthese, 196 (8), 3189–3211.
Hyde, D., Raffman, D., & Sorites paradox. (2018). In Zalta, E.N. (Ed.) The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, summer 2018 edition.
Joint Committee for Guides in Metrology (JCGM). (2012). International vocabulary of metrology — Basic and general concepts and associated terms (VIM), 3rd edition. https://www.bipm.org/en/publications/guides/vim.html.
Jukola, S. (2015). Meta-analysis, ideals of objectivity, and the reliability of medical knowledge. Science & Technology Studies, 28(3), 101–120.
Kahneman, D., & Klein, G. (2009). Conditions for intuitive expertise: A failure to disagree. American Psychologist, 64(6), 515–526.
Klein, R.A., Ratliff, K.A., Vianello, M., Adams, R.B., Bahník, v., Bernstein, M.J., Bocian, K., Brandt, M.J., Brooks, B., Brumbaugh, C.C., Cemalcilar, Z., Chandler, J., Cheong, W., Davis, W.E., Devos, T., Eisner, M., Frankowska, N., Furrow, D., Galliani, E.M., Hasselman, F., Hicks, J.A., Hovermale, J.F., Hunt, S.J., Huntsinger, J.R., IJzerman, H., John, M.-S., Joy-Gaba, J.A., Barry Kappes, H., Krueger, L.E., Kurtz, J., Levitan, C.A., Mallett, R.K., Morris, W.L., Nelson, A.J., Nier, J.A., Packard, G., Pilati, R., Rutchick, A.M., Schmidt, K., Skorinko, J.L., Smith, R., Steiner, T.G., Storbeck, J., Van Swol, L.M., Thompson, D., van’t Veer, A.E., Ann Vaughn, L., Vranka, M., Wichman, A.L., Woodzicka, J.A., & Nosek, B.A. (2014). Investigating variation in replicability. Social Psychology, 45 (3), 142–152.
Kline, R. (2004). Beyond significance testing: Reforming data analysis methods in behavioral research. Washington, D.C: American Psychological Association.
Kvarven, A., Strømland, E., & Johannesson, M. (2020). Comparing meta-analyses and preregistered multiple-laboratory replication projects. Nature Human Behaviour, 4(4), 423–434.
Lakens, D., Adolfi, F.G., Albers, C.J., Anvari, F., Apps, M.A., Argamon, S.E., Baguley, T., Becker, R.B., Benning, S.D., Bradford, D.E., & et al. (2018). Justify your alpha. Nature Human Behaviour, 2(3), 168.
Larrick, R.P., & Feiler, D.C. (2015). Expertise in decision making. In Keren, G., & Wu, G. (Eds.) The Wiley Blackwell handbook of judgment and decision making (pp. 696–721). West Sussex: Wiley.
Lipsey, M.W., & Wilson, D.B. (2001). Practical meta-analysis. Thousand Oaks, CA: SAGE.
Machery, E. (2020). What is a replication? Philosophy of Science, 87(4), 545–567.
Mayo, D., & Spanos, A. (2004). Methodology in practice: Statistical misspecification testing. Philosophy of Science, 71(5), 1007–1025.
McCloskey, D.N., & Ziliak, S.T. (2008). The cult of statistical significance: How the standard error costs us jobs, justice and lives. Ann Arbor: University of Michigan Press.
Morrison, D., Henkel, R., & editors. (1970). The significance test controversy. London: Aldine Publishing.
Nosek, B.A., & Errington, T.M. (2017). Reproducibility in cancer biology: Making sense of replications. eLife, 6, e23383.
Nosek, B.A., & Errington, T.M. (2020). What is replication? PLoS Biology, 18(3), e3000691.
Nozick, R. (1981). Philosophical explanations. Cambridge: Cambridge University Press.
Open Science Collaboration (OSC). (2015). Estimating the reproducibility of psychological science. Science, 349(6251), ac4716.
Patel, R., & Alahmad, A.J. (2016). Growth-factor reduced Matrigel source influences stem cell derived brain microvascular endothelial cell barrier properties. Fluids Barriers CNS, 13(6), 1–7.
Patil, P., Peng, R.D., & Leek, J.T. (2016). What should researchers expect when they replicate studies? A statistical view of replicability in psychological science. Perspectives on Psychological Science, 11(4), 539–544.
Reiss, J., & Sprenger, J. (2017). Scientific objectivity. In Zalta, E.N. (Ed.) The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab. Stanford University, winter 2017 edition.
Romero, F. (2016). Can the behavioral sciences self-correct? A social epistemic study. Studies in History and Philosophy of Science, 60, 55–69.
Romero, F. (2017). Novelty vs. replicability: Virtues and vices in the reward system of science. Philosophy of Science, 84(5), 1031–1043.
Romero, F. (2019). Philosophy of science and the replicability crisis. Philosophy Compass, 14(11), e12633.
Rosenthal, R. (1991). Meta-analytic procedures for social research. Beverly Hills, CA,: Sage. Revised edition.
Rosnow, R.L., & Rosenthal, R. (1989). Statistical procedures and the justification of knowledge in psychological science. American Psychologist, 44(10), 1276–1284.
Royall, R. (1997). Scientific evidence: a likelihood paradigm. London: Chapman and Hall.
Schmidt, F.L. (1992). What do data really mean? Research findings, meta-analysis, and cumulative knowledge in psychology. American Psychologist, 27 (10), 1173–1181.
Schmidt, F.L. (1996). Statistical significance testing and cumulative knowledge in psychology: Implications for training of researchers. Psychological Methods, 1(2), 115–129.
Schmidt, F.L., & Hunter, J.E. (2015). Methods of meta-analysis: Correcting error and bias in research findings, 3rd edn. Thousand Oaks, CA: Sage.
Schmidt, F.L., & Oh, I.-S. (2016). The crisis of confidence in research findings in psychology: Is lack of replication the real problem? Or is it something else? Archives of Scientific Psychology, 4(1), 32–37.
Schmidt, S. (2009). Shall we really do it again? The powerful concept of replication is neglected in the social sciences. Review of General Psychology, 13(2), 90–100.
Shanteau, J. (1992). The psychology of experts: An alternative view. In Wright, G., & Bolger, F. (Eds.) Expertise and decision support (pp. 11–23). New York: Plenum Press.
Simonsohn, U. (2015). Small telescopes: Detectability and the evaluation of replication results. Psychological Science, 26(5), 559–569.
Stefan, A.M., Evans, N.J., & Wagenmakers, E.-J. (2020). Practical challenges and methodological flexibility in prior elicitation. Psychological Methods. https://doi.org/10.1037/met0000354.
Stegenga, J. (2011). Is meta-analysis the platinum standard of evidence? Studies in History and Philosophy of Biological and Biomedical Sciences, 42(4), 497–507.
van Aert, R.C.M., Wicherts, J.M., & van Assen, M.A.L.M. (2016). Conducting meta-analyses based on p values: Reservations and recommendations for applying p-uniform and p-curve. Perspectives on Psychological Science, 11(5), 713–729.
Wolfers, J., & Zitzewitz, E. (2006). Interpreting prediction market prices as probabilities. Technical report, National Bureau of Economic Research.