Practical significance of item misfit and its manifestations in constructs assessed in large-scale studies

Katharina Fährmann1, Carmen Köhler1, Johannes Hartig1, Jörg-Henrik Heine2
1DIPF, Leibniz Institute for Research and Information in Education, Frankfurt, Germany
2Centre for International Student Assessment (ZIB), Munich, Germany

Tóm tắt

When scaling psychological tests with methods of item response theory it is necessary to investigate to what extent the responses correspond to the model predictions. In addition to the statistical evaluation of item misfit, the question arises as to its practical significance. Although item removal is undesirable for several reasons, its practical consequences are rarely investigated and focus mostly on main survey data with pre-selected items. In this paper, we identify criteria to evaluate practical significance and discuss them with respect to various types of assessments and their particular purposes. We then demonstrate the practical consequences of item misfit using two data examples from the German PISA 2018 field trial study: one with cognitive data and one with non-cognitive/metacognitive data. For the former, we scale the data under the GPCM with and without the inclusion of misfitting items, and investigate how this influences the trait distribution and the allocation to reading competency levels. For non-cognitive/metacognitive data, we explore the effect of excluding misfitting items on estimated gender differences. Our results indicate minor practical consequences for person allocation and no changes in the estimated gender-difference effects.

Từ khóa


Tài liệu tham khảo

ACARA. (2013). National assessment program - science literacy technical report 2012. Australian Curriculum, assessment and reporting authority.

Allen, N. L., Carlson, J. E., & Zelenak, C. A. (1999). The NAEP 1996 technical report. National center for educational statistics.

Birnbaum, A. (1968). Some latent trait models. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores (pp. 395–479). Addison-Wesley.

Box, G. E., & Draper, N. R. (1987). Empirical model-building and response surfaces. Wiley.

Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Routledge. https://doi.org/10.4324/9780203771587

Cohen, J. (1992). A power primer. Psychological Bulletin, 112(1), 155–159.

Crişan, D. R., Tendeiro, J. N., & Meijer, R. R. (2017). Investigating the practical consequences of model misfit in unidimensional IRT models. Applied Psychological Measurement, 41(6), 439–455. https://doi.org/10.1177/0146621617695522

De Ayala, R. J. (Ed.). (2009). Methodology in the social sciences. The theory and practice of item response theory. New York: Guilford Press.

Dorans, N. J., & Feigenbaum, M. D. (1994). Equating issues engendered by changes to the SAT and PSAT/NMSQT. ETS Research Memorandum. Princeton, NJ: Educational Testing Service.

Hambleton, R. K., & Han, N. (2005). Assessing the fit of IRT models to educational and psychological test data: A five step plan and several graphical displays. In W. R. Lenderking & D. Revicki (Eds.), Advances in health outcomes research methods, measurement, statistical analysis, and clinical applications. Degnon Associates.

Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: principals and applications. Springer.

Hartig, J., Frey, A., & Jude, N. (2020). Validity of test value interpretations. In H. Moosbrugger & A. Kelava (Eds.), Testtheorie und fragebogenkonstruktion. Springer.

Jaccard, P. (1912). The distribution of the flora in the alpine zone. New Phytologist, 11(2), 37–50. https://doi.org/10.1111/j.1469-8137.1912.tb05611.x

Kirk, R. E. (1996). Practical significance: a concept whose time has come. Educational and Psychological Measurement, 56(5), 746–759. https://doi.org/10.1177/0013164496056005002

Köhler, C., & Hartig, J. (2017). Practical significance of item misfit in educational assessments. Applied Psychological Measurement, 41(5), 388–400. https://doi.org/10.1177/0146621617692978

Köhler, C., Robitzsch, A., & Hartig, J. (2020). A bias-corrected RMSD item fit statistic: an evaluation and comparison to alternatives. Journal of Educational and Behavioral Statistics, 45(3), 251–273. https://doi.org/10.3102/1076998619890566

Liang, T., Wells, C. S., & Hambleton, R. K. (2014). An assessment of the nonparametric approach for evaluating the fit of item response models. Journal of Educational Measurement, 51(1), 1–17. https://doi.org/10.1111/jedm.12031

Lüdtke, O., & Robitzsch, A. (2017). An introduction to the plausible values technique for psychological research. Diagnostica, 63(3), 193–205. https://doi.org/10.1026/0012-1924/a000175

Masters, G. N. (1982). A rasch model for partial credit scoring. Psychometrika, 47(2), 149–174. https://doi.org/10.1007/BF02296272

Molenaar, I. W. (1997). Lenient or strict application of IRT with an eye on practical consequences. In J. Rost & R. Langeheine (Eds.), Applications of latent trait and latent class models in the social sciences (pp. 38–49). Waxmann Verlag.

Muraki, E. (1992). A generalized partial credit model: application of an EM algorithm. Applied Psychological Measurement, 16(2), 159–176.

OECD. (2015). PISA 2015 field trial analysis report: Outcomes of the cognitive assessment (Meeting of the technical advisory group). Paris: OECD Publishing.

OECD. (2018a). PISA 2015: PISA results in focus. OECD Publishing.

OECD. (2018b). PISA 2018 Field trial analysis report for the cognitive assessment. OECD Publishing.

OECD. (2020). PISA 2018 technical report. OECD Publishing.

Peeters, M. J. (2016). Practical significance: Moving beyond statistical significance. Currents in Pharmacy Teaching and Learning, 8(1), 83–89. https://doi.org/10.1016/j.cptl.2015.09.001

R Core Team (2022). R: A language and environment for statistical computing [Computer software] R foundation for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/

Rasch, G. (1960/1980). Probabilistic models for some intelligence and attainment tests. (Copenhagen, Danish Institute for Educational Research), expanded edition (1980) with foreword and afterword by B.D. Wright. Chicago: The University of Chicago Press.

Reiss, K., Weis, M., Klieme, E., & Köller, O. (Eds.). (2019). PISA 2018: Grundbildung im internationalen Vergleich [PISA 2018: Basic education in international comparison]. Waxmann.

Robitzsch, A., Kiefer, T., & Wu, M. (2020). TAM: Test analysis moduls. R package version, 3, 5–19. Computer software.

Rutkowski, L. (2014). Sensitivity of achievement estimation to conditioning model misclassification. Applied Measurement in Education, 27(2), 115–132. https://doi.org/10.1080/08957347.2014.880440

Silva Diaz, J. A., Köhler, C., & Hartig, J. (2022). Performance of Infit and outfit confidence intervals calculated via parametric bootstrapping. Applied Measurement in Education. https://doi.org/10.1080/08957347.2022.2067540

Sinharay, S., Haberman, S. J., & Jia, H. (2011). Fit of item response theory models: A survey of data from several operational tests (Research Report (Vol. No. RR-11-29)). Princeton, NJ: Educational Testing Service. https://doi.org/10.1002/j.2333-8504.2011.tb02265.x.

Sinharay, S., & Haberman, S. J. (2014). How often is the misfit of item response theory models practically significant? Educational Measurement: Issues and Practice, 33(1), 23–35. https://doi.org/10.1111/emip.12024

Su, Y. H., Sheu, C. F., & Wang, W. C. (2007). Computing cis of item fit statistics in the family of rasch models using the bootstrap method. Journal of Applied Measurement, 8(2), 190–203. https://www.ncbi.nlm.nih.gov/pubmed/17440261

Swaminathan, H., Hambleton, R. K., & Rodgers, H. J. (2006). Assessing the fit of item response theory models. In C. R. Rao & S. Sinharay (Eds.), Handbook of statistics: Psychometrics (pp. 683–718). Elsevier.

Tendeiro, J. N., & Meijer, R. R. (2015). How serious is IRT misfit for practical decision-making? LSAC Research Report Series, 15(4), 1–22.

Thompson, B. (2007). Effect sizes, confidence intervals, and confidence intervals for effect sizes. Psychology in the Schools, 44(5), 423–432. https://doi.org/10.1002/pits.20234

Tijmstra, J., Bolsinova, M., Liaw, Y.-L., Rutkowski, L., & Rutkowski, D. (2020). Sensitivity of the RMSD for detecting item-level misfit in low-performing countries. Journal of Educational Measurement, 57(4), 566–583. https://doi.org/10.1111/jedm.12263

Van der Linden, W. J., & Hambleton, R. K. (1997). Handbook of modern item response theory. Springer. https://doi.org/10.1007/978-1-4757-2691-6

Van Rijn, P. W., Sinharay, S., Haberman, S. J., & Johnson, M. S. (2016). Assessment of fit of item response theory models used in large-scale educational survey assessments. Large-Scale Assessments in Education, 4(10), 1–23.

Wainer, H., & Thissen, D. (1987). Estimating ability with the wrong model. Journal of Educational Statistics, 12(4), 339–368. https://doi.org/10.3102/10769986012004339

Wu, M. L. (2005). The role of plausible values in large-scale surveys. Studies in Educational Evaluation, 31(2–3), 114–128. https://doi.org/10.1016/j.stueduc.2005.05.005

Zhao, Y. (2016). Impact of IRT item misfit on score estimates and severity classifications: an examination of PROMIS depression and pain interference item banks. Quality of Life Research, 26(3), 555–564. https://doi.org/10.1007/s11136-016-1467-3

Zhao, Y., & Hambleton, R. K. (2017). Practical consequences of item response theory model misfit in the context of test equating with mixed-format test data. Frontiers in Psychology, 8, 1–11. https://doi.org/10.3389/fpsyg.2017.00484