Practical significance of item misfit and its manifestations in constructs assessed in large-scale studies
Tóm tắt
Từ khóa
Tài liệu tham khảo
ACARA. (2013). National assessment program - science literacy technical report 2012. Australian Curriculum, assessment and reporting authority.
Allen, N. L., Carlson, J. E., & Zelenak, C. A. (1999). The NAEP 1996 technical report. National center for educational statistics.
Birnbaum, A. (1968). Some latent trait models. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores (pp. 395–479). Addison-Wesley.
Box, G. E., & Draper, N. R. (1987). Empirical model-building and response surfaces. Wiley.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Routledge. https://doi.org/10.4324/9780203771587
Cohen, J. (1992). A power primer. Psychological Bulletin, 112(1), 155–159.
Crişan, D. R., Tendeiro, J. N., & Meijer, R. R. (2017). Investigating the practical consequences of model misfit in unidimensional IRT models. Applied Psychological Measurement, 41(6), 439–455. https://doi.org/10.1177/0146621617695522
De Ayala, R. J. (Ed.). (2009). Methodology in the social sciences. The theory and practice of item response theory. New York: Guilford Press.
Dorans, N. J., & Feigenbaum, M. D. (1994). Equating issues engendered by changes to the SAT and PSAT/NMSQT. ETS Research Memorandum. Princeton, NJ: Educational Testing Service.
Hambleton, R. K., & Han, N. (2005). Assessing the fit of IRT models to educational and psychological test data: A five step plan and several graphical displays. In W. R. Lenderking & D. Revicki (Eds.), Advances in health outcomes research methods, measurement, statistical analysis, and clinical applications. Degnon Associates.
Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: principals and applications. Springer.
Hartig, J., Frey, A., & Jude, N. (2020). Validity of test value interpretations. In H. Moosbrugger & A. Kelava (Eds.), Testtheorie und fragebogenkonstruktion. Springer.
Jaccard, P. (1912). The distribution of the flora in the alpine zone. New Phytologist, 11(2), 37–50. https://doi.org/10.1111/j.1469-8137.1912.tb05611.x
Kirk, R. E. (1996). Practical significance: a concept whose time has come. Educational and Psychological Measurement, 56(5), 746–759. https://doi.org/10.1177/0013164496056005002
Köhler, C., & Hartig, J. (2017). Practical significance of item misfit in educational assessments. Applied Psychological Measurement, 41(5), 388–400. https://doi.org/10.1177/0146621617692978
Köhler, C., Robitzsch, A., & Hartig, J. (2020). A bias-corrected RMSD item fit statistic: an evaluation and comparison to alternatives. Journal of Educational and Behavioral Statistics, 45(3), 251–273. https://doi.org/10.3102/1076998619890566
Liang, T., Wells, C. S., & Hambleton, R. K. (2014). An assessment of the nonparametric approach for evaluating the fit of item response models. Journal of Educational Measurement, 51(1), 1–17. https://doi.org/10.1111/jedm.12031
Lüdtke, O., & Robitzsch, A. (2017). An introduction to the plausible values technique for psychological research. Diagnostica, 63(3), 193–205. https://doi.org/10.1026/0012-1924/a000175
Masters, G. N. (1982). A rasch model for partial credit scoring. Psychometrika, 47(2), 149–174. https://doi.org/10.1007/BF02296272
Molenaar, I. W. (1997). Lenient or strict application of IRT with an eye on practical consequences. In J. Rost & R. Langeheine (Eds.), Applications of latent trait and latent class models in the social sciences (pp. 38–49). Waxmann Verlag.
Muraki, E. (1992). A generalized partial credit model: application of an EM algorithm. Applied Psychological Measurement, 16(2), 159–176.
OECD. (2015). PISA 2015 field trial analysis report: Outcomes of the cognitive assessment (Meeting of the technical advisory group). Paris: OECD Publishing.
OECD. (2018a). PISA 2015: PISA results in focus. OECD Publishing.
OECD. (2018b). PISA 2018 Field trial analysis report for the cognitive assessment. OECD Publishing.
OECD. (2020). PISA 2018 technical report. OECD Publishing.
Peeters, M. J. (2016). Practical significance: Moving beyond statistical significance. Currents in Pharmacy Teaching and Learning, 8(1), 83–89. https://doi.org/10.1016/j.cptl.2015.09.001
R Core Team (2022). R: A language and environment for statistical computing [Computer software] R foundation for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/
Rasch, G. (1960/1980). Probabilistic models for some intelligence and attainment tests. (Copenhagen, Danish Institute for Educational Research), expanded edition (1980) with foreword and afterword by B.D. Wright. Chicago: The University of Chicago Press.
Reiss, K., Weis, M., Klieme, E., & Köller, O. (Eds.). (2019). PISA 2018: Grundbildung im internationalen Vergleich [PISA 2018: Basic education in international comparison]. Waxmann.
Robitzsch, A., Kiefer, T., & Wu, M. (2020). TAM: Test analysis moduls. R package version, 3, 5–19. Computer software.
Rutkowski, L. (2014). Sensitivity of achievement estimation to conditioning model misclassification. Applied Measurement in Education, 27(2), 115–132. https://doi.org/10.1080/08957347.2014.880440
Silva Diaz, J. A., Köhler, C., & Hartig, J. (2022). Performance of Infit and outfit confidence intervals calculated via parametric bootstrapping. Applied Measurement in Education. https://doi.org/10.1080/08957347.2022.2067540
Sinharay, S., Haberman, S. J., & Jia, H. (2011). Fit of item response theory models: A survey of data from several operational tests (Research Report (Vol. No. RR-11-29)). Princeton, NJ: Educational Testing Service. https://doi.org/10.1002/j.2333-8504.2011.tb02265.x.
Sinharay, S., & Haberman, S. J. (2014). How often is the misfit of item response theory models practically significant? Educational Measurement: Issues and Practice, 33(1), 23–35. https://doi.org/10.1111/emip.12024
Su, Y. H., Sheu, C. F., & Wang, W. C. (2007). Computing cis of item fit statistics in the family of rasch models using the bootstrap method. Journal of Applied Measurement, 8(2), 190–203. https://www.ncbi.nlm.nih.gov/pubmed/17440261
Swaminathan, H., Hambleton, R. K., & Rodgers, H. J. (2006). Assessing the fit of item response theory models. In C. R. Rao & S. Sinharay (Eds.), Handbook of statistics: Psychometrics (pp. 683–718). Elsevier.
Tendeiro, J. N., & Meijer, R. R. (2015). How serious is IRT misfit for practical decision-making? LSAC Research Report Series, 15(4), 1–22.
Thompson, B. (2007). Effect sizes, confidence intervals, and confidence intervals for effect sizes. Psychology in the Schools, 44(5), 423–432. https://doi.org/10.1002/pits.20234
Tijmstra, J., Bolsinova, M., Liaw, Y.-L., Rutkowski, L., & Rutkowski, D. (2020). Sensitivity of the RMSD for detecting item-level misfit in low-performing countries. Journal of Educational Measurement, 57(4), 566–583. https://doi.org/10.1111/jedm.12263
Van der Linden, W. J., & Hambleton, R. K. (1997). Handbook of modern item response theory. Springer. https://doi.org/10.1007/978-1-4757-2691-6
Van Rijn, P. W., Sinharay, S., Haberman, S. J., & Johnson, M. S. (2016). Assessment of fit of item response theory models used in large-scale educational survey assessments. Large-Scale Assessments in Education, 4(10), 1–23.
Wainer, H., & Thissen, D. (1987). Estimating ability with the wrong model. Journal of Educational Statistics, 12(4), 339–368. https://doi.org/10.3102/10769986012004339
Wu, M. L. (2005). The role of plausible values in large-scale surveys. Studies in Educational Evaluation, 31(2–3), 114–128. https://doi.org/10.1016/j.stueduc.2005.05.005
Zhao, Y. (2016). Impact of IRT item misfit on score estimates and severity classifications: an examination of PROMIS depression and pain interference item banks. Quality of Life Research, 26(3), 555–564. https://doi.org/10.1007/s11136-016-1467-3
Zhao, Y., & Hambleton, R. K. (2017). Practical consequences of item response theory model misfit in the context of test equating with mixed-format test data. Frontiers in Psychology, 8, 1–11. https://doi.org/10.3389/fpsyg.2017.00484