The TIMSS 2019 Item Equivalence Study: examining mode effects for computer-based assessment and implications for measuring trends
Tóm tắt
TIMSS 2019 is the first assessment in the TIMSS transition to a computer-based assessment system, called eTIMSS. The TIMSS 2019 Item Equivalence Study was conducted in advance of the field test in 2017 to examine the potential for mode effects on the psychometric behavior of the TIMSS mathematics and science trend items induced by the change to computer-based administration. The study employed a counterbalanced, within-subjects design to investigate the potential for eTIMSS mode effects. Sample sizes for analysis included 16,894 fourth grade students from 24 countries and 9,164 eighth grade students from 11 countries. Following a review of the differences of the trend items in paper and digital formats, item statistics were examined item by item and aggregated by subject for paperTIMSS and eTIMSS. Then, the TIMSS scaling methods were applied to produce achievement scale scores for each mode. These were used to estimate the expected magnitude of the mode effects on student achievement. The results of the study provide support that the mathematics and science constructs assessed by the trend items were mostly unaffected in the transition to eTIMSS at both grades. However, there was an overall mode effect, where items were more difficult for students in digital formats compared to paper. The effect was larger in mathematics than science. Because the trend items cannot be expected to be sufficiently equivalent across paperTIMSS and eTIMSS, it was concluded that modifications must be made to the usual item calibration model for TIMSS 2019 to measure trends. Each eTIMSS 2019 trend country will administer paper trend booklets to a nationally representative sample of students, in addition to the usual student sample, to provide a bridge between paperTIMSS and eTIMSS results.
Tài liệu tham khảo
American Psychological Association Committee on Professional Standards and Committee on Psychological Tests and Assessment (APA). (1986). Guidelines for computer-based tests and interpretations. Washington, DC: American Psychological Association Committee on Professional Standards and Committee on Psychological Tests and Assessment (APA).
Bennett, R. E., Brasell, J., Oranje, A., Sandene, B., Kaplan, K., & Yan, F. (2008). Does it matter if I take my mathematics test on a computer? A second empirical study of mode effects in NAEP. Journal of Technology, Learning, and Assessment, 6(9), 1–39.
Bridgeman, B., Lennon, M. L., & Jackenthal, A. (2003). Effects of screen size, screen resolution, and display rate on computer-based test performance. Applied Measurement in Education, 16(3), 191–205.
Chen, G., Cheng, W., Chang, T.-W., Zheng, X., & Huang, R. (2014). A comparison of reading comprehension across paper, computer screens, and tablets: Does tablet familiarity matter? Journal of Computers in Education, 1(3), 213–225.
Cooper, J. (2006). The digital divide: The special case of gender. Journal of Computer Assisted Learning, 22, 320–334.
Davis, L. L., Kong, X., McBride, Y., & Morrison, K. (2017). Device comparability of tablets and computers for assessment purposes. Applied Measurement in Education, 30(1), 16–26.
DePascale, C., Dadey, N., & Lyons, S. (2016). Score comparability across computerized assessment delivery devices: Defining comparability, reviewing the literature, and providing recommendations for states when submitting to Title 1 Peer Review. Washington, DC: Council of Chief State School Officers.
Fishbein, B. (2018). Preserving 20 years of TIMSS trend measurements: Early stages in the transition to the eTIMSS assessment (Doctoral dissertation). Boston College.
Foy, P. (2017). TIMSS 2015 user guide for the international database. Retrieved from Boston College, TIMSS & PIRLS International Study Center website: https://timss.bc.edu/timss2015/international-database/.
Foy, P., Galia, J., & Li, I. (2008). Scaling the data from the TIMSS 2007 mathematics and science assessments. In J. F. Olson, M. O. Martin, & I. V. S. Mullis (Eds.), TIMSS 2007 technical report. Chestnut Hill: TIMSS & PIRLS International Study Center, Boston College.
Foy, P., & LaRoche, S. (2016). Estimating standard errors in the TIMSS 2015 results. In M. O. Martin, I. V. S. Mullis, & M. Hooper (Eds.), Methods and procedures in TIMSS 2015 (pp. 4.1–4.69). Retrieved from Boston College, TIMSS & PIRLS International Study Center website: http://timss.bc.edu/publications/timss/2015-methods/chapter-4.html.
Foy, P., & Yin, L. (2016). Scaling the TIMSS 2015 achievement data. In M. O. Martin, I. V. S. Mullis, & M. Hooper (Eds.), Methods and procedures in TIMSS 2015 (pp. 13.1–13.62). Retrieved from Boston College, TIMSS & PIRLS International Study Center website: http://timss.bc.edu/publications/timss/2015-methods/chapter-13.html.
Gallagher, A., Bridgeman, B., & Cahalan, C. (2002). The effect of computer-based tests on racial-ethnic and gender groups. Journal of Educational Measurement, 39(2), 133–147.
Horkay, N., Bennett, R. E., Allen, N., Kaplan, B., & Yan, F. (2006). Does it matter if I take my writing test on computer? An empirical study of mode effects in NAEP. Journal of Technology, Learning, and Assessment, 5(2). Retrieved from http://www.jtla.org.
Jerrim, J. (2016). PISA 2012: How do results for the paper and computer tests compare? Assessment in Education: Principles, Policy, & Practice, 23(4), 495–518.
Jerrim, J., Micklewright, J., Heine, J.-H., Salzer, C., & McKeown, C. (2018). PISA 2015: how big is the ‘mode effect’ and what has been done about it? Oxford Review of Education. https://doi.org/10.1080/03054985.2018.1430025.
Johnson, M., & Green, S. (2006). On-line mathematics assessment: The impact of mode on performance and question answering strategies. Journal of Technology, Learning, and Assessment, 4(5), 1–35.
LaRoche, S., Joncas, M., & Foy, P. (2016). Sample design in TIMSS 2015. In M. O. Martin, I. V. S. Mullis, & M. Hooper (Eds.), Methods and procedures in TIMSS 2015 (pp. 3.1–3.37). Retrieved from Boston College, TIMSS & PIRLS International Study Center website: http://timss.bc.edu/publications/timss/2015-methods/chapter-3.html.
MacCann, R. (2006). The equivalence of online and traditional testing for different subpopulations and item types. British Journal of Educational Technology, 37(1), 79–81.
Martin, M. O., Mullis, I. V. S., Beaton, A. E., Gonzalez, E. J., Smith, T. A., & Kelly, D. L. (1998). Science achievement in the primary school years: IEA’s third international mathematics and science report. Chestnut Hill, MA: TIMSS & PIRLS International Study Center, Boston College.
Martin, M. O., Mullis, I. V. S., Foy, P., & Hooper, M. (2016a). TIMSS 2015 international results in science. Retrieved from Boston College, TIMSS & PIRLS International Study Center website: http://timssandpirls.bc.edu/timss2015/international-results/.
Martin, M. O., Mullis, I. V. S., Foy, P. & Hooper, M. (Eds.). (2016b). TIMSS achievement methodology. In Methods and procedures in TIMSS 2015 (pp. 12.1–12.9). Retrieved from Boston College, TIMSS & PIRLS International Study Center website: http://timss.bc.edu/publications/timss/2015-methods/chapter-12.html.
Mazzeo, J., & Harvey, A. L. (1988). The equivalence of scores from automated and conventional educational and psychological tests: A review of the literature. College Board Rep. No. 88-8, ETS RR No. 88-21. Princeton, NJ: Educational Testing Service.
Mazzeo, J., & von Davier, M. (2014). Linking scales in international large-scale assessments. In L. Rutkowski, M. von Davier, & D. Rutkowski (Eds.), Handbook of international large-scale assessment: Background, technical issues, and methods of data analysis (pp. 229–258). Boca Raton: Chapman & Hall, CRC Press.
Mislevy, R. J. (1991). Randomization-based inference about latent variables from complex samples. Psychometrika, 56(2), 177–196.
Mullis, I. V. S., Martin, M. O., Beaton, A. E., Gonzalez, E. J., Kelly, D. L., & Smith, T. A. (1998). Mathematics achievement in the primary school years: IEA’s third international mathematics and science report. Chestnut Hill: TIMSS & PIRLS International Study Center, Boston College.
Mullis, I. V. S., Martin, M. O., Foy, P., & Hooper, M. (2016). TIMSS 2015 international results in mathematics. Retrieved from Boston College, TIMSS & PIRLS International Study Center website: http://timssandpirls.bc.edu/timss2015/international-results/.
Mullis, I. V. S., Martin, M. O., & Hooper, M. (2017). Measuring changing educational contexts in a changing world: Evolution of the TIMSS and PIRLS questionnaires. In M. Rosén, K. Y. Hansen, & U. Wolff (Eds.), Cognitive abilities and educational outcomes (pp. 207–222). Switzerland: Springer International Publishing.
Muraki, E., & Bock, R. D. (1991). PARSCALE [computer software]. Lincolnwood,: Scientific Software International.
Noyes, J. M., & Garland, K. J. (2008). Computer- vs. paper-based tasks: Are they equivalent? Ergonomics, 51(9), 1352–1375.
Parshall, C. G., & Kromrey, J. D. (1993). Computer testing versus paper-and-pencil testing: An analysis of examinee characteristics associated with mode effect. Paper presented at the Annual Meeting of the American Educational Research Association, Atlanta, GA.
Pisacreta, D. (2013). Comparison of a test delivered using an iPad versus a laptop computer: Usability study results. Paper presented at the Council of Chief State School Officers (CCSSO) National Conference on Student Assessment (NCSA), National Harbor, MD.
Pommerich, M. (2004). Developing computerized versions of paper-and-pencil tests: Mode effects for passaged-based tests. Journal of Technology, Learning, and Assessment, 2(6), 1–45.
Pruet, P., Ang, C. S., & Farzin, D. (2016). Understanding tablet computer usage among primary school students in underdeveloped areas: Students’ technology experience, learning styles and attitudes. Computers in Human Behavior, 55, 1131–1144.
Randall, J., Sireci, S., Li, X., & Kaira, L. (2012). Evaluating the comparability of paper- and computer-based science tests across sex and SES subgroups. Educational Measurement: Issues and Practice, 31(4), 2–12.
Rogers, A., Tang, C., Lin, J.-J., & Kandathil, M. (2006). DGROUP [computer software]. Princeton: Educational Testing Service.
Russell, M. (1999). Testing on computers: A follow-up study comparing performance on computer and on paper. Education Policy Analysis Archives, 7(2). Retrieved from http://epaa.asu.edu/epaa/v7n20/.
Russell, M. (2002). The influence of computer-print on rater scores. Chestnut Hill: Technology and Assessment Study Collaborative, Boston College.
Rust, K. (2014). Sampling, weighting, and variance estimation in international large-scale assessments. In L. Rutkowski, M. von Davier, & D. Rutkowski (Eds.), Handbook of international large-scale assessment: Background, technical issues, and methods of data analysis (pp. 117–153). Boca Raton: CRC Press, Taylor & Francis Group.
Sandene, B., Bennett, R. E., Braswell, J., & Oranje, A. (2005). Online assessment in mathematics and writing: Reports from the NAEP technology-based assessment project, Research and development series (NCES 2005-457). U.S. Department of Education, National Center for Education Statistics. Washington, D.C.: U.S. Government Printing Office.
Stocking, M. L., & Lord, F. M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7, 201–210.
Strain-Seymour, E., Craft, J., Davis, L. L., & Elbom, J. (2013). Testing on tablets: Part I of a series of usability studies on the use of tablets for K-12 assessment programs. Pearson White Paper. Retrieved from http://researchnetwork.pearson.com/.
Way, D. W., Davis, L. L., Keng, L., & Strain-Seymour, E. (2016). From standardization to personalization: The comparability of scores based on different testing conditions, modes, and devices. In F. Drasgow (Ed.), Technology and testing: Improving educational and psychological measurement (pp. 260–284). New York and London: Taylor & Francis, Routledge.
Winter, P. C. (Ed.). (2010). Evaluating the comparability of scores from achievement test variations. Washington, DC: Council of Chief State School Officers.
Zhang, T., Xie, Q., Park, B. J., Kim, Y. Y., Broer, M., & Bohrnstedt, G. (2016). Computer familiarity and its relationship to performance in three NAEP digital-based assessments (AIR-NAEP Working Paper #01-2016). Washington, DC: American Institutes for Research.