Scale length does matter: Recommendations for measurement invariance testing with categorical factor analysis and item response theory approaches
Tóm tắt
In social sciences, the study of group differences concerning latent constructs is ubiquitous. These constructs are generally measured by means of scales composed of ordinal items. In order to compare these constructs across groups, one crucial requirement is that they are measured equivalently or, in technical jargon, that measurement invariance (MI) holds across the groups. This study compared the performance of scale- and item-level approaches based on multiple group categorical confirmatory factor analysis (MG-CCFA) and multiple group item response theory (MG-IRT) in testing MI with ordinal data. In general, the results of the simulation studies showed that MG-CCFA-based approaches outperformed MG-IRT-based approaches when testing MI at the scale level, whereas, at the item level, the best performing approach depends on the tested parameter (i.e., loadings or thresholds). That is, when testing loadings equivalence, the likelihood ratio test provided the best trade-off between true-positive rate and false-positive rate, whereas, when testing thresholds equivalence, the χ2 test outperformed the other testing strategies. In addition, the performance of MG-CCFA’s fit measures, such as RMSEA and CFI, seemed to depend largely on the length of the scale, especially when MI was tested at the item level. General caution is recommended when using these measures, especially when MI is tested for each item individually.
Tài liệu tham khảo
Bentler, P.M. (1990). Comparative fit indexes in structural models. Psychological Bulletin, 107 (2), 238.
Borsboom, D. (2006). When does measurement invariance matter? Medical Care, 44(11), S176–S181.
Brown, T.A. (2014). Confirmatory factor analysis for applied research. Guilford Publications.
Browne, M.W., & Cudeck, R. (1993). Alternative ways of assessing model fit. Sage Focus Editions, 154, 136–136.
Candell, G.L., & Drasgow, F. (1988). An iterative procedure for linking metrics and assessing item bias in item response theory. Applied Psychological Measurement, 12(3), 253–260.
Chalmers, R.P. (2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 1–29.
Chang, Y.-W., Hsu, N.-J., & Tsai, R.-C. (2017). Unifying differential item functioning in factor analysis for categorical data under a discretization of a normal variant. Psychometrika, 82(2), 382–406.
Chen, F.F. (2007). Sensitivity of goodness-of-fit indexes to lack of measurement invariance. Structural Equation Modeling: A Multidisciplinary Journal, 14(3), 464–504.
Cheung, G.W., & Rensvold, R.B. (2002). Evaluating goodness-of-fit indexes for testing measurement invariance. Structural Equation Modeling, 9(2), 233–255.
Choi, S.W., Gibbons, L.E., & Crane, P.K. (2011). Lordif: an R package for detecting differential item functioning using iterative hybrid ordinal logistic regression/item response theory and Monte Carlo simulations. Journal of Statistical Software, 39(8), 1.
Clauser, B., Mazor, K., & Hambleton, R.K. (1993). The effects of purification of matching criterion on the identification of DIF using the Mantel–Haenszel procedure. Applied Measurement in Education, 6(4), 269–279.
Cox, D.R., & Snell, E.J. (1989). Analysis of binary data (vol. 32). Monographs on Statistics and Applied Probability.
Finch, H. (2005). The MIMIC model as a method for detecting DIF: Comparison with Mantel–Haenszel, SIBTEST, and the IRT likelihood ratio. Applied Psychological Measurement, 29(4), 278–295.
French, B.F., & Finch, W.H. (2008). Multigroup confirmatory factor analysis: Locating the invariant referent sets. Structural Equation Modeling: A Multidisciplinary Journal, 15(1), 96–113.
Guenole, N., & Brown, A. (2014). The consequences of ignoring measurement invariance for path coefficients in structural equation models. Frontiers in Psychology, 5, 980.
Jeong, S., & Lee, Y. (2019). Consequences of not conducting measurement invariance tests in cross-cultural studies: a review of current research practices and recommendations. Advances in Developing Human Resources, 21(4), 466–483.
Jones, R.N., & Gallo, J.J. (2002). Education and sex differences in the Mini-Mental State Examination: Effects of differential item functioning. The Journals of Gerontology Series B: Psychological Sciences and Social Sciences, 57(6), P548–P558.
Joo, S.-H., & Kim, E.S. (2019). Impact of error structure misspecification when testing measurement invariance and latent-factor mean difference using mimic and multiple-group confirmatory factor analysis. Behavior Research Methods, 51(6), 2688–2699.
Kamata, A., & Bauer, D.J. (2008). A note on the relation between factor analytic and item response theory models. Structural Equation Modeling: A Multidisciplinary Journal, 15(1), 136–153.
Khalid, M.N., & Glas, C.A. (2014). A scale purification procedure for evaluation of differential item functioning. Measurement, 50, 186–197.
Kim, E.S., & Yoon, M. (2011). Testing measurement invariance: A comparison of multiple-group categorical CFA and IRT. Structural Equation Modeling, 18(2), 212–228.
Kim, S. -H., & Cohen, A.S. (1998). Detection of differential item functioning under the graded response model with the likelihood ratio test. Applied Psychological Measurement, 22(4), 345–355.
Lai, M.H., & Yoon, M. (2015). A modified comparative fit index for factorial invariance studies. Structural Equation Modeling: A Multidisciplinary Journal, 22(2), 236–248.
Li, H. -H., & Stout, W. (1996). A new procedure for detection of crossing DIF. Psychometrika, 61(4), 647–677.
Lopez Rivas, G.E., Stark, S., & Chernyshenko, O.S. (2009). The effects of referent item parameters on differential item functioning detection using the free baseline likelihood-ratio test. Applied Psychological Measurement, 33(4), 251–265.
Lubke, G.H. (2005). Investigating population heterogeneity with factor mixture models. Psychological Methods, 10(1), 21.
MacCallum, R.C., & Tucker, L.R. (1991). Representing sources of error in the common943 factor model: Implications for theory and practice. Psychological Bulletin, 109(3), 502.
McNeish, D., & Wolf, M.G. (2020). Dynamic fit index cutoffs for confirmatory factor analysis models.
Meade, A.W., & Lautenschlager, G.J. (2004). Same question, different answers: CFA and two IRT approaches to measurement invariance. In In 19th annual conference of the society for industrial and organizational psychology, Vol. 1.
Meade, A.W., & Wright, N.A. (2012). Solving the measurement invariance anchor item problem in item response theory. Journal of Applied Psychology, 97(5), 1016.
Menard, S. (2000). Coefficients of determination for multiple logistic regression analysis. The American Statistician, 54(1), 17–24.
Meredith, W., & Teresi, J.A. (2006). An essay on measurement and factorial invariance. Medical Care, pp. S69–S77.
Millsap, R.E. (2012). Statistical approaches to measurement invariance. Routledge.
Millsap, R.E., & Yun-Tein, J. (2004). Assessing factorial invariance in ordered-categorical measures. Multivariate Behavioral Research, 39(3), 479–515.
Muthén, L. (1998). Mplus user’s guide. muthén & muthén, Los Angeles, CA.
Putnick, D.L., & Bornstein, M.H. (2016). Measurement invariance conventions and reporting: The state of the art and future directions for psychological research. Developmental Review, 41, 71–90.
Core Team, R (2013). R: A language and environment for statistical computing [Computer software manual]. Vienna, Austria. Retrieved from http://www.R-project.org/.
Rogers, H.J., & Swaminathan, H. (1993). A comparison of logistic regression and Mantel–Haenszel procedures for detecting differential item functioning. Applied Psychological Measurement, 17(2), 105–116.
Rosseel, Y. (2012). Lavaan: an R package for structural equation modeling and more. version 0.5–12 (beta). Journal of Statistical Software, 48(2), 1–36.
Rutkowski, L., & Svetina, D. (2014). Assessing the hypothesis of measurement invariance in the context of large-scale international surveys. Educational and Psychological Measurement, 74(1), 31–57.
Rutkowski, L., & Svetina, D. (2017). Measurement invariance in international surveys: Categorical indicators and fit measure performance. Applied Measurement in Education, 30(1), 39–51.
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph Supplement.
San, Martin (2013). Identification of parametric Rasch-type models. Journal of Statistical Planning and Inference, 143(1), 116–130.
Schreiber, J.B., Nora, A., Stage, F.K., Barlow, E.A., & King, J. (2006). Reporting structural equation modeling and confirmatory factor analysis results: A review. The Journal of Educational Research, 99 (6), 323–338.
Shi, D., Song, H., Liao, X., Terry, R., & Snyder, L.A. (2017). Bayesian SEM for specification search problems in testing factorial invariance. Multivariate Behavioral Research, 52(4), 430–444.
Stark, S., Chernyshenko, O.S., & Drasgow, F. (2006). Detecting differential item functioning with confirmatory factor analysis and item response theory: Toward a unified strategy. Journal of Applied Psychology, 91 (6), 1292.
Swaminathan, H., & Rogers, H.J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27(4), 361–370.
Takane, Y., & De Leeuw, J. (1987). On the relationship between item response theory and factor analysis of discretized variables. Psychometrika, 52(3), 393–408.
Thissen, D. (1988). Use of item response theory in the study of group differences in trace lines. Test Validity.
Thissen, D., Steinberg, L., & Gerrard, M. (1986). Beyond group-mean differences: The concept of item bias. Psychological Bulletin, 99(1), 118.
Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of differential item functioning using the parameters of item response models.
Thompson, Y.T., Song, H., Shi, D., & Liu, Z. (2021). It matters: Reference indicator selection in measurement invariance tests. Educational and Psychological Measurement, 81(1), 5–38.
Vandenberg, R.J., & Lance, C.E. (2000). A review and synthesis of the measurement in variance literature: Suggestions, practices, and recommendations for organizational research. Organizational Research Methods, 3(1), 4–70.
Wainer, H., & Braun, H. (1988). Differential item performance and the Mantel–Haenszel procedure. Test validity, pp. 129–145.
Widaman, K.F., & Thompson, J.S. (2003). On specifying the null model for incremental fit indices in structural equation modeling. Psychological Methods, 8(1), 16.
Wirth, R., & Edwards, M.C. (2007). Item factor analysis: Current approaches and future directions. Psychological Methods, 12(1), 58.
Woods, C.M. (2009). Empirical selection of anchors for tests of differential item functioning. Applied Psychological Measurement, 33(1), 42–57.
Wu, H., & Estabrook, R. (2016). Identification of confirmatory factor analysis models of different levels of invariance for ordered categorical outcomes. Psychometrika, 81(4), 1014–1045.
Xia, Y., & Yang, Y. (2019). RMSEA, CFI, and TLI in structural equation modeling with ordered categorical data: The story they tell depends on the estimation methods. Behavior Research Methods, 51(1), 409–428.
Yasemin, K., Leite, W.L., & Miller, M.D. (2015). A comparison of logistic regression models for DIF detection in polytomous items: the effect of small sample sizes and non-normality of ability distributions. International Journal of Assessment Tools in Education, 2(1), 22–39.
Zumbo, B.D. (1999). A handbook on the theory and methods of differential item functioning (DIF): Logistic regression modeling as a unitary framework for binary and Likert-type (ordinal) item scores. Ottawa, ON: Directorate of Human Resources Research and Evaluation, Department of National Defense.
