Constructing an Item Bank Using Item Response Theory: The AMC Linear Disability Score Project

Rebecca Holman1, Robert Lindeboom1, Cees A.W. Glas2, Marinus Vermeulen3, Rob J. de Haan1
1Department of Clinical Epidemiology and Biostatistics, Academic Medical Center, Amsterdam, The Netherlands
2Department of Research Methodology, Measurement Methods and Data Analysis, University of Twente, Enschede, The Netherlands
3Department of Neurology, Academic Medical Center, Amsterdam, The Netherlands

Tóm tắt

Patient relevant outcomes, such as cognitive functioning and functional status, measured using questionnaires, have become important endpoints in medical studies. Traditionally, responses to individual items are simply summed to obtain a score for each patient. Recently, there has been interest in another paradigm, item response theory (IRT), proposed as an alternative to summed scores. The benefits of the use of IRT are greatest, when it is used in conjunction with a calibrated item bank. This is a collection of items, which have been presented to large groups of patients, whose responses are used to estimate the measurement properties of the individual items. This article examines the methodology surrounding the use of IRT to construct and calibrate an item bank and uses the AMC Linear Disability Score project, which aims to develop an item bank to measure functional status as expressed by the ability to perform activities of daily life, as an illustration.

Tài liệu tham khảo

Badia, X., Prieto, L., Roset, M., Diez-Perez, A., and Herdman, M., “Development of a short osteoporosis quality of life questionnaire by equating items from two existing instruments,” J. Clin. Epidemiol. 55(1), 32–40, 2002. Birnbaum, A., “Some Latent trait models and their use in inferring an examinee's ability,” in Statistical theories of mental test scores (F.M. Lord and M.R. Novick, eds.) Reading, MA, Addison-Wesley, 1968. Bock, R.D., “Estimating item parameters and latent ability when responses are scored in two or more nominal categories,” Psychometrika 37, 29–51, 1972. Breithaupt, K. and McDowell, I., “Considerations for measuring functioning of the elderly: IRM dimensionality and scaling analysis,” Health Services and Outcomes Research Methodology 2, 37–50, 2001. Cella, D. and Chang, C.H., “A discussion of item response theory and its application in health status assessment,” Med. Care 38, II66–II72, 2000. Cook, K.F., Rabeneck, L., Campbell, C.J., and Wray, N.P., “Evaluation of a multidimensional measure of dyspepsia-related health for use in a randomized clinical trial,” J. Clin. Epidemiol. 52(5), 381–392, 1999. Cronbach, L.J., “Coefficient alpha and the internal structure of tests,” Psychometrika 16, 297–334, 1951. Ebel, R.L. and Frisbie, D.A., Essentials of educational measurement. Prentice-Hall, Engelwood Cliffs, 1986. Fayers, P.M., Curran, D., and Machin, D., “Incomplete quality of life data in randomized trials: Missing items,” Stat Med. 15(17), 679–696, 1998. Fischer, G.H. and Molenaar, I.W. (eds.), Rasch models: Foundations, recent developments and applications. Springer-Verlag, New York, 1995. Gibbons, R.D., Clark, D.C., vonAmmonCavanaugh, S., and Davis, J.M., “Application of modern psychometric theory in psychiatric research,” J. Psychiatr. Res. 19, 43–55, 1985. Glas, C.A.W., “Detection of differential item functioning using Lagrange multiplier tests,” Statistica Sinica 8, 647–667, 1998. Glass, T.A., “Conjugating the ‘tenses’ of functioning: Disconcordance among hypothetical, experimental, and enacted function in older adults,” Gerontologist 38, 101–112, 1998. Hambleton, R.K., “Emergence of item response modelling in instrument development and data analysis,” Medical Care 38, II60–II65, 2000. Hays, R.D., Morales, L.S., and Reise, S.P., “Item response theory and health outcomes measurement in the 21st century,” Med. Care 38, II28–II42, 2000. Hoijtink, H. and Boomsma, A., “On person parameter estimation in the dichotomous Rasch model,” in Rasch models: Foundations, recent developments and applications (G.H. Fischer and I.W. Molenaar, eds.), Springer-Verlag, New York, 1995. Holman, R. and Berger, M.P.F., “Optimal calibration designs for tests of polytomously scored items described by item response theory models,” Journal of Educational and Behavioural Statitics 26, 361–380, 2001. Holman, R., Glas, C.A.W., Zwinderman, A.H., and de Haan, R.J., The treatment of not applicableí responses in an item bank to measure functional status using item response theory. Poster presented at the 23rd meeting of the International Society for Biostatistics. Held in Dijon, France. 11–13 September 2002. Holman, R., Lindeboom, R., Vermeulen, R., Glas, C.A.W., and de Haan, R.J., “The Amsterdam Linear Disability Score (ALDS) project. The calibration of an item bank to measure functional status using item response theory,” Quality of Life Newsletter 27, 4–5, 2001. Karagiozis, H., Gray, S., Sacco, J. et al., “The Direct Assessment of Functional Abilities (DAFA): A comparison to an indirect measure of instrumental activities of daily living,” Gerontologist 38, 113–121, 1998. Kolen, M.J. and Brennan, R.L., Test equating. Springer, New York, 1995. Kosinski, M., Bjorner, J.B., Ware, J.E., Batenhorst, A., and Cady, R.K., “The responsiveness of headache impact scales scored using ‘classical’ and ‘modern’ psychometric methods: A re-analysis of three clinical trials,” Accepted for publication in Qual Life Res. Lindeboom, R., Vermeulen, M., Holman, R., and de Haan, R.J., “Activities of daily living instruments in clinical neurology. Optimizing scales for neurologic assessments,” Neurology 60, 738–742, 2003. Lord, F.M., Applications of item response theory to practical testing problems. LEA, Hillsdale, NJ, 1980. Lord, F.M., “Small N ustifies Rasch model,” in New horizons in testing (D.J. Weiss, ed.), Academic Press, New York, NJ, 1983. MacKnight, C. and Rockwood, K., “Rasch analysis of the hierarchical assessment of balance and mobility (HABAM),” J. Clin. Epidemiol. 53(12), 1242–1247, 2000. McDowell, I. and Newall, C., Measuring health:Aguide to rating scales and questionnaires. Oxford University Press, Oxford, 1996. McHorney, C.A., Haley, S.M., and Ware, J.E. Jr., “Evaluation of the MOS SF-36 Physical Functioning Scale (PF-10): II. Comparison of relative precision using Likert and Rasch scoring methods,” J. Clin. Epidemiol. 50(4), 451–461, 1997 McHorney, C.A., Ware, J.E. Jr., Lu, J.F., and Sherbourne, C.D., “The MOS 36-item Short-Form Health Survey (SF-36): III. Tests of data quality, scaling assumptions, and reliability across diverse patient groups,” Med. Care 32, 40–66, 1994. McKinley, R., and Mills, C., “A comparison of several goodness-of-fit statistics,” Applied Psychological Measurement 9, 49–57, 1985. Molenaar, I.W., “Estimation of item parameters,” in Rasch models: Foundations, recent developments and applications (G.H. Fischer and I.W. Molenaar, eds.), Springer-Verlag, New York, 1995. Orlando, M. and Thissen, D., “Likelihood-based item-fit indicies for dichotommous item respons theory models,” Applied Psychological Measurement 24, 50–64, 2000. Raczek, A.E., Ware, J.E., Bjorner, J.B., Gandek, B., Haley, S.M., Aaronson, N.K., Apolone, G., Bech, P., Brazier, J.E., Bullinger, M., and Sullivan, M., “Comparison of Rasch and summated rating scales constructed from SF-36 physical functioning items in seven countries: Results from the IQOLA Project. International quality of life assessment,” J. Clin. Epidemiol. 51, 1203–1214, 1998. Rasch, G., Probabalistic models for aome intellegence and attainment tests. Danish Institute for Educational Research, Copenhagen, 1960. Sager, M.A., Dunham, N.C., Schwantes, A. et al., “Measurement of activities of daily living in hospitalized elderly: A comparison of self-report and performance-based methods,” J. Am. Geriatr. Soc. 40, 457–462, 1992. Streiner, D.L. and Norman, G.R., Health measurement scales: A practical guide to their development and use. Oxford University Press, Oxford, 1995. Teresi, J.A., Golden, R.R., Cross, P., Gurland, B., Kleinman, M., and Wilder, D., “Item bias in cognitive screening measures: Comparisons of elderly white, Afro-American, Hispanic and high and low education subgroups,” J. Clin. Epidemiol. 48, 473–483, 1995. Teresi, J.A., Kleinman, M., and Ocepek-Welikson, K., “Modern psychometric methods for detection of differential item functioning: Application to cognitive assessment measures,” Statistics in Medicine 19, 1651–1683, 2000. Thissen, D., “Marginal maximum likelihood estimation for the one parameter logistic model,” Psychometrika 47, 175–186, 1982. Thissen, D., MULTILOG user's guide: Multiple categorical item analysis and test scoring using item response theory. Scientific Software, Chicago, 1991. Thissen, D. and Steinberg, L., “A taxonomy of item response models,” Psychometrika 51, 567–577, 1986. Thissen, D. and Wainer, H., Test scoring. LEA, Mahwah, NJ. van Buuren, S. and Hopman-Rock, M., “Revision of the ICIDH severity of disabilities scale by data linking and item response theory,” Stat. Med. 20, 1061–1076, 2001. van der Linden, W. and Glas, C.A.W. (eds.), Computerised adaptive testing: Theory and practice. Kluwer, Boston, MA, 2000. van den Wollenberg, AL., “Two new tests for the Rasch model,” Psychometrika 47, 123–140, 1982. Van Straten, A., de Haan, R.J., Limburg, M. et al., “Clinical meaning of the stroke-adapted sickness impact profile-30 and the sickness impact profile-136,” Stroke 31, 2610–2615, 2000. Verbrugge, L.M. and Jette, A.M., “The disablement process,” Soc. Sci. Med. 38, 1–14, 1994. Verhelst, N.D. and Glas, C.A.W., “The one parameter logistic model,” in Rasch models: Foundations, Recent Developments and Applications (G.H. Fischer and I.W. Molenaar, eds.), Springer-Verlag, New York, 1995. Verhelst, N.D., Glas, C.A.W., and Verstralen, H.H.F.M., OPLM Computer program and manual. Arnhem, The Netherlands: CITO, 1994. Information on obtaining the software can be obtained from [email protected]. Walters, S.J., Campbell, M.J., and Paisley, S., “Methods for determining sample sizes for studies involving health-related quality of life measures: A tutorial,” Health Services and Outcomes Research Methodology 2, 83–99, 2001. Yen, W., “Using simulation results to choose a latent trait model,” Applied Psychological Measurement 5, 245–262, 1981. Zimowski, M.F., Mukari, E., Mislevy, R.J., and Bock, R.D., BILOG-MG. Multiple group IRT analysis and test maintenance for binary items. Software International, Inc., Scientific, Chicago, IL, 1996. www.ssicentral.com/irt.htm.