Using the bootstrap to establish statistical significance for relative validity comparisons among patient-reported outcome measures

Health and Quality of Life Outcomes - Tập 11 - Trang 1-12 - 2013
Nina Deng1, Jeroan J Allison1, Hua Julia Fang1, Arlene S Ash1, John E Ware1,2
1Department of Quantitative Health Sciences, University of Massachusetts Medical School, Worcester, USA
2John Ware Research Group, Incorporated, Worcester, USA

Tóm tắt

Relative validity (RV), a ratio of ANOVA F-statistics, is often used to compare the validity of patient-reported outcome (PRO) measures. We used the bootstrap to establish the statistical significance of the RV and to identify key factors affecting its significance. Based on responses from 453 chronic kidney disease (CKD) patients to 16 CKD-specific and generic PRO measures, RVs were computed to determine how well each measure discriminated across clinically-defined groups of patients compared to the most discriminating (reference) measure. Statistical significance of RV was quantified by the 95% bootstrap confidence interval. Simulations examined the effects of sample size, denominator F-statistic, correlation between comparator and reference measures, and number of bootstrap replicates. The statistical significance of the RV increased as the magnitude of denominator F-statistic increased or as the correlation between comparator and reference measures increased. A denominator F-statistic of 57 conveyed sufficient power (80%) to detect an RV of 0.6 for two measures correlated at r = 0.7. Larger denominator F-statistics or higher correlations provided greater power. Larger sample size with a fixed denominator F-statistic or more bootstrap replicates (beyond 500) had minimal impact. The bootstrap is valuable for establishing the statistical significance of RV estimates. A reasonably large denominator F-statistic (F > 57) is required for adequate power when using the RV to compare the validity of measures with small or moderate correlations (r < 0.7). Substantially greater power can be achieved when comparing measures of a very high correlation (r > 0.9).

Tài liệu tham khảo

McHorney CA, Ware JE Jr, Rogers W, Raczek AE, Lu JFR: The validity and relative precision of MOS short- and long- form Health Status Scales and Dartmouth COOP Charts: Results from the Medical Outcomes Study. Medical Care 1992,30(Suppl 5):MS253-MS265. Fayers MP, Machin D: Quality of life: The assessment, analysis and interpretation of patient-reported outcomes. Chichester, England: Wiley; 2007. Luo N, Johnson JA, Shaw JW, Coons SJ: Relative efficiency of the EQ-5D, HUI2, and HUI3 index scores in measuring health burden of chronic medical conditions in a population health survey in the United States. Medical Care 2009, 47: 53–60. 10.1097/MLR.0b013e31817d92f8 Liang MH, Fossel AH, Larson MC: Comparisons of five health status instruments for orthopedic evaluation. Med Care 1990, 7: 632–642. Kosinski M, Keller SD, Ware JE Jr, Hatoum HT, Kong SX: The SF-36 Health Survey as a generic outcome measure in clinical trials of patients with osteoarthritis and rheumatoid arthritis: Relative validity of scales in relation to clinical measures of arthritis severity. Medical Care 1999,37(Suppl 5):MS23-MS39. Werneke M, Hart DL: Discriminant validity and relative precision for classifying patients with nonspecific neck and back pain by anatomic pain patterns. Spine 2003, 28: 161–166. 10.1097/00007632-200301150-00012 Efron B, Tibshirani R: Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Statistical Science 1986, 1: 54–75. 10.1214/ss/1177013815 Efron B, Tibshirani R: Statistical data analysis in the computer age. Science 1991, 253: 390–395. 10.1126/science.253.5018.390 Efron B, Tibshirani R: An introduction to the bootstrap. New York: Chapman & Hall; 1993:1–436. Henderson AR: The bootstrap: A technique for data-driven statistics. Using computer-intensive analyses to explore experimental data. Clin Chim Acta 2005, 359: 1–26. 10.1016/j.cccn.2005.04.002 Hays RD, Kallich JD, Mapes DL, Coons SJ, Carter WB: Development of the kidney disease quality of life (KDQOL) instrument. Qual Life Res 1994,3(5):329–338. 10.1007/BF00451725 Ware JE Jr, Kosinski M, Keller SD: A 12-item short-form health survey: Construction of scales and preliminary tests of reliability and validity. Medical Care 1996, 34: 220–233. 10.1097/00005650-199603000-00003 Lin P, Ware JE Jr, Meyer K, Richardson M, Bjorner JB: Methods for psychometric and clinical evaluations of CAT-based measures of disease impact in chronic kidney disease (CKD). Value Health 2010,13(7):A244. Ware JE Jr, Guyer R, Harrington M, Boulanger R: Evaluation of a more comprehensive survey item bank for standardizing disease-specific impact comparisons across chronic conditions. Budapest, Hungary: Invited presentation at International Society for Quality of Life Research (ISOQOL) conference; 2012. Evans RW, Manninen DL, Garrison LP Jr, Hart LG, Blagg CR, Gutman RA, Hull AR, Lowrie EG: The quality of life of patients with end-stage renal disease. N Eng J Med 1985,312(9):553–559. 10.1056/NEJM198502283120905 Evans RW, Rader B, Manninen DL: The quality of life of hemodialysis recipients treated with recombinant human erythropoietin, Cooperative Multicenter EPO Clinical Trial Group. J Am Med Assoc 1990, 263: 825–830. 10.1001/jama.1990.03440060071035 Hansen RA, Chin H, Blalock S, Joy MS: Predialysis chronic kidney disease: evaluation of quality of life in clinic patients receiving comprehensive anemia care. Res Social Adm Pharm 2009,5(2):143–153. 10.1016/j.sapharm.2008.06.004 Kerlinger FN: Foundations of behavioral research. New York: Holt, Rinehart, & Winston; 1973. Raczek AE, Ware JE Jr, Bjorner JB, Gandek B, Haley SM, Aaronson NK, Apolone G, Bech P, Brazier JE, Bullinger M, Sullivan M: Comparison of Rasch and summated rating scales constructed from SF-36 physical functioning items in seven countries: Results from the IQOLA project. J Clin Epidemiol 1998, 51: 1203–1214. 10.1016/S0895-4356(98)00112-7 McHorney CA, Haley SM, Ware JE Jr: Evaluation of the MOS SF-36 physical functioning scale (PF-40): II, Comparison of relative precision using Likert and Rasch scoring methods. J Clin Epidemiol 1997, 50: 451–461. 10.1016/S0895-4356(96)00424-6 Fitzpatrick R, Norquist JM, Dawson J, Jenkinson C: Rasch scoring of outcomes of total hip replacement. J Clin Epidemiol 2003,56(1):68–74. 10.1016/S0895-4356(02)00532-2 Norquist JM, Fitzpatrick R, Dawson J, Jenkinson C: Comparing alternative Rasch-based methods vs raw scores in measuring change in health. Medical Care 2004,42(1 Suppl):I25-I36. Fitzpatrick R, Norquist JM, Jenkinson C, Reeves BC, Morris RW, Murray DW, Gregg PJ: A comparison of Rasch with Likert scoring to discriminate between patients' evaluations of total hip replacement surgery. Qual Life Res 2004,13(2):331–338. Hart DL, Mioduski JE, Stratford PW: Simulated computerized adaptive tests for measuring functional status were efficient with good discriminant validity in patients with hip, knee, or foot/ankle impairments. J Clin Epidemiol 2005, 58: 629–638. 10.1016/j.jclinepi.2004.12.004 Hart DL, Cook KF, Mioduski JE, Teal CR, Crane PK: Simulated computerized adaptive test for patients with shoulder impairments was efficient and produced valid measures of function. J Clin Epidemiol 2006, 59: 290–298. 10.1016/j.jclinepi.2005.08.006 Deng N, Ware JE Jr: Using bootstrap confidence interval to compare relative validity coefficient: an example with PRO measures of chronic kidney disease impact. Value in Heal 2012,15(4):A159. Efron B: Better bootstrap confidence intervals. J Am Stat Assoc 1987, 82: 171–200. 10.1080/01621459.1987.10478410 DiCiccio TJ, Efron B: Bootstrap confidence intervals. Statistical Science 1996, 11: 189–228. R Development Core Team. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2011. URL http://www.R-project.org/ Canty A, Ripley B: boot: Bootstrap R (S-Plus) functions. R package version 1.3–4. 2012. Davison AC, Hinkley DV: Bootstrap methods and their applications. Cambridge: Cambridge University Press; 1997. McHorney CA, Ware JE Jr, Raczek AE: The MOS 36-item Short-Form health survey (SF-36): II. psychometric and clinical tests of validity in measuring physical and mental health constructs. Medical Care 1993,31(3):247–263. 10.1097/00005650-199303000-00006 Vickrey BG, Hays RD, Genovese BJ, Myers LW, Ellison GW: Comparison of a generic to disease-targeted health-related quality-of-life measures for multiple sclerosis. J Clin Epidemiol 1997, 50: 557–569. 10.1016/S0895-4356(97)00001-2 Ware JE Jr, Kosinski M, Bjorner JB, Bayliss MS, Batenhorst A, Dahlöf CG, Tepper S, Dowson A: Applications of computerized adaptive testing (CAT) to the assessment of headache impact. Qual Life Res 2003,12(8):935–952. 10.1023/A:1026115230284 Carpenter J, Bithell J: Bootstrap confidence intervals: when, which, what? A practical guide for medical statisticians. Statistics In Medicine 2000, 19: 1141–1164. 10.1002/(SICI)1097-0258(20000515)19:9<1141::AID-SIM479>3.0.CO;2-F Lindman HR: Analysis of variance in complex experimental designs. New York, NY: W. H. Freeman; 1974. Box GEP: Some theorems on quadratic forms applied in the study of analysis of variance problems: II Effect on inequality of variance and correlation of errors in the two-way classification. Annals of Mathematical Statistics 1954, 25: 484–498. 10.1214/aoms/1177728717