Can the buck always be passed to the highest level of clustering?

Christian Bottomley1, Matthew J. Kirby1, Steve W. Lindsay1, Neal Alexander1
1MRC Tropical Epidemiology Group, London School of Hygiene & Tropical Medicine, London, UK

Tóm tắt

Clustering commonly affects the uncertainty of parameter estimates in epidemiological studies. Cluster-robust variance estimates (CRVE) are used to construct confidence intervals that account for single-level clustering, and are easily implemented in standard software. When data are clustered at more than one level (e.g. village and household) the level for the CRVE must be chosen. CRVE are consistent when used at the higher level of clustering (village), but since there are fewer clusters at the higher level, and consistency is an asymptotic property, there may be circumstances under which coverage is better from lower- rather than higher-level CRVE. Here we assess the relative importance of adjusting for clustering at the higher and lower level in a logistic regression model. We performed a simulation study in which the coverage of 95 % confidence intervals was compared between adjustments at the higher and lower levels. Confidence intervals adjusted for the higher level of clustering had coverage close to 95 %, even when there were few clusters, provided that the intra-cluster correlation of the predictor was less than 0.5 for models with a single predictor and less than 0.2 for models with multiple predictors. When there are multiple levels of clustering it is generally preferable to use confidence intervals that account for the highest level of clustering. This only fails if there are few clusters at this level and the intra-cluster correlation of the predictor is high.

Từ khóa


Tài liệu tham khảo

Eldridge SM, Ukoumunne OC, Carlin JB. The intra-cluster correlation coefficient in cluster randomised trials: a review of definitions. Int Stat Rev. 2009; 77(3):378–394. Moulton BR. Random group effects and the precision of regression estimates. J Econ. 1986; 32(3):385–397. Scott AJ, Holt D. The effect of two-stage sampling on ordinary least squares methods. J Am Stat Assoc. 1982; 77(380):848–854. Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986; 73(1):13–22. Angrist JD, Pischke JS. Mostly harmless econometrics. 6 Oxford Street, Woodstock, Oxfordshire OX20 1TW: Princeton University Press; 2009. Bell RM, McCaffrey DF. Bias reduction in standard errors for linear regression with multi-stage samples. Surv Methodol. 2002; 28(2):169–181. Hubbard AE, Ahern J, Fleischer NL, Van der Laan M, Lippman SA, Jewell N, et al.To GEE or not to GEE: comparing population average and mixed models for estimating the associations between neighborhood risk factors and health. Epidemiology. 2010; 21(4):467–74. Zeger SL, Liang KY, Albert PS. Models for longitudinal data: a generalized estimating equation approach. Biometrics. 1988; 44(4):1049–60. Adams G, Gulliford MC, Ukoumunne OC, Eldridge S, Chinn S, Campbell MJ. Patterns of intra-cluster correlation from primary care research to inform study design and analysis. J Clin Epidemiol. 2004; 57(8):785–94. Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing. Austria: Vienna; 2015. http://www.R-project.org/. Accessed 24 Feb 2016. Harrell FE. rms: Regression Modeling Strategies. R package version 4.4-0. 2015. http://CRAN.R-project.org/package=rms. Accessed 24 Feb 2016. Kirby MJ, Ameh D, Bottomley C, Green C, Jawara M, Milligan PJ, et al.Effect of two different house screening interventions on exposure to malaria vectors and on anaemia in children in The Gambia: a randomised controlled trial. Lancet. 2009; 374(9694):998–1009. Pan W, Wall MM. Small-sample adjustments in using the sandwich variance estimator in generalized estimating equations. Stat Med. 2002; 21(10):1429–1441. McCaffrey DF, Bell RM. Improved hypothesis testing for coefficients in generalized estimating equations with small samples of clusters. Stat Med. 2006; 25(23):4081–4098. Fay MP, Graubard BI. Small-sample adjustments for Wald-type tests using sandwich estimators. Biometrics. 2001; 57(4):1198–1206. Mancl LA, DeRouen TA. A covariance estimator for GEE with improved small-sample properties. Biometrics. 2001; 57(1):126–134. Cameron AC, Miller DL. A practitioner’s guide to cluster-robust inference. J Hum Resour. 2015; 50(2):317–372. Qaqish BF, Liang KY. Marginal models for correlated binary responses with multiple classes and multiple levels of nesting. Biometrics. 1992; 48(3):939–50. Chao EC. Structured correlation in models for clustered data. Stat Med. 2006; 25(14):2450–68. Stoner JA, Leroux BG, Puumala M. Optimal combination of estimating equations in the analysis of multilevel nested correlated data. Stat Med. 2010; 29(4):464–73. McDonald BW. Estimating logistic regression parameters for bivariate binary data. J R Stat Soc Ser B. 1993; 55(2):391–397. Fitzmaurice GM. A caveat concerning independence estimating equations with multivariate binary data. Biometrics. 1995; 51(1):309–317.