Statistical Methods in Medical Research
1477-0334
0962-2802
Anh Quốc
Cơ quản chủ quản: SAGE Publications Ltd
Các bài báo tiêu biểu
The goal of multiple imputation is to provide valid inferences for statistical estimates from incomplete data. To achieve that goal, imputed values should preserve the structure in the data, as well as the uncertainty about this structure, and include any knowledge about the process that generated the missing data. Two approaches for imputing multivariate data exist: joint modeling (JM) and fully conditional specification (FCS). JM is based on parametric statistical theory, and leads to imputation procedures whose statistical properties are known. JM is theoretically sound, but the joint model may lack flexibility needed to represent typical data features, potentially leading to bias. FCS is a semi-parametric and flexible alternative that specifies the multivariate model by a series of conditional models, one for each incomplete variable. FCS provides tremendous flexibility and is easy to apply, but its statistical properties are difficult to establish. Simulation work shows that FCS behaves very well in the cases studied. The present paper reviews and compares the approaches. JM and FCS were applied to pubertal development data of 3801 Dutch girls that had missing data on menarche (two categories), breast development (five categories) and pubic hair development (six stages). Imputations for these data were created under two models: a multivariate normal model with rounding and a conditionally specified discrete model. The JM approach introduced biases in the reference curves, whereas FCS did not. The paper concludes that FCS is a useful and easily applied flexible alternative to JM when no convenient and realistic joint distribution can be specified.
The era of big data is coming, and evidence-based medicine is attracting increasing attention to improve decision making in medical practice via integrating evidence from well designed and conducted clinical research. Meta-analysis is a statistical technique widely used in evidence-based medicine for analytically combining the findings from independent clinical trials to provide an overall estimation of a treatment effectiveness. The sample mean and standard deviation are two commonly used statistics in meta-analysis but some trials use the median, the minimum and maximum values, or sometimes the first and third quartiles to report the results. Thus, to pool results in a consistent format, researchers need to transform those information back to the sample mean and standard deviation. In this article, we investigate the optimal estimation of the sample mean for meta-analysis from both theoretical and empirical perspectives. A major drawback in the literature is that the sample size, needless to say its importance, is either ignored or used in a stepwise but somewhat arbitrary manner, e.g. the famous method proposed by Hozo et al. We solve this issue by incorporating the sample size in a smoothly changing weight in the estimators to reach the optimal estimation. Our proposed estimators not only improve the existing ones significantly but also share the same virtue of the simplicity. The real data application indicates that our proposed estimators are capable to serve as “rules of thumb” and will be widely applied in evidence-based medicine.
The simplest approach to dealing with missing data is to restrict the analysis to complete cases, i.e. individuals with no missing values. This can induce bias, however. Inverse probability weighting (IPW) is a commonly used method to correct this bias. It is also used to adjust for unequal sampling fractions in sample surveys. This article is a review of the use of IPW in epidemiological research. We describe how the bias in the complete-case analysis arises and how IPW can remove it. IPW is compared with multiple imputation (MI) and we explain why, despite MI generally being more efficient, IPW may sometimes be preferred. We discuss the choice of missingness model and methods such as weight truncation, weight stabilisation and augmented IPW. The use of IPW is illustrated on data from the 1958 British Birth Cohort.
The Poisson regression model using a sandwich variance estimator has become a viable alternative to the logistic regression model for the analysis of prospective studies with independent binary outcomes. The primary advantage of this approach is that it readily provides covariate-adjusted risk ratios and associated standard errors. In this article, the model is extended to studies with correlated binary outcomes as arise in longitudinal or cluster randomization studies. The key step involves a cluster-level grouping strategy for the computation of the middle term in the sandwich estimator. For a single binary exposure variable without covariate adjustment, this approach results in risk ratio estimates and standard errors that are identical to those found in the survey sampling literature. Simulation results suggest that it is reliable for studies with correlated binary data, provided the total number of clusters is at least 50. Data from observational and cluster randomized studies are used to illustrate the methods.
Identifying and monitoring multiple disease biomarkers and other clinically important factors affecting the course of a disease, behavior or health status is of great clinical relevance. Yet conventional statistical practice generally falls far short of taking full advantage of the information available in multivariate longitudinal data for tracking the course of the outcome of interest. We demonstrate a method called multi-trajectory modeling that is designed to overcome this limitation. The method is a generalization of group-based trajectory modeling. Group-based trajectory modeling is designed to identify clusters of individuals who are following similar trajectories of a single indicator of interest such as post-operative fever or body mass index. Multi-trajectory modeling identifies latent clusters of individuals following similar trajectories across multiple indicators of an outcome of interest (e.g., the health status of chronic kidney disease patients as measured by their eGFR, hemoglobin, blood CO2 levels). Multi-trajectory modeling is an application of finite mixture modeling. We lay out the underlying likelihood function of the multi-trajectory model and demonstrate its use with two examples.
Binary logistic regression is one of the most frequently applied statistical approaches for developing clinical prediction models. Developers of such models often rely on an Events Per Variable criterion (EPV), notably EPV ≥10, to determine the minimal sample size required and the maximum number of candidate predictors that can be examined. We present an extensive simulation study in which we studied the influence of EPV, events fraction, number of candidate predictors, the correlations and distributions of candidate predictor variables, area under the ROC curve, and predictor effects on out-of-sample predictive performance of prediction models. The out-of-sample performance (calibration, discrimination and probability prediction error) of developed prediction models was studied before and after regression shrinkage and variable selection. The results indicate that EPV does not have a strong relation with metrics of predictive performance, and is not an appropriate criterion for (binary) prediction model development studies. We show that out-of-sample predictive performance can better be approximated by considering the number of predictors, the total sample size and the events fraction. We propose that the development of new sample size criteria for prediction models should be based on these three parameters, and provide suggestions for improving sample size determination.
The assumption of positivity or experimental treatment assignment requires that observed treatment levels vary within confounder strata. This article discusses the positivity assumption in the context of assessing model and parameter-specific identifiability of causal effects. Positivity violations occur when certain subgroups in a sample rarely or never receive some treatments of interest. The resulting sparsity in the data may increase bias with or without an increase in variance and can threaten valid inference. The parametric bootstrap is presented as a tool to assess the severity of such threats and its utility as a diagnostic is explored using simulated and real data. Several approaches for improving the identifiability of parameters in the presence of positivity violations are reviewed. Potential responses to data sparsity include restriction of the covariate adjustment set, use of an alternative projection function to define the target parameter within a marginal structural working model, restriction of the sample, and modification of the target intervention. All of these approaches can be understood as trading off proximity to the initial target of inference for identifiability; we advocate approaching this tradeoff systematically.