Logistic Regression in Rare Events Data
Tóm tắt
We study rare events data, binary dependent variables with dozens to thousands of times fewer ones (events, such as wars, vetoes, cases of political activism, or epidemiological infections) than zeros (“nonevents”). In many literatures, these variables have proven difficult to explain and predict, a problem that seems to have at least two sources. First, popular statistical procedures, such as logistic regression, can sharply underestimate the probability of rare events. We recommend corrections that outperform existing methods and change the estimates of absolute and relative risks by as much as some estimated effects reported in the literature. Second, commonly used data collection strategies are grossly inefficient for rare events data. The fear of collecting data with too few events has led to data collections with huge numbers of observations but relatively few, and poorly measured, explanatory variables, such as in international conflict data with more than a quarter-million dyads, only a few of which are at war. As it turns out, more efficient sampling designs exist for making valid inferences, such as sampling all available events (e.g., wars) and a tiny fraction of nonevents (peace). This enables scholars to save as much as 99% of their (nonfixed) data collection costs or to collect much more meaningful explanatory variables. We provide methods that link these two results, enabling both types of corrections to work simultaneously, and software that implements the methods developed.
Từ khóa
Tài liệu tham khảo
Tucker Richard . 1999. “BTSCS: A Binary Time-Series-Cross-Section Data Analysis Utility,” Version 3.0.4. http://www.fas.harvard.edu/∼rtucker/programs/btscs/btscs.html.
Tucker Richard . 1998. “The Interstate Dyad-Year Dataset, 1816–1997,” Version 3.0. http://www.fas.harvard.edu/∼rtucker/data/dyadyear/.
Smith, 1998, Bayesian Statistics
Scott, 1986, Fitting Logistic Models Under Case-Control or Choice Based Sampling, Journal of the Royal Statistical Society, B, 48, 170, 10.1111/j.2517-6161.1986.tb01400.x
Manski, 1981, Structural Analysis of Discrete Data with Econometric Applications
Manski, 1999, Nonlinear Statistical Inference: Essays in Honor of Takeshi Amemiya
King Gary , and Zeng Langche . 2000b. “Explaining Rare Events in International Relations.” International Organization (in press).
King Gary , and Zeng Langche . 2000a. “Inference in Case-Control Studies with Limited Auxilliary Information” (in press). (Preprint at http://Gking.harvard.edu.)
Bueno de Mesquita, 1992, War and Reason: Domestic and International Imperatives, 10.2307/j.ctt1bh4dhm
Bueno de Mesquita, 1981, The War Trap
Bennett D. Scott , and Stam Allan C. III . 1998a. EUGene: Expected Utility Generation and Data Management Program, Version 1.12. http://wizard.ucr.edu/cps/eugene/eugene.html.
We translated the different format in which Bennett and Stam (1998b) report relative risk to our percentage figure. If r is their measure, ours is 100 × (r − 1).
Deriving as an approximately unbiased estimator involves some approximations not required for the optimal Bayesian version derived in Appendix E. The problem is that instead of expanding a random πi around a fixed β as in the Bayesian version, we now must expand a random around a fixed β. Thus, to take the expectation and compute Ci , we need to imagine that in the correction term, is a reasonable estimate of πi in this context. This is obviously an undesirable approximation but it is better than setting it to zero or one (i.e., the equivalent of setting Ci = 0), and as our Monte Carlos show below, is indeed approximately unbiased.
Breslow, 1980, Statistical Methods in Cancer Research
Bennett and Stam (1998b) analyze a data set with 684,000 dyad-years and (1998a) have even developed sophisticated software for managing the larger, 1.2 million-dyad data set they distribute.
Rothman, 1998, Modern Epidemiology
Cosslett, 1981b, Structural Analysis of Discrete Data with Econometric Applications
Rosenau, 1976, In Search of Global Patterns
Levy, 1989, Behavior, Society, and Nuclear War, 1, 2120
Bennett D. Scott , and Stam Allan C. III . 1998b. “Theories of Conflict Initiation and Escalation: Comparative Testing, 1816–1980,” Presented at the annual meeting of the International Studies Association Minneapolis.
We analyze the problem of absolute risk directly and then compute relative risk as the ratio of two absolute risks. Although we do not pursue other options here because our estimates of relative risk clearly outperform existing methods, it seems possible that even better methods could be developed that estimate relative risk directly.
More formally, suppose P(X | Y = j) = Normal(X | µj , 1), for j = 0, 1. Then the logit model should classify an observation as 1 if the probability is greater than 0.5 or equivalently X » T (µ 0, µ 1) = [ln(1 – τ) – ln(τ)]/(µ 1 – µ 0) + (µ 0 + µ 1)/2. A logit of Y on a constant term and X is fully saturated and hence equivalent to estimating µj with (the mean of X i for all i in which Yi = j). However, the estimated classification boundary, , will be larger than T(µ 0, µ 1) when τ < 0.5 (and thus ln[(1 – τ)/τ] > 0), since, by Jensen's inequality, . Hence, the threshold will be too far to the right in Fig. 1 and will underestimate the probability of a one in finite samples.
McCullagh, 1987, Tensor Methods in Statistics
Mehta, 1997, Exact Inference for Categorical Data
Greene, 1993, Econometric Analysis
Achen Christopher A. 1999. “Retrospective Sampling in International Relations,” Presented at the annual meetings of the Midwest Political Science Association, Chicago.
Lancaster, 1996b, Efficient Estimation and Stratified Sampling, Journal of Econometrics, 74, 289, 10.1016/0304-4076(95)01756-9
King, 1994, Designing Social Inquiry: Scientific Inference in Qualitative Research, 10.1515/9781400821211
We have found no discussion in political science of the effects of finite samples and rare events on logistic regression or of most of the methods we discuss that allow selection on Y. There is a brief discussion of one method of correcting selection on Y in asymptotic samples by Bueno de Mesquita and Lalman (1992, Appendix) and in an unpublished paper they cite that has recently become available (Achen 1999).
Cordeiro, 1991, Bias Correction in Generalized Linear Models, Journal of the Royal Statistical Society, B, 53, 629, 10.1111/j.2517-6161.1991.tb01852.x
An elegant result due to Firth (1993) shows that bias can also be corrected during the maximization procedure by applying Jeffrey's invariant prior to the logistic likelihood and using the maximum posterior estimate. We have applied this work to weighting and prior correction and run experiments to compare the methods. Consistent with Firth's examples, we find that the methods give answers that are always numerically very close (almost always less than half a percent). An advantage of Firth's procedure is that it gives answers even when the MLE is undefined, as in cases of perfect discrimination
a disadvantage is computational in that the analytical gradient and Hessian are much more complicated. Another approach to bias reduction is based on jackknife methods, which replace analytical derivations with easy computations, although systematic comparisons by Bull et al. (1997) show that they do not generally work as well as the analytical approaches.
The fixed costs involved in gearing up to collect data would be borne with either data collection strategy, and so selecting on the dependent variable as we suggest saves something less in research dollars than the fraction of observations not collected.
King and Zeng (2000a), building on results of Manski (1999), modify the methods in this paper for the situation when τ is unknown or partially known. King and Zeng use “robust bayesian analysis” to specify classes of prior distributions on τ, representing full or partial ignorance. For example, the user can specify that τ is completely unknown or known to fall with some probability to lie only in a given interval. The result is classes of posterior distributions (instead of a single posterior) that, in many cases, provide informative estimates of quantities of interest.
“Exact” tests are a good solution to the problem when all variables are discrete and sufficient (often massive) computational power is available (see Agresti 1992; Mehta and Patel 1997). These tests compute exact finite sample distributions based on permutations of the data tables.