Logistic Regression in Rare Events Data

Political Analysis - Tập 9 Số 2 - Trang 137-163 - 2001
Gary King1,2, Langche Zeng1,2
1Center for Basic Research in the Social Sciences, 34 Kirkland Street, Harvard University, Cambridge, MA 02138
2Department of Political Science, George Washington University, Funger Hall, 2201 G Street NW, Washington, DC 20052

Tóm tắt

We study rare events data, binary dependent variables with dozens to thousands of times fewer ones (events, such as wars, vetoes, cases of political activism, or epidemiological infections) than zeros (“nonevents”). In many literatures, these variables have proven difficult to explain and predict, a problem that seems to have at least two sources. First, popular statistical procedures, such as logistic regression, can sharply underestimate the probability of rare events. We recommend corrections that outperform existing methods and change the estimates of absolute and relative risks by as much as some estimated effects reported in the literature. Second, commonly used data collection strategies are grossly inefficient for rare events data. The fear of collecting data with too few events has led to data collections with huge numbers of observations but relatively few, and poorly measured, explanatory variables, such as in international conflict data with more than a quarter-million dyads, only a few of which are at war. As it turns out, more efficient sampling designs exist for making valid inferences, such as sampling all available events (e.g., wars) and a tiny fraction of nonevents (peace). This enables scholars to save as much as 99% of their (nonfixed) data collection costs or to collect much more meaningful explanatory variables. We provide methods that link these two results, enabling both types of corrections to work simultaneously, and software that implements the methods developed.

Từ khóa


Tài liệu tham khảo

Verba, 1995, Voice and Equality: Civic Voluntarism in American Politics, 10.2307/j.ctv1pnc1k7

Tucker Richard . 1999. “BTSCS: A Binary Time-Series-Cross-Section Data Analysis Utility,” Version 3.0.4. http://www.fas.harvard.edu/∼rtucker/programs/btscs/btscs.html.

Tucker Richard . 1998. “The Interstate Dyad-Year Dataset, 1816–1997,” Version 3.0. http://www.fas.harvard.edu/∼rtucker/data/dyadyear/.

10.1007/978-1-4612-4024-2

Smith, 1998, Bayesian Statistics

10.1111/0020-8833.00113

10.2307/2585396

Scott, 1986, Fitting Logistic Models Under Case-Control or Choice Based Sampling, Journal of the Royal Statistical Society, B, 48, 170, 10.1111/j.2517-6161.1986.tb01400.x

10.1017/CBO9780511812651

10.1093/biomet/66.3.403

10.1002/sim.4780140806

10.2307/2938740

Manski, 1981, Structural Analysis of Discrete Data with Econometric Applications

10.2307/1914121

Manski, 1999, Nonlinear Statistical Inference: Essays in Honor of Takeshi Amemiya

10.1016/0304-4076(94)01698-4

10.1007/978-1-4899-3242-6

10.2307/2669316

King Gary , and Zeng Langche . 2000b. “Explaining Rare Events in International Relations.” International Organization (in press).

King Gary , and Zeng Langche . 2000a. “Inference in Case-Control Studies with Limited Auxilliary Information” (in press). (Preprint at http://Gking.harvard.edu.)

10.2307/2951544

10.1080/01621459.1985.10478165

10.1017/CBO9780511521713

10.1007/978-1-4899-4467-2

10.1093/biomet/80.1.27

10.2307/1912755

Bueno de Mesquita, 1992, War and Reason: Domestic and International Imperatives, 10.2307/j.ctt1bh4dhm

Bueno de Mesquita, 1981, The War Trap

10.1080/01621459.1996.10476660

Bennett D. Scott , and Stam Allan C. III . 1998a. EUGene: Expected Utility Generation and Data Management Program, Version 1.12. http://wizard.ucr.edu/cps/eugene/eugene.html.

10.2307/1913609

10.1214/ss/1177011454

We translated the different format in which Bennett and Stam (1998b) report relative risk to our percentage figure. If r is their measure, ours is 100 × (r − 1).

Deriving as an approximately unbiased estimator involves some approximations not required for the optimal Bayesian version derived in Appendix E. The problem is that instead of expanding a random πi around a fixed β as in the Bayesian version, we now must expand a random around a fixed β. Thus, to take the expectation and compute Ci , we need to imagine that in the correction term, is a reasonable estimate of πi in this context. This is obviously an undesirable approximation but it is better than setting it to zero or one (i.e., the equivalent of setting Ci = 0), and as our Monte Carlos show below, is indeed approximately unbiased.

Breslow, 1980, Statistical Methods in Cancer Research

Bennett and Stam (1998b) analyze a data set with 684,000 dyad-years and (1998a) have even developed sophisticated software for managing the larger, 1.2 million-dyad data set they distribute.

10.1002/sim.4780020108

Rothman, 1998, Modern Epidemiology

Cosslett, 1981b, Structural Analysis of Discrete Data with Econometric Applications

Rosenau, 1976, In Search of Global Patterns

10.1017/CBO9780511583483

Levy, 1989, Behavior, Society, and Nuclear War, 1, 2120

Bennett D. Scott , and Stam Allan C. III . 1998b. “Theories of Conflict Initiation and Escalation: Comparative Testing, 1816–1980,” Presented at the annual meeting of the International Studies Association Minneapolis.

We analyze the problem of absolute risk directly and then compute relative risk as the ratio of two absolute risks. Although we do not pursue other options here because our estimates of relative risk clearly outperform existing methods, it seems possible that even better methods could be developed that estimate relative risk directly.

More formally, suppose P(X | Y = j) = Normal(X | µj , 1), for j = 0, 1. Then the logit model should classify an observation as 1 if the probability is greater than 0.5 or equivalently X » T (µ 0, µ 1) = [ln(1 – τ) – ln(τ)]/(µ 1 – µ 0) + (µ 0 + µ 1)/2. A logit of Y on a constant term and X is fully saturated and hence equivalent to estimating µj with (the mean of X i for all i in which Yi = j). However, the estimated classification boundary, , will be larger than T(µ 0, µ 1) when τ < 0.5 (and thus ln[(1 – τ)/τ] > 0), since, by Jensen's inequality, . Hence, the threshold will be too far to the right in Fig. 1 and will underestimate the probability of a one in finite samples.

McCullagh, 1987, Tensor Methods in Statistics

Mehta, 1997, Exact Inference for Categorical Data

10.1177/0049124189017003003

10.2307/1957394

10.1017/S0003055400220078

Greene, 1993, Econometric Analysis

Achen Christopher A. 1999. “Retrospective Sampling in International Relations,” Presented at the annual meetings of the Midwest Political Science Association, Chicago.

10.1016/0378-3758(93)E0091-T

10.1177/0193841X8801200301

Lancaster, 1996b, Efficient Estimation and Stratified Sampling, Journal of Econometrics, 74, 289, 10.1016/0304-4076(95)01756-9

King, 1994, Designing Social Inquiry: Scientific Inference in Qualitative Research, 10.1515/9781400821211

10.1002/(SICI)1097-0258(19970315)16:5<545::AID-SIM421>3.0.CO;2-3

We have found no discussion in political science of the effects of finite samples and rare events on logistic regression or of most of the methods we discuss that allow selection on Y. There is a brief discussion of one method of correcting selection on Y in asymptotic samples by Bueno de Mesquita and Lalman (1992, Appendix) and in an unpublished paper they cite that has recently become available (Achen 1999).

Cordeiro, 1991, Bias Correction in Generalized Linear Models, Journal of the Royal Statistical Society, B, 53, 629, 10.1111/j.2517-6161.1991.tb01852.x

An elegant result due to Firth (1993) shows that bias can also be corrected during the maximization procedure by applying Jeffrey's invariant prior to the logistic likelihood and using the maximum posterior estimate. We have applied this work to weighting and prior correction and run experiments to compare the methods. Consistent with Firth's examples, we find that the methods give answers that are always numerically very close (almost always less than half a percent). An advantage of Firth's procedure is that it gives answers even when the MLE is undefined, as in cases of perfect discrimination

a disadvantage is computational in that the analytical gradient and Hessian are much more complicated. Another approach to bias reduction is based on jackknife methods, which replace analytical derivations with easy computations, although systematic comparisons by Bull et al. (1997) show that they do not generally work as well as the analytical approaches.

The fixed costs involved in gearing up to collect data would be borne with either data collection strategy, and so selecting on the dependent variable as we suggest saves something less in research dollars than the fraction of observations not collected.

King and Zeng (2000a), building on results of Manski (1999), modify the methods in this paper for the situation when τ is unknown or partially known. King and Zeng use “robust bayesian analysis” to specify classes of prior distributions on τ, representing full or partial ignorance. For example, the user can specify that τ is completely unknown or known to fall with some probability to lie only in a given interval. The result is classes of posterior distributions (instead of a single posterior) that, in many cases, provide informative estimates of quantities of interest.

“Exact” tests are a good solution to the problem when all variables are discrete and sufficient (often massive) computational power is available (see Agresti 1992; Mehta and Patel 1997). These tests compute exact finite sample distributions based on permutations of the data tables.