Original articles
Stepwise Selection in Small Data Sets: A Simulation Study of Bias in Logistic Regression Analysis

https://doi.org/10.1016/S0895-4356(99)00103-1Get rights and content

Abstract

Stepwise selection methods are widely applied to identify covariables for inclusion in regression models. One of the problems of stepwise selection is biased estimation of the regression coefficients. We illustrate this “selection bias” with logistic regression in the GUSTO-I trial (40,830 patients with an acute myocardial infarction). Random samples were drawn that included 3, 5, 10, 20, or 40 events per variable (EPV). Backward stepwise selection was applied in models containing 8 or 16 pre-specified predictors of 30-day mortality.

We found a considerable overestimation of regression coefficients of selected covariables. The selection bias decreased with increasing EPV. For EPV 3, 10, or 40, the bias exceeded 25% for 7, 3, and 1 in the 8-predictor model respectively, when a conventional selection criterion was used (α = 0.05). For these EPV values, the bias was less than 20% for all covariables when no selection was applied. We conclude that stepwise selection may result in a substantial bias of estimated regression coefficients.

Introduction

Multivariable regression analysis is a valuable technique to quantify the relation between two or more covariables and an outcome variable 1, 2, 3. The covariables may be labeled exposures and confounders in the context of causal relations, or predictors in the context of prognostic modeling. Stepwise selection methods are widely applied to identify a limited number of covariables for inclusion in regression models, especially in prediction problems or in situations with a large number of exposure variables. Stepwise selection indicates covariables with a statistically significant effect, simultaneously adjusting for the other covariables in the regression model. The method is discussed in standard textbooks 2, 3, and is available in all major statistical software programs for several types of commonly used epidemiological models, such as linear, logistic, Cox, and Poisson regression.

Stepwise selection methods (including forward, backward, combined forward-backward, all possible subsets selection) have been criticized for a number of reasons in the literature 4, 5, 6, 7, 8, 9, 10. The first problem concerns the set of selected covariables 11, 12, 13, 14. Different selections may arise when a relatively small number of patients is added to the data set or deleted from it. Also, simulation experiments have shown that stepwise methods have limited power to select important covariables in small data sets. On the other hand, there is a substantial risk that one or more (almost) random covariables are selected, since multiple comparisons are made [7].

The second problem concerns the estimation of the regression coefficients [4]. The variance of the coefficients is usually calculated as if the selection was predetermined [15]. Ignoring model uncertainty causes an underestimation of the variability of the estimated coefficients in the resulting model 5, 13. In addition, stepwise selection causes bias in the estimated regression coefficients 16, 17. This “selection bias” is the main focus of this paper. We may distinguish conditional and unconditional selection bias, which has also been referred to as overfitting and underfitting, respectively [18]. Conditional selection bias (or overfitting) refers to the phenomenon in which the coefficients of selected covariables are biased to more extreme values 8, 15. Unconditional selection bias (or underfitting) refers to the average underestimation when all estimated regression coefficients are considered, including those (statistically non-significant) coefficients which may be considered as set to zero by exclusion of the covariable from the model [15].

In this paper, we study the relation between selection bias and sample size for logistic regression models. We analyze relationships of covariables with 30-day mortality in random subsamples from a large data set of patients with acute myocardial infarction. Especially, we focus on small subsamples, as determined by the number of events per variable (EPV). It has previously been shown that low values for EPV lead to bias in the estimated regression coefficients in a pre-specified model [19]. One might hence be tempted to use stepwise selection to limit the number of covariables included in the model. It will be shown that this approach worsens rather than solves the problem of biased estimates.

Section snippets

Patients

We applied logistic regression analyses in a data set of patients with acute myocardial infarction (GUSTO-I) [20]. This data set has been used before for regression modeling [21]. In brief, this data set consists of 40,830 patients included from 1082 hospitals in 14 countries. Patients were randomized to one of four thrombolytic regimens. The differences between these regimens were small relative to the effect of predictive covariables and are ignored in the present analyses. Mortality at 30

Results

Table 1 summarizes the main characteristics of the 8 predictors considered in the multivariable logistic regression analysis. The prevalence varied between 2% for shock to around 40% for age and high risk. The multivariable regression coefficients were all smaller than the univariable coefficients, reflecting positive correlations between the predictors. The strongest correlations were observed between age and gender (Pearson's correlation coefficient (r) = 0.22), shock and hypotension (r =

Discussion

We studied the influence of backward stepwise selection on the estimated logistic regression coefficients in small samples. We confirmed that the estimated regression coefficients may be somewhat under- or overestimated when a pre-specified model is fitted in a small sample [19]. For 5 events per variable (EPV), this bias was approximately between −10% and +10% in an 8-predictor model. If one would limit the number of covariables to be included in the model with a backward stepwise procedure,

Acknowledgements

Part of this research was supported by a grant from the Netherlands Organization for Scientific Research (NWO, S96-156). We would like to thank Kerry L. Lee, PhD, Department of Community and Family Medicine, Duke University Medical Center, and the GUSTO-I investigators for making the GUSTO-I data available for analysis; Frank E. Harrell Jr., PhD, Department of Health Evaluation Sciences, University of Virginia, and three anonymous reviewers for helpful comments on previous versions of this

References (40)

  • S. Derksen et al.

    Backward, forward and stepwise automated subset algorithmsFrequency of obtaining authentic and noise variables

    Br J Math Stat Psych

    (1992)
  • R. Simon et al.

    Statistical aspects of prognostic factor studies in oncology

    Br J Cancer

    (1994)
  • F. Harrell et al.

    Multivariable prognostic modelsIssues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors

    Stat Med

    (1996)
  • S.T. Buckland et al.

    Model selectionAn integral part of inference

    Biometrics

    (1997)
  • F.E. Harrell et al.

    Regression modelling strategies for improved prognostic prediction

    Stat Med

    (1984)
  • C.-H. Chen et al.

    The bootstrap and identification of prognostic factors via Cox's proportional hazards regression model

    Stat Med

    (1985)
  • D.G. Altman et al.

    Bootstrap investigation of the stability of the Cox regression model

    Stat Med

    (1989)
  • W. Sauerbrei et al.

    A bootstrap resampling procedure for model buildingApplication to the Cox regression model

    Stat Med

    (1992)
  • C. Chatfield

    Model uncertainty, data mining, and statistical inference

    J Roy Statist Soc A

    (1995)
  • Miller AJ. Subset selection in regression. Chapman & Hall, London,...
  • Cited by (373)

    View all citing articles on Scopus
    View full text