Original articlesStepwise Selection in Small Data Sets: A Simulation Study of Bias in Logistic Regression Analysis
Introduction
Multivariable regression analysis is a valuable technique to quantify the relation between two or more covariables and an outcome variable 1, 2, 3. The covariables may be labeled exposures and confounders in the context of causal relations, or predictors in the context of prognostic modeling. Stepwise selection methods are widely applied to identify a limited number of covariables for inclusion in regression models, especially in prediction problems or in situations with a large number of exposure variables. Stepwise selection indicates covariables with a statistically significant effect, simultaneously adjusting for the other covariables in the regression model. The method is discussed in standard textbooks 2, 3, and is available in all major statistical software programs for several types of commonly used epidemiological models, such as linear, logistic, Cox, and Poisson regression.
Stepwise selection methods (including forward, backward, combined forward-backward, all possible subsets selection) have been criticized for a number of reasons in the literature 4, 5, 6, 7, 8, 9, 10. The first problem concerns the set of selected covariables 11, 12, 13, 14. Different selections may arise when a relatively small number of patients is added to the data set or deleted from it. Also, simulation experiments have shown that stepwise methods have limited power to select important covariables in small data sets. On the other hand, there is a substantial risk that one or more (almost) random covariables are selected, since multiple comparisons are made [7].
The second problem concerns the estimation of the regression coefficients [4]. The variance of the coefficients is usually calculated as if the selection was predetermined [15]. Ignoring model uncertainty causes an underestimation of the variability of the estimated coefficients in the resulting model 5, 13. In addition, stepwise selection causes bias in the estimated regression coefficients 16, 17. This “selection bias” is the main focus of this paper. We may distinguish conditional and unconditional selection bias, which has also been referred to as overfitting and underfitting, respectively [18]. Conditional selection bias (or overfitting) refers to the phenomenon in which the coefficients of selected covariables are biased to more extreme values 8, 15. Unconditional selection bias (or underfitting) refers to the average underestimation when all estimated regression coefficients are considered, including those (statistically non-significant) coefficients which may be considered as set to zero by exclusion of the covariable from the model [15].
In this paper, we study the relation between selection bias and sample size for logistic regression models. We analyze relationships of covariables with 30-day mortality in random subsamples from a large data set of patients with acute myocardial infarction. Especially, we focus on small subsamples, as determined by the number of events per variable (EPV). It has previously been shown that low values for EPV lead to bias in the estimated regression coefficients in a pre-specified model [19]. One might hence be tempted to use stepwise selection to limit the number of covariables included in the model. It will be shown that this approach worsens rather than solves the problem of biased estimates.
Section snippets
Patients
We applied logistic regression analyses in a data set of patients with acute myocardial infarction (GUSTO-I) [20]. This data set has been used before for regression modeling [21]. In brief, this data set consists of 40,830 patients included from 1082 hospitals in 14 countries. Patients were randomized to one of four thrombolytic regimens. The differences between these regimens were small relative to the effect of predictive covariables and are ignored in the present analyses. Mortality at 30
Results
Table 1 summarizes the main characteristics of the 8 predictors considered in the multivariable logistic regression analysis. The prevalence varied between 2% for shock to around 40% for age and high risk. The multivariable regression coefficients were all smaller than the univariable coefficients, reflecting positive correlations between the predictors. The strongest correlations were observed between age and gender (Pearson's correlation coefficient (r) = 0.22), shock and hypotension (r =
Discussion
We studied the influence of backward stepwise selection on the estimated logistic regression coefficients in small samples. We confirmed that the estimated regression coefficients may be somewhat under- or overestimated when a pre-specified model is fitted in a small sample [19]. For 5 events per variable (EPV), this bias was approximately between −10% and +10% in an 8-predictor model. If one would limit the number of covariables to be included in the model with a backward stepwise procedure,
Acknowledgements
Part of this research was supported by a grant from the Netherlands Organization for Scientific Research (NWO, S96-156). We would like to thank Kerry L. Lee, PhD, Department of Community and Family Medicine, Duke University Medical Center, and the GUSTO-I investigators for making the GUSTO-I data available for analysis; Frank E. Harrell Jr., PhD, Department of Health Evaluation Sciences, University of Virginia, and three anonymous reviewers for helpful comments on previous versions of this
References (40)
- et al.
Importance of events per independent variable in proportional hazards analysisI. Background, goals, and general strategy
J Clin Epidemiol
(1995) - et al.
A simulation study of the number of events per variable in logistic regression analysis
J Clin Epidemiol
(1996) - et al.
Short-term risk stratification at admission based on simple clinical data in acute myocardial infarction
Am J Cardiol
(1988) - et al.
Randomisation and baseline comparison in clinical trials
Lancet
(1990) Multivariable Analysis
(1996)- et al.
Epidemiologic researchPrinciples and quantitative methods
(1982) - Rothman KJ, Greenland S. Modern Epidemiology. 2nd Ed. Philadelphia, Lippincott-Raven;...
- et al.
Inference based on conditional specification
Int Stat Rev
(1977) The influence of variable selectionA Bayesian diagnostic perspective
JASA
(1995)Methods for epidemiologic analyses of multiple exposuresA review and comparative study of maximum-likelihood, preliminary-testing, and empirical-Bayes regression
Stat Med
(1993)
Backward, forward and stepwise automated subset algorithmsFrequency of obtaining authentic and noise variables
Br J Math Stat Psych
Statistical aspects of prognostic factor studies in oncology
Br J Cancer
Multivariable prognostic modelsIssues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors
Stat Med
Model selectionAn integral part of inference
Biometrics
Regression modelling strategies for improved prognostic prediction
Stat Med
The bootstrap and identification of prognostic factors via Cox's proportional hazards regression model
Stat Med
Bootstrap investigation of the stability of the Cox regression model
Stat Med
A bootstrap resampling procedure for model buildingApplication to the Cox regression model
Stat Med
Model uncertainty, data mining, and statistical inference
J Roy Statist Soc A
Cited by (373)
Shoulder Range of Motion Measurements and Baseball Elbow Injuries: Ambiguity in Scientific Models, Approach, and Execution is Hurting Overhead Athlete Health
2023, Arthroscopy, Sports Medicine, and RehabilitationDiagnostic prediction models to identify patients at risk for healthcare-facility-onset Clostridioides difficile: A systematic review of methodology and reporting
2024, Infection Control and Hospital EpidemiologyBarriers to Surgical Intervention and Factors Influencing Motor Outcomes in Patients with Severe Peripheral Nerve Injury: A Province Wide Cohort Study
2023, Canadian Journal of Neurological Sciences