Monday, 6 September 2010

Stepwise Regression

Never let the computer decide how your model should be. This is something I always try to stress in my daily work, meaning that we should include in our statistical models the knowledge that we have about the field we are working. The simplest example where we leave all decisions to the computer (or to the statistical program if you prefer) is when we run a stepwise regression.

Back in my school time it was just one of the ways we had to do variable selection in a regression model, when we had many variables.  I did not consult the literature but my feeling is that by then people just were not aware of the problems of stepwise regression. Nowadays if you try to defend Stepwise Regression in a table with statisticians they will shoot you down.

It is not difficult to find papers discussion the problems of Stepwise. We have two examples here and here. I think one of the problems with Stepwise selection is the one I said above, it take out of the hands of the researcher the responsibility of thinking - the computer decides what should be in the model, what should not, using criteria not aways understood by those who runs the model. Another problems is the overfitting or capitalization on chance. Stepwise is often used when there are many variables and these are precisely the cases where you will easily have some variables in just by chance.

R-Square for 200 simulations where 100 random variables
explain a dependent variable in a dataset with 1000 units.
I ran a quick simulation on this. I generated a random dependent variables and 100 random independent variables and went on to run a stepwise regression using the random variables to explain the dependent variable. Of course we should not find any association here, all the variables are independent random variables. The histogram below show the R-square of the resulting model. If the sample size is 300 the R-Square will explain between 10% and 35% of the variability in Y. If the sample size is larger, say 1000 cases, then the effect of Stepwise is less misleading as the higher sample size protects against overfitting.

The overfitting is easily seem in Discriminant Analysis as well, which also has variable selection options.

R-Square for 200 simulations where 100 random variables
explain a dependent variable in a data set of 300 units.
There are at least two ways of avoiding this type of overfitting. One would be to include random variables in the data file and stop the stepwise whenever a random variable is selected. The other is to run the variable selection in a subset of the data file and validate the result using the rest. Unfortunately none of these procedures are easy to apply since current softwares does not allow you to stop the Stepwise when variable X enter the model and usually the sample size is too small to allow for a validation set.

Depending on the nature of the variable an interesting approach could be to run an exploratory Factor Analysis or Principal Component Analysis and include all the factors in the model.

Hopefully Stepwise Regression will be used with more caution if used at all. Below you have the SAS code I used to generate the simulations for these two histograms. I am not very good SAS programmer, I am sure someone can write this in a more elegant way... Sorry I did nto have time to insert comments...


%macro step;
 %do step=1 %to &niter; 


%put &step;


data randt;
array x[&nind] x1 - x&nind;
 do i = 1 to &ncases;
    y=rand('UNIFORM');
do j = 1 to &nind;
   x[j] = rand('UNIFORM');
    end;
output;
 end;
 run;


proc reg data = randt outest = est edf noprint;
model y = x1 - x&nind/selection = stepwise adjrsq;
run;


%if &step > 1 %then %do; 
 data comb;
  set comb est;
 run;
 %end;
%else %do;
 data comb;
  set est;
 run;
%end;


%end;


%mend step;


%let nind = 100;
%let ncases = 1000;
%let niter = 200;
%step;


proc univariate data=Work.Comb noprint;                                                                                               
   var _RSQ_;
   histogram / caxes=BLACK cframe=CXF7E1C2 waxis= 1 cbarline=BLACK cfill=BLUE pfill=SOLID vscale=percent hminor=0 vminor=0 name='HIST' midpoints = 0 to 0.5 by 0.01;                                                                                                                                
run;                                                

2 comments:

Anonymous said...

Hi,

Do you know of any papers that have used the method you suggested to include random variables in the stepwise regression? I've been googling terms such as "stepwise regression include random variable" and cannot find any examples.

Thanks!

Anonymous said...

Hi,

Do you know of any papers that have used the method you suggested to include random variables in the stepwise regression? I've been googling terms such as "stepwise regression include random variable" and cannot find any examples.

Thanks!