Sunday, 23 January 2011

ARM Chapter 4 - Linear Regression: Before and After Fitting the model

Today I want to comment on the fourth chapter of this book I am calling ARM.  This chapter is about regression analysis and I just want to single out some main points that I though interesting. I want to note that this is a basic chapter on regression analysis and some of the things here I already new or had a sense of it. Others are interesting new ideas. I find these things always worth commenting since this book is a lot about modeling and not so much about math. Things we see here are not found in usual regression books and many times the only way you will get it is by experience.

1) Linear transformations - these are important not only to make variables to look more normal, but to make interpretation easier and more meaningful. A simple example, if you are measuring distances between cities, use it in Km not in meters.

2) Standardizing variables - again helpful for interpretation purposes. The intercept now is interpreted relative to the mean (and not relative to zero, which for many cases in meaningless. For example, if the independent variable is IQ coefficient, the intercept will have to be interpreted as what is predicted by the model when the IQ coefficient is zero, which does not happen). The standardization may also help on interpreting the coefficients of main effects when the model has interactions.

3) Standardizing binary independent variables by dividing it by 2 standard deviation - The authors argues that in the case of binary variables, this standardization will make the coefficient more meaningful because it will still be the effect of changing the binary variable by one unit. While this make sense in a way I am not sure I completely got this. This is a paper with more details which I still need to read.

4) Principal Component and Regression line - Imagine a scaterplot where you plot your X and Y in the horizontal and vertical axes respectively. The regression line is the line that minimizes the sum of the vertical distance to the line. The principal component line is the line that minimizes the distance (the shortest, not the vertical) to the line. The regression line is better because it is what we want - to minimize the error in Y, the error of predicting Y, or graphically, the vertical distance.

5) Logarithmic transformation - The use of logarithmic (natural - base e) transformation has been widespread in regression analysis. We usually justify it by saying that it makes Y normal. The book does not talk about this, maybe also because in the previous chapter it argues that normality is not that important for inference in regression (and what matters is normality of the error, not Y, which is another issue altogether). But let's stick to what is in Chapter 4. Logarithmic transformation of predictors may help in the interpretation since their coefficients con be interpreted as approximate percent change when they are small. Another point is that it may make the model better since some effects are multiplicative rather than additive. For example, suppose Income is one of the predictors. In the original scale, changing from $10K to $20K will have the same effect as changing the predictor from $100K to $110K. If it seems reasonable that in fact the former effect should be higher than a logarithmic or a square root transformation might work better.

6) Modeling tips - which variable to include and which not to include in the model. I have always found this a trick and maybe dangerous subject. The book argues for leaving in the model significant variables and the not significant ones with coefficient that makes sense. We should think hard about variables significant that does not make sense, because maybe it does make sense... Well, I really find this a complicated issue. We are in a better position if we are talking about demographics or things that usually are not highly associated to each other but things can get messy when multicolinearity plays a role in the model... We should also try interactions when main effects are high.

7) To exemplify point 6 above the authors present a model with several independent variables. They end up sort of unsure on what to keep in the model. I think this exemplifies well the day to day experience we have with regression models and even better, by showing this example the book does not restrict itself to cases with beautiful solutions. However I am not sure I agree with the way they transform the variables (like, creating a volume, area, shape variables), because to me the original variables seem clear enough to be kept in the model as they are. It is not needed to create a "volume" variable from three original variables to sort of reduce the number of variables in the model. But it is an interesting approach anyway.

8) They also talk quickly about others customized transformation and modeling strategies, like, if income is your dependent variables and there is a zero income, you might want to create two models, the first modeling Zero Income/Not zero income and the second modeling the size of the income conditional to having income higher than zero.

9) Another interesting reference is this one, on standardization of variables.

I think that is it. And I make again the point that all this is quite interesting as it is regression from the viewpoint of modeling and not so much theoretical mathematic. I think anyone that works with regression analysis in social sciences will understand the importance of the experience, it is not enough to know well the theory behind the statistical model.

Sunday, 16 January 2011

ESP and Statistics

There has been lately much talking among the statistical community on the paper about extrasensory perception recently published by the well recognized Journal of Personality and Social Psychology. To tell the truth I did not read the paper, maybe because I think the articles I read about the subject, like this one and this one, tells me all I need to know.

Shortly, the paper claims to have found evidence of extrasensory perception which is to say that what happens in the future can influence things now. The world of causality as we define it today would be completely shaken because we would no longer be able to say that if X causes Y, then X happens before Y. The paper claims to have found significant effects in experiments, like this, from the linked article:


"In another experiment, Dr. Bem had subjects choose which of two curtains on a computer screen hid a photograph; the other curtain hid nothing but a blank screen.
A software program randomly posted a picture behind one curtain or the other — but only after the participant made a choice. Still, the participants beat chance, by 53 percent to 50 percent, at least when the photos being posted were erotic ones. They did not do better than chance on negative or neutral photos."

It is weird to think people will be able to influence the randomly generated image that they will see in the next moment. Well, in fact I found this non sense because random things generated by a computer are not really random, which is to say that there is no way one would have any influence in the random process since the random sequence is already generated to start with. So to me the experiment does not seems good, but I did not read the details of the experiment, so I don't know.

The point I want to make is that this is likely to be another example of bad use of significance testing, and maybe the paper's biggest contribution will be as and example of how not to do things. The answer to whether or not the effects are real will come only with replication, that is, when others do the same experiment and get the same results.

It is also interesting to see that such a polemic paper was published in such a high level journal. I think when things are polemic the journal should ask for more evidence, maybe, and be more careful on judging the paper.

It will be interesting to watch for what will come on this subject...