Sunday 19 December 2010

ARM Chapter 3

Data Analysis using Regression and Multilevel Models is one of the books I liked. I think it is a different book on Regression Models, one that focus on the modeling part more than in the mathematical part. I mean, it is for sure important to have the mathematical foundation, even more so now with all these Bayesian models coming around. But when using regression for causal inference, I feel that for most part the lack of modeling expertise rather than mathematical expertize is what drives the low quality of many works. I am reading the book for the second time and I will leave some points I think is important here.

Chapter 3 is about the basics of regression models. It does a good job talking about basic things, like binary predictors and interactions, also using scaterplots to make things visual. These are some points I would like to flag:

1 - Think about the intercept as the point where X is zero, this has always helped me interpret it. Sometimes X =0 will not make much sense and then centralizing the predictor might help.

2 - One can interpret the regression in a predictive or counterfactual way. The predictive interpretation focus on group comparison, sort of exploring the world out there, what is the difference between folks with X =0 and folks with X =1. The counterfactual interpretation is more about effects of changing X by one unit. If we increase X by one unit how the world out there will change. This is as I see the difference between modeling and not  fitting a model.

3 -  Interactions should be included in the model when main effects are large. When using binary variables interactions are a way of fitting different models to different groups.

4 - Always look at the residual versus the predicted as a diagnostic.

5 - Use graphics to display precision of coefficients. Use the R function "sim" for simulations.

6 - Make sure regression is the right tool to use and that the variables you are using are the correct ones to answer your research question. The validity of the Regression Model for the research question is put by the book as the most important assumption of the regression model.

7 Additivity and Linearity is the second most important assumption. Use transformation on the predictors if these assumptions are not matched.

8 - Normality and equal variance are less important assumption.

9 - The best way to validate a model is to test it in the real world, with a new set of data. A model is good when it can be replicated with external data.

For those who likes R, the book has plenty of codes that can be used by the beginners as a starting point for learning the software. 

Saturday 11 December 2010

Which model?

It happened just last month, a paper of ours submitted to a journal got rejected mainly because of many non methodological problems, but there was one critic related to our regression model. Since our dependent variable was a sort of count, we were asked to explain why using a linear regression model instead of a Poisson model.

We noticed the highly skewed nature of the dependent variable, so we applied the logarithmic transformation before fitting the linear model. To make coefficients more easily interpretable we exponentiated them and presented the results with confidence intervals.

Upon receiving the comment from the publisher reviewer, we decided to go back and readjust the data using the Poisson model. It turned out that the Poisson model was not adequate because of overdispersion, so we fitted the Negative Binomial model instead, with log link. It worked nicely but for our surprise the coefficients were nearly identical to the ones from the linear model, fitted with logarithmic transformation of the dependent variable.

It feels good when you do the right thing and you can say "I used a Negative Binomial Model", it looks like you are a professional statistician using high level and advanced models, the state of art type of thing, despite of the fact that this model is around for quite some time and what makes it to look good is, perhaps, the fact that the linear regression model is by far almost the only think know outside of statistical circles.

But my point is that to me it is not really clear how worth is to go to the Negative Binomial model when conclusions will not be different from the simpler linear model. Maybe it makes sense to take the effort to go the most advanced way in a scientific paper, but we also have the world out there where we need to be pragmatic. Does it pay to use the Poisson, Negative Binomial, Gamma or whatever model in the real world or it ends up being just academic distraction? Maybe in some cases it does pay out, like if you think about some financial predictive models out there, but for most part my impression is that we do not add value by using more advanced and technically correct statistical methods. This might seem a word against statistic but I think it is the opposite. We, as statistician, should pursue the answer to questions like this, so to know when it is important seeking more advanced methods and when we might do just as well with simple models...

Crowd Estimates

It appeared in the last Chance News a short article about the challenges of estimating the number of persons in a crowd. I haven't seen much about crowd estimates in the statistical literature, but it looks like by far the most acceptable way of estimating a crowd is by aerial photos. By using estimates of the density (persons per area) one can get to an estimate of the size of the crowd.

One problem with aerial photos is that it captures a moment and therefore will make it possible to estimate the crowd at that specific moment. If you arrived at the place before the photo and left the place before the photo, you are not included in the picture. I can see that that is okay for gatherings like the Obama's swearing, or any sort of event that is composed of a single moment, like a speech. But if you think about events that go over a period of time, like the Carnival in Brazil or the Caribana in Toronto then you need more than photos, and even photos taken at different points in time can be problematic because you don't know the amount of folks that are in, say, two photos.

Sometime ago I got involved in a discussion about how to estimate the number of attendees in the Caribana Festival. This is a good challenge, one that I am not sure can be accomplished with good accuracy. We thought about some ways to get this number.

The most precise way, I think all agreed, would be by conducting two surveys. One in the event to estimate the percentage of participants that were from Toronto. This by itself is not a simple thing in terms of sampling design if you think that the Caribana goes over 2 or 3 days, in different locations. But lets not worry about the sampling here (maybe in another post). The other survey would be among Torontonians, to estimate what is the percentage of the folks living in the city that went to the event. Because we have good estimates of the Toronto population, this survey will give us the estimate of the number of Torontonians that went to the event. From the first survey we know that they are X percent of the total, so we can estimate the total attendance to the event.

Another way, we thought, would be to spread some sort of thing to the crowd. For example, we could distribute a flier to the crowd. This should be done randomly. Then we, also with a random sample, interview folks in the crowd and ask then whether or not they received the flier. By knowing the percentage of the crowd that received the fliers and the total number of fliers distributed we can also get to an estimate of the crowd.

A related way would be by measuring the consumption of something. We thought about cans of coke. If we could get a good estimate of the number of cans of coke sold in the area and if we could conduct a survey to estimate the percentage of the crowd that bought a can of coke (or average number of cans per person) we could then sort out the crowd size.

Another way would be to use external information about garbage. It is possible to find estimates of the average amount of garbage produced by an individual in a crowd. The if you work with the garbage collectors to weight the garbage from the event, you could get to an estimate of the crowd. A problem with an external estimate of something like amount of garbage per person is that it could variate a lot depending on the event, the temperature, weather, place, availability of food and drinks and so on. So I am not sure this method would work very well, maybe some procedure could be put in place to follow a sample of attendees (or maybe just interview them) to get an estimate amount of garbage generated per person.

Finally, an aerial photo could be combined with a survey in the ground. Photos could be taken in different points in time and a survey would ask participants when they arrived at the event and when they intend to leave. Of course the time one leaves the event will not be very accurately estimated but hopefully things would cancel out and in average we would be able to estimate the percentage of the crowd that are in two consecutive photos, so that we can account for that in the estimation.

I think crowd estimates is an area not much developed for these events that goes for a long period of time. There is room for creative and yet technically sound approaches.

Sunday 5 December 2010

The science of status

For quite some time now I have been concerned with the quality of the science that we see in journals in fields like Medicine and Biology. Statistics play a fundamental role in the experimental method necessary for the development of much of human science, but what we see is the science of publishing rather that the science of evidence. Statistical Inference has been disseminated as tool for getting your paper published rather that making science. A as result the bad use of statistics in papers that offer not replicable results are widespread. Here is an interesting paper about the subject.

Unfortunately not only the scientific research lose with the low quality of research and misuse of statistics. The Statistics itself lose by being distrusted as a whole for many people that end up considering the statistical method as faulty and not appropriated when its use by not proficient researchers is to blame.

What is our role as statisticians? We need to do the right thing. We should not be blinded by the easy status that comes with our name in a paper after running a regression analysis. We need to be more than softwares pilots that press buttons for those own the data. We need to question, to understand the problem and to make sure the regression is the right thing to do. If we don't remember the regression class we need to go back and refresh our memory before using statistical models. We need to start ding the right thing and knowing what we are doing. Then we need to pursue and criticize the publications that use faulty statistics because we are among the few who can identify faulty statistics. For the most part the researchers run models that they don't understand and get away with it because nobody else understand it as well.

But often we are among them, we are pleased with our name in publications, and we are happy to go the easiest way.

I decided to write this after a quick argument in a virtual community of statistician, where I noticed a well quoted statistician spreading the wrong definition o p-value.  We can't afford to have this kind of thing inside our own community, we need to know what is a p-value...