Saturday 25 August 2012

Statistical Tests and the real life

The interpretation of Statistical Test is perhaps one of the most misunderstood things considering the more routine statistical applications. And among the misinterpretations is the one about the result of a statistical test that fails to reject the null hypothesis and is almost always interpreted as evidence in favor of the null. We all know that not rejecting the null does not mean it is correct in most cases. Now, what is interesting is that this idea is also common in our daily life, outside of the scope of statistical tests, and there it is misinterpreted as well.

A usual example that I use on explaining classical Hypothesis Tests interpretation is the Court Room. In short, if someone is not found guilty it does not mean the person did not perpetrate the crime, but that there is no evidence of it. Therefore the person is considered innocent, but the innocence is often not proven since this is not the goal of the process. The right thing to say is that there is not enough evidence that the person committed the crime.

This week the news about Lance Armstrong's refusal to fight USADA on charges on doping is all over the media. And I was surprised to see this article from a large US newspaper incurring in such a basic mistake - the idea that not testing positive means that Lance did not use illegal drugs. Here again we have the same idea of non significant statistical test in a non statistical setting. Yes, Statistical reasoning is important for our daily life.

Here the failure to reject the null hypothesis (negative drug test) does not mean much for more reasons that lack of power. But the power is an important one, we do not know and it is not easy to find the "power" of these tests, how much a negative test means. Part of the problem may well be because there must be different tests for different drugs, each with their own "power". But the negative result can happen even with a guilty athlete for others and perhaps more relevant reasons. For example, the simple fact that there is no test for the drug used or that the athlete developed some way to use illegal drugs in a way the test is not prepared to detect.

I always like to look at the comments in such article (to tell the truth I could not finish reading the article, the idea that the writer is thinking that negative tests are proof of innocence just drove me away from it) and there you can see folks making the point I am making here, trying to correct the fallacy. But others and maybe most of the readers will take it as the truth and will use the ideas to make a point in favor of Armstrong. This way even a flawed text like this can become widespread making the bad science and perhaps the bad statistical analysis, also widespread. So much so that I found this text because someone retweeted it...

Tuesday 14 August 2012

Randomized Trials and Public Policy

There is here a very interesting paper that talks about using more Randomized Controlled Trials before making decisions on policies. The paper is easy to ready, easy to understand, as I think it should in order to better spread the idea. Well, it is not really an idea, new stuff, or anything.

It is my perception that that RCTs are underutilized in a word where causal analysis is widespread. But while causal methods for observational data are all over the place, the RCTs seem to be limited to the medical research and forgotten,not considered, sometimes unknown of, in other fields. I think the same way RCTs are demanded for drugs developments, the same should be true for policies development.

Saturday 4 August 2012

Cohen's d

A often overlooked thing in statistical analysis is the meaningfulness of effect sizes. Usually when comparing means or proportions we do a statistical test and not even thing about how meaningful is the observed difference, whether of not significant. Many times, in presence of good power, meaningless effects will be flagged as being significant.

Cohen's d is a standardized effect measure that allows us to make some assessment of the size of the observed effect in practical terms. It is just the difference in means divided by the standard deviation of the sample. Notice that we are not talking about the standard deviation of the mean or of the difference of means, but the sample itself. The idea is to access how large is the effect in light of the natural variation observed in the data.

Usually Cohen's d will be lower than 1 in absolute terms and values around 0.5 and above are taken as practically important. If we think in a broad and approximated terms and consider the data as having normal distribution we have that the variation of the data is about 4 standard deviation.  If we have a intervention that can cause things to change by one standard deviation (Conhen's d = 1), it makes sense to think this is a pretty big effect. And it does not matter too much what we are talking about, meaning it is so for different variables and different studies.

This calculation of effect size is totally missed in Marketing Research and is most common in fields related to medicine. I am already having some ideas about testing Cohen;s d on Segmentation analysis to understand how segments differ in a more meaningful way.

A short non technical paper with some more technical references about the subject is this one.

Tuesday 31 July 2012

Retraction

An interesting blog that reports on flawed papers that are retracted from journals. I wonder what is the proportion of published bad science that actually ends up retracted. Must be quite small given that detecting flaws is not so easy and the authors are surely not interested on retracting their work...

Paying Survey Respondents

Here is an interesting article about the surveys in US and whether or not the respondents should be payed.

Most of the time respondents are payed in Marketing Research and there has long been a concern about whether or not paying respondents will bias the surveys. What happens in official surveys is that respondents are really selected at random and therefore they do not choose to participate, they spend their time filing a non anticipated survey and therefore it seem to make sense that they should be recompensed. The argument goes that it is also fair that all tax payer pay for these official survey and the unlucky ones that are selected, the real contributors, should be rewarded, for they are making possible that all the official results from surveys are available.

Now, in Marketing Research, even Opinion Pools, things are different. Sample is usually on line and respondents choose to participate, often for the reward offered. If folks that are attracted by these rewards are somehow different from the others, then we may have a bias in the survey, which is quite complicated to be quantified. While the recompense may bring selection biases, it seems fair to assume that no recompense would no be fair and surveys would hardly be possible. Besides it also seems that in any case, payed respondents will answer surveys more reliably than if they were no payed as they get a sense of commitment if they are receiving money.

As the on line era dominates and change sampling theory we statisticians need to be more and more creative to handle the new challenges involved in releasing reliable results.

Saturday 21 July 2012

Observed Power

I want to comment quickly on an interesting paper published by The American Statistician journal in 2001 about observed power.

Observed power is calculated after the fact, when the sample is being analyzed and the results are at hand. It is calculated just like one would calculate power usually, but using the observed difference in means (considering a test of means) and observed variability. Usually the observed power is calculated when the null hypothesis fails to be rejected likely because the researcher wants to have an idea whether s/he can or cannot interpret the results as evidence of the truthfulness of the null hypothesis. In these cases, the higher the observed power, more one would take the fail of rejection as acceptance. As the paper well advise, this type of power calculation is non sense just because just because it is directly associated with p-valor - the lower the p-value, the higher the power. Therefore if two tests fail to reject the Null, the one with lower p-value (more evidence against the Null) will have higher observed power (more evidence in favor of the Null according to the usual interpretation above). Therefore this type of power calculation is not only useless but leads to misleading conclusions.

I have lately involved myself into some debates about the role of statisticians especially in teaching statistics, spreading the power of the tool we use and also correcting the many misuses of statistics be it by statisticians or not. I believe this is the sort of information where we need to make the difference, this is the sort of information that differentiates who press buttons from those who do real statistics.