Tuesday 22 October 2013

The science and the system

I think one of the great skills statisticians end up having is their ability to use statistical tools to evaluate scientific research and assess its problems. Well, maybe that is not very accurate. I mean, as a statistician it seems that it is easier for me to read a paper and see possible challenges to the conclusions. I think the background in stats along with the experience of reading many papers over the years, help us a lot on this. The first case in this article, which shows that students who are not supported financially by their parents end up with better grades, is a good example. I find myself in this position of reading the conclusion of a study and not believing it so often, that it sometimes give me the impression that I am being too skeptical compared to the rest of the human beings. It also makes me think that these type of texts are misleading and perhaps harmful in a way, that when they get published in the mainstream media, like this was, it is a pretty bad thing. What if we could make sure whoever does the analysis is not only without interest in the results but also someone who knows well the stuff? That is almost never the case.

The linked paper shows other problems, though. It shows that once a paper like this is published in a top scientific journal, it is quite difficult to criticize it. It speculates on the possibility that even the scientific journals are willing to evaluate a paper by how attractive and surprising is its conclusions, rather than only its scientific contents. And finally, it talks about how difficult is to get our hands on the data used in such a study, so that we can replicate the results.

The system not only put pressure on the researcher to publish, but it seems also on the top journals, to continue being top, attractive, read by as many people as possible. With pressures like these, the goal end up not being the scientific production but the mitigation of the pressure. Or at least part of the goal. It seems that this concern with the amount of science that actually exists in scientific publications, specially when it goes mainstream, is becoming more widespread so that hopefully things gets better...


Friday 18 October 2013

Random Selection and Random Assignment

Randomness is something very important in statistics. Yet I have this opinion that it is not given its due space in statistical courses. Then I've found that usually people have trouble understanding the reason randomness is so important and the possible implications of the lack of it in certain contexts. In two contexts randomness is very important but has a different role - Experimental Design and Sampling. In the former we usually want to assign units randomly to treatments/groups and in the later we want to select units randomly from a given population.

I also think that the lack of understanding randomness, or perhaps only thinking more about it or taking it more seriously is making us to turn a blind eye to the many circumstances where lack of randomness is an issue, specially related to sampling. In that regard I think we need to get better at model based approaches that account for introduced biases, starting by understanding sources and consequences of the non-randmness. Unfortunately the reality is not very friendly and random sample or experiments are the exception, not the rule.

In Sampling, randomness aims at guaranteeing that the conclusions from the small set of units we select, the sample, generalizes to some larger population, because we don't really care about the sampled folks, we care about the population they are supposed sampled from. But sampling in a random way from human populations is next to impossible most of the times. In experimental designs the random assignment allows us to control for concurring causes of what we intend to measure the effect. A randomized experiment allows us to make causal claims about statistical differences, but if the used sample was not selected randomly then the causal claim is in theory only valid for the units participating in the experiment. Here is where both things come together.

Therefore it is important that students have a good understanding of randomness and the difference between assigning randomly (experimental design) and selecting randomly (sampling) and the possible implications when the randomness fail. This is an interesting paper that talks about this and gives an example of how to engage students with real world activities related to the application and interpretation of random selection and random assignment. I thought the idea is cool also because it uses roaches, which are not very much liked in our culture, so I guess it is good for retaining the moment...





Friday 4 October 2013

Good Charities

I always found the decision of giving money to charities a difficult one because of so many charities and the different things they do. First you need to be concerned that the non-profit organization will do what they say, that it even exists. Then you think about the part of the money that actually goes to the cause instead of being spent on administrative and advertising stuff. Finally you also wonder how they will address the cause, how efficiently. This article in the last edition of Significance Magazine discusses the how efficient part. The so called relentless logic is covered in details in this book and commented in the article. The "relentless" part comes from the fact that the idea is to transform the results of the actions of the organization into money, or quantify it using currency values, which is sort of disgusting. But anyway, the argument is that it is valid if it leads to better use of the received funds.

But to me the very fact that transforming such results into monetary value does not sound good is what explains why it should not be done. They are different things in such a way that some of the values coming from the results of this type of work as well as what people gain by just giving, is not quantifiable in any simple way.

Their first example, where I almost stopped reading, was about quantifying the dollar value of bringing back to school the high school drop outs. They say that researches say that in average high school graduates earn $6500 per year more than high school drop-outs. I see two problems with this value: 1) This value is certainly controversial among the scientific community. If the causes of Global Warming, which has a sort of priority status in the scientific agenda, is still nowhere close to a consensus, then imagine the effects of a high-school graduation. 2) Even assuming that the figure is correct, there is likely a bias because $6500 is relative to high-school graduates, not to drop-outs that ended up re-enrolling and became at least high school graduates. You may think that I am being unreasonable, but I do believe the later may have some important differences that leads to different earnings compared to the former.

But in any case these are not the reason I almost stopped reading. It has to do more with the idea of doing this type of quantification. For example, maybe I did not have the chance of getting a high school diploma because I was forced to drop out to help at home. Nevertheless, I succeeded financially speaking and I made a commitment to myself that I would help the drop-out cause. There is a value to that in such a way that it does not matter if you tell me that based on your math and your assumptions, helping hungry kids is a more efficient way to give money. Or maybe I was helped by foundation X, which avoided that I dropped out of school, so now I want to pay back. It does not matter if you tell me that foundation Y does the same thing more efficiently.

These may be very specific cases, but even if I dont have reasons to help any specific cause, I will have my opinion about different causes.  When politicians promise things before election, like individual A promises to do things for the children and individual B promises to combat global warming, we see that people will have their own opinion as to which is most important, there is no effort to quantify anything, or they are next to useless in convincing people, likely because such quantification is far from correct and based on too many assumptions and approximations.

So, I read the paper, but I think it is great that we have such diversity of opinions among human beings, so that every cause has their supporters. Quantifying effects is important because it forces improvement but I not sure it helps much on advising folks as to where they should put their money. It should guide improvements for the good organizations and it should help the not so good ones to make an effort to follow the path of the top organizations.

Tuesday 1 October 2013

Birds killed by cats in Canada

I was reading this news article about causes of bird death that are related to humans and found interesting that cats is ranked at the top, with around 75% of the birds deaths, that is 200 millions per year.That was not so surprising to me, but I immediately wondered how they could estimate such a number. And how precise it could be. I mean, I am not questioning the methodology, or the result, but I knew this is not an easy thing to estimate and as such a precision measure would be important. You cannot interview a random sample of cats and ask them how many birds they kill per year. So, knowing the number of birds killed is hard, but it does not stop there, it is also hard to get a random sample of cats or estimate the number of cats for that matter. My experience was sort of telling me that this is so hard to estimate and requires so many assumptions that results would be imprecise and reader should know that. Of course, that does not go well with media stories that want to catch people's attention.

So, I tracked down the actual paper where the estimation was performed. It is a simple paper, mathematically speaking. They use many estimates, from different sources, which makes the estimation work very arduous and add challenges related to the precision of the estimates. I will go quickly through each of these estimators, although I have to say that I did not go deep into the literature used.

  • This part is about the data used, its quality and assumptions made. Just a flavour, though.
The number of domestic cats in Canada, around 8 millions, seems consistent even though estimated by non-random (online) sample and even though I had a first impression that the figure were too high. But I can't really say, so.

19% of the Canadian population live in rural areas and based on information that they own more cats in average than people living in urban areas, the % of cats in the rural area was estimated to be around 1/3. I found the proportion of Canadians living in rural areas sort of high, but that comes from Stats Canada and should be precise. The thing is that Canada is huge, so even if the density of people in rural areas is low, the total population may add up to a considerable amount. The idea that they own more cats, I think that is fair too. This was gotten from a research in Michigan, so not too far from Canada, although we need to remember, again, that Canada is huge.

The number of feral cats is estimated to be between 1.4 and 4.2 million and is based in mostly unknown sources that appears in media reports. The estimate for Toronto, for example, is between 20 and 500 thousands. Such a wide interval is next to useless. This is specially problematic given that feral cats are considered to kill much more birds than domestic cats. 

Only the cats that have access to outdoor are supposed to be able to kill birds. That proportion is reported for the US in some papers and seems to have some consistency although the range used in quite wide - from 40% to 70%.

Now the number of birds killed by cats is the real problem. It seems that no good source of the direct information exists. So, the estimate is built from the reported number of birds brought home by cats, using this as the number of preyed birds. There are a few studies which reports this sort of estimate, but due to different location and methodology, I guess, the imprecision is quite high - 0.6 to 6.7 birds per year. At first even the high end of the range seems low considering that we are talking about a whole year. The range used of 2.8 to 14 seems conservative, but I cannot really say. It is very wide, though. The thing is that rural cats are usually very free, and there are reasons to believe they may kill birds very often without anybody see. Rural cat owners usually care less about their cats, but this is from my experience, that is, very anecdotal. 

They also use an adjustment factor for the birds that are killed but not brought home. The factor is difficult to estimate and a range between 2 to 5.8 was considered based on some studies. These are numbers really hard to judge, they are again, for sure very imprecise.

The estimate of birds killed by feral cat is based on the content of the stomach of the cats. It is again a very indirect method, which includes assumptions and may have large inaccuracies.  

I don't mean to criticize by listing these data sources because the fact is that many times this is all we have. We need more and more to be able to use this sort of data and be able to understand assumptions and model the unknowns.That is fine. But I also think it is important that assumptions and caveats are stated. They are in this paper, but when the findings gets translated into the popular language of the news media, they get lost, all the statistical aspect of the findings gets lost. When we say that cats are responsible for 200 millions birds killed in Canada, how precise is this? The usual reader will think it is quite precise, I mean, you will if you just read the CBC article. The paper is pretty simple, so I set out to calculate this precision. It turns out that Figure 2 is all you really need, but I wanted to have some fun, so I decided to replicate it.This is the R code, just using the information in the paper.

nsim = 10000

nPC = rnorm(nsim,8.5,0.25) # Number Pet Cats
pRC = runif(nsim,0.27,0.33) # % Rural Cats
pOd = runif(nsim,0.40,0.70) # % Outdoor Cats
BpU = runif(nsim,0.6,6.70) # Avg Birds by Urban Cats
BpR = runif(nsim,2.8,14) # Avg Birds by Rural Cats
Adjust = runif(nsim,2,5.8) # Adj. for non Returned Birds
nFC = runif(nsim,1.4,4.2) # Number of Feral Cats
KpF = runif(nsim,24,64) # Lill per Feral Cat

BKu = nPC*(1-pRC)*pOd*BpU*Adjust
BKr = nPC*pRC*pOd*BpR*Adjust
BKf = nFC*KpF
TotalBird = BKu + BKr + BKf

library(ggplot)
ggplot(tb, aes(x=TotalBird))+geom_histogram(aes(y=..density..), binwidth = 5,fill = "darkgreen")+ ggtitle("Birds Killed by cats in Canada")+xlab("Number of Birds in Millions")+ylab("Density")



quantile(TotalBird, c(0.025, 0.975))
2.5%     97.5%
107.3296 357.0416


I think despite the difference in the graph formatting, they are very similar. So, we see that the 200 Millions birds killed by cats (domestic + feral) in Canada each year is not precisely 200 MI, a 95% Interval shows it is between  around 100 and 360 million birds. It is a pretty wide interval, but it just reflects the uncertainty in the source data.

Another thing interesting in the paper is that they calculate the relative importance of each term in the equation. I thought about this a little bit, because they use regression but the calculation is not linear as we can see in the equation in the R code above. I ran the regression and got an R2 = 95%, so perhaps with such a high R2 the relative variance given by linear regression is acceptable. I don't know, this seems to be an interesting question.

Then a final point is that as we can see in the code above, the terms used in the calculation are all assumed to be independent. I am not sure how to do it any better, but this independence is also an assumption that should be considered.