Tuesday, 22 October 2013

The science and the system

I think one of the great skills statisticians end up having is their ability to use statistical tools to evaluate scientific research and assess its problems. Well, maybe that is not very accurate. I mean, as a statistician it seems that it is easier for me to read a paper and see possible challenges to the conclusions. I think the background in stats along with the experience of reading many papers over the years, help us a lot on this. The first case in this article, which shows that students who are not supported financially by their parents end up with better grades, is a good example. I find myself in this position of reading the conclusion of a study and not believing it so often, that it sometimes give me the impression that I am being too skeptical compared to the rest of the human beings. It also makes me think that these type of texts are misleading and perhaps harmful in a way, that when they get published in the mainstream media, like this was, it is a pretty bad thing. What if we could make sure whoever does the analysis is not only without interest in the results but also someone who knows well the stuff? That is almost never the case.

The linked paper shows other problems, though. It shows that once a paper like this is published in a top scientific journal, it is quite difficult to criticize it. It speculates on the possibility that even the scientific journals are willing to evaluate a paper by how attractive and surprising is its conclusions, rather than only its scientific contents. And finally, it talks about how difficult is to get our hands on the data used in such a study, so that we can replicate the results.

The system not only put pressure on the researcher to publish, but it seems also on the top journals, to continue being top, attractive, read by as many people as possible. With pressures like these, the goal end up not being the scientific production but the mitigation of the pressure. Or at least part of the goal. It seems that this concern with the amount of science that actually exists in scientific publications, specially when it goes mainstream, is becoming more widespread so that hopefully things gets better...


Friday, 18 October 2013

Random Selection and Random Assignment

Randomness is something very important in statistics. Yet I have this opinion that it is not given its due space in statistical courses. Then I've found that usually people have trouble understanding the reason randomness is so important and the possible implications of the lack of it in certain contexts. In two contexts randomness is very important but has a different role - Experimental Design and Sampling. In the former we usually want to assign units randomly to treatments/groups and in the later we want to select units randomly from a given population.

I also think that the lack of understanding randomness, or perhaps only thinking more about it or taking it more seriously is making us to turn a blind eye to the many circumstances where lack of randomness is an issue, specially related to sampling. In that regard I think we need to get better at model based approaches that account for introduced biases, starting by understanding sources and consequences of the non-randmness. Unfortunately the reality is not very friendly and random sample or experiments are the exception, not the rule.

In Sampling, randomness aims at guaranteeing that the conclusions from the small set of units we select, the sample, generalizes to some larger population, because we don't really care about the sampled folks, we care about the population they are supposed sampled from. But sampling in a random way from human populations is next to impossible most of the times. In experimental designs the random assignment allows us to control for concurring causes of what we intend to measure the effect. A randomized experiment allows us to make causal claims about statistical differences, but if the used sample was not selected randomly then the causal claim is in theory only valid for the units participating in the experiment. Here is where both things come together.

Therefore it is important that students have a good understanding of randomness and the difference between assigning randomly (experimental design) and selecting randomly (sampling) and the possible implications when the randomness fail. This is an interesting paper that talks about this and gives an example of how to engage students with real world activities related to the application and interpretation of random selection and random assignment. I thought the idea is cool also because it uses roaches, which are not very much liked in our culture, so I guess it is good for retaining the moment...





Friday, 4 October 2013

Good Charities

I always found the decision of giving money to charities a difficult one because of so many charities and the different things they do. First you need to be concerned that the non-profit organization will do what they say, that it even exists. Then you think about the part of the money that actually goes to the cause instead of being spent on administrative and advertising stuff. Finally you also wonder how they will address the cause, how efficiently. This article in the last edition of Significance Magazine discusses the how efficient part. The so called relentless logic is covered in details in this book and commented in the article. The "relentless" part comes from the fact that the idea is to transform the results of the actions of the organization into money, or quantify it using currency values, which is sort of disgusting. But anyway, the argument is that it is valid if it leads to better use of the received funds.

But to me the very fact that transforming such results into monetary value does not sound good is what explains why it should not be done. They are different things in such a way that some of the values coming from the results of this type of work as well as what people gain by just giving, is not quantifiable in any simple way.

Their first example, where I almost stopped reading, was about quantifying the dollar value of bringing back to school the high school drop outs. They say that researches say that in average high school graduates earn $6500 per year more than high school drop-outs. I see two problems with this value: 1) This value is certainly controversial among the scientific community. If the causes of Global Warming, which has a sort of priority status in the scientific agenda, is still nowhere close to a consensus, then imagine the effects of a high-school graduation. 2) Even assuming that the figure is correct, there is likely a bias because $6500 is relative to high-school graduates, not to drop-outs that ended up re-enrolling and became at least high school graduates. You may think that I am being unreasonable, but I do believe the later may have some important differences that leads to different earnings compared to the former.

But in any case these are not the reason I almost stopped reading. It has to do more with the idea of doing this type of quantification. For example, maybe I did not have the chance of getting a high school diploma because I was forced to drop out to help at home. Nevertheless, I succeeded financially speaking and I made a commitment to myself that I would help the drop-out cause. There is a value to that in such a way that it does not matter if you tell me that based on your math and your assumptions, helping hungry kids is a more efficient way to give money. Or maybe I was helped by foundation X, which avoided that I dropped out of school, so now I want to pay back. It does not matter if you tell me that foundation Y does the same thing more efficiently.

These may be very specific cases, but even if I dont have reasons to help any specific cause, I will have my opinion about different causes.  When politicians promise things before election, like individual A promises to do things for the children and individual B promises to combat global warming, we see that people will have their own opinion as to which is most important, there is no effort to quantify anything, or they are next to useless in convincing people, likely because such quantification is far from correct and based on too many assumptions and approximations.

So, I read the paper, but I think it is great that we have such diversity of opinions among human beings, so that every cause has their supporters. Quantifying effects is important because it forces improvement but I not sure it helps much on advising folks as to where they should put their money. It should guide improvements for the good organizations and it should help the not so good ones to make an effort to follow the path of the top organizations.

Tuesday, 1 October 2013

Birds killed by cats in Canada

I was reading this news article about causes of bird death that are related to humans and found interesting that cats is ranked at the top, with around 75% of the birds deaths, that is 200 millions per year.That was not so surprising to me, but I immediately wondered how they could estimate such a number. And how precise it could be. I mean, I am not questioning the methodology, or the result, but I knew this is not an easy thing to estimate and as such a precision measure would be important. You cannot interview a random sample of cats and ask them how many birds they kill per year. So, knowing the number of birds killed is hard, but it does not stop there, it is also hard to get a random sample of cats or estimate the number of cats for that matter. My experience was sort of telling me that this is so hard to estimate and requires so many assumptions that results would be imprecise and reader should know that. Of course, that does not go well with media stories that want to catch people's attention.

So, I tracked down the actual paper where the estimation was performed. It is a simple paper, mathematically speaking. They use many estimates, from different sources, which makes the estimation work very arduous and add challenges related to the precision of the estimates. I will go quickly through each of these estimators, although I have to say that I did not go deep into the literature used.

  • This part is about the data used, its quality and assumptions made. Just a flavour, though.
The number of domestic cats in Canada, around 8 millions, seems consistent even though estimated by non-random (online) sample and even though I had a first impression that the figure were too high. But I can't really say, so.

19% of the Canadian population live in rural areas and based on information that they own more cats in average than people living in urban areas, the % of cats in the rural area was estimated to be around 1/3. I found the proportion of Canadians living in rural areas sort of high, but that comes from Stats Canada and should be precise. The thing is that Canada is huge, so even if the density of people in rural areas is low, the total population may add up to a considerable amount. The idea that they own more cats, I think that is fair too. This was gotten from a research in Michigan, so not too far from Canada, although we need to remember, again, that Canada is huge.

The number of feral cats is estimated to be between 1.4 and 4.2 million and is based in mostly unknown sources that appears in media reports. The estimate for Toronto, for example, is between 20 and 500 thousands. Such a wide interval is next to useless. This is specially problematic given that feral cats are considered to kill much more birds than domestic cats. 

Only the cats that have access to outdoor are supposed to be able to kill birds. That proportion is reported for the US in some papers and seems to have some consistency although the range used in quite wide - from 40% to 70%.

Now the number of birds killed by cats is the real problem. It seems that no good source of the direct information exists. So, the estimate is built from the reported number of birds brought home by cats, using this as the number of preyed birds. There are a few studies which reports this sort of estimate, but due to different location and methodology, I guess, the imprecision is quite high - 0.6 to 6.7 birds per year. At first even the high end of the range seems low considering that we are talking about a whole year. The range used of 2.8 to 14 seems conservative, but I cannot really say. It is very wide, though. The thing is that rural cats are usually very free, and there are reasons to believe they may kill birds very often without anybody see. Rural cat owners usually care less about their cats, but this is from my experience, that is, very anecdotal. 

They also use an adjustment factor for the birds that are killed but not brought home. The factor is difficult to estimate and a range between 2 to 5.8 was considered based on some studies. These are numbers really hard to judge, they are again, for sure very imprecise.

The estimate of birds killed by feral cat is based on the content of the stomach of the cats. It is again a very indirect method, which includes assumptions and may have large inaccuracies.  

I don't mean to criticize by listing these data sources because the fact is that many times this is all we have. We need more and more to be able to use this sort of data and be able to understand assumptions and model the unknowns.That is fine. But I also think it is important that assumptions and caveats are stated. They are in this paper, but when the findings gets translated into the popular language of the news media, they get lost, all the statistical aspect of the findings gets lost. When we say that cats are responsible for 200 millions birds killed in Canada, how precise is this? The usual reader will think it is quite precise, I mean, you will if you just read the CBC article. The paper is pretty simple, so I set out to calculate this precision. It turns out that Figure 2 is all you really need, but I wanted to have some fun, so I decided to replicate it.This is the R code, just using the information in the paper.

nsim = 10000

nPC = rnorm(nsim,8.5,0.25) # Number Pet Cats
pRC = runif(nsim,0.27,0.33) # % Rural Cats
pOd = runif(nsim,0.40,0.70) # % Outdoor Cats
BpU = runif(nsim,0.6,6.70) # Avg Birds by Urban Cats
BpR = runif(nsim,2.8,14) # Avg Birds by Rural Cats
Adjust = runif(nsim,2,5.8) # Adj. for non Returned Birds
nFC = runif(nsim,1.4,4.2) # Number of Feral Cats
KpF = runif(nsim,24,64) # Lill per Feral Cat

BKu = nPC*(1-pRC)*pOd*BpU*Adjust
BKr = nPC*pRC*pOd*BpR*Adjust
BKf = nFC*KpF
TotalBird = BKu + BKr + BKf

library(ggplot)
ggplot(tb, aes(x=TotalBird))+geom_histogram(aes(y=..density..), binwidth = 5,fill = "darkgreen")+ ggtitle("Birds Killed by cats in Canada")+xlab("Number of Birds in Millions")+ylab("Density")



quantile(TotalBird, c(0.025, 0.975))
2.5%     97.5%
107.3296 357.0416


I think despite the difference in the graph formatting, they are very similar. So, we see that the 200 Millions birds killed by cats (domestic + feral) in Canada each year is not precisely 200 MI, a 95% Interval shows it is between  around 100 and 360 million birds. It is a pretty wide interval, but it just reflects the uncertainty in the source data.

Another thing interesting in the paper is that they calculate the relative importance of each term in the equation. I thought about this a little bit, because they use regression but the calculation is not linear as we can see in the equation in the R code above. I ran the regression and got an R2 = 95%, so perhaps with such a high R2 the relative variance given by linear regression is acceptable. I don't know, this seems to be an interesting question.

Then a final point is that as we can see in the code above, the terms used in the calculation are all assumed to be independent. I am not sure how to do it any better, but this independence is also an assumption that should be considered.



Friday, 27 September 2013

Optimizing Transportation

The article "Bus, Bikes and Random Journeys: Crowdsourcing and distribution in Ivory Coast" in the last issue of Significance caught my attention mainly because it brings to light an interesting point in times of global warming - our transportation is very inefficient. We have so many big cars with one single passenger, empty buses and trucks going back and forth, just because oil is so cheap. We have developed so much our communication technology, but very little in terms of transportation, I mean, at least there are so many huge problems for which the solutions seems so possible.

So, with so many people traveling around, I have no doubt that this crowdsourced transportation of things is possible from the operational point of view. But then when you think that it depends also on the willingness of folks to engage, which requires incentives... oh, well, we are certainly not there yet.It seems we also need changes in our culture.

I saw the model in the article more as a theoretical mathematical model, too much of an ideal world. It involves an area of statistics that is mixed with math and computation, and to which I am not that familiar. But it seemed to be so far from reality. I recognize, though, that the effort put on developing such models is valid, as is the intention and idea of crowdsourcing the transportation, as is important the simple fact that we start talking about this sort of idea.

As a believer on the goodness of the people, I thought since the very beginning of the article, that a model was strange, not necessary and maybe if this ever happen it will not need a model. The desire of people to help and do good would alone make the system efficient. It would be like this: I am taking the train to Hamilton, from Toronto, so at the Union Station there is a place with packages to be delivered in Hamilton. Everybody knows about that. I get some that I think I can carry and take them with me. I think many people would not demand any sort of reward for doing this...

The Significance article does not have any math, it just describes the optimization problem. More detailed description is in one of the references.

Wednesday, 18 September 2013

Gibbs Sampler

Gibbs Sampler is an algorithm used to generate marginal distribution from the conditional ones. It is a special case of the Metropolis-Hastings algorithm. In this post I just want to respond to Subbiah's suggestion in the comment on my post on Metropolis-Hastings algorithm. He suggests a paper by George Casella and Edward George, which explains in more or less simple terms the Gibbs Sampler. I really liked the paper more than the one I mentioned about the Metropolis-Hastings algorithm, as it is much more intuitive. Sections 2 and 3, which shows how the algorithm works in practice, are quite interesting and written in a intelligent and clear way.The paper does not go into Bayesian statistics, it is more about showing how the Markov Chain works on converging to the desired marginal distribution. It seems to me that the paper is very helpful if you are planning to learn Bayesian Statistics, in which case it will serve well as a pre-requisite for the chapters on simulation from the posteriori.

Monday, 16 September 2013

Simpson's Paradox and the teaching of causal inference

The Simpson's paradox, which names the phenomenon where the association between two variable change direction when conditional on a third variable, can be quite puzzling. In my opinion one of the worst things about the paradox, though, was not the reversal of the association, but understanding the consequences that it had for causal inference and in our daily work as statisticians. It seems like we are always at risk because there could always be some important variable lurking in the background that could change our results.

This paper by Judea Pearl talks about the paradox in a simple but different way compared to what I had seen before. The first time I read it I thought this sort of explanation of the paradox, with this sort of simple example, should be in the curriculum of students of statistics because it seems so useful to help understanding causal inference. It is a easy and powerful way of promoting the causal thinking in statistical modeling. And at the same time it explains the paradox.

To go into a little more detail about the paper, the association between treatment and outcome is reversed when controlled by (conditional on) gender. The question that comes to mind, at least it always came in my case, is whether controlling by gender is correct, since doing so totally changes the conclusion.

Well, whether we condition or not on gender has to do with whether gender is a confounder or not. And at this point we can think of a confounder as a variable that can cause both the treatment and the outcome. Something that is caused by the treatment cannot be a confounder. By changing both the treatment and the outcome, the confounder has the power to create association between them that are just association, it is not causation. Unfortunately at this point we need to rely in assumptions, in our understanding of the reality, to decide whether gender is a confounder or not.

It turns out that the only thing that makes sense is to think that gender can cause both the choice of treatment and the outcome, therefore it is a confounder. It does not make much sense to think that the association between gender and treatment is a causal effect of the treatment, that is, treatment causes gender.  In the example, males tend to choose the drug much more often than females and they also tend to have improved outcome more often. This makes the drug look good overall, even though within gender it perform worse than no drug. It goes like this: Male for some reason that is not the drug improves a lot more than females and they tend to take the drug. So, if we dont account for gender, the drug will look good. It is like you are a mediocre professor but for some reason the good students tend to choose you as their supervisor, so if we don't look at the the score previous to the selection, you will have students which performs better then other professors and you will look good even if you are worse then the others.

In this case it is clear that gender has to be controlled for because the only thing that makes sense is that it causes both treatment choice and outcome. So let's change gender by blood pressure and keep exactly the same data, and now it does not make sense to say that blood pressure is a confounder anymore, because it can cause the outcome but not the treatment. Now it makes sense to say blood pressure is a mediator and as such we should not control for it. I thought this is a very interesting example that promotes the causal thinking, but not explored by the literature.

In the book Categorical Data Analysis, by Alan Agresti, pg 48, we have a different example, which at least at my eyes is more complicated. There the causal effect of defendant's race on death penalty is the goal and the victim race is the potential confounder. The role of victim's race as a confounder is not as clear as the role of gender, because some may say victim's race is caused by defendant's race and is therefore a mediator. I don't intend to solve this here, but right there I think we have an example that is not so appropriated for the beginner student.

The text focus on conditional association and why, mathematically, it happens, and how much it can change things. Then the text misses the opportunity to go beyond the mathematics into the real world, the world of causal inference and talk about actual modeling. I really like Agresti's book, which I have been using since its first edition (it is in the third now), but my experience is that the literature in statistics really lacks this bridge between the mathematics and the real world. It seems to me that it is pretty bad to teach regression analysis in a pure mathematical predictive way while ignoring the real world question which is the challenges of interpreting regression coefficients as effects.

In conversations at the JSM2013 I came to the conclusion that the Simpson's paradox may be too challenging and complex to be used in a introductory statistical course, or with people that don't have a background in statistics, or with the physicians and researchers that we work with daily. But students of statistics, with some math and probability background, familiar to the idea of association, I think it would be a good tool to help introducing causality.

Sunday, 15 September 2013

Relaxing

If you need a break from the statistical books with lots of math I recommend the book The Lady Tasting Tea, by David Salsburg.

The book is a easy fun reading on the history of statistics, but it does not have the intention to go into too much details or to cover everything or even to be linear in time. It is a book especially good for those of us for whom many names are already familiar, but we don't really know that much about them. To use the words of a review I read in the Amazon website, a not very pleased one, the book is a little like a collection tales that our grandfather tells us. And sometimes that is what we need.

Besides, as I said, we are always working in our daily job with names that becomes familiar but we actually don't know much about them. Like Pearson Correlation, Kolmogorov-Smirnov test, Fisher's exact test, Neyman-Pearson hypothesis test and so on. I thought the book is very good on covering these important names in a joyful way. As an example, I found the story of Student, from the t-student distribution, very interesting.

Good reading!


Wednesday, 11 September 2013

Conflict of Interest and the research in medicine

I just read this article, where the seduction of money (and maybe other things?) led a highly regarded researcher on Alzheimer to get involved on illegal activities. I will not tell the details as I don't want this post to be too long, but it is in the linked content. The illegality seems to be related to the fact that information about results of clinical trials not available to the public were being released to traders, but on the other hand I also think of it as a problem of conflict of interest, I mean, I think to some extent being involved with such kind of money and with clinical trials, and the money being linked to the results of the trials, may led to, say, bad research.

I have this opinion that we live in a world that put us very far from the ideal environment to do scientific research, because of all the conflicts of interest. I is sort of the elephant in the room. Although there are a lot than can be said regarding this issue, it is a heavy issue and I just want to mention the cases where researchers do their own statistical analyses, which happens many times. To me the issue is the same as blinding in a drug trials, where blinding so be extended to whoever analyzes the data and in particular, researchers listed as author in the paper should not do the statistical analyses. And now we are not talking only about trials, but any research. If a researcher, who wants or needs to publish papers, is the one who analyzes the data, then there is the potential for what has been sometimes been called p-hacking, which basically relates to searching for p-values that are of interest, even if not very consciously. Besides this, practices like making the data available, making the codes used for the analysis available, review of policies related to privacy (so that data can be made available), replication of results are some of the things I think we could easily improve on.

Sunday, 8 September 2013

The ever lasting discussion about Bayesian statistics

I do not consider myself that old, but it amazes me when I think of the difference between how Bayesian statistic was seen in the early 2000s and now, which is a difference that shapes the way current students are introduced to statistics in contrast with then.

Back then very few academics where admittedly Bayesian and there were lots or religious and ferocious opposition to these few. Back then, and perhaps even now, I considered myself sort of lost in this debate, but nevertheless willing to understand matters more as to be able to have an opinion. During my master I was happy to take a course on Bayesian statistics, and that was an opportunity very rarely offered to students.

Nowadays for the statistician who like myself did not have much contact with Bayesian statistics, and is involved with practical problems, it seem to be a subject of mandatory update; it is everywhere. Rather than positioning ourselves in one side of the debate, the rule now seems to know both sides and use whatever is most appropriated for the problem at hand.

It was with that in mind that I read this paper by Andrew Gelman and Christian Robert and the comments that follow. What surprised a little was the fact that set of papers seems to make it clear that the debate is not settled - not only Bayesians are fighting in defense of their choice but there are still Frequentists challenging them. Not that I did not know this, but I guess what happens is that we just base our position and opinion on the fact that nowadays Bayesian Analysis is mainstream and works to dismiss the still existing debate.

Related to these papers I did stay with the impression that the initial paper by Gelman and Robert was a little too picky on things seemingly not that important, as I think the comments by Stigler and Fienberg puts it, and perhaps too defensive, and that Mayo's comments raises some interesting and perhaps valid points against the Bayesian paradigm that seems to be forgotten due to the success of practical applications. However, I have to say, I always have trouble understanding Mayo's more sort of philosophical language. Finally, I found the comments by Johnson of interest for the references therein, which I did not check but seems to be of great interest for those seeking to learn Bayesian Statistics specially in comparison with Frequentist statistics. In that regard I also got interested in Casella's book on Inference, mentioned by Mayo. Too much interest for too little time, though...



Metropolis-Hasting and Bayesian Statistics

My statistical training was focused on classical frequentist inference, but soon after I got to the real world of statistics I realized that Bayesian statistics was not only used more than I thought, but more and more people were using it. Then I set as a personal goal to learn it.

I fount it was not as easy. Maybe I am not that smart too. As time went by I got this impression that Bayesian statistics is actually simple, maybe simpler than frequentist statistics, there are some barriers for those like me, who did not get the tools or the thinking way of Bayesian statistics.

For one we did not have a strong training in simulations, programming, computational statistics. But also, we did not learn too much about Markov Process, especially the ones in Continuous Space. So when we see thing like Metropolis-Hasting algorithm it looks something not accessible.

Nowadays there are many sources where we can learn Bayesian statistics, but I always want to start by learning this MH algorithm. And many times I wasn't successful. Until I found this paper, in the The American Statistician Journal. Using the paper, for the first time I was able to create my simple Markov Chain and simulate from a target distribution, and I just published  the R code I used here.

The paper is not that simple, and I actually would not recommend it if you want to learn the MH algorithm and do not have some good understanding of math. I think the paper is sort of old and probably any book on Bayesian statistics published more recently will give you a more gently introduction to the MH algorithm. But the paper got me started and I learnt a lot with it. The reason the MH algorithm is so useful relates to its power on simulating from intractable distributions, but when we are learning right on our first steps, even this code that I used to simulate from a bivariate Normal distribution made me proud.

Here I will leave the function I created. The MH algorithm is actually quite simple. I wanted to publish the whole code here instead of just linking to it above, but it is impressive how difficult it is to do this. Even this GitHub option does not work very well (not to mention I could not colorcode the code) because if you have many pieces of codes and graphs and R outputs, then it becomes lots of work. The alternative is to paste the code here and copy and paste the outputs and insert figures, which is not better, really. The Knitr package offer an awesome option to create HTMLs with codes and outputs, but I still did not figure out how to publish it in a blog like this.




Sunday, 1 September 2013

Teaching Causal Inference

In the JSM 2013 I participated in a round table about teaching causal inference, where I had the privilege of meeting Judea Pearl. The reason I was so interested in the subject, besides the fact that causal inference is of great interest to me, was that if I had to name the biggest gap I had in my statistical training, it would be causal inference for sure.

They told me that regression was a mathematical equation and the coefficient of X mean "If you increase X by one unit this coefficient is how much Y will increase in average". Right there, in the very explanation of the coefficient was the idea of causality, yet causality was never talked about.

It turns out that it is not only that. In the real world, everything is about causality. People may not say it or they may not even know it (!) but what they want when they are running a regression model is a measure of causal effect. Why then in a top statistical course we don't hear anything about the causal assumptions we are making? Or  confounder? It seems to me that not only causal inference should be talked more about, it should perhaps be the core of a statistical training.

In the experimental design part of the training we are taken through the causal inference in a way, but that never gets linked to the regression analysis, or SEM, on observational data, which is what we have most of the time. Statisticians learn that everything they do is conditional on the validity of assumptions, yet they dont seem to want to think or to talk about or to make assumptions about causality.

So, that round table was very interesting for what I learnt, for what we discussed and for the folks I met. I have no experience on teaching, so I don't know how to do this, but I am contemplating diving a little deeper into this, by perhaps searching the related literature and putting together some material that make it easier to understand causality and how it relates to statistical models. I could also use for this many of the ideas from our round table. And perhaps it could be a joint effort. I think such a thing, if mainstream, would be quite helpful not only for the current students but also maybe for the entire sciences in how it would improve the quality of the research that are currently done.


Thursday, 25 July 2013

Breakfast and Coronary Heart Disease

One of this days I heard in the CBC radio about a new research that supposedly found evidences that skipping breakfast is linked with Coronary Heart Disease. Then I found the news here. I tracked down the actual paper and it was good to know it is available for free.

The first thing I thought when I heard it in the radio was that likely people who are busy at work, who work more, who are more dedicated, engaged at work, these are the ones who may skip the breakfast and their crazy routine may also cause CHD. The second thin I thought is something I have thought before many times and it is that in the history of human evolution we were hunter gatherers for most of the time. And it seems to me that a hunter gatherer would not have right times to eat. If they happened to kill something in the afternoon, they would eat at that time. If they did not hunt anything they would not eat anything. This is just my perception, though, but I believe it sort of make some sense. If so, our body would have evolved in a way that it would be used to irregular eating patterns, maybe with long fasting periods and times of food abundance.

So, in my head the research results were not making sense in two ways. I wanted to read the paper. It turned out that I liked the paper; I think they did a good job exploring confounders, but I am not sure they addressed my confounder, the crazy routine thing. They actually included two related information: Full-time employment and Stress at work. But I don't think these are the same thing as I mentioned above. I think among workers, not necessarily full-time, you may have those workers that are more engaged. They are not necessarily stressed because these are the folks who dedicate themselves to work, as if the work is their life, their past time. And I say that because I used to wake up and go straight running and then to work, with no breakfast, and that was because I could not stand the idea of getting there late. Then I would stay late too, always involved with the work. After changing my job I started having breakfast... Anyways, they did a good job looking at counfounders and mediators and it seems they did what they could, but one can only look at so many variables.

Upon reading the paper another thing started bothering me: skipping breakfast is, according to the paper, also associated with many other things that are not so good. Obesity, smoking, no physical activity, full-time job, hypertension, alcohol use. That seems to indicate that skipping breakfast is a sort of indicator of a not so good lifestyle in the sense that it is caused by/is part of this lifestyle and doesn't cause it. It seems to make more sense to think that both skipping breakfast and CHD are caused by this sort of lifestyle construct.

Finally, there was also some association between late night eating and CHD, which was not in the media and to me it enforces the fact that it is not a causal relationship. We have also to consider that the sample had only physicians, although a lot of them, and that it is important to replicate these associations before trusting them too much.

I think it is complicated to prove one way or another, though, and because of that more care should be taken when reporting these things. They emphasize too much the causal link and as I see this is questionable. When this sort of thing falls on the hands of the mass media, different people will interpret things different ways, and it can be transformed in some health myth. And talking about health myths, I've often and for a long time heard that eating at night leads to more weight gain because you go to sleep and don't burn what you ate. The paper also says this is a myth, I mean, that there are some researches favouring and other opposing such a claim. We need to be careful on what we believe. This is not to say I am criticizing the paper. My opinion, for what it is worth, is that this is a piece of information and other pieces will join this one in the future and the process should help us to understand the association or lack of association between eating habits and health conditions. But care should be put on reporting such findings in the non-scientific  environment, not to make things misleading.

Sunday, 21 July 2013

Multivariate Binary Variables

These days I got myself involved in a problem where power calculation had to be done for repeated measure design with binary endpoint, involving 4 time points and 3 groups. This is not the sort of thing you can do in softwares like G-Power; I like to do it through simulations using Generalized Estimating Equations in this case. Basically, you generate the data that you expect to get in the experiment, and this involves estimating somehow variation and effects without having any data (but of course, possibly having other information, like variation and effect sizes from similar studies and also the experience of the research and the clinical assessment of what would be a meaningful effect). I am not very experienced with this, you know, statisticians working in marketing research will not see this kind of thing very often. But I can say that at this point I am getting pretty comfortable with power simulations for assumed Normal outcomes.

However this one was binary. If you have 4 time points, you will want to generate 4 binary variables that have specific proportion p and correlation structure. How to you simulate multivariate binary variates? My first thought was to use the normal quantiles, for example, if the desired p is 0.5 you generate a N(0,1) and recode negative values to 0, positive to ones. If you do that with a multivariate normal you will end up with a multivariate binomial. The correlation structure will not be the same, but you can easily get close to what you want by trial and error. So I put that in the back pocket. I really wanted to generate the zero and ones in a more mathematical way.

It turns out that there are quite a few references on this, even on using Normal as I mentioned above. There are also two R packages that seems capable of generating multivariate binary variables, but I did not test them. Then I found this interesting paper. It has an algorithm to generate what I wanted, starting from the Poisson distribution. It is an interesting and not very difficult derivation for the bivariate case.  And I managed to program it on SAS, after wrestling with the use of numeric values with decimals inside macros. I forced myself to use Proc IML instead of data steps, to try and learn it.



%macro bibernouli(p1,p2,n,rho);

%let a11 = -1*%sysfunc(log(&p1));
%let a22 = -1*%sysfunc(log(&p2));
%let a12 = %sysfunc(log(1+&rho*%sysfunc(sqrt((1/&p1)*(1-&p1)*(1/&p2)*(1-&p2)))));

proc iml;

rmat = j(%eval(&n),7);
call randgen(rmat[,1],'POISSON',%sysevalf(&a11 - &a12));
call randgen(rmat[,2],'POISSON',%sysevalf(&a22 - &a12));
call randgen(rmat[,3],'POISSON',%sysevalf(&a12));
rmat[,4] = rmat[,1]+rmat[,3];
rmat[,5] = rmat[,2]+rmat[,3];
if rmat[,4] = 0 then rmat[,6] = 1; else rmat[,6]=0;
if rmat[,5] = 0 then rmat[,7] = 1; else rmat[,7]=0;

create databin from rmat[colnames = {"x1","x2","x3","y1","y2","z1","z2"}];
append from rmat;
close databin;
run;
quit;
%mend bibernouli;

%bibernouli(0.75,0.65,10,0.5)



But I needed 4 correlated variables, not only 2. It is not difficult to  generalize this process from 2 to 4 variables, with proportions matching what you want and association close to what you want. But to get association exactly you need to keep reading the paper past the bivariate case. You will find an algorithm, which I can't say is very complicated because it is not. I read it through some 10 times. But I don't know, there is a part in the last step that I was not quite getting. I tried to apply it to the bivariate case and it was not working. I am sure it was my fault, though. Then you know, time is not infinite, so I went back to my Normal simulation to solve my real world problem, but still thinking about the algorithm...

Monday, 15 July 2013

Standard Error or Standard Deviation?

I was surprised when a colleague asked me for a opinion on a revision of a paper mentioning they should present Standard Deviation instead of Standard Error. I went to the website of the Psychiatry Research, Guide to Authors, and there it was the recommendation to use SD instead of SE. It is as dry and enigmatic as this "Study group variability should be reported as the standard deviation, not the standard error". I am not sure what they mean.

I am not against using SD when it is appropriate and would not discuss the fact the many folks may not understand the difference or even use them incorrectly. But I think a general request to use SD is not reasonable, a more elaborated instruction should be given, perhaps referencing some paper since there are several talking about the difference between SD and SE. It turns out that in that specific case I totally disagreed with presenting the SD because they where presenting tables mostly with %, comparing these % across three groups. Even though the tables were preliminary descriptive analysis preparing the reader with some background knowledge about the variables that  were later included in the model. SD is next to useless for %, really, and the SE would provide the reader with a quick idea of whether the group means differed and how precise they were.

And this incident followed a rejection by another Journal, a week earlier, of a paper I thought was very good. There were 4 justifications for the rejections and all they did was to show that whoever wrote that did not understand anything about our model. The first thing I commented with my collaborator was "How can a person review something they don't know anything about?" It seemed blatantly unethical to me, it seemed so not science.

Then you cannot say much. It is like they are Godlike beings, it is like they know more than you, period. I think the system for publication and peer reviewing needs to be different. I would not mind talking to them about our paper, maybe a phone call, maybe talk to their statistician if they have one.

I decided to take e a quick look at the Psychiatric Research papers, I wanted to see if they only publish standard deviation. I went to the first paper of the last issue (well, this was the last available to me). In session 2.2 Subjects they do present some SDs and I think that is appropriated because they are describing the sample, not making inference. However, the way they present, with that plus/minus sign, makes things confusing. It looks like a confidence interval, which is inferential procedure and therefore makes us think they are presenting SEs (or 1.96*SE). I always found this plus/minus notation very confusing not to say misleading or plain wrong.

Then we have Table 3, where 4 groups are being compared in a inferential way, there is a ANOVA there. Then they show mean and SD? This is where I disagree. It would be important to have SE here, maybe confidence intervals. I think here we are talking about means and we want to know how precise they are, that is, the SEs.

So, I don't know, I think this instruction to use SD is already causing more harm then good because there is nothing like use this or that, different things are appropriated for different occasions.


Saturday, 13 July 2013

Visualizing Longitudinal Data

Another paper in the most recent The American Statistician shows hot to use something they call Triangle Plot to visualize some aspects of longitudinal data. It is an interesting idea and even more interesting because my impression is that dropouts are not  analyzed as they should.

In Repeated Measure experiments data are collected at several points in time and it always happens that some subjects get lost for different reasons, these are the dropouts. So, for example, everybody is observed at time 1, but at time 2 some folks do not appear and we therefore have missing values there. At time 3 some more folks do not show up and so on. If we are measuring Quality of LIfe (borrowing the example from the paper) then it is important to understand whether these folks who drops out of the study have a different Quality of Life compared to the ones that stayed. If they have then the dropouts are said informative and such information should be studied and perhaps used in the analysis in order to avoid possible biases. Basically dropouts are missing, and if they are informative, they are not random.

You can check the paper for some neat graphs, but I tried to be creative and create my own Triangle Graph, in this case the data is a little different, but I think the idea of repeated measures and informative dropouts holds sort of true.

I wanted to play around with ggplot2 rather than using the Lattice code provided in the paper. Nothing against lattice, but forcing myself to use ggplot2 is a way of learning more.

I tried to think about some sort of data I could use to run a Trianle Plot. Toat was not very easy. But I found some data about number of medals per country in paraolympic games. Let's use this data for a Triangle Plot.

OlympicData <- drive="" font="" log="" lympicdata.csv="" nbsp="" oogle="" papers="" plot="" read.csv="" riangle="">
    quote = "")
library(ggplot2)
OlympicData[1:20, ]
##                country y2012 y2008 y2004 y2000 y1996 y1992 y1988 Particip
## 1          Afghanistan     0     0     0    NA     0    NA    NA        4
## 2              Albania     0    NA    NA    NA    NA    NA    NA        1
## 3              Algeria    19    15    13     3     7     0    NA        6
## 4              Andorra     0    NA    NA    NA    NA    NA    NA        1
## 5               Angola     2     3     3     0     0    NA    NA        5
## 6  Antigua and Barbuda     0    NA    NA    NA    NA    NA    NA        1
## 7            Argentina     5     6     4     5     9     2     9        7
## 8              Armenia     0     0     0     0     0    NA    NA        5
## 9            Australia    85    79   101   149   106    76    96        7
## 10             Austria    13     6    22    15    22    22    35        7
## 11          Azerbaijan    12    10     4     1     0    NA    NA        5
## 12             Bahamas    NA    NA    NA    NA    NA    NA     0        1
## 13             Bahrain     0     0     1     2     0     1     3        7
## 14          Bangladesh    NA     0     0    NA    NA    NA    NA        2
## 15            Barbados     0     0     0     0    NA    NA    NA        4
## 16             Belarus    10    13    29    23    13    NA    NA        5
## 17             Belgium     7     1     7     9    25    17    41        7
## 18               Benin     0     0     0     0    NA    NA    NA        4
## 19             Bermuda     0     0     0     0     0    NA    NA        5
## 20  Bosnia-Herzegovina     1     1     1     1     0    NA    NA        5


Then we aggregate the data to see the average medals per participation.

x <- aggregate="" by="list(OlympicData$Particip)," mean="" na.rm="T)</font" x="OlympicData,">
x1 <- as.data.frame="" font="" x="">
x1
##   Group.1 country   y2012   y2008   y2004    y2000  y1996 y1992 y1988
## 1       1      NA  0.0000     NaN  1.0000  0.00000    NaN 13.25 83.00
## 2       2      NA  0.4167  0.2000  0.0000  0.00000  0.000 13.00  0.50
## 3       3      NA  0.1176  0.2353  0.0000  0.14286  2.000  0.00  8.50
## 4       4      NA  0.1111  0.1111  0.1765  0.07692  0.000  0.00  1.00
## 5       5      NA  8.1333  6.7333  6.2759  5.51724  2.741  0.00  0.00
## 6       6      NA  7.5000  7.0455  8.4545  8.50000 10.682 10.55  1.00
## 7       7      NA 22.5306 21.7551 24.3673 26.69388 25.735 25.02 39.55
##   Particip
## 1        1
## 2        2
## 3        3
## 4        4
## 5        5
## 6        6
## 7        7

This does not work as dropouts in a repeated measures data because some countries are not in 1988 but are in 2000, for example. In a Repeated Measures when someone drops out, they do not return. So we need to exclude that part of the data. This way will only be present in 2012 data countries with 7 data point, in 2008 those with 6 or 7 data points and so on. Of course, by making this exclusion we are not being as representative anymore… No big deal, we are just playing…

x2 <- 3:10="" font="" x1="">
library(reshape)
## Loading required package: plyr
## Attaching package: 'reshape'
## The following object is masked from 'package:plyr':
## 
## rename, round_any
x3 <- articip="" font="" melt="" subset="" variable="" x2="">
## Using as id variables
x3$Participation <- 7="" font="" rep="" seq="">
x3$Year <- by="-1)," each="7)</font" from="7," rep="" seq="" to="1,">
x4 <- participation="" subset="" x3="">= Year)
x4
##    variable    value Participation Year
## 7     y2012 22.53061             7    7
## 13    y2008  7.04545             6    6
## 14    y2008 21.75510             7    6
## 19    y2004  6.27586             5    5
## 20    y2004  8.45455             6    5
## 21    y2004 24.36735             7    5
## 25    y2000  0.07692             4    4
## 26    y2000  5.51724             5    4
## 27    y2000  8.50000             6    4
## 28    y2000 26.69388             7    4
## 31    y1996  2.00000             3    3
## 32    y1996  0.00000             4    3
## 33    y1996  2.74074             5    3
## 34    y1996 10.68182             6    3
## 35    y1996 25.73469             7    3
## 37    y1992 13.00000             2    2
## 38    y1992  0.00000             3    2
## 39    y1992  0.00000             4    2
## 40    y1992  0.00000             5    2
## 41    y1992 10.55000             6    2
## 42    y1992 25.02041             7    2
## 43    y1988 83.00000             1    1
## 44    y1988  0.50000             2    1
## 45    y1988  8.50000             3    1
## 46    y1988  1.00000             4    1
## 47    y1988  0.00000             5    1
## 48    y1988  1.00000             6    1
## 49    y1988 39.55102             7    1


At this point we have structured the data so that is is suitable for GGPLOT2. I am not really expert in ggplot2, so maybe this is not even the best way of doing it, but I googled around and that is what I found. So, lets do the Triangle Plot.

p <- aes="" fill="value))" font="" geom_tile="" ggplot="" participation="" theme_bw="" variable="" x4="">
p + xlab("Year") + ylab("Participations") + scale_fill_gradient(low = "lightblue", 
    high = "darkblue", name = "# Medals") + theme(panel.grid.major.x = element_blank(), 
    panel.grid.major.y = element_blank())




That is it! The direction of the chart is opposite than what we see in the paper, the scales and colors could likely be improved. But that is just formatting, I did not bother with it.

We can see that there seems to be a higher average number of medales for the countries that participated in more years. The exception are the countries that participated only in 1988 (therefore only one participation), because the dont follow the patter and have a high average number of medals. When I looked at the data, the reason was clear: the only three countries that participated in 1998 and only then were USRR and Federative Republic of Germany, both with very high number of medals or course, and Bahamas with 0 medals in that year. So these two outliers explain the high average in the botton right square.


I just need to learn how to publish these things directly from R...


Thursday, 11 July 2013

Nonidentifiability and Bayes

A third paper from The American Statistician compares nonidentifiability in the context of Bayesian inference and Maximum Likelihood Inference.

More than the comparison, I found the paper does an interesting job on explaining nonidentifiability with a simple example, even without getting rid of all the mathematics.

For those folks working with Structural Equation Models, identification of the model is something routinely present. A model will not fit under Maximum Likelihood estimation if the maximum of the likelihood is not unique. One cannot find X and Y that maximizes X + Y = 10, to give a common place example. In a SEM things are much more complicated because no easy rule exists to prove identification, and your only clue is the software yelling at you that it cannot fit the model.

Under Bayesian statistics one does not maximizes things but rather calculates posterior distribution and uses it for confidence intervals, median, means. I am not expert on this, but after reading the paper I would think that it would be a problem also for the Bayesians if they (we?) were often interested in the mode of the posteriori.

So it is worth reading, it was for me. ON the curiosity side, the paper is co-written by some folks from University of Sao Paulo, where I had the privilege of studying a while ago. Maybe a little more than a while ago. The example in the paper is a modification of one found in a book by Sheldon Ross, a book that I used in my first probability course and caused a lot of sweating: we had to solve most of the exercises of the book, and there are so many...


Wednesday, 3 July 2013

Confidence Intervals through simulation

A paper that appeared in the last edition of The American Statistician discuss the construction of confidence intervals for functions of parameters using simulations. Confidence intervals are becoming more important in the sense that they are seen as giving more information about the parameter than hypothesis tests, and I tend to agree a lot with this trend. I think presenting confidence intervals, especially using graphs, is usually a very nice way of showing inferential results, allowing us even to disregard p-values.

So, it often happens that one needs to estimate these intervals for functions of parameters. For example, in a recent study we needed to estimate the effect in percent in a Regression Discontinuity model. Basically we wanted a confidence interval for the ratio of two regression coefficients.

The first thing that comes to my mind is the Delta Method, where you can use derivatives of the function of the parameter to find confidence interval for the function. It is fairly easy to calculate if the derivatives can be quickly calculated analytically. For example, in the univariate case, the variance of g(theta) = sigma*g'(theta)^2. Of course this is just an approximation and it depend on some assumptions, but usually works quite well.

The second thing that comes to mind is bootstrap, which is usually fine, but it comes with the price of some time spent on programming.

The paper talks about a third option, which I hadn't heard about. The basic idea is that if we want a confidence interval for g(theta), then we can estimate theta (say, theta_hat) using our sample and its variance, and assuming its distribution converges to Normal, we can generate Normal distributed noises, with mean zero and variance = var(theta). We generate many of these noises, say n1, n2, n3, ... nk, where k may be 10000. Now for each simulated noise i, we calculate g(theta_hat + ni) and this simulation will give us a confidence interval for g(theta) that is usually less expensive computationally then bootstrap.

The paper covers more thoroughly the theory and assumptions of the methods as well as cases where it may not work well, so in any case it is interesting to read the text, as my description is quite summarized...

Tuesday, 2 July 2013

Exterior Match

I find it useful to write about texts that I read since that usually helps the understanding and recording of the content into my brain (which is sort of a difficult thing by the way).

This time I found an interesting paper at The American Statistician, which talks about Exterior matching. I was familiar  with the concept of matching (although never really used it) but not with the exterior matching.

It is common in causal inference, especially with observational data, that we match subjects with other that are similar in some important variables, before comparing groups. For example, the paper talks about survival time for black women with breast cancer being lower than for white women. That on its on is quite concerning, but it is important to understand why it happens.

In causal inference we often will have a research question like "does treatment X improves survival time for breast cancer?" and if all we have is observational data we would like to match or control for things like race in order to make sure the effect of treatment X is not actually a confounding.

But in this case we have these two races and we are more interested on why the survival times are different so that we can possibly do something about it. It could be because black woman comes to the hospital in more advanced stages, of the black woman are older, or they receive a different treatment or who knows, it could be many things.

The idea is that we will match women according to age. Every black woman will be matched with a white woman who had breast cancer and the same age. Then we look at survival time. We then further match each black woman with a white woman, but in age and stage of cancer.

The survival time may become more similar after matching in age because now we are not just comparing black with white women, but we are also making sure they have the same age. So this is the first comparison, with age matched only. In the second comparison we are further making sure they have the same initial stage of cancer. Therefore it is natural and expected, although not necessary, that the survival times become closer to the extent that age are different between groups and age influences the survival time (talking about the first comparison).

By testing whether there is a significant difference in survival times between first and second comparisons we can gather evidence on whether initial stage of cancer is a factor or not, after controlling for age. This could be thought of as a simple comparison of the survival times between the white women in the first match with the white woman in the second match.

The issue is that many times these two groups of white woman have overlaps, that is, the same white woman is in both matches. Therefore the groups are not independent and cannot be compared through usual statistical tests, which assumes independence.

The paper is about solving this difficulty through excluding the overlaps and working only with the white woman that appear only once, by matching them in the best possible way in order to end up with three groups (black, white from match 1 and white from match 2) that are independents.

I thought it is an interesting paper, but it seems to me that especially if the overlap is large, we will end up with concerns about the final match being good and efficient. But there is no free lunch, we will always have to live with this sort of assumption.