Monday, 16 September 2013

Simpson's Paradox and the teaching of causal inference

The Simpson's paradox, which names the phenomenon where the association between two variable change direction when conditional on a third variable, can be quite puzzling. In my opinion one of the worst things about the paradox, though, was not the reversal of the association, but understanding the consequences that it had for causal inference and in our daily work as statisticians. It seems like we are always at risk because there could always be some important variable lurking in the background that could change our results.

This paper by Judea Pearl talks about the paradox in a simple but different way compared to what I had seen before. The first time I read it I thought this sort of explanation of the paradox, with this sort of simple example, should be in the curriculum of students of statistics because it seems so useful to help understanding causal inference. It is a easy and powerful way of promoting the causal thinking in statistical modeling. And at the same time it explains the paradox.

To go into a little more detail about the paper, the association between treatment and outcome is reversed when controlled by (conditional on) gender. The question that comes to mind, at least it always came in my case, is whether controlling by gender is correct, since doing so totally changes the conclusion.

Well, whether we condition or not on gender has to do with whether gender is a confounder or not. And at this point we can think of a confounder as a variable that can cause both the treatment and the outcome. Something that is caused by the treatment cannot be a confounder. By changing both the treatment and the outcome, the confounder has the power to create association between them that are just association, it is not causation. Unfortunately at this point we need to rely in assumptions, in our understanding of the reality, to decide whether gender is a confounder or not.

It turns out that the only thing that makes sense is to think that gender can cause both the choice of treatment and the outcome, therefore it is a confounder. It does not make much sense to think that the association between gender and treatment is a causal effect of the treatment, that is, treatment causes gender.  In the example, males tend to choose the drug much more often than females and they also tend to have improved outcome more often. This makes the drug look good overall, even though within gender it perform worse than no drug. It goes like this: Male for some reason that is not the drug improves a lot more than females and they tend to take the drug. So, if we dont account for gender, the drug will look good. It is like you are a mediocre professor but for some reason the good students tend to choose you as their supervisor, so if we don't look at the the score previous to the selection, you will have students which performs better then other professors and you will look good even if you are worse then the others.

In this case it is clear that gender has to be controlled for because the only thing that makes sense is that it causes both treatment choice and outcome. So let's change gender by blood pressure and keep exactly the same data, and now it does not make sense to say that blood pressure is a confounder anymore, because it can cause the outcome but not the treatment. Now it makes sense to say blood pressure is a mediator and as such we should not control for it. I thought this is a very interesting example that promotes the causal thinking, but not explored by the literature.

In the book Categorical Data Analysis, by Alan Agresti, pg 48, we have a different example, which at least at my eyes is more complicated. There the causal effect of defendant's race on death penalty is the goal and the victim race is the potential confounder. The role of victim's race as a confounder is not as clear as the role of gender, because some may say victim's race is caused by defendant's race and is therefore a mediator. I don't intend to solve this here, but right there I think we have an example that is not so appropriated for the beginner student.

The text focus on conditional association and why, mathematically, it happens, and how much it can change things. Then the text misses the opportunity to go beyond the mathematics into the real world, the world of causal inference and talk about actual modeling. I really like Agresti's book, which I have been using since its first edition (it is in the third now), but my experience is that the literature in statistics really lacks this bridge between the mathematics and the real world. It seems to me that it is pretty bad to teach regression analysis in a pure mathematical predictive way while ignoring the real world question which is the challenges of interpreting regression coefficients as effects.

In conversations at the JSM2013 I came to the conclusion that the Simpson's paradox may be too challenging and complex to be used in a introductory statistical course, or with people that don't have a background in statistics, or with the physicians and researchers that we work with daily. But students of statistics, with some math and probability background, familiar to the idea of association, I think it would be a good tool to help introducing causality.

No comments: