The article "Bus, Bikes and Random Journeys: Crowdsourcing and distribution in Ivory Coast" in the last issue of Significance caught my attention mainly because it brings to light an interesting point in times of global warming - our transportation is very inefficient. We have so many big cars with one single passenger, empty buses and trucks going back and forth, just because oil is so cheap. We have developed so much our communication technology, but very little in terms of transportation, I mean, at least there are so many huge problems for which the solutions seems so possible.
So, with so many people traveling around, I have no doubt that this crowdsourced transportation of things is possible from the operational point of view. But then when you think that it depends also on the willingness of folks to engage, which requires incentives... oh, well, we are certainly not there yet.It seems we also need changes in our culture.
I saw the model in the article more as a theoretical mathematical model, too much of an ideal world. It involves an area of statistics that is mixed with math and computation, and to which I am not that familiar. But it seemed to be so far from reality. I recognize, though, that the effort put on developing such models is valid, as is the intention and idea of crowdsourcing the transportation, as is important the simple fact that we start talking about this sort of idea.
As a believer on the goodness of the people, I thought since the very beginning of the article, that a model was strange, not necessary and maybe if this ever happen it will not need a model. The desire of people to help and do good would alone make the system efficient. It would be like this: I am taking the train to Hamilton, from Toronto, so at the Union Station there is a place with packages to be delivered in Hamilton. Everybody knows about that. I get some that I think I can carry and take them with me. I think many people would not demand any sort of reward for doing this...
The Significance article does not have any math, it just describes the optimization problem. More detailed description is in one of the references.
Friday, 27 September 2013
Wednesday, 18 September 2013
Gibbs Sampler
Gibbs Sampler is an algorithm used to generate marginal distribution from the conditional ones. It is a special case of the Metropolis-Hastings algorithm. In this post I just want to respond to Subbiah's suggestion in the comment on my post on Metropolis-Hastings algorithm. He suggests a paper by George Casella and Edward George, which explains in more or less simple terms the Gibbs Sampler. I really liked the paper more than the one I mentioned about the Metropolis-Hastings algorithm, as it is much more intuitive. Sections 2 and 3, which shows how the algorithm works in practice, are quite interesting and written in a intelligent and clear way.The paper does not go into Bayesian statistics, it is more about showing how the Markov Chain works on converging to the desired marginal distribution. It seems to me that the paper is very helpful if you are planning to learn Bayesian Statistics, in which case it will serve well as a pre-requisite for the chapters on simulation from the posteriori.
Monday, 16 September 2013
Simpson's Paradox and the teaching of causal inference
The Simpson's paradox, which names the phenomenon where the association between two variable change direction when conditional on a third variable, can be quite puzzling. In my opinion one of the worst things about the paradox, though, was not the reversal of the association, but understanding the consequences that it had for causal inference and in our daily work as statisticians. It seems like we are always at risk because there could always be some important variable lurking in the background that could change our results.
This paper by Judea Pearl talks about the paradox in a simple but different way compared to what I had seen before. The first time I read it I thought this sort of explanation of the paradox, with this sort of simple example, should be in the curriculum of students of statistics because it seems so useful to help understanding causal inference. It is a easy and powerful way of promoting the causal thinking in statistical modeling. And at the same time it explains the paradox.
To go into a little more detail about the paper, the association between treatment and outcome is reversed when controlled by (conditional on) gender. The question that comes to mind, at least it always came in my case, is whether controlling by gender is correct, since doing so totally changes the conclusion.
Well, whether we condition or not on gender has to do with whether gender is a confounder or not. And at this point we can think of a confounder as a variable that can cause both the treatment and the outcome. Something that is caused by the treatment cannot be a confounder. By changing both the treatment and the outcome, the confounder has the power to create association between them that are just association, it is not causation. Unfortunately at this point we need to rely in assumptions, in our understanding of the reality, to decide whether gender is a confounder or not.
It turns out that the only thing that makes sense is to think that gender can cause both the choice of treatment and the outcome, therefore it is a confounder. It does not make much sense to think that the association between gender and treatment is a causal effect of the treatment, that is, treatment causes gender. In the example, males tend to choose the drug much more often than females and they also tend to have improved outcome more often. This makes the drug look good overall, even though within gender it perform worse than no drug. It goes like this: Male for some reason that is not the drug improves a lot more than females and they tend to take the drug. So, if we dont account for gender, the drug will look good. It is like you are a mediocre professor but for some reason the good students tend to choose you as their supervisor, so if we don't look at the the score previous to the selection, you will have students which performs better then other professors and you will look good even if you are worse then the others.
In this case it is clear that gender has to be controlled for because the only thing that makes sense is that it causes both treatment choice and outcome. So let's change gender by blood pressure and keep exactly the same data, and now it does not make sense to say that blood pressure is a confounder anymore, because it can cause the outcome but not the treatment. Now it makes sense to say blood pressure is a mediator and as such we should not control for it. I thought this is a very interesting example that promotes the causal thinking, but not explored by the literature.
In the book Categorical Data Analysis, by Alan Agresti, pg 48, we have a different example, which at least at my eyes is more complicated. There the causal effect of defendant's race on death penalty is the goal and the victim race is the potential confounder. The role of victim's race as a confounder is not as clear as the role of gender, because some may say victim's race is caused by defendant's race and is therefore a mediator. I don't intend to solve this here, but right there I think we have an example that is not so appropriated for the beginner student.
The text focus on conditional association and why, mathematically, it happens, and how much it can change things. Then the text misses the opportunity to go beyond the mathematics into the real world, the world of causal inference and talk about actual modeling. I really like Agresti's book, which I have been using since its first edition (it is in the third now), but my experience is that the literature in statistics really lacks this bridge between the mathematics and the real world. It seems to me that it is pretty bad to teach regression analysis in a pure mathematical predictive way while ignoring the real world question which is the challenges of interpreting regression coefficients as effects.
In conversations at the JSM2013 I came to the conclusion that the Simpson's paradox may be too challenging and complex to be used in a introductory statistical course, or with people that don't have a background in statistics, or with the physicians and researchers that we work with daily. But students of statistics, with some math and probability background, familiar to the idea of association, I think it would be a good tool to help introducing causality.
This paper by Judea Pearl talks about the paradox in a simple but different way compared to what I had seen before. The first time I read it I thought this sort of explanation of the paradox, with this sort of simple example, should be in the curriculum of students of statistics because it seems so useful to help understanding causal inference. It is a easy and powerful way of promoting the causal thinking in statistical modeling. And at the same time it explains the paradox.
To go into a little more detail about the paper, the association between treatment and outcome is reversed when controlled by (conditional on) gender. The question that comes to mind, at least it always came in my case, is whether controlling by gender is correct, since doing so totally changes the conclusion.
Well, whether we condition or not on gender has to do with whether gender is a confounder or not. And at this point we can think of a confounder as a variable that can cause both the treatment and the outcome. Something that is caused by the treatment cannot be a confounder. By changing both the treatment and the outcome, the confounder has the power to create association between them that are just association, it is not causation. Unfortunately at this point we need to rely in assumptions, in our understanding of the reality, to decide whether gender is a confounder or not.
It turns out that the only thing that makes sense is to think that gender can cause both the choice of treatment and the outcome, therefore it is a confounder. It does not make much sense to think that the association between gender and treatment is a causal effect of the treatment, that is, treatment causes gender. In the example, males tend to choose the drug much more often than females and they also tend to have improved outcome more often. This makes the drug look good overall, even though within gender it perform worse than no drug. It goes like this: Male for some reason that is not the drug improves a lot more than females and they tend to take the drug. So, if we dont account for gender, the drug will look good. It is like you are a mediocre professor but for some reason the good students tend to choose you as their supervisor, so if we don't look at the the score previous to the selection, you will have students which performs better then other professors and you will look good even if you are worse then the others.
In this case it is clear that gender has to be controlled for because the only thing that makes sense is that it causes both treatment choice and outcome. So let's change gender by blood pressure and keep exactly the same data, and now it does not make sense to say that blood pressure is a confounder anymore, because it can cause the outcome but not the treatment. Now it makes sense to say blood pressure is a mediator and as such we should not control for it. I thought this is a very interesting example that promotes the causal thinking, but not explored by the literature.
In the book Categorical Data Analysis, by Alan Agresti, pg 48, we have a different example, which at least at my eyes is more complicated. There the causal effect of defendant's race on death penalty is the goal and the victim race is the potential confounder. The role of victim's race as a confounder is not as clear as the role of gender, because some may say victim's race is caused by defendant's race and is therefore a mediator. I don't intend to solve this here, but right there I think we have an example that is not so appropriated for the beginner student.
The text focus on conditional association and why, mathematically, it happens, and how much it can change things. Then the text misses the opportunity to go beyond the mathematics into the real world, the world of causal inference and talk about actual modeling. I really like Agresti's book, which I have been using since its first edition (it is in the third now), but my experience is that the literature in statistics really lacks this bridge between the mathematics and the real world. It seems to me that it is pretty bad to teach regression analysis in a pure mathematical predictive way while ignoring the real world question which is the challenges of interpreting regression coefficients as effects.
In conversations at the JSM2013 I came to the conclusion that the Simpson's paradox may be too challenging and complex to be used in a introductory statistical course, or with people that don't have a background in statistics, or with the physicians and researchers that we work with daily. But students of statistics, with some math and probability background, familiar to the idea of association, I think it would be a good tool to help introducing causality.
Sunday, 15 September 2013
Relaxing
If you need a break from the statistical books with lots of math I recommend the book The Lady Tasting Tea, by David Salsburg.
The book is a easy fun reading on the history of statistics, but it does not have the intention to go into too much details or to cover everything or even to be linear in time. It is a book especially good for those of us for whom many names are already familiar, but we don't really know that much about them. To use the words of a review I read in the Amazon website, a not very pleased one, the book is a little like a collection tales that our grandfather tells us. And sometimes that is what we need.
Besides, as I said, we are always working in our daily job with names that becomes familiar but we actually don't know much about them. Like Pearson Correlation, Kolmogorov-Smirnov test, Fisher's exact test, Neyman-Pearson hypothesis test and so on. I thought the book is very good on covering these important names in a joyful way. As an example, I found the story of Student, from the t-student distribution, very interesting.
Good reading!
The book is a easy fun reading on the history of statistics, but it does not have the intention to go into too much details or to cover everything or even to be linear in time. It is a book especially good for those of us for whom many names are already familiar, but we don't really know that much about them. To use the words of a review I read in the Amazon website, a not very pleased one, the book is a little like a collection tales that our grandfather tells us. And sometimes that is what we need.
Besides, as I said, we are always working in our daily job with names that becomes familiar but we actually don't know much about them. Like Pearson Correlation, Kolmogorov-Smirnov test, Fisher's exact test, Neyman-Pearson hypothesis test and so on. I thought the book is very good on covering these important names in a joyful way. As an example, I found the story of Student, from the t-student distribution, very interesting.
Good reading!
Wednesday, 11 September 2013
Conflict of Interest and the research in medicine
I just read this article, where the seduction of money (and maybe other things?) led a highly regarded researcher on Alzheimer to get involved on illegal activities. I will not tell the details as I don't want this post to be too long, but it is in the linked content. The illegality seems to be related to the fact that information about results of clinical trials not available to the public were being released to traders, but on the other hand I also think of it as a problem of conflict of interest, I mean, I think to some extent being involved with such kind of money and with clinical trials, and the money being linked to the results of the trials, may led to, say, bad research.
I have this opinion that we live in a world that put us very far from the ideal environment to do scientific research, because of all the conflicts of interest. I is sort of the elephant in the room. Although there are a lot than can be said regarding this issue, it is a heavy issue and I just want to mention the cases where researchers do their own statistical analyses, which happens many times. To me the issue is the same as blinding in a drug trials, where blinding so be extended to whoever analyzes the data and in particular, researchers listed as author in the paper should not do the statistical analyses. And now we are not talking only about trials, but any research. If a researcher, who wants or needs to publish papers, is the one who analyzes the data, then there is the potential for what has been sometimes been called p-hacking, which basically relates to searching for p-values that are of interest, even if not very consciously. Besides this, practices like making the data available, making the codes used for the analysis available, review of policies related to privacy (so that data can be made available), replication of results are some of the things I think we could easily improve on.
I have this opinion that we live in a world that put us very far from the ideal environment to do scientific research, because of all the conflicts of interest. I is sort of the elephant in the room. Although there are a lot than can be said regarding this issue, it is a heavy issue and I just want to mention the cases where researchers do their own statistical analyses, which happens many times. To me the issue is the same as blinding in a drug trials, where blinding so be extended to whoever analyzes the data and in particular, researchers listed as author in the paper should not do the statistical analyses. And now we are not talking only about trials, but any research. If a researcher, who wants or needs to publish papers, is the one who analyzes the data, then there is the potential for what has been sometimes been called p-hacking, which basically relates to searching for p-values that are of interest, even if not very consciously. Besides this, practices like making the data available, making the codes used for the analysis available, review of policies related to privacy (so that data can be made available), replication of results are some of the things I think we could easily improve on.
Sunday, 8 September 2013
The ever lasting discussion about Bayesian statistics
I do not consider myself that old, but it amazes me when I think of the difference between how Bayesian statistic was seen in the early 2000s and now, which is a difference that shapes the way current students are introduced to statistics in contrast with then.
Back then very few academics where admittedly Bayesian and there were lots or religious and ferocious opposition to these few. Back then, and perhaps even now, I considered myself sort of lost in this debate, but nevertheless willing to understand matters more as to be able to have an opinion. During my master I was happy to take a course on Bayesian statistics, and that was an opportunity very rarely offered to students.
Nowadays for the statistician who like myself did not have much contact with Bayesian statistics, and is involved with practical problems, it seem to be a subject of mandatory update; it is everywhere. Rather than positioning ourselves in one side of the debate, the rule now seems to know both sides and use whatever is most appropriated for the problem at hand.
It was with that in mind that I read this paper by Andrew Gelman and Christian Robert and the comments that follow. What surprised a little was the fact that set of papers seems to make it clear that the debate is not settled - not only Bayesians are fighting in defense of their choice but there are still Frequentists challenging them. Not that I did not know this, but I guess what happens is that we just base our position and opinion on the fact that nowadays Bayesian Analysis is mainstream and works to dismiss the still existing debate.
Related to these papers I did stay with the impression that the initial paper by Gelman and Robert was a little too picky on things seemingly not that important, as I think the comments by Stigler and Fienberg puts it, and perhaps too defensive, and that Mayo's comments raises some interesting and perhaps valid points against the Bayesian paradigm that seems to be forgotten due to the success of practical applications. However, I have to say, I always have trouble understanding Mayo's more sort of philosophical language. Finally, I found the comments by Johnson of interest for the references therein, which I did not check but seems to be of great interest for those seeking to learn Bayesian Statistics specially in comparison with Frequentist statistics. In that regard I also got interested in Casella's book on Inference, mentioned by Mayo. Too much interest for too little time, though...
Back then very few academics where admittedly Bayesian and there were lots or religious and ferocious opposition to these few. Back then, and perhaps even now, I considered myself sort of lost in this debate, but nevertheless willing to understand matters more as to be able to have an opinion. During my master I was happy to take a course on Bayesian statistics, and that was an opportunity very rarely offered to students.
Nowadays for the statistician who like myself did not have much contact with Bayesian statistics, and is involved with practical problems, it seem to be a subject of mandatory update; it is everywhere. Rather than positioning ourselves in one side of the debate, the rule now seems to know both sides and use whatever is most appropriated for the problem at hand.
It was with that in mind that I read this paper by Andrew Gelman and Christian Robert and the comments that follow. What surprised a little was the fact that set of papers seems to make it clear that the debate is not settled - not only Bayesians are fighting in defense of their choice but there are still Frequentists challenging them. Not that I did not know this, but I guess what happens is that we just base our position and opinion on the fact that nowadays Bayesian Analysis is mainstream and works to dismiss the still existing debate.
Related to these papers I did stay with the impression that the initial paper by Gelman and Robert was a little too picky on things seemingly not that important, as I think the comments by Stigler and Fienberg puts it, and perhaps too defensive, and that Mayo's comments raises some interesting and perhaps valid points against the Bayesian paradigm that seems to be forgotten due to the success of practical applications. However, I have to say, I always have trouble understanding Mayo's more sort of philosophical language. Finally, I found the comments by Johnson of interest for the references therein, which I did not check but seems to be of great interest for those seeking to learn Bayesian Statistics specially in comparison with Frequentist statistics. In that regard I also got interested in Casella's book on Inference, mentioned by Mayo. Too much interest for too little time, though...
Metropolis-Hasting and Bayesian Statistics
My statistical training was focused on classical frequentist inference, but soon after I got to the real world of statistics I realized that Bayesian statistics was not only used more than I thought, but more and more people were using it. Then I set as a personal goal to learn it.
I fount it was not as easy. Maybe I am not that smart too. As time went by I got this impression that Bayesian statistics is actually simple, maybe simpler than frequentist statistics, there are some barriers for those like me, who did not get the tools or the thinking way of Bayesian statistics.
For one we did not have a strong training in simulations, programming, computational statistics. But also, we did not learn too much about Markov Process, especially the ones in Continuous Space. So when we see thing like Metropolis-Hasting algorithm it looks something not accessible.
Nowadays there are many sources where we can learn Bayesian statistics, but I always want to start by learning this MH algorithm. And many times I wasn't successful. Until I found this paper, in the The American Statistician Journal. Using the paper, for the first time I was able to create my simple Markov Chain and simulate from a target distribution, and I just published the R code I used here.
The paper is not that simple, and I actually would not recommend it if you want to learn the MH algorithm and do not have some good understanding of math. I think the paper is sort of old and probably any book on Bayesian statistics published more recently will give you a more gently introduction to the MH algorithm. But the paper got me started and I learnt a lot with it. The reason the MH algorithm is so useful relates to its power on simulating from intractable distributions, but when we are learning right on our first steps, even this code that I used to simulate from a bivariate Normal distribution made me proud.
Here I will leave the function I created. The MH algorithm is actually quite simple. I wanted to publish the whole code here instead of just linking to it above, but it is impressive how difficult it is to do this. Even this GitHub option does not work very well (not to mention I could not colorcode the code) because if you have many pieces of codes and graphs and R outputs, then it becomes lots of work. The alternative is to paste the code here and copy and paste the outputs and insert figures, which is not better, really. The Knitr package offer an awesome option to create HTMLs with codes and outputs, but I still did not figure out how to publish it in a blog like this.
I fount it was not as easy. Maybe I am not that smart too. As time went by I got this impression that Bayesian statistics is actually simple, maybe simpler than frequentist statistics, there are some barriers for those like me, who did not get the tools or the thinking way of Bayesian statistics.
For one we did not have a strong training in simulations, programming, computational statistics. But also, we did not learn too much about Markov Process, especially the ones in Continuous Space. So when we see thing like Metropolis-Hasting algorithm it looks something not accessible.
Nowadays there are many sources where we can learn Bayesian statistics, but I always want to start by learning this MH algorithm. And many times I wasn't successful. Until I found this paper, in the The American Statistician Journal. Using the paper, for the first time I was able to create my simple Markov Chain and simulate from a target distribution, and I just published the R code I used here.
The paper is not that simple, and I actually would not recommend it if you want to learn the MH algorithm and do not have some good understanding of math. I think the paper is sort of old and probably any book on Bayesian statistics published more recently will give you a more gently introduction to the MH algorithm. But the paper got me started and I learnt a lot with it. The reason the MH algorithm is so useful relates to its power on simulating from intractable distributions, but when we are learning right on our first steps, even this code that I used to simulate from a bivariate Normal distribution made me proud.
Here I will leave the function I created. The MH algorithm is actually quite simple. I wanted to publish the whole code here instead of just linking to it above, but it is impressive how difficult it is to do this. Even this GitHub option does not work very well (not to mention I could not colorcode the code) because if you have many pieces of codes and graphs and R outputs, then it becomes lots of work. The alternative is to paste the code here and copy and paste the outputs and insert figures, which is not better, really. The Knitr package offer an awesome option to create HTMLs with codes and outputs, but I still did not figure out how to publish it in a blog like this.
Sunday, 1 September 2013
Teaching Causal Inference
In the JSM 2013 I participated in a round table about teaching causal inference, where I had the privilege of meeting Judea Pearl. The reason I was so interested in the subject, besides the fact that causal inference is of great interest to me, was that if I had to name the biggest gap I had in my statistical training, it would be causal inference for sure.
They told me that regression was a mathematical equation and the coefficient of X mean "If you increase X by one unit this coefficient is how much Y will increase in average". Right there, in the very explanation of the coefficient was the idea of causality, yet causality was never talked about.
It turns out that it is not only that. In the real world, everything is about causality. People may not say it or they may not even know it (!) but what they want when they are running a regression model is a measure of causal effect. Why then in a top statistical course we don't hear anything about the causal assumptions we are making? Or confounder? It seems to me that not only causal inference should be talked more about, it should perhaps be the core of a statistical training.
In the experimental design part of the training we are taken through the causal inference in a way, but that never gets linked to the regression analysis, or SEM, on observational data, which is what we have most of the time. Statisticians learn that everything they do is conditional on the validity of assumptions, yet they dont seem to want to think or to talk about or to make assumptions about causality.
So, that round table was very interesting for what I learnt, for what we discussed and for the folks I met. I have no experience on teaching, so I don't know how to do this, but I am contemplating diving a little deeper into this, by perhaps searching the related literature and putting together some material that make it easier to understand causality and how it relates to statistical models. I could also use for this many of the ideas from our round table. And perhaps it could be a joint effort. I think such a thing, if mainstream, would be quite helpful not only for the current students but also maybe for the entire sciences in how it would improve the quality of the research that are currently done.
They told me that regression was a mathematical equation and the coefficient of X mean "If you increase X by one unit this coefficient is how much Y will increase in average". Right there, in the very explanation of the coefficient was the idea of causality, yet causality was never talked about.
It turns out that it is not only that. In the real world, everything is about causality. People may not say it or they may not even know it (!) but what they want when they are running a regression model is a measure of causal effect. Why then in a top statistical course we don't hear anything about the causal assumptions we are making? Or confounder? It seems to me that not only causal inference should be talked more about, it should perhaps be the core of a statistical training.
In the experimental design part of the training we are taken through the causal inference in a way, but that never gets linked to the regression analysis, or SEM, on observational data, which is what we have most of the time. Statisticians learn that everything they do is conditional on the validity of assumptions, yet they dont seem to want to think or to talk about or to make assumptions about causality.
So, that round table was very interesting for what I learnt, for what we discussed and for the folks I met. I have no experience on teaching, so I don't know how to do this, but I am contemplating diving a little deeper into this, by perhaps searching the related literature and putting together some material that make it easier to understand causality and how it relates to statistical models. I could also use for this many of the ideas from our round table. And perhaps it could be a joint effort. I think such a thing, if mainstream, would be quite helpful not only for the current students but also maybe for the entire sciences in how it would improve the quality of the research that are currently done.
Subscribe to:
Posts (Atom)