Riding Numbers: July 2013

Thursday, 25 July 2013

Breakfast and Coronary Heart Disease

One of this days I heard in the CBC radio about a new research that supposedly found evidences that skipping breakfast is linked with Coronary Heart Disease. Then I found the news here. I tracked down the actual paper and it was good to know it is available for free.

The first thing I thought when I heard it in the radio was that likely people who are busy at work, who work more, who are more dedicated, engaged at work, these are the ones who may skip the breakfast and their crazy routine may also cause CHD. The second thin I thought is something I have thought before many times and it is that in the history of human evolution we were hunter gatherers for most of the time. And it seems to me that a hunter gatherer would not have right times to eat. If they happened to kill something in the afternoon, they would eat at that time. If they did not hunt anything they would not eat anything. This is just my perception, though, but I believe it sort of make some sense. If so, our body would have evolved in a way that it would be used to irregular eating patterns, maybe with long fasting periods and times of food abundance.

So, in my head the research results were not making sense in two ways. I wanted to read the paper. It turned out that I liked the paper; I think they did a good job exploring confounders, but I am not sure they addressed my confounder, the crazy routine thing. They actually included two related information: Full-time employment and Stress at work. But I don't think these are the same thing as I mentioned above. I think among workers, not necessarily full-time, you may have those workers that are more engaged. They are not necessarily stressed because these are the folks who dedicate themselves to work, as if the work is their life, their past time. And I say that because I used to wake up and go straight running and then to work, with no breakfast, and that was because I could not stand the idea of getting there late. Then I would stay late too, always involved with the work. After changing my job I started having breakfast... Anyways, they did a good job looking at counfounders and mediators and it seems they did what they could, but one can only look at so many variables.

Upon reading the paper another thing started bothering me: skipping breakfast is, according to the paper, also associated with many other things that are not so good. Obesity, smoking, no physical activity, full-time job, hypertension, alcohol use. That seems to indicate that skipping breakfast is a sort of indicator of a not so good lifestyle in the sense that it is caused by/is part of this lifestyle and doesn't cause it. It seems to make more sense to think that both skipping breakfast and CHD are caused by this sort of lifestyle construct.

Finally, there was also some association between late night eating and CHD, which was not in the media and to me it enforces the fact that it is not a causal relationship. We have also to consider that the sample had only physicians, although a lot of them, and that it is important to replicate these associations before trusting them too much.

I think it is complicated to prove one way or another, though, and because of that more care should be taken when reporting these things. They emphasize too much the causal link and as I see this is questionable. When this sort of thing falls on the hands of the mass media, different people will interpret things different ways, and it can be transformed in some health myth. And talking about health myths, I've often and for a long time heard that eating at night leads to more weight gain because you go to sleep and don't burn what you ate. The paper also says this is a myth, I mean, that there are some researches favouring and other opposing such a claim. We need to be careful on what we believe. This is not to say I am criticizing the paper. My opinion, for what it is worth, is that this is a piece of information and other pieces will join this one in the future and the process should help us to understand the association or lack of association between eating habits and health conditions. But care should be put on reporting such findings in the non-scientific environment, not to make things misleading.

Sunday, 21 July 2013

Multivariate Binary Variables

These days I got myself involved in a problem where power calculation had to be done for repeated measure design with binary endpoint, involving 4 time points and 3 groups. This is not the sort of thing you can do in softwares like G-Power; I like to do it through simulations using Generalized Estimating Equations in this case. Basically, you generate the data that you expect to get in the experiment, and this involves estimating somehow variation and effects without having any data (but of course, possibly having other information, like variation and effect sizes from similar studies and also the experience of the research and the clinical assessment of what would be a meaningful effect). I am not very experienced with this, you know, statisticians working in marketing research will not see this kind of thing very often. But I can say that at this point I am getting pretty comfortable with power simulations for assumed Normal outcomes.

However this one was binary. If you have 4 time points, you will want to generate 4 binary variables that have specific proportion p and correlation structure. How to you simulate multivariate binary variates? My first thought was to use the normal quantiles, for example, if the desired p is 0.5 you generate a N(0,1) and recode negative values to 0, positive to ones. If you do that with a multivariate normal you will end up with a multivariate binomial. The correlation structure will not be the same, but you can easily get close to what you want by trial and error. So I put that in the back pocket. I really wanted to generate the zero and ones in a more mathematical way.

It turns out that there are quite a few references on this, even on using Normal as I mentioned above. There are also two R packages that seems capable of generating multivariate binary variables, but I did not test them. Then I found this interesting paper. It has an algorithm to generate what I wanted, starting from the Poisson distribution. It is an interesting and not very difficult derivation for the bivariate case. And I managed to program it on SAS, after wrestling with the use of numeric values with decimals inside macros. I forced myself to use Proc IML instead of data steps, to try and learn it.

%macro bibernouli(p1,p2,n,rho);

%let a11 = -1*%sysfunc(log(&p1));
%let a22 = -1*%sysfunc(log(&p2));
%let a12 = %sysfunc(log(1+&rho*%sysfunc(sqrt((1/&p1)*(1-&p1)*(1/&p2)*(1-&p2)))));

proc iml;

rmat = j(%eval(&n),7);
call randgen(rmat[,1],'POISSON',%sysevalf(&a11 - &a12));
call randgen(rmat[,2],'POISSON',%sysevalf(&a22 - &a12));
call randgen(rmat[,3],'POISSON',%sysevalf(&a12));
rmat[,4] = rmat[,1]+rmat[,3];
rmat[,5] = rmat[,2]+rmat[,3];
if rmat[,4] = 0 then rmat[,6] = 1; else rmat[,6]=0;
if rmat[,5] = 0 then rmat[,7] = 1; else rmat[,7]=0;

create databin from rmat[colnames = {"x1","x2","x3","y1","y2","z1","z2"}];
append from rmat;
close databin;
run;
quit;
%mend bibernouli;

%bibernouli(0.75,0.65,10,0.5)

But I needed 4 correlated variables, not only 2. It is not difficult to generalize this process from 2 to 4 variables, with proportions matching what you want and association close to what you want. But to get association exactly you need to keep reading the paper past the bivariate case. You will find an algorithm, which I can't say is very complicated because it is not. I read it through some 10 times. But I don't know, there is a part in the last step that I was not quite getting. I tried to apply it to the bivariate case and it was not working. I am sure it was my fault, though. Then you know, time is not infinite, so I went back to my Normal simulation to solve my real world problem, but still thinking about the algorithm...

Monday, 15 July 2013

Standard Error or Standard Deviation?

I was surprised when a colleague asked me for a opinion on a revision of a paper mentioning they should present Standard Deviation instead of Standard Error. I went to the website of the Psychiatry Research, Guide to Authors, and there it was the recommendation to use SD instead of SE. It is as dry and enigmatic as this "Study group variability should be reported as the standard deviation, not the standard error". I am not sure what they mean.

I am not against using SD when it is appropriate and would not discuss the fact the many folks may not understand the difference or even use them incorrectly. But I think a general request to use SD is not reasonable, a more elaborated instruction should be given, perhaps referencing some paper since there are several talking about the difference between SD and SE. It turns out that in that specific case I totally disagreed with presenting the SD because they where presenting tables mostly with %, comparing these % across three groups. Even though the tables were preliminary descriptive analysis preparing the reader with some background knowledge about the variables that were later included in the model. SD is next to useless for %, really, and the SE would provide the reader with a quick idea of whether the group means differed and how precise they were.

And this incident followed a rejection by another Journal, a week earlier, of a paper I thought was very good. There were 4 justifications for the rejections and all they did was to show that whoever wrote that did not understand anything about our model. The first thing I commented with my collaborator was "How can a person review something they don't know anything about?" It seemed blatantly unethical to me, it seemed so not science.

Then you cannot say much. It is like they are Godlike beings, it is like they know more than you, period. I think the system for publication and peer reviewing needs to be different. I would not mind talking to them about our paper, maybe a phone call, maybe talk to their statistician if they have one.

I decided to take e a quick look at the Psychiatric Research papers, I wanted to see if they only publish standard deviation. I went to the first paper of the last issue (well, this was the last available to me). In session 2.2 Subjects they do present some SDs and I think that is appropriated because they are describing the sample, not making inference. However, the way they present, with that plus/minus sign, makes things confusing. It looks like a confidence interval, which is inferential procedure and therefore makes us think they are presenting SEs (or 1.96*SE). I always found this plus/minus notation very confusing not to say misleading or plain wrong.

Then we have Table 3, where 4 groups are being compared in a inferential way, there is a ANOVA there. Then they show mean and SD? This is where I disagree. It would be important to have SE here, maybe confidence intervals. I think here we are talking about means and we want to know how precise they are, that is, the SEs.

So, I don't know, I think this instruction to use SD is already causing more harm then good because there is nothing like use this or that, different things are appropriated for different occasions.

Saturday, 13 July 2013

Visualizing Longitudinal Data

Another paper in the most recent The American Statistician shows hot to use something they call Triangle Plot to visualize some aspects of longitudinal data. It is an interesting idea and even more interesting because my impression is that dropouts are not analyzed as they should.

In Repeated Measure experiments data are collected at several points in time and it always happens that some subjects get lost for different reasons, these are the dropouts. So, for example, everybody is observed at time 1, but at time 2 some folks do not appear and we therefore have missing values there. At time 3 some more folks do not show up and so on. If we are measuring Quality of LIfe (borrowing the example from the paper) then it is important to understand whether these folks who drops out of the study have a different Quality of Life compared to the ones that stayed. If they have then the dropouts are said informative and such information should be studied and perhaps used in the analysis in order to avoid possible biases. Basically dropouts are missing, and if they are informative, they are not random.

You can check the paper for some neat graphs, but I tried to be creative and create my own Triangle Graph, in this case the data is a little different, but I think the idea of repeated measures and informative dropouts holds sort of true.

I wanted to play around with ggplot2 rather than using the Lattice code provided in the paper. Nothing against lattice, but forcing myself to use ggplot2 is a way of learning more.

I tried to think about some sort of data I could use to run a Trianle Plot. Toat was not very easy. But I found some data about number of medals per country in paraolympic games. Let's use this data for a Triangle Plot.

OlympicData <- drive="" font="" log="" lympicdata.csv="" nbsp="" oogle="" papers="" plot="" read.csv="" riangle="">

quote = "")

library(ggplot2)

OlympicData[1:20, ]

## country y2012 y2008 y2004 y2000 y1996 y1992 y1988 Particip

## 1 Afghanistan 0 0 0 NA 0 NA NA 4

## 2 Albania 0 NA NA NA NA NA NA 1

## 3 Algeria 19 15 13 3 7 0 NA 6

## 4 Andorra 0 NA NA NA NA NA NA 1

## 5 Angola 2 3 3 0 0 NA NA 5

## 6 Antigua and Barbuda 0 NA NA NA NA NA NA 1

## 7 Argentina 5 6 4 5 9 2 9 7

## 8 Armenia 0 0 0 0 0 NA NA 5

## 9 Australia 85 79 101 149 106 76 96 7

## 10 Austria 13 6 22 15 22 22 35 7

## 11 Azerbaijan 12 10 4 1 0 NA NA 5

## 12 Bahamas NA NA NA NA NA NA 0 1

## 13 Bahrain 0 0 1 2 0 1 3 7

## 14 Bangladesh NA 0 0 NA NA NA NA 2

## 15 Barbados 0 0 0 0 NA NA NA 4

## 16 Belarus 10 13 29 23 13 NA NA 5

## 17 Belgium 7 1 7 9 25 17 41 7

## 18 Benin 0 0 0 0 NA NA NA 4

## 19 Bermuda 0 0 0 0 0 NA NA 5

## 20 Bosnia-Herzegovina 1 1 1 1 0 NA NA 5

Then we aggregate the data to see the average medals per participation.

x <- aggregate="" by="list(OlympicData$Particip)," mean="" na.rm="T)</font" x="OlympicData,">

x1 <- as.data.frame="" font="" x="">

## Group.1 country y2012 y2008 y2004 y2000 y1996 y1992 y1988

## 1 1 NA 0.0000 NaN 1.0000 0.00000 NaN 13.25 83.00

## 2 2 NA 0.4167 0.2000 0.0000 0.00000 0.000 13.00 0.50

## 3 3 NA 0.1176 0.2353 0.0000 0.14286 2.000 0.00 8.50

## 4 4 NA 0.1111 0.1111 0.1765 0.07692 0.000 0.00 1.00

## 5 5 NA 8.1333 6.7333 6.2759 5.51724 2.741 0.00 0.00

## 6 6 NA 7.5000 7.0455 8.4545 8.50000 10.682 10.55 1.00

## 7 7 NA 22.5306 21.7551 24.3673 26.69388 25.735 25.02 39.55

## Particip

## 1 1

## 2 2

## 3 3

## 4 4

## 5 5

## 6 6

## 7 7

This does not work as dropouts in a repeated measures data because some countries are not in 1988 but are in 2000, for example. In a Repeated Measures when someone drops out, they do not return. So we need to exclude that part of the data. This way will only be present in 2012 data countries with 7 data point, in 2008 those with 6 or 7 data points and so on. Of course, by making this exclusion we are not being as representative anymore… No big deal, we are just playing…

x2 <- 3:10="" font="" x1="">

library(reshape)

## Loading required package: plyr

## Attaching package: 'reshape'

## The following object is masked from 'package:plyr':

## rename, round_any

x3 <- articip="" font="" melt="" subset="" variable="" x2="">

## Using as id variables

x3$Participation <- 7="" font="" rep="" seq="">

x3$Year <- by="-1)," each="7)</font" from="7," rep="" seq="" to="1,">

x4 <- participation="" subset="" x3="">= Year)

## variable value Participation Year

## 7 y2012 22.53061 7 7

## 13 y2008 7.04545 6 6

## 14 y2008 21.75510 7 6

## 19 y2004 6.27586 5 5

## 20 y2004 8.45455 6 5

## 21 y2004 24.36735 7 5

## 25 y2000 0.07692 4 4

## 26 y2000 5.51724 5 4

## 27 y2000 8.50000 6 4

## 28 y2000 26.69388 7 4

## 31 y1996 2.00000 3 3

## 32 y1996 0.00000 4 3

## 33 y1996 2.74074 5 3

## 34 y1996 10.68182 6 3

## 35 y1996 25.73469 7 3

## 37 y1992 13.00000 2 2

## 38 y1992 0.00000 3 2

## 39 y1992 0.00000 4 2

## 40 y1992 0.00000 5 2

## 41 y1992 10.55000 6 2

## 42 y1992 25.02041 7 2

## 43 y1988 83.00000 1 1

## 44 y1988 0.50000 2 1

## 45 y1988 8.50000 3 1

## 46 y1988 1.00000 4 1

## 47 y1988 0.00000 5 1

## 48 y1988 1.00000 6 1

## 49 y1988 39.55102 7 1

At this point we have structured the data so that is is suitable for GGPLOT2. I am not really expert in ggplot2, so maybe this is not even the best way of doing it, but I googled around and that is what I found. So, lets do the Triangle Plot.

p <- aes="" fill="value))" font="" geom_tile="" ggplot="" participation="" theme_bw="" variable="" x4="">

p + xlab("Year") + ylab("Participations") + scale_fill_gradient(low = "lightblue",

high = "darkblue", name = "# Medals") + theme(panel.grid.major.x = element_blank(),

panel.grid.major.y = element_blank())

That is it! The direction of the chart is opposite than what we see in the paper, the scales and colors could likely be improved. But that is just formatting, I did not bother with it.

We can see that there seems to be a higher average number of medales for the countries that participated in more years. The exception are the countries that participated only in 1988 (therefore only one participation), because the dont follow the patter and have a high average number of medals. When I looked at the data, the reason was clear: the only three countries that participated in 1998 and only then were USRR and Federative Republic of Germany, both with very high number of medals or course, and Bahamas with 0 medals in that year. So these two outliers explain the high average in the botton right square.

I just need to learn how to publish these things directly from R...

Thursday, 11 July 2013

Nonidentifiability and Bayes

A third paper from The American Statistician compares nonidentifiability in the context of Bayesian inference and Maximum Likelihood Inference.

More than the comparison, I found the paper does an interesting job on explaining nonidentifiability with a simple example, even without getting rid of all the mathematics.

For those folks working with Structural Equation Models, identification of the model is something routinely present. A model will not fit under Maximum Likelihood estimation if the maximum of the likelihood is not unique. One cannot find X and Y that maximizes X + Y = 10, to give a common place example. In a SEM things are much more complicated because no easy rule exists to prove identification, and your only clue is the software yelling at you that it cannot fit the model.

Under Bayesian statistics one does not maximizes things but rather calculates posterior distribution and uses it for confidence intervals, median, means. I am not expert on this, but after reading the paper I would think that it would be a problem also for the Bayesians if they (we?) were often interested in the mode of the posteriori.

So it is worth reading, it was for me. ON the curiosity side, the paper is co-written by some folks from University of Sao Paulo, where I had the privilege of studying a while ago. Maybe a little more than a while ago. The example in the paper is a modification of one found in a book by Sheldon Ross, a book that I used in my first probability course and caused a lot of sweating: we had to solve most of the exercises of the book, and there are so many...

Wednesday, 3 July 2013

Confidence Intervals through simulation

A paper that appeared in the last edition of The American Statistician discuss the construction of confidence intervals for functions of parameters using simulations. Confidence intervals are becoming more important in the sense that they are seen as giving more information about the parameter than hypothesis tests, and I tend to agree a lot with this trend. I think presenting confidence intervals, especially using graphs, is usually a very nice way of showing inferential results, allowing us even to disregard p-values.

So, it often happens that one needs to estimate these intervals for functions of parameters. For example, in a recent study we needed to estimate the effect in percent in a Regression Discontinuity model. Basically we wanted a confidence interval for the ratio of two regression coefficients.

The first thing that comes to my mind is the Delta Method, where you can use derivatives of the function of the parameter to find confidence interval for the function. It is fairly easy to calculate if the derivatives can be quickly calculated analytically. For example, in the univariate case, the variance of g(theta) = sigma*g'(theta)^2. Of course this is just an approximation and it depend on some assumptions, but usually works quite well.

The second thing that comes to mind is bootstrap, which is usually fine, but it comes with the price of some time spent on programming.

The paper talks about a third option, which I hadn't heard about. The basic idea is that if we want a confidence interval for g(theta), then we can estimate theta (say, theta_hat) using our sample and its variance, and assuming its distribution converges to Normal, we can generate Normal distributed noises, with mean zero and variance = var(theta). We generate many of these noises, say n1, n2, n3, ... nk, where k may be 10000. Now for each simulated noise i, we calculate g(theta_hat + ni) and this simulation will give us a confidence interval for g(theta) that is usually less expensive computationally then bootstrap.

The paper covers more thoroughly the theory and assumptions of the methods as well as cases where it may not work well, so in any case it is interesting to read the text, as my description is quite summarized...

Tuesday, 2 July 2013

Exterior Match

I find it useful to write about texts that I read since that usually helps the understanding and recording of the content into my brain (which is sort of a difficult thing by the way).

This time I found an interesting paper at The American Statistician, which talks about Exterior matching. I was familiar with the concept of matching (although never really used it) but not with the exterior matching.

It is common in causal inference, especially with observational data, that we match subjects with other that are similar in some important variables, before comparing groups. For example, the paper talks about survival time for black women with breast cancer being lower than for white women. That on its on is quite concerning, but it is important to understand why it happens.

In causal inference we often will have a research question like "does treatment X improves survival time for breast cancer?" and if all we have is observational data we would like to match or control for things like race in order to make sure the effect of treatment X is not actually a confounding.

But in this case we have these two races and we are more interested on why the survival times are different so that we can possibly do something about it. It could be because black woman comes to the hospital in more advanced stages, of the black woman are older, or they receive a different treatment or who knows, it could be many things.

The idea is that we will match women according to age. Every black woman will be matched with a white woman who had breast cancer and the same age. Then we look at survival time. We then further match each black woman with a white woman, but in age and stage of cancer.

The survival time may become more similar after matching in age because now we are not just comparing black with white women, but we are also making sure they have the same age. So this is the first comparison, with age matched only. In the second comparison we are further making sure they have the same initial stage of cancer. Therefore it is natural and expected, although not necessary, that the survival times become closer to the extent that age are different between groups and age influences the survival time (talking about the first comparison).

By testing whether there is a significant difference in survival times between first and second comparisons we can gather evidence on whether initial stage of cancer is a factor or not, after controlling for age. This could be thought of as a simple comparison of the survival times between the white women in the first match with the white woman in the second match.

The issue is that many times these two groups of white woman have overlaps, that is, the same white woman is in both matches. Therefore the groups are not independent and cannot be compared through usual statistical tests, which assumes independence.

The paper is about solving this difficulty through excluding the overlaps and working only with the white woman that appear only once, by matching them in the best possible way in order to end up with three groups (black, white from match 1 and white from match 2) that are independents.

I thought it is an interesting paper, but it seems to me that especially if the overlap is large, we will end up with concerns about the final match being good and efficient. But there is no free lunch, we will always have to live with this sort of assumption.