Sunday 21 July 2013

Multivariate Binary Variables

These days I got myself involved in a problem where power calculation had to be done for repeated measure design with binary endpoint, involving 4 time points and 3 groups. This is not the sort of thing you can do in softwares like G-Power; I like to do it through simulations using Generalized Estimating Equations in this case. Basically, you generate the data that you expect to get in the experiment, and this involves estimating somehow variation and effects without having any data (but of course, possibly having other information, like variation and effect sizes from similar studies and also the experience of the research and the clinical assessment of what would be a meaningful effect). I am not very experienced with this, you know, statisticians working in marketing research will not see this kind of thing very often. But I can say that at this point I am getting pretty comfortable with power simulations for assumed Normal outcomes.

However this one was binary. If you have 4 time points, you will want to generate 4 binary variables that have specific proportion p and correlation structure. How to you simulate multivariate binary variates? My first thought was to use the normal quantiles, for example, if the desired p is 0.5 you generate a N(0,1) and recode negative values to 0, positive to ones. If you do that with a multivariate normal you will end up with a multivariate binomial. The correlation structure will not be the same, but you can easily get close to what you want by trial and error. So I put that in the back pocket. I really wanted to generate the zero and ones in a more mathematical way.

It turns out that there are quite a few references on this, even on using Normal as I mentioned above. There are also two R packages that seems capable of generating multivariate binary variables, but I did not test them. Then I found this interesting paper. It has an algorithm to generate what I wanted, starting from the Poisson distribution. It is an interesting and not very difficult derivation for the bivariate case.  And I managed to program it on SAS, after wrestling with the use of numeric values with decimals inside macros. I forced myself to use Proc IML instead of data steps, to try and learn it.



%macro bibernouli(p1,p2,n,rho);

%let a11 = -1*%sysfunc(log(&p1));
%let a22 = -1*%sysfunc(log(&p2));
%let a12 = %sysfunc(log(1+&rho*%sysfunc(sqrt((1/&p1)*(1-&p1)*(1/&p2)*(1-&p2)))));

proc iml;

rmat = j(%eval(&n),7);
call randgen(rmat[,1],'POISSON',%sysevalf(&a11 - &a12));
call randgen(rmat[,2],'POISSON',%sysevalf(&a22 - &a12));
call randgen(rmat[,3],'POISSON',%sysevalf(&a12));
rmat[,4] = rmat[,1]+rmat[,3];
rmat[,5] = rmat[,2]+rmat[,3];
if rmat[,4] = 0 then rmat[,6] = 1; else rmat[,6]=0;
if rmat[,5] = 0 then rmat[,7] = 1; else rmat[,7]=0;

create databin from rmat[colnames = {"x1","x2","x3","y1","y2","z1","z2"}];
append from rmat;
close databin;
run;
quit;
%mend bibernouli;

%bibernouli(0.75,0.65,10,0.5)



But I needed 4 correlated variables, not only 2. It is not difficult to  generalize this process from 2 to 4 variables, with proportions matching what you want and association close to what you want. But to get association exactly you need to keep reading the paper past the bivariate case. You will find an algorithm, which I can't say is very complicated because it is not. I read it through some 10 times. But I don't know, there is a part in the last step that I was not quite getting. I tried to apply it to the bivariate case and it was not working. I am sure it was my fault, though. Then you know, time is not infinite, so I went back to my Normal simulation to solve my real world problem, but still thinking about the algorithm...

No comments: