Saturday, 13 July 2013

Visualizing Longitudinal Data

Another paper in the most recent The American Statistician shows hot to use something they call Triangle Plot to visualize some aspects of longitudinal data. It is an interesting idea and even more interesting because my impression is that dropouts are not  analyzed as they should.

In Repeated Measure experiments data are collected at several points in time and it always happens that some subjects get lost for different reasons, these are the dropouts. So, for example, everybody is observed at time 1, but at time 2 some folks do not appear and we therefore have missing values there. At time 3 some more folks do not show up and so on. If we are measuring Quality of LIfe (borrowing the example from the paper) then it is important to understand whether these folks who drops out of the study have a different Quality of Life compared to the ones that stayed. If they have then the dropouts are said informative and such information should be studied and perhaps used in the analysis in order to avoid possible biases. Basically dropouts are missing, and if they are informative, they are not random.

You can check the paper for some neat graphs, but I tried to be creative and create my own Triangle Graph, in this case the data is a little different, but I think the idea of repeated measures and informative dropouts holds sort of true.

I wanted to play around with ggplot2 rather than using the Lattice code provided in the paper. Nothing against lattice, but forcing myself to use ggplot2 is a way of learning more.

I tried to think about some sort of data I could use to run a Trianle Plot. Toat was not very easy. But I found some data about number of medals per country in paraolympic games. Let's use this data for a Triangle Plot.

OlympicData <- drive="" font="" log="" lympicdata.csv="" nbsp="" oogle="" papers="" plot="" read.csv="" riangle="">
    quote = "")
library(ggplot2)
OlympicData[1:20, ]
##                country y2012 y2008 y2004 y2000 y1996 y1992 y1988 Particip
## 1          Afghanistan     0     0     0    NA     0    NA    NA        4
## 2              Albania     0    NA    NA    NA    NA    NA    NA        1
## 3              Algeria    19    15    13     3     7     0    NA        6
## 4              Andorra     0    NA    NA    NA    NA    NA    NA        1
## 5               Angola     2     3     3     0     0    NA    NA        5
## 6  Antigua and Barbuda     0    NA    NA    NA    NA    NA    NA        1
## 7            Argentina     5     6     4     5     9     2     9        7
## 8              Armenia     0     0     0     0     0    NA    NA        5
## 9            Australia    85    79   101   149   106    76    96        7
## 10             Austria    13     6    22    15    22    22    35        7
## 11          Azerbaijan    12    10     4     1     0    NA    NA        5
## 12             Bahamas    NA    NA    NA    NA    NA    NA     0        1
## 13             Bahrain     0     0     1     2     0     1     3        7
## 14          Bangladesh    NA     0     0    NA    NA    NA    NA        2
## 15            Barbados     0     0     0     0    NA    NA    NA        4
## 16             Belarus    10    13    29    23    13    NA    NA        5
## 17             Belgium     7     1     7     9    25    17    41        7
## 18               Benin     0     0     0     0    NA    NA    NA        4
## 19             Bermuda     0     0     0     0     0    NA    NA        5
## 20  Bosnia-Herzegovina     1     1     1     1     0    NA    NA        5


Then we aggregate the data to see the average medals per participation.

x <- aggregate="" by="list(OlympicData$Particip)," mean="" na.rm="T)</font" x="OlympicData,">
x1 <- as.data.frame="" font="" x="">
x1
##   Group.1 country   y2012   y2008   y2004    y2000  y1996 y1992 y1988
## 1       1      NA  0.0000     NaN  1.0000  0.00000    NaN 13.25 83.00
## 2       2      NA  0.4167  0.2000  0.0000  0.00000  0.000 13.00  0.50
## 3       3      NA  0.1176  0.2353  0.0000  0.14286  2.000  0.00  8.50
## 4       4      NA  0.1111  0.1111  0.1765  0.07692  0.000  0.00  1.00
## 5       5      NA  8.1333  6.7333  6.2759  5.51724  2.741  0.00  0.00
## 6       6      NA  7.5000  7.0455  8.4545  8.50000 10.682 10.55  1.00
## 7       7      NA 22.5306 21.7551 24.3673 26.69388 25.735 25.02 39.55
##   Particip
## 1        1
## 2        2
## 3        3
## 4        4
## 5        5
## 6        6
## 7        7

This does not work as dropouts in a repeated measures data because some countries are not in 1988 but are in 2000, for example. In a Repeated Measures when someone drops out, they do not return. So we need to exclude that part of the data. This way will only be present in 2012 data countries with 7 data point, in 2008 those with 6 or 7 data points and so on. Of course, by making this exclusion we are not being as representative anymore… No big deal, we are just playing…

x2 <- 3:10="" font="" x1="">
library(reshape)
## Loading required package: plyr
## Attaching package: 'reshape'
## The following object is masked from 'package:plyr':
## 
## rename, round_any
x3 <- articip="" font="" melt="" subset="" variable="" x2="">
## Using as id variables
x3$Participation <- 7="" font="" rep="" seq="">
x3$Year <- by="-1)," each="7)</font" from="7," rep="" seq="" to="1,">
x4 <- participation="" subset="" x3="">= Year)
x4
##    variable    value Participation Year
## 7     y2012 22.53061             7    7
## 13    y2008  7.04545             6    6
## 14    y2008 21.75510             7    6
## 19    y2004  6.27586             5    5
## 20    y2004  8.45455             6    5
## 21    y2004 24.36735             7    5
## 25    y2000  0.07692             4    4
## 26    y2000  5.51724             5    4
## 27    y2000  8.50000             6    4
## 28    y2000 26.69388             7    4
## 31    y1996  2.00000             3    3
## 32    y1996  0.00000             4    3
## 33    y1996  2.74074             5    3
## 34    y1996 10.68182             6    3
## 35    y1996 25.73469             7    3
## 37    y1992 13.00000             2    2
## 38    y1992  0.00000             3    2
## 39    y1992  0.00000             4    2
## 40    y1992  0.00000             5    2
## 41    y1992 10.55000             6    2
## 42    y1992 25.02041             7    2
## 43    y1988 83.00000             1    1
## 44    y1988  0.50000             2    1
## 45    y1988  8.50000             3    1
## 46    y1988  1.00000             4    1
## 47    y1988  0.00000             5    1
## 48    y1988  1.00000             6    1
## 49    y1988 39.55102             7    1


At this point we have structured the data so that is is suitable for GGPLOT2. I am not really expert in ggplot2, so maybe this is not even the best way of doing it, but I googled around and that is what I found. So, lets do the Triangle Plot.

p <- aes="" fill="value))" font="" geom_tile="" ggplot="" participation="" theme_bw="" variable="" x4="">
p + xlab("Year") + ylab("Participations") + scale_fill_gradient(low = "lightblue", 
    high = "darkblue", name = "# Medals") + theme(panel.grid.major.x = element_blank(), 
    panel.grid.major.y = element_blank())




That is it! The direction of the chart is opposite than what we see in the paper, the scales and colors could likely be improved. But that is just formatting, I did not bother with it.

We can see that there seems to be a higher average number of medales for the countries that participated in more years. The exception are the countries that participated only in 1988 (therefore only one participation), because the dont follow the patter and have a high average number of medals. When I looked at the data, the reason was clear: the only three countries that participated in 1998 and only then were USRR and Federative Republic of Germany, both with very high number of medals or course, and Bahamas with 0 medals in that year. So these two outliers explain the high average in the botton right square.


I just need to learn how to publish these things directly from R...


No comments: