Saturday 6 November 2010

Data visualization

We are always challenged to extract useful information from data. This involves also presenting this information in a way that people can understand it. And no doubt that the easiest way of making sense of numbers is by seeing them in chart. The technology evolves everyday and with it new and creative data visualization tools comes up.

This site links to a video that attracted my attention as a good resource for data visualization. If you watch the videos you will see many creative ways of presenting the information and hear from the trends we are facing in terms of data visualization.

I am not by any means an expert on this subject but I cannot ignore it as to me this is what the future is going to be about. But with the ease of creating graphs and visualizing data comes also a concern - The attraction of the visual tools should not override the quality of the information. We have seen many instances where information is not clear if not misleading and I think there should have more criteria on how information are shown to the general public that as a rule don't have training on interpreting data.

This is an example of data visualization mentioned in the above video that can be very attractive at first sight but penalizes the interpretation of the information. The movie also talks about BBC News and its approach toward more modern data visualization. Here you have an example graphs in their website.  I was exploring it. If you click the tabs "In Graphics" and then "Weather" a pie chart appears showing that most of the deaths on the roads happens when the weather is "Fine". What does that mean? It can mean different things or nothing. I have an idea of British as a country with grey skies (overcast) perhaps because I've heard many times about the not so nice weather. So if the weather is not usually fine and still the majority of deaths happens with fine weather, that means that fine weather is associated with deaths on the roads. Maybe people speed more because the weather is good. But maybe the impression I have is not correct and the weather is fine most of the time and if so I would expect most of the deaths to happen with fine weather if there is not association of death and weather. So, without knowing the distribution of the weather it is impossible to extract useful information from this graph. We know that deaths happens with fine weather but we don't know whether or not there is an association between deaths and weather. It becomes a very superficial analysis.

We can understand this point easily if we look at the "Sex" tab. Notice that the majority of deaths are males. We know that the distribution of gender should be close to 1:1, so we can say that males are much more at risk than females, or in other words, there is an association between deaths and gender. And we can start thinking on why this happens (is it because males are less careful when driving?)  and here we enter the difficult field of causal analysis in statistics. In the "Weather" tab we cannot get to this point because we don't even know if there is an association.

I am ok with data visualization and new advanced tools for this, but it concerns me at the same time that these technology becomes available and easily used by anyone at their will. I do not favor the strict control of the use of such tools - like, only statistician can do advanced graphics or publish them - but I start to think that some kind of control should be in place for the information released to the public, as it is for food, for example. Of course in the age of the internet this seems utopia but is nonetheless a concern.

It is amazing the amount of data already available online, meaning that one does not need to have data to play with data visualization tools. I want to finish this post with a link to Google public data, which make available data and tools to chart them. You can play a lot there. I want to finish with a multiple time series I created, comparing countries according to their CO2 emission. I have always been impressed on how Canadians are so much more environmentally concerned than Brazilians. But at the same time they drive bigger cars and live in bigger houses. So I was curious to see which country "damages" more the  environment through CO2 emission. It is impressive how Canada is ahead of Brazil despite of Brazilian population being more than 6 times larger. It is easy to fight for the environment if you are not starving, have a big car in the garage and a house 5 times bigger than what you need...