Sunday 24 October 2010

Trends

The new edition of the Journal of Official Statistics has just been released. The first paper is by Sharon Lohr, author of the well known sampling book Sampling: Design and Analysis which I still did not read to tell the truth. The paper is not technical, it is rather an article that surveys the program currently offered in courses of statistics in US and how statisticians should learn sampling theory.

I knew that sampling is not a course highly rated in importance in Universities in Brazil and I did not expect it to be different in US. I am familiar with the difficulties encountered on finding a professor that is willing to advise you if you want to do a master or phD related to sampling theory. All that contradicts the importance that sampling has in our (statisticians) day to day work and I think that the consequences are that many of us overlook the sampling part of whichever data we are analyzing. But that is not really the point I want to talk about here.

Back in the 80s, when I did not even knew what statistics was all about, I had an excellent literature teacher who once said "The history of literature is made of trends. A period is characterized by rational literature, the next is subjective, emotional, the next is rational again and so on until the current period...". Well, I am not sure this was exactly what she said since it was a long time ago, but Lohr's paper made me think about these interchanges in the context of sampling.

Before the 1930's sampling was done only by convenience. Then probabilistic sample theory was introduced under lots of skepticism. It was really hard to believe a random sample could give better results than a convenience sample. How would a random sample make sure not everybody in the sample supports one candidate? Or the correct age distribution? Lrhs says the book Sample Survey Methods and Theory by Hansen, Hurwitz, and Madow (1953) had a great influence on the introduction of probability sampling in practical work. The book (actually two volumes) is well coted still today and is in my wish list. 

I guess we can say probability sampling dominated the sampling theory and its practice until a while ago, when the internet came into play. As for now the internet is not suitable for probability samples but despite of that its ease and cheapness has overtaken the reliability offered be the sampling theory. Definitely internet based sampling is taking us back to the convenience sampling, which does not necessarily means that we are going backwards.  The challenge to me is to adapt ourselves to the new reality.

We still have many statisticians putting themselves against non probabilistic sample. But what are the instances where we can have a truly probability sample if the target are human beings? Even official institutions, that have a good budgets to spend on sampling, have trouble with high non response rates. In the business world the probability sample has rarely happened and it might become even worst. Still surveys are more and more popular. In the business world you just don't have work if you refuses to work with non probability samples. We need to understand how to get more reliable results from internet surveys, how build models that correct possible biases, how to interpret and release results for which precision is hard to grasp without misleading the user of the information. I think right now we are doing a very poor job on this regard, at least most of us, because the lack of theory and knowledge about non probability sampling has lead to the use of probability sampling theory for non probability sample. The assumptions needed for this practice is rarely understood.

We are in a new age where so much information are available online and at so low cost. We just cant afford to ignore it because it comes from methods not recognized by statistical theory, rather we need to accept the challenge of making the information useful.