Sunday, 25 July 2010

Survey Subject and Non Response Error

Statisticians usually have a skeptical opinion about web surveys, the most common type of survey nowadays in North America. That is because there is virtually no sampling theory that can accommodate the "sampling design" of a Web Survey without making strong assumptions well known not to hold in many situations. But in a capitalist world the cost drives everything and drives also the way we draw our samples. If you think about the huge economy a Web survey can proportionate you will agree that the "flawed" sample can be justified - it is just a trade off between cost and precision. Hopefully those who are buying the survey are aware of that, and here is where the problem is, as I see it.

The trade off can be very advantageous indeed because it has been shown that in some cases a Web Survey can approximate reasonably the results a probabilistic sample. Our experience is crucial on the understanding of when we should be concerned with web survey results and how to make them more in line with probability samples. This paper, for example, shows that results from attitudinal questions from a web panel sample are quite close to a probability survey. But no doubt we can not hope to be able to use the mathematical theory that allows us to pinpoint the precision of the study.

Challenges are never enough though. In an effort to reduce the burden of the survey to the respondent, researchers are disclosing the subject of the survey before its beginning so that the respondent can decide whether of not to participate. The expected time of the survey is another information often released before the survey starts. Now the respondent that does not like the survey subject or does not feel like responding about it can just send the survey invitation to the trash can. But, what the possible consequences of this are?

Non response biases can be huge. Suppose, for example, a survey about groceries. Pretty much everybody buys grocery and would be eligible for participating in such a survey but possibly not everybody like to do it. The probability that someone that likes to shop for groceries has to participate in such a survey, if the subject is disclosed, we can argue, is much higher than the probability of somebody that thinks it is just a chore to participate. If the incidence of "grocery shopping lovers" in the population is, say, 20%, it will likely be much higher in the sample, without any possibility of being corrected through weighting. Shopping habits and attitudes will likely be associated to some extent to how much people like to shop and this will likely cause a bias in the survey.

Disclosing the subject in a web survey is almost certainly the same as biasing the results in a way that can not be corrected or quantified. Yet again the technical aspect of sampling is put aside without fully understanding its consequences. Yet again the accuracy of estimates is second to the business priorities and yet again we face the challenge of making useful the results of surveys that cross unknown boundaries...

Monday, 19 July 2010

Census - The polemic of the long form

Next year is the year of the Census in Canada. The Canadian Census is conducted every 5 years and used to be composed of a long questionnaire and a short questionnaire. Both were mailed to the respondents who had to respond, since it was mandatory. While the short form were sent to every household, the long form were sent to only one fifth of the households, so it was a Survey rather than a Census.

Now they decided that the long questionnaire will be voluntary and will be sent to one third of the households. The main argument is that there are people who considers the long form particularly intrusive, with too personal questions. It had 40 pages in 2006. But this argument seemed not to be enough and the subject became a huge polemic. Concerns about results accuracy and continuity of historical trend are among the main points among those who are against a voluntary long form.

A good statistical point against the voluntary survey is that it is not only supposed to be a very accurate picture of Canadians but it is also a very important base for many other surveys, official or not. Until the results of the new Census in 2016 come along, Statistics Canada itself will be relying on the long 2011 form results as a baseline for designing and weighting their other surveys.

I have to say that I am not sure what precisely "voluntary" means in this case, in particular, what will be the effort done to get people to answer to the long form. What would the expected response rate be? It seems that Statistics Canada can get pretty good response rates even with voluntary surveys, and if that is so maybe there is no reason for concerns. We have to remember, though, that they are increasing from 1/5 to 1/3 the number of questionnaires sent, and that seems to mean they are expecting a sizable non response rate. Even with all the weighting development we have nowadays, I would think they should do a test or try to estimate somehow the non response rate, and the decision as to whether or not to go mandatory would depend on that. I think this particular Survey cannot afford to be questionable on its methodology.

Back on the time I was at the graduating school, I had a professor that said a Survey would be as accurate as a Census and cost much less. That was also back in Brazil where Censuses are not mandatory. It seems to me that it could be a Survey as long as it was mandatory or as long as it had a very high response rate. One of the arguments against the Census is that it is actually not precise at all, with lots of people not answering to it, or not telling the truth. I think the biases from this kind of thing would be really minor, much lower than the possible biases in a survey with low response rate. So I think things are not as simple as replacing censuses by surveys, there are some serious concern involved. Not to say that making people answer a survey could in fact increase non response error since we could think that people that don't want to respond could "lie" if forced to respond.

Back to the Canadian polemic, it seems possible that there are some political reasons for the changes (rather than only technical), which is of concern. Anyway, let's hope scientific concern drives the choice of the method. I am sure there are very good statistician behind this and I hope they can be heard.

Friday, 16 July 2010

I give up

I could not keep reading the book Statistics Applied to Clinical Trial. I had commented here that I hadn't liked the book since its first chapter, but from chapter 8 to chapter 13 the book becomes a joke. I could not read more.

Even if the Statistical Concepts were explained correctly, the book is written in a very confusing and repetitive way. Chapters 8 and 9 talk about the same thing, which Chapter 9 repeating lots of things in Chapter 8. The same thing happens again on Chapters 10, 11 and 12.

The authors ignores important concepts in Statistics. For example when they say " The p-value tells us the chance of making a type I error of finding a difference where there is none.". The p-value is in fact the probability of finding a difference as big as the one found, or bigger, given H0. P-value = P(Data/H0 True) not P(H0 True/Data). They talk a little bit about the non use of standards like 5% for significance test, but they go on and on always using the rule of 5%. They say that p-values too high is a proof that something is wrong with the study but in Hypothesis test we need to define our hypotheses and how to test them and calculate power and everything a priori. If you do not account for rejection of high p-values before the survey, it is not valid to look at the p-value after the fact and say "Well, I was expecting the p-value to be lower than 5% but it turned out to be 99%, so this is too high and I will Reject H0 because the chance of a p-value higher than 98% given H0 is too small". Perhaps a high p-value could be used as part of descriptive analysis, a red flag saying that maybe we should see if there is no obvious reason to get results so in line with H0.

They say that Correlation and Regression do not show causality but simple association (they do not use the word association, but correlation). They go on saying the Hypothesis tests with Clinical Trials is the real thing to detect causality. But, what is the difference between the ANOVA in a Clinical Trial and Regression? This is simply a huge confusion between observational data and experimental data, and they think Regression is used only with Observational Data in spite of not knowing Regression and ANOVA is the same thing. Ok, maybe they don't think Regression is only for observational data since they go on using Regression in an Cross-over trial. The dependent variable is the first measure and the independent is the second measure, showing that they do not have a sense of what is dependent variable and what is independent variable. If you want the Regression to make sense here, the independent variable has to be the treatments and this has to be a Hierarchical Regression with Within subject effects. That is, the Regression they use for their examples are meaningless. Not to mention the fact that they do not interpret the betas.

Finally I want to say that the book is confusing. The reader do not learn how to do a statistical test or a regression because the formulas appear from nowhere. I had trouble to understand how they use the formulas, imagine someone that is not a statistician.

Frustrated with the dissemination of bad use of statistics and wrong concepts, I wrote my negative review of the book on Amazon website. Hopefully others will choose better than me.

Saturday, 10 July 2010

Interim Analysis

In the world of Clinical Trials there is a type of Statistical Analysis they call Interim Analysis. It is concerned with analyzing the results of the trial before its end in order to assure the protocols of the study are being followed and to see whether the results at a given point in time are already strongly favoring or opposing the study hypothesis to the point that the experiment could stop and save money, perhaps saving lives too. For example, if we have strong evidence that treatment A is better than treatment B, then there is no point on continuing the trial, which gives half of participants the less efficient treatment B.

An Interim Analysis differs a little from usual statistical analysis of clinical trials, even though it is a type of statistical analysis. Interim Analysis is not as complete as the statistical analyses performed at the end of the trial and it usually does not aim to understand many outcome variables. One of the reasons is that it is not supposed to be too expensive and another is that it is not supposed to influence the trial. Interim Analysis has to be done preserving the blindness of the trial, therefore in an almost confidential way. Trials with a great amount of analysis before its completion may be questioned as to the extent the external procedures of analyzing data did not influence the trial itself.

As such an Interim Analysis often focus on some main outcome variables at limited points in time. It wants to avoid multiple comparison problems and its protocol must be very well defined in advance. Strict rules of analysis and stopping rules (i.e. if this comparison is found significant the trial must stop) are well defined and followed.

One of the most used types of Interim Analysis is the Group Sequential Analysis. It is defined a priori how many analyses will be performed during the trial and the sample size per treatment at the point when the analyses is done. Based on this stopping rules can be defined in terms of level of significance attained at each and every analysis.

Another type is called Continuous Sequential Statistical Techniques, which, as I see does not go without some criticism. Here, at the arrival of each new results the entire set of data is reanalyzed and differences reaccessed.  Cut off for the lower and upper bound of difference are established that when reached causes the trial to stop because a conclusion is already clear - either the treatment is significantly better or worse than the other. This seems to be a good book on this subject.

Sunday, 4 July 2010

Equivalence Tests

Reading the Book Statistics Applied to Clinical Trials I learnt about Equivalence tests. This is very interesting indeed. Situations exists where you want to test whether a treatment has the same effect compared to a placebo or compared to an existing treatment. This can be readily extended to situation I face on my daly work with Marketing Research.

For example, suppose the client wishes to change the methodology of their survey from telephone to Online interviews. This will make things much cheaper, but will it make results from the new methodology incomparable with results from the old methodology? To test that we can simply conduct surveys with both methodology at the same time. But then, how do you compare them as to say their results "are the same"?

I soon realized that the usual statistical test of hypothesis is not suitable for this kind of situation. So I quickly thought about something. My approach was to do the usual statistical test, but lowering the confidence level to, say, 70%. I want my test to have more power, so that when I stay with the null hypothesis I am more confident it is true. But to tell the truth I have never even used this because of tremendous difficult to explain people why the usual test is not correct and why this would be a better approach. The problem may be more that 70% is something that seems not as good as 95%, at least from the eyes of non statisticians. Then I came up with a second approach which I used a few times. Ok, we will do the usual stat test, but we will do it several times. We compare the surveys for 3 or 5 months and this will tell us whether there is a methodology effect, even if it is not statistically significant. The approach seemed to work, I could see in one case that the new method produced slight lower averages that were not statistically significant though. Now we have a better idea of how much interview methodology affects survey results and we can even make the decision without relying on statistical tests.

Comparison of interview methodology is just one thing that requires equivalence testing. Many others exists, specially in product testing, when the company wants to launch a new product that is cheaper to produce and we need to test whether the perception of the consumer will be the same.

The approach for Clinical Trials, as showed in this book, is different and easier in a way. You first define what a significant difference would be from practical (not statistical) point of view. One might say, for example, that if differences are lower than 3% they can be considered as equal. Now we can construct a confidence interval for the difference of the means and if this interval is within the range [-3%;3%] we can consider things equivalent. I have to say, though, that this tolerance range is not easy to define in some cases, to the point that I am not sure this could be applied on my type of problems. But I might try.

I searched the internet for equivalence testing and found that there are other ways of doing the test. The book just explained the simplest way of doing it (as I would expect - read the critics I made to the book) and perhaps the oldest way. But the good thing is that it showed me the issue and I can look for it elsewhere...

Saturday, 3 July 2010

How to do it

Considering the post below, this is how variance estimation under Simple Random Sample design can be performed (as per post below). Here we use Taylor Linearization but others methods are also available.

In R, load package "Survey"

library(survey)
attach(mydata)
mydesign<-svydesign(id=idnumber,weights = wgtem,data=mydata)
svymean(varx,design = mydesign)  #mean and standard error for varx.
svyby(~varx,~agex,mydesign,svymean) #now by age.

Using Stata:

svyset idnumber [pweight=wgtem], vce(linearized)
svy, vce(linearized): mean varx, over(agex)

Agex are age ranges, varx is the variable for which we want to calculate the mean. wgtem is the weighting variable. I have found interesting looking at the Design Effect as a measure of weight efficiency. If Design Effect is too high than maybe we should consider collapsing levels of variables to get less extreme weighting factors. If you want to calculate Design Effect make sure your weights add up to the total target population.

Weights in Non Probabilistic Surveys

The fact that we really do not have a standard procedure to deal with precision of estimates in Non Probabilistic Samples has bothered me for some time. In the business world there is not a single survey that is true probabilistic and few of them are close enough that we feel comfortable making the assumption of probabilistic samples.

Samples are often weighted to some standard demographic variables, like age, gender and region. And this brings another layer of difficulty since now precision measures are not only calculated in a Non probabilistic sample, but it is also calculated without accounting for the weights.

Recognizing that Probabilistic Samples are not the reality of our world, we need to be pragmatic and use our knowledge of Sampling Theory to get the most we can from whichever sample we manage to get. If we are comfortable enough to make the assumption that the sample is not strongly biased then I believe it makes sense to think about variability and margin of error, even if they have to be labeled at the bottom of the page with an asterisk that states our assumptions. But, assuming fair to calculate variability, how do we proceed on with the variance calculation in presence of weights?

I have not found much literature on this, likely because Non Probabilistic Samples are mostly ignored as being practice of people that does not know Sampling Theory (I might comment more on that in another future post). But I do know that the only weight applied to these surveys is the post stratification weights, were some demographics (and rarely other things) are corrected to mirror some official figures. When weights are based only on Age and Gender, for example, it makes sense to me to analyze the results as a Stratified Random Sample where Age and Gender are interlocked strata. Often, though, they are not weighted jointly, using a two way table, but through their marginal distribution. Still I think it might make sense to define each different weight factor as a different stratum when calculating variability.

More problems arise when we weight marginals of many variables. Defining different weights as different stratum is not doable anymore since we can have hundreds of strata. In this case I have done some tests using SPSS Complex Sample and Stata and declaring the sample as being Simple Random Sample, as if the weights were the simply the inverse of the probability of selection. Variance calculated in this way are for sure much more realistic that the ones SPSS or Stata calculates with what they call frequency weights. Defining post stratification weights as frequency weights in variance calculation is everything we need to get the wrong figures and fool ourselves.

Usually the Simple Random Sample specification will show how weighting data increases the variances. This is a big improvement over not weighting or considering weights as frequency, even though it might not be the best improvement one can have. To me now we are down to a more difficult question which is the question about the validity of the assumptions we need to make to be able to calculate variances and take them as good estimates of population parameters. But this can be the subject of another post...

Friday, 2 July 2010

Statistics and Clinical Trials

I am reading the book Statistics Applied to Clinical Trials due to my great interest in the subject. I chose this book because it is a new edition and the index seemed to reflect closely what I expected to learn. But the book has disappointed me to some extent even though I am still finding it useful. The technical aspect of statistics is sometimes neglected or at least its presentation does not follow a standard: sometimes we see technical stuff without much explanation, without rigor and sometimes when we wish to see the math, it is not there. Formulas are frequently showed in a messy and difficult to understand way. Even though the authors do a good job on explaining the correct interpretation of hypothesis tests, they are too tied to the 5% and 1% rule as if it were universal. Sometimes they interpret p-value in the wrong way.

All of that is bad in a way because we are not sure we can trust the part of the book where we have less fluency. It has been a easy read though (will be for anyone with some background in statistics) and I think despite of the poor quality I have been exposed to, it is worth the time.

I might comment more on specific chapters...



New blog, new ideas

I am starting this blog with a new goal and I hope to be commited to it. I want to leave here things that I learn on statistics and other fields of interest. I have this passion for reading books and navigating the internet and learning and I want to register it. While comments are always appreciated my goal is not to be a populat blogger but rather to have the opportunity to write down new things that make me grow everyday.

My other blog is more of a personal and informal one. There I write day to day experiences and thoughts. Here it will be more about knowledge, about statistics, about what I read.

Looping Infinito is also writen in Portuguese because it is above all a link to my roots, my friend and my language that is the most beautiful language. This space is different, it is more about specific knowledge, it is more serious, more technical if you will. I dont expect to go crazy on math theory (not sure I could anyway) but I do expect to comment on things that are restricted to few people's interest (if we consider the entire world population). So I thought I was better off writing it on English, this universal language that can reach out the world. I think these more technical subjects will make writing in English easier to me, given that I have my limitations.

If not many people ends up reading what I write, this space will still serve its objective - make me think, because it is necessary to think more when we need to summarize what we learnt. And this will make me learn more.