Riding Numbers: Respondent Driven Sampling

I just read this paper: Assessing Respondent-driven Sampling, published on the Proceedings of the National Academy of Sciences. The problem of measuring traits in rare population has always been a shortcoming of existing sampling method, when we think about practical applications.

I faced the challenge once, when I had to design a sample of homeless people in a very large city. But rather than talk about this specific work, I prefer to use it here as an example to show some problems we have with this type of challenge.

If you wanted to estimate the size of the homeless populations by means of a traditional sample, you would need to draw a quite large probabilistic sample to get some few homeless folks who will allow you to project results to the entire homeless population. The first problem with this is to make sure your sample frame cover the homeless population, because it will likely not. Usually samples that covers the human population of a country is ironically based on households (and we are talking about homeless people) and the nice fact the official institutes like Statistics Canada can give you household counts for small areas, which can then become units in your sample frame. Well, perhaps you can somehow include in your sample frame homeless people by looking for them in the selected areas and weighting afterwards... but I just want to make the point that we have already started with a difficult task even if we could draws a big probabilistic sample. Then I want to move on.

Usually estimating the size of rare population is only the least important of a list of goals a researcher has. The second goal would be to have a good sample of the targeted rare population so that we can draw inferences that are specific for that population. For example, we could be interested in the proportion of homeless people with some sort of mental disorder. This requires a large enough sample of homeless people.

One way to get a large sample of homeless people would be by referrals. It is fair to assume that the members of this specific rare population know other. Once we get one of them, they can indicate others in a process widely known as Snowball Sampling. Before the mentioned paper I was familiar with the term and just considered it as another type of Convenience Sampling for which inference always depends on strong assumption linked to the non probabilistic nature of the sample. But there are some literature on the subject and moreover, it seems to be more widely used than I thought.

Respondent Driven Sampling (RDS) seems to be the most modern way of reaching hidden populations. In a RDS an initial sample is selected and those sampled subjects receive incentive to bring their peers who also belongs to the given population hidden. They get incentives if they answer the survey and if they successfully bring others subjects. RDS would be different from the Snowball Sample because the later ask subjects to name their peers, not to bring them over to the research site. This is an interesting paper to learn more about it.

Other sampling methods include Targeted Sampling where the research tries to make a preliminary list of eligible subjects and sample from there. Of course this is usually not very helpful because of costs and again the difficulty to reach some rare population. Key Informant Sampling is a sample where you would select, for example, people that work with the target population. They would give you information about the population, that is, you would not really sample subjects from the population, but instead would make your inferences through secondary sources of information. Time Space Sampling (TTS) is method where you list places and within those places, times when the target population can be reach. For example, a list of public eating places could be used to find homeless people and you know what time they can be reach. It is possible to have valid inferences for the population that goes to these places by sampling places and times interval in this list. You can see more about TTS sampling and find other references in this paper.

Now to a point about the paper I first mentioned, how good is RDS Sampling? Unfortunately the paper does not brings good news. According to the simulations they did, the variability of the RDS sample is quite high, with median design effect around 10. That means a RDS sample of size n would be as efficient (if we think about Margin of Error) as a Simple Random Sample of size n/10. That makes me start questioning the research, although the authors make good and valid points defending the assumptions they need to make so that their simulations becomes comparable to a real RDS. I don't know, I cannot say anything, but I wonder if out there a survey exist that has used RDS sample for several consecutive waves. I think this kind of survey would give us some ideas of the variability involved on replications of the survey. Maybe we could look at some traits that are not expected to change from wave to wave and make an assessment of its variability.

An interesting property of the RDS is that it tends to lose it bias with time. For example, it was noticed that if a black person is asked for referrals, the referrals tend to be more black the actual proportion of black people in the population. This seems to me a natural thing in networks, I mean, people tend to have peers that are similar to themselves. This will happen with some (or maybe most) of the demographics characteristics. But it looks like the evident bias present in the first wave of referrals seemingly disappear around the sixth wave. That seems to me to be a good thing and if the initial sample cannot be random maybe we should consider disregarding waves 1 to 5 or so.

Anyway, I think as of today researches still face huge technical challenges when the target population is rare. RDS and other methods of reaching these populations have been developed which helps a lot, but this remains an unresolved issue for statistician, one very important in sampling methodology.

Riding Numbers

Friday, 10 September 2010

Respondent Driven Sampling

No comments: