Secrets, surveys and statistics

Paula Rowińska uses mathematics to answer some awkward questions


Image: Flickr user Wessex Archaeology, CC BY-NC-SA 2.0

Have you ever shoplifted?

Have you ever cheated on your partner?

Have you driven a car under the influence of alcohol or drugs?

If you ask such sensitive questions in a survey, don’t expect honest answers. Participants might worry that you wouldn’t be able to provide full anonymity and their embarrassing responses might end up in the wrong hands. They may skip more sensitive questions or, even worse, just blatantly lie, which would make your survey utterly useless. Luckily, there is a way to design fully anonymous surveys. All you need is money. No, I’m not asking you to bribe your respondents. A simple fair coin you keep in your pocket serves not only as the means to pay for services, but also as an excellent generator of events with some known probability (in this case with probability $1/2$). In fact, a die or coloured balls in an urn would do, they’d just generate different probabilities.

Binary choices

Let’s say that a question in your survey is a sensitive yes/no question, where ticking ‘yes’ would mean admitting to some embarrassing or even criminal activity. To minimise the incentive to lie, you might want to include the following extra instructions in the survey. Ask the participants to flip a coin before answering the question. If they get tails, they should respond truthfully. However, if they get heads, ask them to toss the coin again and record the results: ‘yes’ for heads and ‘no’ for tails. Since you won’t be able to distinguish between responses to the question and those who simply recorded the result of the coin flip, survey-takers can rest assured that they won’t be identified. Hopefully this will encourage honesty and give you some data to work with. But… how do you get any meaningful information from this coin-tossing exercise?

I prefer $\tau$ charts.

Scientific studies usually require large samples, so let’s assume that you have managed to recruit a sufficiently large total number of subjects, which we will call $T$. By the end of the study, you will have collected $T$ (total) answers, one from each participant: $Y$ ‘yes’ responses and $N=T-Y$ ‘no’ responses. On average, $T/2$ of these will be truthful answers (since $T/2$ people will get heads in the first coin flip), and $T/2$ will be meaningless results of second coin tosses. This means that out of $T/2$ people answering the actual question, $Y-T/4$ confirmed and $N = 3T/4-Y$ denied the controversial statement. Now you know that on average $(Y-T/4)/(T/2)$ of the whole population (assuming that the sample was representative) have engaged in the embarrassing activity in question.

Of course, if instead of a fair coin you decide to generate events with probabilities different to $1/2$ using, for example, a biased coin or a die, you must adjust the numbers accordingly. Since the concept remains more or less the same, I’ll leave this as an exercise for the reader (just because I’ve always wanted to use this most-hated phrase of any PhD student).

A range of values

Nice, but what if you want to know the number of times the respondent committed the crime as well as whether they had or not? Or in other words, what if the answer to your question is not just ‘yes’ or ‘no’ but a range of values? Don’t worry, maths will save your survey.

Imagine you want to identify some quantity $X$, which the respondents aren’t too happy to share. If they’re able to generate random numbers with known and positive mean $m$ and variance $v$, you’re safe. You can ask them to generate one such number $Z$ and, without disclosing it, answer the question with $Y=XZ$. Again, you have no way of learning the value they’re so afraid of sharing. However, $X$ and $Z$ are independent variables: the value of $X$ doesn’t influence the generated random number $Z$, and vice versa. Then the average of $Y$ is just the product of averages of $X$ and $Z$, where the latter is equal to $m$. Therefore, you can estimate the average value of $X$ as the average of responses $Y$ divided by $m$. You can even quantify the error of your estimation, but I’ll spare you the details. (If you’re interested, you can find them in the paper Scrambled randomized response methods for obtaining sensitive quantitative data by Eichhorn and Hayre.)

So, please flip a coin. If you get heads, toss it again and give me ‘yes’ for heads and ‘no’ for tails. Otherwise, let me know: have you ever attempted to drink and derive?

As a PhD student at Imperial College, London, Paula uses maths and stats to study the impact of wind energy on electricity prices. The title of her TEDx talk `Let’s have a maths party!’ seems to summarise her two favourite activities.

More from Chalkdust