Sample size and size of effect
I read with interest the latest newsletter from London based UX agency Foviance, including the article The more the merrier? by Mariana da Silva (the latest newsletter is unfortunately not on their website yet Update: the full article is now on their website).
Overall this was a fairly well written discussion of sample sizes in user research, in layman’s terms. But one statement confused me somewhat, I’ve highlighted it below:
With surveys, sample size estimation is also somewhat less straightforward than with standard usability evaluations. Here, the information being collected is attitudinal data, which by its sheer nature can be slightly fuzzy. It all comes down to the size of the effect you intend to detect. Imagine you wanted to know whether people in London are taller than people in New York. If people in London and people in New York are actually pretty much the same height, you will need to measure a high number of citizens of both cities. If, on the other hand, people in London were particularly tall and people in New York were shorter than average, this will be obvious after measuring just a handful of people.
Now, I’m no statistics whizz, but that last bit doesn’t make sense to me. Wouldn’t this only be true if you knew ahead of time that Lononders were “particularly tall”? Otherwise the handful you measured might just be anomalous.
Like I said, I may be missing the point and this is in fact an excellent illustration of a fundamental error in my thinking with regard to sample sizes. Feel free to share your thoughts.

Comments(17)
Well, I’m no statistician but I would have almost zero confidence in *any* hypothesis supported by results on a sample size of a “handful of people” for a city with millions of people.
Don’t say that or we’ll never get sign-off on a research project again :)
Seriously though, there are many instances where small sample sizes are perfectly valid in the work we do, which is what the majority of Mariana’s article talks about. What confuses me is the difference between:
Why does this change the sample size required for research? I’m going to have to go and read my old stats textbook from uni aren’t I? *sigh*
Smaller sample size generally means lower precision and lower hypothesis confidence. I can’t see any way somebody could make claims about averages or any other statistics based on a sample size anywhere near that small.
Anthropology would be a different matter – I would say you can learn something about beliefs from a small sample size because (I assume) there’s an understanding of the population you’re looking at and the sample is already determined as relevant to the whole population you’re investigating.
Even then you still can’t make statistical claims with any degree of confidence with a small sample size.
Of course, I’m no anthropologist either.
Don’t know if this helps or not but here is something I wrote to help discuss the issue of sample size when it comes to qualitative versus quantitative research. (Mr Baty might disagree – lets see)
Determining the Optimal Number of Users to Test
A common question we get is:
“What is the rationale of only using a small sample of people in research?”
This answer to this question is that we are in these instances conducting “qualitative” research which is essentially an investigation of why and how people go about completing tasks. Qualitative research sits in a different space to the more common “quantitative” research that requires larger sample sizes to be valid because we are seeking to discover different sorts of information.
Qualitative research looks at the how and whys of behaviour but doesn’t answer the question of “how many” very well.
For example, you may find that people feel like the current website looks cluttered but informative, because the information density on the page is high, therefore adding to the credibility of the page. But you can’t give an accurate prediction on the number of people this applies to in the whole online population.
Quantitative research looks at the relationships between two variables across larger populations to answer the “how many” question but has trouble explaining the hows and why of the relationships between numbers.
For example, you might find stats that show that 80% of all business people look at a certain page in a website for a long time, but you don’t know why they look at it (is it because it is packed with information or confusing?)
Ideally in every project we would combine these two methods.
For the purposes of a project and understanding the “implications for the design” of the website we know that “how and why” data helps us think about the end users of the site in a richer and deeper way. It enables us to find nuggets of information that are unobservable through analysis of quantitative data to help us build a better website by understanding the “hows and why’s” and to set about building a site to cater for those user needs.
In practical terms we choose our sample sizes based on several constraints, the most important is to “reduce the chances of discovery failure”. By this we mean we choose sample sizes not so we can say 100% of people say this, but to ensure that we first and foremost have reduced the risk of not finding some important information about how and why people are performing a behaviour, or holding a particular view.
Contrary to initial logic, you only have to talk to a reasonably small number of people to discover a “majority held belief” because those beliefs will be held by a large number of the participants. To find more information on a minority held belief your numbers have to increase (to increase the chances of finding the person who holds that belief in the population).
For a more mathematical explanation of how many people you need to talk to, to find a majority held belief (it’s pretty low < 10), see: http://www.icology.co.uk/qualitativesamplesize.html
As an aside there has also been statistical analysis for usability testing and the minimum numbers required to find a majority of usability issues with a site (6-9). For an explanation of this please see: http://www.useit.com/alertbox/20000319.html
This blog is like a honeypot…
Pat, the reason you only need to test a small number of Londoners and New Yorker is because the height differential you measure will be quite large. So large, in fact, that it couldn’t reasonably (ie < 0.01%) be the result of randomness in the sampling.
Think of each population in terms of the bell-shaped curve representing the height of each person. If the two populations are similar, there will be a large degree of overlap between the two curves. When sampling from the two cities, the greater the overlap, the greater the similarity we can expect in our samples. The greater the similarity, the larger the sample size we need before we can gain confidence that any observed difference is ‘real’.
For the same reason, as Stephen points out, you only need to ask a relatively small sample size in order to identify a majority opinion; and a much larger opinion to identify minority opinions.
However, it is very important to understand the concepts of statistical sampling when conducting research. On the one hand, you could be drawing conclusions from too small a sample; but on the other hand – you could be wasting an enormous amount of energy trying to sample an unnecessarily large segment of the population.
This comes back to Patrick Lees comment about “I can’t see any way somebody could make claims about averages or any other statistics based on a sample size anywhere near that small.”
There is a whole sub-specialty of statistical analysis that deals with the sampling & analysis needed for quality control in product manufacturing processes. It is a well-defined and mature area of statistical analysis that directly addresses the issue of small sample sizes – because some tests of product quality are destructive (light bulb lifetime; matches; bacteria levels in fruit).
Steve
Thanks for explaining that, Steve.
One more question though. You said “the reason you only need to test a small number of Londoners and New Yorker is because the height differential you measure will be quite large”. But how do you know the differential will be large?
You won’t. But as soon as you start measuring people the difference will show up, and will be of such a size that the chances it’s due to random sampling will be vanishingly small. You’ll never *know*, but you’ll be extremely confident.
If you start from the premise that the two populations are roughly the same in terms of their height – and distribution of heights – and start by measuring 5 people from each city (selected at random). You get an average height for New York of, say, 178cm and an average height for London of 190cm. You measure a further 5 people and NY –> 179cm and London –> 190cm now with 10 measurements each.
At each point you can calculate mean, standard deviation, and the probability that – for a sample size of 10 – the difference is just dumb luck. Pretty quickly that difference in mean of 11cm will turn from a curiosity into a statistical probability that it’s a real difference you’re seeing. I CBA working it out, but I doubt you’d need more than about 20 or 30 people before you hit 95% certainty – as long as you really are selecting people at random, and not hanging out at London’s basketball courts.
Steve
If the confidence intervals are actually calculated then great…
Sorry, my reading of the original example takes a handful to be something like four or five and height differences to be much smaller even for a large difference.
I think it’s dangerous to talk about these things in vague terms numerically whilst giving a salient example of New York and London.
So yeah, my “gripe” is with the example.
At the end of the day, we wouldn’t be trying to answer the question “whether people in London are taller than people in New York” so this discussion is rather academic.
For the sort of research we do, our methods and sample sizes are appropriate, taking into account Stephen’s comments.
I, for one, am much clearer on the matter now.
Actually, to put this example in terms of something you *might* be undertaking, let’s say you were looking at two page designs and trying to determine whether one gives a better click-through on some prime (revenue-generating) content item.
You run that comparison as an A/B test for 15 minutes and check the result. If it’s a big difference, you could stop right there. If it’s a small difference, keep running it and check again at the hour mark. At that point, you probably have a decent sized sample and can crunch some numbers; run a chi-squared test and see what you see. What you’re trying to do at this stage is determined the likelihood that your small observed difference is *real* or random.
Steve
Which is quite counter-intuitive isn’t it? I would naturally think that after 15 minutes, any difference seen might be pure chance, and I’d keep measuring to ensure a “good sample size”. But what you’re saying is this would be a waste of time if the difference was large. Mind blowing!
This explains why I felt “like a deer caught in the headlights” during my stats course at uni :)
Would depend on the certainty you’re after right?
When I had to do this stuff as part of signal processing my intuition was (and apparently still is) wrong on pretty much all of it. That’s why I said it was dangerous to talk about it without actually working out the numbers…
Patrick,
If the question you’re trying to answer is “Is design A better (in real terms) than design B?” then no, it doesn’t really depend on the certainty. If the observed difference between the two designs is relatively large (the relative measurement is important), then the underlying theory says you can be fairly sure.
Do we ever need to write a test report that reads “we can be certain with a 95% degree of confidence that…”? Rarely, if ever. However, the key point I’m trying to make is that – with large observed differences – the point at which you reach that 95% level of confidence is much fewer observations than with small relative differences.
BTW: you’re not alone in the lack of intuition with this. One of the smartest mathematicians I ever met completely failed to understand this topic on anything other than a “I don’t get it, but I’ll take your word for it” basis.
PS: I’m in the process of writing this topic up into a column for uxmatters, which I hope will provide extra clarity.
So we sparked off the idea for the UXmatters article? :)
Patrick,
Thanks for your interest in my article. Apologies if it was confusing. I can see why it was. All I was trying to do was to explain what effect size is by giving an example. It’s always difficult to avoid sounding too geeky and losing people’s attention when you talk about statistics and you use terms like “effect size”! So I decided to explain it by coming up with the New York / London example. What I was trying to say is that when you can predict big effect sizes (before you test) you can reduce the number of people you will test (the “handful” term was used colloquially, rather than formally). Usually, sample sizes need to be decided before the test, survey, etc. is run, and one of the things that should be taken into account is the size of the effect you expect to measure. If you predict that the behaviour you will be measuring will only have tenuous variations between people you will need to boost those numbers to be confident that your results will have solid statistical basis. You can, of course, run a power analysis post hoc to determine whether your sample was big enough to give you high confidence levels in your results. But by then it is too late.
I hope this clarifies things (although Steve made my job very easy by explaining this very clearly).
Great blog and comments in general. :-)
Looking forward to reading the UXmatters article.
Thanks for the follow-up Mariana, I think we’ve all learnt something through this string of comments :)
Just to close the loop somewhat, the article I wrote can be viewed here: http://www.uxmatters.com/MT/archives/000352.php