Here’s a story about bad science communication, the latest COVID variant hype, and a nice win for online citizen science. But first, shampoo adverts.
We’ve all seen adverts which make big claims about survey percentages: “73% of people agreed their hair was more manageable”. But then, sometimes our eye is drawn to the small print at the bottom of the screen, where there might be a disclaimer “4 out of 15 people sampled disagree”.
When you see something like that, it’s hard not to be a bit sceptical. On some intuitive level, we know that small samples can be unreliable, and the more people interviewed the better, but it’s worth thinking a bit about exactly why that is.
The issue is that sampling from the population is a random process, and we are unlikely to get “exactly the right answer”. If we toss a fair coin 10,000 times, we would be surprised to get exactly 5,000 heads, even though that’s the most likely outcome. In fact there’s only a 0.8% chance that we’ll see exactly 5,000 - we’ll tend to be somewhere in a range either side of that.
We’re very unlikely to get fewer than 4,800 or more than 5,200 heads (so if we did then we might be justified in asking whether the coin was really fair), but it’s hard to be more specific than that.
In the same way, we can think of doing an opinion poll of 1,000 people as randomly picking members of the population to be interviewed. Even if exactly 25% of the population vote Labour, we don’t expect that our sample contains exactly 250 Labour voters, and so we are used to the idea that a poll comes with a margin of error (say plus or minus 3%) and that rogue polls and outliers are always possible.
But 10,000 coin flips and 1,000 interviews are relatively large numbers. If, like the shampoo advert, we only interviewed 15 people, then we might start to be sceptical whether we could conclude anything at all very concrete.
Of course, it’s a bit more complicated that “big sample sizes are good”. We’ve all seen voodoo polls on Twitter where polling a Eurosceptic’s followers can give answers that aren’t reflective of the general population. As well as being large, a sample needs to be representative, which is hard to achieve. However we can be sure that if a sample is too small, then even if we’ve sampled perfectly the answer can be deceptive.
Which brings me back to COVID variants. On the 23rd March 2025, inews breathlessly reported the latest data and raised the possibility of another wave.
Of course we got used to how this worked in the past. In 2022 and 2023 we saw a succession of new omicron variants coming in with a significant growth advantage. Broadly speaking a variant taking over fast is bad news, because the faster the takeover the bigger the wave tends to be. So it might seem worrying to hear in inews that:
The latest data from the UKHSA showed the combined share of LP.8.1 and LP.8.1.1 subvariant jumped from 22.3 per cent on 9 Feb to 60 per cent on 23 February (the latest period for which reliable data is available) – and it is expected to have risen further since.
However, I’m kind of sceptical and not stocking up on toilet roll yet, and there’s a simple reason: the data is more like the shampoo than the opinion poll, but you have to do some digging to see that. For the hardcore nerds, UKHSA still publish weekly reports on flu, COVID and other respiratory viruses. If we go into Table 10 of the data that inews’s report was based on, we can see that the report was true:
Rows here are fortnight periods, the columns are LP.8.1 and its cousin LP.8.1.1 respectively, and it’s true, the percentage of the latter had quadrupled in a fortnight! As I posted on Twitter1 on Wednesday though, I wasn’t convinced, for two reasons (even leaving aside issues of representativeness of samples).
The first is easy to see: sure, LP.8.1.1. had jumped in the last fortnight, but it hadn’t done much before that. If it’s got such a massive growth advantage, why had it hung around from 3.8% to 6.7% between 15th December and 9th February? I don’t doubt it’s growing a bit, but I think a journalist should be a little bit sceptical and look at the long-term picture rather than report one week’s data as headline news.
The second reason is more fun from a maths point of view. All the numbers in the bottom row of the table (6.7%, 13.3%, 26.7% and 33.3%) were multiples of 1/15 rounded to 1 decimal place. All the numbers in the row above (2.2%, 6.7%, 8.9%, 13.3%, 15.6%, 37.8%) were multiples of 1/45. This raises the serious possibility that these are samples of 15 and 45 people respectively - if that’s true, then the 26.7% would really be “4 out of 15 people”, literally the numbers in my shampoo example.
I think people looking at the table should have spotted this possibility. Of course, it could be that the true sample size is a multiple of 15 - but if it was as big as 150 then we’d have to have exactly 10, 20, 30 of each variant in the sample, which doesn’t seem likely by chance. But such perfect fractions should be a red flag for people used to working with data.
So on Wednesday, it seemed at least worth investigating the possibility that the sample could just be 15 people. I talked to Dave McNally and Alex Selby, two of the more serious variant trackers, who confirmed that it was. At first I was confused because Dave’s table of variant data talks about 100 or so COVID genomes being sequenced per week. It’s a low number compared to what we saw during the pandemic, but it’s higher than my 15!
However, the issue is that Dave reports UK-wide data, whereas UKHSA report just for England. In the past, this wouldn’t have been a problem, but now the sampling is not uniform across nations - England has 85% of the population, but something like 25% of the recent samples.
So on Wednesday I was happy that I’d probably solved the problem. I did suggest in my thread that UKHSA could solve the problem by publishing absolute numbers of genomes, but I was happy to leave it there and stick my neck out to say:
Isn't it at least somewhat plausible that the 26.7% might be a statistical outlier?
However, I got a pleasant surprise the next day. UKHSA’s next weekly report gave updated data which showed that the percentages were much lower and in line with previous trends (35% in total across the two variants, not “60% and expected to have risen since”).
But they also gave us a new column, giving exactly what I’d asked for only the previous day - absolute numbers of genomes sequenced. No longer would it take calculation to reverse engineer the sample size from the percentages, we could see it directly! And it confirmed that the absolute numbers were very low: 66, 77, 71, 68 and 23 in the various fortnights reported in 2025. So hopefully people will remember to take these percentages with a pinch of salt in future.
And of course, a huge thank you to people at UKHSA for reading my thread and fixing the problem so fast - that’s citizen science in action for you.
(While I’m at it, I’m sceptical of Professor Steve Griffin’s claim in the same article that “our population immunity from vaccines is waning because of poor coverage, limited access and the exorbitant price of private immunisation” because the Figure 6 of the latest UKHSA vaccine surveillance report (albeit from November 2024) doesn’t seem to show the position changing much even in unboosted groups in the last two years, but maybe that’s a fight for another day!)
Yes, I know that doing so makes me complicit in the downfall of democracy, the invasion of Ukraine and Greenland and arbitrary deportations without due process. Don’t bother to write to tell me that.
The shampoo ad bit reminds me of the classic ad for child roller skates (no pics allowed here, but nicely broken down by Lewis Folkard below):
https://lewisfolkard.co.uk/ad-breakdown-fisher-prices-anti-slip-roller-skates/