What’s Wrong with Hypothesis Tests?
Copyright © 2023 by Stan Brown, BrownMath.com
Copyright © 2023 by Stan Brown, BrownMath.com
Summary: The hypothesis test you learned in statistics class is often misused. Even worse, there are serious logical problems with the theory of hypothesis tests. This article lists the main issues and describes what — if anything — you can do about them.
See also: This article talks about inherent problems with the standard hypothesis test. But as with anything people do, it’s also possible to make an outright blunder, making the conclusion worthless. For mistakes often made by students, see Top 10 Mistakes of Hypothesis Tests.
Let’s start with the problems that don’t come from someone making a mistake, but are inherent flaws in hypothesis testing.
Whatever α you pick, that’s the chance that, if you get a p-value below α, that “statistically significant” result is a false positive, also known as a Type I error. With α = 0.05, the traditional choice, about five out of every 100 experiments (one in 20) that result in a p-value under 0.05 are actually flukes (false positives). And you have no way to know which one in twenty is a false positive.
As Robert Matthews says in Chancing It (2017, 163–164) [full citation in “References”, below], the standard hypothesis test is “scarily prone to passing off fluke results as genuine effects. … As a result, a substantial fraction of the countless research claims made over the decades on the basis of ‘statistical significance’ are meaningless nonsense. [Emphasis added.]”
What can you do? You can reduce the problem by picking a lower α before you perform your experiment. But you can’t set α = 0, so some fraction of “statistically significant” research results will always be flukes even if you do everything right.
Ellenberg (2014, 133, 136) [full citation in “References”, below] calls the logic of the standard hypothesis test a “reductio ad unlikely” and summarizes it this way:
- Suppose the null hypothesis H is true.
- It follows from H that a certain outcome O is very improbable (say, less than Fisher’s 0.05 threshold).
- But O was actually observed.
- Therefore, H is very improbable.
That’s a fair summary of the process taught in stats classes. But notice: steps 1–2 say if H is true then O is very unlikely: in symbols, P(O|H) is small. But we don’t know that the null hypothesis is true; we just assumed it for the purpose of computing a p-value. Now look at steps 3–4: since O is true, then H is very improbable: in symbols, P(H|O) is small.
The unstated assumption is that if P(O|H) is small then P(H|O) must be small also. That’s the prosecutor’s fallacy, assuming that P(H|O) = P(O|H). But “reversed” conditional probabilities are almost never the same, and they can be wildly different. The first section of Medical False Positives and False Negatives shows an extended example.
Ellenberg concludes that the standard hypothesis test, the reductio ad unlikely
… is not logically sound in general. It leads us into its own absurdities. Joseph Berkson … offered a famous example demonstrating the pitfalls of the method.
Suppose you have a group of fifty experimental subjects, who you hypothesize (H) are human beings. You observe (O) that one of them is an albino. Now, albinism is extremely rare, affecting no more than one in twenty thousand people. So given that H is correct, the chance you’d find an albino among your fifty subjects is quite small, less than 1 in 400, or 0.0025. So the p-value, the probability of observing O given H, is much lower than 0.05.
We are inexorably led to conclude, with a high degree of statistical confidence, that H is incorrect: the subjects in the sample are not human beings.
What can you do? That’s a tough one. There’s really no way to keep the standard hypothesis test but make it logically valid.
One possible replacement — but I warn you, it’s controversial and it’s often difficult to apply — goes by the name of Bayesian inference. Ellenberg (2014) [full citation in “References”, below] devotes 21 pages to it, and Matthews (2017) [full citation in “References”, below] spends about the third quarter of his book on it, but a very short summary would be that you start with one or more alternative hypotheses, each assigned some prior probability of being true, and then you run your experiment, and at the end of computation you have a new probability that each hypothesis is true.
You may wonder where the initial probabilities come from, and that’s indeed a problem, but if the process works as intended then a series of experiments will cause the probabilities of each hypothesis after each experiment to converge to the true probabilities. The main point, though, is that the standard hypothesis test can’t tell us the probability that a hypothesis is true, and Bayesian inference can, in principle. Obviously a process that is laborious but computes what you want to know is better than a process that is easy but doesn’t compute what you want to know.
Robert Matthews (2017, 188) [full citation in “References”, below] says, “The whole process is transparent, democratic, and quantitative — and for many types of study, involves simply feeding two numbers into an online calculator.” A footnote to that statement refers the reader to Matthews (2001) [full citation in “References”, below] and to this online calculator.
All too often in real life, people don’t follow the hypothesis testing process carefully. They may be consciously fudging, or they were constrained by external circumstances, or they just goofed.
These problems are preventable in your own hypothesis tests. But when you’re evaluating what a researcher has done, always look for these problems before you rely on the conclusion.
Smith (2014, 21) [full citation in “References”, below] gives an example of this data grubbing:
As researchers sift through the data, looking for patterns, they are explicitly or implicitly doing hundreds of tests. Imagine yourself in their shoes. First, you look at the data as a whole. Then you look at males and females separately. Then you differentiate between children and adults; then between children, teenagers, and adults; then between children, teenagers, adults, and seniors. Then you try different age cutoffs. You let the senior category be 65+, and when that doesn’t work, you try 55+, or 60+, or 70+, or 75+. Eventually, something clicks. … If we knew that the researcher obtained the published results by looking at the data in a hundred different ways, we would surely view the results with suspicion. [Emphasis added.]
This error is closely related to the problem that non-significant results are less likely to get published.
What can you do? Make your hypothesis, then collect your data. That makes it a lot less likely that you’ll be seduced into treating a coincidence as a significant result, though there is no perfect protection.
Smith (2014, 36) [full citation in “References”, below] reports:
an HMO survey found that more than 90 percent of its members were satisfied. There are two kinds of survivor bias here, both of which bias the reported satisfaction rate upward: people who left the plan because they were dissatisfied and people who died.
Ellenberg (2014, pp. 95–96) [full citation in “References”, below] tells the tale he calls the Baltimore Stockbroker. Here it is as I’ve condensed it:
You get a letter or email predicting that a certain stock will rise during the coming week, and it does. The next week you get a similar mail about a different stock, and again the prediction is correct. This goes on for 10 weeks in all. In the eleventh week you get another mail, this time asking you to invest your money with him, and pay large commissions too. You say to yourself:
This guy had a 50-50 shot each time, and he was right all 10 times. The probability of that happening, if he knew nothing about stocks, would be 0.510, which is just under 0.001. Obviously he must be a stock-market whiz to do so much better than chance.
That conclusion may be obvious, but it’s wrong. Ellenberg explains:
… things look different when you retell the story from the Baltimore stockbroker’s point of view. Here’s what you didn’t see the first time. That first week, you weren’t the only person who got the broker’s newsletter; he sent out 10,240. (Footnote: This story certainly dates back to the days when this process would have involved reproducing and stapling ten thousand physical documents, but is even more realistic now that this kind of mass mailing can be carried out electronically at essentially zero expense.)
But the newsletters weren’t all the same. Half of them were like yours, predicting a rise in the stock. The others predicted exactly the opposite. The 5,120 people who got a dud prediction … never heard from him again. But you, and the 5,119 other people who got your version of the newspaper, get another tip next week. Of those 5,120 newsletters, half say what yours said and half say the opposite. And after that week, there are still 2,560 people who’ve received two correct predictions in a row.
And so on.
After the tenth week, there are going to be ten … people who’ve gotten ten straight winning picks from the Baltimore stockbroker — no matter what the stock market does. [Whether he’s a good stock picker or not,] there are ten newsletter recipients out there to whom he looks like a genius.
Your hypothesis test was bogus because your data set was merely a “survivor” of the scheme. If you had known the full scheme, and computed difference from chance for the set of all 10,240 people who got any mails from the Baltimore stockbroker, you’d have found no difference from chance in his results.
What can you do? Don’t do your tests after the fact. Instead, take a random sample first, then follow each member forward to learn the outcomes.
For example, the HMO study could have drawn a random sample of members on a particular date in 2021, then polled all of them two years later to determine the level of satisfaction. Or, with the cooperation of the HMO, the study could take place in 2023 by using the HMO’s membership records from two years ago to draw a random sample, and then contacting everyone in the sample.
I have to share with you a terrific dramatization of this principle, putting hypothesis before data. It’s in the excellent and mathematically aware novel The Black Cloud (Harper, 1957) by Fred Hoyle, the astronomer and science fiction writer.
Here’s the background to the conversation I’m quoting. Out of a hundred nuclear missiles from space, three have hit El Paso, Chicago, and Kiev. That’s quite unlikely, since so little of Earth’s area is major cities. The scientists are trying to determine whether those three missiles were aimed at the cities.
“It looks to me as if those perturbations of the rockets must have been deliberately engineered,” began Weichart.
“Why do you say that, Dave?” asked Marlowe.
“Well, the probability of three cities being hit by a hundred odd rockets moving at random is obviously very small. Therefore I conclude that the rockets were not perturbed at random. I think they must have been deliberately guided to give direct hits.” …
There was a derisive laugh from Alexandrov. … “Invent bloody argument, like this. Golfer hits ball. Ball lands on tuft of grass?—so. Probability ball landed on tuft very small, very very small. Million other tufts for ball to land on. Probability very small, very very very small. So golfer did not hit ball, ball deliberately guided on tuft. Is bloody argument. Yes? Like Weichart’s argument.” …
Weichart was not to be budged. When the laugh had subsided he returned to his point. “It seems clear enough to me. If the things were guided they’d be far more likely to hit their targets than if they moved at random. And since they did hit their targets it seems equally clear that they were more probably guided than that they were not.”
Alexandrov waved in a rhetorical gesture. “Is bloody, yes?”
“What Alexis means, I think,” explained Kingsley, “is that we are not justified in supposing that there were any particular targets. The fallacy in the argument about the golfer lies in choosing a particular tuft of grass as a target, when obviously the golfer didn’t think of it in those terms before he made his shot.”
The Russian nodded. “Must say what dam’ target is before shoot, not after shoot. Put shirt on before, not after event.”
Most journals are more likely to publish a result with a p-value below 0.05 than one with a p-value above 0.05. Because of this competition to get published, researchers may not even send in a paper with a failure to reject H0. This publication bias means that we get an incomplete picture of reality, and it often tempts researchers to engage in data grubbing, as illustrated in Randall Munroe’s cartoon at right.
In the cartoon, 20 experiments are done, each testing for a link between acne and one color of jelly beans. In 19 tests, the p-value is too high; but in the test for green jelly beans the p-value is below 0.05. So they publish the result for green jelly beans but not the other 19.
(The cartoon is used by permission of the artist. To see it in full size, click on it or on this link to https://xkcd.com/882/.)
Ellenberg (2014, 155) [full citation in “References”, below] says, “To live or die by the 0.05 is to make a basic category error, treating a continuous variable (how much evidence do we have that the drug works? …) as if it were a binary one (true or false? yes or no?). Scientists should be allowed to report statistically insignificant data. [Emphasis on “should” in original.]”
Ellenberg (2014, 118–120) [full citation in “References”, below] states the objection well:
So: significance. In common language it means something like “important” or “meaningful”. But the significance test that scientists use [which is the one you learn in class] doesn’t measure importance. When we’re testing the effect of a new drug, the null hypothesis is that there is no effect at all; so to reject the null hypothesis is merely to make a judgment that the effect of the drug is not zero. But the effect could still be very small — so small that the drug isn’t effective in any sense that an ordinary non-mathematical Anglophone would call significant.
He goes on to describe a 1995 government warning to doctors, based on a study showing the risk of a venous thrombosis (a possibly fatal blood clot) was doubled by taking certain new types of birth-control pills. 12% of women stopped taking their birth control as a result, and there were 26,000 more pregnancies in England and Wales that the year before. And since many of them were unplanned, there were also 13,600 more abortions. But when you dig deeper, the risk was 1 in 7,000 with the old pills and 2 in 7,000 with the new pills. Yes, the risk was twice as great, but in Ellenberg’s words, “twice a tiny number is a tiny number. [His emphasis.]” Though the study found a statistically significant result, considering the benefits of the newer type pills, almost every woman who was taking them should continue. In other words, the “significant” study result should not actually cause women to change their behavior.
What can you do? Compute a confidence interval instead of a hypothesis test — or in addition to one, if you can’t get out of a significance test. Here’s Ellenberg again (2014, 158) [full citation in “References”, below]:
The confidence interval is the range of hypotheses … that are reasonably consistent with the outcome you actually observed. [In a case when H0 is ‘no change’,] the confidence interval might be the range from +3% to +17%. The fact that zero, the null hypothesis, is not included in the confidence interval is just to say that the results are statistically significant in the sense we described earlier in the chapter.
But the confidence interval tells you a lot more. An interval of +3% to +17% licenses you to be confident that the effect is positive, but not that it’s particularly large. An interval of +9% to +11%, on the other hand, suggests much more strongly that the effect is not only positive but sizable.
The confidence interval is also informative in cases where you don’t get a statistically significant result — that is, where the confidence interval contains zero. If the confidence interval is −0.5% to +0.5%, then the reason you didn’t get statistical significance is because [your data provide] good evidence the intervention doesn’t do anything. If the confidence interval is −20% to +20%, the reason you didn’t get statistical significance is because you have no idea whether the intervention has an effect, or in which direction it goes. These two outcomes look the same from the viewpoint of statistical significance, but have quite different implications for what you should do next.
And in the case of the birth-control study, and studies about any type of risk, don’t just publicize the risk ratio, which was doubled risk in this case, but give the actual risk, 2/7,000 versus 1/7,000.