Confidence Intervals for Goodness of Fit
Copyright © 2011–2023 by Stan Brown, BrownMath.com
Copyright © 2011–2023 by Stan Brown, BrownMath.com
Yes, it is. The multinomial confidence interval is a more general form of the binomial. In the binomial you have two possible outcomes, success and failure; in the multinomial you have three or more possible outcomes. You could compute k independent confidence intervals for the k possible outcomes using an adjusted confidence level. But that treats the outcomes as independent, and they’re not: an M&M that is red can’t also be yellow. A second approach — a better approach, I think — is to treat the model as a whole. Then you compute which models could be consistent with your sample data.
This page explains both methods, with examples.
See also: Two Excel workbooks are provided to help with calculations: the M&Ms example (35 KB), and the general case for up to 11 categories (36 KB).
Plain M&Ms | |||
---|---|---|---|
Color | Model | Observed | p̂ |
Blue | 24% | 127 | 20% |
Brown | 13% | 63 | 10% |
Green | 16% | 122 | 19% |
Orange | 20% | 147 | 23% |
Red | 13% | 93 | 15% |
Yellow | 14% | 76 | 12% |
Total | 100% | 628 | 100% |
My Spring 2011 evening class counted the colors of plain M&Ms and compared the color distribution to the model on the company’s Web site. Their data are at the right. They computed χ² = 19.44 and p-value = 0.0016, so they rejected the model. But what models would be possible from their data, with 95% confidence?
Caution: As always, to apply the analysis techniques you need a simple random sample, and the class didn’t have that. The Fun Size M&Ms packs that they analyzed were bought from the same store on the same day and almost certainly came from one small part of one production run. So they didn’t actually prove that the company’s model is wrong, but for the sake of illustrating the methods I’m going to proceed as if they had.
At first, you might think a 95% CI for the model is a piece of cake: just do a 95% CI for the proportion of blues, then a 95% CI for the proportion of browns, and so on. But the problem with that is that you’re doing multiple confidence intervals from the same sample.
Why is that bad? Well, remember what a 95% confidence interval means. Ninety-five percent of the time, the CI does contain the true value of the population parameter, but 5% of the time it does not. If you do six confidence intervals from the same sample, can you say we’re 95% confident in all of them? No. If there’s a 95% chance that any one of them is right, then clearly the chance that all six are right is less than 95%. They’re not independent, and so the simple multiplication rule doesn’t apply exactly, but the approximate probability of all six being right is 0.956 = 0.74.
If you want to take six confidence intervals from the same data, and have 95% confidence that they’re all correct, then the confidence level for each of the six CIs must be the sixth root of 95%, about 99.1%. (Another common calculation is to use 1−α/k as confidence level for each of the k categories; this gives a similar number.)
Confidence Intervals for Six Proportions (99.1% individual, 95% aggregate) | ||||
---|---|---|---|---|
Color | Model | Observed | p̂ | CI for p |
Blue | 24% | 127 | 20% | 16 to 24% |
Brown | 13% | 63 | 10% | 7 to 13% |
Green | 16% | 122 | 19% | 15 to 24% |
Orange | 20% | 147 | 23% | 19 to 28% |
Red | 13% | 93 | 15% | 11 to 19% |
Yellow | 14% | 76 | 12% | 9 to 16% |
You can see the calculations in the second section of the accompanying Excel workbook, where you can even change the confidence level if you wish. If you have a TI-83, 84, 89, or 92 calculator, you can do a 1-PropZInt with the confidence level .95^(1/6).
The previous solution is easy enough to compute, but it’s a little unsatisfying. Aside from the issue with choosing a substitute confidence level, there’s a philosophical problem. After all, the hypothesis test wasn’t about one color at a time but about the model considered as a whole. Shouldn’t the confidence interval also consider the model as a whole?
Well, yes, that’s certainly possible. It’s hard to think about, though. The (unknown) possible proportions for the six colors represent a region in six-dimensional space. And even if you can picture 6-D space, how would you express the results? So as a practical matter you still have to come up with lower and upper bounds for each of the six colors.
How can you do that? Well, remember that a confidence interval is just the flip side of a hypothesis test. If the HT fails to reject H0, then the parameter from H0 is within the confidence interval. (See Confidence Interval and Hypothesis Test if you need a refresher in that concept.) So looking for a CI for goodness of fit is the same as looking for the possible values of the six model percentages that would not be rejected in a HT with the data observed in the sample.
Unfortunately, that’s not easy to compute. In fact, you have to run twelve optimizations using Excel Solver or a similar tool: one each for the minimum and maximum of each of the six colors’ proportions. In these optimizations, you ask “what’s the lowest proportion of blue such that, by adjusting the other five proportions, it’s still possible to come up with a model that is consistent with the observed sample at the 0.05 significance level?” Then you ask that same question about the highest possible proportion of blue, and ask those two questions about the other colors.
In theory this should be doable with Excel Solver, but in practice I found that Excel 2010 Solver violated the constraints on several of those optimizations and gave negative percentages, for example when finding the lowest possible proportion of brown. So I used Evolver from Palisade Corporation to do the optimizations.
The third section of the accompanying Excel workbook is set up to do an optimization with either Solver or Evolver. Here are the results of the twelve optimizations with Evolver; Solver got similar results on the ones where it didn’t crash:
95% Confidence Intervals for the Model | ||||
---|---|---|---|---|
Color | Model | Observed | p̂ | CI for p |
Blue | 24% | 127 | 20% | 15 to 26% |
Brown | 13% | 63 | 10% | 7 to 15% |
Green | 16% | 122 | 19% | 15 to 25% |
Orange | 20% | 147 | 23% | 18 to 29% |
Red | 13% | 93 | 15% | 11 to 20% |
Yellow | 14% | 76 | 12% | 8 to 17% |
What’s the difference between the previous method and this one? Well, first let’s get them next to each other so that you can compare them more easily. I’m also showing one more decimal place:
95% Confidence Intervals Compared | |||||
---|---|---|---|---|---|
Color | Model | Observed | p̂ | CI for p (first method) | CI for p (second method) |
Blue | 24% | 127 | 20% | 16.0 to 24.4% | 15.4 to 26.0% |
Brown | 13% | 63 | 10% | 6.9 to 13.2% | 6.7 to 14.7% |
Green | 16% | 122 | 19% | 15.3 to 23.6% | 14.7 to 25.2% |
Orange | 20% | 147 | 23% | 19.0 to 27.9% | 18.3 to 29.5% |
Red | 13% | 93 | 15% | 11.1 to 18.5% | 10.7 to 20.1% |
Yellow | 14% | 76 | 12% | 8.7 to 15.5% | 8.4 to 17.1% |
Now you can see that the first set of intervals, computed as six separate proportions, are narrower and are symmetric around p̂, the sample proportions. The second set, computed using the flip side of a hypothesis test, are a bit wider, and the sample proportions are not in the centers of the intervals. This is typical with intervals where χ² is involved; see Inferences about One-Population Standard Deviation for other examples.
But which one is right? Well, you pays your money and you takes your choice. Computationally they’re both right; philosophically it depends on how you want to think about a confidence interval.
After the M&Ms example, it’s not hard to generalize to any number of categories. In the accompanying workbook, I chose to stop at 11 categories (10 degrees of freedom), because goodness-of-fit tests with more categories are rare. If you have more categories than 11, you can unprotect the worksheet and insert rows where needed.
But most likely you have fewer than 11 categories. In that case, proceed as follows:
Evolver users: If your p-value is <α, Evolver will tell you before the first optimization that a constraint was not met, and will ask to run Constraint Solver. Answer Yes. Constraint Solver will run briefly, and pop up an Optimization Complete window. Click OK, then start the Evolver optimization again. This will happen only before the first optimization, if it happens at all.
Solver users: The worksheet is protected, to keep you from accidentally deleting or changing a formula. Solver can’t cope with a protected sheet, so right-click on the Sheet1 tab at the bottom of the window, and select Unprotect Sheet.
This article grew out of Ray Koopman’s and Rich Ulrich’s responses in the newsgroup sci.stat.edu to my query, “Confidence interval after Chi-squared test?” in December 2010. The thread is archived here. One issue raised by Ray Koopman: the binomial confidence intervals in this article are done by the Wald method, which is what the TI-83/84 does. However, there is a lot to be said for using a Wilson or adjusted Wald calculation instead, which would result in slightly wider intervals.
Updates and new info: https://BrownMath.com/stat/