# Confidence Intervals for Goodness of Fit

Copyright © 2011–2020 by Stan Brown

Copyright © 2011–2020 by Stan Brown

Yes, it is. The multinomial confidence interval is
a more general form of the
binomial. In the binomial
you have two possible outcomes, success and failure; in the
multinomial you have three or more possible outcomes. You could compute
*k* independent confidence intervals for the *k* possible
outcomes using an adjusted confidence level. But that treats the
outcomes as independent, and they’re not: an M&M that is red
can’t also be yellow. A second approach — a better
approach, I think — is to treat the model as a whole.
Then you
**compute which models could be consistent with your sample data.**

This page explains both methods, with examples.

**See also:**
Two Excel workbooks are provided to help with calculations:
the M&Ms example (35 KB), and
the general case for up to 11 categories
(36 KB).

Plain M&Ms | |||
---|---|---|---|

Color | Model | Observed | p̂ |

Blue | 24% | 127 | 20% |

Brown | 13% | 63 | 10% |

Green | 16% | 122 | 19% |

Orange | 20% | 147 | 23% |

Red | 13% | 93 | 15% |

Yellow | 14% | 76 | 12% |

Total | 100% | 628 | 100% |

My Spring 2011 evening class counted the colors of plain M&Ms and compared the color distribution to the model on the company’s Web site. Their data are at the right. They computed χ² = 19.44 and p-value = 0.0016, so they rejected the model. But what models would be possible from their data, with 95% confidence?

**Caution**: As always, to apply the analysis techniques
you need a simple random sample, and the class didn’t have that.
The Fun Size M&Ms packs that they analyzed were bought from the
same store on the same day and almost certainly came from one small
part of one production run.
So they didn’t actually prove that
the company’s model is wrong, but for the sake of illustrating
the methods I’m going to proceed as if they had.

At first, you might think a 95% CI for the model is a piece of cake: just do a 95% CI for the proportion of blues, then a 95% CI for the proportion of browns, and so on. But the problem with that is that you’re doing multiple confidence intervals from the same sample.

**Why is that bad?** Well, remember what a 95% confidence
interval means. Ninety-five percent of the time, the CI does contain
the true value of the population parameter, but 5% of the time it does
not. If you do six confidence intervals from the same sample, can you
say we’re 95% confident in *all* of them? No. If
there’s a 95% chance that *any one of them* is right, then clearly
the chance that *all six* are right is less than 95%.
They’re not independent, and so the simple multiplication rule
doesn’t apply exactly, but the *approximate* probability
of all six being right is 0.95^{6} =
0.74.

If you want to take six confidence intervals from the same
data, and have 95% confidence that they’re
*all* correct, then the
confidence level for each of the six CIs must be the sixth root of
95%, about 99.1%. (Another common calculation
is to use 1−α/*k* as confidence level for each
of the *k* categories; this gives a similar number.)

Confidence Intervals for Six Proportions (99.1% individual, 95% aggregate) | ||||
---|---|---|---|---|

Color | Model | Observed | p̂ | CI for p |

Blue | 24% | 127 | 20% | 16 to 24% |

Brown | 13% | 63 | 10% | 7 to 13% |

Green | 16% | 122 | 19% | 15 to 24% |

Orange | 20% | 147 | 23% | 19 to 28% |

Red | 13% | 93 | 15% | 11 to 19% |

Yellow | 14% | 76 | 12% | 9 to 16% |

You can see the calculations in the second section of the accompanying Excel workbook, where you can even change the confidence level if you wish. If you have a TI-83, 84, 89, or 92 calculator, you can do a 1-PropZInt with the confidence level .95^(1/6).

The previous solution is easy enough to compute, but
it’s a little unsatisfying.
Aside from the issue with choosing a substitute confidence level,
there’s a philosophical problem.
After all, the hypothesis test wasn’t about one color at a time
but about the model considered as a whole.
**Shouldn’t the confidence interval also consider the model as a whole?**

Well, yes, that’s certainly possible. It’s hard to think about, though. The (unknown) possible proportions for the six colors represent a region in six-dimensional space. And even if you can picture 6-D space, how would you express the results? So as a practical matter you still have to come up with lower and upper bounds for each of the six colors.

How can you do that? Well, remember that a confidence
interval is just the flip side of a hypothesis test. If the HT fails
to reject H_{0}, then the parameter from H_{0} is within the
confidence interval. (See
Confidence Interval and Hypothesis Test if you need a
refresher in that concept.) So looking for a CI for goodness of fit
is the same as looking for the possible values of the six model
percentages that would *not* be rejected in a HT with the data
observed in the sample.

Unfortunately, that’s not easy to compute. In
fact, you have to run **twelve optimizations** using Excel Solver or a
similar tool: one each for the minimum and maximum of each of the six
colors’ proportions. In these optimizations, you ask
“what’s the lowest proportion of blue such that, by
adjusting the other five proportions, it’s still possible to
come up with a model that is consistent with the observed sample at
the 0.05 significance level?” Then you ask that same question
about the highest possible proportion of blue, and ask those two
questions about the other colors.

In theory this should be doable with Excel Solver, but in practice I found that Excel 2010 Solver violated the constraints on several of those optimizations and gave negative percentages, for example when finding the lowest possible proportion of brown. So I used Evolver from Palisade Corporation to do the optimizations.

The third section of the accompanying Excel workbook is set up to do an optimization with either Solver or Evolver. Here are the results of the twelve optimizations with Evolver; Solver got similar results on the ones where it didn’t crash:

95% Confidence Intervals for the Model | ||||
---|---|---|---|---|

Color | Model | Observed | p̂ | CI for p |

Blue | 24% | 127 | 20% | 15 to 26% |

Brown | 13% | 63 | 10% | 7 to 15% |

Green | 16% | 122 | 19% | 15 to 25% |

Orange | 20% | 147 | 23% | 18 to 29% |

Red | 13% | 93 | 15% | 11 to 20% |

Yellow | 14% | 76 | 12% | 8 to 17% |

What’s the difference between the previous method and this one? Well, first let’s get them next to each other so that you can compare them more easily. I’m also showing one more decimal place:

95% Confidence Intervals Compared | |||||
---|---|---|---|---|---|

Color | Model | Observed | p̂ | CI for p(first method) | CI for p(second method) |

Blue | 24% | 127 | 20% | 16.0 to 24.4% | 15.4 to 26.0% |

Brown | 13% | 63 | 10% | 6.9 to 13.2% | 6.7 to 14.7% |

Green | 16% | 122 | 19% | 15.3 to 23.6% | 14.7 to 25.2% |

Orange | 20% | 147 | 23% | 19.0 to 27.9% | 18.3 to 29.5% |

Red | 13% | 93 | 15% | 11.1 to 18.5% | 10.7 to 20.1% |

Yellow | 14% | 76 | 12% | 8.7 to 15.5% | 8.4 to 17.1% |

Now you can see that the first set of intervals, computed as six separate proportions, are narrower and are symmetric around p̂, the sample proportions. The second set, computed using the flip side of a hypothesis test, are a bit wider, and the sample proportions are not in the centers of the intervals. This is typical with intervals where χ² is involved; see Inferences about One Population Standard Deviation for other examples.

But which one is right? Well, you pays your money and you takes your choice. Computationally they’re both right; philosophically it depends on how you want to think about a confidence interval.

After the M&Ms example, it’s not hard to generalize to any number of categories. In the accompanying workbook, I chose to stop at 11 categories (10 degrees of freedom), because goodness-of-fit tests with more categories are rare. If you have more categories than 11, you can unprotect the worksheet and insert rows where needed.

But most likely you have fewer than 11 categories. In that case, proceed as follows:

- In the first section. clear the unneeded categories at the bottom. For example, if you have eight categories, highlight cells A14:C16 and press your Delete key. Don’t try to delete rows 14 through 16: the worksheet won’t let you, and it’s set up to ignore empty rows.
- Enter your category names, model percentages, and observed counts in columns A through C.
- You can now read off the χ² statistic and the p-value for a hypothesis test.
- In the second section, specify your desired confidence level in cell B23. You can then read off the confidence intervals computed by the first method (individual binomial proportions).
- In the third section, set your desired confidence level in B42. The workbook will automatically transfer your model and observed counts from the first section to the third section.
- For each category, run one Solver or Evolver optimization to find
a minimum and one to find a maximum. That’s
one to seek a minimum for B46, one to seek a maximum for B46, one for
minimum and one for maximum of B47, and so on.
Before the first one, set the
adjustable cells to just the part of column B that actually has
numbers. For example, if you have eight categories then the
adjustable cells would be B46:B53 and you would
run sixteen optimizations.
Evolver users: If your p-value is <α, Evolver will tell you before the first optimization that a constraint was not met, and will ask to run Constraint Solver. Answer Yes. Constraint Solver will run briefly, and pop up an Optimization Complete window. Click OK, then start the Evolver optimization again. This will happen only before the first optimization, if it happens at all.

Solver users: The worksheet is protected, to keep you from accidentally deleting or changing a formula. Solver can’t cope with a protected sheet, so right-click on the Sheet1 tab at the bottom of the window, and select

*Unprotect Sheet*.

This article grew out of Ray Koopman’s and Rich Ulrich’s responses in the newsgroup sci.stat.edu to my query, “Confidence interval after Chi-squared test?” in December 2010. The thread is archived here. One issue raised by Ray Koopman: the binomial confidence intervals in this article are done by the Wald method, which is what the TI-83/84 does. However, there is a lot to be said for using a Wilson or adjusted Wald calculation instead, which would result in slightly wider intervals.

**7/10 Aug 2018**: Rewrite the Summary to emphasize multinomial versus binomial, and the non-independence of the outcomes. Make numerous improvements in the two workbooks, including automatically copying the numbers from the first to the third section, and adjusting Evolver settings for better performance. Update the workook instructions here, here, here, here, and here.- (intervening changes suppressed)
**5 June 2011**: New article and workbook.

Because this article helps you,

please click to donate!Because this article helps you,

please donate at

BrownMath.com/donate.

please click to donate!Because this article helps you,

please donate at

BrownMath.com/donate.

Updates and new info: https://BrownMath.com/stat/