→ Statistics → CI for Goodness of Fit
Updated 4 Jan 2016 (What’s New?)

Confidence Intervals for Goodness of Fit

Copyright © 2011–2017 by Stan Brown

Summary: When you test goodness of fit to a model, you’re not testing your sample against just one number but against k proportions for the k categories or possible responses. Can you go backwards and ask which models could be consistent with your sample data? Yes, but finding boundaries for k proportions in a model is a little trickier than finding bounds for a single parameter. This page explains how: first with a concrete example, then for the general case.

See also: Two Excel workbooks are provided to help with calculations: the M&Ms example (30 KB), and the general case for up to 11 categories (29 KB).


CI for M&Ms: Separate Proportions

Plain M&Ms

My Spring 2011 evening class counted the colors of plain M&Ms and compared the color distribution to the model on the company’s Web site. Their data are at the right. They computed χ² = 19.44 and p-value = 0.0016, so they rejected the model. But what models would be possible from their data, with 95% confidence?

Caution: As always, to apply the analysis techniques you need a simple random sample, and the class didn’t have that. The Fun Size M&Ms packs that they analyzed were bought from the same store on the same day and almost certainly came from one small part of one production run. So they didn’t actually prove that the company’s model is wrong, but for the sake of illustrating the methods I’m going to proceed as if they had.

At first, you might think a 95% CI for the model is a piece of cake: just do a 95% CI for the proportion of blues, then a 95% CI for the proportion of browns, and so on. But the problem with that is that you’re doing multiple confidence intervals from the same sample.

Why is that bad? Well, remember what a 95% confidence interval means. Ninety-five percent of the time, the CI does contain the true value of the population parameter, but 5% of the time it does not. If you do six confidence intervals from the same sample, can you say we’re 95% confident in all of them? No. If there’s a 95% chance that any one of them is right, then clearly the chance that all six are right is less than 95%. They’re not independent, and so the simple multiplication rule doesn’t apply exactly, but the approximate probability of all six being right is 0.956 = 0.74.

If you want to take six confidence intervals from the same data, and have 95% confidence that they’re all right, then the confidence level for each of the six CIs must be the sixth root of 95%, about 99.1%. (Another common calculation is to use 1−α/k as confidence level for each of the k categories; this gives a similar number.)

Confidence Intervals for Six Proportions
(99.1% individual, 95% aggregate)
ColorModelObserved CI for p
Blue24%12720%16 to 24%
Brown13%6310% 7 to 13%
Green16%12219%15 to 24%
Orange20%14723%19 to 28%
Red13%9315%11 to 19%
Yellow14%7612% 9 to 16%

You can see the calculations in the second section of the accompanying Excel workbook, where you can even change the confidence level if you wish. If you have a TI-83, 84, 89, or 92 calculator, you can do a 1-PropZInt with the confidence level .95^(1/6).

CI for M&Ms: Overall Model

The previous solution is easy enough to compute, but it’s a little unsatisfying. Aside from the issue with choosing a substitute confidence level, there’s a philosophical problem. After all, the hypothesis test wasn’t about one color at a time but about the model considered as a whole. Shouldn’t the confidence interval also consider the model as a whole?

Well, yes, that’s certainly possible. It’s hard to think about, though. The (unknown) possible proportions for the six colors represent a region in six-dimensional space. And even if you can picture 6-D space, how would you express the results? So as a practical matter you still have to come up with lower and upper bounds for each of the six colors.

How can you do that? Well, remember that a confidence interval is just the flip side of a hypothesis test. If the HT fails to reject H0, then the parameter from H0 is within the confidence interval. (See Confidence Interval and Hypothesis Test if you need a refresher in that concept.) So looking for a CI for goodness of fit is the same as looking for the possible values of the six model percentages that would not be rejected in a HT with the data observed in the sample.

Unfortunately, that’s not easy to compute. In fact, you have to run twelve optimizations using Excel Solver or a similar tool: one each for the minimum and maximum of each of the six colors’ proportions. In these optimizations, you ask “what’s the lowest proportion of blue such that, by adjusting the other five proportions, it’s still possible to come up with a model that is consistent with the observed sample at the 0.05 significance level?” Then you ask that same question about the highest possible proportion of blue, and ask those two questions about the other colors.

In theory this should be doable with Excel Solver, but in practice I found that Excel 2010 Solver violated the constraints on several of those optimizations and gave negative percentages, for example when finding the lowest possible proportion of brown. So I used Evolver from Palisade Corporation to do the optimizations.

The third section of the accompanying Excel workbook is set up to do an optimization with either Solver or Evolver. Here are the results of the twelve optimizations with Evolver; Solver got similar results on the ones where it didn’t crash:

95% Confidence Intervals for the Model
ColorModelObserved CI for p
Blue24%12720%15 to 26%
Brown13%6310% 7 to 15%
Green16%12219%15 to 25%
Orange20%14723%18 to 29%
Red13%9315%11 to 20%
Yellow14%7612% 8 to 17%

What’s the difference between the two? Well, first let’s get them next to each other so that you can compare them more easily. I’m also showing one more decimal place:

95% Confidence Intervals Compared
ColorModelObserved CI for p
(first method)
CI for p
(second method)
Blue24%12720%16.0 to 24.4%15.4 to 26.0%
Brown13%6310% 6.9 to 13.2% 6.7 to 14.7%
Green16%12219%15.3 to 23.6%14.7 to 25.2%
Orange20%14723%19.0 to 27.9%18.3 to 29.5%
Red13%9315%11.1 to 18.5%10.7 to 20.1%
Yellow14%7612% 8.7 to 15.5% 8.4 to 17.1%

Now you can see that the first set of intervals, computed as six separate proportions, are narrower and are symmetric around , the sample proportions. The second set, computed using the flip side of a hypothesis test, are a bit wider, and the sample proportions are not in the centers of the intervals. This is typical with intervals where χ² is involved; see Inferences about One Population Standard Deviation for other examples.

But which one is right? Well, you pays your money and you takes your choice. Computationally they’re both right; philosophically it depends on how you want to think about a confidence interval.

CI for Any Goodness of Fit

After the M&Ms example, it’s not hard to generalize to any number of categories. In the accompanying workbook, I chose to stop at 11 categories (10 degrees of freedom), because goodness-of-fit tests with more categories are rare. If you have more categories than 11, you can unprotect the worksheet and insert rows where needed.

But most likely you have fewer than 11 categories. In that case, proceed as follows:

  1. In the first section. clear the unneeded categories at the bottom. For example, if you have eight categories, highlight cells A14:C16 and press your Delete key. Don’t try to delete rows 14 through 16: the worksheet won’t let you, and it’s set up to ignore empty rows.
  2. Enter your category names, model percentages, and observed counts in columns A through C.
  3. You can now read off the χ² statistic and the p-value for a hypothesis test.
  4. In the second section, specify your desired confidence level in cell B23. You can then read off the confidence intervals computed by the first method (individual binomial proportions).
  5. In the third section, set your desired confidence level in B42. If you have fewer than 11 categories, delete extra numbers in column B, starting with the last one before the total. Adjust the remaining numbers so that they total 100%; it doesn’t matter whether they’re realistic or not.
  6. For each category, run one Solver or Evolver optimization to find a minimum and one to find a maximum. Before the first one, set the adjustable cells to just the part of column B that actually has numbers. For example, if you have eight categories then the adjustable cells would be B46:B53.


This article grew out of Ray Koopman’s and Rich Ulrich’s responses in the newsgroup to my query, “Confidence interval after Chi-squared test?” in December 2010. The thread is archived here. One issue raised by Ray Koopman: the binomial confidence intervals in this article are done by the Wald method, which is what the TI-83/84 does. However, there is a lot to be said for using a Wilson or adjusted Wald calculation instead, which would result in slightly wider intervals.

What’s New

Because this article helps you,
please click to donate!
Because this article helps you,
please donate at

Updates and new info:

Site Map | Home Page | Contact