# Confidence Intervals for Goodness of Fit

Copyright © 2011–2017 by Stan Brown

Copyright © 2011–2017 by Stan Brown

**Summary:**
When you test goodness of fit to a model, you’re not
testing your sample against just one number but against *k*
proportions for the *k* categories or possible responses. Can you
go backwards and ask which models could be consistent with your sample
data? Yes, but finding boundaries for *k* proportions in a model
is a little trickier than finding bounds for a single parameter. This
page explains how: first with a concrete example, then for the general
case.

**See also:**
Two Excel workbooks are provided to help with calculations:
the M&Ms example (30 KB), and
the general case for up to 11 categories
(28 KB).

Plain M&Ms | |||
---|---|---|---|

Color | Model | Observed | p̂ |

Blue | 24% | 127 | 20% |

Brown | 13% | 63 | 10% |

Green | 16% | 122 | 19% |

Orange | 20% | 147 | 23% |

Red | 13% | 93 | 15% |

Yellow | 14% | 76 | 12% |

Total | 100% | 628 | 100% |

My Spring 2011 evening class counted the colors of plain M&Ms and compared the color distribution to the model on the company’s Web site. Their data are at the right. They computed χ² = 19.44 and p-value = 0.0016, so they rejected the model. But what models would be possible from their data, with 95% confidence?

**Caution**: As always, to apply the analysis techniques
you need a simple random sample, and the class didn’t have that.
The Fun Size M&Ms packs that they analyzed were bought from the
same store on the same day and almost certainly came from one small
part of one production run.
So they didn’t actually prove that
the company’s model is wrong, but for the sake of illustrating
the methods I’m going to proceed as if they had.

At first, you might think a 95% CI for the model is a piece of cake: just do a 95% CI for the proportion of blues, then a 95% CI for the proportion of browns, and so on. But the problem with that is that you’re doing multiple confidence intervals from the same sample.

**Why is that bad?** Well, remember what a 95% confidence
interval means. Ninety-five percent of the time, the CI does contain
the true value of the population parameter, but 5% of the time it does
not. If you do six confidence intervals from the same sample, can you
say we’re 95% confident in *all* of them? No. If
there’s a 95% chance that *any one of them* is right, then clearly
the chance that *all six* are right is less than 95%.
They’re not independent, and so the simple multiplication rule
doesn’t apply exactly, but the *approximate* probability
of all six being right is 0.95^{6} =
0.74.

If you want to take six confidence intervals from the same
data, and have 95% confidence that they’re all right, then the
confidence level for each of the six CIs must be the sixth root of
95%, about 99.1%. (Another common calculation
is to use 1−α/*k* as confidence level for each
of the *k* categories; this gives a similar number.)

Confidence Intervals for Six Proportions (99.1% individual, 95% aggregate) | ||||
---|---|---|---|---|

Color | Model | Observed | p̂ | CI for p |

Blue | 24% | 127 | 20% | 16 to 24% |

Brown | 13% | 63 | 10% | 7 to 13% |

Green | 16% | 122 | 19% | 15 to 24% |

Orange | 20% | 147 | 23% | 19 to 28% |

Red | 13% | 93 | 15% | 11 to 19% |

Yellow | 14% | 76 | 12% | 9 to 16% |

You can see the calculations in the second section of the accompanying Excel workbook, where you can even change the confidence level if you wish. If you have a TI-83, 84, 89, or 92 calculator, you can do a 1-PropZInt with the confidence level .95^(1/6).

The previous solution is easy enough to compute, but
it’s a little unsatisfying.
Aside from the issue with choosing a substitute confidence level,
there’s a philosophical problem.
After all, the hypothesis test wasn’t about one color at a time
but about the model considered as a whole.
**Shouldn’t the confidence interval also consider the model as a whole?**

Well, yes, that’s certainly possible. It’s hard to think about, though. The (unknown) possible proportions for the six colors represent a region in six-dimensional space. And even if you can picture 6-D space, how would you express the results? So as a practical matter you still have to come up with lower and upper bounds for each of the six colors.

How can you do that? Well, remember that a confidence
interval is just the flip side of a hypothesis test. If the HT fails
to reject H_{0}, then the parameter from H_{0} is within the
confidence interval. (See
Confidence Interval and Hypothesis Test if you need a
refresher in that concept.) So looking for a CI for goodness of fit
is the same as looking for the possible values of the six model
percentages that would *not* be rejected in a HT with the data
observed in the sample.

Unfortunately, that’s not easy to compute. In
fact, you have to run **twelve optimizations** using Excel Solver or a
similar tool: one each for the minimum and maximum of each of the six
colors’ proportions. In these optimizations, you ask
“what’s the lowest proportion of blue such that, by
adjusting the other five proportions, it’s still possible to
come up with a model that is consistent with the observed sample at
the 0.05 significance level?” Then you ask that same question
about the highest possible proportion of blue, and ask those two
questions about the other colors.

In theory this should be doable with Excel Solver, but in practice I found that Excel 2010 Solver violated the constraints on several of those optimizations and gave negative percentages, for example when finding the lowest possible proportion of brown. So I used Evolver from Palisade Corporation to do the optimizations.

The third section of the accompanying Excel workbook is set up to do an optimization with either Solver or Evolver. Here are the results of the twelve optimizations with Evolver; Solver got similar results on the ones where it didn’t crash:

95% Confidence Intervals for the Model | ||||
---|---|---|---|---|

Color | Model | Observed | p̂ | CI for p |

Blue | 24% | 127 | 20% | 15 to 26% |

Brown | 13% | 63 | 10% | 7 to 15% |

Green | 16% | 122 | 19% | 15 to 25% |

Orange | 20% | 147 | 23% | 18 to 29% |

Red | 13% | 93 | 15% | 11 to 20% |

Yellow | 14% | 76 | 12% | 8 to 17% |

What’s the difference between the two? Well, first let’s get them next to each other so that you can compare them more easily. I’m also showing one more decimal place:

95% Confidence Intervals Compared | |||||
---|---|---|---|---|---|

Color | Model | Observed | p̂ | CI for p (first method) | CI for p (second method) |

Blue | 24% | 127 | 20% | 16.0 to 24.4% | 15.4 to 26.0% |

Brown | 13% | 63 | 10% | 6.9 to 13.2% | 6.7 to 14.7% |

Green | 16% | 122 | 19% | 15.3 to 23.6% | 14.7 to 25.2% |

Orange | 20% | 147 | 23% | 19.0 to 27.9% | 18.3 to 29.5% |

Red | 13% | 93 | 15% | 11.1 to 18.5% | 10.7 to 20.1% |

Yellow | 14% | 76 | 12% | 8.7 to 15.5% | 8.4 to 17.1% |

Now you can see that the first set of intervals, computed as six separate proportions, are narrower and are symmetric around p̂, the sample proportions. The second set, computed using the flip side of a hypothesis test, are a bit wider, and the sample proportions are not in the centers of the intervals. This is typical with intervals where χ² is involved; see Inferences about One Population Standard Deviation for other examples.

But which one is right? Well, you pays your money and you takes your choice. Computationally they’re both right; philosophically it depends on how you want to think about a confidence interval.

After the M&Ms example, it’s not hard to generalize to any number of categories. In the accompanying workbook, I chose to stop at 11 categories (10 degrees of freedom), because goodness-of-fit tests with more categories are rare. If you have more categories than 11, you can unprotect the worksheet and insert rows where needed.

But most likely you have fewer than 11 categories. In that case, proceed as follows:

- In the first section. clear the unneeded categories at the bottom. For example, if you have eight categories, highlight cells A14:C16 and press your Delete key. Don’t try to delete rows 14 through 16: the worksheet won’t let you, and it’s set up to ignore empty rows.
- Enter your category names, model percentages, and observed counts in columns A through C.
- You can now read off the χ² statistic and the p-value for a hypothesis test.
- In the second section, specify your desired confidence level in cell B23. You can then read off the confidence intervals computed by the first method (individual binomial proportions).
- In the third section, set your desired confidence level in B42. If you have fewer than 11 categories, delete extra numbers in column B, starting with the last one before the total. Adjust the remaining numbers so that they total 100%; it doesn’t matter whether they’re realistic or not.
- For each category, run one Solver or Evolver optimization to find a minimum and one to find a maximum. Before the first one, set the adjustable cells to just the part of column B that actually has numbers. For example, if you have eight categories then the adjustable cells would be B46:B53.

This article grew out of Ray Koopman’s and Rich Ulrich’s responses in the newsgroup sci.stat.edu to my query, “Confidence interval after Chi-squared test?” in December 2010. The thread is archived here. One issue raised by Ray Koopman: the binomial confidence intervals in this article are done by the Wald method, which is what the TI-83/84 does. However, there is a lot to be said for using a Wilson or adjusted Wald calculation instead, which would result in slightly wider intervals.

**4 Jan 2016**: Convert workbooks to Excel 2007–2016 format, and change TC3 and OakRoadSystems links to point to BrownMath.com.- (intervening changes suppressed)
**5 June 2011**: New article and workbook.

Because this article helps you,

please click to donate!Because this article helps you,

please donate at

BrownMath.com/donate.

please click to donate!Because this article helps you,

please donate at

BrownMath.com/donate.

Updates and new info: http://BrownMath.com/stat/