Stats without Tears
12. Tests on Counted Data
Updated 21 Jan 2015
Copyright © 2012–2017 by Stan Brown
Updated 21 Jan 2015
Copyright © 2012–2017 by Stan Brown
In Chapter 10 you learned about hypothesis tests, using one sample of numeric or binomial data to test a hypothesis about a population mean or proportion. In Chapter 11, you extended that to inferences about the difference between two numeric or binomial populations. With binomial data, you have counts of success and failure: two categories for one or two populations.
But what if you have more categories or populations? That’s when you use the tests in this chapter. The hypothesis tests will use the same seven steps that you already know and love, but with a new test statistic called chi-squared. Your data will be counts of members of the sample that fall into particular categories of one or two variables.
Suppose you have one population divided into three or more categories — there are ≥ 3 possible non-numeric responses from each subject. For example, instead of monitoring whether each patient had a heart attack or not (two possibilities), you might monitor whether each person had a fatal heart attack, a non-fatal heart attack, or no heart attack (three possibilities).
When there were only two possibilities, you could talk about the proportion of successes in the population, because failure was the only other possibility. If you knew about successes, you knew about failures. The population proportion of successes, p, was the population parameter.
But when you have three or more possibilities, that goes out the window. Knowing the proportion of one category in the population doesn’t tell you the proportions of the others. So instead of testing against a particular proportion, you test against all the proportions at once. You have a probability model in mind, and you perform a goodness-of-fit (GoF) test to compare the data and the model. If the data are too far away from the model, you reject the model. This is a standard hypothesis test, but you’ll learn that you compute the p-value from a new distribution called χ².
As usual, I’ll show you the theory first, and then you’ll do calculations the easy way.
The M&M Mars Web site used to give the color distribution of plain M&Ms as 24% blue, 13% brown, 16% green, 20% orange, 13% red, and 14% yellow. My Spring 2011 class counted the colors of 628 plain M&Ms and computed the sample proportions, as shown at right. Obviously their percentages differ from the company model. But are they different enough to let the class reject the company’s model?
The company’s model is your null hypothesis H0. The alternative hypothesis H1 is that the company model is wrong. Let’s use a significance level of 0.05.
Caution: As always, to apply the analysis techniques you need a simple random sample, and the class didn’t have that. The Fun Size M&Ms packs that they analyzed were bought from the same store on the same day and almost certainly came from one small part of one production run. Although this wasn’t a true random sample, I’m going to proceed as though it was, to show you the method.
By now you know the drill. Samples vary, so just about any real-life sample will be different from the theoretical expectation in H0. The question is always the same: Is pure sample variability enough to account for the difference between H0 and this sample, or is there some real effect here beyond that?
For each data type, you have a method to figure a test statistic and a p-value. The test statistic is a standardized measure of the discrepancy between H0 and the sample, taking sample size into account; the p-value is the probability of getting that sample, or one even further away from H0, if random chance is the only thing going on. (Most software and statistical calculators compute the test statistic and p-value for you at the same time.)
So far you know two test statistics, z and t. Can you use one of them on this problem? There’s the obvious choice of performing six z tests of proportion on the six colors. But in the immortal words of Richard Nixon in Watergate, “We could do that, but it would be wrong.”
Why would it be wrong? Well, you’re doing a hypothesis test at 0.05 significance, right? That means that you can live with a one in twenty chance of a Type I error, calling the model bad when it’s actually good. But if you do a 0.05 significance test of each color, then you have a 0.05 chance of a Type I error on blue, a 0.05 chance of a Type I error on brown, and so forth. Suddenly your real significance level is almost 0.30, which is ridiculously high. (It’s not quite equal to 6×0.05 because you might get Type I errors on more than one color, and also because the colors aren’t independent.)
Never do multiple tests on the same data, because that makes Type I errors way more likely than you can live with. You must do a single overall test of the model as a whole, and that means a new test statistic. It’s called χ² or chi-squared.
The χ² computation will look a little weird: you have to deal with each category because there are no summary statistics like x̅ and s to help you along. But I’ll walk you through it and you’ll see that it’s not too bad, really.
How to pronounce χ² or chi-squared: The first consonant is roughly a k sound, so you can pronounce χ or chi as Kyle without the l sound. χ² rhymes with “high-chaired”. If you want to get technical — and you know I do — the Greek letter sounds similar to Yiddish ch in l’chaim or Scottish ch in loch. It’s definitely not English ch as in church.
χ is not an X, by the way, even though it looks like one. Greek words beginning with χ are written with ch in English — words like chiropractor and chronology. The Greek letter with the x sound is ξ, spelled xi and pronounced ksee, but it doesn’t figure in this course.
Okay, that’s enough Greek class. We now return you to statistics.
|Blue||24%||127||24% × 628 =||150.7|
|Brown||13%||63||13% × 628 =||81.6|
|Green||16%||122||16% × 628 =||100.5|
|Orange||20%||147||20% × 628 =||125.6|
|Red||13%||93||13% × 628 =||81.6|
|Yellow||14%||76||14% × 628 =||87.9|
The key concept in testing a probability model against data is expected count. Samples never actually match a model, but what would a sample with this same size look like if it did? Well, if colors are supposed to be distributed in 24% blue, 13% brown, and so on, then a perfect match within 628 M&Ms would be distributed in 24% blue, 13% brown, and so on. The expected counts are computed in the table at the right.
The observed column is counts, which means whole numbers. But E’s are averages in a sense — what’s the average number of blues you’d expect if you took many, many samples of size 628 and the company’s 24% is correct? 150.7 — so they don’t need to be whole numbers and typically are not. As you can see, even carrying E’s to one decimal place there’s a slight rounding error, 627.9 versus 628; rounding to whole numbers would give a bigger rounding error. Software and calculators avoid this issue, by carrying more precision internally than they display.
In a goodness-of-fit test, your data are counts, just as they were in tests on binomial data. So it’s no surprise that the requirements for GoF are similar to the requirements for binomial data.
You need a random sample (or equivalent) that is less than 10% of the population. But there are more than two categories in the model, so instead of a success/failure condition you have a condition on the expected counts (E): The expected count in every category must be ≥ 5. (Some authors use a looser requirement, that none of the E’s can be below 1, and no more than 20% of them can be below 5.)
By the way, make sure you actually have counted data. (Dave Bock [see “Sources Used” at end of book] calls this the Counted Data Condition.) Sometimes students try to do a chi-squared test on sample means, but the chi-squared distribution is just for counts of categorical data.
What do you do if your E’s are too small? You can combine smaller categories, if the combination seems reasonable. For example, suppose you’re studying some characteristic of people based on their home state. You could combine adjacent small states like Connecticut–Rhode Island and Delaware–Maryland. But it’s best to plan ahead and not get into this position. Your smallest E will come from your sample size times your smallest model category. Just plan for a large enough sample size to make that product ≥5.
Eyeballing the observed and expected numbers doesn’t really tell you much. What you need is a single number that shows the overall badness of fit and can be related to a standardized distribution.
To find this, you take the difference between observed and expected, square it so that it’s always positive, and then divide by expected to scale the effect size by the sample size. Do this for each row, and the result is called the “χ² contribution” for that row. Add up the rows and you have your χ² test statistic. The computations are shown at right, and for this sample and this model you have χ² = 19.42. This is a standardized measure of how far the model and the data disagree.
All these computations are summarized in the formula χ² = ∑(O−E)²/E, where the summation is over categories, not individual data points.
Let’s pause to talk a little about the χ² distribution.
The chi-squared distribution was developed independently by Ernst Carl Abbe (1840–1895, German) in 1863, by Friedrich Robert Helmert (1843–1917, German) in 1875, and by Karl Pearson (1857–1936, English) in 1900. The name “chi-squared” is due to Pearson, who also invented the goodness-of-fit test.
|χ² distributions, all on the same scale of χ² = 0 to 16|
And it also makes sense in terms of what you’re testing. Higher χ² represent poorer matches between model and data. χ² = 0 would mean that the data match the model exactly, which is extremely rare. Negative χ² would mean that the data and model are better than a perfect match, which obviously can’t happen.
You might be interested to know that the mean of the distribution equals df, the mode is at df−2, and the median is about df−2/3. And then again, you might not.
TI-83 has the χ² distribution in the
so if you needed to you could compute the p-value as
χ²cdf(19.42,10^99,5). But in practice your
calculator will give you the p-value automatically, the same way it
does in z and t tests.
So much for the theory. But how will you test goodness of fit in practice? This section runs through the complete hypothesis test. There’s still some commentary, but the stuff in boxes is what you’d actually write for a quiz or homework.
With goodness of fit, there’s no single population parameter to test for. (If you want to get technical, the population parameter is a probability distribution.) So you state the hypotheses in words, but usually including the model:
H0: The 24:13:16:20:13:14 color distribution is correct.
H1: The color distribution on the Web site is incorrect.
Nothing new here:
|(2)||α = 0.05|
Here you have a choice. The MATH200A Program is easiest to use, and also saves you work with several other statistics procedures. If you don’t have the program, follow the procedure in Testing Goodness of Fit on TI-83/84.
If you have a calculator in the TI-89 family, please see Testing Goodness of Fit on TI-89.
Put the model numbers in L1 — not the total. The model can be percentages or ratios. For example, with the M&Ms you can enter 24, 13, 16, and so on, or .24, .13, .16, and so on; it doesn’t matter as long as you’re consistent. Similarly, if you have a 9:3:3:1 model you can enter 9, 3, 3, 1 or 9/16, 3/16. 3/16, 1/16.
Put the observed counts in L2 — counts, not ratios or percentages. Never enter the total.
PRGM], then the number you see for
MATH200A., then [
Don’t press [
Dismiss the splash screen and press [
6] to select the
GoF test. The confirmation screen asks you if you’ve entered
the two necessary lists. If you have, press
The program performs the computations and graphs the χ² curve, also showing the p-value, test statistic, and degrees of freedom. (In this case the graph looks blank because the p-value is so small.)
You might notice that the test statistic is 19.44, not 19.42 as computed earlier. That’s because the calculator keeps many digits of precision, avoiding problems with rounding.
The program tells you how many of the categories have expected counts below 5. (If any are below 5, it also tells you how many are below 1, but we don’t use that information in this book.) See Requirements Check below.
df=5, χ²=19.44, p=0.0016
The program computes the expected counts, and places them in L3. As discussed above, you must not have any E’s below about 5 to be sure that the test procedure is valid.
I said “about 5”. If the results screen shows one or more E’s below 5, look at L3 to see how far below 5. One expected count just a little below 5 is not necessarily a fatal flaw in the test.
This is the same for every type of hypothesis test.
|(5)||p < α. Reject H0 and accept H1.|
At the 0.05 level of significance, the color distribution on the Web
site is incorrect. [Or, “... the color distribution on the Web
site is inconsistent with the data.”]
The color distribution on the Web site is inconsistent with the data (p = 0.0016).
If you reject H0, can you say anything about which categories are most “responsible” for the overall deviation from the model? Yes. DeVeaux, Velleman, Bock (2009, 699–700) [see “Sources Used” at end of book] suggest that you can look at the standardized residuals (observed−expected)/√expected. These are essentially z-scores, and you recall that z has only a 5% chance of being outside ±2 if the null hypothesis is true.
MATH200A part 6 already computes the squares of the residuals for you in list L4. The square of ±2 is 4, so when you look at list L4 after running the program, you can be pretty sure that any row with a value above 4 indicates a category that doesn’t match the model. (It’s more complicated, but that’s a decent rule of thumb.)
In this example, brown and green (rows 2 and 3) have squared residuals above 4. Therefore, for those colors, the differences between this sample and the model are probably significant. Remember that L2 is the observed counts in the sample, and L3 is the expected counts from the model for this sample size. You can see that there were significantly fewer browns than expected, and significantly more greens than expected. You might be a little suspicious of blue (row 1) and orange (row 4), but this sample’s differences from the model are probably not significant.
Even so, you can’t simply do 1-PropZTest on each category after rejecting H0 on your GoF test, because that would greatly increase your chance of a Type I error above your stated α. More advanced textbooks will suggest alternatives, such as adjusting the significance level or taking a new sample.
You may be wondering about computing a confidence interval. You can’t just do 1-PropZInt confidence intervals on the category proportions. A confidence interval is the complement of a hypothesis test, so multiple confidence intervals on the same data have the same problem as multiple hypothesis tests.
Confidence intervals can be computed for individual categories or the overall model, but the techniques are beyond the scope of this course. If you’re interested, please look at Confidence Intervals for Goodness of Fit. It shows how to make the calculations and includes an Excel workbook with instructions.
“A problem which frequently arises is that of testing the agreement between observation and hypothesis.” — Bulmer (1979, 154) [see “Sources Used” at end of book]
The 9:3:3:1 ratio for crosses is pretty basic in genetics, when two independent traits are involved. Here the traits are green or red eyes and having wings or not.
Dabes and Janik (1999, 273) [see “Sources Used” at end of book] give some data for the hybrid offspring of fruit flies; see figures at right. The flies were randomly selected. Your task is to determine whether this cross follows the 9:3:3:1 model or not. Use α=0.05.
Suggestion: Stop reading at this point, and try to write out all the steps on your own, using the preceding example as a model if you need to. Then compare your work to what follows.
You can read about the 9:3:3:1 ratio in many places such as Wikipedia’s Mendelian Inheritance [see “Sources Used” at end of book]: scroll down to “Law of Independent Assortment (the ‘Second Law’)”. A Web search for “9:3:3:1” will bring up plenty more.
H0: The fruit flies follow the 9:3:3:1 model.
H1: The fruit flies do not follow the model.
|(2)||α = 0.05|
df=3, χ²=2.45, p=0.4838
|(5)||p > α. Fail to reject H0.|
At the 0.05 level of significance, we can’t say whether the
fruit flies follow the 9:3:3:1 model or not.
We can’t say whether the fruit flies follow the 9:3:3:1 model or not (p = 0.4838).
(Again, if you don’t have the program you can follow the procedure in Testing Goodness of Fit on TI-83/84.)
While it’s true that this one experiment gave no conclusion, science wouldn’t stop there. You know that the scientific method calls for experiments to be replicated. Now, when the experiment is repeated, either H0 will be rejected or it will fail to be rejected. Here’s how those possibilities interact with what you’ve learned about writing conclusions.
There’s one caveat, though. Experiments at the 0.05 significance level will reject H0 in about one case in twenty where it’s actually true. So while a “reject H0” deserves a lot of respect, if it’s one result out of dozens we can’t take it on its own as enough to overthrow H0.
How can we do that on the basis of multiple experiments when we can’t do it from one experiment? Well, remember what “fail to reject H0” means: either H0 is actually true, or it’s actually false but this experiment’s sample happened not to show it. If it was actually false, we would expect most experiments to reject it. But as test after test fails to disprove H0, we grow more and more confident that it’s not going to be disproved.
For this reason, in scientific contexts the conclusion after failing to reject H0 is often written in terms like “the data are not inconsistent with the model” or even “we were unable to rule out the model.” The scientists are not accepting the null hypothesis here; they’re writing for a technical audience that understands what a “fail to reject H0” means. When you’re writing for a general audience, stick to neutral language when you fail to reject H0.
A store manager always has to decide how to use limited shelf space or freezer space most effectively. The store currently carries four brands of veggie burgers, and the manager wants to know if customers have a preference. (This is the last store in America that has not computerized its inventory.) She randomly selects a week, and finds the following sales figures: 145 Brand B, 195 Brand G, 189 Brand Q, and 153 Brand V. At the 0.05 level of significance, can you say that customers have equal or different preference for the brands?
You’re not explicitly given a model, so you have to develop one. But “equal preferences” must mean the expected counts are all equal, or in other words the numbers in the model are all equal. You could enter ¼:¼:¼:¼, or 1:1:1:1, or any numbers as long as it’s four equal numbers.
H0: Consumers have equal preference for the four brands.
H1: Consumers have unequal preference for the four brands.
Comment: Students often write these backwards. Remember that H0 is always some variation on “nothin’ special goin’ on here.” Preference for one brand over another would be something, so that must be H1.
|(2)||α = 0.05|
df=3, χ²=11.14, p=0.0110
|(5)||p < α. Reject H0 and accept H1.|
Consumers in general, at this store anyway, do have unequal
preferences among the four brands (p = 0.0110).
At the 0.05 level of significance, we can say that consumers in general do have unequal preferences among the four brands.
So what should the manager do? The χ² test shows that brand preferences aren’t equal, and Brand B is clearly the loser in this sample, but is that really enough to throw out Brand B? I wouldn’t. Its χ² contribution is below the threshold discussed in Residuals, above. And it did sell only eight units less than Brand V; that’s just a 5% difference. Maybe in another week it might sell more.
What the manager can do, now that the finger of suspicion is pointed at Brand B, is make another study — this time maybe taking two random weeks — and focus on just Brand B as a proportion of total veggie-burger sales. If they’re all equal, every brand would have 25%, so the manager might want to drop Brand B if a one-proportion test shows that less than say 20% of all sales are Brand B. But again, this would need to be a new sample, not just a 1-PropZTest on the data from this sample. You should never perform multiple significance tests on the same data.
People may not always agree on whether a given situation is a test for independence or homogeneity, but that’s okay because the two tests are identical in every way; it’s just a matter of how you phrase your conclusions to match what you tested.
“In 1970, SCM surveyed 150 office managers in three states to see if typewriter brand preference varies between states.” The quote and the table at right are from Dabes and Janik (1999, 274) [see “Sources Used” at end of book]. They didn’t say, but presumably this was a random sample. They go on to ask, “Do [the] above indicate that brand preference depends on state? ... α = 0.05.”
(A “typewriter” was a Stone Age piece of office equipment, sort of like a keyboard and printer fused into some sort of bizarre hybrid. Believe it or not, in the 1970s every business had several, and many homes had at least one. They were popular gifts for high-school graduation!)
The question seems clear: Does typewriter brand preference vary among states? But be careful in your thinking! The question is not asking whether preference varies among the managers surveyed. Obviously it does: NY has 50%–50%, PA has 63%–37%, and CT has 75%–25%. The question is whether this sample lets us conclude that brand preference varies among all managers in the three states. The first is descriptive statistics; this is inferential statistics.
But how to analyze it? Well, you have three populations, office managers in the three states. And you have one attribute, preferred brand. So you need to do a test of independence.
As always, the first step is to set up your hypotheses. Recall that H0 is always some variant of “Nothin’ goin’ on here” or “no effect”. So your null hypothesis must be that brand preference is independent of state, and the alternative naturally is that brand preference depends on (is associated with, varies by, is not independent of) state.
What do you do to come up with a test statistic and a p-value? As usual, the calculator is your friend. But as usual, first I’ll take you on a little tour so that you understand what you’re testing. ☺
Just like goodness of fit, two-way tables are analyzed using the χ² distribution. So you are once more concerned with the differences between observed and expected, and χ² will be the sum of
(observed − expected)² / expected
just as it was in goodness of fit. But the computation of “expected” is a bit more complicated.
What is meant by “expected” for this two-way table? Well, in the overall sample IBM was preferred 60–40 over SCM: 90/150=60%, 60/150=40%. So if brand preference doesn’t vary by state — if H0 is true — you would expect that same 60–40 split in each state.
Once you’ve got that, it’s just a matter of applying the 60–40 split to each state. In New York, 60% of 70 is 42, and 40% of 70 is 28. (Conventionally, the expected numbers are written in parentheses in the cell, under the observed numbers.) Pennsylvania and Connecticut are filled in the same way in the table at right.
This table just happens to have whole numbers for all the expected counts. But it’s possible and okay for expected numbers to be decimals.
expected = (row total) × (column total) / (grand total)
For example, the expected count for NY IBM is 70×90/150 = 42, and similarly for all the others. This is a neat formula, but then you never get to see the real point, which is that equal percentage split among all the populations.
The expected counts are how you test requirements. They’re exactly the same as for goodness of fit: random sample less than 10% of population, with all expected counts at least 5. (Again, some authors require only that none of the E’s can be below 1, and no more than 20% of them can be below 5.)
What do you do if your E’s are too small? You can combine smaller categories or smaller populations, if the combination seems reasonable. For example, if this was a four-row table and included Rhode Island, but the RI expected counts were too low, you could combine CT and RI since they’re adjacent small states in coastal New England. (If you had 50 states, you wouldn’t combine Rhode Island and Wyoming, because they’re geographically and demographically different.) But it’s best to plan ahead and not get into this position. You should make some kind of guess about how the percentages will work out, and then plan a large enough sample in each population based on the percentages you expect.
|New York||(35−42)²/42 = 1.17||(35−28)²/28 = 1.75|
|Pennsylvania||(25−24)²/24 = 0.04||(15−16)²/16 = 0.06|
|Connecticut||(30−24)²/24 = 1.50||(10−16)²/16 = 2.25|
Eyeballing the observed and expected numbers doesn’t really tell you much. You can see that observed and expected are pretty close in PA but further apart in NY and CT. But what you can’t see is whether that difference is too great to be purely the result of random sample selection. For that, you need to compute χ² in each of the six cells and add them up. Those computations are shown in the table here.
Add up those six numbers and you have your test statistic: χ² = 6.77.
Degrees of freedom is a bit different from the goodness-of-fit case. You might expect df to be 6−1=5, but for two-way tables it’s actually
df = (rows−1) × (columns−1)
You don’t count the total row and total column. For this table, df = (3−1)×(2−1) = 2.
Finally, computing χ²cdf(6.77,∞,2) gives p-value = 0.0339.
So much for the theory. But how will you perform the test in practice? This section runs through the complete hypothesis test. There’s still some commentary, but the stuff in boxes is what you’d actually write for a quiz or homework.
With independence and homogeneity (two-way tables), there’s no single population parameter to test for. So you state the hypotheses in words. In a test of independence, H0 is always independence, and H1 is always dependence.
H0: Brand preference doesn’t vary by state|
[or, is independent of state, is not associated with state, etc.].
H1: Brand preference varies by state
[or, is dependent on state, is not independent of state, is associated with state, etc.].
Nothing new here:
|(2)||α = 0.05|
Here you have a choice. The
MATH200A Program is
a bit easier to use and gives more information,
but you can also use the native TI-83/84 test called
(In the TI-89 family, it’s
Either way, start by putting the
observed numbers into a matrix, as follows:
MATRX] key, press it. But you probably don’t, so press [
MATRX]. You get the matrix menu, similar to the one shown at right.
►] before pressing [
ENTER]. You’re then prompted for the numbers of rows and columns, not including the total row and total column. As you enter the number of rows and number of columns, the matrix changes shape to match.
ENTER] key after entering each number, the calculator automatically moves to the next cell.
After you fill matrix A with the observed numbers, it’s time to perform the calculations.
|With the MATH200A program (recommended):||If you’re not using the program:|
As soon as you make the selection, the program begins computing. (It assumes that you have the observed counts in matrix A; it knows how to compute the expected numbers and puts them in matrix B automatically.) The results look like this:
You’ll notice that the program tells you that it put some information in matrix C. Under Residuals and More, below, we’ll look at that.
Chances are good that Observed and Expected will already show
[A] and [B]. If not, change either or both by
Select Calculate and press [
In addition to a random sample less than 10% of population, you need all the E’s to be at least 5. Don’t just say this without checking, because sooner or later you’ll have a case where that’s not true.
|With the MATH200A program (recommended):||If you’re not using the program:|
The results screen (above) shows you how many categories had expected counts below 5. A piece of cake! The expected counts are stored in matrix B in case you need to look at them.
(For classes that use the looser requirement, if any E’s are below 5 then the program tells you how many are below 1.)
The calculator puts the expected counts in matrix B
while doing the χ²-Test. To view matrix B, press
In this case you see that all expected counts are above 5.
This is the same for every type of hypothesis test.
|(5)||p < α. Reject H0 and accept H1.|
At the 0.05 level of significance, brand preference does vary by [or
is dependent on, associated with, not independent of] state.
Brand preference does vary by [or is dependent on, associated with, not independent of] state (p = 0.0339).
Just as with the goodness-of-fit test, if you reject the null hypothesis then you can look at the standardized residuals, which are
(observed − expected) / √expected
A standardized residual outside of ±2 is probably significant. The χ² contributions are the squares of the standardized residuals, so a χ² contribution above 4 is probably significant.
The MATH200A program shows you the χ² contributions and
more. (If you used the TI’s native χ²-Test and
want this information, you have to compute it for yourself.) To view
the additional information, which is in matrix C, press
what it looks like for the typewriter survey. (I’ve pasted
screens together to save the effort of scrolling.)
Unfortunately it’s not possible to put captions in a matrix, but here’s your guide to interpreting it. There are three regions: the χ² contributions, the row and column totals, and the row and column percentages. The following paragraphs show and explain those.
The χ² contributions are in the upper left corner of the matrix, with the size and shape of the original matrix. The original matrix had three rows and two columns, so you want to look at the top left 3×2.
As I mentioned at the start of this section, if you’re able to reject H0 then a χ² contribution >4 is probably significant. In this problem, your p-value was <α and you did reject H0, but the χ² contributions are all well under 4. How do you interpret this? There is indeed some variation in brand preference among states, but you can’t tell just where it is. Isn’t this kind of a paradox? Maybe, but I would say instead that the sample was large enough to show that some effect existed, but not large enough to show the details of the effect. If you repeat the survey with a larger sample, you might be able to learn more.
“But wait!” I hear you say. “Isn’t it obvious? NY was 50–50, PA was just over 60–40, and CT was 75–25. Isn’t it obvious that CT is very different from NY?” Yes, it is — in the sample. But you don’t know whether it’s true in the population. For all you know, the NY sample just happened to under-represent IBM lovers and over-represent SCM lovers, and CT the opposite, so that in another sample the proportions might be reversed. You simply don’t have enough information to draw more detailed conclusions.
The row and column totals are in the next section, in an extra column and an extra row. This particular problem gave them to you, but many problems do not, and you’d be surprised how hard it is to grasp a problem and interpret a result without this information.
Here you see that the three states’ sample sizes were 70, 40, and 40; the overall preference for the two manufacturers was 90 and 60. Of course whether you add down or across, you get the same 150 for overall sample size.
Finally, you have the row and column percentages in the last column and last row. What is this telling you? NY was 46.7% of the whole sample, and PA and CT were each 26.7%. IBM lovers were 60% of the whole sample, and SCM lovers were 40%. The two 0’s are just space fillers; the 100 at the lower right reminds you that the row percentages and column percentages add up to 100% in either direction.
Why do you care about the row and column percentages? Because they explain what the null hypothesis means. The null hypothesis is that brand preference is the same among states. So if the null hypothesis is true, then NY, PA, and CT all have the same 60–40 split between IBM and SCM that the overall sample has. (I figured the percentages by hand in Computing Expected Counts (E’s) above.)
You can read the percentages in the other direction, too. It doesn’t have much use in this particular problem, but you can do it. NY was 46.7% of the whole sample, so if H0 is true then 46.7% of the IBM lovers and 46.7% of the SCM lovers should be in the NY sample. Similarly, if H0 is true then PA and CT should each have 26.7% of the IBM lovers and 26.7% of the SCM lovers. And if you look back at the matrix of expected counts you’ll see that it matches. (Hey, I told you it wasn’t very useful in this situation! But there are others where it can be useful to read the table down or across.)
If your two-way table has two rows and two columns, you’re testing the proportions in two populations. That’s a case you had back in Chapter 11. In this situation, you can do a χ² test or a 2-proportion z test; you’ll get the same p-value from both. But if you want to know the size of the difference between the two population proportions, you have to do a 2-proportion z interval. (There is such a thing as a confidence interval in the χ² procedure, but it’s pretty gnarly and we don’t study it in this course.)
It’s a persistent idea that cars manufactured on Monday are of lower quality because the workers are recovering from wild weekends. But is it true? A quality analyst randomly chose 100 records from each weekday over the past year and obtained the following results:
At the 0.05 significance level, are the proportions of defective cars different on different days?
Comment: The way I phrased it, this is a problem of homogeneity, five populations (Monday cars, Tuesday cars, and so on) with one attribute (defectiveness). But the question could just as well be asked whether likelihood of a defect depends on the day of the week when the car was manufactured. In those terms, it’s a test of independence: one population (cars) with two attributes (day manufactured, and defectiveness).
This is a good illustration of what I said in the summary: many situations can be treated equally well as tests of independence or tests of homogeneity. Fortunately, the procedure is identical; you just need to state your hypotheses and your conclusions in terms of what you were asked to test.
|(1)|| H0: Proportions of defective cars are the same on all weekdays.
H1: Proportions of defective cars are different on different weekdays.
|(2)||α = 0.05|
df=4, χ²=8.55, p=0.0735
|(5)||p > α. Fail to reject H0.|
At the 0.05 significance level, it’s impossible to say whether
the proportions of defective cars are the same or different on
We can’t determine whether the proportions of defective cars are the same or different on different days (p = 0.0735).
“The cancer-producing potential of pyrobenzene (a major constituent of cigarette smoke) was tested. Eighty mice were used as a control group with no exposure to pyrobenzene, eighty more were exposed to a low dose, … and another seventy were given a high dose.”
|No tumors||One tumor||Two or more tumors||Total|
Can you prove that pyrobenzene dosage affects production of tumors, at the 0.01 significance level, or “could the above apparent difference be due to random chance?”
Source: Dabes and Janik (1999, 53–54) [see “Sources Used” at end of book]
Please stop here and write out your complete hypothesis test on paper, then check your solution against mine.
|(1)||H0: Pyrobenzene dosage does not affect tumor production.
H1: Pyrobenzene dosage affects tumor production.
Caution! Students sometimes write their null hypothesis as something like “random chance can account for the observed difference.” Yes, if your p-value is high, it means that random chance could explain the observed sample, but that doesn’t mean that it is the explanation. There’s always the possibility that dosage does affect tumors but this sample just happened not to show it. So write your H0 and H1 as contrasting statements about tumors and dosage, as I’ve done here.
|(2)||α = 0.01|
df=4, χ²=19.25, p=7.012755E-4 or about 0.0007
Caution! You have a 3×3 table, not a 4×4 table. Never enter total rows or total columns in your matrix. (If you got df=9, you made that mistake.)
|(5)||p < α. Reject H0 and accept H1.|
|(6)||At the 0.01 significance level, we can conclude that
pyrobenzene dosage level does affect production of tumors.
Pyrobenzene dosage level does affect production of tumors (p = 0.0007).
Comment: Why can we make a statement of cause here, rather than the weaker “is associated with”? Answer: This was an experiment, and the mice were either genetically identical or randomly assigned to the three groups, or both.
Every course has to draw the line somewhere and leave out lots of interesting things, and this one is no exception. Inferential Statistics Cases lists quite a few cases that we don’t have time to study in this course. For those who are interested, here are handouts for some of the cases that I had to leave out:
See Confidence Intervals for Goodness of Fit, which includes Excel workbooks to do the calculations.
You can do Inferences about One Population Standard Deviation. That page includes an Excel workbook for the calculations, or you can use MATH200B Program part 5.
It’s possible to test the means of three or more populations, and your calculator already contains the needed command. See One-Way ANOVA.
You can actually do a hypothesis test and confidence interval about the correlation coefficient of a population, as explained in Inferences about Linear Correlation. That page includes an Excel workbook for the calculations (including where the decision point numbers come from), or you could use MATH200B Program part 6.
The slope and intercept you computed for your regression line are actually random variables. If you had a different sample, you would have come up with a different regression line. For hypothesis test and confidence interval about the regression line, see Inferences about Linear Regression. There’s an Excel workbook for calculations, or you can use MATH200B Program part 7.
(The online book has live links.)
The observed counts are usually different from the expected counts. The χ² statistic measures how different, as one number for the whole model. The p-value says how likely this difference (or a greater difference) is if H0 is true.
H0 is that the model is consistent with the data, and H1 is that it’s not consistent. Use MATH200A Program part 6.
H0 is that the variables are
independent or the population proportions are all equal; H1 is
that the variables are not independent or some population proportions
are different from others.
Use MATH200A Program part 7 or
course review → ← Chapter 11 WHYL
Write out your solutions to these exercises, showing your work for all computations. Then check your solutions against the solutions page and get help with anything you don’t understand.
Caution! If you don’t see how to start a problem, don’t peek at the solution — you won’t learn anything that way. Ask your instructor or a tutor for a hint. Or just leave it and go on to a different problem for now. You may find when you return to that “impossible” problem that you see how to do it after all.
Democrats and Republicans (random sample) were surveyed for their opinions on gun control, and the results are shown in the table at right. Based on this sample, does a person’s opinion on gun control depend on party affiliation, at the .05 level of significance?
Obviously these particular children preferred some occupations over others. Test whether their preferences reflect a real difference among all first graders, at the 0.05 significance level.
|Age at Menarche||Never||Once a week||2–4 times a week|
|Data were adapted from Kuzma and Bohnenblust (2005, 224) [see “Sources Used” at end of book].|
Test, at the 0.01 level, whether age at menarche is independent of egg consumption.
|≥ 61||19 %||33|
In Jury Selection, George Michailides (n.d.a) [see “Sources Used” at end of book] quotes a study of the age breakdown of grand jurors in Alameda County, California, in 1973.
It’s pretty obvious that this sample doesn’t match the age distribution of the county, but is the discrepancy too great to be random chance? Choose an appropriate significance level.
(This isn’t a random sample, but that’s okay because you’re not asked to generalize to a population. For the same reason, the “≤10% of population” requirement doesn’t apply.)
|Now residing in|
|Raised in||< 10,000||10,000–49,999||≥ 50,000||Total|
|Day −7 to 0||Day 0 to 5||Cold||No|
|CO2 extract||CO2 extract||40||5||45|
|60% extract||60% extract||42||10||52|
|20% extract||20% extract||48||4||52|
Some withdrew or were excluded from the study for another reason. The final results are shown at right. The common cold is annoying but rarely fatal, so use α = 0.01 to determine whether Echinacea makes a difference in the likelihood of catching a cold.