Stats without Tears
11. Inference from Two Samples
Updated 1 Jan 2016
(What’s New?)
Copyright © 2013–2019 by Stan Brown
Updated 1 Jan 2016
(What’s New?)
Copyright © 2013–2019 by Stan Brown
Intro: In Chapter 10, you looked at hypothesis tests for one population, where you asked whether a population mean or proportion is different from a baseline number. In this chapter, you’ll ask “are these two populations different from each other?” (hypothesis test) and “how large is the difference?” (confidence interval).
That’s the key question when you’re doing inference on numeric data from two samples. Your answer will control how you analyze the data, so let’s look closely at the difference.
Definitions: You have unpaired data when you get one number from each individual in two unrelated groups. The two groups are known as independent samples.
Independent samples result when you take two samples completely independently, or if you take one sample and then randomly assign the members to groups. Randomization always gives you independent samples.
Example 1: What if any is the average difference in time husbands and wives spend on yard work? You randomly select 40 married men and 40 married women and find how much time a week each spends in yard work. There’s no reason to associate Man A with Woman B any more than Woman C; these are independent samples and the data are unpaired.
Example 2: How much “winter weight” does the average adult gain? You randomly select 500 adults and weigh them all during the first week of November. Then during the last week of February you randomly select another 500 adults and weigh them. The data are unpaired, and the samples are independent.
Before you read further, what’s the big problem in the design of those two studies?
Right! Our old enemy, confounding variables. Look at the examples again, and see how many you can identify. For example, what might make a random person in one sample weigh more or less than a random person in the other sample, other than the passage of time? What might make a random woman spend more or less time on yard work than a random man, apart from their genders?
With independent samples, if there’s actually a difference between the two groups, it may be swamped by all the differences within each group.
Definitions: You have paired data when each observational unit gives you two numbers. These can be one number each from a matched pair of individuals, or two numbers from one individual. Paired data come from dependent samples.
Example 3: What if any is the average difference in time husbands and wives spend on yard work? You randomly select 40 couples and find how much time a week each person spends in yard work. Each husband and wife are a matched pair. The samples are dependent because once you’ve chosen a couple you’ve equally specified a member of the “wives” sample and a member of the “husbands” sample.
Example 4: How much “winter weight” does the average adult gain? You randomly select 500 adults and weigh them all during the first week of November, then again during the last week of February. You have paired data in the before and after numbers. The two samples are dependent because they are the same individuals.
Do you see how a design with paired data (dependent samples) overcomes the big problem with unpaired data (independent samples)? You want to study weight gain, and now that’s what you’re measuring directly. You wanted to know whether husband or wife spends more time on yard work, and now you’ve eliminated all the differences between couples.
Paired data are more likely than unpaired to reveal an effect, if there is one. Why? Because a paireddata design minimizes differences within each group that can swamp any difference between groups.
In studying human development and behavior, twins are a prime source of dependent samples. If you have a pair of identical twins who were raised apart (and that’s surprisingly common), you can investigate which differences between people’s behavior are genetic and which are learned. The Minnesota Study of Twins (Bouchard 1990 [see “Sources Used” at end of book]), found that a lot of behaviors that “should” be learned seem to be genetic. The New York Times published a nontechnical account in Major Personality Study Finds That Traits Are Mostly Inherited (Goleman 1986 [see “Sources Used” at end of book]).
Sample type  Dependent  Independent, or randomized 

Numeric data type  Paired Data  Unpaired Data 
How many numbers from each experimental unit?  Two  One 
Can you rearrange★ one sample?  No  Yes 
Problem of confounding variables  Minimal  Severe 
Use this design …  … if you can  … if you must 
★If the data from the sample are arranged in two rows or two columns, can you rearrange one row or column without destroying information? 
Testing new corn versus standard corn for yield. Can you see a problem with the sample in Western New York that’s not a problem with the sample in Central New York?
Adapted from Dabes and Janik (1999, 263) [see “Sources Used” at end of book]
You’re the head of research for the Whizzo Seed Company, and you’ve developed a new type of seed that looks promising. You randomly select three farmers in Western New York to receive new corn, and three to receive your standard product. (Of course you don’t tell them which one they’re getting.) At the end of the season they report their yield figures to you.
What’s wrong with this picture? You can easily think of all sorts of confounding variables here: different soils, different weather, different insects, different irrigation, different farming techniques, and on and on. Those differences can be great enough to hide (confound) a difference between the two types of corn, especially in a small sample.
The following year, you try again in Central New York. This time you send each farmer two stocks of seed corn, with instructions to plant one field with the first stock and another field with the second.
Does that eliminate confounding variables? Maybe not totally, but it reduces them as far as possible. Now, if you see significant differences in yield between two fields planted by the same farmer, it’s almost sure to be due to differences in the seed.
You always want to structure an experiment or observation with paired data (dependent samples) — if you can.
“If you can.” Aye, there’s the rub. Suppose you want to know whether attending kindergarten makes kids do better in first grade. There’s no way to set this up as paired data: how can a given kid both go through kindergarten and not go through kindergarten? Twin studies don’t help you here, because if the twins are raised together the parents will send both of them to kindergarten, or neither; and if the twins are raised apart then there will be too many other differences in their upbringing that could affect their performance in first grade.
If the samples are independent, you can’t pair the data, even if the samples are the same size. If you’re not sure whether you have dependent or independent samples, look back at Paired and Unpaired Data Compared.
You want to determine whether a new synthetic rubber makes tires last longer than the competitor’s product. Can you see how to do this with independent samples (unpaired data) and with dependent samples (paired data)? Think about it before you read on.
For independent samples, you randomly assign drivers to receive four tires with your new rubber or four of the competitor’s tires. For dependent samples, you put two tires of one type on the left side of every driver’s car, and two on the right side of every driver’s car. (You do half the cars one way and half the other, to eliminate differences like the greater likelihood of hitting the curb on the right.)
With the first method, if there’s only a small difference between your rubber and the competitor’s, it may not show up because you’ve also got differences in driving styles, roads, and so forth — confounding variables again. With the second method, those are eliminated.
The hypothesis test is almost exactly like the Case 1 hypothesis test. The difference is that you define a new variable d (difference) in Step 1 and write hypotheses about μ_{d} instead of μ.
For a confidence interval, you’re estimating the average difference, not the average of either population. You need to state both size and direction of the effect.
You’ve probably heard about the “freshman fifteen”, the weight gain many students experience in their first year at college. The Urban Dictionary even talks about the “freshman twenty” (2004) [see “Sources Used” at end of book].
Francine wanted to know if that was a real thing or just an urban legend. During the first week of school, she got the other nine women in her chemistry class at Wossamatta U to agree to help her collect data. (She reasoned that students in any particular class would effectively be a random sample of the school, since class choice is unrelated to weight or other health issues. Of course that would be questionable for a spin class or a cooking class.)
Wossamatta U CHEM101 — Women’s Weights (in pounds)  

Student  A  B  C  D  E  F  G  H  I  J 
Sept.  118  105  123  112  107  130  120  99  119  126 
May  125  114  128  122  106  143  124  103  125  135 
When she had the data, Francine realized she didn’t know what to do next. If she had just one set of numbers, she would do a Student’s t test, since she doesn’t know the population standard deviation (SD). But what to do with two lists?
Then she had a brainstorm. She realized that she’s not trying to find out anything about students’ weights. She wants to know about their weight gain. Looking at their weights, she’d have plenty of lurking variables starting with precollege diet and lifestyle. Looking only at the weight gain minimizes or eliminates those variables, and measures just what happened to each student during freshman year. So she added a third row to her chart:
Wossamatta U CHEM101 — Women’s Weights (in pounds)  

Student  A  B  C  D  E  F  G  H  I  J 
Sept.  118  105  123  112  107  130  120  99  119  126 
May  125  114  128  122  106  143  124  103  125  135 
d = May−Sept.  7  9  5  10  −1  13  4  4  6  9 
Notice the new variable d, the difference between matched pairs. (You know the data must be paired, because each May number is associated with one and only one September number. You can’t rearrange the May numbers and still have everything make sense.) This is the heart of Case 3 in Inferential Statistics: Basic Cases: reducing paired numeric data to a simple t test of a population mean difference.
Here’s what’s new:
Now she’s all set. She has one set of ten numbers, representing the continuous variable “weight gain in freshman year” for a random sample of Wossamatta U women. (Notice with student E, Francine has a negative value for d because May minus September is 106−107 = −1. That student lost weight as a freshman.) Time for a t test!
But first, what will she test? Her original idea was to test the “freshman fifteen”. But a glance at the d’s shows her that no one gained as much as 15 lb. An average can’t be larger than every member of the data set, so there’s no way she could prove a hypothesis that the average gain is above fifteen pounds. She decides instead to try to prove a “freshman five”, μ_{d} > 5, with 0.05 significance.
When you do a confidence interval, you don’t have to make any decision of this kind because you just follow the data where they lead.
Francine subtracted by hand here, but you shouldn’t do that because it’s a rich source of errors and makes it harder to check your work. Instead, follow this procedure on your TI83/84:
ENTER
], the calculator does all the subtractions,
wiping out whatever was in L3 previously.
This isn’t Excel — if you change L1 or L2 after entering the formula for L3, L3 won’t change. You need to reenter the formula for L3 in that case. (You actually can make the calculator behave like Excel by binding a formula to a list, but it’s not worth the hassle.)
With paired numeric data, your population parameter is the mean difference μ_{d}. The random variable is a difference (in this case, a number of pounds gained from September to May), so the parameter is the mean of all those weight gains.
(1)  d = May−September
H_{0}: μ_{d} = 5, average student gains 5 lb or less H_{1}: μ_{d} > 5, average student gains more than 5 lb 

(2)  α = 0.05 
(RC) 

(3/4) 
This is a regular TTest , number 2 in the STAT
TESTS menu. Francine writes down
TTest: 5, L3, 1, >μ_{o} results: t=1.29, p = 0.1146, d̅=6.6, s=3.9, n=10
The sample mean is d̅ (“dbar”), not x̅, because the data are d’s, not x’s. 
(5)  p > α. Fail to reject H_{0}. 
(6) 
You can’t determine whether the average Wossamatta U woman
student gains more than 5 pounds in her freshman year or not
(p = 0.1146).
Or, At the 0.05 significance level, you can’t determine whether the average Wossamatta U woman student gains more than 5 pounds in her freshman year or not. 
After a “fail to reject H_{0}”, you always remember to write your conclusion in neutral language, right? Maybe the true average weight gain is greater than 5 pounds but this particular sample just happened not to show it; maybe the true average weight gain really is under 5 pounds, A confidence interval can help you get a handle on the effect size.
When a hypothesis test fails to reach a conclusion, a confidence interval can salvage at least some information. When a hypothesis test does reach a conclusion, a confidence interval can give you more precise information.
If Francine was doing only the confidence interval,
she’d have to start off by testing requirements. But she has
already tested them as part of the hypothesis test, so she goes right
to the TINTERVAL
screen.
Which confidence level does she choose? Her onetailed hypothesis test at α = 0.05 would be equivalent to a twotailed test at α = 0.10, and that suggests a confidence level of 90%. But she decides since her hypothesis test has already failed to reach a conclusion she’d at least like to get a 95% CI.
TInterval: L3, 1, .95
results: (3.7948, 9.4052)
Conclusion: Francine is 95% confident that the average woman student at Wossamatta U gains 3.8 to 9.4 pounds during her freshman year.
(Francine doesn’t write down d̅, s, and n because she’s already written them in the hypothesis test. She would write them down when she does only a confidence interval.)
Common mistake: Don’t say the average weight is 3.8 to 9.4 pounds. You aren’t estimating the average firstyear woman’s weight, but her weight gain.
Always reread your conclusion after you write it, and ask yourself whether it seems reasonable in the context of the problem. That can save you from mistakes like this.
Person  Before  After 

1  78  83 
2  64  66 
3  70  77 
4  71  74 
5  70  75 
6  68  71 
A few years back, a coffee company tried to market drinking coffee as a way to relax — and they weren’t talking about decaf. Jon decided to test this. He randomly selected six adults. He recorded their heart rates, then recorded them again half an hour after each person drank two cups of regular coffee. His data are shown at right. (Data come from Dabes and Janik [1999, 264] [see “Sources Used” at end of book].)
The data are paired, because each person (experimental unit) gives you two numbers, Before and After; because each After is associated with one specific Before; and because you can’t rearrange Before or After and still have the data make sense.
Jon selected the 0.01 significance level. (He tests for difference even though he believes coffee increases heart rate, because it could decrease it.)
Jon could equally well define d as Before−After or After−Before. At least, mathematically he could. But you’ll find it’s easier to interpret results if you always define d as high minus low so that all or most of the d’s will be positive numbers. (You can do this based on your common sense or by looking at the data.) Jon sees that the After numbers are generally larger than the Before numbers, so he chooses d = After−Before.
(1)  d = After−Before
H_{0}: μ_{d} = 0, coffee makes no difference to heart rate H_{1}: μ_{d} ≠ 0, coffee makes a difference to heart rate 

(2)  α = 0.01 
(RC)  Jon has a random sample, but the sample size is <30.
(The sample of six is obviously less than 10% of coffee drinkers.) He
puts the Before figures in L1, After in L2, and then L2−L1
(not L1−L2) in
L3. The boxwhisker plot of L3 finds no outliers. The normal
probability plot shows r=.9638, crit=.8893; r>crit.

(3/4) 
TTest: 0, L3, 1, ≠μ_{o}
results: t=5.56, p = 0.0026, d̅=4.2, s=1.8, n=6 
(5)  p < α. Reject H_{0} and accept H_{1}. 
(6)  Drinking coffee does make a difference in heart rate half an
hour later (p = 0.0026). In fact, coffee increases heart
rate.
Or, Drinking coffee does make a difference in heart rate half an hour later, at the 0.01 significance level. In fact, drinking coffee increases heart rate. 
As usual, when you do a twotailed test and p < α, you can interpret it in a onetailed manner. Jon defined d as After−Before, which is the amount of increase in each subject’s heart rate. His sample mean d̅ was positive, so the average outcome in his sample was an increase. Because he proved that the mean difference μ_{d} for all people is nonzero, the sign of his sample mean difference d̅ tells him the sign of the population mean difference μ_{d}.
Jon can’t say that the average increase for people in general is 4.2 beats per minute. That was the mean difference in his sample. If he wants to know the mean difference for all people, he has to construct a confidence interval:
TInterval: L3, 1, .99
result: (1.146, 7.187)
Jon is 99% confident that the average increase in heart rate for all people, half an hour after drinking two cups of coffee, is 1.1 to 7.2 beats per minute.
Caution! The confidence interval expresses a difference, not an absolute number. You are estimating the amount of increase or decrease, not the heart rate. A common mistake would be to say something about the heart rate being 1.1 to 7.2 bpm after coffee. Again, you’re not estimating the heart rate, you’re estimating the change in heart rate.
With paired data, you tested the population mean difference μ_{d} between matched pairs. But suppose you don’t have matched pairs? With unpaired data in independent samples, you test the difference between the means of two populations, μ_{1}−μ_{2}.
This is Case 4 in Inferential Statistics: Basic Cases. Key features:
2SampTTest
for hypothesis test,
2SampTInt
for confidence interval. Always use
Pooled:No
with both.Advice: Take your time when you look at data to decide whether you have paired or unpaired data. If your sample sizes are different, it’s a nobrainer: the data are unpaired. But if the sample sizes are the same, think carefully about whether the data are paired or unpaired. Sometimes students just seem to take a stab in the dark at whether data are paired or unpaired, but if you just stop and think about how the data were taken you can make the right decision every time. Look back at Paired and Unpaired Data at the beginning of this chapter if you need a refresher on the difference.
Prof. Sullivan’s students at Wossamatta U felt that he was a tougher grader than the other speech professors. They decided to test this, at the 0.05 significance level.
Eight of them each took a twohour shift, assigned randomly at different times and days of the week, and distributed a questionnaire to each student on the main quad. They felt this was a reasonable approximation to a random sample of current students. (They asked students not to take a questionnaire if they had already submitted one.) The questionnaire asked whether the student had taken speech in a previous semester, and if so from which professor and what grade they received. They then divided the questionnaires into three piles, “no speech”, “Sullivan”, and “other prof”.
It would be possible to do an analysis with the categorical data of letter grades. But you should always use numerical data when you can, because pvalues are usually lower with numeric data than attribute data, for a given sample size. The students counted an A as 4 points, Aminus as 3.7, and so on. Here is a summary of their findings:
Students of  Mean  Standard Deviation  Sample Size 

Sullivan  2.21  1.44  32 
Other prof  2.68  1.13  54 
In this test, you have unpaired numeric data in two samples. The requirements for each sample are the same as the test for the sample in a onesample t test:
There’s an additional requirement for the two samples:
Here’s the hypothesis test, as performed by Prof. Sullivan’s students:
(1)  pop. 1 = Sullivan students, pop. 2 = other speech profs’
students
H_{0}: μ_{1} = μ_{2}, no difference in average grades H_{1}: μ_{1} < μ_{2}, Sullivan’s grades lower on average 

(2)  α = 0.05 
(RC) 

(3/4) 
2SampTTest: x̅1=2.21, s1=1.44, n1=32, x̅2=2.68, sx2=1.13,
n2=54, μ_{1}<μ_{2}, Pooled:No
Results: t = −1.58, p = 0.0600, df=53.58 The test statistic is still Student’s t, but adapted for two samples. See the BTW note below for more about that and about the funny number of degrees of freedom. 
(5)  p > α. Fail to reject H_{0}. 
(6)  At the 0.05 level of significance, they can’t determine whether Prof. Sullivan is a tougher grader than the other professors or not. 
You’re working with a difference of sample means. The standard error of the mean for the first population is s_{1}/√n_{1} and therefore the variance is s_{1}²/n_{1}, and similarly for the second population. The variance of the sum or difference of independent variables is the sum of their variances, so VAR(x̅_{1}−x̅_{2}) = s_{1}²/n_{1} + s_{2}²/n_{2}. The standard deviation (the standard error of the difference of sample means) is the square root of the variance: .
It turns out that the difference of sample means follows a t distribution — if you choose the right number of degrees of freedom (more on that later). The onesample test statistic was t = (x̅−μ_{o}) / (s/√n). The twosample test statistic is analogous, with the differences substituted. The test statistic becomes . In this course, you’ll just be testing whether one population mean is greater than, less than, or different from the other. In other words, you’ll test against a hypothetical mean difference of 0. That simplifies t a bit: .
What about degrees of freedom? You might think df would be n_{1}+n_{2}−1, but it isn’t. The sampling distribution approximately follows a t with df equal to the lower of n_{1}−1 and n_{2}−1. It’s only approximate because the population SD are usually different. The exact degrees of freedom were computed by B. L. Welch (1938) [see “Sources Used” at end of book], and the horrendous, ugly equation is shown at right. Fortunately, your TI83/84 has the computation built in, and you don’t have to worry about it.
What about pooling? Why do you always select Pooled:No on your TI83/84? Well, if the two populations have the same SD (if they are homoscedastic) you can treat them as a single population (pool the data sets) and use a higher number of degrees of freedom. That in turn means your pvalue will be a bit lower, so you’re a bit more likely to be able to reject H_{0}. Sounds good, right? But there are problems:
For these reasons and others, the issue of pooling is controversial. Some books don’t even mention it. It’s best just to use Pooled:No always.
The requirements are exactly the same as the
requirements for the hypothesis test. You
compute a confidence interval on your TI83/84 through
2SampTInt
.
Since they couldn’t prove that Prof. Sullivan was a tough grader, the students decided to compute a 90% confidence interval for the difference between Prof. Sullivan’s average grades and the other speech profs’ average grades:
pop. 1 = Sullivan students; pop. 2 = other speech profs’
students
Requirements: already covered in hypothesis test.
2SampTInt: x̅1=2.21, s1=1.44, n1=32, x̅2=2.68, sx2=1.13,
n2=54, CLevel=.9, Pooled:No
Results: (−.9678, .02779)
Interpretation: The TI83 gives you the bounds for the confidence interval about μ_{1}−μ_{2}. A negative number indicates μ_{1} smaller than μ_{2}, and a positive number indicates μ_{1} larger than μ_{2}. Therefore:
We’re 90% confident that the average student in Prof. Sullivan’s classes receives somewhere between 0.97 of a letter grade lower than the average student in other profs’ speech classes, and 0.03 of a letter grade higher.
Remark: The 90% confidence interval is almost all negative. This reflects the fact that the pvalue in the onetailed test for μ_{1} < μ_{2} was almost as low as 0.05.
The students could have chosen any confidence level they wanted, just for showing an effect size. But for a confidence interval equivalent to their onetailed hypothesis test that used α = 0.05, the confidence level has to be 1−2×0.05 = 0.90 = 90%.
Why do you need a special twosample t procedure? Can’t you just compute a confidence interval from each sample and then compare them? No, because the standard errors are different. The twosample standard error takes the sample SD and sample sizes into account. Here’s a simple example, provided by Benjamin Kirk:
A farmer tests two diets for his pigs, randomly assigning 36 pigs to each sample. The Diet A group gained an average 55 lb with SD of 3 lb; that gives a 95% confidence interval 54.0 to 56.0 lb. The Diet B group gained 53 lb on average, with SD of 4 lb; the CI is 51.6 to 54.4 lb. Those intervals overlap slightly, which would not let you conclude that there’s any difference in the diets.
But the 2SampTInt is 0.3 to 3.7 lb in favor of Diet A, which says there is a difference. The issue is that the B group had a lower sample mean, but there was more variation within the group.
The Alpha Alpha Alpha sorority chapter at Staples University (Yes, corporate sponsorship is getting ridiculous!) has a tradition of putting in extra effort academically. They gave their incoming pledges the task of proving that Alpha Alpha Alpha had higher average GPA than other sororities, at the 0.05 level of significance. The Alphas are a large sorority, with 119 members.
The pledges hacked the campus server and obtained GPAs of ten randomly selected Alphas and ten randomly selected members of other sororities on campus. Do their illgotten data prove their point?
Alphas:  2.31  3.36  2.77  2.93  2.27  2.35  3.13  2.20  3.20  2.45 

Other sororities:  1.49  1.74  2.70  2.40  2.17  1.08  1.85  1.96  2.08  1.49 
Since you have independent samples (unpaired data) from two different populations, this is Case 4, difference of population means, in Inferential Statistics: Basic Cases.
Caution: You can’t treat these as paired data just because the sample sizes are equal; that’s a rookie mistake. When deciding between a paired or an unpaired analysis, always ask yourself: “Is data point 1 from the first sample truly associated with data point 1 from the second sample?” In this case, they’re not.
(1)  pop. 1 = Alpha Alpha Alpha; pop. 2 =
other sororities
H_{0}: μ_{1} = μ_{2}, No difference in average GPA H_{1}: μ_{1} > μ_{2}, Average GPA of all Alphas is higher than other sororities 

(2)  α = 0.05 
You check requirements against both samples independently. These samples are both smaller than 30, so you have to check normality and outliers on both. Here are the normality checks:
The first picture doesn’t look much like a straight line, but r is greater than crit, so it’s close enough. (With small data sets like this one, fitting the data to the screen can make differences look larger than they really are.)
The calculator lets you “stack” two or three boxplots on one screen. Not only is this a bit of a labor saver, but it also gives you a good sense of how different the samples are. To do this, select “Compare 2 smpl” on the first boxwhisker screen. You can guess what “Compare 3 smpl” does, but we don’t use it in this course.)
For these samples, the difference is dramatic. Every single Alpha’s GPA (in the sample) is above the third quartile in the sample of other sororities, and the max of other sororities is just barely above the median Alpha.
With such a big difference, why do the pledges even need to do a hypothesis test? Because they know these are just samples. Maybe the Alphas actually aren’t any better academically, but these particular samples just happened to be far apart. The hypothesis test tells you whether the difference you see is too big to be purely the result of random chance in sample selection.
(RC) 


(3/4) 
2SampTText L1, L2, 1, 1, >μ_{2}, Pooled:No
outputs: t = 3.93, pvalue = 0.0005, x̅_{1} = 2.70, s1 = 0.43, n1 = 10 x̅_{2} = 1.90, s2 = 0.48, n2 = 10 
(5)  p < α. Reject H_{0} and accept H_{1}. 
(6)  The average GPA in Alpha Alpha Alpha is higher than the
average GPA of other sorority members (p = 0.0005).
[Or, at the 0.05 level of significance, the average GPA in Alpha Alpha Alpha is higher than the average GPA of other sorority members.) 
Comment: You have to phrase your conclusion carefully. The pledges proved that the average GPA of Alphas is higher than the average GPA of all other sorority members, not all other sororities. What’s the difference? Here’s a simple example. Suppose there are ten other sororities besides the Alphas. The Omegas have an average GPA of 3.66, higher than the Alphas’ average. If the other nine each have an average GPA of 1.70, that could easily produce exactly the sample that the pledges got.
The message here: Aggregating data can lose information. Sometimes that’s okay, but be wary when one population is being compared to an aggregate of multiple other populations.
When you have two samples of binomial data, they represent two populations. Each population has some proportion of successes, p_{1} and p_{2} respectively. You don’t know those true proportions, and in fact you’re not concerned with them. Instead, you’re concerned with the difference between the proportions, p_{1}−p_{2}. You can test whether there is a difference (hypothesis test), or you can estimate the size of the difference (confidence interval).
This is Case 5 in Inferential Statistics: Basic Cases. Key features of Case 5, the difference of proportions:
Advice: take your time with twosample binomial data. You have a lot of p’s and a lot of percentages floating around, and it’s easy to get mixed up if you try to hurry.
Take extra care when writing conclusions. You’re making statements about the difference between the two proportions, not about the individual proportions. And you’re making statements about the difference in proportions between the populations, not between the samples.
Stopped by Traffic Cop  

Ticket Issued  Just a Warning  Total  p̂  
Men  86  11  97  89% 
Women  55  15  70  79% 
One of my students — call him Don — had several traffic tickets, and he knew one more would trigger a suspension. He felt that women stopped by a traffic cop were more likely than men to get off with just a warning, and for his Field Project he set out to prove it, with α = 0.05.
Don quickly realized that he should test whether men and women stopped by a cop are equally likely to get a ticket, not just whether men are more likely. After all, he couldn’t rule out the possibility that women are more likely to get a ticket if stopped.
Don distributed a questionnaire to a systematic sample of TC3 students. (He assumed that any genderbased difference in TC3 students would be representative of college students in general. That seems reasonable.) He asked three questions:
Don disregarded any questionnaires from students who had never been stopped as adults. He wasn’t interested in the likelihood of getting a ticket, but in the likelihood of getting a ticket after being stopped by a cop. You could say that he was interested in the different proportions, for men and women, of stops that lead to tickets.
This is just another variation on the good old Seven Steps of Hypothesis Tests:
2PropZTest
,
so your test statistic is a zscore.Here are the requirements for a Case 5 hypothesis test of a difference of proportions:
Actually, that’s an approximation to the real requirement. We use it because it nearly always gives the same answer, and it’s easier to test.
The real requirement is at least 10 successes and 10 failures EXPECTED in each sample. The expected numbers are what you would see in your samples if H_{0} is true and there’s no difference between the two population proportions. In that case, the pooled proportion p̂, which is the overall percentage of success in the combined samples, is an estimator of the true proportion in both populations.
That pooled proportion is
.
(2PropZTest
shows you p̂ on the output screen.)
Using that pooled proportion p̂, the
expected successes and failures
in sample 1 are
n_{1}p̂ and n_{1}−n_{1}p̂, and the expected
successes and failures in sample 2 are
n_{2}p̂ and n_{2}−n_{2}p̂.
All four of these must be ≥ 10.
The Gardasil vaccine example, below, shows a situation where you have to use the blended proportion to test requirements.
Here is Don’s hypothesis test about the different proportions of men and women that receive tickets after being stopped in traffic.
(1) 
population 1 = college men stopped by traffic cops;
population 2 = college women stopped by traffic cops
H_{0}: p_{1} = p_{2}, college men and women equally likely to get a ticket after being stopped H_{1}: p_{1} ≠ p_{2}, college men and women not equally likely to get a ticket after being stopped 

(2)  α = 0.05 
(RC) 

(3/4) 
2PropZTest : 86, 97, 55, 70, p1≠p2
Results: z=1.77, pvalue = 0.0760, p̂_{1} =.89, p̂_{2}=.79, p̂=.84

(5)  p > α. Fail to reject H_{0}. 
(6)  At the 0.05 level of significance, Don can’t tell whether men and women stopped by traffic cops are equally likely to get tickets, or not. 
If this nonconclusion leaves you nonsatisfied, you’re not alone. As usual, the confidence interval (next section) can provide some information.
Why does the “official” requirement use a pooled proportion p̂ instead of testing each sample? In fact, for a confidence interval you always test requirements for each sample. But in a hypothesis test, your H_{0} is always “no difference in population proportions”, and a hypothesis test always starts by assuming H_{0} is true. If the null is true, then there is no difference in the two populations, and you really just have one big sample of size n_{1}+n_{2} and sample proportion p̂. So that’s what you test.
Of course the twopopulation case is a bit more complicated. You need the key fact that when you add or subtract independent random variables, their variances add. If the two populations have the same proportion p, as H_{0} assumes, then the SD of the sampling distribution of the proportion for population 1 is √[p̂(1−p̂)/n_{1}], and similarly for population 2, where p̂ is the pooled proportion mentioned in the requirements check, above. Square the SD to get the variances, add them, and take the square rot to get the standard error: . And from this you have the test statistic: .
In a confidence interval for the difference of two proportions, some unknown proportion p_{1} of population 1 has some characteristic, and some unknown proportion p_{2} of population 2 has that characteristic. You aren’t concerned with those proportions on their own, but you want to estimate which population has the greater proportion, and by how much.
2PropZInt
. The
CI estimate is for p_{1}−p_{2}, the true difference
between the proportion of success in the two populations. A
negative number in the confidence interval means the population 1
proportion is lower than the population 2 proportion, and a
positive number means p_{1} is greater than p_{2}.The requirements for a CI are almost the same as a HT, but with one subtle difference:
Why is that last requirement different from the “official” requirement for the hypothesis test? With the HT, you assumed H_{0} was true and both populations had the same proportion. That let you use a blended or pooled proportion from your combined samples. But with a CI, you don’t make any such assumption. What would be the point of a confidence interval for the difference if you assume there is no difference?
But despite the difference in theory, as a practical matter you can just test for ≥ 10 successes and ≥ 10 failures in each sample for both HT and CI.
Don has already checked requirements in the
hypothesis test, so he moves right to a 2PropZInt
:
Don gets a result of −1.4% to +21.6%. How does he interpret that? Well, he can write it as
−1.4% ≤ p_{1}−p_{2} ≤ 21.6% (95% conf.)
Adding p_{2} to all three “sides” gives
p_{2}−1.4% ≤ p_{1} ≤ p_{2}+21.6% (95% conf.)
With 95% confidence, p_{1} is somewhere between 1.4% below p_{2} and 21.6% above p_{2}. You don’t know the numerical value of p_{1}, but out of male students who are stopped by a traffic cop, p_{1} is the proportion who get a ticket, and similarly for p_{2} and women. So Don can write his confidence interval like this:
I’m 95% confident that, out of students stopped by traffic cops, the proportion of men who actually get tickets is somewhere between 1.4 percentage points less than women, and 21.6 percentage points more than women.
If you’re not feelin’ the love with the algebra approach, you can reason it out in words. The confidence interval is the difference in proportions for men minus women. If that’s negative, the proportion for men is less than the proportion for women; if the difference is positive, the proportion for men is greater than the proportion for women.
Why do I say “percentage points” instead of just “percent” or “%”? Well, how do you describe the difference between 1% and 2%? It’s a difference of one percentage point, but it’s a 100% increase, because the second one is 200% of the first. When you subtract two percentages, the difference is a number of percentage points. If you just say “percent”, that means you’re expressing the difference using one of the percentages as a base, even if you don’t mean to.
Getting back to Don’s confidence interval, the −1.4% to +26.1% difference between men and women in traffic tickets is a simple subtraction of men’s rate minus women’s rate, so it is percentage points, not percent.
The standard deviation of the sampling distribution of the proportion for population 1 is √[p̂_{1}(1−p̂_{1})/n_{1}], and similarly for population 2. Square them, add, and take the square root to get the SD of the distribution of differences in sample proportions, also known as the standard error of the difference of proportions: . The margin of error is z_{α/2} times that. The center of the confidence interval is the point estimate, (p̂_{1}−p̂_{2}), so the bounds for the (1−α)% confidence interval are
(p̂_{1}−p̂_{2})−E ≤ p_{1}−p_{2} ≤ (p̂_{1}−p̂_{2})+E where
Just like with numeric data, you have to use the twosample procedure to compute a correct confidence interval. Here’s an example.
Two candidates are running for city council, so they each commission an exit poll on Election Day. Of 200 voters polled, 110 voted for Mr. X; 90 of a different 200 voted for Ms. Y. The 95% confidence intervals are 48.1% to 61.9% and 38.1% to 51.9%. The intervals overlap, so Ms. Y might still hope for victory. But a 2PropZInt tells a different story. The interval for the difference of proportions, X−Y, is 0.2% to 19.8%, so Mr. X is 95% confident of winning, and the only question is whether it will be a squeaker or a landslide.
You have a confidence level and a desired margin of error in mind. How large must each sample be?
You may remember with the onepopulation binomial case, part of the calculation was your prior estimate, or if you had no prior estimate you used 0.5. With two binomial populations, you need a prior estimate (or 0.5) for each one.
The easiest way to compute the necessary sample size is to use MATH200A Program part 5. If you don’t have the program and want to get it, see Getting the Program. You can also calculate necessary sample size by using the formula in the next paragraph, if you don’t have the program.
For a detailed explanation, with worked examples, see How Big a Sample Do I Need?.
Caution! When you’re planning to study the difference between two binomial populations, you have to use the twopopulation binomial computation of sample size. If you compute one sample size for sample 1 and a separate sample size for sample 2, you’ll come out much too low.
Example 12: Let’s look back once more at Don and his traffic stops. His 95% confidence interval was −0.0141 to +0.21587. That’s a margin of error of (.21587−(−.0141))/2 = 11½ percentage points. How large must his samples be if he wants a margin of error of no more than 5 percentage points but he’s willing to be only 90% confident?
Solution: Don can use his sample proportions as prior estimates. Those were 86/97 ≈ 0.8866 for men and 55/70 ≈ 0.7857 for women.
With the MATH200A program (recommended):  If you’re not using the program: 

Here’s the output screen from MATH200A Program part 5, 2pop
binomial :

The calculation is a little easier if you break it into chunks.
First compute p̂_{1}(1−p̂_{1}) +
p̂_{2}(1−p̂_{2}). When you press [Enter ], the
calculator displays that result.
You want to multiply that by (z_{α/2}/E)².
Press the [ What is z_{α/2}?
You did this in
How Big a Sample for Binomial Data?
in Chapter 9.
The confidence
level is 1−α = 0.9, so α = 0.1,
α/2 = 0.05, and z_{α/2} is
Caution: You don’t round the sample size. If you don’t get a whole number from the calculation, always go up to the next whole number. A sample size of 291.0255149 or greater gives a margin of error of .05 or less, at 90% confidence. The smallest whole number that is 291.0255149 or greater is 292, not 291. 
Answer: Don needs a sample of 292 men and 292 women if he wants 90% confidence in an estimate of the difference with margin of error no more than 5%.
Rookie mistake: Don’t just say “292”. It’s 292 from each population.
Why do you need such large samples, even at a confidence level as low as 90%? Part of the answer is that binomial data do need large samples; remember that a single sample of just over a thousand gives you a 3% margin of error at the 95% confidence level. And when you have two populations, you are estimating the difference between two unknown parameter values, p_{1} and p_{2}. If each of those was estimated within a 3% margin of error, the margin of error for their difference would be 6%, so the samples have to be larger in the twopopulation binomial case.
Example 13: The Prime Minister knows that his program of tax cuts and reduced social services appeals more to Conservatives than to Labour, but he wants to know how large the difference is. To estimate the difference with 95% confidence, with a margin of error of no more than 3%, how many members of each party must he survey?
Solution: You’re given no estimate of support within either party, so use 0.5 for p̂_{1} and p̂_{2}. E = 0.03 (not 0.3).
With the MATH200A program (recommended):  If you’re not using the program: 

MATH200a/sample size/2pop binomial:

First compute
p̂_{1}(1−p̂_{1}) +
p̂_{2}(1−p̂_{2}) =
0.5(1−0.5)+0.5(1−0.5). You have to multiply that by
z_{α/2}, which you find like this:
CLevel = 1−α = 0.95 ⇒
α = 1−0.95 = 0.05
⇒ α/2 = 0.025 ⇒
z_{α/2} = invNorm(1−.025) .

Answer: To gauge the difference within a 3% margin of error, at the 95% confidence level, the Prime Minister needs to poll 2135 Conservative Party members and 2135 Labour Party members.
The Gardasil vaccine is marketed by Merck to prevent cervical cancer. What are the statistics behind it? How do women decide whether to get vaccinated? Should the vaccine be mandatory?
A Cortland Standard story (21 Nov 2002) summarized an article from the New England Journal of Medicine as follows
A new vaccine can protect against Type 16 of the human papilloma virus, a sexually transmitted virus that causes cervical cancer, a new study shows. An estimated 5.5 million people become infected with a strain of HPV [not necessarily this strain] each year in the United States.
Efficiency rate of vaccine and placebo
Placebo: Group size 765, infection 41
HPV16 vaccine: Group size 768, infection 0
Note: The study included 1533 women with an average age of 20.
(Similar studies were done for the vaccine’s effectiveness against another strain, HPV18. According to the front page of the Wall Street Journal on 16 Apr 2007, HPV16 and 18 between them “are thought to cause 70% of cervicalcancer cases.” The vaccine, developed by Merck, is now marketed as Gardasil.)
The samples certainly show an impressive difference, but is it statistically significant? Could the luck of random sampling be enough to account for that difference in infection rates?
The claim is “the vaccine protects against HPV16.” To translate this into the language of statistics, realize that there are two populations: (1) women who don’t get the vaccine, and (2) women who do get the vaccine.
Notice that the populations are all women, past, present, and future who don’t or do get vaccinated. The 765 and 768 women are samples, not populations. The populations are unvaccinated and vaccinated, not placebo and vaccine. Placebos are administered to members of a sample, but a population doesn’t “get placeboed”.
The data type is attribute (binomial) because the original question or measurement of each participant is the yes/no question: “Did this woman contract the virus?” (“Success” is an HPV16 infection, not a good thing.) Since you’re comparing two populations, this is Case 5, Difference of Two Proportions.
If the vaccine works, then you expect more women without the vaccine to contract the virus, so make them population 1. (That’s not necessary; it just usually makes things a little simpler to call population 1 the one with higher numbers expected.)
Although you hope that the vaccine population will have a lower infection rate, it’s not impossible that they could have a higher rate. Therefore you do a twotailed test (≠). If p < α, then it’s time to say whether the vaccine makes things better or worse.
Let’s use α = 0.001. You’re talking about cancer in humans, after all. A Type I error would be saying that Gardasil makes a difference when actually it doesn’t. You don’t want women to get vaccinated, and have a false sense of security, if the vaccine actually doesn’t work, so a Type I error is pretty serious.
(1) 
population 1 = unvaccinated women; population 2 = vaccinated women
H_{0}: p_{1} = p_{2}, the vaccine makes no difference H_{1}: p_{1} ≠ p_{2}, the vaccine does make a difference 

(2)  α = 0.001 
(RC) 

(3/4) 
2PropZTest: 41, 765, 0, 768, p1≠p2
results: z=6.50, pvalue = 7.9E11, p̂_{1}=.0536, p̂_{2}=0, p̂=.0267 Pause for a minute to make sure you can keep all those p’s straight. The first one, p = 7.9E11, is the pvalue, the chance of getting such different sample results if the vaccine makes no difference. p̂_{1} and p̂_{2} are those sample results: 5.4% of unvaccinated women and 0% of vaccinated women in the samples contracted HPV16 infections. p̂ without subscript is the pooled proportion: 2.7% of all women in the study contracted HPV16. 
(5)  p < α. Reject H_{0} and accept H_{1}. 
(6)  The Gardasil vaccine does make a difference to HPV16
infection rates (p = 8×10^{11}). In fact, it
lowers the chance of infection.
Or, At the 0.001 level of significance, the Gardasil vaccine does make a difference to HPV16 infection rates. In fact, it lowers the chance of infection. 
It’s worth reviewing what this pvalue of 8×10^{11} means. If the vaccine actually made no difference, there are only 8 chances in a hundred billion of getting the difference between samples that the researchers actually got, or a larger difference.
How do you get from “makes a difference” to “reduces infection rate”? Remember that when p < α in a twotailed test, you can interpret the result in a onetailed manner. If the vaccine makes things different, as appears virtually certain, then it must either make them better or make them worse. But in the sample groups, the vaccine group did better than the placebo group. Therefore the vaccine can’t make things worse, and it must make them better.
Can you do a confidence interval to estimate how much Gardasil reduces a woman’s risk of HPV16 infection? Unfortunately, you can’t, because the requirements aren’t met: There were zero successes in the second sample. You can’t think like the hypothesis test and use the blended p̂ to meet requirements. Why wouldn’t that make sense? In a confidence interval, you’re specifically trying to estimate the difference between p_{1} and p_{2} (likelihood of infection for unvaccinated and vaccinated women), so you can’t very well assume there is no difference.
In terms of what you’re required to know for the course, you can skip to the next section right now. But if you want to know more, keep reading.
One informal calculation finds a number needed to treat per person actually helped (Simon 2000c [see “Sources Used” at end of book]). The difference in sample proportions is 5.4 percentage points, and 1/.054 ≈ 18.5 is called the number needed to treat. (You may recognize this as the expected value of the geometric distribution with p = 5.4%.) In the long run, for every 18 or 19 women who are vaccinated, one HPV16 infection is prevented.
Caution! 5.4 percentage points is a difference in sample proportions. You can say only that the difference in the population is somewhere in the neighborhood of 5.4 percentage points, not that it is that. The number needed to treat is therefore not exactly 18.5, just somewhere in the neighborhood of 18.5. Even so, this is valuable information for women and their doctors.
Another approach is the rule of three, explained in Confidence Interval with Zero Events (Simon 2010 [see “Sources Used” at end of book]). When there are zero successes in n events, the 95% confidence interval is 0 to 3/n. Here 3/768 = 0.0039, about 0.4%. The 95% confidence interval for the unvaccinated population is 3.8% to 7.0%. So a doctor can tell her patients that about 38 to 70 unvaccinated women in a thousand will be infected with HPV16, but only about four vaccinated women in a thousand.
Each of those is a 95% confidence interval, but the combination isn’t a 95% confidence interval! In the long run, if you do a bunch of 95% CIs, one in 20 of them won’t capture the true population parameter. Here you’re doing two, so there’s only a 95%×95% = 90.3% chance that both of these actually capture the true population proportions.
Summary: If you have a confidence interval for the difference of two population means or proportions, you can conclude whether the difference is statistically significant or not, just like the result of a hypothesis test.
Example 15: You’re testing the new drug Effluvium to see whether it makes people drowsy. Your 95% confidence interval for the difference between proportions of drowsiness in people who do and don’t take Effluvium is (0.017, 0.041). That means you’re 95% confident that Effluvium is more likely, by 1.7 to 4.1 percentage points, to cause drowsiness.
There’s the key point. You’re 95% confident that it does increase the chance of drowsiness by something between those two figures. How likely is it that Effluvium doesn’t affect the chance of drowsiness, then? Clearly it’s got to be less than 5%.
When both endpoints of your confidence interval are positive (or both are negative), so that the confidence interval doesn’t include 0, you have a significant difference between the two populations.
Example 16: Now, suppose that confidence interval was (−0.013, 0.011). That means you’re 95% confident that Effluvium is somewhere between 1.3 percentage points less likely and 1.1 more likely to cause drowsiness. Can you now conclude that Effluvium affects the chance of drowsiness? No, because 0 (“no difference”) is inside your confidence interval. Maybe Effluvium makes drowsiness less likely, maybe it has no effect, maybe it makes drowsiness more likely; you can’t tell.
When one endpoint of your confidence interval is negative and one is positive, so that the confidence interval includes 0, you can’t tell whether there’s a significant difference between the two populations or not.
This is exact for numeric data but approximate for binomial data. Why? Because the HT and CI use the same standard error for the numeric data cases, but slightly different standard errors for twopopulation binomial data. (The two calculations are in BTW paragraphs earlier in the chapter, here and here). chapter.)
Confidence intervals for two populations are easy enough to calculate on your TI83. But one or both endpoints can be negative, and that means you have to write your interpretation carefully. Don’t just say “difference”; specify which population’s mean or proportion is larger or smaller. You must also distinguish between mean difference (for paired data) and difference in means (for unpaired data).
Study these examples of confidence intervals for two populations, and you’ll learn how to write your interpretations like a pro!
Here’s an example adapted from Johnson and Kuby (2003, 425) [see “Sources Used” at end of book]. Men’s and women’s heights are ND. From this random sample, estimate the difference in average height as a 95% CI.
Sample  Mean, x̅  Standard Deviation, s  Sample Size, n 

Female, pop. 2  63.8"  2.18"  20 
Male, pop. 1  69.8"  1.92"  30 
You have independent samples here: you get one number from each individual. The data type is numeric (height), so you have Case 4, difference of independent means.
With independent means, you check requirements for each sample separately.
All requirements for Case 4 are met.
The TI83 or TI84 computes μ_{1}−μ_{2}, so you need to decide which will be population 1 and which will be population 2. I like to avoid negative signs, so unless there’s a good reason to do otherwise I take the sample with the larger mean as sample 1; in this case that’s the men.
Whichever way you decide, write it down: pop 1 = ________, pop 2 = ________.
On your calculator, press [STAT
] [◄
] and
scroll up or down to find 0:2SampTInt
. Enter the sample
statistics and use Pooled:No. Here are the input and output screens
:
Conclusion: With 95% confidence, the average man at that college is between 4.8″ and 7.2″ taller than the average woman, or μ_{M}−μ_{F} = 6.0″±1.2″. (You would probably present one or the other of those forms, not both.)
(6.0 is the difference of sample means and is the center of the confidence interval: x̅_{1}−x̅_{2} = 69.8−63.8 = 6.0.)
Remark: The difference from the case of dependent means is subtle but important. With dependent means (paired data), the CI is about the average difference in measurements of a single randomly selected individual or matched pair. But with independent means (unpaired data), the CI is about the difference between the averages for two different populations.
Now let’s make up new data for the coffee example. (The new d’s are still normally distributed with no outliers.) Again, you’re estimating the mean difference in heart rate due to drinking coffee.
Person  1  2  3  4  5  6 

Before  78  64  70  71  70  68 
After  79  62  73  70  71  67 
d = A−B  1  −2  3  −1  1  −1 
Notice that some heart rates declined after the people drank coffee. Now when you compute a 95% CI you get the results shown at right.
How should you interpret a negative endpoint in the interval? Remember that you are computing a CI for the quantity After−Before. You could follow the earlier pattern and say “With 95% confidence, the mean increase in heart rate for all individuals after drinking coffee is between −1.8 and +2.1 beats per minute,” but only a mathematician would love a statement that talks about an increase being negative. Instead, you draw attention to the fact that the change might be a decrease or an increase, as follows.
Conclusion: With 95% confidence, the mean change in heart rate for all individuals after drinking coffee is between a decrease of 1.8 and an increase of 2.1 beats per minute. Since it’s obviously very important to get the direction right, be sure to check your conclusion against your H_{1} (if any) and your original definition of d.
Remark 1: Though it’s correct to present the CI as a point estimate and margin of error, it’s probably not a good idea because that form is so easy to misinterpret. If you say “With 95% confidence, the mean increase in heart rate for all individuals is 0.2±1.9 beats per minute,” many people won’t notice that the margin of error is bigger than the point estimate, and they’ll come to the false conclusion that you have established an increase in heart rate after drinking coffee. As statistics mavens, we have a responsibility to present our results clearly, so that people draw the right conclusions and not the wrong ones.
Remark 2: Remember that the CI occupies the middle of the distribution while the HT looks at the tails. If 0 is inside the CI, it can’t be in either tail. Therefore, from this confidence interval you know that testing the null hypothesis μ_{d} = 0 at the 0.05 level (0.05 = 1−95%) would fail to reject H_{0}: this experiment failed to find a significant difference in heart rate after drinking coffee. (See Confidence Interval and Hypothesis Test (Two Populations).)
Remember the difference between “no significant difference found” and “no difference exists”. Since 0 is in the CI, you can’t say whether there is a difference. The correct statement, “I don’t know whether there is a difference,” is different from the incorrect “There is no difference.”
The following data are from Dabes and Janik (1999, 269) [see “Sources Used” at end of book]. Men and women were polled in a systematic sample on whether they favored legalized abortion, and the results were as follows:
Sample  Number in Favor, x  Sample Size, n 

Females, pop. 1  60  100 
Males, pop. 2  40  80 
Find a 98% confidence interval for the difference in level of support between women and men.
You have binomial data: each person either supports legalized abortion or not. (Obviously this example is oversimplified.) Binomial data with two populations is Case 5, difference of proportions.
Support among the sample of women is 60/100 = 60%, and among the sample of men is 40/80 = 50%, so let’s define population 1 = women, population 2 = men.
All requirements for a Case 5 CI are met.
On the TI83 or TI84, press [STAT
] [◄
] and
scroll up to find B:2PropZInt
. The input and output
screens look like this:
Twopopulation confidence intervals can be tricky to interpret, particularly when the two endpoints have different signs and particularly for Case 5, two population proportions. You can reason it out in words, or use algebra.
In words, remember that the confidence interval is the estimated difference p_{1}−p_{2}, which is the estimated amount by which the proportion in the first population exceeds the proportion in the second population. So a negative endpoint for your CI means that the first proportion is lower than the second, and a positive endpoint means that the first proportion is larger.
Using algebra, begin with the calculator’s estimate of p_{1}−p_{2}:
−0.0729 ≤ p_{1}−p_{2} ≤ +0.27292 (98% conf.)
Add p_{2} to all three parts of the inequality, and you have
p_{2}−0.0729 ≤ p_{1} ≤ p_{2}+0.27292 (98% conf.)
That’s a little easier to work with. The 98% confidence bounds on p_{1} (level of women’s support) are p_{2}−0.0729 (7.3% below men’s support) and p_{2}+0.27292 (27.3% above men’s support).
Conclusion: You are 98% confident that support for legalized abortion is somewhere between 7.3 percentage points lower and 27.3 points higher among females than males.
Remark: It would be equally valid to turn that around and say you’re 98% confident that support is between 27.3 percentage points lower and 7.3 points higher among males than females.
Johnson and Kuby (2003, 427) [see “Sources Used” at end of book] present another example. What is the difference (if any) in academic performance between fraternity members and nonmembers? Forty members of each population were randomly selected, and their cumulative GPA recorded as an indication of performance. The results were as follows:
Sample  x̅  s  n 

Fraternity members, pop. 1  2.03  0.68  40 
Independents, pop. 2  2.21  0.59  40 
Here you have numeric data, two independent samples. (You know it’s independent samples, unpaired data, because each member of the sample gives you just one number.) This is Case 4, difference of independent means.
Each sample was random, and each sample size is >30. We can assume that there are more than 10×40 = 400 fraternity members and 400 independents on campus. All requirements for Case 4 are met.
The CI is −0.46 to +0.10, with 95% confidence. To interpret this, remember that the TI83 computes a CI for μ_{1}−μ_{2}, and we defined population 1 as fraternity and population 2 as independent. The calculator is telling you that
−0.46 ≤ μ_{1}−μ_{2} ≤ +0.10 (95% conf.)
or, adding μ_{2} to all three parts,
μ_{2}−0.46 ≤ μ_{1} ≤ μ_{2}+0.10 (95% conf.)
Conclusion: The true difference in academic performance, as measured by average GPA, is somewhere between 0.46 worse and 0.10 better for fraternity members relative to nonmembers, with 95% confidence.
You could also write a somewhat longer form: with 95% confidence, the average fraternity member’s academic performance, as measured by GPA, is somewhere between 0.46 worse and 0.10 better than the average independent’s performance.
Remark: Don’t be fooled by the fact that the CI is mostly below zero. You really cannot conclude that fraternity members probably have lower academic performance. Remember that the 95% CI is the result of a process that captures the true population mean (or difference, in this case) 95 times out of 100. But you can’t know where in that interval the true mean (or difference) lies. If you could, there would be no point to having a CI!
Remark 2: Even though zero is within the CI, you must not say that there is no difference in performance between members and nonmembers. The difference might indeed be zero, but it might also be anywhere between 0.46 in one direction and 0.10 in the other. There’s even a 5% chance that the true difference lies outside those limits. Always bear in mind the difference between insufficient evidence for and evidence against. (You may hear that said as “lack of evidence for is not evidence against.”)
This chapter covered confidence intervals and hypothesis tests for two samples, both binomial and numeric data. Instead of testing a population’s μ or p against some baseline number, you test the μ or p of two populations against each other.
If there’s an effect to be found, you’re more likely to find it with a paireddata design than with unpaired data.
Caution! Just having equal sample sizes is not enough for paired data; there has to be an association between each member of one sample and a specific member of the other sample. They can be two tasks performed by the same individual, husbandwife studies, identicaltwin studies, etc.
In step 1 of your HT or at the beginning of your CI, write the definition of d, showing which direction you will subtract. Your HT is about μ_{d}.
Do your requirements check on the d’s — you don’t care whether the original numbers pass requirements.
Use plain TTest
or TInterval
on the
differences.
In HT step 1 or at start of CI, identify population 1 and population 2 (not sample 1 and sample 2).
Check requirements on each sample separately.
Use 2SampTTest
or 2SampTInt
.
In HT step 1 or at start of CI, identify population 1 and population 2 (not sample 1 and sample 2).
Requirements are slightly different between CI and HT, and in fact with HT it’s easier to check requirements after the computations.
Use 2PropZTest
or 2PropZInt
.
Be able to calculate necessary sample size to keep margin of error below a desired value for a desired confidence level.
With binomial data, the difference is a matter of percentage points, not percent.
Your interpretation should clearly be about the populations, not the samples.
Chapter 12 WHYL → ← Chapter 10 WHYL
Write out your solutions to these exercises, showing your work for all computations. Then check your solutions against the solutions page and get help with anything you don’t understand.
Caution! If you don’t see how to start a problem, don’t peek at the solution — you won’t learn anything that way. Ask your instructor or a tutor for a hint. Or just leave it and go on to a different problem for now. You may find when you return to that “impossible” problem that you see how to do it after all.
You want to determine whether sports fans would pay 20% extra for
reserved bleacher seats.
You suspect that the answer may be different between people aged under
30 and people 30 years old or older.
(a) How big must your samples be to
let you construct a 95% confidence interval for the difference, with a
margin of error of 3 percentage points?
(b) You discover a poll done last
year, in which 30% of young people and 45% of older people said they
would pay extra. Now how large must your samples be?
A researcher wanted to see whether the English like soccer more than the Scots. She randomly selected eight English and eight Scots and asked them to rate their liking for soccer on a numeric scale of 1 (hate) to 10 (love), and she recorded these responses:
English  6.4  5.9  2.9  8.2  7.0  7.1  5.5  9.3 

Scots  5.1  4.0  7.2  6.9  4.4  1.3  2.2  7.7 
(a) From the above data, can the researcher prove that the
English have a stronger liking for soccer than the Scots? Use
α = 0.05.
(b) Construct a 90% confidence interval for the different average levels of
English and Scottish enthusiasm for soccer.
Sample Size  Number of “Yes”  

English  150  105 
Scots  200  160 
Another researcher took a different approach.
She polled random samples with the question “Do you watch
football at least once a week?” (In the UK they call soccer
“football”). She got the results shown at right.
(a) At the 0.05 level of significance, are the English
and the Scots equally fans of soccer?
(b) Construct a 95% confidence interval for the difference.
(c) Find the margin of error in that interval.
(d) If the researcher repeats her survey, what sample size would
she need to reduce the margin of error to 4
percentage points at the same confidence
level?
Person  Before Running  After Running 

1  30  35 
2  34  39 
3  36  42 
4  34  33 
5  40  48 
To see if running raises the HDL (“good”)
cholesterol level, five female volunteers (randomly selected) had
their HDL level measured before they started running and again after
each had run regularly for six months, an average of four miles daily.
(a) See if you can prove that the average person’s HDL
cholesterol level would be raised after all that running. Use
α=0.05.
(b) Compute and interpret a 90% confidence interval for the
change in HDL from running four miles daily for six months.
The Physicians’ Health Study tested the effects of 325 mg
aspirin every other day in preventing heart attack in people with no
personal history of heart attack. 22,071 doctors were randomized into
an aspirin group and a placebo group. Of the 11,037 doctors who
received aspirin, 10 had fatal heart attacks and 129 had nonfatal
heart attacks; total 139. For the 11,034 doctors in the placebo group, the
figures were 26 and 213; total 239.
(a) At the 0.001 level of significance, does this aspirin regimen make
a difference to the likelihood of a heart attack?
(b) Find a 95% confidence interval for the reduced risk.
Mean  S.D.  n  

Cortland Co.  $134,296  $44,800  30 
Broome Co.  $127,139  $61,200  32 
June was planning to relocate to Central New York, considering
Binghamton and Cortland. She found an online survey of prices
of recently completed house sales as shown at right. The survey was two
random samples taken about a month before she looked at the Web
site.
(a) Construct a 95% confidence interval for the difference in
mean house price in the two counties.
(b) Use that answer to determine which county has a lower average
price of houses, at the 0.05 significance level.
The Canter Polling Service conducted two national polls in the same week, one for the Red Party candidate and one for the Blue Party candidate. Each one was a random sample of 1000 likely voters — not the same 1000, of course. (Most national polls have sample size 1000.)
In the first sample, 520 (52%) said they would vote for Red. In the second sample, 480 (48%) said they would vote for Blue. The newspaper reported that Red was leading by 4%. What’s wrong with that? Write a correct statement, at the 95% level of confidence.
Updates and new info: https://BrownMath.com/swt/