→ Stats w/o Tears → 11. Two-Sample Inference
Stats w/o Tears home page

Stats without Tears
11. Inference from Two Samples

Updated 1 Jan 2016 (What’s New?)
Copyright © 2013–2017 by Stan Brown

View or
These pages change automatically for your screen or printer. Underlined text, printed URLs, and the table of contents become live links on screen; and you can use your browser’s commands to change the size of the text or search for key words. If you print, I suggest black-and-white, two-sided printing.

Intro: In Chapter 10, you looked at hypothesis tests for one population, where you asked whether a population mean or proportion is different from a baseline number. In this chapter, you’ll ask “are these two populations different from each other?” (hypothesis test) and “how large is the difference?” (confidence interval).


11A.  Numeric Data — Paired or Unpaired?

That’s the key question when you’re doing inference on numeric data from two samples. Your answer will control how you analyze the data, so let’s look closely at the difference.

Unpaired Data / Independent Samples

Definitions: You have unpaired data when you get one number from each individual in two unrelated groups. The two groups are known as independent samples.

Independent samples result when you take two samples completely independently, or if you take one sample and then randomly assign the members to groups. Randomization always gives you independent samples.

Example 1: What if any is the average difference in time husbands and wives spend on yard work? You randomly select 40 married men and 40 married women and find how much time a week each spends in yard work. There’s no reason to associate Man A with Woman B any more than Woman C; these are independent samples and the data are unpaired.

Example 2: How much “winter weight” does the average adult gain? You randomly select 500 adults and weigh them all during the first week of November. Then during the last week of February you randomly select another 500 adults and weigh them. The data are unpaired, and the samples are independent.

Before you read further, what’s the big problem in the design of those two studies?

Right! Our old enemy, confounding variables. Look at the examples again, and see how many you can identify. For example, what might make a random person in one sample weigh more or less than a random person in the other sample, other than the passage of time? What might make a random woman spend more or less time on yard work than a random man, apart from their genders?

With independent samples, if there’s actually a difference between the two groups, it may be swamped by all the differences within each group.

Paired Data / Dependent Samples

Definitions: You have paired data when each observational unit gives you two numbers. These can be one number each from a matched pair of individuals, or two numbers from one individual. Paired data come from dependent samples.

Example 3: What if any is the average difference in time husbands and wives spend on yard work? You randomly select 40 couples and find how much time a week each person spends in yard work. Each husband and wife are a matched pair. The samples are dependent because once you’ve chosen a couple you’ve equally specified a member of the “wives” sample and a member of the “husbands” sample.

Example 4: How much “winter weight” does the average adult gain? You randomly select 500 adults and weigh them all during the first week of November, then again during the last week of February. You have paired data in the before and after numbers. The two samples are dependent because they are the same individuals.

Do you see how a design with paired data (dependent samples) overcomes the big problem with unpaired data (independent samples)? You want to study weight gain, and now that’s what you’re measuring directly. You wanted to know whether husband or wife spends more time on yard work, and now you’ve eliminated all the differences between couples.

Paired data are more likely than unpaired to reveal an effect, if there is one. Why? Because a paired-data design minimizes differences within each group that can swamp any difference between groups.

In studying human development and behavior, twins are a prime source of dependent samples. If you have a pair of identical twins who were raised apart (and that’s surprisingly common), you can investigate which differences between people’s behavior are genetic and which are learned. The Minnesota Study of Twins (Bouchard 1990 [see “Sources Used” at end of book]), found that a lot of behaviors that “should” be learned seem to be genetic. The New York Times published a nontechnical account in Major Personality Study Finds That Traits Are Mostly Inherited (Goleman 1986 [see “Sources Used” at end of book]).

Paired and Unpaired Data Compared

Sample type DependentIndependent, or randomized
Numeric data type Paired DataUnpaired Data
How many numbers from each experimental unit? TwoOne
Can you rearrange★ one sample? NoYes
Problem of confounding variables MinimalSevere
Use this design … … if you can… if you must
★If the data from the sample are arranged in two rows or two columns, can you rearrange one row or column without destroying information?

Example 5: Seed Corn

unpaired data/independent samples versus paired data/dependent samples

Testing new corn versus standard corn for yield. Can you see a problem with the sample in Western New York that’s not a problem with the sample in Central New York?

Adapted from Dabes and Janik (1999, 263) [see “Sources Used” at end of book]

You’re the head of research for the Whizzo Seed Company, and you’ve developed a new type of seed that looks promising. You randomly select three farmers in Western New York to receive new corn, and three to receive your standard product. (Of course you don’t tell them which one they’re getting.) At the end of the season they report their yield figures to you.

What’s wrong with this picture? You can easily think of all sorts of confounding variables here: different soils, different weather, different insects, different irrigation, different farming techniques, and on and on. Those differences can be great enough to hide (confound) a difference between the two types of corn, especially in a small sample.

The following year, you try again in Central New York. This time you send each farmer two stocks of seed corn, with instructions to plant one field with the first stock and another field with the second.

Does that eliminate confounding variables? Maybe not totally, but it reduces them as far as possible. Now, if you see significant differences in yield between two fields planted by the same farmer, it’s almost sure to be due to differences in the seed.

When to Use Paired Data?

You always want to structure an experiment or observation with paired data (dependent samples) — if you can.

“If you can.” Aye, there’s the rub. Suppose you want to know whether attending kindergarten makes kids do better in first grade. There’s no way to set this up as paired data: how can a given kid both go through kindergarten and not go through kindergarten? Twin studies don’t help you here, because if the twins are raised together the parents will send both of them to kindergarten, or neither; and if the twins are raised apart then there will be too many other differences in their upbringing that could affect their performance in first grade.

If the samples are independent, you can’t pair the data, even if the samples are the same size. If you’re not sure whether you have dependent or independent samples, look back at Paired and Unpaired Data Compared.

Example 6: Where the Rubber Meets the Road

You want to determine whether a new synthetic rubber makes tires last longer than the competitor’s product. Can you see how to do this with independent samples (unpaired data) and with dependent samples (paired data)? Think about it before you read on.

For independent samples, you randomly assign drivers to receive four tires with your new rubber or four of the competitor’s tires. For dependent samples, you put two tires of one type on the left side of every driver’s car, and two on the right side of every driver’s car. (You do half the cars one way and half the other, to eliminate differences like the greater likelihood of hitting the curb on the right.)

With the first method, if there’s only a small difference between your rubber and the competitor’s, it may not show up because you’ve also got differences in driving styles, roads, and so forth — confounding variables again. With the second method, those are eliminated.

11B.  Inference with Paired Numeric Data (Case 3)


The hypothesis test is almost exactly like the Case 1 hypothesis test. The difference is that you define a new variable d (difference) in Step 1 and write hypotheses about μd instead of μ.

For a confidence interval, you’re estimating the average difference, not the average of either population. You need to state both size and direction of the effect.

Example 7: The Freshman Fifteen

You’ve probably heard about the “freshman fifteen”, the weight gain many students experience in their first year at college. The Urban Dictionary even talks about the “freshman twenty” (2004) [see “Sources Used” at end of book].

Francine wanted to know if that was a real thing or just an urban legend. During the first week of school, she got the other nine women in her chemistry class at Wossamatta U to agree to help her collect data. (She reasoned that students in any particular class would effectively be a random sample of the school, since class choice is unrelated to weight or other health issues. Of course that would be questionable for a spin class or a cooking class.)

Wossamatta U CHEM101 — Women’s Weights (in pounds)
Sept. 11810512311210713012099119126
May 125114128122106143124103125135

When she had the data, Francine realized she didn’t know what to do next. If she had just one set of numbers, she would do a Student’s t test, since she doesn’t know the population standard deviation (SD). But what to do with two lists?

Then she had a brainstorm. She realized that she’s not trying to find out anything about students’ weights. She wants to know about their weight gain. Looking at their weights, she’d have plenty of lurking variables starting with pre-college diet and lifestyle. Looking only at the weight gain minimizes or eliminates those variables, and measures just what happened to each student during freshman year. So she added a third row to her chart:

Wossamatta U CHEM101 — Women’s Weights (in pounds)
Sept. 11810512311210713012099119126
May 125114128122106143124103125135
d = May−Sept. 79510−1134469

Notice the new variable d, the difference between matched pairs. (You know the data must be paired, because each May number is associated with one and only one September number. You can’t rearrange the May numbers and still have everything make sense.) This is the heart of Case 3 in Inferential Statistics: Basic Cases: reducing paired numeric data to a simple t test of a population mean difference.

Here’s what’s new:

Now she’s all set. She has one set of ten numbers, representing the continuous variable “weight gain in freshman year” for a random sample of Wossamatta U women. (Notice with student E, Francine has a negative value for d because May minus September is 106−107 = −1. That student lost weight as a freshman.) Time for a t test!

Because this textbook helps you,
please click to donate!
Because this textbook helps you,
please donate at

But first, what will she test? Her original idea was to test the “freshman fifteen”. But a glance at the d’s shows her that no one gained as much as 15 lb. An average can’t be larger than every member of the data set, so there’s no way she could prove a hypothesis that the average gain is above fifteen pounds. She decides instead to try to prove a “freshman five”, μd > 5, with 0.05 significance.

Subtle point here: You never use sample data in a hypothesis, but you can sometimes adjust your hypotheses after you collect your data, especially when it’s obvious that your data won’t prove what you wanted to prove. Another reasonable choice for Francine would be to try to prove simply that the average student gains weight, μd > 0.

When you do a confidence interval, you don’t have to make any decision of this kind because you just follow the data where they lead.

Entering Paired Numeric Data

Francine subtracted by hand here, but you shouldn’t do that because it’s a rich source of errors and makes it harder to check your work. Instead, follow this procedure on your TI-83/84:

  1. Enter the first data set (September, in this case) in L1.
  2. Enter the second data set (May) in L2. Unlike the one-population cases, the order matters.
  3. Check your data entry. Since you entered all of the September figures and then all the May figures, check them the opposite way, first student A September and May, then student B, and so on.
  4. Cursor to L3 — the column heading, not the first number.
  5. Francine defined d as May−Sept., which is L2−L1, so enter that formula. (To subtract in the other direction, enter L1−L2.) As soon as you press [ENTER], the calculator does all the subtractions, wiping out whatever was in L3 previously.

    TI-83 screen with L1 and L2 entered, formula L2 minus L1 in progress for L3      TI-83 screen after applying formula; differences shown in L3

    This isn’t Excel — if you change L1 or L2 after entering the formula for L3, L3 won’t change. You need to re-enter the formula for L3 in that case. (You actually can make the calculator behave like Excel by binding a formula to a list, but it’s not worth the hassle.)

Hypothesis Test for Mean Difference

With paired numeric data, your population parameter is the mean difference μd. The random variable is a difference (in this case, a number of pounds gained from September to May), so the parameter is the mean of all those weight gains.

(1)d = May−September
H0: μd = 5, average student gains 5 lb or less
H1: μd > 5, average student gains more than 5 lb
(2) α = 0.05
  • Random sample? Yes, effectively. (It’s a random sample of Wossamatta U women frosh, not necessarily those from other colleges.)
  • 10n ≤ N? Yes, because any university has more than 10×10 = 100 women in the freshman class.
  • n = 10 (< 30), so Francine must test for normality and verify absence of outliers. She tests L3, not L1 and L2, because L3 holds her sample data of weight gain:

    TI-83 screen with normality check; see text      TI-83 boxplot with no outliers

    r=.9811 and crit=.9179. r>crit, and the box-whisker shows no outliers.

(3/4) This is a regular T-Test, number 2 in the STAT TESTS menu. Francine writes down
T-Test: 5, L3, 1, >μo
results: t=1.29, p = 0.1146, =6.6, s=3.9, n=10

TI-83 T-Test input screen; see text      TI-83 T-Test results screen; see text

The sample mean is (“d-bar”), not , because the data are d’s, not x’s.

(5) p > α. Fail to reject H0.
(6) You can’t determine whether the average Wossamatta U woman student gains more than 5 pounds in her freshman year or not (p = 0.1146).
At the 0.05 significance level, you can’t determine whether the average Wossamatta U woman student gains more than 5 pounds in her freshman year or not.

After a “fail to reject H0”, you always remember to write your conclusion in neutral language, right? Maybe the true average weight gain is greater than 5 pounds but this particular sample just happened not to show it; maybe the true average weight gain really is under 5 pounds, A confidence interval can help you get a handle on the effect size.

Confidence Interval for Mean Difference

When a hypothesis test fails to reach a conclusion, a confidence interval can salvage at least some information. When a hypothesis test does reach a conclusion, a confidence interval can give you more precise information.

If Francine was doing only the confidence interval, she’d have to start off by testing requirements. But she has already tested them as part of the hypothesis test, so she goes right to the TINTERVAL screen.

Which confidence level does she choose? Her one-tailed hypothesis test at α = 0.05 would be equivalent to a two-tailed test at α = 0.10, and that suggests a confidence level of 90%. But she decides since her hypothesis test has already failed to reach a conclusion she’d at least like to get a 95% CI.

TInterval: L3, 1, .95
results: (3.7948, 9.4052)

TI-83 TInterval input screen; see text      TI-83 TInterval results screen; see text

Conclusion: Francine is 95% confident that the average woman student at Wossamatta U gains 3.8 to 9.4 pounds during her freshman year.

(Francine doesn’t write down , s, and n because she’s already written them in the hypothesis test. She would write them down when she does only a confidence interval.)

Common mistake: Don’t say the average weight is 3.8 to 9.4 pounds. You aren’t estimating the average first-year woman’s weight, but her weight gain.

Always re-read your conclusion after you write it, and ask yourself whether it seems reasonable in the context of the problem. That can save you from mistakes like this.

Example 8: Coffee and Heart Rate


A few years back, a coffee company tried to market drinking coffee as a way to relax — and they weren’t talking about decaf. Jon decided to test this. He randomly selected six adults. He recorded their heart rates, then recorded them again half an hour after each person drank two cups of regular coffee. His data are shown at right. (Data come from Dabes and Janik [1999, 264] [see “Sources Used” at end of book].)

The data are paired, because each person (experimental unit) gives you two numbers, Before and After; because each After is associated with one specific Before; and because you can’t rearrange Before or After and still have the data make sense.

Jon selected the 0.01 significance level. (He tests for difference even though he believes coffee increases heart rate, because it could decrease it.)

Jon could equally well define d as Before−After or After−Before. At least, mathematically he could. But you’ll find it’s easier to interpret results if you always define d as high minus low so that all or most of the d’s will be positive numbers. (You can do this based on your common sense or by looking at the data.) Jon sees that the After numbers are generally larger than the Before numbers, so he chooses d = After−Before.

(1)d = After−Before
H0: μd = 0, coffee makes no difference to heart rate
H1: μd ≠ 0, coffee makes a difference to heart rate
(2) α = 0.01
(RC)Jon has a random sample, but the sample size is <30. (The sample of six is obviously less than 10% of coffee drinkers.) He puts the Before figures in L1, After in L2, and then L2−L1 (not L1−L2) in L3. The box-whisker plot of L3 finds no outliers. The normal probability plot shows r=.9638, crit=.8893; r>crit.

box-whisker plot showing no outliers      normality check; see text

(3/4) T-Test results: t=5.562426994, p=.0025836914, x-bar=4.166666667, s=1.834847859, n=6 T-Test: 0, L3, 1, ≠μo
results: t=5.56, p = 0.0026, d̅=4.2, s=1.8, n=6
(5) p < α. Reject H0 and accept H1.
(6)Drinking coffee does make a difference in heart rate half an hour later (p = 0.0026). In fact, coffee increases heart rate.
Drinking coffee does make a difference in heart rate half an hour later, at the 0.01 significance level. In fact, drinking coffee increases heart rate.

As usual, when you do a two-tailed test and p < α, you can interpret it in a one-tailed manner. Jon defined d as After−Before, which is the amount of increase in each subject’s heart rate. His sample mean d̅ was positive, so the average outcome in his sample was an increase. Because he proved that the mean difference μd for all people is nonzero, the sign of his sample mean difference d̅ tells him the sign of the population mean difference μd.

Jon can’t say that the average increase for people in general is 4.2 beats per minute. That was the mean difference in his sample. If he wants to know the mean difference for all people, he has to construct a confidence interval:

TInterval: L3, 1, .99

result: (1.146, 7.187)

Jon is 99% confident that the average increase in heart rate for all people, half an hour after drinking two cups of coffee, is 1.1 to 7.2 beats per minute.

Caution! The confidence interval expresses a difference, not an absolute number. You are estimating the amount of increase or decrease, not the heart rate. A common mistake would be to say something about the heart rate being 1.1 to 7.2 bpm after coffee. Again, you’re not estimating the heart rate, you’re estimating the change in heart rate.

11C.  Inference with Unpaired Numeric Data (Case 4)

With paired data, you tested the population mean difference μd between matched pairs. But suppose you don’t have matched pairs? With unpaired data in independent samples, you test the difference between the means of two populations, μ1−μ2.

This is Case 4 in Inferential Statistics: Basic Cases. Key features:

Advice: Take your time when you look at data to decide whether you have paired or unpaired data. If your sample sizes are different, it’s a no-brainer: the data are unpaired. But if the sample sizes are the same, think carefully about whether the data are paired or unpaired. Sometimes students just seem to take a stab in the dark at whether data are paired or unpaired, but if you just stop and think about how the data were taken you can make the right decision every time. Look back at Paired and Unpaired Data at the beginning of this chapter if you need a refresher on the difference.

Example 9: A Tough Grader?

Prof. Sullivan’s students at Wossamatta U felt that he was a tougher grader than the other speech professors. They decided to test this, at the 0.05 significance level.

Eight of them each took a two-hour shift, assigned randomly at different times and days of the week, and distributed a questionnaire to each student on the main quad. They felt this was a reasonable approximation to a random sample of current students. (They asked students not to take a questionnaire if they had already submitted one.) The questionnaire asked whether the student had taken speech in a previous semester, and if so from which professor and what grade they received. They then divided the questionnaires into three piles, “no speech”, “Sullivan”, and “other prof”.

It would be possible to do an analysis with the categorical data of letter grades. But you should always use numerical data when you can, because p-values are usually lower with numeric data than attribute data, for a given sample size. The students counted an A as 4 points, A-minus as 3.7, and so on. Here is a summary of their findings:

Students ofMeanStandard
Sample Size
Sullivan 2.211.4432
Other prof 2.681.1354

Hypothesis Test for Difference of Means

In this test, you have unpaired numeric data in two samples. The requirements for each sample are the same as the test for the sample in a one-sample t test:

There’s an additional requirement for the two samples:

Here’s the hypothesis test, as performed by Prof. Sullivan’s students:

(1)pop. 1 = Sullivan students, pop. 2 = other speech profs’ students
H0: μ1 = μ2, no difference in average grades
H1: μ1 < μ2, Sullivan’s grades lower on average
(2) α = 0.05
  • Random sample (systematic).
  • Are samples less than 10% of their populations? 10×32 = 320, and 10×54 = 540. At a university there are almost certainly more speech students per professor than that, especially considering multiple years.
  • Both sample sizes > 30.
  • Samples independent (no connection between Sullivan students and non-Sullivan students).
(3/4) 2-SampTTest: 1=2.21, s1=1.44, n1=32, 2=2.68, sx2=1.13, n2=54, μ12, Pooled:No
Results: t = −1.58, p = 0.0600, df=53.58

2-SampTTest input screen: x-bar 1 = 2.21, sx 1 = 1.44, n 1 = 32, x-bar 2 -= 2.68, sx 2 = 1.13, n 2 = 54, mu 1 less than mu 2, pooled no      2-SampTTest results screen: t = minus 1.58036729, p = .0599544103, df = 53.57941422 The test statistic is still Student’s t, but adapted for two samples. See the BTW note below for more about that and about the funny number of degrees of freedom.

(5) p > α. Fail to reject H0.
(6)At the 0.05 level of significance, they can’t determine whether Prof. Sullivan is a tougher grader than the other professors or not.
How does your calculator analyze a difference of independent means? If you remember what you learned about one-sample t tests, all you have to do is extend it.

You’re working with a difference of sample means. The standard error of the mean for the first population is s1/√n1 and therefore the variance is s1²/n1, and similarly for the second population. The variance of the sum or difference of independent variables is the sum of their variances, so VAR(12) = s1²/n1 + s2²/n2. The standard deviation (the standard error of the difference of sample means) is the square root of the variance: SE of x-bar 1 minus x-bar 2 equals square root of the expression s 1 squared over n 1, plus s 2 squared over n 2.

It turns out that the difference of sample means follows a t distribution — if you choose the right number of degrees of freedom (more on that later). The one-sample test statistic was t = (−μo) / (s/√n). The two-sample test statistic is analogous, with the differences substituted. The test statistic becomes t equals fraction. Numerator is quantity x-bar 1 minus x-bar 2, minus quantity mu 1 minus mu 2.  Denominator is standard error from previous equation.. In this course, you’ll just be testing whether one population mean is greater than, less than, or different from the other. In other words, you’ll test against a hypothetical mean difference of 0. That simplifies t a bit: t equals fraction. Numerator is x-bar 1 minus x-bar 2. Denominator is standard error from earlier equation..

What about degrees of freedom? Welch computation of degrees of freedom You might think df would be n1+n2−1, but it isn’t. The sampling distribution approximately follows a t with df equal to the lower of n1−1 and n2−1. It’s only approximate because the population SD are usually different. The exact degrees of freedom were computed by B. L. Welch (1938) [see “Sources Used” at end of book], and the horrendous, ugly equation is shown at right. Fortunately, your TI-83/84 has the computation built in, and you don’t have to worry about it.

What about pooling? Why do you always select Pooled:No on your TI-83/84? Well, if the two populations have the same SD (if they are homoscedastic) you can treat them as a single population (pool the data sets) and use a higher number of degrees of freedom. That in turn means your p-value will be a bit lower, so you’re a bit more likely to be able to reject H0. Sounds good, right? But there are problems:

For these reasons and others, the issue of pooling is controversial. Some books don’t even mention it. It’s best just to use Pooled:No always.

Confidence Interval for Difference of Means

The requirements are exactly the same as the requirements for the hypothesis test. You compute a confidence interval on your TI-83/84 through 2-SampTInt.

Since they couldn’t prove that Prof. Sullivan was a tough grader, the students decided to compute a 90% confidence interval for the difference between Prof. Sullivan’s average grades and the other speech profs’ average grades:

pop. 1 = Sullivan students; pop. 2 = other speech profs’ students
Requirements: already covered in hypothesis test.
2-SampTInt: 1=2.21, s1=1.44, n1=32, 2=2.68, sx2=1.13, n2=54, C-Level=.9, Pooled:No
Results: (−.9678, .02779)

2-SampTInt input screen: x-bar 1 = 2.21, sx 1 = 1.44, n 1 = 32, x-bar 2 -= 2.68, sx 2 = 1.13, n 2 = 54, C-Level = .9, pooled no      2-SampTInt results screen: −.9678, .02779

Interpretation: The TI-83 gives you the bounds for the confidence interval about μ1−μ2. A negative number indicates μ1 smaller than μ2, and a positive number indicates μ1 larger than μ2. Therefore:

We’re 90% confident that the average student in Prof. Sullivan’s classes receives somewhere between 0.97 of a letter grade lower than the average student in other profs’ speech classes, and 0.03 of a letter grade higher.

Remark: The 90% confidence interval is almost all negative. This reflects the fact that the p-value in the one-tailed test for μ1 < μ2 was almost as low as 0.05.

The students could have chosen any confidence level they wanted, just for showing an effect size. But for a confidence interval equivalent to their one-tailed hypothesis test that used α = 0.05, the confidence level has to be 1−2×0.05 = 0.90 = 90%.

Why do you need a special two-sample t procedure? Can’t you just compute a confidence interval from each sample and then compare them? No, because the standard errors are different. The two-sample standard error takes the sample SD and sample sizes into account. Here’s a simple example, provided by Benjamin Kirk:

A farmer tests two diets for his pigs, randomly assigning 36 pigs to each sample. The Diet A group gained an average 55 lb with SD of 3 lb; that gives a 95% confidence interval 54.0 to 56.0 lb. The Diet B group gained 53 lb on average, with SD of 4 lb; the CI is 51.6 to 54.4 lb. Those intervals overlap slightly, which would not let you conclude that there’s any difference in the diets.

But the 2-SampTInt is 0.3 to 3.7 lb in favor of Diet A, which says there is a difference. The issue is that the B group had a lower sample mean, but there was more variation within the group.

Example 10: Sorority Academics

The Alpha Alpha Alpha sorority chapter at Staples University (Yes, corporate sponsorship is getting ridiculous!) has a tradition of putting in extra effort academically. They gave their incoming pledges the task of proving that Alpha Alpha Alpha had higher average GPA than other sororities, at the 0.05 level of significance. The Alphas are a large sorority, with 119 members.

The pledges hacked the campus server and obtained GPAs of ten randomly selected Alphas and ten randomly selected members of other sororities on campus. Do their ill-gotten data prove their point?

Alphas: 2.313.362.772.932.27 2.353.
Other sororities: 1.491.742.702.402.17 1.081.851.962.081.49

Since you have independent samples (unpaired data) from two different populations, this is Case 4, difference of population means, in Inferential Statistics: Basic Cases.

Caution: You can’t treat these as paired data just because the sample sizes are equal; that’s a rookie mistake. When deciding between a paired or an unpaired analysis, always ask yourself: “Is data point 1 from the first sample truly associated with data point 1 from the second sample?” In this case, they’re not.

(1)pop. 1 = Alpha Alpha Alpha; pop. 2 = other sororities
H0: μ1 = μ2, No difference in average GPA
H1: μ1 > μ2, Average GPA of all Alphas is higher than other sororities
(2) α = 0.05

You check requirements against both samples independently. These samples are both smaller than 30, so you have to check normality and outliers on both. Here are the normality checks:

normality check for Alphas' GPAs      normality check for other sororities' GPAs

The first picture doesn’t look much like a straight line, but r is greater than crit, so it’s close enough. (With small data sets like this one, fitting the data to the screen can make differences look larger than they really are.)

stacked boxplots of Alphas' and other sororities' GPAs The calculator lets you “stack” two or three boxplots on one screen. Not only is this a bit of a labor saver, but it also gives you a good sense of how different the samples are. To do this, select “Compare 2 smpl” on the first box-whisker screen. You can guess what “Compare 3 smpl” does, but we don’t use it in this course.)

For these samples, the difference is dramatic. Every single Alpha’s GPA (in the sample) is above the third quartile in the sample of other sororities, and the max of other sororities is just barely above the median Alpha.

With such a big difference, why do the pledges even need to do a hypothesis test? Because they know these are just samples. Maybe the Alphas actually aren’t any better academically, but these particular samples just happened to be far apart. The hypothesis test tells you whether the difference you see is too big to be purely the result of random chance in sample selection.

  • Random samples, OK
  • 10% of Alphas is 12, and the sample is smaller than that. We don’t know how many are in all the other sororities combined, but it must be more than 10×10 = 100. OK
  • Normality check, sample 1: r(.9567) > crit(.9179), OK
  • Normality check, sample 2: r(.9946) > crit(.9179), OK
  • Box-whisker: no outliers in either sample, OK
(3/4) output screen for 2-sample t test, described in the text 2-SampTText L1, L2, 1, 1, >μ2, Pooled:No
outputs: t = 3.93, p-value = 0.0005,
1 = 2.70, s1 = 0.43, n1 = 10
2 = 1.90, s2 = 0.48, n2 = 10
(5) p < α. Reject H0 and accept H1.
(6)The average GPA in Alpha Alpha Alpha is higher than the average GPA of other sorority members (p = 0.0005).
[Or, at the 0.05 level of significance, the average GPA in Alpha Alpha Alpha is higher than the average GPA of other sorority members.)

Comment: You have to phrase your conclusion carefully. The pledges proved that the average GPA of Alphas is higher than the average GPA of all other sorority members, not all other sororities. What’s the difference? Here’s a simple example. Suppose there are ten other sororities besides the Alphas. The Omegas have an average GPA of 3.66, higher than the Alphas’ average. If the other nine each have an average GPA of 1.70, that could easily produce exactly the sample that the pledges got.

The message here: Aggregating data can lose information. Sometimes that’s okay, but be wary when one population is being compared to an aggregate of multiple other populations.

11D.  Inference on Two Proportions (Case 5)

When you have two samples of binomial data, they represent two populations. Each population has some proportion of successes, p1 and p2 respectively. You don’t know those true proportions, and in fact you’re not concerned with them. Instead, you’re concerned with the difference between the proportions, p1p2. You can test whether there is a difference (hypothesis test), or you can estimate the size of the difference (confidence interval).

This is Case 5 in Inferential Statistics: Basic Cases. Key features of Case 5, the difference of proportions:

Advice: take your time with two-sample binomial data. You have a lot of p’s and a lot of percentages floating around, and it’s easy to get mixed up if you try to hurry.

Take extra care when writing conclusions. You’re making statements about the difference between the two proportions, not about the individual proportions. And you’re making statements about the difference in proportions between the populations, not between the samples.

Example 11: Traffic Stops and Traffic Tickets

Stopped by Traffic Cop
Just a

One of my students — call him Don — had several traffic tickets, and he knew one more would trigger a suspension. He felt that women stopped by a traffic cop were more likely than men to get off with just a warning, and for his Field Project he set out to prove it, with α = 0.05.

Don quickly realized that he should test whether men and women stopped by a cop are equally likely to get a ticket, not just whether men are more likely. After all, he couldn’t rule out the possibility that women are more likely to get a ticket if stopped.

Don distributed a questionnaire to a systematic sample of TC3 students. (He assumed that any gender-based difference in TC3 students would be representative of college students in general. That seems reasonable.) He asked three questions:

  1. Male or female?
  2. Stopped by a traffic cop since your 18th birthday?
  3. If yes, did you receive a ticket the last time you were stopped?

Don disregarded any questionnaires from students who had never been stopped as adults. He wasn’t interested in the likelihood of getting a ticket, but in the likelihood of getting a ticket after being stopped by a cop. You could say that he was interested in the different proportions, for men and women, of stops that lead to tickets.

Hypothesis Test for Difference of Proportions

This is just another variation on the good old Seven Steps of Hypothesis Tests:

Here are the requirements for a Case 5 hypothesis test of a difference of proportions:

Here is Don’s hypothesis test about the different proportions of men and women that receive tickets after being stopped in traffic.

(1) population 1 = college men stopped by traffic cops; population 2 = college women stopped by traffic cops
H0: p1 = p2, college men and women equally likely to get a ticket after being stopped
H1: p1p2, college men and women not equally likely to get a ticket after being stopped
(2) α = 0.05
  • Samples 1 and 2 random? Yes, effectively (systematic).
  • 10n1 = 10×97 = 970, and there have been more than 970 male students (at all colleges) stopped by traffic cops.
  • 10n2 = 10×70 = 700, and there have been more than 700 female students (at all colleges) stopped by traffic cops.
  • Sample 1 has 86 successes and 97−86 = 11 failures; sample 2 has 55 successes and 70−55 = 15 failures.
(3/4) 2-PropZTest: 86, 97, 55, 70, p1≠p2
Results: z=1.77, p-value = 0.0760, 1 =.89, 2=.79, =.84

2-PropZTest input screen: x 1 = 86, n 1 = 97, x 2 = 55, n 2 = 70, p 1 not equal p 2      2-PropZTest results screen: z=1.774261748, p=.0760197674, p-hat 1 = .8865979381, p-hat 2 = .7857142857, p-hat = .8443113772
There’s a difference of 10 percentage points between the sample proportions, but with Don’s sample sizes that difference is not large enough to be statistically significant. Even if there really is a difference in proportions for college men and women in general, random chance would be enough to explain the difference Don sees in his samples.

(5) p > α. Fail to reject H0.
(6)At the 0.05 level of significance, Don can’t tell whether men and women stopped by traffic cops are equally likely to get tickets, or not.

If this non-conclusion leaves you non-satisfied, you’re not alone. As usual, the confidence interval (next section) can provide some information.

Why does the “official” requirement use a pooled proportion instead of testing each sample? In fact, for a confidence interval you always test requirements for each sample. But in a hypothesis test, your H0 is always “no difference in population proportions”, and a hypothesis test always starts by assuming H0 is true. If the null is true, then there is no difference in the two populations, and you really just have one big sample of size n1+n2 and sample proportion . So that’s what you test.

Why is this a z test? For the same reason that a one-proportion test is a z test: from the population proportion p you know the SD.

Of course the two-population case is a bit more complicated. You need the key fact that when you add or subtract independent random variables, their variances add. If the two populations have the same proportion p, as H0 assumes, then the SD of the sampling distribution of the proportion for population 1 is √[(1−)/n1], and similarly for population 2, where is the pooled proportion mentioned in the requirements check, above. Square the SD to get the variances, add them, and take the square rot to get the standard error: SE sub pooled of p-hat 1 minus p-hat 2 = radical of the expression p-hat times 1 minus p-hat, all over n 1, plus p-hat times 1 minus p-hat, all over n 2. And from this you have the test statistic: z = p-hat 1 minus p-hat 2 all over the standard error.

Confidence Interval for Difference of Proportions

In a confidence interval for the difference of two proportions, some unknown proportion p1 of population 1 has some characteristic, and some unknown proportion p2 of population 2 has that characteristic. You aren’t concerned with those proportions on their own, but you want to estimate which population has the greater proportion, and by how much.

The requirements for a CI are almost the same as a HT, but with one subtle difference:

Why is that last requirement different from the “official” requirement for the hypothesis test? With the HT, you assumed H0 was true and both populations had the same proportion. That let you use a blended or pooled proportion from your combined samples. But with a CI, you don’t make any such assumption. What would be the point of a confidence interval for the difference if you assume there is no difference?

But despite the difference in theory, as a practical matter you can just test for ≥ 10 successes and ≥ 10 failures in each sample for both HT and CI.

Don has already checked requirements in the hypothesis test, so he moves right to a 2-PropZInt:

2-PropZInt input screen: x 1 = 86, n 1 = 97, x 2 = 55, n 2 = 70, C-Level = .95      2-PropZInt results screen: minus .0141 to plus .21587

Don gets a result of −1.4% to +21.6%. How does he interpret that? Well, he can write it as

−1.4%   ≤   p1p2   ≤   21.6%   (95% conf.)

Adding p2 to all three “sides” gives

p2−1.4%   ≤   p1   ≤   p2+21.6%   (95% conf.)

With 95% confidence, p1 is somewhere between 1.4% below p2 and 21.6% above p2. You don’t know the numerical value of p1, but out of male students who are stopped by a traffic cop, p1 is the proportion who get a ticket, and similarly for p2 and women. So Don can write his confidence interval like this:

cartoon about percentage points
used by permission; source: (accessed 2014-10-03)

I’m 95% confident that, out of students stopped by traffic cops, the proportion of men who actually get tickets is somewhere between 1.4 percentage points less than women, and 21.6 percentage points more than women.

If you’re not feelin’ the love with the algebra approach, you can reason it out in words. The confidence interval is the difference in proportions for men minus women. If that’s negative, the proportion for men is less than the proportion for women; if the difference is positive, the proportion for men is greater than the proportion for women.

Why do I say “percentage points” instead of just “percent” or “%”? Well, how do you describe the difference between 1% and 2%? It’s a difference of one percentage point, but it’s a 100% increase, because the second one is 200% of the first. When you subtract two percentages, the difference is a number of percentage points. If you just say “percent”, that means you’re expressing the difference using one of the percentages as a base, even if you don’t mean to.

Getting back to Don’s confidence interval, the −1.4% to +26.1% difference between men and women in traffic tickets is a simple subtraction of men’s rate minus women’s rate, so it is percentage points, not percent.

Where does the confidence interval come from? First you have to find the standard error. Yes, it’s different from the standard error associated with the hypothesis test. Why? That standard error assumed H0 was true and used the pooled . You can’t do that in the confidence interval, because if H0 is true then the difference between the population proportions is zero and you don’t have a confidence interval!

The standard deviation of the sampling distribution of the proportion for population 1 is √[1(1−1)/n1], and similarly for population 2. Square them, add, and take the square root to get the SD of the distribution of differences in sample proportions, also known as the standard error of the difference of proportions: SE of p-hat 1 minus p-hat 2 = radical of the expression p-hat 1 times 1 minus p-hat 1, all over n 1, plus p-hat 2 times 1 minus p-hat 2, all over n 2. The margin of error is zα/2 times that. The center of the confidence interval is the point estimate, (12), so the bounds for the (1−α)% confidence interval are

(12)−E  ≤  p1p2  ≤  (12)+E     where  E equals [z of alpha over 2] times square root of fraction p1 times 1 minus p1 over n1 plus fraction p2 times 1 minus p2 over n2

Just like with numeric data, you have to use the two-sample procedure to compute a correct confidence interval. Here’s an example.

Two candidates are running for city council, so they each commission an exit poll on Election Day. Of 200 voters polled, 110 voted for Mr. X; 90 of a different 200 voted for Ms. Y. The 95% confidence intervals are 48.1% to 61.9% and 38.1% to 51.9%. The intervals overlap, so Ms. Y might still hope for victory. But a 2-PropZInt tells a different story. The interval for the difference of proportions, X−Y, is 0.2% to 19.8%, so Mr. X is 95% confident of winning, and the only question is whether it will be a squeaker or a landslide.

Necessary Sample Size for Confidence Interval

You have a confidence level and a desired margin of error in mind. How large must each sample be?

You may remember with the one-population binomial case, part of the calculation was your prior estimate, or if you had no prior estimate you used 0.5. With two binomial populations, you need a prior estimate (or 0.5) for each one.

The easiest way to compute the necessary sample size is to use MATH200A Program part 5. If you don’t have the program and want to get it, see Getting the Program. You can also calculate necessary sample size by using the formula in the next paragraph, if you don’t have the program.

The formula for sample size is not too difficult. Start with the formula for margin of error. The desired confidence level determines critical z. But when you fill in your desired margin of error E and your prior estimates 1 and 2, you still have two unknowns, n1 and n2. The simplest assumption is that you’ll make your two samples the same size, so set n1 = n2 and solve:

n 1 = n 2 = open brackets p-hat 1 times 1 minus p-hat 1, plus p-hat 2 times 1 minus p-hat 2, close brackets, times the square of the fraction z sub one-half alpha, over E

For a detailed explanation, with worked examples, see How Big a Sample Do I Need?.

Caution! When you’re planning to study the difference between two binomial populations, you have to use the two-population binomial computation of sample size. If you compute one sample size for sample 1 and a separate sample size for sample 2, you’ll come out much too low.

Example 12: Let’s look back once more at Don and his traffic stops. His 95% confidence interval was −0.0141 to +0.21587. That’s a margin of error of (.21587−(−.0141))/2 = 11½ percentage points. How large must his samples be if he wants a margin of error of no more than 5 percentage points but he’s willing to be only 90% confident?

Solution: Don can use his sample proportions as prior estimates. Those were 86/97 ≈ 0.8866 for men and 55/70 ≈ 0.7857 for women.

With the MATH200A program (recommended): If you’re not using the program:
Here’s the output screen from MATH200A Program part 5, 2-pop binomial:

MATH200A results: p-hat= .8866 and .7857, E less or equal .05, C-Level = .9, x Crit = 1.64, n greater or equal 292 per sample

The calculation is a little easier if you break it into chunks. First compute 1(1−1) + 2(1−2). When you press [Enter], the calculator displays that result.

You want to multiply that by (zα/2/E)². Press the [×] key, and the calculator displays Ans*. Then press the opening paren [(], enter the fraction, and square it.

What is zα/2? You did this in How Big a Sample for Binomial Data? in Chapter 9. The confidence level is 1−α = 0.9, so α = 0.1, α/2 = 0.05, and zα/2 is invNorm(1−.05). The margin of error is 5% or .05 (not .5 !).

.8866 times 1 minus .8866, plus .7857 times 1 minus .7857, yields .26891595. Ans times the square of the fraction invnorm 1 minus .05, all over .05, yields 291.0255149 Caution: You don’t round the sample size. If you don’t get a whole number from the calculation, always go up to the next whole number. A sample size of 291.0255149 or greater gives a margin of error of .05 or less, at 90% confidence. The smallest whole number that is 291.0255149 or greater is 292, not 291.

Answer: Don needs a sample of 292 men and 292 women if he wants 90% confidence in an estimate of the difference with margin of error no more than 5%.

Rookie mistake: Don’t just say “292”. It’s 292 from each population.

Why do you need such large samples, even at a confidence level as low as 90%? Part of the answer is that binomial data do need large samples; remember that a single sample of just over a thousand gives you a 3% margin of error at the 95% confidence level. And when you have two populations, you are estimating the difference between two unknown parameter values, p1 and p2. If each of those was estimated within a 3% margin of error, the margin of error for their difference would be 6%, so the samples have to be larger in the two-population binomial case.

Example 13: The Prime Minister knows that his program of tax cuts and reduced social services appeals more to Conservatives than to Labour, but he wants to know how large the difference is. To estimate the difference with 95% confidence, with a margin of error of no more than 3%, how many members of each party must he survey?

Solution: You’re given no estimate of support within either party, so use 0.5 for 1 and 2. E = 0.03 (not 0.3).

With the MATH200A program (recommended): If you’re not using the program:
MATH200a/sample size/2-pop binomial:

p-hat = .5 and .5, E <= .03, C-Level = .95, z Crit = 1.96, n = 2135 per sample

.5 times 1 minus .5, plus .5 times 1 minus.5, yields .5. Multiply by the square of the fraction invNorm of 1 minus .025, over .03 First compute 1(1−1) + 2(1−2) = 0.5(1−0.5)+0.5(1−0.5). You have to multiply that by zα/2, which you find like this: C-Level = 1−α = 0.95 ⇒ α = 1−0.95 = 0.05 ⇒ α/2 = 0.025 ⇒ zα/2 = invNorm(1−.025).

Answer: To gauge the difference within a 3% margin of error, at the 95% confidence level, the Prime Minister needs to poll 2135 Conservative Party members and 2135 Labour Party members.

Example 14: Gardasil Vaccine

The Gardasil vaccine is marketed by Merck to prevent cervical cancer. What are the statistics behind it? How do women decide whether to get vaccinated? Should the vaccine be mandatory?

A Cortland Standard story (21 Nov 2002) summarized an article from the New England Journal of Medicine as follows

A new vaccine can protect against Type 16 of the human papilloma virus, a sexually transmitted virus that causes cervical cancer, a new study shows. An estimated 5.5 million people become infected with a strain of HPV [not necessarily this strain] each year in the United States.

Efficiency rate of vaccine and placebo

Placebo:  Group size 765, infection 41

HPV-16 vaccine:  Group size 768, infection 0

Note: The study included 1533 women with an average age of 20.

(Similar studies were done for the vaccine’s effectiveness against another strain, HPV-18. According to the front page of the Wall Street Journal on 16 Apr 2007, HPV-16 and -18 between them “are thought to cause 70% of cervical-cancer cases.” The vaccine, developed by Merck, is now marketed as Gardasil.)

The samples certainly show an impressive difference, but is it statistically significant? Could the luck of random sampling be enough to account for that difference in infection rates?

The claim is “the vaccine protects against HPV-16.” To translate this into the language of statistics, realize that there are two populations: (1) women who don’t get the vaccine, and (2) women who do get the vaccine.

Notice that the populations are all women, past, present, and future who don’t or do get vaccinated. The 765 and 768 women are samples, not populations. The populations are unvaccinated and vaccinated, not placebo and vaccine. Placebos are administered to members of a sample, but a population doesn’t “get placeboed”.

The data type is attribute (binomial) because the original question or measurement of each participant is the yes/no question: “Did this woman contract the virus?” (“Success” is an HPV-16 infection, not a good thing.) Since you’re comparing two populations, this is Case 5, Difference of Two Proportions.

Is the Vaccine Effective?

If the vaccine works, then you expect more women without the vaccine to contract the virus, so make them population 1. (That’s not necessary; it just usually makes things a little simpler to call population 1 the one with higher numbers expected.)

Although you hope that the vaccine population will have a lower infection rate, it’s not impossible that they could have a higher rate. Therefore you do a two-tailed test (≠). If p < α, then it’s time to say whether the vaccine makes things better or worse.

Let’s use α = 0.001. You’re talking about cancer in humans, after all. A Type I error would be saying that Gardasil makes a difference when actually it doesn’t. You don’t want women to get vaccinated, and have a false sense of security, if the vaccine actually doesn’t work, so a Type I error is pretty serious.

(1) population 1 = unvaccinated women; population 2 = vaccinated women
H0: p1 = p2, the vaccine makes no difference
H1: p1p2, the vaccine does make a difference
(2) α = 0.001
  • Randomized design? We’re not told in so many words, but this is a high-profile medical study so you can be pretty confident it was done right.
  • Samples less than 10% of population? Yes, since millions of women will get the vaccine (if it’s proved effective) and millions won’t.
  • At least 10 yes and 10 no in each sample? In the placebo group, there were 41 yes and 765−41 = 724 no. In the treatment group, there were no successes at all.

    Does that mean you can’t do the hypothesis test? Remember that “at least 10 yes and 10 no in each sample” is a shortcut for the real requirement, which is “at least 10 yes and 10 no expected in each sample if the null hypothesis is true”. If H0 is true, then the pooled proportion  = 0.0267 is an estimator of the proportions in both populations.

    What would you expect if H0 is true? In the placebo group of 765, you would expect n1 = 765×.0267 ≈ 20 yes and n1n1 = 765−20 = 745 no. You’d expect about the same in the treatment group of 768, so the “at least 10” requirement is met.

(3/4) 2-PropZTest: 41, 765, 0, 768, p1≠p2
results: z=6.50, p-value = 7.9E-11, 1=.0536, 2=0, =.0267
2-PropZTest 41, 765, 0, 768, p1≠p2      2-PropZTest z=6.503220606, p=7.900655 E minus 11, p-hat 1 = .0535947712, p-hat 2 = 0 p-hat = .026744946
Pause for a minute to make sure you can keep all those p’s straight. The first one, p = 7.9E-11, is the p-value, the chance of getting such different sample results if the vaccine makes no difference. 1 and 2 are those sample results: 5.4% of unvaccinated women and 0% of vaccinated women in the samples contracted HPV-16 infections. without subscript is the pooled proportion: 2.7% of all women in the study contracted HPV-16.
(5) p < α. Reject H0 and accept H1.
(6)The Gardasil vaccine does make a difference to HPV-16 infection rates (p = 8×10-11). In fact, it lowers the chance of infection.
At the 0.001 level of significance, the Gardasil vaccine does make a difference to HPV-16 infection rates. In fact, it lowers the chance of infection.

It’s worth reviewing what this p-value of 8×10-11 means. If the vaccine actually made no difference, there are only 8 chances in a hundred billion of getting the difference between samples that the researchers actually got, or a larger difference.

How do you get from “makes a difference” to “reduces infection rate”? Remember that when p < α in a two-tailed test, you can interpret the result in a one-tailed manner. If the vaccine makes things different, as appears virtually certain, then it must either make them better or make them worse. But in the sample groups, the vaccine group did better than the placebo group. Therefore the vaccine can’t make things worse, and it must make them better.

How Effective Is the Vaccine?

Can you do a confidence interval to estimate how much Gardasil reduces a woman’s risk of HPV-16 infection? Unfortunately, you can’t, because the requirements aren’t met: There were zero successes in the second sample. You can’t think like the hypothesis test and use the blended to meet requirements. Why wouldn’t that make sense? In a confidence interval, you’re specifically trying to estimate the difference between p1 and p2 (likelihood of infection for unvaccinated and vaccinated women), so you can’t very well assume there is no difference.

In terms of what you’re required to know for the course, you can skip to the next section right now. But if you want to know more, keep reading.

One informal calculation finds a number needed to treat per person actually helped (Simon 2000c [see “Sources Used” at end of book]). The difference in sample proportions is 5.4 percentage points, and 1/.054 ≈ 18.5 is called the number needed to treat. (You may recognize this as the expected value of the geometric distribution with p = 5.4%.) In the long run, for every 18 or 19 women who are vaccinated, one HPV-16 infection is prevented.

Caution! 5.4 percentage points is a difference in sample proportions. You can say only that the difference in the population is somewhere in the neighborhood of 5.4 percentage points, not that it is that. The number needed to treat is therefore not exactly 18.5, just somewhere in the neighborhood of 18.5. Even so, this is valuable information for women and their doctors.

Another approach is the rule of three, explained in Confidence Interval with Zero Events (Simon 2010 [see “Sources Used” at end of book]). When there are zero successes in n events, the 95% confidence interval is 0 to 3/n. Here 3/768 = 0.0039, about 0.4%. The 95% confidence interval for the unvaccinated population is 3.8% to 7.0%. So a doctor can tell her patients that about 38 to 70 unvaccinated women in a thousand will be infected with HPV-16, but only about four vaccinated women in a thousand.

Each of those is a 95% confidence interval, but the combination isn’t a 95% confidence interval! In the long run, if you do a bunch of 95% CIs, one in 20 of them won’t capture the true population parameter. Here you’re doing two, so there’s only a 95%×95% = 90.3% chance that both of these actually capture the true population proportions.

11E.  Confidence Interval and Hypothesis Test (Two Populations)

Summary: If you have a confidence interval for the difference of two population means or proportions, you can conclude whether the difference is statistically significant or not, just like the result of a hypothesis test.

Example 15: You’re testing the new drug Effluvium to see whether it makes people drowsy. Your 95% confidence interval for the difference between proportions of drowsiness in people who do and don’t take Effluvium is (0.017, 0.041). That means you’re 95% confident that Effluvium is more likely, by 1.7 to 4.1 percentage points, to cause drowsiness.

There’s the key point. You’re 95% confident that it does increase the chance of drowsiness by something between those two figures. How likely is it that Effluvium doesn’t affect the chance of drowsiness, then? Clearly it’s got to be less than 5%.

When both endpoints of your confidence interval are positive (or both are negative), so that the confidence interval doesn’t include 0, you have a significant difference between the two populations.

Example 16: Now, suppose that confidence interval was (−0.013, 0.011). That means you’re 95% confident that Effluvium is somewhere between 1.3 percentage points less likely and 1.1 more likely to cause drowsiness. Can you now conclude that Effluvium affects the chance of drowsiness? No, because 0 (“no difference”) is inside your confidence interval. Maybe Effluvium makes drowsiness less likely, maybe it has no effect, maybe it makes drowsiness more likely; you can’t tell.

When one endpoint of your confidence interval is negative and one is positive, so that the confidence interval includes 0, you can’t tell whether there’s a significant difference between the two populations or not.

This is exact for numeric data but approximate for binomial data. Why? Because the HT and CI use the same standard error for the numeric data cases, but slightly different standard errors for two-population binomial data. (The two calculations are in BTW paragraphs earlier in the chapter, here and here). chapter.)

11F.  More Confidence Intervals for Two Populations


Confidence intervals for two populations are easy enough to calculate on your TI-83. But one or both endpoints can be negative, and that means you have to write your interpretation carefully. Don’t just say “difference”; specify which population’s mean or proportion is larger or smaller. You must also distinguish between mean difference (for paired data) and difference in means (for unpaired data).

Study these examples of confidence intervals for two populations, and you’ll learn how to write your interpretations like a pro!

Example 17: Heights of Men and Women

Here’s an example adapted from Johnson and Kuby (2003, 425) [see “Sources Used” at end of book]. Men’s and women’s heights are ND. From this random sample, estimate the difference in average height as a 95% CI.

Sample Mean, Standard Deviation, s Sample Size, n
Female, pop. 2 63.8"2.18"20
Male, pop. 1 69.8"1.92"30


You have independent samples here: you get one number from each individual. The data type is numeric (height), so you have Case 4, difference of independent means.

Requirements Check

With independent means, you check requirements for each sample separately.

All requirements for Case 4 are met.

Calculation and Conclusion

The TI-83 or TI-84 computes μ1−μ2, so you need to decide which will be population 1 and which will be population 2. I like to avoid negative signs, so unless there’s a good reason to do otherwise I take the sample with the larger mean as sample 1; in this case that’s the men.

Whichever way you decide, write it down: pop 1 = ________, pop 2 = ________.

On your calculator, press [STAT] [] and scroll up or down to find 0:2-SampTInt. Enter the sample statistics and use Pooled:No. Here are the input and output screens :

TI-83 2-sample T interval input   TI-83 2-sample T interval output

Conclusion: With 95% confidence, the average man at that college is between 4.8″ and 7.2″ taller than the average woman, or μM−μF = 6.0″±1.2″. (You would probably present one or the other of those forms, not both.)

(6.0 is the difference of sample means and is the center of the confidence interval: 12 = 69.8−63.8 = 6.0.)

Remark: The difference from the case of dependent means is subtle but important. With dependent means (paired data), the CI is about the average difference in measurements of a single randomly selected individual or matched pair. But with independent means (unpaired data), the CI is about the difference between the averages for two different populations.

Example 18: Coffee and Heart Rate with Negatives

Now let’s make up new data for the coffee example. (The new d’s are still normally distributed with no outliers.) Again, you’re estimating the mean difference in heart rate due to drinking coffee.

TI-83 TInterval output screen
Person 123456
Before 786470717068
After 796273707167
d = A−B 1−23−11−1

Notice that some heart rates declined after the people drank coffee. Now when you compute a 95% CI you get the results shown at right.

How should you interpret a negative endpoint in the interval? Remember that you are computing a CI for the quantity After−Before. You could follow the earlier pattern and say “With 95% confidence, the mean increase in heart rate for all individuals after drinking coffee is between −1.8 and +2.1 beats per minute,” but only a mathematician would love a statement that talks about an increase being negative. Instead, you draw attention to the fact that the change might be a decrease or an increase, as follows.

Conclusion: With 95% confidence, the mean change in heart rate for all individuals after drinking coffee is between a decrease of 1.8 and an increase of 2.1 beats per minute. Since it’s obviously very important to get the direction right, be sure to check your conclusion against your H1 (if any) and your original definition of d.

Remark 1: Though it’s correct to present the CI as a point estimate and margin of error, it’s probably not a good idea because that form is so easy to misinterpret. If you say “With 95% confidence, the mean increase in heart rate for all individuals is 0.2±1.9 beats per minute,” many people won’t notice that the margin of error is bigger than the point estimate, and they’ll come to the false conclusion that you have established an increase in heart rate after drinking coffee. As statistics mavens, we have a responsibility to present our results clearly, so that people draw the right conclusions and not the wrong ones.

Remark 2: Remember that the CI occupies the middle of the distribution while the HT looks at the tails. If 0 is inside the CI, it can’t be in either tail. Therefore, from this confidence interval you know that testing the null hypothesis μd = 0 at the 0.05 level (0.05 = 1−95%) would fail to reject H0: this experiment failed to find a significant difference in heart rate after drinking coffee. (See Confidence Interval and Hypothesis Test (Two Populations).)

Remember the difference between “no significant difference found” and “no difference exists”. Since 0 is in the CI, you can’t say whether there is a difference. The correct statement, “I don’t know whether there is a difference,” is different from the incorrect “There is no difference.”

Example 19: Opinion Poll

The following data are from Dabes and Janik (1999, 269) [see “Sources Used” at end of book]. Men and women were polled in a systematic sample on whether they favored legalized abortion, and the results were as follows:

Sample Number in Favor, x Sample Size, n
Females, pop. 160100
Males, pop. 24080

Find a 98% confidence interval for the difference in level of support between women and men.


You have binomial data: each person either supports legalized abortion or not. (Obviously this example is oversimplified.) Binomial data with two populations is Case 5, difference of proportions.

Requirements Check

Support among the sample of women is 60/100 = 60%, and among the sample of men is 40/80 = 50%, so let’s define population 1 = women, population 2 = men.

All requirements for a Case 5 CI are met.

Calculation and Conclusion

On the TI-83 or TI-84, press [STAT] [] and scroll up to find B:2-PropZInt. The input and output screens look like this:

TI-83 2-proportion input screen     TI-83 2-proportion output screen

Two-population confidence intervals can be tricky to interpret, particularly when the two endpoints have different signs and particularly for Case 5, two population proportions. You can reason it out in words, or use algebra.

In words, remember that the confidence interval is the estimated difference p1p2, which is the estimated amount by which the proportion in the first population exceeds the proportion in the second population. So a negative endpoint for your CI means that the first proportion is lower than the second, and a positive endpoint means that the first proportion is larger.

Using algebra, begin with the calculator’s estimate of p1−p2:

−0.0729 ≤ p1−p2 ≤ +0.27292   (98% conf.)

Add p2 to all three parts of the inequality, and you have

p20.0729 ≤ p1 ≤ p2+0.27292   (98% conf.)

That’s a little easier to work with. The 98% confidence bounds on p1 (level of women’s support) are p2−0.0729 (7.3% below men’s support) and p2+0.27292 (27.3% above men’s support).

Conclusion: You are 98% confident that support for legalized abortion is somewhere between 7.3 percentage points lower and 27.3 points higher among females than males.

Remark: It would be equally valid to turn that around and say you’re 98% confident that support is between 27.3 percentage points lower and 7.3 points higher among males than females.

Example 20: GPA of Fraternity Members and Nonmembers

Johnson and Kuby (2003, 427) [see “Sources Used” at end of book] present another example. What is the difference (if any) in academic performance between fraternity members and nonmembers? Forty members of each population were randomly selected, and their cumulative GPA recorded as an indication of performance. The results were as follows:

Sample s n
Fraternity members, pop. 1 2.030.6840
Independents, pop. 2 2.210.5940


Here you have numeric data, two independent samples. (You know it’s independent samples, unpaired data, because each member of the sample gives you just one number.) This is Case 4, difference of independent means.

Requirements Check

Each sample was random, and each sample size is >30. We can assume that there are more than 10×40 = 400 fraternity members and 400 independents on campus. All requirements for Case 4 are met.

Calculation and Conclusion

TI-83 2-sample T Interval output screen The CI is −0.46 to +0.10, with 95% confidence. To interpret this, remember that the TI-83 computes a CI for μ1−μ2, and we defined population 1 as fraternity and population 2 as independent. The calculator is telling you that

−0.46 ≤ μ1−μ2 ≤ +0.10   (95% conf.)

or, adding μ2 to all three parts,

μ2−0.46 ≤ μ1 ≤ μ2+0.10   (95% conf.)

Conclusion: The true difference in academic performance, as measured by average GPA, is somewhere between 0.46 worse and 0.10 better for fraternity members relative to nonmembers, with 95% confidence.

You could also write a somewhat longer form: with 95% confidence, the average fraternity member’s academic performance, as measured by GPA, is somewhere between 0.46 worse and 0.10 better than the average independent’s performance.

Remark: Don’t be fooled by the fact that the CI is mostly below zero. You really cannot conclude that fraternity members probably have lower academic performance. Remember that the 95% CI is the result of a process that captures the true population mean (or difference, in this case) 95 times out of 100. But you can’t know where in that interval the true mean (or difference) lies. If you could, there would be no point to having a CI!

Remark 2: Even though zero is within the CI, you must not say that there is no difference in performance between members and nonmembers. The difference might indeed be zero, but it might also be anywhere between 0.46 in one direction and 0.10 in the other. There’s even a 5% chance that the true difference lies outside those limits. Always bear in mind the difference between insufficient evidence for and evidence against. (You may hear that said as “lack of evidence for is not evidence against.”)

What Have You Learned?

This chapter covered confidence intervals and hypothesis tests for two samples, both binomial and numeric data. Instead of testing a population’s μ or p against some baseline number, you test the μ or p of two populations against each other.

Key ideas:
Because this textbook helps you,
please click to donate!
Because this textbook helps you,
please donate at
Study aids:

Chapter 12 WHYL → ← Chapter 10 WHYL

Exercises for Chapter 11

Write out your solutions to these exercises, showing your work for all computations. Then check your solutions against the solutions page and get help with anything you don’t understand.

Caution! If you don’t see how to start a problem, don’t peek at the solution — you won’t learn anything that way. Ask your instructor or a tutor for a hint. Or just leave it and go on to a different problem for now. You may find when you return to that “impossible” problem that you see how to do it after all.


You want to determine whether sports fans would pay 20% extra for reserved bleacher seats. You suspect that the answer may be different between people aged under 30 and people 30 years old or older.
(a) How big must your samples be to let you construct a 95% confidence interval for the difference, with a margin of error of 3 percentage points?
(b) You discover a poll done last year, in which 30% of young people and 45% of older people said they would pay extra. Now how large must your samples be?


A researcher wanted to see whether the English like soccer more than the Scots. She randomly selected eight English and eight Scots and asked them to rate their liking for soccer on a numeric scale of 1 (hate) to 10 (love), and she recorded these responses:


(a) From the above data, can the researcher prove that the English have a stronger liking for soccer than the Scots? Use α = 0.05.
(b) Construct a 90% confidence interval for the different average levels of English and Scottish enthusiasm for soccer.

Sample SizeNumber of “Yes”
English 150105
Scots 200160

Another researcher took a different approach. She polled random samples with the question “Do you watch football at least once a week?” (In the UK they call soccer “football”). She got the results shown at right.
(a) At the 0.05 level of significance, are the English and the Scots equally fans of soccer?
(b) Construct a 95% confidence interval for the difference.
(c) Find the margin of error in that interval.
(d) If the researcher repeats her survey, what sample size would she need to reduce the margin of error to 4 percentage points at the same confidence level?


To see if running raises the HDL (“good”) cholesterol level, five female volunteers (randomly selected) had their HDL level measured before they started running and again after each had run regularly for six months, an average of four miles daily.
(a) See if you can prove that the average person’s HDL cholesterol level would be raised after all that running. Use α=0.05.
(b) Compute and interpret a 90% confidence interval for the change in HDL from running four miles daily for six months.


The Physicians’ Health Study tested the effects of 325 mg aspirin every other day in preventing heart attack in people with no personal history of heart attack. 22,071 doctors were randomized into an aspirin group and a placebo group. Of the 11,037 doctors who received aspirin, 10 had fatal heart attacks and 129 had non-fatal heart attacks; total 139. For the 11,034 doctors in the placebo group, the figures were 26 and 213; total 239.
(a) At the 0.001 level of significance, does this aspirin regimen make a difference to the likelihood of a heart attack?
(b) Find a 95% confidence interval for the reduced risk.

Cortland Co. $134,296$44,80030
Broome Co. $127,139$61,20032

June was planning to relocate to Central New York, considering Binghamton and Cortland. She found an online survey of prices of recently completed house sales as shown at right. The survey was two random samples taken about a month before she looked at the Web site.
(a) Construct a 95% confidence interval for the difference in mean house price in the two counties.
(b) Use that answer to determine which county has a lower average price of houses, at the 0.05 significance level.


The Canter Polling Service conducted two national polls in the same week, one for the Red Party candidate and one for the Blue Party candidate. Each one was a random sample of 1000 likely voters — not the same 1000, of course. (Most national polls have sample size 1000.)

In the first sample, 520 (52%) said they would vote for Red. In the second sample, 480 (48%) said they would vote for Blue. The newspaper reported that Red was leading by 4%. What’s wrong with that? Write a correct statement, at the 95% level of confidence.

8 You have two independent random samples of yes/no data. Sample 1 has 7 yes out of 28, and sample 2 has 18 yes out of 32. Each sample is smaller than 10% of the population.
(a) Is it valid to use 2-PropZInt to compute a confidence interval? Why or why not?
(b) Is it valid to use 2-PropZTest to compute a p-value? Why or why not?

Solutions → 

What’s New

Updates and new info:

Site Map | Home Page | Contact