BrownMath.com → Stats w/o Tears → 10. Hypothesis Tests

# Stats without Tears10. Hypothesis Tests

Updated 21 Aug 2023

View or
Print:
These pages change automatically for your screen or printer. Underlined text, printed URLs, and the table of contents become live links on screen; and you can use your browser’s commands to change the size of the text or search for key words. If you print, I suggest black-and-white, two-sided printing.

Summary: You want to know if something is going on (if there’s some effect). You assume nothing is going on (null hypothesis), and you take a sample. You find the probability of getting your sample if nothing is going on (p-value). If that’s too unlikely, you conclude that something is going on (reject the null hypothesis). If it’s not that unlikely, you can’t reach a conclusion (fail to reject the null).

## 10A.  Testing a Proportion (Binomial Data)

Remember the Swain v. Alabama example? In a county that was 26% African American, Mr. Swain’s jury pool of 100 men had only eight African Americans. In that example, you assumed that selection was not racially biased, and on that basis you computed the probability of getting such a low proportion. You found that it was very unlikely. This disconnect between the data and the claim led you to reject the claim.

You didn’t know it, but you were doing a hypothesis test. This is the standard way to test a claim in statistics: assume nothing is going on, compute the probability of getting your sample, and then draw a conclusion based on that probability. In this chapter, you’ll learn some formal methods for doing that.

The basic procedure of a hypothesis test or significance test is due to Jerzy Neyman (1894–1981), a Polish American, and Egon Pearson (1895–1980), an Englishman. They published the relevant paper in 1933.

We’re going to take a seven-step approach to hypothesis tests. The first examples will be for binomial data, testing a claim about a population proportion. Later in this chapter you’ll use a similar approach with numeric data to test a claim about a population mean. In later chapters you’ll learn to test other kinds of claims, but all of them will just be variations on this theme.

### 10A1.  Example 1: Swain v. Alabama

#### Step 1: Hypotheses

Your first task is to turn the claim into algebra. The claim may be that nothing is going on, or that something is going on. You always have two statements, called the null and alternative hypotheses.

Definition: The null hypothesis, symbol H0, is the statement that nothing is going on, that there is no effect, “nothin’ to see here. Move along, folks!” It is an equation, saying that p, the proportion in the population (which you don’t know), equals some number.

Definition: The alternative hypothesis, symbol H1, is the statement that something is going on, that there is an effect. It is an inequality, saying that p is different from the number mentioned in H0. (H1 could specify <, >, or just ≠.)

The hypotheses are statements about the population, not about your sample. You never use sample data in your hypotheses. (In real life you can’t make that mistake, since you write your hypotheses before you gather data. But in the textbook and the classroom, you always have sample data up front, so don’t make a rookie mistake.)

You must have the algebra (symbols) in your hypotheses, but it can also be helpful to have some English explaining the ultimate meaning of each hypothesis, or the consequences if each hypothesis is true. Here you want to know whether there’s racial bias in jury selection in the county.

You don’t want to know if the proportion of African Americans in Mr. Swain’s jury pool is less than 26%: obviously it is. You want to know if it’s too different — if the difference is too great to be believable as the result of random chance.

(1) H0: p = 0.26, there’s no racial bias in jury selection H1: p < 0.26, there is racial bias in jury selection

Obviously those can’t both be true. How will you choose between them? You’ll compute the probability of getting your sample (or a more unexpected one), assuming that the null hypothesis H0 is true, and one of two things will happen. Maybe the probability will be low. In that case you rule out the possibility that random chance is all that’s happening in jury selection, and you conclude that the alternative hypothesis H1 is true. Or maybe the probability won’t be too low, and you’ll conclude that this sample isn’t unusual (unexpected, surprising) for the claimed population.

The number in your null hypothesis H0, with binomial data, is called po because it’s the proportion as given in H0. (You may want to refer to the Statistics Symbol Sheetto help you keep the symbols straight.)

What exactly is p? Yes, it’s the population proportion being tested, but what’s the population? It can’t be people in the county, or men in the county, or African-American men in the county.

In fact it’s all people serving on Talladega County jury pools past, present and future. If there’s racial bias, then African Americans are less likely to be selected than whites, and — probability of one, proportion of all — therefore the overall population of jury pools has less than 26% African Americans. If there’s no racial bias, then in the long run the overall population of jury pools has the same 26% of African Americans as the county.

Although a hypothesis test is officially about the population, in cases like this one it’s okay to think of it as answering a simpler question: Is the difference between the claim of no racial bias and the reality of this sample significant, or could it be explained away as the result of random chance? The hypotheses are the same either way, the calculations are the same, and the conclusions are the same.

This is why a hypothesis test is also called a significance test or a test of significance.

#### Step 2: Significance Level

Okay, you’re looking to figure out if this sample is inconsistent with the null hypothesis. In other words, is it too unlikely, if the null hypothesis H0 is true? But what do you mean by “too unlikely”? Back in Chapter 5, we talked about unusual events, with a threshold of 5% or 0.05 for such events. We’ll use that idea in hypothesis testing and call it a significance level.

Definition: The significance level, symbol α (the Greek letter alpha), is the chance of being wrong that you can live with. By convention, you write it as a decimal, not a percentage.

(2) α = 0.05

A significance level of 0.05 is standard in business and science. If you can’t tolerate a 5% chance of being wrong — if the consequences are particularly serious — use a lower significance level, 0.01 or 0.001 for example. (0.001 is common if there’s a possibility of death or serious disease or injury.) If the consequences of being wrong are especially minor, you might use a higher significance level, such as 0.10, but this is rare in practice.

In a classroom setting, you’re usually given a significance level α to use.

Later in this chapter, you’ll see that the significance level α is actually concerned with a particular way of being wrong, a Type I error.

#### Step RC: Requirements Check

Back in Chapter 8, you learned the CLT’s requirements for binomial data: random sample not larger than 10% of population, and at least 10 successes and 10 failures expected if the null hypothesis is true. You compute expected successes as npo by using po, which is the number from H0. Expected failures are then sample size minus expected successes, nnpo in symbols. Steps 3 and 4 need the sampling distribution of the proportion to be a ND, so you must check the requirements as part of your hypothesis test.

(RC) Random sample? Yes, according to the county. ✔ 10n = 10×100 = 1000. We don’t know the number of adult males in the county, but it must be greater than 1000, surely. (“I know that, and don’t call me Shirley.”) ✔ Expected successes = npo = 100×.26 = 26; expected failures are 100−26 = 74; both are ≥ 10. ✔

You might wonder about the first test. “The county may say it’s random, but I don’t believe it. Isn’t that why we’re running this test?” Good question! Answer: Every hypothesis test assumes the null hypothesis is true and computes everything based on that. If you end up deciding that the sample was too unlikely, in effect you’ll be saying “I assumed nothing was going on, but the sample makes that just too hard to believe.”

This same idea — the null hypothesis H0 is innocent till proven guilty — explains why you use 0.26 (po) to figure expected successes and failures, not 0.08 (). Again, the county claims that there’s no racial bias. If that’s true, if there’s no funny business going on, then in the long run 26% of members of jury pools should be African American.

Comment: Usually, if requirements aren’t met you just have to give up. But for one-population binomial data, where the other two requirements are met but expected successes or failures are much under 10, you can use MATH200A part 3 to compute the p-value directly. There’s an example in “Small Samples”, below.

#### Steps 3/4: Test Statistic and p-Value

This is the heart of a hypothesis test. You assume that the null hypothesis is true, and then use what you know about the sampling distribution to ask: How likely is this sample, given that null hypothesis?

Definition: A test statistic is a standardized measure of the discrepancy between your null hypothesis H0 and your sample. It is the number of standard errors that the sample lies above or below H0.

You can think of a test statistic as a measure of unbelievability, of disagreement between H0 and your sample. A sample hardly ever matches your null hypothesis perfectly, but the closer the test statistic is to zero the better the agreement, and the further the test statistic is from 0 the worse the sample and the null hypothesis disagree with each other.

Because you showed that the sampling distribution is normal and the standard error of the proportion is implicitly known, this is a z test. The test statistic is z = (po) / σ  where  , but as you’ll see your calculator computes everything for you.

Definition: The p-value is the probability of getting your sample, or a sample even further from H0, if H0 is true. The smaller the p-value, the stronger the evidence against the null hypothesis.

Inferential Statistics: Basic Cases tells you that binomial data in one population are Case 2. This is a hypothesis test of population proportion, and you use `1-PropZTest` on your calculator.

To get to that menu selection, press [`STAT`] [`◄`] [`5`]. Enter po from the null hypothesis H0, followed by the number of successes x, the sample size n, and the alternative hypothesis H1. Write everything down before you select `Calculate`. When you get to the output screen, check that your alternative hypothesis H1 is shown correctly at the top of the screen, and then write down everything that’s new.

(3/4) 1-PropZTest .26, 8, 100,

By convention, we round the test statistic to two decimal places and the p-value to four decimal places.

When the p-value is less than one in ten thousand, you need more than four decimal places. Some authors just write “p <.0001” when the p-value is that small; they figure nobody cares about fine shades of very low probability. Feel free to use that alternative.

Caution! Watch for powers of 10 (E minus whatever) and never write something daft like “p-value = 2.0346”.

What do these outputs of the 1-PropZTest tell you? The sample proportion,  = 0.08, is more than 4 standard errors below the supposed population proportion, po = 0.26. Your test statistic is z = −4.10. Since 95% of samples have z-scores within ±2, this is surprising. How surprising, that’s what the p-value tells you.

How likely is it to get this sample, or one with even a smaller sample proportion, if the null hypothesis H0 is true? The p-value is 0.000 020, so if there’s no racial bias in selection then there are only two chances in a hundred thousand of getting eight or fewer African Americans in a 100-man jury pool. (There’s a lot more about interpreting the p-value later in this chapter.)

You don’t actually use the z-score, but I want you to understand something about what a test statistic is. Every case you study will have a different test statistic, and in fact choosing a test statistic is the main difference between cases.

Why does one step have two numbers? In the olden days, when dinosaurs roamed the earth and a slide rule was the hot new thing, you had to compute the SEP and then the z-score; that was step 3. Then you had to look up z in a printed table to find the p-value; that was step 4. The TI-83 or TI-84 gives you both at the same time, but I’ve kept the numbering of steps.

#### Step 5: Decision Rule

Because this textbook helps you,
Because this textbook helps you,
BrownMath.com/donate.

There are two and only two possibilities, and all you have to do is pick the correct one based on your p-value and your α:

p < α. Reject H0 and accept H1.

or

p > α. Fail to reject H0.

Caution! There are lots of p’s in problems involving population proportions (Case 2), so make sure you select the right one. The p-value is the first `p` on the 1-PropZInt output screen.

You can add the numbers, if you like — p < α (0.000 020 < 0.05) — but the symbols are required.

(5) p < α. Reject H0 and accept H1.

What are you saying here? The p-value was very small, so that means the chance of getting this sample, if there’s no racial bias, was very small. Previously, you set a significance level of 0.05, meaning you would consider this sample too unlikely if its probability was under 5%. Its probability is under 5%, so the sample and the null hypothesis contradict each other. The sample is what it is, so you can’t reject the sample. Therefore you reject H0 and accept H1 — you declare that there is racial bias.

Another way to look at it: Any sample will vary from the population because random selection is always operating to produce sampling error. But the difference between this sample and the supposed population proportion is just too great to be produced by random selection alone. Something else must be going on also. That something else is the alternative hypothesis H1.

Definition: When the p-value is below α, the sample is too unlikely to come from ordinary sample variability alone, and you have a significant result, or your result is statistically significant.

You always select a significance level before you know the p-value. If you could first get the p-value and then specify a significance level, you could get whichever result you wanted, and there would be no point to doing a hypothesis test at all. Choosing α up front keeps you honest.

#### Step 6: Conclusion (in English)

Since you accepted H1 in the previous step, that’s your conclusion. If you have already written it in English as part of the hypotheses, as I did, then most of your work is already done. You do need to add the significance level or the p-value, so your conclusion will look something like one of these:

(6) The 8% proportion of African American men in Mr. Swain’s jury pool is significantly below the expected 26%, and this is evidence at the 0.05 level of significance of racial bias in the selection.

or

(6) The 8% proportion of African American men in Mr. Swain’s jury pool is significantly below the expected 26%, and this is evidence of racial bias in the selection (p = 0.000 020).

If you’re publishing your hypothesis test, you’ll want to write a thorough conclusion that still makes sense if it’s read on its own. But in class exercises you don’t have to write so much. It’s enough to write “At the 0.05 significance level, there is racial bias in jury selection” or “There is racial bias in jury selection (p = 0.000 020)”.

### 10A2.  Example 2: Cancer Screening

The Colorectal Cancer Screening Guidelines (CDC 2014 [see “Sources Used” at end of book]) recommend a colonoscopy every ten years for adults aged 50 to 75. A public-health researcher believes that only a minority are following this recommendation. She interviews a simple random sample of 500 adults aged 50–75 in Metropolis (pop. 6.4 million) and finds that 235 of them have had a colonoscopy in the past ten years. At the 0.05 level of significance, is her belief correct?

Solution: The population is adults aged 50–75 in Metropolis. You want to know whether a minority of them — under 50% — follow the colonoscopy guideline. Each person either does or does not, so you have binomial data, a test of proportion (Case 2 in Inferential Statistics: Basic Cases). Try to write out the hypothesis test yourself before you look at mine below.

Reminder: Even though you already have the sample data in the problem, when you write the hypotheses, ignore the sample. In principle, you write the hypotheses, then plan the study and gather data. If you use any of the sample data in the hypotheses, something is wrong.

You should have written something pretty close to this:

(1) H0: p = 0.5, half the seniors of Metropolis follow the guideline H1: p < 0.5, less than half follow the guideline α = 0.05 Random sample? Yes. 10n ≤ N? Yes, 10n = 10×500 = 5000, surely less than the number of adults aged 50–75 in a population of 6,400,000. At least 10 successes and 10 failures expected? Yes, npo = 500×.5 = 250, and n−npo = 500−250 = 250. 1-PropZTest: po=.5, x=235, n=500, p α. Fail to reject H0. At the 0.05 level of significance, it’s impossible to say whether less than half of Metropolis seniors aged 50–75 follow the CDC guideline for a colonoscopy every ten years or not. [Or, It’s impossible to say whether less than half of Metropolis seniors aged 50–75 follow the CDC guideline for a colonoscopy every ten years or not (p = 0.0899).]

Important: When p is greater than α, you fail to reach a conclusion. In this situation, you must use neutral language. You mention both possibilities without giving more weight to either one, and you use words like “impossible to say” or “can’t determine”.

This is unsatisfying, frankly. You go through all the trouble of gathering data and then you end up with a non-conclusion. Can anything be salvaged from this mess?

Yes, you can do a confidence interval. This at least will let you set bounds on what percent of all seniors follow the guidelines. You’ve already tested requirements as part of the hypothesis test, so go right into your calculations and conclusion. You’re free to pick any confidence level you wish, but 95% is most usual.

1-PropZInt, 235, 500, .95

outputs: (.42625, .51375)

42.6% to 51.4% of Metropolis seniors aged 50–75 follow the CDC guideline on screening for colorectal cancer.

In a classroom setting, or on regular homework, if you’re assigned a hypothesis test do that and don’t feel obligated to do a confidence interval also. But in real life, and on labs and projects for class, you’ll usually want to do both.

### 10A3.  Example 3: Small Samples

What if your sample is so small that expected successes npo or expected failures nnpo are under 10? You can no longer use `1-PropZTest`, which assumes that the sampling distribution of the proportion is ND, but you can compute the binomial probability directly as long as the other two requirements are still met (SRS and 10nN). Only the calculation of the p-value changes.

Example: In 2001, 9.6% of Fictional County motorists said that fuel efficiency was the most important factor in their choice of a car. For her statistics project, Amber set out to prove that the percentage has increased since then. She interviewed 80 motorists in a systematic sample of those registering vehicles at the DMV, and 13 of them said that fuel efficiency was the most important factor in their choice of a car. Test her hypothesis, at the 0.05 significance level.

(1) H0: p = 0.096, percentage has not increased H1: p > 0.096, percentage has increased α = 0.05 SRS? Systematic sample can be analyzed like a random sample. ✔ 10n≤N? 10×80 = 800, less than number of car owners in any county. ✔ Expected successes are npo = 80×.096 = 7.7, too far below 10 to live with. ✘ The sampling distribution of p̂ doesn’t follow the normal model, so you can’t use `1-PropZTest`. But the other two requirements are met, so you can proceed, calculating the binomial probability directly. MATH200A/Binomial prob: n=80, p=0.096, x=13 to 80; p-value = 0.0410 (If you don’t have the program, use 1−binomcdf(80,0.096,12) = 0.0410.) [Why 13 to 80? H1 contains >, so you test the probability of getting the sample you got, or a larger one, if H0 is true. If H1 contained <, x would be 0 to 13 — the sample you got, or a smaller one. See Surprised?in Chapter 6.] p < α. Reject H0 and accept H1. At the 0.05 significance level, the percentage of Fictional County motorists who rate fuel efficiency as most important has increased since 2001. [Or, The percentage of Fictional County motorists who rate fuel efficiency as most important has increased since 2001 (p = 0.0410).]

## 10B.  Sharp Points

Hypothesis tests are based on a simple idea, but there are lots of details to think about. This section clarifies some important ideas about the philosophy and practice of a hypothesis test.

See also:
• Is Statistics Hard? (Dallal 2002 [see “Sources Used” at end of book]) offers great help in getting your head around these new concepts.
• HyperStat’s “Logic of Hypothesis Testing” (Lane 2013 [see “Sources Used” at end of book]) covers many of these same “sharp points”. Because this stuff seems so weird at first, I suggest you look at his take on these same issues in addition to mine.

### 10B1.  Type I and Type II Errors

Definition: A Type I error is rejecting the null hypothesis when it’s actually true.

Definition: A Type II error is failing to reject the null hypothesis when it’s actually false.

A Type I error usually causes you to do something you shouldn’t; a Type II error usually represents a missed opportunity.

Example 4: Suppose your alternative hypothesis H1 is that a new headache remedy PainX helps a greater proportion of people than aspirin.

A Type I error — rejecting H0 and accepting H1 when H0 is actually true — would have you announce that PainX helps more people when in fact it doesn’t. People would then buy PainX instead of aspirin, and their headache would less likely be cured. This is a bad thing.

On the other hand, a Type II error — failing to reject H0 when it’s actually false — would mean you announce an inconclusive result. This keeps PainX off the market when it actually would have helped more people than aspirin. This too is a bad thing.

Example 5: You’re on a jury, and you have to decide whether the accused actually committed the murder. What would be Type I and Type II errors?

To answer that you need to identify your null hypothesis H0. Remember that it’s always some form of “nothing going on here.” In this case, H0 would be that the defendant didn’t commit the murder, and H1 would be that he did.

A Type I error would be condemning an innocent man; a Type II error would be letting a guilty man go free. In our legal system, a defendant is not supposed to be found guilty if there is a reasonable doubt; this would correspond to your α. Probably α = 0.05 is not good enough in a serious case like murder, where a Type I error would mean long jail time or execution, so if you’re on a jury you’d want to be more sure than that.

“Okay then,” you say, “I’ll have to be super careful and not make mistakes.” But remember from Chapter 1: In statistics, “errors” aren’t necessarily mistakes. Errors are discrepancies between your results and reality, whatever their cause. Type I and Type II errors are not mistakes in procedure.

Even if you do everything right in your hypothesis test, you can’t be certain of your answer, because you can never get away from sample variability.

How often will these errors occur? This is where your significance level comes into play. If you perform a lot of tests at α = 0.05, then in the long run a Type I error will occur one time in twenty. It’s too big for these pages, but there’s a cartoon at xkcd.com that illustrates this perfectly. The probability of a Type II error has the symbol β (Greek letter beta) and it has to do with the “power” of the test, its ability to find an effect when there’s an effect to be found. β belongs to a more advanced course, and I don’t do anything with it in this book.

Earlier, I said that your significance level α is the chance of being wrong that you can live with. Now I can be a little more precise. α is not the chance of any error; α is the chance of a Type I error that you can live with. If one Type I error in 20 hypothesis tests is unacceptable, use a lower significance level — but then you make a Type II error more likely. If that’s unacceptable, increase your sample size.

Somebody is making a mint off the following chart. It’s in every stats textbook I’ve seen, so you may as well have it too:

Reject H0, accept H1 Fail to reject H0 Type I error Correct decision Correct decision Type II error

### 10B2.  One-Tailed or Two-Tailed?

Summary:

How do you know whether your H1 should contain “<” or “>” (a one-tailed test) or “≠” (a two-tailed test)? In class, the problem will usually be clear about whether you’re testing for a “difference” (two-tailed) or testing if something is “better”, “larger”, “less than”, etc. (all one-tailed). But which one should you use when you’re on your own?

In general, prefer a two-tailed test unless you have a specific reason to make a one-tailed test.

When a two-tailed test reaches a statistically significant result, you interpret in a one-tailed manner.

#### Pick the Right Hypotheses

There are two main situations where a one-tailed test makes sense: “(a) where there is truly concern for the outcomes in one [direction] only and (b) where it is completely inconceivable that the results could go in the opposite direction.

—Dubey, quoted by Kuzma and Bohnenblust (2005, 132) [see “Sources Used” at end of book]

With a one-tailed test, say for μ<4.5, you’re saying that you consider “equal to 4.5” and “greater than 4.5” the same thing, that if μ isn’t less than 4.5 then you don’t care whether it’s equal or it’s greater. Sometimes you really don’t care, but very often you do. If the problem statement is ambiguous, or if this is real life and you have to do a hypothesis test, how do you decide whether to do a one-tailed or two-tailed test?

Testing two-tailed doesn’t prejudge a situation. Do a two-tailed test unless you can honestly say, without looking at the data, that only one direction of difference matters, or only one direction is possible.

Example 6: An existing drug cures people in an average of 4.5 days, and you’re testing a new drug. If you test for μ<4.5, you’re saying that it doesn’t matter whether the new drug takes the same time or takes more time. But that’s wrong: it matters very much. You want to test whether the new drug is different (μ≠4.5). Then if it’s different, you can conclude whether it’s faster or slower.

Another way to look at this whole business: a one-tailed test essentially doubles your α — you’re much more likely to reach a conclusion with dicey data. But that means double the risk of being wrong with a Type I error — not a good thing!

Sometimes the same situation can call for a different test, depending on your viewpoint.

Example 7: You’re the county inspector of weights and measures, checking up on a dairy and its half gallons of milk. Legally, half a gallon is 64 fluid ounces. To a government inspector, “Dairylea gives 64.0 ounces in the average half gallon” and “Dairylea gives more than 64.0 ounces in the average half gallon” are the same (legal), and you care only about whether Dairylea gives less (illegal). A one-tailed test (<) is correct.

But now shift your perspective. You’re Dairylea management. You don’t want to short the customers because that’s illegal, but you don’t want to give too much because that’s giving away money. You make a two-tailed test (≠).

#### p < α in Two-Tailed Test: What Does it Tell You?

After a two-tailed test, if p<α then you can interpret the result as one-tailed.

Example 8: You want to test whether your candidate’s approval rating has changed from the previous dismal 40% after a major policy announcement. Your H1 is p ≠ 0.4, and 170 out of a random sample of 500 voters approve ( = 34%). Your p-value is 0.0062, so you reject H0 and accept H1. You conclude that the candidate’s approval rating has changed.

But you can go further and say that her approval rating has dropped. You do this by combining the facts that (a) you’ve proved that approval rating is different, which means it must be either less or more than 40%, and (b) the sample was less than po (40%).

You can phrase your conclusion something like this, first answering the original question then going beyond it: The candidate’s approval rating has changed from 40% after the speech (p = 0.0062). In fact, it has dropped.

Your justification is the relationship between Confidence Interval and Hypothesis Test (later in this chapter), but you don’t actually have to compute the CI. When p < α in a two-tailed test, po is outside the confidence interval (at the matching confidence level).

• When is above po and the p-value is < α, the whole CI (if you computed it) would be above po so you know the true proportion is greater than po.
• Conversely, if is below po and the p-value is < α, the whole CI would be below po so you know the true proportion is below po.

### 10B3.  What Does the p-Value Mean?

Summary: The p-value tells you how likely it is to get the sample you got (or a more extreme sample) if the null hypothesis is true.

Many people are confused about the p-value. They try to read too much into it, or they try to simplify it.

Part of the problem is trying to fit the meaning into the traditional structure of a one-sentence definition, so let’s try a story instead. In your experiment, you got a certain result, a sample mean or sample proportion. Assume that the null hypothesis is true. If H0 is true, the properties of the sampling distribution tell you how likely it is to get this sample result, or one even further away from H0. That likelihood is called the p-value.

The one-tailed p-value is exactly the probability that you computed with `normalcdf` in Chapter 8. When that’s less than 0.5, the two-tailed p-value is exactly double the one-tailed p-value.

If the p-value is small, your results are in conflict with H0, so you reject the null and accept the alternative. If the p-value is larger, your sample is not in conflict with H0 and you fail to reject the null, which is stats-talk for failing to reach any kind of conclusion.

In a nice phrase, Sterne and Smith [see “Sources Used” at end of book] say that p-values “measure the strength of the evidence against the null hypothesis; the smaller the p-value, the stronger the evidence against the null hypothesis.” They also quote R. A. Fisher on interpreting a p-value: “If P is between 0.1 and 0.9 there is certainly no reason to suspect the hypothesis tested. If it is below 0.02 it is strongly indicated that the hypothesis fails to account for the whole of the facts. We shall not often be astray if we draw a conventional line at 0.05.”

The message here is that p-values fall on a continuum; you can’t just arbitrarily divide them into “significant” and “not significant” once and for all.

The p-value is the likelihood, if H0 is actually true, that random chance could give you the results you got, or results even further from H0. It is a conditional probability:

p-value = P(this sample given that H0 is true)

Yes, that seems convoluted — because it is. Alas, there just isn’t any description of a p-value that is both correct and simple.

The p-value is not the probability that either hypothesis is true or false:

• The p-value is not the probability that H0 is true.
• The p-value is not the probability that H0 is false.
• The p-value is not the probability that H1 is true.
• The p-value is not the probability that H1 is false.
• The p-value is not the probability that your results are due to random chance.
• The p-value is not the probability that your results are not due to random chance.

The p-value is not any of the above because they are all plain probabilities. Once again, the p-value is just a measure of how likely your results would be if  H0 is true and random chance is the only factor in selecting the sample.

The p-value tells you how unlikely this sample (or a more extreme one) is if the null hypothesis is true. The more unlikely (surprising, unexpected), the lower the p-value, and the more confident you can feel about rejecting H0.

See also:
• Re: P Value for Kids (Moore 2003 [see “Sources Used” at end of book]), a very short article that cuts through all the bs.
• Sifting the Evidence — What’s Wrong with Significance Tests? (Sterne and Smith 2001 [see “Sources Used” at end of book]). This is more advanced, but still readable. They don’t argue against significance tests but do argue against the blind use of 0.05 as a significance level in medical studies.

There’s one other thing: the p-value is not a measure of the size or importance of an effect. That gets into statistical significance versus practical significance.

### 10B4.  Practical and Statistical Significance

If your p-value is less than your significance level α, your result is statistically significant. That low p-value, Wheelan (2013, 11) [see “Sources Used” at end of book] writes, means that your result is “not likely to be the product of chance alone”. That’s all that statistical significance means. But even if a result is statistically significant, it may not be practically significant.

Example 9: Suppose that your p-value for “PainX is more likely to help a person than aspirin” is 0.000 002. You’re pretty darn sure that PainX is better. But to determine whether the result is practically significant, you have to ask not just whether PainX is better, but by how much.

One way to evaluate practical significance is to compute a confidence interval about the effect size. In this case, the 95% confidence interval is that a person is between 1.14 and 2.86 percentage points more likely to be helped by PainX than aspirin. Oh yes, and aspirin costs a buck for 100 tablets, where PainX costs \$29.50 for ten. Most people would say this result has no practical significance. They’re not going to plunk down \$30 for a few pills that are only 2% more likely to help them than aspirin.

How can you get such a low p-value when the size of the effect is small? The answer is in extremely large sample sizes. In this made-up case, PainX helped 15,500 people in a sample of 25,000, and aspirin helped 15,000 in a sample of 25,000. When you have really large samples, be especially alert to the issue of statistical versus practical significance.

### 10B5.  Conclusions: Write ’em Right!

Summary: As a statistician, you have an ethical obligation to make your results as easy as possible to understand, and as hard as possible to misinterpret.

Avoid common errors when stating conclusions and interpreting them. Make sure you understand what you are doing, and explain it to others in their own language.

used by permission; source: https://xkcd.com/892/ (accessed 2021-11-15)

#### When p < α, you reject H0 and accept H1.

If your p-value is less than your significance level, you have shown that your sample results were unlikely to arise by chance if H0 is true. The data are statistically significant. You therefore reject H0 and accept H1.

Details: Assuming that H0 is true, the sample you got is surprising (unexpected, unusual). The data are inconsistent with the null hypothesis — they can’t both be true. The data are what they are, and if the sample was properly taken you have to believe in it. Therefore, H0 is most likely false. If H0 is false, its opposite H1 is true.

You accept H1, but you haven’t proved it to a certainty. There’s always that p-value chance that the sample results could have occurred when H0 is true. That’s why you say you “accept” H1, not that you have “proved” H1.

Compare to a jury verdict of “guilty”. It means the jury is convinced that the probability (p) that the defendant is innocent is less than a reasonable doubt (significance level, α). It doesn’t mean there is no chance he’s innocent, just that there is very little chance.

Example 10:

Suppose your null H0 is “the average package contains the stated net weight,” your alternative is “the average package contains less than the stated net weight,” and your significance level is 0.05.

If p = 0.0241, which is < α, you reject H0 and accept H1. You conclude “the average package does contain less than the stated net weight (p = 0.0241)” or “the average package does contain less than the stated net weight, at the 0.05 significance level.”

Don’t say the average package “might” be less than the stated weight or “appears to be” less than the stated weight. When you reject H0, state the alternative as a fact within the stated significance level, or preferably with the p-value. (Again, compare to a jury verdict. The jury doesn’t say the defendant “might be guilty”.)

See also: Take published conclusions with a grain of salt. Even professional researchers can misuse hypothesis tests. “Data mining” (first gathering data, then looking for relationships) is one problem, but not the only one. See Why Most Published Research Findings Are False (Ioannidis 2005 [see “Sources Used” at end of book]). If you find the article heavy going, just scroll down to read the example in Box 1 and then the corollaries that follow.

#### When p > α, you fail to reject H0.

If your p-value is greater than your significance level, you have shown that random chance could account for your results if H0 is true. You don’t know that random chance is the explanation, just that it’s a possible explanation. The data are not statistically significant.

You therefore fail to reject H0 (and don’t mention H1 in step 5). The sample you have could have come about by random selection if H0 is true, but it could also have come about by random selection if H0 is false. In other words, you don’t know whether H0 is actually true, or it’s false but the sample data just happened to fall not too far from H0.

Compare to a jury verdict of “not guilty”. That could mean the defendant is actually innocent, or that the defendant is actually guilty but the prosecutor didn’t make a strong enough case.

Example 11: Suppose your null hypothesis is “the average package contains the stated net weight,” your alternative is “the average package contains less than the stated net weight,” and your significance level α is 0.05.

If you compute a p-value of 0.0788, which is > α, you fail to reject H0 in step 5, but how do you state your conclusion in step 6?

There are two kinds of answer, depending on who you talk to. Some people say “there’s insufficient evidence to prove that the average package is underweight”; others say “we can’t tell whether the average package is underweight or not.” Of course there are many ways to write a conclusion in English, but ultimately they boil down to “we can’t prove H1” (or the equivalent “we can’t disprove H0”) versus “we can’t reach a conclusion either way.”

Does it matter? Yes, I think it does.

• “We can’t prove H1” is true, but it’s only part of the truth. The whole truth is that we can’t prove or disprove H0 when the p-value is greater than the significance level.
• Non-technical people are intimidated by statistics and may not realize what is meant. Tell them “we can’t disprove A” and they may think A is true. And if you say “we can’t prove B”, people may think B is false. If you say straightforwardly, “we can’t determine which one is true”, then there’s no risk of people jumping to a false conclusion.
• As a practical matter, we don’t do hypothesis testing just for the fun of it. We want to know something about the real world, not just out of sheer curiosity but because the result will determine what we do. Starting to market a new drug costs mucho dinero, so “we can’t prove that the new drug is better” probably means the drug won’t go to market, and that would be too bad if the new drug actually is better. “We can’t determine whether the new drug is better or not” essentially says “a new study is needed, probably with larger samples.”

Please understand: It’s not that the people writing the conclusions are confused (well, usually not). The problem is confusion among people reading the conclusions.

Advice: It’s the same advice I’ve given before: Tailor your presentation to your audience. If you’re presenting to technical people, the one-sided forms are okay, and you could answer Example 11 with something like “there’s insufficient evidence, at the 0.05 significance level, to show that the average package is under weight” or “… to reject the hypothesis that the average package contains the stated net weight.” (Since the p-value gives more information, you could give that instead of the significance level.)

But if your audience is non-technical people, don’t expect them to understand a two-sided truth from a one-sided conclusion. Instead, use neutral language, such as “We can’t determine from the data whether the average package is underweight or not (p = 0.0788).” (You could state the significance level instead of the p-value.)

##### What if the p-value is very large?

If your p-value is very large, say bigger than 0.5, there’s a good chance you’ve made a mistake. Check carefully whether you should be testing <, ≠, or >. Also check whether you’re testing against the wrong number. For instance, suppose your H1 is that a coin comes up heads more than a third of the time. A few dozen flips will probably yield a p-value very close to 1. This is the statistical equivalent of “Well, duh!”

Sometimes large p-values are correct, but those situations are rare enough that you should be suspicious.

##### Can we never accept the null hypothesis?

Not as a matter of strict logic, no. But there are circumstances where the data do suggest that the null hypothesis is true. The most important of these is when multiple experiments fail to reject H0. Here’s why.

Suppose you do an experiment at the 0.05 significance level, and your p-value is greater than that. Maybe H0 is really true; maybe it’s false but this particular sample happened to be close to H0. You can’t tell — you’ve failed to disprove H0 but that doesn’t mean it’s necessarily true.

But suppose other experimenters also get p-values > 0.05. They can’t all be unlucky in their samples, can they?

If you keep giving the universe opportunities to send you data that contradict the null hypothesis, but you keep getting data that are consistent with the null, then you begin to think that the null hypothesis shouldn’t be rejected, that it’s actually true.

This is why scientists always replicate experiments. If the first experiment fails to reject H0, they don’t know whether H0 is true or they were just unlucky in their sample. But if several experiments fail to reject the null — always assuming the experiments are properly conducted — then they begin to have confidence in the theory.

What if an experiment does reject H0? Is that it, game over? Not necessarily. Remember that even a true H0 will get rejected one time in 20 when tested at the 0.05 level. Once again, the answer is replication. If they get more “reject H0”,scientists know that the first one wasn’t just a statistical fluke. But if they get a string of “fail to reject H0”, then it’s likely that the first one was just that one in 20, and H0 is actually true.

In How Not to Be Wrong, Jordan Ellenberg (2014, 160–161) [see “Sources Used” at end of book] quotes R. A. Fisher on the need for reproducing results: “A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance.” And then, to drive the point home, Ellenberg adds: “Not ‘succeeds once in giving’, but ‘rarely fails to give’.” In other words, that “properly designed experiment” must be performed multiple times, so that it has many opportunities to reject H0. If p > α more often than “rarely” in those many opportunities, you cannot reject H0 and accept H1.

## 10C.  Testing a Mean (Numeric Data)

Summary: Just as you used a `TInterval` in Chapter 9 to make a confidence interval about μ for numeric data, you use a `T-Test` to perform the hypothesis test.

Typically you don’t know σ, the standard deviation (SD) of the population, and therefore you don’t know the standard error σ/√n either. So you estimate the standard error as s/√n, using the known SD of the sample. That means that the test statistic is:

t = (−μo) / (s/√n)

The t statistic is the estimated number of standard errors between your sample mean and the hypothetical population mean.

You met the t distribution when you computed confidence intervals in Chapter 9. Compared to z, the t distribution is a little flatter and more spread out, especially for small samples, so p-values tend to be larger.

Let’s jump in and do a t test. The numbered steps are almost the same as they were in the examples with binomial data — you just have the necessary variations for working with numeric data. Because I’ll be adding some commentary, I’ve put boxes around what I expect to see from you for a problem like this. (Refer to Seven Steps of Hypothesis Tests if you don’t know the steps very well yet.)

It hardly ever happens, but if you do know the SD of the population you can do a z test instead of a t test. Since the z distribution is a bit less spread out than the t distribution, for very small samples the p-values are typically a bit lower with a z test than with a t. But the difference is rarely enough to change the result — and again, you are quite unlikely to know the SD of the population, so a z test is quite unlikely to be the right one.

### 10C1.  Example 12: Bank Deposits

The management claims that the average cash deposit is \$200.00, and you’ve taken a random sample to test that:

192.68 188.24 152.37 211.73 201.57   167.79 177.19 191.15 209.22 178.49
185.90 226.31 192.38 190.23 156.13   224.07 191.78 203.45 186.40 160.83

At the 0.05 significance level, does this sample show that the average of all cash deposits is different from \$200?

Solution: The data type is numeric, and the population SD σ is unknown, so this is a test of a population mean, Case 1 from Inferential Statistics: Basic Cases. Your hypotheses are:

(1) H0: μ = 200, management’s claim is correct H1: μ ≠ 200, management’s claim is wrong

Comment: Even though you already have the sample data in the problem, when you write the hypotheses, ignore the sample. In principle, you write the hypotheses, then plan the study and gather data. If you use any of the sample data in the hypotheses, something is wrong.

So you don’t use numbers from the sample in your hypotheses, and you don’t use the sample to help you decide whether the alternative hypothesis H1 should have < ≠, or >.

The significance level was given in the problem. (Problems will usually give you an α to use.)

(2) α = 0.05

Next is the requirements check. Even though it doesn’t have a number, it’s always necessary. In this case, n = 20, which is less than 30, so you have to test for normality and verify that there are no outliers.

Enter your data in any statistics list (I used `L5`), and check your data entry carefully. Use the MATH200A program “Normality chk” to check for a normal distribution and “Box-whisker” to verify that there are no outliers.

You don’t need to draw the plots, but do write down r and crit and show the comparison, and do check for outliers. (For what to do if you have outliers, see Chapter 3.)

(RC) Random sample: given. 10n = 10×20 = 200, and the bank had better have more deposits than that or it can’t afford to pay you for your work! Normality: yes. From MATH200A part 4, r(0.9864) > crit(0.9503). Outliers: none (MATH200A part 2).

Now it’s time to compute the test statistic (t) and the p-value.

On the `T-Test` screen, you have to choose `Data` or `Stats` just as you did on the `TInterval` screen. You have the actual data, so you select `Data` on the T-Test screen, instead of `Stats`. Then the sample mean, sample SD, and sample size are shown on the output screen, so you write them down as part of your results. Always write down , s, and n.

(3/4) T-Test: μo=200, List=L5, Freq=1, μ≠μo results: t=−2.33, p=0.0311, x̅=189.40, s=20.37, n=20

The decision rule is the same for every single hypothesis test, regardless of data type. In this case:

(5) p < α. Reject H0 and accept H1.

And as usual, you can write your conclusion with the significance level or the p-value:

(6) At the 0.05 level of significance, management is incorrect and the average of all cash deposits is different from \$200.00. In fact, the true average is lower than \$200.00.

Or,

(6) Management is incorrect, and the average of all cash deposits is different from \$200.00 (p = 0.0311). In fact, the true average is lower than \$200.00.

Remember what happens when you do a two-tailed test (≠ in H1) and p turns out less than α: After you write your “different from” conclusion, you can go on to interpret the direction of the difference. See p < α in Two-Tailed Test.

In a classroom exercise, if you were asked to do a hypothesis test you would do a hypothesis test and only a hypothesis test. But in real life, and in the big labs for class, it makes sense to answer the obvious question: If the true mean is less than \$200.00, what is it?

You don’t have to check requirements for the CI, because you already checked them for the HT.

TInterval L5, 1, .95
outputs: (179.86, 198.93)

With 95% confidence, the average of all cash deposits is between \$179.86 and \$198.93.

### 10C2.  Example 13: Smokers and Retirement

Here’s an example where you have statistics without the raw data. It’s adapted from Sullivan (2011, 483) [see “Sources Used” at end of book].

According to the Centers for Disease Control, the mean number of cigarettes smoked per day by individuals who are daily smokers is 18.1. Do retired adults who are daily smokers smoke less than the general population of daily smokers?

To answer this question, Sascha obtains a random sample of 40 retired adults who are current daily smokers and record the number of cigarettes smoked on a randomly selected day. The data result in a sample mean of 16.8 cigarettes and a SD of 4.7 cigarettes.

Is there sufficient evidence at the α = 0.01 level of significance to conclude that retired adults who are daily smokers smoke less than the general population of daily smokers?

Solution: Start with the hypotheses. You’re comparing the unknown mean μ for retired smokers to the fixed number 18.1, the known mean for smokers in general. Since the data type is numeric (number of cigarettes smoked), and there’s one population, and you don’t know the SD of the population, this is Case 1, test of population mean, from Inferential Statistics: Basic Cases.

(1) H0: μ = 18.1, retired smokers smoke the same amount as smokers in general H1: μ < 18.1, retired smokers smoke less than smokers in general Comment: The claim is a population mean of 18.1, so you use 18.1 in your hypotheses. Using the sample mean of 16.8 in Step 1 is a rookie mistake, one of the Top 10 Mistakes of Hypothesis Tests. Never use sample data in your hypotheses. Comment: Why does H1 have < instead of ≠? The short answer is: that’s what the problem says to do. In the real world, you would do a two-tailed test (≠) unless there’s a specific reason to do a one-tailed test (< or >); see One-Tailed or Two-Tailed? (earlier in this document). Presumably there’s some reason why they are interested only in the case “retired smokers smoke less” and not in the case “retired smokers smoke more”. α = 0.01 Random sample (given). n > 30. 10n = 10×40 = 400, less than the total number of retired smokers. Therefore the sampling distribution is normal. T-Test: μo=18.1, x̅=16.8, s=4.7, n=40, μ<μo outputs: t=−1.75, p=0.0440 p > α. Fail to reject H0. At the 0.01 level of significance, we can’t determine whether the average number of cigarettes smoked per day by retired adults who are current smokers is less than the average for all daily smokers or not. Or, We can’t tell whether the average number of cigarettes smoked per day by retired adults who are current smokers is less than the average for all daily smokers or not (p = 0.0440).

When you fail to reject H0, you cannot reach any conclusion. You must use neutral language in your non-conclusions. Please review When p > α, you fail to reject H0 earlier in this chapter.

## 10D.  Confidence Interval and Hypothesis Test

Summary: You can use a confidence interval to conclude whether results are statistically significant. A hypothesis test (HT) and confidence interval (CI) are two ways of looking at the same thing: what possibilities for the population mean or proportion are consistent with my sample?

A 95% CI is the flip side of a 0.05 two-tailed HT. More generally, a 1−α CI is the complement of an α two-tailed HT.

Example 14: The baseline rate for heart attacks in diabetes patients is 20.2% in seven years. You have a new diabetes drug, Effluvium, that is effective in treating diabetes. Clinical trials on 89 patients found that 27 (30.3%) had heart attacks. The 95% confidence interval is 20.8% to 39.9% likelihood of heart attack within seven years for diabetes patients taking Effluvium. What does this tell you about the safety of Effluvium?

Solution: Okay, you’re 95% confident that Effluvium takers have a 20.8% to 39.9% chance of a heart attack within seven years. If you’re 95% confident that their chance of heart attack is inside that interval, then there’s only a 5% or 0.05 probability that their chance of heart attack is outside the interval, namely <20.8% or >39.9%.

But 20.2% is outside the interval, so there’s less than a 0.05 chance that the true probability of heart attack with Effluvium is 20.2%.

CI and HT calculations both rely on the sampling distribution. The open curve centered on 20.2% shows the sampling distribution for a hypothetical population proportion of 20.2%. Only a very small part of it extends beyond 30.3%, the proportion of heart attacks you actually found in your sample.

The chance of getting your sample, given a hypothetical proportion po in the population, is the p-value. If po = 20.2%, your sample with  = 30.3% would be unlikely (p-value below 0.05). You would reject the null hypothesis and conclude that Effluvium takers have a different likelihood of heart attack from other diabetes patients, at the 0.05 significance level. Further, the entire confidence interval is above the baseline value, so you know that Effluvium increases the likelihood of heart attack in diabetes patients.

At significance level 0.05, a two-tailed test against any value outside the 95% confidence interval (the shaded curve) would lead to rejecting the null hypothesis. And you can say the same thing for any other significance level α and confidence level 1−α.

What if the interval does include the baseline or hypothetical value? Then you fail to reject the null hypothesis.

Example 15: A machine is supposed to be turning out something with a mean value of 100.00 and SD of 6.00, and you take a random sample of 36 objects produced by the machine. If your sample mean is 98.4 and SD is 5.9, your 95% confidence interval is 96.4 to 100.4.

Now, can you make any conclusion about whether the machine is working properly?

 Solution: Well, you’re 95% confident that the machine’s true mean output is somewhere between 96.4 and 100.4. With this sample, you can rule out a true population mean of <96.4 or >100.4, at the 0.05 significance level; but you can’t rule out a true population mean between 96.4 and 100.4 at α = 0.05. A hypothesis test would fail to reject the hypothesis that μ = 100. You can’t determine whether the true mean output of the machine is equal to 100 or not.
• When μo or po is inside the 1−α CI, the two-tailed p-value is > α. Your sample does not contradict H0 and you fail to reject H0.
• When μo or po is outside the 1−α CI, the two-tailed p-value is < α. Your sample contradicts H0, and you reject H0.

Leaving the symbols aside, when you test a null hypothesis your sample either is surprising (and you reject the null hypothesis) or is not surprising (and you fail to reject the null). Any null hypothesis value inside the confidence interval is close enough to your sample that it would not get rejected, and any null hypothesis value outside the interval is far enough from the sample that it would get rejected.

Jordan Ellenberg (2014, 158) [see “Sources Used” at end of book] explains how the confidence interval gives you more information than any single hypothesis test:

The confidence interval is the range of hypotheses … that are reasonably consistent with the outcome you actually observed. [In a case when H0 is ‘no change’,] the confidence interval might be the range from +3% to +17%. The fact that zero, the null hypothesis, is not included in the confidence interval is just to say that the results are statistically significant in the sense we described earlier in the chapter.

But the confidence interval tells you a lot more. An interval of +3% to +17% licenses you to be confident that the effect is positive, but not that it’s particularly large. An interval of +9% to +11%, on the other hand, suggests much more strongly that the effect is not only positive but sizable.

The confidence interval is also informative in cases where you don’t get a statistically significant result — that is, where the confidence interval contains zero. If the confidence interval is −0.5% to +0.5%, then the reason you didn’t get statistical significance is because [your data provide] good evidence the intervention doesn’t do anything. If the confidence interval is −20% to +20%, the reason you didn’t get statistical significance is because you have no idea whether the intervention has an effect, or in which direction it goes. These two outcomes look the same from the viewpoint of statistical significance, but have quite different implications for what you should do next.

#### Special Note for Binomial Data

For numeric data, the CI and HT are exactly equivalent.

But for binomial data, the CI and HT are only approximately equivalent. Why? Because with binomial data, the HT uses a standard error derived from po in the null hypothesis, but the CI uses a standard error derived from , the sample proportion. Since the standard errors are slightly different, right around the borderline they might get different answers. But when the hypothetical po is a fair distance outside the CI, as it was in the drug example, the p-value will definitely be less than α.

Good question!

A confidence interval is symmetric (for the cases you study in this course), so it’s intrinsically two-tailed. A one-tailed HT for < or > at α = 0.01 corresponds to a two-tailed HT for ≠ at α = 0.02, so the CI for a one-tailed HT at α = 0.01 is a 98% CI, not a 99% CI. The confidence level for a one-tailed α is 1−2α, not 1−α.

Correspondence between Significance Level and Confidence Level
αtailsC-Level
0.0511−2×.05 = 90%
21−.05 = 95%
0.0111−2×.01 = 98%
21−.01 = 99%
0.00111−2×.001 = 99.8%
21−.001 = 99.9%

If the baseline value is outside the confidence interval, you can say (at the appropriate significance level) that the true value of μ or p is different from the baseline, and then go on to say whether it’s bigger or smaller, so you get your one-tailed result.

On the other hand, if the baseline value is inside the confidence interval, you can’t say whether the true μ or p is equal to the baseline or different from it, and if you can’t say whether they’re different then you can’t say which one is bigger than the other.

## 10E.  Testing a Non-Random Sample

Though most hypothesis tests are to find out something about a population, sometimes you just want to know whether this sample is significantly different from a population. In this case, you don’t need a random sample, but the other requirements must still be met.

Example 16: At Wossamatta University, instructors teach the statistics course independently but all sections take the same final exam. (There are several hundred students.) One semester, the mean score on the exam is 74. In one section of 30 students, the mean was 68.2 and the SD was 10.4. The students felt that they had not been adequately prepared for the exam by the instructor. Can they make their case?

Solution: In effect, they are saying that their section performance was significantly below the performance of students in the course overall. This is a testable hypothesis. But the hypothesis is not about the population that these 30 students were drawn from; we already know about that population. Instead, it is a test whether this sample, as a sample, is different from the population.

(1) H0: This section’s mean was no different from the course mean. H1: This section’s mean was significantly below the course mean. α = 0.05 (Omit the requirement for a random sample.) 10n = 10×30 = 300 is less than the “several hundred students” in the course. Sample size is ≥30, so the sampling distribution is normal. TTest: μ = 74, x̅ = 68.2, s = 10.4, n = 30, μ < μo Outputs: t = −3.05, p-value = 0.0024 p < α. Reject H0 and accept H1. This section’s average exam score was less than the overall course average (p-value = 0.0024).

Okay, there was a real difference. This section’s mean exam score was not only below the average for the whole course, but too far below for random chance to be enough of an explanation.

But did the students prove their case? Their case was not just that their average score was lower, but that the difference was the result of poor teaching. Statistics can’t answer that question so easily. Maybe it was poor teaching; maybe these were weaker students; maybe it was environmental factors like classroom temperature or the time of day; maybe it was all of the above.

## What Have You Learned?

Key ideas:

(The online book has live links.)

• You don’t know the proportion or mean of a population. You want to test whether it is different from some baseline number. You take a sample, and then compute how likely that sample would be if the true proportion or mean in the population is equal to that baseline. If the sample is too unlikely, you reject the null hypothesis and conclude that the true proportion or mean must be different from that baseline number.
• Know the seven steps of hypothesis tests. Know them by heart, and write them on your cheat sheet if you need to.
• Know whether you have binomial or numeric data. This totally determines which type of test you will do, so think before you act! When you have numeric data, you test for the mean of a population (hypotheses about μ). When you have binomial data in a count of successes, you test for the proportion in a population (hypotheses about p).
• Understand one-tailed versus two-tailed tests. When should you use which one? How do you interpret the results in step 6?
• Understand the significance level α. Know how to pick an appropriate level.
• Understand the p-value. It’s the probability, if H0 is true, of getting the sample you got (or one even further away from H0).
• Know how to write conclusions (if p-value < α) or non-conclusions (if p-value > α).
• Understand Type I and Type II errors. Describe what each one means in specific situations.
• Understand the relationship between a confidence interval and a hypothesis test. How can you relate the endpoints of a CI to whether you do or don’t have a statistically significant result, so that H0 would or wouldn’t be rejected?
Because this textbook helps you,
Because this textbook helps you,
BrownMath.com/donate.

## Exercises for Chapter 10

Write out your solutions to these exercises, showing your work for all computations. Then check your solutions against the solutions page and get help with anything you don’t understand.

Caution! If you don’t see how to start a problem, don’t peek at the solution — you won’t learn anything that way. Ask your instructor or a tutor for a hint. Or just leave it and go on to a different problem for now. You may find when you return to that “impossible” problem that you see how to do it after all.

### Problem Set 1

1 List the seven steps of every hypothesis test.
2

Why must you select a significance level before computing a p-value?

3

Explain the p-value in your own words.

4

You’ve tested the hypothesis that the new accelerant makes a difference to the time to dry paint, using α = 0.05. What is wrong with each conclusion, based on the p-value? Write a correct conclusion for that p-value.
(a) p = 0.0214. You conclude, “The accelerant may make a difference, at the 0.05 significance level.”
(b) p = 0.0714. You conclude, “The accelerant makes no difference, at the 0.05 significance level.”

5

You are testing whether the new accelerant makes your paint dry faster. (You have already eliminated the possibility that it makes your paint dry slower.)
(a) What conclusion would be a Type I error? What wrong action would a Type I error lead you to take?
(b) What conclusion would be a Type II error? What wrong action would a Type II error lead you to take?

6

Are Type I and Type II errors actually mistakes? What one thing can you do to prevent both of them, or at least make them both less likely?

7

What can you do to make a Type I error less likely at a given sample size? What’s the unfortunate side effect of that?

8

Explain in your own words the difference between “accept H0” (wrong) and “fail to reject H0” (correct) when your p-value is > α.

9

The engineering department claims that the average battery lifetime is 500 minutes. Write both hypotheses in symbols.

10

Suppose H0 is “the directors are honest” and H1 is “the directors are stealing from the company.” Write conclusions, in Statistics and in English, if …
(a) if p = 0.0405 and α = 0.01
(b) if p = 0.0045 and α = 0.01

11

In your hypothesis test, H0 is “the defendant is innocent” and H1 is “the defendant is guilty”. The crime carries the death penalty. Out of 0.05, 0.01, and 0.001, which is the most appropriate significance level, and why?

12

When Keith read the AAA’s statement that 10% of drivers on Friday and Saturday nights are impaired, he believed the proportion was actually higher for TC3 students. He took a systematic sample of 120 students and, on an anonymous questionnaire, 18 of them admitted being alcohol impaired the last Friday or Saturday night that they drove. Can he prove his point, at the 0.05 significance level?

13

In 2006–2008 there was controversy about creating a sewer district in south Lansing, where residents have had their own septic tanks for years. The Sewer Committee sent out an opinion poll to every household in the proposed sewer district. In a letter to the editor, published 3 Feb 2007 in the Ithaca Journal, John Schabowski wrote, in part:

The Jan. 4 Journal article about the sewer reported that “only” 380 of 1366 households receiving the survey responded, with 232 against it, 119 supporting it, and 29 neutral. … The survey results are statistically valid and accurate for predicting that the sewer project would be voted down by a large margin in an actual referendum.

Can you do a hypothesis test to show that more than half of Lansing households in the proposed district were against the sewer project? (You’re trying to show a majority against, so combine “supporting” and “neutral” since those are not against.)

14 Esperanza wanted to determine whether more than 40% of grocery shoppers — specifically, the primary grocery shoppers in their households — regularly use manufacturers’ coupons. She conducted a random telephone survey and contacted 500 people. (For this exercise, let’s assume that telephone subscribers are representative of grocery shoppers.) Of the 500 she contacted, 325 do the grocery shopping in their households. Of those 325, 182 said they regularly use manufacturers’ coupons.
(a) What is the size of the sample? (Think before you answer!)
(b) What is the population, and how large is it?
(c) What does the number 182 represent?
(d) Don’t do a hypothesis test. But if you did, what would po be?
(e) Is it a source of bias that she considered only each household’s primary grocery shopper?
15

Doubting Thomas remembered the Monty Hall example from Chapter 5, but he didn’t believe the conclusion that switching doors would improve the chance of winning to 2/3. (It’s okay if you don’t remember the example. All the facts you need are right here.)

Thomas watched every Let’s Make a Deal for four weeks. (Though this isn’t a random sample, treat it as one. There’s no reason why the show should operate differently in these four weeks from any others.) In that time, 30 contestants switched doors, and 18 of them won.
(a) At the 0.05 significance level, is it true or false that your chance of winning is 2/3 if you switch doors?
(b) At the 95% confidence level, estimate your chance of winning if you switch doors.
(c) If you don’t switch doors, your chance of winning is 1/3. Using your answer to (b), is switching doors definitely a good strategy, or is there some doubt?

16 Most of us have spam filters on our email. The filter decides whether each incoming piece of mail is spam. Heather trusts her spam filter, and she sets it to just delete spam rather than save it to a folder.
(a) What would Heather’s spam filter do if it makes a Type I error? What would it do if it makes a Type II error?
(b) Which is more serious here, a Type I error or a Type II error? Should the significance level α be set higher or lower?
17

Rosario read in Chapter 6 that 30.4% of US households own cats. She felt like dogs were a lot more visible than cats in Ithaca, so she decided to test whether the true proportion of cat ownership in Ithaca was less than the national proportion. She took a systematic sample of Wegmans shoppers one day, and during the same time period a friend took a systematic sample of Tops shoppers. (They counted groups shopping together, not individual shoppers, so they didn’t have to worry about getting the same household twice.)

Together, they accumulated a sample of 215 households, and of those 54 owned cats. Did she prove her case, at the 0.05 significance level?

### Problem Set 2

18 What is wrong with each pair of hypotheses? Correct the error.

(a) H0 = 14.2;  H1 > 14.2

(b) H0: μ < 25;  H1: μ > 25

(c) You’re testing whether batteries have a mean life of greater than 750 hours. You take a sample, and your sample mean is 762 hours. You write H0:μ=762 hr; H1:μ>762 hr.

(d) Your conventional paint takes 4.3 hours to dry, on average. You’ve developed a drying accelerant and you want to test whether adding it makes a difference to drying time. You write H0: μ=4.3 hr;  H1: μ < 4.3 hr.

19

This year, water pollution readings at State Park Beach seem to be lower than last year. A sample of 10 readings was randomly selected from this year’s daily readings:

3.5   3.9   2.8   3.1   3.1   3.4   3.2   2.5   3.5   3.1

Does this sample provide sufficient evidence, at the 0.01 level, to conclude that the mean of this year’s pollution readings is significantly lower than last year’s mean of 3.8?

20

Dairylea Dairy sells quarts of milk, which by law must contain an average of at least 32 fl. oz. You obtain a random sample of ten quarts and find an average of 31.8 fl. oz. per quart, with SD 0.60 fl. oz. Assuming that the amount delivered in quart containers is normally distributed, does Dairylea have a legal problem? Choose an appropriate significance level and explain your choice.

21

You’re in the research department of StickyCo, and you’re developing a new glue. You want to compare your new glue against StickyCo’s best seller, which has a bond strength of 870 lb/in².

You take 30 samples of your new glue, at random, and you find an average strength of 892.2 lb/in², with SD 56.0. At the 0.05 significance level, is there a difference in your new glue’s strength?

22

New York Quick Facts from the Census Bureau (2014b) [see “Sources Used” at end of book] says that 32.8% of residents of New York State aged 25 or older had at least a bachelor’s degree in 2008–2012. Let’s assume the figure hasn’t changed today.

You conduct a random sample of 120 residents of Tompkins County aged 25+, and you find that 52 of them have at least a bachelor’s degree.
(a) Construct a 95% confidence interval for the proportion of Tompkins County residents aged 25+ with at least a bachelor’s degree.
(b) Don’t do a full hypothesis test, but use your answer for (a) to determine whether the proportion of bachelor’s degrees in Tompkins County is different from the statewide proportion, at the 0.05 significance level.

23

You’re thinking of buying new Whizzo bungee cords, if the new ones are stronger than your current Stretchie ones. You test a random sample of Whizzo and find these breaking strengths, in pounds:

679   599   678   715   728   678   699   624

At the 0.01 level of significance, is Whizzo stronger on average than Stretchie? (Stretchies have mean strength of 625 pounds.)

24

For her statistics project, Jennifer wanted to prove that TC3 students average more than six hours a week in volunteer work. She gathered a systematic sample of 100 students and found a mean of 6.75 hours and SD of 3.30 hours. Can she make her case, at the 0.05 significance level?

25

As a POW in World War II, John Kerrich flipped a coin 10,000 times and got 5067 heads. At the 0.05 level of significance, was the coin fair?

26

People who take aspirin for headache get relief in an average of 20 minutes (let’s suppose). Your company is testing a new headache remedy, PainX, and in a random sample of 45 headache sufferers you find a mean time to relief of 18 minutes with SD of 8 minutes.
(a) Construct a 95% confidence interval for the mean time to relief of PainX.
(b) Don’t do a full hypothesis test, but use your answer for (a) to determine at the 0.05 significance level whether PainX offers headache relief to the average person in a different time than aspirin.

## What’s New?

• 21 Aug 2023: Added Ellenberg’s quote from R. A. Fisher on the need for replication in science and his connection of hypothesis test with confidence interval.
• 15 Nov 2021: Updated links to xkcd.com.
• 4 Nov 2020: Comverted the page from HTML 4.01 to HTML5. Italicized variable names and improved the formatting of radicals.
• (intervening changes suppressed)
• 5–10 Mar 2012: New document, formed by merging eight class handouts.