Stats without Tears
10. Hypothesis Tests
Updated 21 Aug 2023
(What’s New?)
Copyright © 2002–2023 by Stan Brown, BrownMath.com
Updated 21 Aug 2023
(What’s New?)
Copyright © 2002–2023 by Stan Brown, BrownMath.com
Summary: You want to know if something is going on (if there’s some effect). You assume nothing is going on (null hypothesis), and you take a sample. You find the probability of getting your sample if nothing is going on (pvalue). If that’s too unlikely, you conclude that something is going on (reject the null hypothesis). If it’s not that unlikely, you can’t reach a conclusion (fail to reject the null).
Remember the Swain v. Alabama example? In a county that was 26% African American, Mr. Swain’s jury pool of 100 men had only eight African Americans. In that example, you assumed that selection was not racially biased, and on that basis you computed the probability of getting such a low proportion. You found that it was very unlikely. This disconnect between the data and the claim led you to reject the claim.
You didn’t know it, but you were doing a hypothesis test. This is the standard way to test a claim in statistics: assume nothing is going on, compute the probability of getting your sample, and then draw a conclusion based on that probability. In this chapter, you’ll learn some formal methods for doing that.
The basic procedure of a hypothesis test or significance test is due to Jerzy Neyman (1894–1981), a Polish American, and Egon Pearson (1895–1980), an Englishman. They published the relevant paper in 1933.
We’re going to take a sevenstep approach to hypothesis tests. The first examples will be for binomial data, testing a claim about a population proportion. Later in this chapter you’ll use a similar approach with numeric data to test a claim about a population mean. In later chapters you’ll learn to test other kinds of claims, but all of them will just be variations on this theme.
Your first task is to turn the claim into algebra. The claim may be that nothing is going on, or that something is going on. You always have two statements, called the null and alternative hypotheses.
Definition: The null hypothesis, symbol H_{0}, is the statement that nothing is going on, that there is no effect, “nothin’ to see here. Move along, folks!” It is an equation, saying that p, the proportion in the population (which you don’t know), equals some number.
Definition: The alternative hypothesis, symbol H_{1}, is the statement that something is going on, that there is an effect. It is an inequality, saying that p is different from the number mentioned in H_{0}. (H_{1} could specify <, >, or just ≠.)
The hypotheses are statements about the population, not about your sample. You never use sample data in your hypotheses. (In real life you can’t make that mistake, since you write your hypotheses before you gather data. But in the textbook and the classroom, you always have sample data up front, so don’t make a rookie mistake.)
You must have the algebra (symbols) in your hypotheses, but it can also be helpful to have some English explaining the ultimate meaning of each hypothesis, or the consequences if each hypothesis is true. Here you want to know whether there’s racial bias in jury selection in the county.
You don’t want to know if the proportion of African Americans in Mr. Swain’s jury pool is less than 26%: obviously it is. You want to know if it’s too different — if the difference is too great to be believable as the result of random chance.
Write your hypotheses this way:
(1) 
H_{0}: p = 0.26, there’s no racial bias in jury selection
H_{1}: p < 0.26, there is racial bias in jury selection 

Obviously those can’t both be true. How will you choose between them? You’ll compute the probability of getting your sample (or a more unexpected one), assuming that the null hypothesis H_{0} is true, and one of two things will happen. Maybe the probability will be low. In that case you rule out the possibility that random chance is all that’s happening in jury selection, and you conclude that the alternative hypothesis H_{1} is true. Or maybe the probability won’t be too low, and you’ll conclude that this sample isn’t unusual (unexpected, surprising) for the claimed population.
The number in your null hypothesis H_{0}, with binomial data, is called p_{o} because it’s the proportion as given in H_{0}. (You may want to refer to the Statistics Symbol Sheetto help you keep the symbols straight.)
In fact it’s all people serving on Talladega County jury pools past, present and future. If there’s racial bias, then African Americans are less likely to be selected than whites, and — probability of one, proportion of all — therefore the overall population of jury pools has less than 26% African Americans. If there’s no racial bias, then in the long run the overall population of jury pools has the same 26% of African Americans as the county.
This is why a hypothesis test is also called a significance test or a test of significance.
Okay, you’re looking to figure out if this sample is inconsistent with the null hypothesis. In other words, is it too unlikely, if the null hypothesis H_{0} is true? But what do you mean by “too unlikely”? Back in Chapter 5, we talked about unusual events, with a threshold of 5% or 0.05 for such events. We’ll use that idea in hypothesis testing and call it a significance level.
Definition: The significance level, symbol α (the Greek letter alpha), is the chance of being wrong that you can live with. By convention, you write it as a decimal, not a percentage.
(2)  α = 0.05 

A significance level of 0.05 is standard in business and science. If you can’t tolerate a 5% chance of being wrong — if the consequences are particularly serious — use a lower significance level, 0.01 or 0.001 for example. (0.001 is common if there’s a possibility of death or serious disease or injury.) If the consequences of being wrong are especially minor, you might use a higher significance level, such as 0.10, but this is rare in practice.
In a classroom setting, you’re usually given a significance level α to use.
Later in this chapter, you’ll see that the significance level α is actually concerned with a particular way of being wrong, a Type I error.
Back in Chapter 8, you learned the CLT’s requirements for binomial data: random sample not larger than 10% of population, and at least 10 successes and 10 failures expected if the null hypothesis is true. You compute expected successes as np_{o} by using p_{o}, which is the number from H_{0}. Expected failures are then sample size minus expected successes, n−np_{o} in symbols. Steps 3 and 4 need the sampling distribution of the proportion to be a ND, so you must check the requirements as part of your hypothesis test.
(RC) 


You might wonder about the first test. “The county may say it’s random, but I don’t believe it. Isn’t that why we’re running this test?” Good question! Answer: Every hypothesis test assumes the null hypothesis is true and computes everything based on that. If you end up deciding that the sample was too unlikely, in effect you’ll be saying “I assumed nothing was going on, but the sample makes that just too hard to believe.”
This same idea — the null hypothesis H_{0} is innocent till proven guilty — explains why you use 0.26 (p_{o}) to figure expected successes and failures, not 0.08 (p̂). Again, the county claims that there’s no racial bias. If that’s true, if there’s no funny business going on, then in the long run 26% of members of jury pools should be African American.
Comment: Usually, if requirements aren’t met you just have to give up. But for onepopulation binomial data, where the other two requirements are met but expected successes or failures are much under 10, you can use MATH200A part 3 to compute the pvalue directly. There’s an example in “Small Samples”, below.
This is the heart of a hypothesis test. You assume that the null hypothesis is true, and then use what you know about the sampling distribution to ask: How likely is this sample, given that null hypothesis?
Definition: A test statistic is a standardized measure of the discrepancy between your null hypothesis H_{0} and your sample. It is the number of standard errors that the sample lies above or below H_{0}.
You can think of a test statistic as a measure of unbelievability, of disagreement between H_{0} and your sample. A sample hardly ever matches your null hypothesis perfectly, but the closer the test statistic is to zero the better the agreement, and the further the test statistic is from 0 the worse the sample and the null hypothesis disagree with each other.
Because you showed that the sampling distribution is normal and the standard error of the proportion is implicitly known, this is a z test. The test statistic is z = (p̂−p_{o}) / σ_{p̂} where , but as you’ll see your calculator computes everything for you.
Definition: The pvalue is the probability of getting your sample, or a sample even further from H_{0}, if H_{0} is true. The smaller the pvalue, the stronger the evidence against the null hypothesis.
Inferential Statistics: Basic Cases tells you that binomial data in one population are
Case 2. This is a hypothesis test of
population proportion, and you use
1PropZTest
on your calculator.
To get to that menu selection, press
[STAT
] [◄
] [5
]. Enter p_{o} from the null hypothesis
H_{0}, followed by the number of successes x, the sample size
n, and the alternative hypothesis H_{1}. Write everything down
before you select Calculate
. When you get to the output screen,
check that your alternative hypothesis H_{1} is shown correctly at the
top of the screen, and then
write down everything that’s new.
(3/4) 
1PropZTest .26, 8, 100, <p_{o}
outputs: z = −4.10, pvalue = 0.000 020, p̂ = 0.08 

By convention, we round the test statistic to two decimal places and the pvalue to four decimal places.
When the pvalue is less than one in ten thousand, you need more than four decimal places. Some authors just write “p <.0001” when the pvalue is that small; they figure nobody cares about fine shades of very low probability. Feel free to use that alternative.
Caution! Watch for powers of 10 (E minus whatever) and never write something daft like “pvalue = 2.0346”.
What do these outputs of the 1PropZTest tell you? The sample proportion, p̂ = 0.08, is more than 4 standard errors below the supposed population proportion, p_{o} = 0.26. Your test statistic is z = −4.10. Since 95% of samples have zscores within ±2, this is surprising. How surprising, that’s what the pvalue tells you.
How likely is it to get this sample, or one with even a smaller sample proportion, if the null hypothesis H_{0} is true? The pvalue is 0.000 020, so if there’s no racial bias in selection then there are only two chances in a hundred thousand of getting eight or fewer African Americans in a 100man jury pool. (There’s a lot more about interpreting the pvalue later in this chapter.)
You don’t actually use the zscore, but I want you to understand something about what a test statistic is. Every case you study will have a different test statistic, and in fact choosing a test statistic is the main difference between cases.
Why does one step have two numbers? In the olden days, when dinosaurs roamed the earth and a slide rule was the hot new thing, you had to compute the SEP and then the zscore; that was step 3. Then you had to look up z in a printed table to find the pvalue; that was step 4. The TI83 or TI84 gives you both at the same time, but I’ve kept the numbering of steps.
There are two and only two possibilities, and all you have to do is pick the correct one based on your pvalue and your α:
p < α. Reject H_{0} and accept H_{1}.
or
p > α. Fail to reject H_{0}.
Caution! There are lots of p’s in problems involving
population proportions (Case 2),
so make sure you select the right one. The
pvalue is the first p
on
the 1PropZInt output screen.
You can add the numbers, if you like — p < α (0.000 020 < 0.05) — but the symbols are required.
(5)  p < α. Reject H_{0} and accept H_{1}. 

What are you saying here? The pvalue was very small, so that means the chance of getting this sample, if there’s no racial bias, was very small. Previously, you set a significance level of 0.05, meaning you would consider this sample too unlikely if its probability was under 5%. Its probability is under 5%, so the sample and the null hypothesis contradict each other. The sample is what it is, so you can’t reject the sample. Therefore you reject H_{0} and accept H_{1} — you declare that there is racial bias.
Another way to look at it: Any sample will vary from the population because random selection is always operating to produce sampling error. But the difference between this sample and the supposed population proportion is just too great to be produced by random selection alone. Something else must be going on also. That something else is the alternative hypothesis H_{1}.
Definition: When the pvalue is below α, the sample is too unlikely to come from ordinary sample variability alone, and you have a significant result, or your result is statistically significant.
You always select a significance level before you know the pvalue. If you could first get the pvalue and then specify a significance level, you could get whichever result you wanted, and there would be no point to doing a hypothesis test at all. Choosing α up front keeps you honest.
Since you accepted H_{1} in the previous step, that’s your conclusion. If you have already written it in English as part of the hypotheses, as I did, then most of your work is already done. You do need to add the significance level or the pvalue, so your conclusion will look something like one of these:
(6)  The 8% proportion of African American men in Mr. Swain’s jury pool is significantly below the expected 26%, and this is evidence at the 0.05 level of significance of racial bias in the selection. 

or
(6)  The 8% proportion of African American men in Mr. Swain’s jury pool is significantly below the expected 26%, and this is evidence of racial bias in the selection (p = 0.000 020). 

If you’re publishing your hypothesis test, you’ll want to write a thorough conclusion that still makes sense if it’s read on its own. But in class exercises you don’t have to write so much. It’s enough to write “At the 0.05 significance level, there is racial bias in jury selection” or “There is racial bias in jury selection (p = 0.000 020)”.
The Colorectal Cancer Screening Guidelines (CDC 2014 [see “Sources Used” at end of book]) recommend a colonoscopy every ten years for adults aged 50 to 75. A publichealth researcher believes that only a minority are following this recommendation. She interviews a simple random sample of 500 adults aged 50–75 in Metropolis (pop. 6.4 million) and finds that 235 of them have had a colonoscopy in the past ten years. At the 0.05 level of significance, is her belief correct?
Solution: The population is adults aged 50–75 in Metropolis. You want to know whether a minority of them — under 50% — follow the colonoscopy guideline. Each person either does or does not, so you have binomial data, a test of proportion (Case 2 in Inferential Statistics: Basic Cases). Try to write out the hypothesis test yourself before you look at mine below.
Reminder: Even though you already have the sample data in the problem, when you write the hypotheses, ignore the sample. In principle, you write the hypotheses, then plan the study and gather data. If you use any of the sample data in the hypotheses, something is wrong.
You should have written something pretty close to this:
(1) 
H_{0}: p = 0.5, half the seniors of Metropolis follow the guideline
H_{1}: p < 0.5, less than half follow the guideline 

(2)  α = 0.05 
(RC) 

(3/4) 
1PropZTest: p_{o}=.5, x=235, n=500, p<p_{o}
outputs: z=−1.34, pval=0.0899, p̂=0.47 
(5)  p > α. Fail to reject H_{0}. 
(6)  At the 0.05 level of significance, it’s impossible to
say whether less
than half of Metropolis seniors aged 50–75 follow the CDC guideline for
a colonoscopy every ten years or not.
[Or, It’s impossible to say whether less than half of Metropolis seniors aged 50–75 follow the CDC guideline for a colonoscopy every ten years or not (p = 0.0899).] 
Important: When p is greater than α, you fail to reach a conclusion. In this situation, you must use neutral language. You mention both possibilities without giving more weight to either one, and you use words like “impossible to say” or “can’t determine”.
This is unsatisfying, frankly. You go through all the trouble of gathering data and then you end up with a nonconclusion. Can anything be salvaged from this mess?
Yes, you can do a confidence interval. This at least will let you set bounds on what percent of all seniors follow the guidelines. You’ve already tested requirements as part of the hypothesis test, so go right into your calculations and conclusion. You’re free to pick any confidence level you wish, but 95% is most usual.
1PropZInt, 235, 500, .95
outputs: (.42625, .51375)
42.6% to 51.4% of Metropolis seniors aged 50–75 follow the CDC guideline on screening for colorectal cancer.
In a classroom setting, or on regular homework, if you’re assigned a hypothesis test do that and don’t feel obligated to do a confidence interval also. But in real life, and on labs and projects for class, you’ll usually want to do both.
What if your sample is so small that
expected successes np_{o} or expected failures
n−np_{o} are under 10? You can no longer use
1PropZTest
, which assumes that the sampling
distribution of the proportion is ND, but you can
compute the binomial probability directly as long as the other two
requirements are still met (SRS and 10n≤N).
Only the calculation of the pvalue changes.
Example: In 2001, 9.6% of Fictional County motorists said that fuel efficiency was the most important factor in their choice of a car. For her statistics project, Amber set out to prove that the percentage has increased since then. She interviewed 80 motorists in a systematic sample of those registering vehicles at the DMV, and 13 of them said that fuel efficiency was the most important factor in their choice of a car. Test her hypothesis, at the 0.05 significance level.
Please write out your hypothesis test before you look at mine.
(1)  H_{0}: p = 0.096, percentage has not increased
H_{1}: p > 0.096, percentage has increased 

(2)  α = 0.05 
(RC) 
The sampling distribution of p̂ doesn’t follow the
normal model, so you can’t use 
(3/4) 
MATH200A/Binomial prob:
n=80, p=0.096, x=13 to 80; pvalue = 0.0410
(If you don’t have the program, use 1−binomcdf(80,0.096,12) = 0.0410.) [Why 13 to 80? H_{1} contains >, so you test the probability of getting the sample you got, or a larger one, if H_{0} is true. If H_{1} contained <, x would be 0 to 13 — the sample you got, or a smaller one. See Surprised?in Chapter 6.] 
(5)  p < α. Reject H_{0} and accept H_{1}. 
(6)  At the 0.05 significance level, the percentage of Fictional
County motorists who rate fuel efficiency as most important has
increased since 2001.
[Or, The percentage of Fictional County motorists who rate fuel efficiency as most important has increased since 2001 (p = 0.0410).] 
Hypothesis tests are based on a simple idea, but there are lots of details to think about. This section clarifies some important ideas about the philosophy and practice of a hypothesis test.
Definition: A Type I error is rejecting the null hypothesis when it’s actually true.
Definition: A Type II error is failing to reject the null hypothesis when it’s actually false.
A Type I error usually causes you to do something you shouldn’t; a Type II error usually represents a missed opportunity.
Example 4: Suppose your alternative hypothesis H_{1} is that a new headache remedy PainX helps a greater proportion of people than aspirin.
A Type I error — rejecting H_{0} and accepting H_{1} when H_{0} is actually true — would have you announce that PainX helps more people when in fact it doesn’t. People would then buy PainX instead of aspirin, and their headache would less likely be cured. This is a bad thing.
On the other hand, a Type II error — failing to reject H_{0} when it’s actually false — would mean you announce an inconclusive result. This keeps PainX off the market when it actually would have helped more people than aspirin. This too is a bad thing.
Example 5: You’re on a jury, and you have to decide whether the accused actually committed the murder. What would be Type I and Type II errors?
To answer that you need to identify your null hypothesis H_{0}. Remember that it’s always some form of “nothing going on here.” In this case, H_{0} would be that the defendant didn’t commit the murder, and H_{1} would be that he did.
A Type I error would be condemning an innocent man; a Type II error would be letting a guilty man go free. In our legal system, a defendant is not supposed to be found guilty if there is a reasonable doubt; this would correspond to your α. Probably α = 0.05 is not good enough in a serious case like murder, where a Type I error would mean long jail time or execution, so if you’re on a jury you’d want to be more sure than that.
“Okay then,” you say, “I’ll have to be super careful and not make mistakes.” But remember from Chapter 1: In statistics, “errors” aren’t necessarily mistakes. Errors are discrepancies between your results and reality, whatever their cause. Type I and Type II errors are not mistakes in procedure.
Even if you do everything right in your hypothesis test, you can’t be certain of your answer, because you can never get away from sample variability.
How often will these errors occur? This is where your significance level comes into play. If you perform a lot of tests at α = 0.05, then in the long run a Type I error will occur one time in twenty. It’s too big for these pages, but there’s a cartoon at xkcd.com that illustrates this perfectly. The probability of a Type II error has the symbol β (Greek letter beta) and it has to do with the “power” of the test, its ability to find an effect when there’s an effect to be found. β belongs to a more advanced course, and I don’t do anything with it in this book.
Earlier, I said that your significance level α is the chance of being wrong that you can live with. Now I can be a little more precise. α is not the chance of any error; α is the chance of a Type I error that you can live with. If one Type I error in 20 hypothesis tests is unacceptable, use a lower significance level — but then you make a Type II error more likely. If that’s unacceptable, increase your sample size.
Somebody is making a mint off the following chart. It’s in every stats textbook I’ve seen, so you may as well have it too:
Reject H_{0}, accept H_{1}  Fail to reject H_{0}  

If H_{0} is actually true  Type I error  Correct decision 
If H_{0} is actually false (and H_{1} is true)  Correct decision  Type II error 
How do you know whether your H_{1} should contain “<” or “>” (a onetailed test) or “≠” (a twotailed test)? In class, the problem will usually be clear about whether you’re testing for a “difference” (twotailed) or testing if something is “better”, “larger”, “less than”, etc. (all onetailed). But which one should you use when you’re on your own?
In general, prefer a twotailed test unless you have a specific reason to make a onetailed test.
When a twotailed test reaches a statistically significant result, you interpret in a onetailed manner.
There are two main situations where a onetailed test makes sense: “(a) where there is truly concern for the outcomes in one [direction] only and (b) where it is completely inconceivable that the results could go in the opposite direction.”
—Dubey, quoted by Kuzma and Bohnenblust (2005, 132) [see “Sources Used” at end of book]
With a onetailed test, say for μ<4.5, you’re saying that you consider “equal to 4.5” and “greater than 4.5” the same thing, that if μ isn’t less than 4.5 then you don’t care whether it’s equal or it’s greater. Sometimes you really don’t care, but very often you do. If the problem statement is ambiguous, or if this is real life and you have to do a hypothesis test, how do you decide whether to do a onetailed or twotailed test?
Testing twotailed doesn’t prejudge a situation. Do a twotailed test unless you can honestly say, without looking at the data, that only one direction of difference matters, or only one direction is possible.
Example 6: An existing drug cures people in an average of 4.5 days, and you’re testing a new drug. If you test for μ<4.5, you’re saying that it doesn’t matter whether the new drug takes the same time or takes more time. But that’s wrong: it matters very much. You want to test whether the new drug is different (μ≠4.5). Then if it’s different, you can conclude whether it’s faster or slower.
Another way to look at this whole business: a onetailed test essentially doubles your α — you’re much more likely to reach a conclusion with dicey data. But that means double the risk of being wrong with a Type I error — not a good thing!
Sometimes the same situation can call for a different test, depending on your viewpoint.
Example 7: You’re the county inspector of weights and measures, checking up on a dairy and its half gallons of milk. Legally, half a gallon is 64 fluid ounces. To a government inspector, “Dairylea gives 64.0 ounces in the average half gallon” and “Dairylea gives more than 64.0 ounces in the average half gallon” are the same (legal), and you care only about whether Dairylea gives less (illegal). A onetailed test (<) is correct.
But now shift your perspective. You’re Dairylea management. You don’t want to short the customers because that’s illegal, but you don’t want to give too much because that’s giving away money. You make a twotailed test (≠).
After a twotailed test, if p<α then you can interpret the result as onetailed.
Example 8: You want to test whether your candidate’s approval rating has changed from the previous dismal 40% after a major policy announcement. Your H_{1} is p ≠ 0.4, and 170 out of a random sample of 500 voters approve (p̂ = 34%). Your pvalue is 0.0062, so you reject H_{0} and accept H_{1}. You conclude that the candidate’s approval rating has changed.
But you can go further and say that her approval rating has dropped. You do this by combining the facts that (a) you’ve proved that approval rating is different, which means it must be either less or more than 40%, and (b) the sample p̂ was less than p_{o} (40%).
You can phrase your conclusion something like this, first answering the original question then going beyond it: The candidate’s approval rating has changed from 40% after the speech (p = 0.0062). In fact, it has dropped.
Your justification is the relationship between Confidence Interval and Hypothesis Test (later in this chapter), but you don’t actually have to compute the CI. When p < α in a twotailed test, p_{o} is outside the confidence interval (at the matching confidence level).
Summary: The pvalue tells you how likely it is to get the sample you got (or a more extreme sample) if the null hypothesis is true.
Many people are confused about the pvalue. They try to read too much into it, or they try to simplify it.
Part of the problem is trying to fit the meaning into the traditional structure of a onesentence definition, so let’s try a story instead. In your experiment, you got a certain result, a sample mean or sample proportion. Assume that the null hypothesis is true. If H_{0} is true, the properties of the sampling distribution tell you how likely it is to get this sample result, or one even further away from H_{0}. That likelihood is called the pvalue.
The onetailed pvalue is exactly the
probability that you computed with normalcdf
in
Chapter 8. When that’s less than
0.5, the twotailed pvalue is exactly double the onetailed pvalue.
If the pvalue is small, your results are in conflict with H_{0}, so you reject the null and accept the alternative. If the pvalue is larger, your sample is not in conflict with H_{0} and you fail to reject the null, which is statstalk for failing to reach any kind of conclusion.
In a nice phrase, Sterne and Smith [see “Sources Used” at end of book] say that pvalues “measure the strength of the evidence against the null hypothesis; the smaller the pvalue, the stronger the evidence against the null hypothesis.” They also quote R. A. Fisher on interpreting a pvalue: “If P is between 0.1 and 0.9 there is certainly no reason to suspect the hypothesis tested. If it is below 0.02 it is strongly indicated that the hypothesis fails to account for the whole of the facts. We shall not often be astray if we draw a conventional line at 0.05.”
The message here is that pvalues fall on a continuum; you can’t just arbitrarily divide them into “significant” and “not significant” once and for all.
The pvalue is the likelihood, if H_{0} is actually true, that random chance could give you the results you got, or results even further from H_{0}. It is a conditional probability:
pvalue = P(this sample given that H_{0} is true)
Yes, that seems convoluted — because it is. Alas, there just isn’t any description of a pvalue that is both correct and simple.
The pvalue is not the probability that either hypothesis is true or false:
The pvalue is not any of the above because they are all plain probabilities. Once again, the pvalue is just a measure of how likely your results would be if H_{0} is true and random chance is the only factor in selecting the sample.
The pvalue tells you how unlikely this sample (or a more extreme one) is if the null hypothesis is true. The more unlikely (surprising, unexpected), the lower the pvalue, and the more confident you can feel about rejecting H_{0}.
There’s one other thing: the pvalue is not a measure of the size or importance of an effect. That gets into statistical significance versus practical significance.
If your pvalue is less than your significance level α, your result is statistically significant. That low pvalue, Wheelan (2013, 11) [see “Sources Used” at end of book] writes, means that your result is “not likely to be the product of chance alone”. That’s all that statistical significance means. But even if a result is statistically significant, it may not be practically significant.
Example 9: Suppose that your pvalue for “PainX is more likely to help a person than aspirin” is 0.000 002. You’re pretty darn sure that PainX is better. But to determine whether the result is practically significant, you have to ask not just whether PainX is better, but by how much.
One way to evaluate practical significance is to compute a confidence interval about the effect size. In this case, the 95% confidence interval is that a person is between 1.14 and 2.86 percentage points more likely to be helped by PainX than aspirin. Oh yes, and aspirin costs a buck for 100 tablets, where PainX costs $29.50 for ten. Most people would say this result has no practical significance. They’re not going to plunk down $30 for a few pills that are only 2% more likely to help them than aspirin.
How can you get such a low pvalue when the size of the effect is small? The answer is in extremely large sample sizes. In this madeup case, PainX helped 15,500 people in a sample of 25,000, and aspirin helped 15,000 in a sample of 25,000. When you have really large samples, be especially alert to the issue of statistical versus practical significance.
Avoid common errors when stating conclusions and interpreting them. Make sure you understand what you are doing, and explain it to others in their own language.
If your pvalue is less than your significance level, you have shown that your sample results were unlikely to arise by chance if H_{0} is true. The data are statistically significant. You therefore reject H_{0} and accept H_{1}.
Details: Assuming that H_{0} is true, the sample you got is surprising (unexpected, unusual). The data are inconsistent with the null hypothesis — they can’t both be true. The data are what they are, and if the sample was properly taken you have to believe in it. Therefore, H_{0} is most likely false. If H_{0} is false, its opposite H_{1} is true.
You accept H_{1}, but you haven’t proved it to a certainty. There’s always that pvalue chance that the sample results could have occurred when H_{0} is true. That’s why you say you “accept” H_{1}, not that you have “proved” H_{1}.
Compare to a jury verdict of “guilty”. It means the jury is convinced that the probability (p) that the defendant is innocent is less than a reasonable doubt (significance level, α). It doesn’t mean there is no chance he’s innocent, just that there is very little chance.
Suppose your null H_{0} is “the average package contains the stated net weight,” your alternative is “the average package contains less than the stated net weight,” and your significance level is 0.05.
If p = 0.0241, which is < α, you reject H_{0} and accept H_{1}. You conclude “the average package does contain less than the stated net weight (p = 0.0241)” or “the average package does contain less than the stated net weight, at the 0.05 significance level.”
Don’t say the average package “might” be less than the stated weight or “appears to be” less than the stated weight. When you reject H_{0}, state the alternative as a fact within the stated significance level, or preferably with the pvalue. (Again, compare to a jury verdict. The jury doesn’t say the defendant “might be guilty”.)
See also: Take published conclusions with a grain of salt. Even professional researchers can misuse hypothesis tests. “Data mining” (first gathering data, then looking for relationships) is one problem, but not the only one. See Why Most Published Research Findings Are False (Ioannidis 2005 [see “Sources Used” at end of book]). If you find the article heavy going, just scroll down to read the example in Box 1 and then the corollaries that follow.
If your pvalue is greater than your significance level, you have shown that random chance could account for your results if H_{0} is true. You don’t know that random chance is the explanation, just that it’s a possible explanation. The data are not statistically significant.
You therefore fail to reject H_{0} (and don’t mention H_{1} in step 5). The sample you have could have come about by random selection if H_{0} is true, but it could also have come about by random selection if H_{0} is false. In other words, you don’t know whether H_{0} is actually true, or it’s false but the sample data just happened to fall not too far from H_{0}.
Compare to a jury verdict of “not guilty”. That could mean the defendant is actually innocent, or that the defendant is actually guilty but the prosecutor didn’t make a strong enough case.
Example 11: Suppose your null hypothesis is “the average package contains the stated net weight,” your alternative is “the average package contains less than the stated net weight,” and your significance level α is 0.05.
If you compute a pvalue of 0.0788, which is > α, you fail to reject H_{0} in step 5, but how do you state your conclusion in step 6?
There are two kinds of answer, depending on who you talk to. Some people say “there’s insufficient evidence to prove that the average package is underweight”; others say “we can’t tell whether the average package is underweight or not.” Of course there are many ways to write a conclusion in English, but ultimately they boil down to “we can’t prove H_{1}” (or the equivalent “we can’t disprove H_{0}”) versus “we can’t reach a conclusion either way.”
Does it matter? Yes, I think it does.
Please understand: It’s not that the people writing the conclusions are confused (well, usually not). The problem is confusion among people reading the conclusions.
Advice: It’s the same advice I’ve given before: Tailor your presentation to your audience. If you’re presenting to technical people, the onesided forms are okay, and you could answer Example 11 with something like “there’s insufficient evidence, at the 0.05 significance level, to show that the average package is under weight” or “… to reject the hypothesis that the average package contains the stated net weight.” (Since the pvalue gives more information, you could give that instead of the significance level.)
But if your audience is nontechnical people, don’t expect them to understand a twosided truth from a onesided conclusion. Instead, use neutral language, such as “We can’t determine from the data whether the average package is underweight or not (p = 0.0788).” (You could state the significance level instead of the pvalue.)
If your pvalue is very large, say bigger than 0.5, there’s a good chance you’ve made a mistake. Check carefully whether you should be testing <, ≠, or >. Also check whether you’re testing against the wrong number. For instance, suppose your H1 is that a coin comes up heads more than a third of the time. A few dozen flips will probably yield a pvalue very close to 1. This is the statistical equivalent of “Well, duh!”
Sometimes large pvalues are correct, but those situations are rare enough that you should be suspicious.
Not as a matter of strict logic, no. But there are circumstances where the data do suggest that the null hypothesis is true. The most important of these is when multiple experiments fail to reject H_{0}. Here’s why.
Suppose you do an experiment at the 0.05 significance level, and your pvalue is greater than that. Maybe H_{0} is really true; maybe it’s false but this particular sample happened to be close to H_{0}. You can’t tell — you’ve failed to disprove H_{0} but that doesn’t mean it’s necessarily true.
But suppose other experimenters also get pvalues > 0.05. They can’t all be unlucky in their samples, can they?
If you keep giving the universe opportunities to send you data that contradict the null hypothesis, but you keep getting data that are consistent with the null, then you begin to think that the null hypothesis shouldn’t be rejected, that it’s actually true.
This is why scientists always replicate experiments. If the first experiment fails to reject H_{0}, they don’t know whether H_{0} is true or they were just unlucky in their sample. But if several experiments fail to reject the null — always assuming the experiments are properly conducted — then they begin to have confidence in the theory.
What if an experiment does reject H_{0}? Is that it, game over? Not necessarily. Remember that even a true H_{0} will get rejected one time in 20 when tested at the 0.05 level. Once again, the answer is replication. If they get more “reject H_{0}”,scientists know that the first one wasn’t just a statistical fluke. But if they get a string of “fail to reject H_{0}”, then it’s likely that the first one was just that one in 20, and H_{0} is actually true.
In How Not to Be Wrong, Jordan Ellenberg (2014, 160–161) [see “Sources Used” at end of book] quotes R. A. Fisher on the need for reproducing results: “A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance.” And then, to drive the point home, Ellenberg adds: “Not ‘succeeds once in giving’, but ‘rarely fails to give’.” In other words, that “properly designed experiment” must be performed multiple times, so that it has many opportunities to reject H_{0}. If p > α more often than “rarely” in those many opportunities, you cannot reject H_{0} and accept H_{1}.
Summary:
Just as you used a TInterval
in Chapter 9 to make a
confidence interval about μ for numeric
data, you use a TTest
to perform the hypothesis test.
Typically you don’t know σ, the standard deviation (SD) of the population, and therefore you don’t know the standard error σ/√n either. So you estimate the standard error as s/√n, using the known SD of the sample. That means that the test statistic is:
t = (x̅−μ_{o}) / (s/√n)
The t statistic is the estimated number of standard errors between your sample mean and the hypothetical population mean.
You met the t distribution when you computed confidence intervals in Chapter 9. Compared to z, the t distribution is a little flatter and more spread out, especially for small samples, so pvalues tend to be larger.
Let’s jump in and do a t test. The numbered steps are almost the same as they were in the examples with binomial data — you just have the necessary variations for working with numeric data. Because I’ll be adding some commentary, I’ve put boxes around what I expect to see from you for a problem like this. (Refer to Seven Steps of Hypothesis Tests if you don’t know the steps very well yet.)
It hardly ever happens, but if you do know the SD of the population you can do a z test instead of a t test. Since the z distribution is a bit less spread out than the t distribution, for very small samples the pvalues are typically a bit lower with a z test than with a t. But the difference is rarely enough to change the result — and again, you are quite unlikely to know the SD of the population, so a z test is quite unlikely to be the right one.
The management claims that the average cash deposit is $200.00, and you’ve taken a random sample to test that:
192.68 188.24 152.37 211.73 201.57
167.79 177.19 191.15 209.22 178.49
185.90 226.31 192.38 190.23 156.13
224.07 191.78 203.45 186.40 160.83
At the 0.05 significance level, does this sample show that the average of all cash deposits is different from $200?
Solution: The data type is numeric, and the population SD σ is unknown, so this is a test of a population mean, Case 1 from Inferential Statistics: Basic Cases. Your hypotheses are:
(1) 
H_{0}: μ = 200, management’s claim is correct
H_{1}: μ ≠ 200, management’s claim is wrong 

Comment: Even though you already have the sample data in the problem, when you write the hypotheses, ignore the sample. In principle, you write the hypotheses, then plan the study and gather data. If you use any of the sample data in the hypotheses, something is wrong.
So you don’t use numbers from the sample in your hypotheses, and you don’t use the sample to help you decide whether the alternative hypothesis H_{1} should have < ≠, or >.
The significance level was given in the problem. (Problems will usually give you an α to use.)
(2)  α = 0.05 

Next is the requirements check. Even though it doesn’t have a number, it’s always necessary. In this case, n = 20, which is less than 30, so you have to test for normality and verify that there are no outliers.
Enter your data in any statistics list (I used
L5
), and check your data entry carefully. Use the
MATH200A program “Normality chk” to check for a normal
distribution and “Boxwhisker” to verify that
there are no outliers.
You don’t need to draw the plots, but do write down r and crit and show the comparison, and do check for outliers. (For what to do if you have outliers, see Chapter 3.)
(RC) 


Now it’s time to compute the test statistic (t) and the pvalue.
On the TTest
screen, you have to choose
Data
or Stats
just as you did on the
TInterval
screen. You have the actual data, so you
select Data
on the TTest screen, instead
of Stats
. Then the sample mean, sample SD, and sample size
are shown on the output screen, so you write them down as part of your
results. Always write down x̅, s, and n.
(3/4) 
TTest: μ_{o}=200, List=L5, Freq=1, μ≠μ_{o}
results: t=−2.33, p=0.0311, x̅=189.40, s=20.37, n=20 

The decision rule is the same for every single hypothesis test, regardless of data type. In this case:
(5)  p < α. Reject H_{0} and accept H_{1}. 

And as usual, you can write your conclusion with the significance level or the pvalue:
(6)  At the 0.05 level of significance, management is incorrect and the average of all cash deposits is different from $200.00. In fact, the true average is lower than $200.00. 

Or,
(6)  Management is incorrect, and the average of all cash deposits is different from $200.00 (p = 0.0311). In fact, the true average is lower than $200.00. 

Remember what happens when you do a twotailed test (≠ in H_{1}) and p turns out less than α: After you write your “different from” conclusion, you can go on to interpret the direction of the difference. See p < α in TwoTailed Test.
In a classroom exercise, if you were asked to do a hypothesis test you would do a hypothesis test and only a hypothesis test. But in real life, and in the big labs for class, it makes sense to answer the obvious question: If the true mean is less than $200.00, what is it?
You don’t have to check requirements for the CI, because you already checked them for the HT.
With 95% confidence, the average of all cash deposits is between $179.86 and $198.93.
Here’s an example where you have statistics without the raw data. It’s adapted from Sullivan (2011, 483) [see “Sources Used” at end of book].
According to the Centers for Disease Control, the mean number of cigarettes smoked per day by individuals who are daily smokers is 18.1. Do retired adults who are daily smokers smoke less than the general population of daily smokers?
To answer this question, Sascha obtains a random sample of 40 retired adults who are current daily smokers and record the number of cigarettes smoked on a randomly selected day. The data result in a sample mean of 16.8 cigarettes and a SD of 4.7 cigarettes.
Is there sufficient evidence at the α = 0.01 level of significance to conclude that retired adults who are daily smokers smoke less than the general population of daily smokers?
Solution: Start with the hypotheses. You’re comparing the unknown mean μ for retired smokers to the fixed number 18.1, the known mean for smokers in general. Since the data type is numeric (number of cigarettes smoked), and there’s one population, and you don’t know the SD of the population, this is Case 1, test of population mean, from Inferential Statistics: Basic Cases.
(1) 
H_{0}: μ = 18.1, retired smokers smoke the same amount
as smokers in general
H_{1}: μ < 18.1, retired smokers smoke less than smokers in general Comment: The claim is a population mean of 18.1, so you use 18.1 in your hypotheses. Using the sample mean of 16.8 in Step 1 is a rookie mistake, one of the Top 10 Mistakes of Hypothesis Tests. Never use sample data in your hypotheses. Comment: Why does H_{1} have < instead of ≠? The short answer is: that’s what the problem says to do. In the real world, you would do a twotailed test (≠) unless there’s a specific reason to do a onetailed test (< or >); see OneTailed or TwoTailed? (earlier in this document). Presumably there’s some reason why they are interested only in the case “retired smokers smoke less” and not in the case “retired smokers smoke more”. 

(2)  α = 0.01 
(RC) 
Therefore the sampling distribution is normal. 
(3/4) 
TTest: μ_{o}=18.1, x̅=16.8, s=4.7, n=40, μ<μ_{o} outputs: t=−1.75, p=0.0440 
(5)  p > α. Fail to reject H_{0}. 
(6) 
At the 0.01 level of significance, we can’t determine whether the average number of cigarettes
smoked per day by retired adults who are current smokers is less
than the average for all daily smokers or not.
Or, We can’t tell whether the average number of cigarettes smoked per day by retired adults who are current smokers is less than the average for all daily smokers or not (p = 0.0440). 
When you fail to reject H_{0}, you cannot reach any conclusion. You must use neutral language in your nonconclusions. Please review When p > α, you fail to reject H_{0} earlier in this chapter.
A 95% CI is the flip side of a 0.05 twotailed HT. More generally, a 1−α CI is the complement of an α twotailed HT.
Example 14: The baseline rate for heart attacks in diabetes patients is 20.2% in seven years. You have a new diabetes drug, Effluvium, that is effective in treating diabetes. Clinical trials on 89 patients found that 27 (30.3%) had heart attacks. The 95% confidence interval is 20.8% to 39.9% likelihood of heart attack within seven years for diabetes patients taking Effluvium. What does this tell you about the safety of Effluvium?
Solution: Okay, you’re 95% confident that Effluvium takers have a 20.8% to 39.9% chance of a heart attack within seven years. If you’re 95% confident that their chance of heart attack is inside that interval, then there’s only a 5% or 0.05 probability that their chance of heart attack is outside the interval, namely <20.8% or >39.9%.
But 20.2% is outside the interval, so there’s less than a 0.05 chance that the true probability of heart attack with Effluvium is 20.2%.
CI and HT calculations both rely on the sampling distribution. The open curve centered on 20.2% shows the sampling distribution for a hypothetical population proportion of 20.2%. Only a very small part of it extends beyond 30.3%, the proportion of heart attacks you actually found in your sample.
The chance of getting your sample, given a hypothetical proportion p_{o} in the population, is the pvalue. If p_{o} = 20.2%, your sample with p̂ = 30.3% would be unlikely (pvalue below 0.05). You would reject the null hypothesis and conclude that Effluvium takers have a different likelihood of heart attack from other diabetes patients, at the 0.05 significance level. Further, the entire confidence interval is above the baseline value, so you know that Effluvium increases the likelihood of heart attack in diabetes patients.
At significance level 0.05, a twotailed test against any value outside the 95% confidence interval (the shaded curve) would lead to rejecting the null hypothesis. And you can say the same thing for any other significance level α and confidence level 1−α.
What if the interval does include the baseline or hypothetical value? Then you fail to reject the null hypothesis.
Example 15: A machine is supposed to be turning out something with a mean value of 100.00 and SD of 6.00, and you take a random sample of 36 objects produced by the machine. If your sample mean is 98.4 and SD is 5.9, your 95% confidence interval is 96.4 to 100.4.
Now, can you make any conclusion about whether the machine is working properly?
Solution: Well, you’re 95% confident that the machine’s true mean output is somewhere between 96.4 and 100.4. With this sample, you can rule out a true population mean of <96.4 or >100.4, at the 0.05 significance level; but you can’t rule out a true population mean between 96.4 and 100.4 at α = 0.05. A hypothesis test would fail to reject the hypothesis that μ = 100. You can’t determine whether the true mean output of the machine is equal to 100 or not. 
Leaving the symbols aside, when you test a null hypothesis your sample either is surprising (and you reject the null hypothesis) or is not surprising (and you fail to reject the null). Any null hypothesis value inside the confidence interval is close enough to your sample that it would not get rejected, and any null hypothesis value outside the interval is far enough from the sample that it would get rejected.
Jordan Ellenberg (2014, 158) [see “Sources Used” at end of book] explains how the confidence interval gives you more information than any single hypothesis test:
The confidence interval is the range of hypotheses … that are reasonably consistent with the outcome you actually observed. [In a case when H_{0} is ‘no change’,] the confidence interval might be the range from +3% to +17%. The fact that zero, the null hypothesis, is not included in the confidence interval is just to say that the results are statistically significant in the sense we described earlier in the chapter.
But the confidence interval tells you a lot more. An interval of +3% to +17% licenses you to be confident that the effect is positive, but not that it’s particularly large. An interval of +9% to +11%, on the other hand, suggests much more strongly that the effect is not only positive but sizable.
The confidence interval is also informative in cases where you don’t get a statistically significant result — that is, where the confidence interval contains zero. If the confidence interval is −0.5% to +0.5%, then the reason you didn’t get statistical significance is because [your data provide] good evidence the intervention doesn’t do anything. If the confidence interval is −20% to +20%, the reason you didn’t get statistical significance is because you have no idea whether the intervention has an effect, or in which direction it goes. These two outcomes look the same from the viewpoint of statistical significance, but have quite different implications for what you should do next.
For numeric data, the CI and HT are exactly equivalent.
But for binomial data, the CI and HT are only approximately equivalent. Why? Because with binomial data, the HT uses a standard error derived from p_{o} in the null hypothesis, but the CI uses a standard error derived from p̂, the sample proportion. Since the standard errors are slightly different, right around the borderline they might get different answers. But when the hypothetical p_{o} is a fair distance outside the CI, as it was in the drug example, the pvalue will definitely be less than α.
Good question!
A confidence interval is symmetric (for the cases you study in this course), so it’s intrinsically twotailed. A onetailed HT for < or > at α = 0.01 corresponds to a twotailed HT for ≠ at α = 0.02, so the CI for a onetailed HT at α = 0.01 is a 98% CI, not a 99% CI. The confidence level for a onetailed α is 1−2α, not 1−α.
Correspondence between Significance Level and Confidence Level  

α  tails  CLevel 
0.05  1  1−2×.05 = 90% 
2  1−.05 = 95%  
0.01  1  1−2×.01 = 98% 
2  1−.01 = 99%  
0.001  1  1−2×.001 = 99.8% 
2  1−.001 = 99.9% 
If the baseline value is outside the confidence interval, you can say (at the appropriate significance level) that the true value of μ or p is different from the baseline, and then go on to say whether it’s bigger or smaller, so you get your onetailed result.
On the other hand, if the baseline value is inside the confidence interval, you can’t say whether the true μ or p is equal to the baseline or different from it, and if you can’t say whether they’re different then you can’t say which one is bigger than the other.
Though most hypothesis tests are to find out something about a population, sometimes you just want to know whether this sample is significantly different from a population. In this case, you don’t need a random sample, but the other requirements must still be met.
Example 16: At Wossamatta University, instructors teach the statistics course independently but all sections take the same final exam. (There are several hundred students.) One semester, the mean score on the exam is 74. In one section of 30 students, the mean was 68.2 and the SD was 10.4. The students felt that they had not been adequately prepared for the exam by the instructor. Can they make their case?
Solution: In effect, they are saying that their section performance was significantly below the performance of students in the course overall. This is a testable hypothesis. But the hypothesis is not about the population that these 30 students were drawn from; we already know about that population. Instead, it is a test whether this sample, as a sample, is different from the population.
(1)  H_{0}: This section’s mean was no different from
the course mean.
H_{1}: This section’s mean was significantly below the course mean. 

(2)  α = 0.05 
(RC) 

(3/4)  TTest: μ = 74, x̅ = 68.2,
s = 10.4, n = 30,
μ < μ_{o}
Outputs: t = −3.05, pvalue = 0.0024 
(5)  p < α. Reject H_{0} and accept H_{1}. 
(6)  This section’s average exam score was less than the overall course average (pvalue = 0.0024). 
Okay, there was a real difference. This section’s mean exam score was not only below the average for the whole course, but too far below for random chance to be enough of an explanation.
But did the students prove their case? Their case was not just that their average score was lower, but that the difference was the result of poor teaching. Statistics can’t answer that question so easily. Maybe it was poor teaching; maybe these were weaker students; maybe it was environmental factors like classroom temperature or the time of day; maybe it was all of the above.
(The online book has live links.)
Chapter 11 WHYL → ← Chapter 9 WHYL
Write out your solutions to these exercises, showing your work for all computations. Then check your solutions against the solutions page and get help with anything you don’t understand.
Caution! If you don’t see how to start a problem, don’t peek at the solution — you won’t learn anything that way. Ask your instructor or a tutor for a hint. Or just leave it and go on to a different problem for now. You may find when you return to that “impossible” problem that you see how to do it after all.
Why must you select a significance level before computing a pvalue?
Explain the pvalue in your own words.
You’ve tested the hypothesis that the new accelerant makes a
difference to the time to dry paint, using α = 0.05.
What is wrong with each conclusion, based on the pvalue? Write a
correct conclusion for that pvalue.
(a) p = 0.0214. You conclude, “The accelerant may
make a difference, at the 0.05 significance level.”
(b) p = 0.0714. You conclude, “The accelerant makes no
difference, at the 0.05 significance level.”
You are testing whether the new accelerant makes your paint dry faster.
(You have already eliminated the possibility that it makes your paint dry
slower.)
(a) What conclusion would be a Type I error? What wrong
action would a Type I error lead you to take?
(b) What conclusion would be a Type II error? What wrong
action would a Type II error lead you to take?
Are Type I and Type II errors actually mistakes? What one thing can you do to prevent both of them, or at least make them both less likely?
What can you do to make a Type I error less likely at a given sample size? What’s the unfortunate side effect of that?
Explain in your own words the difference between “accept H_{0}” (wrong) and “fail to reject H_{0}” (correct) when your pvalue is > α.
The engineering department claims that the average battery lifetime is 500 minutes. Write both hypotheses in symbols.
Suppose H_{0} is “the directors are honest” and H_{1} is
“the directors are stealing from the company.” Write
conclusions, in Statistics and in English, if …
(a) if p = 0.0405 and α = 0.01
(b) if p = 0.0045 and α = 0.01
In your hypothesis test, H_{0} is “the defendant is innocent” and H_{1} is “the defendant is guilty”. The crime carries the death penalty. Out of 0.05, 0.01, and 0.001, which is the most appropriate significance level, and why?
When Keith read the AAA’s statement that 10% of drivers on Friday and Saturday nights are impaired, he believed the proportion was actually higher for TC3 students. He took a systematic sample of 120 students and, on an anonymous questionnaire, 18 of them admitted being alcohol impaired the last Friday or Saturday night that they drove. Can he prove his point, at the 0.05 significance level?
In 2006–2008 there was controversy about creating a sewer district in south Lansing, where residents have had their own septic tanks for years. The Sewer Committee sent out an opinion poll to every household in the proposed sewer district. In a letter to the editor, published 3 Feb 2007 in the Ithaca Journal, John Schabowski wrote, in part:
The Jan. 4 Journal article about the sewer reported that “only” 380 of 1366 households receiving the survey responded, with 232 against it, 119 supporting it, and 29 neutral. … The survey results are statistically valid and accurate for predicting that the sewer project would be voted down by a large margin in an actual referendum.
Can you do a hypothesis test to show that more than half of Lansing households in the proposed district were against the sewer project? (You’re trying to show a majority against, so combine “supporting” and “neutral” since those are not against.)
Doubting Thomas remembered the Monty Hall example from Chapter 5, but he didn’t believe the conclusion that switching doors would improve the chance of winning to 2/3. (It’s okay if you don’t remember the example. All the facts you need are right here.)
Thomas watched every Let’s Make a Deal for
four weeks.
(Though this isn’t a random
sample, treat it as one. There’s no reason why the show should
operate differently in these four weeks from any others.)
In that time, 30 contestants switched doors, and 18 of them won.
(a) At the 0.05 significance level, is it true or false that your
chance of winning is 2/3 if you switch doors?
(b) At the 95% confidence level, estimate your chance of winning
if you switch doors.
(c) If you don’t switch doors, your chance of winning is
1/3. Using your answer to (b), is switching doors definitely a good
strategy, or is there some doubt?
Rosario read in Chapter 6 that 30.4% of US households own cats. She felt like dogs were a lot more visible than cats in Ithaca, so she decided to test whether the true proportion of cat ownership in Ithaca was less than the national proportion. She took a systematic sample of Wegmans shoppers one day, and during the same time period a friend took a systematic sample of Tops shoppers. (They counted groups shopping together, not individual shoppers, so they didn’t have to worry about getting the same household twice.)
Together, they accumulated a sample of 215 households, and of those 54 owned cats. Did she prove her case, at the 0.05 significance level?
(a) H_{0} = 14.2; H_{1} > 14.2
(b) H_{0}: μ < 25; H_{1}: μ > 25
(c) You’re testing whether batteries have a mean life of greater than 750 hours. You take a sample, and your sample mean is 762 hours. You write H_{0}:μ=762 hr; H_{1}:μ>762 hr.
(d) Your conventional paint takes 4.3 hours to dry, on average. You’ve developed a drying accelerant and you want to test whether adding it makes a difference to drying time. You write H_{0}: μ=4.3 hr; H_{1}: μ < 4.3 hr.
This year, water pollution readings at State Park Beach seem to be lower than last year. A sample of 10 readings was randomly selected from this year’s daily readings:
3.5 3.9 2.8 3.1 3.1 3.4 3.2 2.5 3.5 3.1
Does this sample provide sufficient evidence, at the 0.01 level, to conclude that the mean of this year’s pollution readings is significantly lower than last year’s mean of 3.8?
Dairylea Dairy sells quarts of milk, which by law must contain an average of at least 32 fl. oz. You obtain a random sample of ten quarts and find an average of 31.8 fl. oz. per quart, with SD 0.60 fl. oz. Assuming that the amount delivered in quart containers is normally distributed, does Dairylea have a legal problem? Choose an appropriate significance level and explain your choice.
You’re in the research department of StickyCo, and you’re developing a new glue. You want to compare your new glue against StickyCo’s best seller, which has a bond strength of 870 lb/in².
You take 30 samples of your new glue, at random, and you find an average strength of 892.2 lb/in², with SD 56.0. At the 0.05 significance level, is there a difference in your new glue’s strength?
New York Quick Facts from the Census Bureau (2014b) [see “Sources Used” at end of book] says that 32.8% of residents of New York State aged 25 or older had at least a bachelor’s degree in 2008–2012. Let’s assume the figure hasn’t changed today.
You conduct a random sample of 120 residents
of Tompkins County aged 25+, and you find that 52 of them have at least a
bachelor’s degree.
(a) Construct a 95% confidence interval for the proportion of
Tompkins County residents aged 25+ with at least a bachelor’s degree.
(b) Don’t do a full hypothesis test, but use your answer for
(a) to determine whether the proportion of bachelor’s degrees in
Tompkins County is different from the statewide proportion, at the 0.05
significance level.
You’re thinking of buying new Whizzo bungee cords, if the new ones are stronger than your current Stretchie ones. You test a random sample of Whizzo and find these breaking strengths, in pounds:
679 599 678 715 728 678 699 624
At the 0.01 level of significance, is Whizzo stronger on average than Stretchie? (Stretchies have mean strength of 625 pounds.)
For her statistics project, Jennifer wanted to prove that TC3 students average more than six hours a week in volunteer work. She gathered a systematic sample of 100 students and found a mean of 6.75 hours and SD of 3.30 hours. Can she make her case, at the 0.05 significance level?
As a POW in World War II, John Kerrich flipped a coin 10,000 times and got 5067 heads. At the 0.05 level of significance, was the coin fair?
People who take aspirin for headache get relief in an average of 20
minutes (let’s suppose).
Your company is testing a new headache remedy, PainX, and in a
random sample of 45 headache sufferers you find a mean time to relief
of 18 minutes with SD of 8 minutes.
(a) Construct a 95% confidence interval for the mean time to relief
of PainX.
(b) Don’t do a full hypothesis test, but use your answer for
(a) to determine at the 0.05 significance level whether PainX offers
headache relief to the average person in a different time than aspirin.
Updates and new info: https://BrownMath.com/swt/