Stats without Tears
in One File
Updated 14 Feb 2018
Copyright © 2001–2019 by Stan Brown
Updated 14 Feb 2018
Copyright © 2001–2019 by Stan Brown
STATS
WITHOUT
TEARS
Copyright © 2001–2019 by Stan Brown,
Tompkins Cortland Community College
Updated 14 Feb 2018
This book is an alternative to the usual textbooks for a onesemester course in statistics. Whether you’re teaching in a classroom or learning on your own, you’ve come to the right place.
Douglas Adams’ The Hitchhiker’s Guide to the Galaxy bore a “large, friendly label” with those words, and that’s also my message to you.
I don’t see any reason for students to be afraid of statistics. It’s no more difficult than any other technical course, and it’s much more practical than other math courses. The mathematical details are here for those who want them, but I lean heavily on technology to relieve students of the “grunt work”.
You need a TI83 or TI84 family calculator to get the most out of this book. For $100 or less, this calculator has amazing capabilities for statistics, and it also supports other math courses up through calculus. I suggest you download my free MATH200A program, which adds some capabilities to the calculator, but this is optional.
Some error conditions on your calculator can be scary when you see them the first time. Don’t panic! See TI83/84 Troubleshooting.
This textbook grew out of handouts I made for my students at TC3 (Tompkins Cortland Community College in Dryden, New York). The handouts filled gaps and corrected errors in our standard textbook.
As time went on, I found myself replacing whole chapters. Student evaluations showed that they preferred these replacements to the textbook. In Spring 2013 I reached the tipping point: I had replaced more than half of the twelve textbook chapters. In good conscience I didn’t feel I could ask students to buy an expensive textbook that they would use less than half of, so I burned my bridges and announced the required textbook beginning in Summer 2013 as “none”.
In Fall 2013, a second instructor at TC3 adopted this textbook for his class. Benjamin Kirk provided a lot of valuable suggestions and corrections, and I’m very grateful. They have improved the book considerably.
Contact information is at BrownMath.com/about/#Contact.
Please share your reactions, whether positive or negative! If I could explain something better, I’d like to know. If some section works particularly well for you, please tell me. If you find an error, I especially want to know about it. (My own students get extra credit for pointing out errors.)
Being on the Web, this book will get updated frequently, based on your feedback. You can see the revision dates in the chapter list above, and a revision history is shown at the end of every chapter. at https://BrownMath.com/swt/pfswt.htm .
This eTextbook is a free resource for you. You can read it on line or print any or all chapters. Links to all the chapters are at <https://BrownMath.com/swt/pfswt.htm>. If you print any chapters, you can keep your costs down by choosing blackandwhite printing in duplex (twosided) mode.
Just a word of advice. I’ve tried to make statistics approachable to anyone with highschool math, but it’s still a technical subject. You can’t just read a chapter in one pass from start to end, the way you would a novel or a book of history. Please see How to Read a Math Book for some tips on getting the most out of your time with this book, or any math book.
Some material is marked BTW. This is stuff I find interesting, including mathematical details that some students have asked for, but you can get through the course without it.
Although this is a free resource, it is copyrighted and I would appreciate your asking permission to copy and distribute any of it. My contact information is at BrownMath.com/about/#Contact.
Though you don’t need to ask permission simply to link to this material, I would appreciate knowing about it.
Updated 21 Feb 2016
(What’s New?)
Summary: We live in an uncertain world. You never have complete information when you make a decision, but you have to make decisions anyway. Statistics helps you make sense of confusing and incomplete information, to decide whether a pattern is meaningful or just coincidence. It’s a way of taming uncertainty, a tool to help you assess the risks and make the best possible decisions.
Statistics is different from any other math course.
Yes, you’ll solve problems. But most will be realworld practical problems: Does aspirin make heart attacks less likely? Was there racial bias in the selection of that jury? What’s your best strategy in a casino? (Most examples will be from business, public policy, and medicine, but we’ll hit other fields too.)
There will be very little use of formulas. Real statisticians don’t do things by hand. They use calculators or software, and so will you. Your TI83 or TI84 may seem intimidating at first, but you’ll quickly get to know it and be amazed at how it relieves you of drudgery.
With little grunt work to do, you will focus on what your numbers mean. You’re not just a buttonpushing calculator monkey; you have to think about what you’re doing and understand it well enough to explain it. Most of the time your answers will be nontechnical English, not numbers or statistical jargon. That may seem scary and unfamiliar at first, but if you stick with it you’ll love stretching your brain instead of just following a book’s examples by rote.
It may be a required course, so you get that much closer to graduation. ☺ But you can get more than that.
If you do it right, statistics teaches you to think. You become skeptical, unwilling to take everything at face value. Instead, when somebody makes a statement you question how they know that and what they’re not telling you. You can’t be fooled so easily. You become a more thoughtful citizen, a more savvy consumer.
Who knows? You might even have some fun along the way. So— Let’s get started!
Suppose you want to know about the health of athletes who use steroids versus those who don’t. Or you want to know whether people are likely to buy your new type of chips. Or you want to know whether a new type of glue makes boxes less likely to come apart in shipping. How do you answer questions like that?
With most things you want to know, it’s impossible or impractical to examine every member of the group you want to know about, so you examine part of that group and then generalize to the whole group.
In Good Samples, Bad Samples, later in this chapter, you’ll see how samples are actually taken.
The sample is usually a subgroup of the population, but in a census the whole population is the sample.
Example 1: You want to know what proportion of likely voters will vote for your candidate, so you poll 850 people. The people you actually ask are your sample, and the likely voters are the population.
Caution!: Your sample is the 850 people you took data from, not just the subgroup that said they would vote for your candidate. The population is all likely voters, regardless of which candidate they prefer. Yes, you want to know who will vote for your candidate, but everybody’s vote counts, so the group you want to know something about — the population — is all likely voters.
The number of members of your sample is called the sample size or size of the sample (symbol n), and the number of members of the population is called the population size or size of the population (symbol N).
“Sometimes it is not possible to count the units contained in the population. Such a population is called infinite or uncountable.” (Finite and Infinite Population 2014 [see “Sources Used” at end of book]) “Smokers” is an example. There is a definite number of smokers in the world at any moment, but if you try to count them the number changes while you’re counting.
The sample size is always a definite number, since you always know how many individuals you took data from.
Example 2: You’re monitoring quality in a factory that turns out 2400 units an hour, so you test 30 units from each hour’s production.
The units you tested are your sample, and your sample size is 30. All production in that hour is the population, and the population size is 2400.
Isn’t the population the factory’s total production, since you want to know about the overall quality? No! Your sample was all drawn from one hour’s production. A sample from one production run can tell you about that production run, not about overall operations. This is why quality testing never ends.
Example 3: You’re testing a new herpes vaccine. 800 people agree to participate in your study. You divide them randomly into two groups and administer the vaccine to one group and a placebo (something that looks and feels like a vaccine but is medically inactive) to another group. Over the course of the study, a few people drop out, and at the end you have 397 vaccinated individuals and 396 who received the placebo.
You have two samples, individuals who were vaccinated (n_{1} = 397) and the control group (n_{2} = 396). The corresponding populations are all people who will take this vaccine in the future, and all people who won’t. Both of those populations are uncountable or infinite because more people are being born all the time.
Sometimes you want to summarize the data from your sample, and other times you want to use the sample to tell you something about the larger population. Those two situations are the two grand branches of statistics.
Definition: Descriptive statistics is summarizing and presenting data that were actually measured. Inferential statistics is making statements about a population based on measurements of a smaller sample.
Example 4: “52.9% of 1000 voters surveyed said they will vote for Candidate A.” That is descriptive statistics because someone actually measured (took responses from) those 1000 people.
Compare: “I’m 95% confident that 49.8% to 56.0% of voters plan to vote for Candidate A.” That is inferential statistics because no one has asked all voters. Instead, a sample of voters was asked, and from that an estimate was made of the feelings of all voters.
Definitions: A statistic is a numerical summary of a sample. A parameter is a numerical summary of a population.
Mnemonic: sample and statistic begin with s; population and parameter both begin with p.
Continuing with Example 4: “52.9% of 1000 voters surveyed plan to vote for Candidate A.” — 52.9% is a statistic because it summarizes the sample.
“I’m 95% confident that 49.8% to 56.0% of voters plan to vote for Candidate A.” — 49.8% to 56.0% is an estimate of a parameter. (The actual parameter is the exact proportion presently planning to vote for A, which you don’t know exactly.)
A statistic is always a statement of descriptive statistics and is always known exactly, because a statistic is a number that summarizes a sample of actual measured data.
A parameter is usually estimated, not known exactly, and therefore is usually a matter of inferential statistics. The exception is a census, in which data are taken from the whole population. In that case, the parameter is known exactly because you have complete data for the population, so the parameter is then descriptive statistics.
Describing …  The number is …  And the process is … 

Any sample  A statistic  Descriptive statistics 
A population (usually)  A parameter  Inferential statistics 
A census (pop. w/ every member surveyed)  Both statistic and parameter  Descriptive statistics 
You already know that a random sample is a good thing, but did you know that a random sample is actually carefully planned? What if you can’t take a true random sample? What are good and bad ways to gather samples?
All valid samples share one characteristic: they are chosen through probability means, not selected by any decisions made by the person taking the sample. Every valid sample is gathered according to some rule that lets the impersonal operations of probability do the actual selection.
Definition: A probability sample is a sample where the members are chosen by a predetermined process that uses the workings of chance and removes discretion from the investigators. Some of the types of probability samples are discussed below.
See also: For lots of examples of good sampling and (usually) clear presentation of data about the American people, you might want to visit the Pew Research Center and its techoriented spinoff, Pew Internet. The venerable Gallup Poll also makes available its snapshots of the American public.
Definition: A random sample (also called a simple random sample) is a sample constructed through a process that gives every member of the population an equal chance of being chosen for the sample.
You always want a random sample, if you can get one. But to create a random sample you need a frame, and in many situations it’s impossible or unreasonably difficult to list all members of the population. The sections below explain alternative types of samples that can lead to statistically valid results.
“Random” doesn’t mean haphazard. Humans think we’re good at constructing random sequences of letters and digits, but actually we’re very bad at it. Try typing 1300 “random” letters on your keyboard. If you do it really randomly, you should get about 1300÷26 = 50 of each letter. (Note: about 50 of each, not exactly 50. To determine whether a particular sample of text is unreasonably far from random letters, see Testing Goodness of Fit to a Model.) But if you’re like most people, the distribution will be very different from that: some letters will occur many more than 50 times, and others many less.
So how do you construct a random sample? You need a frame, plus a randomnumber generator or randomnumber table.
The frame need not be a physical list; it can be a computer file — these days it usually is. But it has to be a complete list.
If you have a table of random numbers, the table will come with instructions for use. I’ll show you how do it with the TI83/84, but you could also do it with Excel’s RANDBETWEEN( ) function, or with any other software that generates pseudorandom numbers. (The Web site random.org provides true random numbers based on atmospheric noise.)
Random numbers from software or a calculator aren’t really random, but what we call pseudorandom numbers. That means that they are generated by deterministic calculations designed to mimic randomness pretty well but not perfectly. To help them do a better job, you need to “seed” the random number sequence, meaning that you give it a unique starting point so that your sequence of random numbers is different from other people’s.
You seed the random numbers only once. To do this:
CLEAR
].STO→
], which shows on your screen as
→
.MATH
] [◄
] [1
] to paste rand
to
the screen. Press [ENTER
].Again, you need to seed random numbers only once on your calculator.
For this you need to know the size of the population, which is the number of individuals in your frame. You will generate a series of random numbers between 1 and the population size, as follows:
MATH
] [◄
] [5
] to paste
randInt(
to your screen.1
] [,
], enter the population size, and press
[)
] [ENTER
] to generate the first random number. In my case the
population size was 20,147 and my first random number was 4413, so the
first member of my sample will be the 4413th individual, in order,
from the sampling frame.ENTER
] to generate the next random number. (The
randInt
function may or may not be displayed again,
depending on your calculator model and settings.) In my case, the
next random number is 4949, so the 4949th individual in my frame
becomes the second member of my sample.ENTER
] until you have your desired
sample size. If you get a duplicate random number, simply ignore it
and take the next one.
(If your calculator has
[8
] randIntNoRep
, use it instead of plain
randInt
to prevent duplicates from appearing in the first
place.)Definition: A systematic sample with k = some number is one where you take every kth individual from a representative subset of the population to make up your sample.
Example 5: Standing outside the grocery store all day, you survey every 40th person. That is a systematic sample with k=40.
If properly taken, a systematic sample can be treated like a random sample. Then why do I call it almost as good? Because you have to make one big assumption: that the variable you’re surveying is independent of the order in which individuals appear. In the grocerystore example, you have to assume that shoppers in the time period when you take your survey are representative of all shoppers. That may or may not be true. For example, a high proportion of Wegmans shoppers at lunch time are buying prepared foods to eat there or take back to work. At other times, the mix of groceries purchased is likely to be different.
If you’re pretty unsure of N, you may need to observe that spot without taking the survey, just to get a preliminary count.
If your estimate of N is uncertain, you’ll want to reduce k a bit. This will increase your sample size, but a sample that’s too large (within reason) is better than one that’s too small.
MATH
] [◄
] [5
] to paste
randInt(
, then [1
] [,
].
Enter the value of k and press [)
] [ENTER
].
Caution: It’s 1 to k, not 1 to N.
If you need to survey every 12th
person, then you use randInt(1,12)
.
For determining where to
start in the first 12 people, randInt(1,95)
and
randInt(1,1200)
are both wrong.
At right you see an illustration with k=12. The calculator has determined that I will start with the 2nd person and take every 12th person after that: 2, 14, 26, 38, 50, and so on.
Sometimes a true random sample is possible but unreasonably difficult. For example, you could use census records to take a random sample of 1000 adults in the US, but that would mean doing a lot of travel. So instead you take a cluster sample.
“In singlestage cluster sampling, all members of the selected clusters are interviewed. In multistage cluster sampling, further subdivisions take place.” (Upton and Cook 2008, 76 [see “Sources Used” at end of book])
Example 6: You want to have 600 representative Americans try your new neck pillow to gauge your potential market. Travel to 600 separate locations across the country would be ridiculously expensive, so you randomly select 30 census tracts and then randomly select 20 individuals within each selected census tract.
A cluster sample makes one big assumption: that the individuals in each cluster are representative of the whole population. You can get away with a slightly weaker assumption, that the individuals in all the selected clusters are representative of the whole population. But it’s still an assumption. For this and other technical reasons, a cluster sample cannot be analyzed in all the same ways as a random sample or systematic sample. Analysis of cluster samples is outside the scope of this course.
Sometimes you can identify subgroups of your population and you expect individuals within a subgroup to be more alike than individuals of different subgroups. In such a case, you want to take a stratified sample.
Definition: If you can identify subgroups, called strata (singular: stratum), that have something in common relative to the trait that you’re studying, you want to ensure that your sample has the same mix of those groups as the population. Such a sample is called a stratified sample.
Example 7: You’re studying people’s attitudes toward a proposed change in the immigration laws for a Presidential candidate. You believe that some races are more likely to favor loosening the law and others are more likely to oppose it. If the population is 66% nonHispanic white, 14% Hispanic, 12% black, 4% Asian, and so on, your sample should have that same composition.
A stratified sample is really a set of minisamples grouped together.
Example 8: You want to survey attitudes towards sports at a college that is 45% male and 55% female, and you want 400 in your sample. You would take a sample of 45%×400 = 180 male students and 55%×400 = 220 female students to make up your sample of 400. Each minisample would be taken by a valid technique like a random sample or systematic sample.
Definition: A census is a sample that contains every member of the population.
In many situations, it’s impossible or highly inconvenient to take a census. But with the nearuniversal computerization of records, a census is practical in many situations where it never used to be.
Example 9: At the push of a button, a librarian can get the exact average number of times that all library books have been checked out, with no need for sampling and estimation. An apartment manager can tell the exact average number of complaints per tenant. And so forth.
A census is the only sample that perfectly represents the population, because it is the whole population. If you can take a census, you’ve reduced a problem of inferential statistics to one of descriptive statistics. But even today, only a minority of situations are subject to a census. For instance, there’s no way to test a drug on every person with the condition that the drug is meant to treat. It’s totally impractical to interview every potential voter and determine his or her preferences. And so forth.
Any sample where people select the individual members is a bogus sample. That means every sample where people select themselves, and every sample where the interviewer decides whether to include or exclude individual members.
Why is that bad? Remember, a proper sample is a smaller group that is representative of the population. No sample will represent the population perfectly, but you do the best you possibly can.
The good samples listed above can go bad if you make various kinds of mistakes, mistakes (“Statistical Errors”, later in this chapter), but a sample that doesn’t depend on the workings of chance is always wrong and cannot be made right. The textbooks will give you names for the types of bad samples — convenience sample, opportunity sample, snowball sample, and so on — but why learn the names when they’re all bogus anyway?
Good Samples  Bad Samples 

Chosen through probability methods  Chosen by individual decisions about which persons or things to include 
Represent the population as well as possible  Do not accurately represent the population 
Uncertainty can be estimated, and can be reduced by increasing sample size  Uncertainty cannot be estimated, and bigger samples don’t help 
So goodbye to Internet polls and petitions, letterwriting campaigns, “the first 500 volunteers”, and every other form of selfselected sample. If people select themselves for a sample, then by definition they are not representative because they feel more strongly than the people who didn’t select themselves. You can make statements about the people who selected themselves, but that tells you nothing about the much larger number who didn’t select themselves. (More about this in Simon 2001 [see “Sources Used” at end of book], Web Polls.)
Goodbye also to any kind of poll where the pollster selects the individual people. If you set up a rule that depends on the workings of chance and then follow it, that’s okay. But if you decide on the spur of the moment who gets included, that’s bogus.
Why is it bad to just approach people as you see them? Because studies show that you are more likely to approach people that you perceive to be like you, even if you’re not aware of that. Ask yourself if you are truly equally likely to select someone of a different race or sex from yourself, someone who is dressed much richer or poorer than you, someone who seems much more or much less attractive, and so forth. Unless you’re Gandhi, the honest answer is “not equally likely”. It doesn’t make you a bad person, just a bad pollster like everyone else. If you tend to pick people who are more like you, your sample is not representative of the population.
The same principle applies to studies of nonhumans. Here the investigator’s intrinsic biases may be less clear, but unless you choose your sample based on chance you can never be sure that those biases didn’t “skew it up”.
This will be an important topic throughout the course, because different variable types are presented differently in descriptive statistics, and again are analyzed differently in inferential statistics. So before you do anything, you need to think what type of data you’re dealing with.
You can think of the variable as kind of like a question, and the data points as the answers to that question.
If you record one piece of information from each member of the sample, you have univariate data; if you record two pieces of information from each member, you have bivariate data.
Example 10: You record the birth weights of babies born in a certain hospital during the year. The variable is “birth weight”.
Example 11: In April, you ask all the members of your sample whether they had the flu vaccine that year and how many days of work or school they lost because of colds or flu. (Can you see at least two problems with that second question? If not, you will after you read about Nonsampling Errors, later in this chapter.) This is bivariate data. One variable is “flu shot?” and the data points are all yes or no; the other variable is “days lost to colds and flu” and the data points are whole numbers.
Numeric data are subdivided into discrete and continuous data. Discrete data are whole numbers and typically answer the question “how many?” Continuous data can take on any value (or any value within a certain range) and typically answer the question “how much?”
Qualitative data are data that are not numbers. Qualitative data are also called nonnumeric data, attribute data or categorical data.
Sometimes we talk about data types, and sometimes about variable types. They’re the same thing. For instance, “weight of a machine part” is a continuous variable, and 61.1 g, 61.4 g, 60.4 g, 61.0 g, and 60.7 g are continuous data.
Quantitative (numeric)  Qualitative (categorical or nonnumeric) 

You get a number from each member of the sample.  You get a yes/no or a category from each member of the sample. 
The data have units (inches, pounds, dollars, IQ points, whatever) and can be sorted from low to high.  The data may or may not have units and do not have a definite sort order. 
It makes sense to average the data.  Your summary is counts or percentages in each category. 
Examples (discrete): number of children in a family,
number of cigarettes smoked per day, age at last birthday
Examples (continuous): height, salary, exact age 
Examples: hair color, marital status, gender, country of birth, and opinion for or against a particular issue 
Continuous or discrete data? Sometimes when you have numeric data it’s hard to say whether you have discrete or continuous data. But since you’ll graph them differently, it’s important to be clear on the distinction. Here are two examples of doubtful cases: salary and age.
It’s true that your salary can be only a whole number of pennies. But there are a great many possible values, and the distance between the possible values is quite small, so you call salary a continuous variable. Besides, you don’t ask “how many pennies do you make?” but rather “how much do you make?”
What about age? Well, age at last birthday is clearly discrete since it can be only a whole number: “how many years old were you at your last birthday?” But age now, including years and months and days and fractions of days, would be continuous, again because you can subdivide it as finely as desired.
When you see a summary statement, you have to do a little mental detective work to figure out the data type. Always ask yourself, what was the original measurement taken or question asked?
Example 12: “The average salary at our corporation is $22,471.” The original measurement was the salary of each individual, so this is continuous data.
Example 13: “The average American family has 1.7 children.” Don’t let “1.7” fool you into identifying this as a continuous variable! What was the original question or measurement? “How many children are there in your family?” That’s discrete data.
Example 14: “Four out of five dentists surveyed recommend Trident sugarless gum for their patients who chew gum.” Yes, there are numbers in the summary statement, but the original question asked of each dentist was “Do you recommend Trident?” That is a yes/no question, so the data type is categorical.
Summary: In statistics, an error is not necessarily a mistake. This section explores the types of statistical errors and where they come from.
Definition: An error is a discrepancy between your findings and reality. Some errors arise from mistakes, but some are an inevitable part of the sampling process.
Even if you make no mistakes, inevitably samples will vary from each other and any given sample is almost sure to vary from the population. This variability is called sampling error. (It would probably be more helpful to call it sample variability, but we’re stuck with “sampling error”.)
Sampling error “refers to the difference between an estimate for a population based on data from a sample and the ‘true’ value for that population which would result if a census were taken.” (Australian Bureau of Statistics 2013) [see “Sources Used” at end of book]
Except for a census, no sample is a perfect representation of the population. So the sample mean (average), for example, will usually be a bit different from the population mean. Sampling errors are unavoidable, even if you do everything right when you take a random sample. They’re not mistakes, they’re just part of the nature of things.
Although sampling error cannot be eliminated, the size of the error can be estimated, and it can be reduced. For a given population, a larger sample size gives a smaller sampling error. You'll learn more about that when you study sampling distributions.
Definition: Nonsampling errors are discrepancies between your sample and reality that are caused by mistakes in planning, in collecting data, or in analyzing data.
Nonsampling errors make your sample unrepresentative of the population and your results questionable if not useless. Unlike sampling errors, nonsampling errors cannot be reduced by taking larger samples, and you can’t even estimate the size of most nonsampling errors. Instead, the mistakes must be corrected, and probably a new sample must be taken.
There are many types of nonsampling errors. Different authors give them different names, but it’s much more important for you to recognize the bad practice than to worry about what to name it. In taking your own samples, and in evaluating what other people are telling you about their samples, always ask yourself: what could go wrong here? has anything been done that can make this sample unrepresentative of the population? Here are some of the more common types of nonsampling errors. After you read through them, see how many others you can think of.
This is almost always bogus. People who select themselves are by definition different from people who don’t, which means they are not representative. It can be very hard to know whether that difference matters in the context of a particular study. Since you can never be sure, it is safest to avoid the problem and not let people select themselves.
But medical studies all use volunteers. (They have to, ethically.) Why doesn’t that make the sample bogus? They’re volunteers, but usually they’re not selfselected volunteers. For example, researchers may ask doctors and hospitals to suggest patients who meet a particular profile; they use probability techniques to select a sample from that pool.
But things are not always simple. For example, some companies or researchers may advertise and pay volunteers to undergo testing. In this case you have to ask very serious questions about whether the volunteers are representative of the general population. Statistical thinking isn’t a matter of black and white, but some pretty sophisticated judgment can be involved. Your takeaway is: don’t accept anything at face value, but always ask: What important facts are being left out? What does that do to the credibility of the results?
Definition: Sampling bias results from taking a sample in a way that tends to over or underrepresent some subgroup of the population.
Example 15: If you’re doing a survey on student attitudes toward the cafeteria, and you conduct the survey in the cafeteria, you are systematically underrepresenting students who don’t use the cafeteria. It seems logical that attitudes are more negative among students who don’t use the cafeteria than among students generally, so by excluding them you will report overall attitude as more favorable than it really is.
“Bias” is a good example of the words in statistics that don’t have their ordinary English meaning. You’re not prejudiced against students who dislike the cafeteria. “Bias” in statistics just means that something tends to distort your results in a particular direction.
Example 16: The classic example of sampling bias is the Literary Digest fiasco in predicting that Landon would beat Roosevelt in the 1936 election. The magazine sent questionnaires to all its subscribers, it phoned randomly selected people in telephone books, and it left stacks of questionnaires at car dealerships with instructions to give one to every person who test drove a car. The sample size was in the millions.
This procedure systematically overrepresented people who were well off and systematically underrepresented poorer people. In 1936 the Great Depression still held sway, and most people did not have the disposable income to subscribe to a fancy magazine, let alone a home telephone; the very thought of buying a car would have struck them as ridiculous or insulting. In that era, the Republicans appealed more to the rich and the Democrats more to the working class. So the net effect of the Literary Digest’s procedure was that it made the country look a lot more Republican than it actually was. Since Landon was a Republican and FDR a Democrat, FDR’s actual support was much greater than shown by the poll, and Landon’s was much less.
Notice that a sample size of millions did not overcome sampling bias. A larger sample size is not an answer to nonsampling errors.
The Digest’s original article can be found in Landon in a Landslide: The Poll That Changed Polling (American Social History Project) [see “Sources Used” at end of book].
While we’re on the subject of presidential elections, different nonsampling errors also led to wrong predictions of a Dewey victory over Truman in 1948. For analyses of both the 1936 and the 1948 statistical mistakes, see Classic Polling Surprises (2004) [see “Sources Used” at end of book] and Introduction to Polling (n.d.) [see “Sources Used” at end of book].
Beyond sampling bias, there are many other bad practices in selecting your sample can bias the results. Wikipedia’s Selection Bias [see “Sources Used” at end of book] has a good rundown of quite a few.
If you’re taking a mail survey, a significant number of people (probably a majority) won’t respond. Are the responders representative of the nonresponders, or has a bias been introduced by the nonresponse? That’s a tough question, and the answer may not always be clear.
For this reason, mail surveys are often coded so that the investigators can tell who did respond, and follow up with those who didn’t. That followup can be more mail, a phone call, or a visit.
Even with inperson polls, nonresponse is a problem: many people will simply refuse to participate in your survey. Depending on what you’re surveying, that could be unimportant or it could be a fatal flaw.
Definition: Response errors occur when respondents give answers that don’t match objective truth or their own actual opinions.
Poorly worded survey questions are a major source of response errors, and lead to biased results or completely meaningless results. There may not be a perfect survey question, but having several people review the questions against a list of possible problems will greatly reduce the level of response errors.
But response errors can never be completely eliminated. For instance, people tend to shade their answers to make themselves look good in their own eyes or in the interviewer’s eyes. Most people rate themselves as betterthanaverage drivers, for example, which obviously can’t be true. And selfreporting of objective facts is always suspect because memory is unreliable.
Example 17:
These include mistakes by interviewers in recording respondents’ answers, mistakes by investigators in measuring and recording data, and mistakes in entering the recorded data.
In the second half of the course you’ll learn a number of inferential statistics procedures. Each one is appropriate in some circumstances and inappropriate in others. If you use the wrong form of analysis in a given situation, or you apply it wrongly, your results will be about as good as the results from using a hammer to drive a screw.
Summary: There are two main methods of gathering data, the observational study and the experiment. Learn the differences, and what each one can tell you.
Many, many statistical investigations try to find out whether A causes B. To do this, you have two groups, one with A and one without A, or you have multiple groups with different levels of A. You then ask whether the difference in B among the groups is significant. The two main ways to investigate a possible connection are the observational study and the experiment.
The concepts aren’t hard, but there’s a boatload of vocabulary. Let’s get through the definitions first, and then have some concrete examples to show how the terms are used. Please read the definitions first, then read the first example and refer back to the definitions; repeat for the other examples.
Definition: In an observational study, the investigator simply records what happens (a prospective study) or what has already happened (a retrospective study). In an experiment, the investigator takes an more active role, randomly assigning members of the sample to different groups that get treated differently.
Which is better? Well, in an observational study, you always have the problem of wondering whether the groups are different in some way other than what you are studying. This means that an observational study can never establish cause. The best you can do after an observational study is to say that you found an association between two variables.
How do we establish cause, when for ethical or practical reasons we can’t do an experiment? The nine criteria are listed in Causation (Simon 2000b [see “Sources Used” at end of book]) and were first laid down by Sir Austin Bradford Hill in 1965.
In an observational study or an experiment, there are two or more variables. You want to show that changes in one or more of them, called the explanatory variables, go with changes in one or more response variables.
Explanatory variables are the suspected causes, and response variables are the suspected effects or results.
Example 18: Over the course of a year, you have parents record the number of minutes they spend every day reading to their child, and at the end of the year you record each child’s performance on standard tests. The explanatory variable is parental time spent reading to the child, and the response variable(s) are performance on the standardized test(s).
Definitions: In an experiment, the experimenter manipulates the suspected cause(s), called explanatory variable(s) or factor(s). A specific level of each factor is administered to each group. The level(s) of the explanatory variable(s) in a given group are known as its treatment.
Example 19: To test productivity of factory workers, you randomly assign them to three groups. One group gets an extra hour at lunch, one group gets halfhour breaks in morning and afternoon, and one group gets six 10minute breaks spaced throughout the day. The explanatory variable or factor is structuring of break time, and the three levels or treatments are as described.
Definitions: In an experiment, each member of a sample is called a unit or an experimental unit. However, when the experiment is performed on people they are called subjects or participants.
In any study or experiment, results will vary for individuals within each group, and results will also vary between the groups as a whole. Some of that variation is due to chance: it is expected statistical variability or sampling error. If the differences between groups are bigger than the variation within groups — and enough bigger, according to some calculations you’ll learn later — then the investigator has a significant result. A significant result is a difference that is too big to be merely the result of normal statistical variability.
I’ll have a lot more to say about significance when you study Hypothesis Tests.
In Example 18, about reading to children, you find generally that the more time parents spend reading to first graders, the better the children tend to do on standard tests of reading level.
Is the reading time responsible for the improved test scores? You can easily think of other possible explanations. Parents who spend more time reading to their children probably spend more time with them in general. They tend to be better off financially — if you’re working two jobs to make ends meet, you probably have little time available for reading to your children. Economic status and time spent with children in general are examples of lurking variables in this study.
Definition: A hidden variable that isn’t measured and isn’t part of your design but affects the outcome is called a lurking variable.
Example 20: In a large elementary school, you schedule half the second grade to do art for an hour, two mornings a week, with the district’s art teacher. The other half does art for an hour, two afternoons a week, with the same teacher, but they are told at the beginning that all their projects will be displayed and prizes given for the best ones.
Can you learn anything from this about whether the chance to win prizes prompts children to do a better job on art projects? The problem is that there’s not just one difference in treatment here, the promised prizes. There’s also the fact that everyone’s project will be on display. And maybe mornings are better (or worse) for doing art than afternoons. Maybe the teacher is a morning person and fades in the afternoon, or is not a morning person and really shines in the afternoon. Even if there’s a difference in quality of the projects, you can’t make any kind of simple statement about the cause, because of these confounding variables.
A confounding variable is “associated in a noncausal way with a factor and affects the response.” (DeVeaux, Velleman, Bock 2009, 346 [see “Sources Used” at end of book])
“Confounding occurs in an experiment when you [can’t] distinguish the effects of different factors.” (Triola 2011, 32 [see “Sources Used” at end of book])
In the art example, you wanted to find out whether promising prizes makes children do better art work. But the promise of prizes wasn’t the only difference between the two groups. Time of day and public display are confounding variables built into the design of this experiment. You know what they are, but you can’t untangle their effect from the effect of what you actually wanted to study.
What’s the difference between lurking variables and confounding variables? Both confuse the issue of whether A causes B.
A lurking variable L is associated with or causes both A and B, so any relationship you see between A and B is just a side effect of the L/A and L/B relationships. For example, counties with more library books tend to have more murders per year. Does reading make people homicidal? Of course not! The lurking variable is population size. Highdensity urban counties have more books in the library and more murders than lowdensity rural counties.
A confounding variable C is typically associated with A but doesn’t cause it, so when you look at B you don’t know whether any effect comes from A, from C, or from both. For example, after a year with a lot of motorcycle deaths, a state passes a strict helmet law, and the next year there are significantly fewer deaths. Was the helmet law responsible? Maybe, but time is a confounding variable here. Were motorcyclists shocked at the high death toll, so that they started driving more carefully or switched to other modes of transit?
Don’t obsess over the difference between lurking and confounding variables. Some authors don’t even make a distinction. You should recognize variables that make results questionable; that’s a lot more important than what you call them.
That said, if you want to see two more takes on the difference, have a look at Confounding and Lurking Variables (Virmani 2012 [see “Sources Used” at end of book]) and Confounding Variables (Velleman 2005 [see “Sources Used” at end of book]).
Lurking and confounding variables are the boogeyman of any statistical work. Lurking variables are the reason that an observational study can show only association, not causation. In experiments, you have the potential to exclude lurking variables, or at least to minimize them, but it takes planning and extra work, and you need to be careful not to create a design with builtin confounding..
Whenever any experiment claims that A causes B, ask yourself what lurking variables there might be, and whether the design of the study has ruled them out. You can’t take this for granted, because even professional researchers sometimes cut corners, knowingly or unknowingly.
Example 21:
Does smoking cause lung cancer?
Initial studies in the mid20th century had three or four
groups: nonsmokers, light smokers, moderate smokers, and heavy
smokers. They looked at the number and severity of lung tumors in the
groups to see whether there was a significant difference, and in fact
they found one.
This was an observational study. Ethically it had to be: if you suspect smoking is harmful you can’t assign people to smoke.
Explanatory variable: smoking level (none, light, moderate, heavy). Levels or treatments don’t apply to an observational study.
Response variable: tumor production
Because this was an observational study, there was no control for lurking variables, and even with a significant result you can’t say from this study that smoking causes lung cancer. What lurking variables could there be? Well, maybe some genetic factor both makes some people more likely to smoke and makes them more susceptible to lung cancer. This is a problem with every observational study that finds an effect: you can’t rule out lurking variables, and therefore you can’t infer causation, no matter how strong an association you find.
Since you can’t do an experiment on humans that involves possibly harming them, how do you know that smoking causes lung cancer? A good explanation is in Causation (Simon 2000b [see “Sources Used” at end of book]).
Example 22:
Does aspirin make heart attacks less likely?
Here you can do an experiment, because aspirin is
generally recognized as safe. Investigators randomly assigned people
to two groups, gave aspirin to one group but not the other, and then
monitored the proportion who had heart attacks. They found a
significantly lower risk of heart attack in the aspirin group.
This was a designed experiment.
Explanatory variable: aspirin. There were two levels or treatments: yes and no.
Response variable: heart attack (yes/no)
From this experiment, you can say that aspirin reduces the risk of heart attack. How can you be sure there were no lurking variables? By randomly assigning people to the two groups, investigators made each group representative of the whole population. For example, overweight is a risk factor for heart attacks. The random assignment ensures that overweight people form about the same proportion in each group as in the population. And the same is true for any other potential lurking variable. (It helps to have larger samples, and in this study each sample was about 10,000 people.)
Example 23:
Does prayer help surgical patients?
Here again, no one thinks prayer is harmful, so ethically the
experimenters were in the clear to assign cardiacbypass patients randomly to three
groups:
people who knew they were prayed for,
people who were prayed for and didn’t know it, and
people who were not prayed for.
Investigators found no significant difference in frequency of
complications between the patients who were prayed for and those who
were not prayed for.
This was a designed experiment.
Explanatory variables: receipt of prayer (two levels, yes and no) and knowledge of being prayed for (also two levels, yes and no). There were three treatments: (a) receipt=yes and knowledge=yes, (b) receipt=yes and knowledge=no, (c) receipt=no.
Response variable: occurrence of postsurgical complications (yes/no).
(You can read an abstract of the experiment and its results in Study of the Therapeutic Effects of Intercessory Prayer [Benson 2006 [see “Sources Used” at end of book]]. The full report of the experiment is in Benson 2005 [see “Sources Used” at end of book].)
Because lurking variables can’t be ruled out in an observational study, investigators always prefer an experiment if possible. If ethical or other considerations prevent doing an experiment, an observational study is the only choice. But then the best you can hope for is to show an association between the two variables. Only with an experiment do you have a hope of showing causation.
Okay, so you always have to do an experiment if you want to show that A causes B. Let’s look in more detail at how experiments are conducted, and learn best practices for an experiment.
Caution: Design of Experiments is a specialized field in statistics, and you could take a whole course on just that. This chapter can only give you enough to make you dangerous.☺ While you’re planning your first experiment in real life, it’s a good idea to get help from someone senior or a professional statistician.
R. A. Fisher “virtually invented the subject of experimental design” (Upton and Cook 2008, 144 [see “Sources Used” at end of book]), and pioneered many of the techniques that we use today. He was a great champion of planning: Upton & Cook quote him as saying “To call in the statistician after the experiment is done may be no more than asking him to perform a postmortem examination: he may be able to say what the experiment died of.”
Definitions: Experimenters randomly assign members to the various treatment groups. We say that they have randomized the groups, and this process is called randomization.
Why randomize? Why not just put the first half of the sample in group A and the second half in group B? Because randomization is how you control for lurking variables.
Think about the study with aspirin and heart attacks. You know that different individuals are more or less susceptible to heart attacks. Risk factors include smoking, obesity, lack of exercise, and family history. You want your aspirin group and your nonaspirin group to have the same mix of smokers and nonsmokers as the general population, the same mix of obese and nonobese individuals, and so on. Actually it’s harder than that. There aren’t just “smokers” and “nonsmokers”; people smoke various amounts. There aren’t just “obese” and “fit” people, but people have all levels of fitness.
It would be very laborious to do stratified samples and get the right proportions for a lot of variables. You’d have to have a huge number of strata. And even if you did do those matchups, taking enormous trouble and expense, what about the variables you didn’t think of? You can never be sure that the samples have the same composition as the population.
It really must be random assignments — you can’t just assign test subjects to groups alternately. Steve Simon (2000a) explains why, with examples, in Alternating Treatments [see “Sources Used” at end of book].
Randomization is the indispensable way out. Instead of trying to match everything up yourself — and inevitably failing — you let impersonal random chance do your work for you.
Are you guaranteed that the sample will perfectly represent the population? No, you’re not. Remember sampling error, earlier in this chapter. Samples vary from the population; that’s just the nature of things. But when you randomize, in the long run most of your samples will be representative enough, even though they’re not perfectly representative.
Notice I said that randomization works in the long run. But in the short run it may not. Suppose you are testing a weightloss drug on a group of 100 volunteers, 50 men and 50 women. If you completely randomize them, you might end up with 20 men and 30 women in one group, and 30 men and 20 women in the other. (There’s about a 20% chance of this, or a more extreme split.)
Why is this bad? Because you don’t know whether men and women respond differently to the drug. If you see a difference between your 20/30 placebo group and your 30/20 treatment group, you don’t know how much of that is the drug and how much is the difference between men and women. Gender is a confounding variable.
What to do? Create blocks, in this case a block of the 50 men and a block of the 50 women. Then within each block you randomly assign individuals to receive medication or a placebo. Now you can find how the drug affects women and how it affects men. This is called a randomized block design. When you can identify a potentially confounding variable before you perform your experiment, first divide your subjects into blocks according to that variable, and then randomize within each block. Do this, and you have tamed that confounding variable.
R. A. Fisher coined the term “randomized block” in 1926.
In this example, gender would be called a blocking variable because you divide your subjects into blocks according to gender. Now there’s no problem separating the effects of the drug from the effects of gender in your experimental results.
When I talked about complete randomization, I said it would be laborious to take strata of a lot of variables, and that complete randomization was the answer. But here I’m suggesting exactly that for men and women in the weightloss study. Right about now, you might be telling me, “Make up your mind!”
This is where some judgment is needed in making tradeoffs. Men and women typically have different percentages of body fat, and they are known to respond differently to some drugs. It makes sense that a weightloss drug could have different response from men and women, and therefore you block on gender. But no other factor stands out as both important and measurable. If you tried to block on motivation, for instance, how would you measure it?
“Block what you can, randomize what you cannot” is a good rule, sometimes attributed to George Box. A variable is a candidate for blocking when it seems like it could make a difference, and you can identify and measure it. For other variables, we depend on randomization, either complete randomization or randomization within blocks.
There’s one circumstance where you can be sure that the subgroups are perfectly matched with respect to lurking variables: when you use a matchedpairs design. This is kind of like a randomized block design where each block contains two identical subjects.
Example 24: You want to know whether one form of foreignlanguage instruction is more effective than another. So you take fifty pairs of identical twins, and assign one twin from each pair to group A and the other twin to group B. Then you know that genetic factors are perfectly balanced between the two groups. And if you restrict yourself to twins raised together, you’ve also controlled for environmental factors.
A special type of matchedpairs design matches each experimental unit to itself.
Example 25: You want to know the effect of caffeine on heart rate. You don’t assemble a sample, give coffee to half of them, and measure the difference in heart rate between the groups. People’s heart rates vary quite a bit, so you would have large variation within each group, and that might swamp the effect you’re looking for.
Instead, you measure each individual’s resting heart rate, then give him or her a cup of coffee to drink, and after a specified time measure the heart rate again. By comparing each individual to himself or herself, you determine what effect caffeine has on each person’s heart rate, and people’s different resting heart rates aren’t an issue.
See also: Experimental Design in Statistics shows how the same experiment would work out with a completely randomized design, randomized blocks, and matched pairs.
Definition: In an experiment, usually one of the treatments will be no treatment at all. The group that gets no treatment is called the control group.
But “no treatment at all” doesn’t mean just leaving the control group alone. They should be treated the same way as the other groups, except that they get zero of whatever the other groups are getting. If the treatment groups are getting injections, the control group must get injections too. Otherwise you’ve introduced a lurking variable: effects of just getting a needle stick, and in humans the knowledge that they’re not actually getting medicine.
Definition: A placebo is a substance that has no medical activity but that the subjects of the experiment can’t tell from the real thing.
The placebo effect is well known. Sick people tend to get better if they feel like someone is looking after them. So if you gave your treatment group an injection but your control group no injection, you’d be putting them in a different psychological state. Instead, you inject your control group with salt water.
TheProfessorFunk has a fun threeminute YouTube video in the placebo effect (Keogh 2011 [see “Sources Used” at end of book]). Thanks to Benjamin Kirk for drawing this to my attention.
You might think placebos would be unnecessary when experimenting on animals. But if you’ve ever had a pet, you know that some animals are stressed by getting an injection. If the control group didn’t get an injection, you’d have those differing stress levels as a lurking variable. So you administer a placebo.
Example 26: Sometimes, for practical or ethical reasons, you have to get a little bit creative with a control group. Here’s an excellent example from Wheelan (2013, 238) [see “Sources Used” at end of book]:
Suppose a school district requires summer school for struggling students. The district would like to know whether the summer program has any longterm academic value. As usual, a simple comparison between students who attend summer school and those who do not would be worse than useless. The students who attend summer school are there because they are struggling. Even if the summer school program is highly effective, the participating students will probably still do worse in the long run than the students who were not required to take summer school. What we want to know is how the struggling students perform after taking summer school compared with how they would have done if they had not taken summer school. Yes, we could do some kind of controlled experiment in which struggling students are randomly selected to attend summer school or not, but that would involve denying the control group access to a program that we think would be helpful.
Instead, the treatment and control groups are created by comparing those students who just barely fell below the threshold for summer school with those who just barely escaped it. Think about it: the [group of all] students who fail a midterm are appreciably different from [the group of all] students who do not fail a midterm. But students who get a 59 percent (a failing grade) are not appreciably different from those students who get a 60 percent (a passing grade).
Some people do better if they think they’re getting medicine, even if they’re not. To avoid this placebo effect, the standard technique is the double blind.
Definitions: In a doubleblind experiment, neither the test subjects nor those who administer the treatments know what treatment each subject is getting. In a singleblind experiment, the test subjects don’t know which treatment they’re getting, but the personnel who administer the treatments do know.
Okay, given that people’s thoughts influence whether they improve, a single blind makes sense. If you let someone know they’re not getting medicine in a trial, they’re less likely to improve. But why isn’t that enough? Why is a double blind necessary?
For one thing, there’s always the risk that a doctor or nurse might tell the subject, accidentally or on purpose. But beyond that, if you’re treating someone who has a terrible disease, you might treat them differently if they’re getting a placebo that if they’re getting real medicine, even if you don’t realize you’re doing it. Why take the risk of introducing another lurking variable? Better to use a double blind and just rule out the possibility.
You might wonder how it’s done in practice. In a drug trial, for instance, each test subject is assigned a code number, and the drug company then packages pills or vaccines with a subject’s code number on each. The doctors and nurses who administer the treatments just match the code number on the pill or vaccine to each subject’s code number. Of course all the pills or vaccines look alike, so the workers who have contact with the subjects don’t know who’s getting medicine and who’s getting a placebo. And what they don’t know, they can’t reveal.
You’ll be dealing with numbers through most of this course. Handle them right, and you won’t get burned! There are three issues here: how many digits to round to, how to round to that number of digits, and when to do your rounding.
There are a lot of rules for how many digits you should round to, but we’re not going to be that rigorous in this course. Instead, you’ll use common sense supplemented by a few rules of thumb. What’s common sense? Avoid false precision, and avoid overly rough numbers.
The rules are important, but we have only so much time, and you’ve probably learned them in your science courses. If you want to, look up “significant figures” or “significant digits” in the index of pretty much any science textbook, or look at Significant Figures/Digits [see “Sources Used” at end of book].
Example 27: When you fill your car’s gas tank, the pump shows the number of gallons to three decimal places. You can also describe that as the nearest thousandth of a gallon. How much gas is that? Convert it to teaspoons (Brown 2009 [see “Sources Used” at end of book]): (0.001 gal) × (4 qt/gal) × (4 cup/qt) × (16 Tbsp/cup) × (3 tsp/Tbsp) ≈ 0.8 tsp. You can bet there’s several times that much in the hose when the pump shuts off. Three decimal places at the gas pump is false precision a/k/a spurious accuracy. That third decimal place is just noise, statistical fluctuations without real significance.
On the other hand, suppose the pump showed only whole gallons. This is too rough. You can go along pumping gas for no extra charge (bad for the merchant), and then abruptly the cost jumps by several dollars (bad for you).
Here are some rules of thumb to supplement your common sense. These are not matters of right and wrong, but conventions to save thinking time:
Round in one step. Say you have a number 1.24789, and you want to round it to one decimal place. Draw a line — mentally or with your pencil — at the spot where you want to round: 1.24789. If the first digit to the right of that line is a 0, 1, 2, 3, or 4, throw away everything to the right of the line. It doesn’t matter what digits come after that first digit. Here, the first digit to the right of the line is a 4, so you throw away everything to the right of the line: 1.24789 rounded to one decimal place is 1.2.
Rounding in multiple steps, 1.24789 → 1.2479 → 1.248 → 1.25 → 1.3, is wrong. (Why? Because 1.24789 is 0.05211 units away from 1.3, but only 0.04789 units away from 1.2.) You must round in one step only.
As you know, if the first digit to the right of the line is a 5, 6, 7, 8, or 9, you raise the digit to the left of the line by one and throw away everything to the right of the line. To one decimal place, 1.27489 is 1.27489 → 1.3.
You may need to “carry one”. What is 1.97842 to one decimal place? 1.97842 needs you to increase that 9 by one. That means it becomes a zero and you have to increase the next digit over: 1.9+0.1 = 2.0. Therefore, 1.97842 rounded to one decimal place is 2.0.
Here’s the Big NoNo: Never do further calculations with rounded numbers. What’s the right way? Round only after the last step in calculation.
Example 28: True story: In Europe, average body temperature for healthy people was determined to be 36.8°C, as repeated in A Critical Appraisal of 98.6°F (Mackowiak, Wasserman, Levine 1992 [see “Sources Used” at end of book]). Rounding to the nearest degree, the average human body temperature is 37°C. So far so good.
But in the US, thermometers for home use are marked in degrees Fahrenheit. Some nimrod converted 37°C using the good old formula 1.8C+32 and got 98.6°F, and that’s what’s marked on millions of US thermometers as “normal” temperature. If you’ve got one of those, ask for your money back, because it’s wrong.
Why is it wrong? The person who did the conversion committed the Big NoNo and did further calculations with a rounded number. For a correct calculation, use the unrounded number, 36.8. (Okay, 36.8 was probably rounded from 36.77 or 36.82 or something. But the point is that it’s the least rounded number available.) 1.8×36.8+32 = 98.24 → 98.2, and that is the average body temperature for healthy humans.
When a calculation results in a number lower than about 0.0005, your calculator will usually present it in the dreaded scientific notation, like this example. Be alert for this! Your answer is not 1.99 (or however you want to round it). Your answer is 1.99×10^{4}.
How do you convert this to a decimal for reading by ordinary humans? (And yes, you should usually do that — definitely, if your work will be read by nontechnical people.)
The exponent (the number after the E minus) tells you how many zeroes the decimal starts with, including the zero before the decimal point. 1.99×10^{4} is 0.000 199 or 0.0002.
When a decimal starts with a bunch of zeroes, especially if the decimal is long, many people use spaces to separate groups of three digits. This makes the decimal easier to read.
Don’t just write down answers; show your work. This is in your own best interest:
“But,” I hear you object, “in the real world, all that matters is getting the right answer.” True enough, but there’s a difference between being in the real world and preparing for the real world. Part of your study is to develop thought and work habits that ensure you will get the right answer when there’s nobody around to check you. You expose your process now, so that problems can be corrected.
How do you show your work?
The general idea is to show enough that someone familiar with the course content can follow what you did.
When evaluating a formula, write down the formula, then on a line below show it with the numbers replacing the letters. Your calculator can handle very complicated formulas in one step, so your next line will be your last line, containing the final answer and any rounding you do. Example:
SEM = σ/√n
SEM = 160/√37
SEM = 26.30383797 → SEM = 26.3
You’ll be using a lot of the menus and commands on the TI83 or TI84. Here are some tips:
randInt
to get five random integers from 1 to 100, write
down randInt(1,100,5)
. That’s the only way your
instructor will know that you know how to use that function. If you
think the command is randInt(5,100)
, now is the time to
correct that misunderstanding.1VarStats L1,L2
, write that. For pity’s sake, don’t
write all the keystrokes, [STAT
] [►
] [1
] [2nd
] [L1
] [,
] [2nd
] [L2
] [ENTER
].
I put them in this book because you’re just learning them. But
someone
familiar with the course knows how to get the command, and I hate to
think of all the time and paper you could waste.You’ll find that your calculator does the complicated stuff for you, but here and there I’ve scattered formulas in BTW paragraphs in case you want to peek behind the curtain.
Stats formulas usually need to do the same thing to every member of a data set and then add up the results. The Greek letter ∑, a capital sigma, indicates this. This summation notation makes formulas easier to write — easier to read, too, once you get used to it.
Some examples:
(The online book has live links to all of these.)
randInt
.Write out your solutions to these exercises, showing your work for all computations. Then check your solutions against the solutions page and get help with anything you don’t understand.
Caution! If you don’t see how to start a problem, don’t peek at the solution — you won’t learn anything that way. Ask your instructor or a tutor for a hint. Or just leave it and go on to a different problem for now. You may find when you return to that “impossible” problem that you see how to do it after all.
How would you construct a random sample of size 50? a systematic sample? a cluster sample? Which of these is the best balance between statistical purity and practicality?
Each doctor will put up notices in the waiting rooms, and will select the first 30 adult volunteers, assigning the first 15 to the experimental group and the second 15 to the control group. Patients will not be told which group they are in; you supply placebo pills that are identical in appearance to the active medication. Doctors will administer the placebo and medication to the selected groups and report results back to you.
Identify three serious errors in this technique. Are these examples of sampling or nonsampling error?
Now you answer these:
(b) The average dinner check at my restaurant last
Friday was $38.23.
(c) 45% of patients taking Effluvium complained of
bloating and stomach pain.
(d) The average size of a party at my restaurant last
Friday was 2.9 people.
randIntNoRep
.Updated 19 June 2017
(What’s New?)
Summary: To make sense out of a mass of raw data, make a graph. Nonnumeric data want a bar graph or pie chart; numeric data want a histogram or stemplot. Histograms and bar graphs can show frequency or relative frequency.
Graph Paper
for Free:
Why buy an expensive pad of graph paper, especially if you only need a
few sheets? You can print your own for free using Incompetech’s
Plain Graph Paper PDF Generator
and at Math Worksheets Land’s
Graph Paper.
Both are sources not just for the ordinary square grid, but for
various specialty graph papers.
Any graph of nonnumeric data needs to show two things: the categories and the size of each. Probably you’re already familiar with the two most common types, which are the bar graph and pie chart.
The sizes of categories can be shown as raw counts, called frequencies, or percentages, called relative frequencies. (Relative frequencies can also be shown as decimals, but I think most people respond better to “20%” than “.20”.)
How do you decide whether to show frequencies or relative frequencies? This is a stylistic choice, not a matter of right and wrong. Your choice depends on what’s important, what point you’re trying to make. If your main concern is just with the individuals in your sample, go with frequencies. But if you want to show the relationship of the parts to the whole, show relative frequencies.
How Often Parents Read to Children under Age 12 (n=434)  

How Often  Number of Parents 
Every day  217 
A few times a week  113 
About once a week  39 
A few times a month  26 
Less often  30 
Never  9 
Example 1: In fall 2012, the Pew Research Center (2013a) [see “Sources Used” at end of book] surveyed American adults on their habits of reading to their children. The survey included 434 adults who had at least one child under age 12, and the results are shown in the table.
(Remember, you can’t call the data numeric just because you see numbers in a summary statement. You have to go back to the individual data points, which are categorical: “every day”, “a few times a week”, and so on. If the Pew Center had asked “how many days a week do you read to your child?” and got answers 0, 1, 2, 3, 4, 5, 6, and 7, that would be a set of numeric data.)
Your bar chart or bar graph must follow these rules:
Usually the category axis is horizontal, so the frequency axis and the bars are vertical. But you can also make a horizontal bar chart, where the category axis is vertical and the frequency axis and bars are horizontal.
You can make a bar graph by hand, or use software such as Microsoft Excel. If you make a bar graph by hand, use graph paper and draw the axes and bars with a straightedge — wobbly bars make you look like you had a liquid lunch.
Here’s my bar graph for parents reading to children:
A couple of comments on best practices:
Getting some kind of bar graph out of Excel is easy. But then there’s a lot of fiddling around to reverse some of Excel’s rather strange format choices. Here are instructions for Excel 2010. If you have Excel 2007, 2013, or 2016, you’ll find that they’re pretty similar.
If you prefer a horizontal bar chart, it’s easy to make the change. Click into the chart area, then on the
tab on the ribbon click » and select the first one.Okay, well, nothing is that easy! Excel puts the
categories in backwards order, so rightclick the category axis and
select
The frequency bar graph tells us about the 434 individuals in the Pew Research Center’s sample. But why collect that sample except for what it can tell us about how often parents in general read to their children?
You know from Sampling Error in Chapter 1 that the proportions in the population are probably not the same as the sample, but probably not very far off either. So you compute those proportions and then redraw your graph to show percentages instead of raw counts.
How Often Parents Read to Children under Age 12 (n=434)  

How Often  Number of Parents  Rel. Freq. 
Every day  217  50% 
A few times a week  113  26% 
About once a week  39  9% 
A few times a month  26  6% 
Less often  30  7% 
Never  9  2% 
First, total all the frequencies to get the sample size n = 434. (In this case n is given already, but often it isn’t.) Then convert each frequency into a relative frequency. The formula, if you need one, is f/n. For example, 9 parents never read to their under12 children. The relative frequency is f/n = 9/434 = 0.021 or 2%: 2% of parents never read to their children. Enter that and the other relative frequencies in the table, as shown at right.
You may see some bar graphs with relative frequencies as decimals. There’s nothing wrong with that for technical audiences, but general audiences usually respond better to percentages.
Your relative frequencies may not add up to exactly 100% (or 1.0000), because of rounding. Don’t change any of the numbers to force a total.
Once you have your relative frequencies, you can make your bar graph. Choose round numbers for the tick marks on your relative frequency axis, for example every 5% or every 10%. I won’t inflict another of my sketches on you, but you can see a finished relativefrequency bar graph below.
To my surprise, I found that Excel doesn’t include relativefrequency bar graphs in its repertoire. You have to enter some formulas to compute the relative frequencies, and then create the graph from them. (Of course you could compute the relative frequencies yourself and enter them in Excel as numbers, but whenever possible I like to be lazy and make the computer do the work.)
Now highlight the category and relativefrequency columns, click the
tab and the first 2D column chart, and tweak the graph as you did before. Your result should be something like the one you see here.On this chart, neither axis really needs a label. The percent signs reinforce the message in the chart title that the bars show relative frequencies. And the category names together with the chart title tell the reader exactly what is being represented.
It’s a judgment call where to place tick marks on the relativefrequency axis, and you really need to look at the data to make a decision. Four categories are under 10%, so it makes sense to show the 5% line and help the reader get a sense of the relative sizes. Of course, if you show 5% then you have to show every 5% increment up to the top of the graph.
You may want to compare two populations: men and women, for instance, or one year versus another year. To do this, a sidebyside bar graph is ideal. A sidebyside bar graph has two bars for each category, and a legend shows the meaning of the bars.
The two populations you’re comparing are almost never the same size. Therefore sidebyside graphs almost always show relative frequencies rather than frequencies.
Example 2: In Educational Attainment, the Census Bureau (2014) [see “Sources Used” at end of book] showed the educational attainment of the population in selected years 1940 to 2012. I chose the years 1992 and 2012 and prepared this graph to show the change over that 20year period.
What do you see? Comparing 2012 to 1992, the proportion of the population with no college (the first four categories) declined, and the proportion with some college or a college degree increased. You should be able to see why this has to be a relativefrequency chart: in a frequency chart, the larger population in 2012 would make all the bars taller than the 1992 bars, and you’d be hard put to see any kind of trend.
Example 3: Another way to compare two populations is the stacked bar graph. In the sidebyside bar graph, above, each group of bars was one category, and each bar within a group was a population. With the stacked bar graph, you have one bar for each population, and one piece of that bar for each category. (A stacked bar graph is kind of like an unrolled pie chart.)
Here’s a stacked bar graph for the same data set:
What do you see? Look first at the legend that lists the categories, then at the two bars. The top two segments represent some college. In 1992, about 56% of adults had no education beyond high school. But in 2012, only about 42% had a highschool diploma or less, meaning that 58% had at least some college. The proportions of college and no college were reversed in those 20 years.
You can also see that, though the group with four years of high school shrank, it didn’t shrink as much as the group with college grew. In other words, it’s not just more highschool graduates going on to college, it’s a higher proportion of the population entering high school. All the categories without a highschool diploma shrank. In 1992, 20% of adults had less than a highschool diploma and 80% were highschool graduates; in 2012, only about 12% had less than a highschool diploma and 88% had graduated from high school.
What’s the best way to compare two populations? The answer depends on what you’re trying to show. The sidebyside graph seems to be better at showing how each category changed, and the stacked graph is usually better at showing the mix, especially if you want to group the categories mentally. In the sidebyside graph, you can easily see the decline in adults with a fourthgrade education or less, but the shift to a collegeeducated population is much harder to see. It’s just the opposite with the stacked graph.
As always, get clear in your own mind what you’re trying to show, and then select the type of graph that shows that most clearly.
Did you notice that this stacked bar graph shows relative frequencies? (Maybe you didn’t notice, because it seems like the natural way to go.) A stacked bar graph could show frequencies instead of relative frequencies, if you want to emphasize the different sizes of the populations, but then it becomes harder to compare the mix in the populations.
When you make a stacked bar graph in Excel, there’s no need to precompute the percentages. Just select the third type of 2D column chart,
.Example 4: In the first example, you were given a table of categories and counts. But more likely you’ll just have a mass of data points, like this:
Children’s Favorite Beach Toys  

shovel  dump truck  shovel  bucket  shovel  ball 
ball  bucket  sifter  ball  shovel  shovel 
dump truck  ball  shovel  shovel  bucket  net 
sifter  shovel  bucket  dump truck  bucket  shovel 
ball  shovel  ball  bucket  net  ball 
Before you can make any kind of graph, you need a table to summarize the data. You’re probably tempted to count the number of shovels, the number of balls, and so on, but it’s way too easy to make mistakes that way. Why? Because you have to go over the data set multiple times, and you may count something twice or miss something.
The better procedure is to tally the categories in a table. It’s a winwin: the procedure is faster, and you’re less likely to make a mistake.
Toy  Tallies 

shovel   
ball   
dump truck   
sifter   
bucket   
Simply go through the data, one item at a time. If you’re seeing a given category for the first time, add it to your list with a tally mark; if that category is already in your table, just add a tally mark. Here’s my table of tallies after going through the first two columns of data:
Please complete your tallies on your own before you look at mine.
After you’ve tallied all the data, count the tallies in each category and total the counts. Of course the total should equal your sample size n. Here’s my complete table:
Toy  Tallies  Frequency 

shovel     10 
ball     7 
dump truck    3 
sifter    2 
bucket     6 
net    2 
Total  30 
Always check the total of your frequencies. If it matches the sample size, that’s no guarantee everything is correct; but if it doesn’t match, you know something is wrong.
Once you’ve got your table, you can make a graph by following the procedures above. If you’re publishing the table itself, give just the category names and sizes and the total, but leave out the tallies.
Where a bar graph tends to emphasize the sizes of categories in relation to each other, a pie chart tends to emphasize the categories as divisions of the whole. This distinction is not hard and fast; it’s just a matter of emphasis.
To make a pie chart, you need a compass, or something else that can draw a circle, and you need a protractor. The angle of each segment of the pie will be 360°×f/n, where f is the frequency of the category and n is the sample size — in other words, it’s 360° times the relative frequency, whether you’re showing frequencies or relative frequencies on the pie chart. But in practice, if you’re going to make a pie chart you’ll use Excel or some other software.
Excel can draw a pie chart for you, but you have to make a bunch of tweaks before it’s usable. There’s one bit of good news: with a pie chart, unlike a bar graph, Excel can compute relative frequencies automatically. I’ll show you how to do that for the data about parents reading to children, for which we made a bar graph earlier.
Many people stop there, but this is an absolutely horrible design. Readers have to keep looking back and forth to match up the colors, and often there are similar colors. Colorblind people are really screwed, and if you print the chart on a blackandwhite printer it’s hopeless. Fortunately you can fix this!
For numeric data, you want to show four things: the shape, center, and spread of the distribution plus any outliers. The histogram is the standard way to do this, and it can show frequencies or relative frequencies.
Usually you’ll group the data into classes, but when you have discrete data without too many different values you can make an ungrouped histogram.
For a discrete data set with a moderate number of values and a moderate range, a stemplot is an alternative. With a stemplot, it doesn’t matter how many different data values there are, but the number of data points matters.
How can you draw a picture of numeric data? The answer is a histogram.
The term “histogram” was coined by Karl Pearson in lectures some time before 1895.
Example 5: Let’s use the lengths of some randomly selected iTunes songs:
Lengths of iTunes Songs (seconds)  

113  282  179  594  213  319  245  323  334  526  
395  440  477  240  296  428  407  230  294  152  
242  837  246  135  412  223  275  409  114  604  
170  239  138  505  316  369  298  168  269  398  
433  212  367  255  218  283  179  374  204  227 
How do you make sense of this? As you might expect, the first step is to make a table. But you don’t want to treat each number as its own category, because that would produce a really uninteresting graph. Instead you create categories, except for numeric data you call them classes. The rules for classes are very simple:
Notice that the rules don’t tell you how many classes there must be, or what width a class must have. That’s where your discretion comes in. You want to pick class boundaries that are “nice” numbers, and you don’t want too many classes or too few. In practice, five to nine classes is usually about the right number.
How does that apply to the iTunes songs? Take a look at the data. The lowest number seems to be 113, and the highest is 837. That gives a range in “nice” numbers of about 100–850. If you set class width to 100 you have eight classes, so that seems about right.
Now go ahead and make your tally marks to create the table. Instead of category names, you use class boundaries. You already know how to make tally marks, so I’ll just give you the results:
Lengths of iTunes Songs (seconds)  

Class Boundaries  Tallies  Frequency 
100–199     9 
200–299       20 
300–399     9 
400–499     7 
500–599    3 
600–699    1 
700–799  0  
800–899    1 
Even though the 700–799 class has no data points, it’s still a class and it will occupy the same width in the histogram as any other class. A bar with zero height shows in the histogram as a gap, and that’s good because it emphasizes that there’s something unusual about the point in the 800–899 class (which was 837 seconds).
If the class width is 100, how come the class bounds are 100–199 and not 100–200? In fact, some authors do write these class bounds as 100–200, 200–300, and so on, with the understanding that if a number is right on the boundary it goes in the upper class. All authors agree that the class width is the difference between the lower bounds of two consecutive classes, not the difference between lower and upper bounds of one class. So whether you write 100–199 for the first class or 100–200, the class width is 200 minus 100, which is 100.
Once you have the table, the histogram is straightforward. You can draw the histogram by hand or use Excel. I’ll show Excel later, but here’s my handmade histogram for the iTunes data.
Notice that you label the data bars on their edges: 100, 200, …, 900, not 100–199, 200–299, …. Label the left edge of each bar, and also the right edge of the last bar. The right edge of the last bar is always one class width more than the left edge, so even if you’ve got 800–899 in your table the last bar’s edges are 800 and 900.
Like all histograms, this one is good at showing the shape of the data (skewed right; see below), the center (somewhere in the upper 200s to 300s), and the spread (from 100ish to 800ish seconds, or about two minutes to 13 minutes). In Chapter 3 you’ll learn how to measure center and spread numerically, but there’s always a place for a picture to help people grasp a data set as a whole.
This data set also shows an outlier, located somewhere in the 800–899 class. Not every data set will have an outlier, of course, and a rare sample might have more than one. When an outlier occurs, your first move is to go back to your original data sheets and make sure that it’s not simply a mistake in entering your data. If it’s a real data point, then you can ask what it means. In this case, the message is pretty simple: tunes generally run up to about 11 or 12 minutes (700 seconds), but the occasional one can be several minutes longer.
A histogram is similar to a bar graph, but with the following differences:
Histogram  Bar Graph  

Data type  Numeric (grouped)  Discrete ungrouped★  Nonnumeric 
Order of categories  Numeric order, left to right 
Numeric order, left to right 
Any order you choose 
Do the bars touch?  Yes  No, they’re spaced  
Where are they labeled?  Below the edges  Below the centers  
★Some authors treat ungrouped discrete data as numeric and make a histogram. Others, including this book, treat ungrouped discrete data as categories and make a bar graph. 
For both histogram and bar graph, the frequencies must start at 0. However, in a histogram the data axis typically doesn’t start at zero. You just leave some space between the frequency axis and the first bar, and the scale of the data axis is considered to start at the first bar.
Though I don’t show it here, you could make a relativefrequency histogram, the same way you made a relativefrequency bar chart. The relative frequencies range from 0 for the 700–799 class to 20/50 = 40% for the 200–299 class.
Believe it or not, out of all the chart types in Excel, the standard histogram was not included until Excel 2016. If you’ve got Excel 2016, click » and select the histogram.
In Excel 2013 and earlier, to make a histogram you have to combine a column chart and a scatterplot (Middleton) [see “Sources Used” at end of book], or download additional software. You can follow the detailed instructions in that document, or you can download the free Better Histogram addin from TreePlan Software to do the job. (It works in Excel 2007 through 2016.) If you’re using Better Histogram:
To make sense of most data sets, you need to group the data into classes. But sometimes your data have only a few different values. In such cases, you probably want to skip the grouping and just have one histogram bar for each different response. The height of the bar tells you how often that response occurred, as usual.
Example 6: A state park collected data on the number of adults in each vehicle that entered the park in a given time interval:
3 1 1 3 3 3 0 7 3 1 3 6 4 5 3 2 3 4 2 3
0 2 2 4 8 3 3 1 3 3 3 4 1 5 2 2 6 3 4 2
Number of Adults in Vehicles Entering Park  

Adults  Tallies  Frequency 
0    2 
1    5 
2     7 
3      15 
4    5 
5    2 
6    2 
7    1 
8    1 
Total  40 
There are only nine different values, so it seems a little silly to group them. Instead, just tally the occurrences, as shown at right.
Label ungrouped data under the centers of the bars, just like categorical data, not under the edges. Some authors still make the bars touch because the data are numeric, and others keep the bars separated because the data are ungrouped. I prefer the second approach, but I’ll accept the other. Here’s my histogram:
Caution: This particular data set has at least one occurrence of every value between min and max. But suppose it didn’t; suppose there were no vehicles with 7 adults? In that case, you would draw the histogram exactly the same, except that the bar above “7” would have zero height. The horizontal axis for numeric data must always have a consistent scale for its whole length, so you never close up any gaps.
You can graph ungrouped discrete data in Excel, if you wish. The key is to fool Excel into treating the data like categorical data:
By the way, you might notice that the tick marks on the vertical axis are every two cars on this graph, but they were every five cars on my handdrawn histogram. One is not better than the other; it’s a stylistic choice.
You should know the names of the most common shapes of numeric data. Why? It’s easier to talk about data that way, and — as you’ll see in the next chapter — you treat differentshaped distributions a little differently.
The first question is whether the data set is symmetric or skewed. The histogram of a symmetric data set would look pretty much the same in a mirror; a skewed data set’s histogram would look quite different in a mirror.
If a distribution is skewed, you say whether it’s skewed left or skewed right. A distribution that is skewed left, like the first one below, has mostly high scores, and a distribution that is skewed right, like the second one below, has mostly low scores. The direction of skew is away from the bulk of the data, toward the long skinny tail, where there are few data points.
Skewed left or negatively skewed 
Skewed right or positively skewed 

Example 7: Scores on a really easy test would be skewed left: most people get high scores, but a few get low or very low scores.
Lifespan in developed countries is skewed left: there are relatively few infant and child deaths, and most people live into their 60s, 70s, or 80s. (The first graph in Calculus Applied to Probability and Statistics [Waner and Costenoble 1996] [see “Sources Used” at end of book] illustrates this.)
People’s own evaluation of their driving skills and safety are left skewed: few people rate themselves below average and most rate themselves above average. Illusory Superiority [see “Sources Used” at end of book] cites a study by Svenson showing this “Lake Wobegon effect”.
Example 8: People’s departure times after a concert would be skewed right: most people leave shortly before or after the performers finish, but a few straggle out for some time afterward. Skewedright distributions are more common than skewedleft distributions.
Salaries at almost any corporation are another good example of a distribution that is skewed right: most people make a modest wage, but a few top people make much more.
There are several types of symmetric distributions, but here are the two you’ll meet most often. A uniform distribution is one where all possible values are equally likely to occur. The normal distribution has a precise definition, which you’ll meet in Chapter 7, but for now it’s enough to say that it’s the famous bell curve, with the middle values occurring the most often and the extreme values occurring much less often.
You’ll notice that both of the examples below are “bumpy”. That’s usual. In real life you pretty much never meet an exact match for any distribution, because there are always lurking variables, measurement errors, and so on. And even if a population does perfectly follow a given distribution, like the probability distributions you’ll meet in Chapter 6, still a sample doesn’t perfectly reflect the population it came from: sampling error is always with us. When we say that a data set follows suchandsuch a distribution, we mean it’s a close match, not a perfect match.
Uniform  Normal (“bell curve”) 

Example 9: Winning lottery numbers are uniformly distributed. (In the short term some numbers occur more often than others, but over the long run they tend to even out.)
The results of rolling one die many times are uniformly distributed. (But the results of rolling two dice are not uniformly distributed: 7 is the most likely, 2 and 12 are tied for least likely, and the other numbers are intermediate.)
The normal distribution or bell curve occurs very often, and in fact many natural and industrial processes produce normal distributions. This happens so often that we often just say or write ND for “normal distribution” or “normally distributed”.
Example 10: Men’s and women’s heights follow separate normal distributions. People’s arrival times at an event are ND. IQ scores, and scores on most tests, are ND. The amount of soda in twoliter bottles is ND. Your commute times on a given route are ND.
Suppose you have a discrete data set with few repetitions. An ungrouped histogram would have most bars at the same low height; a grouped histogram might show a pattern but you’d lose the individual data points.
If your discrete data set isn’t too large (n < 100, give or take), and the range isn’t too great, you can eat your cake and have it too. The stemplot, also known as a stemandleaf diagram, is a mutant hybrid between a histogram and a simple list of data.
The idea is that you take all the digits of each data point except the last digit and call that the stem; the last digit is the leaf. For example, consider scores of 113 and 117. They are two leaves 3 and 7 on a common stem 11 (meaning 110).
To construct a stemplot, you look over your data set for the minimum and maximum, then write the stems in a column, from lowest to highest. Just like with a histogram, there are no gaps, so if you have data in the 50s and the 70s but not in the 60s you still need a stem of 6.
However, your stems probably won’t start at 0. Start them with the lowest data point that actually occurred, and end them with the highest data point that actually occurred.
The stemplot was invented by John Tukey in 1970.
Example 11: Here is a set of IQ scores from 50 randomlyselected tenth graders:
99 77 83 111 141
89 98 84 93 124
110 73 96 60 102
87 123 120 100 95
100 90 104 85 129
81 119 112 103 76
108 91 94 114 108
92 96 94 88 101
117 106 103 105 113
97 106 109 80 116
To make your stemplot, eyeball the data for the minimum and maximum, which are 60 and 141. Write the stems, 6 to 14, in a column at the left of your paper, starting several lines below the top. Then draw a vertical line just to the right of them.
Now go through the data points, one by one, and add each leaf to the proper stem. During this process, you might find a value outside what you thought were the min and max. That’s no problem. Just add the stem and then the leaf. (Again, the stems can’t have gaps, so if your first stem is 6 and you come across a data point 47, you have to add stems 4 and 5, not just 4.)
Finally, add a title and a legend or key to your stemplot. Here is the result:
IQ Scores 6  0 7  7 3 6 8  3 9 4 7 5 1 8 0 9  9 8 3 6 5 0 1 4 2 6 4 7 10  2 0 0 4 3 8 8 1 6 3 5 6 9 11  1 0 9 2 4 7 3 6 12  4 3 0 9 13  14  1 key: 11  7 = 117
If you lie down and look at this sideways, it looks like a histogram. But the bonus is that you can still see all the actual data points within the groupings of 60–69, 70–79, etc.
A stemplot is great at showing shape, center and spread of distributions plus outliers, but most data sets don’t lend themselves to a stemplot. If your data set is too large, your leaves will run off the edge of the page. If your data set is too sparse — if the range is large for the number of data points — most of your stems won’t have leaves and the plot won’t really show any patterns in the data. But when you have a moderatesized data set and the data range is moderate, a stemplot is probably better than a histogram because the stemplot gives more information.
One last touch is sorting the leaves. I don’t think that’s important enough to take the extra effort in a homework problem or on a quiz, but if you’re going to be presenting your stemplot to other people then you probably want to sort the leaves. Here’s the same stemplot with sorted leaves:
IQ Scores 6  0 7  3 6 7 8  0 1 3 4 5 7 8 9 9  0 1 2 3 4 4 5 6 6 7 8 9 10  0 0 1 2 3 3 4 5 6 6 8 8 9 11  0 1 2 3 4 6 7 9 12  0 3 4 9 13  14  1 key: 11  7 = 117
A glance at this stemplot shows you quite a lot. The data set is normally distributed, the center is around 100 points, the spread is 60–141, and there’s an outlier at 141.
You now know how to make good graphs, so be on the lookout for bad graphs. Sometimes they’re bad just because whoever drew them didn’t know any better, or didn’t think. But some people may deliberately try to deceive you with a graph.
Example 12: File this one under “what were they thinking?” The lefthand graph doesn’t have a title, so you don’t know what “Yes” and “No” mean. You have to look back and forth between the graph and the legend, and anyone with redgreen color blindness probably won’t be able to see which segment is which. Oh yes — what percentages of the sample answered “Yes” and “No”? You can guess that it’s around a third versus two thirds, but that’s not very precise.
The righthand graph cures those problems. It’s now crystal clear which segment is Yes and which is No, and what proportion of the sample gave each answer. This actually lets you show more information in less space, a winwin. (Of course you wouldn’t use a vague term like “Opinions” — that’s just there to remind you to give your graph a title.)
Example 13: There’s no telling whether this one is deliberate deception or just incompetent graphing. An oatmeal company, which shall remain nameless, wanted to show that eating oatmeal for four weeks reduces cholesterol. The first graph makes a strong case — until you look at the scale on the vertical axis. (Don’t even think about wasting your time on a graph with no vertical scale.)
The scale doesn’t start at zero, so it makes differences look much bigger than they are. Your frequency or relative frequency scale must always start at zero (and you must show the zero). The second graph is properly drawn, and now you can see that the drop in cholesterol is only a slight one.
source:
Misleading Graph [see “Sources Used” at end of book]
Example 14: It’s all very well to create visual interest, but not if it makes the reader misinterpret the graph.
In the lefthand graph, you can tell from the scale that B is supposed to be three times as large as A, but since it’s three times as high and three times as wide it’s actually nine times as large, giving the reader a distorted impression of the amount of difference. Even if your “bars” are pictures, they still have to be the same width. The corrected version is shown at right. (It’s still not quite correct, though, because 0 is not shown on the vertical axis.)
If you follow the rules in this chapter, you’ll make good, professional graphs. But there are plenty of other ways to make good graphs, depending on the data you’re trying to show.
There’s a classic picture book that can give you lots of good ideas. Edward Tufte’s The Visual Display of Quantitative Information has been around since 1983, and no one has yet done it any better. (Tufte has produced newer editions.)
Example 15: One famous graph in Tufte’s book is particularly stunning. Charles Minard wanted to present a lot of timeseries data about Napoleon’s disastrous campaign in Russia in the winter of 1812–1813: where battles took place, numbers of casualties, temperature, and so forth. He elected to make a kind of stylized map showing just the rivers and the cities where events happened. (Niemen at the left is the Niemen River, Russia’s western border at the time. Moscow, “Moscou” in French, is as far east as Napoleon got.) Across that, Minard showed the army strength as a broad swath at the start that shrank to almost nothing by the end of the retreat westward. Below are dates of events, temperatures, and precipitation. It’s a huge amount of information on one piece of paper.
This tiny rendition doesn’t do it justice, but if you click on it visit http://upload.wikimedia.org/wikipedia/commons/2/29/Minard.png you’ll see it at a better size. (Your browser may still reduce it to fit on your screen. Try clicking into the picture and you should see it at original size, though you’ll have to scroll around to see the details. It sounds like a lot of effort, but I promise you it’s worth it. Or just get the book, because it has plenty more!)
Example 16: Here’s one I ran across in my reading. It’s not the graph of the century like Minard’s, but it’s a cut above the usual. In Bear Attacks: Their Causes and Avoidance (2002), Stephen Herrero had the problem of contrasting bears’ diet in spring, summer, and fall. (Of course in winter they’re not eating.)
He could have drawn three pie charts, or a stacked bar graph, but instead he came up with a great alternative. (You can click on the picture to enlarge it.) (A larger version is at https://BrownMath.com/swtpic/chap02_beardiet.jpg.) Each component of diet is clearly labeled right in the graph, not in some legend off to the side, and the contrasting backgrounds make it a little more interesting visually. A stacked bar graph would convey the same information, but I like this presentation because it suggests that “spring”, “summer”, and “fall” are not completely separate but rather transition one into the next.
The vertical axis is clearly labeled, too. There’s no doubt what the numbers are (as opposed to some units of weight, for instance, or something more esoteric like pounds of feed per hundreds of pounds of bear).
He probably could have left off the title off the category axis — after all, we know that the seasons are seasons, and the graph title also conveys that information. But that’s a minor point. My only real quibble with this graph is that the overall graph title at the bottom is too small.
Overview: With numeric data, the goal of descriptive stats is to show shape, center, spread, and outliers.
(The online book has live links to all of these.)
Be on the lookout for violations of this rule and other signs of bad graphs.
Chapter 3 WHYL → ← Chapter 1 WHYL
Write out your solutions to these exercises, showing your work for all computations. Then check your solutions against the solutions page and get help with anything you don’t understand.
Caution! If you don’t see how to start a problem, don’t peek at the solution — you won’t learn anything that way. Ask your instructor or a tutor for a hint. Or just leave it and go on to a different problem for now. You may find when you return to that “impossible” problem that you see how to do it after all.
Would You Move to the US?  

Yes, with authorization  154 
Yes, without authorization  204 
No  612 
Don’t know  30 
The Pew Research Center (2013c) [see “Sources Used” at end of book] conducted a poll of 1000 adults in Mexico, asking whether they would move to the US if they had the means and opportunity to move. Draw a relativefrequency bar graph for their responses.
What’s wrong with this graph? (You should be able to see at least two problems, maybe more.)
(source: Misleading Graph [see “Sources Used” at end of book] in Wikipedia)
Professor Marvel had a statistics class of fifteen students, and on one 15point quiz their scores were
10.5 13.5 8 12 11.3 9 9.5 5 15 2.5 10.5 7 11.5 10 10.5
Construct a frequency table and bar graph for their letter grades on the quiz, where 90% is the minimum for an A, 80% for a B, 70% for a C, and 60% for a D.
Deaths by Horse Kick in 14 Prussian Army Corps, 1875–1894  

Number of Deaths  Frequency 
0  144 
1  91 
2  32 
3  11 
4  2 
Total  280 
Bulmer (1979, 92) [see “Sources Used” at end of book]
quotes an 1898 study of deaths by horse kick
in the Prussian army. Von Bortkiewicz compiled the number of deaths in
14 Prussian Army corps over the 20year period 1875–1894, as
shown at right. (14 corps over 20 years gives 14×20 = 280
observations.) For example, there were 32 observations in which two
officers died of horse kicks.
(a) What is the type of the variable?
(b) Construct an appropriate graph.
Commuting Distances in km  

5  15  23  12  9 
12  22  26  31  21 
11  19  16  45  12 
8  26  18  17  1 
16  24  15  20  17 
In a GM factory in Brazil, 25 workers were asked their commuting distance in kilometers. Construct a stemandleaf plot.
—Adapted from Dabes and Janik (1999, 8) [see “Sources Used” at end of book]
Abigail asked a number of students their major. She found 35 in liberal arts, 10 in criminal justice, 25 in nursing, 45 in business, and 20 in other majors. What was the relative frequency of the nursing group, rounded to the nearest whole percent?
Bert asked his fellow students how many books they read for pleasure in a year. He found that most of them read 0, 1, or 2 books, but some read 3 or more and a very few read as many as 10. (He plotted the histogram shown at right.) Identify the shape of this distribution.
Test scores, x  Frequencies, f 

470.0–479.9  15 
480.0–489.9  22 
490.0–499.9  29 
500.0–509.9  50 
510.0–519.9  38 
At right is a grouped frequency distribution.
(a) Create a frequency histogram. (For a real quiz, you’d
use graph paper, but you can freehand this one.)
(b) Find the class width.
(c) What’s the shape of this distribution?
Updated 1 Feb 2015
(What’s New?)
For numeric data, the goal of descriptive stats is to show the shape, center, spread, and outliers of a data set. In this chapter, you learn how to find and interpret numbers that do that.
Measures of shape called skewness and kurtosis do exist, but they’re not part of this course. Roughly, skewness tells how this data set differs from a symmetric distribution, and kurtosis tells how it differs from a normal distribution. If you’re interested, you can learn about them in Measures of Shape: Skewness and Kurtosis. The MATH200B Program part 1 can compute those measures of shape for you.
There are three common measures of the center of a data set: mean, median, and mode.
The mean is nothing more than the average that you’ve been computing since elementary school.
The symbol for the mean of a sample is x̅, pronounced “x bar”. The symbol for the mean of a population is the Greek letter μ, pronounced “mew” and spelled “mu” in English. (Don’t write μ as “u”; the letter has a tail at the left.)
You can think of the mean as the center of gravity of the distribution. If you made a wooden cutout of the histogram, you could balance it on a pencil or your finger placed exactly under the mean.
The formula for the mean is x̅ = ∑x/n or μ = ∑x/N, meaning that you add up all the numbers in the data set and then divide by sample size or population size.
The median is the middle number of a sample or population. It is the number that is above and below equal numbers of data points. (Examples are below.)
There’s no one agreed symbol for the median. Different books use M or Med or just “median”.
To find the median by hand, you must put the numbers in order. If the data set has an odd number of data points, counting duplicates, then the median is then the middle number. If the data set has an even number of data points, the median is half way between the two middle numbers. (In the next section, you’ll get the median from your TI calculator, with no need to sort the numbers.)
The mode is the number that occurs most frequently in a data set. If two or more numbers are tied for most frequent, some textbooks say that the data set has no mode, and others say that those numbers are the modes (plural). We’ll follow the second convention.
Most distributions have only one mode, and we call them unimodal. If a distribution has two modes, or even if it has two “frequency peaks” like the one at right, we call it bimodal. (This was students’ final grades in a math course: a lot of low or high grades, and few in the middle.)
There’s no symbol for the mode.
Example 1: You’re interviewing at a company. You ask about the average salary, and the interviewer tells you that it’s $100,000. That sounds pretty good to you. But when you start work, you find that everybody you work with is making $10,000. What went wrong here?
The interviewer told the truth, but left out a key fact: Everybody but the president makes below the average. Eight employees make $10,000 each, the vice president makes $50,000, and the president makes $870,000. Yes, the mean is (8×10,000 + 50,000 + 870,000)/10 = $100,000, but that’s not representative because the president’s salary is an outlier. It pulls the mean away from the rest of the data, and skews the salary distribution toward the right. This graph tells the sad tale:
There was your mistake. Salaries at most companies are strongly skewed right, so most employees make less than the average. When a data set is skewed, the mean is pulled toward the extreme values. (A data set can be skewed without outliers, but when there are outliers the data set is almost certain to be skewed.)
You should have asked for the median salary, not the average (mean) salary. There are 10 employees, and 50% of 10 is 5, so the median is less than or equal to five data points and greater than or equal to five data points. The fifthhighest and sixthhighest salaries are both $10,000, so the median is $10,000.
The median is more representative than the mean when a data set is skewed. The mean is pulled toward an extreme value, but the median is unaffected by extreme values in the data set. We say that the median is resistant.
Example 2: What is the median of the data set 8, 15, 4, 1, 2? Put the numbers in order: 1, 2, 4, 8, 15. There are five numbers, and 50% of 5 is 2.5. You need the number that is above 2 data points and below 2 data points; the median is 4.
Example 3: What is the median of the data set 7, 24, 15, 1, 7, 45? There are six data points, and in order they are 1, 7, 7, 15, 24, 45. 50% of 6 is 3; you need the number that is above 3 data points and below 3 data points. It’s clear that the median is between 7 and 15, but where exactly? When the sample size is an even number, the median is the average of the two middle numbers. Therefore the median for this data set is the average of 7 and 15, (7+15)/2 = 11.
When a distribution is symmetric, the mean and median are close together. If it’s unimodal, the mode is close to the mean and median as well.
But have you ever taken a course that was graded on a curve, and one or two “curve wreckers” ruined things for everyone else? What happened? Their high scores raised the class average (mean), so everybody else’s scores looked worse. The class scores were skewed right: low scores occurred frequently, and high scores were rare. (You can see shapes of skewed distributions in Chapter 2.)
When a distribution is skewed, the mean is pulled toward the extreme values. The median is resistant, unaffected by extreme values. And you can reverse that logic too: if the mean is greater than the median, it must be because the distribution is skewed right. From the median to the mean is the direction of skew.
Skewed left, mean < median (usually) 
Skewed right, mean > median (usually) 
For heaven’s sake, don’t memorize that! Instead, just draw a skewed distribution and ask yourself approximately where the mean and median fall on it.
Karl Pearson gives the rule median = (2×mean + mode)/3 for moderately skewed distributions. For more about this, see Empirical Relation between Mean, Median and Mode [see “Sources Used” at end of book].
Caution! All the statements in this section are a rule of thumb, true for most distributions. The logic holds for almost every unimodal continuous distribution, and for discrete distributions with a lot of different values. But it tends to break down on discrete distributions that have only a few different values. For more about this, see von Hippel 2005 [see “Sources Used” at end of book].
Summary:
The 1VarStats
command gives you mean, median, and much
more for any data set. If you have just a
plain list of numbers, enter the name of that list on the
command line. If you have a frequency distribution, enter the
name of the data list and the name of the frequency list on the
command line.
Excel: Excel can do these computations. This isn’t an Excel course, but if you’re an Excel head you can figure out how to get this information. One way is with the Data Analysis tool, part of the Analysis Toolpak addin that comes with Excel (though you may have to enable it). Another way is to click in a blank cell, click » » and select the appropriate worksheet function.
Example 4: Professor Marvel had a statistics class of fifteen students, and on one quiz their scores were
10.5 13.5 8 12 11.3 9 9.5 5 15 2.5 10.5 7 11.5 10 10.5
Your TI83 or TI84 can give you the mean, median, and other numbers that summarize this data set.
CLEAR
].STAT
] [ENTER
] to get into the edit screen for
statistics lists. You can use any list, but let’s use L1 this
time. (If you don’t see L1, and pressing the left arrow
doesn’t bring it into view, press
[STAT
] [5
] [ENTER
] [STAT
] [ENTER
].)L1
label at the top — not
the top number, the column heading — and press
[CLEAR
] [ENTER
] to clear the list.ENTER
] after each one.DEL
] to remove it; if you
left out a number, press [2nd
DEL
makes INS
] to open a space for
it.STAT
] [►
] [1
] to select
1VarStats
.2nd
1
makes L1
]. For a simple list of numbers like this
one, there is no frequency list, so press [DEL
].1VarStats
to the home
screen. On the same line, identify the list that contains your data:
[2nd
1
makes L1
].1VarStats L1
—
press [ENTER
] to execute it. The results screen is shown below. A down arrow on the
screen says that there is more information if you
press [▼
], and an up arrow says that there is
more information if you press [▲
].
Look first at the bottom of the screen. Always check
n
first — if it doesn’t match your
sample or population size, the other numbers are big sacks of
bogosity. In this case a quick count of the original data set shows 15
numbers, which is the right quantity. (Of course, this check
can’t determine if you miskeyed any numbers. Only double and
triple checking can protect you from that kind of mistake.)
What are you seeing on this screen?
A word about rounding: The rules for significant digits and rounding are beyond the scope of this course, but beware of being ridiculously precise. (For example, most gasoline pumps are calibrated in 0.001 gallon units. But 0.001 gallon is two tablespoons, and there’s considerably more gas than that in the hose, so that precision is just silly.)
A good rule of thumb is to report sample statistics and population parameters to one more decimal place than the original data. Then why did I say μ = 9.7 instead of 9.72, since the original data have one decimal place? That’s a valid question, and my answer is that 9.72 would not be wrong but it feels overly precise when there are only fifteen data points, most are whole numbers, and the rest are a whole number plus ½.
Since this data set is a population, select σx = 3.057929583 and write down σ = 3.1.
The name standard deviation was created in 1893 by Karl Pearson. (We might wish that he had chosen something with fewer than six syllables.) He assigned the symbol σ to the standard deviation of a population in 1894.
Showing your work and your results, you write down:
1VarStats L1
μ = 9.7
σ = 3.1
N = 15
min = 2.5
Q1 = 8
Med = 10.5
Q3 = 11.5
max = 15
Number of Adults in Vehicles Entering Park  

Adults in Vehicle  Number of Vehicles  
0  2  
1  5  
2  7  
3  15  
4  5  
5  2  
6  2  
7  1  
8  1  
Total  40 
Example 5: Your TI83 or TI84 can also compute statistics of a frequency distribution. Let’s try it with the data from Chapter 2 for number of adults in vehicles entering the park.
Enter the data values in one statistics list, such as L1.
Enter the frequencies in a second list, such as L2. Press
[STAT
] [►
] [1
] to select 1VarStats
.
2nd
1
makes L1
] for List
and
[2nd
2
makes L2
] for FreqList
.2nd
1
makes L1
], then
[,
] (comma), then [2nd
2
makes L2
] and [ENTER
]. The data
list must come first and the frequency list second.
Caution — rookie mistake: Students often leave off the frequency list. Your calculator is pretty good, but it can’t read your mind. The only way it knows that you have a frequency distribution is if you give it both the frequency list and the data list.
Either way, write down the complete command on your
paper:
1VarStats L1,L2
.
Here are the results:
Again, look at n
first. That protects you
from the rookie mistake of leaving off the frequency list.
If n
is wrong, redo your 1VarStats
command and
this time do it right.
These forty vehicles are obviously not all the vehicles that enter the park, so they are a sample, not a population. You therefore write down the statistics as follows:
1VarStats L1,L2
x̅ = 3.0
s = 1.7 (from Sx = 1.73186575)
n = 40
min = 0
Q1 = 2
Med = 3
Q3 = 4
max = 8
Sometimes you take an average where some data points are more important than others. We say that they are weighted more heavily, and the mean that you compute in this way is called a weighted average or weighted mean.
You’re intimately familiar with one example of a weighted average: your GPA or grade point average.
Example 6: The NHTSA’s Corporate Average Fuel Economy or CAFE Rule (NHTSA 2008) [see “Sources Used” at end of book] specifies a corporate average of 34.8 mpg (miles per gallon) for passenger cars. Let’s keep things simple and suppose that ZaZa Motors makes three models of passenger car: the Behemoth gets 22 mpg, the Ferret gets 35 mpg, and the Mosquito gets 50 mpg. Does ZaZa meet the standard?
To answer that, you can’t just average the three models: (22+36+50)/3 = 36 mpg. Suppose the company sells one Mosquito and the rest are Behemoths and a sprinkling of Ferrets? You have to take into account the number of cars of each model sold. In effect, you have a frequency distribution with mpg figures and repetition counts. Let’s suppose these are the sales figures:
Auto Sales by ZaZa Motors  

Model  Miles per Gallon  Number Sold 
Behemoth  22  100,000 
Ferret  35  250,000 
Mosquito  50  20,000 
Total  370,000 
Put the miles per gallon in L1 and the frequencies in L2. (How do you know it’s not the other way around? You’re trying to find an average mpg, so the mpg numbers are your data.) You should find:
1VarStats L1,L2
μ = 32.3 mpg
N = 370,000 passenger cars
Even though two of the three models meet the standard, the mix of sales is such that ZaZa Motors’ CAFE is 32.3 mpg, and it’s not in compliance.
The formula for the mean of a grouped distribution and the formula for a weighted average are the same formula: μ = ∑xf/N for a population or x̅ = ∑xf/n for a sample. Either way, take each data value times its frequency. Add up all those products, and divide by the population size or sample size. For the notation, see ∑ Means Add ’em Up in Chapter 1.
In a grouped frequency distribution, one number called the class midpoint stands for all the numbers in the class.
Definition: The class midpoint for a given class equals the lower boundary plus half the class width. This is half way between the lower class boundary of this class and the lower class boundary of the next class.
Lengths of iTunes Songs (seconds)  

Class Boundaries  Class Midpoint  Frequency 
100–199  150  9 
200–299  250  20 
300–399  350  9 
400–499  450  7 
500–599  550  3 
600–699  650  1 
700–799  750  0 
800–899  850  1 
Example 7: Let’s revisit the lengths of iTunes songs from the ungrouped histogram in Chapter 2. What is the midpoint of the 300 to 399 class?
The class width equals the difference between lower boundaries: 400−300 = 100. Half the class width is 50, so the midpoint is 300+50 = 350. You could also compute the class midpoint as (300+400)/2 = 350.
However, it is wrong to take (300+399)/2 = 349.5 as class midpoint or 399−300 = 99 as class width. Don’t use the upper boundary in finding the class midpoint.
Of course you don’t have to compute every class midpoint the long way. Once you have the midpoint of the first class, (100+200)/2 = 150, just add the class width repeatedly to get the rest: 250, 350, … 850. The grouped frequency distribution, with the class midpoints, is shown at right.
What good is the class midpoint? It’s a standin for all the numbers in its class. Instead of being concerned with the nine different numbers in the 100 to 199 class, twenty different numbers in the 200 to 299 class, and so on, we pretend that the entire data set is nine 150s, twenty 250s, and so on. This means you get approximate statistics, but you get them with a lot less work.
Is this legitimate? How good is the approximation? Usually, quite good. In most data sets, a given class holds about equally many data points below the class midpoint and above the class midpoint, so the errors from making the approximation tend to balance each other out. And the bigger the data set, the more points you have in each class, so the approximation is usually better for a larger data set.
Procedure:
Enter the class midpoints in one statistics list, such as L1.
Enter the frequencies in another list, such as L2. Enter the
command 1VarStats L1,L2
and write down the
complete command on your paper.
Again, avoid the rookie mistake: include the classmidpoint list and the frequency list in your command.
The results screens are below. As usual, before you look at anything else, check that n matches the size of the data set. 50 is correct, so that’s one less worry.
There’s a problem with
the second screen,
though. Your calculator knows you have a frequency distribution,
because you gave two lists to the 1VarStats
command. But
it doesn’t have the original data, so it doesn’t know the
true minimum (lowest data point). When you read minX=150
,
you interpret that to mean that the lowest data point occurs in the
class whose midpoint is 150; in other words, the minimum is somewhere
between 100 and 199. Your knowledge of the rest of the
fivenumber summary has the same limitation. For instance, the median
isn’t 250; all you know is that it occurs somewhere between 200
and 299.
Because of these limitations,
you don’t do anything with the second results screen from a grouped distribution.
The mean and standard deviation don’t have this
problem: they’re approximate, but the approximation is good
enough. (n
is exact, not an approximation.)
These 50 iTunes songs are obviously not all the songs there are, not even all the songs in any particular person’s iTunes library. They are a sample, not a population. Therefore you write down your work and results like this:
1VarStats L1,L2
x̅ = 316 (or you could write 316.0)
s = 145.1
n = 50
There are four common measures of the spread of a data set: range, interquartile range or IQR, variance, and standard deviation. (You may also see spread referred to as dispersion, scatter, variation, and similar words.)
Definition: The range of a data set is the distance between the largest and smallest members.
Example 8: If the largest number in a data set is 100 and the smallest is 20, the range is 100−20 = 80, regardless of what numbers lie between them and what shape the distribution might have.
Caution: The range is one number: 80, not “20 to 100”.
Obviously the range has a problem as a measure of spread: It uses only two of the numbers. Since only the two most extreme numbers in the data set get used to compute the range, the range is about as far from resistant as anything can be.
In favor of the range is that it’s easy to compute, and it can be a good rough descriptor for data sets that aren’t too weird. The interquartile range has something of the same idea, but it is resistant.
The interquartile range (IQR) is the distance between the largest and smallest members of the middle 50% of the data points, taking repetitions into account.
Alternative definition: The IQR is the third quartile minus the first quartile, or the 75th percentile minus the 25th percentile.
You’ll learn about percentiles and quartiles in the next section, Measures of Position, but for now let’s just take a quick nontechnical example.
Example 9: Consider the data set 1, 2, 3, 3, 3, 4, 5, 8, 11, 11, 15, 23. There are twelve numbers, and the middle 50% (six numbers) are 3, 3, 4, 5, 8, 11. The interquartile range is 11−3 = 8.
The IQR is a better measure of spread than the range, because it’s resistant to the extreme values. but it still has the problem that it uses only two numbers in the data set. Isn’t there some measure of spread that uses all the numbers in the data set, as the mean does? The answer is yes: the variance and the standard deviation use all the numbers.
Your calculator gives you the standard deviation, as you saw above. The variance is important in a theoretical stats course, but not so much in this practical course. We’ll measure spread with the standard deviation almost exclusively. (To save wear and tear on my keyboard and your printer, I’ll often use the abbreviation SD.)
If you’d like to know how the variance and SD are computed, read the “BTW” section that follows. Otherwise, skip down to “What Good Is the Standard Deviation, Anyway?”
x  x−μ  (x−μ)² 

10.5  0.78  0.6084 
13.5  3.78  14.2884 
8  1.72  2.9584 
12  2.28  5.1984 
11.3  1.58  2.4964 
9  0.72  0.5184 
9.5  0.22  0.0484 
5  4.72  22.2784 
15  5.28  27.8784 
2.5  7.22  52.1284 
10.5  0.78  0.6084 
7  2.72  7.3984 
11.5  1.78  3.1684 
10  0.28  0.0784 
10.5  0.78  0.6084 
Total  0.00  140.2640 
If you want to devise a measure of spread, it seems reasonable to consider spread from the mean, so try subtracting the mean from each quiz score and then adding up all those deviations. You get zero, so obviously “sum of deviations” isn’t a useful measure of spread.
But with the next column you strike gold. Squaring all the deviations changes the negatives to positives, and also weights the larger deviations more heavily. This is progress! Now divide the total of squared deviations by the population size and you have the variance: σ² = 140.2640/15 = 9.3509. (σ is the Greek letter sigma.)
(When computing the variance of a sample, you divide by n−1 rather than n. The reasons are technical and are explained in Steve Simon’s articles Degrees of Freedom (1999a) [see “Sources Used” at end of book] and Degrees of Freedom, Part 2 (2004) [see “Sources Used” at end of book].
The variance is quite a good measure of spread because it uses all the numbers and combines their differences from the mean in one overall measure. But it’s got one problem. If the data are dollars, the squared deviations will be in square dollars, and therefore the variance will be in square dollars. What’s a square dollar? (No, I don’t know either.) You want a measure of spread that is in the same units as the original data, just like the mean and median are. The simplest solution is to take the square root of the variance, and when you do that you have the standard deviation (SD), σ = √(140.2640/15) = 3.05793, which rounds to 3.1. And because the standard deviation is in the same units as the original data, it can be used as a yardstick, as you’ll see below.
For lovers of formulas, here they are. The standard deviation of a population, σ, has population size N on the bottom of the fraction; the standard deviation of a sample, s, has sample size n minus 1 on the bottom of the fraction. If you’re not familiar with the ∑ notation (sigma or summation), ∑x² means square every data value and add the squares; ∑x²f means square every data value, multiply by the frequency, and add those products. For the notation, see ∑ Means Add ’em Up in Chapter 1.
Formulas for Standard Deviation  

of a List of Numbers  of a Frequency Distribution  
When the data set is the whole population  
When the data set is just a sample 
Why are there two formulas on each row under “list of numbers”? The first formula is the definition, and the second is a shortcut for faster computations. Of course they’re mathematically equivalent; you could prove that if you wanted to.
Sir Ronald Fisher coined the term variance in 1918. He used the symbol σ² for the variance of a population, since Pearson had already assigned σ to the standard deviation, and the variance is the square of the SD.
The standard deviation will be the key to inferential statistics, starting in Chapter 8, but even within the realm of descriptive statistics there are some applications. In addition to this section, you’ll see an application in zScores, below.
Working with the quiz scores on your TI83 or TI84, you found that the population mean was μ = 9.7 and the population SD was σ = 3.1. What does this mean?
Just as a concept, the standard deviation gives you an idea of the expected variation from one member of the sample or population to the next. The SD in this example is about a third of the mean, so you expect some variation but not a lot. But can you do better than this? Yes, you can!
You can predict what percentage of the data will be within a certain number of standard deviations above or below the mean. In a normal distribution, 68% of the data are between one SD below the mean and one SD above the mean (μ±σ), 95% are within two SD of the mean (μ±2σ), and 99.7% are within three SD of the mean (μ±3σ).
This is the Empirical Rule or 68–95–99.7 Rule. Caution! It’s good for normal distributions only.
You’ll notice that the 68%, 95%, and 99.7% of data occur within approximately one, two, and three SD of the mean. More accurate figures are shown in the pictures, but for now we’ll just use the simple rule of thumb. You’ll learn how to make precise computations in Chapter 7.
It’s not a traditional part of the Empirical Rule, but another useful rule of thumb is that, in a normal distribution, about 50% of the data are within 2/3 of a SD above and below the mean.
Example 10: Adult women’s heights are normally distributed with μ = 65.5″ and σ = 2.5″. (By the way, different sources give different values for human heights, so don’t be surprised to see different figures elsewhere in this book.) How tall are the middle 95% of women?
Solution: The middle 95% of the normal distribution lies between two SD below and two SD above the mean. 2σ = 2×2.5 = 5″, and 65.5±5 = 60.5″ to 70.5″, so 95% of women are 60.5″ to 70.5″ tall.
Actually there are two interpretations. You can say that 95% of women are 60.5″ to 70.5″ tall, or you can say that if you randomly select one woman the probability that she’s 60.5–70.5″ tall is 95%. Any probability statement can be turned into a proportion statement, and vice versa. You’ll learn about this in Interpreting Probability Statements in Chapter 5.
Example 11: What fraction of women are 65.5″ to 68″ tall?
Solution: 68−65.5 = 2.5, so 68″ is one standard deviation above the mean. You know that 68% of a normal distribution is within μ±σ. You also know that the normal distribution is symmetric, so 68%/2 = 34% of women are within one SD below the mean, and 34% are within one SD above the mean. Therefore 34% of women are 65.5″ to 68″ tall.
You can combine the three diagrams above and show data in regions bounded by each whole number of standard deviations, like this:
Where do these figures come from? For example, how do we know that about 13.5% of the population is between one and two standard deviations below the mean in a normal distribution? Well, 95% is between two SD below and two SD above the mean. Half of 95% is 47.5%, so 47.5% of the population is between the mean and two SD below the mean. Similarly, about 68% is between one SD below and one SD below, so 68/2 = 34% is between the mean and one SD below. But if 47.5% is between μ−2σ and μ — call it Region A — and 34% is between μ−σ and μ, then the part of Region A that is not in the 34% is the part between μ−2σ and μ−σ, and that must be 47.5−34 = 13.5%. If you had an afternoon to kill, you could work out the other seven percentages.
With this diagram, you can work Example 11 more easily, directly reading off the 34% figure for women between mean height and one SD above the mean. You can also work more complicated examples, like this one.
Example 12: If you randomly select a woman, how likely is it that she’s taller than 70.5″?
Solution: 70.5−65.5 = 5.0, so 70.5″ is two SD above the mean. From the diagram, you see that 2.35+0.15 = 2.5% of the population is more than two SD above the mean. Answer: a randomly selected woman has a 2.5% of being more than 70.5″ tall.
If you have a normal distribution, the Empirical Rule tells you how much of the population is in each region. What if you don’t have a normal distribution?
As you might expect, the portions of the population in the various regions depends on the shape of the distribution, but Chebyshev’s Inequality (or Chebyshev’s Rule) gives you a “worst case scenario” — no matter how skewed the distribution, at least 75% of the data are within 2 SD of the mean, and at least 89% are within 3 SD of the mean.
More generally, within k SD above and below the mean, you will find at least (1−1/k²)·100% of the data. (If you plug in k = 1, you’ll find that at least 0% of the data lie within one SD of the mean. Distributions where all the data are more than one SD away from the mean are unusual, but they do exist.)
Example 13: For the quiz scores, two standard deviations is 2×3.0579 = 6.1, so you expect at least 1−1/2² = 1−¼ = 75% of the quiz scores to be within the range 9.7±6.1 = 3.6 to 15.8. Remember that this is a worst case. In fact, 14 of the 15 numbers (93%) are within those limits.
The symbol is P followed by a number. For example, P35 or P_{35} denotes the 35th percentile, the member of the data set that is greater than or equal to 35% of the data.
Percentiles are most often used in measures of human development, like your child’s performance on standardized tests, or an infant’s length or weight.
Example 14: Your daughter takes a standardized reading test, and the report says that she is in the 85th percentile for her grade. Does this make you happy or sad? Solution: 85% of her grade read as well as she does, or less well; only 15% read better than she does. Presumably this makes you happy.
Example 15: Consider the data set 1, 4, 7, 8, 10, 13, 13, 22, 25, 28. (To find percentiles, you have to put the data set in order.)
(a) What is the percentile rank of the number 13? Solution: There are ten numbers in the data set, and seven of those are ≤13. Seven out of ten is 70%, so the percentile rank of 13 is 70, or “13 is at the 70th percentile”, or P70 = 13.
(b) Find P60 for this data set. Solution: What number is greater than or equal to 60% of the numbers in the data set? Counting up six numbers from the beginning, you find … 13 again. So 13 is both P60 and P70.
(Anomalies like this are usual when you have small data sets. It really doesn’t make sense to talk about percentiles unless you have a fairly large data set, typically a population like all third graders or all sixweekold infants.)
The different definitions can give very different answers for small data sets. Nobody worries too much about this, because in practice you seldom compute percentiles against small data sets. (What does “18th percentile” mean in a set of only 12 numbers?) All the definitions give pretty much the same answer for larger data sets.
David Lane’s Percentiles (2010) [see “Sources Used” at end of book] gives three definitions of percentile and shows what difference they make. His Definition 2 is the one I use in this book.
To find quartiles by hand, put the data set in order and find the median. If you have an odd number of data points, strike out the median. Q1 is the median of the lower half, and Q3 is the median of the upper half.
One fourth is 25% and three fourths is 75%, so Q1 = P25 and Q3 is P75. (I chose a definition of percentiles that makes this happen. Some authors use different definitions, which may give slightly different results.)
What, no Q2? There is a Q2, but two quarters is one half, or 50%, so the second quartile is better known as the median: 50% of the data are less than or equal to the 50th %ile, alias Q2, alias the median.
The quartiles and the median divide the data set into four equal parts. We sometimes use the word quartile in a way that reflects this: the “bottom quartile” means the part of the data set that is below Q1, and the “`upper quartile” or “top quartile” means the part of the data set that is above Q3.
Q1 and Q3 are part of the fivenumber summary (later in this chapter). From Measures of Spread, you already know that they’re used to find the interquartile range, and later in this chapter you’ll use the IQR to make a boxwhisker plot.
Just like percentiles, quartiles are defined slightly differently by different authors. Dr. Math gives a nice, clear rundown of different ways of computing quartiles in Defining Quartiles in The Math Forum (2002) [see “Sources Used” at end of book]. I follow Moore and McCabe’s method, which is also used by your TI83 or TI84.
You’ll use zscores more than any other measure of position. (Remember that every measure of position measures the position of one data point within the sample or population that it is part of.)
How do you find out how many SD a number
is above or below the mean of its data set? You subtract the
mean, and then divide the result by the SD.
zscore within a sample:
zscore within a population:
Either way, it’s
When you compute a zscore, the top and bottom of the fraction are both in the same units as the original data, and therefore the zscore itself has no units. zscores are pure numbers.
What good are zscores? You’ll use them in inferential statistics, starting in Chapter 9, but you can also use them in descriptive statistics.
For one thing, a zscore gives you economy in language. Instead of saying “at least 75% of the data in any distribution must lie between two standard deviations below the mean and two standard deviations above the mean”, you can say “at least 75% of the data lie between z = ±2.”
A zscore helps you determine whether a measurement is unusual. For instance, how good is an SAT verbal score of 300? Scores on the SAT verbal are ND with mean of 500 and SD of 100, so z = −2. The Empirical Rule tells you only 2½% of students score that low or lower.
And zscores are also good for comparing apples and oranges, as the next example shows.
Example 16: You have two candidates for an entrylevel position in your restaurant kitchen. Both have been to chef school, but different schools, and neither one has any experience. Chris presents you with a final exam score of 86, and Fran with a final exam score of 67. Which one do you hire?
At first glance, you’d go with the one who has the higher score. But wait! Maybe Fran with the 67 is actually better, and just went to a tougher school. So you ask about the average scores at the two schools. Chris’s school had a mean score of 76, and Fran’s school had a mean score of 59. Assuming that the students at the two schools had equal innate ability, Fran went to a tougher school than Chris.
Chris scored 10 points above the school average, while Fran scored only 8 points above the school average. Now do you hire Chris? Not yet! Maybe there was more variability in Chris’s class, so 10 points above the average is no big deal, but there was less variability in Fran’s, so 8 points above the mean is a big deal. So you dig further and find that the standard deviations of the two classes were 8 and 4. At this point, you make a table:
Chris  Fran  

Candidate’s score  86  67 
School mean  76  59 
School SD  8  4 
zscore  (86−76)/8 = 1.25  (67−59)/4 = 2.00 
The zscores tell you that Fran stands higher in Fran’s class than Chris stands in Chris’s class. Assuming that the two classes as a whole were of equal ability, Fran is the stronger candidate.
Definition: The fivenumber summary of a data set is the minimum value, Q1, median, Q3, and maximum value (in order).
The fivenumber summary combines measures of center (the median) and spread (the interquartile range and the range). A plot of the fivenumber summary, called a boxwhisker diagram (below), shows you shape of the data set.
On the TI83 or TI84, the fivenumber summary is the second
output screen from 1VarStats
. Caution! Remember that the
second screen is meaningful only for a simple list of numbers or an
ungrouped distribution, not for a grouped distribution.
To produce a fivenumber summary, you need all the original data
points.
Example 17: Here is the second output screen from the quiz scores earlier in this chapter. The fivenumber summary is 2.5, 8, 10.5, 11.5, 15.
The median is 10.5, meaning that half the students scored 10.5 or below and half scored 10.5 or above.
The interquartile range is Q3−Q1 = 11.5−8 = 3.5. Half of the students scored between 8 and 11.5.
An outlier is a data value that is well separated from most of the data.
Conventionally, the values Q1−1.5×IQR and Q3+1.5×IQR (first quartile minus 1½ times interquartile range, and third quartile plus 1½ times interquartile range) are called fences, and any data points outside the fences are considered outliers.
Example 18: Here again are the quiz scores from earlier in this chapter:
10.5 13.5 8 12 11.3 9 9.5 5 15 2.5 10.5 7 11.5 10 10.5
Find the outliers, if any.
The fivenumber summary, above, gave you the quartiles: Q1 = 8 and Q3 = 11.5. The interquartile range is 11.5−8 = 3.5, and 1.5 times that is 5.25. The fences are 8−5.25 = 2.75 and 11.5+5.25 = 16.75. All the data points but one lie within the fences; only 2.5 is outside. Therefore 2.5 is the only outlier in this data set.
You can find outliers more easily by using your TI83 or TI84; see below.
Why do you care about outliers? First off, an outlier might be a mistake. You should always check all your data carefully, but check your outliers extra carefully.
But if it’s not a mistake, an outlier may be the most interesting part of your data set. Always ask yourself what an outlier may be trying to tell you. For example, does this quiz score represent a student who is trying but needs some extra help, or one who simply didn’t prepare for the quiz?
What do you do with outliers? One thing you definitely don’t do: Don’t just throw outliers away. That can really give a false picture of the situation.
But suppose you have to make some policy decision based on your analysis, or run a hypothesis test (Chapters 10 and 11) and announce whether some claim is true or false?
One way is to do your analysis twice, once with the outliers and once without, and present your results in a twocolumn table. Anyone who looks at it can judge how much difference the outliers make. If you’re lucky, the two columns are not very different, and whatever decision must be made can be made with confidence.
But maybe the two columns are so different that including or excluding the outliers leads to different decisions or actions. In that case, you may need to start over with a larger sample, change your data collection protocol, or call in a professional statistician.
For more on handling outliers, see Outliers (Simon 2000d) [see “Sources Used” at end of book].
The fivenumber summary packs a lot of information, but it’s usually easier to grasp a summary through a picture if possible. A graph of the fivenumber summary is called a boxplot or boxwhisker diagram.
The boxwhisker diagram was invented by John Tukey in 1970.
A boxwhisker diagram has a horizontal axis, which is the number line of the data, and the number line need not start at zero. Either the axis or the chart as a whole needs a title, but there’s usually no need for a title on both. There is no vertical axis.
For the graph itself, first identify any outliers and mark them as squares or crosses. Then draw a box with vertical lines at Q1, the median, and Q3. Lastly, draw whiskers from Q3 to the greatest value in the data set that isn’t an outlier, and from Q1 to the smallest value in the data set that isn’t an outlier.
Example 19: Let’s look at a boxwhisker plot of those same quiz scores, which were
10.5 13.5 8 12 11.3 9 9.5 5 15 2.5 10.5 7 11.5 10 10.5
The fivenumber summary is reproduced at right. You recall from the previous section that there is one outlier, 2.5, so the smallest number in the data set that isn’t an outlier is 5.
Here’s a plot that I made with StatTools from Palisade Corporation:
The boxwhisker plot is almost as good as a histogram for showing you the shape of a distribution. If one whisker is longer than the other, and especially if there are outliers on the same side as the long whisker, the distribution is skewed in that direction. If the whiskers are about the same length and there are no outliers, but one side of the box is longer than the other, that usually indicates skew in that direction as well.
Example 20: In the boxplot of quiz scores, just above, you see an outlier on the left side, and the left side of the box is longer than the right. That indicates that the distribution is left skewed.
You can use your TI83 or TI84 to make a boxwhisker plot. The calculator comes with that ability — see BoxWhisker Plots on TI83/84 — but it’s easier to use MATH200A Program part 2. See Getting the Program for instructions on getting the program into your calculator.
(If you have a TI89, see BoxWhisker Plots on TI89.)
To make a boxwhisker plot with the program, begin by entering the numbers into a statistics list, such as L1. (If you have an ungrouped frequency distribution, put the numbers in one list and the frequencies in a second list. You need the original data for a boxplot, so you can’t make a boxplot of a grouped frequency distribution.)
Now press [PRGM
]. If you can see
MATH200A
in the list, press its menu number; otherwise,
use the [▼
] or [▲
] key to get to
MATH200A
, and press [ENTER
].
With the program name on your home screen, press
[ENTER
] (again) to run the program, and yet again to dismiss
the title screen. You’ll then see a menu. Press
[2
] for boxwhisker
plot.
The program asks whether you have one, two, or three samples. Select 1, since that’s what you have.
The program wants to know whether you have a plain list of numbers or a grouped frequency distribution. Since you have a plain list, choose 1.
The program needs to know which list holds the numbers to be plotted.
Finally, the program presents the boxwhisker plot.
When you have a boxwhisker plot on your screen, whether you used
MATH200A part 2 or the calculator’s native commands, if you see any
outliers press [TRACE
] and then [◄
] or
[►
] to find which data points are outliers.
(For the TI89, see BoxWhisker Plots on TI89. If you prefer to use Excel to find outliers, see Normality Check and Finding Outliers in Excel.)
After pressing the [TRACE
] key, you can get
the fivenumber summary by pressing [◄
] or
[►
] repeatedly. If there are outliers at the left,
use the lowest one for the minimum (first number in the fivenumber
summary); if
there are outliers at the right, use the highest one for
the maximum (last number in the fivenumber summary).
Overview: With numeric data, the goal of descriptive stats is to show shape, center, spread, and outliers.
(The online book has live links to all of these.)
Chapter 4 WHYL → ← Chapter 2 WHYL
Write out your solutions to these exercises, showing your work for all computations. Then check your solutions against the solutions page and get help with anything you don’t understand.
Caution! If you don’t see how to start a problem, don’t peek at the solution — you won’t learn anything that way. Ask your instructor or a tutor for a hint. Or just leave it and go on to a different problem for now. You may find when you return to that “impossible” problem that you see how to do it after all.
Ages  Frequency 

20 – 29  34 
30 – 39  58 
40 – 49  76 
50 – 59  187 
60 – 69  254 
70 – 79  241 
80 – 89  147 
The grouped frequency distribution at right is the ages
reported by a sample of Roman Catholic nuns, from
Johnson and Kuby (2004, 67) [see “Sources Used” at end of book].
(a) Approximate the mean and SD of the ages of
these nuns, to two decimal places, and find the sample size.
(b) Explain why a boxplot of this
distribution is a bad idea.
Course  Credits  Grade 

Statistics  3  A 
Calculus  4  B+ 
Microsoft Word  1  C− 
Microbiology  3  B− 
English Comp  3  C 
Commuting Distances in km  

5  15  23  12  9 
12  22  26  31  21 
11  19  16  45  12 
8  26  18  17  1 
16  24  15  20  17 
Maria took a traditional IQ test and scored 129. On that test, the mean is 100 and the SD is 15.
From the test scores, who is more intelligent? Explain.
Test Scores  Frequencies, f  

470.0–479.9  15  
480.0–489.9  22  
490.0–499.9  29  
500.0–509.9  50  
510.0–519.9  38 
Updated 11 Jan 2015
(What’s New?)
Intro: When you get two numbers from each member of the sample (bivariate numeric data), you make a plot to look for a relationship between them. If a straight line seems like a good fit for the plotted points, we say that they follow a linear model. In this chapter, you’ll learn when to use a linear model, and how to find the best one.
The chapter intro talks about points following a “linear model”. But what is a linear model, and what does it mean to follow one? Well, since a linear model is one kind of mathematical model, let’s talk a little bit about mathematical models.
You know what a model is in general, right? A copy of the original, usually smaller and with unimportant details left out. Think of model airplanes, or architect’s models of buildings.
A mathematical model is like that. Real Life is Complicated,™ and mathematical models help us manage those complications.
Definition: A mathematical model is a mathematical description of something in the real world. An object or process or data set follows a model if the calculations you do with the model match reality closely enough to be useful.
You’ve already met one model in Chapter 3: the grouped frequency distribution. Instead of dealing with all the data points, you do calculations using the midpoint of each class. That gives you approximate mean and SD, but the approximation is close enough to be useful.
The MathIsFun site has a nice example of modeling the space inside a cardboard box, going beyond the h×w×l formula; see Mathematical Models.
You’ll meet plenty more models in this book: probability models in Chapter 5, several discrete models in Chapter 6, and the normal model in Chapter 7.
But in this chapter we’re concerned with the linear model.
Because the graph of y = ax+b is a straight line, we can also call it a straightline model, and we say that x and y have a straightline relationship in the model.
The linear model is a good one if it describes the data well enough to let you make useful calculations.
Usually you have some idea that your x variable can help predict your y variable, so you call x the explanatory variable and y the response variable. (Other names are independent variable and dependent variable.)
Set floating point mode, if you haven’t already.  [MODE ] [▼ ] [ENTER ] 
Go to the home screen  [2nd MODE makes QUIT ] [CLEAR ] 
Turn on diagnostics with the [DiagnosticOn ] command.
 [2nd 0 makes CATALOG ] [x^{1} ]
Don’t press the [ ALPHA ] key, because the
CATALOG command has already put the calculator in
alpha mode.
Scroll down to DiagnosticOn and
press [ENTER ] twice. 
The calculator will remember these settings when you turn it off: next time you can start with Step 1.
Before you even run a regression, you should first plot the points and see whether they seem to lie along a straight line. If the distribution is obviously not a straight line, don’t do a linear regression. (Some other form of regression might still be appropriate, but that is outside the scope of this course.)
Let’s use this example from Sullivan (2011, 179) [see “Sources Used” at end of book]: the distance a golf ball travels versus the speed with which the club head hit it.
Clubhead speed, mph (x)  100  102  103  101  105  100  99  105 

Distance, yards (y)  257  264  274  266  277  263  258  275 
Turn off other plots.  [Y= ]
Cursor to each highlighted = sign or Plot number and press [ ENTER ] to deactivate. 
Set the format screen. 
Press [2nd ZOOM makes FORMAT ].
Just select everything in the left column.

Enter the numbers in two statistics lists.  [STAT ] [1 ] selects the listedit screen.
Cursor onto the label L1 at top of first
column, then [CLEAR ] [ENTER ] erases the list.
Enter the x values.
Cursor onto the label L2 at top of second
column, then [CLEAR ] [ENTER ] erases the list.
Enter the y values. 
Set up the scatterplot.

[2nd Y= makes STAT PLOT ] [1 ] [ENTER ] turns Plot 1 on.
[ ▼ ] [ENTER ] selects scatterplot.
[ ▼ ] [2nd 1 makes L1 ] ties list 1 to the x axis.
[ ▼ ] [2nd 2 makes L2 ] ties list 2 to the y axis.
(Leave the square as the selected mark for plotting.) 
Plot the points.
I have the grid turned on in some of these
pictures, but earlier I told you
to turn it off. That’s simplest. If you want the
grid, you can turn it on, but then you’ll have to adjust the
grid spacing for almost every plot. To adjust grid spacing, press
[ 
[ZOOM ] [9 ] automatically adjusts the window
frame to fit the data.

Check your data entry by tracing the points. 
[TRACE ] shows you the first (x,y) pair, and then
[► ] shows you the others. They’re shown
in the order you entered them, not necessarily from left to right.

A scatterplot on paper needs labels (numbers) and titles on both axes; the x and y axes typically won’t start at 0. Here’s the plot for this data set. (The horizontal lines aren’t needed when you plot on graph paper.)
When the same (x,y) pair occurs multiple times, plot the extra ones slightly offset. This is called jitter. In the example at the right, the point (6,6) occurs twice.
If the data points don’t seem to follow a straight line reasonably well, STOP! Your calculator will obey you if you tell it to perform a linear regression, but if the points don’t actually fit a straight line then it’s a case of “garbage in, garbage out.”
For instance, consider this example from DeVeaux, Velleman, Bock (2009, 179) [see “Sources Used” at end of book]. This is a table of recommended f/stops for various shutter speeds for a digital camera:
Shutter speed (x)  1/1000  1/500  1/250  1/125  1/60  1/30  1/15  1/8 

f/stop (y)  2.8  4  5.6  8  11  16  22  32 
If you try plotting these numbers yourself, enter the shutter speeds as fractions for accuracy: don’t convert them to decimals yourself. The calculator will show you only a few decimal places, but it maintains much greater precision internally.
You can see from the plot at right that these data don’t fit a straight line. There is a distinct bend near the left. When you have anything with a curve or bend, linear regression is wrong. You can try other forms of regression in your calculator’s menu, or you can transform the data as described in DeVeaux, Velleman, Bock (2009, ch 10) [see “Sources Used” at end of book] and other textbooks.
Set up to calculate statistics.  [STAT ] [► ] [4 ] pastes
LinReg(ax+b) to the home screen. 
[2nd 1 makes L1 ] [, ] [2nd 2 makes L2 ] defines L1 as x
values and L2 as y values.
If you have the “wizard’ interface, leave FreqList blank, or press
[DEL ] if something is already filled in.  
Set up to store regression equation.  [, ] [VARS ] [► ] [1 ] [1 ] pastes Y1
into the LinReg command. 
Show your work! Write down
the whole command —
LinReg(ax+b) L1,L2,Y1 in this
case, not just LinReg or LinReg(ax+b).

Press [ENTER ].
The calculator shows correlation
and regression statistics and pastes the regression
equation into Y1 . 
Your input screen should look like this, for the “wizard” and nonwizard interfaces:
Write down the slope a, the y intercept b, the coefficient of determination R², and the correlation coefficient r. (A decent rule of thumb is four decimal places for slope and intercept, and two for r and R².)
a = 3.1661, b = −55.7966
R² = 0.88, r = 0.94
Now let’s take a look in depth at each of those.
“Several sets of (x,y) [pairs], with the correlation coefficient
for each set. Note that correlation reflects the noisiness and
direction of a linear relationship (top row), but not the slope of
that relationship (middle), nor many aspects of nonlinear
relationships (bottom).”
source:
Correlation and Dependence [see “Sources Used” at end of book]
Look first at r, the coefficient of linear correlation. r can range from −1 to +1 and measures the strength of the association between x and y. A positive correlation or positive association means that y tends to increase as x increases, and a negative correlation or negative association means that y tends to decrease as x increases. The closer r is to 1 or −1, the stronger the association. We usually round r to two decimal places.
Karl Pearson developed the formula for the linear correlation coefficient in 1896. The symbol r is due to Sir Francis Galton in 1888.
For realworld data, 0.94 is a pretty strong correlation. But you might wonder whether there’s actually a general association between clubhead speed and distance traveled, as opposed to just the correlation that you see in this sample. Decision Points for Correlation Coefficient, later in this chapter, shows you how to answer that question.
zscores are pure numbers without units, and therefore r also has no units. You can interchange the x’s and y’s in the formula without changing the result, and therefore r is the same regardless of which variable is x and which is y.
Why is r positive when data points trend up to the right and negative when they trend down to the right? The product (x−x̅)(y−y̅) explains this. When points trend up to the right, most are in the lower left and upper right quadrants of the plot. In the lower left, x and y are both below average, x−x̅ and y−y̅ are both negative, and the product is positive. In the upper right, x and y are both above average, x−x̅ and y−y̅ are both positive, and again the product is positive. The product is positive for most points, and therefore r is positive when the trend is up to the right.
On the other hand, if the data trend down to the right, most points are in the upper left (where x is below average and y is above average, x−x̅ is negative, y−y̅ is positive, and the product is negative) and the lower right (where x−x̅ is positive, y−y̅ is negative, and the product is also negative.) Since the product is negative for most points, r is negative when data trend down to the right.
Be careful in your interpretation! No matter how strong your r might be, say that changes in the y variable are associated with changes in the x variable, not “caused by” it. Correlation is not causation is your mantra.
It’s easy to think of associations where there is no cause. For example, if you make a scatterplot of US cities with x as number of books in the public library and y as number of murders, you’ll see a positive association: number of murders tends to be higher in cities with more library books. Does that mean that reading causes people to commit murder, or that murderers read more than other people? Of course not! There is a lurking variable here: population of the city.
When you have a positive or negative association, there are four possibilities: x might cause changes in y, y might cause changes in x, lurking variables might cause changes in both, or it could just be coincidence, a random sample that happens to show a strong association even though the population does not.
used by permission; source:
http://xkcd.com/552/
(accessed 20140915)
If correlation is not causation, then how can we establish causation? For example, how do we know that smoking causes lung cancer in humans? Obviously we can’t perform an experiment, for ethical reasons. Sir Austin Bradford Hill laid down nine criteria for establishing causation in a 1965 paper, The Environment and Disease: Association or Causation? [see “Sources Used” at end of book] Short summaries of the “Bradford Hill criteria” are many places on the Web, including Steve Simon’s (2000b) Causation [see “Sources Used” at end of book].
Write the equation of the line using ŷ (“yhat”), not y, to indicate that this is a prediction. b is the y intercept, and a is the slope. Round both of them to four decimal places, and write the equation of the line as
ŷ = 3.1661x − 55.7966
(Don’t write 3.1661x + −55.7966.)
These numbers can be interpreted pretty easily. Business majors will recognize them as intercept = fixed cost and slope = variable cost, but you can interpret them in nonbusiness contexts just as well.
The slope, a or b_{1} or m, tells how much ŷ increases or decreases for a oneunit increase in x. In this case, your interpretation is “the ball travels about an extra 3.17 yards when the club speed is 1 mph greater.” The slope and the correlation coefficient always have the same sign. (A negative slope would mean that y decreases that many units for every one unit increase in x.)
The intercept, b or b_{0}, says where the regression line crosses the y axis: it’s the value of ŷ when x is 0. Be careful! The y intercept may or may not be meaningful. In this case, a clubhead speed of zero is not meaningful. In general, when the measured x values don’t include 0 or don’t at least come pretty close to it, you can’t assign a realworld interpretation to the intercept. In this case you’d say something like “the intercept of −55.7966 has no physical interpretation because you can’t hit a golf ball at 0 mph.
Here’s an example where the y intercept does have a physical meaning. Suppose you measure the gross weight of a UPS truck (y) with various numbers of packages (x) in it, and you get the regression equation ŷ = 2.17x+2463. The slope, 2.17, is the average weight per package, and the y intercept, 2463, is the weight of the empty truck.
For the meaning of ∑, see ∑ Means Add ’em Up in Chapter 1.
Traditionally, calculus is used to come up with those equations, but all that’s really necessary is some algebra. See Least Squares — the Gory Details if you’d like to know more.
The second formula for the slope is kind of neat because it connects the slope, the correlation coefficient, and the SD of the two variables.
The last number to look at (third on the screen) is R², the coefficient of determination. (The calculator displays r², but the capital letter is standard notation.) R² measures the quality of the regression line as a means of predicting ŷ from x: the closer R² is to 1, the better the line. Another way to look at it is that R² measures how much of the total variation in y is predicted by the line.
In this case R² is about 0.88, so your interpretation is “about 88% of the variation in distance traveled is associated with variation in clubhead speed.” Statisticians say that R² tells you how much of the variation in y is “explained” by variation in x, but if you use that word remember that it means a numerical association, not necessarily a causeandeffect explanation. It’s best to stick with “associated” unless you have done an experiment to show that there is cause and effect.
There’s a subtle difference between r and R², so keep your interpretations straight. r talks about the strength of the association between the variables; R² talks about what part of the variation in the y variable is associated with variation in the x variable, and how well the line predicts y from x. Don’t use any form of the word “correlated” when interpreting R².
Only linear regression will have a correlation coefficient r, but any type of regression — fitting any line or curve to a set of data points — will have a coefficient of determination R² that tells you how well the regression equation predicts y from the independent variable(s). Steve Simon (1999b) gives an example for nonlinear regression in Rsquared [see “Sources Used” at end of book].
In straightline regression, R² is the square of r, so if you want a formula just compute r and square the result.
Show line with original data points. 
[GRAPH ] 
What is this line, exactly? It’s the one unique line that fits the plotted points best. But what does “best” mean?
The same four points on left and right. The vertical distance
from each measured data point to the line, y−ŷ, is
called the residual for that x value. The line on the right is better
because the residuals are smaller.
source: Dabes and Janik (1999, 179) [see “Sources Used” at end of book]
For each plotted point, there is a residual equal to y−ŷ, the difference between the actual measured y for that x and the value predicted by the line. Residuals are positive if the data point is above the line, or negative if the data point is below the line.
You can think of the residuals as measures of how bad the line is at prediction, so you want them small. For any possible line, there’s a “total badness” equal to taking all the residuals, squaring them, and adding them up. The least squares regression line means the line that is best because it has less of this “total badness” than any other possible line. Obviously you’re not going to try different lines and make those calculations, because the formulas built into your calculator guarantee that there’s one best line and this is it.
Carl Friedrich Gauss developed the method of least squares in a paper published in 1809.
I would like you to know the material in this section, but it's not part of the MATH200 syllabus so I don’t require it. No homework or quiz problems will draw from this section. You will, however, need to calculate individual residuals; see Finding Residuals, below.
“No regression analysis is complete without a display of the residuals to check that the linear model is reasonable.”
DeVeaux, Velleman Bock (1999, 227) [see “Sources Used” at end of book]
The residuals are automatically calculated during the regression. All you have to do is plot them on the y axis against your existing x data. This is an important final check on your model of the straightline relationship.
Turn off other plots.  Press [Y= ]. Cursor to the highlighted = sign next to
Y1 and press [ENTER ]. Cursor to PLOT1
and press [ENTER ]. 
Set up the plot of residuals against the x data. 
Set up Plot 2 for the residuals.
Press [2nd Y= makes STAT PLOT ]
[▼ ] [ENTER ] [ENTER ] to turn on Plot 2. Press
[▼ ] [ENTER ] to select a scatterplot.
The x’s are still in L1 , so press
[2nd 1 makes L1 ] [ENTER ].
In this plot, the y’s will be the residuals: press
[2nd STAT makes LIST ], cursor up to RESID , and press
[ENTER ] [ENTER ]. 
Display the plot. 
[ZOOM ] [9 ] displays the plot. 
You want the plot of residuals versus x to be “the most boring scatterplot you’ve ever seen.” (DeVeaux, Velleman, Bock 2009, 203) [see “Sources Used” at end of book] “It shouldn’t have any interesting features, like a direction or shape. It should stretch horizontally, with about the same amount of scatter throughout. It should show no bends, and it should have no outliers. If you see any of these features, find out what the regression model missed.”
Don’t worry about the size of the residuals,
because [ZOOM
] [9
] adjusts the vertical scale so that they
take up the full screen.
If the residuals are more or less evenly distributed above and below the axis and show no particular trend, you were probably right to choose linear regression. But if there is a trend, you have probably forced a linear regression on nonlinear data. If your data points looked like they fit a straight line but the residuals show a trend, it probably means that you took data along a small part of a curve.
Here there is no bend and there are no outliers. The scatter is pretty consistent from left to right, so you conclude that distance traveled versus clubhead speed really does fit the straightline model.
Refer back to the scatterplot of f/stop against shutter speed. I said then that it was not a straight line, so you could not do a linear regression. If you missed the bend in the scatterplot and did a regression anyway, you’d get a correlation coefficient of r = 0.98, which would encourage you to rely on the bad regression. But plotting the residuals (at right) makes it crystal clear that linear regression is the wrong type for this data set.
This is a textbook case (which is why it was in a textbook): there’s a clear curve with a bend, variation on both sides of the x axis is not consistent, and there’s even a likely outlier.
I said in Step 2 that the coefficient of determination measures the variation in measured y that’s associated with the variation in measured x. Now that you understand the residuals, I can make that statement more precise and perhaps a little easier to understand.
The set of measured y values has a spread, which can be measured by the standard deviation or the variance. It turns out to be useful to consider the variation in y’s as their variance. (You remember that the variance is the square of the standard deviation.)
The total variance of the measured y’s has two components: the socalled “explained” variation, which is the variation along the regression line, and the “unexplained” variation, which is the variation away from the regression line. The “explained” variation is simply the variance of the ŷ’s, computing ŷ for every x, and the “unexplained” variation is the variance of the residuals. Those two must add up to the total variance of the measured y’s, which means that as percentages of the variation in y they must add to 100%. So R² is the percent of “explained” variation in the regression, and 100%−R² is the percent of “unexplained” variation.
and
Now I can restate what you learned in Step 2. R² is 88% because 88% of the variance in y is associated with the regression line, and the other 12% must therefore be the variance in the residuals. This isn’t hard to verify: do a 1VarStats on the list of measured y’s and square the standard deviation to get the total variance in y, s²_{y} = 59.93. Then do 1VarStats on the residuals list and square the standard deviation to get the “unexplained” variance, s²_{e} = 7.12. The ratio of those is 7.12/59.93 = 0.12, which is 1−R². Expressing it as a percentage gives 100%−R² = 12%, so 12% of the variation in measured y’s is “unexplained” (due to lurking variables, measurement error, etc.).
Summary: The regression line represents the model that best fits the data. One important reason for doing the regression in the first place is to answer the question, what average y value does the model predict for a given x? This page shows you two methods of answering that question.
You can make predictions while examining the graph of the regression line on the TI83/84 or TI89.
Advantages to this method: aside from being pretty cool, it avoids rounding errors, and it’s very fast for multiple predictions.
Activate tracing on the regression line.  [TRACE ] 
Look in the upper left corner to make sure that the regression equation is displayed.  If you see
P:L1,L2 , press [▲ ] to display the
regression equation. 
Enter the x value.  Press the blackonwhite numeric keys including
[(−) ] and decimal point if needed.
As soon as you press the first number, you’ll see a large X= appear at the bottom left of the screen.
Enter any additional digits and press [ENTER ].
The TI83/84 displays the predicted average y value (ŷ) at the bottom right and puts a blinking cursor at that point on the regression line. 
Caution: ŷ = 267.1 yards is the predicted or expected average distance for a clubhead speed of 102 mph. But that does not mean any particular golf ball hit at that speed will travel that exact distance. You can think of ŷ as the average travel distance that you’d would expect for a whole lot of golf balls hit at that speed.
Caution:
A regression equation is valid only within the range of actual measured x values,
and a little way left
and right of that range. If you try to go too far outside the valid
range, the calculator will display ERR:INVALID
.
It’s not just being cranky. The line describes the points you measured, so it’s usable between your minimum and maximum x values and maybe a little way outside those limits. But unless you have very solid reasons why the same straightline model is good beyond that range, you can’t extrapolate.
Take a look at this graph of men’s and women’s winning times in the Olympic 100meter dash from 1928 to 2012, which I made from data compiled by Mike Rosenbaum [see “Sources Used” at end of book]. (The women’s 100 m dash became an Olympic event in 1928.)
From this you can reasonably guess that if women had run in the 1924 Olympics, the winner would have finished in around 12.2 or 12.3 seconds. And the 2016 winner will probably finish in around 11.5 seconds. But the further you go outside your measured data, the more riskier your predictions.
Will men’s and women’s times generally continue to decrease? Probably: training will get better, nutrition will improve, global communications will make it less likely that a stellar runner goes undiscovered. But will the decrease follow a straight line? Certainly not! Think about it for a minute. If times keep decreasing on a straight line, eventually they’ll cross the x axis and go negative. Runners will finish the race before they start it! So obviously the straightline model breaks down — the only question is where. You don’t know, and you can’t know. All you know is that it’s not safe to extrapolate.
Bogus extrapolations give statistics a bad name and make people say “you can prove anything with statistics.” Here’s an example. I’ve just extended the two trend lines to “prove” that after the 2160 Olympics women will run the 100 meters faster than men. Pretty clearly, the linear model breaks down before then.
It’s not safe to extrapolate to earlier times, either. The intercepts tell you that in the year zero, the fastest man in the world took 31.6 seconds to run 100 m, and the fastest woman took 44.7 seconds. Does that seem believable?
But what if you don’t still have the regression line on your calculator, for instance if you’ve done a different regression? In that case, you can go back to your writtendown regression equation and plug in the desired x value.
Advantage of this method: You already know how to substitute into equations. Disadvantages: depending on the specific numbers involved, you may introduce rounding errors. Also, since you’re entering more numbers there’s an increased chance of entering a number wrong.
Example: To find the predicted average y value for x = 102, go back to the regression equation that you wrote down, and substitute 102 for x:
ŷ = 3.1661x − 55.7966
ŷ = 3.1661*102 − 55.7966
ŷ = 267.1456 → 267.1
In this example, the rounding error was very small, and it disappeared when you rounded ŷ to one decimal place. But there will be problems where the rounding error is large enough to affect the final answer, so always use the trace method if you can.
Again, please observe the Cautions above. With this method, the calculator won’t tell you when your x value is outside a reasonable range, so you need to be aware of that issue yourself.
Each measured data point has an associated residual, defined as y−ŷ, the distance of the point above or below the line. To find a residual, the actual y comes from the original data, and the predicted average ŷ comes from one of the methods above.
Example: Find the residual for x = 102.
Solution: From the original data, y = 264. From
either of the methods above, ŷ = 267.1. Therefore
the residual is y−ŷ = 264−267.1 =
−3.1 yards.
If a given x value occurs in more than one data point, you have multiple residuals for that x value.
Summary: After you compute the linear correlation coefficient r of your sample, you may wonder whether this reflects any linear correlation in the population. By comparing r to a critical number or decision point, you either conclude that there is linear correlation in the population, or reach no conclusion. You can never conclude that there’s no correlation in the population.
This page gives a simple mechanical test, but a proper statistical test exists. The optional advanced handout Inferences about Linear Correlation explains how decision points are computed and the theory behind the test. You need to learn about t tests before you can understand all of it, but right now you can use the Excel spreadsheet that you’ll find there. Or you can use MATH200B Program part 6 to do the computations.
The decision points are used to answer the question “From the linear correlation r of my sample, can I rule out chance as an explanation for the correlation I see? Can I infer that there is some correlation in the population?”
To answer that question, temporarily disregard the sign of r. This is the absolute value of r, written  r . Then compare  r  to the decision point, and obtain one of the only three possible results:
If  r  ≤ d.p.  If  r  > d.p.  

... and r is negative  ... and r is positive  
... then you cannot say whether there is any linear correlation in the population.  ... then there is some negative linear correlation in the population.  ... then there is some positive linear correlation in the population. 
Here’s a table of decision points (also known as critical values of r) for various sample sizes.
Decision Points or Critical Numbers for r
(twotailed test for ρ≠0 at significance level 0.05)  

n  d.p.  n  d.p.  n  d.p.  n  d.p.  n  d.p.  
5  .878  10  .632  15  .514  20  .444  30  .361  
6  .811  11  .602  16  .497  22  .423  40  .312  
7  .754  12  .576  17  .482  24  .404  50  .279  
8  .707  13  .553  18  .468  26  .388  60  .254  
9  .666  14  .532  19  .456  28  .374  80  .220  
100  .196 
(If your sample size is not shown, either refer to the Excel workbook or use the next lower number that is shown in the table. Example: n = 35 is not shown, and therefore you will use the decision point for n = 30.)
You survey 50 randomly selected college students about the number of hours they spend playing video games each week and their GPA, and you find r = −0.35. You look up n = 50 in the table and find 0.279 as the decision point. r>d.p. (0.35 > 0.279). You conclude that for college students in general, video game play time is negatively associated with GPA, or that GPA tends to decrease as videogame playing increases.
You randomly select 21 college students. For the amount they spend on textbooks and their GPA, you find r = +0.20. n=21 isn’t in the table of decision points, so you select 0.444, the decision point for n=20. r≤d.p. (0.20 ≤ 0.444). Therefore, you are unable to make any statement about an association between textbook spending and GPA for college students in general.
Be very careful with your interpretation, and don’t say more than the statistics will allow.
The question was simply whether there is some correlation in the population, not how much. The population might have stronger or weaker correlation than your sample; all you know is that it has some. (Though you won’t learn how to do it in this course, it is possible to estimate the correlation coefficient of the population.)
If you conclude there is some correlation in the population, it’s probable, not certain. From a completely uncorrelated population, there’s still one chance in 20 of drawing a sample with  r  greater than the decision point. Because 1/20 is .05, we say that .05 is the significance level.
Even if you conclude that there is some correlation in the population, that’s the start of your investigation, not the end. If there’s a correlation in the population, you can’t just assume that one variable drives the other: correlation is not causation. Steve Simon’s (2000b) Causation [see “Sources Used” at end of book] gives some hints for investigating causation, using smoking and lung cancer as an example.)
Finally, note that there’s no way to reach the conclusion “there’s no correlation in the population." Either there (probably) is, or you can’t reach any conclusion. This will be a general pattern in inferential statistics: either you reach a conclusion of significance, or you don’t reach any conclusion at all. (As you’ll see in Chapter 10, you can conclude “something is going on”, you can fail to reach a conclusion, but you can never conclude “nothing is going on”. Lack of evidence for is not evidence against.)
In “Scatterplot, Correlation, and Regression on TI83/84”, earlier in this chapter, you learned the concepts of correlation and regression, and you used a TI83 or TI84 calculator to plot the points and do the computations. The calculator is handy, but calculator screens aren’t great for formal reports. This section tells you how to do the same operations in Microsoft Excel, without repeating the concepts.
I’m using Excel 2010, but Excel 2007 or 2013 should be almost identical.
Here again are the data:
Clubhead speed, mph (x)  100  102  103  101  105  100  99  105 

Distance, yards (y)  257  264  274  266  277  263  258  275 
Excel won’t put r on the chart, but you can compute it in a worksheet cell:
=CORREL(
including the = sign and opening parenthesis.ENTER
].(You can get the slope, y intercept, or R² into the worksheet by following the above procedure but substituting SLOPE, INTERCEPT, or RSQ for CORREL.)
Like your calculator, Excel can find the ŷ value (predicted average y) for any x.
Caution: A regression equation is valid only within the range of actual measured x values, and a little way left and right of that range. If you go outside that range, Excel will happily serve up garbage numbers to you.
On average, how far do you expect a golf ball to travel when hit at 102 mph?
=FORECAST(
including the = sign and opening parenthesis.ENTER
]. You’ll see the predicted average
distance, 267.1 yards.The prediction formula, like all Excel formulas, is “live”: if you type in a new x Excel will display the corresponding ŷ. If this doesn’t happen, in the Excel ribbon click
» » .(The online book has live links to all of these.)
Chapter 5 WHYL → ← Chapter 3 WHYL
Write out your solutions to these exercises, showing your work for all computations. Then check your solutions against the solutions page and get help with anything you don’t understand.
Caution! If you don’t see how to start a problem, don’t peek at the solution — you won’t learn anything that way. Ask your instructor or a tutor for a hint. Or just leave it and go on to a different problem for now. You may find when you return to that “impossible” problem that you see how to do it after all.
Year  Power Boat Reg. (1000s)  Manatees Killed 

1977  447  13 
1978  460  21 
1979  481  24 
1980  498  16 
1981  513  24 
1982  512  20 
1983  526  15 
1984  559  34 
1985  585  33 
1986  614  33 
1987  645  39 
1988  675  43 
1989  711  50 
1990  719  47 
1991  716  53 
1992  716  38 
1993  716  35 
1994  735  49 
(a) The two variables are powerboat registrations and manatee deaths. Which should be the explanatory variable, and which should be the response variable?
(b) On paper or on your calculator, make a scatterplot. Do the data seem to follow a straight line, more or less?
(c) Give the symbol and numerical value of the correlation coefficient.
(d) Write down the regression equation for manatee deaths as a function of powerboat registrations.
(e) State and interpret the slope.
(f) State and interpret the y intercept.
(g) Give the coefficient of determination with its symbol, and interpret it.
(h) How many deaths does the regression predict if 559,000 power boats are registered? Use the proper symbol.
(i) Find the residual for x = 559.
(j) How many manatee deaths would you expect for a million powerboat registrations?
Dial  Temp, °F 

0  6 
2  −1 
3  −3 
5  −10 
6  −16 
(a) Make a scatterplot. Does a straightline model seem reasonable here?
(b) What linear equation best describes the relation between dial setting x and temperature y?
(c) State and interpret the slope.
(d) State and interpret the y intercept.
(e) Give the correlation coefficient with its symbol.
(f) Give the coefficient of determination with its symbol, and interpret it.
(g) Predict the temperature for a dial setting of 1.
How much of the variation in children’s IQ is associated with variation in family income?
Updated 13 Jan 2015
(What’s New?)
Intro: By now you know: There’s no certainty in statistics. When you draw a sample from a population, what you get is a matter of probability. When you use a sample to draw some conclusion about a population, you’re only probably right. It’s time to learn just how probability works.
If you’re learning independently, you can skip the sections marked “Optional” and still understand the chapters that follow. If you’re taking this course with an instructor, s/he may require some or all of those sections. Ask if you’re not sure!
For easy reference, tables used in more than one problem are duplicated at the end of this document.
A trial is any procedure or observation whose result depends at least partly on chance. The result of a trial is called the outcome. We call a group of one or more repeated trials a probability experiment.
Example 1: Ten thousand doctors took aspirin every night for six years, and 104 of them had heart attacks. The relative frequency is 104/10000 = 1.04%, so the probability of heart attack is 1.04% for doctors taking aspirin nightly.
Each doctor represents a trial, and the outcome of each trial is either “heart attack” or “no heart attack”. The group of 10,000 trials is a probability experiment.
Definition: An event is a group of one or more possible outcomes of a trial. Usually those outcomes are related in some way, and the event is named to reflect that.
Example 2: If you draw a card from a deck without looking, there are 52 possible outcomes (assuming the jokers have been removed). “Ace” is an event, representing a group of four outcomes, and the probability of that event is 4/52 or 1/13. “Spade” is an event, representing a group of 13 outcomes, so its probability is 13/52 or 1/4. “Ace of spades” is both an outcome and an event, with a probability of 1/52.
Write probabilities as fractions, decimals, or percentages, like this:
P(event) = number
Example 3: On a coin flip, P(heads) = 0.5, read as the probability of heads is 0.5. “P(0.5)” is wrong. Don’t write P(number); always write P(event) = number.
All probabilities are between 0 and 1 inclusive. A probability of 0 means the event is impossible or cannot happen; a probability of 1 means the event is a certainty or will definitely happen. Probabilities between 0 and 1 are assigned to events that may or may not happen; the more likely the event, the higher its probability.
Definition: When an event is unlikely — when it has a low probability of occurring — you call it an unusual event. Unless otherwise stated, “unlikely” means that the probability is below 0.05.
This will be an important idea in inferential statistics.
Pure thought is enough to give many probabilities: the probability of drawing a spade from a deck of cards, the probability of rolling doubles three times in a row at Monopoly, the probability of getting an allwhite jury pool in a county with 26% black population. Any such probability is called a theoretical probability or classical probability.
Theoretical probabilities come ultimately from a sample space, usually with help from some of the laws for combining events. (I’ll tell you about both of these later in this chapter.)
Example 4: A standard die (used in Monopoly or Yahtzee) has six faces, all equally likely to come up. Therefore you know that the probability of rolling a two is 1/6.
On the other hand, some probabilities are impossible to compute that way, because there are too many variables or because you don’t know enough: the probability that weather conditions today will give rise to rain tomorrow, the probability that a given radium nucleus will decay within the next second, the probability that a given candidate will win the next election, the probability that a driver will have a car crash in the next year. To find the probability of an event like that, you do an experiment or rely on past experience, and so it is called an experimental probability or empirical probability.
Example 5: The CDC says that the incidence of TB in the US is 5 cases per 100,000 population. 5/100,000 = 0.005%. Therefore you can say that the probability a randomly selected person has TB is 0.005%.
These two terms describe where a probability came from, but there’s no other difference between experimental and theoretical probabilities. They both obey the same laws and have the same interpretations.
You probably don’t need formulas, but if you want them here they are:
Theoretical or classical:  P(success) = N(success) / N(possible outcomes) 

Empirical or experimental:  P(success) = N(success) / N(trials) 
Every probability statement has two interpretations, probability of one and proportion of all. You use the interpretation that seems most useful in a given situation.
Example 6: For doctors taking aspirin nightly, P(heart attack in six years) = 1.04%. The “probability of one” interpretation is that there’s a 1.04% chance any given doctor taking aspirin will have a heart attack. The “proportion of all” interpretation is that 1.04% of all doctors taking aspirin can be expected to have heart attacks.
Which interpretation is right? They’re both right, but in a given situation you should use the one that feels more natural.
You know that P(boy) is about 50% for live births, but you’re not surprised to see families with two or three girls in a row. Probability is longterm relative frequency; it can’t predict what will happen in any particular case.
This is expressed in the law of large numbers: as you take more and more trials, the relative frequency tends toward the true probability.
The law of large numbers was stated in 1689 by Jacob Bernoulli.
Example 7: For just a few babies, say the four children in one family, it’s quite common to find a proportion of boys very different from 50%, say one in four (25%) or even zero in four. But consider a class of thirty statistics students. The proportion may still be different from 50%, but a very different proportion (more than 70%, say, or less than 30%) would be unusual. And when you look at all babies born in a large hospital in a year, experience tells you that the proportion will be very close to 50%. The more trials you take, the closer the relative frequency is to the true probability — usually.
Trial  Result  Heads so far  rel. freq. 

1  T  0  0.0000 
2  H  1  0.5000 
3  H  2  0.6667 
4  H  3  0.7500 
5  H  4  0.8000 
6  T  4  0.6667 
But the Law of Large Numbers says that the relative frequency tends to the true probability. Probability can’t predict what will happen in any given case. The idea that a particular outcome is “due” is just wrong, and it’s such a classic mistake that it has a name. The Gambler’s Fallacy is the idea that somehow events try to match probabilities.
Example 8: I’ve just flipped a coin a few times, and the results are shown at the right. The first flip was a tail, and after that flip the relative frequency (rf) of heads is 0. The next flip is a head, and after two flips I’ve had one head out of two trials, so the rf is 0.5. The third flip is also a head, so now the rf is 2/3 or about 0.6667. At this point someone might say, “you’re due for a tail, to move the rf back toward the true probability of 0.5.” That’s the Gambler’s Fallacy.
The coin doesn’t know what it did before, and it doesn’t try to make things “right”. In my trials, the fourth flip moves the rf of heads further from 0.5, and the fifth flip moves it further still. True, the sixth flip moves the rf of heads closer to 0.5, but it could just as well have moved it further away, even if the coin is perfectly fair.
I stopped after six trials. I know that if I went on to do ten trials, or a hundred, or a thousand, over time the proportion of heads would almost always move closer to 0.5 — not necessarily on any particular flip, but in the long run.
Subconsciously you expect random events not to show a pattern, but you may see patterns along the way. For example, if you flip a fair coin repeatedly, inevitably you will see a run of ten heads or ten tails — about twice in every thousand sequences of ten. If you flip the coin once every two seconds, you can expect to see a run of ten flips the same about once every 17 minutes, on average.
Here are two more examples of patterns cropping up in processes that are actually random:
Example 9: You have flipped a coin 999 times, and there were 499 heads and 500 tails. What’s the probability of a head on the next flip?
Solution: It is 50%, the same as on any other flip. The Law of Large Numbers tells you that over time you tend to get closer and closer to 50% heads, but it doesn’t tell you anything at all about any particular flip. If you think that the coin is somehow “due for a head”, you’ve fallen into the Gambler’s Fallacy.
At bottom, probability is about counting. Empirical probability is the number of times something did happen, divided by the number of trials. Classical probability is similar, but it makes use of a list or table of all possible outcomes, called a sample space. Technically a sample space is just a list of all possible outcomes, but it’s only useful if you make it a list of all possible equally likely outcomes.
For repeated independent trials — flipping multiple coins, rolling multiple dice, making successive bets at roulette, and so on — the size of the sample space will be the number of outcomes in each trial, raised to the power of the number of trials. For example, if you want to compute probabilities for the number of girls in a family of four children, your sample space will have 2^{4} = 16 entries.
Example 10: If you roll two dice, what’s the probability you’ll roll a seven? You could list the sample space as
S = { 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 }
but the outcomes are not equally likely. There’s only one way to get a twelve, for instance (double sixes), but there are several ways to get a seven (1–6, 2–5, and so on). So it’s much more useful to list your sample space with equally likely outcomes.
When constructing a sample space, be systematic so that you don’t leave any out or list any twice. Here, you’re rolling two dice, and each die has six equally likely results, so you have 6×6 = 36 equally likely outcomes in your sample space. How can you be systematic? List the outcomes in some regular order, like the picture below. Each row lists all the possibilities with the same outcome for the first die; each column lists all the possibilities with the same outcome for the second die.
image courtesy of Bob Yavits, Tompkins Cortland Community College
Once you have a sample space of equally likely outcomes, finding the probability is simple. There are six ways to roll a seven: 61, 52, 43, 34, 25, 16. There are 36 possible outcomes, all equally likely. Therefore the probability of rolling a seven is 6/36 or 1/6 or about 0.1667. In symbols, P(7) = 6/36 or P(7) = 1/6.
P(7) ≈ 0.1667
Caution: Round your final answer only. Never use a rounded number in further calculations; that’s the Big Nono. Fortunately, your calculator makes it easy to chain calculations so that you can see rounded numbers but it still uses the unrounded numbers for further calculations.
Example 11: Find the probability of rolling craps (two, three, or twelve).
Solution: There’s one way to roll a two, two ways to roll a three, and one way to roll a twelve. P(craps) = (1+2+1)/36 = 4/36 or 1/9.
Often, it’s not practical to construct a sample space and compute probabilities from it. Instead, you construct a probability model. Probability models are yet another kind of mathematical model as introduced in Chapter 4.
Definition: A probability model is a table showing all possible outcomes and their probabilities. Every probability must be 0 to 1 inclusive, and the total of the probabilities must be 1 or 100%.
A probability model can be theoretical or empirical.
Number of Heads on Two Coin Flips  

x  P(x) 
0  1/4 
1  2/4 
2  1/4 
∑  4/4 = 1 
Example 12: Construct a probability model for the number of heads that appear when you flip two coins.
Solution: Start by constructing the sample space. Remember that you need equally likely events if you are going to find probabilities from the sample space. The first coin can be heads or tails, and whatever the first coin is, the second coin can also be heads or tails. So the sample space has 2×2 = 4 outcomes:
S = { HH, HT, TH, TT }
There are four equally likely outcomes, so the denominator (bottom number) on all the probabilities will be 4. The possible outcomes are no heads (one way), one head (two ways), and two heads (one way). The probability model is shown at right. Often a total row is included, as I did, to show that the probabilities add up to 1.
That was an easy example, so easy that you could just as well work from the sample space. But think about more complex situations, especially with empirical (experimental) probabilities. Constructing a sample space may be impractical, but a probability model is relatively easy to create.
Example 13: (adapted from Sullivan 2011, page 235 problem 40): The CDC asked college students how often they wore a seat belt when driving. 118 answered never, 249 rarely, 345 sometimes, 716 most of the time, 3093 always. Construct a probability model for seatbelt use by college students when driving.
SeatBelt Use by College Students Driving (sample size: 4521) 


Never  2.61 % 
Rarely  5.51 % 
Sometimes  7.63 % 
Most of the time  15.84 % 
Always  68.41 % 
Total  100.00 % 
Solution: Probability of one is proportion of all, so to get the probabilities you simply calculate the proportions. Sample size was (118+249+345+716+3093) = 4521. The proportions or probabilities are then simply 118/4521, 249/4521, and so on. The probability model is shown at the right.
Comments: Don’t push this model too far. In this sample, 68.4% of college students reported that they always use a seat belt when driving. There’s no uncertainty about that statement; it’s a completely accurate statistic (summary number for a sample). But can you go further and say that 68.4% of college students always wear a seat belt when driving? No, for two reasons.
First, this is a sample. Even if it’s a perfect random sample, it’s still not the population. There’s always sample variability. A different sample of college students would most likely give different answers — probably not very different, since this was a large sample, but almost certainly not identical. Second, and more serious, this survey depended on self reporting: students weren’t observed, they were just asked. When people report their behavior they tend to shade their responses in the direction of what’s socially approved or what they would like to think about themselves (response bias). How many of those “always” responses should have been “most of the time” or “sometimes”? You have no way to know.
You can find probabilities of simple events by making sample spaces and counting. But life isn’t usually that simple. To find probabilities of more interesting (and complex) events, you need to use rules for combining probabilities.
The rules are the same whether your original probabilities are theoretical or experimental.
Definition: When two events can’t both happen on the same trial, they are called mutually exclusive events or disjoint events.
Example 14: You select a student and ask where she was born. “Born in Cortland” and “born in Ithaca” are mutually exclusive events because they can’t both be true for the same person.
Comment: Obviously it’s possible that neither is true. Disjoint events could both be false, or one might be true, but they can’t both be true in the same trial.
Example 15: You select a student and ask his major. “Major in physics” and “major in music” are nondisjoint events because they could be true of the same student. (It doesn’t matter whether they are both true of the student you asked. They are nondisjoint because they could both be true of the same student — think about double majors.)
Rule: For disjoint events, P(A or B) = P(A)+P(B)
Example 16: You draw a card from a standard 52card deck. What’s P(ace or face card)? (A face card is a king, queen, or jack.)
Solution: Are the events “ace” and “face card” disjoint? Yes, because a given card can’t be both an ace and a face card. Therefore you can use the rule:
P(ace or face card) = P(ace) + P(face card)
But what are P(ace) and P(face card)? A picture may help.
used by permission; source: http://www.jfitz.com/cards/ accessed 20120926
Now you can see that the deck of 52 cards has four aces and twelve face cards. Therefore
P(ace) = 4/52 and P(face card) = 12/52
Since the events are disjoint,
P(ace or face card) = P(ace) + P(face card)
P(ace or face card) = 4/52 + 12/52 = 16/52
Reminder: When you need to compute probability of A or B, always ask yourself first, are the events disjoint? Use the simple addition rule only if the events are disjoint. If events are nondisjoint — if it’s possible for both to happen on the same trial — you have to use the general rule, below.
US Marital Status in 2006 (in Millions)  

Men  Women  Totals  
Married  63.6  64.1  127.7 
Widowed  2.6  11.3  13.9 
Divorced  9.7  13.1  22.8 
Never married  30.3  25.0  55.3 
Totals  106.2  113.5  219.7 
Take a look at this table of marital status in 2006, from the US Census Bureau. It’s known as a contingency table or twoway table, because it classifies each member of the sample or population by two variables — in this case, sex and marital status.
Example 17: What’s the probability that a randomly selected person is widowed or divorced?
Solution: Are those events disjoint? Yes, because a given person can’t be listed in both rows of the table. (You might argue that a given person can be both widowed and divorced in his or her lifetime, and that’s true. But the table shows marital status at the time the survey was made, not over each person’s lifetime. The “Widowed” row counts those whose most recent marriage ended with the death of their spouse.) Therefore
P(widowed or divorced) = P(widowed) + P(divorced)
How do you find those probabilities? Remember that probability of one = proportion of all. Find the proportions, and you have the probabilities.
P(widowed or divorced) = 13.9/219.7 + 22.8/219.7
P(widowed or divorced) = 36.7/219.7 ≈ 0.1670
Example 18: Find the probability that a randomly selected man is widowed or divorced.
Solution: Disjoint events? Yes, a given man can’t be in both rows of the table. Again, the probabilities are the proportions, but now you’re looking only at the men:
P(widowed or divorced) = P(widowed) + P(divorced)
P(widowed or divorced) = 2.6/106.2 + 9.7/106.2
P(widowed or divorced) = 12.3/106.2 ≈ 0.1158
Now let’s look at a couple of examples of probability “or” for nondisjoint events.
Example 19: Find P(seven or club).
Solution: Are the events “seven” and “club” disjoint? No, because a given card can be both a seven and a club. You can’t use the simple addition rule.
The next section shows you a formula, but in math there’s usually more than one way to approach a problem. Here you can look back at the picture look at the picture (reprinted on the last page) and count from the sample space. There are thirteen clubs, plus the sevens of spades, hearts, and diamonds, for a total of 16. (You don’t count the seven of clubs when counting sevens, because you already counted it when counting clubs.) And therefore P(seven or club) = 16/52.
Example 20: Find P(woman or divorced).
Solution: Disjoint events? No, a given person can be both. So what do you do? The same thing as in the preceding example: you count up all the women, and all the divorced people who aren’t women, and divide by the number of people:
P(woman or divorced) = 113.5/219.7 + 9.7/219.7 = 123.2/219.7 ≈ 0.5608
Look back at P(seven or club). Those are not disjoint events, so you can’t just add P(seven) and P(club). But what did you do, when counting? You counted the clubs, then you counted the sevens that aren’t clubs. In other words, just adding P(seven) and P(club) would be wrong because that would double count the overlap.
With 52 cards, it’s easy enough just to count. But that’s not practical in every problem, so there’s a rule: go ahead and double count by adding the probabilities, then fix it by subtracting the part you double counted.
Rule: P(A or B) = P(A) + P(B) − P(A and B)
This general addition rule works for all events, disjoint or nondisjoint. (If two events are disjoint, they can’t happen at the same time, P(A and B) is 0, and the general rule becomes the same as the simple rule.)
Let’s redo the last two examples with this new general rule, to see that it gives the same answers.
Example 19 again: Find P(seven or club).
P(seven or club) = P(seven) + P(club) − P(seven and club)
Caution: P(seven and club) doesn’t mean “all the sevens and all the clubs”. It means the probability that one card will be both a seven and a club — in other words, it means the seven of clubs.
P(seven or club) = 4/52 + 13/52 − 1/52
P(seven or club) = 16/52
US Marital Status in 2006 (in Millions)  

Men  Women  Totals  
Married  63.6  64.1  127.7 
Widowed  2.6  11.3  13.9 
Divorced  9.7  13.1  22.8 
Never married  30.3  25.0  55.3 
Totals  106.2  113.5  219.7 
Example 20 again: Using the table, table of marital status (reprinted on the last page), find P(woman or divorced).
Solution:
P(woman or divorced) = P(woman) + P(divorced) − P(woman and divorced)
P(woman or divorced) = 113.5/219.7 + 22.8/219.7 − 13.1/219.7
P(woman or divorced) = 123.2/219.7 ≈ 0.5608
About two thirds of students who register for a math class complete it successfully. What’s the probability that a randomly selected student who registers for a math class will not complete it successfully? Of course you already know it’s 1−(2/3) = 1/3. Let’s formalize this.
Definitions: Two events are complementary if they can’t both occur but one of them must occur. If A is an event from a given sample space, then the complement of A, written A^{C} or not A, is the rest of that sample space.
Describing a complement usually involves using the word “not”. Complementary events (can’t both happen, but one must happen) are a subcategory of disjoint events (can’t both happen).
Example 21: The complement of the event “the student completes the course successfully” is the event “the student does not complete the course successfully.” Obviously the complement need not be a simple event. The complement of “the student completes the course successfully” is “the student never shows up, or attends initially but stops attending, or withdraws, or earns an F, or takes an incomplete but never finishes”, or probably other outcomes I haven’t thought of.
Rule: P(A^{C}) = 1 − P(A)
This comes directly from the definition, and the rule for “or”. A and A^{C} can’t both happen, so they’re disjoint and P(A or A^{C}) = P(A)+P(A^{C}). But one or the other must happen, so P(A or A^{C}) = 1. Therefore P(A)+P(A^{C}) = 1, and P(A^{C}) = 1−P(A).
Example 22: In rolling two dice, “doubles” and “not doubles” are complementary events because they can’t both happen on the same roll, but one of them must happen. “Boxcars” (double sixes) and “snake eyes” (double ones) can’t both happen, so they’re disjoint; but they are not complementary because other outcomes are possible.
The complement rule is useful on its own, but it really shines as a laborsaving device. Very often when a probability problem looks like a lot of tedious computation, the complement is your friend. This really sticks out with “at least” problems (later), but here are a few simpler examples.
Colors of Plain M&Ms  

Blue  24 % 
Orange  20 % 
Green  16 % 
Yellow  14 % 
Brown  13 % 
Red  13 % 
Example 23: The color distribution for plain M&Ms is shown at right. What’s the probability that a randomly selected plain M&M is any color but yellow?
Solution: You could add the probabilities of the five other colors, but of course it’s easier to say
P(Yellow^{C}) = 1 − P(Yellow)
P(Yellow^{C}) = 100% − 14% = 86%
US Marital Status in 2006 (in Millions)  

Men  Women  Totals  
Married  63.6  64.1  127.7 
Widowed  2.6  11.3  13.9 
Divorced  9.7  13.1  22.8 
Never married  30.3  25.0  55.3 
Totals  106.2  113.5  219.7 
Example 24: Referring again to the table of marital status, status (reprinted on the last page), what’s the probability that a randomly selected person is not currently married?
Solution: Since the four marital statuses are disjoint, you could add the probabilities for widowed, divorced, and never married. But it’s easier to take the complement of “married”:
P(not currently married) = P(married^{C)}
P(not currently married) = 1 − P(married)
P(not currently married) = 1 − 127.7/219.7
P(not currently married) = 0.4188
Definition: Two events are called independent events if the occurrence of one doesn’t change the probability of the other.
Example 25: When you play poker, being dealt a pair in this hand and a pair in the next are independent events because the deck is shuffled between hands. But in casino blackjack, according to Scarne on Cards (Scarne 1965, 144 [see “Sources Used” at end of book]), four decks are used and they aren’t necessarily shuffled between hands. Therefore, getting a natural (ace plus a ten or face card) in this hand and a natural in the next are not independent events, because the cards already dealt change the mix of remaining cards and therefore change the probabilities.
That’s also an example of sampling with replacement (poker) and sampling without replacement (casino blackjack).
Samples drawn with replacement are independent because the sample space is reset to its initial condition between draws. Samples drawn without replacement are usually dependent because what you draw out changes the mix of what is left. However, if you’re drawing from a very large group, the change to the proportions in the mix is very small, so you can treat small samples from a very large group as independent.
Independent events are not disjoint, and disjoint events are not independent. If two events A and B are disjoint, then if A happens B can’t happen, so its probability is zero. One of two disjoint events happening changes the probability of the other, so they can’t be independent.
Rule: For independent events, P(A and B) = P(A) × P(B)
Example 26: In Monopoly, you get an extra roll if you roll doubles, but if you roll doubles three times in a row you have to go to jail. What’s the probability you’ll have to go to jail on any given turn?
Solution: Refer to the picture of the dice. picture of the dice (reprinted on the last page). There are six ways out of 36 to get doubles, so P(doubles) = 6/36 or 1/6. Each roll is independent, so the probability of doubles three times in a row is (1/6)×(1/6)×(1/6) or (1/6)^3 = 1/216, about 0.0046. If you play a lot of Monopoly, you’ll go to jail, because of doubles, between four and five times per thousand turns.
Example 27: The first traffic light on your morning commute is red 40% of the time, yellow 5%, and green 55%. What’s the probability you’ll hit a green all five mornings in any given week?
Solution: Are the five days independent? Yes, because where you hit that light in its cycle on one morning doesn’t influence where you hit it on the next day. The probability of green is 55% each day regardless of what happens on any other day. Therefore, the probability of five greens on five successive mornings is 55%×55%×55%×55%×55% or (0.55)^{5} ≈ 0.0503. About one week in twenty, that light should be green for you all five mornings.
US Marital Status in 2006 (in Millions)  

Men  Women  Totals  
Married  63.6  64.1  127.7 
Widowed  2.6  11.3  13.9 
Divorced  9.7  13.1  22.8 
Never married  30.3  25.0  55.3 
Totals  106.2  113.5  219.7 
Example 28: Refer again to the table of marital status. status (reprinted on the last page). What’s the probability that a randomly selected person is female and widowed?
Solution: In a twoway table, for probability “and”, you don’t worry about formulas or independence because everything is already laid out for you. 11.3 million persons are female and widowed, out of 219.7 million. Therefore:
P(female and widowed) = 11.3/219.7 ≈ 0.0514.
Example 29: Earlier in this section, I said that samples drawn without replacement are usually dependent, but you can treat them as independent when drawing a small sample from a very large group. Here’s an example. If you randomly select three women, what’s the probability that all three are widowed?
Solution: From the preceding example, the probability that any one woman is widowed was 11.3/219.7. Because three women is a small sample against the millions of women in the census, and the sample is random, you can treat them as independent. If you randomly select one woman out of millions, the mix of marital status in the remaining women is so nearly unchanged that you can ignore the difference. Therefore, the probability that all three women are widowed is
(11.3/219.7) × (11.3/219.7) × (11.3/219.7) = (11.3/219.7)³ ≈ 0.0001.
There’s no special rule for “at least”, but textbook writers (and quiz writers) love this type of problem, so it’s worth looking at. “At least” problems usually want you to combine several of the probability rules.
Example 30: Think back to that traffic light that’s green 55% of the time, yellow 5%, red 40%. What’s the probability that you’ll catch it red at least one morning in a fiveday week?
Solution: You could find the probability of catching it red one morning (five separate probabilities for five separate mornings), or two mornings (ten different ways to hit two mornings out of five), or three, four, or five mornings. This would be incredibly laborious. Remember that the complement is your friend. What’s the complement of “at least one morning”? It’s “no mornings”. So you can find the probability of getting a red on no mornings, subtract that from 1, and have the desired probability of hitting red on at least one morning.
P(at least one red in five) = 1 − P(no red in five)
But the status of the light on each morning is independent of all the others, so
P(no red in five) = P(no red on one)^{5}
What’s the probability of no red on any one morning? It’s 1 minus the probability of red on any one morning:
P(no red on one) = 1 − P(red on one) = 1−0.4
Now put all the pieces together:
P(no red on one) = 1 − P(red on one) = 1−0.4
P(no red in five) = [ P(no red on one) ]^{5} = (1−0.4)^{5}
P(at least one red in five) = 1 − P(no red in five) = 1 − (1−0.4)^{5} ≈ 0.9222
About 92% of weeks, you hit red at least one morning in the week.
Be careful with your logic! You really do need to work things through step by step, and write down your steps. Some students just seem to subtract things from 1, and multiply other things, and hope for the best. That’s not a very productive approach.
One thing that can help you with these “at least’ and “at most” problems is to write down all the possibilities and then cross out the ones that don’t apply, or underline the ones that do apply. For “at least one red in five”, you have 0 1 2 3 4 5 or 0  1 2 3 4 5. Either way, with this enumeration technique, taught to me by Benjamin Kirk, you can see that the complement of “at least one” is “none”.
A common mistake is computing 1−0.4^{5} for P(none), instead of the correct (1−0.4)^{5}. “None are red” means “all are notred”, every one of the five is something other than red. Remember that all are not is different from not all are. In ordinary English, people often say “All my friends can’t go to the concert” when they really mean “Some of my friends can go, but not all of them can go.” In math you have to be careful about the distinction. Here’s an example.
Example 31: For the same situation, what’s the probability that you’ll hit a red light no more than four mornings in a fiveday week? (This could also be asked as “at most four mornings” or “four mornings at most”.)
Solution: Try enumerating. “At most four out of five” looks like this: 0 1 2 3 4 5 or 0 1 2 3 4  5. The previous example was a “none are” or “all are not”, but this one is a “not all are”.
P(≤ 4 out of 5) = 1 − P(5 out of 5)
P(5 out of 5) = 0.4^{5}
P(≤ 4 out of 5) = 1 − 0.4^{5} ≈ 0.9898
About 99% of weeks, you hit the red light no more than four mornings of the week.
Example 32: You’re throwing a barbecue, and you want to start the grill at 2 PM. Fred and Joe live on opposite sides of town, and they’ve both agreed to bring the charcoal. The problem is that they’re both slackers. Fred is late 40% of the time, and Joe is late 30% of the time. What’s the probability you’ll start the grill by 2 PM?
Solution: This is another “at least” problem for independent events, though this time the independent events don’t have the same probability. To have charcoal by 2 PM, at least one of them has to show up by then. What’s the probability that at least one will be on time? Again, you could compute the probability that they’re both on time, that Fred’s on time but Joe’s late, and that Fred’s late and Joe’s on time — all of those together will be the probability of charcoal on time. But again, the complement is your friend. The complement of “charcoal on time” is “charcoal late”, which happens only if they’re both late.
P(charcoal on time) = 1 − P(charcoal late)
P(charcoal on time) = 1 − P(Fred late and Joe late)
(Fred and Joe live on opposite sides of town, so whether one is late has no connection with whether the other one is late. The events are independent.)
P(charcoal on time) = 1 − P(Fred late) × P(Joe late)
P(charcoal on time) = 1 − 0.4×0.3 = 0.88
You’ve got an 88% chance of starting the grill on schedule.
Example 33: The space shuttle Challenger exploded shortly after launch in the 1980s, when one of six gaskets failed. After the fact, engineers realized that they should have known the design was too risky, but they didn’t think past “each gasket is 97% reliable.” The trouble was that if any gasket failed, the shuttle would explode. If you were asked to evaluate the design while the plans were still on the drawing board, what would you conclude? (Note: The design makes the six gaskets independent.)
Solution: The shuttle will explode if one or more gaskets fail. Here’s another “at least” problem, so enumerate the case you’re interested in: 0  1 2 3 4 5 6.
P(explosion) = P(at least one gasket fails)
The complement of “at least one gasket fails” (hard to compute) is “no gaskets fail” (much easier). What does it mean for no gaskets to fail? All gaskets must hold. Since the gaskets are independent, that’s easy to compute:
P(all six gaskets hold) = 0.97^{6}
The answer you want is the complement of the allhold or zerofail case:
P(at least one gasket fails) = 1 − P(all six hold) = 1 − 0.97^{6}
P(explosion) = P(at least one gasket fails) = 1 −0.97^{6} ≈ 0.1670
Conclusion: There’s about a 17% chance that the shuttle will explode, just considering the gaskets and ignoring all other possible causes of trouble. This is about the same as the odds of shooting yourself in Russian roulette.
In 2012, the Honda Accord was the most frequently stolen vehicle in the US (Siu 2013 [see “Sources Used” at end of book]). Does that mean that your Honda Accord is more likely to be stolen than another model?
You’re tested for a rare strain of flu, and the result is positive. Your doctor tells you the test is 99% accurate. Does that mean that there’s a 99% chance you have that strain of flu?
In New York City, a rape victim identifies physical characteristics that match only 0.0001% of people. Police find someone with those characteristics and arrest him. Is there only a 0.0001% chance that he’s innocent?
These are examples of conditional probability — the probability of one event under the condition that another event happened. It’s probably the most misunderstood probability topic, but I’m going to demystify it for you.
The definition may seem hard at first. But after you work through the examples you’ll find it makes sense.
That’s the “probability of one” interpretation. You might find the “proportion of all” interpretation easier: P(B  A) is the proportion of A’s that are also B.
Either way, the order matters — P(B  A) and P(A  B) mean different things and they’re different numbers.
Example 34: P(truck  Ford) is the probability that a vehicle is a truck if it’s a Ford, or the probability that a Ford is a truck, or the proportion of trucks among Fords. P(Ford  truck) is the probability that a vehicle is a Ford if it’s a truck, or the probability that a truck is a Ford, or the proportion of Fords among trucks.
Example 35: Let’s look first at the suspected rapist. The prosecutor presents evidence that these physical characteristics are found in only 0.0001% of people. The prosecutor therefore claims that there’s only a 0.0001% chance the suspect is innocent.
But the defense points out that there are over 8 million people in New York City. 0.0001% × 8,000,000 = 8, so the suspect is not a unique individual at all, but one of about eight people who match the eyewitness accounts. Seven of them are innocent. If there’s no evidence beyond the physical match to tie him to the crime, the probability that this defendant is innocent isn’t 0.0001%, it’s 7/8 or 87.5%. (And that’s just in the city. If you consider the metro area, or the US, or the world, there are even more people who match, so any one of them is even more likely to be innocent.)
The prosecutor’s fallacy is the false idea that the probability of a random match equals the probability of innocence. You can also describe this fallacy as “consider[ing] the unlikelihood of an event, while neglecting to consider the number of opportunities for that event to occur”, in the words of “The Prosecutor’s Fallacy” on the Poker Sleuth site (Stutzbach 2011 [see “Sources Used” at end of book]).
It’s an easy mistake to make if you just think about low probabilities. To not make this error, think in whole numbers, as the defense did. 0.0001% is hard to think about; 8 is much easier.
The key to solving conditionalprobability problems is your old friend, probability of one equals proportion of all. The probability that this particular matching person is innocent is the same as the proportion of all matching people that are innocent, or the proportion of innocent people among those who match. Probability problems usually get easier when you turn them into problems about numbers of people or numbers of things.
What does this look like in symbols? (Don’t be afraid of symbols! They are your friend, I promise. Words are slippery and confusing, but when you reduce a problem to symbols you make the situation clear and you are half way to solving it.)
In this example, there’s a 0.0001% chance that a random person would match the physical type of the criminal:
P(matching) = 0.0001%
The prosecution wants you to believe that the probability of a matching individual being innocent is the same:
P(innocent  matching) = 0.0001% (WRONG)
This is a conditional probability, the probability that one thing is true if another thing is true. Formally, the whole expression is “the probability of innocent given matching”. But it’s easier to think of as “the probability that a person who matches is innocent” or “the proportion of matching people who are innocent”.
The symbols help you clarify your thinking. “The probability of a match” and “the probability of innocence among those who match” are different symbols, and they’re different concepts. You’d expect them to be different probabilities.
The defense showed the right way to figure the probability of innocence given a match. 0.0001%×8,000,000 = 8 people match, and 7 of them are innocent. The probability that a matching person is innocent — the probability that a person is innocent given that he matches — is 87.5%.
P(innocent  matching) = 87.5% (CORRECT)
Notice what happens with ifthen probabilities. You’re considering one group within a subgroup of the population, not one group within the whole population. You’ve reduced your sample space — not all people, but all matching people. The bottom number of your fraction comes from the “given that” part of the conditional probability, because P(innocent  matching) is the proportion of matching people that are also innocent.
To explode the prosecutor’s fallacy, you distinguish between a probability in the whole population and a probability in a subgroup. You also have to ask yourself, “which group?” The issue of medical test results is a good example.
Example 36: There’s a rare skin disease, Texter’s Peril (TP), where you become hypersensitive to the buttons on your phone. (Yes, I am making this up.) It affects 0.03% of adults aged 18–30, three in ten thousand. The only cure is to lay off texting for 30 days, no exceptions. Naturally this is about the worst thing that can happen to anyone.
Your doctor has tested you and the test comes up positive. She tells you that the test is 99% accurate. Does that mean you are 99% likely to have TP? You might think so, and sadly many doctors make the same mistake.
You have a positive test result, and you want to know how likely it is that you have Texter’s Peril. In symbols,
P(disease  positive) = ?
Your doctor told you that the test is 99% accurate, meaning that 99% of people who actually have TP get a positive result:
P(positive  disease) = 99%
These are obviously not the same symbol, so the probability you care about, the probability you have the disease, may well be different from 99%. How can you compute it?
Change those probabilities to whole numbers, and make a table. (I got this technique from the book Calculated Risks [Gigerenzer 2002 [see “Sources Used” at end of book]]. The book cites a study showing that doctors routinely confused probabilities when counseling patients about test results.) You’ve already played with a twoway table; now you’re going to make one. It’s a little bit like filling in a puzzle. I hope you like puzzles. ☺
You don’t know the population size, but that’s okay. Just use a large round number, like a million. Start with what you know.
P(disease) = 0.03%
Out of 1,000,000 people, 0.03% = 300 will have TP, and the other 999,700 won’t. That’s the bottom row of the table, the totals row.
P(positive  disease) = 99%
Of the 300 who have actually have TP, 99% = 297 will get a correct positive result, and 3 will get a false negative. That’s the first column of the table.
P(negative  disease^{C}) = 99%
(In the real world, a given test may not be equally accurate for positives and negatives, but we’ll overlook that to keep things simple.) Out of 999,700 who don’t have TP, 99% = 989,703 will get a correct negative result, and 9,997 will get a false positive. This is the second column of the table, and now you can fill in the column of totals.
Have TP  Don’t Have TP  Total  

Positive Test  297  9,997  10,294 
Negative Test  3  989,703  989,706 
Total  300  999,700  1,000,000 
Take a look at that table, specifically the “Positive Test” row. Do you see the problem? Most of the people with positive test results actually don’t have Texter’s Peril, even though the test is 99% accurate!
It took a while to get here, but it’s better to be correct slowly than to be wrong quickly. You can now compute the probability of having TP given that you have a positive test result. Once again, probability of one equals proportion of all, so this is really the same as the proportion of people with positive test results who actually have TP:
P(disease  positive) = 297 / 10,294 = 2.89%
The test is 99% accurate, but because TP is rare, most of the positive results are false positives, and there’s under a 3% chance that a positive result means you actually have Texter’s Peril. There’s a 1 − 297/10,294 = 97.11% chance that a positive result is a false positive.
Notice again: With conditional probability, you’re not concerned with the whole population. Rather, you focus on a subgroup within a subgroup. P(disease  positive) is the proportion of people who actually have the disease, within the subgroup that received a positive test result.
Example 37: What’s the chance that a negative is a false negative, that given a negative test result you actually have TP? In symbols,
P(disease  negative) = ?
You’ve already got the table, so this is a piece of cake. Out of a million people, 989,706 test negative and 3 of them have the disease. The probability that a negative is a false negative is
P(disease  negative) = 3/989,706 ≈ 0.000 003
which is essentially nil.
Example 38: A lot of Web sites in 2013 trumpeted the news that the Honda Accord was the most frequently stolen model in the US the year before. And that’s true. Out of 721,053 stolen cars and light trucks in 2012, Hot Wheels 2012 tells us that 58,596 were Honda Accords (NICB 2013 [see “Sources Used” at end of book]).
But many Web sites warned Honda owners that they were most at risk. For instance, Honda Accord, Civic Remain Top Targets for Thieves at cars.com (Schmitz 2013 [see “Sources Used” at end of book]) leads with “If you own a Honda Accord or Civic, or a fullsize Ford pickup truck, you might want to take a moment to make sure your autoinsurance payments are up to date. You drive one of the top three moststolen vehicles in the US.”
Do you see what’s wrong here? Think about it for a minute before reading on.
Yes, a lot of Honda Accords were stolen, because there are a lot of them on the road. Too many news organizations are sloppy and think that the likelihood a stolen car is an Accord is the same as the likelihood that an Accord will be stolen. This is the doctor’s mistake from the previous example, all over again.
Let’s clarify. You have 58,596 Accords out of 721,053 thefts, so the probability that a stolen car was an Accord — the probability that a car was an Accord given that it was stolen — the probability of “if stolen then Accord” — is
P(Accord  stolen) = 58,596/721,053 = 8.13%
But that doesn’t tell you doodleysquat about your chance of having your Accord stolen. That would be the probability of a car being stolen given that it is an Accord, “if Accord then stolen”. The top number of that fraction is still 58,596, but the bottom number is the total number of Accords on the road:
P(stolen  Accord) = 58,596/(total Accords on the road in 2012)
Do you see the difference? They’re both conditional probabilities, but they’re different conditions. “If stolen then Accord” is different from “if Accord then stolen”. The first one is about Accord thefts as a proportion of all thefts, and the second one is about Accord thefts as a proportion of all Accords. Those are different numbers.
To find the chance that an Accord will be stolen, you need the number of Accords on the road in 2012. A press release from Experian (2012) says there were “more than 245 million vehicles on US roads” in 2012, and 2.6% of them were Accords.
P(stolen  Accord) = (stolen Accords)/(total Accords on the road in 2012)
P(stolen  Accord) = 58,596/(2.6% of 245 million)
P(stolen  Accord) = 58,596/6,370,000
P(stolen  Accord) = 0.92%
Yes, over 8% of cars stolen in 2012 were Accords, but the chance of a given Accord being stolen was under 1%. P(Accord  stolen) = 8.13%, but P(stolen  Accord) = 0.92%.
Rule: P(B  A) = P(A and B) / P(A) or N(A and B) / N(A)
The “N” alternatives remind you that often it’s easier just to count than to find probabilities and then divide. Either way, when you consider P(B  A), remember that you’re interested in the likelihood of B given that A occurs. It’s the B cases within the A group, not all the B cases.
P(A  B) is not the same as P(B  A). You’ll get the probability right if you remember that the second event, the “given that” event, supplies the bottom number of the fraction.
Example 39: Find P(stolen  Accord), the chance that any one Accord will be stolen. Using the numbers from Example 38,
P(stolen  Accord) = N(Accord and stolen) / N(Accord)
P(stolen  Accord) = 58,596/6,370,000 = 0.92%
Example 40: I draw a card from the deck, and I tell you it’s red. What’s the probability that it’s a heart? If you didn’t know anything about the card, you’d write P(heart) = ¼ because a quarter of the cards in the deck are hearts. But what is the probability given that it’s red?
P(heart  red) = P(heart and red) / P(red)
P(heart and red) is the probability of a red heart. A quarter of the cards in the deck are red hearts, so this is just ¼. P(red) is of course ½ because half the cards in the deck are red.
P(heart  red) = (¼) / (½) = (¼) × 2 = ½
This one is probably easier to do by just counting:
P(heart  red) = N(heart and red) / N(red)
P(heart  red) = 26/52 = ½
Either way, you’re concerned with the subsubgroup of hearts within the subgroup of red cards. P(heart  red) = ½ — half of the red cards are hearts.
Example 41: You know P(heart  red) = ½: given that a card is red, there’s a ½ probability that it’s a heart. But what is P(red  heart), the probability that a card is red given that it’s a heart? You probably already know the answer, but let’s run the formula:
P(red  heart) = N(red and heart) / N(heart)
P(red  heart) = 13/13 = 1 (or 100%)
Conditional probabilities often come up in twoway tables.
US Marital Status in 2006 (in Millions)  

Men  Women  Totals  
Married  63.6  64.1  127.7 
Widowed  2.6  11.3  13.9 
Divorced  9.7  13.1  22.8 
Never married  30.3  25.0  55.3 
Totals  106.2  113.5  219.7 
Example 42: Again using the table of marital status (reprinted on the last page), status, what’s the probability that a randomly selected woman is divorced? In other words, given that the person is a woman, what’s the probability that she’s divorced?
Solution: The problem wants P(divorced  woman), the probability that the person is divorced given that she’s a woman.
P(divorced  woman) = N(divorced and woman) / N(woman)
P(divorced  woman) = 13.1/113.5 ≈ 0.1154
Because we have “given woman” or “if woman”, the bottom number is the number of women, 113.5 million.
Remember the definition of independent events? A and B are independent if the occurrence of one doesn’t change the probability of the other. Now that you know about conditional probability, you can define independent events in terms of conditional probability:
Definition: Two events A and B are independent if and only if P(AB) = P(A).
This makes sense. P(A) is the probability of A without considering whether B happened or not, and P(AB) is the probability of A given that B happened. If B’s occurrence doesn’t change the probability of A, then those two numbers will be equal.
US Marital Status in 2006 (in Millions)  

Men  Women  Totals  
Married  63.6  64.1  127.7 
Widowed  2.6  11.3  13.9 
Divorced  9.7  13.1  22.8 
Never married  30.3  25.0  55.3 
Totals  106.2  113.5  219.7 
Example 43: Referring again to the table of marital status, status (reprinted on the last page), show that “woman” and “widowed” are dependent (not independent).
Solution:
P(widowed) = 13.9 / 219.7 ≈ 0.0633
P(widowed  woman) = 11.3 / 113.5 ≈ 0.0996
These numbers are different — the probability of “widowed” changes when “woman” is given, or in English the proportion of widowed women is different from the proportion of widowed people. Therefore the events “woman” and “widowed” are not independent.
By the way, if A and B are independent then B and A are independent. So you could just as well compare P(woman) = 113.5/219.7 ≈ 0.5166 to P(womanwidowed) = 11.3/13.9 ≈ 0.8129. Since those are different, you conclude that “woman” and “widowed” are dependent.
When events are not independent, to find probability “and” you need to use a conditional probability. Remember the formula for conditional probability: P(B  A) = P(A and B) / P(A). Multiply both sides by P(A) and you have P(A) × P(B  A) = P(A and B), or:
Rule: For all events, P(A and B) = P(A) × P(B  A)
Example 44: You draw two cards from the deck without looking. What’s the probability that they’re both diamonds?
Solution: Are these independent events? No! P(diamond_{1}), the probability that the first card is a diamond, is 13/52 because there are 13 diamonds out of 52. But if the first card is a diamond, the probability that the second card is a diamond is different. Now there are only 12 diamonds left in the deck, out of a total of 51 cards. So P(diamond_{2}  diamond_{1}) = 12/51, which is a bit less than 13/52.
P(diamond_{1} and diamond_{2}) = P(diamond_{1}) × P(diamond_{2}  diamond_{1})
P(diamond_{1} and diamond_{2}) = (13/52) × (12/51) ≈ 0.0588
A lot of probability problems can be solved without using formulas, through the technique of sequences. Here’s the procedure:
Example 45: Suppose a bag contains 6 oatmeal cookies, 4 raisin cookies, and 5 chocolate chip. You are to draw two cookies from the bag without looking (and without replacement, which would be yucky). What is the probability that you will get two chocolate chip cookies?
Solution: To start with, notice that there are 6+4+5 = 15 cookies. There’s only one winning sequence, but this one illustrates an important point: you have to assign each probability in its situation at that point in its sequence.
Example 46: In the same situation, what’s the probability you’ll get one oatmeal and one raisin?
Solution: Even though you don’t care which order they come in, you have to list both orders among your willing sequences. Remember the example of flipping two coins, or the examples with dice: to make probabilities come out right, consider possible orderings.
Example 47: Consider the same bag of 15 cookies, but now what’s the probability you get two cookies the same?
Solution:
Example 48: Your teacher’s policy is to roll a sixsided die and give a quiz if a 2 or less turns up. Otherwise, she rolls again and collects homework if a 3 or less turns up. You haven’t done the homework for today and you’re not ready for a quiz. What is the probability you’ll get caught?
Solution: Though you could do this with formulas, you’ll get the same answer with less pain by following the method of sequences. The “winning sequences” in this case are the sequences that lead to either a quiz or homework.
Sequences let you think through a situation without getting confused about which formula may apply. Sometimes no formula applies. Here’s a famous example.
Example 49: You’re a contestant on Let’s Make a Deal. You have to pick one of three doors, knowing that there’s a new car behind one of them and a “zonk” (something funny but worthless) behind the other two. Let’s say you pick Door #1.
The host, who of course knows where the car is, opens Door #2 and shows you a zonk. He then asks whether you want to stick with your choice of Door #1, or instead take what’s behind Door #3. What should you do, and why?
(I gave specific door numbers to help make this problem less abstract, but the specifics don’t matter. What does matter is that you pick a door at random, and the host reveals that a door you didn’t pick is the wrong one.)
Solution: There’s really no formula for this one, because the host’s actions aren’t governed by probability. Once you realize that, it’s easy.
Therefore, P(right door) = 1/3 and P(wrong door) = 2/3.
If you chose the wrong door and switch doors, you will always win because the host has eliminated the other wrong door.
In the long run, keeping your original choice is the winning strategy 1/3 of the time, and switching is the winning strategy 2/3 of the time.
This is the famous Monty Hall Problem. Monty Hall [see “Sources Used” at end of book] developed Let’s Make a Deal and hosted the show for many years. There was a lot of controversy (Tierney 1991 [see “Sources Used” at end of book]) about the answer. Many people who should have known better thought that Door #1 and Door #3 were equally likely after Door #2 was opened. But they forgot that this is not a pure probability problem. The host knows where the car is and picks a door to open based on that knowledge, and that makes all the difference.
(The online book has live links to all of these.)
Chapter 6 WHYL → ← Chapter 4 WHYL
Write out your solutions to these exercises, showing your work for all computations. Then check your solutions against the solutions page and get help with anything you don’t understand.
Caution! If you don’t see how to start a problem, don’t peek at the solution — you won’t learn anything that way. Ask your instructor or a tutor for a hint. Or just leave it and go on to a different problem for now. You may find when you return to that “impossible” problem that you see how to do it after all.
(b) Construct that sample space.
(c) Find P(2H), the probability of getting exactly two heads.
Need a hint? Think about the two kinds of probability from the beginning of the chapter.
(a) You’re waiting for a flight at the airport. You fall into conversation with a stranger, and you’re surprised to learn that both of you have been victims of violent crime in the past year. Assuming random selection, what are the chances of that happening?
(b) Explain why you cannot use the same technique to find the probability that both members of a married couple have been victims of violent crime in the past year.
US Marital Status in 2006 (in Millions)  

Men  Women  Totals  
Married  63.6  64.1  127.7 
Widowed  2.6  11.3  13.9 
Divorced  9.7  13.1  22.8 
Never married  30.3  25.0  55.3 
Totals  106.2  113.5  219.7 
(a) Find P(divorced).
(b) Give two interpretations of that probability.
(c) What type of probability is this: classical, empirical, experimental, theoretical?
(d) Find P(divorced^{C}) and give one interpretation.
(e) Find P(man and married).
(f) Find P(man or married). (Work this with and without the formula.)
(g)
Find the probability that a randomly selected male was never married:
P(never married  male) = ?
(h) Find P(man  married), and interpret as “____% of ____ were ____.”
(i) Find P(married  man), and interpret as “____% of ____ were ____.”
(a) What’s the chance that she leaves your favorites behind?
(b) What’s the chance that all three of her picks are red?
Colors of Plain M&Ms  

Blue  24 % 
Orange  20 % 
Green  16 % 
Yellow  14 % 
Brown  13 % 
Red  13 % 
(a) Find the probability that all three are red.
(b) Find the probability that none are red.
(c) Find the probability that at least one is green.
(d) Find the probability that exactly one is green.
US Marital Status in 2006 (in Millions)  

Men  Women  Totals  
Married  63.6  64.1  127.7 
Widowed  2.6  11.3  13.9 
Divorced  9.7  13.1  22.8 
Never married  30.3  25.0  55.3 
Totals  106.2  113.5  219.7 
US Marital Status in 2006 (in Millions)  

Men  Women  Totals  
Married  63.6  64.1  127.7 
Widowed  2.6  11.3  13.9 
Divorced  9.7  13.1  22.8 
Never married  30.3  25.0  55.3 
Totals  106.2  113.5  219.7 
Colors of Plain M&Ms  

Blue  24 % 
Orange  20 % 
Green  16 % 
Yellow  14 % 
Brown  13 % 
Red  13 % 
SeatBelt Use by College Students Driving (sample size: 4521) 


Never  2.61 % 
Rarely  5.51 % 
Sometimes  7.63 % 
Most of the time  15.84 % 
Always  68.41 % 
Updated 18 Jan 2017
(What’s New?)
Intro: In Chapter 5, you looked at the probabilities of specific events. In this chapter, you’ll take a more global view and look at the probabilities of all possible outcomes of a given trial.
The random variable is one of the main concepts of statistics, and we’ll be dealing with random variables from now till the end of the course.
A variable is “the characteristic measured or observed when an experiment is carried out or an observation is made.”
—Upton and Cook (2008, 401) [see “Sources Used” at end of book]
If the results of that procedure depend on chance, completely or partly, you have a random variable. Each outcome of the procedure is a value of the variable. We use a capital letter like X for a variable, and a lowercase letter like x for each value of the variable.
As you learned in Chapter 1, numeric variables can be discrete or continuous. A discrete random variable can have only specific values, typically whole numbers. A continuous random variable can have infinitely many values, either across all the real numbers or within some interval.
In this chapter, you’ll be concerned with discrete random variables. In the next chapter, you’ll look at one particular type of continuous random variable, the normal distribution.
Example 1: You roll three dice. The number of sixes that appear is a random variable, and the total number of spots on the upper faces is another random variable. These are both discrete.
Example 2: You randomly select a household and ask the family income for last year. This is a continuous random variable.
Example 3: You randomly select twelve TC3 students, measure their heights, and take the average. “Height of a student” is a continuous random variable, and “average height in a 12student sample” is another continuous random variable.
Example 4: You randomly select 40 families and ask the number of children in each. “Number of children in family” is a discrete random variable, and “average number of children in a sample of 40 families” is a continuous random variable.
Definition: A discrete probability distribution or DPD (also known as a discrete probability model) lists all possible values of a discrete random variable and gives their probabilities. The distribution can be shown in a table, a histogram, or a formula. Like any probabilities, the probabilities in a DPD can be determined theoretically or experimentally.
Prize  Declared Value, x  Chance
of Winning, P(x) 

Two Camaros  $100,000  1 in 5,000,000 
Cash  10,000  1 in 1,000,000 
Apple iPad  1,000  1 in 500,000 
Various  500  1 in 250,000 
Gift card  5  0.9999928 
Example 5: In March 2013, Royal Auto sent me one of those “Win big!” flyers with a fake car key taped to it. The various prizes, and chances of winning, are shown at right.
This is a discrete probability distribution. The discrete variable X is “prize value”, and the five possible values of X are $100,000 down to $5.
Remember the two interpretations of probability: probability of one = proportion of all. From the table, you can equally well say that any person’s chance of winning a $500 prize is 1/250,000 = 0.000 004 = 0.0004%, or that in the long run 0.0004% of all the people who participate in the promotion will win a $500 prize.
A discrete probability distribution must list all possible outcomes. The total probability for all possible outcomes in any situation is 1. Therefore, for any discrete probability distribution, the probabilities must add up to 1 or 100%.
Definitions: Suppose you do a probability experiment a lot of times. (For the Royal Auto example, suppose bazillions of people show up to claim prizes.) Each outcome will be a discrete value. The mean of the discrete probability distribution, μ, is the mean of the outcomes from an indefinitely large number of trials, and the standard deviation of the discrete probability distribution, σ, is the standard deviation of the outcomes from an indefinitely large number of trials. The mean of any probability distribution is also called the expected value, because it’s the expected average outcome in the long run.
How do you find the mean and SD of a discrete probability
distribution? Well, one interpretation of
probability is longterm relative frequency,
so you can treat a discrete probability distribution as a
relative frequency distribution. (You can also think of the
probabilities as weights, with the mean as the weighted average.)
On the TI83/84, that means good old 1Var Stats
,
just like in Chapter 3.
μ = ∑ x·P(x) σ = √[ ∑ (x²·P(x)) − μ²]
For ∑, see ∑ Means Add ’em Up in Chapter 1.
Example 6: To find the mean and SD of the distribution of winnings in the Royal Auto sweepstakes, put the x’s in one list and the P(x)’s in another list. Caution: When the probability is a fraction, enter the fraction, not an approximate decimal. The calculator will display an approximate decimal, but it will do its calculations on a much more precise value.
After entering the x’s and p’s, press
[STAT
] [►
] [1
] and specify your two lists, such as
1Var Stats L1,L2
. (Yes, the order matters: the
x list must be first and the P(x) list second.) When you
get your results, check n first. In a discrete probability
distribution, n represents the total of the probabilities, so it
must be exactly 1. If it’s just approximately 1, you made a
mistake in entering your probabilities.
The mean of the distribution is μ = $5.03, and the standard deviation is σ = $45.85.
Interpretation: In the long run, the dealership will have to pay out $5.03 per person in prizes. The SD is a little harder to get a grasp on, but notice that it’s more than nine times the mean. This tells you that there is a lot of variability in outcome from one person to the next. In general, the mean tells you the longterm average outcome, and the SD tells you the unpredictability of any particular trial. You can look at the SD as a measure of risk.
A couple of notes about the calculator output: The calculator knows that a DPD is a population, so it gives you σ and not s for the SD. It should give you μ for the mean, but instead it displays x̅, so you need to make the change. I’ve already mentioned that the sum of the probabilities (n) must be exactly 1, not just approximately 1.
Example 7: When visiting the city, should you park in a lot or on the street? On a quarter of your visits (25%), you park for an hour or less, which costs $10 in a lot; for parking more than an hour they charge a flat $14. If you park on the street, you might receive a simple $30 parking ticket (p = 20%), or a $100 citation for obstruction of traffic (p = 5%), but of course you might get neither. Which should you do?
(Adapted from Paulos 2004 [see “Sources Used” at end of book].)
You have two probability models here, one for the outcomes of parking in a lot, and one for street parking. Begin by putting the two models into tables:


The problem leaves out some things that you can figure for yourself. Remember that every probability model includes all outcomes, and the probabilities add up to 1. If there’s a 25% chance of parking up to an hour, there must be a 100−25 = 75% chance of parking more than an hour. And on the street, if you have a 20+5 = 25% chance of getting some kind of ticket, you have a 100−25 = 75% chance of getting neither. The cost of getting neither ticket is zero.
Now you can fill in the empty cells in the tables.


I showed the total probability to emphasize that it’s 1. Never compute the total of the outcomes (x’s), because that wouldn’t mean anything.
How do these tables help you make up your mind where to park? By themselves, they don’t. But they let you compute μ and σ, and that will help you decide.
I placed the x’s and P(x)’s for the parking
lot in L1 and L2, and did 1Var Stats L1,L2
. I placed the
x’s and P(x)’s for street parking in L3 and L4 and
did 1Var Stats L3,L4
. Here are the results:
Lot: Street:
As always, look first at n. If it’s not exactly 1, find your mistake in entering the probabilities.
Now you can interpret these results. Parking in the lot is a bit more expensive in the long run (μ = $13.00 per day versus μ = $11.00 per day). But there are no nasty surprises (σ = $1.73, little variation from day to day). Parking on the street is much riskier (σ = $23.64), meaning that what happens today can be wildly different from what happened yesterday.
So what should you do? Statistics can give you information, but part of your decision is always your own temperament. If you like stability and predictability — if you are risk averse — you’ll opt for the parking lot. If it’s more important to you to save $2 a day on average, and you can accept occasionally getting hit with a nasty fine, you’ll choose to park on the street.
The fair price of a game is the price that would make the expected value or mean value of the probability distribution equal to zero, the breakeven point.
(“Fair price” is one of those math words that look like English but mean something different. You should expect to pay more than the fair price because the operator of the game — the insurance company or casino or stockbroker — also has to cover selling and administrative expenses.)
There are two ways to compute the fair price:
Die shows  x  P(x) 

1,2,3,4,5  −$12  5/6 
6  $60−12 = $48  1/6 
6/6 = 1 
Example 8: Take a really simple bar game: a stranger offers to pay you $60 if you roll a 6 with a standard sixsided die, but you have to pay him $12 per roll. Find the fair price of this game.
Method 1: The only prize is $60, and you have a 1/6 chance of winning it. $60×(1/6) = $10.
Method 2: Amounts in L1, probabilities in L2; 1VarStats L1,L2. Verify that n=1, and read off the mean of −$2. The actual price is $12, so the fair price is $12 + (−2) = $10.
Naturally, the two methods always give the same answer. Method 2 is easier if you already know the mean of the probability distribution; otherwise Method 1 is easier.
Example 9: A lottery has a $6,000,000 grand prize with probability of winning 1 in 3,000,000. It also has a $10 consolation prize with probability of winning 1 in 1000. What is the fair price of your $5 lottery ticket?
Solution: You don’t need μ, so Method 1 is easier: multiply each prize by its probability and add up the products. $6,000,000×(1/3,000,000) + $10×(1/1000) → fair price is $2.01.
Why does a lottery ticket that is worth $2.01 actually cost $5.00? In effect, the lottery is paying out about 2.01/5.00 ≈ 40% of ticket sales in prizes. Some of the 60% that the lottery commission keeps will cover the lottery’s own expenses, and the rest is paid to the state treasury. This is actually fairly typical: most lotteries pay out in prizes less than half of what they take in. By contrast, the illegal “numbers game” pays out about 70%, or at least it did in the 1980s in Cleveland. (Don’t ask me how I know that!)
In the examples so far of probability models, I’ve had to give you a table of probabilities. But there are many subtypes of discrete probability distribution where the probabilities can be calculated by a formula. The rest of this chapter will look at part of one family, discrete probability distributions that come from Bernoulli trials.
Repeated trials of a process or an event are called Bernoulli trials if they have both of these characteristics:
If the probability of success on each trial is p, then the probability of failure on each trial is 1−p, or q for short.
Bernoulli trials are named after Jacob Bernoulli, a Swiss mathematician. He developed the binomial distribution, which you’ll meet later in this chapter.
Example 10: You randomly interview 30 people to find out which party they will vote for in the next election. These are not Bernoulli trials, because there are more than two possible outcomes. (New York State ballots often have six or more parties listed, though some parties just endorse the Republican or Democratic candidate.)
On reflection, you realize that you don’t care which party a given voter will choose. All you care about is whether they are voting for your candidate or not, so you randomly select 30 registered voters and ask, “Will you be voting for Abe Snake for President?” (Yes, that’s a real thing; here’s a video.) These are Bernoulli trials, because there are only two answers, and the probability of voting for Abe Snake is the same for each randomly selected person. (p equals the proportion of Abe Snake voters in the population. Remember, proportion of all = probability of one.)
Actually, this overlooks the undecided or “swing” voters. These become fewer as the election gets closer, but in real life they can’t be overlooked because they may be a larger proportion than the leading candidate’s lead.
You draw cards from a deck until you get a heart. These are not Bernoulli trials. Although there are only two outcomes, heart and other suit, the probability changes with each draw because you have removed a card from the deck.
Variation: You replace each card and reshuffle the deck before drawing the next card. Then these become Bernoulli trials because the probability of drawing a heart is 25% on every trial.
Variation: You have five decks shuffled together, instead of one 52card pack. You don’t replace cards after drawing them. You can treat these as Bernoulli trials even without replacement, because you won’t be drawing enough cards to alter the probabilities significantly.
How do I know? Five packs is 260 cards, and 10% of 260 is 26. On the first card, P(heart) = 25%. It’s quite unlikely that you’d have no hearts by the 26th card (0.04% chance), but if you did, the probability of a heart on the 27th card would be: 5×13/(5×52−26) ≈ 27.8%. That’s not much different from the original 25%.
(You don’t have to take my word for these probabilities. Use the sequences method from Chapter 5 to compute them.)
Although this sample without replacement violates independence, it doesn’t violate it by very much, not enough to worry about. This bears out what I said earlier: Trials without replacement can still be treated as independent when the sample is small relative to the population.
Example 13:
According to the AVMA (2014) [see “Sources Used” at end of book]
30.4% of US households own one or more cats. Suppose
you randomly select some households.
(a) How likely is it that the first time you find cat owners is in
the fifth household?
(b) How likely is it that your first catowning household will be
somewhere in the first five you survey?
Although you could compute these individual probabilities using techniques from Chapter 5, there’s a specific model called the geometric model that makes it a lot easier to compute. Also, using the geometric model you can get an overview of the probabilities for various outcomes, which you’d miss by computing probabilities of specific events using the previous chapter’s techniques. If trials are independent, and you want the probability of a string of failures before your first success, you’re using a geometric model.
The geometric model, also known as the geometric probability distribution, is a kind of discrete probability distribution that applies to Bernoulli trials when you try, and try, and try again until you get a success. P(x) is the probability that your first success will come on your xth attempt, after x−1 failures.
Expanding on the definition of Bernoulli trials, you can say that a geometric model is one where
The probability of success on any given trial, p, completely describes a geometric model.
Here’s a picture of part of the geometric model for catowning households, with p = 0.304.
How do you read this? The horizontal axis is x, the number of the trial that gives your first success, and the vertical axis is P(x), the probability of that outcome.
For example, there’s a hair over a 30% chance that you’ll find cat owners in your first household, P(1) = 30.4%. There’s about a 21% chance that the first household won’t own cats but the second household will, P(2) ≈ 21%. Skipping a bit to x = 6, there’s just about a 5% chance that the first five households won’t have cats but the sixth will, P(6) ≈ 5%. And so forth.
x = 1 is always the most likely outcome, and larger x values are successively less and less likely. This is true for every geometric distribution, not just this particular one with p = 0.304.
The geometric model never actually ends. The probabilities eventually get too small to show in the picture, but no matter what x you pick, the probability is still greater than 0.
Your TI83/84 calculator has two menu selections for the geometric model:
geometpdf(
p,x)
answers the question “what’s the
probability that my first success will come at trial number
x?”geometcdf(
p,x)
answers the question “what’s
the probability that my first success will come at or before
trial number x?” (The “c” stands for
cumulative, because the cdf
functions
accumulate the probabilities for a range of outcomes.)They’re both in the [2nd
VARS
makes DISTR
] menu.
(If you have a calculator in the TI89 family, use the
[F5
] Distr
menu. Select Geometric
Pdf
and Geometric Cdf
.)
Let’s use the calculator to find the answers for Example 13. Here p, the probability of success in any given household, is 30.4% or 0.304.
Part (a) wants the probability of four failures followed by a
success on the fifth try. For that you use geometpdf
.
Press [2nd
VARS
makes DISTR
] [▲
] [▲
] to
get to geometpdf
, and press [ENTER
].
With the “wizard” interface:  With the classic interface: 

Enter p and x.
Press [ 
After entering p and x, press [) ] [ENTER ] to get the
answer.
geometpdf(.304,5) = .0713362938 → 0.0713 
There’s about a 7% chance you won’t find any cat owners in the first four households but you will in the fifth household.
(You could calculate this the long way. The probability of four failures followed by a success is (1−.304)^{4}×.304. But the geometric model is easier. That’s the point of a model: one general rule works well enough for all cases, so you don’t have to treat each situation as a special case with its own unique methods.)
Part (b) wants the probability of a success occurring anywhere in
the first five trials. This is a geometcdf
problem.
Press [2nd
VARS
makes DISTR
] [▲
] to
get to geometcdf
, and press [ENTER
].
With the “wizard” interface:  With the classic interface: 

Enter p and x.
Press [ 
After entering p and x, press [) ] [ENTER ] to get the
answer.
geometcdf(.304,5) = .8366774327 → 0.8367 
There’s almost an 84% chance you will find at least one catowning household among the first five.
(Doing this the long way, you would use the complement. The complement of “at least one catowning household in the first five” is “no catowning households in the first five”. The probability that a given household doesn’t own a cat is q = 1−.304 = 0.696, and the probability that five in a row don’t own cats is 0.696^{5}. Therefore the original probability you wanted is 1−(.696^{5}) = 0.8367.)
You don’t actually need formulas for the
geometric model, but if you’re curious about what your
calculator is doing, here they are:
geometpdf(p,x) =
q^{x−1}p
geometcdf(p,x) = 1−q^{x}
where q = 1−p as usual.
You can see that the two “long way” paragraphs
above actually used those formulas.
The geometric distribution is completely specified by p, so you can compute the mean and standard deviation quite easily:
μ = 1/p σ = μ √q or (1/p) √(1−p)
Example 14: 30.4% of US households own cats. How many households do you expect you’ll need to visit to find a catowning household?
Solution: The expected value of a distribution is the mean. μ = 1/p = 1/.304 = 3.289473684. μ = 3.3. Interpretation: On average, you expect to have to visit between 3 and 4 households to find the first cat owners.
Caution! The expected value (mean) is not the most likely value (mode). Take a look back at the histogram, and you’ll see that the most likely value is 1: you’re more likely to get lucky on the first trial than on any other specific trial. But the distribution is highly skewed right, so the average gets pulled toward the higher numbers.
To compute the SD, just multiply the
mean by √q. A handy technique is called
chaining calculations. After first calculating the mean,
press the [×
] key, and the calculator knows you are
multiplying the previous answer by something. Here you see that
σ = 2.7.
Interpreting σ is a bit harder. The geometric distribution is a type of discrete probability distribution, so you interpret its standard deviation the same way as for any other DPD. In this particular example, σ is almost as large as μ, so you expect a lot of variability. If you and a lot of coworkers go out independently looking for households with cats, the group average number of visits will be 3.3 households, but there will be a lot of variability between different workers’ experience. You can’t use the Empirical Rule here because the geometric model is not a bell curve, but you can at least say you won’t be surprised to find workers who get lucky on the first house (μ−σ ≈ 0.5), and workers who have to visit six houses or more (μ+σ ≈ 6.0).
Some people find it very hard to make choices because they feel they must consider all the pros and cons of every possibility. Others look at possibilities one at a time and take the first one that’s acceptable. Studies such as The Tyranny of Choice (Roets, Schwartz, Guan 2012 [see “Sources Used” at end of book]) show that the first group may make better choices objectively, but the second group is happier with the items they choose.
Example 15: You have to buy a new sofa. You’d be content with 55% of the sofas out there. Let’s assume that your Web search presents sofas in an order that has nothing to do with your preferences. There are hundreds to choose from, so you decide to adopt the “first one that’s acceptable” strategy. How likely is it that you’d order the third sofa you’d see?
Solution: This is a geometric model, with two failures
followed by one success. p = 55%.
geometpdf(.55,3)
= .111375. There’s about an
11% chance you’d order the third sofa.
Example 16: Larry’s batting average is .260. During which time at bat would he expect to get his first hit of the game? How likely is he to get his first hit within his first four times at bat?
Solution: This is a geometric model with p =
0.260. The mean or expected value is 1/p = 1/.26 =
3.85, about 4. On average, his first hit each
game will
come on his
fourth time at bat.
For the second question, geometcdf(.26,4)
=
.70013424; there’s about a 70% chance he’ll get
his first hit within his first four times at bat.
In the previous section, we looked at the geometric model, where you just keep trying until you get a success. In this section, we’ll look at the binomial model, where you have a fixed number of trials and a varying number of successes.
The binomial model, also known as the binomial probability distribution or BPD, is a kind of discrete probability distribution that applies to Bernoulli trials when you have a fixed number of trials, n.
Expanding on the definition of Bernoulli trials, you can say that a binomial model is one where
Example 17: Cats again! 30.4% of US households
own one or more cats. You visit five households, selected randomly.
(a) What’s the chance that no more than two have
cats?
(b) What’s the chance that exactly two have cats?
(c) What’s the chance that at least two have
cats?
(d) What’s the chance that two to four have
cats?
This problem fits the binomial model: n = 5 trials, each household does or does not have cats, and the probability p = 30.4% is the same for each household.
A picture of this binomial distribution is shown at right, and you can see some differences from the picture of the geometric distribution:
How do you read the picture? There’s about a 17% probability that none of the five households will have cats, about 36% that one of the five will have cats, and so on. (Why 36% and not 30.4%? Because there’s a greater chance of “winning” one out of five than one out of one.)
In this book we’re more concerned with computing probabilities, but it can be nice to get an overall picture of a distribution. I made this particular graph by using @RISK from Palisade Corporation, but you can also make histograms of binomial distributions by using MATH200A Program part 1(5).
Here you have a choice. Your TI83/84 calculator comes with two menu selections for the binomial model, but the MATH200A program gives you a simpler interface. Here’s a quick overview of both, before we start on computations:
With the MATH200A program (recommended):  If you’re not using the program: 

MATH200A Program part 3 gives you one interface for all binomial probability calculations. The program might already be on your calculator from Chapter 3 boxplots, but if it’s not, see Getting the Program for instructions. To find binomial probability with the program, press
[
That puts the program name on your home screen. Press
[ 
These are both in the [

Got a TI89 family calculator? Use the
[F5
] Distr
menu. Select Binomial
Pdf
or Binomial Cdf
. The Cdf function can handle
any range of successes, not just 0 to x. See
Binomial Probability Distribution on TI89 for full
instructions.)
Now let’s use your TI83/84 to answer the questions in Example 17. You have five trials, so n = 5. The probability of success on any given household is 30.4%, so p = 0.304.
(a) What’s the probability that no more than two of the five randomly selected households have cats?
With the MATH200A program (recommended):  If you’re not using the program: 

Press [ Enter n and p. “No more than two cats” is from 0 to 2 cats, so enter those values when prompted. The program echoes back your inputs and shows the computed probability. To show your work, write down the screen name, the inputs, and the result. Conclusion: P(x ≤ 2) or P(0 ≤ x ≤ 2) = 0.8316. 
The probability that no more than two of your five
households have cats (in other words, the probability that 0 to 2 have
cats) is
If you don’t have the “wizard” interface, or
you have it turned off,
If you have the “wizard” interface, you get a menu
screen, but you enter the same information. Press [ Either way, write down the Conclusion: P(x ≤ 2) or P(0 ≤ x ≤ 2) = 0.8316. 
(b) What’s the probability that exactly two of five randomly selected households are cat owners?
With the MATH200A program (recommended):  If you’re not using the program: 

You need a specific number of successes, instead of a range.
It’s almost exactly the same deal: you
just enter the same number for Conclusion: P(x = 2) or P(2) = 0.3116. 
(a) The probability of exactly two catowner households in
five is (The “wizard” interface screen is the
same as it was for Conclusion: P(x = 2) or P(2) = 0.3116. 
(c) What’s the probability that at least two of the five randomly selected households have cats?
With the MATH200A program (recommended):  If you’re not using the program: 

“At least two”, in a sample of five, means from two to five successes. Enter those values in MATH200A part 3. Here’s the results screen:
Conclusion: P(x ≥ 2) or P(2 ≤ x ≤ 5) = 0.4800. 
This one is a little trickier. You could find P(2), P(3), P(4), and P(5) and add them up by hand, but that’s tedious and error prone, and it can introduce rounding errors. Instead, you’ll make the calculator add them up for you.
First, get all the probabilities for 0 through n successes
into a statistics list. To do this, use After the closing paren, don’t press [ Now you need to sum the desired range of cells. You want 2 ≤ x ≤ 5. But the lowest possible x is 0, and the cells in statistics lists are numbered starting at 1. So to get x from 2 through 5, you need cells 3 through 6. When summing part of a list, add 1 to your desired x values. Press [ Your answer: P(x ≥ 2) or P(2 ≤ x ≤ 5) = 0.4800. Beware of offbyone errors when you solve problems with phrases like at least and no more than. Always test the “edge conditions”. “Okay, I need at least 2, and that’s 2 through 5, not 3 through 5. Oh yeah, add 1 for the statistics list in the TI83, so I’m summing cells 3 through 6, not 2 through 5.” Alternative solution: Do you remember solving “at least” problems in Chapter 5? What was the lesson there? With laborious probability problems, the complement is your friend. What’s the complement of “at least two”? It’s “fewer than two”, which is the same as “no more than one”. Shaky on the logic of complements? Use the enumeration method from Chapter 5: 0 1 2 3 4 5 or 0 1  2 3 4 5. Find the probability of ≤1 household with cats, and subtract from 1: P(x ≥ 2) = 1 − P(x ≤ 1) P(x ≥ 2) = 1 − binomcdf(5, .304, 1) P(x ≥ 2) = .4799959639 → 0.4800 
(d) What’s the chance for two to four catowning households in your random sample of five households?
With the MATH200A program (recommended):  If you’re not using the program: 

Nothing new here: just use good old MATH200A part 3. Here’s the results screen:
P(2 ≤ x ≤ 4) = 0.4774. 
You need x from 2 through 4, but remember you always add 1 when
summing binomial probabilities from a statistics list, so you put 3 to
5 in your P(2 ≤ x ≤ 4) = 0.4774. Alternative solution: You can also do it
without summing. If you think about it, the probability for x from 2
to 4 is the probability for x from 0 to 4, with x below 2 (x no more
than 1) removed:
P(0 ≤ x ≤ 1) + P(2 ≤ x ≤ 4) = P(0 ≤ x ≤ 4) and by subtracting that first term you get P(2 ≤ x ≤ 4) = P(0 ≤ x ≤ 4) − P(0 ≤ x ≤ 1) Your probability is the result of subtracting two cumulative probabilities, the cdf from 0 to 4 minus the cdf from 0 to 1. It’s shown at right. This is tricky, I admit. You have to set that x value
correctly in the second 
You don’t actually need a formula for the
binomial model, but if you’re curious about what your
calculator is doing, here it is:
binompdf(n,p,x) =
_{n}C_{x} · p^{x}
q^{n−x}
Why?
p^{x} is the probability of getting successes on
all of the first x trials.
q is the probability of failure on one trial, and therefore
q^{n−x} is the probability of failure
on the remaining trials, after the x out of n
successes.
But in a binomial probability model, you care how many successes and
failures there are, not in what order they occur.
To account for the fact that order doesn’t matter, the formula
has to multiply by _{n}C_{x}, “the
number of ways to choose x objects out of n”.
(If you want to know more about _{n}C_{x},
search “combinations” at your favorite math site.)
Unlike the geometric case, there’s no simple formula for binomcdf. Your calculator just has to compute probabilities for x = 0, 1, and so on and add them up.
Example 18: Larry’s batting average is .260. How likely is it that he’ll get more than one hit in four times at bat?
Solution: This is a binomial model with n =
4, p = 0.26, x = 2 to 4. You can use MATH200A part 3 or
the binompdf
sum
technique to get
.27870128. P(x > 1) = 0.2787 or
about 28%. (The program is completely straightforward, so I’m
showing only the tricky binompdf
sum
sequence here.)
Alternative solution: If you don’t have the program, can you see how to use the complement to solve this problem more easily? Check your answer against mine to be sure that your method is correct.
The binomial distribution depends on the proportion in the population (p) and your sample size (n). You can compute the mean and SD quite easily:
μ = np σ = √[npq]
What are the mean and SD of the number of catowning households in a random sample of five households?
μ = np = 5 × 0.304 = 1.52
σ = √[npq] = √[5 × .304 × (1−.304)] = 1.028552381
Conclusion: μ = 1.5 and σ = 1.0.
Interpretation: in a sample of five households, the expected number of catowning households is 1.5. Or, if you take a whole lot of samples of five households, on average you will find that 1.5 households per sample own cats. The SD is 1.0. You can’t use the Empirical Rule, but you can say that you expect most of the samples of five to contain μ±2σ = 1.5±2×1.0 = 0 to 3 catowning households.
Example 19: 30.4% of US households own one or more cats. You visit ten random households and seven of them own cats. Are you surprised at this result?
A result is surprising or unusual or unexpected if it has low probability, given what you think you know about the population in question. The threshold for “low probability” can vary in different problems, but a typical choice is 5%.
When we ask whether a result is surprising (unusual, unexpected), we are really talking about that result or one even further from the expected value.
You think you know that 30.4% of US households own cats. A sample of ten doesn’t seem very large; how do you decide whether seven successes seems reasonable or unreasonable?
First, what’s the expected value? That’s μ = np = 10×.304 = 3.04.
Next, what does “that result or one further from the expected value” mean? The expected value is 3.04, seven is greater than 3.04, so we’re talking about seven or more successes, x = 7 to 10.
Find the probability of that result or one even further from the expected value.
That’s easiest with MATH200A part 3: set n=10, p=.304,
x=7 to 10. You can also do it with binomcdf
:
seven or more successes is the complement of zero to six
successes
(0 1 2 3 4 5 6 7 8 9 10).
Either way, the probability is 0.0115 or just over 1%.
Draw your conclusion. If 30.4% of US households own cats, finding seven or more cat houses in a random sample of ten households is unusual (surprising, unexpected).
That was a trivial example. But in real life, when a result is unexpected it can cast doubt on what you’ve been told. Here’s an example.
Example 20: In Talladega County, Alabama, in 1962, an African American man named Robert Swain was accused of rape. 26% of eligible jurors in the county were African American, but the 100man jury panel for Swain’s trial included only 8 African Americans. (Through exemptions and peremptory challenges, all were excluded from the final 12man jury.) Swain was convicted and sentenced to death.
Swain’s lawyer appealed, on grounds of racial bias in jury selection. The Supreme Court ruled in 1965 that “The overall percentage disparity has been small and reflects no studied attempt to include or exclude a specified number of blacks.”
—Adapted from Michailides [see “Sources Used” at end of book]
What do you think of that ruling? If 100 men in the county were randomly selected, is eight out of 100 in the jury pool unexpected (unusual, surprising)?
Solution: This is a binomial model: every man in the county either is or is not African American, the sample size is a fixed 100, and in a random sample there’s the same 26% chance that any given man is African American.
To determine whether eight in 100 is unexpected, ask what is expected. For binomial data, μ = np = 100×.26 = 26; in a sample of 100, you expect 26 African Americans.
Okay, 26 is expected, 8 is less than 26, “further away from expected” is less than 8, so you compute the probability for x = 0 to 8.
Use binomcdf(100,.26,8)
or MATH200A part 3.
Either way you get a probability
of 4.734795002E6, or about 0.000 005, five chances in
a million. That is unexpected. It’s
so unlikely that we have to question the
county’s claim that the selection was random.
Unfortunately, Mr. Swain’s lawyer didn’t consult a statistician.
(The online book has live links to all of these.)
binomcdf
/binompdf
or MATH200A,
but MATH200A is less work.Chapter 7 WHYL → ← Chapter 5 WHYL
Write out your solutions to these exercises, showing your work for all computations. Then check your solutions against the solutions page and get help with anything you don’t understand.
Caution! If you don’t see how to start a problem, don’t peek at the solution — you won’t learn anything that way. Ask your instructor or a tutor for a hint. Or just leave it and go on to a different problem for now. You may find when you return to that “impossible” problem that you see how to do it after all.
Velma the Vampire will drink anything, but she prefers O
negative. She doesn’t know a victim’s blood type until she
tastes it.
(a) How many does she expect to drain before she gets some O
negative?
(b) How likely is it that she’ll find her first O negative
within her first ten victims?
(c) How likely is it that exactly two of her first ten victims will
be O negative?
x  0  1  2  3  4  5 

P(x)  0.0778  0.2591  0.3456  0.2305  0.0768  0.0102 
(a) Find and interpret the mean and standard deviation of this
probability model.
(b) For an extra challenge, can you use your answer from
part (a) to construct a simpler probability model for five flips
of this coin?
Updated 19 Dec 2016
(What’s New?)
You met random variables back in Chapter 6. Any random variable has a single numerical value, determined by chance, for each outcome of a procedure. Discrete random variables are limited to specified values, usually whole numbers. But a continuous random variable can take any value at all, within some interval or across all the real numbers.
Just as discrete probability models are used to model discrete variables, continuous probability models are used to model continuous variables. Of course, because a continuous random variable has infinitely many possible values, you can’t make a table of values and probabilities as you could do for a discrete distribution. Instead, either there’s an equation, or just a density curve (below).
A probability model is often called a distribution, so you can say that a variable “is normally distributed” (ND), that it “is a normal distribution” (also ND), or that it “follows a normal probability model”.
There are lots of specialized continuous distributions, but the normal distribution is most important by a wide margin. Many, many reallife processes follow the normal model, and the ND is also the key to most of our work in inferential statistics.
This section will give you some concepts that are common to all continuous distributions, and the rest of the chapter will talk about special properties of the normal distribution and applications. In Chapter 8, you’ll apply the normal distribution to get a handle on the variation from one sample to the next.
In Chapter 2, you learned to graph continuous data by grouping the data in classes and making a histogram, like the one below left. This is wait times in a fastfood drivethrough, with time in minutes — not whole minutes, which would make a discrete distribution, but minutes and fractional minutes.
Any sample you might take has a finite number of data points, so you set up classes, place the data points in the classes, and then draw a histogram. The height of each bar is proportional to the frequency or relative frequency of that class.
But when you come to consider all the possible values of a continuous variable, you have an infinite number of data points. If you tried to assign them to classes, it would take you forever —literally! Instead, you draw a smooth curve, called a density curve, to show the possible values and how likely they are to occur. An example is shown above right.
The density curve is a picture of a continuous probability model. It doesn’t just represent the data in a particular sample, but all possible data for that variable — along with the probabilities of their occurrence, as you’ll see next.
Up to now, the height of a bar in a histogram has been the number of data points in that class, or the relative frequency of that class. But how do you interpret the height of a density curve?
Answer: you don’t! The height of the curve above any particular point on the x axis just doesn’t lend itself to a simple interpretation. You might think it would be the probability of that value occurring. But with infinitely many possible values, “what’s the likelihood of a wait time of exactly 4 minutes?” just isn’t a meaningful question, because what about 3.99997 minutes or 4.002 minutes?
What is meaningful is the probability within an interval, which equals the area under the curve within that interval. For example, in this illustration, the probability of a wait time of 6.4 to 9.5 minutes is 29.4%. In symbols,
P(6.4 ≤ x ≤ 9.5) = 29.4%
or
P(6.4 < x < 9.5) = 29.4%
That’s right — the probability is the same whether you include or exclude the endpoints of the interval.
This explains why the probability is the same whether you include or exclude either endpoint of the interval. The difference is the area of a “rectangle” whose height is the height of the density curve and whose width is the distance from a to a — which is zero. Thus the area of the “rectangle” is zero, and the probability of the random variable taking any particular value, exactly, is zero.
Since area equals probability, and total probability must be 1, total area must be 1. Every pdf — the height of every density curve — is scaled so that the integral from −∞ to +∞ is 1.
You can also have the probability for an interval with one boundary, < or ≤ some value like the picture at right, or > or ≥ some value. For example, 3.33 minutes is about 3 minutes and 20 seconds, so the probability of waiting up to 3 minutes and 20 seconds is 20.6%: P(x ≤ 3.33) = 20.6%.
The total area under any density curve equals the probability that the random variable will take any one of its possible values, which of course is 1, or 100%. So you can use the complement to say that the probability of waiting 3 minutes and 20 seconds or more (or, more than 3 minutes and 20 seconds) is 100−20.6% = 79.4%.
You remember from Interpreting Probability Statements in Chapter 5 that every probability can be interpreted as a probability of one or a proportion of all. For example, P(x > 3.33) = 79.4% can equally well be interpreted in two ways:
Which interpretation you use in a given situation depends on what seems simplest and most natural in the situation. Here, the “proportion of all” interpretation seems simpler. But you’re always free to switch to the other interpretation if it helps you in thinking about a situation.
Area = Probability of One = Proportion of All
Why study the normal distribution?
First, it’s useful on its own. Lots and lots of reallife distributions match the normal model: body temperature or blood pressure of healthy people, scores on most standardized tests, commute times on a given route, lifetimes of batteries or light bulbs, heights of men or women, weights of apples of a particular variety, measurement errors (in many situations), and on and on.
Why is the ND so common? In real life, very few events have just one cause; most things are the result of many factors operating independently. It turns out that if you take a lot of independent random variables and add them up, their sum is ND. For example, your IQ score results from multiple genetic factors, countless occurrences in your education and your family life, even transient factors like how well you slept the night before the test. Most of these are independent of each other, so the result of adding them is a ND.
Several mathematicians can claim the discovery of the normal distribution. Abraham de Moivre (1667–1754, French) was probably first, in 1733. But the name of Carl Friedrich Gauss is permanently coupled to the normal distribution — literally. Although Sir Francis Galton coined the term normal distribution in 1889, Karl Pearson called it the Gaussian distribution in 1905, and that’s still a recognized synonym.
Second, through sampling, even nonND populations follow a normal model. You’ll use this model in inferential statistics to make statements about a whole population based on just one sample. You’ll learn about this neat trick in Chapter 8.
The normal distribution (ND) has the properties of other continuous distributions as listed earlier. In particular, area = probability, and the total area under the density curve is the total probability, which is 1. The ND also has these special properties:
A ND is completely described by its mean and SD. The mean locates the center of the curve, but has no effect on the shape. For example, here are three normal curves with μ = 0, 2, and 5 and σ = 4.
The standard deviation determines the shape of the curve, but has no effect on the location. Smaller SD means the data stick closer to the mean, so the peak is higher and the tails are shorter and fatter. Larger SD means the data vary more, so they spread out from the mean: the peak is lower and the tails are longer and thinner. The second picture shows are three normal curves with μ = 2 and σ = 2, 4, and 6. (The vertical scale is different from the first picture.)
All of this is the theoretical normal distribution. In fact, nothing in real life is perfectly ND, because nothing in real life has an infinite number of data points. When we say something is ND, we mean it’s a close match, not a perfect match. “Normally distributed” (or ND) is short for “using a normal distribution to model this data set, the calculations will come out close enough to reality.”
This is a lot like what you did in Chapter 3, when you computed the statistics of a grouped distribution. The statistics were only approximate, because of the simplification you introduced by grouping, but the approximation was good enough.
Now let’s get to some applications! There are two main categories: “forward” problems, where you have the boundaries and you have to find the area or probability, and “backward” problems, where you have a probability or area and you have to find the boundaries.
In case you’re interested, the pdf, the height of the density curve above a given x, is . The cdf, the area to the left of a given x, is the integral of that, just the same as finding the area under any curve to the left of a given x: . This integral doesn’t have a “closed form”, a finite sequence of basic algebraic operations, so it must be found by successive approximations. That’s what your calculator does with normalcdf and Excel does with NORM.DIST.
Summary: Make a sketch, estimate the probability (area), then compute it.
TI83/84/89:
Use normalcdf(
left bound,
right bound, mean, SD)
. I’ll
walk you through the TI83/84 keystrokes in the first example below.
If you have a TI89, press [CATALOG
] [F3
] [plain 6
makes N
]
[ENTER
].
Excel:
In Excel 2010 or later, use (deep breath here)
=NORM.DIST(
right bound, mean, SD,
TRUE) − NORM.DIST(
left bound, mean, SD,
TRUE)
. In Excel 2007 or earlier, it’s
NORMDIST
rather than NORM.DIST
.
Example 1: Heights of human children of a given age and sex are ND. One study found that threeyearold girls’ heights have a mean of 38.72″ and SD of 3.17″. What percentage of threeyearold girls are 35″ to 40″ tall?
Solution: Take the time to make a sketch. It doesn’t have to be beautiful, but you should make it as accurate as you reasonably can. It’s an important safeguard against making boneheaded mistakes. Here’s what should be on your sketch:
Important: When you marked the SD, you set the scale for the sketch. Now you have to honor that and place your boundaries in proportion. For instance, in this problem the mean is 38.72 and the left boundary is 35, which is 3.72 below the mean. Your left boundary therefore needs to be a bit more than one SD (3.17) left of the mean. The right bound is 40, which is 1.28 above the mean, so your line needs to be just over a third of a SD to the right of the mean.
(Students often put in more numbers and lines, like the values of 1, 2, and 3 SD above and below the mean. That’s not wrong, but it’s usually not helpful, and it definitely clutters up the sketch.)
From my sketch, I estimate an area of 50%–60%. If it’s 45% or 70% I won’t be terribly surprised, but if it’s 5% or 99% I’ll know something is wrong.
If you wish, add that number to your sketch — not below the axis, please. Write it within the shaded area, if there’s room, or as a callout to the left or right of the diagram, the way I did here.
On a TI83 or TI84, press [2nd
VARS
makes DISTR
] [2
] to
select normalcdf
. Enter the left boundary (35), right
boundary (40), mean (38.72), and SD (3.17).
(If you have a TI89 or you’re using Excel, see above.)
With the “wizard” interface:  With the classic interface: 

Press [ 
After entering the standard deviation, press [) ] [ENTER ] to get
the answer.

You always need to show your work, so write down
normalcdf(35,40,38.72,3.17)
before you proceed to the
answer. (There’s no need to write down the keystrokes you
used.)
In this book, I round probabilities to four decimal places, or two decimal places if expressed as a percentage. The probability is
P(35 ≤ x ≤ 40) = 0.5365
That number matches my estimate of 50%–60%.
But the problem asked for a percentage. (Always, always, always look back at the problem and make sure you’re answering the question that was actually asked.) The answer: 53.65% of threeyearold girls are 35″ to 40″ tall.
Example 2: A threeyearold girl is randomly chosen. Would it be unusual (unexpected, surprising) if she’s over 45″ tall?
In Chapter 5 you learned to call a lowprobability event unusual (a/k/a surprising or unexpected). The standard definition of unusual events is a probability below 0.05, so really this problem is just asking you to find the probability and compare it to 0.05.
Solution: The sketch is at right, and obviously the
probability should be small. The left boundary
is 45, but what’s the right boundary? The normal distribution
never quite ends, so the right boundary is ∞ (infinity).
TI89s have a key for ∞, but TI83s and TI84s don’t
and Excel doesn’t, so
use 10^99 instead. (That’s 10 to the 99th power; the
[^
] key on your TI calculator is between [CLEAR
] and
[÷
].)
Show your work:
P(x > 45) = normalcdf(45,10^99,38.72,3.17)
=
0.0238
That’s rounded from 0.0237914986, and it’s in line with my estimate of “small”. Now answer the question: There’s only a 2.38% chance that a randomly selected threeyearold girl will be over 45″ tall, so that would be unusual.
Example 3: For the same population, find and interpret P(x < 33).
Solution: The sketch is at right, and again the expected probability is small. The right boundary is 33, but what’s the left boundary? You might want to use 0, since no one can be under 0″ tall, but you could make the same argument for 1″ or 5″, so that can’t be right.
To locate the left boundary, remember that
you’re using a normal model to
approximate the data, and the normal distribution runs right out to
±∞. Therefore, the left boundary is
minus ∞ on a TI89, or minus 10^99 on a TI83/84. (Use the
[()
] key, not the [−
] subtraction key.)
P(x < 33) = normalcdf(10^99,33,38.72,3.17)
=
0.0356
The proportion of threeyearold girls under 33″ tall is 0.0356 or 3.56%; or, 3.56% of threeyearold girls are under 33″ tall. The other interpretation is the chance that a randomly selected threeyearold girl is under 33″ tall is 0.0356 or 3.56%.
Example 4: What’s the percentile rank of a threeyearold girl who is 33″ tall?
Solution: Long ago, in a galaxy called Numbers about Numbers, you learned the definition of percentiles. The percentile rank of a data point is the percentage of the data set that is ≤ that data point. So you need P(x ≤ 33). But that’s exactly what you computed in the previous example: 3.56%. So the 33″tall girl is between the third and fourth percentiles for her age group.
“That was P(x < 33), and for a percentile I need P(x ≤ 33)!” I hear you yell. But those two are equal. When we talked about density curves, near the beginning of this chapter, you learned that the area and probability are the same whether you include or exclude the boundary.
And this is why it doesn’t make much difference whether you define a percentile rank in terms of < or ≤, because the probability in a continuous distribution is the same either way.
Summary: Make a sketch, estimate the value(s), then compute the value(s).
TI83/84/89:
Use invNorm(
area to left,
mean, SD)
. I’ll
walk you through the TI83/84 keystrokes in the first example below.
If you have a TI89, press
[CATALOG
] [F3
] [plain 9
makes I
] [▼
3 times] [
ENTER
].
Excel:
In Excel 2010 or later, use
=NORM.INV(
area to left, mean, SD)
.
In Excel 2007 or earlier, it’s
NORMINV
rather than NORM.INV
.
Example 5: Blood pressure is stated as two numbers, systolic over diastolic. The World Health Organization’s MONICA Project (Kuulasmaa 1998 [see “Sources Used” at end of book]) reported these parameters for the US:
Systolic: μ = 120, σ = 15
Diastolic: μ = 75, σ = 11
Blood pressure in the population is normally distributed. The lowest 5% is considered “hypotensive”, according to Kuzma and Bohnenblust (2005, 103) [see “Sources Used” at end of book]. What systolic blood pressure would be considered hypotensive?
Solution: Always make a sketch for these problems. Your sketch is similar to the ones you made for the first group of problems, except that you use a symbol like x_{1} or “?” for the unknown boundary, and you write in the known area.
Always estimate your answer to guard against at least some errors. In the sketch, x_{1} looks like it’s not quite two SD left of the mean, so I’ll estimate a pressure of 95 to 100. (Okay, I cheated by using my calculator to make my “sketch”. But even with a real pencilandpaper sketch, you ought to be in the right ballpark.)
Now you’re ready to calculate.
TI89 or Excel users, please see the
instructions above. On your TI83 or TI84,
press [2nd
VARS
makes DISTR
] [3
] to
select invNorm
. Enter the area to the left of the point
you’re interested in (.05), the mean (120), and the SD (15).
With the “wizard” interface:  With the classic interface: 

Press [ 
After the standard deviation, press [) ] [ENTER ] to get the answer.

Show your work! Write down
invNorm(.05,120,15)
before you proceed to the
answer. (There’s no need to write down the keystrokes you
used.)
Answer: Systolic blood pressure (first number) under 95 would be considered hypotensive.
Example 6: The same source considers the top 5% “hypertensive”. What is the minimum systolic blood pressure that is hypertensive?
Solution: My “sketch” is at right. It’s mostly straightforward — the x_{1} boundary is between the 5% tail and the rest of the distribution.
But what’s up with the 1−0.05?
The problem asks you about the upper 5%, which is the area to the
right of the unknown boundary. But
invNorm
on the calculator, and NORM.INV
in
Excel, need area to left of the desired boundary.
The area to the left is the probability of
“not hypertensive”, and area is probability, so the area to
left is 1 minus the area to right, in this case 1−0.05.
Could you just write down 0.95? Sure, that would be correct. But if the area to right was 0.1627 you’d probably make the calculator compute 1 minus that for you, so why not be consistent?
x_{1} = invNorm(1−.05,120,15) = 144.6728044 → 145
(That’s actually a little liberal. Several sources that I’ve seen give 140 as the threshold.)
Example 7: Kuzma and Bohnenblust describe the middle 80% as “normal”. What is that range of systolic blood pressure?
This problem wants you to find two boundaries, lower and upper. You have to convert the 80% middle into two areas to left. Here’s how. If the middle is 80%, then the two tails combined must be 100−80% = 20%. But the curve is symmetric, so each tail must be 20/2 = 10%. Strictly speaking, I probably should have written that computation on the diagram, instead of just a laconic “0.1”, but it would take up a lot of space and the computation was easy enough. You’ll probably do the same — just be careful.
Once you have the areas squared away, the computation is simple enough:
x_{1} = invNorm(.1,120,15) = 100.7767265 → 101
x_{2} = invNorm(1−.1,120,15) = 139.2232735 → 139
Check: The boundaries of the middle 80% (or the middle any percent) should be equal distances from the mean. (100.776265+139.2232735)/2 = 120, so at least it’s consistent. Answer: Systolic b.p. of 101 to 139 is considered normal.
Example 8: What’s the 40th percentile for systolic blood pressure?
Sometimes the gods smile on us. The kth percentile is the value that is ≥ k% of the population, so k% is exactly the area to left that you need.
P_{40} = invNorm(.4,120,15) = 116.1997935 → 116
Definition: The standard normal distribution is a normal distribution with a mean of 0 and standard deviation of 1, sometimes written N(0,1).
The standard normal distribution is a picture of zscores of any possible realworld ND — more about that later.
The standard normal distribution lets you make computations that apply to all normal models, not just a particular model. You’ll see some examples shortly, but first —
The main point about the standard normal distribution is that it’s a standin for every ND from real life. How does this work? Well, if you take any real data set and subtract the mean from every data point, the mean of the new data set is 0. And if you then divide that data set by the standard deviation (which doesn’t change when you subtract a constant from every data point), then the SD of the newnew data set is 1.
But all you did with those manipulations was replace the numbers with zscores. Remember the formula: . The standard normal distribution is what you get when you convert any normal model to zscores.
The need to do normal computations the hard way has gone the way of the dinosaurs, but I think this history is why many stats books still use tables to do their computations. Inertia is a powerful force in textbooks!
The pdf and cdf functions for the standard normal distribution are what you get when you set μ=0 and σ=1 in the general equations for the ND: and . Again, the integral must be found by successive approximations. That’s where the tables in books come from, and it’s what your calculator does with normalcdf and Excel does with NORM.DIST.
I said above that the standard normal distribution lets you make statements about all normal models. What sort of statements? Well, the Empirical Rule for one.
Example 9: The Empirical Rule says that 68% of the population in a normal model lies within one SD of the mean. How good is the rule? In other words, what’s the actual proportion?
Solution: As usual, you start with a sketch. This is the standard ND, so the axis is z, not x. There’s no need to mark the mean or SD, because the z label identifies this as a standard normal distribution and therefore μ = 0 and σ = 1. Just label the boundaries.
Compute the probability the same way you’ve already learned. (Both Excel and the TIs have special procedures available for the standard normal distribution, but it’s not worth taking brain cells to learn them, when the regular procedures for the ND work just fine with N(0,1).)
P(−1 ≤ z ≤ 1) = normalcdf(−1,1,0,1) = .6826894809 → 68.27%
The Empirical Rule says 68% of the data are within z = ±1. Actually it’s about 68¼%, close enough.
Example 10: How many standard deviations must you go above and below the mean to take in the middle 50% of the data in a normal model?
Solution: This is similar to finding the middle 80% of blood pressures earlier, except now you’re making a statement about all normal models, not just a particular one.
Shading the middle 50% leaves 100−50 = 50% in the two tails combined, so each tail is 50/2 = 25%.
z_{1} = invNorm(.25,0,1) = −.6744897495 → −0.67
By symmetry, z_{2} must be numerically equal to z_{1} but have the opposite sign: z_{2} = 0.67.
50% of the data in any normal model are within about 2/3 of a SD of the mean. Since the bounds of the middle 50% of the data are Q1 and Q3, the IQR of any normal distribution is twice that, about one and a third standard deviations. More precisely, the IQR is 2×0.674 ≈ 1.35 times the SD.
There’s one special notation you’ll use when you compute confidence intervals in Chapter 9.
Definition: z_{area} or z(area), also known as critical z, is the zscore that divides the standard normal distribution such that the righthand tail has the indicated area.
This may seem a little weird, but really it’s just a recipe to specify a number. Compare with the square root of 48. That is the positive number such that, if you multiply it by itself, you get 48. Or consider π: the number that you get when you divide the circumference of a perfect circle by its diameter. Math is full of numbers that are specified as recipes. An example will make things clearer.
Example 11: Find z_{0.025}.
Solution: The problem is diagrammed at right. Caution! 0.025 is an area, not a zscore, so you don’t write 0.025 on the number line (the z axis). z_{0.025} is a zscore (though you don’t know its value yet), so it goes on the number line.
Once you have your sketch, the computation is straightforward.
Have area (probability), compute boundary.
The area is 0.025, but it’s an area to right, and
invNorm
needs an area to left, so you subtract from 1 as
usual:
z_{0.025} = invNorm(1−.025, 0, 1) = 1.959963986 → 1.96
Caution! You’re computing a boundary for the righthand tail. If you get a negative number, that can’t possibly be right.
z_{0.025} = 1.96 makes sense, if you think about it. If you also shaded in the lefthand tail with an area of 0.025, the two tails together would total 5%, leaving 95% in the middle. The Empirical Rule says that 95% of data are within 2 SD above and below the mean, and 1.96 is approximately 2.
How do you know whether a normal model is appropriate? How do you know whether your data are normally distributed? A histogram can rule out skewed data, or data with more than one peak.
But what if your data are unimodal and not obviously skewed? Is that enough to justify a normal model? No, it’s not. You need to perform a test called a normal probability plot. You’ll need this procedure in Chapters 8 through 11, whenever you have a small sample of numeric data.
That’s the bare outline, and you’ll get a little bit more with the examples. For those who want the full theory, it’s marked optional at the end of this section.
Testing for normality can be automated partly or completely, depending on what technology you have:
Example 12: Consider these vehicle weights (in pounds):
2500, 3250, 4000, 3500, 2900, 4500, 3800, 3000, 5000, 2200
Do they fit a normal model?
Solution:
Put the data in any statistics list,
then press [PRGM
], scroll down to MATH200A
, and
press [ENTER
] twice. Select Normality chk
.
The program makes the plot, and you can look at the points to determine whether they seem to be pretty much on a straight line. At least, that’s the theory. In practice, most data sets are a lot less clear cut than this one. It can be hard to tell whether the points fit a line, particularly if you have only a few of them. The plot takes up the whole screen, so deviations can look bigger than they really are.
Fortunately, there’s a test for whether points lie on a straight line. As you know from Chapter 4, the closer the correlation coefficient r is to 1, the closer the points are to a straight line.
The program computes r for you, and it also computes a critical value★ to help you determine if the points are close enough to a straight line. (For technical reasons, the critical value is different from the decision points of Chapter 4.) If r≥crit, it’s close enough to 1, the points are close enough to a straight line, and you can use a normal model. If r<crit, it’s too far from 1, the points are too far from a straight line, and you can’t use a normal model.
For this data set, r > crit, and therefore these vehicle weights fit the normal model.
★The “classic TI83” (non“Plus” model) doesn’t compute the critical value, so you have to do it yourself. See the formula in item 4 in the next section.
Example 13: Here’s a random sample of the lengths (in seconds) of tunes in my iTunes library:
120 219 242 134 129 105 275 76 412 268 486 199 651 291 126 210 151 98 100 92 305 231 734 468 410 313 644 117 451 375
Do they fit a normal model?
Solution: I entered them in a statistics list and then ran MATH200A Program part 4. The result was the plot at the right.
You can see that the plot is curved. This is reinforced by comparing r=0.9473 to crit=0.9639. r < crit. The points diverge too far from a straight line, and therefore I cannot use a normal model for the lengths of my iTunes songs.
The basic idea isn’t too bad. You make an xy scatterplot where the x’s are the data points, sorted in ascending order, and the y’s are the expected z scores for a normal distribution.
Why would you expect that to be a straight line? Recall the formula for a z score: z = (x−x̅)/s. Breaking the one fraction into two, you have z = x/s−x̅/s. That’s just a linear equation, with slope 1/s and intercept x̅/s. So an xz plot of any theoretical ND, plotting each data point’s z score against the actual data value, would be a straight line.
Further, if your actual data points are ND, then their actual z scores will match their expectedforanormaldistribution z scores, and therefore a scatterplot of expected z scores against actual data values will also be a straight line.
Now, in real life no data set is ever exactly a ND, so you won’t ever see a perfectly straight line. Instead, you say that the closer the points are to a straight line, the closer the data set is to normal. If the data points are too far from a straight line — if their correlation coefficient r is lower than some critical value — then you reject the idea that the data set is ND.
Okay, so you have to plot the data points against what their zscores should be if this is a ND, and specifically for a sample of n points from a ND, where n is your sample size. This must be built up in a sequence of steps:
1.0071−0.1371/√n−0.3682/n+0.7780/n² at α=0.10
0.9963−0.0211/√n−1.4106/n+3.1791/n² at α=0.01
The closer the points are to a straight line, the closer the data set is to fitting a normal model. In other words, a larger r indicates a ND, and a smaller r indicates a nonND. You can draw one of two conclusions:
(If you haven’t studied hypothesis testing yet, another way to say it is that you’re pretty sure the data set doesn’t fit the normal model because there’s less than a 5% probability that it does.)
This doesn’t mean you are certain it does, merely that you can’t rule it out. Technically you don’t know either way, but practically it doesn’t matter. Remember (or you will learn later) that inferential statistics procedures like t tests are robust, meaning that they still work even if the data are moderately nonnormal. But if your data were extremely nonnormal, r would be less than the critical value. When r is greater than the critical value, you don’t know whether the data set comes from normal data or moderately nonnormal data, but either way your inferential statistics procedures are okay.
So the bottom line is, if r > CRIT, treat the data as normal, and if r < CRIT, don’t.
The normal probability plot is just one of many possible ways to determine whether a data set fits the normal model. Another method, the D’AgostinoPearson test, uses numerical measures of the shape of a data set called skewness and kurtosis to test for normality. For details, see Assessing Normality in Measures of Shape: Skewness and Kurtosis.
(The online book has live links to all of these.)
normalcdf
.invNorm
.
That function needs area to left, so if the
problem gives area to right you have to use 1 minus that area.invNorm(1−
area)
.Chapter 8 WHYL → ← Chapter 6 WHYL
Write out your solutions to these exercises, showing your work for all computations. Then check your solutions against the solutions page and get help with anything you don’t understand.
Caution! If you don’t see how to start a problem, don’t peek at the solution — you won’t learn anything that way. Ask your instructor or a tutor for a hint. Or just leave it and go on to a different problem for now. You may find when you return to that “impossible” problem that you see how to do it after all.
You’ll need this information for several of the problems:
Source: “Is Human Height Bimodal?” (Schilling 2002 [see “Sources Used” at end of book]).
78 66 98 90 74 70 70 76 72 86 62 84 66 70 68
0.3 8.8 11.5 12 12.3 12.5 13 13.5 14.8
Updated 25 June 2015
(What’s New?)
Inferential statistics says, “I’ve got this sample. What does it tell me about the population it came from?” Eventually, you’ll estimate a population mean or proportion from a sample and use a sample to test a claim about a population. In essence, you’re reasoning backward from known sample to unknown population. But how? This chapter lays the groundwork.
First you have to reason forward. Way back in Chapter 1, you learned that samples vary because no one sample perfectly represents its population. In this chapter, you’ll put some numbers on that variation. You’ll learn about sampling distributions, and you’ll calculate the likelihood of getting a particular sample from a known population. That will be the basis for all your inferential statistics, starting in Chapter 9.
Acknowledgements: The approach I take to this material was suggested by What Is a pValue Anyway? (Vickers 2010, ch 10 [see “Sources Used” at end of book]), though of course any problems with this chapter are my responsibility and not Vickers’.
The software used to prepare most of the graphs and all of the simulations for this chapter is @RISK from Palisade Corporation.
Lengths of 30 Tunes  

mm:ss  seconds 
2:00  120 
3:39  219 
4:02  242 
2:14  134 
2:09  129 
1:45  105 
4:35  275 
1:16  76 
6:52  412 
4:28  268 
8:06  486 
3:19  199 
10:51  651 
4:51  291 
2:06  126 
3:30  210 
2:31  151 
1:38  98 
1:40  100 
1:32  92 
5:05  305 
3:51  231 
12:14  734 
7:48  468 
6:50  410 
5:13  313 
10:44  644 
1:57  117 
7:31  451 
6:15  375 
Having time on my hands, I was curious about the lengths of tunes in the Apple Store. Being lazy, I decided to look instead at the lengths of tunes in my iTunes library. There are 10113 of them, and I’m going to assume that they are representative. (That’s my story, and I’m sticking to it.)
I set Shuffle to Songs and then took the first 30, which gave me the times you see at right for a random sample of size 30.
Here is a histogram of the data. The tune times are moderately skewed right. That makes sense: most tunes run around two to five minutes, but a few are longer.
The mean of this sample is 280.9 seconds, and the standard deviation is 181.7 seconds. But you know that there’s always sampling error. No sample can represent the population perfectly, so if you take another sample from the same population you’d expect to see a different mean, but not very different. This chapter is all about what differences you should expect.
First, ask yourself: Why should you expect the mean of a second sample to be “different, but not very different” from the mean of the first sample? The samples are independent, so why should they relate to each other at all?
Answer: because they come from the same population. In a given sample, you would naturally expect some data points below the population mean μ, and others above μ. You’d expect that the points below μ and the points above μ would more or less cancel each other out, so that the mean of a sample should be in the neighborhood of μ, the mean of the population.
And if you think a little further about it, you’ll probably imagine that this canceling effect works better for larger samples. If you have a sample of four data points, you wouldn’t be much surprised if they’re all above μ or all below μ. If you have a sample of 100 data points, having them all on one side of μ would surprise you as much as flipping a coin 100 times and getting 100 heads. So you expect that the means of large samples tend to stick closer to μ than the means of small samples do. That’s absolutely true, as you’ll find out in this chapter.
To get a handle on “different, but not very different”, take a look at a second sample of 30 from the same population. This one has x̅ = 349.1, s = 204.2 seconds. From its histogram, you can see it’s a bit more strongly skewed than the first sample.
The two sample means differ by 349.14−280.93 ≈ 68.2 seconds. That might seem like a lot, but it’s only about a quarter of the first sample mean and under a fifth of the second sample mean. Also, it’s a lot less than the standard deviations of the two samples, meaning that the difference between samples is much less than the variability within samples.
There’s an element of hand waving in that paragraph. Sure, it seems plausible that the two sample means are “different, but not very different”; but you could just as well construct an argument in words that the two means are different. Without numbers to go on, how much of a difference is reasonable? In statistics, we like to use numbers to decide whether a thing is reasonable or not. How can we make a numerical argument about the difference between samples? Well, put on your thinking cap, because I’m about to blow your mind.
The key to sample variability is the sampling distribution.
Notice that n is the size of each sample, not the number of samples. There’s no symbol for the number of samples, because it’s indefinitely large.
The sampling distribution is a new level of abstraction. It exists only in our minds: nobody ever takes a whole lot of samples of the same size from a given population. You can think of the sampling distribution as a “what if?” — if you took a whole lot of samples of a given size from the same population, and computed the means of all those samples, and then took those means as a new set of data for a histogram, what would that distribution look like?
Why ask such an abstract question? Simply this: if you know how samples from a known population are distributed, you can work backward from a single sample to make some estimates about an unknown population. In this chapter, I work from a population of tunes with known mean and standard deviation, and I ask what distribution of sample means I can expect to get. In the following chapters, I’ll turn that around: looking at one sample, we’ll ask what that says about the mean and standard deviation of the population that the sample came from.
What does a sampling distribution look like? Well, I used a computer simulation with @RISK from Palisade Corporation to take a thousand samples of 30 tunes each — the same n as before — and this is what I got:
“Big whoop!” I hear you say. I agree, it’s not too impressive at first glance. But let’s compare this distribution of sample means to the population those samples come from.
(In real life, you wouldn’t know what the population looks like. But in this chapter I work from a known population to explore what the distribution of its samples looks like. Starting in the next chapter, I’ll turn that around and use one sample to explore what the population probably looks like.)
Look at the two histograms below. The lefthand plot shows the individual lengths of all the tunes in the population — it’s a histogram of the original population. The righthand plot shows the means of a whole lot of samples, 30 tunes per sample — it’s a histogram of the sampling distribution of the mean. That righthand plot is the same as the plot I showed you a couple of paragraphs above, just rescaled to match the lefthand plot for easier comparison.
Now, what can you see?
Population (indiv. tunes)  Sampling Distribution  

Values  50 to 1000s(*)  200 to 400 
Middle 95% of values  98.0 to 696.3  244.6 to 359.1 
Standard deviation  158.6  29.0 
(*) I cut off the right tail of the population graph to save space. 
At this point, you’re probably wondering if similar things are true for other numeric populations. The answer is a resounding YES.
When you describe a distribution of continuous data, you give the center, spread, and shape. Let’s look at those in some detail, because this will be key to everything you do in inferential statistics.
Before I get into the properties of the sampling distribution, I’d like to tell you about two Web apps that let you play with sampling distributions in real time. (I’m grateful to Benjamin Kirk for suggesting these.)
If you possibly can, try out these apps, especially the second one. Sampling distributions are new and strange to you, and playing with them in real time will really help you to understand the text that follows.
The mean of the sampling distribution of x̅ equals the mean of the population: μ_{x̅} = μ.
This is true regardless of the shape of the original population and regardless of sample size.
Why is this true? Well, you already know that when you take a sample, usually you have some data points that are higher than the population mean and some that are lower. Usually the highs and lows come pretty close to canceling each other out, so the mean of each sample is close to μ — closer than the individual data points, that is.
When you take a distribution of sample means, the same thing happens at the second level. Some of the sample means x̅ are above μ and some are below. The highs and lows tend to cancel, so the average of the averages is pretty darn close to the population mean.
The standard deviation of the sampling distribution of x̅ has a special name: standard error of the mean or SEM; its symbol is σ_{x̅}. The standard error of the mean for sample size n equals the standard deviation of the population divided by the square root of n: SEM or σ_{x̅} = σ/√n.
This is true regardless of the shape of the original population and regardless of sample size.
Okay, the sample is n random values drawn from a population with a variance of σ². The total of those n values in the sample is a random variable with a variance of σ²n, and therefore the standard deviation of the total is √(σ²n) = σ√n. Now divide the sample total by n to get the sample mean. x̅ is a random variable with a standard deviation of (σ√n)/n = σ/√n. QED — which is Latin for “told ya so!”
Summary: If the original population is normally distributed (ND), the sampling distribution of the mean is ND. If the original population is not ND, still the sampling distribution is nearly ND if sample size is ≥ 30 or so but not more than about 10% of population size.
You can probably see that if you take a bunch of samples from a ND population and compute their means, the sample means will be ND also. But why should the means of samples from a skewed population be ND as well?
The answer should be called the Fundamental Theorem of Statistics, but instead it’s called the Central Limit Theorem. (The name was given by Richard Martin Edler von Mises in a 1919 article, but the theorem itself is due to the Marquis de Laplace, in his Théorie analytique des probabilités [1812].) The CLT is the only theorem in this whole course. There is a mathematical way to state and prove it, but we’ll go for just a conceptual understanding.
The sampling distribution of the mean approaches the normal distribution, and does so more closely at larger sample sizes.
An equivalent form of the theorem says that if you take a selection of independent random variables, and add up their values, the more independent variables there are, the closer their sum will be to a ND.
The second form of the theorem explains why so many reallife distributions are bell curves: Most things don’t have a single cause, but many independent causes.
Example: Lots of independent variables affect when you leave the house and your travel time every day. That means that any person’s commute times are ND, and so are people’s arrival times at an event. The same sorts of variables affect when buses arrive, so wait times are ND. Most things in nature have their growth rate affected by a lot of independent variables, so most things in nature are ND.
But it’s the first form of the theorem that we’ll use in this chapter. If samples are randomly chosen, or chosen by another valid sampling technique, then they will be independent and the Central Limit Theorem will apply.
The further the population is from a ND, the bigger the sample you need to take advantage of the CLT. Be careful! It’s size of each sample that matters, not number of samples. The number of samples is always large but unspecified, since the sampling distribution is just a construct in our heads. As a rule of thumb, n=30 is enough for most populations in real life. And if the population is close to normal (symmetric, with most data near the middle), you can get away with smaller samples.
On the other hand, the sample can’t be too large. For samples drawn without replacement (which is most samples), the sample shouldn’t be more than about 10% of the population. In symbols, n ≤ 0.1N. Suppose you don’t know the population size, N? Multiply left and right by 10 and rewrite the requirement as 10n ≤ N. You always know the sample size, and if you can make a case that the population is at least ten times that size then you’re good to go.
You’ll remember that the population of tune times was highly skewed, but the sampling distribution for n=30 was pretty nearly bell shaped. To show how larger sample size moves the sampling distribution closer to normal, I ran some simulations of 1000 samples for some other sample sizes. Remember that the sampling distribution is an indefinitely large number of samples; you’re still seeing some lumpiness because I ran only 1000 samples in each simulation.
The means of 3tune samples are still fairly well skewed, though the range is less than the population range. Increasing sample size to 10, the skew is already much less. 20tune samples are pretty close to a bell curve except for the extreme righthand tail. Finally, with a sample size of 100, we’re darn close to a bell curve. Yes, there’s still some lumpiness, but that’s because the histogram contains only 1000 sample means.
The requirements mentioned in this chapter will be your “ticket of admission” to everything you do in the rest of the course. If you don’t check the requirements, the calculator will happily calculate numbers for you, they’ll be completely bogus, and your conclusions will be wrong but you won’t know it. Always check the requirements for any type of inference before you perform the calculations.
I talk about “requirements”. By now you’ve probably noticed that I think very highly of DeVeaux, Velleman, and Bock’s Intro Stats (2009) [see “Sources Used” at end of book]. They test the same things in practice, but they talk about “assumptions” and “conditions”. Assumptions are things that must be true for inference to work, and conditions are ways that you test those assumptions in practice.
You might like their approach better. It’s the same content, just a different way of looking at it. And sampling distributions are so weird and abstract that the more ways you can look at them the better! Following DeVeaux pages 591–593, here’s another way to think about the requirements.
Independence Assumption: Always look at the overall situation and try to see if there’s any way that different members of the sample can affect each other. If they seem to be independent, you’ll then test these conditions:
These conditions must always be met, but they’re a supplement to the Independence Assumption, not a substitute for it. If you can see any way in which individuals are not independent, it’s game over regardless of the conditions.
Normal Population Assumption: For numeric data, the sampling distribution must be ND or you’re dead in the water. There are two conditions to check this:
The Normal Population Assumption and the Nearly Normal Condition or Large Sample Condition are for numeric data and only numeric data. We’ll have a separate set of requirements, assumptions, and conditions for binomial data later in this chapter.
See also: Is That an Assumption or a Condition? is a very nice summary by Bock [see “Sources Used” at end of book] of all assumptions and conditions. It puts all of our requirements for all procedures into context. (Just ignore the language directed at instructors.)
Ultimately, you’ll use sampling distributions to estimate the population mean or proportion from one sample, or to test claims about a population. That’s the next four chapters, covering confidence intervals and hypothesis tests. But before that, you can still do some useful computations.
For all problems involving sampling distributions and probability of samples, follow these steps:
normalcdf
.
Caution! Don’t use rounded numbers in this calculation.You are auditing a bank. The bank managers have told you that the average cash deposit is $200.00, with standard deviation $45.00. You plan to take a random sample of 50 cash deposits. (a) Describe the distribution of sample means for n = 50. (b) Assuming the given parameters are correct, how likely is a sample mean of $189.56 or below?
Solution (a): Recall that describing a distribution means giving its center, its spread, and its shape.
Solution (b): Please refer to How to Work Problems, above. You’ve already described the distribution, so the next step is to make the sketch. You may be tempted to skip this step, but it’s an important reality check on the numerical answer you get from your calculator.
The sketch for this problem is shown at right. Please observe these key points when sketching sampling distribution problems:
Next, compute the probability on your calculator.
Press [2nd
VARS
makes DISTR
] [2
] to select
normalcdf
.
Fill in the arguments,
either on the wizard interface or in the function itself. Either way,
you need four arguments, in this order:
normalcdf
calculations are particularly sensitive to rounding errors, especially when one or
both boundaries are out in the tails, so use the exact value:
45/√50.With the “wizard” interface:  With the classic interface: 

The wizard prompts you for a standard deviation σ. Don’t
enter the SD of the population. Do enter the SD of the sampling
distribution, which is the standard error.
After entering the standard error, press [ 
After entering the standard error, press [) ] [ENTER ].
You’ll have two closing parentheses, one for the square root and
one for normalcdf .

Always show your work. There’s no need to write down all your keystrokes, but do write down the function and its arguments:
normalcdf(−10^99, 189.56, 200, 45/√50)
Answer: P(x̅ ≤ 189.56) = 0.0505
Comment: Here you see the power of sampling. With a standard deviation of $45.00, an individual deposit of $189.56 or lower can be expected almost 41% of the time. But a sample mean under $189.56 with n=50 is much less likely, only a little over 5%.
This is one reason you should take the trouble to make your sketch reasonably close to scale. If you enter the standard deviation, 45, instead of the standard error, 45/√50 — a common mistake — you’ll get 0.4083. A glance at your sketch will tell you that can’t possibly be right, so you then know to find and fix your mistake.
US women’s heights are normally distributed (ND), with mean 65.5″ and standard deviation 2.5″. You visit a small club on a Thursday evening, and 25 women are there. (Let’s assume they are a representative sample.) Your pickup line is that you’re a statistics student and you need to measure their heights for class. Amazingly, this works, and you get all 25 heights. How likely is it that the average height is between 65″ and 66″?
Solution: First, get the characteristics of the sampling distribution:
If the SEM is 0.5″, then 65″ and 66″ equal the mean ± one standard error. The Empirical Rule (68–95–99.7 Rule) tells you that about 68% of the data fall between those bounds. In this problem, the sketch is a really good guide to the answer you expect.
This is the distribution of sample means, so you expect 68% of them to fall between those bounds. But do the computation anyway, because the Empirical Rule is approximate and now you’re able to be precise. Also, the SEM of 0.5″ is an exact number, but still I put the whole computation into the calculator just to be consistent.
The chance that the sample mean is between 65″ and 66″ is
P(65 ≤ x̅ ≤ 66) = 0.6827
Remember the difference between the distribution of sample means and the distribution of individual heights. From the computation at the right, you expect to see under 16% of women’s heights between 65″ and 66″, versus over 68% of sample mean heights (for n=25) between 65″ and 66″. That’s the whole point of this chapter: sample means stick much closer to the population mean.
Suppose hotel guests who take elevators weigh on average 150 pounds with standard deviation of 35 pounds. An engineer is designing a large elevator, to lift 50 people. If she designs it to lift 4 tons (8000 pounds), what is the chance a random group of 50 people will overload it?
Need a hint? This is a problem in sample total. You haven’t studied that kind of problem, but you have studied problems in sample means. In math, when you have an unfamiliar type of problem, it’s always good to ask: Can I change this into some type of problem I do know how to solve? In this case, how do you change a problem about the total number of pounds in a sample (∑x) into a problem about the average number of pounds per person (x̅)?
Please stop and think about that before you read further.
Solution: To convert a problem in sums into a problem in averages, divide by the sample size. If the total weight of a sample of 50 people is 8000 lb, then the average weight of the 50 people in the sample is 8000/50 = 160 lb. So the desired probability is P(x̅ > 160):
P(∑x > 8000 for n = 50) = P(x̅ > 160)
And you know how to find the second one.
What does the sampling distribution of the mean look like for μ = 150, σ = 35, n = 50? The mean is μ_{x̅} = 150 lb, and the standard error is 35/√50 ≈ 4.9 lb. That’s all you need to draw the sketch at right. Samples are random, 10×50 = 500 is less than the number of people (or potential hotel guests) in the world, and n = 50 ≥ 30, so the sampling distribution follows a normal model.
Now make your calculation. This time the left boundary is a definite number and the right boundary is pseudo infinity, 10^99. And again, you want the standard error, not the SD of the original population.
With the “wizard” interface:  With the classic interface: 

After entering the standard error, press [ 
After entering the standard error, press [) ] [ENTER ].

Show your work: normalcdf(160, 10^99, 150, 35/√50).
There’s a 0.0217 chance that any given load of 50 people will overload the elevator. That’s not 2% of all loads, but 2% of loads of 50 people. Still, it’s an unacceptable level of risk.
Is there an inconsistency here? Back in Chapter 5, I said that an unusual event was one that had a low probability of occurring, typically under 5%. Since 2% is less than 5%, doesn’t that mean that an overloaded elevator is an unusual event, and therefore it can be ignored?
Yes, it’s unusual. But no, fifty people plunging to a terrible death can’t be ignored. The issue is acceptable risk. Yes, there’s some risk any time you step in an elevator that it will be your last journey. But it’s a small risk, and it’s one you’re willing to accept. (The risk is much greater every time you get into a car.) Without knowing exact figures, you can be sure it’s much, much less than 2%; otherwise every big city would see many elevator deaths every day.
In Chapter 10, you’ll meet the significance level, which is essentially the risk of being wrong that you can live with. The worse the consequences of being wrong, the lower the acceptable risk. With an elevator, 5% is much too risky — you want crashes to be a lot more unusual than that.
Binomial data are yes/no or success/failure data. Each sample yields a count of successes. (A reminder: “success” isn’t necessarily good; it’s just the name for the condition or response that you’re counting, and the other one is called “failure”.)
Need a refresher on the binomial model? Please refer back to Chapter 6.
The summary statistic or parameter is a proportion, rather than a mean. In fact, the proportion of success (p) is all there is to know about a binomial population.
In Chapter 6 you computed probabilities of specific numbers of successes. Now you’ll look more at the proportions of success in all possible samples from a binomial population, using the normal distribution (ND) as an approximation.
Here’s a reminder of the symbols used with binomial data:
p  The proportion in the population.
Example: If 83% of US households have at least one cell phone, then
p = 0.83.
Remember “proportion of all equals probability of one”, so p is also the probability that any randomly selected response from the population will be a success. 

q  = 1−p is therefore the proportion of failure or the chance that any given response will be a failure. 
n  The sample size. 
x  The number of successes in the sample. Example: if 45 households in your sample have at least one cell phone, then x = 45. 
p̂  “phat”, the proportion in the sample, equal to x/n. Example: If you survey 60 households and 45 of them have at least one cell phone, then p̂ = 45/60 = 0.75 or 75%. 
The sampling distribution of the proportion is the same idea as the sampling distribution of the mean, and there are a lot of parallels between the two. (A table at the end of this chapter summarizes them.)
As before, n is the size of each sample, not the number of samples. There’s no symbol for the number of samples, because it’s indefinitely large.
One change from the sampling distribution of x̅ is that the sampling distribution of p̂ is a different data type from the population. The original data are nonnumeric (yeses and noes), but the distribution of p̂ is numeric because the p̂’s are numbers. Each p̂ says “so many percent of this sample were successes.”
The mean of the sampling distribution of p̂ equals the proportion of the population: μ_{p̂} = p (“mu sub phat equals p”).
This is true regardless of the proportion in the original population and regardless of sample size.
Why is this true? The reasons are similar to the reasons in Center of the Sampling Distribution of x̅. p̂ for a given sample may be higher or lower than p of the population, but if you take a whole lot of samples then the high and low p̂’s will tend to cancel each other out, more or less.
The standard deviation of the sampling distribution of p̂ has a special name: standard error of the proportion or SEP; its symbol is σ_{p̂} (“sigmasubphat”). The standard error of the proportion for sample size n equals the square root of the population proportion, times 1 minus the population proportion, divided by the sample size: SEP or σ_{p̂} = √[pq/n].
This is true regardless of the proportion in the original population and regardless of sample size.
Why is this true? For a binomial distribution with sample size n, the standard deviation is √[npq]. That is the SD of the random variable x, the number of successes in a sample of size n. The sample proportion, random variable p̂, is x divided by n, and therefore the SD of p̂ is the SD of random variable x, also divided by n. In symbols, σ_{p̂} = √[npq] / n = √[npq/n²] = √[pq/n].
If np and nq are both ≥ about 10, and 10n ≤ N, the normal model is good enough for the sampling distribution.
Let’s look at some sampling distributions of p̂. First I’ll show you the effect of the population’s proportion of success p, and then the effect of the sample size n.
Using @RISK from Palisade Corporation, I simulated all of the sampling distributions shown here. The mathematical sampling distribution has an indefinitely large number of samples, but I stopped at 10,000.
These first three graphs show the sampling distributions for samples of size n = 4 from three populations with different proportions of successes.
Reminder: these are not graphs of the population — they’re not responses from individuals. They are graphs of the sampling distributions, showing the proportion of successes (p̂) found in a lot of samples.
How do you read these? For example, look at the first graph. This shows the sampling distribution of the proportion for a whole lot of samples, each of size 4, where the probability of success on any one individual is 0.1. You can see that about 67% of all samples have p̂ = 0 (no successes out of four), about 29% have p̂ = .25 (one success out of four), about 4% have p̂ = .50 (two successes out of four), and so on.
Why the large gaps between the bars? With n = 4, each sample can have only 0, 1, 2, 3, or 4 successes, so the only possible proportions for those samples are 0, 25%, 50%, 75%, and 100%.
But let’s not obsess over the details of these graphs. I’m more interested in the shapes of the sampling distributions.
What do you see? If you take many samples of size 4 from a population with p = 0.1 (10% successes and 90% failures), the sampling distribution of the proportion is highly skewed. Now look at the second graph. When p = .25 (25% successes and 75% failures in the population), again with n = 4 individuals in each sample, the sample proportions are still skewed, but less so. And in the third graph, where the population has p = 0.5 (success and failure equally likely), then the sampling distribution is symmetric even with these small samples.
For a given sample size n, it looks like the closer the population p is to 0.5, the closer the sampling distribution is to symmetric. And in fact that’s true. That’s your takeaway from these three graphs.
Now let’s look at sampling distributions using different sample sizes from the same population. I’ll use a population with 10% probability of success for each individual (p = 0.1).
You’ve already seen the graph of the sampling distribution when n = 4. The three graphs here show the sampling distribution of p̂ for progressively larger samples. (Remember always that n is the number of individuals in one sample. The number of samples is indefinitely large, though in fact I took 10,000 samples for each graph.)
What do you see here? The distribution of p̂’s from samples of 50 individuals is still noticeably skewed, though a lot less than the graph for n = 4. If I take samples of size 100, the graph is starting to look nearly symmetric, though still slightly skewed. And if I take samples of 500 individuals, the distribution of p̂ looks like a bell curve.
What do you conclude from these graphs? First, even if p is far from 0.5 (if the population is quite unbalanced), with large enough samples, the sampling distribution of p̂ is a normal distribution. Second, you need big samples for binomial data. Remember that 30 is usually good enough for numeric data. For binomial data, it looks like you need bigger samples.
Okay, let’s put it together. If the size of each sample is large enough, the sampling distribution is close enough to normal. How large a sample is large enough? It depends on how skewed the original population is, which means it depends on the proportion of successes in the population. The further p is from 0.5, the more unbalanced the population and the larger n must be.
How big a sample is big enough? Here’s what some authors say:
Why the disagreements? They can’t all be right, can they?
Actually, they can. The question is, what’s close enough to a ND? That’s a judgment call, and different statisticians are a little bit more or less strict about what they consider close enough. Fortunately, with samples bigger than a hundred or so, which are customary, all the conditions are usually met with room to spare.
We’ll follow DeVeaux and his buddies and use np ≥ 10 and nq ≥ 10. This is easy to remember: at least ten “yes” and at least ten “no” expected in a sample. (You can compute the expected number of noes as nq = n(1−p) or simply n−np, sample size minus the expected number of yeses.)
How does this work out in practice? Look at the nexttolast graph, with n=100 and p=0.1. It’s close to a bell curve, but has just a little bit of skew. (It’s easier to see the skew if you cover the top part of the graph.)
Check the requirements: np = 100×.1 = 10, and nq = 100−10 = 90. In a sample of 100, 10 successes and 90 failures are expected, on average. This just meets requirements. And that matches the graph: you can see that it’s not a perfect bell curve, but close; but if it was a little more skewed then the normal model wouldn’t be a good enough fit.
De Veaux and friends (page 440) give a nice justification for choosing ≥ 10 yeses and noes. Briefly, the ND has tails that go out to ±infinity, but proportions are between 0 and 1. They chose their “success/failure condition”, at least ten of each, so that the mismatch between the binomial model and the normal model is only in the rare cases.
But there’s an additional condition: the individuals in the sample must be independent. This translates to a requirement that the sample can’t be too large, or drawing without replacement would break the binomial model. Big surprise (not!): Authors disagree about this too. For example, De Veaux and Johnson & Kuby say sample can’t be bigger than 10% of population (n ≤ 0.1N); Sullivan says 5%.
We’ll use n ≤ 0.1N, just like with numeric data. And just as before, you can think of that as 10n ≤ N when you don’t have the exact size of the population.
Example 4: You asked 300 randomly selected adult residents of Ithaca a yesorno question. Is the sample too large to assume independence? You may not know the population of Ithaca, but you can compute 10×300 = 3000 and be confident that there are more than 3000 adult Ithacans. Therefore your sample is not too large.
Don’t just claim 10n ≤ N. Show the computation, and identify the population you’re referring to, like this: “10n = 10×300 = 3000 ≤ number of adult Ithacans.”
Remember to check your conditions: np ≥ about 10, nq ≥ about 10, and 10n ≤ N. And of course your sample must be random.
Just like with numeric data, you might find it helpful to name the requirements for binomial data. These are the same requirements that I just gave you, but presented differently. I’m following DeVeaux, Velleman, Bock (2009, 493) [see “Sources Used” at end of book].
Independence Condition, Randomization Condition, 10% Condition: These are the same for every sampling distribution and every procedure in inferential stats. I’ve already talked about them under numeric data earlier in the chapter. In practice, the 10% Condition comes into play more often for binomial data than numeric data, because binomial samples are usually much larger.
Sample Size Assumption: For binomial data, the sample is like Goldilocks and porridge — it can’t be too big and it can’t be too small. (Maybe it was beds or chairs and not porridge? And what the heck is porridge?) “Too big” is checked by the 10% Condition; “too small” is checked by the
See also: Is That an Assumption or a Condition? (Bock [see “Sources Used” at end of book]). Again, these are the same requirements you see in this textbook, just presented differently.
Working with the sampling distribution of p̂, the technique is exactly the same as for problems involving the sampling distribution of x̅. Follow these steps:
normalcdf
.
Caution! Don’t use rounded numbers in this calculation.Because of this, some authors apply a continuity correction to make the normal model a better fit for the binomial. This means extending the range by half a unit in each direction. For example, if n = 100 and p = 0.20, and you’re finding the probability of 10 to 15 successes, MATH200A part 3 gives a probability of 0.1262. The normal model with standard error √[.20×(1−.20)/100] = 0.04 gives
normalcdf(10/100, 15/100, .2, .04) = 0.0994
With the continuity correction, you compute the probability for 9½ to 15½ successes. Then the normal model gives a probability of
normalcdf(9.5/100, 15.5/100, .2, .04) = 0.1260
This is a better match to the exact binomial probability. Why use the normal model at all, then? Why not just compute the exact binomial probability? Because there’s only a noticeable discrepancy far from the center, and only when the sample is on the small side. (100 is a small sample for binomial data, as you’ll see in the next two chapters.) You can apply the continuity correction if you want, but many authors don’t because it usually doesn’t make enough difference to matter.
1965. Talladega County, Alabama. An African American man named Robert Swain is accused of rape. The 100man jury panel includes 8 African Americans, but through exemptions and peremptory challenges none are on the final jury. Swain is convicted and sentenced to death. (Juries are all male. 26% of men in the county are African American.)
(a) In a county that is 26% African American, is it unexpected to get a 100man jury panel with only eight African Americans?
Solution: “Unexpected”, “unusual”, or “surprising” describes an event with low probability, typically under 5%. This problem is asking you to find the probability of getting that sample and compare that probability to 0.05.
This is binomial data: each member of the sample either is or is not African American. Putting the problem into the language of the sampling distribution, your population proportion is p = 0.26. You’re asked to find the probability of getting 8 or fewer successes in a sample of 100, so n = 100 and your sample proportion must be p̂ ≤ 8/100 or p̂ ≤ 0.08.
Why “8 or fewer”? p̂ = 8% for the questionable sample, and you think it’s too low since it’s below the expected 26%. Therefore, in determining how likely or unlikely such a sample is, you’ll compute the probability of p̂ ≤ 0.08.
First, describe the sampling distribution of the proportion:
Therefore the sampling distribution of the proportion is a ND.
Next, make the sketch and estimate the answer. I’ve numbered the key points in the sketch at right, but if you need a refresher please refer back to the sketch under Example 1.
From this sketch, you’d expect the probability to be very small, and indeed it is.
Compute the probability using normalcdf
as before.
Be careful with the last argument,
which is the standard deviation of the sampling distribution.
Don’t use a rounded number for the standard error,
because it can make a large difference in the probability.
With the “wizard” interface:  With the classic interface: 

The standard error expression, √(.26*(1−.26)/100),
scrolls off the screen as you type it in, so be extra careful!
Press [ 
Press [) ] [ENTER ] after entering the standard error.
You’ll have two closing parentheses, one for the square root and
one for normalcdf .

Always show your work — not keystrokes but the function and its arguments:
normalcdf(−10^99, .08, .26. √(.26*(1−.26)/100))
The SEP is a nasty expression, and you have to enter it
twice in every problem. You might like to save some keystrokes by
computing it once and then storing it in a variable, as I did at the
right. When
you’re drawing the sketch and need the standard error, compute
it as usual but before pressing [ENTER
] press
[STO→
] [x,T,θ,n]. Then when you need the standard
error in normalcdf
, in the wizard or the classic
interface, just press the [x,T,θ,n] key instead of
reentering the whole SEP expression.
The probability is naturally the same whether you use the
shortcut or not.
P(p̂ ≤ 0.08) = 2.0×10^{5}, or P(x ≤ 8) = 0.000 020. There are only 20 chances in a million of getting a 100man jury pool with so few African Americans by random selection from that county’s population. This is highly unexpected — so unlikely that it raises the gravest doubts about the county’s claim that jury pools were selected without racial bias.
You might remember that in Chapter 6 you computed this binomial probability as 0.000 005, five chances in a million. If the ND is a good approximation, why does it give a probability that’s four times the correct probability? Answer: The normal approximation gets a little dicier as you move further out the tails, and this sample is pretty far out (z = −4.10). But is the approximation really that bad? Sure, the relative error is large, but the absolute error is only 0.000 015, 15 chances in a million. Either way, the message is “This is extremely unlikely to be the product of random chance.”
(b) From 1950 to 1965, as cited in the Supreme Court’s decision, every 100man jury pool in the county had 15 or fewer African Americans. How likely is that, if they were randomly selected?
Solution: 15 out of 100 is 15%. You know how to compute the probability that one jury pool would be ≤15% African American, so start with that. You’ve already described the sampling distribution, so all you have to do is make the sketch and then the calculation. Everything’s the same, except your right boundary is 0.15 instead of 0.08.
If you use my little shortcut:  Otherwise: 



Either way, P(p̂ ≤ 0.15) = 0.0061. The Talladega County jury panels are multiple samples with n = 100 in each, so the “proportion of all” interpretation makes sense: In the long run, you expect 0.61% of jury panels to have 15% or fewer African Americans, if they’re randomly selected.
But actually 100% of those jury panels had 15% or fewer African Americans. How unlikely is that? Well, we don’t know how many juries there were in the county in those 16 years, but surely it must have been at least one a year, or a total of 16 or more. The probability that 16 independent jury pools would all have 15% or fewer African Americans, just by chance, is 0.0060745591^{16} ≈ 3E36, effectively zip. And if there was more than one jury a year, as there probably was, the probability would be even lower. Something is definitely fishy.
The binomial probability is 0.0061 also. This is still pretty far out in the lefthand tail (z = −2.51), but the normal approximation is excellent. The message here is that the normal approximation is pretty darn close except where the probabilities are so small that exactness isn’t needed anyway.
Here’s a sidebyside summary of sampling distributions of the mean (numeric data) and sampling distributions of the proportion (binomial data). Always check requirements for the type of data you actually have!
Numeric Data  Binomial Data  

Each individual in sample provides a number.  Each individual in sample provides a success or failure, and you count successes.  
Statistic of one sample  mean x̅ = ∑x/n  proportion p̂ = x/n 
Parameter of population  mean μ  proportion p 
Sampling distribution of the ...  Sampling distribution of the mean (sampling distribution of x̅)  Sampling distribution of the proportion (sampling distribution of p̂) 
Mean of sampling distribution  μ_{x̅} = μ  μ_{p̂} = p 
Standard deviation of sampling distribution  SEM = standard error of the mean  SEP = standard error of the proportion 
σ_{x̅} = σ/√n  σ_{p̂} = √[pq/n]  
Sampling distribution is close enough to normal if ... 


NOTE: n is number of individuals per sample. Number of samples is indefinitely large and has no symbol. 
Key ideas:
normalcdf
to compute probability. In normalcdf
,
the fourth argument is the unrounded standard error, not the
population standard deviation.Chapter 9 WHYL → ← Chapter 7 WHYL
Write out your solutions to these exercises, showing your work for all computations. Then check your solutions against the solutions page and get help with anything you don’t understand.
Caution! If you don’t see how to start a problem, don’t peek at the solution — you won’t learn anything that way. Ask your instructor or a tutor for a hint. Or just leave it and go on to a different problem for now. You may find when you return to that “impossible” problem that you see how to do it after all.
(a) If the manufacturer’s claim is true, is a sample mean of 780 hours surprising? (Hint: Think about whether you need the probability of x̅ ≤ 780 or x̅ ≥ 780.)
(b) Would you accept the manufacturer’s claim?
(a) Describe the sampling distribution of the proportion who believe in angels in samples of 500 Americans.
(b) Use the normal approximation to compute the probability of finding that 350 to 370 in a sample of 500 believe in angels. Reminder: You can’t use the sample counts directly; you have to convert them to sample proportions.
(a) One way beginners play is to bet on red or black. If the ball comes up that color, they double their money; if it comes up any other color, they lose their money. Construct a probability model for the outcome of a $10 bet on red from the player’s point of view.
(b) Find the mean and SD for the outcome of $10 bets on red, and write a sentence interpreting the mean.
(c) Now take the casino’s point of view. A large casino can have hundreds of thousands of bets placed in a day. Obviously they won’t all be same, but it doesn’t take many days to see a whole lot of any given bet. Describe the sampling distribution of the mean for a sample of 10,000 $10 bets on red.
(d) How much does the casino expect to earn on 10,000 $10 bets on red?
(e) What’s the chance that the casino will lose money on those 10,000 $10 bets on red?
(f) What’s the casino’s chance of making at least $2000 on those 10,000 $10 bets?
What is the probability of this happening, if the day’s mean is 5.00 pounds and SD 0.05 pounds?
(a) If you randomly pick one cabbage, what is the probability that its weight is more than 43.0 ounces?
(b) If you randomly pick 14 cabbages, what is the probability that their average weight is more than 43.0 ounces?
Heart Attack  No Attack  Total  p̂  

Placebo  189  10845  11034  1.71% 
Aspirin  104  10933  11037  0.94% 
The heart attack rate among aspirin takers was 0.94%, which looks like an impressive difference. Is there any chance that aspirin makes no difference, and this was just the result of random selection? In other words, how likely is it for that second sample to have p̂ = 0.94% if the true proportion of heart attacks in adult male aspirin takers is actually 1.71%, no different from adult males who don’t take aspirin?
(Hint: The 5% is the tails, the part of the sampling distribution that is not in the middle 95%.)
In a population where 45% have an unfavorable view of the Tea Party, how likely is a sample of 1504 where 737 or more have an unfavorable view? Can you draw any conclusions from that probability?
Updated 24 Dec 2017
(What’s New?)
In Chapter 8, you learned what sort of samples to expect from a known population. In the rest of the course, you’ll learn how to use a sample to make statements about an unknown population. This is inferential statistics.
In inferential statistics, there are two types of things you want to do: test whether some claim is true, and estimate the size of some effect. In this chapter you’ll construct confidence intervals that estimate population means and proportions; Chapter 10 starts you on testing claims.
In the Physicians’ Health Study [see “Sources Used” at end of book], about 22,000 male doctors were randomly assigned to take aspirin or a placebo every night. Of 11,037 in the treatment group, 104 had heart attacks and 10,933 did not. Can you say how likely it is for people in general (or, at least, male doctors in general) to have a heart attack if they take aspirin nightly?
As always, probability of one equals proportion of all. So you could just as well ask, what proportion of people who take aspirin would be expected to have heart attacks?
Before statistics class, you would divide 104/11037 = 0.0094 and say that 0.94% of people taking nightly aspirin would be expected to have heart attacks. This is known as a point estimate.
But you are in statistics class. You know that a sample can’t perfectly represent the population, and therefore all you can say is that the true proportion of heart attacks in the population of aspirin takers is around 0.94%. Can you be more specific?
Yes, you can. You can compute a confidence interval for the proportion of heart attacks to be expected among aspirin takers, based on your sample, and that’s the subject of this chapter. We’ll get back to the doctors and their aspirin later, but first, let’s do an example with M&Ms.
Example 1: You take a random sample of 605 plain M&Ms, and 87 of them are red. What can you say about the proportion of reds in all plain M&Ms?
A point estimate of a population parameter is the single best available number, and in fact it’s nothing more than the corresponding sample statistic.
In this example, your point estimate for population proportion is sample proportion, 87/605 = 14.4%, and you conclude “Somewhere around 14.4% of all plain M&Ms are red.”
The sample proportion is a point estimate of the proportion in the population, the sample mean is a point estimate of the mean of the population, the sample standard deviation is a point estimate of the standard deviation of the population, and so on.
A confidence interval estimate of a population parameter is a statement of bounds on that parameter and includes your level of confidence that the parameter actually falls within those bounds.
For instance, you could say “I’m 95% confident that 11.6% to 17.2% of plain M&Ms are red.” 95% is your confidence level (symbol: 1−α, “one minus alpha”). and 11.6% and 17.2% are the boundaries of your estimate or the endpoints of the interval.
As an alternative to endpoint form, you could write a confidence interval as a point estimate and a margin of error, like this: “I’m 95% confident that the proportion of red in plain M&Ms is 14.4% ± 2.8%.” 14.4% is your point estimate, and 2.8% is your margin of error (symbol: E), also known as the maximum error of the estimate. Since the confidence interval extends one margin of error below the point estimate and one margin of error above the point estimate, the margin of error is half the width of the confidence interval.
For all the cases you’ll study in this course, the point estimate — the mean or proportion of your sample — is at the middle of the confidence interval. But that’s not true for some other cases, such as estimating the standard deviation of a population. For those cases, computing the margin of error is uglier.
As you might expect, your TI83/84 and lots of statistical packages can compute confidence intervals for you. But before doing it the easy way, let’s take a minute to understand what’s behind computing a confidence interval.
You can compute an interval to any level of confidence you desire, but 95% is most common by far, so let’s start there. How do you use those 87 reds in a sample of 605 M&Ms to estimate the proportion of reds in the population, and have 95% confidence in your answer?
In Chapter 8, you learned how to find the sampling distribution of p̂. Given the true proportion p in the population, you could then determine how likely it is to get a sample proportion p̂ within various intervals. To find a confidence interval, you simply run that backward.
You don’t know the proportion of reds in all plain M&Ms, so call it p. You know that, if the sample size is large enough, sample proportions are ND and there’s a 95% chance that any given sample proportion will be within 2 standard errors on either side of p, whatever p is.
The standard error of the proportion is σ_{p̂} = √[pq/n]. You don’t know p — that’s what you’re trying to find. Are you stuck? No, you have an estimate for p. Your point estimate for the population proportion p is the sample proportion p̂ = 87/605. You can estimate the standard error of the proportion (the SEP) by using the statistics of your sample:
σ_{p̂} ≈ √[(87/605)(1−87/605)/605] = 0.0142656 or about 1.4%
Two standard errors is 0.0285312 → 0.029 or 2.9%.
How good is this estimate? For decentsized samples, it’s quite good. For example, suppose the true population proportion p is 50% or 0.5. For a sample of n = 625, the SEP is √[.5(1−.5)/625] = 0.0200 or 2.00%. Your sample proportion is very, very, very unlikely to be as far away as 40% or 0.4, but even if it is then you would estimate the SEP as √[.4(1−.4)/625] = 0.0196 or 1.96%, which is extremely close.
Different authors use the term “standard error” slightly differently. Some use it only for the standard deviation of the sampling distribution, which you never know exactly because you never know the population parameters exactly. Others use it only for the estimate based on sample statistics, which I computed just above. Still others use it for either computation. In practice it doesn’t make a lot of difference. I don’t see much point to getting too fussy about the terminology, given that only one of them can be computed anyway.
Any given sample proportion is 95% likely to be within two standard errors or 2.9% of the population proportion:
p−0.029 ≤ p̂ ≤ p+0.029 (probability = 95%)
Now the magic reverso: Given a sample proportion, you’re 95% confident that the population proportion is within 2.9% of that sample proportion:
p̂−0.029 ≤ p ≤ p̂+0.029 (95% confidence)
In this case, your sample proportion is 87/605 ≈ 0.144:
0.144−0.029 ≤ p ≤ 0.144+0.029 (95% confidence)
0.115 ≤ p ≤ 0.173 (95% confidence)
So your 95% confidence interval is 0.115 to 0.173, or 11.5% to 17.3%.
p−0.029 ≤ p̂ ≤ p+0.029
Multiply by −1. When you multiply by a negative, you have to reverse the inequality signs.
−p+0.029 ≥ −p̂ ≥ −p−0.029
Rewrite in conventional order, from smallest to largest.
−p−0.029 ≤ −p̂ ≤ −p+0.029
Now add p+p̂ to all three “sides”.
p̂−0.029 ≤ p ≤ p̂+0.029
You might have noticed that I changed from 95% probability to 95% confidence. What’s up with that? Well, the sample proportion is a random variable — different samples will have different sample proportions p̂, and you can compute the probability of getting p̂ in any particular range.
But the population proportion p is not a random variable. It has one definite value, even though you don’t know what that definite value is. Probability statements about a definite number make about as much sense as discussing the probability of precipitation for yesterday. The population proportion is what it is, and you have some level of confidence that your estimated range includes that true value.
What does “95% confident” mean, then? Simply this: In the long run, when you do everything right, 95% of your 95% intervals will actually include the population proportion, and the other 5% won’t. 5% is 5/100 = 1/20, so in the long run about one in 20 of your 95% confidence intervals will be wrong, just because of sample variability.
Probability of one = proportion of all, so there’s one chance in twenty that this interval is wrong, meaning that it doesn’t contain the true population proportion, even if you did everything right. If that makes you too nervous, you can use a higher confidence level, but you can never reach 100% confidence.
There’s one more wrinkle. That margin of error of 0.029 was 2σ_{p̂}, two standard errors. The figure of 2 standard errors for the middle 95% of a ND comes from the Empirical Rule or 68–95–99.7 Rule, so it’s only approximately right.
But you can be a little more precise. In Chapter 7 you learned to find the middle any percent, and that lets you generalize to any confidence level:
This Example  General Case  

Confidence level (middle area of the ND) 
95%  1−α 
Area in the two tails combined  100%−95% = 5% or 0.05  1−(1−α) = α 
Area in each tail  0.05/2 = 0.025  α/2 
The boundaries are  ±z_{0.025} = invNorm(1−0.025) = 1.9600  ±z_{α/2} 
The margin of error is  E = 1.96σ_{p̂}  E = z_{α/2} σ_{p̂} 
And you compute it as  E = 1.96  E = z_{α/2} 
The margin of error on a 1−α confidence interval is z_{α/2} standard errors. (This will be important when you determine necessary sample size, below.)
The margin of error on a 95% confidence interval is close to 2σ_{p̂}, but more accurately it’s 1.96σ_{p̂}. For the proportion of red M&Ms, where the SEP was σ_{p̂} = 0.0142656, the margin of error is 1.96σ_{p̂} = 0.0279606 → 0.028 or 2.8%. Since the point estimate was 14.4%, you’re 95% confident that the proportion of reds in plain M&Ms is within 14.4%±2.8%, or 11.6% to 17.2%.
You’ve seen that there are two ways to state a confidence interval: from ____ to ____ with ____% confidence, or ____ ± ____ with _____% confidence. Mathematically these are equivalent, but psychologically they’re very different. The first form is better than the second.
What’s wrong with the ____ ± ____ form? It’s easy to misinterpret.
If you say “I’m 95% confident that the proportion of reds in plain M&Ms is within 14.4%±2.8%”, some people will read 14.4% and stop — they’ll think that the population proportion is 14.4%. And even people who get past that will probably think that there’s something special about 14.4%, that somehow it’s more likely to be the true proportion of reds among all plain M&Ms. But 14.4% is just a value of a random variable, namely the proportion of reds in this sample. Another sample would almost certainly have a different p̂ and therefore a different midpoint for the interval.
It’s much better to use the endpoint form, because the endpoint form is harder to misinterpret. When you say “I’m 95% confident that the proportion of reds in plain M&Ms is 11.6% to 17.2%”, you lead the reader, even the nontechnical reader, to understand that the proportion could be anything in that range, and even that there’s a slight chance that it’s outside that range.
Requirements check (RC): This is an essential step — do it before you compute the confidence interval. Computing the CI assumes that the sampling distribution of p̂ is a ND, but “assumes” in statistics means you don’t assume, you check it.
The requirements are stated in Chapter 8 as simple random sample (or equivalent), np and nq both ≥ about 10, 10n ≤ N. You don’t know p, but for binomial data it’s okay to use p̂ as an estimate. But np̂ is just the number of yeses or successes in your sample, and nq̂ is just the number of noes or failures in your sample, so you really don’t need to do any multiplications.
Here’s how you check the requirements:
Your TI83 or TI84 can easily compute confidence intervals for a population proportion. With binomial data, this is Case 2 in Inferential Statistics: Basic Cases. (Excel can do it too, but it’s significantly harder in Excel.)
Example 2:
Let’s do the red M&Ms, since you already know the answer.
See the requirements check above.
Press [STAT
] [◄
] to get
to the STAT TESTS menu, and scroll up or down to find
1PropZInt
. (Caution: you don’t want 1PropZTest.
That’s reserved for
Chapter 10.)
Enter the number of successes in the sample, the sample size, and the confidence
level — easypeasy!
Write down the screen name and your inputs, then proceed to
the output screen and
write down just the new stuff:
Here’s how you show your work:
1PropZInt 87, 605, .95 (not PropZInt, please!)
(.11584, .17176), p̂ = .1438016529
There’s no need to write n=605 because you already wrote it down from the input screen.
Interpretation: I’m 95% confident that 11.6% to 17.2% of plain M&Ms are red.
You can vary that in several ways. For instance, some people like to put the confidence level last: 11.6% to 17.2% of plain M&Ms are red (95% confidence). Or they may choose more formal language: We’re 95% confident that the true proportion of reds in plain M&Ms is 11.6% to 17.2%.
I’ve already poohpoohed the marginoferror form, but sometimes you have to write it that way, for instance if your boss or your thesis advisor demands it. You can get it easily from the TI83/84 output screen.
The center of the interval, the point estimate, is given: 14.38%. To find the margin of error, subtract that from the upper bound of the interval, or subtract the lower bound from it: .17176−.1438 = .02796, or .1438−.11584 = .02796. Either way it’s 2.8%. You can then express the CI as 14.4%±2.8% with 95% confidence.
Example 3: What about the male doctors who started this section? 104 out of 11037 of the doctors taking nightly aspirin had heart attacks. Assuming that male doctors are representative of adults in general, in terms of heartattack risk, what can you say about the chance of heart attack for anyone who takes aspirin nightly? Use confidence level 1−α = 95%.
Solution: Requirements check (RC):
1PropZInt 104, 11037, .95
(.00762, .011223), p̂ = .0094228504
Conclusion: People who take nightly aspirin have a 0.76% to 1.12% chance of heart attack (95% confidence).
Example: Suppose that your sample was only 50 doctors, and none of them had heart attacks. 3/50 = 6%, so you would be 95% confident that people who take nightly aspirin have a zero to 6% chance of heart attack.
The equation for margin of error is packed with information: E = z_{α/2}
You can see that a larger sample size n means a narrower confidence interval, but the sample size is inside the squareroot sign so you don’t get as much benefit as you might hope for. If you take a sample four times as big, the square root of 4 is 2 and so your interval is half as wide, not ¼ as wide.
You can see also that you get a narrower interval if you’re willing to live with a lower confidence level. The lower your confidence interval, the smaller z_{α/2} will be, and therefore the narrower your confidence interval.
The bottom line is that there’s a threeway tension among sample size, confidence level, and margin of error. You can choose any two of those, but then you have to live with the third. (p̂ doesn’t come into it. Although p̂ does contribute to the standard error and therefore to the margin of error, you can’t choose what p̂ you’re going to get in a sample.)
If you want to get a confidence interval at your preferred confidence level with (no more than) a specified margin of error, how big a sample do you need? MATH200A Program part 5 will compute this for you, but let’s look at the formula first.
(See Getting the Program for instructions on getting the MATH200A program into your calculator.)
The equation at the start of this section shows the margin of error you get for a given sample size and confidence level. You can solve for the sample size n, like this:
E = z_{α/2} ⇒
In the formula, p̂ is your prior estimate if you have one. This can be the result of a past study, or a reasonable estimate if it has some logical basis. If you don’t have a prior estimate, use 0.5.
Using .5 as your prior estimate, you’re guaranteed that your sample won’t be too small, though it may be larger than necessary. Why not just use .5 all the time? Because taking samples always costs time and usually costs money, so you don’t want a larger sample than necessary.
Example 4: In a sample of 605 plain M&Ms, 87 were red. The 95% confidence interval had a 2.8% margin of error. How big a sample would you need to reduce the margin of error to 2%?
With the MATH200A program (recommended):  If you’re not using the program: 

Press [
The next screen wants your estimated p, your desired margin of error, and your desired confidence level. Your prior estimate is 87/605, from your earlier study. Your margin of error is 2% = .02 (not .2 !), and your confidence level is 95% = .95. The output screen echoes back your inputs, in case you forgot to write down the input screen, and then tells you that the sample size must be at least 1183 M&Ms. Notice the inequalities: for a margin of error of .02 (2%) or less, you need a sample size 1183 or more. (z Crit is critical z or z_{α/2}, the number of standard errors associated with your chosen confidence level.) 
Marshal your data: prior estimate p̂ = 87/605, desired margin of error E = 0.02, and confidence level 1−α = 0.95. You need z_{α/2}. Get α/2 from the confidence level: 1−α = 0.95 ⇒ α = 0.05 ⇒ α/2 = 0.025 z_{α/2} is the zscore such that the area to the
right is α/2.
In this problem, α/2 = 0.025, so you’re computing
z_{0.025}.
You’ll use invNorm, but invNorm wants area to left and 0.025 is
an area to right, so
you compute
Now, to avoid reentering that z value,
chain your calculations. The formula says you need to divide
by E, so simply press [ To square the fraction, press [ Finally, multiply by p̂ and (1−p̂). You get 1182.4…, and therefore your required sample size is 1183. Caution! Your answer is 1183, not 1182. You don’t round the result of a samplesize calculation. If it comes out to a whole number (unusual), that’s your answer. Otherwise, you round up to the next whole number. Why? Smaller sample size makes larger margin of error. n = 1182.4… corresponds to E = 0.02 exactly. A sample of 1182 would be just slightly under 1182.4…, and your margin of error would be just slightly over 0.02. But 0.02 was the maximum acceptable margin of error, so 1182.4… is the minimum acceptable sample size. You can’t take a fraction of an M&M in your sample, so you have to go up to the next whole number. 
There’s no requirements check in samplesize problems. These are planning how to take your sample; requirements apply to your sample once you have it.
Example 5: You’re taking the first political poll of the season, and you’d like to know what fraction of adults favor your candidate. You decide you can live with a 90% confidence level and a 3% margin of error. How many adults do you need in your random sample?
Solution: Since you have no prior estimate for p, make p̂ = .5.
With the MATH200A program (recommended):  If you’re not using the program: 

MATH200A/sample size/binomial p̂=.5, E=.03, CLevel=.9, n ≥ 752 
1−α = .9, E = .03, p̂ = .5 1−α = .9 ⇒ α = 0.1 ⇒ α/2 = 0.05 z_{0.05} = invNorm(1−.05) Divide by E, which is .03. Square the result. Multiply by p̂ times (1−p̂). n =751.5… → 752 
Numeric data are pretty much the same deal as binomial data, though there are a couple of wrinkles:
The second one is a problem, because you almost never know the standard deviation of the population. Therefore, we won’t be working any problems for this case. Instead, I’ll give you a little more theory to lay the groundwork for the next section, which explains how we get around this knowledge gap.
If you know the standard deviation of the population — and you hardly ever do — then your confidence interval is
x̅ − z_{α/2} · σ/√n ≤ μ ≤ x̅ + z_{α/2} · σ/√n
If you’re ever in this situation, you can compute a
confidence interval on your TI83/84 by choosing
ZInterval
in the STAT TESTS menu.
The margin of error is E = z_{α/2} · σ/√n, so the required sample size for a margin of error E with confidence level 1−α is
n = [ z_{α/2} · σ / E]²
(You can also use MATH200A part 5.)
“Houston, we have a problem!” A confidence interval is founded on the sampling distribution of the mean or proportion. Everything in Chapter 8 on the sampling distribution of the mean was based on knowing the standard deviation of the population. But you almost never know the standard deviation of the population. How to resolve this?
The solution comes from William Gosset, who worked for Guinness in Dublin as a brewer. (I swear I am not making this up.) In 1908 he published a paper called The Probable Error of a Mean [see “Sources Used” at end of book]. For competitive reasons, the Guinness company wouldn’t let him use his own name, and he chose the penname “Student”. The t distribution that he described in his paper has been known as Student’s t ever since.
While looking for Gosset’s original paper, I stumbled on Probable Error of a Mean, The (“Student”) (Moulton [see “Sources Used” at end of book]). It’s a fascinating look at what Gosset did and didn’t accomplish, and how this classic paper was virtually ignored for years. Things didn’t start to happen till Gosset sent a copy of his tables to R. A. Fisher with the remark that Fisher was the only one who would ever use them! It was Fisher who really got the whole world using Student’s t distribution.
Gosset knew that the standard error of the mean is σ/√n, but he didn’t know σ. He wondered what would happen if he estimated the standard error as s/√n, and did some experiments to answer that question. Since s varies from one sample to the next, this new t distribution spreads out more than the ND. Its peak is shallower, and its tails are fatter.
Actually, there’s no such thing as “the” t distribution. There’s a different t for each sample size. The larger the sample, the closer that t distribution is to a normal distribution, but it’s never quite normal.
For technical reasons, t distributions aren’t identified by sample size, but rather by degrees of freedom (symbol df or Greek ν, “nu”). df = n−1. Here are two t distributions:
Solid: standard normal distribution Line: Student’s t for df = 4, n = 5 
Solid: standard normal distribution Line: Student’s t for df = 29, n = 30 
What do you see? Student’s t for 4 degrees of freedom is quite a bit more spread out than the ND: 12.2% of sample means are more than two standard errors from the mean, versus only 5% for the ND.
At this scale, Student’s t for 29 degrees of freedom looks identical to the ND, but it’s not quite the same. You can see that 6% of sample means are more than two standard errors from the mean, versus 5% for the ND.
You don’t really need a list of properties of Student’s t, because your calculator is going to do the work for you. It’s enough to know this:
The logic of confidence intervals for numeric data is the same whether you know the standard deviation of the population or not. Even the requirements are the same. The only difference is between using a z and a t.
x̅ − t_{α/2} · s/√n ≤ μ ≤ x̅ + t_{α/2} · s/√n (1α confidence)
(It’s understood that you have to use the right number of degrees of freedom, df = n−1, in finding critical t.)
Example 6: You’re auditing a bank.
You take a random sample of 50 cash deposits and find a mean of $189.56
and standard deviation of $42.17.
(a) Estimate the mean of all cash deposits, with 95% confidence.
(b) The bank’s accounting department tells you that the
average cash deposit is over $210.00. Is that believable?
Solution: You want to compute a confidence interval about the mean of all deposits. You have numeric data, and you don’t know the standard deviation of the population, σ. This is Case 1 in Inferential Statistics: Basic Cases. In your sample, n = 50, x̅ = 189.56, and s = 42.17.
First, check the requirements (RC):
Since the sample is large enough, there’s no need to verify normality or check for outliers.)
Now calculate the interval.
On your
TI83/84, in the STAT TESTS menu, select 8:TInterval
. The
difference between Data
and Stats
is whether you
have all the data points, or just summary statistics. In this case
you have only the stats, so cursor onto Stats
and press
[ENTER
]. (The lower part of the screen may change.)
Enter your sample statistics and your desired
confidence level. Write down your inputs before you select
Calculate
:
TInterval 189.56, 42.17, 50, .95
Proceed to the output screen, and write down everything new. There isn’t much:
(177.58, 201.54)
Finally, write your interpretation. I’m 95% confident that the average of all cash deposits is between $177.58 and $201.54.
Caution! Don’t say anything like “95% of deposits are between $177.58 and $201.54.” Your confidence interval is an estimate of the true average of all deposits, and it’s not about the individual deposits. With a standard deviation of $42 and change, you would predict that 95% of deposits are within 2×42.17 = $84.34 either side of the mean, which is a much wider interval.
Now turn to part (b). Management claims that the average of all cash deposits is > $210.00. Is that believable? Well, it’s not impossible, but it’s unlikely. You’re 95% confident that the average of all deposits is between $177.58 and $201.54, which means you’re 95% confident that it’s not < $177.58 or > $201.54. But they’re claiming $210, which is outside your confidence interval. Again, they’re unlikely to be correct — there’s less than a 5% likelihood (100%−95% = 5%).
Example 7: In a random sample from the 237 vehicles on a usedcar lot, the following weights in pounds were found:
2500 3250 4000 3500 2900 4500 3800 3000 5000 2200
Estimate the average weight of vehicles on the lot, with 90% confidence.
Solution: Check the requirements first. You have a small sample (n < 30), so you have to verify that the data are ND and there are no outliers. Here are the results of normality check and boxwhisker plot in MATH200A:
There’s not much you can write for the boxwhisker plot, but you can show the normality test numerically:
Now proceed to your TInterval
. This time you have
the actual data, so you choose Data
on the screen. Specify
your data list. Freq (frequency) should already be set to 1; if not,
first press the [ALPHA
] key once, and then [1
] [ENTER
].
Enter your confidence level, and write down your inputs:
TInterval L1, 1, .90
When you have raw data, everything on the output screen is new:
(2956, 3974)
x̅ = 3465, s = 878.1, n = 10
You’re 90% confident that the average weight of all vehicles on the lot is between 2956 and 3974 pounds.
Again, this is an estimate of the average weight of the population (the 237 cars on the lot). In your interpretation, you can’t say anything about the weights of individual vehicles, because you don’t know anything about the weights of individual vehicles, apart from your sample.
Why do you have to check for outliers? If your sample passes the normality check, isn’t that enough? No! If a sample passes the normality check, it still might have outliers.
No sample is perfectly normal, so you’re not actually deciding “is it normal or not?” Instead, you’re finding the strength of evidence against normality. The smaller r is, the stronger the evidence against a ND. If r < crit, the evidence is so strong that you say the data are nonnormal. But if r > crit, you can’t say that the data are definitely normal, only that you can’t rule out a ND based on this test. But outliers make the evidence against the normal model too strong, so if outliers are present then you can’t treat the data as normal.
This “fail to prove” is similar to what you saw in Chapter 4 with decision points: you could prove that the correlation was nonzero, but you couldn’t prove that it was zero. Starting in Chapter 10, you’ll see that this is how inferential statistics works whenever you’re testing some proposition.
Why are outliers a problem? Well, your confidence interval depends on the mean and standard deviation of your sample. But x̅ and s are sensitive to outliers. (That sensitivity goes down as sample size goes up, so you don’t have to worry with samples bigger than about 30.)
To make this clearer, let’s look at an example. I drew these 15 points from a moderately skewed population:
157 171 182 189 201 208 217 219
229 242 247 252 265 279 375
The normality test shows r > crit. So far so good. But the box plot shows a big honkin’ outlier:
How big a difference does it make? Quite a lot, unfortunately. Here are the 95% confidence intervals for the original sample, and the sample with the outlier removed. The means are different, the standard deviations are really different, and the high ends of the confidence intervals are pretty different too. (The screens don’t show the margins of error, but they too are quite different: (258.45199.28)/2 = 29.6 and (239.36197.5)/2 = 20.9.)
Do you say that the outlier increased the mean by almost 5% and the SD by almost 50%, moved the confidence interval and made it wider? That’s not really fair — the sample is what it is (assuming you’ve ruled out a mistake in data entry). If you start throwing out points, you no longer have a random sample. On the other hand, that one point does seem to carry an awful lot of weight, and it doesn’t seem right to have results depend so heavily on one point.
So what do you do? If you can, you take another sample, preferably a larger one. Larger samples are less likely to have outliers in the first place, and outliers that do occur have less influence on the results.
But taking a new sample may not be practical. An alternative — not really great, but better than nothing — is to do the analysis both ways, once with the full sample and once with the outlier(s) excluded. That will at least give a sense of how much the outliers affect the results.
With the MATH200A program (recommended):  If you’re not using the program: 

Example 8: For the vehicle weights, your margin of error in a 90% CI was 3974−3465 = 509 pounds. How many vehicles would you need in your sample to get a 95% confidence interval with a margin of error of 500 pounds?
Solution: In MATH200A part 5, select
When you enter the last piece of information, you’ll notice that the calculator takes several seconds to come up with an answer; this is normal because it has to do an iterative calculation (fancy words for trial and error). Critical t for a 95% CI with 14 degrees of freedom (n = 15) is 2.14, larger than critical z of 1.96 because the t distribution is more spread out. But of course what you really care about is the bottom line: to keep margin of error no greater than 500 pounds in a 95% CI, you need to sample at least 15 vehicles. How is this computed? Start with the margin of
error and solve for sample size:
E = t_{α/2}·s/√n ⇒ n = [t_{α/2}·s/E]² The problem here is that t_{α/2} depends on df, which depends on n, so you haven’t really isolated sample size on the left side. The only way to solve this equation precisely is by a process of trial and error, and that’s what MATH200 does. 
What if you don’t have the program? Since t is not super different from the normal distribution, you can alter the above formula and use z in place of t: n = [z_{α/2}·s/E]². But the t distribution is more spread out than the normal (z) distribution, so your answer may be smaller than the actual necessary sample size. If you do that and you get > about 30, it’s probably nearly right for the t distribution. If your answer is small, you should increase it so that the TInterval doesn’t come out with too large a margin of error. You calculate z_{α/2} exactly as you did in the samplesize formula for a confidence interval about a proportion. For example, with a 95% CI, 1−α = 1−0.95, α = 0.05, and α/2 = 0.025. z_{α/2} = z_{0.025} = invNorm(1−.025) = 1.9600. so using z for t you compute sample size [1.96·878.1/500]² = 11.8… → 12. That’s well under 30, so you want to bump it up a bit. I’m deliberately glossing over this, because the program is a lot easier. But if you want more, check out Case 1 in How Big a Sample Do I Need? That page gives you all the details of the method, with workedout examples. At first glance, this procedure is less precise than the successive approximations done by MATH200A. But in fairness, there’s one more source of unpreciseness that neither method can avoid. Unlike binomial data, where small variations in the prior estimate p̂ made little difference to the computed sample size, for numeric data variations in the standard deviation do make a difference in computed sample size. Since s is squared in the formula, it can be a big difference. This can swamp any pettifogging details about t versus z. 
(The online book has live links.)
1PropZInt
to compute a CI
estimate of the population proportion p. See Inferential Statistics: Basic Cases for
the requirements.TInterval
to compute a CI
estimate of the population mean μ. See Inferential Statistics: Basic Cases for
requirements.Chapter 10 WHYL → ← Chapter 8 WHYL
Write out your solutions to these exercises, showing your work for all computations. Then check your solutions against the solutions page and get help with anything you don’t understand.
Caution! If you don’t see how to start a problem, don’t peek at the solution — you won’t learn anything that way. Ask your instructor or a tutor for a hint. Or just leave it and go on to a different problem for now. You may find when you return to that “impossible” problem that you see how to do it after all.
The Neveready Company tested 40 randomly selected Acell batteries to see how long they would operate a wireless mouse. They found a mean of 1756 minutes (29 hours, 16 minutes) and standard deviation (SD) 142 minutes. With 95% confidence, what’s the average life of all Neveready A cells in wireless mice?
You’re planning to conduct a poll about people’s attitudes toward a hot political issue, and you have absolutely no idea what proportion will be in favor and what proportion will be opposed. If you want a margin of error no more than 3.5% at 95% confidence, how large must your sample be?
The Department of Veterans Affairs is under fire for slow processing of veterans’ claims. An investigator for the Nightly Show randomly selected 100 claims (out of 68,917 at one office) and found that 40 of them had been open for more than a year. Find a 90% confidence interval for the proportion of all claims that have been open for more than a year.
For her statistics project, Sandra kept track of her commute times for 40 consecutive mornings (8 weeks). Treat this as a random sample. Her mean commute time was 17.7 minutes and her SD was 1.8 minutes. Find a 95% confidence interval for her average time on all commutes, not just this sample of 40.
Fifteen women in their 20s were randomly selected for health screening. As part of this, their heights in inches were recorded:
62.5 63 67 63.5 62 63 65 64.5 66.5 64.5 62.5 62 61.5 64.5 67.5
Construct a 95% confidence interval for the average height of women aged 20–29.
For his statistics project, Fred measured the body temperature of 18 randomly selected healthy male students. Here are his figures in °F:
98.3 97.7 98.6 98.5 97.5 98.6 98.2 96.9 97.9
96.9 97.8 99.3 98.6 99.2 96.9 97.8 97.9 98.3
(a) Write a 90% confidence interval for the average body temperature of healthy male students.
(b) What does this say about the famous “normal” temperature of 98.6°?
(c) What is his margin of error?
(d) To get an answer to within 0.1° with 95% confidence, how many students would he have to sample?
The
Colorectal Cancer Screening Guidelines
(CDC 2014 [see “Sources Used” at end of book])
recommend a colonoscopy every ten years for adults aged 50 to 75.
A publichealth researcher interviews a simple random sample of 500
adults aged 50–75 in Metropolis (pop. 6.4 million) and finds
that 219 of them have had a colonoscopy in the past ten years.
(a) What proportion of all Metropolis adults in that age range
have had a colonoscopy in the past ten years, at the 90% level of
confidence?
(b) Still at the 90% confidence level, what sample size would be
required to get an estimate within a margin of error of 2%, if she
uses her sample proportion as a prior estimate?
The next year, you go back to audit the bank again. This time, you take a random sample of 20 cash deposits. Here are the amounts:
192.68 188.24 152.37 211.73 201.57 167.79 177.19 191.15 209.22 178.49 185.90 226.31 192.38 190.23 156.13 224.07 191.78 203.45 186.40 160.83
Construct a 95% confidence interval for the average of all cash deposits at the bank.
Not wanting to wait for the official results, Abe Snake commissioned an exit poll of voters. In a systematic sample of 1000 voters, 520 (52%) said they voted for Abe Snake. (14,000 people voted in the election.) That sounds good, but can he be confident of victory, at the 95% level?
Updated 1 Jan 2016
(What’s New?)
Summary: You want to know if something is going on (if there’s some effect). You assume nothing is going on (null hypothesis), and you take a sample. You find the probability of getting your sample if nothing is going on (pvalue). If that’s too unlikely, you conclude that something is going on (reject the null hypothesis). If it’s not that unlikely, you can’t reach a conclusion (fail to reject the null).
Remember the Swain v. Alabama example? In a county that was 26% African American, Mr. Swain’s jury pool of 100 men had only eight African Americans. In that example, you assumed that selection was not racially biased, and on that basis you computed the probability of getting such a low proportion. You found that it was very unlikely. This disconnect between the data and the claim led you to reject the claim.
You didn’t know it, but you were doing a hypothesis test. This is the standard way to test a claim in statistics: assume nothing is going on, compute the probability of getting your sample, and then draw a conclusion based on that probability. In this chapter, you’ll learn some formal methods for doing that.
The basic procedure of a hypothesis test or significance test is due to Jerzy Neyman (1894–1981), a Polish American, and Egon Pearson (1895–1980), an Englishman. They published the relevant paper in 1933.
We’re going to take a sevenstep approach to hypothesis tests. The first examples will be for binomial data, testing a claim about a population proportion. Later in this chapter you’ll use a similar approach with numeric data to test a claim about a population mean. In later chapters you’ll learn to test other kinds of claims, but all of them will just be variations on this theme.
Your first task is to turn the claim into algebra. The claim may be that nothing is going on, or that something is going on. You always have two statements, called the null and alternative hypotheses.
Definition: The null hypothesis, symbol H_{0}, is the statement that nothing is going on, that there is no effect, “nothin’ to see here. Move along, folks!” It is an equation, saying that p, the proportion in the population (which you don’t know), equals some number.
Definition: The alternative hypothesis, symbol H_{1}, is the statement that something is going on, that there is an effect. It is an inequality, saying that p is different from the number mentioned in H_{0}. (H_{1} could specify <, >, or just ≠.)
The hypotheses are statements about the population, not about your sample. You never use sample data in your hypotheses. (In real life you can’t make that mistake, since you write your hypotheses before you gather data. But in the textbook and the classroom, you always have sample data up front, so don’t make a rookie mistake.)
You must have the algebra (symbols) in your hypotheses, but it can also be helpful to have some English explaining the ultimate meaning of each hypothesis, or the consequences if each hypothesis is true. Here you want to know whether there’s racial bias in jury selection in the county.
You don’t want to know if the proportion of African Americans in Mr. Swain’s jury pool is less than 26%: obviously it is. You want to know if it’s too different — if the difference is too great to be believable as the result of random chance.
Write your hypotheses this way:
(1) 
H_{0}: p = 0.26, there’s no racial bias in jury selection
H_{1}: p < 0.26, there is racial bias in jury selection 

Obviously those can’t both be true. How will you choose between them? You’ll compute the probability of getting your sample (or a more unexpected one), assuming that the null hypothesis H_{0} is true, and one of two things will happen. Maybe the probability will be low. In that case you rule out the possibility that random chance is all that’s happening in jury selection, and you conclude that the alternative hypothesis H_{1} is true. Or maybe the probability won’t be too low, and you’ll conclude that this sample isn’t unusual (unexpected, surprising) for the claimed population.
The number in your null hypothesis H_{0}, with binomial data, is called p_{o} because it’s the proportion as given in H_{0}. (You may want to refer to the Statistics Symbol Sheetto help you keep the symbols straight.)
In fact it’s all people serving on Talladega County jury pools past, present and future. If there’s racial bias, then African Americans are less likely to be selected than whites, and — probability of one, proportion of all — therefore the overall population of jury pools has less than 26% African Americans. If there’s no racial bias, then in the long run the overall population of jury pools has the same 26% of African Americans as the county.
This is why a hypothesis test is also called a significance test or a test of significance.
Okay, you’re looking to figure out if this sample is inconsistent with the null hypothesis. In other words, is it too unlikely, if the null hypothesis H_{0} is true? But what do you mean by “too unlikely”? Back in Chapter 5, we talked about unusual events, with a threshold of 5% or 0.05 for such events. We’ll use that idea in hypothesis testing and call it a significance level.
Definition: The significance level, symbol α (the Greek letter alpha), is the chance of being wrong that you can live with. By convention, you write it as a decimal, not a percentage.
(2)  α = 0.05 

A significance level of 0.05 is standard in business and science. If you can’t tolerate a 5% chance of being wrong — if the consequences are particularly serious — use a lower significance level, 0.01 or 0.001 for example. (0.001 is common if there’s a possibility of death or serious disease or injury.) If the consequences of being wrong are especially minor, you might use a higher significance level, such as 0.10, but this is rare in practice.
In a classroom setting, you’re usually given a significance level α to use.
Later in this chapter, you’ll see that the significance level α is actually concerned with a particular way of being wrong, a Type I error.
Back in Chapter 8, you learned the CLT’s requirements for binomial data: random sample not larger than 10% of population, and at least 10 successes and 10 failures expected if the null hypothesis is true. You compute expected successes as np_{o} by using p_{o}, which is the number from H_{0}. Expected failures are then sample size minus expected successes, n−np_{o} in symbols. Steps 3 and 4 need the sampling distribution of the proportion to be a ND, so you must check the requirements as part of your hypothesis test.
(RC) 


You might wonder about the first test. “The county may say it’s random, but I don’t believe it. Isn’t that why we’re running this test?” Good question! Answer: Every hypothesis test assumes the null hypothesis is true and computes everything based on that. If you end up deciding that the sample was too unlikely, in effect you’ll be saying “I assumed nothing was going on, but the sample makes that just too hard to believe.”
This same idea — the null hypothesis H_{0} is innocent till proven guilty — explains why you use 0.26 (p_{o}) to figure expected successes and failures, not 0.08 (p̂). Again, the county claims that there’s no racial bias. If that’s true, if there’s no funny business going on, then in the long run 26% of members of jury pools should be African American.
Comment: Usually, if requirements aren’t met you just have to give up. But for onepopulation binomial data, where the other two requirements are met but expected successes or failures are much under 10, you can use MATH200A part 3 to compute the pvalue directly. There’s an example in “Small Samples”, below.
This is the heart of a hypothesis test. You assume that the null hypothesis is true, and then use what you know about the sampling distribution to ask: How likely is this sample, given that null hypothesis?
Definition: A test statistic is a standardized measure of the discrepancy between your null hypothesis H_{0} and your sample. It is the number of standard errors that the sample lies above or below H_{0}.
You can think of a test statistic as a measure of unbelievability, of disagreement between H_{0} and your sample. A sample hardly ever matches your null hypothesis perfectly, but the closer the test statistic is to zero the better the agreement, and the further the test statistic is from 0 the worse the sample and the null hypothesis disagree with each other.
Because you showed that the sampling distribution is normal and the standard error of the proportion is implicitly known, this is a z test. The test statistic is z = (p̂−p_{o}) / σ_{p̂} where , but as you’ll see your calculator computes everything for you.
Definition: The pvalue is the probability of getting your sample, or a sample even further from H_{0}, if H_{0} is true. The smaller the pvalue, the stronger the evidence against the null hypothesis.
Inferential Statistics: Basic Cases tells you that binomial data in one population are
Case 2. This is a hypothesis test of
population proportion, and you use
1PropZTest
on your calculator.
To get to that menu selection, press
[STAT
] [◄
] [5
]. Enter p_{o} from the null hypothesis
H_{0}, followed by the number of successes x, the sample size
n, and the alternative hypothesis H_{1}. Write everything down
before you select Calculate
. When you get to the output screen,
check that your alternative hypothesis H_{1} is shown correctly at the
top of the screen, and then
write down everything that’s new.
(3/4) 
1PropZTest .26, 8, 100, <p_{o}
outputs: z = −4.10, pvalue = 0.000 020, p̂ = 0.08 

By convention, we round the test statistic to two decimal places and the pvalue to four decimal places.
When the pvalue is less than one in ten thousand, you need more than four decimal places. Some authors just write “p <.0001” when the pvalue is that small; they figure nobody cares about fine shades of very low probability. Feel free to use that alternative.
Caution! Watch for powers of 10 (E minus whatever) and never write something daft like “pvalue = 2.0346”.
What do these outputs of the 1PropZTest tell you? The sample proportion, p̂ = 0.08, is more than 4 standard errors below the supposed population proportion, p_{o} = 0.26. Your test statistic is z = −4.10. Since 95% of samples have z scores within ±2, this is surprising. How surprising, that’s what the pvalue tells you.
How likely is it to get this sample, or one with even a smaller sample proportion, if the null hypothesis H_{0} is true? The pvalue is 0.000 020, so if there’s no racial bias in selection then there are only two chances in a hundred thousand of getting eight or fewer African Americans in a 100man jury pool. (There’s a lot more about interpreting the pvalue later in this chapter.)
You don’t actually use the zscore, but I want you to understand something about what a test statistic is. Every case you study will have a different test statistic, and in fact choosing a test statistic is the main difference between cases.
Why does one step have two numbers? In the olden days, when dinosaurs roamed the earth and a slide rule was the hot new thing, you had to compute the SEP and then the zscore; that was step 3. Then you had to look up z in a printed table to find the pvalue; that was step 4. The TI83 or TI84 gives you both at the same time, but I’ve kept the numbering of steps.
There are two and only two possibilities, and all you have to do is pick the correct one based on your pvalue and your α:
p < α. Reject H_{0} and accept H_{1}.
or
p > α. Fail to reject H_{0}.
Caution! There are lots of p’s in problems involving population proportions (Case 2), so make sure you select the right one. The pvalue is the first p on the 1PropZInt output screen.
You can add the numbers, if you like — p < α (0.000 020 < 0.05) — but the symbols are required.
(5)  p < α. Reject H_{0} and accept H_{1}. 

What are you saying here? The pvalue was very small, so that means the chance of getting this sample, if there’s no racial bias, was very small. Previously, you set a significance level of 0.05, meaning you would consider this sample too unlikely if its probability was under 5%. Its probability is under 5%, so the sample and the null hypothesis contradict each other. The sample is what it is, so you can’t reject the sample. Therefore you reject H_{0} and accept H_{1} — you declare that there is racial bias.
Another way to look at it: Any sample will vary from the population because random selection is always operating to produce sampling error. But the difference between this sample and the supposed population proportion is just too great to be produced by random selection alone. Something else must be going on also. That something else is the alternative hypothesis H_{1}.
Definition: When the pvalue is below α, the sample is too unlikely to come from ordinary sample variability alone, and you have a significant result, or your result is statistically significant.
You always select a significance level before you know the pvalue. If you could first get the pvalue and then specify a significance level, you could get whichever result you wanted, and there would be no point to doing a hypothesis test at all. Choosing α up front keeps you honest.
Since you accepted H_{1} in the previous step, that’s your conclusion. If you have already written it in English as part of the hypotheses, as I did, then most of your work is already done. You do need to add the significance level or the pvalue, so your conclusion will look something like one of these:
(6)  The 8% proportion of African American men in Mr. Swain’s jury pool is significantly below the expected 26%, and this is evidence at the 0.05 level of significance of racial bias in the selection. 

or
(6)  The 8% proportion of African American men in Mr. Swain’s jury pool is significantly below the expected 26%, and this is evidence of racial bias in the selection (p = 0.000 020). 

If you’re publishing your hypothesis test, you’ll want to write a thorough conclusion that still makes sense if it’s read on its own. But in class exercises you don’t have to write so much. It’s enough to write “At the 0.05 significance level, there is racial bias in jury selection” or “There is racial bias in jury selection (p = 0.000 020)”.
The Colorectal Cancer Screening Guidelines (CDC 2014 [see “Sources Used” at end of book]) recommend a colonoscopy every ten years for adults aged 50 to 75. A publichealth researcher believes that only a minority are following this recommendation. She interviews a simple random sample of 500 adults aged 50–75 in Metropolis (pop. 6.4 million) and finds that 235 of them have had a colonoscopy in the past ten years. At the 0.05 level of significance, is her belief correct?
Solution: The population is adults aged 50–75 in Metropolis. You want to know whether a minority of them — under 50% — follow the colonoscopy guideline. Each person either does or does not, so you have binomial data, a test of proportion (Case 2 in Inferential Statistics: Basic Cases). Try to write out the hypothesis test yourself before you look at mine below.
Reminder: Even though you already have the sample data in the problem, when you write the hypotheses, ignore the sample. In principle, you write the hypotheses, then plan the study and gather data. If you use any of the sample data in the hypotheses, something is wrong.
You should have written something pretty close to this:
(1) 
H_{0}: p = 0.5, half the seniors of Metropolis follow the guideline
H_{1}: p < 0.5, less than half follow the guideline 

(2)  α = 0.05 
(RC) 

(3/4) 
1PropZTest: p_{o}=.5, x=235, n=500, p<p_{o}
outputs: z=−1.34, pval=0.0899, p̂=0.47 
(5)  p > α. Fail to reject H_{0}. 
(6)  At the 0.05 level of significance, it’s impossible to
say whether less
than half of Metropolis seniors aged 50–75 follow the CDC guideline for
a colonoscopy every ten years or not.
[Or, It’s impossible to say whether less than half of Metropolis seniors aged 50–75 follow the CDC guideline for a colonoscopy every ten years or not (p = 0.0899).] 
Important: When p is greater than α, you fail to reach a conclusion. In this situation, you must use neutral language. You mention both possibilities without giving more weight to either one, and you use words like “impossible to say” or “can’t determine”.
This is unsatisfying, frankly. You go through all the trouble of gathering data and then you end up with a nonconclusion. Can anything be salvaged from this mess?
Yes, you can do a confidence interval. This at least will let you set bounds on what percent of all seniors follow the guidelines. You’ve already tested requirements as part of the hypothesis test, so go right into your calculations and conclusion. You’re free to pick any confidence level you wish, but 95% is most usual.
1PropZInt, 235, 500, .95
outputs: (.42625, .51375)
42.6% to 51.4% of Metropolis seniors aged 50–75 follow the CDC guideline on screening for colorectal cancer.
In a classroom setting, or on regular homework, if you’re assigned a hypothesis test do that and don’t feel obligated to do a confidence interval also. But in real life, and on labs and projects for class, you’ll usually want to do both.
What if your sample is so small that
expected successes np_{o} or expected failures
n−np_{o} are under 10? You can no longer use
1PropZTest
, which assumes that the sampling
distribution of the proportion is ND, but you can
compute the binomial probability directly as long as the other two
requirements are still met (SRS and 10n≤N).
Only the calculation of the pvalue changes.
Example: In 2001, 9.6% of Fictional County motorists said that fuel efficiency was the most important factor in their choice of a car. For her statistics project, Amber set out to prove that the percentage has increased since then. She interviewed 80 motorists in a systematic sample of those registering vehicles at the DMV, and 13 of them said that fuel efficiency was the most important factor in their choice of a car. Test her hypothesis, at the 0.05 significance level.
Please write out your hypothesis test before you look at mine.
(1)  H_{0}: p = 0.096, percentage has not increased
H_{1}: p > 0.096, percentage has increased 

(2)  α = 0.05 
(RC) 
The sampling distribution of p̂ doesn’t follow the
normal model, so you can’t use 
(3/4) 
MATH200A/Binomial prob:
n=80, p=0.096, x=13 to 80; pvalue = 0.0410
(If you don’t have the program, use 1−binomcdf(80,0.096,12) = 0.0410.) [Why 13 to 80? H_{1} contains >, so you test the probability of getting the sample you got, or a larger one, if H_{0} is true. If H_{1} contained <, x would be 0 to 13 — the sample you got, or a smaller one. See Surprised?in Chapter 6.] 
(5)  p < α. Reject H_{0} and accept H_{1}. 
(6)  At the 0.05 significance level, the percentage of Fictional
County motorists who rate fuel efficiency as most important has
increased since 2001.
[Or, The percentage of Fictional County motorists who rate fuel efficiency as most important has increased since 2001 (p = 0.0410).] 
Hypothesis tests are based on a simple idea, but there are lots of details to think about. This section clarifies some important ideas about the philosophy and practice of a hypothesis test.
Definition: A Type I error is rejecting the null hypothesis when it’s actually true.
Definition: A Type II error is failing to reject the null hypothesis when it’s actually false.
A Type I error usually causes you to do something you shouldn’t; a Type II error usually represents a missed opportunity.
Example 4: Suppose your alternative hypothesis H_{1} is that a new headache remedy PainX helps a greater proportion of people than aspirin.
A Type I error — rejecting H_{0} and accepting H_{1} when H_{0} is actually true — would have you announce that PainX helps more people when in fact it doesn’t. People would then buy PainX instead of aspirin, and their headache would less likely be cured. This is a bad thing.
On the other hand, a Type II error — failing to reject H_{0} when it’s actually false — would mean you announce an inconclusive result. This keeps PainX off the market when it actually would have helped more people than aspirin. This too is a bad thing.
Example 5: You’re on a jury, and you have to decide whether the accused actually committed the murder. What would be Type I and Type II errors?
To answer that you need to identify your null hypothesis H_{0}. Remember that it’s always some form of “nothing going on here.” In this case, H_{0} would be that the defendant didn’t commit the murder, and H_{1} would be that he did.
A Type I error would be condemning an innocent man; a Type II error would be letting a guilty man go free. In our legal system, a defendant is not supposed to be found guilty if there is a reasonable doubt; this would correspond to your α. Probably α = 0.05 is not good enough in a serious case like murder, where a Type I error would mean long jail time or execution, so if you’re on a jury you’d want to be more sure than that.
“Okay then,” you say, “I’ll have to be super careful and not make mistakes.” But remember from Chapter 1: In statistics, errors aren’t necessarily mistakes. Errors are discrepancies between your results and reality, whatever their cause. Type I and Type II errors are not mistakes in procedure.
Even if you do everything right in your hypothesis test, you can’t be certain of your answer, because you can never get away from sample variability.
How often will these errors occur? This is where your significance level comes into play. If you perform a lot of tests at α = 0.05, then in the long run a Type I error will occur one time in twenty. It’s too big for these pages, but there’s a cartoon at xkcd.com that illustrates this perfectly. The probability of a Type II error has the symbol β (Greek letter beta) and it has to do with the “power” of the test, its ability to find an effect when there’s an effect to be found. β belongs to a more advanced course, and I don’t do anything with it in this book.
Earlier, I said that your significance level α is the chance of being wrong that you can live with. Now I can be a little more precise. α is not the chance of any error; α is the chance of a Type I error that you can live with. If one Type I error in 20 hypothesis tests is unacceptable, use a lower significance level — but then you make a Type II error more likely. If that’s unacceptable, increase your sample size.
Somebody is making a mint off the following chart. It’s in every stats textbook I’ve seen, so you may as well have it too:
Reject H_{0}, accept H_{1}  Fail to reject H_{0}  

If H_{0} is actually true  Type I error  Correct decision 
If H_{0} is actually false (and H_{1} is true)  Correct decision  Type II error 
How do you know whether your H_{1} should contain “<” or “>” (a onetailed test) or “≠” (a twotailed test)? In class, the problem will usually be clear about whether you’re testing for a “difference” (twotailed) or testing if something is “better”, “larger”, “less than”, etc. (all onetailed). But which one should you use when you’re on your own?
In general, prefer a twotailed test unless you have a specific reason to make a onetailed test.
When a twotailed test reaches a statistically significant result, you interpret in a onetailed manner.
There are two main situations where a onetailed test makes sense: “(a) where there is truly concern for the outcomes in one [direction] only and (b) where it is completely inconceivable that the results could go in the opposite direction.”
—Dubey, quoted by Kuzma and Bohnenblust (2005, 132) [see “Sources Used” at end of book]
With a onetailed test, say for μ<4.5, you’re saying that you consider “equal to 4.5” and “greater than 4.5” the same thing, that if μ isn’t less than 4.5 then you don’t care whether it’s equal or it’s greater. Sometimes you really don’t care, but very often you do. If the problem statement is ambiguous, or if this is real life and you have to do a hypothesis test, how do you decide whether to do a onetailed or twotailed test?
Testing twotailed doesn’t prejudge a situation. Do a twotailed test unless you can honestly say, without looking at the data, that only one direction of difference matters, or only one direction is possible.
Example 6: An existing drug cures people in an average of 4.5 days, and you’re testing a new drug. If you test for μ<4.5, you’re saying that it doesn’t matter whether the new drug takes the same time or takes more time. But that’s wrong: it matters very much. You want to test whether the new drug is different (μ≠4.5). Then if it’s different, you can conclude whether it’s faster or slower.
Another way to look at this whole business: a onetailed test essentially doubles your α — you’re much more likely to reach a conclusion with dicey data. But that means double the risk of being wrong with a Type I error — not a good thing!
Sometimes the same situation can call for a different test, depending on your viewpoint.
Example 7: You’re the county inspector of weights and measures, checking up on a dairy and its half gallons of milk. Legally, half a gallon is 64 fluid ounces. To a government inspector, “Dairylea gives 64.0 ounces in the average half gallon” and “Dairylea gives more than 64.0 ounces in the average half gallon” are the same (legal), and you care only about whether Dairylea gives less (illegal). A onetailed test (<) is correct.
But now shift your perspective. You’re Dairylea management. You don’t want to short the customers because that’s illegal, but you don’t want to give too much because that’s giving away money. You make a twotailed test (≠).
After a twotailed test, if p<α then you can interpret the result as onetailed.
Example 8: You want to test whether your candidate’s approval rating has changed from the previous dismal 40% after a major policy announcement. Your H_{1} is p ≠ 0.4, and 170 out of a random sample of 500 voters approve (p̂ = 34%). Your pvalue is 0.0062, so you reject H_{0} and accept H_{1}. You conclude that the candidate’s approval rating has changed.
But you can go further and say that her approval rating has dropped. You do this by combining the facts that (a) you’ve proved that approval rating is different, which means it must be either less or more than 40%, and (b) the sample p̂ was less than p_{o} (40%).
You can phrase your conclusion something like this, first answering the original question then going beyond it: The candidate’s approval rating has changed from 40% after the speech (p = 0.0062). In fact, it has dropped.
Your justification is the relationship between Confidence Interval and Hypothesis Test (later in this chapter), but you don’t actually have to compute the CI. When p < α in a twotailed test, p_{o} is outside the confidence interval (at the matching confidence level).
Summary: The pvalue tells you how likely it is to get the sample you got (or a more extreme sample) if the null hypothesis is true.
Many people are confused about the pvalue. They try to read too much into it, or they try to simplify it.
Part of the problem is trying to fit the meaning into the traditional structure of a onesentence definition, so let’s try a story instead. In your experiment, you got a certain result, a sample mean or sample proportion. Assume that the null hypothesis is true. If H_{0} is true, the properties of the sampling distribution tell you how likely it is to get this sample result, or one even further away from H_{0}. That likelihood is called the pvalue.
The onetailed pvalue is exactly the
probability that you computed with normalcdf
in
Chapter 8. When that’s less than
0.5, the twotailed pvalue is exactly double the onetailed pvalue.
If the pvalue is small, your results are in conflict with H_{0}, so you reject the null and accept the alternative. If the pvalue is larger, your sample is not in conflict with H_{0} and you fail to reject the null, which is statstalk for failing to reach any kind of conclusion.
In a nice phrase, Sterne and Smith [see “Sources Used” at end of book] say that pvalues “measure the strength of the evidence against the null hypothesis; the smaller the pvalue, the stronger the evidence against the null hypothesis.” They also quote R. A. Fisher on interpreting a pvalue: “If P is between 0.1 and 0.9 there is certainly no reason to suspect the hypothesis tested. If it is below 0.02 it is strongly indicated that the hypothesis fails to account for the whole of the facts. We shall not often be astray if we draw a conventional line at 0.05.”
The message here is that pvalues fall on a continuum; you can’t just arbitrarily divide them into “significant” and “not significant” once and for all.
The pvalue is the likelihood, if H_{0} is actually true, that random chance could give you the results you got, or results even further from H_{0}. It is a conditional probability:
pvalue = P(this sample given that H_{0} is true)
Yes, that seems convoluted — because it is. Alas, there just isn’t any description of a pvalue that is both correct and simple.
The pvalue is not the probability that either hypothesis is true or false:
The pvalue is not any of the above because they are all plain probabilities. Once again, the pvalue is just a measure of how likely your results would be if H_{0} is true and random chance is the only factor in selecting the sample.
The pvalue tells you how unlikely this sample (or a more extreme one) is if the null hypothesis is true. The more unlikely (surprising, unexpected), the lower the pvalue, and the more confident you can feel about rejecting H_{0}.
There’s one other thing: the pvalue is not a measure of the size or importance of an effect. That gets into statistical significance versus practical significance.
If your pvalue is less than your significance level α, your result is statistically significant. That low pvalue, Wheelan (2013, 11) [see “Sources Used” at end of book] writes, means that your result is “not likely to be the product of chance alone”. That’s all that statistical significance means. But even if a result is statistically significant, it may not be practically significant.
Example 9: Suppose that your pvalue for “PainX is more likely to help a person than aspirin” is 0.000 002. You’re pretty darn sure that PainX is better. But to determine whether the result is practically significant, you have to ask not just whether PainX is better, but by how much.
One way to evaluate practical significance is to compute a confidence interval about the effect size. In this case, the 95% confidence interval is that a person is between 1.14 and 2.86 percentage points more likely to be helped by PainX than aspirin. Oh yes, and aspirin costs a buck for 100 tablets, where PainX costs $29.50 for ten. Most people would say this result has no practical significance. They’re not going to plunk down $30 for a few pills that are only 2% more likely to help them than aspirin.
How can you get such a low pvalue when the size of the effect is small? The answer is in extremely large sample sizes. In this madeup case, PainX helped 15,500 people in a sample of 25,000, and aspirin helped 15,000 in a sample of 25,000. When you have really large samples, be especially alert to the issue of statistical versus practical significance.
Avoid common errors when stating conclusions and interpreting them. Make sure you understand what you are doing, and explain it to others in their own language.
If your pvalue is less than your significance level, you have shown that your sample results were unlikely to arise by chance if H_{0} is true. The data are statistically significant. You therefore reject H_{0} and accept H_{1}.
Details: Assuming that H_{0} is true, the sample you got is surprising (unexpected, unusual). The data are inconsistent with the null hypothesis — they can’t both be true. The data are what they are, and if the sample was properly taken you have to believe in it. Therefore, H_{0} is most likely false. If H_{0} is false, its opposite H_{1} is true.
You accept H_{1}, but you haven’t proved it to a certainty. There’s always that pvalue chance that the sample results could have occurred when H_{0} is true. That’s why you say you “accept” H_{1}, not that you have “proved” H_{1}.
Compare to a jury verdict of “guilty”. It means the jury is convinced that the probability (p) that the defendant is innocent is less than a reasonable doubt (significance level, α). It doesn’t mean there is no chance he’s innocent, just that there is very little chance.
Suppose your null H_{0} is “the average package contains the stated net weight,” your alternative is “the average package contains less than the stated net weight,” and your significance level is 0.05.
If p = 0.0241, which is < α, you reject H_{0} and accept H_{1}. You conclude “the average package does contain less than the stated net weight (p = 0.0241)” or “the average package does contain less than the stated net weight, at the 0.05 significance level.”
Don’t say the average package “might” be less than the stated weight or “appears to be” less than the stated weight. When you reject H_{0}, state the alternative as a fact within the stated significance level, or preferably with the pvalue. (Again, compare to a jury verdict. The jury doesn’t say the defendant “might be guilty”.)
See also: Take published conclusions with a grain of salt. Even professional researchers can misuse hypothesis tests. “Data mining” (first gathering data, then looking for relationships) is one problem, but not the only one. See Why Most Published Research Findings Are False (Ioannidis 2005 [see “Sources Used” at end of book]). If you find the article heavy going, just scroll down to read the example in Box 1 and then the corollaries that follow.
If your pvalue is greater than your significance level, you have shown that random chance could account for your results if H_{0} is true. You don’t know that random chance is the explanation, just that it’s a possible explanation. The data are not statistically significant.
You therefore fail to reject H_{0} (and don’t mention H_{1} in step 5). The sample you have could have come about by random selection if H_{0} is true, but it could also have come about by random selection if H_{0} is false. In other words, you don’t know whether H_{0} is actually true, or it’s false but the sample data just happened to fall not too far from H_{0}.
Compare to a jury verdict of “not guilty”. That could mean the defendant is actually innocent, or that the defendant is actually guilty but the prosecutor didn’t make a strong enough case.
Example 11: Suppose your null hypothesis is “the average package contains the stated net weight,” your alternative is “the average package contains less than the stated net weight,” and your significance level α is 0.05.
If you compute a pvalue of 0.0788, which is > α, you fail to reject H_{0} in step 5, but how do you state your conclusion in step 6?
There are two kinds of answer, depending on who you talk to. Some people say “there’s insufficient evidence to prove that the average package is underweight”; others say “we can’t tell whether the average package is underweight or not.” Of course there are many ways to write a conclusion in English, but ultimately they boil down to “we can’t prove H_{1}” (or the equivalent “we can’t disprove H_{0}”) versus “we can’t reach a conclusion either way.”
Does it matter? Yes, I think it does.
Please understand: It’s not that the people writing the conclusions are confused (well, usually not). The problem is confusion among people reading the conclusions.
Advice: It’s the same advice I’ve given before: Tailor your presentation to your audience. If you’re presenting to technical people, the onesided forms are okay, and you could answer Example 11 with something like “there’s insufficient evidence, at the 0.05 significance level, to show that the average package is under weight” or “… to reject the hypothesis that the average package contains the stated net weight.” (Since the pvalue gives more information, you could give that instead of the significance level.)
But if your audience is nontechnical people, don’t expect them to understand a twosided truth from a onesided conclusion. Instead, use neutral language, such as “We can’t determine from the data whether the average package is underweight or not (p = 0.0788).” (You could state the significance level instead of the pvalue.)
If your pvalue is very large, say bigger than 0.5, there’s a good chance you’ve made a mistake. Check carefully whether you should be testing <, ≠, or >. Also check whether you’re testing against the wrong number. For instance, suppose your H1 is that a coin comes up heads more than a third of the time. A few dozen flips will probably yield a pvalue very close to 1. This is the statistical equivalent of “Well, duh!”
Sometimes large pvalues are correct, but those situations are rare enough that you should be suspicious.
Not as a matter of strict logic, no. But there are circumstances where the data do suggest that the null hypothesis is true. The most important of these is when multiple experiments fail to reject H_{0}. Here’s why.
Suppose you do an experiment at the 0.05 significance level, and your pvalue is greater than that. Maybe H_{0} is really true; maybe it’s false but this particular sample happened to be close to H_{0}. You can’t tell — you’ve failed to disprove H_{0} but that doesn’t mean it’s necessarily true.
But suppose other experimenters also get pvalues > 0.05. They can’t all be unlucky in their samples, can they?
If you keep giving the universe opportunities to send you data that contradict the null hypothesis, but you keep getting data that are consistent with the null, then you begin to think that the null hypothesis shouldn’t be rejected, that it’s actually true.
This is why scientists always replicate experiments. If the first experiment fails to reject H_{0}, they don’t know whether H_{0} is true or they were just unlucky in their sample. But if several experiments fail to reject the null — always assuming the experiments are properly conducted — then they begin to have confidence in the theory.
What if an experiment does reject H_{0}? Is that it, game over? Not necessarily. Remember that even a true H_{0} will get rejected one time in 20 when tested at the 0.05 level. Once again, the answer is replication. If they get more “reject H_{0}”,scientists know that the first one wasn’t just a statistical fluke. But if they get a string of “fail to reject H_{0}”, then it’s likely that the first one was just that one in 20, and H_{0} is actually true.
Summary:
Just as you used a TInterval
in Chapter 9 to make a
confidence interval about μ for numeric
data, you use a TTest
to perform the hypothesis test.
Typically you don’t know σ, the standard deviation (SD) of the population, and therefore you don’t know the standard error σ/√n either. So you estimate the standard error as s/√n, using the known SD of the sample. That means that the test statistic is:
t = (x̅−μ_{o}) / (s/√n)
The t statistic is the estimated number of standard errors between your sample mean and the hypothetical population mean.
You met the t distribution when you computed confidence intervals in Chapter 9. Compared to z, the t distribution is a little flatter and more spread out, especially for small samples, so pvalues tend to be larger.
Let’s jump in and do a t test. The numbered steps are almost the same as they were in the examples with binomial data — you just have the necessary variations for working with numeric data. Because I’ll be adding some commentary, I’ve put boxes around what I expect to see from you for a problem like this. (Refer to Seven Steps of Hypothesis Tests if you don’t know the steps very well yet.)
It hardly ever happens, but if you do know the SD of the population you can do a z test instead of a t test. Since the z distribution is a bit less spread out than the t distribution, for very small samples the pvalues are typically a bit lower with a z test than with a t. But the difference is rarely enough to change the result — and again, you are quite unlikely to know the SD of the population, so a z test is quite unlikely to be the right one.
The management claims that the average cash deposit is $200.00, and you’ve taken a random sample to test that:
192.68 188.24 152.37 211.73 201.57
167.79 177.19 191.15 209.22 178.49
185.90 226.31 192.38 190.23 156.13
224.07 191.78 203.45 186.40 160.83
At the 0.05 significance level, does this sample show that the average of all cash deposits is different from $200?
Solution: The data type is numeric, and the population SD σ is unknown, so this is a test of a population mean, Case 1 from Inferential Statistics: Basic Cases. Your hypotheses are:
(1) 
H_{0}: μ = 200, management’s claim is correct
H_{1}: μ ≠ 200, management’s claim is wrong 

Comment: Even though you already have the sample data in the problem, when you write the hypotheses, ignore the sample. In principle, you write the hypotheses, then plan the study and gather data. If you use any of the sample data in the hypotheses, something is wrong.
So you don’t use numbers from the sample in your hypotheses, and you don’t use the sample to help you decide whether the alternative hypothesis H_{1} should have < ≠, or >.
The significance level was given in the problem. (Problems will usually give you an α to use.)
(2)  α = 0.05 

Next is the requirements check. Even though it doesn’t have a number, it’s always necessary. In this case, n = 20, which is less than 30, so you have to test for normality and verify that there are no outliers.
Enter your data in any statistics list (I used
L5
), and check your data entry carefully. Use the
MATH200A program “Normality chk” to check for a normal
distribution and “Boxwhisker” to verify that
there are no outliers.
You don’t need to draw the plots, but do write down r and crit and show the comparison, and do check for outliers. (For what to do if you have outliers, see Chapter 3.)
(RC) 


Now it’s time to compute the test statistic (t) and the pvalue.
On the TTest
screen, you have to choose
Data
or Stats
just as you did on the
TInterval
screen. You have the actual data, so you
select Data
on the TTest screen, instead
of Stats
. Then the sample mean, sample SD, and sample size
are shown on the output screen, so you write them down as part of your
results. Always write down x̅, s, and n.
(3/4) 
TTest: μ_{o}=200, List=L5, Freq=1, μ≠μ_{o}
results: t=−2.33, p=0.0311, x̅=189.40, s=20.37, n=20 

The decision rule is the same for every single hypothesis test, regardless of data type. In this case:
(5)  p < α. Reject H_{0} and accept H_{1}. 

And as usual, you can write your conclusion with the significance level or the pvalue:
(6)  At the 0.05 level of significance, management is incorrect and the average of all cash deposits is different from $200.00. In fact, the true average is lower than $200.00. 

Or,
(6)  Management is incorrect, and the average of all cash deposits is different from $200.00 (p = 0.0311). In fact, the true average is lower than $200.00. 

Remember what happens when you do a twotailed test (≠ in H_{1}) and p turns out less than α: After you write your “different from” conclusion, you can go on to interpret the direction of the difference. See p < α in TwoTailed Test.
In a classroom exercise, if you were asked to do a hypothesis test you would do a hypothesis test and only a hypothesis test. But in real life, and in the big labs for class, it makes sense to answer the obvious question: If the true mean is less than $200.00, what is it?
You don’t have to check requirements for the CI, because you already checked them for the HT.
With 95% confidence, the average of all cash deposits is between $179.86 and $198.93.
Here’s an example where you have statistics without the raw data. It’s adapted from Sullivan (2011, 483) [see “Sources Used” at end of book].
According to the Centers for Disease Control, the mean number of cigarettes smoked per day by individuals who are daily smokers is 18.1. Do retired adults who are daily smokers smoke less than the general population of daily smokers?
To answer this question, Sascha obtains a random sample of 40 retired adults who are current daily smokers and record the number of cigarettes smoked on a randomly selected day. The data result in a sample mean of 16.8 cigarettes and a SD of 4.7 cigarettes.
Is there sufficient evidence at the α = 0.01 level of significance to conclude that retired adults who are daily smokers smoke less than the general population of daily smokers?
Solution: Start with the hypotheses. You’re comparing the unknown mean μ for retired smokers to the fixed number 18.1, the known mean for smokers in general. Since the data type is numeric (number of cigarettes smoked), and there’s one population, and you don’t know the SD of the population, this is Case 1, test of population mean, from Inferential Statistics: Basic Cases.
(1) 
H_{0}: μ = 18.1, retired smokers smoke the same amount
as smokers in general
H_{1}: μ < 18.1, retired smokers smoke less than smokers in general Comment: The claim is a population mean of 18.1, so you use 18.1 in your hypotheses. Using the sample mean of 16.8 in Step 1 is a rookie mistake, one of the Top 10 Mistakes of Hypothesis Tests. Never use sample data in your hypotheses. Comment: Why does H_{1} have < instead of ≠? The short answer is: that’s what the problem says to do. In the real world, you would do a twotailed test (≠) unless there’s a specific reason to do a onetailed test (< or >); see OneTailed or TwoTailed? (earlier in this document). Presumably there’s some reason why they are interested only in the case “retired smokers smoke less” and not in the case “retired smokers smoke more”. 

(2)  α = 0.01 
(RC) 
Therefore the sampling distribution is normal. 
(3/4) 
TTest: μ_{o}=18.1, x̅=16.8, s=4.7, n=40, μ<μ_{o} outputs: t=−1.75, p=0.0440 
(5)  p > α. Fail to reject H_{0}. 
(6) 
At the 0.01 level of significance, we can’t determine whether the average number of cigarettes
smoked per day by retired adults who are current smokers is less
than the average for all daily smokers or not.
Or, We can’t tell whether the average number of cigarettes smoked per day by retired adults who are current smokers is less than the average for all daily smokers or not (p = 0.0440). 
When you fail to reject H_{0}, you cannot reach any conclusion. You must use neutral language in your nonconclusions. Please review When p > α, you fail to reject H_{0} earlier in this chapter.
A 95% CI is the flip side of a 0.05 twotailed HT. More generally, a 1−α CI is the complement of an α twotailed HT.
Example 14: The baseline rate for heart attacks in diabetes patients is 20.2% in seven years. You have a new diabetes drug, Effluvium, that is effective in treating diabetes. Clinical trials on 89 patients found that 27 (30.3%) had heart attacks. The 95% confidence interval is 20.8% to 39.9% likelihood of heart attack within seven years for diabetes patients taking Effluvium. What does this tell you about the safety of Effluvium?
Solution: Okay, you’re 95% confident that Effluvium takers have a 20.8% to 39.9% chance of a heart attack within seven years. If you’re 95% confident that their chance of heart attack is inside that interval, then there’s only a 5% or 0.05 probability that their chance of heart attack is outside the interval, namely <20.8% or >39.9%.
But 20.2% is outside the interval, so there’s less than a 0.05 chance that the true probability of heart attack with Effluvium is 20.2%.
CI and HT calculations both rely on the sampling distribution. The open curve centered on 20.2% shows the sampling distribution for a hypothetical population proportion of 20.2%. Only a very small part of it extends beyond 30.3%, the proportion of heart attacks you actually found in your sample.
The chance of getting your sample, given a hypothetical proportion p_{o} in the population, is the pvalue. If p_{o} = 20.2%, your sample with p̂ = 30.3% would be unlikely (pvalue below 0.05). You would reject the null hypothesis and conclude that Effluvium takers have a different likelihood of heart attack from other diabetes patients, at the 0.05 significance level. Further, the entire confidence interval is above the baseline value, so you know that Effluvium increases the likelihood of heart attack in diabetes patients.
At significance level 0.05, a twotailed test against any value outside the 95% confidence interval (the shaded curve) would lead to rejecting the null hypothesis. And you can say the same thing for any other significance level α and confidence level 1−α.
What if the interval does include the baseline or hypothetical value? Then you fail to reject the null hypothesis.
Example 15: A machine is supposed to be turning out something with a mean value of 100.00 and SD of 6.00, and you take a random sample of 36 objects produced by the machine. If your sample mean is 98.4 and SD is 5.9, your 95% confidence interval is 96.4 to 100.4.
Now, can you make any conclusion about whether the machine is working properly?
Solution: Well, you’re 95% confident that the machine’s true mean output is somewhere between 96.4 and 100.4. With this sample, you can rule out a true population mean of <96.4 or >100.4, at the 0.05 significance level; but you can’t rule out a true population mean between 96.4 and 100.4 at α = 0.05. A hypothesis test would fail to reject the hypothesis that μ = 100. You can’t determine whether the true mean output of the machine is equal to 100 or not. 
Leaving the symbols aside, when you test a null hypothesis your sample either is surprising (and you reject the null hypothesis) or is not surprising (and you fail to reject the null). Any null hypothesis value inside the confidence interval is close enough to your sample that it would not get rejected, and any null hypothesis value outside the interval is far enough from the sample that it would get rejected.
For numeric data, the CI and HT are exactly equivalent.
But for binomial data, the CI and HT are only approximately equivalent. Why? Because with binomial data, the HT uses a standard error derived from p_{o} in the null hypothesis, but the CI uses a standard error derived from p̂, the sample proportion. Since the standard errors are slightly different, right around the borderline they might get different answers. But when the hypothetical p_{o} is a fair distance outside the CI, as it was in the drug example, the pvalue will definitely be less than α.
Good question!
A confidence interval is symmetric (for the cases you study in this course), so it’s intrinsically twotailed. A onetailed HT for < or > at α = 0.01 corresponds to a twotailed HT for ≠ at α = 0.02, so the CI for a onetailed HT at α = 0.01 is a 98% CI, not a 99% CI. The confidence level for a onetailed α is 1−2α, not 1−α.
Correspondence between Significance Level and Confidence Level  

α  tails  CLevel 
0.05  1  1−2×.05 = 90% 
2  1−.05 = 95%  
0.01  1  1−2×.01 = 98% 
2  1−.01 = 99%  
0.001  1  1−2×.001 = 99.8% 
2  1−.001 = 99.9% 
If the baseline value is outside the confidence interval, you can say (at the appropriate significance level) that the true value of μ or p is different from the baseline, and then go on to say whether it’s bigger or smaller, so you get your onetailed result.
On the other hand, if the baseline value is inside the confidence interval, you can’t say whether the true μ or p is equal to the baseline or different from it, and if you can’t say whether they’re different then you can’t say which one is bigger than the other.
Though most hypothesis tests are to find out something about a population, sometimes you just want to know whether this sample is significantly different from a population. In this case, you don’t need a random sample, but the other requirements must still be met.
Example 16: At Wossamatta University, instructors teach the statistics course independently but all sections take the same final exam. (There are several hundred students.) One semester, the mean score on the exam is 74. In one section of 30 students, the mean was 68.2 and the SD was 10.4. The students felt that they had not been adequately prepared for the exam by the instructor. Can they make their case?
Solution: In effect, they are saying that their section performance was significantly below the performance of students in the course overall. This is a testable hypothesis. But the hypothesis is not about the population that these 30 students were drawn from; we already know about that population. Instead, it is a test whether this sample, as a sample, is different from the population.
(1)  H_{0}: This section’s mean was no different from
the course mean.
H_{1}: This section’s mean was significantly below the course mean. 

(2)  α = 0.05 
(RC) 

(3/4)  TTest: μ = 74, x̅ = 68.2,
s = 10.4, n = 30,
μ < μ_{o}
Outputs: t = −3.05, pvalue = 0.0024 
(5)  p < α. Reject H_{0} and accept H_{1}. 
(6)  This section’s average exam score was less than the overall course average (pvalue = 0.0024). 
Okay, there was a real difference. This section’s mean exam score was not only below the average for the whole course, but too far below for random chance to be enough of an explanation.
But did the students prove their case? Their case was not just that their average score was lower, but that the difference was the result of poor teaching. Statistics can’t answer that question so easily. Maybe it was poor teaching; maybe these were weaker students; maybe it was environmental factors like classroom temperature or the time of day; maybe it was all of the above.
(The online book has live links.)
Chapter 11 WHYL → ← Chapter 9 WHYL
Write out your solutions to these exercises, showing your work for all computations. Then check your solutions against the solutions page and get help with anything you don’t understand.
Caution! If you don’t see how to start a problem, don’t peek at the solution — you won’t learn anything that way. Ask your instructor or a tutor for a hint. Or just leave it and go on to a different problem for now. You may find when you return to that “impossible” problem that you see how to do it after all.
Why must you select a significance level before computing a pvalue?
Explain the pvalue in your own words.
You’ve tested the hypothesis that the new accelerant makes a
difference to the time to dry paint, using α = 0.05.
What is wrong with each conclusion, based on the pvalue? Write a
correct conclusion for that pvalue.
(a) p = 0.0214. You conclude, “The accelerant may
make a difference, at the 0.05 significance level.”
(b) p = 0.0714. You conclude, “The accelerant makes no
difference, at the 0.05 significance level.”
You are testing whether the new accelerant makes your paint dry faster.
(You have already eliminated the possibility that it makes your paint dry
slower.)
(a) What conclusion would be a Type I error? What wrong
action would a Type I error lead you to take?
(b) What conclusion would be a Type II error? What wrong
action would a Type II error lead you to take?
Are Type I and Type II errors actually mistakes? What one thing can you do to prevent both of them, or at least make them both less likely?
What can you do to make a Type I error less likely at a given sample size? What’s the unfortunate side effect of that?
Explain in your own words the difference between “accept H_{0}” (wrong) and “fail to reject H_{0}” (correct) when your pvalue is > α.
The engineering department claims that the average battery lifetime is 500 minutes. Write both hypotheses in symbols.
Suppose H_{0} is “the directors are honest” and H_{1} is
“the directors are stealing from the company.” Write
conclusions, in Statistics and in English, if …
(a) if p = 0.0405 and α = 0.01
(b) if p = 0.0045 and α = 0.01
In your hypothesis test, H_{0} is “the defendant is innocent” and H_{1} is “the defendant is guilty”. The crime carries the death penalty. Out of 0.05, 0.01, and 0.001, which is the most appropriate significance level, and why?
When Keith read the AAA’s statement that 10% of drivers on Friday and Saturday nights are impaired, he believed the proportion was actually higher for TC3 students. He took a systematic sample of 120 students and, on an anonymous questionnaire, 18 of them admitted being alcohol impaired the last Friday or Saturday night that they drove. Can he prove his point, at the 0.05 significance level?
In 2006–2008 there was controversy about creating a sewer district in south Lansing, where residents have had their own septic tanks for years. The Sewer Committee sent out an opinion poll to every household in the proposed sewer district. In a letter to the editor, published 3 Feb 2007 in the Ithaca Journal, John Schabowski wrote, in part:
The Jan. 4 Journal article about the sewer reported that “only” 380 of 1366 households receiving the survey responded, with 232 against it, 119 supporting it, and 29 neutral. ... The survey results are statistically valid and accurate for predicting that the sewer project would be voted down by a large margin in an actual referendum.
Can you do a hypothesis test to show that more than half of Lansing households in the proposed district were against the sewer project? (You’re trying to show a majority against, so combine “supporting” and “neutral” since those are not against.)
Doubting Thomas remembered the Monty Hall example from Chapter 5, but he didn’t believe the conclusion that switching doors would improve the chance of winning to 2/3. (It’s okay if you don’t remember the example. All the facts you need are right here.)
Thomas watched every Let’s Make a Deal for
four weeks.
(Though this isn’t a random
sample, treat it as one. There’s no reason why the show should
operate differently in these four weeks from any others.)
In that time, 30 contestants switched doors, and 18 of them won.
(a) At the 0.05 significance level, is it true or false that your
chance of winning is 2/3 if you switch doors?
(b) At the 95% confidence level, estimate your chance of winning
if you switch doors.
(c) If you don’t switch doors, your chance of winning is
1/3. Using your answer to (b), is switching doors definitely a good
strategy, or is there some doubt?
Rosario read in Chapter 6 that 30.4% of US households own cats. She felt like dogs were a lot more visible than cats in Ithaca, so she decided to test whether the true proportion of cat ownership in Ithaca was less than the national proportion. She took a systematic sample of Wegmans shoppers one day, and during the same time period a friend took a systematic sample of Tops shoppers. (They counted groups shopping together, not individual shoppers, so they didn’t have to worry about getting the same household twice.)
Together, they accumulated a sample of 215 households, and of those 54 owned cats. Did she prove her case, at the 0.05 significance level?
(a) H_{0} = 14.2; H_{1} > 14.2
(b) H_{0}: μ < 25; H_{1}: μ > 25
(c) You’re testing whether batteries have a mean life of greater than 750 hours. You take a sample, and your sample mean is 762 hours. You write H_{0}:μ=762 hr; H_{1}:μ>762 hr.
(d) Your conventional paint takes 4.3 hours to dry, on average. You’ve developed a drying accelerant and you want to test whether adding it makes a difference to drying time. You write H_{0}: μ=4.3 hr; H_{1}: μ < 4.3 hr.
This year, water pollution readings at State Park Beach seem to be lower than last year. A sample of 10 readings was randomly selected from this year’s daily readings:
3.5 3.9 2.8 3.1 3.1 3.4 3.2 2.5 3.5 3.1
Does this sample provide sufficient evidence, at the 0.01 level, to conclude that the mean of this year’s pollution readings is significantly lower than last year’s mean of 3.8?
Dairylea Dairy sells quarts of milk, which by law must contain an average of at least 32 fl. oz. You obtain a random sample of ten quarts and find an average of 31.8 fl. oz. per quart, with SD 0.60 fl. oz. Assuming that the amount delivered in quart containers is normally distributed, does Dairylea have a legal problem? Choose an appropriate significance level and explain your choice.
You’re in the research department of StickyCo, and you’re developing a new glue. You want to compare your new glue against StickyCo’s best seller, which has a bond strength of 870 lb/in².
You take 30 samples of your new glue, at random, and you find an average strength of 892.2 lb/in², with SD 56.0. At the 0.05 significance level, is there a difference in your new glue’s strength?
New York Quick Facts from the Census Bureau (2014b) [see “Sources Used” at end of book] says that 32.8% of residents of New York State aged 25 or older had at least a bachelor’s degree in 2008–2012. Let’s assume the figure hasn’t changed today.
You conduct a random sample of 120 residents
of Tompkins County aged 25+, and you find that 52 of them have at least a
bachelor’s degree.
(a) Construct a 95% confidence interval for the proportion of
Tompkins County residents aged 25+ with at least a bachelor’s degree.
(b) Don’t do a full hypothesis test, but use your answer for
(a) to determine whether the proportion of bachelor’s degrees in
Tompkins County is different from the statewide proportion, at the 0.05
significance level.
You’re thinking of buying new Whizzo bungee cords, if the new ones are stronger than your current Stretchie ones. You test a random sample of Whizzo and find these breaking strengths, in pounds:
679 599 678 715 728 678 699 624
At the 0.01 level of significance, is Whizzo stronger on average than Stretchie? (Stretchies have mean strength of 625 pounds.)
For her statistics project, Jennifer wanted to prove that TC3 students average more than six hours a week in volunteer work. She gathered a systematic sample of 100 students and found a mean of 6.75 hours and SD of 3.30 hours. Can she make her case, at the 0.05 significance level?
As a POW in World War II, John Kerrich flipped a coin 10,000 times and got 5067 heads. At the 0.05 level of significance, was the coin fair?
People who take aspirin for headache get relief in an average of 20
minutes (let’s suppose).
Your company is testing a new headache remedy, PainX, and in a
random sample of 45 headache sufferers you find a mean time to relief
of 18 minutes with SD of 8 minutes.
(a) Construct a 95% confidence interval for the mean time to relief
of PainX.
(b) Don’t do a full hypothesis test, but use your answer for
(a) to determine at the 0.05 significance level whether PainX offers
headache relief to the average person in a different time than aspirin.
Updated 1 Jan 2016
(What’s New?)
Intro: In Chapter 10, you looked at hypothesis tests for one population, where you asked whether a population mean or proportion is different from a baseline number. In this chapter, you’ll ask “are these two populations different from each other?” (hypothesis test) and “how large is the difference?” (confidence interval).
That’s the key question when you’re doing inference on numeric data from two samples. Your answer will control how you analyze the data, so let’s look closely at the difference.
Definitions: You have unpaired data when you get one number from each individual in two unrelated groups. The two groups are known as independent samples.
Independent samples result when you take two samples completely independently, or if you take one sample and then randomly assign the members to groups. Randomization always gives you independent samples.
Example 1: What if any is the average difference in time husbands and wives spend on yard work? You randomly select 40 married men and 40 married women and find how much time a week each spends in yard work. There’s no reason to associate Man A with Woman B any more than Woman C; these are independent samples and the data are unpaired.
Example 2: How much “winter weight” does the average adult gain? You randomly select 500 adults and weigh them all during the first week of November. Then during the last week of February you randomly select another 500 adults and weigh them. The data are unpaired, and the samples are independent.
Before you read further, what’s the big problem in the design of those two studies?
Right! Our old enemy, confounding variables. Look at the examples again, and see how many you can identify. For example, what might make a random person in one sample weigh more or less than a random person in the other sample, other than the passage of time? What might make a random woman spend more or less time on yard work than a random man, apart from their genders?
With independent samples, if there’s actually a difference between the two groups, it may be swamped by all the differences within each group.
Definitions: You have paired data when each observational unit gives you two numbers. These can be one number each from a matched pair of individuals, or two numbers from one individual. Paired data come from dependent samples.
Example 3: What if any is the average difference in time husbands and wives spend on yard work? You randomly select 40 couples and find how much time a week each person spends in yard work. Each husband and wife are a matched pair. The samples are dependent because once you’ve chosen a couple you’ve equally specified a member of the “wives” sample and a member of the “husbands” sample.
Example 4: How much “winter weight” does the average adult gain? You randomly select 500 adults and weigh them all during the first week of November, then again during the last week of February. You have paired data in the before and after numbers. The two samples are dependent because they are the same individuals.
Do you see how a design with paired data (dependent samples) overcomes the big problem with unpaired data (independent samples)? You want to study weight gain, and now that’s what you’re measuring directly. You wanted to know whether husband or wife spends more time on yard work, and now you’ve eliminated all the differences between couples.
Paired data are more likely than unpaired to reveal an effect, if there is one. Why? Because a paireddata design minimizes differences within each group that can swamp any difference between groups.
In studying human development and behavior, twins are a prime source of dependent samples. If you have a pair of identical twins who were raised apart (and that’s surprisingly common), you can investigate which differences between people’s behavior are genetic and which are learned. The Minnesota Study of Twins (Bouchard 1990 [see “Sources Used” at end of book]), found that a lot of behaviors that “should” be learned seem to be genetic. The New York Times published a nontechnical account in Major Personality Study Finds That Traits Are Mostly Inherited (Goleman 1986 [see “Sources Used” at end of book]).
Sample type  Dependent  Independent, or randomized 

Numeric data type  Paired Data  Unpaired Data 
How many numbers from each experimental unit?  Two  One 
Can you rearrange★ one sample?  No  Yes 
Problem of confounding variables  Minimal  Severe 
Use this design …  … if you can  … if you must 
★If the data from the sample are arranged in two rows or two columns, can you rearrange one row or column without destroying information? 
Testing new corn versus standard corn for yield. Can you see a problem with the sample in Western New York that’s not a problem with the sample in Central New York?
Adapted from Dabes and Janik (1999, 263) [see “Sources Used” at end of book]
You’re the head of research for the Whizzo Seed Company, and you’ve developed a new type of seed that looks promising. You randomly select three farmers in Western New York to receive new corn, and three to receive your standard product. (Of course you don’t tell them which one they’re getting.) At the end of the season they report their yield figures to you.
What’s wrong with this picture? You can easily think of all sorts of confounding variables here: different soils, different weather, different insects, different irrigation, different farming techniques, and on and on. Those differences can be great enough to hide (confound) a difference between the two types of corn, especially in a small sample.
The following year, you try again in Central New York. This time you send each farmer two stocks of seed corn, with instructions to plant one field with the first stock and another field with the second.
Does that eliminate confounding variables? Maybe not totally, but it reduces them as far as possible. Now, if you see significant differences in yield between two fields planted by the same farmer, it’s almost sure to be due to differences in the seed.
You always want to structure an experiment or observation with paired data (dependent samples) — if you can.
“If you can.” Aye, there’s the rub. Suppose you want to know whether attending kindergarten makes kids do better in first grade. There’s no way to set this up as paired data: how can a given kid both go through kindergarten and not go through kindergarten? Twin studies don’t help you here, because if the twins are raised together the parents will send both of them to kindergarten, or neither; and if the twins are raised apart then there will be too many other differences in their upbringing that could affect their performance in first grade.
If the samples are independent, you can’t pair the data, even if the samples are the same size. If you’re not sure whether you have dependent or independent samples, look back at 11A5. Paired and Unpaired Data Compared.
You want to determine whether a new synthetic rubber makes tires last longer than the competitor’s product. Can you see how to do this with independent samples (unpaired data) and with dependent samples (paired data)? Think about it before you read on.
For independent samples, you randomly assign drivers to receive four tires with your new rubber or four of the competitor’s tires. For dependent samples, you put two tires of one type on the left side of every driver’s car, and two on the right side of every driver’s car. (You do half the cars one way and half the other, to eliminate differences like the greater likelihood of hitting the curb on the right.)
With the first method, if there’s only a small difference between your rubber and the competitor’s, it may not show up because you’ve also got differences in driving styles, roads, and so forth — confounding variables again. With the second method, those are eliminated.
The hypothesis test is almost exactly like the Case 1 hypothesis test. The difference is that you define a new variable d (difference) in Step 1 and write hypotheses about μ_{d} instead of μ.
For a confidence interval, you’re estimating the average difference, not the average of either population. You need to state both size and direction of the effect.
You’ve probably heard about the “freshman fifteen”, the weight gain many students experience in their first year at college. The Urban Dictionary even talks about the “freshman twenty” (2004) [see “Sources Used” at end of book].
Francine wanted to know if that was a real thing or just an urban legend. During the first week of school, she got the other nine women in her chemistry class at Wossamatta U to agree to help her collect data. (She reasoned that students in any particular class would effectively be a random sample of the school, since class choice is unrelated to weight or other health issues. Of course that would be questionable for a spin class or a cooking class.)
Wossamatta U CHEM101 — Women’s Weights (in pounds)  

Student  A  B  C  D  E  F  G  H  I  J 
Sept.  118  105  123  112  107  130  120  99  119  126 
May  125  114  128  122  106  143  124  103  125  135 
When she had the data, Francine realized she didn’t know what to do next. If she had just one set of numbers, she would do a Student’s t test, since she doesn’t know the population standard deviation (SD). But what to do with two lists?
Then she had a brainstorm. She realized that she’s not trying to find out anything about students’ weights. She wants to know about their weight gain. Looking at their weights, she’d have plenty of lurking variables starting with precollege diet and lifestyle. Looking only at the weight gain minimizes or eliminates those variables, and measures just what happened to each student during freshman year. So she added a third row to her chart:
Wossamatta U CHEM101 — Women’s Weights (in pounds)  

Student  A  B  C  D  E  F  G  H  I  J 
Sept.  118  105  123  112  107  130  120  99  119  126 
May  125  114  128  122  106  143  124  103  125  135 
d = May−Sept.  7  9  5  10  −1  13  4  4  6  9 
Notice the new variable d, the difference between matched pairs. (You know the data must be paired, because each May number is associated with one and only one September number. You can’t rearrange the May numbers and still have everything make sense.) This is the heart of Case 3 in Inferential Statistics: Basic Cases: reducing paired numeric data to a simple t test of a population mean difference.
Here’s what’s new:
Now she’s all set. She has one set of ten numbers, representing the continuous variable “weight gain in freshman year” for a random sample of Wossamatta U women. (Notice with student E, Francine has a negative value for d because May minus September is 106−107 = −1. That student lost weight as a freshman.) Time for a t test!
But first, what will she test? Her original idea was to test the “freshman fifteen”. But a glance at the d’s shows her that no one gained as much as 15 lb. An average can’t be larger than every member of the data set, so there’s no way she could prove a hypothesis that the average gain is above fifteen pounds. She decides instead to try to prove a “freshman five”, μ_{d} > 5, with 0.05 significance.
When you do a confidence interval, you don’t have to make any decision of this kind because you just follow the data where they lead.
Francine subtracted by hand here, but you shouldn’t do that because it’s a rich source of errors and makes it harder to check your work. Instead, follow this procedure on your TI83/84:
ENTER
], the calculator does all the subtractions,
wiping out whatever was in L3 previously.
This isn’t Excel — if you change L1 or L2 after entering the formula for L3, L3 won’t change. You need to reenter the formula for L3 in that case. (You actually can make the calculator behave like Excel by binding a formula to a list, but it’s not worth the hassle.)
With paired numeric data, your population parameter is the mean difference μ_{d}. The random variable is a difference (in this case, a number of pounds gained from September to May), so the parameter is the mean of all those weight gains.
(1)  d = May−September
H_{0}: μ_{d} = 5, average student gains 5 lb or less H_{1}: μ_{d} > 5, average student gains more than 5 lb 

(2)  α = 0.05 
(RC) 

(3/4) 
This is a regular TTest , number 2 in the STAT
TESTS menu. Francine writes down
TTest: 5, L3, 1, >μ_{o} results: t=1.29, p = 0.1146, d̅=6.6, s=3.9, n=10
The sample mean is d̅ (“dbar”), not x̅, because the data are d’s, not x’s. 
(5)  p > α. Fail to reject H_{0}. 
(6) 
You can’t determine whether the average Wossamatta U woman
student gains more than 5 pounds in her freshman year or not
(p = 0.1146).
Or, At the 0.05 significance level, you can’t determine whether the average Wossamatta U woman student gains more than 5 pounds in her freshman year or not. 
After a “fail to reject H_{0}”, you always remember to write your conclusion in neutral language, right? Maybe the true average weight gain is greater than 5 pounds but this particular sample just happened not to show it; maybe the true average weight gain really is under 5 pounds, A confidence interval can help you get a handle on the effect size.
When a hypothesis test fails to reach a conclusion, a confidence interval can salvage at least some information. When a hypothesis test does reach a conclusion, a confidence interval can give you more precise information.
If Francine was doing only the confidence interval,
she’d have to start off by testing requirements. But she has
already tested them as part of the hypothesis test, so she goes right
to the TINTERVAL
screen.
Which confidence level does she choose? Her onetailed hypothesis test at α = 0.05 would be equivalent to a twotailed test at α = 0.10, and that suggests a confidence level of 90%. But she decides since her hypothesis test has already failed to reach a conclusion she’d at least like to get a 95% CI.
TInterval: L3, 1, .95
results: (3.7948, 9.4052)
Conclusion: Francine is 95% confident that the average woman student at Wossamatta U gains 3.8 to 9.4 pounds during her freshman year.
(Francine doesn’t write down d̅, s, and n because she’s already written them in the hypothesis test. She would write them down when she does only a confidence interval.)
Common mistake: Don’t say the average weight is 3.8 to 9.4 pounds. You aren’t estimating the average firstyear woman’s weight, but her weight gain.
Always reread your conclusion after you write it, and ask yourself whether it seems reasonable in the context of the problem. That can save you from mistakes like this.
Person  Before  After 

1  78  83 
2  64  66 
3  70  77 
4  71  74 
5  70  75 
6  68  71 
A few years back, a coffee company tried to market drinking coffee as a way to relax — and they weren’t talking about decaf. Jon decided to test this. He randomly selected six adults. He recorded their heart rates, then recorded them again half an hour after each person drank two cups of regular coffee. His data are shown at right. (Data come from Dabes and Janik [1999, 264] [see “Sources Used” at end of book].)
The data are paired, because each person (experimental unit) gives you two numbers, Before and After; because each After is associated with one specific Before; and because you can’t rearrange Before or After and still have the data make sense.
Jon selected the 0.01 significance level. (He tests for difference even though he believes coffee increases heart rate, because it could decrease it.)
Jon could equally well define d as Before−After or After−Before. At least, mathematically he could. But you’ll find it’s easier to interpret results if you always define d as high minus low so that all or most of the d’s will be positive numbers. (You can do this based on your common sense or by looking at the data.) Jon sees that the After numbers are generally larger than the Before numbers, so he chooses d = After−Before.
(1)  d = After−Before
H_{0}: μ_{d} = 0, coffee makes no difference to heart rate H_{1}: μ_{d} ≠ 0, coffee makes a difference to heart rate 

(2)  α = 0.01 
(RC)  Jon has a random sample, but the sample size is <30.
(The sample of six is obviously less than 10% of coffee drinkers.) He
puts the Before figures in L1, After in L2, and then L2−L1
(not L1−L2) in
L3. The boxwhisker plot of L3 finds no outliers. The normal
probability plot shows r=.9638, crit=.8893; r>crit.

(3/4) 
TTest: 0, L3, 1, ≠μ_{o}
results: t=5.56, p = 0.0026, d̅=4.2, s=1.8, n=6 
(5)  p < α. Reject H_{0} and accept H_{1}. 
(6)  Drinking coffee does make a difference in heart rate half an
hour later (p = 0.0026). In fact, coffee increases heart
rate.
Or, Drinking coffee does make a difference in heart rate half an hour later, at the 0.01 significance level. In fact, drinking coffee increases heart rate. 
As usual, when you do a twotailed test and p < α, you can interpret it in a onetailed manner. Jon defined d as After−Before, which is the amount of increase in each subject’s heart rate. His sample mean d̅ was positive, so the average outcome in his sample was an increase. Because he proved that the mean difference μ_{d} for all people is nonzero, the sign of his sample mean difference d̅ tells him the sign of the population mean difference μ_{d}.
Jon can’t say that the average increase for people in general is 4.2 beats per minute. That was the mean difference in his sample. If he wants to know the mean difference for all people, he has to construct a confidence interval:
TInterval: L3, 1, .99
result: (1.146, 7.187)
Jon is 99% confident that the average increase in heart rate for all people, half an hour after drinking two cups of coffee, is 1.1 to 7.2 beats per minute.
Caution! The confidence interval expresses a difference, not an absolute number. You are estimating the amount of increase or decrease, not the heart rate. A common mistake would be to say something about the heart rate being 1.1 to 7.2 bpm after coffee. Again, you’re not estimating the heart rate, you’re estimating the change in heart rate.
With paired data, you tested the population mean difference μ_{d} between matched pairs. But suppose you don’t have matched pairs? With unpaired data in independent samples, you test the difference between the means of two populations, μ_{1}−μ_{2}.
This is Case 4 in Inferential Statistics: Basic Cases. Key features:
2SampTTest
for hypothesis test,
2SampTInt
for confidence interval. Always use
Pooled:No
with both.Advice: Take your time when you look at data to decide whether you have paired or unpaired data. If your sample sizes are different, it’s a nobrainer: the data are unpaired. But if the sample sizes are the same, think carefully about whether the data are paired or unpaired. Sometimes students just seem to take a stab in the dark at whether data are paired or unpaired, but if you just stop and think about how the data were taken you can make the right decision every time. Look back at Paired and Unpaired Data at the beginning of this chapter if you need a refresher on the difference.
Prof. Sullivan’s students at Wossamatta U felt that he was a tougher grader than the other speech professors. They decided to test this, at the 0.05 significance level.
Eight of them each took a twohour shift, assigned randomly at different times and days of the week, and distributed a questionnaire to each student on the main quad. They felt this was a reasonable approximation to a random sample of current students. (They asked students not to take a questionnaire if they had already submitted one.) The questionnaire asked whether the student had taken speech in a previous semester, and if so from which professor and what grade they received. They then divided the questionnaires into three piles, “no speech”, “Sullivan”, and “other prof”.
It would be possible to do an analysis with the categorical data of letter grades. But you should always use numerical data when you can, because pvalues are usually lower with numeric data than attribute data, for a given sample size. The students counted an A as 4 points, Aminus as 3.7, and so on. Here is a summary of their findings:
Students of  Mean  Standard Deviation  Sample Size 

Sullivan  2.21  1.44  32 
Other prof  2.68  1.13  54 
In this test, you have unpaired numeric data in two samples. The requirements for each sample are the same as the test for the sample in a onesample t test:
There’s an additional requirement for the two samples:
Here’s the hypothesis test, as performed by Prof. Sullivan’s students:
(1)  pop. 1 = Sullivan students, pop. 2 = other speech profs’
students
H_{0}: μ_{1} = μ_{2}, no difference in average grades H_{1}: μ_{1} < μ_{2}, Sullivan’s grades lower on average 

(2)  α = 0.05 
(RC) 

(3/4) 
2SampTTest: x̅1=2.21, s1=1.44, n1=32, x̅2=2.68, sx2=1.13,
n2=54, μ_{1}<μ_{2}, Pooled:No
Results: t = −1.58, p = 0.0600, df=53.58 The test statistic is still Student’s t, but adapted for two samples. See the BTW note below for more about that and about the funny number of degrees of freedom. 
(5)  p > α. Fail to reject H_{0}. 
(6)  At the 0.05 level of significance, they can’t determine whether Prof. Sullivan is a tougher grader than the other professors or not. 
You’re working with a difference of sample means. The standard error of the mean for the first population is s_{1}/√n_{1} and therefore the variance is s_{1}²/n_{1}, and similarly for the second population. The variance of the sum or difference of independent variables is the sum of their variances, so VAR(x̅_{1}−x̅_{2}) = s_{1}²/n_{1} + s_{2}²/n_{2}. The standard deviation (the standard error of the difference of sample means) is the square root of the variance: .
It turns out that the difference of sample means follows a t distribution — if you choose the right number of degrees of freedom (more on that later). The onesample test statistic was t = (x̅−μ_{o}) / (s/√n). The twosample test statistic is analogous, with the differences substituted. The test statistic becomes . In this course, you’ll just be testing whether one population mean is greater than, less than, or different from the other. In other words, you’ll test against a hypothetical mean difference of 0. That simplifies t a bit: .
What about degrees of freedom? You might think df would be n_{1}+n_{2}−1, but it isn’t. The sampling distribution approximately follows a t with df equal to the lower of n_{1}−1 and n_{2}−1. It’s only approximate because the population SD are usually different. The exact degrees of freedom were computed by B. L. Welch (1938) [see “Sources Used” at end of book], and the horrendous, ugly equation is shown at right. Fortunately, your TI83/84 has the computation built in, and you don’t have to worry about it.
What about pooling? Why do you always select Pooled:No on your TI83/84? Well, if the two populations have the same SD (if they are homoscedastic) you can treat them as a single population (pool the data sets) and use a higher number of degrees of freedom. That in turn means your pvalue will be a bit lower, so you’re a bit more likely to be able to reject H_{0}. Sounds good, right? But there are problems:
For these reasons and others, the issue of pooling is controversial. Some books don’t even mention it. It’s best just to use Pooled:No always.
The requirements are exactly the same as the
requirements for the hypothesis test. You
compute a confidence interval on your TI83/84 through
2SampTInt
.
Since they couldn’t prove that Prof. Sullivan was a tough grader, the students decided to compute a 90% confidence interval for the difference between Prof. Sullivan’s average grades and the other speech profs’ average grades:
pop. 1 = Sullivan students; pop. 2 = other speech profs’
students
Requirements: already covered in hypothesis test.
2SampTInt: x̅1=2.21, s1=1.44, n1=32, x̅2=2.68, sx2=1.13,
n2=54, CLevel=.9, Pooled:No
Results: (−.9678, .02779)
Interpretation: The TI83 gives you the bounds for the confidence interval about μ_{1}−μ_{2}. A negative number indicates μ_{1} smaller than μ_{2}, and a positive number indicates μ_{1} larger than μ_{2}. Therefore:
We’re 90% confident that the average student in Prof. Sullivan’s classes receives somewhere between 0.97 of a letter grade lower than the average student in other profs’ speech classes, and 0.03 of a letter grade higher.
Remark: The 90% confidence interval is almost all negative. This reflects the fact that the pvalue in the onetailed test for μ_{1} < μ_{2} was almost as low as 0.05.
The students could have chosen any confidence level they wanted, just for showing an effect size. But for a confidence interval equivalent to their onetailed hypothesis test that used α = 0.05, the confidence level has to be 1−2×0.05 = 0.90 = 90%.
Why do you need a special twosample t procedure? Can’t you just compute a confidence interval from each sample and then compare them? No, because the standard errors are different. The twosample standard error takes the sample SD and sample sizes into account. Here’s a simple example, provided by Benjamin Kirk:
A farmer tests two diets for his pigs, randomly assigning 36 pigs to each sample. The Diet A group gained an average 55 lb with SD of 3 lb; that gives a 95% confidence interval 54.0 to 56.0 lb. The Diet B group gained 53 lb on average, with SD of 4 lb; the CI is 51.6 to 54.4 lb. Those intervals overlap slightly, which would not let you conclude that there’s any difference in the diets.
But the 2SampTInt is 0.3 to 3.7 lb in favor of Diet A, which says there is a difference. The issue is that the B group had a lower sample mean, but there was more variation within the group.
The Alpha Alpha Alpha sorority chapter at Staples University (Yes, corporate sponsorship is getting ridiculous!) has a tradition of putting in extra effort academically. They gave their incoming pledges the task of proving that Alpha Alpha Alpha had higher average GPA than other sororities, at the 0.05 level of significance. The Alphas are a large sorority, with 119 members.
The pledges hacked the campus server and obtained GPAs of ten randomly selected Alphas and ten randomly selected members of other sororities on campus. Do their illgotten data prove their point?
Alphas:  2.31  3.36  2.77  2.93  2.27  2.35  3.13  2.20  3.20  2.45 

Other sororities:  1.49  1.74  2.70  2.40  2.17  1.08  1.85  1.96  2.08  1.49 
Since you have independent samples (unpaired data) from two different populations, this is Case 4, difference of population means, in Inferential Statistics: Basic Cases.
Caution: You can’t treat these as paired data just because the sample sizes are equal; that’s a rookie mistake. When deciding between a paired or an unpaired analysis, always ask yourself: “Is data point 1 from the first sample truly associated with data point 1 from the second sample?” In this case, they’re not.
(1)  pop. 1 = Alpha Alpha Alpha; pop. 2 =
other sororities
H_{0}: μ_{1} = μ_{2}, No difference in average GPA H_{1}: μ_{1} > μ_{2}, Average GPA of all Alphas is higher than other sororities 

(2)  α = 0.05 
You check requirements against both samples independently. These samples are both smaller than 30, so you have to check normality and outliers on both. Here are the normality checks:
The first picture doesn’t look much like a straight line, but r is greater than crit, so it’s close enough. (With small data sets like this one, fitting the data to the screen can make differences look larger than they really are.)
The calculator lets you “stack” two or three boxplots on one screen. Not only is this a bit of a labor saver, but it also gives you a good sense of how different the samples are. To do this, select “Compare 2 smpl” on the first boxwhisker screen. You can guess what “Compare 3 smpl” does, but we don’t use it in this course.)
For these samples, the difference is dramatic. Every single Alpha’s GPA (in the sample) is above the third quartile in the sample of other sororities, and the max of other sororities is just barely above the median Alpha.
With such a big difference, why do the pledges even need to do a hypothesis test? Because they know these are just samples. Maybe the Alphas actually aren’t any better academically, but these particular samples just happened to be far apart. The hypothesis test tells you whether the difference you see is too big to be purely the result of random chance in sample selection.
(RC) 


(3/4) 
2SampTText L1, L2, 1, 1, >μ_{2}, Pooled:No
outputs: t = 3.93, pvalue = 0.0005, x̅_{1} = 2.70, s1 = 0.43, n1 = 10 x̅_{2} = 1.90, s2 = 0.48, n2 = 10 
(5)  p < α. Reject H_{0} and accept H_{1}. 
(6)  The average GPA in Alpha Alpha Alpha is higher than the
average GPA of other sorority members (p = 0.0005).
[Or, at the 0.05 level of significance, the average GPA in Alpha Alpha Alpha is higher than the average GPA of other sorority members.) 
Comment: You have to phrase your conclusion carefully. The pledges proved that the average GPA of Alphas is higher than the average GPA of all other sorority members, not all other sororities. What’s the difference? Here’s a simple example. Suppose there are ten other sororities besides the Alphas. The Omegas have an average GPA of 3.66, higher than the Alphas’ average. If the other nine each have an average GPA of 1.70, that could easily produce exactly the sample that the pledges got.
The message here: Aggregating data can lose information. Sometimes that’s okay, but be wary when one population is being compared to an aggregate of multiple other populations.
When you have two samples of binomial data, they represent two populations. Each population has some proportion of successes, p_{1} and p_{2} respectively. You don’t know those true proportions, and in fact you’re not concerned with them. Instead, you’re concerned with the difference between the proportions, p_{1}−p_{2}. You can test whether there is a difference (hypothesis test), or you can estimate the size of the difference (confidence interval).
This is Case 5 in Inferential Statistics: Basic Cases. Key features of Case 5, the difference of proportions:
Advice: take your time with twosample binomial data. You have a lot of p’s and a lot of percentages floating around, and it’s easy to get mixed up if you try to hurry.
Take extra care when writing conclusions. You’re making statements about the difference between the two proportions, not about the individual proportions. And you’re making statements about the difference in proportions between the populations, not between the samples.
Stopped by Traffic Cop  

Ticket Issued  Just a Warning  Total  p̂  
Men  86  11  97  89% 
Women  55  15  70  79% 
One of my students — call him Don — had several traffic tickets, and he knew one more would trigger a suspension. He felt that women stopped by a traffic cop were more likely than men to get off with just a warning, and for his Field Project he set out to prove it, with α = 0.05.
Don quickly realized that he should test whether men and women stopped by a cop are equally likely to get a ticket, not just whether men are more likely. After all, he couldn’t rule out the possibility that women are more likely to get a ticket if stopped.
Don distributed a questionnaire to a systematic sample of TC3 students. (He assumed that any genderbased difference in TC3 students would be representative of college students in general. That seems reasonable.) He asked three questions:
Don disregarded any questionnaires from students who had never been stopped as adults. He wasn’t interested in the likelihood of getting a ticket, but in the likelihood of getting a ticket after being stopped by a cop. You could say that he was interested in the different proportions, for men and women, of stops that lead to tickets.
This is just another variation on the good old Seven Steps of Hypothesis Tests:
2PropZTest
,
so your test statistic is a zscore.Here are the requirements for a Case 5 hypothesis test of a difference of proportions:
Actually, that’s an approximation to the real requirement. We use it because it nearly always gives the same answer, and it’s easier to test.
The real requirement is at least 10 successes and 10 failures EXPECTED in each sample. The expected numbers are what you would see in your samples if H_{0} is true and there’s no difference between the two population proportions. In that case, the pooled proportion p̂, which is the overall percentage of success in the combined samples, is an estimator of the true proportion in both populations.
That pooled proportion is
.
(2PropZTest
shows you p̂ on the output screen.)
Using that pooled proportion p̂, the
expected successes and failures
in sample 1 are
n_{1}p̂ and n_{1}−n_{1}p̂, and the expected
successes and failures in sample 2 are
n_{2}p̂ and n_{2}−n_{2}p̂.
All four of these must be ≥ 10.
The Gardasil vaccine example, below, shows a situation where you have to use the blended proportion to test requirements.
Here is Don’s hypothesis test about the different proportions of men and women that receive tickets after being stopped in traffic.
(1) 
population 1 = college men stopped by traffic cops;
population 2 = college women stopped by traffic cops
H_{0}: p_{1} = p_{2}, college men and women equally likely to get a ticket after being stopped H_{1}: p_{1} ≠ p_{2}, college men and women not equally likely to get a ticket after being stopped 

(2)  α = 0.05 
(RC) 

(3/4) 
2PropZTest : 86, 97, 55, 70, p1≠p2
Results: z=1.77, pvalue = 0.0760, p̂_{1} =.89, p̂_{2}=.79, p̂=.84

(5)  p > α. Fail to reject H_{0}. 
(6)  At the 0.05 level of significance, Don can’t tell whether men and women stopped by traffic cops are equally likely to get tickets, or not. 
If this nonconclusion leaves you nonsatisfied, you’re not alone. As usual, the confidence interval (next section) can provide some information.
Why does the “official” requirement use a pooled proportion p̂ instead of testing each sample? In fact, for a confidence interval you always test requirements for each sample. But in a hypothesis test, your H_{0} is always “no difference in population proportions”, and a hypothesis test always starts by assuming H_{0} is true. If the null is true, then there is no difference in the two populations, and you really just have one big sample of size n_{1}+n_{2} and sample proportion p̂. So that’s what you test.
Of course the twopopulation case is a bit more complicated. You need the key fact that when you add or subtract independent random variables, their variances add. If the two populations have the same proportion p, as H_{0} assumes, then the SD of the sampling distribution of the proportion for population 1 is √[p̂(1−p̂)/n_{1}], and similarly for population 2, where p̂ is the pooled proportion mentioned in the requirements check, above. Square the SD to get the variances, add them, and take the square rot to get the standard error: . And from this you have the test statistic: .
In a confidence interval for the difference of two proportions, some unknown proportion p_{1} of population 1 has some characteristic, and some unknown proportion p_{2} of population 2 has that characteristic. You aren’t concerned with those proportions on their own, but you want to estimate which population has the greater proportion, and by how much.
2PropZInt
. The
CI estimate is for p_{1}−p_{2}, the true difference
between the proportion of success in the two populations. A
negative number in the confidence interval means the population 1
proportion is lower than the population 2 proportion, and a
positive number means p_{1} is greater than p_{2}.The requirements for a CI are almost the same as a HT, but with one subtle difference:
Why is that last requirement different from the “official” requirement for the hypothesis test? With the HT, you assumed H_{0} was true and both populations had the same proportion. That let you use a blended or pooled proportion from your combined samples. But with a CI, you don’t make any such assumption. What would be the point of a confidence interval for the difference if you assume there is no difference?
But despite the difference in theory, as a practical matter you can just test for ≥ 10 successes and ≥ 10 failures in each sample for both HT and CI.
Don has already checked requirements in the
hypothesis test, so he moves right to a 2PropZInt
:
Don gets a result of −1.4% to +21.6%. How does he interpret that? Well, he can write it as
−1.4% ≤ p_{1}−p_{2} ≤ 21.6% (95% conf.)
Adding p_{2} to all three “sides” gives
p_{2}−1.4% ≤ p_{1} ≤ p_{2}+21.6% (95% conf.)
With 95% confidence, p_{1} is somewhere between 1.4% below p_{2} and 21.6% above p_{2}. You don’t know the numerical value of p_{1}, but out of male students who are stopped by a traffic cop, p_{1} is the proportion who get a ticket, and similarly for p_{2} and women. So Don can write his confidence interval like this:
I’m 95% confident that, out of students stopped by traffic cops, the proportion of men who actually get tickets is somewhere between 1.4 percentage points less than women, and 21.6 percentage points more than women.
If you’re not feelin’ the love with the algebra approach, you can reason it out in words. The confidence interval is the difference in proportions for men minus women. If that’s negative, the proportion for men is less than the proportion for women; if the difference is positive, the proportion for men is greater than the proportion for women.
Why do I say “percentage points” instead of just “percent” or “%”? Well, how do you describe the difference between 1% and 2%? It’s a difference of one percentage point, but it’s a 100% increase, because the second one is 200% of the first. When you subtract two percentages, the difference is a number of percentage points. If you just say “percent”, that means you’re expressing the difference using one of the percentages as a base, even if you don’t mean to.
Getting back to Don’s confidence interval, the −1.4% to +26.1% difference between men and women in traffic tickets is a simple subtraction of men’s rate minus women’s rate, so it is percentage points, not percent.
The standard deviation of the sampling distribution of the proportion for population 1 is √[p̂_{1}(1−p̂_{1})/n_{1}], and similarly for population 2. Square them, add, and take the square root to get the SD of the distribution of differences in sample proportions, also known as the standard error of the difference of proportions: . The margin of error is z_{α/2} times that. The center of the confidence interval is the point estimate, (p̂_{1}−p̂_{2}), so the bounds for the (1−α)% confidence interval are
(p̂_{1}−p̂_{2})−E ≤ p_{1}−p_{2} ≤ (p̂_{1}−p̂_{2})+E where
Just like with numeric data, you have to use the twosample procedure to compute a correct confidence interval. Here’s an example.
Two candidates are running for city council, so they each commission an exit poll on Election Day. Of 200 voters polled, 110 voted for Mr. X; 90 of a different 200 voted for Ms. Y. The 95% confidence intervals are 48.1% to 61.9% and 38.1% to 51.9%. The intervals overlap, so Ms. Y might still hope for victory. But a 2PropZInt tells a different story. The interval for the difference of proportions, X−Y, is 0.2% to 19.8%, so Mr. X is 95% confident of winning, and the only question is whether it will be a squeaker or a landslide.
You have a confidence level and a desired margin of error in mind. How large must each sample be?
You may remember with the onepopulation binomial case, part of the calculation was your prior estimate, or if you had no prior estimate you used 0.5. With two binomial populations, you need a prior estimate (or 0.5) for each one.
The easiest way to compute the necessary sample size is to use MATH200A Program part 5. If you don’t have the program and want to get it, see Getting the Program. You can also calculate necessary sample size by using the formula in the next paragraph, if you don’t have the program.
For a detailed explanation, with worked examples, see How Big a Sample Do I Need?.
Caution! When you’re planning to study the difference between two binomial populations, you have to use the twopopulation binomial computation of sample size. If you compute one sample size for sample 1 and a separate sample size for sample 2, you’ll come out much too low.
Example 12: Let’s look back once more at Don and his traffic stops. His 95% confidence interval was −0.0141 to +0.21587. That’s a margin of error of (.21587−(−.0141))/2 = 11½ percentage points. How large must his samples be if he wants a margin of error of no more than 5 percentage points but he’s willing to be only 90% confident?
Solution: Don can use his sample proportions as prior estimates. Those were 86/97 ≈ 0.8866 for men and 55/70 ≈ 0.7857 for women.
With the MATH200A program (recommended):  If you’re not using the program: 

Here’s the output screen from MATH200A Program part 5, 2pop
binomial :

The calculation is a little easier if you break it into chunks.
First compute p̂_{1}(1−p̂_{1}) +
p̂_{2}(1−p̂_{2}). When you press [Enter ], the
calculator displays that result.
You want to multiply that by (z_{α/2}/E)².
Press the [ What is z_{α/2}?
You did this in
How Big a Sample for Binomial Data?
in Chapter 9.
The confidence
level is 1−α = 0.9, so α = 0.1,
α/2 = 0.05, and z_{α/2} is
Caution: You don’t round the sample size. If you don’t get a whole number from the calculation, always go up to the next whole number. A sample size of 291.0255149 or greater gives a margin of error of .05 or less, at 90% confidence. The smallest whole number that is 291.0255149 or greater is 292, not 291. 
Answer: Don needs a sample of 292 men and 292 women if he wants 90% confidence in an estimate of the difference with margin of error no more than 5%.
Rookie mistake: Don’t just say “292”. It’s 292 from each population.
Why do you need such large samples, even at a confidence level as low as 90%? Part of the answer is that binomial data do need large samples; remember that a single sample of just over a thousand gives you a 3% margin of error at the 95% confidence level. And when you have two populations, you are estimating the difference between two unknown parameter values, p_{1} and p_{2}. If each of those was estimated within a 3% margin of error, the margin of error for their difference would be 6%, so the samples have to be larger in the twopopulation binomial case.
Example 13: The Prime Minister knows that his program of tax cuts and reduced social services appeals more to Conservatives than to Labour, but he wants to know how large the difference is. To estimate the difference with 95% confidence, with a margin of error of no more than 3%, how many members of each party must he survey?
Solution: You’re given no estimate of support within either party, so use 0.5 for p̂_{1} and p̂_{2}. E = 0.03 (not 0.3).
With the MATH200A program (recommended):  If you’re not using the program: 

MATH200a/sample size/2pop binomial:

First compute
p̂_{1}(1−p̂_{1}) +
p̂_{2}(1−p̂_{2}) =
0.5(1−0.5)+0.5(1−0.5). You have to multiply that by
z_{α/2}, which you find like this:
CLevel = 1−α = 0.95 ⇒
α = 1−0.95 = 0.05
⇒ α/2 = 0.025 ⇒
z_{α/2} = invNorm(1−.025) .

Answer: To gauge the difference within a 3% margin of error, at the 95% confidence level, the Prime Minister needs to poll 2135 Conservative Party members and 2135 Labour Party members.
The Gardasil vaccine is marketed by Merck to prevent cervical cancer. What are the statistics behind it? How do women decide whether to get vaccinated? Should the vaccine be mandatory?
A Cortland Standard story (21 Nov 2002) summarized an article from the New England Journal of Medicine as follows
A new vaccine can protect against Type 16 of the human papilloma virus, a sexually transmitted virus that causes cervical cancer, a new study shows. An estimated 5.5 million people become infected with a strain of HPV [not necessarily this strain] each year in the United States.
Efficiency rate of vaccine and placebo
Placebo: Group size 765, infection 41
HPV16 vaccine: Group size 768, infection 0
Note: The study included 1533 women with an average age of 20.
(Similar studies were done for the vaccine’s effectiveness against another strain, HPV18. According to the front page of the Wall Street Journal on 16 Apr 2007, HPV16 and 18 between them “are thought to cause 70% of cervicalcancer cases.” The vaccine, developed by Merck, is now marketed as Gardasil.)
The samples certainly show an impressive difference, but is it statistically significant? Could the luck of random sampling be enough to account for that difference in infection rates?
The claim is “the vaccine protects against HPV16.” To translate this into the language of statistics, realize that there are two populations: (1) women who don’t get the vaccine, and (2) women who do get the vaccine.
Notice that the populations are all women, past, present, and future who don’t or do get vaccinated. The 765 and 768 women are samples, not populations. The populations are unvaccinated and vaccinated, not placebo and vaccine. Placebos are administered to members of a sample, but a population doesn’t “get placeboed”.
The data type is attribute (binomial) because the original question or measurement of each participant is the yes/no question: “Did this woman contract the virus?” (“Success” is an HPV16 infection, not a good thing.) Since you’re comparing two populations, this is Case 5, Difference of Two Proportions.
If the vaccine works, then you expect more women without the vaccine to contract the virus, so make them population 1. (That’s not necessary; it just usually makes things a little simpler to call population 1 the one with higher numbers expected.)
Although you hope that the vaccine population will have a lower infection rate, it’s not impossible that they could have a higher rate. Therefore you do a twotailed test (≠). If p < α, then it’s time to say whether the vaccine makes things better or worse.
Let’s use α = 0.001. You’re talking about cancer in humans, after all. A Type I error would be saying that Gardasil makes a difference when actually it doesn’t. You don’t want women to get vaccinated, and have a false sense of security, if the vaccine actually doesn’t work, so a Type I error is pretty serious.
(1) 
population 1 = unvaccinated women; population 2 = vaccinated women
H_{0}: p_{1} = p_{2}, the vaccine makes no difference H_{1}: p_{1} ≠ p_{2}, the vaccine does make a difference 

(2)  α = 0.001 
(RC) 

(3/4) 
2PropZTest: 41, 765, 0, 768, p1≠p2
results: z=6.50, pvalue = 7.9E11, p̂_{1}=.0536, p̂_{2}=0, p̂=.0267 Pause for a minute to make sure you can keep all those p’s straight. The first one, p = 7.9E11, is the pvalue, the chance of getting such different sample results if the vaccine makes no difference. p̂_{1} and p̂_{2} are those sample results: 5.4% of unvaccinated women and 0% of vaccinated women in the samples contracted HPV16 infections. p̂ without subscript is the pooled proportion: 2.7% of all women in the study contracted HPV16. 
(5)  p < α. Reject H_{0} and accept H_{1}. 
(6)  The Gardasil vaccine does make a difference to HPV16
infection rates (p = 8×10^{11}). In fact, it
lowers the chance of infection.
Or, At the 0.001 level of significance, the Gardasil vaccine does make a difference to HPV16 infection rates. In fact, it lowers the chance of infection. 
It’s worth reviewing what this pvalue of 8×10^{11} means. If the vaccine actually made no difference, there are only 8 chances in a hundred billion of getting the difference between samples that the researchers actually got, or a larger difference.
How do you get from “makes a difference” to “reduces infection rate”? Remember that when p < α in a twotailed test, you can interpret the result in a onetailed manner. If the vaccine makes things different, as appears virtually certain, then it must either make them better or make them worse. But in the sample groups, the vaccine group did better than the placebo group. Therefore the vaccine can’t make things worse, and it must make them better.
Can you do a confidence interval to estimate how much Gardasil reduces a woman’s risk of HPV16 infection? Unfortunately, you can’t, because the requirements aren’t met: There were zero successes in the second sample. You can’t think like the hypothesis test and use the blended p̂ to meet requirements. Why wouldn’t that make sense? In a confidence interval, you’re specifically trying to estimate the difference between p_{1} and p_{2} (likelihood of infection for unvaccinated and vaccinated women), so you can’t very well assume there is no difference.
In terms of what you’re required to know for the course, you can skip to the next section right now. But if you want to know more, keep reading.
One informal calculation finds a number needed to treat per person actually helped (Simon 2000c [see “Sources Used” at end of book]). The difference in sample proportions is 5.4 percentage points, and 1/.054 ≈ 18.5 is called the number needed to treat. (You may recognize this as the expected value of the geometric distribution with p = 5.4%.) In the long run, for every 18 or 19 women who are vaccinated, one HPV16 infection is prevented.
Caution! 5.4 percentage points is a difference in sample proportions. You can say only that the difference in the population is somewhere in the neighborhood of 5.4 percentage points, not that it is that. The number needed to treat is therefore not exactly 18.5, just somewhere in the neighborhood of 18.5. Even so, this is valuable information for women and their doctors.
Another approach is the rule of three, explained in Confidence Interval with Zero Events (Simon 2010 [see “Sources Used” at end of book]). When there are zero successes in n events, the 95% confidence interval is 0 to 3/n. Here 3/768 = 0.0039, about 0.4%. The 95% confidence interval for the unvaccinated population is 3.8% to 7.0%. So a doctor can tell her patients that about 38 to 70 unvaccinated women in a thousand will be infected with HPV16, but only about four vaccinated women in a thousand.
Each of those is a 95% confidence interval, but the combination isn’t a 95% confidence interval! In the long run, if you do a bunch of 95% CIs, one in 20 of them won’t capture the true population parameter. Here you’re doing two, so there’s only a 95%×95% = 90.3% chance that both of these actually capture the true population proportions.
Summary: If you have a confidence interval for the difference of two population means or proportions, you can conclude whether the difference is statistically significant or not, just like the result of a hypothesis test.
Example 15: You’re testing the new drug Effluvium to see whether it makes people drowsy. Your 95% confidence interval for the difference between proportions of drowsiness in people who do and don’t take Effluvium is (0.017, 0.041). That means you’re 95% confident that Effluvium is more likely, by 1.7 to 4.1 percentage points, to cause drowsiness.
There’s the key point. You’re 95% confident that it does increase the chance of drowsiness by something between those two figures. How likely is it that Effluvium doesn’t affect the chance of drowsiness, then? Clearly it’s got to be less than 5%.
When both endpoints of your confidence interval are positive (or both are negative), so that the confidence interval doesn’t include 0, you have a significant difference between the two populations.
Example 16: Now, suppose that confidence interval was (−0.013, 0.011). That means you’re 95% confident that Effluvium is somewhere between 1.3 percentage points less likely and 1.1 more likely to cause drowsiness. Can you now conclude that Effluvium affects the chance of drowsiness? No, because 0 (“no difference”) is inside your confidence interval. Maybe Effluvium makes drowsiness less likely, maybe it has no effect, maybe it makes drowsiness more likely; you can’t tell.
When one endpoint of your confidence interval is negative and one is positive, so that the confidence interval includes 0, you can’t tell whether there’s a significant difference between the two populations or not.
This is exact for numeric data but approximate for binomial data. Why? Because the HT and CI use