→ Stats w/o Tears → Printer-Friendly
Stats w/o Tears home page

Stats without Tears
in One File

Updated 1 Jan 2016
Copyright © 2001–2017 by Stan Brown




Copyright © 2001–2017 by Stan Brown,
Tompkins Cortland Community College
Updated 1 Jan 2016


Help » About This Book

This book is an alternative to the usual textbooks for a one-semester course in statistics. Whether you’re teaching in a classroom or learning on your own, you’ve come to the right place.


Douglas Adams’ The Hitchhiker’s Guide to the Galaxy bore a “large, friendly label” with those words, and that’s also my message to you.

I don’t see any reason for students to be afraid of statistics. It’s no more difficult than any other technical course, and it’s much more practical than other math courses. The mathematical details are here for those who want them, but I lean heavily on technology to relieve students of the “grunt work”.


You need a TI-83 or TI-84 family calculator to get the most out of this book. For $100 or less, this calculator has amazing capabilities for statistics, and it also supports other math courses up through calculus. I suggest you download my free MATH200A program, which adds some capabilities to the calculator, but this is optional.

Some error conditions on your calculator can be scary when you see them the first time. Don’t panic! See TI-83/84 Troubleshooting.

View or
These pages change automatically for your screen or printer. Underlined text, printed URLs, and the table of contents become live links on screen; and you can use your browser’s commands to change the size of the text or search for key words. If you print, I suggest black-and-white, two-sided printing.
History of
this book:

This textbook grew out of handouts I made for my students at TC3 (Tompkins Cortland Community College in Dryden, New York). The handouts filled gaps and corrected errors in our standard textbook.

As time went on, I found myself replacing whole chapters. Student evaluations showed that they preferred these replacements to the textbook. In Spring 2013 I reached the tipping point: I had replaced more than half of the twelve textbook chapters. In good conscience I didn’t feel I could ask students to buy an expensive textbook that they would use less than half of, so I burned my bridges and announced the required textbook beginning in Summer 2013 as “none”.

In Fall 2013, a second instructor at TC3 adopted this textbook for his class. Benjamin Kirk provided a lot of valuable suggestions and corrections, and I’m very grateful. They have improved the book considerably.


Contact information is at

Please share your reactions, whether positive or negative! If I could explain something better, I’d like to know. If some section works particularly well for you, please tell me. If you find an error, I especially want to know about it. (My own students get extra credit for pointing out errors.)

Being on the Web, this book will get updated frequently, based on your feedback. You can see the revision dates in the chapter list above, and a revision history is shown at the end of every chapter. at .

Because this textbook helps you,
please click to donate!
Because this textbook helps you,
please donate at

This eTextbook is a free resource for you. You can read it on line or print any or all chapters. Links to all the chapters are at <>. If you print any chapters, you can keep your costs down by choosing black-and-white printing in duplex (two-sided) mode.

Just a word of advice. I’ve tried to make statistics approachable to anyone with high-school math, but it’s still a technical subject. You can’t just read a chapter in one pass from start to end, the way you would a novel or a book of history. Please see How to Read a Math Book for some tips on getting the most out of your time with this book, or any math book.

Some material is marked BTW. This is stuff I find interesting, including mathematical details that some students have asked for, but you can get through the course without it.


Although this is a free resource, it is copyrighted and I would appreciate your asking permission to copy and distribute any of it. My contact information is at

Though you don’t need to ask permission simply to link to this material, I would appreciate knowing about it.

1. Statistics!

Updated 21 Feb 2016 (What’s New?)


1A.  Statistics? What’s That?

Summary: We live in an uncertain world. You never have complete information when you make a decision, but you have to make decisions anyway. Statistics helps you make sense of confusing and incomplete information, to decide whether a pattern is meaningful or just coincidence. It’s a way of taming uncertainty, a tool to help you assess the risks and make the best possible decisions.

1A1.  What Should You Expect?

Statistics is different from any other math course.

Yes, you’ll solve problems. But most will be real-world practical problems: Does aspirin make heart attacks less likely? Was there racial bias in the selection of that jury? What’s your best strategy in a casino? (Most examples will be from business, public policy, and medicine, but we’ll hit other fields too.)

There will be very little use of formulas. Real statisticians don’t do things by hand. They use calculators or software, and so will you. Your TI-83 or TI-84 may seem intimidating at first, but you’ll quickly get to know it and be amazed at how it relieves you of drudgery.

With little grunt work to do, you will focus on what your numbers mean. You’re not just a button-pushing calculator monkey; you have to think about what you’re doing and understand it well enough to explain it. Most of the time your answers will be non-technical English, not numbers or statistical jargon. That may seem scary and unfamiliar at first, but if you stick with it you’ll love stretching your brain instead of just following a book’s examples by rote.

1A2.  What Do You Get From the Course?

It may be a required course, so you get that much closer to graduation. ☺ But you can get more than that.

If you do it right, statistics teaches you to think. You become skeptical, unwilling to take everything at face value. Instead, when somebody makes a statement you question how they know that and what they’re not telling you. You can’t be fooled so easily. You become a more thoughtful citizen, a more savvy consumer.

Who knows? You might even have some fun along the way.  So— Let’s get started!

1A3.  Sample and Population

Suppose you want to know about the health of athletes who use steroids versus those who don’t. Or you want to know whether people are likely to buy your new type of chips. Or you want to know whether a new type of glue makes boxes less likely to come apart in shipping. How do you answer questions like that?

With most things you want to know, it’s impossible or impractical to examine every member of the group you want to know about, so you examine part of that group and then generalize to the whole group.

Definitions: A sample is the group you actually take data from. The population is the group you want to know something about.

In Good Samples, Bad Samples, later in this chapter, you’ll see how samples are actually taken.

The sample is usually a subgroup of the population, but in a census the whole population is the sample.

Example 1: You want to know what proportion of likely voters will vote for your candidate, so you poll 850 people. The people you actually ask are your sample, and the likely voters are the population.

Caution!: Your sample is the 850 people you took data from, not just the subgroup that said they would vote for your candidate. The population is all likely voters, regardless of which candidate they prefer. Yes, you want to know who will vote for your candidate, but everybody’s vote counts, so the group you want to know something about — the population — is all likely voters.


The number of members of your sample is called the sample size or size of the sample (symbol n), and the number of members of the population is called the population size or size of the population (symbol N).

“Sometimes it is not possible to count the units contained in the population. Such a population is called infinite or uncountable.” (Finite and Infinite Population 2014 [see “Sources Used” at end of book]) “Smokers” is an example. There is a definite number of smokers in the world at any moment, but if you try to count them the number changes while you’re counting.

The sample size is always a definite number, since you always know how many individuals you took data from.

Example 2: You’re monitoring quality in a factory that turns out 2400 units an hour, so you test 30 units from each hour’s production.

The units you tested are your sample, and your sample size is 30. All production in that hour is the population, and the population size is 2400.

Isn’t the population the factory’s total production, since you want to know about the overall quality? No! Your sample was all drawn from one hour’s production. A sample from one production run can tell you about that production run, not about overall operations. This is why quality testing never ends.

Example 3: You’re testing a new herpes vaccine. 800 people agree to participate in your study. You divide them randomly into two groups and administer the vaccine to one group and a placebo (something that looks and feels like a vaccine but is medically inactive) to another group. Over the course of the study, a few people drop out, and at the end you have 397 vaccinated individuals and 396 who received the placebo.

You have two samples, individuals who were vaccinated (n1 = 397) and the control group (n2 = 396). The corresponding populations are all people who will take this vaccine in the future, and all people who won’t. Both of those populations are uncountable or infinite because more people are being born all the time.

1A4.  Descriptive and Inferential Statistics

Sometimes you want to summarize the data from your sample, and other times you want to use the sample to tell you something about the larger population. Those two situations are the two grand branches of statistics.

Definition: Descriptive statistics is summarizing and presenting data that were actually measured. Inferential statistics is making statements about a population based on measurements of a smaller sample.

Example 4: “52.9% of 1000 voters surveyed said they will vote for Candidate A.” That is descriptive statistics because someone actually measured (took responses from) those 1000 people.

Compare: “I’m 95% confident that 49.8% to 56.0% of voters plan to vote for Candidate A.” That is inferential statistics because no one has asked all voters. Instead, a sample of voters was asked, and from that an estimate was made of the feelings of all voters.

1A5.  Statistic and Parameter

Definitions: A statistic is a numerical summary of a sample. A parameter is a numerical summary of a population.

Mnemonic: sample and statistic begin with s; population and parameter both begin with p.

Continuing with Example 4: “52.9% of 1000 voters surveyed plan to vote for Candidate A.” — 52.9% is a statistic because it summarizes the sample.

“I’m 95% confident that 49.8% to 56.0% of voters plan to vote for Candidate A.” — 49.8% to 56.0% is an estimate of a parameter. (The actual parameter is the exact proportion presently planning to vote for A, which you don’t know exactly.)

A statistic is always a statement of descriptive statistics and is always known exactly, because a statistic is a number that summarizes a sample of actual measured data.

A parameter is usually estimated, not known exactly, and therefore is usually a matter of inferential statistics. The exception is a census, in which data are taken from the whole population. In that case, the parameter is known exactly because you have complete data for the population, so the parameter is then descriptive statistics.

Describing …The number is …And the process is …
Any sampleA statisticDescriptive statistics
A population (usually)A parameterInferential statistics
A census (pop. w/
every member surveyed)
Both statistic
and parameter
Descriptive statistics

1B.  Good Samples, Bad Samples

Summary: A good sample is a smaller group that is representative of the population. A bad sample does a bad job of representing the population.

You already know that a random sample is a good thing, but did you know that a random sample is actually carefully planned? What if you can’t take a true random sample? What are good and bad ways to gather samples?

All valid samples share one characteristic: they are chosen through probability means, not selected by any decisions made by the person taking the sample. Every valid sample is gathered according to some rule that lets the impersonal operations of probability do the actual selection.

Definition: A probability sample is a sample where the members are chosen by a predetermined process that uses the workings of chance and removes discretion from the investigators. Some of the types of probability samples are discussed below.

See also: For lots of examples of good sampling and (usually) clear presentation of data about the American people, you might want to visit the Pew Research Center and its tech-oriented spinoff, Pew Internet. The venerable Gallup Poll also makes available its snapshots of the American public.

1B1.  The Gold Standard: Random Samples

Definition: A random sample (also called a simple random sample) is a sample constructed through a process that gives every member of the population an equal chance of being chosen for the sample.

You always want a random sample, if you can get one. But to create a random sample you need a frame, and in many situations it’s impossible or unreasonably difficult to list all members of the population. The sections below explain alternative types of samples that can lead to statistically valid results.

“Random” doesn’t mean haphazard. Humans think we’re good at constructing random sequences of letters and digits, but actually we’re very bad at it. Try typing 1300 “random” letters on your keyboard. If you do it really randomly, you should get about 1300÷26 = 50 of each letter. (Note: about 50 of each, not exactly 50. To determine whether a particular sample of text is unreasonably far from random letters, see Testing Goodness of Fit to a Model.) But if you’re like most people, the distribution will be very different from that: some letters will occur many more than 50 times, and others many less.

So how do you construct a random sample? You need a frame, plus a random-number generator or random-number table.

Definition: A sampling frame, or simply a frame, is a list of all members of the population in a way that lets you assign a unique number to each one.

The frame need not be a physical list; it can be a computer file — these days it usually is. But it has to be a complete list.

If you have a table of random numbers, the table will come with instructions for use. I’ll show you how do it with the TI-83/84, but you could also do it with Excel’s RANDBETWEEN( ) function, or with any other software that generates pseudo-random numbers. (The Web site provides true random numbers based on atmospheric noise.)

Seeding the Random-Number Generator

Random numbers from software or a calculator aren’t really random, but what we call pseudo-random numbers. That means that they are generated by deterministic calculations designed to mimic randomness pretty well but not perfectly. To help them do a better job, you need to “seed” the random number sequence, meaning that you give it a unique starting point so that your sequence of random numbers is different from other people’s.

You seed the random numbers only once. To do this:

  1. Turn on the calculator and press [CLEAR].
  2. Come up with a number through some means other than choosing it. For instance, select the first number you see in the newspaper or in a book that you let fall open where it will. Type this number into the calculator. (Eyes closed, I tapped the financial page with a pen and used the number that the pen touched.)
  3. TI-84 screen showing random number seed Press [STO→], which shows on your screen as .
  4. Press [MATH] [] [1] to paste rand to the screen. Press [ENTER].

Again, you need to seed random numbers only once on your calculator.

Selecting Members of the Sample

For this you need to know the size of the population, which is the number of individuals in your frame. You will generate a series of random numbers between 1 and the population size, as follows:

  1. TI-84 screen showing random number generation Press [MATH] [] [5] to paste randInt( to your screen.
  2. Press [1] [,], enter the population size, and press [)] [ENTER] to generate the first random number. In my case the population size was 20,147 and my first random number was 4413, so the first member of my sample will be the 4413th individual, in order, from the sampling frame.
  3. Press [ENTER] to generate the next random number. (The randInt function may or may not be displayed again, depending on your calculator model and settings.) In my case, the next random number is 4949, so the 4949th individual in my frame becomes the second member of my sample.
  4. Continue pressing [ENTER] until you have your desired sample size. If you get a duplicate random number, simply ignore it and take the next one. (If your calculator has [8] randIntNoRep, use it instead of plain randInt to prevent duplicates from appearing in the first place.)

1B2.  Almost as Good: Systematic Samples

Definition: A systematic sample with k = some number is one where you take every kth individual from a representative subset of the population to make up your sample.

Example 5: Standing outside the grocery store all day, you survey every 40th person. That is a systematic sample with k=40.

If properly taken, a systematic sample can be treated like a random sample. Then why do I call it almost as good? Because you have to make one big assumption: that the variable you’re surveying is independent of the order in which individuals appear. In the grocery-store example, you have to assume that shoppers in the time period when you take your survey are representative of all shoppers. That may or may not be true. For example, a high proportion of Wegmans shoppers at lunch time are buying prepared foods to eat there or take back to work. At other times, the mix of groceries purchased is likely to be different.

Taking a Systematic Sample

  1. Estimate the number of individuals you will be sampling from, and call this N. (Here your sampling frame is smaller than the population.) In the grocery-store example, estimate how many shoppers will pass the point where you will stand during the time you’re standing there. If you estimate 1200 shoppers during the six hours when you’ll take your survey, then N=1200.

    If you’re pretty unsure of N, you may need to observe that spot without taking the survey, just to get a preliminary count.

  2. Decide how large a sample you want. Divide N by your desired sample size, rounding down, and call the result k. If you want 95 grocery shoppers in your sample, then k = N/95 = 1200/95 = 12.63 → k=12.

    If your estimate of N is uncertain, you’ll want to reduce k a bit. This will increase your sample size, but a sample that’s too large (within reason) is better than one that’s too small.

  3. If you have never seeded the random-number generator, do it now. See Seeding the Random-Number Generator, above.
  4. TI-84 screen for starting a systematic sample Take a random number from 1 to k to determine which person will be first in your sample. To do this, press [MATH] [] [5] to paste randInt(, then [1] [,]. Enter the value of k and press [)] [ENTER].

    Caution: It’s 1 to k, not 1 to N. If you need to survey every 12th person, then you use randInt(1,12). For determining where to start in the first 12 people, randInt(1,95) and randInt(1,1200) are both wrong.

    At right you see an illustration with k=12. The calculator has determined that I will start with the 2nd person and take every 12th person after that: 2, 14, 26, 38, 50, and so on.

  5. If you reach your desired sample size sooner than expected, keep going for the originally planned time. Why? Because you don’t know whether the individuals that appear early are different from those that appear late. The good news is that the larger sample will give you more accurate results, always a good thing.

1B3.  Good but Hard: Cluster Samples

Sometimes a true random sample is possible but unreasonably difficult. For example, you could use census records to take a random sample of 1000 adults in the US, but that would mean doing a lot of travel. So instead you take a cluster sample.

Definition: In a cluster sample, you first subdivide the population into a large number of subunits, called clusters, and then you construct a random sample from the clusters.

“In single-stage cluster sampling, all members of the selected clusters are interviewed. In multi-stage cluster sampling, further subdivisions take place.” (Upton and Cook 2008, 76 [see “Sources Used” at end of book])

Example 6: You want to have 600 representative Americans try your new neck pillow to gauge your potential market. Travel to 600 separate locations across the country would be ridiculously expensive, so you randomly select 30 census tracts and then randomly select 20 individuals within each selected census tract.

A cluster sample makes one big assumption: that the individuals in each cluster are representative of the whole population. You can get away with a slightly weaker assumption, that the individuals in all the selected clusters are representative of the whole population. But it’s still an assumption. For this and other technical reasons, a cluster sample cannot be analyzed in all the same ways as a random sample or systematic sample. Analysis of cluster samples is outside the scope of this course.

1B4.  Stratified Samples

Sometimes you can identify subgroups of your population and you expect individuals within a subgroup to be more alike than individuals of different subgroups. In such a case, you want to take a stratified sample.

Definition: If you can identify subgroups, called strata (singular: stratum), that have something in common relative to the trait that you’re studying, you want to ensure that your sample has the same mix of those groups as the population. Such a sample is called a stratified sample.

Example 7: You’re studying people’s attitudes toward a proposed change in the immigration laws for a Presidential candidate. You believe that some races are more likely to favor loosening the law and others are more likely to oppose it. If the population is 66% non-Hispanic white, 14% Hispanic, 12% black, 4% Asian, and so on, your sample should have that same composition.

A stratified sample is really a set of mini-samples grouped together.

Example 8: You want to survey attitudes towards sports at a college that is 45% male and 55% female, and you want 400 in your sample. You would take a sample of 45%×400 = 180 male students and 55%×400 = 220 female students to make up your sample of 400. Each mini-sample would be taken by a valid technique like a random sample or systematic sample.

1B5.  Census

Definition: A census is a sample that contains every member of the population.

In many situations, it’s impossible or highly inconvenient to take a census. But with the near-universal computerization of records, a census is practical in many situations where it never used to be.

Example 9: At the push of a button, a librarian can get the exact average number of times that all library books have been checked out, with no need for sampling and estimation. An apartment manager can tell the exact average number of complaints per tenant. And so forth.

A census is the only sample that perfectly represents the population, because it is the whole population. If you can take a census, you’ve reduced a problem of inferential statistics to one of descriptive statistics. But even today, only a minority of situations are subject to a census. For instance, there’s no way to test a drug on every person with the condition that the drug is meant to treat. It’s totally impractical to interview every potential voter and determine his or her preferences. And so forth.

1B6.  Bogus Samples

Because this textbook helps you,
please click to donate!
Because this textbook helps you,
please donate at

Any sample where people select the individual members is a bogus sample. That means every sample where people select themselves, and every sample where the interviewer decides whether to include or exclude individual members.

Why is that bad? Remember, a proper sample is a smaller group that is representative of the population. No sample will represent the population perfectly, but you do the best you possibly can.

The good samples listed above can go bad if you make various kinds of mistakes, mistakes (“Statistical Errors”, later in this chapter), but a sample that doesn’t depend on the workings of chance is always wrong and cannot be made right. The textbooks will give you names for the types of bad samples — convenience sample, opportunity sample, snowball sample, and so on — but why learn the names when they’re all bogus anyway?

Good SamplesBad Samples
Chosen through probability methods Chosen by individual decisions about which persons or things to include
Represent the population as well as possible Do not accurately represent the population
Uncertainty can be estimated, and can be reduced by increasing sample size Uncertainty cannot be estimated, and bigger samples don’t help

So goodbye to Internet polls and petitions, letter-writing campaigns, “the first 500 volunteers”, and every other form of self-selected sample. If people select themselves for a sample, then by definition they are not representative because they feel more strongly than the people who didn’t select themselves. You can make statements about the people who selected themselves, but that tells you nothing about the much larger number who didn’t select themselves. (More about this in Simon 2001 [see “Sources Used” at end of book], Web Polls.)

Goodbye also to any kind of poll where the pollster selects the individual people. If you set up a rule that depends on the workings of chance and then follow it, that’s okay. But if you decide on the spur of the moment who gets included, that’s bogus.

Why is it bad to just approach people as you see them? Because studies show that you are more likely to approach people that you perceive to be like you, even if you’re not aware of that. Ask yourself if you are truly equally likely to select someone of a different race or sex from yourself, someone who is dressed much richer or poorer than you, someone who seems much more or much less attractive, and so forth. Unless you’re Gandhi, the honest answer is “not equally likely”. It doesn’t make you a bad person, just a bad pollster like everyone else. If you tend to pick people who are more like you, your sample is not representative of the population.

The same principle applies to studies of non-humans. Here the investigator’s intrinsic biases may be less clear, but unless you choose your sample based on chance you can never be sure that those biases didn’t “skew it up”.

1C.  Data and Variables

Summary: Statistics is all about data and variables, but what exactly do those terms mean? What are the types of data and variables?

This will be an important topic throughout the course, because different variable types are presented differently in descriptive statistics, and again are analyzed differently in inferential statistics. So before you do anything, you need to think what type of data you’re dealing with.

1C1.  What Are Data? What Are Variables?

Definitions: Variables are the characteristics you’re studying. Data are the values of those characteristics that you record, and the value recorded from any given member of the sample is called a data point or datum.

You can think of the variable as kind of like a question, and the data points as the answers to that question.

If you record one piece of information from each member of the sample, you have univariate data; if you record two pieces of information from each member, you have bivariate data.

Example 10: You record the birth weights of babies born in a certain hospital during the year. The variable is “birth weight”.

Example 11: In April, you ask all the members of your sample whether they had the flu vaccine that year and how many days of work or school they lost because of colds or flu. (Can you see at least two problems with that second question? If not, you will after you read about Nonsampling Errors, later in this chapter.) This is bivariate data. One variable is “flu shot?” and the data points are all yes or no; the other variable is “days lost to colds and flu” and the data points are whole numbers.

1C2.  Quantitative or Qualitative?

Definitions: Quantitative data are data that are numbers. Quantitative data are also called numeric data.

Numeric data are subdivided into discrete and continuous data. Discrete data are whole numbers and typically answer the question “how many?” Continuous data can take on any value (or any value within a certain range) and typically answer the question “how much?”

Qualitative data are data that are not numbers. Qualitative data are also called non-numeric data, attribute data or categorical data.


Sometimes we talk about data types, and sometimes about variable types. They’re the same thing. For instance, “weight of a machine part” is a continuous variable, and 61.1 g, 61.4 g, 60.4 g, 61.0 g, and 60.7 g are continuous data.

Quantitative (numeric) Qualitative (categorical or non-numeric)
You get a number from each member of the sample. You get a yes/no or a category from each member of the sample.
The data have units (inches, pounds, dollars, IQ points, whatever) and can be sorted from low to high. The data may or may not have units and do not have a definite sort order.
It makes sense to average the data. Your summary is counts or percentages in each category.
Examples (discrete): number of children in a family, number of cigarettes smoked per day, age at last birthday
Examples (continuous): height, salary, exact age
Examples: hair color, marital status, gender, country of birth, and opinion for or against a particular issue

Continuous or discrete data? Sometimes when you have numeric data it’s hard to say whether you have discrete or continuous data. But since you’ll graph them differently, it’s important to be clear on the distinction. Here are two examples of doubtful cases: salary and age.

It’s true that your salary can be only a whole number of pennies. But there are a great many possible values, and the distance between the possible values is quite small, so you call salary a continuous variable. Besides, you don’t ask “how many pennies do you make?” but rather “how much do you make?”

What about age? Well, age at last birthday is clearly discrete since it can be only a whole number: “how many years old were you at your last birthday?” But age now, including years and months and days and fractions of days, would be continuous, again because you can subdivide it as finely as desired.

1C3.  Summary Statements

When you see a summary statement, you have to do a little mental detective work to figure out the data type. Always ask yourself, what was the original measurement taken or question asked?

Example 12: “The average salary at our corporation is $22,471.” The original measurement was the salary of each individual, so this is continuous data.

Example 13: “The average American family has 1.7 children.” Don’t let “1.7” fool you into identifying this as a continuous variable! What was the original question or measurement? “How many children are there in your family?” That’s discrete data.

Example 14: “Four out of five dentists surveyed recommend Trident sugarless gum for their patients who chew gum.” Yes, there are numbers in the summary statement, but the original question asked of each dentist was “Do you recommend Trident?” That is a yes/no question, so the data type is categorical.

1D.  Statistical Errors

Summary: In statistics, an error is not necessarily a mistake. This section explores the types of statistical errors and where they come from.

Definition: An error is a discrepancy between your findings and reality. Some errors arise from mistakes, but some are an inevitable part of the sampling process.

1D1.  Sampling Error


Even if you make no mistakes, inevitably samples will vary from each other and any given sample is almost sure to vary from the population. This variability is called sampling error. (It would probably be more helpful to call it sample variability, but we’re stuck with “sampling error”.)

Sampling error “refers to the difference between an estimate for a population based on data from a sample and the ‘true’ value for that population which would result if a census were taken.” (Australian Bureau of Statistics 2013) [see “Sources Used” at end of book]

Except for a census, no sample is a perfect representation of the population. So the sample mean (average), for example, will usually be a bit different from the population mean. Sampling errors are unavoidable, even if you do everything right when you take a random sample. They’re not mistakes, they’re just part of the nature of things.

Although sampling error cannot be eliminated, the size of the error can be estimated, and it can be reduced. For a given population, a larger sample size gives a smaller sampling error. You'll learn more about that when you study sampling distributions.

1D2.  Nonsampling Errors

Definition: Nonsampling errors are discrepancies between your sample and reality that are caused by mistakes in planning, in collecting data, or in analyzing data.

Nonsampling errors make your sample unrepresentative of the population and your results questionable if not useless. Unlike sampling errors, nonsampling errors cannot be reduced by taking larger samples, and you can’t even estimate the size of most nonsampling errors. Instead, the mistakes must be corrected, and probably a new sample must be taken.

There are many types of nonsampling errors. Different authors give them different names, but it’s much more important for you to recognize the bad practice than to worry about what to name it. In taking your own samples, and in evaluating what other people are telling you about their samples, always ask yourself: what could go wrong here? has anything been done that can make this sample unrepresentative of the population? Here are some of the more common types of nonsampling errors. After you read through them, see how many others you can think of.

Self-Selected Samples

This is almost always bogus. People who select themselves are by definition different from people who don’t, which means they are not representative. It can be very hard to know whether that difference matters in the context of a particular study. Since you can never be sure, it is safest to avoid the problem and not let people select themselves.

But medical studies all use volunteers. (They have to, ethically.) Why doesn’t that make the sample bogus? They’re volunteers, but usually they’re not self-selected volunteers. For example, researchers may ask doctors and hospitals to suggest patients who meet a particular profile; they use probability techniques to select a sample from that pool.

But things are not always simple. For example, some companies or researchers may advertise and pay volunteers to undergo testing. In this case you have to ask very serious questions about whether the volunteers are representative of the general population. Statistical thinking isn’t a matter of black and white, but some pretty sophisticated judgment can be involved. Your take-away is: don’t accept anything at face value, but always ask: What important facts are being left out? What does that do to the credibility of the results?

Sampling Bias

Definition: Sampling bias results from taking a sample in a way that tends to over- or under-represent some subgroup of the population.

Example 15: If you’re doing a survey on student attitudes toward the cafeteria, and you conduct the survey in the cafeteria, you are systematically under-representing students who don’t use the cafeteria. It seems logical that attitudes are more negative among students who don’t use the cafeteria than among students generally, so by excluding them you will report overall attitude as more favorable than it really is.

“Bias” is a good example of the words in statistics that don’t have their ordinary English meaning. You’re not prejudiced against students who dislike the cafeteria. “Bias” in statistics just means that something tends to distort your results in a particular direction.

Example 16: The classic example of sampling bias is the Literary Digest fiasco in predicting that Landon would beat Roosevelt in the 1936 election. The magazine sent questionnaires to all its subscribers, it phoned randomly selected people in telephone books, and it left stacks of questionnaires at car dealerships with instructions to give one to every person who test drove a car. The sample size was in the millions.

This procedure systematically over-represented people who were well off and systematically under-represented poorer people. In 1936 the Great Depression still held sway, and most people did not have the disposable income to subscribe to a fancy magazine, let alone a home telephone; the very thought of buying a car would have struck them as ridiculous or insulting. In that era, the Republicans appealed more to the rich and the Democrats more to the working class. So the net effect of the Literary Digest’s procedure was that it made the country look a lot more Republican than it actually was. Since Landon was a Republican and FDR a Democrat, FDR’s actual support was much greater than shown by the poll, and Landon’s was much less.

Notice that a sample size of millions did not overcome sampling bias. A larger sample size is not an answer to nonsampling errors.

The Digest’s original article can be found in Landon in a Landslide: The Poll That Changed Polling (American Social History Project) [see “Sources Used” at end of book].

While we’re on the subject of presidential elections, different nonsampling errors also led to wrong predictions of a Dewey victory over Truman in 1948. For analyses of both the 1936 and the 1948 statistical mistakes, see Classic Polling Surprises (2004) [see “Sources Used” at end of book] and Introduction to Polling (n.d.) [see “Sources Used” at end of book].

Selection Bias

Beyond sampling bias, there are many other bad practices in selecting your sample can bias the results. Wikipedia’s Selection Bias [see “Sources Used” at end of book] has a good rundown of quite a few.

Non-Response Errors

If you’re taking a mail survey, a significant number of people (probably a majority) won’t respond. Are the responders representative of the non-responders, or has a bias been introduced by the non-response? That’s a tough question, and the answer may not always be clear.

For this reason, mail surveys are often coded so that the investigators can tell who did respond, and follow up with those who didn’t. That follow-up can be more mail, a phone call, or a visit.

Even with in-person polls, non-response is a problem: many people will simply refuse to participate in your survey. Depending on what you’re surveying, that could be unimportant or it could be a fatal flaw.

Response Errors

Definition: Response errors occur when respondents give answers that don’t match objective truth or their own actual opinions.

Poorly worded survey questions are a major source of response errors, and lead to biased results or completely meaningless results. There may not be a perfect survey question, but having several people review the questions against a list of possible problems will greatly reduce the level of response errors.

But response errors can never be completely eliminated. For instance, people tend to shade their answers to make themselves look good in their own eyes or in the interviewer’s eyes. Most people rate themselves as better-than-average drivers, for example, which obviously can’t be true. And self-reporting of objective facts is always suspect because memory is unreliable.

Example 17:

Data Errors

These include mistakes by interviewers in recording respondents’ answers, mistakes by investigators in measuring and recording data, and mistakes in entering the recorded data.

Inappropriate Analysis

In the second half of the course you’ll learn a number of inferential statistics procedures. Each one is appropriate in some circumstances and inappropriate in others. If you use the wrong form of analysis in a given situation, or you apply it wrongly, your results will be about as good as the results from using a hammer to drive a screw.

1E.  Observation and Experiment

Summary: There are two main methods of gathering data, the observational study and the experiment. Learn the differences, and what each one can tell you.

1E1.  Observational Study Versus Designed Experiment

Many, many statistical investigations try to find out whether A causes B. To do this, you have two groups, one with A and one without A, or you have multiple groups with different levels of A. You then ask whether the difference in B among the groups is significant. The two main ways to investigate a possible connection are the observational study and the experiment.

The concepts aren’t hard, but there’s a boatload of vocabulary. Let’s get through the definitions first, and then have some concrete examples to show how the terms are used. Please read the definitions first, then read the first example and refer back to the definitions; repeat for the other examples.

Definition: In an observational study, the investigator simply records what happens (a prospective study) or what has already happened (a retrospective study). In an experiment, the investigator takes an more active role, randomly assigning members of the sample to different groups that get treated differently.

Which is better? Well, in an observational study, you always have the problem of wondering whether the groups are different in some way other than what you are studying. This means that an observational study can never establish cause. The best you can do after an observational study is to say that you found an association between two variables.

How do we establish cause, when for ethical or practical reasons we can’t do an experiment? The nine criteria are listed in Causation (Simon 2000b [see “Sources Used” at end of book]) and were first laid down by Sir Austin Bradford Hill in 1965.


In an observational study or an experiment, there are two or more variables. You want to show that changes in one or more of them, called the explanatory variables, go with changes in one or more response variables.

Explanatory variables are the suspected causes, and response variables are the suspected effects or results.

Example 18: Over the course of a year, you have parents record the number of minutes they spend every day reading to their child, and at the end of the year you record each child’s performance on standard tests. The explanatory variable is parental time spent reading to the child, and the response variable(s) are performance on the standardized test(s).

Definitions: In an experiment, the experimenter manipulates the suspected cause(s), called explanatory variable(s) or factor(s). A specific level of each factor is administered to each group. The level(s) of the explanatory variable(s) in a given group are known as its treatment.

Example 19: To test productivity of factory workers, you randomly assign them to three groups. One group gets an extra hour at lunch, one group gets half-hour breaks in morning and afternoon, and one group gets six 10-minute breaks spaced throughout the day. The explanatory variable or factor is structuring of break time, and the three levels or treatments are as described.

Definitions: In an experiment, each member of a sample is called a unit or an experimental unit. However, when the experiment is performed on people they are called subjects or participants.


In any study or experiment, results will vary for individuals within each group, and results will also vary between the groups as a whole. Some of that variation is due to chance: it is expected statistical variability or sampling error. If the differences between groups are bigger than the variation within groups — and enough bigger, according to some calculations you’ll learn later — then the investigator has a significant result. A significant result is a difference that is too big to be merely the result of normal statistical variability.

I’ll have a lot more to say about significance when you study Hypothesis Tests.

Confounding and Lurking Variables

In Example 18, about reading to children, you find generally that the more time parents spend reading to first graders, the better the children tend to do on standard tests of reading level.

Is the reading time responsible for the improved test scores? You can easily think of other possible explanations. Parents who spend more time reading to their children probably spend more time with them in general. They tend to be better off financially — if you’re working two jobs to make ends meet, you probably have little time available for reading to your children. Economic status and time spent with children in general are examples of lurking variables in this study.

Definition: A hidden variable that isn’t measured and isn’t part of your design but affects the outcome is called a lurking variable.

Example 20: In a large elementary school, you schedule half the second grade to do art for an hour, two mornings a week, with the district’s art teacher. The other half does art for an hour, two afternoons a week, with the same teacher, but they are told at the beginning that all their projects will be displayed and prizes given for the best ones.

Can you learn anything from this about whether the chance to win prizes prompts children to do a better job on art projects? The problem is that there’s not just one difference in treatment here, the promised prizes. There’s also the fact that everyone’s project will be on display. And maybe mornings are better (or worse) for doing art than afternoons. Maybe the teacher is a morning person and fades in the afternoon, or is not a morning person and really shines in the afternoon. Even if there’s a difference in quality of the projects, you can’t make any kind of simple statement about the cause, because of these confounding variables.


A confounding variable is “associated in a non-causal way with a factor and affects the response.” (DeVeaux, Velleman, Bock 2009, 346 [see “Sources Used” at end of book])

Confounding occurs in an experiment when you [can’t] distinguish the effects of different factors.” (Triola 2011, 32 [see “Sources Used” at end of book])

In the art example, you wanted to find out whether promising prizes makes children do better art work. But the promise of prizes wasn’t the only difference between the two groups. Time of day and public display are confounding variables built into the design of this experiment. You know what they are, but you can’t untangle their effect from the effect of what you actually wanted to study.

What’s the difference between lurking variables and confounding variables? Both confuse the issue of whether A causes B.

A lurking variable L is associated with or causes both A and B, so any relationship you see between A and B is just a side effect of the L/A and L/B relationships. For example, counties with more library books tend to have more murders per year. Does reading make people homicidal? Of course not! The lurking variable is population size. High-density urban counties have more books in the library and more murders than low-density rural counties.

A confounding variable C is typically associated with A but doesn’t cause it, so when you look at B you don’t know whether any effect comes from A, from C, or from both. For example, after a year with a lot of motorcycle deaths, a state passes a strict helmet law, and the next year there are significantly fewer deaths. Was the helmet law responsible? Maybe, but time is a confounding variable here. Were motorcyclists shocked at the high death toll, so that they started driving more carefully or switched to other modes of transit?

Don’t obsess over the difference between lurking and confounding variables. Some authors don’t even make a distinction. You should recognize variables that make results questionable; that’s a lot more important than what you call them.

That said, if you want to see two more takes on the difference, have a look at Confounding and Lurking Variables (Virmani 2012 [see “Sources Used” at end of book]) and Confounding Variables (Velleman 2005 [see “Sources Used” at end of book]).

Lurking and confounding variables are the boogeyman of any statistical work. Lurking variables are the reason that an observational study can show only association, not causation. In experiments, you have the potential to exclude lurking variables, or at least to minimize them, but it takes planning and extra work, and you need to be careful not to create a design with built-in confounding..

Whenever any experiment claims that A causes B, ask yourself what lurking variables there might be, and whether the design of the study has ruled them out. You can’t take this for granted, because even professional researchers sometimes cut corners, knowingly or unknowingly.

Extended Examples

Example 21: Does smoking cause lung cancer?
Initial studies in the mid-20th century had three or four groups: non-smokers, light smokers, moderate smokers, and heavy smokers. They looked at the number and severity of lung tumors in the groups to see whether there was a significant difference, and in fact they found one.

This was an observational study. Ethically it had to be: if you suspect smoking is harmful you can’t assign people to smoke.

Explanatory variable: smoking level (none, light, moderate, heavy). Levels or treatments don’t apply to an observational study.

Response variable: tumor production

Because this was an observational study, there was no control for lurking variables, and even with a significant result you can’t say from this study that smoking causes lung cancer. What lurking variables could there be? Well, maybe some genetic factor both makes some people more likely to smoke and makes them more susceptible to lung cancer. This is a problem with every observational study that finds an effect: you can’t rule out lurking variables, and therefore you can’t infer causation, no matter how strong an association you find.

Since you can’t do an experiment on humans that involves possibly harming them, how do you know that smoking causes lung cancer? A good explanation is in Causation (Simon 2000b [see “Sources Used” at end of book]).

Example 22: Does aspirin make heart attacks less likely?
Here you can do an experiment, because aspirin is generally recognized as safe. Investigators randomly assigned people to two groups, gave aspirin to one group but not the other, and then monitored the proportion who had heart attacks. They found a significantly lower risk of heart attack in the aspirin group.

This was a designed experiment.

Explanatory variable: aspirin. There were two levels or treatments: yes and no.

Response variable: heart attack (yes/no)

From this experiment, you can say that aspirin reduces the risk of heart attack. How can you be sure there were no lurking variables? By randomly assigning people to the two groups, investigators made each group representative of the whole population. For example, overweight is a risk factor for heart attacks. The random assignment ensures that overweight people form about the same proportion in each group as in the population. And the same is true for any other potential lurking variable. (It helps to have larger samples, and in this study each sample was about 10,000 people.)

Example 23: Does prayer help surgical patients?
Here again, no one thinks prayer is harmful, so ethically the experimenters were in the clear to assign cardiac-bypass patients randomly to three groups: people who knew they were prayed for, people who were prayed for and didn’t know it, and people who were not prayed for. Investigators found no significant difference in frequency of complications between the patients who were prayed for and those who were not prayed for.

This was a designed experiment.

Explanatory variables: receipt of prayer (two levels, yes and no) and knowledge of being prayed for (also two levels, yes and no). There were three treatments: (a) receipt=yes and knowledge=yes, (b) receipt=yes and knowledge=no, (c) receipt=no.

Response variable: occurrence of post-surgical complications (yes/no).

(You can read an abstract of the experiment and its results in Study of the Therapeutic Effects of Intercessory Prayer [Benson 2006 [see “Sources Used” at end of book]]. The full report of the experiment is in Benson 2005 [see “Sources Used” at end of book].)

Because lurking variables can’t be ruled out in an observational study, investigators always prefer an experiment if possible. If ethical or other considerations prevent doing an experiment, an observational study is the only choice. But then the best you can hope for is to show an association between the two variables. Only with an experiment do you have a hope of showing causation.

1E2.  Experimental Techniques

Okay, so you always have to do an experiment if you want to show that A causes B. Let’s look in more detail at how experiments are conducted, and learn best practices for an experiment.

Caution: Design of Experiments is a specialized field in statistics, and you could take a whole course on just that. This chapter can only give you enough to make you dangerous.☺ While you’re planning your first experiment in real life, it’s a good idea to get help from someone senior or a professional statistician.

R. A. Fisher “virtually invented the subject of experimental design” (Upton and Cook 2008, 144 [see “Sources Used” at end of book]), and pioneered many of the techniques that we use today. He was a great champion of planning: Upton & Cook quote him as saying “To call in the statistician after the experiment is done may be no more than asking him to perform a postmortem examination: he may be able to say what the experiment died of.”

Completely Randomized Design

Definitions: Experimenters randomly assign members to the various treatment groups. We say that they have randomized the groups, and this process is called randomization.

Why randomize? Why not just put the first half of the sample in group A and the second half in group B? Because randomization is how you control for lurking variables.

Think about the study with aspirin and heart attacks. You know that different individuals are more or less susceptible to heart attacks. Risk factors include smoking, obesity, lack of exercise, and family history. You want your aspirin group and your non-aspirin group to have the same mix of smokers and non-smokers as the general population, the same mix of obese and non-obese individuals, and so on. Actually it’s harder than that. There aren’t just “smokers” and “non-smokers”; people smoke various amounts. There aren’t just “obese” and “fit” people, but people have all levels of fitness.

It would be very laborious to do stratified samples and get the right proportions for a lot of variables. You’d have to have a huge number of strata. And even if you did do those matchups, taking enormous trouble and expense, what about the variables you didn’t think of? You can never be sure that the samples have the same composition as the population.

It really must be random assignments — you can’t just assign test subjects to groups alternately. Steve Simon (2000a) explains why, with examples, in Alternating Treatments [see “Sources Used” at end of book].

Randomization is the indispensable way out. Instead of trying to match everything up yourself — and inevitably failing — you let impersonal random chance do your work for you.

Are you guaranteed that the sample will perfectly represent the population? No, you’re not. Remember sampling error, earlier in this chapter. Samples vary from the population; that’s just the nature of things. But when you randomize, in the long run most of your samples will be representative enough, even though they’re not perfectly representative.

Randomized Block Design

Notice I said that randomization works in the long run. But in the short run it may not. Suppose you are testing a weight-loss drug on a group of 100 volunteers, 50 men and 50 women. If you completely randomize them, you might end up with 20 men and 30 women in one group, and 30 men and 20 women in the other. (There’s about a 20% chance of this, or a more extreme split.)

Why is this bad? Because you don’t know whether men and women respond differently to the drug. If you see a difference between your 20/30 placebo group and your 30/20 treatment group, you don’t know how much of that is the drug and how much is the difference between men and women. Gender is a confounding variable.

What to do? Create blocks, in this case a block of the 50 men and a block of the 50 women. Then within each block you randomly assign individuals to receive medication or a placebo. Now you can find how the drug affects women and how it affects men. This is called a randomized block design. When you can identify a potentially confounding variable before you perform your experiment, first divide your subjects into blocks according to that variable, and then randomize within each block. Do this, and you have tamed that confounding variable.

R. A. Fisher coined the term “randomized block” in 1926.

In this example, gender would be called a blocking variable because you divide your subjects into blocks according to gender. Now there’s no problem separating the effects of the drug from the effects of gender in your experimental results.

When I talked about complete randomization, I said it would be laborious to take strata of a lot of variables, and that complete randomization was the answer. But here I’m suggesting exactly that for men and women in the weight-loss study. Right about now, you might be telling me, “Make up your mind!”

This is where some judgment is needed in making tradeoffs. Men and women typically have different percentages of body fat, and they are known to respond differently to some drugs. It makes sense that a weight-loss drug could have different response from men and women, and therefore you block on gender. But no other factor stands out as both important and measurable. If you tried to block on motivation, for instance, how would you measure it?

“Block what you can, randomize what you cannot” is a good rule, sometimes attributed to George Box. A variable is a candidate for blocking when it seems like it could make a difference, and you can identify and measure it. For other variables, we depend on randomization, either complete randomization or randomization within blocks.

Matched Pairs

There’s one circumstance where you can be sure that the subgroups are perfectly matched with respect to lurking variables: when you use a matched-pairs design. This is kind of like a randomized block design where each block contains two identical subjects.

Example 24: You want to know whether one form of foreign-language instruction is more effective than another. So you take fifty pairs of identical twins, and assign one twin from each pair to group A and the other twin to group B. Then you know that genetic factors are perfectly balanced between the two groups. And if you restrict yourself to twins raised together, you’ve also controlled for environmental factors.

A special type of matched-pairs design matches each experimental unit to itself.

Example 25: You want to know the effect of caffeine on heart rate. You don’t assemble a sample, give coffee to half of them, and measure the difference in heart rate between the groups. People’s heart rates vary quite a bit, so you would have large variation within each group, and that might swamp the effect you’re looking for.

Instead, you measure each individual’s resting heart rate, then give him or her a cup of coffee to drink, and after a specified time measure the heart rate again. By comparing each individual to himself or herself, you determine what effect caffeine has on each person’s heart rate, and people’s different resting heart rates aren’t an issue.

See also: Experimental Design in Statistics shows how the same experiment would work out with a completely randomized design, randomized blocks, and matched pairs.

Control Group and Placebo

Definition: In an experiment, usually one of the treatments will be no treatment at all. The group that gets no treatment is called the control group.

But “no treatment at all” doesn’t mean just leaving the control group alone. They should be treated the same way as the other groups, except that they get zero of whatever the other groups are getting. If the treatment groups are getting injections, the control group must get injections too. Otherwise you’ve introduced a lurking variable: effects of just getting a needle stick, and in humans the knowledge that they’re not actually getting medicine.

Definition: A placebo is a substance that has no medical activity but that the subjects of the experiment can’t tell from the real thing.

The placebo effect is well known. Sick people tend to get better if they feel like someone is looking after them. So if you gave your treatment group an injection but your control group no injection, you’d be putting them in a different psychological state. Instead, you inject your control group with salt water.

TheProfessorFunk has a fun three-minute YouTube video in the placebo effect (Keogh 2011 [see “Sources Used” at end of book]). Thanks to Benjamin Kirk for drawing this to my attention.

You might think placebos would be unnecessary when experimenting on animals. But if you’ve ever had a pet, you know that some animals are stressed by getting an injection. If the control group didn’t get an injection, you’d have those differing stress levels as a lurking variable. So you administer a placebo.

Example 26: Sometimes, for practical or ethical reasons, you have to get a little bit creative with a control group. Here’s an excellent example from Wheelan (2013, 238) [see “Sources Used” at end of book]:

Suppose a school district requires summer school for struggling students. The district would like to know whether the summer program has any long-term academic value. As usual, a simple comparison between students who attend summer school and those who do not would be worse than useless. The students who attend summer school are there because they are struggling. Even if the summer school program is highly effective, the participating students will probably still do worse in the long run than the students who were not required to take summer school. What we want to know is how the struggling students perform after taking summer school compared with how they would have done if they had not taken summer school. Yes, we could do some kind of controlled experiment in which struggling students are randomly selected to attend summer school or not, but that would involve denying the control group access to a program that we think would be helpful.

Instead, the treatment and control groups are created by comparing those students who just barely fell below the threshold for summer school with those who just barely escaped it. Think about it: the [group of all] students who fail a midterm are appreciably different from [the group of all] students who do not fail a midterm. But students who get a 59 percent (a failing grade) are not appreciably different from those students who get a 60 percent (a passing grade).

Double Blind

Some people do better if they think they’re getting medicine, even if they’re not. To avoid this placebo effect, the standard technique is the double blind.

Definitions: In a double-blind experiment, neither the test subjects nor those who administer the treatments know what treatment each subject is getting. In a single-blind experiment, the test subjects don’t know which treatment they’re getting, but the personnel who administer the treatments do know.

Okay, given that people’s thoughts influence whether they improve, a single blind makes sense. If you let someone know they’re not getting medicine in a trial, they’re less likely to improve. But why isn’t that enough? Why is a double blind necessary?

For one thing, there’s always the risk that a doctor or nurse might tell the subject, accidentally or on purpose. But beyond that, if you’re treating someone who has a terrible disease, you might treat them differently if they’re getting a placebo that if they’re getting real medicine, even if you don’t realize you’re doing it. Why take the risk of introducing another lurking variable? Better to use a double blind and just rule out the possibility.

You might wonder how it’s done in practice. In a drug trial, for instance, each test subject is assigned a code number, and the drug company then packages pills or vaccines with a subject’s code number on each. The doctors and nurses who administer the treatments just match the code number on the pill or vaccine to each subject’s code number. Of course all the pills or vaccines look alike, so the workers who have contact with the subjects don’t know who’s getting medicine and who’s getting a placebo. And what they don’t know, they can’t reveal.

1F.  Sharp Points

1F1.  Rounding and Significant Digits

You’ll be dealing with numbers through most of this course. Handle them right, and you won’t get burned! There are three issues here: how many digits to round to, how to round to that number of digits, and when to do your rounding.

How Many Digits?

There are a lot of rules for how many digits you should round to, but we’re not going to be that rigorous in this course. Instead, you’ll use common sense supplemented by a few rules of thumb. What’s common sense? Avoid false precision, and avoid overly rough numbers.

The rules are important, but we have only so much time, and you’ve probably learned them in your science courses. If you want to, look up “significant figures” or “significant digits” in the index of pretty much any science textbook, or look at Significant Figures/Digits [see “Sources Used” at end of book].

Example 27: When you fill your car’s gas tank, the pump shows the number of gallons to three decimal places. You can also describe that as the nearest thousandth of a gallon. How much gas is that? Convert it to teaspoons (Brown 2009 [see “Sources Used” at end of book]): (0.001 gal) × (4 qt/gal) × (4 cup/qt) × (16 Tbsp/cup) × (3 tsp/Tbsp) ≈ 0.8 tsp. You can bet there’s several times that much in the hose when the pump shuts off. Three decimal places at the gas pump is false precision a/k/a spurious accuracy. That third decimal place is just noise, statistical fluctuations without real significance.

On the other hand, suppose the pump showed only whole gallons. This is too rough. You can go along pumping gas for no extra charge (bad for the merchant), and then abruptly the cost jumps by several dollars (bad for you).

Here are some rules of thumb to supplement your common sense. These are not matters of right and wrong, but conventions to save thinking time:

How to Round Numbers

Round in one step. Say you have a number 1.24789, and you want to round it to one decimal place. Draw a line — mentally or with your pencil — at the spot where you want to round: 1.2|4789. If the first digit to the right of that line is a 0, 1, 2, 3, or 4, throw away everything to the right of the line. It doesn’t matter what digits come after that first digit. Here, the first digit to the right of the line is a 4, so you throw away everything to the right of the line: 1.24789 rounded to one decimal place is 1.2.

Rounding in multiple steps, 1.24789 → 1.2479 → 1.248 → 1.25 → 1.3, is wrong. (Why? Because 1.24789 is 0.05211 units away from 1.3, but only 0.04789 units away from 1.2.) You must round in one step only.

As you know, if the first digit to the right of the line is a 5, 6, 7, 8, or 9, you raise the digit to the left of the line by one and throw away everything to the right of the line. To one decimal place, 1.27489 is 1.2|7489 → 1.3.

You may need to “carry one”. What is 1.97842 to one decimal place? 1.9|7842 needs you to increase that 9 by one. That means it becomes a zero and you have to increase the next digit over: 1.9+0.1 = 2.0. Therefore, 1.97842 rounded to one decimal place is 2.0.

When to Round Numbers

Here’s the Big No-No: Never do further calculations with rounded numbers. What’s the right way? Round only after the last step in calculation.

Example 28: True story: In Europe, average body temperature for healthy people was determined to be 36.8°C, as repeated in A Critical Appraisal of 98.6°F (Mackowiak, Wasserman, Levine 1992 [see “Sources Used” at end of book]). Rounding to the nearest degree, the average human body temperature is 37°C. So far so good.

But in the US, thermometers for home use are marked in degrees Fahrenheit. Some nimrod converted 37°C using the good old formula 1.8C+32 and got 98.6°F, and that’s what’s marked on millions of US thermometers as “normal” temperature. If you’ve got one of those, ask for your money back, because it’s wrong.

Why is it wrong? The person who did the conversion committed the Big No-No and did further calculations with a rounded number. For a correct calculation, use the unrounded number, 36.8. (Okay, 36.8 was probably rounded from 36.77 or 36.82 or something. But the point is that it’s the least rounded number available.) 1.8×36.8+32 = 98.24 → 98.2, and that is the average body temperature for healthy humans.

1F2.  Powers of 10 from Your Calculator

calcculation wnding in 1.994017946 E minus 4 When a calculation results in a number lower than about 0.0005, your calculator will usually present it in the dreaded scientific notation, like this example. Be alert for this! Your answer is not 1.99 (or however you want to round it). Your answer is 1.99×10-4.

How do you convert this to a decimal for reading by ordinary humans? (And yes, you should usually do that — definitely, if your work will be read by non-technical people.)

The exponent (the number after the E minus) tells you how many zeroes the decimal starts with, including the zero before the decimal point. 1.99×10-4 is 0.000 199 or 0.0002.

When a decimal starts with a bunch of zeroes, especially if the decimal is long, many people use spaces to separate groups of three digits. This makes the decimal easier to read.

1F3.  Show Your Work!

Don’t just write down answers; show your work. This is in your own best interest:

How do you show your work?

The general idea is to show enough that someone familiar with the course content can follow what you did.

When evaluating a formula, write down the formula, then on a line below show it with the numbers replacing the letters. Your calculator can handle very complicated formulas in one step, so your next line will be your last line, containing the final answer and any rounding you do. Example:

SEM = σ/√n

SEM = 160/√37

SEM = 26.30383797 → SEM = 26.3

You’ll be using a lot of the menus and commands on the TI-83 or TI-84. Here are some tips:

1F4.  Optional:  ∑ Means Add ’em Up

You’ll find that your calculator does the complicated stuff for you, but here and there I’ve scattered formulas in BTW paragraphs in case you want to peek behind the curtain.

Stats formulas usually need to do the same thing to every member of a data set and then add up the results. The Greek letter ∑, a capital sigma, indicates this. This summation notation makes formulas easier to write — easier to read, too, once you get used to it.

Some examples:

What Have You Learned?

Key ideas:

(The online book has live links to all of these.)

Because this textbook helps you,
please click to donate!
Because this textbook helps you,
please donate at
Study aids:

Chapter 2 WHYL →  

Exercises for Chapter 1

Write out your solutions to these exercises, showing your work for all computations. Then check your solutions against the solutions page and get help with anything you don’t understand.

Caution! If you don’t see how to start a problem, don’t peek at the solution — you won’t learn anything that way. Ask your instructor or a tutor for a hint. Or just leave it and go on to a different problem for now. You may find when you return to that “impossible” problem that you see how to do it after all.

1 Briefly distinguish sampling error from nonsampling error. Which one represents avoidable mistakes? The other type can’t be eliminated, but what can you do to reduce it?
2 A gynecologist wants to study pregnant women’s use of prenatal vitamins. One month, she randomly selects one of her first five patients. For the rest of that month, she records data on every fifth pregnant patient that she sees.
(a) What type of sample is this?
(b) Is it a good sample or a bad sample? Why?
(c) Is the gynecologist performing an observational study or an experiment?
3 To test Gro-Mor plant food, investigators randomly divide 150 bulbs into three groups. They are planted in a greenhouse under identical conditions, except that one group gets no plant food, one group gets Gro-Mor, and one group gets Magi-Grow, a competitor’s product. The height of each plant is measured at the end of each week for 13 weeks. Identify the following:
(a) Type of experimental design.
(b) Factor(s).
(c) Treatments or levels.
(d) Response variable(s).
(e) Experimental units.
(f) Explanatory variable(s).
(g) Which is the control group?
4 The National Census of Borgovia released the statement, “The average number of children in Borgovian families is 2.1.” (a) Identify the variable. (b) State the specific variable type. (c) Is the number 2.1 a statistic or a parameter?
5 You’re taste-testing your new formula for Whoopsie Cola against your old formula. You assemble a focus group of 80 people and give them each a small cup of each drink. (Half the group gets old cola, water, new cola; the other half gets new, water, old. Of course you don’t tell them what they’re getting.) 55 of the people in the focus group like the new formula better.
(a) Describe the sample.
(b) What is the sample size?
(c) Describe the population.
(d) What is the population size?
6 No sample can perfectly represent the population, so no two samples will be the same, even if your sampling technique is perfect.
(a) What is the name for this variation?
(b) What can be done to reduce this variation?
7 “Have you ever left an infant alone in the house while you went to the store?” Explain how response bias might operate with this question.
8 You want to survey attitudes of resident students toward the cafeteria food. (There are 2000 resident students, and about 1500 of them eat in the cafeteria on a given day. The dorms have two students per room.)

How would you construct a random sample of size 50? a systematic sample? a cluster sample? Which of these is the best balance between statistical purity and practicality?

9 Two studies — Misinformation and the 2010 Election (Ramsay 2010 [see “Sources Used” at end of book]) at the University of Maryland, and Some News Leaves People Knowing Less (Fairleigh Dickinson University 2011 [see “Sources Used” at end of book]) have shown that Fox News viewers know less about the world than people who watch no news at all. Can you conclude that this is because they watch Fox News? Why or why not?
10 You’re conducting a survey to determine Tompkins County voters’ willingness to pay for expanded bus routes. You randomly select twenty bus trips on each day one week, and on each selected bus you or your associate hand a questionnaire to each person who gets on the bus.
(a) What is the most serious problem with this survey technique?
(b) What is the technical term for this type of mistake?
11 Sandy said, “I took a random sample by walking around the halls at lunch time and just asking random people to take my survey.” What is wrong with this statement? What type of sample did Sandy actually take?
12 “42% of my sample said that they have at least one device in the house that can stream video.”
(a) What is the data type?
(b) Is this an example of descriptive or inferential statistics?
(c) Is the number 42% a statistic or a parameter?
13 You want to test the effectiveness of a new medication for a condition that was previously untreatable. You randomly select thirty doctors from state lists of licensed doctors, and all of them agree to help.

Each doctor will put up notices in the waiting rooms, and will select the first 30 adult volunteers, assigning the first 15 to the experimental group and the second 15 to the control group. Patients will not be told which group they are in; you supply placebo pills that are identical in appearance to the active medication. Doctors will administer the placebo and medication to the selected groups and report results back to you.

Identify three serious errors in this technique. Are these examples of sampling or nonsampling error?

14 Which is larger, 0.0004 or 2.145E-4? Explain.
15 You survey 87 randomly selected households and find a total of 163 children. Dividing, you announce that the average number of children is 163/87 = 1.87356. What’s wrong with that, and how do you fix it?
16 Identify the type of each variable as discrete, continuous, or non-numeric:
(a) Telephone area code.
(b) Volume of a soap bubble.
(c) Number of times a comment gets retweeted.
(d) Ownership of a dog.
(e) Level of pain, from “none” to “unbearable”.
(f) Level of pain, from 0 to 10.
17 Here are some statements summarizing data. (I made all of them up.) State the original question asked or measurement taken from each member of the sample, and identify each data type as discrete, continuous, or non-numeric. The first one is done for you as an example.
(a) The average weight loss in rats sent to space was 3.4 g.
Answer — Measurement: weight loss of each rat. Continuous.

Now you answer these:
(b) The average dinner check at my restaurant last Friday was $38.23.
(c) 45% of patients taking Effluvium complained of bloating and stomach pain.
(d) The average size of a party at my restaurant last Friday was 2.9 people.

Solutions → 

What’s New


2. Graphing Your Data

Updated 19 Jan 2017 (What’s New?)

Summary: To make sense out of a mass of raw data, make a graph. Non-numeric data want a bar graph or pie chart; numeric data want a histogram or stemplot. Histograms and bar graphs can show frequency or relative frequency.


Graph Paper
for Free:
Why buy an expensive pad of graph paper, especially if you only need a few sheets? You can print your own for free using Incompetech’s Plain Graph Paper PDF Generator and at Math Worksheets Land’s Graph Paper. Both are sources not just for the ordinary square grid, but for various specialty graph papers.

2A.  Graphing Non-Numeric Data

Any graph of non-numeric data needs to show two things: the categories and the size of each. Probably you’re already familiar with the two most common types, which are the bar graph and pie chart.

The sizes of categories can be shown as raw counts, called frequencies, or percentages, called relative frequencies. (Relative frequencies can also be shown as decimals, but I think most people respond better to “20%” than “.20”.)

How do you decide whether to show frequencies or relative frequencies? This is a stylistic choice, not a matter of right and wrong. Your choice depends on what’s important, what point you’re trying to make. If your main concern is just with the individuals in your sample, go with frequencies. But if you want to show the relationship of the parts to the whole, show relative frequencies.

2A1.  Bar Graph

How Often Parents Read to Children under Age 12 (n=434)
How OftenNumber of Parents
Every day217
A few times a week113
About once a week39
A few times a month26
Less often30

Example 1: In fall 2012, the Pew Research Center (2013a) [see “Sources Used” at end of book] surveyed American adults on their habits of reading to their children. The survey included 434 adults who had at least one child under age 12, and the results are shown in the table.

(Remember, you can’t call the data numeric just because you see numbers in a summary statement. You have to go back to the individual data points, which are categorical: “every day”, “a few times a week”, and so on. If the Pew Center had asked “how many days a week do you read to your child?” and got answers 0, 1, 2, 3, 4, 5, 6, and 7, that would be a set of numeric data.)

Your bar chart or bar graph must follow these rules:

Usually the category axis is horizontal, so the frequency axis and the bars are vertical. But you can also make a horizontal bar chart, where the category axis is vertical and the frequency axis and bars are horizontal.

You can make a bar graph by hand, or use software such as Microsoft Excel. If you make a bar graph by hand, use graph paper and draw the axes and bars with a straightedge — wobbly bars make you look like you had a liquid lunch.

Here’s my bar graph for parents reading to children:

frequency bar graph on graph paper

A couple of comments on best practices:

Optional:  Bar Graph in Excel

Getting some kind of bar graph out of Excel is easy. But then there’s a lot of fiddling around to reverse some of Excel’s rather strange format choices. Here are instructions for Excel 2010. If you have Excel 2007, 2013, or 2016, you’ll find that they’re pretty similar.

  1. Table data in two columns in Excel Get your categories into one column and your frequencies into the next column. The first row of each column should be the column headings from the table. Don’t enter a total row.
  2. With your mouse, highlight all rows and columns of your chart. (It doesn’t matter whether you include the column heads.) Click the Insert tab and then Charts » Column, and select the first 2-D column chart.
    First draft bar graph in Excel
  3. Right-click the useless legend at the right, “Series1” or “Number of Parents”, and select Delete.
  4. When you right-clicked the legend, three Chart Tools tabs appeared. On the Layout tab of the ribbon click Chart Title » Above Chart. Click into the chart title and type a better one. (Maybe Excel already gave your chart a title, but “Number of Parents” is the proper title of the frequency axis, not the whole chart.)
  5. Click Axis Titles » Primary Vertical Axis » Rotated Title. Click on the words “Axis Title” that appear in the chart, and type the new title “Number of Parents” for your frequency axis.
  6. If your category axis needs a title, click Axis Titles » Primary Horizontal Axis » Title Below Axis and enter the axis title.
  7. For some reason, the chart has tick marks between the categories. Right-click one of them, select Format Axis, and change Major tick mark type to None. That gives the chart you see here.
    Final bar graph in Excel
  8. You may have to tweak the formatting of the graph further; here are some suggestions. (If you try something and don’t like the result, press Ctrl-Z to undo the change.)

If you prefer a horizontal bar chart, it’s easy to make the change. Click into the chart area, then on the Design tab on the ribbon click Change Chart Type » Bar and select the first one.

Okay, well, nothing is that easy! Excel puts the categories in backwards order, so right-click the category axis and select Format Axis » Axis Options » Categories in reverse order. Still on the Axis Options dialog, click Horizontal axis crosses at maximum category.
Horizontal bar graph in Excel

Bar Graph with Relative Frequencies

The frequency bar graph tells us about the 434 individuals in the Pew Research Center’s sample. But why collect that sample except for what it can tell us about how often parents in general read to their children?

You know from Sampling Error in Chapter 1 that the proportions in the population are probably not the same as the sample, but probably not very far off either. So you compute those proportions and then redraw your graph to show percentages instead of raw counts.

How Often Parents Read to Children under Age 12 (n=434)
How OftenNumber of Parents Rel. Freq.
Every day21750%
A few times a week11326%
About once a week399%
A few times a month266%
Less often307%

First, total all the frequencies to get the sample size n = 434. (In this case n is given already, but often it isn’t.) Then convert each frequency into a relative frequency. The formula, if you need one, is f/n. For example, 9 parents never read to their under-12 children. The relative frequency is f/n = 9/434 = 0.021 or 2%: 2% of parents never read to their children. Enter that and the other relative frequencies in the table, as shown at right.

You may see some bar graphs with relative frequencies as decimals. There’s nothing wrong with that for technical audiences, but general audiences usually respond better to percentages.

Your relative frequencies may not add up to exactly 100% (or 1.0000), because of rounding. Don’t change any of the numbers to force a total.

Once you have your relative frequencies, you can make your bar graph. Choose round numbers for the tick marks on your relative frequency axis, for example every 5% or every 10%. I won’t inflict another of my sketches on you, but you can see a finished relative-frequency bar graph below.

Optional:  Relative Frequencies in Excel

To my surprise, I found that Excel doesn’t include relative-frequency bar graphs in its repertoire. You have to enter some formulas to compute the relative frequencies, and then create the graph from them. (Of course you could compute the relative frequencies yourself and enter them in Excel as numbers, but whenever possible I like to be lazy and make the computer do the work.)

  1. Excel with partial formula =sum(C2:C7) in cell C8 Enter the categories in a column, leave a blank column, and enter the frequencies. If you already have the categories and frequencies in adjacent columns, right-click on the letter at the top of the frequency column and select Insert.
  2. Click into the cell below the last frequency, and type “=sum(” (without the quotes). Then with your mouse select the frequencies. Finally, type a closing parenthesis and hit the Enter key.
  3. Excel with TOTPARENTS in the name area In the address box just above the first column of the spreadsheet, type a unique name such as TOTPARENTS and press the Enter key. This makes it easier to refer to this total cell.
  4. Excel with formula =C2/TOTPARENTS in cell B2 Click into the empty relative-frequency cell for the first category. Type an = sign, then click on the first frequency cell. (In the illustration, the relative-frequency cells are in column B and the frequency cells are in column C.) Type /TOTPARENTS (including / mark for division) and press the Enter key.
  5. Excel with relative frequencies in cells B2 through B7 Grab the “handle” at the lower right of the cell you just typed into, and drag it down to fill the Relative Frequency column.
  6. Click the % sign in the ribbon to change the decimals to percentages. (The % sign is near the middle of the ribbon, on the Home tab.)

Excel relative frequency bar graph Now highlight the category and relative-frequency columns, click the Insert tab and the first 2-D column chart, and tweak the graph as you did before. Your result should be something like the one you see here.

On this chart, neither axis really needs a label. The percent signs reinforce the message in the chart title that the bars show relative frequencies. And the category names together with the chart title tell the reader exactly what is being represented.

It’s a judgment call where to place tick marks on the relative-frequency axis, and you really need to look at the data to make a decision. Four categories are under 10%, so it makes sense to show the 5% line and help the reader get a sense of the relative sizes. Of course, if you show 5% then you have to show every 5% increment up to the top of the graph.

Side-by-Side Bar Graph

You may want to compare two populations: men and women, for instance, or one year versus another year. To do this, a side-by-side bar graph is ideal. A side-by-side bar graph has two bars for each category, and a legend shows the meaning of the bars.

The two populations you’re comparing are almost never the same size. Therefore side-by-side graphs almost always show relative frequencies rather than frequencies.

Example 2: In Educational Attainment, the Census Bureau (2014) [see “Sources Used” at end of book] showed the educational attainment of the population in selected years 1940 to 2012. I chose the years 1992 and 2012 and prepared this graph to show the change over that 20-year period.

side-by-side bar graph for educational attainment

What do you see? Comparing 2012 to 1992, the proportion of the population with no college (the first four categories) declined, and the proportion with some college or a college degree increased. You should be able to see why this has to be a relative-frequency chart: in a frequency chart, the larger population in 2012 would make all the bars taller than the 1992 bars, and you’d be hard put to see any kind of trend.

Stacked Bar Graph

Example 3: Another way to compare two populations is the stacked bar graph. In the side-by-side bar graph, above, each group of bars was one category, and each bar within a group was a population. With the stacked bar graph, you have one bar for each population, and one piece of that bar for each category. (A stacked bar graph is kind of like an unrolled pie chart.)

Here’s a stacked bar graph for the same data set:

stacked bar graph, as described in text

Because this textbook helps you,
please click to donate!
Because this textbook helps you,
please donate at

What do you see? Look first at the legend that lists the categories, then at the two bars. The top two segments represent some college. In 1992, about 56% of adults had no education beyond high school. But in 2012, only about 42% had a high-school diploma or less, meaning that 58% had at least some college. The proportions of college and no college were reversed in those 20 years.

You can also see that, though the group with four years of high school shrank, it didn’t shrink as much as the group with college grew. In other words, it’s not just more high-school graduates going on to college, it’s a higher proportion of the population entering high school. All the categories without a high-school diploma shrank. In 1992, 20% of adults had less than a high-school diploma and 80% were high-school graduates; in 2012, only about 12% had less than a high-school diploma and 88% had graduated from high school.

What’s the best way to compare two populations? The answer depends on what you’re trying to show. The side-by-side graph seems to be better at showing how each category changed, and the stacked graph is usually better at showing the mix, especially if you want to group the categories mentally. In the side-by-side graph, you can easily see the decline in adults with a fourth-grade education or less, but the shift to a college-educated population is much harder to see. It’s just the opposite with the stacked graph.

As always, get clear in your own mind what you’re trying to show, and then select the type of graph that shows that most clearly.

Did you notice that this stacked bar graph shows relative frequencies? (Maybe you didn’t notice, because it seems like the natural way to go.) A stacked bar graph could show frequencies instead of relative frequencies, if you want to emphasize the different sizes of the populations, but then it becomes harder to compare the mix in the populations.

When you make a stacked bar graph in Excel, there’s no need to pre-compute the percentages. Just select the third type of 2-D column chart, 100% Stacked Column.

2A2.  Making a Table from Scratch

Example 4: In the first example, you were given a table of categories and counts. But more likely you’ll just have a mass of data points, like this:

Children’s Favorite Beach Toys
shoveldump truckshovelbucketshovelball
dump truckballshovelshovelbucketnet
siftershovelbucketdump truckbucketshovel

Before you can make any kind of graph, you need a table to summarize the data. You’re probably tempted to count the number of shovels, the number of balls, and so on, but it’s way too easy to make mistakes that way. Why? Because you have to go over the data set multiple times, and you may count something twice or miss something.

The better procedure is to tally the categories in a table. It’s a win-win: the procedure is faster, and you’re less likely to make a mistake.

dump truck||

Simply go through the data, one item at a time. If you’re seeing a given category for the first time, add it to your list with a tally mark; if that category is already in your table, just add a tally mark. Here’s my table of tallies after going through the first two columns of data:

Please complete your tallies on your own before you look at mine.

After you’ve tallied all the data, count the tallies in each category and total the counts. Of course the total should equal your sample size n. Here’s my complete table:

shovel|||| |||| 10
ball|||| ||7
dump truck|||3
bucket|||| |6

Always check the total of your frequencies. If it matches the sample size, that’s no guarantee everything is correct; but if it doesn’t match, you know something is wrong.

Once you’ve got your table, you can make a graph by following the procedures above. If you’re publishing the table itself, give just the category names and sizes and the total, but leave out the tallies.

2A3.  Pie Chart

Where a bar graph tends to emphasize the sizes of categories in relation to each other, a pie chart tends to emphasize the categories as divisions of the whole. This distinction is not hard and fast; it’s just a matter of emphasis.

To make a pie chart, you need a compass, or something else that can draw a circle, and you need a protractor. The angle of each segment of the pie will be 360°×f/n, where f is the frequency of the category and n is the sample size — in other words, it’s 360° times the relative frequency, whether you’re showing frequencies or relative frequencies on the pie chart. But in practice, if you’re going to make a pie chart you’ll use Excel or some other software.

Optional:  Pie Chart in Excel

Excel can draw a pie chart for you, but you have to make a bunch of tweaks before it’s usable. There’s one bit of good news: with a pie chart, unlike a bar graph, Excel can compute relative frequencies automatically. I’ll show you how to do that for the data about parents reading to children, for which we made a bar graph earlier.

  1. first cut at an Excel pie chart Highlight the categories and frequencies, but not the total. Click the Insert tab and then Pie, and choose the first 2-D pie. You see the result at right.

    Many people stop there, but this is an absolutely horrible design. Readers have to keep looking back and forth to match up the colors, and often there are similar colors. Color-blind people are really screwed, and if you print the chart on a black-and-white printer it’s hopeless. Fortunately you can fix this!

  2. You’re going to put the category names with the pie segments, so right-click the legend (the list of categories at the right) and select Delete.
  3. Click on the “Number of Parents” title and type in a better one, such as “How Often Parents Read to Children”. (Don’t type the quotes in the title, of course.)
  4. In the ribbon, on the Layout tab, click Data Labels » More Data Label Options. Under Label Contains, select Category Name, select either Value or Percentage, and select Show Leader Lines. Under Label Position, select Best Fit. Click Close.
  5. You may want to resize the graph to make the labels less crowded, depending on the sizes of the segments. Drag a handle with your mouse, as you did before.

final pie graph in Excel

2B.  Graphing Numeric Data


For numeric data, you want to show four things: the shape, center, and spread of the distribution plus any outliers. The histogram is the standard way to do this, and it can show frequencies or relative frequencies.

Usually you’ll group the data into classes, but when you have discrete data without too many different values you can make an ungrouped histogram.

For a discrete data set with a moderate number of values and a moderate range, a stemplot is an alternative. With a stemplot, it doesn’t matter how many different data values there are, but the number of data points matters.

2B1.  Histogram for Numeric Data

How can you draw a picture of numeric data? The answer is a histogram.

The term “histogram” was coined by Karl Pearson in lectures some time before 1895.

Example 5: Let’s use the lengths of some randomly selected iTunes songs:

Lengths of iTunes Songs (seconds)
113282179594213   319245323334526
395440477240296   428407230294152
242837246135412   223275409114604
170239138505316   369298168269398
433212367255218   283179374204227

How do you make sense of this? As you might expect, the first step is to make a table. But you don’t want to treat each number as its own category, because that would produce a really uninteresting graph. Instead you create categories, except for numeric data you call them classes. The rules for classes are very simple:

Notice that the rules don’t tell you how many classes there must be, or what width a class must have. That’s where your discretion comes in. You want to pick class boundaries that are “nice” numbers, and you don’t want too many classes or too few. In practice, five to nine classes is usually about the right number.

How does that apply to the iTunes songs? Take a look at the data. The lowest number seems to be 113, and the highest is 837. That gives a range in “nice” numbers of about 100–850. If you set class width to 100 you have eight classes, so that seems about right.

Now go ahead and make your tally marks to create the table. Instead of category names, you use class boundaries. You already know how to make tally marks, so I’ll just give you the results:

Lengths of iTunes Songs (seconds)
100–199|||| ||||9
200–299|||| |||| |||| ||||20
300–399|||| ||||9
400–499|||| ||7
700–799 0

Even though the 700–799 class has no data points, it’s still a class and it will occupy the same width in the histogram as any other class. A bar with zero height shows in the histogram as a gap, and that’s good because it emphasizes that there’s something unusual about the point in the 800–899 class (which was 837 seconds).

If the class width is 100, how come the class bounds are 100–199 and not 100–200? In fact, some authors do write these class bounds as 100–200, 200–300, and so on, with the understanding that if a number is right on the boundary it goes in the upper class. All authors agree that the class width is the difference between the lower bounds of two consecutive classes, not the difference between lower and upper bounds of one class. So whether you write 100–199 for the first class or 100–200, the class width is 200 minus 100, which is 100.

histogram for lengths of iTunes songs Once you have the table, the histogram is straightforward. You can draw the histogram by hand or use Excel. I’ll show Excel later, but here’s my hand-made histogram for the iTunes data.

Notice that you label the data bars on their edges: 100, 200, …, 900, not 100–199, 200–299, …. Label the left edge of each bar, and also the right edge of the last bar. The right edge of the last bar is always one class width more than the left edge, so even if you’ve got 800–899 in your table the last bar’s edges are 800 and 900.

Like all histograms, this one is good at showing the shape of the data (skewed right; see below), the center (somewhere in the upper 200s to 300s), and the spread (from 100ish to 800ish seconds, or about two minutes to 13 minutes). In Chapter 3 you’ll learn how to measure center and spread numerically, but there’s always a place for a picture to help people grasp a data set as a whole.

This data set also shows an outlier, located somewhere in the 800–899 class. Not every data set will have an outlier, of course, and a rare sample might have more than one. When an outlier occurs, your first move is to go back to your original data sheets and make sure that it’s not simply a mistake in entering your data. If it’s a real data point, then you can ask what it means. In this case, the message is pretty simple: tunes generally run up to about 11 or 12 minutes (700 seconds), but the occasional one can be several minutes longer.

Histogram Versus Bar Graph

A histogram is similar to a bar graph, but with the following differences:

HistogramBar Graph
Data type Numeric (grouped) Discrete ungrouped★ Non-numeric
Order of categories Numeric order,
left to right
Numeric order,
left to right
Any order you choose
Do the bars touch? YesNo, they’re spaced
Where are they labeled? Below the edgesBelow the centers
★Some authors treat ungrouped discrete data as numeric and make a histogram. Others, including this book, treat ungrouped discrete data as categories and make a bar graph.

For both histogram and bar graph, the frequencies must start at 0. However, in a histogram the data axis typically doesn’t start at zero. You just leave some space between the frequency axis and the first bar, and the scale of the data axis is considered to start at the first bar.

Relative-Frequency Histogram

Though I don’t show it here, you could make a relative-frequency histogram, the same way you made a relative-frequency bar chart. The relative frequencies range from 0 for the 700–799 class to 20/50 = 40% for the 200–299 class.

Optional:  Histogram in Excel

Believe it or not, out of all the chart types in Excel, the standard histogram is not included. To make one, you have to combine a column chart and a scatterplot (Middleton) [see “Sources Used” at end of book], or download additional software. You can follow the detailed instructions in that document, or you can download the free Better Histogram add-in from TreePlan Software to do the job. (It works in Excel 2007 through 2016.) If you’re using Better Histogram:

  1. Enter all the original numbers in a column in Excel.
  2. Double-click the downloaded ZIP file, and within it double-click Better-Histogram-2007. You will have to enable macros.
  3. Click the Add-Ins tab in the ribbon, and then Better Histogram.
  4. histogram created by Better Histogram, following instructions in the text Better Histogram will create a new sheet in your workbook with a frequency table and histogram. Click on the chart title and enter a new title. Click on the horizontal axis title and either delete it or change it to more appropriate text. The result is shown at right.
  5. applying blue theme to the above histogram Optional: You might wish to jazz up the chart visually. If so, click on the Design tab of Excel’s ribbon and choose a design. Color is fine, but don’t choose different colors for different bars because that can make bars look larger or smaller than they actually are. Here’s what I got from clicking the blue theme.

2B2.  Ungrouped Discrete Data

To make sense of most data sets, you need to group the data into classes. But sometimes your data have only a few different values. In such cases, you probably want to skip the grouping and just have one histogram bar for each different response. The height of the bar tells you how often that response occurred, as usual.

Example 6: A state park collected data on the number of adults in each vehicle that entered the park in a given time interval:

3 1 1 3 3 3 0 7 3 1    3 6 4 5 3 2 3 4 2 3
0 2 2 4 8 3 3 1 3 3    3 4 1 5 2 2 6 3 4 2

Number of Adults
in Vehicles Entering Park
  Adults    Tallies  Frequency
2|||| ||7
3|||| |||| ||||15

There are only nine different values, so it seems a little silly to group them. Instead, just tally the occurrences, as shown at right.

Label ungrouped data under the centers of the bars, just like categorical data, not under the edges. Some authors still make the bars touch because the data are numeric, and others keep the bars separated because the data are ungrouped. I prefer the second approach, but I’ll accept the other. Here’s my histogram:

ungrouped histogram foradults in vehicles entering park

Caution: This particular data set has at least one occurrence of every value between min and max. But suppose it didn’t; suppose there were no vehicles with 7 adults? In that case, you would draw the histogram exactly the same, except that the bar above “7” would have zero height. The horizontal axis for numeric data must always have a consistent scale for its whole length, so you never close up any gaps.

Optional:  Ungrouped Discrete Histogram in Excel

You can graph ungrouped discrete data in Excel, if you wish. The key is to fool Excel into treating the data like categorical data:

  1. two-column layout in Excel as described in text Type the unique values in one column. But as you type each number, type an apostrophe (') first. Don’t put 0, 1, 2 and so on in the cells, but '0, '1, '2. The apostrophe won’t appear, but it tells Excel to treat the numbers like text. (You may notice that Excel left justifies those numbers.)
  2. Type the frequencies in a second column.
  3. Highlight the numbers in both columns, and on the Insert tab click Column. Select the first 2-D column.
    initial frequency histogram in Excel
  4. frequency histogram in Excel after cleanup Make all the same adjustments you made for the bar graph, above.

    By the way, you might notice that the tick marks on the vertical axis are every two cars on this graph, but they were every five cars on my hand-drawn histogram. One is not better than the other; it’s a stylistic choice.

  5. frequency histogram in Excel with no gaps between bars Optional: If you want to make the bars touch, right-click on the graph, select Format Data Series, and under Series Options change Gap Width to 0%. Then click Border Color and select Solid Line with a color of white.

2B3.  Shapes of Data Sets

You should know the names of the most common shapes of numeric data. Why? It’s easier to talk about data that way, and — as you’ll see in the next chapter — you treat different-shaped distributions a little differently.

The first question is whether the data set is symmetric or skewed. The histogram of a symmetric data set would look pretty much the same in a mirror; a skewed data set’s histogram would look quite different in a mirror.

If a distribution is skewed, you say whether it’s skewed left or skewed right. A distribution that is skewed left, like the first one below, has mostly high scores, and a distribution that is skewed right, like the second one below, has mostly low scores. The direction of skew is away from the bulk of the data, toward the long skinny tail, where there are few data points.

a distribution that is skwewed left a distribution that is skwewed right
Skewed left or
negatively skewed
Skewed right or
positively skewed

Example 7: Scores on a really easy test would be skewed left: most people get high scores, but a few get low or very low scores.

Lifespan in developed countries is skewed left: there are relatively few infant and child deaths, and most people live into their 60s, 70s, or 80s. (The first graph in Calculus Applied to Probability and Statistics [Waner and Costenoble 1996] [see “Sources Used” at end of book] illustrates this.)

People’s own evaluation of their driving skills and safety are left skewed: few people rate themselves below average and most rate themselves above average. Illusory Superiority [see “Sources Used” at end of book] cites a study by Svenson showing this “Lake Wobegon effect”.

Example 8: People’s departure times after a concert would be skewed right: most people leave shortly before or after the performers finish, but a few straggle out for some time afterward. Skewed-right distributions are more common than skewed-left distributions.

Salaries at almost any corporation are another good example of a distribution that is skewed right: most people make a modest wage, but a few top people make much more.

There are several types of symmetric distributions, but here are the two you’ll meet most often. A uniform distribution is one where all possible values are equally likely to occur. The normal distribution has a precise definition, which you’ll meet in Chapter 7, but for now it’s enough to say that it’s the famous bell curve, with the middle values occurring the most often and the extreme values occurring much less often.

You’ll notice that both of the examples below are “bumpy”. That’s usual. In real life you pretty much never meet an exact match for any distribution, because there are always lurking variables, measurement errors, and so on. And even if a population does perfectly follow a given distribution, like the probability distributions you’ll meet in Chapter 6, still a sample doesn’t perfectly reflect the population it came from: sampling error is always with us. When we say that a data set follows such-and-such a distribution, we mean it’s a close match, not a perfect match.

a uniform distribution a normal distribution
UniformNormal (“bell curve”)

Example 9: Winning lottery numbers are uniformly distributed. (In the short term some numbers occur more often than others, but over the long run they tend to even out.)

The results of rolling one die many times are uniformly distributed. (But the results of rolling two dice are not uniformly distributed: 7 is the most likely, 2 and 12 are tied for least likely, and the other numbers are intermediate.)

The normal distribution or bell curve occurs very often, and in fact many natural and industrial processes produce normal distributions. This happens so often that we often just say or write ND for “normal distribution” or “normally distributed”.

Example 10: Men’s and women’s heights follow separate normal distributions. People’s arrival times at an event are ND. IQ scores, and scores on most tests, are ND. The amount of soda in two-liter bottles is ND. Your commute times on a given route are ND.

2B4.  Stem Plot

Suppose you have a discrete data set with few repetitions. An ungrouped histogram would have most bars at the same low height; a grouped histogram might show a pattern but you’d lose the individual data points.

If your discrete data set isn’t too large (n < 100, give or take), and the range isn’t too great, you can eat your cake and have it too. The stemplot, also known as a stem-and-leaf diagram, is a mutant hybrid between a histogram and a simple list of data.

The idea is that you take all the digits of each data point except the last digit and call that the stem; the last digit is the leaf. For example, consider scores of 113 and 117. They are two leaves 3 and 7 on a common stem 11 (meaning 110).

To construct a stemplot, you look over your data set for the minimum and maximum, then write the stems in a column, from lowest to highest. Just like with a histogram, there are no gaps, so if you have data in the 50s and the 70s but not in the 60s you still need a stem of 6.

However, your stems probably won’t start at 0. Start them with the lowest data point that actually occurred, and end them with the highest data point that actually occurred.

The stemplot was invented by John Tukey in 1970.

Example 11: Here is a set of IQ scores from 50 randomly-selected tenth graders:

 99  77  83 111 141      89  98  84  93 124
110  73  96  60 102      87 123 120 100  95
100  90 104  85 129      81 119 112 103  76
108  91  94 114 108      92  96  94  88 101
117 106 103 105 113      97 106 109  80 116

To make your stemplot, eyeball the data for the minimum and maximum, which are 60 and 141. Write the stems, 6 to 14, in a column at the left of your paper, starting several lines below the top. Then draw a vertical line just to the right of them.

Now go through the data points, one by one, and add each leaf to the proper stem. During this process, you might find a value outside what you thought were the min and max. That’s no problem. Just add the stem and then the leaf. (Again, the stems can’t have gaps, so if your first stem is 6 and you come across a data point 47, you have to add stems 4 and 5, not just 4.)

Finally, add a title and a legend or key to your stemplot. Here is the result:

            IQ Scores
 6 | 0
 7 | 7 3 6
 8 | 3 9 4 7 5 1 8 0
 9 | 9 8 3 6 5 0 1 4 2 6 4 7
10 | 2 0 0 4 3 8 8 1 6 3 5 6 9
11 | 1 0 9 2 4 7 3 6
12 | 4 3 0 9
13 |
14 | 1
                    key: 11 | 7 = 117

If you lie down and look at this sideways, it looks like a histogram. But the bonus is that you can still see all the actual data points within the groupings of 60–69, 70–79, etc.

A stemplot is great at showing shape, center and spread of distributions plus outliers, but most data sets don’t lend themselves to a stemplot. If your data set is too large, your leaves will run off the edge of the page. If your data set is too sparse — if the range is large for the number of data points — most of your stems won’t have leaves and the plot won’t really show any patterns in the data. But when you have a moderate-sized data set and the data range is moderate, a stemplot is probably better than a histogram because the stemplot gives more information.

One last touch is sorting the leaves. I don’t think that’s important enough to take the extra effort in a homework problem or on a quiz, but if you’re going to be presenting your stemplot to other people then you probably want to sort the leaves. Here’s the same stemplot with sorted leaves:

            IQ Scores
 6 | 0
 7 | 3 6 7
 8 | 0 1 3 4 5 7 8 9
 9 | 0 1 2 3 4 4 5 6 6 7 8 9
10 | 0 0 1 2 3 3 4 5 6 6 8 8 9
11 | 0 1 2 3 4 6 7 9
12 | 0 3 4 9
13 |
14 | 1
                    key: 11 | 7 = 117

A glance at this stemplot shows you quite a lot. The data set is normally distributed, the center is around 100 points, the spread is 60–141, and there’s an outlier at 141.

2C.  Bad Graphs

You now know how to make good graphs, so be on the lookout for bad graphs. Sometimes they’re bad just because whoever drew them didn’t know any better, or didn’t think. But some people may deliberately try to deceive you with a graph.

Example 12: File this one under “what were they thinking?” The left-hand graph doesn’t have a title, so you don’t know what “Yes” and “No” mean. You have to look back and forth between the graph and the legend, and anyone with red-green color blindness probably won’t be able to see which segment is which. Oh yes — what percentages of the sample answered “Yes” and “No”? You can guess that it’s around a third versus two thirds, but that’s not very precise.

The right-hand graph cures those problems. It’s now crystal clear which segment is Yes and which is No, and what proportion of the sample gave each answer. This actually lets you show more information in less space, a win-win. (Of course you wouldn’t use a vague term like “Opinions” — that’s just there to remind you to give your graph a title.)

a bad pie chart and a good pie chart, as described

Example 13: There’s no telling whether this one is deliberate deception or just incompetent graphing. An oatmeal company, which shall remain nameless, wanted to show that eating oatmeal for four weeks reduces cholesterol. The first graph makes a strong case — until you look at the scale on the vertical axis. (Don’t even think about wasting your time on a graph with no vertical scale.)

The scale doesn’t start at zero, so it makes differences look much bigger than they are. Your frequency or relative frequency scale must always start at zero (and you must show the zero). The second graph is properly drawn, and now you can see that the drop in cholesterol is only a slight one.

bad graph: frequency scale doesn't start at zero      better graph: frequency scale starts at zero

impropoerly and properly scaled graphs; see text
source: Misleading Graph [see “Sources Used” at end of book]

Example 14: It’s all very well to create visual interest, but not if it makes the reader misinterpret the graph.

In the left-hand graph, you can tell from the scale that B is supposed to be three times as large as A, but since it’s three times as high and three times as wide it’s actually nine times as large, giving the reader a distorted impression of the amount of difference. Even if your “bars” are pictures, they still have to be the same width. The corrected version is shown at right. (It’s still not quite correct, though, because 0 is not shown on the vertical axis.)

2D.  Really Good Graphs

If you follow the rules in this chapter, you’ll make good, professional graphs. But there are plenty of other ways to make good graphs, depending on the data you’re trying to show.

There’s a classic picture book that can give you lots of good ideas. Edward Tufte’s The Visual Display of Quantitative Information has been around since 1983, and no one has yet done it any better. (Tufte has produced newer editions.)

Example 15: One famous graph in Tufte’s book is particularly stunning. Charles Minard wanted to present a lot of time-series data about Napoleon’s disastrous campaign in Russia in the winter of 1812–1813: where battles took place, numbers of casualties, temperature, and so forth. He elected to make a kind of stylized map showing just the rivers and the cities where events happened. (Niemen at the left is the Niemen River, Russia’s western border at the time. Moscow, “Moscou” in French, is as far east as Napoleon got.) Across that, Minard showed the army strength as a broad swath at the start that shrank to almost nothing by the end of the retreat westward. Below are dates of events, temperatures, and precipitation. It’s a huge amount of information on one piece of paper.

Charles Minard's famous figurative map of Napoleon’s Moscow campaign

This tiny rendition doesn’t do it justice, but if you click on it visit you’ll see it at a better size. (Your browser may still reduce it to fit on your screen. Try clicking into the picture and you should see it at original size, though you’ll have to scroll around to see the details. It sounds like a lot of effort, but I promise you it’s worth it. Or just get the book, because it has plenty more!)

Example 16: Stephen Herrero three-way pie graph Here’s one I ran across in my reading. It’s not the graph of the century like Minard’s, but it’s a cut above the usual. In Bear Attacks: Their Causes and Avoidance (2002), Stephen Herrero had the problem of contrasting bears’ diet in spring, summer, and fall. (Of course in winter they’re not eating.)

He could have drawn three pie charts, or a stacked bar graph, but instead he came up with a great alternative. (You can click on the picture to enlarge it.) (A larger version is at Each component of diet is clearly labeled right in the graph, not in some legend off to the side, and the contrasting backgrounds make it a little more interesting visually. A stacked bar graph would convey the same information, but I like this presentation because it suggests that “spring”, “summer”, and “fall” are not completely separate but rather transition one into the next.

The vertical axis is clearly labeled, too. There’s no doubt what the numbers are (as opposed to some units of weight, for instance, or something more esoteric like pounds of feed per hundreds of pounds of bear).

He probably could have left off the title off the category axis — after all, we know that the seasons are seasons, and the graph title also conveys that information. But that’s a minor point. My only real quibble with this graph is that the overall graph title at the bottom is too small.

What Have You Learned?

Overview: With numeric data, the goal of descriptive stats is to show shape, center, spread, and outliers.

Key ideas:

(The online book has live links to all of these.)

Because this textbook helps you,
please click to donate!
Because this textbook helps you,
please donate at
Study aids:

Chapter 3 WHYL → ← Chapter 1 WHYL

Exercises for Chapter 2

Write out your solutions to these exercises, showing your work for all computations. Then check your solutions against the solutions page and get help with anything you don’t understand.

Caution! If you don’t see how to start a problem, don’t peek at the solution — you won’t learn anything that way. Ask your instructor or a tutor for a hint. Or just leave it and go on to a different problem for now. You may find when you return to that “impossible” problem that you see how to do it after all.

Would You Move to the US?
Yes, with authorization154
Yes, without authorization204
Don’t know30

The Pew Research Center (2013c) [see “Sources Used” at end of book] conducted a poll of 1000 adults in Mexico, asking whether they would move to the US if they had the means and opportunity to move. Draw a relative-frequency bar graph for their responses.


banana, apple, cherry What’s wrong with this graph? (You should be able to see at least two problems, maybe more.)

(source: Misleading Graph [see “Sources Used” at end of book] in Wikipedia)


Professor Marvel had a statistics class of fifteen students, and on one 15-point quiz their scores were

10.5   13.5   8   12   11.3   9   9.5   5   15   2.5   10.5   7   11.5   10   10.5

Construct a frequency table and bar graph for their letter grades on the quiz, where 90% is the minimum for an A, 80% for a B, 70% for a C, and 60% for a D.

Deaths by Horse Kick in
14 Prussian Army Corps, 1875–1894
Number of DeathsFrequency

Bulmer (1979, 92) [see “Sources Used” at end of book] quotes an 1898 study of deaths by horse kick in the Prussian army. Von Bortkiewicz compiled the number of deaths in 14 Prussian Army corps over the 20-year period 1875–1894, as shown at right. (14 corps over 20 years gives 14×20 = 280 observations.) For example, there were 32 observations in which two officers died of horse kicks.
(a) What is the type of the variable?
(b) Construct an appropriate graph.

Commuting Distances in km

In a GM factory in Brazil, 25 workers were asked their commuting distance in kilometers. Construct a stem-and-leaf plot.

—Adapted from Dabes and Janik (1999, 8) [see “Sources Used” at end of book]


Abigail asked a number of students their major. She found 35 in liberal arts, 10 in criminal justice, 25 in nursing, 45 in business, and 20 in other majors. What was the relative frequency of the nursing group, rounded to the nearest whole percent?

7 (a) Name three types of graph used for ungrouped discrete data. Which type do you use when?
(b) Name the type of graph used for grouped numeric data.
(c) Name two types of graph used for qualitative data.

histogram of data given in this question Bert asked his fellow students how many books they read for pleasure in a year. He found that most of them read 0, 1, or 2 books, but some read 3 or more and a very few read as many as 10. (He plotted the histogram shown at right.) Identify the shape of this distribution.

9 (a) In making a histogram, how do you decide whether to group data?
(b) What are the two rules for classes when you group data?
Test scores, xFrequencies, f

At right is a grouped frequency distribution.
(a) Create a frequency histogram. (For a real quiz, you’d use graph paper, but you can freehand this one.)
(b) Find the class width.
(c) What’s the shape of this distribution?

Solutions → 

What’s New


3. Numbers about Numbers

Updated 1 Feb 2015 (What’s New?)


For numeric data, the goal of descriptive stats is to show the shape, center, spread, and outliers of a data set. In this chapter, you learn how to find and interpret numbers that do that.

Measures of shape called skewness and kurtosis do exist, but they’re not part of this course. Roughly, skewness tells how this data set differs from a symmetric distribution, and kurtosis tells how it differs from a normal distribution. If you’re interested, you can learn about them in Measures of Shape: Skewness and Kurtosis. The MATH200B Program part 1 can compute those measures of shape for you.


3A.  Measures of Center

3A1.  The Three M’s: Mean, Median, Mode

There are three common measures of the center of a data set: mean, median, and mode.


The mean is nothing more than the average that you’ve been computing since elementary school.

The symbol for the mean of a sample is , pronounced “x bar”. The symbol for the mean of a population is the Greek letter μ, pronounced “mew” and spelled “mu” in English. (Don’t write μ as “u”; the letter has a tail at the left.)

You can think of the mean as the center of gravity of the distribution. If you made a wooden cutout of the histogram, you could balance it on a pencil or your finger placed exactly under the mean.

The formula for the mean is  = ∑x/n or μ = ∑x/N, meaning that you add up all the numbers in the data set and then divide by sample size or population size.


The median is the middle number of a sample or population. It is the number that is above and below equal numbers of data points. (Examples are below.)

There’s no one agreed symbol for the median. Different books use M or Med or just “median”.

To find the median by hand, you must put the numbers in order. If the data set has an odd number of data points, counting duplicates, then the median is then the middle number. If the data set has an even number of data points, the median is half way between the two middle numbers. (In the next section, you’ll get the median from your TI calculator, with no need to sort the numbers.)


The mode is the number that occurs most frequently in a data set. If two or more numbers are tied for most frequent, some textbooks say that the data set has no mode, and others say that those numbers are the modes (plural). We’ll follow the second convention.

bimodal distribution of student grades Most distributions have only one mode, and we call them unimodal. If a distribution has two modes, or even if it has two “frequency peaks” like the one at right, we call it bimodal. (This was students’ final grades in a math course: a lot of low or high grades, and few in the middle.)

There’s no symbol for the mode.

Example 1: You’re interviewing at a company. You ask about the average salary, and the interviewer tells you that it’s $100,000. That sounds pretty good to you. But when you start work, you find that everybody you work with is making $10,000. What went wrong here?

The interviewer told the truth, but left out a key fact: Everybody but the president makes below the average. Eight employees make $10,000 each, the vice president makes $50,000, and the president makes $870,000. Yes, the mean is (8×10,000 + 50,000 + 870,000)/10 = $100,000, but that’s not representative because the president’s salary is an outlier. It pulls the mean away from the rest of the data, and skews the salary distribution toward the right. This graph tells the sad tale:

distribution of salaries with median and mean, as described in text

There was your mistake. Salaries at most companies are strongly skewed right, so most employees make less than the average. When a data set is skewed, the mean is pulled toward the extreme values. (A data set can be skewed without outliers, but when there are outliers the data set is almost certain to be skewed.)

You should have asked for the median salary, not the average (mean) salary. There are 10 employees, and 50% of 10 is 5, so the median is less than or equal to five data points and greater than or equal to five data points. The fifth-highest and sixth-highest salaries are both $10,000, so the median is $10,000.

The median is more representative than the mean when a data set is skewed. The mean is pulled toward an extreme value, but the median is unaffected by extreme values in the data set. We say that the median is resistant.

Example 2: What is the median of the data set 8, 15, 4, 1, 2? Put the numbers in order: 1, 2, 4, 8, 15. There are five numbers, and 50% of 5 is 2.5. You need the number that is above 2 data points and below 2 data points; the median is 4.

Example 3: What is the median of the data set 7, 24, 15, 1, 7, 45? There are six data points, and in order they are 1, 7, 7, 15, 24, 45. 50% of 6 is 3; you need the number that is above 3 data points and below 3 data points. It’s clear that the median is between 7 and 15, but where exactly? When the sample size is an even number, the median is the average of the two middle numbers. Therefore the median for this data set is the average of 7 and 15, (7+15)/2 = 11.

3A2.  Mean, Median, Mode, and the Shape of a Data Set

When a distribution is symmetric, the mean and median are close together. If it’s unimodal, the mode is close to the mean and median as well.

But have you ever taken a course that was graded on a curve, and one or two “curve wreckers” ruined things for everyone else? What happened? Their high scores raised the class average (mean), so everybody else’s scores looked worse. The class scores were skewed right: low scores occurred frequently, and high scores were rare. (You can see shapes of skewed distributions in Chapter 2.)

When a distribution is skewed, the mean is pulled toward the extreme values. The median is resistant, unaffected by extreme values. And you can reverse that logic too: if the mean is greater than the median, it must be because the distribution is skewed right. From the median to the mean is the direction of skew.

distribution with mean < median, as in text   distribution with mean > median, as in text
Skewed left,
mean < median (usually)
  Skewed right,
mean > median (usually)

For heaven’s sake, don’t memorize that! Instead, just draw a skewed distribution and ask yourself approximately where the mean and median fall on it.

Karl Pearson gives the rule median = (2×mean + mode)/3 for moderately skewed distributions. For more about this, see Empirical Relation between Mean, Median and Mode [see “Sources Used” at end of book].

Caution! All the statements in this section are a rule of thumb, true for most distributions. The logic holds for almost every unimodal continuous distribution, and for discrete distributions with a lot of different values. But it tends to break down on discrete distributions that have only a few different values. For more about this, see von Hippel 2005 [see “Sources Used” at end of book].

3B.  Summary Numbers on the TI-83 …

Summary: The 1-VarStats command gives you mean, median, and much more for any data set. If you have just a plain list of numbers, enter the name of that list on the command line. If you have a frequency distribution, enter the name of the data list and the name of the frequency list on the command line.

Excel: Excel can do these computations. This isn’t an Excel course, but if you’re an Excel head you can figure out how to get this information. One way is with the Data Analysis tool, part of the Analysis Toolpak add-in that comes with Excel (though you may have to enable it). Another way is to click in a blank cell, click Formulas » More Functions » Statistical and select the appropriate worksheet function.

3B1.  … from a List of Numbers

Example 4: Professor Marvel had a statistics class of fifteen students, and on one quiz their scores were

10.5   13.5   8   12   11.3   9   9.5   5   15   2.5   10.5   7   11.5   10   10.5

Your TI-83 or TI-84 can give you the mean, median, and other numbers that summarize this data set.

  1. If you have any partial commands visible, press [CLEAR].
  2. Press [STAT] [ENTER] to get into the edit screen for statistics lists. You can use any list, but let’s use L1 this time. (If you don’t see L1, and pressing the left arrow doesn’t bring it into view, press [STAT] [5] [ENTER] [STAT] [ENTER].)
  3. Cursor to the L1 label at the top — not the top number, the column heading — and press [CLEAR] [ENTER] to clear the list.
  4. data entry in TI-83 stats editor Enter your numbers, pressing [ENTER] after each one.
  5. After entering the last number, check all the numbers carefully and make any needed corrections. If you duplicated a number, press [DEL] to remove it; if you left out a number, press [2nd DEL makes INS] to open a space for it.
  6. Press [STAT] [] [1] to select 1-VarStats.
  7. After writing down the complete command on your paper — 1-VarStats L1 — press [ENTER] to execute it.

The results screen is shown below. A down arrow on the screen says that there is more information if you press [], and an up arrow says that there is more information if you press [].

TI-83 results screen for 1-VarStats, screen 1 of 2      TI-83 results screen for 1-VarStats, screen 2 of 2

Look first at the bottom of the screen. Always check n first — if it doesn’t match your sample or population size, the other numbers are big sacks of bogosity. In this case a quick count of the original data set shows 15 numbers, which is the right quantity. (Of course, this check can’t determine if you miskeyed any numbers. Only double and triple checking can protect you from that kind of mistake.)

What are you seeing on this screen?

Showing your work and your results, you write down:

1-VarStats L1

μ = 9.7

σ = 3.1

N = 15

min = 2.5

Q1 = 8

Med = 10.5

Q3 = 11.5

max = 15

3B2.  … from an Ungrouped Distribution

Number of Adults
in Vehicles Entering Park
Adults in
Number of

Example 5: Your TI-83 or TI-84 can also compute statistics of a frequency distribution. Let’s try it with the data from Chapter 2 for number of adults in vehicles entering the park.

Enter the data values in one statistics list, such as L1. Enter the frequencies in a second list, such as L2. Press [STAT] [] [1] to select 1-VarStats.

Either way, write down the complete command on your paper: 1-VarStats L1,L2.

Here are the results:


TI-83 output screen 1 of 2      TI-83 output screen 2 of 2

Again, look at n first. That protects you from the rookie mistake of leaving off the frequency list. If n is wrong, redo your 1-VarStats command and this time do it right.

These forty vehicles are obviously not all the vehicles that enter the park, so they are a sample, not a population. You therefore write down the statistics as follows:

1-VarStats L1,L2

= 3.0

s = 1.7 (from Sx = 1.73186575)

n = 40

min = 0

Q1 = 2

Med = 3

Q3 = 4

max = 8

Weighted Average

Sometimes you take an average where some data points are more important than others. We say that they are weighted more heavily, and the mean that you compute in this way is called a weighted average or weighted mean.

You’re intimately familiar with one example of a weighted average: your GPA or grade point average.

Example 6: The NHTSA’s Corporate Average Fuel Economy or CAFE Rule (NHTSA 2008) [see “Sources Used” at end of book] specifies a corporate average of 34.8 mpg (miles per gallon) for passenger cars. Let’s keep things simple and suppose that ZaZa Motors makes three models of passenger car: the Behemoth gets 22 mpg, the Ferret gets 35 mpg, and the Mosquito gets 50 mpg. Does ZaZa meet the standard?

To answer that, you can’t just average the three models: (22+36+50)/3 = 36 mpg. Suppose the company sells one Mosquito and the rest are Behemoths and a sprinkling of Ferrets? You have to take into account the number of cars of each model sold. In effect, you have a frequency distribution with mpg figures and repetition counts. Let’s suppose these are the sales figures:

Auto Sales by ZaZa Motors
ModelMiles per GallonNumber Sold
  Total 370,000

Put the miles per gallon in L1 and the frequencies in L2. (How do you know it’s not the other way around? You’re trying to find an average mpg, so the mpg numbers are your data.) You should find:

1-VarStats L1,L2

μ = 32.3 mpg

N = 370,000 passenger cars

Even though two of the three models meet the standard, the mix of sales is such that ZaZa Motors’ CAFE is 32.3 mpg, and it’s not in compliance.

The formula for the mean of a grouped distribution and the formula for a weighted average are the same formula: μ = ∑xf/N for a population or  = ∑xf/n for a sample. Either way, take each data value times its frequency. Add up all those products, and divide by the population size or sample size. For the notation, see ∑ Means Add ’em Up in Chapter 1.

3B3.  … from a Grouped Distribution

In a grouped frequency distribution, one number called the class midpoint stands for all the numbers in the class.

Definition: The class midpoint for a given class equals the lower boundary plus half the class width. This is half way between the lower class boundary of this class and the lower class boundary of the next class.

Lengths of iTunes Songs (seconds)
Class BoundariesClass MidpointFrequency

Example 7: Let’s revisit the lengths of iTunes songs from the ungrouped histogram in Chapter 2. What is the midpoint of the 300 to 399 class?

The class width equals the difference between lower boundaries: 400−300 = 100. Half the class width is 50, so the midpoint is 300+50 = 350. You could also compute the class midpoint as (300+400)/2 = 350.

However, it is wrong to take (300+399)/2 = 349.5 as class midpoint or 399−300 = 99 as class width. Don’t use the upper boundary in finding the class midpoint.

Of course you don’t have to compute every class midpoint the long way. Once you have the midpoint of the first class, (100+200)/2 = 150, just add the class width repeatedly to get the rest: 250, 350, … 850. The grouped frequency distribution, with the class midpoints, is shown at right.

Because this textbook helps you,
please click to donate!
Because this textbook helps you,
please donate at

What good is the class midpoint? It’s a stand-in for all the numbers in its class. Instead of being concerned with the nine different numbers in the 100 to 199 class, twenty different numbers in the 200 to 299 class, and so on, we pretend that the entire data set is nine 150s, twenty 250s, and so on. This means you get approximate statistics, but you get them with a lot less work.

Is this legitimate? How good is the approximation? Usually, quite good. In most data sets, a given class holds about equally many data points below the class midpoint and above the class midpoint, so the errors from making the approximation tend to balance each other out. And the bigger the data set, the more points you have in each class, so the approximation is usually better for a larger data set.

TI-83 stats edit screen Procedure: Enter the class midpoints in one statistics list, such as L1. Enter the frequencies in another list, such as L2. Enter the command 1-VarStats L1,L2 and write down the complete command on your paper.

Again, avoid the rookie mistake: include the class-midpoint list and the frequency list in your command.

The results screens are below. As usual, before you look at anything else, check that n matches the size of the data set. 50 is correct, so that’s one less worry.

TI-83 output screen 1 of 2      TI-83 output screen 2 of 2

There’s a problem with the second screen, though. Your calculator knows you have a frequency distribution, because you gave two lists to the 1-VarStats command. But it doesn’t have the original data, so it doesn’t know the true minimum (lowest data point). When you read minX=150, you interpret that to mean that the lowest data point occurs in the class whose midpoint is 150; in other words, the minimum is somewhere between 100 and 199. Your knowledge of the rest of the five-number summary has the same limitation. For instance, the median isn’t 250; all you know is that it occurs somewhere between 200 and 299.

Because of these limitations, you don’t do anything with the second results screen from a grouped distribution. The mean and standard deviation don’t have this problem: they’re approximate, but the approximation is good enough. (n is exact, not an approximation.)

These 50 iTunes songs are obviously not all the songs there are, not even all the songs in any particular person’s iTunes library. They are a sample, not a population. Therefore you write down your work and results like this:

1-VarStats L1,L2

= 316 (or you could write 316.0)

s = 145.1

n = 50

3C.  Measures of Spread

There are four common measures of the spread of a data set: range, interquartile range or IQR, variance, and standard deviation. (You may also see spread referred to as dispersion, scatter, variation, and similar words.)

3C1.  Range and IQR (Interquartile Range)

Definition: The range of a data set is the distance between the largest and smallest members.

Example 8: If the largest number in a data set is 100 and the smallest is 20, the range is 100−20 = 80, regardless of what numbers lie between them and what shape the distribution might have.

Caution: The range is one number: 80, not “20 to 100”.

Obviously the range has a problem as a measure of spread: It uses only two of the numbers. Since only the two most extreme numbers in the data set get used to compute the range, the range is about as far from resistant as anything can be.

In favor of the range is that it’s easy to compute, and it can be a good rough descriptor for data sets that aren’t too weird. The interquartile range has something of the same idea, but it is resistant.


The interquartile range (IQR) is the distance between the largest and smallest members of the middle 50% of the data points, taking repetitions into account.

Alternative definition: The IQR is the third quartile minus the first quartile, or the 75th percentile minus the 25th percentile.

You’ll learn about percentiles and quartiles in the next section, Measures of Position, but for now let’s just take a quick non-technical example.

Example 9: Consider the data set 1, 2, 3, 3, 3, 4, 5, 8, 11, 11, 15, 23. There are twelve numbers, and the middle 50% (six numbers) are 3, 3, 4, 5, 8, 11. The interquartile range is 11−3 = 8.

3C2.  Standard Deviation

The IQR is a better measure of spread than the range, because it’s resistant to the extreme values. but it still has the problem that it uses only two numbers in the data set. Isn’t there some measure of spread that uses all the numbers in the data set, as the mean does? The answer is yes: the variance and the standard deviation use all the numbers.

Your calculator gives you the standard deviation, as you saw above. The variance is important in a theoretical stats course, but not so much in this practical course. We’ll measure spread with the standard deviation almost exclusively. (To save wear and tear on my keyboard and your printer, I’ll often use the abbreviation SD.)

If you’d like to know how the variance and SD are computed, read the “BTW” section that follows. Otherwise, skip down to “What Good Is the Standard Deviation, Anyway?

To see how the variance is computed, let’s go back to Professor Marvel’s quiz scores. We computed the mean as 9.7, or to use the unrounded value, μ = 9.72. (Never round numbers if you’re going to use them in further calculation; that’s the Big No-no.)

If you want to devise a measure of spread, it seems reasonable to consider spread from the mean, so try subtracting the mean from each quiz score and then adding up all those deviations. You get zero, so obviously “sum of deviations” isn’t a useful measure of spread.

But with the next column you strike gold. Squaring all the deviations changes the negatives to positives, and also weights the larger deviations more heavily. This is progress! Now divide the total of squared deviations by the population size and you have the variance: σ² = 140.2640/15 = 9.3509. (σ is the Greek letter sigma.)

(When computing the variance of a sample, you divide by n−1 rather than n. The reasons are technical and are explained in Steve Simon’s articles Degrees of Freedom (1999a) [see “Sources Used” at end of book] and Degrees of Freedom, Part 2 (2004) [see “Sources Used” at end of book].

The variance is quite a good measure of spread because it uses all the numbers and combines their differences from the mean in one overall measure. But it’s got one problem. If the data are dollars, the squared deviations will be in square dollars, and therefore the variance will be in square dollars. What’s a square dollar? (No, I don’t know either.) You want a measure of spread that is in the same units as the original data, just like the mean and median are. The simplest solution is to take the square root of the variance, and when you do that you have the standard deviation (SD), σ = √(140.2640/15) = 3.05793, which rounds to 3.1. And because the standard deviation is in the same units as the original data, it can be used as a yardstick, as you’ll see below.

For lovers of formulas, here they are. The standard deviation of a population, σ, has population size N on the bottom of the fraction; the standard deviation of a sample, s, has sample size n minus 1 on the bottom of the fraction. If you’re not familiar with the ∑ notation (sigma or summation), ∑x² means square every data value and add the squares; ∑x²f means square every data value, multiply by the frequency, and add those products. For the notation, see ∑ Means Add ’em Up in Chapter 1.

Formulas for Standard Deviation
of a List of Numbersof a Frequency Distribution
When the data set is the whole population
When the data set is just a sample

Why are there two formulas on each row under “list of numbers”? The first formula is the definition, and the second is a shortcut for faster computations. Of course they’re mathematically equivalent; you could prove that if you wanted to.

Sir Ronald Fisher coined the term variance in 1918. He used the symbol σ² for the variance of a population, since Pearson had already assigned σ to the standard deviation, and the variance is the square of the SD.

What Good Is the Standard Deviation, Anyway?

The standard deviation will be the key to inferential statistics, starting in Chapter 8, but even within the realm of descriptive statistics there are some applications. In addition to this section, you’ll see an application in z-Scores, below.

Working with the quiz scores on your TI-83 or TI-84, you found that the population mean was μ = 9.7 and the population SD was σ = 3.1. What does this mean?

Just as a concept, the standard deviation gives you an idea of the expected variation from one member of the sample or population to the next. The SD in this example is about a third of the mean, so you expect some variation but not a lot. But can you do better than this? Yes, you can!

The Empirical Rule for Normal Distributions

You can predict what percentage of the data will be within a certain number of standard deviations above or below the mean. In a normal distribution, 68% of the data are between one SD below the mean and one SD above the mean (μ±σ), 95% are within two SD of the mean (μ±2σ), and 99.7% are within three SD of the mean (μ±3σ).

This is the Empirical Rule or 68–95–99.7 Rule. Caution! It’s good for normal distributions only.

In a normal distribution, 68.0% of data are within 0.99 standard deviations above and below the mean.     In a normal distribution, 95.0% of data are within 1.96 standard deviations above and below the mean.     In a normal distribution, 99.7% of data are within 2.97 standard deviations above and below the mean.

You’ll notice that the 68%, 95%, and 99.7% of data occur within approximately one, two, and three SD of the mean. More accurate figures are shown in the pictures, but for now we’ll just use the simple rule of thumb. You’ll learn how to make precise computations in Chapter 7.

It’s not a traditional part of the Empirical Rule, but another useful rule of thumb is that, in a normal distribution, about 50% of the data are within 2/3 of a SD above and below the mean.

Example 10: Adult women’s heights are normally distributed with μ = 65.5″ and σ = 2.5″. (By the way, different sources give different values for human heights, so don’t be surprised to see different figures elsewhere in this book.) How tall are the middle 95% of women?

Solution: The middle 95% of the normal distribution lies between two SD below and two SD above the mean. 2σ = 2×2.5 = 5″, and 65.5±5 = 60.5″ to 70.5″, so 95% of women are 60.5″ to 70.5″ tall.

Actually there are two interpretations. You can say that 95% of women are 60.5″ to 70.5″ tall, or you can say that if you randomly select one woman the probability that she’s 60.5–70.5″ tall is 95%. Any probability statement can be turned into a proportion statement, and vice versa. You’ll learn about this in Interpreting Probability Statements in Chapter 5.

Example 11: What fraction of women are 65.5″ to 68″ tall?

Solution: 68−65.5 = 2.5, so 68″ is one standard deviation above the mean. You know that 68% of a normal distribution is within μ±σ. You also know that the normal distribution is symmetric, so 68%/2 = 34% of women are within one SD below the mean, and 34% are within one SD above the mean. Therefore 34% of women are 65.5″ to 68″ tall.

You can combine the three diagrams above and show data in regions bounded by each whole number of standard deviations, like this:

Percentage of population in regions bounded by whole numbers of standard deviations

TIP: If this diagram doesn’t come out well in black-and-white printing, you can view or print it in color at <>.

Where do these figures come from? For example, how do we know that about 13.5% of the population is between one and two standard deviations below the mean in a normal distribution? Well, 95% is between two SD below and two SD above the mean. Half of 95% is 47.5%, so 47.5% of the population is between the mean and two SD below the mean. Similarly, about 68% is between one SD below and one SD below, so 68/2 = 34% is between the mean and one SD below. But if 47.5% is between μ−2σ and μ — call it Region A — and 34% is between μ−σ and μ, then the part of Region A that is not in the 34% is the part between μ−2σ and μ−σ, and that must be 47.5−34 = 13.5%. If you had an afternoon to kill, you could work out the other seven percentages.

With this diagram, you can work Example 11 more easily, directly reading off the 34% figure for women between mean height and one SD above the mean. You can also work more complicated examples, like this one.

Example 12: If you randomly select a woman, how likely is it that she’s taller than 70.5″?

Solution: 70.5−65.5 = 5.0, so 70.5″ is two SD above the mean. From the diagram, you see that 2.35+0.15 = 2.5% of the population is more than two SD above the mean. Answer: a randomly selected woman has a 2.5% of being more than 70.5″ tall.

Optional:  Chebyshev’s Inequality

If you have a normal distribution, the Empirical Rule tells you how much of the population is in each region. What if you don’t have a normal distribution?

As you might expect, the portions of the population in the various regions depends on the shape of the distribution, but Chebyshev’s Inequality (or Chebyshev’s Rule) gives you a “worst case scenario” — no matter how skewed the distribution, at least 75% of the data are within 2 SD of the mean, and at least 89% are within 3 SD of the mean.

More generally, within k SD above and below the mean, you will find at least (1−1/k²)·100% of the data. (If you plug in k = 1, you’ll find that at least 0% of the data lie within one SD of the mean. Distributions where all the data are more than one SD away from the mean are unusual, but they do exist.)

Example 13: For the quiz scores, two standard deviations is 2×3.0579 = 6.1, so you expect at least 1−1/2² = 1−¼ = 75% of the quiz scores to be within the range 9.7±6.1 = 3.6 to 15.8. Remember that this is a worst case. In fact, 14 of the 15 numbers (93%) are within those limits.

3D.  Measures of Position

Summary: The measures of center and spread that you’ve studied are properties of the data set as a whole. Now we look at measures of position, which consider how a given data point stands in relation to the whole sample or population that it’s part of.

3D1.  Percentiles

Definition: The percentile rank of a data point is the percentage of the data set that is equal to or less than the data point. We say that the data point is at the __th percentile or %ile for short.

The symbol is P followed by a number. For example, P35 or P35 denotes the 35th percentile, the member of the data set that is greater than or equal to 35% of the data.

Percentiles are most often used in measures of human development, like your child’s performance on standardized tests, or an infant’s length or weight.

Example 14: Your daughter takes a standardized reading test, and the report says that she is in the 85th percentile for her grade. Does this make you happy or sad? Solution: 85% of her grade read as well as she does, or less well; only 15% read better than she does. Presumably this makes you happy.

Example 15: Consider the data set 1, 4, 7, 8, 10, 13, 13, 22, 25, 28. (To find percentiles, you have to put the data set in order.)

(a) What is the percentile rank of the number 13? Solution: There are ten numbers in the data set, and seven of those are ≤13. Seven out of ten is 70%, so the percentile rank of 13 is 70, or “13 is at the 70th percentile”, or P70 = 13.

(b) Find P60 for this data set. Solution: What number is greater than or equal to 60% of the numbers in the data set? Counting up six numbers from the beginning, you find … 13 again. So 13 is both P60 and P70.

(Anomalies like this are usual when you have small data sets. It really doesn’t make sense to talk about percentiles unless you have a fairly large data set, typically a population like all third graders or all six-week-old infants.)

Everybody agrees on the idea of a percentile, but different authors have different ways to compute it. For example, some authors say a percentile rank is the percent of data less than the data point, instead of less than or equal to as I did. By their definition there is a 0th percentile but no 100th percentile; by my definition there is no 0th percentile but there is a 100th percentile. And some define percentiles in such a way that the percentile (like the mean) need not be a member of the data set.

The different definitions can give very different answers for small data sets. Nobody worries too much about this, because in practice you seldom compute percentiles against small data sets. (What does “18th percentile” mean in a set of only 12 numbers?) All the definitions give pretty much the same answer for larger data sets.

David Lane’s Percentiles (2010) [see “Sources Used” at end of book] gives three definitions of percentile and shows what difference they make. His Definition 2 is the one I use in this book.

3D2.  Quartiles

Definitions: The first quartile (Q1) is the member of the data set that is greater than or equal to a quarter of the data points. The third quartile (Q3) is the member of the data set that is greater than or equal to three quarters of the data points.

To find quartiles by hand, put the data set in order and find the median. If you have an odd number of data points, strike out the median. Q1 is the median of the lower half, and Q3 is the median of the upper half.

One fourth is 25% and three fourths is 75%, so Q1 = P25 and Q3 is P75. (I chose a definition of percentiles that makes this happen. Some authors use different definitions, which may give slightly different results.)

What, no Q2? There is a Q2, but two quarters is one half, or 50%, so the second quartile is better known as the median: 50% of the data are less than or equal to the 50th %ile, alias Q2, alias the median.

The quartiles and the median divide the data set into four equal parts. We sometimes use the word quartile in a way that reflects this: the “bottom quartile” means the part of the data set that is below Q1, and the “`upper quartile” or “top quartile” means the part of the data set that is above Q3.

Q1 and Q3 are part of the five-number summary (later in this chapter). From Measures of Spread, you already know that they’re used to find the interquartile range, and later in this chapter you’ll use the IQR to make a box-whisker plot.

Just like percentiles, quartiles are defined slightly differently by different authors. Dr. Math gives a nice, clear rundown of different ways of computing quartiles in Defining Quartiles in The Math Forum (2002) [see “Sources Used” at end of book]. I follow Moore and McCabe’s method, which is also used by your TI-83 or TI-84.

3D3.  z-Scores

You’ll use z-scores more than any other measure of position. (Remember that every measure of position measures the position of one data point within the sample or population that it is part of.)

Definition: The z-score of a data point is how many standard deviations it lies above or below the mean. (A z-score is sometimes called a standard score.)

How do you find out how many SD a number is above or below the mean of its data set? You subtract the mean, and then divide the result by the SD.
z-score within a sample: z equals x minus x-bar, all divided by s      z-score within a population: z equals x minus mu, all divided by sigma
Either way, it’s z equals the data point minus the mean, all divided by the standard deviation

When you compute a z-score, the top and bottom of the fraction are both in the same units as the original data, and therefore the z-score itself has no units. z-scores are pure numbers.

What good are z-scores? You’ll use them in inferential statistics, starting in Chapter 9, but you can also use them in descriptive statistics.

For one thing, a z-score gives you economy in language. Instead of saying “at least 75% of the data in any distribution must lie between two standard deviations below the mean and two standard deviations above the mean”, you can say “at least 75% of the data lie between z = ±2.”

A z-score helps you determine whether a measurement is unusual. For instance, how good is an SAT verbal score of 300? Scores on the SAT verbal are ND with mean of 500 and SD of 100, so z = −2. The Empirical Rule tells you only 2½% of students score that low or lower.

And z-scores are also good for comparing apples and oranges, as the next example shows.

Example 16: You have two candidates for an entry-level position in your restaurant kitchen. Both have been to chef school, but different schools, and neither one has any experience. Chris presents you with a final exam score of 86, and Fran with a final exam score of 67. Which one do you hire?

At first glance, you’d go with the one who has the higher score. But wait! Maybe Fran with the 67 is actually better, and just went to a tougher school. So you ask about the average scores at the two schools. Chris’s school had a mean score of 76, and Fran’s school had a mean score of 59. Assuming that the students at the two schools had equal innate ability, Fran went to a tougher school than Chris.

Chris scored 10 points above the school average, while Fran scored only 8 points above the school average. Now do you hire Chris? Not yet! Maybe there was more variability in Chris’s class, so 10 points above the average is no big deal, but there was less variability in Fran’s, so 8 points above the mean is a big deal. So you dig further and find that the standard deviations of the two classes were 8 and 4. At this point, you make a table:

Candidate’s score8667
School mean7659
School SD84
z-score(86−76)/8 = 1.25(67−59)/4 = 2.00

The z-scores tell you that Fran stands higher in Fran’s class than Chris stands in Chris’s class. Assuming that the two classes as a whole were of equal ability, Fran is the stronger candidate.

3E.  Five-Number Summary

Definition: The five-number summary of a data set is the minimum value, Q1, median, Q3, and maximum value (in order).

The five-number summary combines measures of center (the median) and spread (the interquartile range and the range). A plot of the five-number summary, called a box-whisker diagram (below), shows you shape of the data set.

On the TI-83 or TI-84, the five-number summary is the second output screen from 1-VarStats. Caution! Remember that the second screen is meaningful only for a simple list of numbers or an ungrouped distribution, not for a grouped distribution. To produce a five-number summary, you need all the original data points.

five-number summary for quiz scores; see text Example 17: Here is the second output screen from the quiz scores earlier in this chapter. The five-number summary is 2.5, 8, 10.5, 11.5, 15.

The median is 10.5, meaning that half the students scored 10.5 or below and half scored 10.5 or above.

The interquartile range is Q3−Q1 = 11.5−8 = 3.5. Half of the students scored between 8 and 11.5.

3E1.  Outliers


An outlier is a data value that is well separated from most of the data.

Conventionally, the values Q1−1.5×IQR and Q3+1.5×IQR (first quartile minus 1½ times interquartile range, and third quartile plus 1½ times interquartile range) are called fences, and any data points outside the fences are considered outliers.

Example 18: Here again are the quiz scores from earlier in this chapter:

10.5   13.5   8   12   11.3   9   9.5   5   15   2.5   10.5   7   11.5   10   10.5

Find the outliers, if any.

The five-number summary, above, gave you the quartiles: Q1 = 8 and Q3 = 11.5. The interquartile range is 11.5−8 = 3.5, and 1.5 times that is 5.25. The fences are 8−5.25 = 2.75 and 11.5+5.25 = 16.75. All the data points but one lie within the fences; only 2.5 is outside. Therefore 2.5 is the only outlier in this data set.

You can find outliers more easily by using your TI-83 or TI-84; see below.

Why do you care about outliers? First off, an outlier might be a mistake. You should always check all your data carefully, but check your outliers extra carefully.

But if it’s not a mistake, an outlier may be the most interesting part of your data set. Always ask yourself what an outlier may be trying to tell you. For example, does this quiz score represent a student who is trying but needs some extra help, or one who simply didn’t prepare for the quiz?

What do you do with outliers? One thing you definitely don’t do: Don’t just throw outliers away. That can really give a false picture of the situation.

But suppose you have to make some policy decision based on your analysis, or run a hypothesis test (Chapters 10 and 11) and announce whether some claim is true or false?

One way is to do your analysis twice, once with the outliers and once without, and present your results in a two-column table. Anyone who looks at it can judge how much difference the outliers make. If you’re lucky, the two columns are not very different, and whatever decision must be made can be made with confidence.

But maybe the two columns are so different that including or excluding the outliers leads to different decisions or actions. In that case, you may need to start over with a larger sample, change your data collection protocol, or call in a professional statistician.

For more on handling outliers, see Outliers (Simon 2000d) [see “Sources Used” at end of book].

3E2.  Box-Whisker Diagrams

The five-number summary packs a lot of information, but it’s usually easier to grasp a summary through a picture if possible. A graph of the five-number summary is called a boxplot or box-whisker diagram.

The box-whisker diagram was invented by John Tukey in 1970.

A box-whisker diagram has a horizontal axis, which is the number line of the data, and the number line need not start at zero. Either the axis or the chart as a whole needs a title, but there’s usually no need for a title on both. There is no vertical axis.

For the graph itself, first identify any outliers and mark them as squares or crosses. Then draw a box with vertical lines at Q1, the median, and Q3. Lastly, draw whiskers from Q3 to the greatest value in the data set that isn’t an outlier, and from Q1 to the smallest value in the data set that isn’t an outlier.

Example 19: Let’s look at a box-whisker plot of those same quiz scores, which were

10.5   13.5   8   12   11.3   9   9.5   5   15   2.5   10.5   7   11.5   10   10.5

five-number summary for quiz scores; see text The five-number summary is reproduced at right. You recall from the previous section that there is one outlier, 2.5, so the smallest number in the data set that isn’t an outlier is 5.

Here’s a plot that I made with StatTools from Palisade Corporation:

box-whisker plot of quiz scores

Box-Whisker Plot, and Shape of a Data Set

The box-whisker plot is almost as good as a histogram for showing you the shape of a distribution. If one whisker is longer than the other, and especially if there are outliers on the same side as the long whisker, the distribution is skewed in that direction. If the whiskers are about the same length and there are no outliers, but one side of the box is longer than the other, that usually indicates skew in that direction as well.

Example 20: In the boxplot of quiz scores, just above, you see an outlier on the left side, and the left side of the box is longer than the right. That indicates that the distribution is left skewed.

Box-Whisker Plot on TI-83/84/89

You can use your TI-83 or TI-84 to make a box-whisker plot. The calculator comes with that ability — see Box-Whisker Plots on TI-83/84 — but it’s easier to use MATH200A Program part 2. See Getting the Program for instructions on getting the program into your calculator.

(If you have a TI-89, see Box-Whisker Plots on TI-89.)

To make a box-whisker plot with the program, begin by entering the numbers into a statistics list, such as L1. (If you have an ungrouped frequency distribution, put the numbers in one list and the frequencies in a second list. You need the original data for a boxplot, so you can’t make a boxplot of a grouped frequency distribution.)

Now press [PRGM]. If you can see MATH200A in the list, press its menu number; otherwise, use the [] or [] key to get to MATH200A, and press [ENTER].

With the program name on your home screen, press [ENTER] (again) to run the program, and yet again to dismiss the title screen. You’ll then see a menu. Press [2] for box-whisker plot.

MATH200A splash screen      MATH200A menu screen

MATH200A box-whisker: choosing number of samples The program asks whether you have one, two, or three samples. Select 1, since that’s what you have.

MATH200A box-whisker: choosing data arrangement The program wants to know whether you have a plain list of numbers or a grouped frequency distribution. Since you have a plain list, choose 1.

MATH200A box-whisker: specifying which list The program needs to know which list holds the numbers to be plotted.

MATH200A box-whisker plot Finally, the program presents the box-whisker plot.

Finding Outliers with the TI-83/84/89 or Excel

MATH200A box-whisker plot When you have a box-whisker plot on your screen, whether you used MATH200A part 2 or the calculator’s native commands, if you see any outliers press [TRACE] and then [] or [] to find which data points are outliers.

(For the TI-89, see Box-Whisker Plots on TI-89. If you prefer to use Excel to find outliers, see Normality Check and Finding Outliers in Excel.)

Five-Number Summary from TI-83/84/89 Boxplot

After pressing the [TRACE] key, you can get the five-number summary by pressing [] or [] repeatedly. If there are outliers at the left, use the lowest one for the minimum (first number in the five-number summary); if there are outliers at the right, use the highest one for the maximum (last number in the five-number summary).

What Have You Learned?

Overview: With numeric data, the goal of descriptive stats is to show shape, center, spread, and outliers.

Key ideas:

(The online book has live links to all of these.)

Because this textbook helps you,
please click to donate!
Because this textbook helps you,
please donate at
Study aids:

Chapter 4 WHYL → ← Chapter 2 WHYL

Exercises for Chapter 3

Write out your solutions to these exercises, showing your work for all computations. Then check your solutions against the solutions page and get help with anything you don’t understand.

Caution! If you don’t see how to start a problem, don’t peek at the solution — you won’t learn anything that way. Ask your instructor or a tutor for a hint. Or just leave it and go on to a different problem for now. You may find when you return to that “impossible” problem that you see how to do it after all.

1 When is the mean not the best choice for a measure of center? What would you use instead?
2 Your doctor tells you that you’re in the 15th percentile for cholesterol. Should you be concerned, or should you go out and celebrate with bacon-wrapped shrimp? (Give your reason, not just an answer.)
3 Consider these questions about measures of spread:
(a) What’s the biggest problem with the range?
(b) What makes the interquartile range a better measure of spread?
(c) Why is the variance better than both?
(d) What makes the standard deviation (SD) better than the variance? (Give two reasons.)
4 Contrast (a) s and σ, (b) μ and , (c) N and n.
5 Your smart-alecky statistics prof distributes quiz results as z-scores rather than raw scores. It’s a large class, and quiz scores were normally distributed. Your z-score was +1.87. How did you do relative to the class?
6 Weights of apples (of a particular type) are normally distributed. In a large shipment, you find that nearly all the apples weigh between 4.50 and 8.50 ounces. Estimate the SD of the weight of apples in that shipment.
20 – 2934
30 – 3958
40 – 4976
50 – 59187
60 – 69254
70 – 79241
80 – 89147

The grouped frequency distribution at right is the ages reported by a sample of Roman Catholic nuns, from Johnson and Kuby (2004, 67) [see “Sources Used” at end of book].
(a) Approximate the mean and SD of the ages of these nuns, to two decimal places, and find the sample size.
(b) Explain why a boxplot of this distribution is a bad idea.

Microsoft Word1C−
English Comp3C
You took the courses shown at right. On the usual scale of A = 4.0, A− = 3.7, B+ = 3.3, and so forth, compute your GPA. (Your GPA, grade point average, is the average of your course grades, weighted by number of credits in each course.)
9 Your prof has a policy that you can skip the final exam if your quiz average is 87% or better. After ten quizzes, your average is 86%. One quiz remains. Is it still possible for you to skip the final, and if so what percentage score do you need on that last quiz?
Commuting Distances in km
In a GM factory in Brazil, 25 workers were asked their commuting distance in kilometers. The data, from Dabes and Janik (1999, 8) [see “Sources Used” at end of book], are shown at right.
(a) Construct a grouped frequency distribution for 0–9 km, 10–19 km, and so on. (You made a stemplot for a homework problem in Chapter 2, so use that answer to save yourself some work.)
(b) What is the class width? What are the class midpoints?
(c) Use your grouped distribution to approximate the mean and SD of the commuting distances.
(d) Now compute the mean, median, and SD of the original data set.
(e) Construct a box-whisker plot from the original data set. (Never make a boxplot from grouped data.). Suggestion: do it on your calculator, then transfer it to paper where you have already drawn and labeled the number line.
(f) Which is the most appropriate measure of center for this sample? Why?
(g) Give the five-number summary and identify any outliers.
11 SAT verbal scores are normally distributed, with a mean of 500 and SD of 100. You randomly select a test taker. What’s the probability that s/he scored between 500 and 700?
12 Mensa, the largest high-IQ society, accepts SAT scores as indicating intelligence. Assume that the mean combined SAT score is 1500, with SD 300. Jacinto scored a combined 2070.

Maria took a traditional IQ test and scored 129. On that test, the mean is 100 and the SD is 15.

From the test scores, who is more intelligent? Explain.

Test Scores Frequencies, f  
At right is a sample shown as a grouped frequency distribution. Compute the following quantities and label each with its proper symbol: (a) sample size, (b) mean, (c) standard deviation. Round to two decimal places. Use any valid method, but show your work. (Begin by filling in the third column including column heading.)
14 In a particular data set (continuous data), the mean is around 8700 and the median is around 5000. What if anything can you say about the shape of the distribution?

Solutions → 

What’s New


4. Linked Variables

Updated 11 Jan 2015 (What’s New?)

Intro: When you get two numbers from each member of the sample (bivariate numeric data), you make a plot to look for a relationship between them. If a straight line seems like a good fit for the plotted points, we say that they follow a linear model. In this chapter, you’ll learn when to use a linear model, and how to find the best one.


4A.  Mathematical Models

The chapter intro talks about points following a “linear model”. But what is a linear model, and what does it mean to follow one? Well, since a linear model is one kind of mathematical model, let’s talk a little bit about mathematical models.

You know what a model is in general, right? A copy of the original, usually smaller and with unimportant details left out. Think of model airplanes, or architect’s models of buildings.

A mathematical model is like that. Real Life is Complicated,™ and mathematical models help us manage those complications.

Definition: A mathematical model is a mathematical description of something in the real world. An object or process or data set follows a model if the calculations you do with the model match reality closely enough to be useful.

You’ve already met one model in Chapter 3: the grouped frequency distribution. Instead of dealing with all the data points, you do calculations using the midpoint of each class. That gives you approximate mean and SD, but the approximation is close enough to be useful.

The MathIsFun site has a nice example of modeling the space inside a cardboard box, going beyond the h×w×l formula; see Mathematical Models.

You’ll meet plenty more models in this book: probability models in Chapter 5, several discrete models in Chapter 6, and the normal model in Chapter 7.

But in this chapter we’re concerned with the linear model.

Definition: The linear model uses the linear equation y = ax+b to model the relationship between two numeric variables x and y. In any particular model, a and b are constants.

Because the graph of y = ax+b is a straight line, we can also call it a straight-line model, and we say that x and y have a straight-line relationship in the model.

The linear model is a good one if it describes the data well enough to let you make useful calculations.

4B.  Scatterplot, Correlation, and Regression on TI-83/84

Summary: When you have a set of (x,y) data points and want to find the best equation to describe them, you are performing a regression. You will learn how to find the strength of the association between your two variables (correlation coefficient), and how to find the line of best fit (least squares regression line).

Usually you have some idea that your x variable can help predict your y variable, so you call x the explanatory variable and y the response variable. (Other names are independent variable and dependent variable.)

See also:

4B1.  Step 0. Setup

Set floating point mode, if you haven’t already. [MODE] [] [ENTER]
Go to the home screen [2nd MODE makes QUIT] [CLEAR]
Turn on diagnostics with the [DiagnosticOn] command. [2nd 0 makes CATALOG] [x-1]
Don’t press the [ALPHA] key, because the CATALOG command has already put the calculator in alpha mode.
Scroll down to DiagnosticOn and press [ENTER] twice.

The calculator will remember these settings when you turn it off: next time you can start with Step 1.

4B2.  Step 1. Make the Scatterplot

Before you even run a regression, you should first plot the points and see whether they seem to lie along a straight line. If the distribution is obviously not a straight line, don’t do a linear regression. (Some other form of regression might still be appropriate, but that is outside the scope of this course.)

Let’s use this example from Sullivan (2011, 179) [see “Sources Used” at end of book]: the distance a golf ball travels versus the speed with which the club head hit it.

Club-head speed, mph (x) 10010210310110510099105
Distance, yards (y) 257264274266277263258275
Turn off other plots. [Y=]
Cursor to each highlighted = sign or Plot number and press [ENTER] to deactivate.
Set the format screen. TI-83/84 format screen Press [2nd ZOOM makes FORMAT]. Just select everything in the left column.
Enter the numbers in two statistics lists. [STAT] [1] selects the list-edit screen.
Cursor onto the label L1 at top of first column, then [CLEAR] [ENTER] erases the list. Enter the x values.
Cursor onto the label L2 at top of second column, then [CLEAR] [ENTER] erases the list. Enter the y values.
Set up the scatterplot.
TI-83/84 setup screeen for scatterplot
[2nd Y= makes STAT PLOT] [1] [ENTER] turns Plot 1 on.
[] [ENTER] selects scatterplot.
[] [2nd 1 makes L1] ties list 1 to the x axis.
[] [2nd 2 makes L2] ties list 2 to the y axis.
(Leave the square as the selected mark for plotting.)
Plot the points.

I have the grid turned on in some of these pictures, but earlier I told you to turn it off. That’s simplest. If you want the grid, you can turn it on, but then you’ll have to adjust the grid spacing for almost every plot. To adjust grid spacing, press [WINDOW], set Xscl and Yscl to appropriate values for your data, and press [GRAPH] to see the result.

TI-83/84 scatterplot [ZOOM] [9] automatically adjusts the window frame to fit the data.
Check your data entry by tracing the points. TI-83/84 scatterplot with trace for checking data entry [TRACE] shows you the first (x,y) pair, and then [] shows you the others. They’re shown in the order you entered them, not necessarily from left to right.

A scatterplot on paper needs labels (numbers) and titles on both axes; the x and y axes typically won’t start at 0. Here’s the plot for this data set. (The horizontal lines aren’t needed when you plot on graph paper.)

scatterplot for this data set, showing axis labels and titles

scatterplot showing jitter When the same (x,y) pair occurs multiple times, plot the extra ones slightly offset. This is called jitter. In the example at the right, the point (6,6) occurs twice.

If the data points don’t seem to follow a straight line reasonably well, STOP! Your calculator will obey you if you tell it to perform a linear regression, but if the points don’t actually fit a straight line then it’s a case of “garbage in, garbage out.”

For instance, consider this example from DeVeaux, Velleman, Bock (2009, 179) [see “Sources Used” at end of book]. This is a table of recommended f/stops for various shutter speeds for a digital camera:

Shutter speed (x) 1/10001/5001/2501/1251/601/301/151/8
f/stop (y) 2.845.6811162232

scatterplot of the above numbers, showing non-linear trend If you try plotting these numbers yourself, enter the shutter speeds as fractions for accuracy: don’t convert them to decimals yourself. The calculator will show you only a few decimal places, but it maintains much greater precision internally.

You can see from the plot at right that these data don’t fit a straight line. There is a distinct bend near the left. When you have anything with a curve or bend, linear regression is wrong. You can try other forms of regression in your calculator’s menu, or you can transform the data as described in DeVeaux, Velleman, Bock (2009, ch 10) [see “Sources Used” at end of book] and other textbooks.

4B3.  Step 2. Perform the Regression

Set up to calculate statistics. [STAT] [] [4] pastes LinReg(ax+b) to the home screen.
  [2nd 1 makes L1] [,] [2nd 2 makes L2] defines L1 as x values and L2 as y values.
If you have the “wizard’ interface, leave FreqList blank, or press [DEL] if something is already filled in.
Set up to store regression equation. [,] [VARS] [] [1] [1] pastes Y1 into the LinReg command.
Show your work! Write down the whole command — LinReg(ax+b) L1,L2,Y1 in this case, not just LinReg or LinReg(ax+b). Press [ENTER]. The calculator shows correlation and regression statistics and pastes the regression equation into Y1.

Your input screen should look like this, for the “wizard” and non-wizard interfaces:

wizard interface command for linear regression       non-wizard interface command for linear regression

TI-83/84 regression output screen Write down the slope a, the y intercept b, the coefficient of determination R², and the correlation coefficient r. (A decent rule of thumb is four decimal places for slope and intercept, and two for r and R².)

a = 3.1661, b = −55.7966

R² = 0.88, r = 0.94

Now let’s take a look in depth at each of those.

Correlation Coefficient, r

scatterplots for various correlation coefficients

“Several sets of (x,y) [pairs], with the correlation coefficient for each set. Note that correlation reflects the noisiness and direction of a linear relationship (top row), but not the slope of that relationship (middle), nor many aspects of nonlinear relationships (bottom).”
source: Correlation and Dependence [see “Sources Used” at end of book]

Look first at r, the coefficient of linear correlation. r can range from −1 to +1 and measures the strength of the association between x and y. A positive correlation or positive association means that y tends to increase as x increases, and a negative correlation or negative association means that y tends to decrease as x increases. The closer r is to 1 or −1, the stronger the association. We usually round r to two decimal places.

Karl Pearson developed the formula for the linear correlation coefficient in 1896. The symbol r is due to Sir Francis Galton in 1888.

For real-world data, 0.94 is a pretty strong correlation. But you might wonder whether there’s actually a general association between club-head speed and distance traveled, as opposed to just the correlation that you see in this sample. Decision Points for Correlation Coefficient, later in this chapter, shows you how to answer that question.

Though nobody ever computes r by hand any more, the formula explains the properties of r. Here are two equivalent forms. In the first form, you compute the z score of each x within just the x’s and the z score of each y within just the y’s. The second formula is easier if you already have the means and SD of the x’s and y’s. For the meaning of , see ∑ Means Add ’em Up in Chapter 1.

formulas for linear correlation coefficient

z-scores are pure numbers without units, and therefore r also has no units. You can interchange the x’s and y’s in the formula without changing the result, and therefore r is the same regardless of which variable is x and which is y.

Why is r positive when data points trend up to the right and negative when they trend down to the right? The product (x−)(y−) explains this. When points trend up to the right, most are in the lower left and upper right quadrants of the plot. In the lower left, x and y are both below average, x− and y− are both negative, and the product is positive. In the upper right, x and y are both above average, x− and y− are both positive, and again the product is positive. The product is positive for most points, and therefore r is positive when the trend is up to the right.

On the other hand, if the data trend down to the right, most points are in the upper left (where x is below average and y is above average, x− is negative, y− is positive, and the product is negative) and the lower right (where x− is positive, y− is negative, and the product is also negative.) Since the product is negative for most points, r is negative when data trend down to the right.

Be careful in your interpretation! No matter how strong your r might be, say that changes in the y variable are associated with changes in the x variable, not “caused by” it. Correlation is not causation is your mantra.

It’s easy to think of associations where there is no cause. For example, if you make a scatterplot of US cities with x as number of books in the public library and y as number of murders, you’ll see a positive association: number of murders tends to be higher in cities with more library books. Does that mean that reading causes people to commit murder, or that murderers read more than other people? Of course not! There is a lurking variable here: population of the city.

When you have a positive or negative association, there are four possibilities: x might cause changes in y, y might cause changes in x, lurking variables might cause changes in both, or it could just be coincidence, a random sample that happens to show a strong association even though the population does not.

correlation and causation cartoon
used by permission; source: (accessed 2014-09-15)

If correlation is not causation, then how can we establish causation? For example, how do we know that smoking causes lung cancer in humans? Obviously we can’t perform an experiment, for ethical reasons. Sir Austin Bradford Hill laid down nine criteria for establishing causation in a 1965 paper, The Environment and Disease: Association or Causation? [see “Sources Used” at end of book] Short summaries of the “Bradford Hill criteria” are many places on the Web, including Steve Simon’s (2000b) Causation [see “Sources Used” at end of book].

Regression Line, ŷ = ax+b

Write the equation of the line using ŷ (“y-hat”), not y, to indicate that this is a prediction. b is the y intercept, and a is the slope. Round both of them to four decimal places, and write the equation of the line as

ŷ = 3.1661x − 55.7966

(Don’t write 3.1661x + −55.7966.)

These numbers can be interpreted pretty easily. Business majors will recognize them as intercept = fixed cost and slope = variable cost, but you can interpret them in non-business contexts just as well.

The slope, a or b1 or m, tells how much ŷ increases or decreases for a one-unit increase in x. In this case, your interpretation is “the ball travels about an extra 3.17 yards when the club speed is 1 mph greater.” The slope and the correlation coefficient always have the same sign. (A negative slope would mean that y decreases that many units for every one unit increase in x.)

The intercept, b or b0, says where the regression line crosses the y axis: it’s the value of ŷ when x is 0. Be careful! The y intercept may or may not be meaningful. In this case, a club-head speed of zero is not meaningful. In general, when the measured x values don’t include 0 or don’t at least come pretty close to it, you can’t assign a real-world interpretation to the intercept. In this case you’d say something like “the intercept of −55.7966 has no physical interpretation because you can’t hit a golf ball at 0 mph.

Here’s an example where the y intercept does have a physical meaning. Suppose you measure the gross weight of a UPS truck (y) with various numbers of packages (x) in it, and you get the regression equation ŷ = 2.17x+2463. The slope, 2.17, is the average weight per package, and the y intercept, 2463, is the weight of the empty truck.

The slope (a or m or b1) and y intercept (b or b0) of the regression line can be calculated from formulas, if you have a lot of time on your hands:

standard regression equations

For the meaning of , see ∑ Means Add ’em Up in Chapter 1.

Traditionally, calculus is used to come up with those equations, but all that’s really necessary is some algebra. See Least Squares — the Gory Details if you’d like to know more.

The second formula for the slope is kind of neat because it connects the slope, the correlation coefficient, and the SD of the two variables.

Coefficient of Determination, R²

Because this textbook helps you,
please click to donate!
Because this textbook helps you,
please donate at

The last number to look at (third on the screen) is R², the coefficient of determination. (The calculator displays r², but the capital letter is standard notation.) R² measures the quality of the regression line as a means of predicting ŷ from x: the closer R² is to 1, the better the line. Another way to look at it is that R² measures how much of the total variation in y is predicted by the line.

In this case R² is about 0.88, so your interpretation is “about 88% of the variation in distance traveled is associated with variation in club-head speed.” Statisticians say that R² tells you how much of the variation in y is “explained” by variation in x, but if you use that word remember that it means a numerical association, not necessarily a cause-and-effect explanation. It’s best to stick with “associated” unless you have done an experiment to show that there is cause and effect.

There’s a subtle difference between r and R², so keep your interpretations straight. r talks about the strength of the association between the variables; R² talks about what part of the variation in the y variable is associated with variation in the x variable, and how well the line predicts y from x. Don’t use any form of the word “correlated” when interpreting R².

Only linear regression will have a correlation coefficient r, but any type of regression — fitting any line or curve to a set of data points — will have a coefficient of determination R² that tells you how well the regression equation predicts y from the independent variable(s). Steve Simon (1999b) gives an example for non-linear regression in R-squared [see “Sources Used” at end of book].

In straight-line regression, R² is the square of r, so if you want a formula just compute r and square the result.

4B4.  Step 3. Display the Regression Line

Show line with original data points. TI-83/84 plotted points and regression line [GRAPH]

What is this line, exactly? It’s the one unique line that fits the plotted points best. But what does “best” mean?

the same four points with bad and good regression lines

The same four points on left and right. The vertical distance from each measured data point to the line, y−ŷ, is called the residual for that x value. The line on the right is better because the residuals are smaller.
source: Dabes and Janik (1999, 179) [see “Sources Used” at end of book]

For each plotted point, there is a residual equal to y−ŷ, the difference between the actual measured y for that x and the value predicted by the line. Residuals are positive if the data point is above the line, or negative if the data point is below the line.

You can think of the residuals as measures of how bad the line is at prediction, so you want them small. For any possible line, there’s a “total badness” equal to taking all the residuals, squaring them, and adding them up. The least squares regression line means the line that is best because it has less of this “total badness” than any other possible line. Obviously you’re not going to try different lines and make those calculations, because the formulas built into your calculator guarantee that there’s one best line and this is it.

Carl Friedrich Gauss developed the method of least squares in a paper published in 1809.

4B5.  Optional:  Display the Residuals

I would like you to know the material in this section, but it's not part of the MATH200 syllabus so I don’t require it. No homework or quiz problems will draw from this section. You will, however, need to calculate individual residuals; see Finding Residuals, below.

“No regression analysis is complete without a display of the residuals to check that the linear model is reasonable.”

DeVeaux, Velleman Bock (1999, 227) [see “Sources Used” at end of book]

The residuals are automatically calculated during the regression. All you have to do is plot them on the y axis against your existing x data. This is an important final check on your model of the straight-line relationship.

Turn off other plots. Press [Y=]. Cursor to the highlighted = sign next to Y1 and press [ENTER]. Cursor to PLOT1 and press [ENTER].
Set up the plot of residuals against the x data. STAT PLOT setup Set up Plot 2 for the residuals. Press [2nd Y= makes STAT PLOT] [] [ENTER] [ENTER] to turn on Plot 2. Press [] [ENTER] to select a scatterplot.
The x’s are still in L1, so press [2nd 1 makes L1] [ENTER]. In this plot, the y’s will be the residuals: press [2nd STAT makes LIST], cursor up to RESID, and press [ENTER] [ENTER].
Display the plot. plot of residuals [ZOOM] [9] displays the plot.

You want the plot of residuals versus x to be “the most boring scatterplot you’ve ever seen.” (DeVeaux, Velleman, Bock 2009, 203) [see “Sources Used” at end of book] “It shouldn’t have any interesting features, like a direction or shape. It should stretch horizontally, with about the same amount of scatter throughout. It should show no bends, and it should have no outliers. If you see any of these features, find out what the regression model missed.”

Don’t worry about the size of the residuals, because [ZOOM] [9] adjusts the vertical scale so that they take up the full screen.

If the residuals are more or less evenly distributed above and below the axis and show no particular trend, you were probably right to choose linear regression. But if there is a trend, you have probably forced a linear regression on non-linear data. If your data points looked like they fit a straight line but the residuals show a trend, it probably means that you took data along a small part of a curve.

Here there is no bend and there are no outliers. The scatter is pretty consistent from left to right, so you conclude that distance traveled versus club-head speed really does fit the straight-line model.

Residual Plot Showing Problems

Refer back to the scatterplot of f/stop against shutter speed. I said then that it was not a straight residuals plot for digital camera data from Step 2 line, so you could not do a linear regression. If you missed the bend in the scatterplot and did a regression anyway, you’d get a correlation coefficient of r = 0.98, which would encourage you to rely on the bad regression. But plotting the residuals (at right) makes it crystal clear that linear regression is the wrong type for this data set.

This is a textbook case (which is why it was in a textbook): there’s a clear curve with a bend, variation on both sides of the x axis is not consistent, and there’s even a likely outlier.

Optional Advanced:  Residuals and R²

I said in Step 2 that the coefficient of determination measures the variation in measured y that’s associated with the variation in measured x. Now that you understand the residuals, I can make that statement more precise and perhaps a little easier to understand.

The set of measured y values has a spread, which can be measured by the standard deviation or the variance. It turns out to be useful to consider the variation in y’s as their variance. (You remember that the variance is the square of the standard deviation.)

The total variance of the measured y’s has two components: the so-called “explained” variation, which is the variation along the regression line, and the “unexplained” variation, which is the variation away from the regression line. The “explained” variation is simply the variance of the ŷ’s, computing ŷ for every x, and the “unexplained” variation is the variance of the residuals. Those two must add up to the total variance of the measured y’s, which means that as percentages of the variation in y they must add to 100%. So R² is the percent of “explained” variation in the regression, and 100%−R² is the percent of “unexplained” variation.

variance of y-hat over variance of y equals 100% times R squared and variance of residuals over variance of y equals 100% times R squared

Now I can restate what you learned in Step 2. R² is 88% because 88% of the variance in y is associated with the regression line, and the other 12% must therefore be the variance in the residuals. This isn’t hard to verify: do a 1-VarStats on the list of measured y’s and square the standard deviation to get the total variance in y, s²y = 59.93. Then do 1-VarStats on the residuals list and square the standard deviation to get the “unexplained” variance, s²e = 7.12. The ratio of those is 7.12/59.93 = 0.12, which is 1−R². Expressing it as a percentage gives 100%−R² = 12%, so 12% of the variation in measured y’s is “unexplained” (due to lurking variables, measurement error, etc.).

4C.  Finding ŷ from a Regression on TI-83/84

Summary: The regression line represents the model that best fits the data. One important reason for doing the regression in the first place is to answer the question, what average y value does the model predict for a given x? This page shows you two methods of answering that question.

See also:

4C1.  Method 1: Trace on the Regression Line Graph (preferred)

You can make predictions while examining the graph of the regression line on the TI-83/84 or TI-89.

Advantages to this method: aside from being pretty cool, it avoids rounding errors, and it’s very fast for multiple predictions.

Activate tracing on the regression line. [TRACE]
Look in the upper left corner to make sure that the regression equation is displayed. If you see P:L1,L2, press [] to display the regression equation.
Enter the x value. TI-83/84 trace screen Press the black-on-white numeric keys including [(−)] and decimal point if needed.
As soon as you press the first number, you’ll see a large X= appear at the bottom left of the screen. Enter any additional digits and press [ENTER].
The TI-83/84 displays the predicted average y value (ŷ) at the bottom right and puts a blinking cursor at that point on the regression line.

Caution: ŷ = 267.1 yards is the predicted or expected average distance for a club-head speed of 102 mph. But that does not mean any particular golf ball hit at that speed will travel that exact distance. You can think of ŷ as the average travel distance that you’d would expect for a whole lot of golf balls hit at that speed.

Extrapolation: Just Say No (Usually)

Caution: A regression equation is valid only within the range of actual measured x values, and a little way left and right of that range. If you try to go too far outside the valid range, the calculator will display ERR:INVALID.

It’s not just being cranky. The line describes the points you measured, so it’s usable between your minimum and maximum x values and maybe a little way outside those limits. But unless you have very solid reasons why the same straight-line model is good beyond that range, you can’t extrapolate.

Take a look at this graph of men’s and women’s winning times in the Olympic 100-meter dash from 1928 to 2012, which I made from data compiled by Mike Rosenbaum [see “Sources Used” at end of book]. (The women’s 100 m dash became an Olympic event in 1928.)

women, y = minus 0.0169x + 44.663; men, y = minus 0.0109x + 31.605

From this you can reasonably guess that if women had run in the 1924 Olympics, the winner would have finished in around 12.2 or 12.3 seconds. And the 2016 winner will probably finish in around 11.5 seconds. But the further you go outside your measured data, the more riskier your predictions.

Will men’s and women’s times generally continue to decrease? Probably: training will get better, nutrition will improve, global communications will make it less likely that a stellar runner goes undiscovered. But will the decrease follow a straight line? Certainly not! Think about it for a minute. If times keep decreasing on a straight line, eventually they’ll cross the x axis and go negative. Runners will finish the race before they start it! So obviously the straight-line model breaks down — the only question is where. You don’t know, and you can’t know. All you know is that it’s not safe to extrapolate.

Bogus extrapolations give statistics a bad name and make people say “you can prove anything with statistics.” Here’s an example. I’ve just extended the two trend lines to “prove” that after the 2160 Olympics women will run the 100 meters faster than men. Pretty clearly, the linear model breaks down before then.

extension of trend lines shows women outrunning men after 2160

It’s not safe to extrapolate to earlier times, either. The intercepts tell you that in the year zero, the fastest man in the world took 31.6 seconds to run 100 m, and the fastest woman took 44.7 seconds. Does that seem believable?

4C2.  Method 2: Use Calculated Regression Equation (if necessary)

But what if you don’t still have the regression line on your calculator, for instance if you’ve done a different regression? In that case, you can go back to your written-down regression equation and plug in the desired x value.

Advantage of this method: You already know how to substitute into equations.  Disadvantages: depending on the specific numbers involved, you may introduce rounding errors. Also, since you’re entering more numbers there’s an increased chance of entering a number wrong.

Example: To find the predicted average y value for x = 102, go back to the regression equation that you wrote down, and substitute 102 for x:

ŷ = 3.1661x − 55.7966

ŷ = 3.1661*102 − 55.7966

ŷ = 267.1456 → 267.1

In this example, the rounding error was very small, and it disappeared when you rounded ŷ to one decimal place. But there will be problems where the rounding error is large enough to affect the final answer, so always use the trace method if you can.

Again, please observe the Cautions above. With this method, the calculator won’t tell you when your x value is outside a reasonable range, so you need to be aware of that issue yourself.

4C3.  Finding Residuals

Each measured data point has an associated residual, defined as y−ŷ, the distance of the point above or below the line. To find a residual, the actual y comes from the original data, and the predicted average ŷ comes from one of the methods above.

Example: Find the residual for x = 102.
Solution: From the original data, y = 264. From either of the methods above, ŷ = 267.1. Therefore the residual is y−ŷ = 264−267.1 = −3.1 yards.

If a given x value occurs in more than one data point, you have multiple residuals for that x value.

4D.  Decision Points for Correlation Coefficient

Summary: After you compute the linear correlation coefficient r of your sample, you may wonder whether this reflects any linear correlation in the population. By comparing r to a critical number or decision point, you either conclude that there is linear correlation in the population, or reach no conclusion. You can never conclude that there’s no correlation in the population.

This page gives a simple mechanical test, but a proper statistical test exists. The optional advanced handout Inferences about Linear Correlation explains how decision points are computed and the theory behind the test. You need to learn about t tests before you can understand all of it, but right now you can use the Excel spreadsheet that you’ll find there. Or you can use MATH200B Program part 6 to do the computations.

4D1.  Procedure

The decision points are used to answer the question “From the linear correlation r of my sample, can I rule out chance as an explanation for the correlation I see? Can I infer that there is some correlation in the population?”

To answer that question, temporarily disregard the sign of r. This is the absolute value of r, written | r |. Then compare | r | to the decision point, and obtain one of the only three possible results:

If | r | ≤ d.p. If | r | > d.p.
... and r is negative ... and r is positive
... then you cannot say whether there is any linear correlation in the population. ... then there is some negative linear correlation in the population. ... then there is some positive linear correlation in the population.

Here’s a table of decision points (also known as critical values of r) for various sample sizes.

Decision Points or Critical Numbers for r
(two-tailed test for ρ≠0 at significance level 0.05)
nd.p.  nd.p.  nd.p.  nd.p.  nd.p.
5.878   10.632   15.514   20.444   30.361
6.811 11.602 16.497 22.423 40.312
7.754 12.576 17.482 24.404 50.279
8.707 13.553 18.468 26.388 60.254
9.666 14.532 19.456 28.374 80.220

(If your sample size is not shown, either refer to the Excel workbook or use the next lower number that is shown in the table. Example: n = 35 is not shown, and therefore you will use the decision point for n = 30.)

4D2.  Examples

You survey 50 randomly selected college students about the number of hours they spend playing video games each week and their GPA, and you find r = −0.35. You look up n = 50 in the table and find 0.279 as the decision point. |r|>d.p. (0.35 > 0.279). You conclude that for college students in general, video game play time is negatively associated with GPA, or that GPA tends to decrease as video-game playing increases.

You randomly select 21 college students. For the amount they spend on textbooks and their GPA, you find r = +0.20. n=21 isn’t in the table of decision points, so you select 0.444, the decision point for n=20. |r|≤d.p. (0.20 ≤ 0.444). Therefore, you are unable to make any statement about an association between textbook spending and GPA for college students in general.

4D3.  Interpretation

Be very careful with your interpretation, and don’t say more than the statistics will allow.

The question was simply whether there is some correlation in the population, not how much. The population might have stronger or weaker correlation than your sample; all you know is that it has some. (Though you won’t learn how to do it in this course, it is possible to estimate the correlation coefficient of the population.)

If you conclude there is some correlation in the population, it’s probable, not certain. From a completely uncorrelated population, there’s still one chance in 20 of drawing a sample with | r | greater than the decision point. Because 1/20 is .05, we say that .05 is the significance level.

Even if you conclude that there is some correlation in the population, that’s the start of your investigation, not the end. If there’s a correlation in the population, you can’t just assume that one variable drives the other: correlation is not causation. Steve Simon’s (2000b) Causation [see “Sources Used” at end of book] gives some hints for investigating causation, using smoking and lung cancer as an example.)

Finally, note that there’s no way to reach the conclusion “there’s no correlation in the population." Either there (probably) is, or you can’t reach any conclusion. This will be a general pattern in inferential statistics: either you reach a conclusion of significance, or you don’t reach any conclusion at all. (As you’ll see in Chapter 10, you can conclude “something is going on”, you can fail to reach a conclusion, but you can never conclude “nothing is going on”. Lack of evidence for is not evidence against.)

4E.  Optional:  Scatterplot, Correlation, and Regression in Excel


In “Scatterplot, Correlation, and Regression on TI-83/84”, earlier in this chapter, you learned the concepts of correlation and regression, and you used a TI-83 or TI-84 calculator to plot the points and do the computations. The calculator is handy, but calculator screens aren’t great for formal reports. This section tells you how to do the same operations in Microsoft Excel, without repeating the concepts.

I’m using Excel 2010, but Excel 2007 or 2013 should be almost identical.

4E1.  Plot the Points

Here again are the data:

Club-head speed, mph (x) 10010210310110510099105
Distance, yards (y) 257264274266277263258275
  1. initial scatterplot from Excel Enter the x-y pairs in rows or columns; row or column heads are optional.
  2. With your mouse, highlight the data but not the headers. Click Insert. In the Charts section, click Scatter and choose the first scatterplot type.
  3. Right-click the useless “Series1” legend and click Delete.
  4. This time I got lucky, but sometimes Excel puts too much white space at the left or bottom of the chart. If this happens to you, right-click the axis numbers and select Format Axis. Change Minimum to Fixed and type in a sensible value.
  5. In the Excel ribbon, click Layout » Axis Titles » Primary Horizontal Axis Title » Title Below Axis and type the axis title. Include units if any. In this case, you have club-head speed in miles per hour.
  6. Click Axis Title » Primary Vertical Axis Title » Rotated Title and type the axis title, including units if any. In this case, you have distance traveled in yards.
  7. Click Chart Title » Above Chart and type your chart title.
  8. For a neater appearance, you can right-click the horizontal axis, select Format Axis, and change Major tick mark type to None. Repeat for the vertical axis. Your chart should look like this:

    Scatterplot with tidier axes

4E2.  Show the Regression Line

  1. In the Excel ribbon, click Layout. In the Analysis group, click Trendline » More Trendline Options.
  2. In the dialog box that appears, click Trendline Options at the left. At the top right, select Linear. At the bottom right, select Display Equation on chart and Display R-squared value on chart.
  3. Click and drag the regression equation and R² value so that they’re not covering any data points or any part of the line. Then right-click them and select Format Trendline Label. Click Fill at the left, then at the right click Solid Fill and change the color to white. (This keeps the gridlines from running through the text.) If you wish, click Border Color » Solid Line. Here’s the result:

    Scatterplot with regression line and equation plus R²

4E3.  Show the Correlation Coefficient

Excel won’t put r on the chart, but you can compute it in a worksheet cell:

  1. Click into an empty worksheet cell. Type =CORREL( including the = sign and opening parenthesis.
  2. Highlight your y list with your mouse — numbers only, not the header — and type a comma.
  3. Highlight your x list with your mouse — again, just the numbers. Type a closing parenthesis and press [ENTER].

(You can get the slope, y intercept, or R² into the worksheet by following the above procedure but substituting SLOPE, INTERCEPT, or RSQ for CORREL.)

4E4.  Predict the Average y

Like your calculator, Excel can find the ŷ value (predicted average y) for any x.

Caution: A regression equation is valid only within the range of actual measured x values, and a little way left and right of that range. If you go outside that range, Excel will happily serve up garbage numbers to you.

On average, how far do you expect a golf ball to travel when hit at 102 mph?

  1. Type your x value, 102, in an empty cell.
  2. Click into an empty worksheet cell. Type =FORECAST( including the = sign and opening parenthesis.
  3. Click into the cell that holds your x value, and type a comma.
  4. Highlight your y list with your mouse — numbers only, not the header — and type a comma.
  5. Highlight your x list with your mouse — again, just the numbers. Type a closing parenthesis and hit [ENTER]. You’ll see the predicted average distance, 267.1 yards.

The prediction formula, like all Excel formulas, is “live”: if you type in a new x Excel will display the corresponding ŷ. If this doesn’t happen, in the Excel ribbon click Formulas » Calculation Options » Automatic.

What Have You Learned?

Key ideas:

(The online book has live links to all of these.)

Because this textbook helps you,
please click to donate!
Because this textbook helps you,
please donate at
Study aids:

Chapter 5 WHYL → ← Chapter 3 WHYL

Exercises for Chapter 4

Write out your solutions to these exercises, showing your work for all computations. Then check your solutions against the solutions page and get help with anything you don’t understand.

Caution! If you don’t see how to start a problem, don’t peek at the solution — you won’t learn anything that way. Ask your instructor or a tutor for a hint. Or just leave it and go on to a different problem for now. You may find when you return to that “impossible” problem that you see how to do it after all.

1 A researcher performed a regression on x = age and y = salary for all employees at MegaGrandeEnormoCorp (doing business as “Gramma’s Kitchen”). She found R² = 0.64. How would you explain this to a friend who doesn’t understand any math more complicated than percentages?
YearPower Boat
Reg. (1000s)
2Manatees or “sea cows” are large, slow-moving mammals that live in coastal waters. They’re an endangered species. Sharyn O’Halloran (n.d., slide 4) [see “Sources Used” at end of book] quotes yearly figures from the US Fish & Wildlife Service for the number of power-boat registrations and number of manatees killed by power boats in Florida coastal waters.

(a) The two variables are power-boat registrations and manatee deaths. Which should be the explanatory variable, and which should be the response variable?

(b) On paper or on your calculator, make a scatterplot. Do the data seem to follow a straight line, more or less?

(c) Give the symbol and numerical value of the correlation coefficient.

(d) Write down the regression equation for manatee deaths as a function of power-boat registrations.

(e) State and interpret the slope.

(f) State and interpret the y intercept.

(g) Give the coefficient of determination with its symbol, and interpret it.

(h) How many deaths does the regression predict if 559,000 power boats are registered? Use the proper symbol.

(i) Find the residual for x = 559.

(j) How many manatee deaths would you expect for a million power-boat registrations?

3 Sascha randomly selected 10 TC3 students and asked how many hours of TV they watched on an average day and what was their GPA. The correlation was −0.57. What if anything can you say about TV watching and GPA for all TC3 students?
DialTemp, °F
4Your deep freezer has a dial to regulate temperature, but it’s just numbered 0 to 8 with no indication of temperature. So you try various dial settings, allowing 24 hours for temperature to stabilize after each change. The results are shown at right.

(a) Make a scatterplot. Does a straight-line model seem reasonable here?

(b) What linear equation best describes the relation between dial setting x and temperature y?

(c) State and interpret the slope.

(d) State and interpret the y intercept.

(e) Give the correlation coefficient with its symbol.

(f) Give the coefficient of determination with its symbol, and interpret it.

(g) Predict the temperature for a dial setting of 1.

5 A statistics professor asked students to write on their final exam the number of hours they had spent studying. After scoring the exams, she randomly selected 12 of them and plotted exam score against hours of study, with the result r = 0.85. What if anything can you say about the relation between study time and exam score for statistics students in general, assuming that this class is representative of all classes?
6 A public-school administrator with too much time on his hands studied shoe size and reading ability and found a correlation coefficient of 0.81. Are big feet a sign of intelligence?
7 scatterplot, roughly the shape of an upside-down hanging chain A scatterplot is shown at right. Would the value of r be strongly positive, near zero, or strongly negative? Briefly explain your answer.
8 “In a large study of twins, the Minnesota Twin study found a correlation of +.71 between the IQ scores of identical twins. Another study found that family income is correlated +.30 with the IQ of children.” (Source: Pearson’s 2001 [see “Sources Used” at end of book] in the McGraw-Hill Statistical Primer.)

How much of the variation in children’s IQ is associated with variation in family income?

Solutions → 

What’s New


5. Probability

Updated 13 Jan 2015 (What’s New?)

Intro: By now you know: There’s no certainty in statistics. When you draw a sample from a population, what you get is a matter of probability. When you use a sample to draw some conclusion about a population, you’re only probably right. It’s time to learn just how probability works.


If you’re learning independently, you can skip the sections marked “Optional” and still understand the chapters that follow. If you’re taking this course with an instructor, s/he may require some or all of those sections. Ask if you’re not sure!

For easy reference, tables used in more than one problem are duplicated at the end of this document.

5A.  Probability Basics

5A1.  What Is Probability?

Definitions: Probability can be defined two ways: the long-term relative frequency of an event, or the likelihood that an event will occur.

A trial is any procedure or observation whose result depends at least partly on chance. The result of a trial is called the outcome. We call a group of one or more repeated trials a probability experiment.

Example 1: Ten thousand doctors took aspirin every night for six years, and 104 of them had heart attacks. The relative frequency is 104/10000 = 1.04%, so the probability of heart attack is 1.04% for doctors taking aspirin nightly.

Each doctor represents a trial, and the outcome of each trial is either “heart attack” or “no heart attack”. The group of 10,000 trials is a probability experiment.

Definition: An event is a group of one or more possible outcomes of a trial. Usually those outcomes are related in some way, and the event is named to reflect that.

Example 2: If you draw a card from a deck without looking, there are 52 possible outcomes (assuming the jokers have been removed). “Ace” is an event, representing a group of four outcomes, and the probability of that event is 4/52 or 1/13. “Spade” is an event, representing a group of 13 outcomes, so its probability is 13/52 or 1/4. “Ace of spades” is both an outcome and an event, with a probability of 1/52.

Write probabilities as fractions, decimals, or percentages, like this:

P(event) = number

Example 3: On a coin flip, P(heads) = 0.5, read as the probability of heads is 0.5. “P(0.5)” is wrong. Don’t write P(number); always write P(event) = number.

All probabilities are between 0 and 1 inclusive. A probability of 0 means the event is impossible or cannot happen; a probability of 1 means the event is a certainty or will definitely happen. Probabilities between 0 and 1 are assigned to events that may or may not happen; the more likely the event, the higher its probability.

Definition: When an event is unlikely — when it has a low probability of occurring — you call it an unusual event. Unless otherwise stated, “unlikely” means that the probability is below 0.05.

This will be an important idea in inferential statistics.

5A2.  Where Do You Get Probabilities?

Pure thought is enough to give many probabilities: the probability of drawing a spade from a deck of cards, the probability of rolling doubles three times in a row at Monopoly, the probability of getting an all-white jury pool in a county with 26% black population. Any such probability is called a theoretical probability or classical probability.

Theoretical probabilities come ultimately from a sample space, usually with help from some of the laws for combining events. (I’ll tell you about both of these later in this chapter.)

Example 4: A standard die (used in Monopoly or Yahtzee) has six faces, all equally likely to come up. Therefore you know that the probability of rolling a two is 1/6.

On the other hand, some probabilities are impossible to compute that way, because there are too many variables or because you don’t know enough: the probability that weather conditions today will give rise to rain tomorrow, the probability that a given radium nucleus will decay within the next second, the probability that a given candidate will win the next election, the probability that a driver will have a car crash in the next year. To find the probability of an event like that, you do an experiment or rely on past experience, and so it is called an experimental probability or empirical probability.

Example 5: The CDC says that the incidence of TB in the US is 5 cases per 100,000 population. 5/100,000 = 0.005%. Therefore you can say that the probability a randomly selected person has TB is 0.005%.

These two terms describe where a probability came from, but there’s no other difference between experimental and theoretical probabilities. They both obey the same laws and have the same interpretations.

You probably don’t need formulas, but if you want them here they are:

Theoretical or classical:P(success) = N(success) / N(possible outcomes)
Empirical or experimental:P(success) = N(success) / N(trials)

5A3.  Interpreting Probability Statements

Every probability statement has two interpretations, probability of one and proportion of all. You use the interpretation that seems most useful in a given situation.

Example 6: For doctors taking aspirin nightly, P(heart attack in six years) = 1.04%. The “probability of one” interpretation is that there’s a 1.04% chance any given doctor taking aspirin will have a heart attack. The “proportion of all” interpretation is that 1.04% of all doctors taking aspirin can be expected to have heart attacks.

Which interpretation is right? They’re both right, but in a given situation you should use the one that feels more natural.

5A4.  Law of Large Numbers

You know that P(boy) is about 50% for live births, but you’re not surprised to see families with two or three girls in a row. Probability is long-term relative frequency; it can’t predict what will happen in any particular case.

This is expressed in the law of large numbers: as you take more and more trials, the relative frequency tends toward the true probability.

The law of large numbers was stated in 1689 by Jacob Bernoulli.

Example 7: For just a few babies, say the four children in one family, it’s quite common to find a proportion of boys very different from 50%, say one in four (25%) or even zero in four. But consider a class of thirty statistics students. The proportion may still be different from 50%, but a very different proportion (more than 70%, say, or less than 30%) would be unusual. And when you look at all babies born in a large hospital in a year, experience tells you that the proportion will be very close to 50%. The more trials you take, the closer the relative frequency is to the true probability — usually.

so far
rel. freq.

But the Law of Large Numbers says that the relative frequency tends to the true probability. Probability can’t predict what will happen in any given case. The idea that a particular outcome is “due” is just wrong, and it’s such a classic mistake that it has a name. The Gambler’s Fallacy is the idea that somehow events try to match probabilities.

Example 8: I’ve just flipped a coin a few times, and the results are shown at the right. The first flip was a tail, and after that flip the relative frequency (rf) of heads is 0. The next flip is a head, and after two flips I’ve had one head out of two trials, so the rf is 0.5. The third flip is also a head, so now the rf is 2/3 or about 0.6667. At this point someone might say, “you’re due for a tail, to move the rf back toward the true probability of 0.5.” That’s the Gambler’s Fallacy.

The coin doesn’t know what it did before, and it doesn’t try to make things “right”. In my trials, the fourth flip moves the rf of heads further from 0.5, and the fifth flip moves it further still. True, the sixth flip moves the rf of heads closer to 0.5, but it could just as well have moved it further away, even if the coin is perfectly fair.

I stopped after six trials. I know that if I went on to do ten trials, or a hundred, or a thousand, over time the proportion of heads would almost always move closer to 0.5 — not necessarily on any particular flip, but in the long run.

Subconsciously you expect random events not to show a pattern, but you may see patterns along the way. For example, if you flip a fair coin repeatedly, inevitably you will see a run of ten heads or ten tails — about twice in every thousand sequences of ten. If you flip the coin once every two seconds, you can expect to see a run of ten flips the same about once every 17 minutes, on average.

Here are two more examples of patterns cropping up in processes that are actually random:

Example 9: You have flipped a coin 999 times, and there were 499 heads and 500 tails. What’s the probability of a head on the next flip?

Solution: It is 50%, the same as on any other flip. The Law of Large Numbers tells you that over time you tend to get closer and closer to 50% heads, but it doesn’t tell you anything at all about any particular flip. If you think that the coin is somehow “due for a head”, you’ve fallen into the Gambler’s Fallacy.

5A5.  Sample Space

At bottom, probability is about counting. Empirical probability is the number of times something did happen, divided by the number of trials. Classical probability is similar, but it makes use of a list or table of all possible outcomes, called a sample space. Technically a sample space is just a list of all possible outcomes, but it’s only useful if you make it a list of all possible equally likely outcomes.

For repeated independent trials — flipping multiple coins, rolling multiple dice, making successive bets at roulette, and so on — the size of the sample space will be the number of outcomes in each trial, raised to the power of the number of trials. For example, if you want to compute probabilities for the number of girls in a family of four children, your sample space will have 24 = 16 entries.

Example 10: If you roll two dice, what’s the probability you’ll roll a seven? You could list the sample space as

S = { 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 }

but the outcomes are not equally likely. There’s only one way to get a twelve, for instance (double sixes), but there are several ways to get a seven (1–6, 2–5, and so on). So it’s much more useful to list your sample space with equally likely outcomes.

When constructing a sample space, be systematic so that you don’t leave any out or list any twice. Here, you’re rolling two dice, and each die has six equally likely results, so you have 6×6 = 36 equally likely outcomes in your sample space. How can you be systematic? List the outcomes in some regular order, like the picture below. Each row lists all the possibilities with the same outcome for the first die; each column lists all the possibilities with the same outcome for the second die.

36 equally likely outcomes from rolling two dice 36 equally likely outcomes from rolling two dice
image courtesy of Bob Yavits, Tompkins Cortland Community College

Once you have a sample space of equally likely outcomes, finding the probability is simple. There are six ways to roll a seven: 6-1, 5-2, 4-3, 3-4, 2-5, 1-6. There are 36 possible outcomes, all equally likely. Therefore the probability of rolling a seven is 6/36 or 1/6 or about 0.1667. In symbols, P(7) = 6/36 or P(7) = 1/6.

There’s no need to reduce fractions to lowest terms. If a decimal is not exactly equal to a fraction, it’s probably better to keep the fraction. But if the fraction is complex or you’re comparing fractions, round to four decimal places and use the “approximately equal” sign, like this:

P(7) ≈ 0.1667

Caution: Round your final answer only. Never use a rounded number in further calculations; that’s the Big No-no. Fortunately, your calculator makes it easy to chain calculations so that you can see rounded numbers but it still uses the unrounded numbers for further calculations.

Example 11: Find the probability of rolling craps (two, three, or twelve).

Solution: There’s one way to roll a two, two ways to roll a three, and one way to roll a twelve. P(craps) = (1+2+1)/36 = 4/36 or 1/9.

5A6.  Probability Models

Often, it’s not practical to construct a sample space and compute probabilities from it. Instead, you construct a probability model. Probability models are yet another kind of mathematical model as introduced in Chapter 4.

Definition: A probability model is a table showing all possible outcomes and their probabilities. Every probability must be 0 to 1 inclusive, and the total of the probabilities must be 1 or 100%.

A probability model can be theoretical or empirical.

Number of Heads
on Two Coin Flips
4/4 = 1

Example 12: Construct a probability model for the number of heads that appear when you flip two coins.

Solution: Start by constructing the sample space. Remember that you need equally likely events if you are going to find probabilities from the sample space. The first coin can be heads or tails, and whatever the first coin is, the second coin can also be heads or tails. So the sample space has 2×2 = 4 outcomes:

S = { HH, HT, TH, TT }

There are four equally likely outcomes, so the denominator (bottom number) on all the probabilities will be 4. The possible outcomes are no heads (one way), one head (two ways), and two heads (one way). The probability model is shown at right. Often a total row is included, as I did, to show that the probabilities add up to 1.

That was an easy example, so easy that you could just as well work from the sample space. But think about more complex situations, especially with empirical (experimental) probabilities. Constructing a sample space may be impractical, but a probability model is relatively easy to create.

Example 13: (adapted from Sullivan 2011, page 235 problem 40): The CDC asked college students how often they wore a seat belt when driving. 118 answered never, 249 rarely, 345 sometimes, 716 most of the time, 3093 always. Construct a probability model for seat-belt use by college students when driving.

Seat-Belt Use by
College Students Driving
(sample size: 4521)
Never2.61 %
Rarely5.51 %
Sometimes7.63 %
Most of
the time
15.84 %
Always68.41 %
  Total100.00 %

Solution: Probability of one is proportion of all, so to get the probabilities you simply calculate the proportions. Sample size was (118+249+345+716+3093) = 4521. The proportions or probabilities are then simply 118/4521, 249/4521, and so on. The probability model is shown at the right.

Comments: Don’t push this model too far. In this sample, 68.4% of college students reported that they always use a seat belt when driving. There’s no uncertainty about that statement; it’s a completely accurate statistic (summary number for a sample). But can you go further and say that 68.4% of college students always wear a seat belt when driving? No, for two reasons.

First, this is a sample. Even if it’s a perfect random sample, it’s still not the population. There’s always sample variability. A different sample of college students would most likely give different answers — probably not very different, since this was a large sample, but almost certainly not identical. Second, and more serious, this survey depended on self reporting: students weren’t observed, they were just asked. When people report their behavior they tend to shade their responses in the direction of what’s socially approved or what they would like to think about themselves (response bias). How many of those “always” responses should have been “most of the time” or “sometimes”? You have no way to know.

5B.  Combining Probabilities

You can find probabilities of simple events by making sample spaces and counting. But life isn’t usually that simple. To find probabilities of more interesting (and complex) events, you need to use rules for combining probabilities.

The rules are the same whether your original probabilities are theoretical or experimental.

5B1.  Probability “or” for Disjoint Events

Definition: When two events can’t both happen on the same trial, they are called mutually exclusive events or disjoint events.

Example 14: You select a student and ask where she was born. “Born in Cortland” and “born in Ithaca” are mutually exclusive events because they can’t both be true for the same person.

Comment: Obviously it’s possible that neither is true. Disjoint events could both be false, or one might be true, but they can’t both be true in the same trial.

Example 15: You select a student and ask his major. “Major in physics” and “major in music” are non-disjoint events because they could be true of the same student. (It doesn’t matter whether they are both true of the student you asked. They are non-disjoint because they could both be true of the same student — think about double majors.)

Rule: For disjoint events, P(A or B) = P(A)+P(B)

Example 16: You draw a card from a standard 52-card deck. What’s P(ace or face card)? (A face card is a king, queen, or jack.)

Solution: Are the events “ace” and “face card” disjoint? Yes, because a given card can’t be both an ace and a face card. Therefore you can use the rule:

P(ace or face card) = P(ace) + P(face card)

But what are P(ace) and P(face card)? A picture may help.

52 cards in a standard deck
used by permission; source: accessed 2012-09-26

Now you can see that the deck of 52 cards has four aces and twelve face cards. Therefore

P(ace) = 4/52 and P(face card) = 12/52

Since the events are disjoint,

P(ace or face card) = P(ace) + P(face card)

P(ace or face card) = 4/52 + 12/52 = 16/52

Reminder: When you need to compute probability of A or B, always ask yourself first, are the events disjoint? Use the simple addition rule only if the events are disjoint. If events are non-disjoint — if it’s possible for both to happen on the same trial — you have to use the general rule, below.

US Marital Status in 2006 (in Millions)
MenWomen Totals
Never married30.325.055.3

Take a look at this table of marital status in 2006, from the US Census Bureau. It’s known as a contingency table or two-way table, because it classifies each member of the sample or population by two variables — in this case, sex and marital status.

Example 17: What’s the probability that a randomly selected person is widowed or divorced?

Solution: Are those events disjoint? Yes, because a given person can’t be listed in both rows of the table. (You might argue that a given person can be both widowed and divorced in his or her lifetime, and that’s true. But the table shows marital status at the time the survey was made, not over each person’s lifetime. The “Widowed” row counts those whose most recent marriage ended with the death of their spouse.) Therefore

P(widowed or divorced) = P(widowed) + P(divorced)

How do you find those probabilities? Remember that probability of one = proportion of all. Find the proportions, and you have the probabilities.

P(widowed or divorced) = 13.9/219.7 + 22.8/219.7

P(widowed or divorced) = 36.7/219.7 ≈ 0.1670

Example 18: Find the probability that a randomly selected man is widowed or divorced.

Solution: Disjoint events? Yes, a given man can’t be in both rows of the table. Again, the probabilities are the proportions, but now you’re looking only at the men:

P(widowed or divorced) = P(widowed) + P(divorced)

P(widowed or divorced) = 2.6/106.2 + 9.7/106.2

P(widowed or divorced) = 12.3/106.2 ≈ 0.1158

Now let’s look at a couple of examples of probability “or” for non-disjoint events.

Example 19: Find P(seven or club).

Solution: Are the events “seven” and “club” disjoint? No, because a given card can be both a seven and a club. You can’t use the simple addition rule.

The next section shows you a formula, but in math there’s usually more than one way to approach a problem. Here you can look back at the picture look at the picture (reprinted on the last page) and count from the sample space. There are thirteen clubs, plus the sevens of spades, hearts, and diamonds, for a total of 16. (You don’t count the seven of clubs when counting sevens, because you already counted it when counting clubs.) And therefore P(seven or club) = 16/52.

Example 20: Find P(woman or divorced).

Solution: Disjoint events? No, a given person can be both. So what do you do? The same thing as in the preceding example: you count up all the women, and all the divorced people who aren’t women, and divide by the number of people:

P(woman or divorced) = 113.5/219.7 + 9.7/219.7 = 123.2/219.7 ≈ 0.5608

5B2.  Probability “or” for All Events

Because this textbook helps you,
please click to donate!
Because this textbook helps you,
please donate at

Look back at P(seven or club). Those are not disjoint events, so you can’t just add P(seven) and P(club). But what did you do, when counting? You counted the clubs, then you counted the sevens that aren’t clubs. In other words, just adding P(seven) and P(club) would be wrong because that would double count the overlap.

With 52 cards, it’s easy enough just to count. But that’s not practical in every problem, so there’s a rule: go ahead and double count by adding the probabilities, then fix it by subtracting the part you double counted.

Rule: P(A or B) = P(A) + P(B) − P(A and B)

This general addition rule works for all events, disjoint or non-disjoint. (If two events are disjoint, they can’t happen at the same time, P(A and B) is 0, and the general rule becomes the same as the simple rule.)

Let’s redo the last two examples with this new general rule, to see that it gives the same answers.

Example 19 again: Find P(seven or club).

P(seven or club) = P(seven) + P(club) − P(seven and club)

Caution: P(seven and club) doesn’t mean “all the sevens and all the clubs”. It means the probability that one card will be both a seven and a club — in other words, it means the seven of clubs.

P(seven or club) = 4/52 + 13/52 − 1/52

P(seven or club) = 16/52

US Marital Status in 2006 (in Millions)
MenWomen Totals
Never married30.325.055.3

Example 20 again: Using the table, table of marital status (reprinted on the last page), find P(woman or divorced).


P(woman or divorced) = P(woman) + P(divorced) − P(woman and divorced)

P(woman or divorced) = 113.5/219.7 + 22.8/219.7 − 13.1/219.7

P(woman or divorced) = 123.2/219.7 ≈ 0.5608


5B3.  Probability “not” — Complements

About two thirds of students who register for a math class complete it successfully. What’s the probability that a randomly selected student who registers for a math class will not complete it successfully? Of course you already know it’s 1−(2/3) = 1/3. Let’s formalize this.

Definitions: Two events are complementary if they can’t both occur but one of them must occur. If A is an event from a given sample space, then the complement of A, written AC or not A, is the rest of that sample space.

Describing a complement usually involves using the word “not”. Complementary events (can’t both happen, but one must happen) are a subcategory of disjoint events (can’t both happen).

Example 21: The complement of the event “the student completes the course successfully” is the event “the student does not complete the course successfully.” Obviously the complement need not be a simple event. The complement of “the student completes the course successfully” is “the student never shows up, or attends initially but stops attending, or withdraws, or earns an F, or takes an incomplete but never finishes”, or probably other outcomes I haven’t thought of.

Rule: P(AC) = 1 − P(A)

This comes directly from the definition, and the rule for “or”. A and AC can’t both happen, so they’re disjoint and P(A or AC) = P(A)+P(AC). But one or the other must happen, so P(A or AC) = 1. Therefore P(A)+P(AC) = 1, and P(AC) = 1−P(A).

Example 22: In rolling two dice, “doubles” and “not doubles” are complementary events because they can’t both happen on the same roll, but one of them must happen. “Boxcars” (double sixes) and “snake eyes” (double ones) can’t both happen, so they’re disjoint; but they are not complementary because other outcomes are possible.

The complement rule is useful on its own, but it really shines as a labor-saving device. Very often when a probability problem looks like a lot of tedious computation, the complement is your friend. This really sticks out with “at least” problems (later), but here are a few simpler examples.

Colors of Plain M&Ms
Blue24 %
Orange20 %
Green16 %
Yellow14 %
Brown13 %
Red13 %

Example 23: The color distribution for plain M&Ms is shown at right. What’s the probability that a randomly selected plain M&M is any color but yellow?

Solution: You could add the probabilities of the five other colors, but of course it’s easier to say

P(YellowC) = 1 − P(Yellow)

P(YellowC) = 100% − 14% = 86%

US Marital Status in 2006 (in Millions)
MenWomen Totals
Never married30.325.055.3

Example 24: Referring again to the table of marital status, status (reprinted on the last page), what’s the probability that a randomly selected person is not currently married?

Solution: Since the four marital statuses are disjoint, you could add the probabilities for widowed, divorced, and never married. But it’s easier to take the complement of “married”:

P(not currently married) = P(marriedC)

P(not currently married) = 1 − P(married)

P(not currently married) = 1 − 127.7/219.7

P(not currently married) = 0.4188


5B4.  Probability “and” for Independent Events

Definition: Two events are called independent events if the occurrence of one doesn’t change the probability of the other.

Example 25: When you play poker, being dealt a pair in this hand and a pair in the next are independent events because the deck is shuffled between hands. But in casino blackjack, according to Scarne on Cards (Scarne 1965, 144 [see “Sources Used” at end of book]), four decks are used and they aren’t necessarily shuffled between hands. Therefore, getting a natural (ace plus a ten or face card) in this hand and a natural in the next are not independent events, because the cards already dealt change the mix of remaining cards and therefore change the probabilities.

That’s also an example of sampling with replacement (poker) and sampling without replacement (casino blackjack).

Samples drawn with replacement are independent because the sample space is reset to its initial condition between draws. Samples drawn without replacement are usually dependent because what you draw out changes the mix of what is left. However, if you’re drawing from a very large group, the change to the proportions in the mix is very small, so you can treat small samples from a very large group as independent.

Independent events are not disjoint, and disjoint events are not independent. If two events A and B are disjoint, then if A happens B can’t happen, so its probability is zero. One of two disjoint events happening changes the probability of the other, so they can’t be independent.

Rule: For independent events, P(A and B) = P(A) × P(B)

Example 26: In Monopoly, you get an extra roll if you roll doubles, but if you roll doubles three times in a row you have to go to jail. What’s the probability you’ll have to go to jail on any given turn?

Solution: Refer to the picture of the dice. picture of the dice (reprinted on the last page). There are six ways out of 36 to get doubles, so P(doubles) = 6/36 or 1/6. Each roll is independent, so the probability of doubles three times in a row is (1/6)×(1/6)×(1/6) or (1/6)^3 = 1/216, about 0.0046. If you play a lot of Monopoly, you’ll go to jail, because of doubles, between four and five times per thousand turns.

Example 27: The first traffic light on your morning commute is red 40% of the time, yellow 5%, and green 55%. What’s the probability you’ll hit a green all five mornings in any given week?

Solution: Are the five days independent? Yes, because where you hit that light in its cycle on one morning doesn’t influence where you hit it on the next day. The probability of green is 55% each day regardless of what happens on any other day. Therefore, the probability of five greens on five successive mornings is 55%×55%×55%×55%×55% or (0.55)5 ≈ 0.0503. About one week in twenty, that light should be green for you all five mornings.

US Marital Status in 2006 (in Millions)
MenWomen Totals
Never married30.325.055.3

Example 28: Refer again to the table of marital status. status (reprinted on the last page). What’s the probability that a randomly selected person is female and widowed?

Solution: In a two-way table, for probability “and”, you don’t worry about formulas or independence because everything is already laid out for you. 11.3 million persons are female and widowed, out of 219.7 million. Therefore:

P(female and widowed) = 11.3/219.7 ≈ 0.0514.

Example 29: Earlier in this section, I said that samples drawn without replacement are usually dependent, but you can treat them as independent when drawing a small sample from a very large group. Here’s an example. If you randomly select three women, what’s the probability that all three are widowed?

Solution: From the preceding example, the probability that any one woman is widowed was 11.3/219.7. Because three women is a small sample against the millions of women in the census, and the sample is random, you can treat them as independent. If you randomly select one woman out of millions, the mix of marital status in the remaining women is so nearly unchanged that you can ignore the difference. Therefore, the probability that all three women are widowed is

(11.3/219.7) × (11.3/219.7) × (11.3/219.7) = (11.3/219.7)³ ≈ 0.0001.


5B5.  Probability “at least” for Independent Events

There’s no special rule for “at least”, but textbook writers (and quiz writers) love this type of problem, so it’s worth looking at. “At least” problems usually want you to combine several of the probability rules.

Example 30: Think back to that traffic light that’s green 55% of the time, yellow 5%, red 40%. What’s the probability that you’ll catch it red at least one morning in a five-day week?

Solution: You could find the probability of catching it red one morning (five separate probabilities for five separate mornings), or two mornings (ten different ways to hit two mornings out of five), or three, four, or five mornings. This would be incredibly laborious. Remember that the complement is your friend. What’s the complement of “at least one morning”? It’s “no mornings”. So you can find the probability of getting a red on no mornings, subtract that from 1, and have the desired probability of hitting red on at least one morning.

P(at least one red in five) = 1 − P(no red in five)

But the status of the light on each morning is independent of all the others, so

P(no red in five) = P(no red on one)5

What’s the probability of no red on any one morning? It’s 1 minus the probability of red on any one morning:

P(no red on one) = 1 − P(red on one) = 1−0.4

Now put all the pieces together:

P(no red on one) = 1 − P(red on one) = 1−0.4

P(no red in five) = [ P(no red on one) ]5 = (1−0.4)5

P(at least one red in five) = 1 − P(no red in five) = 1 − (1−0.4)50.9222

About 92% of weeks, you hit red at least one morning in the week.

Be careful with your logic! You really do need to work things through step by step, and write down your steps. Some students just seem to subtract things from 1, and multiply other things, and hope for the best. That’s not a very productive approach.

One thing that can help you with these “at least’ and “at most” problems is to write down all the possibilities and then cross out the ones that don’t apply, or underline the ones that do apply. For “at least one red in five”, you have 1 2 3 4 5 or 0 | 1 2 3 4 5. Either way, with this enumeration technique, taught to me by Benjamin Kirk, you can see that the complement of “at least one” is “none”.

A common mistake is computing 1−0.45 for P(none), instead of the correct (1−0.4)5. “None are red” means “all are not-red”, every one of the five is something other than red. Remember that all are not is different from not all are. In ordinary English, people often say “All my friends can’t go to the concert” when they really mean “Some of my friends can go, but not all of them can go.” In math you have to be careful about the distinction. Here’s an example.

Example 31: For the same situation, what’s the probability that you’ll hit a red light no more than four mornings in a five-day week? (This could also be asked as “at most four mornings” or “four mornings at most”.)

Solution: Try enumerating. “At most four out of five” looks like this: 0 1 2 3 4 5 or 0 1 2 3 4 | 5. The previous example was a “none are” or “all are not”, but this one is a “not all are”.

P(≤ 4 out of 5) = 1 − P(5 out of 5)

P(5 out of 5) = 0.45

P(≤ 4 out of 5) = 1 − 0.450.9898

About 99% of weeks, you hit the red light no more than four mornings of the week.

Example 32: You’re throwing a barbecue, and you want to start the grill at 2 PM. Fred and Joe live on opposite sides of town, and they’ve both agreed to bring the charcoal. The problem is that they’re both slackers. Fred is late 40% of the time, and Joe is late 30% of the time. What’s the probability you’ll start the grill by 2 PM?

Solution: This is another “at least” problem for independent events, though this time the independent events don’t have the same probability. To have charcoal by 2 PM, at least one of them has to show up by then. What’s the probability that at least one will be on time? Again, you could compute the probability that they’re both on time, that Fred’s on time but Joe’s late, and that Fred’s late and Joe’s on time — all of those together will be the probability of charcoal on time. But again, the complement is your friend. The complement of “charcoal on time” is “charcoal late”, which happens only if they’re both late.

P(charcoal on time) = 1 − P(charcoal late)

P(charcoal on time) = 1 − P(Fred late and Joe late)

(Fred and Joe live on opposite sides of town, so whether one is late has no connection with whether the other one is late. The events are independent.)

P(charcoal on time) = 1 − P(Fred late) × P(Joe late)

P(charcoal on time) = 1 − 0.4×0.3 = 0.88

You’ve got an 88% chance of starting the grill on schedule.

Example 33: The space shuttle Challenger exploded shortly after launch in the 1980s, when one of six gaskets failed. After the fact, engineers realized that they should have known the design was too risky, but they didn’t think past “each gasket is 97% reliable.” The trouble was that if any gasket failed, the shuttle would explode. If you were asked to evaluate the design while the plans were still on the drawing board, what would you conclude? (Note: The design makes the six gaskets independent.)

Solution: The shuttle will explode if one or more gaskets fail. Here’s another “at least” problem, so enumerate the case you’re interested in: 0 | 1 2 3 4 5 6.

P(explosion) = P(at least one gasket fails)

The complement of “at least one gasket fails” (hard to compute) is “no gaskets fail” (much easier). What does it mean for no gaskets to fail? All gaskets must hold. Since the gaskets are independent, that’s easy to compute:

P(all six gaskets hold) = 0.976

The answer you want is the complement of the all-hold or zero-fail case:

P(at least one gasket fails) = 1 − P(all six hold) = 1 − 0.976

P(explosion) = P(at least one gasket fails) = 1 −0.976 ≈ 0.1670

Conclusion: There’s about a 17% chance that the shuttle will explode, just considering the gaskets and ignoring all other possible causes of trouble. This is about the same as the odds of shooting yourself in Russian roulette.

5B6.  Conditional Probability

In 2012, the Honda Accord was the most frequently stolen vehicle in the US (Siu 2013 [see “Sources Used” at end of book]). Does that mean that your Honda Accord is more likely to be stolen than another model?

You’re tested for a rare strain of flu, and the result is positive. Your doctor tells you the test is 99% accurate. Does that mean that there’s a 99% chance you have that strain of flu?

In New York City, a rape victim identifies physical characteristics that match only 0.0001% of people. Police find someone with those characteristics and arrest him. Is there only a 0.0001% chance that he’s innocent?

These are examples of conditional probability — the probability of one event under the condition that another event happened. It’s probably the most misunderstood probability topic, but I’m going to demystify it for you.

The definition may seem hard at first. But after you work through the examples you’ll find it makes sense.

Definition: The conditional probability of B given A, written P(B | A), is the probability of B under the condition that A occurs. Read B | A as “B given A” or “if A then B”.

That’s the “probability of one” interpretation. You might find the “proportion of all” interpretation easier: P(B | A) is the proportion of A’s that are also B.

Either way, the order matters — P(B | A) and P(A | B) mean different things and they’re different numbers.

Example 34: P(truck | Ford) is the probability that a vehicle is a truck if it’s a Ford, or the probability that a Ford is a truck, or the proportion of trucks among Fords. P(Ford | truck) is the probability that a vehicle is a Ford if it’s a truck, or the probability that a truck is a Ford, or the proportion of Fords among trucks.

Example 35: Let’s look first at the suspected rapist. The prosecutor presents evidence that these physical characteristics are found in only 0.0001% of people. The prosecutor therefore claims that there’s only a 0.0001% chance the suspect is innocent.

But the defense points out that there are over 8 million people in New York City. 0.0001% × 8,000,000 = 8, so the suspect is not a unique individual at all, but one of about eight people who match the eyewitness accounts. Seven of them are innocent. If there’s no evidence beyond the physical match to tie him to the crime, the probability that this defendant is innocent isn’t 0.0001%, it’s 7/8 or 87.5%. (And that’s just in the city. If you consider the metro area, or the US, or the world, there are even more people who match, so any one of them is even more likely to be innocent.)

The prosecutor’s fallacy is the false idea that the probability of a random match equals the probability of innocence. You can also describe this fallacy as “consider[ing] the unlikelihood of an event, while neglecting to consider the number of opportunities for that event to occur”, in the words of “The Prosecutor’s Fallacy” on the Poker Sleuth site (Stutzbach 2011 [see “Sources Used” at end of book]).

It’s an easy mistake to make if you just think about low probabilities. To not make this error, think in whole numbers, as the defense did. 0.0001% is hard to think about; 8 is much easier.

The key to solving conditional-probability problems is your old friend, probability of one equals proportion of all. The probability that this particular matching person is innocent is the same as the proportion of all matching people that are innocent, or the proportion of innocent people among those who match. Probability problems usually get easier when you turn them into problems about numbers of people or numbers of things.

What does this look like in symbols? (Don’t be afraid of symbols! They are your friend, I promise. Words are slippery and confusing, but when you reduce a problem to symbols you make the situation clear and you are half way to solving it.)

In this example, there’s a 0.0001% chance that a random person would match the physical type of the criminal:

P(matching) = 0.0001%

The prosecution wants you to believe that the probability of a matching individual being innocent is the same:

P(innocent | matching) = 0.0001%    (WRONG)

This is a conditional probability, the probability that one thing is true if another thing is true. Formally, the whole expression is “the probability of innocent given matching”. But it’s easier to think of as “the probability that a person who matches is innocent” or “the proportion of matching people who are innocent”.

The symbols help you clarify your thinking. “The probability of a match” and “the probability of innocence among those who match” are different symbols, and they’re different concepts. You’d expect them to be different probabilities.

The defense showed the right way to figure the probability of innocence given a match. 0.0001%×8,000,000 = 8 people match, and 7 of them are innocent. The probability that a matching person is innocent — the probability that a person is innocent given that he matches — is 87.5%.

P(innocent | matching) = 87.5%    (CORRECT)

Notice what happens with if-then probabilities. You’re considering one group within a subgroup of the population, not one group within the whole population. You’ve reduced your sample space — not all people, but all matching people. The bottom number of your fraction comes from the “given that” part of the conditional probability, because P(innocent | matching) is the proportion of matching people that are also innocent.

To explode the prosecutor’s fallacy, you distinguish between a probability in the whole population and a probability in a subgroup. You also have to ask yourself, “which group?” The issue of medical test results is a good example.

Example 36: There’s a rare skin disease, Texter’s Peril (TP), where you become hypersensitive to the buttons on your phone. (Yes, I am making this up.) It affects 0.03% of adults aged 18–30, three in ten thousand. The only cure is to lay off texting for 30 days, no exceptions. Naturally this is about the worst thing that can happen to anyone.

Your doctor has tested you and the test comes up positive. She tells you that the test is 99% accurate. Does that mean you are 99% likely to have TP? You might think so, and sadly many doctors make the same mistake.

You have a positive test result, and you want to know how likely it is that you have Texter’s Peril. In symbols,

P(disease | positive) = ?

Your doctor told you that the test is 99% accurate, meaning that 99% of people who actually have TP get a positive result:

P(positive | disease) = 99%

These are obviously not the same symbol, so the probability you care about, the probability you have the disease, may well be different from 99%. How can you compute it?

Change those probabilities to whole numbers, and make a table. (I got this technique from the book Calculated Risks [Gigerenzer 2002 [see “Sources Used” at end of book]]. The book cites a study showing that doctors routinely confused probabilities when counseling patients about test results.) You’ve already played with a two-way table; now you’re going to make one. It’s a little bit like filling in a puzzle. I hope you like puzzles. ☺

You don’t know the population size, but that’s okay. Just use a large round number, like a million. Start with what you know.

P(disease) = 0.03%

Out of 1,000,000 people, 0.03% = 300 will have TP, and the other 999,700 won’t. That’s the bottom row of the table, the totals row.

P(positive | disease) = 99%

Of the 300 who have actually have TP, 99% = 297 will get a correct positive result, and 3 will get a false negative. That’s the first column of the table.

P(negative | diseaseC) = 99%

(In the real world, a given test may not be equally accurate for positives and negatives, but we’ll overlook that to keep things simple.) Out of 999,700 who don’t have TP, 99% = 989,703 will get a correct negative result, and 9,997 will get a false positive. This is the second column of the table, and now you can fill in the column of totals.

Have TPDon’t Have TPTotal
Positive Test2979,99710,294
Negative Test3989,703989,706

Take a look at that table, specifically the “Positive Test” row. Do you see the problem? Most of the people with positive test results actually don’t have Texter’s Peril, even though the test is 99% accurate!

It took a while to get here, but it’s better to be correct slowly than to be wrong quickly. You can now compute the probability of having TP given that you have a positive test result. Once again, probability of one equals proportion of all, so this is really the same as the proportion of people with positive test results who actually have TP:

P(disease | positive) = 297 / 10,294 = 2.89%

The test is 99% accurate, but because TP is rare, most of the positive results are false positives, and there’s under a 3% chance that a positive result means you actually have Texter’s Peril. There’s a 1 − 297/10,294 = 97.11% chance that a positive result is a false positive.

Notice again: With conditional probability, you’re not concerned with the whole population. Rather, you focus on a subgroup within a subgroup. P(disease | positive) is the proportion of people who actually have the disease, within the subgroup that received a positive test result.

Example 37: What’s the chance that a negative is a false negative, that given a negative test result you actually have TP? In symbols,

P(disease | negative) = ?

You’ve already got the table, so this is a piece of cake. Out of a million people, 989,706 test negative and 3 of them have the disease. The probability that a negative is a false negative is

P(disease | negative) = 3/989,706 ≈ 0.000 003

which is essentially nil.

Example 38: A lot of Web sites in 2013 trumpeted the news that the Honda Accord was the most frequently stolen model in the US the year before. And that’s true. Out of 721,053 stolen cars and light trucks in 2012, Hot Wheels 2012 tells us that 58,596 were Honda Accords (NICB 2013 [see “Sources Used” at end of book]).

But many Web sites warned Honda owners that they were most at risk. For instance, Honda Accord, Civic Remain Top Targets for Thieves at (Schmitz 2013 [see “Sources Used” at end of book]) leads with “If you own a Honda Accord or Civic, or a full-size Ford pickup truck, you might want to take a moment to make sure your auto-insurance payments are up to date. You drive one of the top three most-stolen vehicles in the US.”

Do you see what’s wrong here? Think about it for a minute before reading on.

Yes, a lot of Honda Accords were stolen, because there are a lot of them on the road. Too many news organizations are sloppy and think that the likelihood a stolen car is an Accord is the same as the likelihood that an Accord will be stolen. This is the doctor’s mistake from the previous example, all over again.

Let’s clarify. You have 58,596 Accords out of 721,053 thefts, so the probability that a stolen car was an Accord — the probability that a car was an Accord given that it was stolen — the probability of “if stolen then Accord” — is

P(Accord | stolen) = 58,596/721,053 = 8.13%

But that doesn’t tell you doodley-squat about your chance of having your Accord stolen. That would be the probability of a car being stolen given that it is an Accord, “if Accord then stolen”. The top number of that fraction is still 58,596, but the bottom number is the total number of Accords on the road:

P(stolen | Accord) = 58,596/(total Accords on the road in 2012)

Do you see the difference? They’re both conditional probabilities, but they’re different conditions. “If stolen then Accord” is different from “if Accord then stolen”. The first one is about Accord thefts as a proportion of all thefts, and the second one is about Accord thefts as a proportion of all Accords. Those are different numbers.

To find the chance that an Accord will be stolen, you need the number of Accords on the road in 2012. A press release from Experian (2012) says there were “more than 245 million vehicles on US roads” in 2012, and 2.6% of them were Accords.

P(stolen | Accord) = (stolen Accords)/(total Accords on the road in 2012)

P(stolen | Accord) = 58,596/(2.6% of 245 million)

P(stolen | Accord) = 58,596/6,370,000

P(stolen | Accord) = 0.92%

Yes, over 8% of cars stolen in 2012 were Accords, but the chance of a given Accord being stolen was under 1%. P(Accord | stolen) = 8.13%, but P(stolen | Accord) = 0.92%.

Optional:  Conditional Probability Formula

Rule: P(B | A) = P(A and B) / P(A) or N(A and B) / N(A)

The “N” alternatives remind you that often it’s easier just to count than to find probabilities and then divide. Either way, when you consider P(B | A), remember that you’re interested in the likelihood of B given that A occurs. It’s the B cases within the A group, not all the B cases.

P(A | B) is not the same as P(B | A). You’ll get the probability right if you remember that the second event, the “given that” event, supplies the bottom number of the fraction.

Example 39: Find P(stolen | Accord), the chance that any one Accord will be stolen. Using the numbers from Example 38,

P(stolen | Accord) = N(Accord and stolen) / N(Accord)

P(stolen | Accord) = 58,596/6,370,000 = 0.92%

Example 40: I draw a card from the deck, and I tell you it’s red. What’s the probability that it’s a heart? If you didn’t know anything about the card, you’d write P(heart) = ¼ because a quarter of the cards in the deck are hearts. But what is the probability given that it’s red?

P(heart | red) = P(heart and red) / P(red)

P(heart and red) is the probability of a red heart. A quarter of the cards in the deck are red hearts, so this is just ¼. P(red) is of course ½ because half the cards in the deck are red.

P(heart | red) = (¼) / (½) = (¼) × 2 = ½

This one is probably easier to do by just counting:

P(heart | red) = N(heart and red) / N(red)

P(heart | red) = 26/52 = ½

Either way, you’re concerned with the sub-subgroup of hearts within the subgroup of red cards. P(heart | red) = ½ — half of the red cards are hearts.

Example 41: You know P(heart | red) = ½: given that a card is red, there’s a ½ probability that it’s a heart. But what is P(red | heart), the probability that a card is red given that it’s a heart? You probably already know the answer, but let’s run the formula:

P(red | heart) = N(red and heart) / N(heart)

P(red | heart) = 13/13 = 1 (or 100%)

Conditional probabilities often come up in two-way tables.

US Marital Status in 2006 (in Millions)
MenWomen Totals
Never married30.325.055.3

Example 42: Again using the table of marital status (reprinted on the last page), status, what’s the probability that a randomly selected woman is divorced? In other words, given that the person is a woman, what’s the probability that she’s divorced?

Solution: The problem wants P(divorced | woman), the probability that the person is divorced given that she’s a woman.

P(divorced | woman) = N(divorced and woman) / N(woman)

P(divorced | woman) = 13.1/113.5 ≈ 0.1154

Because we have “given woman” or “if woman”, the bottom number is the number of women, 113.5 million.

5B7.  Optional:  Checking Independence

Remember the definition of independent events? A and B are independent if the occurrence of one doesn’t change the probability of the other. Now that you know about conditional probability, you can define independent events in terms of conditional probability:

Definition: Two events A and B are independent if and only if P(A|B) = P(A).

This makes sense. P(A) is the probability of A without considering whether B happened or not, and P(A|B) is the probability of A given that B happened. If B’s occurrence doesn’t change the probability of A, then those two numbers will be equal.

US Marital Status in 2006 (in Millions)
MenWomen Totals
Never married30.325.055.3

Example 43: Referring again to the table of marital status, status (reprinted on the last page), show that “woman” and “widowed” are dependent (not independent).


P(widowed) = 13.9 / 219.7 ≈ 0.0633

P(widowed | woman) = 11.3 / 113.5 ≈ 0.0996

These numbers are different — the probability of “widowed” changes when “woman” is given, or in English the proportion of widowed women is different from the proportion of widowed people. Therefore the events “woman” and “widowed” are not independent.

By the way, if A and B are independent then B and A are independent. So you could just as well compare P(woman) = 113.5/219.7 ≈ 0.5166 to P(woman|widowed) = 11.3/13.9 ≈ 0.8129. Since those are different, you conclude that “woman” and “widowed” are dependent.

5B8.  Optional:  Probability “and” for All Events

When events are not independent, to find probability “and” you need to use a conditional probability. Remember the formula for conditional probability: P(B | A) = P(A and B) / P(A). Multiply both sides by P(A) and you have P(A) × P(B | A) = P(A and B), or:

Rule: For all events, P(A and B) = P(A) × P(B | A)

Example 44: You draw two cards from the deck without looking. What’s the probability that they’re both diamonds?

Solution: Are these independent events? No! P(diamond1), the probability that the first card is a diamond, is 13/52 because there are 13 diamonds out of 52. But if the first card is a diamond, the probability that the second card is a diamond is different. Now there are only 12 diamonds left in the deck, out of a total of 51 cards. So P(diamond2 | diamond1) = 12/51, which is a bit less than 13/52.

P(diamond1 and diamond2) = P(diamond1) × P(diamond2 | diamond1)

P(diamond1 and diamond2) = (13/52) × (12/51) ≈ 0.0588

5C.  Sequences instead of Formulas

A lot of probability problems can be solved without using formulas, through the technique of sequences. Here’s the procedure:

  1. Write down the “winning sequences”, the sequences that lead to the desired outcome.
  2. Assign probabilities to each event in each sequence, from start to end.
  3. Multiply the probabilities within each sequence, and then add up the probabilities of all the sequences.

Example 45: Suppose a bag contains 6 oatmeal cookies, 4 raisin cookies, and 5 chocolate chip. You are to draw two cookies from the bag without looking (and without replacement, which would be yucky). What is the probability that you will get two chocolate chip cookies?

Solution: To start with, notice that there are 6+4+5 = 15 cookies. There’s only one winning sequence, but this one illustrates an important point: you have to assign each probability in its situation at that point in its sequence.

  1. Sequence: CC1 and CC2
  2. Probabilities: 5/15 and 4/14.
    You compute the probability CC2 at this point in the sequence: it’s the probability of a second CC if the first cookie was CC. You don’t care about the probabilities if the first cookie was anything else, because the sequence starts with a CC cookie. That means that, when you are looking for the probability of a second CC cookie, the bag now contains only 14 cookies, and only 4 of them are CC.
  3. Arithmetic: (5/15)×(4/14) ≈ 0.0952

Example 46: In the same situation, what’s the probability you’ll get one oatmeal and one raisin?

Solution: Even though you don’t care which order they come in, you have to list both orders among your willing sequences. Remember the example of flipping two coins, or the examples with dice: to make probabilities come out right, consider possible orderings.

  1. Sequences: (A) O1 and R2; (B) R1 and O2
  2. Probabilities: (A) 5/15 and 4/14; (B) 4/15 and 5/14
  3. Arithmetic: (5/15)×(4/14) + (4/15)×(5/14) ≈ 0.1905

Example 47: Consider the same bag of 15 cookies, but now what’s the probability you get two cookies the same?


  1. Sequences: (A) O1 and O2; (B) R1 and R2; (C) CC1 and CC2
  2. Probabilities — again, the probability for the second cookie takes into account the first cookie that was drawn.
    (A) 6/15 and 5/14; (B) 4/15 and 3/14; (C) 5/15 and 4/14
  3. Arithmetic: (6/15)×(5/14) + (4/14)×(3/14) + (5/15)×(4/14) ≈ 0.2952

Example 48: Your teacher’s policy is to roll a six-sided die and give a quiz if a 2 or less turns up. Otherwise, she rolls again and collects homework if a 3 or less turns up. You haven’t done the homework for today and you’re not ready for a quiz. What is the probability you’ll get caught?

Solution: Though you could do this with formulas, you’ll get the same answer with less pain by following the method of sequences. The “winning sequences” in this case are the sequences that lead to either a quiz or homework.

  1. There are two sequences: (A) quiz (and stop, without deciding about homework); (B) no quiz, but homework
    Notice that you start each sequence from the same starting point. Notice also that you don’t consider the possible sequence “no quiz and no homework” because in that sequence you don’t get caught.
  2. P(quiz) = 2/6 = 1/3. P(no quiz) = 1−1/3 = 2/3. P(homework if die roll) = 3/6 = 1/2.
    (A) 1/3 (B) 2/3 and 1/2
  3. (1/3) + (2/3)×(1/2) = (1/3)+(1/3)= 2/3
    There’s a 2/3 probability of a quiz or homework.

Sequences let you think through a situation without getting confused about which formula may apply. Sometimes no formula applies. Here’s a famous example.

Example 49: You’re a contestant on Let’s Make a Deal. You have to pick one of three doors, knowing that there’s a new car behind one of them and a “zonk” (something funny but worthless) behind the other two. Let’s say you pick Door #1.

The host, who of course knows where the car is, opens Door #2 and shows you a zonk. He then asks whether you want to stick with your choice of Door #1, or instead take what’s behind Door #3. What should you do, and why?

(I gave specific door numbers to help make this problem less abstract, but the specifics don’t matter. What does matter is that you pick a door at random, and the host reveals that a door you didn’t pick is the wrong one.)

Solution: There’s really no formula for this one, because the host’s actions aren’t governed by probability. Once you realize that, it’s easy.

  1. In the long run, 1/3 of contestants will choose the correct door, whichever one it is, and 2/3 will choose one of the two wrong doors. Why? The show’s producers have to make sure that prizes are equally distributed among the three doors over the long haul. If they favored one door over the others, people would notice and would start picking that door.

    Therefore, P(right door) = 1/3 and P(wrong door) = 2/3.

  2. If you chose the right door, the host opens one of the two wrong doors, but obviously you would not benefit by switching.
  3. If you chose the wrong door, the host opens the other wrong door and offers you the chance to switch doors. The host has eliminated the other wrong door, and the third door must be the winning door. You should switch.

    If you chose the wrong door and switch doors, you will always win because the host has eliminated the other wrong door.

  4. The probability that you chose the right door initially, and will lose if you switch, is 1/3. The probability that you chose the wrong door initially, and will win if you switch, is 2/3.

    In the long run, keeping your original choice is the winning strategy 1/3 of the time, and switching is the winning strategy 2/3 of the time.

  5. Switching doors doubles your chance of winning.

This is the famous Monty Hall Problem. Monty Hall [see “Sources Used” at end of book] developed Let’s Make a Deal and hosted the show for many years. There was a lot of controversy (Tierney 1991 [see “Sources Used” at end of book]) about the answer. Many people who should have known better thought that Door #1 and Door #3 were equally likely after Door #2 was opened. But they forgot that this is not a pure probability problem. The host knows where the car is and picks a door to open based on that knowledge, and that makes all the difference.

What Have You Learned?

Key ideas:

(The online book has live links to all of these.)

Because this textbook helps you,
please click to donate!
Because this textbook helps you,
please donate at
Study aids:

Chapter 6 WHYL → ← Chapter 4 WHYL

Exercises for Chapter 5

Write out your solutions to these exercises, showing your work for all computations. Then check your solutions against the solutions page and get help with anything you don’t understand.

Caution! If you don’t see how to start a problem, don’t peek at the solution — you won’t learn anything that way. Ask your instructor or a tutor for a hint. Or just leave it and go on to a different problem for now. You may find when you return to that “impossible” problem that you see how to do it after all.

Problem Set 1

1 You toss three coins.
(a) How many entries do you expect in the sample space of equally likely events?

(b) Construct that sample space.

(c) Find P(2H), the probability of getting exactly two heads.

2 In 2003 a federal government survey estimated that 58.2% of US households had both a cell phone and a landline, 2.8% had only cell service, and 1.6% had no phone service at all.
(a) Construct a probability model for type of phone service to US households. (Hint: You’re going to have to add a fourth case.)
(b) Supposedly, polling agencies try not to call cell phones, because consumers object to paying for the calls. What proportion of US households could be reached by a landline in 2003?
3 According to (2014) [see “Sources Used” at end of book], the probability of being struck by lightning in a given year is about 1 in 1,000,000. A blog post by Tara Parker-Pope (2007) [see “Sources Used” at end of book] says that the probability of suffering a shark attack in 2003 was about 1 in 4,691,000. Can you add these two numbers to find the probability of being struck by lightning or attacked by a shark in 2003 as 1/1,000,000 + 1/4,691,000? Briefly, why or why not?
4 P(A), the probability of event A, is 0.7. A and B are complementary events. Find (a) P(not A); (b) P(B); (c) P(A and B).
If any of them cannot be determined from the information given, say so.
5 A blog post by Tara Parker-Pope (2007) [see “Sources Used” at end of book] reported that your lifetime risk of dying of heart disease is 1/5, and your lifetime risk of dying of cancer is 1/7. Can you add these two numbers to find the probability of dying of heart disease or cancer? Briefly, why or why not?
6 Explain the difference between P(divorced | man) and P(man | divorced).
7 A company analyzed all 412 customer complaints that were received in January 2013. None of them were for unresolved billing disputes. Therefore the probability that a randomly selected complaint from January 2013 was for an unresolved billing dispute is zero. We’re used to interpreting a probability of zero as impossible, but obviously it is possible for a complaint to be about an unresolved billing dispute. How do you resolve this paradox?

Need a hint? Think about the two kinds of probability from the beginning of the chapter.

8 You shuffle a standard 52-card deck well and deal five cards. What is the probability that the fifth card is a spade?
9 Write out the sample space for flipping two coins, and use it to answer these questions.
(a) If you are told that at least one of the flips came up heads, what is the probability that both are heads?
(b) If you are told that the first coin came up heads, what is the probability that both are heads?
10 The chance of being a victim of violent crime in a given year varies by age and sex, according to What are my chances of being a victim of violent crime? [see “Sources Used” at end of book] Take 17.1 per thousand, or 1.71%, as the average.

(a) You’re waiting for a flight at the airport. You fall into conversation with a stranger, and you’re surprised to learn that both of you have been victims of violent crime in the past year. Assuming random selection, what are the chances of that happening?

(b) Explain why you cannot use the same technique to find the probability that both members of a married couple have been victims of violent crime in the past year.

US Marital Status in 2006 (in Millions)
MenWomen Totals
Never married30.325.055.3
11 For this problem, please use the table of marital status at right. status (reprinted on the last page).

(a) Find P(divorced).

(b) Give two interpretations of that probability.

(c) What type of probability is this: classical, empirical, experimental, theoretical?

(d) Find P(divorcedC) and give one interpretation.

(e) Find P(man and married).

(f) Find P(man or married). (Work this with and without the formula.)

(g) Find the probability that a randomly selected male was never married:
P(never married | male) = ?

(h) Find P(man | married), and interpret as “____% of ____ were ____.”

(i) Find P(married | man), and interpret as “____% of ____ were ____.”

12 In five-card draw poker, you are dealt five cards and then during the betting you can discard some in hopes that the replacements will improve your hand. You have a pat hand if the first five cards are good enough that you don’t need to discard. What’s the probability you’ll be dealt a diamond flush (five diamonds) as a pat hand?
13 There are 20 M&Ms left in the dish: 5 blue, 4 orange, 3 green, 3 yellow, 3 brown, and 2 red. The yellows are your favorites. Your friend takes three M&Ms without looking.

(a) What’s the chance that she leaves your favorites behind?

(b) What’s the chance that all three of her picks are red?

14 Tom Turkey invested in two risky startup companies, A and W. There is a 0.90 probability that company A will go bankrupt, and a 0.80 probability that company W will go bankrupt. Assuming the two companies have no connection, find the probabilities that (a) both will go bankrupt; (b) one of them, but not both, will go bankrupt; (c) neither will go bankrupt.
Colors of Plain M&Ms
Blue24 %
Orange20 %
Green16 %
Yellow14 %
Brown13 %
Red13 %
15Without looking, you take three M&Ms from a new three-pound bag. (The bag contains over a thousand M&Ms.) Use the probability model of plain M&M colors (reprinted on the last page) at right to answer these questions.

(a) Find the probability that all three are red.

(b) Find the probability that none are red.

(c) Find the probability that at least one is green.

(d) Find the probability that exactly one is green.

16 A poll found that 45% of baseball fans had attended a game in person within the past year. Of five randomly selected baseball fans, find the probability that at least one fan had not attended a game within the past year.
17 Without looking, Grace Underfire takes two sourballs from a bowl that contains 11 cherry and 9 orange flavor. What is the probability that she will get one of each flavor?
18 An annual church raffle offers one chance in 500 of winning something. Find the chance that you win at least once if you play five years in a row.
19 Butch will miss an important TV program while taking his statistics exam, so he sets both his DVRs to record it. The first one records 70% of the time, and the second one records 60% of the time. (Their performance is independent.) What is the probability that he gets home after the exam and finds
(a) No copies of his program?
(b) One copy of his program?
(c) Two copies of his program?

Problem Set 2

20 Police plan to enforce speed limits during the morning rush hour on four different routes into the city. The traps on routes A, B, C, and D are operated 40%, 30%, 20%, and 30% of the time, respectively. Biff always speeds to work, and he has probability 0.2, 0.1, 0.5, and 0.2 of using those routes.
(a) What’s the probability that he’ll get a ticket on any one morning?
(b) What’s the probability he’ll go five mornings without a ticket?
(Hint: His choice of a route, and whether there’s a speed trap on that route, are independent.)
US Marital Status in 2006 (in Millions)
MenWomen Totals
Never married30.325.055.3
21 For this problem, please use the table of marital status at right. status (reprinted on the last page). Show that the events “man” and ”divorced” are not independent.
22 I remarked that if you flip a fair coin repeatedly, you’ll see a run of ten heads or ten tails. Show why this should happen twice in about every thousand flips.
23 (adapted from Dabes and Janik 1999 [see “Sources Used” at end of book], page 24)  The probability that a certain door is locked is 0.5. The key to the door is one of five unidentified keys hanging on a rack. You select two keys before going to the door. Find the probability that you can open the door without returning for another key.

Solutions → 

What’s New

US Marital Status in 2006 (in Millions)
MenWomen Totals
Never married30.325.055.3
Colors of Plain M&Ms
Blue24 %
Orange20 %
Green16 %
Yellow14 %
Brown13 %
Red13 %
Seat-Belt Use by
College Students Driving
(sample size: 4521)
Never2.61 %
Rarely5.51 %
Sometimes7.63 %
Most of
the time
15.84 %
Always68.41 %

52 cards in a standard deck

36 equally likely outcomes from rolling two dice 36 equally likely outcomes from rolling two dice


6. Discrete Probability Models

Updated 18 Jan 2017 (What’s New?)

Intro: In Chapter 5, you looked at the probabilities of specific events. In this chapter, you’ll take a more global view and look at the probabilities of all possible outcomes of a given trial.


6A.  Random Variables

The random variable is one of the main concepts of statistics, and we’ll be dealing with random variables from now till the end of the course.


A variable is “the characteristic measured or observed when an experiment is carried out or an observation is made.”

Upton and Cook (2008, 401) [see “Sources Used” at end of book]

If the results of that procedure depend on chance, completely or partly, you have a random variable. Each outcome of the procedure is a value of the variable. We use a capital letter like X for a variable, and a lower-case letter like x for each value of the variable.

As you learned in Chapter 1, numeric variables can be discrete or continuous. A discrete random variable can have only specific values, typically whole numbers. A continuous random variable can have infinitely many values, either across all the real numbers or within some interval.

In this chapter, you’ll be concerned with discrete random variables. In the next chapter, you’ll look at one particular type of continuous random variable, the normal distribution.

Example 1: You roll three dice. The number of sixes that appear is a random variable, and the total number of spots on the upper faces is another random variable. These are both discrete.

Example 2: You randomly select a household and ask the family income for last year. This is a continuous random variable.

Example 3: You randomly select twelve TC3 students, measure their heights, and take the average. “Height of a student” is a continuous random variable, and “average height in a 12-student sample” is another continuous random variable.

Example 4: You randomly select 40 families and ask the number of children in each. “Number of children in family” is a discrete random variable, and “average number of children in a sample of 40 families” is a continuous random variable.

6B.  Discrete Probability Distributions

Definition: A discrete probability distribution or DPD (also known as a discrete probability model) lists all possible values of a discrete random variable and gives their probabilities. The distribution can be shown in a table, a histogram, or a formula. Like any probabilities, the probabilities in a DPD can be determined theoretically or experimentally.

Value, x
Chance of
Winning, P(x)
Two Camaros$100,0001 in 5,000,000
Cash10,0001 in 1,000,000
Apple iPad1,0001 in 500,000
Various5001 in 250,000
Gift card50.9999928

Example 5: In March 2013, Royal Auto sent me one of those “Win big!” flyers with a fake car key taped to it. The various prizes, and chances of winning, are shown at right.

This is a discrete probability distribution. The discrete variable X is “prize value”, and the five possible values of X are $100,000 down to $5.

Remember the two interpretations of probability: probability of one = proportion of all. From the table, you can equally well say that any person’s chance of winning a $500 prize is 1/250,000 = 0.000 004 = 0.0004%, or that in the long run 0.0004% of all the people who participate in the promotion will win a $500 prize.

A discrete probability distribution must list all possible outcomes. The total probability for all possible outcomes in any situation is 1. Therefore, for any discrete probability distribution, the probabilities must add up to 1 or 100%.

6B1.  Mean and Standard Deviation of a DPD

Definitions: Suppose you do a probability experiment a lot of times. (For the Royal Auto example, suppose bazillions of people show up to claim prizes.) Each outcome will be a discrete value. The mean of the discrete probability distribution, μ, is the mean of the outcomes from an indefinitely large number of trials, and the standard deviation of the discrete probability distribution, σ, is the standard deviation of the outcomes from an indefinitely large number of trials. The mean of any probability distribution is also called the expected value, because it’s the expected average outcome in the long run.

How do you find the mean and SD of a discrete probability distribution? Well, one interpretation of probability is long-term relative frequency, so you can treat a discrete probability distribution as a relative frequency distribution. (You can also think of the probabilities as weights, with the mean as the weighted average.) On the TI-83/84, that means good old 1-Var Stats, just like in Chapter 3.

Textbooks all list the formulas, so if you want to know them here they are. But in fact everybody uses software except in the simplest cases.

μ = ∑ x·P(x)     σ = √[ ∑ (x²·P(x)) − μ²]

For ∑, see ∑ Means Add ’em Up in Chapter 1.

Example 6: To find the mean and SD of the distribution of winnings in the Royal Auto sweepstakes, put the x’s in one list and the P(x)’s in another list. Caution: When the probability is a fraction, enter the fraction, not an approximate decimal. The calculator will display an approximate decimal, but it will do its calculations on a much more precise value.

After entering the x’s and p’s, press [STAT] [] [1] and specify your two lists, such as 1-Var Stats L1,L2. (Yes, the order matters: the x list must be first and the P(x) list second.) When you get your results, check n first. In a discrete probability distribution, n represents the total of the probabilities, so it must be exactly 1. If it’s just approximately 1, you made a mistake in entering your probabilities.

TI-83 Stat Edit input screen      TI-83 Stat Edit output screen; see text

The mean of the distribution is μ = $5.03, and the standard deviation is σ = $45.85.

Interpretation: In the long run, the dealership will have to pay out $5.03 per person in prizes. The SD is a little harder to get a grasp on, but notice that it’s more than nine times the mean. This tells you that there is a lot of variability in outcome from one person to the next. In general, the mean tells you the long-term average outcome, and the SD tells you the unpredictability of any particular trial. You can look at the SD as a measure of risk.

A couple of notes about the calculator output: The calculator knows that a DPD is a population, so it gives you σ and not s for the SD. It should give you μ for the mean, but instead it displays , so you need to make the change. I’ve already mentioned that the sum of the probabilities (n) must be exactly 1, not just approximately 1.

6B2.  Comparing DPDs: Parking Choices

Example 7: When visiting the city, should you park in a lot or on the street? On a quarter of your visits (25%), you park for an hour or less, which costs $10 in a lot; for parking more than an hour they charge a flat $14. If you park on the street, you might receive a simple $30 parking ticket (p = 20%), or a $100 citation for obstruction of traffic (p = 5%), but of course you might get neither. Which should you do?

(Adapted from Paulos 2004 [see “Sources Used” at end of book].)

You have two probability models here, one for the outcomes of parking in a lot, and one for street parking. Begin by putting the two models into tables:

Parking in lotxP(x)
≤ 1 hour$100.25
> 1 hour$14
Parking on streetxP(x)
Parking ticket$300.20
Obstruction ticket$1000.05
No ticket

The problem leaves out some things that you can figure for yourself. Remember that every probability model includes all outcomes, and the probabilities add up to 1. If there’s a 25% chance of parking up to an hour, there must be a 100−25 = 75% chance of parking more than an hour. And on the street, if you have a 20+5 = 25% chance of getting some kind of ticket, you have a 100−25 = 75% chance of getting neither. The cost of getting neither ticket is zero.

Now you can fill in the empty cells in the tables.

Parking in lotxP(x)
≤ 1 hour$100.25
> 1 hour$140.75
Parking on streetxP(x)
Parking ticket$300.20
Obstruction ticket$1000.05
No ticket$00.75

I showed the total probability to emphasize that it’s 1. Never compute the total of the outcomes (x’s), because that wouldn’t mean anything.

How do these tables help you make up your mind where to park? By themselves, they don’t. But they let you compute μ and σ, and that will help you decide.

I placed the x’s and P(x)’s for the parking lot in L1 and L2, and did 1-Var Stats L1,L2. I placed the x’s and P(x)’s for street parking in L3 and L4 and did 1-Var Stats L3,L4. Here are the results:

Lot: parameters for parking lot: mean=13, SD=1.732050808      Street: parameters for street parking: mean=11, SD=23.64318084

As always, look first at n. If it’s not exactly 1, find your mistake in entering the probabilities.

Now you can interpret these results. Parking in the lot is a bit more expensive in the long run (μ = $13.00 per day versus μ = $11.00 per day). But there are no nasty surprises (σ = $1.73, little variation from day to day). Parking on the street is much riskier (σ = $23.64), meaning that what happens today can be wildly different from what happened yesterday.

So what should you do? Statistics can give you information, but part of your decision is always your own temperament. If you like stability and predictability — if you are risk averse — you’ll opt for the parking lot. If it’s more important to you to save $2 a day on average, and you can accept occasionally getting hit with a nasty fine, you’ll choose to park on the street.

6B3.  Fair Price of a Game

Definitions: The fair price of a game is the price that would make all parties come out even in the long run. (We’re not just talking traditional games here. A game is any activity where the participants stand to gain or lose money or something else of value. Usually chance contributes to the outcome, but not necessarily.)

The fair price of a game is the price that would make the expected value or mean value of the probability distribution equal to zero, the break-even point.

(“Fair price” is one of those math words that look like English but mean something different. You should expect to pay more than the fair price because the operator of the game — the insurance company or casino or stockbroker — also has to cover selling and administrative expenses.)

There are two ways to compute the fair price:

Die showsxP(x)
6$60−12 = $481/6
6/6 = 1

Example 8: Take a really simple bar game: a stranger offers to pay you $60 if you roll a 6 with a standard six-sided die, but you have to pay him $12 per roll. Find the fair price of this game.

Method 1: The only prize is $60, and you have a 1/6 chance of winning it. $60×(1/6) = $10.

Method 2: Amounts in L1, probabilities in L2; 1-VarStats L1,L2. Verify that n=1, and read off the mean of −$2. The actual price is $12, so the fair price is $12 + (−2) = $10.

Naturally, the two methods always give the same answer. Method 2 is easier if you already know the mean of the probability distribution; otherwise Method 1 is easier.

Example 9: A lottery has a $6,000,000 grand prize with probability of winning 1 in 3,000,000. It also has a $10 consolation prize with probability of winning 1 in 1000. What is the fair price of your $5 lottery ticket?

Solution: You don’t need μ, so Method 1 is easier: multiply each prize by its probability and add up the products. $6,000,000×(1/3,000,000) + $10×(1/1000) → fair price is $2.01.

Why does a lottery ticket that is worth $2.01 actually cost $5.00? In effect, the lottery is paying out about 2.01/5.00 ≈ 40% of ticket sales in prizes. Some of the 60% that the lottery commission keeps will cover the lottery’s own expenses, and the rest is paid to the state treasury. This is actually fairly typical: most lotteries pay out in prizes less than half of what they take in. By contrast, the illegal “numbers game” pays out about 70%, or at least it did in the 1980s in Cleveland. (Don’t ask me how I know that!)

6C.  Bernoulli Trials

In the examples so far of probability models, I’ve had to give you a table of probabilities. But there are many subtypes of discrete probability distribution where the probabilities can be calculated by a formula. The rest of this chapter will look at part of one family, discrete probability distributions that come from Bernoulli trials.


Repeated trials of a process or an event are called Bernoulli trials if they have both of these characteristics:

  1. Each trial has only two possible outcomes. We call those “success” and “failure”. However, “success” is not necessarily a desirable outcome. Success simply means the outcome you’re interested in, and failure is the other outcome.
  2. The probability of success, denoted p, is the same for every trial. This is another way of saying that the trials are independent. (Even if they’re not independent, you can usually treat the trials as independent if the sample is a small part of the population, not more than about 10%.)

If the probability of success on each trial is p, then the probability of failure on each trial is 1−p, or q for short.

Bernoulli trials are named after Jacob Bernoulli, a Swiss mathematician. He developed the binomial distribution, which you’ll meet later in this chapter.

Example 10: You randomly interview 30 people to find out which party they will vote for in the next election. These are not Bernoulli trials, because there are more than two possible outcomes. (New York State ballots often have six or more parties listed, though some parties just endorse the Republican or Democratic candidate.)

Example 11:

On reflection, you realize that you don’t care which party a given voter will choose. All you care about is whether they are voting for your candidate or not, so you randomly select 30 registered voters and ask, “Will you be voting for Abe Snake for President?” (Yes, that’s a real thing; here’s a video.) These are Bernoulli trials, because there are only two answers, and the probability of voting for Abe Snake is the same for each randomly selected person. (p equals the proportion of Abe Snake voters in the population. Remember, proportion of all = probability of one.)

Actually, this overlooks the undecided or “swing” voters. These become fewer as the election gets closer, but in real life they can’t be overlooked because they may be a larger proportion than the leading candidate’s lead.

Example 12:

You draw cards from a deck until you get a heart. These are not Bernoulli trials. Although there are only two outcomes, heart and other suit, the probability changes with each draw because you have removed a card from the deck.

Variation: You replace each card and reshuffle the deck before drawing the next card. Then these become Bernoulli trials because the probability of drawing a heart is 25% on every trial.

Variation: You have five decks shuffled together, instead of one 52-card pack. You don’t replace cards after drawing them. You can treat these as Bernoulli trials even without replacement, because you won’t be drawing enough cards to alter the probabilities significantly.

How do I know? Five packs is 260 cards, and 10% of 260 is 26. On the first card, P(heart) = 25%. It’s quite unlikely that you’d have no hearts by the 26th card (0.04% chance), but if you did, the probability of a heart on the 27th card would be: 5×13/(5×52−26) ≈ 27.8%. That’s not much different from the original 25%.

(You don’t have to take my word for these probabilities. Use the sequences method from Chapter 5 to compute them.)

Although this sample without replacement violates independence, it doesn’t violate it by very much, not enough to worry about. This bears out what I said earlier: Trials without replacement can still be treated as independent when the sample is small relative to the population.

6D.  The Geometric Model

Example 13: According to the AVMA (2014) [see “Sources Used” at end of book] 30.4% of US households own one or more cats. Suppose you randomly select some households.
(a) How likely is it that the first time you find cat owners is in the fifth household?
(b) How likely is it that your first cat-owning household will be somewhere in the first five you survey?

Although you could compute these individual probabilities using techniques from Chapter 5, there’s a specific model called the geometric model that makes it a lot easier to compute. Also, using the geometric model you can get an overview of the probabilities for various outcomes, which you’d miss by computing probabilities of specific events using the previous chapter’s techniques. If trials are independent, and you want the probability of a string of failures before your first success, you’re using a geometric model.


The geometric model, also known as the geometric probability distribution, is a kind of discrete probability distribution that applies to Bernoulli trials when you try, and try, and try again until you get a success. P(x) is the probability that your first success will come on your xth attempt, after x−1 failures.

Expanding on the definition of Bernoulli trials, you can say that a geometric model is one where

The probability of success on any given trial, p, completely describes a geometric model.

geometric distribution for p=0.304 Here’s a picture of part of the geometric model for cat-owning households, with p = 0.304.

How do you read this? The horizontal axis is x, the number of the trial that gives your first success, and the vertical axis is P(x), the probability of that outcome.

For example, there’s a hair over a 30% chance that you’ll find cat owners in your first household, P(1) = 30.4%. There’s about a 21% chance that the first household won’t own cats but the second household will, P(2) ≈ 21%. Skipping a bit to x = 6, there’s just about a 5% chance that the first five households won’t have cats but the sixth will, P(6) ≈ 5%. And so forth.

x = 1 is always the most likely outcome, and larger x values are successively less and less likely. This is true for every geometric distribution, not just this particular one with p = 0.304.

The geometric model never actually ends. The probabilities eventually get too small to show in the picture, but no matter what x you pick, the probability is still greater than 0.

6D1.  Computing Probabilities

Your TI-83/84 calculator has two menu selections for the geometric model:

They’re both in the [2nd VARS makes DISTR] menu.

(If you have a calculator in the TI-89 family, use the [F5] Distr menu. Select Geometric Pdf and Geometric Cdf.)

Let’s use the calculator to find the answers for Example 13. Here p, the probability of success in any given household, is 30.4% or 0.304.

Part (a) wants the probability of four failures followed by a success on the fifth try. For that you use geometpdf. Press [2nd VARS makes DISTR] [] [] to get to geometpdf, and press [ENTER].

With the “wizard” interface: With the classic interface:
Enter p and x.

geometpdf 'wizard' screen, with p=.304 and x=5

Press [ENTER] twice, and your screen will look like the one at right.

After entering p and x, press [)] [ENTER] to get the answer.

geometpdf of .304 and 5 yields .0713362938

geometpdf(.304,5) = .0713362938 → 0.0713

There’s about a 7% chance you won’t find any cat owners in the first four households but you will in the fifth household.

(You could calculate this the long way. The probability of four failures followed by a success is (1−.304)4×.304. But the geometric model is easier. That’s the point of a model: one general rule works well enough for all cases, so you don’t have to treat each situation as a special case with its own unique methods.)

Part (b) wants the probability of a success occurring anywhere in the first five trials. This is a geometcdf problem. Press [2nd VARS makes DISTR] [] to get to geometcdf, and press [ENTER].

With the “wizard” interface: With the classic interface:
Enter p and x.

geometcdf 'wizard' on TI-84, showing p=.304 and x=5

Press [ENTER] twice, and your screen will look like the one at right.

After entering p and x, press [)] [ENTER] to get the answer.

geometcdf of .304 and 5 yields .0713362938

geometcdf(.304,5) = .8366774327 → 0.8367

There’s almost an 84% chance you will find at least one cat-owning household among the first five.

(Doing this the long way, you would use the complement. The complement of “at least one cat-owning household in the first five” is “no cat-owning households in the first five”. The probability that a given household doesn’t own a cat is q = 1−.304 = 0.696, and the probability that five in a row don’t own cats is 0.6965. Therefore the original probability you wanted is 1−(.6965) = 0.8367.)

You don’t actually need formulas for the geometric model, but if you’re curious about what your calculator is doing, here they are:
       geometpdf(p,x) = qx−1p     geometcdf(p,x) = 1−qx
where q = 1−p as usual. You can see that the two “long way” paragraphs above actually used those formulas.

6D2.  Mean and Standard Deviation of a Geometric Distribution

The geometric distribution is completely specified by p, so you can compute the mean and standard deviation quite easily:

μ = 1/p          σ = μ √q  or  (1/p) √(1−p)

Example 14: 30.4% of US households own cats. How many households do you expect you’ll need to visit to find a cat-owning household?

Solution: The expected value of a distribution is the mean. μ = 1/p = 1/.304 = 3.289473684. μ = 3.3. Interpretation: On average, you expect to have to visit between 3 and 4 households to find the first cat owners.

Caution! The expected value (mean) is not the most likely value (mode). Take a look back at the histogram, and you’ll see that the most likely value is 1: you’re more likely to get lucky on the first trial than on any other specific trial. But the distribution is highly skewed right, so the average gets pulled toward the higher numbers.

mean and standard deviation on TI-84; see text for numbers To compute the SD, just multiply the mean by √q. A handy technique is called chaining calculations. After first calculating the mean, press the [×] key, and the calculator knows you are multiplying the previous answer by something. Here you see that σ = 2.7.

Interpreting σ is a bit harder. The geometric distribution is a type of discrete probability distribution, so you interpret its standard deviation the same way as for any other DPD. In this particular example, σ is almost as large as μ, so you expect a lot of variability. If you and a lot of co-workers go out independently looking for households with cats, the group average number of visits will be 3.3 households, but there will be a lot of variability between different workers’ experience. You can’t use the Empirical Rule here because the geometric model is not a bell curve, but you can at least say you won’t be surprised to find workers who get lucky on the first house (μ−σ ≈ 0.5), and workers who have to visit six houses or more (μ+σ ≈ 6.0).

6D3.  Making a Decision

Some people find it very hard to make choices because they feel they must consider all the pros and cons of every possibility. Others look at possibilities one at a time and take the first one that’s acceptable. Studies such as The Tyranny of Choice (Roets, Schwartz, Guan 2012 [see “Sources Used” at end of book]) show that the first group may make better choices objectively, but the second group is happier with the items they choose.

Example 15: You have to buy a new sofa. You’d be content with 55% of the sofas out there. Let’s assume that your Web search presents sofas in an order that has nothing to do with your preferences. There are hundreds to choose from, so you decide to adopt the “first one that’s acceptable” strategy. How likely is it that you’d order the third sofa you’d see?

Solution: This is a geometric model, with two failures followed by one success. p = 55%. geometpdf(.55,3) = .111375. There’s about an 11% chance you’d order the third sofa.

6D4.  Baseball

Example 16: Larry’s batting average is .260. During which time at bat would he expect to get his first hit of the game? How likely is he to get his first hit within his first four times at bat?

Solution: This is a geometric model with p = 0.260. The mean or expected value is 1/p = 1/.26 = 3.85, about 4. On average, his first hit each game will come on his fourth time at bat. For the second question, geometcdf(.26,4) = .70013424; there’s about a 70% chance he’ll get his first hit within his first four times at bat.

6E.  The Binomial Model

In the previous section, we looked at the geometric model, where you just keep trying until you get a success. In this section, we’ll look at the binomial model, where you have a fixed number of trials and a varying number of successes.


The binomial model, also known as the binomial probability distribution or BPD, is a kind of discrete probability distribution that applies to Bernoulli trials when you have a fixed number of trials, n.

Expanding on the definition of Bernoulli trials, you can say that a binomial model is one where

Example 17: Cats again! 30.4% of US households own one or more cats. You visit five households, selected randomly.
(a) What’s the chance that no more than two have cats?
(b) What’s the chance that exactly two have cats?
(c) What’s the chance that at least two have cats?
(d) What’s the chance that two to four have cats?

binomial distribution for n = 5, p = 0.304 This problem fits the binomial model: n = 5 trials, each household does or does not have cats, and the probability p = 30.4% is the same for each household.

A picture of this binomial distribution is shown at right, and you can see some differences from the picture of the geometric distribution:

How do you read the picture? There’s about a 17% probability that none of the five households will have cats, about 36% that one of the five will have cats, and so on. (Why 36% and not 30.4%? Because there’s a greater chance of “winning” one out of five than one out of one.)

In this book we’re more concerned with computing probabilities, but it can be nice to get an overall picture of a distribution. I made this particular graph by using @RISK from Palisade Corporation, but you can also make histograms of binomial distributions by using MATH200A Program part 1(5).

6E1.  Computing Probabilities

Here you have a choice. Your TI-83/84 calculator comes with two menu selections for the binomial model, but the MATH200A program gives you a simpler interface. Here’s a quick overview of both, before we start on computations:

With the MATH200A program (recommended): If you’re not using the program:

MATH200A Program part 3 gives you one interface for all binomial probability calculations. The program might already be on your calculator from Chapter 3 boxplots, but if it’s not, see Getting the Program for instructions.

To find binomial probability with the program, press [PRGM]. If you see MATH200A in the list, press its menu number; otherwise, press [] or [] to get to MATH200A, and press [ENTER].

MATH200A menu screen That puts the program name on your home screen. Press [ENTER] again to run the program, and yet again to dismiss the title screen. You’ll then see a menu. Press [3] for binomial probability.

These are both in the [2nd VARS makes DISTR] menu:

  • binomcdf(n,p,x) answers the question “what’s the probability of no more than x successes in n trials (0 to x successes)?” (The “cdf” stands for cumulative distribution function, because the cdf functions accumulate the probabilities for a range of outcomes.)
  • binompdf(n,p,x) answers the question “what’s the probability of exactly x successes in n trials?” (The “pdf” stands for probability distribution function, because the probability for any particular number of successes is a function of [determined by] that number.)
Because this textbook helps you,
please click to donate!
Because this textbook helps you,
please donate at

Got a TI-89 family calculator? Use the [F5] Distr menu. Select Binomial Pdf or Binomial Cdf. The Cdf function can handle any range of successes, not just 0 to x. See Binomial Probability Distribution on TI-89 for full instructions.)

Now let’s use your TI-83/84 to answer the questions in Example 17. You have five trials, so n = 5. The probability of success on any given household is 30.4%, so p = 0.304.

(a) What’s the probability that no more than two of the five randomly selected households have cats?

With the MATH200A program (recommended): If you’re not using the program:

Press [PRGM], select MATH200A, and press [3] in the MATH200A menu.

MATH200A binomial input screen: n=5, p=.304, from=0, to=2
MATH200A binomial results screen: n=5, p=.304, x=0 to 2, probability=.8315878479

Enter n and p. “No more than two cats” is from 0 to 2 cats, so enter those values when prompted. The program echoes back your inputs and shows the computed probability. To show your work, write down the screen name, the inputs, and the result.

Conclusion: P(x ≤ 2) or P(0 ≤ x ≤ 2) = 0.8316.

The probability that no more than two of your five households have cats (in other words, the probability that 0 to 2 have cats) is binomcdf(5,.304,2). Press [2nd VARS makes DISTR] and scroll up to binomcdf.

binomcdf(5, .304, 2) = .8315878479) If you don’t have the “wizard” interface, or you have it turned off, binomcdf( will appear on your screen, Enter n, p, and the desired maximum number of successes, in that order, then the closing paren and [ENTER].

’Wizard’ interface input screen: trials=5, p=.304, x value=2 If you have the “wizard” interface, you get a menu screen, but you enter the same information. Press [ENTER] once on Paste and then again when the command is pasted to your home screen.

Either way, write down the binomcdf command and the argument numbers to show your work.

Conclusion: P(x ≤ 2) or P(0 ≤ x ≤ 2) = 0.8316.

(b) What’s the probability that exactly two of five randomly selected households are cat owners?

With the MATH200A program (recommended): If you’re not using the program:
MATH200A binomial input screen: n=5, p=.304, number of successes from 2 to r, probability=.4773995859
MATH200A binomial results screen: n=5, p=.304, x=22, probability=.3115838118

You need a specific number of successes, instead of a range. It’s almost exactly the same deal: you just enter the same number for from and to. In this example, to get the probability of exactly two successes, enter number of successes from 2 to 2.

Conclusion: P(x = 2) or P(2) = 0.3116.

binompdf(5, .304, 2) = .3115838118 (a) The probability of exactly two cat-owner households in five is binompdf(5,.304,2). Press [2nd VARS makes DISTR] and then press [] several times to get to binompdf. (Caution! pdf, not cdf.) Press [ENTER], type in the numbers, and press [)] [ENTER].

(The “wizard” interface screen is the same as it was for binomcdf.)

Conclusion: P(x = 2) or P(2) = 0.3116.

(c) What’s the probability that at least two of the five randomly selected households have cats?

With the MATH200A program (recommended): If you’re not using the program:

“At least two”, in a sample of five, means from two to five successes. Enter those values in MATH200A part 3. Here’s the results screen:

MATH200A binomial results screen: n=5, p=.304, x=2 to 5, probability=.4799959639

Conclusion: P(x ≥ 2) or P(2 ≤ x ≤ 5) = 0.4800.

This one is a little trickier. You could find P(2), P(3), P(4), and P(5) and add them up by hand, but that’s tedious and error prone, and it can introduce rounding errors. Instead, you’ll make the calculator add them up for you.

'wizard' interface for binompdf with trials=5, p=.304, and x-value left blank First, get all the probabilities for 0 through n successes into a statistics list. To do this, use binompdf (not cdf) but with only the n and p arguments. (If you have the “wizard” interface, leave x value blank.)

After the closing paren, don’t press [ENTER] just yet. Instead, press the [STO→] key and select a statistics list, such as [2nd 6 makes L6]. Then press [ENTER]. This puts the probabilities for 0 successes, 1 success, and so on to 5 successes into L6. (If you want, you could examine them with [], or on the [STAT] edit screen.)

binompdf(5, .304) stored to L6; sum(L6, 3, 6) = .4799959639 Now you need to sum the desired range of cells. You want 2 ≤ x ≤ 5. But the lowest possible x is 0, and the cells in statistics lists are numbered starting at 1. So to get x from 2 through 5, you need cells 3 through 6. When summing part of a list, add 1 to your desired x values.

Press [2nd STAT makes LIST] [] [5] to paste sum(, then [2nd 6 makes L6] [,] 3 [,] 6 [)] [ENTER].

Your answer: P(x ≥ 2) or P(2 ≤ x ≤ 5) = 0.4800.

Beware of off-by-one errors when you solve problems with phrases like at least and no more than. Always test the “edge conditions”. “Okay, I need at least 2, and that’s 2 through 5, not 3 through 5. Oh yeah, add 1 for the statistics list in the TI-83, so I’m summing cells 3 through 6, not 2 through 5.”

Alternative solution: Do you remember solving “at least” problems in Chapter 5? What was the lesson there? With laborious probability problems, the complement is your friend. What’s the complement of “at least two”? It’s “fewer than two”, which is the same as “no more than one”.

Shaky on the logic of complements? Use the enumeration method from Chapter 5: 0 1 2 3 4 5 or 0 1 | 2 3 4 5.

Find the probability of ≤1 household with cats, and subtract from 1:

P(x ≥ 2) = 1 − P(x ≤ 1)

P(x ≥ 2) = 1 − binomcdf(5, .304, 1)

P(x ≥ 2) = .4799959639 → 0.4800

(d) What’s the chance for two to four cat-owning households in your random sample of five households?

With the MATH200A program (recommended): If you’re not using the program:

Nothing new here: just use good old MATH200A part 3. Here’s the results screen:

MATH200A binomial results screen: n=5, p=.304, x=2 to 4, probability=.4773995859

P(2 ≤ x ≤ 4) = 0.4774.

sum(L6, 3, 5) = .4773995859 You need x from 2 through 4, but remember you always add 1 when summing binomial probabilities from a statistics list, so you put 3 to 5 in your sum command. (You’re still using the same distribution, so there’s no need to repeat binompdf.)

P(2 ≤ x ≤ 4) = 0.4774.

Alternative solution: You can also do it without summing. If you think about it, the probability for x from 2 to 4 is the probability for x from 0 to 4, with x below 2 (x no more than 1) removed: 0 1 2 3 4. In symbols,

P(0 ≤ x ≤ 1) + P(2 ≤ x ≤ 4) = P(0 ≤ x ≤ 4)

and by subtracting that first term you get

P(2 ≤ x ≤ 4) = P(0 ≤ x ≤ 4) − P(0 ≤ x ≤ 1)

binomcdf of 5, .304, 4 minus binomcdf of 5, .304, 1 yields .4773995859 Your probability is the result of subtracting two cumulative probabilities, the cdf from 0 to 4 minus the cdf from 0 to 1. It’s shown at right.

This is tricky, I admit. You have to set that x value correctly in the second binomcdf, so this method is not much better than the other one. About all it has going for it is that it avoids storing values in a list and then using sum.

You don’t actually need a formula for the binomial model, but if you’re curious about what your calculator is doing, here it is:
       binompdf(n,p,x) = nCx · px qnx
Why? px is the probability of getting successes on all of the first x trials. q is the probability of failure on one trial, and therefore qnx is the probability of failure on the remaining trials, after the x out of n successes. But in a binomial probability model, you care how many successes and failures there are, not in what order they occur. To account for the fact that order doesn’t matter, the formula has to multiply by nCx, “the number of ways to choose x objects out of n”. (If you want to know more about nCx, search “combinations” at your favorite math site.)

Unlike the geometric case, there’s no simple formula for binomcdf. Your calculator just has to compute probabilities for x = 0, 1, and so on and add them up.

6E2.  Baseball Again!

Example 18: Larry’s batting average is .260. How likely is it that he’ll get more than one hit in four times at bat?

binompdf(4, .26) stored to L6; sum(L6, 3, 5) = .27870128 Solution: This is a binomial model with n = 4, p = 0.26, x = 2 to 4. You can use MATH200A part 3 or the binompdf-sum technique to get .27870128. P(x > 1) = 0.2787 or about 28%. (The program is completely straightforward, so I’m showing only the tricky binompdf-sum sequence here.)

Alternative solution: If you don’t have the program, can you see how to use the complement to solve this problem more easily? Check your answer against mine to be sure that your method is correct.

6E3.  Mean and Standard Deviation of a Binomial Distribution

The binomial distribution depends on the proportion in the population (p) and your sample size (n). You can compute the mean and SD quite easily:

μ = np          σ = √[npq]

What are the mean and SD of the number of cat-owning households in a random sample of five households?

μ = np = 5 × 0.304 = 1.52

σ = √[npq] = √[5 × .304 × (1−.304)] = 1.028552381

Conclusion: μ = 1.5 and σ = 1.0.

Interpretation: in a sample of five households, the expected number of cat-owning households is 1.5. Or, if you take a whole lot of samples of five households, on average you will find that 1.5 households per sample own cats. The SD is 1.0. You can’t use the Empirical Rule, but you can say that you expect most of the samples of five to contain μ±2σ = 1.5±2×1.0 = 0 to 3 cat-owning households.

6E4.  Surprised?

Example 19: 30.4% of US households own one or more cats. You visit ten random households and seven of them own cats. Are you surprised at this result?


A result is surprising or unusual or unexpected if it has low probability, given what you think you know about the population in question. The threshold for “low probability” can vary in different problems, but a typical choice is 5%.

When we ask whether a result is surprising (unusual, unexpected), we are really talking about that result or one even further from the expected value.

You think you know that 30.4% of US households own cats. A sample of ten doesn’t seem very large; how do you decide whether seven successes seems reasonable or unreasonable?

First, what’s the expected value? That’s μ = np = 10×.304 = 3.04.

Next, what does “that result or one further from the expected value” mean? The expected value is 3.04, seven is greater than 3.04, so we’re talking about seven or more successes, x = 7 to 10.

1 minus binomcdf(10, .304, 6) = .0114590334 MATH200A output screen: n=10, p=.304, x=7 to 10, probability=.0114590334 Find the probability of that result or one even further from the expected value. That’s easiest with MATH200A part 3: set n=10, p=.304, x=7 to 10. You can also do it with binomcdf: seven or more successes is the complement of zero to six successes (0 1 2 3 4 5 6 7 8 9 10). Either way, the probability is 0.0115 or just over 1%.

Draw your conclusion. If 30.4% of US households own cats, finding seven or more cat houses in a random sample of ten households is unusual (surprising, unexpected).

That was a trivial example. But in real life, when a result is unexpected it can cast doubt on what you’ve been told. Here’s an example.

6E5.  A Life-or-Death Example

Example 20: In Talladega County, Alabama, in 1962, an African American man named Robert Swain was accused of rape. 26% of eligible jurors in the county were African American, but the 100-man jury panel for Swain’s trial included only 8 African Americans. (Through exemptions and peremptory challenges, all were excluded from the final 12-man jury.) Swain was convicted and sentenced to death.

Swain’s lawyer appealed, on grounds of racial bias in jury selection. The Supreme Court ruled in 1965 that “The overall percentage disparity has been small and reflects no studied attempt to include or exclude a specified number of blacks.”

What do you think of that ruling? If 100 men in the county were randomly selected, is eight out of 100 in the jury pool unexpected (unusual, surprising)?

Solution: This is a binomial model: every man in the county either is or is not African American, the sample size is a fixed 100, and in a random sample there’s the same 26% chance that any given man is African American.

To determine whether eight in 100 is unexpected, ask what is expected. For binomial data, μ = np = 100×.26 = 26; in a sample of 100, you expect 26 African Americans.

MATH200A output screen: n=100, p=.26, x=0 to 8, probability=4.734795002E minus 6 Okay, 26 is expected, 8 is less than 26, “further away from expected” is less than 8, so you compute the probability for x = 0 to 8.

Use binomcdf(100,.26,8) or MATH200A part 3. Either way you get a probability of 4.734795002E-6, or about 0.000 005, five chances in a million. That is unexpected. It’s so unlikely that we have to question the county’s claim that the selection was random.

Unfortunately, Mr. Swain’s lawyer didn’t consult a statistician.

What Have You Learned?

Key ideas:

(The online book has live links to all of these.)

Because this textbook helps you,
please click to donate!
Because this textbook helps you,
please donate at
Study aids:

Chapter 7 WHYL → ← Chapter 5 WHYL

Exercises for Chapter 6

Write out your solutions to these exercises, showing your work for all computations. Then check your solutions against the solutions page and get help with anything you don’t understand.

Caution! If you don’t see how to start a problem, don’t peek at the solution — you won’t learn anything that way. Ask your instructor or a tutor for a hint. Or just leave it and go on to a different problem for now. You may find when you return to that “impossible” problem that you see how to do it after all.

1 You roll five dice and count the number of twos that appear.
(a) List the possible values of the discrete random variable, X = “number of twos in five dice”.
(b) What type of probability model is appropriate? Why?
2 A lottery has a 1 in 10 million chance of paying $10,000,000, a 1 in 125 chance of paying $100, and a 1 in 20 chance of paying $10. A ticket costs $5, and you do not get that money back if you win a prize.
(a) Construct a discrete probability distribution.
(b) Is this a good deal or a bad deal for you? Explain.
3 Blood Types [see “Sources Used” at end of book] at the Stanford School of Medicine’s Web site lists the relative frequencies of blood types in the US. (There’s also a nice chart of what blood types you can safely receive, based on your own blood type.) Only 6.6% of the US have O negative blood.

Velma the Vampire will drink anything, but she prefers O negative. She doesn’t know a victim’s blood type until she tastes it.
(a) How many does she expect to drain before she gets some O negative?
(b) How likely is it that she’ll find her first O negative within her first ten victims?
(c) How likely is it that exactly two of her first ten victims will be O negative?

4 In January 2013, a CBS News story by Sarah Dutton and others [see “Sources Used” at end of book] reported poll results: 92% of American adults favored universal background checks for gun buyers.
(a) If TC3 students are representative of American adults when the poll was taken, what’s the chance that you’ll have to ask three TC3 students before the third one opposes universal background checks?
(b) How likely is it that you’d find a student opposing universal background checks somewhere in the first three you ask, not necessarily in third position?
5 Suppose 80% of students who register for Elizabethan Sonnets complete the course successfully.
(a) Imagine taking many, many samples of seven people, with replacement. What would be the expected number and standard deviation of the number of people that would finish successfully, per sample of seven?
(b) At the end of the semester, imagine a random group of seven students who originally registered for the course. Find the probability that four to six of them completed it successfully.
(c) What’s the chance that, when you ask each person in turn, the third person you ask is the first one who successfully completed the course?
(d) What’s the chance that the first person that you find who successfully completed the course is one of the first two you ask?
6 In a June 2013 poll, the Pew Research Center (2013b) [see “Sources Used” at end of book] found that 49% of American adults approved of President Obama’s job performance. In a random sample of 40 American adults, taken at the same time, would you be surprised if 13 approved his performance? Why or why not?
7 According to the Social Security Administration (2010) [see “Sources Used” at end of book], 0.1304% of 22-year-old males are expected to die in the next year.
(a) What is the fair price of a $100,000 one-year term life insurance policy on a 22-year-old male? (To keep things simple, assume that the company will charge the same price to every 22-year-old male, without regard to lifestyle or health factors.)
(b) The company actually charges $180.00 for this policy, more than the fair price. Is this unfair? Explain.
8 A coin is weighted — the chance of heads is not 50%. On five flips of that coin, the probability of various numbers of heads is shown by this model:
x 012345
P(x) 0.07780.25910.34560.23050.07680.0102

(a) Find and interpret the mean and standard deviation of this probability model.
(b) For an extra challenge, can you use your answer from part (a) to construct a simpler probability model for five flips of this coin?

9 Long experience shows that a particular drug will help 70% of the people who take it.
(a) If you take a random sample of five people, what is the probability that the drug helps at least three?
(b) If you take many samples of 10 people, what’s the average number of people per sample that the drug will help?
(c)In a random sample of 10 people, would you be surprised if the drug helps only five? Why or why not?
10 In April 2013, the Pew Research Center [see “Sources Used” at end of book] released poll results for the question “Which of the following best describes how you feel about doing your taxes?” Surprisingly (to me, anyway), 34% said they like or love doing their taxes.
(a) How many Americans would you expect to have to ask to find one who likes or loves doing her taxes?
(b) If you ask five random Americans, what’s the probability that none of them will say they like or love doing their taxes?
11 In a sentence or two, write down the difference between the geometric and binomial models. (Write it, don’t just think it. It’s easy to tell yourself you understand something, but the rubber meets the road when you have to put your understanding into words on paper.)
12 In a sentence or two, write down the difference between pdf and cdf.

Solutions → 

What’s New


7. Normal Distributions

Updated 19 Dec 2016 (What’s New?)

Summary: The normal distribution (ND) is important for two reasons. First, many natural and artificial processes are ND. You’ll look at some of those in this chapter. Second, any process can be treated as a ND through sampling. That will be the subject of Chapter 8, and it’s also the foundation of the inferential statistics you’ll do in Chapters 9 through 11.

7A.  Continuous Random Variables

You met random variables back in Chapter 6. Any random variable has a single numerical value, determined by chance, for each outcome of a procedure. Discrete random variables are limited to specified values, usually whole numbers. But a continuous random variable can take any value at all, within some interval or across all the real numbers.

Just as discrete probability models are used to model discrete variables, continuous probability models are used to model continuous variables. Of course, because a continuous random variable has infinitely many possible values, you can’t make a table of values and probabilities as you could do for a discrete distribution. Instead, either there’s an equation, or just a density curve (below).

A probability model is often called a distribution, so you can say that a variable “is normally distributed” (ND), that it “is a normal distribution” (also ND), or that it “follows a normal probability model”.

There are lots of specialized continuous distributions, but the normal distribution is most important by a wide margin. Many, many real-life processes follow the normal model, and the ND is also the key to most of our work in inferential statistics.

This section will give you some concepts that are common to all continuous distributions, and the rest of the chapter will talk about special properties of the normal distribution and applications. In Chapter 8, you’ll apply the normal distribution to get a handle on the variation from one sample to the next.

7A1.  Density Curves

In Chapter 2, you learned to graph continuous data by grouping the data in classes and making a histogram, like the one below left. This is wait times in a fast-food drive-through, with time in minutes — not whole minutes, which would make a discrete distribution, but minutes and fractional minutes.

Any sample you might take has a finite number of data points, so you set up classes, place the data points in the classes, and then draw a histogram. The height of each bar is proportional to the frequency or relative frequency of that class.

histogram      density curve

histogram      density curve

But when you come to consider all the possible values of a continuous variable, you have an infinite number of data points. If you tried to assign them to classes, it would take you forever —literally! Instead, you draw a smooth curve, called a density curve, to show the possible values and how likely they are to occur. An example is shown above right.

The density curve is a picture of a continuous probability model. It doesn’t just represent the data in a particular sample, but all possible data for that variable — along with the probabilities of their occurrence, as you’ll see next.

7A2.  Probability and Continuous Distributions

density curve showing probability 29.4% between x=6.40 and x=9.50 Up to now, the height of a bar in a histogram has been the number of data points in that class, or the relative frequency of that class. But how do you interpret the height of a density curve?

Answer: you don’t! The height of the curve above any particular point on the x axis just doesn’t lend itself to a simple interpretation. You might think it would be the probability of that value occurring. But with infinitely many possible values, “what’s the likelihood of a wait time of exactly 4 minutes?” just isn’t a meaningful question, because what about 3.99997 minutes or 4.002 minutes?

Area = Probability

What is meaningful is the probability within an interval, which equals the area under the curve within that interval. For example, in this illustration, the probability of a wait time of 6.4 to 9.5 minutes is 29.4%. In symbols,

P(6.4 ≤ x ≤ 9.5) = 29.4%


P(6.4 < x < 9.5) = 29.4%

That’s right — the probability is the same whether you include or exclude the endpoints of the interval.

Okay, I lied. The height of the curve is meaningful, but only if you’ve had some calculus. The curve is the graph of a probability density function or pdf. The integral of that curve from a to b is the area between x=a and x=b and is the probability that the random variable will have a value between a and b.

This explains why the probability is the same whether you include or exclude either endpoint of the interval. The difference is the area of a “rectangle” whose height is the height of the density curve and whose width is the distance from a to a — which is zero. Thus the area of the “rectangle” is zero, and the probability of the random variable taking any particular value, exactly, is zero.

Since area equals probability, and total probability must be 1, total area must be 1. Every pdf — the height of every density curve — is scaled so that the integral from −∞ to +∞ is 1.

density curve showing probability 20.6% to left of x=3.33 You can also have the probability for an interval with one boundary, < or ≤ some value like the picture at right, or > or ≥ some value. For example, 3.33 minutes is about 3 minutes and 20 seconds, so the probability of waiting up to 3 minutes and 20 seconds is 20.6%: P(x ≤ 3.33) = 20.6%.

The total area under any density curve equals the probability that the random variable will take any one of its possible values, which of course is 1, or 100%. So you can use the complement to say that the probability of waiting 3 minutes and 20 seconds or more (or, more than 3 minutes and 20 seconds) is 100−20.6% = 79.4%.

Two Interpretations of Probability

You remember from Interpreting Probability Statements in Chapter 5 that every probability can be interpreted as a probability of one or a proportion of all. For example, P(x > 3.33) = 79.4% can equally well be interpreted in two ways:

Which interpretation you use in a given situation depends on what seems simplest and most natural in the situation. Here, the “proportion of all” interpretation seems simpler. But you’re always free to switch to the other interpretation if it helps you in thinking about a situation.

Area = Probability of One = Proportion of All

7B.  The Normal Model

Why study the normal distribution?

First, it’s useful on its own. Lots and lots of real-life distributions match the normal model: body temperature or blood pressure of healthy people, scores on most standardized tests, commute times on a given route, lifetimes of batteries or light bulbs, heights of men or women, weights of apples of a particular variety, measurement errors (in many situations), and on and on.

Why is the ND so common? In real life, very few events have just one cause; most things are the result of many factors operating independently. It turns out that if you take a lot of independent random variables and add them up, their sum is ND. For example, your IQ score results from multiple genetic factors, countless occurrences in your education and your family life, even transient factors like how well you slept the night before the test. Most of these are independent of each other, so the result of adding them is a ND.

Several mathematicians can claim the discovery of the normal distribution. Abraham de Moivre (1667–1754, French) was probably first, in 1733. But the name of Carl Friedrich Gauss is permanently coupled to the normal distribution — literally. Although Sir Francis Galton coined the term normal distribution in 1889, Karl Pearson called it the Gaussian distribution in 1905, and that’s still a recognized synonym.

Second, through sampling, even non-ND populations follow a normal model. You’ll use this model in inferential statistics to make statements about a whole population based on just one sample. You’ll learn about this neat trick in Chapter 8.

7B1.  Properties of the Normal Distribution

normal curves with standard deviation 4 and means 0, 2, and 5
normal curves with mean 2 and standard deviations 2, 4, and 6

The normal distribution (ND) has the properties of other continuous distributions as listed earlier. In particular, area = probability, and the total area under the density curve is the total probability, which is 1. The ND also has these special properties:

All of this is the theoretical normal distribution. In fact, nothing in real life is perfectly ND, because nothing in real life has an infinite number of data points. When we say something is ND, we mean it’s a close match, not a perfect match. “Normally distributed” (or ND) is short for “using a normal distribution to model this data set, the calculations will come out close enough to reality.”

This is a lot like what you did in Chapter 3, when you computed the statistics of a grouped distribution. The statistics were only approximate, because of the simplification you introduced by grouping, but the approximation was good enough.

Now let’s get to some applications! There are two main categories: “forward” problems, where you have the boundaries and you have to find the area or probability, and “backward” problems, where you have a probability or area and you have to find the boundaries.

In case you’re interested, the pdf, the height of the density curve above a given x, is f of x equals exponential of negative, open grackets, square of x minus mu, over twice the square of sigma, close brackets, all over sigma times square root of 2 pi. The cdf, the area to the left of a given x, is the integral of that, just the same as finding the area under any curve to the left of a given x: reciprocal of quantity sigma times square root of 2 pi, times intergal from minus infinity to x, of exponential of minus square of t minus mu over 2 sigma squared, end exponent, times dt. This integral doesn’t have a “closed form”, a finite sequence of basic algebraic operations, so it must be found by successive approximations. That’s what your calculator does with normalcdf and Excel does with NORM.DIST.

7B2.  From Boundaries, Find Probability

Summary: Make a sketch, estimate the probability (area), then compute it.

TI-83/84/89: Use normalcdf(left bound, right bound, mean, SD). I’ll walk you through the TI-83/84 keystrokes in the first example below. If you have a TI-89, press [CATALOG] [F3] [plain 6 makes N] [ENTER].

Excel: In Excel 2010 or later, use (deep breath here) =NORM.DIST(right bound, mean, SD, TRUE) − NORM.DIST(left bound, mean, SD, TRUE). In Excel 2007 or earlier, it’s NORMDIST rather than NORM.DIST.

Example 1: Heights of human children of a given age and sex are ND. One study found that three-year-old girls’ heights have a mean of 38.72″ and SD of 3.17″. What percentage of three-year-old girls are 35″ to 40″ tall?

Solution: Take the time to make a sketch. It doesn’t have to be beautiful, but you should make it as accurate as you reasonably can. It’s an important safeguard against making boneheaded mistakes. Here’s what should be on your sketch:

  1. Draw the axis line. sketched normal distribution for this example
  2. Label the axis, x or z as appropriate. x is the symbol for real-world data points, and z is the symbol for z-scores in the standard normal distribution, below.
  3. Draw a vertical line in the middle of the distribution and write the numerical value of the mean below the axis where that central line meets it. (If necessary, offset it with a tick mark, as I did.)
  4. Draw a horizontal line at about the right spot and show the numerical value of the standard deviation.
  5. Draw a line and show the value for each boundary.

    Important: When you marked the SD, you set the scale for the sketch. Now you have to honor that and place your boundaries in proportion. For instance, in this problem the mean is 38.72 and the left boundary is 35, which is 3.72 below the mean. Your left boundary therefore needs to be a bit more than one SD (3.17) left of the mean. The right bound is 40, which is 1.28 above the mean, so your line needs to be just over a third of a SD to the right of the mean.

    (Students often put in more numbers and lines, like the values of 1, 2, and 3 SD above and below the mean. That’s not wrong, but it’s usually not helpful, and it definitely clutters up the sketch.)

  6. Shade the area you’re trying to find.
  7. Look at your sketch and estimate the area before you pull out your calculator. That way, if you make a mistake that leads to a ridiculous answer, you’ll recognize it as ridiculous and fix it.

    From my sketch, I estimate an area of 50%–60%. If it’s 45% or 70% I won’t be terribly surprised, but if it’s 5% or 99% I’ll know something is wrong.

  8. Compute the area (below).

    If you wish, add that number to your sketch — not below the axis, please. Write it within the shaded area, if there’s room, or as a callout to the left or right of the diagram, the way I did here.

Computing the Area

On a TI-83 or TI-84, press [2nd VARS makes DISTR] [2] to select normalcdf. Enter the left boundary (35), right boundary (40), mean (38.72), and SD (3.17).

(If you have a TI-89 or you’re using Excel, see above.)

With the “wizard” interface: With the classic interface:

’wizard’ interface for normalcdf in this problem

Press [ENTER] twice, and your screen will look like the one at right.

After entering the standard deviation, press [)] [ENTER] to get the answer.

traditional interface for normalcdf in this problem

You always need to show your work, so write down normalcdf(35,40,38.72,3.17) before you proceed to the answer. (There’s no need to write down the keystrokes you used.)

In this book, I round probabilities to four decimal places, or two decimal places if expressed as a percentage. The probability is

P(35 ≤ x ≤ 40) = 0.5365

That number matches my estimate of 50%–60%.

But the problem asked for a percentage. (Always, always, always look back at the problem and make sure you’re answering the question that was actually asked.) The answer: 53.65% of three-year-old girls are 35″ to 40″ tall.

Example 2: A three-year-old girl is randomly chosen. Would it be unusual (unexpected, surprising) if she’s over 45″ tall?

normal distribution, mean=38.72, SD=3.17, shaded 45 to the right edge In Chapter 5 you learned to call a low-probability event unusual (a/k/a surprising or unexpected). The standard definition of unusual events is a probability below 0.05, so really this problem is just asking you to find the probability and compare it to 0.05.

Solution: The sketch is at right, and obviously the probability should be small. The left boundary is 45, but what’s the right boundary? The normal distribution never quite ends, so the right boundary is ∞ (infinity). TI-89s have a key for ∞, but TI-83s and TI-84s don’t and Excel doesn’t, so use 10^99 instead. (That’s 10 to the 99th power; the [^] key on your TI calculator is between [CLEAR] and [÷].)

Show your work:

P(x > 45) = normalcdf(45,10^99,38.72,3.17) = 0.0238

That’s rounded from 0.0237914986, and it’s in line with my estimate of “small”. Now answer the question: There’s only a 2.38% chance that a randomly selected three-year-old girl will be over 45″ tall, so that would be unusual.

Example 3: For the same population, find and interpret P(x < 33).

normal distribution, mean=38.72, SD=3.17, shaded 33 to the left edge Solution: The sketch is at right, and again the expected probability is small. The right boundary is 33, but what’s the left boundary? You might want to use 0, since no one can be under 0″ tall, but you could make the same argument for 1″ or 5″, so that can’t be right.

To locate the left boundary, remember that you’re using a normal model to approximate the data, and the normal distribution runs right out to ±∞. Therefore, the left boundary is minus ∞ on a TI-89, or minus 10^99 on a TI-83/84. (Use the [(-)] key, not the [] subtraction key.)

P(x < 33) = normalcdf(-10^99,33,38.72,3.17) = 0.0356

The proportion of three-year-old girls under 33″ tall is 0.0356 or 3.56%; or, 3.56% of three-year-old girls are under 33″ tall. The other interpretation is the chance that a randomly selected three-year-old girl is under 33″ tall is 0.0356 or 3.56%.


Example 4: What’s the percentile rank of a three-year-old girl who is 33″ tall?

Because this textbook helps you,
please click to donate!
Because this textbook helps you,
please donate at

Solution: Long ago, in a galaxy called Numbers about Numbers, you learned the definition of percentiles. The percentile rank of a data point is the percentage of the data set that is ≤ that data point. So you need P(x ≤ 33). But that’s exactly what you computed in the previous example: 3.56%. So the 33″-tall girl is between the third and fourth percentiles for her age group.

“That was P(x < 33), and for a percentile I need P(x ≤ 33)!” I hear you yell. But those two are equal. When we talked about density curves, near the beginning of this chapter, you learned that the area and probability are the same whether you include or exclude the boundary.

And this is why it doesn’t make much difference whether you define a percentile rank in terms of < or ≤, because the probability in a continuous distribution is the same either way.

7B3.  From Probability, Find Boundaries

Summary: Make a sketch, estimate the value(s), then compute the value(s).

TI-83/84/89: Use invNorm(area to left, mean, SD). I’ll walk you through the TI-83/84 keystrokes in the first example below. If you have a TI-89, press [CATALOG] [F3] [plain 9 makes I] [ 3 times] [ENTER].

Excel: In Excel 2010 or later, use =NORM.INV(area to left, mean, SD). In Excel 2007 or earlier, it’s NORMINV rather than NORM.INV.

Example 5: Blood pressure is stated as two numbers, systolic over diastolic. The World Health Organization’s MONICA Project (Kuulasmaa 1998 [see “Sources Used” at end of book]) reported these parameters for the US:

Systolic: μ = 120, σ = 15

Diastolic: μ = 75, σ = 11

Blood pressure in the population is normally distributed. The lowest 5% is considered “hypotensive”, according to Kuzma and Bohnenblust (2005, 103) [see “Sources Used” at end of book]. What systolic blood pressure would be considered hypotensive?

Solution: Always make a sketch for these problems. normal distribution, mu=120, signa=15, left 5% shaded Your sketch is similar to the ones you made for the first group of problems, except that you use a symbol like x1 or “?” for the unknown boundary, and you write in the known area.

Always estimate your answer to guard against at least some errors. In the sketch, x1 looks like it’s not quite two SD left of the mean, so I’ll estimate a pressure of 95 to 100. (Okay, I cheated by using my calculator to make my “sketch”. But even with a real pencil-and-paper sketch, you ought to be in the right ballpark.)

Now you’re ready to calculate. TI-89 or Excel users, please see the instructions above. On your TI-83 or TI-84, press [2nd VARS makes DISTR] [3] to select invNorm. Enter the area to the left of the point you’re interested in (.05), the mean (120), and the SD (15).

With the “wizard” interface: With the classic interface:

’wizard’ interface for invNorm in this problem

Press [ENTER] twice, and your screen will look like the one at right.

After the standard deviation, press [)] [ENTER] to get the answer.

traditional interface for invNorm in this problem

Show your work! Write down invNorm(.05,120,15) before you proceed to the answer. (There’s no need to write down the keystrokes you used.)

Answer: Systolic blood pressure (first number) under 95 would be considered hypotensive.

Example 6: The same source considers the top 5% “hypertensive”. What is the minimum systolic blood pressure that is hypertensive?

normal curve with mean 120, SD 15, shaded 0.05 at right and unshaded area marked 1−0.05 Solution: My “sketch” is at right. It’s mostly straightforward — the x1 boundary is between the 5% tail and the rest of the distribution.

But what’s up with the 1−0.05? The problem asks you about the upper 5%, which is the area to the right of the unknown boundary. But invNorm on the calculator, and NORM.INV in Excel, need area to left of the desired boundary. The area to the left is the probability of “not hypertensive”, and area is probability, so the area to left is 1 minus the area to right, in this case 1−0.05.

Could you just write down 0.95? Sure, that would be correct. But if the area to right was 0.1627 you’d probably make the calculator compute 1 minus that for you, so why not be consistent?

x1 = invNorm(1−.05,120,15) = 144.6728044 → 145

(That’s actually a little liberal. Several sources that I’ve seen give 140 as the threshold.)

normal curve with mu=120, sigma=15, middle 80% shaded, tails marked as 0.1 each Example 7: Kuzma and Bohnenblust describe the middle 80% as “normal”. What is that range of systolic blood pressure?

This problem wants you to find two boundaries, lower and upper. You have to convert the 80% middle into two areas to left. Here’s how. If the middle is 80%, then the two tails combined must be 100−80% = 20%. But the curve is symmetric, so each tail must be 20/2 = 10%. Strictly speaking, I probably should have written that computation on the diagram, instead of just a laconic “0.1”, but it would take up a lot of space and the computation was easy enough. You’ll probably do the same — just be careful.

Once you have the areas squared away, the computation is simple enough:

x1 = invNorm(.1,120,15) = 100.7767265 → 101

x2 = invNorm(1−.1,120,15) = 139.2232735 → 139

Check: The boundaries of the middle 80% (or the middle any percent) should be equal distances from the mean. (100.776265+139.2232735)/2 = 120, so at least it’s consistent. Answer: Systolic b.p. of 101 to 139 is considered normal.

Percentiles Again

Example 8: What’s the 40th percentile for systolic blood pressure?

normal curve with mu=120, sigma=15, left 40% shaded Sometimes the gods smile on us. The kth percentile is the value that is ≥ k% of the population, so k% is exactly the area to left that you need.

P40 = invNorm(.4,120,15) = 116.1997935 → 116

7C.  The Standard Normal Distribution

Definition: The standard normal distribution is a normal distribution with a mean of 0 and standard deviation of 1, sometimes written N(0,1).

The standard normal distribution is a picture of z-scores of any possible real-world ND — more about that later.

The standard normal distribution lets you make computations that apply to all normal models, not just a particular model. You’ll see some examples shortly, but first —

7C1.  “Normal” and “Standard Normal”

The main point about the standard normal distribution is that it’s a stand-in for every ND from real life. How does this work? Well, if you take any real data set and subtract the mean from every data point, the mean of the new data set is 0. And if you then divide that data set by the standard deviation (which doesn’t change when you subtract a constant from every data point), then the SD of the new-new data set is 1.

But all you did with those manipulations was replace the numbers with z-scores. Remember the formula: z = x minus mu, all over sigma. The standard normal distribution is what you get when you convert any normal model to z-scores.

Long ago, when dinosaurs ruled the earth — okay, up through the early 1980s — a “computer” was a person who used a slide rule to make computations. (I swear I am not making this up.) There were no statistical calculators and no Excel. The only way for most people to make computations on a normal model was to look up probabilities in printed tables. But obviously a book couldn’t print tables for every normal model. So the printed tables were for the standard normal distribution. If you had boundaries and wanted the probability of the interval, you converted your real-world numbers to z-scores, looked up the probabilities in the table, and subtracted them. If you had a probability and needed a boundary, you looked up the z-score in the table and then converted it to a raw score using the mean and SD of your data set.

The need to do normal computations the hard way has gone the way of the dinosaurs, but I think this history is why many stats books still use tables to do their computations. Inertia is a powerful force in textbooks!

The pdf and cdf functions for the standard normal distribution are what you get when you set μ=0 and σ=1 in the general equations for the ND: f of z equals exponential of negative z squared over 2, all over square root of 2 pi and reciprocal of square root of 2 pi, times intergal from minus infinity to z, of exponential of minus t squared over 2, times dt. Again, the integral must be found by successive approximations. That’s where the tables in books come from, and it’s what your calculator does with normalcdf and Excel does with NORM.DIST.

7C2.  Applying the Standard Normal Distribution

I said above that the standard normal distribution lets you make statements about all normal models. What sort of statements? Well, the Empirical Rule for one.

Example 9: The Empirical Rule says that 68% of the population in a normal model lies within one SD of the mean. How good is the rule? In other words, what’s the actual proportion?

standard normal curve shaded between z = ±1 Solution: As usual, you start with a sketch. This is the standard ND, so the axis is z, not x. There’s no need to mark the mean or SD, because the z label identifies this as a standard normal distribution and therefore μ = 0 and σ = 1. Just label the boundaries.

Compute the probability the same way you’ve already learned. (Both Excel and the TIs have special procedures available for the standard normal distribution, but it’s not worth taking brain cells to learn them, when the regular procedures for the ND work just fine with N(0,1).)

P(−1 ≤ z ≤ 1) = normalcdf(−1,1,0,1) = .6826894809 → 68.27%

The Empirical Rule says 68% of the data are within z = ±1. Actually it’s about 68¼%, close enough.

Example 10: How many standard deviations must you go above and below the mean to take in the middle 50% of the data in a normal model?

standard normal curve with middle 50% shaded Solution: This is similar to finding the middle 80% of blood pressures earlier, except now you’re making a statement about all normal models, not just a particular one.

Shading the middle 50% leaves 100−50 = 50% in the two tails combined, so each tail is 50/2 = 25%.

z1 = invNorm(.25,0,1) = −.6744897495 → −0.67

By symmetry, z2 must be numerically equal to z1 but have the opposite sign: z2 = 0.67.

50% of the data in any normal model are within about 2/3 of a SD of the mean. Since the bounds of the middle 50% of the data are Q1 and Q3, the IQR of any normal distribution is twice that, about one and a third standard deviations. More precisely, the IQR is 2×0.674 ≈ 1.35 times the SD.

7C3.  The z Function (Critical z)

There’s one special notation you’ll use when you compute confidence intervals in Chapter 9.

Definition: zarea or z(area), also known as critical z, is the z-score that divides the standard normal distribution such that the right-hand tail has the indicated area.

This may seem a little weird, but really it’s just a recipe to specify a number. Compare with the square root of 48. That is the positive number such that, if you multiply it by itself, you get 48. Or consider π: the number that you get when you divide the circumference of a perfect circle by its diameter. Math is full of numbers that are specified as recipes. An example will make things clearer.

Example 11: Find z0.025.

standard normal curve with right-hand 0.025 shaded Solution: The problem is diagrammed at right. Caution! 0.025 is an area, not a z-score, so you don’t write 0.025 on the number line (the z axis). z0.025 is a z-score (though you don’t know its value yet), so it goes on the number line.

Once you have your sketch, the computation is straightforward. Have area (probability), compute boundary. The area is 0.025, but it’s an area to right, and invNorm needs an area to left, so you subtract from 1 as usual:

z0.025 = invNorm(1−.025, 0, 1) = 1.959963986 → 1.96

Caution! You’re computing a boundary for the right-hand tail. If you get a negative number, that can’t possibly be right.

z0.025 = 1.96 makes sense, if you think about it. If you also shaded in the left-hand tail with an area of 0.025, the two tails together would total 5%, leaving 95% in the middle. The Empirical Rule says that 95% of data are within 2 SD above and below the mean, and 1.96 is approximately 2.

7D.  Checking for Normality

How do you know whether a normal model is appropriate? How do you know whether your data are normally distributed? A histogram can rule out skewed data, or data with more than one peak.

But what if your data are unimodal and not obviously skewed? Is that enough to justify a normal model? No, it’s not. You need to perform a test called a normal probability plot. You’ll need this procedure in Chapters 8 through 11, whenever you have a small sample of numeric data.

Summary: To check whether a normal model can represent your sample, make a normal probability plot. This plots the actual data points, against the z-scores you would expect for this number of points that are ND. If the plot is close to a straight line, a normal model is appropriate; if the plot is far from a straight line, a normal model is not appropriate.

That’s the bare outline, and you’ll get a little bit more with the examples. For those who want the full theory, it’s marked optional at the end of this section.


Testing for normality can be automated partly or completely, depending on what technology you have:

7D1.  Checking Data Sets

Example 12: Consider these vehicle weights (in pounds):

2500, 3250, 4000, 3500, 2900, 4500, 3800, 3000, 5000, 2200

Do they fit a normal model?

Solution: TI-83/84 normal probability plot showing points nearly on a straight line Put the data in any statistics list, then press [PRGM], scroll down to MATH200A, and press [ENTER] twice. Select Normality chk.

The program makes the plot, and you can look at the points to determine whether they seem to be pretty much on a straight line. At least, that’s the theory. In practice, most data sets are a lot less clear cut than this one. It can be hard to tell whether the points fit a line, particularly if you have only a few of them. The plot takes up the whole screen, so deviations can look bigger than they really are.

Fortunately, there’s a test for whether points lie on a straight line. As you know from Chapter 4, the closer the correlation coefficient r is to 1, the closer the points are to a straight line.

The program computes r for you, and it also computes a critical value★ to help you determine if the points are close enough to a straight line. (For technical reasons, the critical value is different from the decision points of Chapter 4.) If r≥crit, it’s close enough to 1, the points are close enough to a straight line, and you can use a normal model. If r<crit, it’s too far from 1, the points are too far from a straight line, and you can’t use a normal model.

For this data set, r > crit, and therefore these vehicle weights fit the normal model.

★The “classic TI-83” (non-“Plus” model) doesn’t compute the critical value, so you have to do it yourself. See the formula in item 4 in the next section.

Example 13: Here’s a random sample of the lengths (in seconds) of tunes in my iTunes library:

 120  219  242  134  129     105  275   76  412  268
 486  199  651  291  126     210  151   98  100   92
 305  231  734  468  410     313  644  117  451  375

Do they fit a normal model?

normal probability plot for 30 data points, showing r=.9473 and crit=.9639 Solution: I entered them in a statistics list and then ran MATH200A Program part 4. The result was the plot at the right.

You can see that the plot is curved. This is reinforced by comparing r=0.9473 to crit=0.9639. r < crit. The points diverge too far from a straight line, and therefore I cannot use a normal model for the lengths of my iTunes songs.

7D2.  Optional:  How Normal Probability Plots Work

The basic idea isn’t too bad. You make an xy scatterplot where the x’s are the data points, sorted in ascending order, and the y’s are the expected z scores for a normal distribution.

Why would you expect that to be a straight line? Recall the formula for a z score: z = (x−)/s. Breaking the one fraction into two, you have z = x/s−/s. That’s just a linear equation, with slope 1/s and intercept /s. So an xz plot of any theoretical ND, plotting each data point’s z score against the actual data value, would be a straight line.

Further, if your actual data points are ND, then their actual z scores will match their expected-for-a-normal-distribution z scores, and therefore a scatterplot of expected z scores against actual data values will also be a straight line.

Now, in real life no data set is ever exactly a ND, so you won’t ever see a perfectly straight line. Instead, you say that the closer the points are to a straight line, the closer the data set is to normal. If the data points are too far from a straight line — if their correlation coefficient r is lower than some critical value — then you reject the idea that the data set is ND.

Okay, so you have to plot the data points against what their z-scores should be if this is a ND, and specifically for a sample of n points from a ND, where n is your sample size. This must be built up in a sequence of steps:

  1. Divide the normal curve (mentally) into n regions of equal probability and take one probability from each region. For technical reasons, the probability number you use for region i is (i−.375)/(n+.25). This formula is in many textbooks, and also in Normal Probability Plots and Tests for Normality (Ryan and Joiner 1976 [see “Sources Used” at end of book]).
  2. Compute the expected z scores for those probabilities. Working with the calculator, that’s just invNorm of (i−.375)/(n+.25).
  3. Plot those expected z scores against the data values. This xy plot (or xz plot) has a correlation coefficient r, computed just like any other correlation coefficient.
  4. Compare the r for your data set to the critical value for the size of your data set. Ryan and Joiner determined that the critical value for sample size n, at the 0.05 significance level,, is 1.0063−.1288/√n−.6118/n+1.3505/n². To make it a little easier on the calculator I rearranged it as 1.0063−.6118/n+1.3505/n²−.1288/√n.
    In the same paper, they gave formulas for critical values at other significance levels:

    1.0071−0.1371/√n−0.3682/n+0.7780/n² at α=0.10

    0.9963−0.0211/√n−1.4106/n+3.1791/n² at α=0.01

The closer the points are to a straight line, the closer the data set is to fitting a normal model. In other words, a larger r indicates a ND, and a smaller r indicates a non-ND. You can draw one of two conclusions:

So the bottom line is, if r > CRIT, treat the data as normal, and if r < CRIT, don’t.

The normal probability plot is just one of many possible ways to determine whether a data set fits the normal model. Another method, the D’Agostino-Pearson test, uses numerical measures of the shape of a data set called skewness and kurtosis to test for normality. For details, see Assessing Normality in Measures of Shape: Skewness and Kurtosis.

What Have You Learned?

Key ideas:

(The online book has live links to all of these.)

Because this textbook helps you,
please click to donate!
Because this textbook helps you,
please donate at
Study aids:

Chapter 8 WHYL → ← Chapter 6 WHYL

Exercises for Chapter 7

Write out your solutions to these exercises, showing your work for all computations. Then check your solutions against the solutions page and get help with anything you don’t understand.

Caution! If you don’t see how to start a problem, don’t peek at the solution — you won’t learn anything that way. Ask your instructor or a tutor for a hint. Or just leave it and go on to a different problem for now. You may find when you return to that “impossible” problem that you see how to do it after all.

You’ll need this information for several of the problems:

Source: “Is Human Height Bimodal?” (Schilling 2002 [see “Sources Used” at end of book]).

1 Suppose that variable X is Chantal’s commute time between home and school, in minutes. Give two interpretations of the statement P(x < 17) = 0.0900.
2 A male co-worker is “six foot four and a half” — 76.5″ tall. How unusual is that? (Give two interpretations of your number.)
3 What proportion of women are 64″ to 67″ tall?
4 What heights for men would be considered unusual (less than 5% likely)? Hint: Your answer will be in the form “under ____ inches or over ____ inches”.
5 To enter the Pennsyltuckey Police Academy, you have to be at or above the 15th percentile in height. How tall is that, for a man?
6 (a) Find the 25th and 75th %iles for women’s heights.
(b) Find the interquartile range.
(c) Example 10 found that, in a normal distribution, the interquartile range equals 1.35 standard deviations. Does your computed IQR match that prediction?
7 Determine whether this sample of diastolic blood pressures fits the normal model:

78  66  98  90  74  70  70  76  72  86  62  84  66  70  68

8 Scores on the math SAT are ND with a mean of 500 and standard deviation of 100. What percentile is represented by a score of 735?
9 To join Mensa, you must be in the top 2% of the population on a recognized intelligence test. Mensa accepts the SAT as a qualifying test for membership. The mean on the combined three parts is 1500 and the SD is 300. What’s the minimum combined score to qualify you for Mensa?
10 Find z0.01.
11 For men’s heights, find P(x < 60″) and write two interpretations.
12 Test scores are supposed to be ND, but this is questionable on small tests. Here are scores from a recent quiz; do they fit the normal model?

0.3  8.8  11.5  12  12.3  12.5  13  13.5  14.8 

13 A small shop decided to stock formal wear for men and women in the middle 90% of height. How tall must men and women be to shop there?

Solutions → 

What’s New


8. How Samples Vary

Updated 25 June 2015 (What’s New?)


Inferential statistics says, “I’ve got this sample. What does it tell me about the population it came from?” Eventually, you’ll estimate a population mean or proportion from a sample and use a sample to test a claim about a population. In essence, you’re reasoning backward from known sample to unknown population. But how? This chapter lays the groundwork.

First you have to reason forward. Way back in Chapter 1, you learned that samples vary because no one sample perfectly represents its population. In this chapter, you’ll put some numbers on that variation. You’ll learn about sampling distributions, and you’ll calculate the likelihood of getting a particular sample from a known population. That will be the basis for all your inferential statistics, starting in Chapter 9.


Acknowledgements: The approach I take to this material was suggested by What Is a p-Value Anyway? (Vickers 2010, ch 10 [see “Sources Used” at end of book]), though of course any problems with this chapter are my responsibility and not Vickers’.

The software used to prepare most of the graphs and all of the simulations for this chapter is @RISK from Palisade Corporation.

8A.  Numeric Data / Means of Samples

8A1.  One Sample and Its Mean

Lengths of
30 Tunes
mm:ss  seconds

Having time on my hands, I was curious about the lengths of tunes in the Apple Store. Being lazy, I decided to look instead at the lengths of tunes in my iTunes library. There are 10113 of them, and I’m going to assume that they are representative. (That’s my story, and I’m sticking to it.)

I set Shuffle to Songs and then took the first 30, which gave me the times you see at right for a random sample of size 30.

Here is a histogram of the data. The tune times are moderately skewed right. That makes sense: most tunes run around two to five minutes, but a few are longer.

histogram showing median=230.8, mean=280.9

The mean of this sample is 280.9 seconds, and the standard deviation is 181.7 seconds. But you know that there’s always sampling error. No sample can represent the population perfectly, so if you take another sample from the same population you’d expect to see a different mean, but not very different. This chapter is all about what differences you should expect.

First, ask yourself: Why should you expect the mean of a second sample to be “different, but not very different” from the mean of the first sample? The samples are independent, so why should they relate to each other at all?

Answer: because they come from the same population. In a given sample, you would naturally expect some data points below the population mean μ, and others above μ. You’d expect that the points below μ and the points above μ would more or less cancel each other out, so that the mean of a sample should be in the neighborhood of μ, the mean of the population.

And if you think a little further about it, you’ll probably imagine that this canceling effect works better for larger samples. If you have a sample of four data points, you wouldn’t be much surprised if they’re all above μ or all below μ. If you have a sample of 100 data points, having them all on one side of μ would surprise you as much as flipping a coin 100 times and getting 100 heads. So you expect that the means of large samples tend to stick closer to μ than the means of small samples do. That’s absolutely true, as you’ll find out in this chapter.

To get a handle on “different, but not very different”, take a look at a second sample of 30 from the same population. This one has  = 349.1, s = 204.2 seconds. From its histogram, you can see it’s a bit more strongly skewed than the first sample.

histogram showing median=267.1, mean=349.1

The two sample means differ by 349.14−280.93 ≈ 68.2 seconds. That might seem like a lot, but it’s only about a quarter of the first sample mean and under a fifth of the second sample mean. Also, it’s a lot less than the standard deviations of the two samples, meaning that the difference between samples is much less than the variability within samples.

There’s an element of hand waving in that paragraph. Sure, it seems plausible that the two sample means are “different, but not very different”; but you could just as well construct an argument in words that the two means are different. Without numbers to go on, how much of a difference is reasonable? In statistics, we like to use numbers to decide whether a thing is reasonable or not. How can we make a numerical argument about the difference between samples? Well, put on your thinking cap, because I’m about to blow your mind.

8A2.  Meet the Sampling Distribution of

The key to sample variability is the sampling distribution.

Definition: Imagine you take a whole lot of samples, each sample with n data points, and you compute the sample mean of each of them. All those ’s form a new data set, which can be called the distribution of sample means, or the sampling distribution of the mean, or the sampling distribution of , for sample size n.

Notice that n is the size of each sample, not the number of samples. There’s no symbol for the number of samples, because it’s indefinitely large.

The sampling distribution is a new level of abstraction. It exists only in our minds: nobody ever takes a whole lot of samples of the same size from a given population. You can think of the sampling distribution as a “what if?” — if you took a whole lot of samples of a given size from the same population, and computed the means of all those samples, and then took those means as a new set of data for a histogram, what would that distribution look like?

Why ask such an abstract question? Simply this: if you know how samples from a known population are distributed, you can work backward from a single sample to make some estimates about an unknown population. In this chapter, I work from a population of tunes with known mean and standard deviation, and I ask what distribution of sample means I can expect to get. In the following chapters, I’ll turn that around: looking at one sample, we’ll ask what that says about the mean and standard deviation of the population that the sample came from.

What does a sampling distribution look like? Well, I used a computer simulation with @RISK from Palisade Corporation to take a thousand samples of 30 tunes each — the same n as before — and this is what I got:

histogram showing sampling distribution

“Big whoop!” I hear you say. I agree, it’s not too impressive at first glance. But let’s compare this distribution of sample means to the population those samples come from.

(In real life, you wouldn’t know what the population looks like. But in this chapter I work from a known population to explore what the distribution of its samples looks like. Starting in the next chapter, I’ll turn that around and use one sample to explore what the population probably looks like.)

Look at the two histograms below. The left-hand plot shows the individual lengths of all the tunes in the population — it’s a histogram of the original population. The right-hand plot shows the means of a whole lot of samples, 30 tunes per sample — it’s a histogram of the sampling distribution of the mean. That right-hand plot is the same as the plot I showed you a couple of paragraphs above, just rescaled to match the left-hand plot for easier comparison.

histogram of population histogram showing sampling distribution

Now, what can you see?

At this point, you’re probably wondering if similar things are true for other numeric populations. The answer is a resounding YES.

8A3.  Properties of the Sampling Distribution of

When you describe a distribution of continuous data, you give the center, spread, and shape. Let’s look at those in some detail, because this will be key to everything you do in inferential statistics.

There’s an App for That

Before I get into the properties of the sampling distribution, I’d like to tell you about two Web apps that let you play with sampling distributions in real time. (I’m grateful to Benjamin Kirk for suggesting these.)

If you possibly can, try out these apps, especially the second one. Sampling distributions are new and strange to you, and playing with them in real time will really help you to understand the text that follows.

Center of the Sampling Distribution of


The mean of the sampling distribution of equals the mean of the population: μ = μ.

This is true regardless of the shape of the original population and regardless of sample size.

Why is this true? Well, you already know that when you take a sample, usually you have some data points that are higher than the population mean and some that are lower. Usually the highs and lows come pretty close to canceling each other out, so the mean of each sample is close to μ — closer than the individual data points, that is.

When you take a distribution of sample means, the same thing happens at the second level. Some of the sample means are above μ and some are below. The highs and lows tend to cancel, so the average of the averages is pretty darn close to the population mean.

Spread of the Sampling Distribution of


The standard deviation of the sampling distribution of has a special name: standard error of the mean or SEM; its symbol is σ. The standard error of the mean for sample size n equals the standard deviation of the population divided by the square root of n: SEM or σ = σ/√n.

This is true regardless of the shape of the original population and regardless of sample size.

Why is this true? Each member of the sample is a random variable, all drawn from the same population with a SD of σ and therefore a variance of σ². If you combine random variables — independent random variables — their variances add.

Okay, the sample is n random values drawn from a population with a variance of σ². The total of those n values in the sample is a random variable with a variance of σ²n, and therefore the standard deviation of the total is √(σ²n) = σ√n. Now divide the sample total by n to get the sample mean. is a random variable with a standard deviation of (σ√n)/n = σ/√n. QED — which is Latin for “told ya so!”

Shape of the Sampling Distribution of

Summary: If the original population is normally distributed (ND), the sampling distribution of the mean is ND. If the original population is not ND, still the sampling distribution is nearly ND if sample size is ≥ 30 or so but not more than about 10% of population size.

You can probably see that if you take a bunch of samples from a ND population and compute their means, the sample means will be ND also. But why should the means of samples from a skewed population be ND as well?

The answer should be called the Fundamental Theorem of Statistics, but instead it’s called the Central Limit Theorem. (The name was given by Richard Martin Edler von Mises in a 1919 article, but the theorem itself is due to the Marquis de Laplace, in his Théorie analytique des probabilités [1812].) The CLT is the only theorem in this whole course. There is a mathematical way to state and prove it, but we’ll go for just a conceptual understanding.


The sampling distribution of the mean approaches the normal distribution, and does so more closely at larger sample sizes.

An equivalent form of the theorem says that if you take a selection of independent random variables, and add up their values, the more independent variables there are, the closer their sum will be to a ND.

The second form of the theorem explains why so many real-life distributions are bell curves: Most things don’t have a single cause, but many independent causes.

Example: Lots of independent variables affect when you leave the house and your travel time every day. That means that any person’s commute times are ND, and so are people’s arrival times at an event. The same sorts of variables affect when buses arrive, so wait times are ND. Most things in nature have their growth rate affected by a lot of independent variables, so most things in nature are ND.

But it’s the first form of the theorem that we’ll use in this chapter. If samples are randomly chosen, or chosen by another valid sampling technique, then they will be independent and the Central Limit Theorem will apply.

The further the population is from a ND, the bigger the sample you need to take advantage of the CLT. Be careful! It’s size of each sample that matters, not number of samples. The number of samples is always large but unspecified, since the sampling distribution is just a construct in our heads. As a rule of thumb, n=30 is enough for most populations in real life. And if the population is close to normal (symmetric, with most data near the middle), you can get away with smaller samples.

On the other hand, the sample can’t be too large. For samples drawn without replacement (which is most samples), the sample shouldn’t be more than about 10% of the population. In symbols, n ≤ 0.1N. Suppose you don’t know the population size, N? Multiply left and right by 10 and rewrite the requirement as 10n ≤ N. You always know the sample size, and if you can make a case that the population is at least ten times that size then you’re good to go.

You’ll remember that the population of tune times was highly skewed, but the sampling distribution for n=30 was pretty nearly bell shaped. To show how larger sample size moves the sampling distribution closer to normal, I ran some simulations of 1000 samples for some other sample sizes. Remember that the sampling distribution is an indefinitely large number of samples; you’re still seeing some lumpiness because I ran only 1000 samples in each simulation.

histogram showing sampling distribution for n=3 histogram showing sampling distribution for n=10 histogram showing sampling distribution for n=20 histogram showing sampling distribution for n=100

The means of 3-tune samples are still fairly well skewed, though the range is less than the population range. Increasing sample size to 10, the skew is already much less. 20-tune samples are pretty close to a bell curve except for the extreme right-hand tail. Finally, with a sample size of 100, we’re darn close to a bell curve. Yes, there’s still some lumpiness, but that’s because the histogram contains only 1000 sample means.

Requirements, Assumptions, and Conditions

The requirements mentioned in this chapter will be your “ticket of admission” to everything you do in the rest of the course. If you don’t check the requirements, the calculator will happily calculate numbers for you, they’ll be completely bogus, and your conclusions will be wrong but you won’t know it. Always check the requirements for any type of inference before you perform the calculations.

I talk about “requirements”. By now you’ve probably noticed that I think very highly of DeVeaux, Velleman, and Bock’s Intro Stats (2009) [see “Sources Used” at end of book]. They test the same things in practice, but they talk about “assumptions” and “conditions”. Assumptions are things that must be true for inference to work, and conditions are ways that you test those assumptions in practice.

You might like their approach better. It’s the same content, just a different way of looking at it. And sampling distributions are so weird and abstract that the more ways you can look at them the better! Following DeVeaux pages 591–593, here’s another way to think about the requirements.

Independence Assumption: Always look at the overall situation and try to see if there’s any way that different members of the sample can affect each other. If they seem to be independent, you’ll then test these conditions:

These conditions must always be met, but they’re a supplement to the Independence Assumption, not a substitute for it. If you can see any way in which individuals are not independent, it’s game over regardless of the conditions.

Normal Population Assumption: For numeric data, the sampling distribution must be ND or you’re dead in the water. There are two conditions to check this:

The Normal Population Assumption and the Nearly Normal Condition or Large Sample Condition are for numeric data and only numeric data. We’ll have a separate set of requirements, assumptions, and conditions for binomial data later in this chapter.

See also: Is That an Assumption or a Condition? is a very nice summary by Bock [see “Sources Used” at end of book] of all assumptions and conditions. It puts all of our requirements for all procedures into context. (Just ignore the language directed at instructors.)

8A4.  Applications

Because this textbook helps you,
please click to donate!
Because this textbook helps you,
please donate at

Ultimately, you’ll use sampling distributions to estimate the population mean or proportion from one sample, or to test claims about a population. That’s the next four chapters, covering confidence intervals and hypothesis tests. But before that, you can still do some useful computations.

How to Work Problems

For all problems involving sampling distributions and probability of samples, follow these steps:

  1. Determine center, spread, and shape of the sampling distribution, even if you’re not explicitly asked to describe the distribution.
  2. If you can’t show that the sampling distribution is ND, stop!
  3. Sketch the curve, and estimate the answer. (See examples below.)
  4. Compute the probability (area) using normalcdf. Caution! Don’t use rounded numbers in this calculation.

Example 1: Bank Deposits

You are auditing a bank. The bank managers have told you that the average cash deposit is $200.00, with standard deviation $45.00. You plan to take a random sample of 50 cash deposits. (a) Describe the distribution of sample means for n = 50. (b) Assuming the given parameters are correct, how likely is a sample mean of $189.56 or below?

Solution (a): Recall that describing a distribution means giving its center, its spread, and its shape.

Solution (b): Please refer to How to Work Problems, above. You’ve already described the distribution, so the next step is to make the sketch. You may be tempted to skip this step, but it’s an important reality check on the numerical answer you get from your calculator.

sketch for this problem -- see text for features The sketch for this problem is shown at right. Please observe these key points when sketching sampling distribution problems:

  1. Draw the axis line.
  2. Label the axis, or as appropriate.
  3. Draw a vertical line in the middle of the distribution and show the numerical value of the mean. Caution! This is the mean of the sampling distribution, equal to the population mean, not the sample mean.
  4. Draw a horizontal line at about the right spot and show the numerical value of the SEM, not σ of the original population. (For Binomial Data, below, you’ll use the SEP instead of the SEM.)
  5. Draw a line and show the value for each boundary.
  6. Shade the area you’re trying to find, and estimate it. (From the sketch for this problem, I estimated a few percent, definitely under 10%.)
  7. (optional) After you find the area, show its value.

Next, compute the probability on your calculator.

Press [2nd VARS makes DISTR] [2] to select normalcdf. Fill in the arguments, either on the wizard interface or in the function itself. Either way, you need four arguments, in this order:

With the “wizard” interface: With the classic interface:
The wizard prompts you for a standard deviation σ. Don’t enter the SD of the population. Do enter the SD of the sampling distribution, which is the standard error.

wizard interface showing inputs to normalcdf

After entering the standard error, press [ENTER] twice and your screen will look like the one at right.

After entering the standard error, press [)] [ENTER]. You’ll have two closing parentheses, one for the square root and one for normalcdf.

normalcdf function showing result of .05054518842

Always show your work. There’s no need to write down all your keystrokes, but do write down the function and its arguments:

normalcdf(−10^99, 189.56, 200, 45/√50)

Answer: P( ≤ 189.56) = 0.0505

Comment: Here you see the power of sampling. With a standard deviation of $45.00, an individual deposit of $189.56 or lower can be expected almost 41% of the time. But a sample mean under $189.56 with n=50 is much less likely, only a little over 5%.

This is one reason you should take the trouble to make your sketch reasonably close to scale. If you enter the standard deviation, 45, instead of the standard error, 45/√50 — a common mistake — you’ll get 0.4083. A glance at your sketch will tell you that can’t possibly be right, so you then know to find and fix your mistake.

Example 2: Women’s Heights

US women’s heights are normally distributed (ND), with mean 65.5″ and standard deviation 2.5″. You visit a small club on a Thursday evening, and 25 women are there. (Let’s assume they are a representative sample.) Your pickup line is that you’re a statistics student and you need to measure their heights for class. Amazingly, this works, and you get all 25 heights. How likely is it that the average height is between 65″ and 66″?

Solution: First, get the characteristics of the sampling distribution:

sketch for this problem -- see text If the SEM is 0.5″, then 65″ and 66″ equal the mean ± one standard error. The Empirical Rule (68–95–99.7 Rule) tells you that about 68% of the data fall between those bounds. In this problem, the sketch is a really good guide to the answer you expect.

normalcdf of 65, 66, 65.5, and 2.5 over root 25; answer .6826894809

This is the distribution of sample means, so you expect 68% of them to fall between those bounds. But do the computation anyway, because the Empirical Rule is approximate and now you’re able to be precise. Also, the SEM of 0.5″ is an exact number, but still I put the whole computation into the calculator just to be consistent.

The chance that the sample mean is between 65″ and 66″ is

P(65 ≤ ≤ 66) = 0.6827

normalcdf of 65, 66, 65.5, and 2.5; answer .1585193755 Remember the difference between the distribution of sample means and the distribution of individual heights. From the computation at the right, you expect to see under 16% of women’s heights between 65″ and 66″, versus over 68% of sample mean heights (for n=25) between 65″ and 66″. That’s the whole point of this chapter: sample means stick much closer to the population mean.

Example 3: Elevator Load Limit

Suppose hotel guests who take elevators weigh on average 150 pounds with standard deviation of 35 pounds. An engineer is designing a large elevator, to lift 50 people. If she designs it to lift 4 tons (8000 pounds), what is the chance a random group of 50 people will overload it?

Need a hint? This is a problem in sample total. You haven’t studied that kind of problem, but you have studied problems in sample means. In math, when you have an unfamiliar type of problem, it’s always good to ask: Can I change this into some type of problem I do know how to solve? In this case, how do you change a problem about the total number of pounds in a sample (∑x) into a problem about the average number of pounds per person ()?

Please stop and think about that before you read further.

Solution: To convert a problem in sums into a problem in averages, divide by the sample size. If the total weight of a sample of 50 people is 8000 lb, then the average weight of the 50 people in the sample is 8000/50 = 160 lb. So the desired probability is P( > 160):

sketch for this problem -- see text P(∑x > 8000 for n = 50) = P( > 160)

And you know how to find the second one.

What does the sampling distribution of the mean look like for μ = 150, σ = 35, n = 50? The mean is μ = 150 lb, and the standard error is 35/√50 ≈ 4.9 lb. That’s all you need to draw the sketch at right. Samples are random, 10×50 = 500 is less than the number of people (or potential hotel guests) in the world, and n = 50 ≥ 30, so the sampling distribution follows a normal model.

Now make your calculation. This time the left boundary is a definite number and the right boundary is pseudo infinity, 10^99. And again, you want the standard error, not the SD of the original population.

With the “wizard” interface: With the classic interface:

normalcdf wizard interface with 160, 10^99, 150, 35 over root 50

After entering the standard error, press [ENTER] twice, and your screen will look like the one at right.

After entering the standard error, press [)] [ENTER].

normalcdf of 160, 10^99, 150, 35 over root 50 equals 0.0217

Show your work: normalcdf(160, 10^99, 150, 35/√50).

There’s a 0.0217 chance that any given load of 50 people will overload the elevator. That’s not 2% of all loads, but 2% of loads of 50 people. Still, it’s an unacceptable level of risk.

Is there an inconsistency here? Back in Chapter 5, I said that an unusual event was one that had a low probability of occurring, typically under 5%. Since 2% is less than 5%, doesn’t that mean that an overloaded elevator is an unusual event, and therefore it can be ignored?

Yes, it’s unusual. But no, fifty people plunging to a terrible death can’t be ignored. The issue is acceptable risk. Yes, there’s some risk any time you step in an elevator that it will be your last journey. But it’s a small risk, and it’s one you’re willing to accept. (The risk is much greater every time you get into a car.) Without knowing exact figures, you can be sure it’s much, much less than 2%; otherwise every big city would see many elevator deaths every day.

In Chapter 10, you’ll meet the significance level, which is essentially the risk of being wrong that you can live with. The worse the consequences of being wrong, the lower the acceptable risk. With an elevator, 5% is much too risky — you want crashes to be a lot more unusual than that.

8B.  Binomial Data / Proportions of Samples

Binomial data are yes/no or success/failure data. Each sample yields a count of successes. (A reminder: “success” isn’t necessarily good; it’s just the name for the condition or response that you’re counting, and the other one is called “failure”.)

Need a refresher on the binomial model? Please refer back to Chapter 6.

The summary statistic or parameter is a proportion, rather than a mean. In fact, the proportion of success (p) is all there is to know about a binomial population.

In Chapter 6 you computed probabilities of specific numbers of successes. Now you’ll look more at the proportions of success in all possible samples from a binomial population, using the normal distribution (ND) as an approximation.

Here’s a reminder of the symbols used with binomial data:

pThe proportion in the population. Example: If 83% of US households have at least one cell phone, then p = 0.83.
Remember “proportion of all equals probability of one”, so p is also the probability that any randomly selected response from the population will be a success.
q= 1−p is therefore the proportion of failure or the chance that any given response will be a failure.
nThe sample size.
xThe number of successes in the sample. Example: if 45 households in your sample have at least one cell phone, then x = 45.
“p-hat”, the proportion in the sample, equal to x/n. Example: If you survey 60 households and 45 of them have at least one cell phone, then  = 45/60 = 0.75 or 75%.

8B1.  Sampling Distribution of

The sampling distribution of the proportion is the same idea as the sampling distribution of the mean, and there are a lot of parallels between the two. (A table at the end of this chapter summarizes them.)

Definition: Imagine you take a whole lot of samples from the same population. Each sample has n success/failure data points, and you compute the sample proportion of each of them. All those ’s form a new data set, which can be called the distribution of sample proportions, or the sampling distribution of the proportion, or the sampling distribution of , for sample size n.

As before, n is the size of each sample, not the number of samples. There’s no symbol for the number of samples, because it’s indefinitely large.

One change from the sampling distribution of is that the sampling distribution of is a different data type from the population. The original data are non-numeric (yeses and noes), but the distribution of is numeric because the ’s are numbers. Each says “so many percent of this sample were successes.”

Center of the Sampling Distribution of


The mean of the sampling distribution of equals the proportion of the population: μ = p (“mu sub p-hat equals p”).

This is true regardless of the proportion in the original population and regardless of sample size.

Why is this true? The reasons are similar to the reasons in Center of the Sampling Distribution of . for a given sample may be higher or lower than p of the population, but if you take a whole lot of samples then the high and low ’s will tend to cancel each other out, more or less.

Spread of the Sampling Distribution of


The standard deviation of the sampling distribution of has a special name: standard error of the proportion or SEP; its symbol is σ (“sigma-sub-p-hat”). The standard error of the proportion for sample size n equals the square root of the population proportion, times 1 minus the population proportion, divided by the sample size: SEP or σ = √[pq/n].

This is true regardless of the proportion in the original population and regardless of sample size.

Why is this true? For a binomial distribution with sample size n, the standard deviation is √[npq]. That is the SD of the random variable x, the number of successes in a sample of size n. The sample proportion, random variable , is x divided by n, and therefore the SD of is the SD of random variable x, also divided by n. In symbols, σ = √[npq] / n = √[npq/n²] = √[pq/n].

Shape of the Sampling Distribution of


If np and nq are both ≥ about 10, and 10n ≤ N, the normal model is good enough for the sampling distribution.

sampling distribution of p-hat for n=4, p=0.1
sampling distribution of p-hat for n=4, p=0.25
sampling distribution of p-hat for n=4, p=0.5

Let’s look at some sampling distributions of . First I’ll show you the effect of the population’s proportion of success p, and then the effect of the sample size n.

Using @RISK from Palisade Corporation, I simulated all of the sampling distributions shown here. The mathematical sampling distribution has an indefinitely large number of samples, but I stopped at 10,000.

These first three graphs show the sampling distributions for samples of size n = 4 from three populations with different proportions of successes.

Reminder: these are not graphs of the population — they’re not responses from individuals. They are graphs of the sampling distributions, showing the proportion of successes () found in a lot of samples.

How do you read these? For example, look at the first graph. This shows the sampling distribution of the proportion for a whole lot of samples, each of size 4, where the probability of success on any one individual is 0.1. You can see that about 67% of all samples have  = 0 (no successes out of four), about 29% have  = .25 (one success out of four), about 4% have  = .50 (two successes out of four), and so on.

Why the large gaps between the bars? With n = 4, each sample can have only 0, 1, 2, 3, or 4 successes, so the only possible proportions for those samples are 0, 25%, 50%, 75%, and 100%.

But let’s not obsess over the details of these graphs. I’m more interested in the shapes of the sampling distributions.

What do you see? If you take many samples of size 4 from a population with p = 0.1 (10% successes and 90% failures), the sampling distribution of the proportion is highly skewed. Now look at the second graph. When p = .25 (25% successes and 75% failures in the population), again with n = 4 individuals in each sample, the sample proportions are still skewed, but less so. And in the third graph, where the population has p = 0.5 (success and failure equally likely), then the sampling distribution is symmetric even with these small samples.

For a given sample size n, it looks like the closer the population p is to 0.5, the closer the sampling distribution is to symmetric. And in fact that’s true. That’s your take-away from these three graphs.

sampling distribution of p-hat for n=50, p=0.1
sampling distribution of p-hat for n=100, p=0.1
sampling distribution of p-hat for n=500, p=0.1

Now let’s look at sampling distributions using different sample sizes from the same population. I’ll use a population with 10% probability of success for each individual (p = 0.1).

You’ve already seen the graph of the sampling distribution when n = 4. The three graphs here show the sampling distribution of for progressively larger samples. (Remember always that n is the number of individuals in one sample. The number of samples is indefinitely large, though in fact I took 10,000 samples for each graph.)

What do you see here? The distribution of ’s from samples of 50 individuals is still noticeably skewed, though a lot less than the graph for n = 4. If I take samples of size 100, the graph is starting to look nearly symmetric, though still slightly skewed. And if I take samples of 500 individuals, the distribution of looks like a bell curve.

What do you conclude from these graphs? First, even if p is far from 0.5 (if the population is quite unbalanced), with large enough samples, the sampling distribution of is a normal distribution. Second, you need big samples for binomial data. Remember that 30 is usually good enough for numeric data. For binomial data, it looks like you need bigger samples.


Okay, let’s put it together. If the size of each sample is large enough, the sampling distribution is close enough to normal. How large a sample is large enough? It depends on how skewed the original population is, which means it depends on the proportion of successes in the population. The further p is from 0.5, the more unbalanced the population and the larger n must be.

How big a sample is big enough? Here’s what some authors say:

Why the disagreements? They can’t all be right, can they?

Actually, they can. The question is, what’s close enough to a ND? That’s a judgment call, and different statisticians are a little bit more or less strict about what they consider close enough. Fortunately, with samples bigger than a hundred or so, which are customary, all the conditions are usually met with room to spare.

We’ll follow DeVeaux and his buddies and use np ≥ 10 and nq ≥ 10. This is easy to remember: at least ten “yes” and at least ten “no” expected in a sample. (You can compute the expected number of noes as nq = n(1−p) or simply nnp, sample size minus the expected number of yeses.)

How does this work out in practice? Look at the next-to-last graph, with n=100 and p=0.1. It’s close to a bell curve, but has just a little bit of skew. (It’s easier to see the skew if you cover the top part of the graph.)

Check the requirements: np = 100×.1 = 10, and nq = 100−10 = 90. In a sample of 100, 10 successes and 90 failures are expected, on average. This just meets requirements. And that matches the graph: you can see that it’s not a perfect bell curve, but close; but if it was a little more skewed then the normal model wouldn’t be a good enough fit.

De Veaux and friends (page 440) give a nice justification for choosing ≥ 10 yeses and noes. Briefly, the ND has tails that go out to ±infinity, but proportions are between 0 and 1. They chose their “success/failure condition”, at least ten of each, so that the mismatch between the binomial model and the normal model is only in the rare cases.

But there’s an additional condition: the individuals in the sample must be independent. This translates to a requirement that the sample can’t be too large, or drawing without replacement would break the binomial model. Big surprise (not!): Authors disagree about this too. For example, De Veaux and Johnson & Kuby say sample can’t be bigger than 10% of population (n ≤ 0.1N); Sullivan says 5%.

We’ll use n ≤ 0.1N, just like with numeric data. And just as before, you can think of that as 10n ≤ N when you don’t have the exact size of the population.

Example 4: You asked 300 randomly selected adult residents of Ithaca a yes-or-no question. Is the sample too large to assume independence? You may not know the population of Ithaca, but you can compute 10×300 = 3000 and be confident that there are more than 3000 adult Ithacans. Therefore your sample is not too large.

Don’t just claim 10n ≤ N. Show the computation, and identify the population you’re referring to, like this: “10n = 10×300 = 3000 ≤ number of adult Ithacans.”

Remember to check your conditions: np ≥ about 10, nq ≥ about 10, and 10nN. And of course your sample must be random.

Requirements, Assumptions, and Conditions

Just like with numeric data, you might find it helpful to name the requirements for binomial data. These are the same requirements that I just gave you, but presented differently. I’m following DeVeaux, Velleman, Bock (2009, 493) [see “Sources Used” at end of book].

Independence Condition, Randomization Condition, 10% Condition: These are the same for every sampling distribution and every procedure in inferential stats. I’ve already talked about them under numeric data earlier in the chapter. In practice, the 10% Condition comes into play more often for binomial data than numeric data, because binomial samples are usually much larger.

Sample Size Assumption: For binomial data, the sample is like Goldilocks and porridge — it can’t be too big and it can’t be too small. (Maybe it was beds or chairs and not porridge? And what the heck is porridge?) “Too big” is checked by the 10% Condition; “too small” is checked by the

See also: Is That an Assumption or a Condition? (Bock [see “Sources Used” at end of book]). Again, these are the same requirements you see in this textbook, just presented differently.

8B2.  Applications

How to Work Problems

Working with the sampling distribution of , the technique is exactly the same as for problems involving the sampling distribution of . Follow these steps:

  1. Determine center, spread, and shape of the sampling distribution, even if you’re not explicitly asked to describe the distribution.
  2. If you can’t show that the sampling distribution is ND, stop!
  3. Sketch the curve, and estimate the answer. (See example below.)
  4. Compute the probability (area) using normalcdf. Caution! Don’t use rounded numbers in this calculation.
The ND is continuous and goes out to ±infinity, but the binomial distribution is discrete and bounded by 0 and n. If the requirements are met (at least 10 successes and 10 failures expected), the normal model is a good fit near the middle of the distribution. The fit is usually good enough in the tails, but not as good as it is in the middle.

Because of this, some authors apply a continuity correction to make the normal model a better fit for the binomial. This means extending the range by half a unit in each direction. For example, if n = 100 and p = 0.20, and you’re finding the probability of 10 to 15 successes, MATH200A part 3 gives a probability of 0.1262. The normal model with standard error √[.20×(1−.20)/100] = 0.04 gives

normalcdf(10/100, 15/100, .2, .04) = 0.0994

With the continuity correction, you compute the probability for 9½ to 15½ successes. Then the normal model gives a probability of

normalcdf(9.5/100, 15.5/100, .2, .04) = 0.1260

This is a better match to the exact binomial probability. Why use the normal model at all, then? Why not just compute the exact binomial probability? Because there’s only a noticeable discrepancy far from the center, and only when the sample is on the small side. (100 is a small sample for binomial data, as you’ll see in the next two chapters.) You can apply the continuity correction if you want, but many authors don’t because it usually doesn’t make enough difference to matter.

Example 5: Swain v. Alabama

1965. Talladega County, Alabama. An African American man named Robert Swain is accused of rape. The 100-man jury panel includes 8 African Americans, but through exemptions and peremptory challenges none are on the final jury. Swain is convicted and sentenced to death. (Juries are all male. 26% of men in the county are African American.)

(a) In a county that is 26% African American, is it unexpected to get a 100-man jury panel with only eight African Americans?

Solution: “Unexpected”, “unusual”, or “surprising” describes an event with low probability, typically under 5%. This problem is asking you to find the probability of getting that sample and compare that probability to 0.05.

This is binomial data: each member of the sample either is or is not African American. Putting the problem into the language of the sampling distribution, your population proportion is p = 0.26. You’re asked to find the probability of getting 8 or fewer successes in a sample of 100, so n = 100 and your sample proportion must be  ≤ 8/100 or  ≤ 0.08.

Why “8 or fewer”?  = 8% for the questionable sample, and you think it’s too low since it’s below the expected 26%. Therefore, in determining how likely or unlikely such a sample is, you’ll compute the probability of  ≤ 0.08.

First, describe the sampling distribution of the proportion:

normal curve shaded left of .08; probability=.00002 Next, make the sketch and estimate the answer. I’ve numbered the key points in the sketch at right, but if you need a refresher please refer back to the sketch under Example 1.

From this sketch, you’d expect the probability to be very small, and indeed it is.

Compute the probability using normalcdf as before. Be careful with the last argument, which is the standard deviation of the sampling distribution. Don’t use a rounded number for the standard error, because it can make a large difference in the probability.

With the “wizard” interface: With the classic interface:
The standard error expression, √(.26*(1−.26)/100), scrolls off the screen as you type it in, so be extra careful!

normalcdf wizard interface with minus infinity, .08, .26, and .26 times 1 minus .26 all over 100

Press [ENTER] twice, and your screen will look like the one at right.

Press [)] [ENTER] after entering the standard error. You’ll have two closing parentheses, one for the square root and one for normalcdf.

normalcdf of minus infinity, .08, .26, and the square root of quantity .26 times .74 over 100, yielding 2.03455044 e minus 5

Always show your work — not keystrokes but the function and its arguments:

normalcdf(−10^99, .08, .26. √(.26*(1−.26)/100))

shortcut method of computation; see text The SEP is a nasty expression, and you have to enter it twice in every problem. You might like to save some keystrokes by computing it once and then storing it in a variable, as I did at the right. When you’re drawing the sketch and need the standard error, compute it as usual but before pressing [ENTER] press [STO→] [x,T,θ,n]. Then when you need the standard error in normalcdf, in the wizard or the classic interface, just press the [x,T,θ,n] key instead of re-entering the whole SEP expression. The probability is naturally the same whether you use the shortcut or not.

P( ≤ 0.08) = 2.0×10-5, or P(x ≤ 8) = 0.000 020. There are only 20 chances in a million of getting a 100-man jury pool with so few African Americans by random selection from that county’s population. This is highly unexpected — so unlikely that it raises the gravest doubts about the county’s claim that jury pools were selected without racial bias.

You might remember that in Chapter 6 you computed this binomial probability as 0.000 005, five chances in a million. If the ND is a good approximation, why does it give a probability that’s four times the correct probability? Answer: The normal approximation gets a little dicier as you move further out the tails, and this sample is pretty far out (z = −4.10). But is the approximation really that bad? Sure, the relative error is large, but the absolute error is only 0.000 015, 15 chances in a million. Either way, the message is “This is extremely unlikely to be the product of random chance.”

normal curve shaded left of .15; probability=.0061 (b) From 1950 to 1965, as cited in the Supreme Court’s decision, every 100-man jury pool in the county had 15 or fewer African Americans. How likely is that, if they were randomly selected?

Solution: 15 out of 100 is 15%. You know how to compute the probability that one jury pool would be ≤15% African American, so start with that. You’ve already described the sampling distribution, so all you have to do is make the sketch and then the calculation. Everything’s the same, except your right boundary is 0.15 instead of 0.08.

If you use my little shortcut: Otherwise:

shortcut method of computation; see text

long method of computation; see text. Answer is .0060745591

Either way, P( ≤ 0.15) = 0.0061. The Talladega County jury panels are multiple samples with n = 100 in each, so the “proportion of all” interpretation makes sense: In the long run, you expect 0.61% of jury panels to have 15% or fewer African Americans, if they’re randomly selected.

But actually 100% of those jury panels had 15% or fewer African Americans. How unlikely is that? Well, we don’t know how many juries there were in the county in those 16 years, but surely it must have been at least one a year, or a total of 16 or more. The probability that 16 independent jury pools would all have 15% or fewer African Americans, just by chance, is 0.006074559116 ≈ 3E-36, effectively zip. And if there was more than one jury a year, as there probably was, the probability would be even lower. Something is definitely fishy.

The binomial probability is 0.0061 also. This is still pretty far out in the left-hand tail (z = −2.51), but the normal approximation is excellent. The message here is that the normal approximation is pretty darn close except where the probabilities are so small that exactness isn’t needed anyway.

8C.  Summary of Sampling Distributions

Here’s a side-by-side summary of sampling distributions of the mean (numeric data) and sampling distributions of the proportion (binomial data). Always check requirements for the type of data you actually have!

Numeric Data Binomial Data
Each individual in sample provides a number. Each individual in sample provides a success or failure, and you count successes.
Statistic of one sample mean  = ∑x/n proportion  = x/n
Parameter of population mean μ proportion p
Sampling distribution of the ... Sampling distribution of the mean (sampling distribution of ) Sampling distribution of the proportion (sampling distribution of )
Mean of sampling distribution μ = μ μ = p
Standard deviation of sampling distribution SEM = standard error of the mean SEP = standard error of the proportion
σ = σ/√n σ = √[pq/n]
Sampling distribution is close enough to normal if ...
  • Random sample
  • 10nN
  • Population is ND or n ≥ about 30
  • Random sample
  • 10nN
  • np ≥ about 10 and nq ≥ about 10
NOTE: n is number of individuals per sample. Number of samples is indefinitely large and has no symbol.

What Have You Learned?

Key ideas:

Because this textbook helps you,
please click to donate!
Because this textbook helps you,
please donate at
Study aids:

Chapter 9 WHYL → ← Chapter 7 WHYL

Exercises for Chapter 8

Write out your solutions to these exercises, showing your work for all computations. Then check your solutions against the solutions page and get help with anything you don’t understand.

Caution! If you don’t see how to start a problem, don’t peek at the solution — you won’t learn anything that way. Ask your instructor or a tutor for a hint. Or just leave it and go on to a different problem for now. You may find when you return to that “impossible” problem that you see how to do it after all.

1 Household incomes in the country Freedonia are a skewed distribution with mean $48,000 and standard deviation (SD) $2,000. You take a random sample of size 64 and compute the mean of the sample. That is one sample mean out of the distribution of all possible sample means. Describe the sampling distribution of the mean, including all symbols and formulas.
2 A manufacturer of light bulbs claims a mean life of 800 hours with SD 50 hours. You take a random sample of 100 bulbs and find a sample mean of 780 hours.

(a) If the manufacturer’s claim is true, is a sample mean of 780 hours surprising? (Hint: Think about whether you need the probability of  ≤ 780 or  ≥ 780.)

(b) Would you accept the manufacturer’s claim?

3 Suppose 72% of Americans believe in angels, and you take a simple random sample of 500 Americans.

(a) Describe the sampling distribution of the proportion who believe in angels in samples of 500 Americans.

(b) Use the normal approximation to compute the probability of finding that 350 to 370 in a sample of 500 believe in angels. Reminder: You can’t use the sample counts directly; you have to convert them to sample proportions.

4 In a town with 100,000 households, the last census showed a mean income of $32,400 with SD $19,000. The city manager believes that average income has fallen since the census. Students at the local community college randomly survey 1000 households and find a sample mean income of $31,000. What’s the chance of getting a sample mean ≤$31,000 if the true mean and SD are still what the census found?
5 Roulette is a popular casino game. The croupier spins the wheel in one direction and spins a white ball in the other direction along the rim, and the ball drops into one of the slots. In the US, roulette wheels have 38 slots: 18 red, 18 black, and 2 green. (In Monte Carlo, the wheels have 37 slots because there’s only one green.)

(a) One way beginners play is to bet on red or black. If the ball comes up that color, they double their money; if it comes up any other color, they lose their money. Construct a probability model for the outcome of a $10 bet on red from the player’s point of view.

(b) Find the mean and SD for the outcome of $10 bets on red, and write a sentence interpreting the mean.

(c) Now take the casino’s point of view. A large casino can have hundreds of thousands of bets placed in a day. Obviously they won’t all be same, but it doesn’t take many days to see a whole lot of any given bet. Describe the sampling distribution of the mean for a sample of 10,000 $10 bets on red.

(d) How much does the casino expect to earn on 10,000 $10 bets on red?

(e) What’s the chance that the casino will lose money on those 10,000 $10 bets on red?

(f) What’s the casino’s chance of making at least $2000 on those 10,000 $10 bets?

6 A sugar company packages sugar in 5-pound bags. The amount of sugar per bag varies according to a normal distribution. A random sample of 15 bags is selected from the day’s production. If the total weight of the sample is more than 75.6 pounds, the machine is packing too much per bag and must be adjusted.

What is the probability of this happening, if the day’s mean is 5.00 pounds and SD 0.05 pounds?

7 The weights of cabbages in a shipment are normally distributed, with a mean of 38.0 ounces and SD of 5.1 ounces.

(a) If you randomly pick one cabbage, what is the probability that its weight is more than 43.0 ounces?

(b) If you randomly pick 14 cabbages, what is the probability that their average weight is more than 43.0 ounces?

8 Suppose the average household consumes 12.5 KW of electric power at peak time, with SD 3.5 KW. A particular substation in a typical neighborhood serves 1000 households and has a capacity of 12,778 KW at peak time. (That’s 12 thousand and some, not 12 point something.) Find the probability that the substation will fail to supply enough power.
In the Physicians’ Health Study, about 22,000 male doctors were randomly assigned to take aspirin daily or a placebo daily. (Of course the study was double blind.) In the placebo group, 1.71% of doctors had heart attacks over the course of the study. Let’s take 1.71% as the rate of heart attacks in an adult male population that doesn’t take aspirin.

The heart attack rate among aspirin takers was 0.94%, which looks like an impressive difference. Is there any chance that aspirin makes no difference, and this was just the result of random selection? In other words, how likely is it for that second sample to have  = 0.94% if the true proportion of heart attacks in adult male aspirin takers is actually 1.71%, no different from adult males who don’t take aspirin?

10 Men’s heights are normally distributed with mean 69.3″ and SD 2.92″. If a random sample of 16 men is taken, what values of the sample mean would be surprising? In other words, what values of are in the 5% of the sampling distribution furthest away from the population mean?

(Hint: The 5% is the tails, the part of the sampling distribution that is not in the middle 95%.)

11 In June 2013, the Pew Research Center found that 45% of Americans had an unfavorable view of the Tea Party. In the second week of October 2013, according to Tea Party’s Image Turns More Negative (Pew Research Center 2013e [see “Sources Used” at end of book]), 737 adults in a random sample of 1504 had an unfavorable view of the Tea Party.

In a population where 45% have an unfavorable view of the Tea Party, how likely is a sample of 1504 where 737 or more have an unfavorable view? Can you draw any conclusions from that probability?

Solutions → 

What’s New


9. Estimating Population Parameters

Updated 1 Jan 2016 (What’s New?)


In Chapter 8, you learned what sort of samples to expect from a known population. In the rest of the course, you’ll learn how to use a sample to make statements about an unknown population. This is inferential statistics.

In inferential statistics, there are two types of things you want to do: test whether some claim is true, and estimate the size of some effect. In this chapter you’ll construct confidence intervals that estimate population means and proportions; Chapter 10 starts you on testing claims.


9A.  Estimating Population Proportion p

In the Physicians’ Health Study [see “Sources Used” at end of book], about 22,000 male doctors were randomly assigned to take aspirin or a placebo every night. Of 11,037 in the treatment group, 104 had heart attacks and 10,933 did not. Can you say how likely it is for people in general (or, at least, male doctors in general) to have a heart attack if they take aspirin nightly?

As always, probability of one equals proportion of all. So you could just as well ask, what proportion of people who take aspirin would be expected to have heart attacks?

Before statistics class, you would divide 104/11037 = 0.0094 and say that 0.94% of people taking nightly aspirin would be expected to have heart attacks. This is known as a point estimate.

But you are in statistics class. You know that a sample can’t perfectly represent the population, and therefore all you can say is that the true proportion of heart attacks in the population of aspirin takers is around 0.94%. Can you be more specific?

Yes, you can. You can compute a confidence interval for the proportion of heart attacks to be expected among aspirin takers, based on your sample, and that’s the subject of this chapter. We’ll get back to the doctors and their aspirin later, but first, let’s do an example with M&Ms.

9A1.  Confidence Interval for p (Binomial Data)

Example 1: You take a random sample of 605 plain M&Ms, and 87 of them are red. What can you say about the proportion of reds in all plain M&Ms?


A point estimate of a population parameter is the single best available number, and in fact it’s nothing more than the corresponding sample statistic.

In this example, your point estimate for population proportion is sample proportion, 87/605 = 14.4%, and you conclude “Somewhere around 14.4% of all plain M&Ms are red.”

The sample proportion is a point estimate of the proportion in the population, the sample mean is a point estimate of the mean of the population, the sample standard deviation is a point estimate of the standard deviation of the population, and so on.


A confidence interval estimate of a population parameter is a statement of bounds on that parameter and includes your level of confidence that the parameter actually falls within those bounds.

For instance, you could say “I’m 95% confident that 11.6% to 17.2% of plain M&Ms are red.” 95% is your confidence level (symbol: 1−α, “one minus alpha”). and 11.6% and 17.2% are the boundaries of your estimate or the endpoints of the interval.

As an alternative to endpoint form, you could write a confidence interval as a point estimate and a margin of error, like this: “I’m 95% confident that the proportion of red in plain M&Ms is 14.4% ± 2.8%.” 14.4% is your point estimate, and 2.8% is your margin of error (symbol: E), also known as the maximum error of the estimate. Since the confidence interval extends one margin of error below the point estimate and one margin of error above the point estimate, the margin of error is half the width of the confidence interval.

For all the cases you’ll study in this course, the point estimate — the mean or proportion of your sample — is at the middle of the confidence interval. But that’s not true for some other cases, such as estimating the standard deviation of a population. For those cases, computing the margin of error is uglier.

Computing a Confidence Interval

As you might expect, your TI-83/84 and lots of statistical packages can compute confidence intervals for you. But before doing it the easy way, let’s take a minute to understand what’s behind computing a confidence interval.

You can compute an interval to any level of confidence you desire, but 95% is most common by far, so let’s start there. How do you use those 87 reds in a sample of 605 M&Ms to estimate the proportion of reds in the population, and have 95% confidence in your answer?

In Chapter 8, you learned how to find the sampling distribution of . Given the true proportion p in the population, you could then determine how likely it is to get a sample proportion within various intervals. To find a confidence interval, you simply run that backward.

You don’t know the proportion of reds in all plain M&Ms, so call it p. You know that, if the sample size is large enough, samples are ND and there’s a 95% chance that any given sample proportion will be within 2 standard errors on either side of p, whatever p is.

The standard error of the proportion is σ = √[pq/n]. You don’t know p — that’s what you’re trying to find. Are you stuck? No, you have an estimate for p. Your point estimate for the population proportion p is the sample proportion  = 87/605. You can estimate the standard error of the proportion (the SEP) by using the statistics of your sample:

σ ≈  √[(87/605)(1−87/605)/605] = 0.0142656 or about 1.4%

Two standard errors is 0.0285312 → 0.029 or 2.9%.

How good is this estimate? For decent-sized samples, it’s quite good. For example, suppose the true population proportion p is 50% or 0.5. For a sample of n = 625, the SEP is √[.5(1−.5)/625] = 0.0200 or 2.00%. Your sample proportion is very, very, very unlikely to be as far away as 40% or 0.4, but even if it is then you would estimate the SEP as √[.4(1−.4)/625] = 0.0196 or 1.96%, which is extremely close.

Different authors use the term “standard error” slightly differently. Some use it only for the standard deviation of the sampling distribution, which you never know exactly because you never know the population parameters exactly. Others use it only for the estimate based on sample statistics, which I computed just above. Still others use it for either computation. In practice it doesn’t make a lot of difference. I don’t see much point to getting too fussy about the terminology, given that only one of them can be computed anyway.

Any given sample proportion is 95% likely to be within two standard errors or 2.9% of the population proportion:

p−0.029 ≤  ≤ p+0.029 (probability = 95%)

Now the magic reverso: Given a sample proportion, you’re 95% confident that the population proportion is within 2.9% of that sample proportion:

−0.029 ≤ p ≤ +0.029 (95% confidence)

In this case, your sample proportion is 87/605 ≈ 0.144:

0.144−0.029 ≤ p ≤ 0.144+0.029 (95% confidence)

0.115 ≤ p ≤ 0.173 (95% confidence)

So your 95% confidence interval is 0.115 to 0.173, or 11.5% to 17.3%.

If the magic reverso seems like cheating, it’s not. Suppose you’re 95% sure that Cortland is within 12 miles of Dryden; aren’t you equally sure that Dryden is within 12 miles of Cortland? But you can also prove it with algebra. Here was our starting point:

p−0.029 ≤  ≤ p+0.029

Multiply by −1. When you multiply by a negative, you have to reverse the inequality signs.

p+0.029 ≥ − ≥ −p−0.029

Rewrite in conventional order, from smallest to largest.

p−0.029 ≤ − ≤ −p+0.029

Now add p+ to all three “sides”.

−0.029 ≤ p ≤ +0.029

You might have noticed that I changed from 95% probability to 95% confidence. What’s up with that? Well, the sample proportion is a random variable — different samples will have different sample proportions , and you can compute the probability of getting in any particular range.

But the population proportion p is not a random variable. It has one definite value, even though you don’t know what that definite value is. Probability statements about a definite number make about as much sense as discussing the probability of precipitation for yesterday. The population proportion is what it is, and you have some level of confidence that your estimated range includes that true value.

What does “95% confident” mean, then? Simply this: In the long run, when you do everything right, 95% of your 95% intervals will actually include the population proportion, and the other 5% won’t. 5% is 5/100 = 1/20, so in the long run about one in 20 of your 95% confidence intervals will be wrong, just because of sample variability.

Probability of one = proportion of all, so there’s one chance in twenty that this interval is wrong, meaning that it doesn’t contain the true population proportion, even if you did everything right. If that makes you too nervous, you can use a higher confidence level, but you can never reach 100% confidence.

There’s one more wrinkle. That margin of error of 0.029 was 2σ, two standard errors. The figure of 2 standard errors for the middle 95% of a ND comes from the Empirical Rule or 68–95–99.7 Rule, so it’s only approximately right.

But you can be a little more precise. In Chapter 7 you learned to find the middle any percent, and that lets you generalize to any confidence level:

This ExampleGeneral Case
Confidence level
(middle area of the ND)
Area in the two tails combined 100%−95% = 5% or 0.051−(1−α) = α
Area in each tail 0.05/2 = 0.025α/2
The boundaries are ±z0.025 =
invNorm(1−0.025) = 1.9600
The margin of error is E = 1.96σ E = zα/2 σ
And you compute it as E = 1.96 multiply p-hat by 1 minus p-hat, divide by n, and take square root of the result E = zα/2 multiply p-hat by 1 minus p-hat, divide by n, and take square root of the result

The margin of error on a 1−α confidence interval is zα/2 standard errors. (This will be important when you determine necessary sample size, below.)

The margin of error on a 95% confidence interval is close to 2σ, but more accurately it’s 1.96σ. For the proportion of red M&Ms, where the SEP was σ = 0.0142656, the margin of error is 1.96σ = 0.0279606 → 0.028 or 2.8%. Since the point estimate was 14.4%, you’re 95% confident that the proportion of reds in plain M&Ms is within 14.4%±2.8%, or 11.6% to 17.2%.

Interpreting a Confidence Interval

You’ve seen that there are two ways to state a confidence interval: from ____ to ____ with ____% confidence, or ____ ± ____ with _____% confidence. Mathematically these are equivalent, but psychologically they’re very different. The first form is better than the second.

What’s wrong with the ____ ± ____ form? It’s easy to misinterpret.

If you say “I’m 95% confident that the proportion of reds in plain M&Ms is within 14.4%±2.8%”, some people will read 14.4% and stop — they’ll think that the population proportion is 14.4%. And even people who get past that will probably think that there’s something special about 14.4%, that somehow it’s more likely to be the true proportion of reds among all plain M&Ms. But 14.4% is just a value of a random variable, namely the proportion of reds in this sample. Another sample would almost certainly have a different and therefore a different midpoint for the interval.

It’s much better to use the endpoint form, because the endpoint form is harder to misinterpret. When you say “I’m 95% confident that the proportion of reds in plain M&Ms is 11.6% to 17.2%”, you lead the reader, even the non-technical reader, to understand that the proportion could be anything in that range, and even that there’s a slight chance that it’s outside that range.

Requirements check (RC): This is an essential step — do it before you compute the confidence interval. Computing the CI assumes that the sampling distribution of is a ND, but “assumes” in statistics means you don’t assume, you check it.

The requirements are stated in Chapter 8 as simple random sample (or equivalent), np and nq both ≥ about 10, 10n ≤ N. You don’t know p, but for binomial data it’s okay to use as an estimate. But n is just the number of yeses or successes in your sample, and n is just the number of noes or failures in your sample, so you really don’t need to do any multiplications.

Here’s how you check the requirements:

Easy CIs with TI-83/84

Your TI-83 or TI-84 can easily compute confidence intervals for a population proportion. With binomial data, this is Case 2 in Inferential Statistics: Basic Cases. (Excel can do it too, but it’s significantly harder in Excel.)

Example 2: Let’s do the red M&Ms, since you already know the answer. See the requirements check above. Press [STAT] [] to get to the STAT TESTS menu, and scroll up or down to find 1-PropZInt. (Caution: you don’t want 1-PropZTest. That’s reserved for Chapter 10.) Enter the number of successes in the sample, the sample size, and the confidence level — easy-peasy! Write down the screen name and your inputs, then proceed to the output screen and write down just the new stuff:

TI-83/84 input screen for 1-PropZInt      TI-83/84 output screen for 1-PropZInt

Here’s how you show your work:

1-PropZInt 87, 605, .95 (not PropZInt, please!)

(.11584, .17176), = .1438016529

There’s no need to write n=605 because you already wrote it down from the input screen.

Interpretation: I’m 95% confident that 11.6% to 17.2% of plain M&Ms are red.

You can vary that in several ways. For instance, some people like to put the confidence level last: 11.6% to 17.2% of plain M&Ms are red (95% confidence). Or they may choose more formal language: We’re 95% confident that the true proportion of reds in plain M&Ms is 11.6% to 17.2%.

I’ve already pooh-poohed the margin-of-error form, but sometimes you have to write it that way, for instance if your boss or your thesis advisor demands it. You can get it easily from the TI-83/84 output screen.

The center of the interval, the point estimate, is given: 14.38%. To find the margin of error, subtract that from the upper bound of the interval, or subtract the lower bound from it: .17176−.1438 = .02796, or .1438−.11584 = .02796. Either way it’s 2.8%. You can then express the CI as 14.4%±2.8% with 95% confidence.

Example 3: What about the male doctors who started this section? 104 out of 11037 of the doctors taking nightly aspirin had heart attacks. Assuming that male doctors are representative of adults in general, in terms of heart-attack risk, what can you say about the chance of heart attack for anyone who takes aspirin nightly? Use confidence level 1−α = 95%.

Solution: Requirements check (RC):

1-PropZInt 104, 11037, .95

(.00762, .011223), = .0094228504

Conclusion: People who take nightly aspirin have a 0.76% to 1.12% chance of heart attack (95% confidence).

An interesting special case occurs when you have no successes. Although you can’t do the regular calculation, because 0 successes doesn’t meet the requirement, you can use an approximate procedure called the Rule of Three. In Confidence Intervals with Zero Events, Steve Simon (2010) [see “Sources Used” at end of book] explains, “zero to 3/n is an approximate 95% confidence interval for a data set where we observed 0 events in n patients.”

Example: Suppose that your sample was only 50 doctors, and none of them had heart attacks. 3/50 = 6%, so you would be 95% confident that people who take nightly aspirin have a zero to 6% chance of heart attack.

9A2.  How Big a Sample for Binomial Data?

The equation for margin of error is packed with information:  E = zα/2 multiply p-hat by 1 minus p-hat, divide by n, and take square root of the result

You can see that a larger sample size n means a narrower confidence interval, but the sample size is inside the square-root sign so you don’t get as much benefit as you might hope for. If you take a sample four times as big, the square root of 4 is 2 and so your interval is half as wide, not ¼ as wide.

You can see also that you get a narrower interval if you’re willing to live with a lower confidence level. The lower your confidence interval, the smaller zα/2 will be, and therefore the narrower your confidence interval.

The bottom line is that there’s a three-way tension among sample size, confidence level, and margin of error. You can choose any two of those, but then you have to live with the third. ( doesn’t come into it. Although does contribute to the standard error and therefore to the margin of error, you can’t choose what you’re going to get in a sample.)

If you want to get a confidence interval at your preferred confidence level with (no more than) a specified margin of error, how big a sample do you need? MATH200A Program part 5 will compute this for you, but let’s look at the formula first.

(See Getting the Program for instructions on getting the MATH200A program into your calculator.)

The equation at the start of this section shows the margin of error you get for a given sample size and confidence level. You can solve for the sample size n, like this:

E = zα/2 multiply p-hat by 1 minus p-hat, divide by n, and take square root of the result    ⇒    n=Find z-sub-alpha-over-2, divide by E, and square. Multiply by p-hat and then by 1 minus p-hat.

In the formula, is your prior estimate if you have one. This can be the result of a past study, or a reasonable estimate if it has some logical basis. If you don’t have a prior estimate, use 0.5.

.5 or 50% is the conservative choice. It gives the largest possible sample size for a given E and C-Level. Why is that? Because the formula contains a multiplication by (1−), and that product takes on its largest value when  = 0.5.

Using .5 as your prior estimate, you’re guaranteed that your sample won’t be too small, though it may be larger than necessary. Why not just use .5 all the time? Because taking samples always costs time and usually costs money, so you don’t want a larger sample than necessary.

Example 4: In a sample of 605 plain M&Ms, 87 were red. The 95% confidence interval had a 2.8% margin of error. How big a sample would you need to reduce the margin of error to 2%?

With the MATH200A program (recommended): If you’re not using the program:

Press [PRGM], select MATH200A, and press [ENTER] twice. Dismiss the title screen and you’ll see a menu. Press [5] for sample size. Then select your data type, binomial.

MATH200A menu screen for sample size application
MATH200A menu screen for binomial data type

MATH200A input screen for binomial sample size The next screen wants your estimated p, your desired margin of error, and your desired confidence level. Your prior estimate is 87/605, from your earlier study. Your margin of error is 2% = .02 (not .2 !), and your confidence level is 95% = .95.

MATH200A output screen for binomial sample size The output screen echoes back your inputs, in case you forgot to write down the input screen, and then tells you that the sample size must be at least 1183 M&Ms. Notice the inequalities: for a margin of error of .02 (2%) or less, you need a sample size 1183 or more.

(z Crit is critical z or zα/2, the number of standard errors associated with your chosen confidence level.)

Marshal your data: prior estimate  = 87/605, desired margin of error E = 0.02, and confidence level 1−α = 0.95.

You need zα/2. Get α/2 from the confidence level:

1−α = 0.95 ⇒ α = 0.05 ⇒ α/2 = 0.025

zα/2 is the z-score such that the area to the right is α/2. In this problem, α/2 = 0.025, so you’re computing z0.025. You’ll use invNorm, but invNorm wants area to left and 0.025 is an area to right, so you compute invNorm(1−.025).

computation of sample size, described in text Now, to avoid re-entering that z value, chain your calculations. The formula says you need to divide by E, so simply press [/] and type .02, the desired margin of error, then press [ENTER]. Notice how the calculator displays Ans as soon as you press the [/] key, to confirm that you’re continuing the previous calculation.

To square the fraction, press [] [ENTER].

Finally, multiply by and (1−). You get 1182.4…, and therefore your required sample size is 1183.

Caution! Your answer is 1183, not 1182. You don’t round the result of a sample-size calculation. If it comes out to a whole number (unusual), that’s your answer. Otherwise, you round up to the next whole number. Why? Smaller sample size makes larger margin of error. n = 1182.4… corresponds to E = 0.02 exactly. A sample of 1182 would be just slightly under 1182.4…, and your margin of error would be just slightly over 0.02. But 0.02 was the maximum acceptable margin of error, so 1182.4… is the minimum acceptable sample size. You can’t take a fraction of an M&M in your sample, so you have to go up to the next whole number.

There’s no requirements check in sample-size problems. These are planning how to take your sample; requirements apply to your sample once you have it.

Example 5: You’re taking the first political poll of the season, and you’d like to know what fraction of adults favor your candidate. You decide you can live with a 90% confidence level and a 3% margin of error. How many adults do you need in your random sample?

Solution: Since you have no prior estimate for p, make  = .5.

With the MATH200A program (recommended): If you’re not using the program:

MATH200A/sample size/binomial

=.5, E=.03, C-Level=.9, n ≥ 752

computation of sample size, described in text 1−α = .9, E = .03,  = .5

1−α = .9 ⇒ α = 0.1 ⇒ α/2 = 0.05

z0.05 = invNorm(1−.05)

Divide by E, which is .03.

Square the result.

Multiply by times (1−).

n =751.5… → 752

9B.  Estimating Population Mean μ When You Know σ

Numeric data are pretty much the same deal as binomial data, though there are a couple of wrinkles:

The second one is a problem, because you almost never know the standard deviation of the population. Therefore, we won’t be working any problems for this case. Instead, I’ll give you a little more theory to lay the groundwork for the next section, which explains how we get around this knowledge gap.

9B1.  Confidence Interval

If you know the standard deviation of the population — and you hardly ever do — then your confidence interval is

− zα/2 · σ/√n  ≤  μ  ≤  + zα/2 · σ/√n

If you’re ever in this situation, you can compute a confidence interval on your TI-83/84 by choosing ZInterval in the STAT TESTS menu.

9B2.  How Big a Sample Do You Need?

The margin of error is E = zα/2 · σ/√n, so the required sample size for a margin of error E with confidence level 1−α is

n = [ zα/2 · σ / E]²

(You can also use MATH200A part 5.)

9C.  Estimating Population Mean μ When You Don’t Know σ

“Houston, we have a problem!” A confidence interval is founded on the sampling distribution of the mean or proportion. Everything in Chapter 8 on the sampling distribution of the mean was based on knowing the standard deviation of the population. But you almost never know the standard deviation of the population. How to resolve this?

The solution comes from William Gosset, who worked for Guinness in Dublin as a brewer. (I swear I am not making this up.) In 1908 he published a paper called The Probable Error of a Mean [see “Sources Used” at end of book]. For competitive reasons, the Guinness company wouldn’t let him use his own name, and he chose the pen-name “Student”. The t distribution that he described in his paper has been known as Student’s t ever since.

While looking for Gosset’s original paper, I stumbled on Probable Error of a Mean, The (“Student”) (Moulton [see “Sources Used” at end of book]). It’s a fascinating look at what Gosset did and didn’t accomplish, and how this classic paper was virtually ignored for years. Things didn’t start to happen till Gosset sent a copy of his tables to R. A. Fisher with the remark that Fisher was the only one who would ever use them! It was Fisher who really got the whole world using Student’s t distribution.

9C1.  Student’s t Distribution

Because this textbook helps you,
please click to donate!
Because this textbook helps you,
please donate at

Gosset knew that the standard error of the mean is σ/√n, but he didn’t know σ. He wondered what would happen if he estimated the standard error as s/√n, and did some experiments to answer that question. Since s varies from one sample to the next, this new t distribution spreads out more than the ND. Its peak is shallower, and its tails are fatter.

Actually, there’s no such thing as “the” t distribution. There’s a different t for each sample size. The larger the sample, the closer that t distribution is to a normal distribution, but it’s never quite normal.

For technical reasons, t distributions aren’t identified by sample size, but rather by degrees of freedom (symbol df or Greek ν, “nu”). df = n−1. Here are two t distributions:

standard noral distribution overlaid by t with 4 degrees of freedom
Solid: standard normal distribution
Line: Student’s t for df = 4, n = 5
   standard noral distribution overlaid by t with 29 degrees of freedom
Solid: standard normal distribution
Line: Student’s t for df = 29, n = 30

What do you see? Student’s t for 4 degrees of freedom is quite a bit more spread out than the ND: 12.2% of sample means are more than two standard errors from the mean, versus only 5% for the ND.

At this scale, Student’s t for 29 degrees of freedom looks identical to the ND, but it’s not quite the same. You can see that 6% of sample means are more than two standard errors from the mean, versus 5% for the ND.

You don’t really need a list of properties of Student’s t, because your calculator is going to do the work for you. It’s enough to know this:

9C2.  Confidence Interval for μ (Numeric Data)

The logic of confidence intervals for numeric data is the same whether you know the standard deviation of the population or not. Even the requirements are the same. The only difference is between using a z and a t.

The confidence interval formula for numeric data with unknown σ looks a lot like the one for known σ. You just replace σ by s and z by t:

− tα/2 · s/√n   ≤   μ   ≤   + tα/2 · s/√n     (1-α confidence)

(It’s understood that you have to use the right number of degrees of freedom, df = n−1, in finding critical t.)

Example 6: You’re auditing a bank. You take a random sample of 50 cash deposits and find a mean of $189.56 and standard deviation of $42.17.
(a) Estimate the mean of all cash deposits, with 95% confidence.
(b) The bank’s accounting department tells you that the average cash deposit is over $210.00. Is that believable?

Solution: You want to compute a confidence interval about the mean of all deposits. You have numeric data, and you don’t know the standard deviation of the population, σ. This is Case 1 in Inferential Statistics: Basic Cases. In your sample, n = 50,  = 189.56, and s = 42.17.

First, check the requirements (RC):

Since the sample is large enough, there’s no need to verify normality or check for outliers.)

Now calculate the interval.

On your TI-83/84, in the STAT TESTS menu, select 8:TInterval. The difference between Data and Stats is whether you have all the data points, or just summary statistics. In this case you have only the stats, so cursor onto Stats and press [ENTER]. (The lower part of the screen may change.)

Enter your sample statistics and your desired confidence level. Write down your inputs before you select Calculate:

TInterval 189.56, 42.17, 50, .95

Proceed to the output screen, and write down everything new. There isn’t much:

(177.58, 201.54)

TI-83/84 input screen for TInterval      TI-83/84 output screen for TInterval

Finally, write your interpretation. I’m 95% confident that the average of all cash deposits is between $177.58 and $201.54.

Caution! Don’t say anything like “95% of deposits are between $177.58 and $201.54.” Your confidence interval is an estimate of the true average of all deposits, and it’s not about the individual deposits. With a standard deviation of $42 and change, you would predict that 95% of deposits are within 2×42.17 = $84.34 either side of the mean, which is a much wider interval.

Now turn to part (b). Management claims that the average of all cash deposits is > $210.00. Is that believable? Well, it’s not impossible, but it’s unlikely. You’re 95% confident that the average of all deposits is between $177.58 and $201.54, which means you’re 95% confident that it’s not < $177.58 or > $201.54. But they’re claiming $210, which is outside your confidence interval. Again, they’re unlikely to be correct — there’s less than a 5% likelihood (100%−95% = 5%).

Example 7: In a random sample from the 237 vehicles on a used-car lot, the following weights in pounds were found:

2500  3250  4000  3500  2900  4500  3800  3000  5000  2200

Estimate the average weight of vehicles on the lot, with 90% confidence.

Solution: Check the requirements first. You have a small sample (n < 30), so you have to verify that the data are ND and there are no outliers. Here are the results of normality check and box-whisker plot in MATH200A:

normality check showing r=.9936, crit=.9179      box-whisker plot showing no outliers

There’s not much you can write for the box-whisker plot, but you can show the normality test numerically:

Now proceed to your TInterval. This time you have the actual data, so you choose Data on the screen. Specify your data list. Freq (frequency) should already be set to 1; if not, first press the [ALPHA] key once, and then [1] [ENTER]. Enter your confidence level, and write down your inputs:

TInterval L1, 1, .90

TI-83/84 input screen for TInterval      TI-83/84 output screen for TInterval

When you have raw data, everything on the output screen is new:

(2956, 3974)

= 3465, s = 878.1, n = 10

You’re 90% confident that the average weight of all vehicles on the lot is between 2956 and 3974 pounds.

Again, this is an estimate of the average weight of the population (the 237 cars on the lot). In your interpretation, you can’t say anything about the weights of individual vehicles, because you don’t know anything about the weights of individual vehicles, apart from your sample.

9C3.  The Trouble with Outliers

Why do you have to check for outliers? If your sample passes the normality check, isn’t that enough? No! If a sample passes the normality check, it still might have outliers.

How can this be? How can a sample that contains outliers still pass the normality check? Well, back in Chapter 7 I said that if r > crit you can use the normal model and if r < crit you can’t. But that simple rule hides a more complicated truth.

No sample is perfectly normal, so you’re not actually deciding “is it normal or not?” Instead, you’re finding the strength of evidence against normality. The smaller r is, the stronger the evidence against a ND. If r < crit, the evidence is so strong that you say the data are non-normal. But if r > crit, you can’t say that the data are definitely normal, only that you can’t rule out a ND based on this test. But outliers make the evidence against the normal model too strong, so if outliers are present then you can’t treat the data as normal.

This “fail to prove” is similar to what you saw in Chapter 4 with decision points: you could prove that the correlation was non-zero, but you couldn’t prove that it was zero. Starting in Chapter 10, you’ll see that this is how inferential statistics works whenever you’re testing some proposition.

Why are outliers a problem? Well, your confidence interval depends on the mean and standard deviation of your sample. But and s are sensitive to outliers. (That sensitivity goes down as sample size goes up, so you don’t have to worry with samples bigger than about 30.)

To make this clearer, let’s look at an example. I drew these 15 points from a moderately skewed population:

 157   171   182   189   201   208   217   219 
 229   242   247   252   265   279   375 

The normality test shows r > crit. So far so good. But the box plot shows a big honkin’ outlier:

TI-83/84 output screen for normality check    TI-83/84 box-whisker plot

How big a difference does it make? Quite a lot, unfortunately. Here are the 95% confidence intervals for the original sample, and the sample with the outlier removed. The means are different, the standard deviations are really different, and the high ends of the confidence intervals are pretty different too. (The screens don’t show the margins of error, but they too are quite different: (258.45-199.28)/2 = 29.6 and (239.36-197.5)/2 = 20.9.)

TI-83/84 confidence interval from whole sample: (199.28,258.45), x-bar=228.9, s=53.4, n=15
95% CI from full sample
TI-83/84 confidence interval after excluding outlier: (197.5,239.36), x-bar=218.4, s=36.2, n=14
95% CI excluding outlier

Do you say that the outlier increased the mean by almost 5% and the SD by almost 50%, moved the confidence interval and made it wider? That’s not really fair — the sample is what it is (assuming you’ve ruled out a mistake in data entry). If you start throwing out points, you no longer have a random sample. On the other hand, that one point does seem to carry an awful lot of weight, and it doesn’t seem right to have results depend so heavily on one point.

So what do you do? If you can, you take another sample, preferably a larger one. Larger samples are less likely to have outliers in the first place, and outliers that do occur have less influence on the results.

But taking a new sample may not be practical. An alternative — not really great, but better than nothing — is to do the analysis both ways, once with the full sample and once with the outlier(s) excluded. That will at least give a sense of how much the outliers affect the results.

9C4.  How Big a Sample for Numeric Data?

With the MATH200A program (recommended): If you’re not using the program:

Example 8: For the vehicle weights, your margin of error in a 90% CI was 3974−3465 = 509 pounds. How many vehicles would you need in your sample to get a 95% confidence interval with a margin of error of 500 pounds?

sample size computation, as described in the text Solution: In MATH200A part 5, select 2:Num unknown σ since you don’t know the standard deviation of the population. You’re first prompted for the estimated standard deviation s, which is based on your sample. Enter that, then the desired margin of error E and the desired confidence level.

When you enter the last piece of information, you’ll notice that the calculator takes several seconds to come up with an answer; this is normal because it has to do an iterative calculation (fancy words for trial and error).

Critical t for a 95% CI with 14 degrees of freedom (n = 15) is 2.14, larger than critical z of 1.96 because the t distribution is more spread out. But of course what you really care about is the bottom line: to keep margin of error no greater than 500 pounds in a 95% CI, you need to sample at least 15 vehicles.

How is this computed? Start with the margin of error and solve for sample size:

E = tα/2·s/√n   ⇒   n = [tα/2·s/E]²

The problem here is that tα/2 depends on df, which depends on n, so you haven’t really isolated sample size on the left side. The only way to solve this equation precisely is by a process of trial and error, and that’s what MATH200 does.

What if you don’t have the program? Since t is not super different from the normal distribution, you can alter the above formula and use z in place of t: n = [zα/2·s/E]². But the t distribution is more spread out than the normal (z) distribution, so your answer may be smaller than the actual necessary sample size. If you do that and you get > about 30, it’s probably nearly right for the t distribution. If your answer is small, you should increase it so that the TInterval doesn’t come out with too large a margin of error.

You calculate zα/2 exactly as you did in the sample-size formula for a confidence interval about a proportion. For example, with a 95% CI, 1−α = 1−0.95, α = 0.05, and α/2 = 0.025. zα/2 = z0.025 = invNorm(1−.025) = 1.9600. so using z for t you compute sample size [1.96·878.1/500]² = 11.8… → 12. That’s well under 30, so you want to bump it up a bit.

I’m deliberately glossing over this, because the program is a lot easier. But if you want more, check out Case 1 in How Big a Sample Do I Need? That page gives you all the details of the method, with worked-out examples.

At first glance, this procedure is less precise than the successive approximations done by MATH200A. But in fairness, there’s one more source of un-preciseness that neither method can avoid. Unlike binomial data, where small variations in the prior estimate made little difference to the computed sample size, for numeric data variations in the standard deviation do make a difference in computed sample size. Since s is squared in the formula, it can be a big difference. This can swamp any pettifogging details about t versus z.

What Have You Learned?

Key ideas:

(The online book has live links.)

Because this textbook helps you,
please click to donate!
Because this textbook helps you,
please donate at
Study aids:

Chapter 10 WHYL → ← Chapter 8 WHYL

Exercises for Chapter 9

Write out your solutions to these exercises, showing your work for all computations. Then check your solutions against the solutions page and get help with anything you don’t understand.

Caution! If you don’t see how to start a problem, don’t peek at the solution — you won’t learn anything that way. Ask your instructor or a tutor for a hint. Or just leave it and go on to a different problem for now. You may find when you return to that “impossible” problem that you see how to do it after all.

1 For a confidence interval, people sometimes say, “There’s a 95% chance that the mean height of all trees in the forest is …” Why is that not correct — why must you say “I’m 95% confident” and not “there’s a 95% chance”?
2 Simple Simon took a random sample of 40 TC3 students and computed a 90% confidence interval of (45.20,60.14) for weekly food expense per student. (He checked all requirements, and his computations were correct.) He wrote, “I’m 90% confident that TC3 students spend between $45.20 and $60.14 a week for food.” There’s a huge mistake in that conclusion. Identify it, and write a correct conclusion.
3 Silly Sally took a random sample of 150 TC3 students and found that 50 of them “usually” or “always” prepare their own food instead of buying from the cafeteria. She computed a 90% confidence interval of (.27002, .39664) and reported “With 90% confidence, on average 27% to 40% of TC3 students usually or always prepare their own food instead of buying from the cafeteria.” What is the biggest mistake in her conclusion?

The Neveready Company tested 40 randomly selected A-cell batteries to see how long they would operate a wireless mouse. They found a mean of 1756 minutes (29 hours, 16 minutes) and standard deviation (SD) 142 minutes. With 95% confidence, what’s the average life of all Neveready A cells in wireless mice?

5 In World War II, a prisoner of war flipped a coin 10,000 times and recorded 5,067 heads.
(a) Find the point estimate for proportion of heads.
(b) What is the sample size? What is the population size?

You’re planning to conduct a poll about people’s attitudes toward a hot political issue, and you have absolutely no idea what proportion will be in favor and what proportion will be opposed. If you want a margin of error no more than 3.5% at 95% confidence, how large must your sample be?


The Department of Veterans Affairs is under fire for slow processing of veterans’ claims. An investigator for the Nightly Show randomly selected 100 claims (out of 68,917 at one office) and found that 40 of them had been open for more than a year. Find a 90% confidence interval for the proportion of all claims that have been open for more than a year.


For her statistics project, Sandra kept track of her commute times for 40 consecutive mornings (8 weeks). Treat this as a random sample. Her mean commute time was 17.7 minutes and her SD was 1.8 minutes. Find a 95% confidence interval for her average time on all commutes, not just this sample of 40.


Fifteen women in their 20s were randomly selected for health screening. As part of this, their heights in inches were recorded:

62.5  63  67  63.5  62  63  65  64.5  66.5  64.5  62.5  62  61.5  64.5  67.5

Construct a 95% confidence interval for the average height of women aged 20–29.


For his statistics project, Fred measured the body temperature of 18 randomly selected healthy male students. Here are his figures in °F:

98.3  97.7  98.6  98.5  97.5  98.6  98.2  96.9  97.9
96.9  97.8  99.3  98.6  99.2  96.9  97.8  97.9  98.3

(a) Write a 90% confidence interval for the average body temperature of healthy male students.

(b) What does this say about the famous “normal” temperature of 98.6°?

(c) What is his margin of error?

(d) To get an answer to within 0.1° with 95% confidence, how many students would he have to sample?


The Colorectal Cancer Screening Guidelines (CDC 2014 [see “Sources Used” at end of book]) recommend a colonoscopy every ten years for adults aged 50 to 75. A public-health researcher interviews a simple random sample of 500 adults aged 50–75 in Metropolis (pop. 6.4 million) and finds that 219 of them have had a colonoscopy in the past ten years.
(a) What proportion of all Metropolis adults in that age range have had a colonoscopy in the past ten years, at the 90% level of confidence?
(b) Still at the 90% confidence level, what sample size would be required to get an estimate within a margin of error of 2%, if she uses her sample proportion as a prior estimate?


The next year, you go back to audit the bank again. This time, you take a random sample of 20 cash deposits. Here are the amounts:

  192.68  188.24  152.37  211.73  201.57     167.79  177.19  191.15  209.22  178.49     185.90  226.31  192.38  190.23  156.13     224.07  191.78  203.45  186.40  160.83  

Construct a 95% confidence interval for the average of all cash deposits at the bank.


Not wanting to wait for the official results, Abe Snake commissioned an exit poll of voters. In a systematic sample of 1000 voters, 520 (52%) said they voted for Abe Snake. (14,000 people voted in the election.) That sounds good, but can he be confident of victory, at the 95% level?

Solutions → 

What’s New


10. Hypothesis Tests

Updated 1 Jan 2016 (What’s New?)

Summary: You want to know if something is going on (if there’s some effect). You assume nothing is going on (null hypothesis), and you take a sample. You find the probability of getting your sample if nothing is going on (p-value). If that’s too unlikely, you conclude that something is going on (reject the null hypothesis). If it’s not that unlikely, you can’t reach a conclusion (fail to reject the null).


10A.  Testing a Proportion (Binomial Data)

Remember the Swain v. Alabama example? In a county that was 26% African American, Mr. Swain’s jury pool of 100 men had only eight African Americans. In that example, you assumed that selection was not racially biased, and on that basis you computed the probability of getting such a low proportion. You found that it was very unlikely. This disconnect between the data and the claim led you to reject the claim.

You didn’t know it, but you were doing a hypothesis test. This is the standard way to test a claim in statistics: assume nothing is going on, compute the probability of getting your sample, and then draw a conclusion based on that probability. In this chapter, you’ll learn some formal methods for doing that.

The basic procedure of a hypothesis test or significance test is due to Jerzy Neyman (1894–1981), a Polish American, and Egon Pearson (1895–1980), an Englishman. They published the relevant paper in 1933.

We’re going to take a seven-step approach to hypothesis tests. The first examples will be for binomial data, testing a claim about a population proportion. Later in this chapter you’ll use a similar approach with numeric data to test a claim about a population mean. In later chapters you’ll learn to test other kinds of claims, but all of them will just be variations on this theme.

10A1.  Example 1: Swain v. Alabama

Step 1: Hypotheses

Your first task is to turn the claim into algebra. The claim may be that nothing is going on, or that something is going on. You always have two statements, called the null and alternative hypotheses.

Definition: The null hypothesis, symbol H0, is the statement that nothing is going on, that there is no effect, “nothin’ to see here. Move along, folks!” It is an equation, saying that p, the proportion in the population (which you don’t know), equals some number.

Definition: The alternative hypothesis, symbol H1, is the statement that something is going on, that there is an effect. It is an inequality, saying that p is different from the number mentioned in H0. (H1 could specify <, >, or just ≠.)

The hypotheses are statements about the population, not about your sample. You never use sample data in your hypotheses. (In real life you can’t make that mistake, since you write your hypotheses before you gather data. But in the textbook and the classroom, you always have sample data up front, so don’t make a rookie mistake.)

You must have the algebra (symbols) in your hypotheses, but it can also be helpful to have some English explaining the ultimate meaning of each hypothesis, or the consequences if each hypothesis is true. Here you want to know whether there’s racial bias in jury selection in the county.

You don’t want to know if the proportion of African Americans in Mr. Swain’s jury pool is less than 26%: obviously it is. You want to know if it’s too different — if the difference is too great to be believable as the result of random chance.

Write your hypotheses this way:

(1) H0: p = 0.26, there’s no racial bias in jury selection
H1: p < 0.26, there is racial bias in jury selection

Obviously those can’t both be true. How will you choose between them? You’ll compute the probability of getting your sample (or a more unexpected one), assuming that the null hypothesis H0 is true, and one of two things will happen. Maybe the probability will be low. In that case you rule out the possibility that random chance is all that’s happening in jury selection, and you conclude that the alternative hypothesis H1 is true. Or maybe the probability won’t be too low, and you’ll conclude that this sample isn’t unusual (unexpected, surprising) for the claimed population.

The number in your null hypothesis H0, with binomial data, is called po because it’s the proportion as given in H0. (You may want to refer to the Statistics Symbol Sheetto help you keep the symbols straight.)

What exactly is p? Yes, it’s the population proportion being tested, but what’s the population? It can’t be people in the county, or men in the county, or African-American men in the county.

In fact it’s all people serving on Talladega County jury pools past, present and future. If there’s racial bias, then African Americans are less likely to be selected than whites, and — probability of one, proportion of all — therefore the overall population of jury pools has less than 26% African Americans. If there’s no racial bias, then in the long run the overall population of jury pools has the same 26% of African Americans as the county.

Although a hypothesis test is officially about the population, in cases like this one it’s okay to think of it as answering a simpler question: Is the difference between the claim of no racial bias and the reality of this sample significant, or could it be explained away as the result of random chance? The hypotheses are the same either way, the calculations are the same, and the conclusions are the same.

This is why a hypothesis test is also called a significance test or a test of significance.

Step 2: Significance Level

Okay, you’re looking to figure out if this sample is inconsistent with the null hypothesis. In other words, is it too unlikely, if the null hypothesis H0 is true? But what do you mean by “too unlikely”? Back in Chapter 5, we talked about unusual events, with a threshold of 5% or 0.05 for such events. We’ll use that idea in hypothesis testing and call it a significance level.

Definition: The significance level, symbol α (the Greek letter alpha), is the chance of being wrong that you can live with. By convention, you write it as a decimal, not a percentage.

(2) α = 0.05

A significance level of 0.05 is standard in business and science. If you can’t tolerate a 5% chance of being wrong — if the consequences are particularly serious — use a lower significance level, 0.01 or 0.001 for example. (0.001 is common if there’s a possibility of death or serious disease or injury.) If the consequences of being wrong are especially minor, you might use a higher significance level, such as 0.10, but this is rare in practice.

In a classroom setting, you’re usually given a significance level α to use.

Later in this chapter, you’ll see that the significance level α is actually concerned with a particular way of being wrong, a Type I error.

Step RC: Requirements Check

Back in Chapter 8, you learned the CLT’s requirements for binomial data: random sample not larger than 10% of population, and at least 10 successes and 10 failures expected if the null hypothesis is true. You compute expected successes as npo by using po, which is the number from H0. Expected failures are then sample size minus expected successes, nnpo in symbols. Steps 3 and 4 need the sampling distribution of the proportion to be a ND, so you must check the requirements as part of your hypothesis test.

  • Random sample? Yes, according to the county. ✔
  • 10n = 10×100 = 1000. We don’t know the number of adult males in the county, but it must be greater than 1000, surely. (“I know that, and don’t call me Shirley.”) ✔
  • Expected successes = npo = 100×.26 = 26; expected failures are 100−26 = 74; both are ≥ 10. ✔

You might wonder about the first test. “The county may say it’s random, but I don’t believe it. Isn’t that why we’re running this test?” Good question! Answer: Every hypothesis test assumes the null hypothesis is true and computes everything based on that. If you end up deciding that the sample was too unlikely, in effect you’ll be saying “I assumed nothing was going on, but the sample makes that just too hard to believe.”

This same idea — the null hypothesis H0 is innocent till proven guilty — explains why you use 0.26 (po) to figure expected successes and failures, not 0.08 (). Again, the county claims that there’s no racial bias. If that’s true, if there’s no funny business going on, then in the long run 26% of members of jury pools should be African American.

Comment: Usually, if requirements aren’t met you just have to give up. But for one-population binomial data, where the other two requirements are met but expected successes or failures are much under 10, you can use MATH200A part 3 to compute the p-value directly. There’s an example in “Small Samples”, below.

Steps 3/4: Test Statistic and p-Value

This is the heart of a hypothesis test. You assume that the null hypothesis is true, and then use what you know about the sampling distribution to ask: How likely is this sample, given that null hypothesis?

Definition: A test statistic is a standardized measure of the discrepancy between your null hypothesis H0 and your sample. It is the number of standard errors that the sample lies above or below H0.

You can think of a test statistic as a measure of unbelievability, of disagreement between H0 and your sample. A sample hardly ever matches your null hypothesis perfectly, but the closer the test statistic is to zero the better the agreement, and the further the test statistic is from 0 the worse the sample and the null hypothesis disagree with each other.

Because you showed that the sampling distribution is normal and the standard error of the proportion is implicitly known, this is a z test. The test statistic is z = (po) / σ  where  sigma sub p-hat = square root of fraction p zero, times 1 minus p zero, over n, but as you’ll see your calculator computes everything for you.

TI-83 input screen for 1-PropZTest
TI-83 output screen for 1-PropZTest

Definition: The p-value is the probability of getting your sample, or a sample even further from H0, if H0 is true. The smaller the p-value, the stronger the evidence against the null hypothesis.

Inferential Statistics: Basic Cases tells you that binomial data in one population are Case 2. This is a hypothesis test of population proportion, and you use 1-PropZTest on your calculator.

To get to that menu selection, press [STAT] [] [5]. Enter po from the null hypothesis H0, followed by the number of successes x, the sample size n, and the alternative hypothesis H1. Write everything down before you select Calculate. When you get to the output screen, check that your alternative hypothesis H1 is shown correctly at the top of the screen, and then write down everything that’s new.

(3/4) 1-PropZTest .26, 8, 100, <po
outputs: z = −4.10, p-value = 0.000 020,  = 0.08

By convention, we round the test statistic to two decimal places and the p-value to four decimal places.

When the p-value is less than one in ten thousand, you need more than four decimal places. Some authors just write “p <.0001” when the p-value is that small; they figure nobody cares about fine shades of very low probability. Feel free to use that alternative.

Caution! Watch for powers of 10 (E minus whatever) and never write something daft like “p-value = 2.0346”.

What do these outputs of the 1-PropZTest tell you? The sample proportion,  = 0.08, is more than 4 standard errors below the supposed population proportion, po = 0.26. Your test statistic is z = −4.10. Since 95% of samples have z scores within ±2, this is surprising. How surprising, that’s what the p-value tells you.

How likely is it to get this sample, or one with even a smaller sample proportion, if the null hypothesis H0 is true? The p-value is 0.000 020, so if there’s no racial bias in selection then there are only two chances in a hundred thousand of getting eight or fewer African Americans in a 100-man jury pool. (There’s a lot more about interpreting the p-value later in this chapter.)

You don’t actually use the z-score, but I want you to understand something about what a test statistic is. Every case you study will have a different test statistic, and in fact choosing a test statistic is the main difference between cases.

Why does one step have two numbers? In the olden days, when dinosaurs roamed the earth and a slide rule was the hot new thing, you had to compute the SEP and then the z-score; that was step 3. Then you had to look up z in a printed table to find the p-value; that was step 4. The TI-83 or TI-84 gives you both at the same time, but I’ve kept the numbering of steps.

Step 5: Decision Rule

Because this textbook helps you,
please click to donate!
Because this textbook helps you,
please donate at

There are two and only two possibilities, and all you have to do is pick the correct one based on your p-value and your α:

p < α. Reject H0 and accept H1.


p > α. Fail to reject H0.

Caution! There are lots of p’s in problems involving population proportions (Case 2), so make sure you select the right one. The p-value is the first p on the 1-PropZInt output screen.

You can add the numbers, if you like — p < α (0.000 020 < 0.05) — but the symbols are required.

(5)p < α. Reject H0 and accept H1.

What are you saying here? The p-value was very small, so that means the chance of getting this sample, if there’s no racial bias, was very small. Previously, you set a significance level of 0.05, meaning you would consider this sample too unlikely if its probability was under 5%. Its probability is under 5%, so the sample and the null hypothesis contradict each other. The sample is what it is, so you can’t reject the sample. Therefore you reject H0 and accept H1 — you declare that there is racial bias.

Another way to look at it: Any sample will vary from the population because random selection is always operating to produce sampling error. But the difference between this sample and the supposed population proportion is just too great to be produced by random selection alone. Something else must be going on also. That something else is the alternative hypothesis H1.

Definition: When the p-value is below α, the sample is too unlikely to come from ordinary sample variability alone, and you have a significant result, or your result is statistically significant.

You always select a significance level before you know the p-value. If you could first get the p-value and then specify a significance level, you could get whichever result you wanted, and there would be no point to doing a hypothesis test at all. Choosing α up front keeps you honest.

Step 6: Conclusion (in English)

Since you accepted H1 in the previous step, that’s your conclusion. If you have already written it in English as part of the hypotheses, as I did, then most of your work is already done. You do need to add the significance level or the p-value, so your conclusion will look something like one of these:

(6)The 8% proportion of African American men in Mr. Swain’s jury pool is significantly below the expected 26%, and this is evidence at the 0.05 level of significance of racial bias in the selection.


(6)The 8% proportion of African American men in Mr. Swain’s jury pool is significantly below the expected 26%, and this is evidence of racial bias in the selection (p = 0.000 020).

If you’re publishing your hypothesis test, you’ll want to write a thorough conclusion that still makes sense if it’s read on its own. But in class exercises you don’t have to write so much. It’s enough to write “At the 0.05 significance level, there is racial bias in jury selection” or “There is racial bias in jury selection (p = 0.000 020)”.

10A2.  Example 2: Cancer Screening

The Colorectal Cancer Screening Guidelines (CDC 2014 [see “Sources Used” at end of book]) recommend a colonoscopy every ten years for adults aged 50 to 75. A public-health researcher believes that only a minority are following this recommendation. She interviews a simple random sample of 500 adults aged 50–75 in Metropolis (pop. 6.4 million) and finds that 235 of them have had a colonoscopy in the past ten years. At the 0.05 level of significance, is her belief correct?

Solution: The population is adults aged 50–75 in Metropolis. You want to know whether a minority of them — under 50% — follow the colonoscopy guideline. Each person either does or does not, so you have binomial data, a test of proportion (Case 2 in Inferential Statistics: Basic Cases). Try to write out the hypothesis test yourself before you look at mine below.

Reminder: Even though you already have the sample data in the problem, when you write the hypotheses, ignore the sample. In principle, you write the hypotheses, then plan the study and gather data. If you use any of the sample data in the hypotheses, something is wrong.

You should have written something pretty close to this:

(1) H0: p = 0.5, half the seniors of Metropolis follow the guideline
H1: p < 0.5, less than half follow the guideline
(2) α = 0.05
  • Random sample? Yes.
  • 10n ≤ N? Yes, 10n = 10×500 = 5000, surely less than the number of adults aged 50–75 in a population of 6,400,000.
  • At least 10 successes and 10 failures expected? Yes, npo = 500×.5 = 250, and nnpo = 500−250 = 250.
(3/4) 1-PropZTest: po=.5, x=235, n=500, p<po
outputs: z=−1.34, pval=0.0899, =0.47
(5) p > α. Fail to reject H0.
(6)At the 0.05 level of significance, it’s impossible to say whether less than half of Metropolis seniors aged 50–75 follow the CDC guideline for a colonoscopy every ten years or not.
[Or, It’s impossible to say whether less than half of Metropolis seniors aged 50–75 follow the CDC guideline for a colonoscopy every ten years or not (p = 0.0899).]

Important: When p is greater than α, you fail to reach a conclusion. In this situation, you must use neutral language. You mention both possibilities without giving more weight to either one, and you use words like “impossible to say” or “can’t determine”.

This is unsatisfying, frankly. You go through all the trouble of gathering data and then you end up with a non-conclusion. Can anything be salvaged from this mess?

Yes, you can do a confidence interval. This at least will let you set bounds on what percent of all seniors follow the guidelines. You’ve already tested requirements as part of the hypothesis test, so go right into your calculations and conclusion. You’re free to pick any confidence level you wish, but 95% is most usual.

1-PropZInt, 235, 500, .95

outputs: (.42625, .51375)

42.6% to 51.4% of Metropolis seniors aged 50–75 follow the CDC guideline on screening for colorectal cancer.

In a classroom setting, or on regular homework, if you’re assigned a hypothesis test do that and don’t feel obligated to do a confidence interval also. But in real life, and on labs and projects for class, you’ll usually want to do both.

10A3.  Example 3: Small Samples

What if your sample is so small that expected successes npo or expected failures nnpo are under 10? You can no longer use 1-PropZTest, which assumes that the sampling distribution of the proportion is ND, but you can compute the binomial probability directly as long as the other two requirements are still met (SRS and 10n≤N). Only the calculation of the p-value changes.

Example: In 2001, 9.6% of Fictional County motorists said that fuel efficiency was the most important factor in their choice of a car. For her statistics project, Amber set out to prove that the percentage has increased since then. She interviewed 80 motorists in a systematic sample of those registering vehicles at the DMV, and 13 of them said that fuel efficiency was the most important factor in their choice of a car. Test her hypothesis, at the 0.05 significance level.

Please write out your hypothesis test before you look at mine.

(1)H0: p = 0.096, percentage has not increased
H1: p > 0.096, percentage has increased
(2) α = 0.05
  • SRS? Systematic sample can be analyzed like a random sample. ✔
  • 10n≤N? 10×80 = 800, less than number of car owners in any county. ✔
  • Expected successes are npo = 80×.096 = 7.7, too far below 10 to live with. ✘

The sampling distribution of doesn’t follow the normal model, so you can’t use 1-PropZTest. But the other two requirements are met, so you can proceed, calculating the binomial probability directly.

(3/4) MATH200A/Binomial prob: n=80, p=0.096, x=13 to 80; p-value = 0.0410

(If you don’t have the program, use 1−binomcdf(80,0.096,12) = 0.0410.)

[Why 13 to 80? H1 contains >, so you test the probability of getting the sample you got, or a larger one, if H0 is true. If H1 contained <, x would be 0 to 13 — the sample you got, or a smaller one. See Surprised?in Chapter 6.]

(5) p < α. Reject H0 and accept H1.
(6)At the 0.05 significance level, the percentage of Fictional County motorists who rate fuel efficiency as most important has increased since 2001.
[Or, The percentage of Fictional County motorists who rate fuel efficiency as most important has increased since 2001 (p = 0.0410).]

10B.  Sharp Points

Hypothesis tests are based on a simple idea, but there are lots of details to think about. This section clarifies some important ideas about the philosophy and practice of a hypothesis test.

See also:

10B1.  Type I and Type II Errors

Definition: A Type I error is rejecting the null hypothesis when it’s actually true.

Definition: A Type II error is failing to reject the null hypothesis when it’s actually false.

A Type I error usually causes you to do something you shouldn’t; a Type II error usually represents a missed opportunity.

Example 4: Suppose your alternative hypothesis H1 is that a new headache remedy PainX helps a greater proportion of people than aspirin.

A Type I error — rejecting H0 and accepting H1 when H0 is actually true — would have you announce that PainX helps more people when in fact it doesn’t. People would then buy PainX instead of aspirin, and their headache would less likely be cured. This is a bad thing.

On the other hand, a Type II error — failing to reject H0 when it’s actually false — would mean you announce an inconclusive result. This keeps PainX off the market when it actually would have helped more people than aspirin. This too is a bad thing.

Example 5: You’re on a jury, and you have to decide whether the accused actually committed the murder. What would be Type I and Type II errors?

To answer that you need to identify your null hypothesis H0. Remember that it’s always some form of “nothing going on here.” In this case, H0 would be that the defendant didn’t commit the murder, and H1 would be that he did.

A Type I error would be condemning an innocent man; a Type II error would be letting a guilty man go free. In our legal system, a defendant is not supposed to be found guilty if there is a reasonable doubt; this would correspond to your α. Probably α = 0.05 is not good enough in a serious case like murder, where a Type I error would mean long jail time or execution, so if you’re on a jury you’d want to be more sure than that.

“Okay then,” you say, “I’ll have to be super careful and not make mistakes.” But remember from Chapter 1: In statistics, errors aren’t necessarily mistakes. Errors are discrepancies between your results and reality, whatever their cause. Type I and Type II errors are not mistakes in procedure.

Even if you do everything right in your hypothesis test, you can’t be certain of your answer, because you can never get away from sample variability.

How often will these errors occur? This is where your significance level comes into play. If you perform a lot of tests at α = 0.05, then in the long run a Type I error will occur one time in twenty. It’s too big for these pages, but there’s a cartoon at that illustrates this perfectly. The probability of a Type II error has the symbol β (Greek letter beta) and it has to do with the “power” of the test, its ability to find an effect when there’s an effect to be found. β belongs to a more advanced course, and I don’t do anything with it in this book.

Earlier, I said that your significance level α is the chance of being wrong that you can live with. Now I can be a little more precise. α is not the chance of any error; α is the chance of a Type I error that you can live with. If one Type I error in 20 hypothesis tests is unacceptable, use a lower significance level — but then you make a Type II error more likely. If that’s unacceptable, increase your sample size.

Somebody is making a mint off the following chart. It’s in every stats textbook I’ve seen, so you may as well have it too:

Reject H0, accept H1Fail to reject H0
If H0 is actually trueType I errorCorrect decision
If H0 is actually false
(and H1 is true)
Correct decisionType II error

10B2.  One-Tailed or Two-Tailed?


How do you know whether your H1 should contain “<” or “>” (a one-tailed test) or “≠” (a two-tailed test)? In class, the problem will usually be clear about whether you’re testing for a “difference” (two-tailed) or testing if something is “better”, “larger”, “less than”, etc. (all one-tailed). But which one should you use when you’re on your own?

In general, prefer a two-tailed test unless you have a specific reason to make a one-tailed test.

When a two-tailed test reaches a statistically significant result, you interpret in a one-tailed manner.

Pick the Right Hypotheses

There are two main situations where a one-tailed test makes sense: “(a) where there is truly concern for the outcomes in one [direction] only and (b) where it is completely inconceivable that the results could go in the opposite direction.

—Dubey, quoted by Kuzma and Bohnenblust (2005, 132) [see “Sources Used” at end of book]

With a one-tailed test, say for μ<4.5, you’re saying that you consider “equal to 4.5” and “greater than 4.5” the same thing, that if μ isn’t less than 4.5 then you don’t care whether it’s equal or it’s greater. Sometimes you really don’t care, but very often you do. If the problem statement is ambiguous, or if this is real life and you have to do a hypothesis test, how do you decide whether to do a one-tailed or two-tailed test?

Testing two-tailed doesn’t prejudge a situation. Do a two-tailed test unless you can honestly say, without looking at the data, that only one direction of difference matters, or only one direction is possible.

Example 6: An existing drug cures people in an average of 4.5 days, and you’re testing a new drug. If you test for μ<4.5, you’re saying that it doesn’t matter whether the new drug takes the same time or takes more time. But that’s wrong: it matters very much. You want to test whether the new drug is different (μ≠4.5). Then if it’s different, you can conclude whether it’s faster or slower.

Another way to look at this whole business: a one-tailed test essentially doubles your α — you’re much more likely to reach a conclusion with dicey data. But that means double the risk of being wrong with a Type I error — not a good thing!

Sometimes the same situation can call for a different test, depending on your viewpoint.

Example 7: You’re the county inspector of weights and measures, checking up on a dairy and its half gallons of milk. Legally, half a gallon is 64 fluid ounces. To a government inspector, “Dairylea gives 64.0 ounces in the average half gallon” and “Dairylea gives more than 64.0 ounces in the average half gallon” are the same (legal), and you care only about whether Dairylea gives less (illegal). A one-tailed test (<) is correct.

But now shift your perspective. You’re Dairylea management. You don’t want to short the customers because that’s illegal, but you don’t want to give too much because that’s giving away money. You make a two-tailed test (≠).

p < α in Two-Tailed Test: What Does it Tell You?

After a two-tailed test, if p<α then you can interpret the result as one-tailed.

Example 8: You want to test whether your candidate’s approval rating has changed from the previous dismal 40% after a major policy announcement. Your H1 is p ≠ 0.4, and 170 out of a random sample of 500 voters approve ( = 34%). Your p-value is 0.0062, so you reject H0 and accept H1. You conclude that the candidate’s approval rating has changed.

But you can go further and say that her approval rating has dropped. You do this by combining the facts that (a) you’ve proved that approval rating is different, which means it must be either less or more than 40%, and (b) the sample was less than po (40%).

You can phrase your conclusion something like this, first answering the original question then going beyond it: The candidate’s approval rating has changed from 40% after the speech (p = 0.0062). In fact, it has dropped.

Your justification is the relationship between Confidence Interval and Hypothesis Test (later in this chapter), but you don’t actually have to compute the CI. When p < α in a two-tailed test, po is outside the confidence interval (at the matching confidence level).

10B3.  What Does the p-Value Mean?

Summary: The p-value tells you how likely it is to get the sample you got (or a more extreme sample) if the null hypothesis is true.

Many people are confused about the p-value. They try to read too much into it, or they try to simplify it.

Part of the problem is trying to fit the meaning into the traditional structure of a one-sentence definition, so let’s try a story instead. In your experiment, you got a certain result, a sample mean or sample proportion. Assume that the null hypothesis is true. If H0 is true, the properties of the sampling distribution tell you how likely it is to get this sample result, or one even further away from H0. That likelihood is called the p-value.

The one-tailed p-value is exactly the probability that you computed with normalcdf in Chapter 8. When that’s less than 0.5, the two-tailed p-value is exactly double the one-tailed p-value.

If the p-value is small, your results are in conflict with H0, so you reject the null and accept the alternative. If the p-value is larger, your sample is not in conflict with H0 and you fail to reject the null, which is stats-talk for failing to reach any kind of conclusion.

In a nice phrase, Sterne and Smith [see “Sources Used” at end of book] say that p-values “measure the strength of the evidence against the null hypothesis; the smaller the p-value, the stronger the evidence against the null hypothesis.” They also quote R. A. Fisher on interpreting a p-value: “If P is between 0.1 and 0.9 there is certainly no reason to suspect the hypothesis tested. If it is below 0.02 it is strongly indicated that the hypothesis fails to account for the whole of the facts. We shall not often be astray if we draw a conventional line at 0.05.”

The message here is that p-values fall on a continuum; you can’t just arbitrarily divide them into “significant” and “not significant” once and for all.

The p-value is the likelihood, if H0 is actually true, that random chance could give you the results you got, or results even further from H0. It is a conditional probability:

p-value = P(this sample given that H0 is true)

Yes, that seems convoluted — because it is. Alas, there just isn’t any description of a p-value that is both correct and simple.

The p-value is not the probability that either hypothesis is true or false:

The p-value is not any of the above because they are all plain probabilities. Once again, the p-value is just a measure of how likely your results would be if  H0 is true and random chance is the only factor in selecting the sample.

The p-value tells you how unlikely this sample (or a more extreme one) is if the null hypothesis is true. The more unlikely (surprising, unexpected), the lower the p-value, and the more confident you can feel about rejecting H0.

See also:

There’s one other thing: the p-value is not a measure of the size or importance of an effect. That gets into statistical significance versus practical significance.

10B4.  Practical and Statistical Significance

If your p-value is less than your significance level α, your result is statistically significant. That low p-value, Wheelan (2013, 11) [see “Sources Used” at end of book] writes, means that your result is “not likely to be the product of chance alone”. That’s all that statistical significance means. But even if a result is statistically significant, it may not be practically significant.

Example 9: Suppose that your p-value for “PainX is more likely to help a person than aspirin” is 0.000 002. You’re pretty darn sure that PainX is better. But to determine whether the result is practically significant, you have to ask not just whether PainX is better, but by how much.

One way to evaluate practical significance is to compute a confidence interval about the effect size. In this case, the 95% confidence interval is that a person is between 1.14 and 2.86 percentage points more likely to be helped by PainX than aspirin. Oh yes, and aspirin costs a buck for 100 tablets, where PainX costs $29.50 for ten. Most people would say this result has no practical significance. They’re not going to plunk down $30 for a few pills that are only 2% more likely to help them than aspirin.

How can you get such a low p-value when the size of the effect is small? The answer is in extremely large sample sizes. In this made-up case, PainX helped 15,500 people in a sample of 25,000, and aspirin helped 15,000 in a sample of 25,000. When you have really large samples, be especially alert to the issue of statistical versus practical significance.

10B5.  Conclusions: Write ’em Right!

Summary: As a statistician, you have an ethical obligation to make your results as easy as possible to understand, and as hard as possible to misinterpret.

Avoid common errors when stating conclusions and interpreting them. Make sure you understand what you are doing, and explain it to others in their own language.

cartoon about disproving the null hypothesis
used by permission; source: (accessed 2014-10-12)

When p < α, you reject H0 and accept H1.

If your p-value is less than your significance level, you have shown that your sample results were unlikely to arise by chance if H0 is true. The data are statistically significant. You therefore reject H0 and accept H1.

Details: Assuming that H0 is true, the sample you got is surprising (unexpected, unusual). The data are inconsistent with the null hypothesis — they can’t both be true. The data are what they are, and if the sample was properly taken you have to believe in it. Therefore, H0 is most likely false. If H0 is false, its opposite H1 is true.

You accept H1, but you haven’t proved it to a certainty. There’s always that p-value chance that the sample results could have occurred when H0 is true. That’s why you say you “accept” H1, not that you have “proved” H1.

Compare to a jury verdict of “guilty”. It means the jury is convinced that the probability (p) that the defendant is innocent is less than a reasonable doubt (significance level, α). It doesn’t mean there is no chance he’s innocent, just that there is very little chance.

Example 10:

Suppose your null H0 is “the average package contains the stated net weight,” your alternative is “the average package contains less than the stated net weight,” and your significance level is 0.05.

If p = 0.0241, which is < α, you reject H0 and accept H1. You conclude “the average package does contain less than the stated net weight (p = 0.0241)” or “the average package does contain less than the stated net weight, at the 0.05 significance level.”

Don’t say the average package “might” be less than the stated weight or “appears to be” less than the stated weight. When you reject H0, state the alternative as a fact within the stated significance level, or preferably with the p-value. (Again, compare to a jury verdict. The jury doesn’t say the defendant “might be guilty”.)

See also: Take published conclusions with a grain of salt. Even professional researchers can misuse hypothesis tests. “Data mining” (first gathering data, then looking for relationships) is one problem, but not the only one. See Why Most Published Research Findings Are False (Ioannidis 2005 [see “Sources Used” at end of book]). If you find the article heavy going, just scroll down to read the example in Box 1 and then the corollaries that follow.

When p > α, you fail to reject H0.

If your p-value is greater than your significance level, you have shown that random chance could account for your results if H0 is true. You don’t know that random chance is the explanation, just that it’s a possible explanation. The data are not statistically significant.

You therefore fail to reject H0 (and don’t mention H1 in step 5). The sample you have could have come about by random selection if H0 is true, but it could also have come about by random selection if H0 is false. In other words, you don’t know whether H0 is actually true, or it’s false but the sample data just happened to fall not too far from H0.

Compare to a jury verdict of “not guilty”. That could mean the defendant is actually innocent, or that the defendant is actually guilty but the prosecutor didn’t make a strong enough case.

Example 11: Suppose your null hypothesis is “the average package contains the stated net weight,” your alternative is “the average package contains less than the stated net weight,” and your significance level α is 0.05.

If you compute a p-value of 0.0788, which is > α, you fail to reject H0 in step 5, but how do you state your conclusion in step 6?

There are two kinds of answer, depending on who you talk to. Some people say “there’s insufficient evidence to prove that the average package is underweight”; others say “we can’t tell whether the average package is underweight or not.” Of course there are many ways to write a conclusion in English, but ultimately they boil down to “we can’t prove H1” (or the equivalent “we can’t disprove H0”) versus “we can’t reach a conclusion either way.”

Does it matter? Yes, I think it does.

Please understand: It’s not that the people writing the conclusions are confused (well, usually not). The problem is confusion among people reading the conclusions.

LOOK! Advice: It’s the same advice I’ve given before: Tailor your presentation to your audience. If you’re presenting to technical people, the one-sided forms are okay, and you could answer Example 11 with something like “there’s insufficient evidence, at the 0.05 significance level, to show that the average package is under weight” or “… to reject the hypothesis that the average package contains the stated net weight.” (Since the p-value gives more information, you could give that instead of the significance level.)

But if your audience is non-technical people, don’t expect them to understand a two-sided truth from a one-sided conclusion. Instead, use neutral language, such as “We can’t determine from the data whether the average package is underweight or not (p = 0.0788).” (You could state the significance level instead of the p-value.)

What if the p-value is very large?

If your p-value is very large, say bigger than 0.5, there’s a good chance you’ve made a mistake. Check carefully whether you should be testing <, ≠, or >. Also check whether you’re testing against the wrong number. For instance, suppose your H1 is that a coin comes up heads more than a third of the time. A few dozen flips will probably yield a p-value very close to 1. This is the statistical equivalent of “Well, duh!”

Sometimes large p-values are correct, but those situations are rare enough that you should be suspicious.

Can we never accept the null hypothesis?

Not as a matter of strict logic, no. But there are circumstances where the data do suggest that the null hypothesis is true. The most important of these is when multiple experiments fail to reject H0. Here’s why.

Suppose you do an experiment at the 0.05 significance level, and your p-value is greater than that. Maybe H0 is really true; maybe it’s false but this particular sample happened to be close to H0. You can’t tell — you’ve failed to disprove H0 but that doesn’t mean it’s necessarily true.

But suppose other experimenters also get p-values > 0.05. They can’t all be unlucky in their samples, can they?

If you keep giving the universe opportunities to send you data that contradict the null hypothesis, but you keep getting data that are consistent with the null, then you begin to think that the null hypothesis shouldn’t be rejected, that it’s actually true.

This is why scientists always replicate experiments. If the first experiment fails to reject H0, they don’t know whether H0 is true or they were just unlucky in their sample. But if several experiments fail to reject the null — always assuming the experiments are properly conducted — then they begin to have confidence in the theory.

What if an experiment does reject H0? Is that it, game over? Not necessarily. Remember that even a true H0 will get rejected one time in 20 when tested at the 0.05 level. Once again, the answer is replication. If they get more “reject H0”,scientists know that the first one wasn’t just a statistical fluke. But if they get a string of “fail to reject H0”, then it’s likely that the first one was just that one in 20, and H0 is actually true.

10C.  Testing a Mean (Numeric Data)

Summary: Just as you used a TInterval in Chapter 9 to make a confidence interval about μ for numeric data, you use a T-Test to perform the hypothesis test.

Typically you don’t know σ, the standard deviation (SD) of the population, and therefore you don’t know the standard error σ/√n either. So you estimate the standard error as s/√n, using the known SD of the sample. That means that the test statistic is:

t = (−μo) / (s/√n)

The t statistic is the estimated number of standard errors between your sample mean and the hypothetical population mean.

You met the t distribution when you computed confidence intervals in Chapter 9. Compared to z, the t distribution is a little flatter and more spread out, especially for small samples, so p-values tend to be larger.

Let’s jump in and do a t test. The numbered steps are almost the same as they were in the examples with binomial data — you just have the necessary variations for working with numeric data. Because I’ll be adding some commentary, I’ve put boxes around what I expect to see from you for a problem like this. (Refer to Seven Steps of Hypothesis Tests if you don’t know the steps very well yet.)

It hardly ever happens, but if you do know the SD of the population you can do a z test instead of a t test. Since the z distribution is a bit less spread out than the t distribution, for very small samples the p-values are typically a bit lower with a z test than with a t. But the difference is rarely enough to change the result — and again, you are quite unlikely to know the SD of the population, so a z test is quite unlikely to be the right one.

10C1.  Example 12: Bank Deposits

The management claims that the average cash deposit is $200.00, and you’ve taken a random sample to test that:

 192.68 188.24 152.37 211.73 201.57   167.79 177.19 191.15 209.22 178.49 
 185.90 226.31 192.38 190.23 156.13   224.07 191.78 203.45 186.40 160.83 

At the 0.05 significance level, does this sample show that the average of all cash deposits is different from $200?

Solution: The data type is numeric, and the population SD σ is unknown, so this is a test of a population mean, Case 1 from Inferential Statistics: Basic Cases. Your hypotheses are:

(1) H0: μ = 200, management’s claim is correct
H1: μ ≠ 200, management’s claim is wrong

Comment: Even though you already have the sample data in the problem, when you write the hypotheses, ignore the sample. In principle, you write the hypotheses, then plan the study and gather data. If you use any of the sample data in the hypotheses, something is wrong.

So you don’t use numbers from the sample in your hypotheses, and you don’t use the sample to help you decide whether the alternative hypothesis H1 should have < ≠, or >.

The significance level was given in the problem. (Problems will usually give you an α to use.)

(2) α = 0.05

Next is the requirements check. Even though it doesn’t have a number, it’s always necessary. In this case, n = 20, which is less than 30, so you have to test for normality and verify that there are no outliers.

Enter your data in any statistics list (I used L5), and check your data entry carefully. Use the MATH200A program “Normality chk” to check for a normal distribution and “Box-whisker” to verify that there are no outliers.

TI-83 probability plot showing normality; see text below       TI-83 box-whisker plot showing no outliers

You don’t need to draw the plots, but do write down r and crit and show the comparison, and do check for outliers. (For what to do if you have outliers, see Chapter 3.)

  • Random sample: given.
  • 10n = 10×20 = 200, and the bank had better have more deposits than that or it can’t afford to pay you for your work!
  • Normality: yes. From MATH200A part 4, r(0.9864) > crit(0.9503).
  • Outliers: none (MATH200A part 2).

Now it’s time to compute the test statistic (t) and the p-value.

On the T-Test screen, you have to choose Data or Stats just as you did on the TInterval screen. You have the actual data, so you select Data on the T-Test screen, instead of Stats. Then the sample mean, sample SD, and sample size are shown on the output screen, so you write them down as part of your results. Always write down , s, and n.

TI-83 input screen for TTest; for numbers, see text       TI-83 output screen for TTest; for numbers, see text

(3/4) T-Test: μo=200, List=L5, Freq=1, μ≠μo

results: t=−2.33, p=0.0311, =189.40, s=20.37, n=20

The decision rule is the same for every single hypothesis test, regardless of data type. In this case:

(5) p < α. Reject H0 and accept H1.

And as usual, you can write your conclusion with the significance level or the p-value:

(6) At the 0.05 level of significance, management is incorrect and the average of all cash deposits is different from $200.00. In fact, the true average is lower than $200.00.


(6) Management is incorrect, and the average of all cash deposits is different from $200.00 (p = 0.0311). In fact, the true average is lower than $200.00.

Remember what happens when you do a two-tailed test (≠ in H1) and p turns out less than α: After you write your “different from” conclusion, you can go on to interpret the direction of the difference. See p < α in Two-Tailed Test.

In a classroom exercise, if you were asked to do a hypothesis test you would do a hypothesis test and only a hypothesis test. But in real life, and in the big labs for class, it makes sense to answer the obvious question: If the true mean is less than $200.00, what is it?

You don’t have to check requirements for the CI, because you already checked them for the HT.

TInterval L5, 1, .95
outputs: (179.86, 198.93)

With 95% confidence, the average of all cash deposits is between $179.86 and $198.93.

10C2.  Example 13: Smokers and Retirement

Here’s an example where you have statistics without the raw data. It’s adapted from Sullivan (2011, 483) [see “Sources Used” at end of book].

According to the Centers for Disease Control, the mean number of cigarettes smoked per day by individuals who are daily smokers is 18.1. Do retired adults who are daily smokers smoke less than the general population of daily smokers?

To answer this question, Sascha obtains a random sample of 40 retired adults who are current daily smokers and record the number of cigarettes smoked on a randomly selected day. The data result in a sample mean of 16.8 cigarettes and a SD of 4.7 cigarettes.

Is there sufficient evidence at the α = 0.01 level of significance to conclude that retired adults who are daily smokers smoke less than the general population of daily smokers?

Solution: Start with the hypotheses. You’re comparing the unknown mean μ for retired smokers to the fixed number 18.1, the known mean for smokers in general. Since the data type is numeric (number of cigarettes smoked), and there’s one population, and you don’t know the SD of the population, this is Case 1, test of population mean, from Inferential Statistics: Basic Cases.

(1) H0: μ = 18.1, retired smokers smoke the same amount as smokers in general
H1: μ < 18.1, retired smokers smoke less than smokers in general

Comment: The claim is a population mean of 18.1, so you use 18.1 in your hypotheses. Using the sample mean of 16.8 in Step 1 is a rookie mistake, one of the Top 10 Mistakes of Hypothesis Tests. Never use sample data in your hypotheses.

Comment: Why does H1 have < instead of ≠? The short answer is: that’s what the problem says to do. In the real world, you would do a two-tailed test (≠) unless there’s a specific reason to do a one-tailed test (< or >); see One-Tailed or Two-Tailed? (earlier in this document). Presumably there’s some reason why they are interested only in the case “retired smokers smoke less” and not in the case “retired smokers smoke more”.

(2) α = 0.01
  • Random sample (given).
  • n > 30.
  • 10n = 10×40 = 400, less than the total number of retired smokers.

Therefore the sampling distribution is normal.

(3/4) TI-83 input screen for TTest       TI-83 output screen for TTest

T-Test: μo=18.1, =16.8, s=4.7, n=40, μ<μo

outputs: t=−1.75, p=0.0440

(5) p > α. Fail to reject H0.
(6) At the 0.01 level of significance, we can’t determine whether the average number of cigarettes smoked per day by retired adults who are current smokers is less than the average for all daily smokers or not.
We can’t tell whether the average number of cigarettes smoked per day by retired adults who are current smokers is less than the average for all daily smokers or not (p = 0.0440).

When you fail to reject H0, you cannot reach any conclusion. You must use neutral language in your non-conclusions. Please review When p > α, you fail to reject H0 earlier in this chapter.

10D.  Confidence Interval and Hypothesis Test

Summary: You can use a confidence interval to conclude whether results are statistically significant. A hypothesis test (HT) and confidence interval (CI) are two ways of looking at the same thing: what possibilities for the population mean or proportion are consistent with my sample?

A 95% CI is the flip side of a 0.05 two-tailed HT. More generally, a 1−α CI is the complement of an α two-tailed HT.

Example 14: The baseline rate for heart attacks in diabetes patients is 20.2% in seven years. You have a new diabetes drug, Effluvium, that is effective in treating diabetes. Clinical trials on 89 patients found that 27 (30.3%) had heart attacks. The 95% confidence interval is 20.8% to 39.9% likelihood of heart attack within seven years for diabetes patients taking Effluvium. What does this tell you about the safety of Effluvium?

Solution: Okay, you’re 95% confident that Effluvium takers have a 20.8% to 39.9% chance of a heart attack within seven years. If you’re 95% confident that their chance of heart attack is inside that interval, then there’s only a 5% or 0.05 probability that their chance of heart attack is outside the interval, namely <20.8% or >39.9%.

confidence interval, 20.8% to 39.9%, with hypothesis test But 20.2% is outside the interval, so there’s less than a 0.05 chance that the true probability of heart attack with Effluvium is 20.2%.

CI and HT calculations both rely on the sampling distribution. The open curve centered on 20.2% shows the sampling distribution for a hypothetical population proportion of 20.2%. Only a very small part of it extends beyond 30.3%, the proportion of heart attacks you actually found in your sample.

The chance of getting your sample, given a hypothetical proportion po in the population, is the p-value. If po = 20.2%, your sample with  = 30.3% would be unlikely (p-value below 0.05). You would reject the null hypothesis and conclude that Effluvium takers have a different likelihood of heart attack from other diabetes patients, at the 0.05 significance level. Further, the entire confidence interval is above the baseline value, so you know that Effluvium increases the likelihood of heart attack in diabetes patients.

At significance level 0.05, a two-tailed test against any value outside the 95% confidence interval (the shaded curve) would lead to rejecting the null hypothesis. And you can say the same thing for any other significance level α and confidence level 1−α.

What if the interval does include the baseline or hypothetical value? Then you fail to reject the null hypothesis.

Example 15: A machine is supposed to be turning out something with a mean value of 100.00 and SD of 6.00, and you take a random sample of 36 objects produced by the machine. If your sample mean is 98.4 and SD is 5.9, your 95% confidence interval is 96.4 to 100.4.

Now, can you make any conclusion about whether the machine is working properly?

confidence interval, 96.4 to 100.4, with hypothesis test Solution: Well, you’re 95% confident that the machine’s true mean output is somewhere between 96.4 and 100.4. With this sample, you can rule out a true population mean of <96.4 or >100.4, at the 0.05 significance level; but you can’t rule out a true population mean between 96.4 and 100.4 at α = 0.05. A hypothesis test would fail to reject the hypothesis that μ = 100. You can’t determine whether the true mean output of the machine is equal to 100 or not.

Leaving the symbols aside, when you test a null hypothesis your sample either is surprising (and you reject the null hypothesis) or is not surprising (and you fail to reject the null). Any null hypothesis value inside the confidence interval is close enough to your sample that it would not get rejected, and any null hypothesis value outside the interval is far enough from the sample that it would get rejected.

Special Note for Binomial Data

For numeric data, the CI and HT are exactly equivalent.

But for binomial data, the CI and HT are only approximately equivalent. Why? Because with binomial data, the HT uses a standard error derived from po in the null hypothesis, but the CI uses a standard error derived from , the sample proportion. Since the standard errors are slightly different, right around the borderline they might get different answers. But when the hypothetical po is a fair distance outside the CI, as it was in the drug example, the p-value will definitely be less than α.

What about One-Tailed Tests?

Good question!

A confidence interval is symmetric (for the cases you study in this course), so it’s intrinsically two-tailed. A one-tailed HT for < or > at α = 0.01 corresponds to a two-tailed HT for ≠ at α = 0.02, so the CI for a one-tailed HT at α = 0.01 is a 98% CI, not a 99% CI. The confidence level for a one-tailed α is 1−2α, not 1−α.

Correspondence between Significance Level and Confidence Level
0.0511−2×.05 = 90%
21−.05 = 95%
0.0111−2×.01 = 98%
21−.01 = 99%
0.00111−2×.001 = 99.8%
21−.001 = 99.9%

If the baseline value is outside the confidence interval, you can say (at the appropriate significance level) that the true value of μ or p is different from the baseline, and then go on to say whether it’s bigger or smaller, so you get your one-tailed result.

On the other hand, if the baseline value is inside the confidence interval, you can’t say whether the true μ or p is equal to the baseline or different from it, and if you can’t say whether they’re different then you can’t say which one is bigger than the other.

10E.  Testing a Non-Random Sample

Though most hypothesis tests are to find out something about a population, sometimes you just want to know whether this sample is significantly different from a population. In this case, you don’t need a random sample, but the other requirements must still be met.

Example 16: At Wossamatta University, instructors teach the statistics course independently but all sections take the same final exam. (There are several hundred students.) One semester, the mean score on the exam is 74. In one section of 30 students, the mean was 68.2 and the SD was 10.4. The students felt that they had not been adequately prepared for the exam by the instructor. Can they make their case?

Solution: In effect, they are saying that their section performance was significantly below the performance of students in the course overall. This is a testable hypothesis. But the hypothesis is not about the population that these 30 students were drawn from; we already know about that population. Instead, it is a test whether this sample, as a sample, is different from the population.

(1)H0: This section’s mean was no different from the course mean.
H1: This section’s mean was significantly below the course mean.
(2) α = 0.05
  • (Omit the requirement for a random sample.)
  • 10n = 10×30 = 300 is less than the “several hundred students” in the course.
  • Sample size is ≥30, so the sampling distribution is normal.
(3/4)TTest: μ = 74,  = 68.2, s = 10.4, n = 30, μ < μo
Outputs: t = −3.05, p-value = 0.0024
(5) p < α. Reject H0 and accept H1.
(6)This section’s average exam score was less than the overall course average (p-value = 0.0024).

Okay, there was a real difference. This section’s mean exam score was not only below the average for the whole course, but too far below for random chance to be enough of an explanation.

But did the students prove their case? Their case was not just that their average score was lower, but that the difference was the result of poor teaching. Statistics can’t answer that question so easily. Maybe it was poor teaching; maybe these were weaker students; maybe it was environmental factors like classroom temperature or the time of day; maybe it was all of the above.

What Have You Learned?

Key ideas:

(The online book has live links.)

Because this textbook helps you,
please click to donate!
Because this textbook helps you,
please donate at
Study aids:

Chapter 11 WHYL → ← Chapter 9 WHYL

Exercises for Chapter 10

Write out your solutions to these exercises, showing your work for all computations. Then check your solutions against the solutions page and get help with anything you don’t understand.

Caution! If you don’t see how to start a problem, don’t peek at the solution — you won’t learn anything that way. Ask your instructor or a tutor for a hint. Or just leave it and go on to a different problem for now. You may find when you return to that “impossible” problem that you see how to do it after all.

Problem Set 1

1 List the seven steps of every hypothesis test.

Why must you select a significance level before computing a p-value?


Explain the p-value in your own words.


You’ve tested the hypothesis that the new accelerant makes a difference to the time to dry paint, using α = 0.05. What is wrong with each conclusion, based on the p-value? Write a correct conclusion for that p-value.
(a) p = 0.0214. You conclude, “The accelerant may make a difference, at the 0.05 significance level.”
(b) p = 0.0714. You conclude, “The accelerant makes no difference, at the 0.05 significance level.”


You are testing whether the new accelerant makes your paint dry faster. (You have already eliminated the possibility that it makes your paint dry slower.)
(a) What conclusion would be a Type I error? What wrong action would a Type I error lead you to take?
(b) What conclusion would be a Type II error? What wrong action would a Type II error lead you to take?


Are Type I and Type II errors actually mistakes? What one thing can you do to prevent both of them, or at least make them both less likely?


What can you do to make a Type I error less likely at a given sample size? What’s the unfortunate side effect of that?


Explain in your own words the difference between “accept H0” (wrong) and “fail to reject H0” (correct) when your p-value is > α.


The engineering department claims that the average battery lifetime is 500 minutes. Write both hypotheses in symbols.


Suppose H0 is “the directors are honest” and H1 is “the directors are stealing from the company.” Write conclusions, in Statistics and in English, if …
(a) if p = 0.0405 and α = 0.01
(b) if p = 0.0045 and α = 0.01


In your hypothesis test, H0 is “the defendant is innocent” and H1 is “the defendant is guilty”. The crime carries the death penalty. Out of 0.05, 0.01, and 0.001, which is the most appropriate significance level, and why?


When Keith read the AAA’s statement that 10% of drivers on Friday and Saturday nights are impaired, he believed the proportion was actually higher for TC3 students. He took a systematic sample of 120 students and, on an anonymous questionnaire, 18 of them admitted being alcohol impaired the last Friday or Saturday night that they drove. Can he prove his point, at the 0.05 significance level?


In 2006–2008 there was controversy about creating a sewer district in south Lansing, where residents have had their own septic tanks for years. The Sewer Committee sent out an opinion poll to every household in the proposed sewer district. In a letter to the editor, published 3 Feb 2007 in the Ithaca Journal, John Schabowski wrote, in part:

The Jan. 4 Journal article about the sewer reported that “only” 380 of 1366 households receiving the survey responded, with 232 against it, 119 supporting it, and 29 neutral. ... The survey results are statistically valid and accurate for predicting that the sewer project would be voted down by a large margin in an actual referendum.

Can you do a hypothesis test to show that more than half of Lansing households in the proposed district were against the sewer project? (You’re trying to show a majority against, so combine “supporting” and “neutral” since those are not against.)

14 Esperanza wanted to determine whether more than 40% of grocery shoppers — specifically, the primary grocery shoppers in their households — regularly use manufacturers’ coupons. She conducted a random telephone survey and contacted 500 people. (For this exercise, let’s assume that telephone subscribers are representative of grocery shoppers.) Of the 500 she contacted, 325 do the grocery shopping in their households. Of those 325, 182 said they regularly use manufacturers’ coupons.
(a) What is the size of the sample? (Think before you answer!)
(b) What is the population, and how large is it?
(c) What does the number 182 represent?
(d) Don’t do a hypothesis test. But if you did, what would po be?
(e) Is it a source of bias that she considered only each household’s primary grocery shopper?

Doubting Thomas remembered the Monty Hall example from Chapter 5, but he didn’t believe the conclusion that switching doors would improve the chance of winning to 2/3. (It’s okay if you don’t remember the example. All the facts you need are right here.)

Thomas watched every Let’s Make a Deal for four weeks. (Though this isn’t a random sample, treat it as one. There’s no reason why the show should operate differently in these four weeks from any others.) In that time, 30 contestants switched doors, and 18 of them won.
(a) At the 0.05 significance level, is it true or false that your chance of winning is 2/3 if you switch doors?
(b) At the 95% confidence level, estimate your chance of winning if you switch doors.
(c) If you don’t switch doors, your chance of winning is 1/3. Using your answer to (b), is switching doors definitely a good strategy, or is there some doubt?

16 Most of us have spam filters on our email. The filter decides whether each incoming piece of mail is spam. Heather trusts her spam filter, and she sets it to just delete spam rather than save it to a folder.
(a) What would Heather’s spam filter do if it makes a Type I error? What would it do if it makes a Type II error?
(b) Which is more serious here, a Type I error or a Type II error? Should the significance level α be set higher or lower?

Rosario read in Chapter 6 that 30.4% of US households own cats. She felt like dogs were a lot more visible than cats in Ithaca, so she decided to test whether the true proportion of cat ownership in Ithaca was less than the national proportion. She took a systematic sample of Wegmans shoppers one day, and during the same time period a friend took a systematic sample of Tops shoppers. (They counted groups shopping together, not individual shoppers, so they didn’t have to worry about getting the same household twice.)

Together, they accumulated a sample of 215 households, and of those 54 owned cats. Did she prove her case, at the 0.05 significance level?

Solutions → 

Problem Set 2

18 What is wrong with each pair of hypotheses? Correct the error.

(a) H0 = 14.2;  H1 > 14.2

(b) H0: μ < 25;  H1: μ > 25

(c) You’re testing whether batteries have a mean life of greater than 750 hours. You take a sample, and your sample mean is 762 hours. You write H0:μ=762 hr; H1:μ>762 hr.

(d) Your conventional paint takes 4.3 hours to dry, on average. You’ve developed a drying accelerant and you want to test whether adding it makes a difference to drying time. You write H0: μ=4.3 hr;  H1: μ < 4.3 hr.


This year, water pollution readings at State Park Beach seem to be lower than last year. A sample of 10 readings was randomly selected from this year’s daily readings:

3.5   3.9   2.8   3.1   3.1   3.4   3.2   2.5   3.5   3.1

Does this sample provide sufficient evidence, at the 0.01 level, to conclude that the mean of this year’s pollution readings is significantly lower than last year’s mean of 3.8?


Dairylea Dairy sells quarts of milk, which by law must contain an average of at least 32 fl. oz. You obtain a random sample of ten quarts and find an average of 31.8 fl. oz. per quart, with SD 0.60 fl. oz. Assuming that the amount delivered in quart containers is normally distributed, does Dairylea have a legal problem? Choose an appropriate significance level and explain your choice.


You’re in the research department of StickyCo, and you’re developing a new glue. You want to compare your new glue against StickyCo’s best seller, which has a bond strength of 870 lb/in².

You take 30 samples of your new glue, at random, and you find an average strength of 892.2 lb/in², with SD 56.0. At the 0.05 significance level, is there a difference in your new glue’s strength?


New York Quick Facts from the Census Bureau (2014b) [see “Sources Used” at end of book] says that 32.8% of residents of New York State aged 25 or older had at least a bachelor’s degree in 2008–2012. Let’s assume the figure hasn’t changed today.

You conduct a random sample of 120 residents of Tompkins County aged 25+, and you find that 52 of them have at least a bachelor’s degree.
(a) Construct a 95% confidence interval for the proportion of Tompkins County residents aged 25+ with at least a bachelor’s degree.
(b) Don’t do a full hypothesis test, but use your answer for (a) to determine whether the proportion of bachelor’s degrees in Tompkins County is different from the statewide proportion, at the 0.05 significance level.


You’re thinking of buying new Whizzo bungee cords, if the new ones are stronger than your current Stretchie ones. You test a random sample of Whizzo and find these breaking strengths, in pounds:

679   599   678   715   728   678   699   624

At the 0.01 level of significance, is Whizzo stronger on average than Stretchie? (Stretchies have mean strength of 625 pounds.)


For her statistics project, Jennifer wanted to prove that TC3 students average more than six hours a week in volunteer work. She gathered a systematic sample of 100 students and found a mean of 6.75 hours and SD of 3.30 hours. Can she make her case, at the 0.05 significance level?


As a POW in World War II, John Kerrich flipped a coin 10,000 times and got 5067 heads. At the 0.05 level of significance, was the coin fair?


People who take aspirin for headache get relief in an average of 20 minutes (let’s suppose). Your company is testing a new headache remedy, PainX, and in a random sample of 45 headache sufferers you find a mean time to relief of 18 minutes with SD of 8 minutes.
(a) Construct a 95% confidence interval for the mean time to relief of PainX.
(b) Don’t do a full hypothesis test, but use your answer for (a) to determine at the 0.05 significance level whether PainX offers headache relief to the average person in a different time than aspirin.

Solutions → 

What’s New


11. Inference from Two Samples

Updated 1 Jan 2016 (What’s New?)

Intro: In Chapter 10, you looked at hypothesis tests for one population, where you asked whether a population mean or proportion is different from a baseline number. In this chapter, you’ll ask “are these two populations different from each other?” (hypothesis test) and “how large is the difference?” (confidence interval).


11A.  Numeric Data — Paired or Unpaired?

That’s the key question when you’re doing inference on numeric data from two samples. Your answer will control how you analyze the data, so let’s look closely at the difference.

11A1.  Unpaired Data / Independent Samples

Definitions: You have unpaired data when you get one number from each individual in two unrelated groups. The two groups are known as independent samples.

Independent samples result when you take two samples completely independently, or if you take one sample and then randomly assign the members to groups. Randomization always gives you independent samples.

Example 1: What if any is the average difference in time husbands and wives spend on yard work? You randomly select 40 married men and 40 married women and find how much time a week each spends in yard work. There’s no reason to associate Man A with Woman B any more than Woman C; these are independent samples and the data are unpaired.

Example 2: How much “winter weight” does the average adult gain? You randomly select 500 adults and weigh them all during the first week of November. Then during the last week of February you randomly select another 500 adults and weigh them. The data are unpaired, and the samples are independent.

Before you read further, what’s the big problem in the design of those two studies?

Right! Our old enemy, confounding variables. Look at the examples again, and see how many you can identify. For example, what might make a random person in one sample weigh more or less than a random person in the other sample, other than the passage of time? What might make a random woman spend more or less time on yard work than a random man, apart from their genders?

With independent samples, if there’s actually a difference between the two groups, it may be swamped by all the differences within each group.

11A2.  Paired Data / Dependent Samples

Definitions: You have paired data when each observational unit gives you two numbers. These can be one number each from a matched pair of individuals, or two numbers from one individual. Paired data come from dependent samples.

Example 3: What if any is the average difference in time husbands and wives spend on yard work? You randomly select 40 couples and find how much time a week each person spends in yard work. Each husband and wife are a matched pair. The samples are dependent because once you’ve chosen a couple you’ve equally specified a member of the “wives” sample and a member of the “husbands” sample.

Example 4: How much “winter weight” does the average adult gain? You randomly select 500 adults and weigh them all during the first week of November, then again during the last week of February. You have paired data in the before and after numbers. The two samples are dependent because they are the same individuals.

Do you see how a design with paired data (dependent samples) overcomes the big problem with unpaired data (independent samples)? You want to study weight gain, and now that’s what you’re measuring directly. You wanted to know whether husband or wife spends more time on yard work, and now you’ve eliminated all the differences between couples.

Paired data are more likely than unpaired to reveal an effect, if there is one. Why? Because a paired-data design minimizes differences within each group that can swamp any difference between groups.

In studying human development and behavior, twins are a prime source of dependent samples. If you have a pair of identical twins who were raised apart (and that’s surprisingly common), you can investigate which differences between people’s behavior are genetic and which are learned. The Minnesota Study of Twins (Bouchard 1990 [see “Sources Used” at end of book]), found that a lot of behaviors that “should” be learned seem to be genetic. The New York Times published a nontechnical account in Major Personality Study Finds That Traits Are Mostly Inherited (Goleman 1986 [see “Sources Used” at end of book]).

11A3.  Paired and Unpaired Data Compared

Sample type DependentIndependent, or randomized
Numeric data type Paired DataUnpaired Data
How many numbers from each experimental unit? TwoOne
Can you rearrange★ one sample? NoYes
Problem of confounding variables MinimalSevere
Use this design … … if you can… if you must
★If the data from the sample are arranged in two rows or two columns, can you rearrange one row or column without destroying information?

11A4.  Example 5: Seed Corn

unpaired data/independent samples versus paired data/dependent samples

Testing new corn versus standard corn for yield. Can you see a problem with the sample in Western New York that’s not a problem with the sample in Central New York?

Adapted from Dabes and Janik (1999, 263) [see “Sources Used” at end of book]

You’re the head of research for the Whizzo Seed Company, and you’ve developed a new type of seed that looks promising. You randomly select three farmers in Western New York to receive new corn, and three to receive your standard product. (Of course you don’t tell them which one they’re getting.) At the end of the season they report their yield figures to you.

What’s wrong with this picture? You can easily think of all sorts of confounding variables here: different soils, different weather, different insects, different irrigation, different farming techniques, and on and on. Those differences can be great enough to hide (confound) a difference between the two types of corn, especially in a small sample.

The following year, you try again in Central New York. This time you send each farmer two stocks of seed corn, with instructions to plant one field with the first stock and another field with the second.

Does that eliminate confounding variables? Maybe not totally, but it reduces them as far as possible. Now, if you see significant differences in yield between two fields planted by the same farmer, it’s almost sure to be due to differences in the seed.

When to Use Paired Data?

You always want to structure an experiment or observation with paired data (dependent samples) — if you can.

“If you can.” Aye, there’s the rub. Suppose you want to know whether attending kindergarten makes kids do better in first grade. There’s no way to set this up as paired data: how can a given kid both go through kindergarten and not go through kindergarten? Twin studies don’t help you here, because if the twins are raised together the parents will send both of them to kindergarten, or neither; and if the twins are raised apart then there will be too many other differences in their upbringing that could affect their performance in first grade.

If the samples are independent, you can’t pair the data, even if the samples are the same size. If you’re not sure whether you have dependent or independent samples, look back at 11A5.  Paired and Unpaired Data Compared.

11A6.  Example 6: Where the Rubber Meets the Road

You want to determine whether a new synthetic rubber makes tires last longer than the competitor’s product. Can you see how to do this with independent samples (unpaired data) and with dependent samples (paired data)? Think about it before you read on.

For independent samples, you randomly assign drivers to receive four tires with your new rubber or four of the competitor’s tires. For dependent samples, you put two tires of one type on the left side of every driver’s car, and two on the right side of every driver’s car. (You do half the cars one way and half the other, to eliminate differences like the greater likelihood of hitting the curb on the right.)

With the first method, if there’s only a small difference between your rubber and the competitor’s, it may not show up because you’ve also got differences in driving styles, roads, and so forth — confounding variables again. With the second method, those are eliminated.

11B.  Inference with Paired Numeric Data (Case 3)


The hypothesis test is almost exactly like the Case 1 hypothesis test. The difference is that you define a new variable d (difference) in Step 1 and write hypotheses about μd instead of μ.

For a confidence interval, you’re estimating the average difference, not the average of either population. You need to state both size and direction of the effect.

11B1.  Example 7: The Freshman Fifteen

You’ve probably heard about the “freshman fifteen”, the weight gain many students experience in their first year at college. The Urban Dictionary even talks about the “freshman twenty” (2004) [see “Sources Used” at end of book].

Francine wanted to know if that was a real thing or just an urban legend. During the first week of school, she got the other nine women in her chemistry class at Wossamatta U to agree to help her collect data. (She reasoned that students in any particular class would effectively be a random sample of the school, since class choice is unrelated to weight or other health issues. Of course that would be questionable for a spin class or a cooking class.)

Wossamatta U CHEM101 — Women’s Weights (in pounds)
Sept. 11810512311210713012099119126
May 125114128122106143124103125135

When she had the data, Francine realized she didn’t know what to do next. If she had just one set of numbers, she would do a Student’s t test, since she doesn’t know the population standard deviation (SD). But what to do with two lists?

Then she had a brainstorm. She realized that she’s not trying to find out anything about students’ weights. She wants to know about their weight gain. Looking at their weights, she’d have plenty of lurking variables starting with pre-college diet and lifestyle. Looking only at the weight gain minimizes or eliminates those variables, and measures just what happened to each student during freshman year. So she added a third row to her chart:

Wossamatta U CHEM101 — Women’s Weights (in pounds)
Sept. 11810512311210713012099119126
May 125114128122106143124103125135
d = May−Sept. 79510−1134469

Notice the new variable d, the difference between matched pairs. (You know the data must be paired, because each May number is associated with one and only one September number. You can’t rearrange the May numbers and still have everything make sense.) This is the heart of Case 3 in Inferential Statistics: Basic Cases: reducing paired numeric data to a simple t test of a population mean difference.

Here’s what’s new:

Now she’s all set. She has one set of ten numbers, representing the continuous variable “weight gain in freshman year” for a random sample of Wossamatta U women. (Notice with student E, Francine has a negative value for d because May minus September is 106−107 = −1. That student lost weight as a freshman.) Time for a t test!

Because this textbook helps you,
please click to donate!
Because this textbook helps you,
please donate at

But first, what will she test? Her original idea was to test the “freshman fifteen”. But a glance at the d’s shows her that no one gained as much as 15 lb. An average can’t be larger than every member of the data set, so there’s no way she could prove a hypothesis that the average gain is above fifteen pounds. She decides instead to try to prove a “freshman five”, μd > 5, with 0.05 significance.

Subtle point here: You never use sample data in a hypothesis, but you can sometimes adjust your hypotheses after you collect your data, especially when it’s obvious that your data won’t prove what you wanted to prove. Another reasonable choice for Francine would be to try to prove simply that the average student gains weight, μd > 0.

When you do a confidence interval, you don’t have to make any decision of this kind because you just follow the data where they lead.

Entering Paired Numeric Data

Francine subtracted by hand here, but you shouldn’t do that because it’s a rich source of errors and makes it harder to check your work. Instead, follow this procedure on your TI-83/84:

  1. Enter the first data set (September, in this case) in L1.
  2. Enter the second data set (May) in L2. Unlike the one-population cases, the order matters.
  3. Check your data entry. Since you entered all of the September figures and then all the May figures, check them the opposite way, first student A September and May, then student B, and so on.
  4. Cursor to L3 — the column heading, not the first number.
  5. Francine defined d as May−Sept., which is L2−L1, so enter that formula. (To subtract in the other direction, enter L1−L2.) As soon as you press [ENTER], the calculator does all the subtractions, wiping out whatever was in L3 previously.

    TI-83 screen with L1 and L2 entered, formula L2 minus L1 in progress for L3      TI-83 screen after applying formula; differences shown in L3

    This isn’t Excel — if you change L1 or L2 after entering the formula for L3, L3 won’t change. You need to re-enter the formula for L3 in that case. (You actually can make the calculator behave like Excel by binding a formula to a list, but it’s not worth the hassle.)

Hypothesis Test for Mean Difference

With paired numeric data, your population parameter is the mean difference μd. The random variable is a difference (in this case, a number of pounds gained from September to May), so the parameter is the mean of all those weight gains.

(1)d = May−September
H0: μd = 5, average student gains 5 lb or less
H1: μd > 5, average student gains more than 5 lb
(2) α = 0.05
  • Random sample? Yes, effectively. (It’s a random sample of Wossamatta U women frosh, not necessarily those from other colleges.)
  • 10n ≤ N? Yes, because any university has more than 10×10 = 100 women in the freshman class.
  • n = 10 (< 30), so Francine must test for normality and verify absence of outliers. She tests L3, not L1 and L2, because L3 holds her sample data of weight gain:

    TI-83 screen with normality check; see text      TI-83 boxplot with no outliers

    r=.9811 and crit=.9179. r>crit, and the box-whisker shows no outliers.

(3/4) This is a regular T-Test, number 2 in the STAT TESTS menu. Francine writes down
T-Test: 5, L3, 1, >μo
results: t=1.29, p = 0.1146, =6.6, s=3.9, n=10

TI-83 T-Test input screen; see text      TI-83 T-Test results screen; see text

The sample mean is (“d-bar”), not , because the data are d’s, not x’s.

(5) p > α. Fail to reject H0.
(6) You can’t determine whether the average Wossamatta U woman student gains more than 5 pounds in her freshman year or not (p = 0.1146).
At the 0.05 significance level, you can’t determine whether the average Wossamatta U woman student gains more than 5 pounds in her freshman year or not.

After a “fail to reject H0”, you always remember to write your conclusion in neutral language, right? Maybe the true average weight gain is greater than 5 pounds but this particular sample just happened not to show it; maybe the true average weight gain really is under 5 pounds, A confidence interval can help you get a handle on the effect size.

Confidence Interval for Mean Difference

When a hypothesis test fails to reach a conclusion, a confidence interval can salvage at least some information. When a hypothesis test does reach a conclusion, a confidence interval can give you more precise information.

If Francine was doing only the confidence interval, she’d have to start off by testing requirements. But she has already tested them as part of the hypothesis test, so she goes right to the TINTERVAL screen.

Which confidence level does she choose? Her one-tailed hypothesis test at α = 0.05 would be equivalent to a two-tailed test at α = 0.10, and that suggests a confidence level of 90%. But she decides since her hypothesis test has already failed to reach a conclusion she’d at least like to get a 95% CI.

TInterval: L3, 1, .95
results: (3.7948, 9.4052)

TI-83 TInterval input screen; see text      TI-83 TInterval results screen; see text

Conclusion: Francine is 95% confident that the average woman student at Wossamatta U gains 3.8 to 9.4 pounds during her freshman year.

(Francine doesn’t write down , s, and n because she’s already written them in the hypothesis test. She would write them down when she does only a confidence interval.)

Common mistake: Don’t say the average weight is 3.8 to 9.4 pounds. You aren’t estimating the average first-year woman’s weight, but her weight gain.

Always re-read your conclusion after you write it, and ask yourself whether it seems reasonable in the context of the problem. That can save you from mistakes like this.

11B2.  Example 8: Coffee and Heart Rate


A few years back, a coffee company tried to market drinking coffee as a way to relax — and they weren’t talking about decaf. Jon decided to test this. He randomly selected six adults. He recorded their heart rates, then recorded them again half an hour after each person drank two cups of regular coffee. His data are shown at right. (Data come from Dabes and Janik [1999, 264] [see “Sources Used” at end of book].)

The data are paired, because each person (experimental unit) gives you two numbers, Before and After; because each After is associated with one specific Before; and because you can’t rearrange Before or After and still have the data make sense.

Jon selected the 0.01 significance level. (He tests for difference even though he believes coffee increases heart rate, because it could decrease it.)

Jon could equally well define d as Before−After or After−Before. At least, mathematically he could. But you’ll find it’s easier to interpret results if you always define d as high minus low so that all or most of the d’s will be positive numbers. (You can do this based on your common sense or by looking at the data.) Jon sees that the After numbers are generally larger than the Before numbers, so he chooses d = After−Before.

(1)d = After−Before
H0: μd = 0, coffee makes no difference to heart rate
H1: μd ≠ 0, coffee makes a difference to heart rate
(2) α = 0.01
(RC)Jon has a random sample, but the sample size is <30. (The sample of six is obviously less than 10% of coffee drinkers.) He puts the Before figures in L1, After in L2, and then L2−L1 (not L1−L2) in L3. The box-whisker plot of L3 finds no outliers. The normal probability plot shows r=.9638, crit=.8893; r>crit.

box-whisker plot showing no outliers      normality check; see text

(3/4) T-Test results: t=5.562426994, p=.0025836914, x-bar=4.166666667, s=1.834847859, n=6 T-Test: 0, L3, 1, ≠μo
results: t=5.56, p = 0.0026, d̅=4.2, s=1.8, n=6
(5) p < α. Reject H0 and accept H1.
(6)Drinking coffee does make a difference in heart rate half an hour later (p = 0.0026). In fact, coffee increases heart rate.
Drinking coffee does make a difference in heart rate half an hour later, at the 0.01 significance level. In fact, drinking coffee increases heart rate.

As usual, when you do a two-tailed test and p < α, you can interpret it in a one-tailed manner. Jon defined d as After−Before, which is the amount of increase in each subject’s heart rate. His sample mean d̅ was positive, so the average outcome in his sample was an increase. Because he proved that the mean difference μd for all people is nonzero, the sign of his sample mean difference d̅ tells him the sign of the population mean difference μd.

Jon can’t say that the average increase for people in general is 4.2 beats per minute. That was the mean difference in his sample. If he wants to know the mean difference for all people, he has to construct a confidence interval:

TInterval: L3, 1, .99

result: (1.146, 7.187)

Jon is 99% confident that the average increase in heart rate for all people, half an hour after drinking two cups of coffee, is 1.1 to 7.2 beats per minute.

Caution! The confidence interval expresses a difference, not an absolute number. You are estimating the amount of increase or decrease, not the heart rate. A common mistake would be to say something about the heart rate being 1.1 to 7.2 bpm after coffee. Again, you’re not estimating the heart rate, you’re estimating the change in heart rate.

11C.  Inference with Unpaired Numeric Data (Case 4)

With paired data, you tested the population mean difference μd between matched pairs. But suppose you don’t have matched pairs? With unpaired data in independent samples, you test the difference between the means of two populations, μ1−μ2.

This is Case 4 in Inferential Statistics: Basic Cases. Key features:

Advice: Take your time when you look at data to decide whether you have paired or unpaired data. If your sample sizes are different, it’s a no-brainer: the data are unpaired. But if the sample sizes are the same, think carefully about whether the data are paired or unpaired. Sometimes students just seem to take a stab in the dark at whether data are paired or unpaired, but if you just stop and think about how the data were taken you can make the right decision every time. Look back at Paired and Unpaired Data at the beginning of this chapter if you need a refresher on the difference.

Example 9: A Tough Grader?

Prof. Sullivan’s students at Wossamatta U felt that he was a tougher grader than the other speech professors. They decided to test this, at the 0.05 significance level.

Eight of them each took a two-hour shift, assigned randomly at different times and days of the week, and distributed a questionnaire to each student on the main quad. They felt this was a reasonable approximation to a random sample of current students. (They asked students not to take a questionnaire if they had already submitted one.) The questionnaire asked whether the student had taken speech in a previous semester, and if so from which professor and what grade they received. They then divided the questionnaires into three piles, “no speech”, “Sullivan”, and “other prof”.

It would be possible to do an analysis with the categorical data of letter grades. But you should always use numerical data when you can, because p-values are usually lower with numeric data than attribute data, for a given sample size. The students counted an A as 4 points, A-minus as 3.7, and so on. Here is a summary of their findings:

Students ofMeanStandard
Sample Size
Sullivan 2.211.4432
Other prof 2.681.1354

Hypothesis Test for Difference of Means

In this test, you have unpaired numeric data in two samples. The requirements for each sample are the same as the test for the sample in a one-sample t test:

There’s an additional requirement for the two samples:

Here’s the hypothesis test, as performed by Prof. Sullivan’s students:

(1)pop. 1 = Sullivan students, pop. 2 = other speech profs’ students
H0: μ1 = μ2, no difference in average grades
H1: μ1 < μ2, Sullivan’s grades lower on average
(2) α = 0.05
  • Random sample (systematic).
  • Are samples less than 10% of their populations? 10×32 = 320, and 10×54 = 540. At a university there are almost certainly more speech students per professor than that, especially considering multiple years.
  • Both sample sizes > 30.
  • Samples independent (no connection between Sullivan students and non-Sullivan students).
(3/4) 2-SampTTest: 1=2.21, s1=1.44, n1=32, 2=2.68, sx2=1.13, n2=54, μ12, Pooled:No
Results: t = −1.58, p = 0.0600, df=53.58

2-SampTTest input screen: x-bar 1 = 2.21, sx 1 = 1.44, n 1 = 32, x-bar 2 -= 2.68, sx 2 = 1.13, n 2 = 54, mu 1 less than mu 2, pooled no      2-SampTTest results screen: t = minus 1.58036729, p = .0599544103, df = 53.57941422 The test statistic is still Student’s t, but adapted for two samples. See the BTW note below for more about that and about the funny number of degrees of freedom.

(5) p > α. Fail to reject H0.
(6)At the 0.05 level of significance, they can’t determine whether Prof. Sullivan is a tougher grader than the other professors or not.
How does your calculator analyze a difference of independent means? If you remember what you learned about one-sample t tests, all you have to do is extend it.

You’re working with a difference of sample means. The standard error of the mean for the first population is s1/√n1 and therefore the variance is s1²/n1, and similarly for the second population. The variance of the sum or difference of independent variables is the sum of their variances, so VAR(12) = s1²/n1 + s2²/n2. The standard deviation (the standard error of the difference of sample means) is the square root of the variance: SE of x-bar 1 minus x-bar 2 equals square root of the expression s 1 squared over n 1, plus s 2 squared over n 2.

It turns out that the difference of sample means follows a t distribution — if you choose the right number of degrees of freedom (more on that later). The one-sample test statistic was t = (−μo) / (s/√n). The two-sample test statistic is analogous, with the differences substituted. The test statistic becomes t equals fraction. Numerator is quantity x-bar 1 minus x-bar 2, minus quantity mu 1 minus mu 2.  Denominator is standard error from previous equation.. In this course, you’ll just be testing whether one population mean is greater than, less than, or different from the other. In other words, you’ll test against a hypothetical mean difference of 0. That simplifies t a bit: t equals fraction. Numerator is x-bar 1 minus x-bar 2. Denominator is standard error from earlier equation..

What about degrees of freedom? Welch computation of degrees of freedom You might think df would be n1+n2−1, but it isn’t. The sampling distribution approximately follows a t with df equal to the lower of n1−1 and n2−1. It’s only approximate because the population SD are usually different. The exact degrees of freedom were computed by B. L. Welch (1938) [see “Sources Used” at end of book], and the horrendous, ugly equation is shown at right. Fortunately, your TI-83/84 has the computation built in, and you don’t have to worry about it.

What about pooling? Why do you always select Pooled:No on your TI-83/84? Well, if the two populations have the same SD (if they are homoscedastic) you can treat them as a single population (pool the data sets) and use a higher number of degrees of freedom. That in turn means your p-value will be a bit lower, so you’re a bit more likely to be able to reject H0. Sounds good, right? But there are problems:

For these reasons and others, the issue of pooling is controversial. Some books don’t even mention it. It’s best just to use Pooled:No always.

Confidence Interval for Difference of Means

The requirements are exactly the same as the requirements for the hypothesis test. You compute a confidence interval on your TI-83/84 through 2-SampTInt.

Since they couldn’t prove that Prof. Sullivan was a tough grader, the students decided to compute a 90% confidence interval for the difference between Prof. Sullivan’s average grades and the other speech profs’ average grades:

pop. 1 = Sullivan students; pop. 2 = other speech profs’ students
Requirements: already covered in hypothesis test.
2-SampTInt: 1=2.21, s1=1.44, n1=32, 2=2.68, sx2=1.13, n2=54, C-Level=.9, Pooled:No
Results: (−.9678, .02779)

2-SampTInt input screen: x-bar 1 = 2.21, sx 1 = 1.44, n 1 = 32, x-bar 2 -= 2.68, sx 2 = 1.13, n 2 = 54, C-Level = .9, pooled no      2-SampTInt results screen: −.9678, .02779

Interpretation: The TI-83 gives you the bounds for the confidence interval about μ1−μ2. A negative number indicates μ1 smaller than μ2, and a positive number indicates μ1 larger than μ2. Therefore:

We’re 90% confident that the average student in Prof. Sullivan’s classes receives somewhere between 0.97 of a letter grade lower than the average student in other profs’ speech classes, and 0.03 of a letter grade higher.

Remark: The 90% confidence interval is almost all negative. This reflects the fact that the p-value in the one-tailed test for μ1 < μ2 was almost as low as 0.05.

The students could have chosen any confidence level they wanted, just for showing an effect size. But for a confidence interval equivalent to their one-tailed hypothesis test that used α = 0.05, the confidence level has to be 1−2×0.05 = 0.90 = 90%.

Why do you need a special two-sample t procedure? Can’t you just compute a confidence interval from each sample and then compare them? No, because the standard errors are different. The two-sample standard error takes the sample SD and sample sizes into account. Here’s a simple example, provided by Benjamin Kirk:

A farmer tests two diets for his pigs, randomly assigning 36 pigs to each sample. The Diet A group gained an average 55 lb with SD of 3 lb; that gives a 95% confidence interval 54.0 to 56.0 lb. The Diet B group gained 53 lb on average, with SD of 4 lb; the CI is 51.6 to 54.4 lb. Those intervals overlap slightly, which would not let you conclude that there’s any difference in the diets.

But the 2-SampTInt is 0.3 to 3.7 lb in favor of Diet A, which says there is a difference. The issue is that the B group had a lower sample mean, but there was more variation within the group.

11C1.  Example 10: Sorority Academics

The Alpha Alpha Alpha sorority chapter at Staples University (Yes, corporate sponsorship is getting ridiculous!) has a tradition of putting in extra effort academically. They gave their incoming pledges the task of proving that Alpha Alpha Alpha had higher average GPA than other sororities, at the 0.05 level of significance. The Alphas are a large sorority, with 119 members.

The pledges hacked the campus server and obtained GPAs of ten randomly selected Alphas and ten randomly selected members of other sororities on campus. Do their ill-gotten data prove their point?

Alphas: 2.313.362.772.932.27 2.353.
Other sororities: 1.491.742.702.402.17 1.081.851.962.081.49

Since you have independent samples (unpaired data) from two different populations, this is Case 4, difference of population means, in Inferential Statistics: Basic Cases.

Caution: You can’t treat these as paired data just because the sample sizes are equal; that’s a rookie mistake. When deciding between a paired or an unpaired analysis, always ask yourself: “Is data point 1 from the first sample truly associated with data point 1 from the second sample?” In this case, they’re not.

(1)pop. 1 = Alpha Alpha Alpha; pop. 2 = other sororities
H0: μ1 = μ2, No difference in average GPA
H1: μ1 > μ2, Average GPA of all Alphas is higher than other sororities
(2) α = 0.05

You check requirements against both samples independently. These samples are both smaller than 30, so you have to check normality and outliers on both. Here are the normality checks:

normality check for Alphas' GPAs      normality check for other sororities' GPAs

The first picture doesn’t look much like a straight line, but r is greater than crit, so it’s close enough. (With small data sets like this one, fitting the data to the screen can make differences look larger than they really are.)

stacked boxplots of Alphas' and other sororities' GPAs The calculator lets you “stack” two or three boxplots on one screen. Not only is this a bit of a labor saver, but it also gives you a good sense of how different the samples are. To do this, select “Compare 2 smpl” on the first box-whisker screen. You can guess what “Compare 3 smpl” does, but we don’t use it in this course.)

For these samples, the difference is dramatic. Every single Alpha’s GPA (in the sample) is above the third quartile in the sample of other sororities, and the max of other sororities is just barely above the median Alpha.

With such a big difference, why do the pledges even need to do a hypothesis test? Because they know these are just samples. Maybe the Alphas actually aren’t any better academically, but these particular samples just happened to be far apart. The hypothesis test tells you whether the difference you see is too big to be purely the result of random chance in sample selection.

  • Random samples, OK
  • 10% of Alphas is 12, and the sample is smaller than that. We don’t know how many are in all the other sororities combined, but it must be more than 10×10 = 100. OK
  • Normality check, sample 1: r(.9567) > crit(.9179), OK
  • Normality check, sample 2: r(.9946) > crit(.9179), OK
  • Box-whisker: no outliers in either sample, OK
(3/4) output screen for 2-sample t test, described in the text 2-SampTText L1, L2, 1, 1, >μ2, Pooled:No
outputs: t = 3.93, p-value = 0.0005,
1 = 2.70, s1 = 0.43, n1 = 10
2 = 1.90, s2 = 0.48, n2 = 10
(5) p < α. Reject H0 and accept H1.
(6)The average GPA in Alpha Alpha Alpha is higher than the average GPA of other sorority members (p = 0.0005).
[Or, at the 0.05 level of significance, the average GPA in Alpha Alpha Alpha is higher than the average GPA of other sorority members.)

Comment: You have to phrase your conclusion carefully. The pledges proved that the average GPA of Alphas is higher than the average GPA of all other sorority members, not all other sororities. What’s the difference? Here’s a simple example. Suppose there are ten other sororities besides the Alphas. The Omegas have an average GPA of 3.66, higher than the Alphas’ average. If the other nine each have an average GPA of 1.70, that could easily produce exactly the sample that the pledges got.

The message here: Aggregating data can lose information. Sometimes that’s okay, but be wary when one population is being compared to an aggregate of multiple other populations.

11D.  Inference on Two Proportions (Case 5)

When you have two samples of binomial data, they represent two populations. Each population has some proportion of successes, p1 and p2 respectively. You don’t know those true proportions, and in fact you’re not concerned with them. Instead, you’re concerned with the difference between the proportions, p1p2. You can test whether there is a difference (hypothesis test), or you can estimate the size of the difference (confidence interval).

This is Case 5 in Inferential Statistics: Basic Cases. Key features of Case 5, the difference of proportions:

Advice: take your time with two-sample binomial data. You have a lot of p’s and a lot of percentages floating around, and it’s easy to get mixed up if you try to hurry.

Take extra care when writing conclusions. You’re making statements about the difference between the two proportions, not about the individual proportions. And you’re making statements about the difference in proportions between the populations, not between the samples.

11D1.  Example 11: Traffic Stops and Traffic Tickets

Stopped by Traffic Cop
Just a

One of my students — call him Don — had several traffic tickets, and he knew one more would trigger a suspension. He felt that women stopped by a traffic cop were more likely than men to get off with just a warning, and for his Field Project he set out to prove it, with α = 0.05.

Don quickly realized that he should test whether men and women stopped by a cop are equally likely to get a ticket, not just whether men are more likely. After all, he couldn’t rule out the possibility that women are more likely to get a ticket if stopped.

Don distributed a questionnaire to a systematic sample of TC3 students. (He assumed that any gender-based difference in TC3 students would be representative of college students in general. That seems reasonable.) He asked three questions:

  1. Male or female?
  2. Stopped by a traffic cop since your 18th birthday?
  3. If yes, did you receive a ticket the last time you were stopped?

Don disregarded any questionnaires from students who had never been stopped as adults. He wasn’t interested in the likelihood of getting a ticket, but in the likelihood of getting a ticket after being stopped by a cop. You could say that he was interested in the different proportions, for men and women, of stops that lead to tickets.

Hypothesis Test for Difference of Proportions

This is just another variation on the good old Seven Steps of Hypothesis Tests:

Here are the requirements for a Case 5 hypothesis test of a difference of proportions:

Here is Don’s hypothesis test about the different proportions of men and women that receive tickets after being stopped in traffic.

(1) population 1 = college men stopped by traffic cops; population 2 = college women stopped by traffic cops
H0: p1 = p2, college men and women equally likely to get a ticket after being stopped
H1: p1p2, college men and women not equally likely to get a ticket after being stopped
(2) α = 0.05
  • Samples 1 and 2 random? Yes, effectively (systematic).
  • 10n1 = 10×97 = 970, and there have been more than 970 male students (at all colleges) stopped by traffic cops.
  • 10n2 = 10×70 = 700, and there have been more than 700 female students (at all colleges) stopped by traffic cops.
  • Sample 1 has 86 successes and 97−86 = 11 failures; sample 2 has 55 successes and 70−55 = 15 failures.
(3/4) 2-PropZTest: 86, 97, 55, 70, p1≠p2
Results: z=1.77, p-value = 0.0760, 1 =.89, 2=.79, =.84

2-PropZTest input screen: x 1 = 86, n 1 = 97, x 2 = 55, n 2 = 70, p 1 not equal p 2      2-PropZTest results screen: z=1.774261748, p=.0760197674, p-hat 1 = .8865979381, p-hat 2 = .7857142857, p-hat = .8443113772
There’s a difference of 10 percentage points between the sample proportions, but with Don’s sample sizes that difference is not large enough to be statistically significant. Even if there really is a difference in proportions for college men and women in general, random chance would be enough to explain the difference Don sees in his samples.

(5) p > α. Fail to reject H0.
(6)At the 0.05 level of significance, Don can’t tell whether men and women stopped by traffic cops are equally likely to get tickets, or not.

If this non-conclusion leaves you non-satisfied, you’re not alone. As usual, the confidence interval (next section) can provide some information.

Why does the “official” requirement use a pooled proportion instead of testing each sample? In fact, for a confidence interval you always test requirements for each sample. But in a hypothesis test, your H0 is always “no difference in population proportions”, and a hypothesis test always starts by assuming H0 is true. If the null is true, then there is no difference in the two populations, and you really just have one big sample of size n1+n2 and sample proportion . So that’s what you test.

Why is this a z test? For the same reason that a one-proportion test is a z test: from the population proportion p you know the SD.

Of course the two-population case is a bit more complicated. You need the key fact that when you add or subtract independent random variables, their variances add. If the two populations have the same proportion p, as H0 assumes, then the SD of the sampling distribution of the proportion for population 1 is √[(1−)/n1], and similarly for population 2, where is the pooled proportion mentioned in the requirements check, above. Square the SD to get the variances, add them, and take the square rot to get the standard error: SE sub pooled of p-hat 1 minus p-hat 2 = radical of the expression p-hat times 1 minus p-hat, all over n 1, plus p-hat times 1 minus p-hat, all over n 2. And from this you have the test statistic: z = p-hat 1 minus p-hat 2 all over the standard error.

Confidence Interval for Difference of Proportions

In a confidence interval for the difference of two proportions, some unknown proportion p1 of population 1 has some characteristic, and some unknown proportion p2 of population 2 has that characteristic. You aren’t concerned with those proportions on their own, but you want to estimate which population has the greater proportion, and by how much.

The requirements for a CI are almost the same as a HT, but with one subtle difference:

Why is that last requirement different from the “official” requirement for the hypothesis test? With the HT, you assumed H0 was true and both populations had the same proportion. That let you use a blended or pooled proportion from your combined samples. But with a CI, you don’t make any such assumption. What would be the point of a confidence interval for the difference if you assume there is no difference?

But despite the difference in theory, as a practical matter you can just test for ≥ 10 successes and ≥ 10 failures in each sample for both HT and CI.

Don has already checked requirements in the hypothesis test, so he moves right to a 2-PropZInt:

2-PropZInt input screen: x 1 = 86, n 1 = 97, x 2 = 55, n 2 = 70, C-Level = .95      2-PropZInt results screen: minus .0141 to plus .21587

Don gets a result of −1.4% to +21.6%. How does he interpret that? Well, he can write it as

−1.4%   ≤   p1p2   ≤   21.6%   (95% conf.)

Adding p2 to all three “sides” gives

p2−1.4%   ≤   p1   ≤   p2+21.6%   (95% conf.)

With 95% confidence, p1 is somewhere between 1.4% below p2 and 21.6% above p2. You don’t know the numerical value of p1, but out of male students who are stopped by a traffic cop, p1 is the proportion who get a ticket, and similarly for p2 and women. So Don can write his confidence interval like this:

cartoon about percentage points
used by permission; source: (accessed 2014-10-03)

I’m 95% confident that, out of students stopped by traffic cops, the proportion of men who actually get tickets is somewhere between 1.4 percentage points less than women, and 21.6 percentage points more than women.

If you’re not feelin’ the love with the algebra approach, you can reason it out in words. The confidence interval is the difference in proportions for men minus women. If that’s negative, the proportion for men is less than the proportion for women; if the difference is positive, the proportion for men is greater than the proportion for women.

Why do I say “percentage points” instead of just “percent” or “%”? Well, how do you describe the difference between 1% and 2%? It’s a difference of one percentage point, but it’s a 100% increase, because the second one is 200% of the first. When you subtract two percentages, the difference is a number of percentage points. If you just say “percent”, that means you’re expressing the difference using one of the percentages as a base, even if you don’t mean to.

Getting back to Don’s confidence interval, the −1.4% to +26.1% difference between men and women in traffic tickets is a simple subtraction of men’s rate minus women’s rate, so it is percentage points, not percent.

Where does the confidence interval come from? First you have to find the standard error. Yes, it’s different from the standard error associated with the hypothesis test. Why? That standard error assumed H0 was true and used the pooled . You can’t do that in the confidence interval, because if H0 is true then the difference between the population proportions is zero and you don’t have a confidence interval!

The standard deviation of the sampling distribution of the proportion for population 1 is √[1(1−1)/n1], and similarly for population 2. Square them, add, and take the square root to get the SD of the distribution of differences in sample proportions, also known as the standard error of the difference of proportions: SE of p-hat 1 minus p-hat 2 = radical of the expression p-hat 1 times 1 minus p-hat 1, all over n 1, plus p-hat 2 times 1 minus p-hat 2, all over n 2. The margin of error is zα/2 times that. The center of the confidence interval is the point estimate, (12), so the bounds for the (1−α)% confidence interval are

(12)−E  ≤  p1p2  ≤  (12)+E     where  E equals [z of alpha over 2] times square root of fraction p1 times 1 minus p1 over n1 plus fraction p2 times 1 minus p2 over n2

Just like with numeric data, you have to use the two-sample procedure to compute a correct confidence interval. Here’s an example.

Two candidates are running for city council, so they each commission an exit poll on Election Day. Of 200 voters polled, 110 voted for Mr. X; 90 of a different 200 voted for Ms. Y. The 95% confidence intervals are 48.1% to 61.9% and 38.1% to 51.9%. The intervals overlap, so Ms. Y might still hope for victory. But a 2-PropZInt tells a different story. The interval for the difference of proportions, X−Y, is 0.2% to 19.8%, so Mr. X is 95% confident of winning, and the only question is whether it will be a squeaker or a landslide.

Necessary Sample Size for Confidence Interval

You have a confidence level and a desired margin of error in mind. How large must each sample be?

You may remember with the one-population binomial case, part of the calculation was your prior estimate, or if you had no prior estimate you used 0.5. With two binomial populations, you need a prior estimate (or 0.5) for each one.

The easiest way to compute the necessary sample size is to use MATH200A Program part 5. If you don’t have the program and want to get it, see Getting the Program. You can also calculate necessary sample size by using the formula in the next paragraph, if you don’t have the program.

The formula for sample size is not too difficult. Start with the formula for margin of error. The desired confidence level determines critical z. But when you fill in your desired margin of error E and your prior estimates 1 and 2, you still have two unknowns, n1 and n2. The simplest assumption is that you’ll make your two samples the same size, so set n1 = n2 and solve:

n 1 = n 2 = open brackets p-hat 1 times 1 minus p-hat 1, plus p-hat 2 times 1 minus p-hat 2, close brackets, times the square of the fraction z sub one-half alpha, over E

For a detailed explanation, with worked examples, see How Big a Sample Do I Need?.

Caution! When you’re planning to study the difference between two binomial populations, you have to use the two-population binomial computation of sample size. If you compute one sample size for sample 1 and a separate sample size for sample 2, you’ll come out much too low.

Example 12: Let’s look back once more at Don and his traffic stops. His 95% confidence interval was −0.0141 to +0.21587. That’s a margin of error of (.21587−(−.0141))/2 = 11½ percentage points. How large must his samples be if he wants a margin of error of no more than 5 percentage points but he’s willing to be only 90% confident?

Solution: Don can use his sample proportions as prior estimates. Those were 86/97 ≈ 0.8866 for men and 55/70 ≈ 0.7857 for women.

With the MATH200A program (recommended): If you’re not using the program:
Here’s the output screen from MATH200A Program part 5, 2-pop binomial:

MATH200A results: p-hat= .8866 and .7857, E less or equal .05, C-Level = .9, x Crit = 1.64, n greater or equal 292 per sample

The calculation is a little easier if you break it into chunks. First compute 1(1−1) + 2(1−2). When you press [Enter], the calculator displays that result.

You want to multiply that by (zα/2/E)². Press the [×] key, and the calculator displays Ans*. Then press the opening paren [(], enter the fraction, and square it.

What is zα/2? You did this in How Big a Sample for Binomial Data? in Chapter 9. The confidence level is 1−α = 0.9, so α = 0.1, α/2 = 0.05, and zα/2 is invNorm(1−.05). The margin of error is 5% or .05 (not .5 !).

.8866 times 1 minus .8866, plus .7857 times 1 minus .7857, yields .26891595. Ans times the square of the fraction invnorm 1 minus .05, all over .05, yields 291.0255149 Caution: You don’t round the sample size. If you don’t get a whole number from the calculation, always go up to the next whole number. A sample size of 291.0255149 or greater gives a margin of error of .05 or less, at 90% confidence. The smallest whole number that is 291.0255149 or greater is 292, not 291.

Answer: Don needs a sample of 292 men and 292 women if he wants 90% confidence in an estimate of the difference with margin of error no more than 5%.

Rookie mistake: Don’t just say “292”. It’s 292 from each population.

Why do you need such large samples, even at a confidence level as low as 90%? Part of the answer is that binomial data do need large samples; remember that a single sample of just over a thousand gives you a 3% margin of error at the 95% confidence level. And when you have two populations, you are estimating the difference between two unknown parameter values, p1 and p2. If each of those was estimated within a 3% margin of error, the margin of error for their difference would be 6%, so the samples have to be larger in the two-population binomial case.

Example 13: The Prime Minister knows that his program of tax cuts and reduced social services appeals more to Conservatives than to Labour, but he wants to know how large the difference is. To estimate the difference with 95% confidence, with a margin of error of no more than 3%, how many members of each party must he survey?

Solution: You’re given no estimate of support within either party, so use 0.5 for 1 and 2. E = 0.03 (not 0.3).

With the MATH200A program (recommended): If you’re not using the program:
MATH200a/sample size/2-pop binomial:

p-hat = .5 and .5, E <= .03, C-Level = .95, z Crit = 1.96, n = 2135 per sample

.5 times 1 minus .5, plus .5 times 1 minus.5, yields .5. Multiply by the square of the fraction invNorm of 1 minus .025, over .03 First compute 1(1−1) + 2(1−2) = 0.5(1−0.5)+0.5(1−0.5). You have to multiply that by zα/2, which you find like this: C-Level = 1−α = 0.95 ⇒ α = 1−0.95 = 0.05 ⇒ α/2 = 0.025 ⇒ zα/2 = invNorm(1−.025).

Answer: To gauge the difference within a 3% margin of error, at the 95% confidence level, the Prime Minister needs to poll 2135 Conservative Party members and 2135 Labour Party members.

11D2.  Example 14: Gardasil Vaccine

The Gardasil vaccine is marketed by Merck to prevent cervical cancer. What are the statistics behind it? How do women decide whether to get vaccinated? Should the vaccine be mandatory?

A Cortland Standard story (21 Nov 2002) summarized an article from the New England Journal of Medicine as follows

A new vaccine can protect against Type 16 of the human papilloma virus, a sexually transmitted virus that causes cervical cancer, a new study shows. An estimated 5.5 million people become infected with a strain of HPV [not necessarily this strain] each year in the United States.

Efficiency rate of vaccine and placebo

Placebo:  Group size 765, infection 41

HPV-16 vaccine:  Group size 768, infection 0

Note: The study included 1533 women with an average age of 20.

(Similar studies were done for the vaccine’s effectiveness against another strain, HPV-18. According to the front page of the Wall Street Journal on 16 Apr 2007, HPV-16 and -18 between them “are thought to cause 70% of cervical-cancer cases.” The vaccine, developed by Merck, is now marketed as Gardasil.)

The samples certainly show an impressive difference, but is it statistically significant? Could the luck of random sampling be enough to account for that difference in infection rates?

The claim is “the vaccine protects against HPV-16.” To translate this into the language of statistics, realize that there are two populations: (1) women who don’t get the vaccine, and (2) women who do get the vaccine.

Notice that the populations are all women, past, present, and future who don’t or do get vaccinated. The 765 and 768 women are samples, not populations. The populations are unvaccinated and vaccinated, not placebo and vaccine. Placebos are administered to members of a sample, but a population doesn’t “get placeboed”.

The data type is attribute (binomial) because the original question or measurement of each participant is the yes/no question: “Did this woman contract the virus?” (“Success” is an HPV-16 infection, not a good thing.) Since you’re comparing two populations, this is Case 5, Difference of Two Proportions.

Is the Vaccine Effective?

If the vaccine works, then you expect more women without the vaccine to contract the virus, so make them population 1. (That’s not necessary; it just usually makes things a little simpler to call population 1 the one with higher numbers expected.)

Although you hope that the vaccine population will have a lower infection rate, it’s not impossible that they could have a higher rate. Therefore you do a two-tailed test (≠). If p < α, then it’s time to say whether the vaccine makes things better or worse.

Let’s use α = 0.001. You’re talking about cancer in humans, after all. A Type I error would be saying that Gardasil makes a difference when actually it doesn’t. You don’t want women to get vaccinated, and have a false sense of security, if the vaccine actually doesn’t work, so a Type I error is pretty serious.

(1) population 1 = unvaccinated women; population 2 = vaccinated women
H0: p1 = p2, the vaccine makes no difference
H1: p1p2, the vaccine does make a difference
(2) α = 0.001
  • Randomized design? We’re not told in so many words, but this is a high-profile medical study so you can be pretty confident it was done right.
  • Samples less than 10% of population? Yes, since millions of women will get the vaccine (if it’s proved effective) and millions won’t.
  • At least 10 yes and 10 no in each sample? In the placebo group, there were 41 yes and 765−41 = 724 no. In the treatment group, there were no successes at all.

    Does that mean you can’t do the hypothesis test? Remember that “at least 10 yes and 10 no in each sample” is a shortcut for the real requirement, which is “at least 10 yes and 10 no expected in each sample if the null hypothesis is true”. If H0 is true, then the pooled proportion  = 0.0267 is an estimator of the proportions in both populations.

    What would you expect if H0 is true? In the placebo group of 765, you would expect n1 = 765×.0267 ≈ 20 yes and n1n1 = 765−20 = 745 no. You’d expect about the same in the treatment group of 768, so the “at least 10” requirement is met.

(3/4) 2-PropZTest: 41, 765, 0, 768, p1≠p2
results: z=6.50, p-value = 7.9E-11, 1=.0536, 2=0, =.0267
2-PropZTest 41, 765, 0, 768, p1≠p2      2-PropZTest z=6.503220606, p=7.900655 E minus 11, p-hat 1 = .0535947712, p-hat 2 = 0 p-hat = .026744946
Pause for a minute to make sure you can keep all those p’s straight. The first one, p = 7.9E-11, is the p-value, the chance of getting such different sample results if the vaccine makes no difference. 1 and 2 are those sample results: 5.4% of unvaccinated women and 0% of vaccinated women in the samples contracted HPV-16 infections. without subscript is the pooled proportion: 2.7% of all women in the study contracted HPV-16.
(5) p < α. Reject H0 and accept H1.
(6)The Gardasil vaccine does make a difference to HPV-16 infection rates (p = 8×10-11). In fact, it lowers the chance of infection.
At the 0.001 level of significance, the Gardasil vaccine does make a difference to HPV-16 infection rates. In fact, it lowers the chance of infection.

It’s worth reviewing what this p-value of 8×10-11 means. If the vaccine actually made no difference, there are only 8 chances in a hundred billion of getting the difference between samples that the researchers actually got, or a larger difference.

How do you get from “makes a difference” to “reduces infection rate”? Remember that when p < α in a two-tailed test, you can interpret the result in a one-tailed manner. If the vaccine makes things different, as appears virtually certain, then it must either make them better or make them worse. But in the sample groups, the vaccine group did better than the placebo group. Therefore the vaccine can’t make things worse, and it must make them better.

How Effective Is the Vaccine?

Can you do a confidence interval to estimate how much Gardasil reduces a woman’s risk of HPV-16 infection? Unfortunately, you can’t, because the requirements aren’t met: There were zero successes in the second sample. You can’t think like the hypothesis test and use the blended to meet requirements. Why wouldn’t that make sense? In a confidence interval, you’re specifically trying to estimate the difference between p1 and p2 (likelihood of infection for unvaccinated and vaccinated women), so you can’t very well assume there is no difference.

In terms of what you’re required to know for the course, you can skip to the next section right now. But if you want to know more, keep reading.

One informal calculation finds a number needed to treat per person actually helped (Simon 2000c [see “Sources Used” at end of book]). The difference in sample proportions is 5.4 percentage points, and 1/.054 ≈ 18.5 is called the number needed to treat. (You may recognize this as the expected value of the geometric distribution with p = 5.4%.) In the long run, for every 18 or 19 women who are vaccinated, one HPV-16 infection is prevented.

Caution! 5.4 percentage points is a difference in sample proportions. You can say only that the difference in the population is somewhere in the neighborhood of 5.4 percentage points, not that it is that. The number needed to treat is therefore not exactly 18.5, just somewhere in the neighborhood of 18.5. Even so, this is valuable information for women and their doctors.

Another approach is the rule of three, explained in Confidence Interval with Zero Events (Simon 2010 [see “Sources Used” at end of book]). When there are zero successes in n events, the 95% confidence interval is 0 to 3/n. Here 3/768 = 0.0039, about 0.4%. The 95% confidence interval for the unvaccinated population is 3.8% to 7.0%. So a doctor can tell her patients that about 38 to 70 unvaccinated women in a thousand will be infected with HPV-16, but only about four vaccinated women in a thousand.

Each of those is a 95% confidence interval, but the combination isn’t a 95% confidence interval! In the long run, if you do a bunch of 95% CIs, one in 20 of them won’t capture the true population parameter. Here you’re doing two, so there’s only a 95%×95% = 90.3% chance that both of these actually capture the true population proportions.

11E.  Confidence Interval and Hypothesis Test (Two Populations)

Summary: If you have a confidence interval for the difference of two population means or proportions, you can conclude whether the difference is statistically significant or not, just like the result of a hypothesis test.

Example 15: You’re testing the new drug Effluvium to see whether it makes people drowsy. Your 95% confidence interval for the difference between proportions of drowsiness in people who do and don’t take Effluvium is (0.017, 0.041). That means you’re 95% confident that Effluvium is more likely, by 1.7 to 4.1 percentage points, to cause drowsiness.

There’s the key point. You’re 95% confident that it does increase the chance of drowsiness by something between those two figures. How likely is it that Effluvium doesn’t affect the chance of drowsiness, then? Clearly it’s got to be less than 5%.

When both endpoints of your confidence interval are positive (or both are negative), so that the confidence interval doesn’t include 0, you have a significant difference between the two populations.

Example 16: Now, suppose that confidence interval was (−0.013, 0.011). That means you’re 95% confident that Effluvium is somewhere between 1.3 percentage points less likely and 1.1 more likely to cause drowsiness. Can you now conclude that Effluvium affects the chance of drowsiness? No, because 0 (“no difference”) is inside your confidence interval. Maybe Effluvium makes drowsiness less likely, maybe it has no effect, maybe it makes drowsiness more likely; you can’t tell.

When one endpoint of your confidence interval is negative and one is positive, so that the confidence interval includes 0, you can’t tell whether there’s a significant difference between the two populations or not.

This is exact for numeric data but approximate for binomial data. Why? Because the HT and CI use the same standard error for the numeric data cases, but slightly different standard errors for two-population binomial data. (The two calculations are in BTW paragraphs earlier in the chapter, here and here). chapter.)

11F.  More Confidence Intervals for Two Populations


Confidence intervals for two populations are easy enough to calculate on your TI-83. But one or both endpoints can be negative, and that means you have to write your interpretation carefully