Summary: We live in an uncertain world. You never have complete information when you make a decision, but you have to make decisions anyway. Statistics helps you make sense of confusing and incomplete information, to decide whether a pattern is meaningful or just coincidence. It’s a way of taming uncertainty, a tool to help you assess the risks and make the best possible decisions.
Statistics is different from any other math course.
Yes, you’ll solve problems. But most will be real-world practical problems: Does aspirin make heart attacks less likely? Was there racial bias in the selection of that jury? What’s your best strategy in a casino? (Most examples will be from business, public policy, and medicine, but we’ll hit other fields too.)
There will be very little use of formulas. Real statisticians don’t do things by hand. They use calculators or software, and so will you. Your TI-83 or TI-84 may seem intimidating at first, but you’ll quickly get to know it and be amazed at how it relieves you of drudgery.
With little grunt work to do, you will focus on what your numbers mean. You’re not just a button-pushing calculator monkey; you have to think about what you’re doing and understand it well enough to explain it. Most of the time your answers will be non-technical English, not numbers or statistical jargon. That may seem scary and unfamiliar at first, but if you stick with it you’ll love stretching your brain instead of just following a book’s examples by rote.
It may be a required course, so you get that much closer to graduation. ☺ But you can get more than that.
If you do it right, statistics teaches you to think. You become skeptical, unwilling to take everything at face value. Instead, when somebody makes a statement you question how they know that and what they’re not telling you. You can’t be fooled so easily. You become a more thoughtful citizen, a more savvy consumer.
Who knows? You might even have some fun along the way. So— Let’s get started!
Suppose you want to know about the health of athletes who use steroids versus those who don’t. Or you want to know whether people are likely to buy your new type of chips. Or you want to know whether a new type of glue makes boxes less likely to come apart in shipping. How do you answer questions like that?
With most things you want to know, it’s impossible or impractical to examine every member of the group you want to know about, so you examine part of that group and then generalize to the whole group.
In Good Samples, Bad Samples, later in this chapter, you’ll see how samples are actually taken.
The sample is usually a subgroup of the population, but in a census the whole population is the sample.
Example 1: You want to know what proportion of likely voters will vote for your candidate, so you poll 850 people. The people you actually ask are your sample, and the likely voters are the population.
Caution!: Your sample is the 850 people you took data from, not just the subgroup that said they would vote for your candidate. The population is all likely voters, regardless of which candidate they prefer. Yes, you want to know who will vote for your candidate, but everybody’s vote counts, so the group you want to know something about — the population — is all likely voters.
The number of members of your sample is called the sample size or size of the sample (symbol n), and the number of members of the population is called the population size or size of the population (symbol N).
“Sometimes it is not possible to count the units contained in the population. Such a population is called infinite or uncountable.” (Finite and Infinite Population 2014 [see “Sources Used” at end of book]) “Smokers” is an example. There is a definite number of smokers in the world at any moment, but if you try to count them the number changes while you’re counting.
The sample size is always a definite number, since you always know how many individuals you took data from.
Example 2: You’re monitoring quality in a factory that turns out 2400 units an hour, so you test 30 units from each hour’s production.
The units you tested are your sample, and your sample size is 30. All production in that hour is the population, and the population size is 2400.
Isn’t the population the factory’s total production, since you want to know about the overall quality? No! Your sample was all drawn from one hour’s production. A sample from one production run can tell you about that production run, not about overall operations. This is why quality testing never ends.
Example 3: You’re testing a new herpes vaccine. 800 people agree to participate in your study. You divide them randomly into two groups and administer the vaccine to one group and a placebo (something that looks and feels like a vaccine but is medically inactive) to another group. Over the course of the study, a few people drop out, and at the end you have 397 vaccinated individuals and 396 who received the placebo.
You have two samples, individuals who were vaccinated (n1 = 397) and the control group (n2 = 396). The corresponding populations are all people who will take this vaccine in the future, and all people who won’t. Both of those populations are uncountable or infinite because more people are being born all the time.
Sometimes you want to summarize the data from your sample, and other times you want to use the sample to tell you something about the larger population. Those two situations are the two grand branches of statistics.
Definition: Descriptive statistics is summarizing and presenting data that were actually measured. Inferential statistics is making statements about a population based on measurements of a smaller sample.
Example 4: “52.9% of 1000 voters surveyed said they will vote for Candidate A.” That is descriptive statistics because someone actually measured (took responses from) those 1000 people.
Compare: “I’m 95% confident that 49.8% to 56.0% of voters plan to vote for Candidate A.” That is inferential statistics because no one has asked all voters. Instead, a sample of voters was asked, and from that an estimate was made of the feelings of all voters.
Definitions: A statistic is a numerical summary of a sample. A parameter is a numerical summary of a population.
Mnemonic: sample and statistic begin with s; population and parameter both begin with p.
Continuing with Example 4: “52.9% of 1000 voters surveyed plan to vote for Candidate A.” — 52.9% is a statistic because it summarizes the sample.
“I’m 95% confident that 49.8% to 56.0% of voters plan to vote for Candidate A.” — 49.8% to 56.0% is an estimate of a parameter. (The actual parameter is the exact proportion presently planning to vote for A, which you don’t know exactly.)
A statistic is always a statement of descriptive statistics and is always known exactly, because a statistic is a number that summarizes a sample of actual measured data.
A parameter is usually estimated, not known exactly, and therefore is usually a matter of inferential statistics. The exception is a census, in which data are taken from the whole population. In that case, the parameter is known exactly because you have complete data for the population, so the parameter is then descriptive statistics.
|Describing …||The number is …||And the process is …|
|Any sample||A statistic||Descriptive statistics|
|A population (usually)||A parameter||Inferential statistics|
|A census (pop. w/|
every member surveyed)
You already know that a random sample is a good thing, but did you know that a random sample is actually carefully planned? What if you can’t take a true random sample? What are good and bad ways to gather samples?
All valid samples share one characteristic: they are chosen through probability means, not selected by any decisions made by the person taking the sample. Every valid sample is gathered according to some rule that lets the impersonal operations of probability do the actual selection.
Definition: A probability sample is a sample where the members are chosen by a predetermined process that uses the workings of chance and removes discretion from the investigators. Some of the types of probability samples are discussed below.
See also: For lots of examples of good sampling and (usually) clear presentation of data about the American people, you might want to visit the Pew Research Center and its tech-oriented spinoff, Pew Internet. The venerable Gallup Poll also makes available its snapshots of the American public.
Definition: A random sample (also called a simple random sample) is a sample constructed through a process that gives every member of the population an equal chance of being chosen for the sample.
You always want a random sample, if you can get one. But to create a random sample you need a frame, and in many situations it’s impossible or unreasonably difficult to list all members of the population. The sections below explain alternative types of samples that can lead to statistically valid results.
“Random” doesn’t mean haphazard. Humans think we’re good at constructing random sequences of letters and digits, but actually we’re very bad at it. Try typing 1300 “random” letters on your keyboard. If you do it really randomly, you should get about 1300÷26 = 50 of each letter. (Note: about 50 of each, not exactly 50. To determine whether a particular sample of text is unreasonably far from random letters, see Testing Goodness of Fit to a Model.) But if you’re like most people, the distribution will be very different from that: some letters will occur many more than 50 times, and others many less.
So how do you construct a random sample? You need a frame, plus a random-number generator or random-number table.
The frame need not be a physical list; it can be a computer file — these days it usually is. But it has to be a complete list.
If you have a table of random numbers, the table will come with instructions for use. I’ll show you how do it with the TI-83/84, but you could also do it with Excel’s RANDBETWEEN( ) function, or with any other software that generates pseudo-random numbers. (The Web site random.org provides true random numbers based on atmospheric noise.)
Random numbers from software or a calculator aren’t really random, but what we call pseudo-random numbers. That means that they are generated by deterministic calculations designed to mimic randomness pretty well but not perfectly. To help them do a better job, you need to “seed” the random number sequence, meaning that you give it a unique starting point so that your sequence of random numbers is different from other people’s.
You seed the random numbers only once. To do this:
STO→], which shows on your screen as
1] to paste
randto the screen. Press [
Again, you need to seed random numbers only once on your calculator.
For this you need to know the size of the population, which is the number of individuals in your frame. You will generate a series of random numbers between 1 and the population size, as follows:
5] to paste
randInt(to your screen.
,], enter the population size, and press [
ENTER] to generate the first random number. In my case the population size was 20,147 and my first random number was 4413, so the first member of my sample will be the 4413th individual, in order, from the sampling frame.
ENTER] to generate the next random number. (The
randIntfunction may or may not be displayed again, depending on your calculator model and settings.) In my case, the next random number is 4949, so the 4949th individual in my frame becomes the second member of my sample.
ENTER] until you have your desired sample size. If you get a duplicate random number, simply ignore it and take the next one. (If your calculator has [
randIntNoRep, use it instead of plain
randIntto prevent duplicates from appearing in the first place.)
Definition: A systematic sample with k = some number is one where you take every kth individual from a representative subset of the population to make up your sample.
Example 5: Standing outside the grocery store all day, you survey every 40th person. That is a systematic sample with k=40.
If properly taken, a systematic sample can be treated like a random sample. Then why do I call it almost as good? Because you have to make one big assumption: that the variable you’re surveying is independent of the order in which individuals appear. In the grocery-store example, you have to assume that shoppers in the time period when you take your survey are representative of all shoppers. That may or may not be true. For example, a high proportion of Wegmans shoppers at lunch time are buying prepared foods to eat there or take back to work. At other times, the mix of groceries purchased is likely to be different.
If you’re pretty unsure of N, you may need to observe that spot without taking the survey, just to get a preliminary count.
If your estimate of N is uncertain, you’ll want to reduce k a bit. This will increase your sample size, but a sample that’s too large (within reason) is better than one that’s too small.
5] to paste
randInt(, then [
,]. Enter the value of k and press [
Caution: It’s 1 to k, not 1 to N.
If you need to survey every 12th
person, then you use
For determining where to
start in the first 12 people,
randInt(1,1200) are both wrong.
At right you see an illustration with k=12. The calculator has determined that I will start with the 2nd person and take every 12th person after that: 2, 14, 26, 38, 50, and so on.
Sometimes a true random sample is possible but unreasonably difficult. For example, you could use census records to take a random sample of 1000 adults in the US, but that would mean doing a lot of travel. So instead you take a cluster sample.
“In single-stage cluster sampling, all members of the selected clusters are interviewed. In multi-stage cluster sampling, further subdivisions take place.” (Upton and Cook 2008, 76 [see “Sources Used” at end of book])
Example 6: You want to have 600 representative Americans try your new neck pillow to gauge your potential market. Travel to 600 separate locations across the country would be ridiculously expensive, so you randomly select 30 census tracts and then randomly select 20 individuals within each selected census tract.
A cluster sample makes one big assumption: that the individuals in each cluster are representative of the whole population. You can get away with a slightly weaker assumption, that the individuals in all the selected clusters are representative of the whole population. But it’s still an assumption. For this and other technical reasons, a cluster sample cannot be analyzed in all the same ways as a random sample or systematic sample. Analysis of cluster samples is outside the scope of this course.
Sometimes you can identify subgroups of your population and you expect individuals within a subgroup to be more alike than individuals of different subgroups. In such a case, you want to take a stratified sample.
Definition: If you can identify subgroups, called strata (singular: stratum), that have something in common relative to the trait that you’re studying, you want to ensure that your sample has the same mix of those groups as the population. Such a sample is called a stratified sample.
Example 7: You’re studying people’s attitudes toward a proposed change in the immigration laws for a Presidential candidate. You believe that some races are more likely to favor loosening the law and others are more likely to oppose it. If the population is 66% non-Hispanic white, 14% Hispanic, 12% black, 4% Asian, and so on, your sample should have that same composition.
A stratified sample is really a set of mini-samples grouped together.
Example 8: You want to survey attitudes towards sports at a college that is 45% male and 55% female, and you want 400 in your sample. You would take a sample of 45%×400 = 180 male students and 55%×400 = 220 female students to make up your sample of 400. Each mini-sample would be taken by a valid technique like a random sample or systematic sample.
Definition: A census is a sample that contains every member of the population.
In many situations, it’s impossible or highly inconvenient to take a census. But with the near-universal computerization of records, a census is practical in many situations where it never used to be.
Example 9: At the push of a button, a librarian can get the exact average number of times that all library books have been checked out, with no need for sampling and estimation. An apartment manager can tell the exact average number of complaints per tenant. And so forth.
A census is the only sample that perfectly represents the population, because it is the whole population. If you can take a census, you’ve reduced a problem of inferential statistics to one of descriptive statistics. But even today, only a minority of situations are subject to a census. For instance, there’s no way to test a drug on every person with the condition that the drug is meant to treat. It’s totally impractical to interview every potential voter and determine his or her preferences. And so forth.
Any sample where people select the individual members is a bogus sample. That means every sample where people select themselves, and every sample where the interviewer decides whether to include or exclude individual members.
Why is that bad? Remember, a proper sample is a smaller group that is representative of the population. No sample will represent the population perfectly, but you do the best you possibly can.
The good samples listed above can go bad if you make various kinds of mistakes, mistakes (“Statistical Errors”, later in this chapter), but a sample that doesn’t depend on the workings of chance is always wrong and cannot be made right. The textbooks will give you names for the types of bad samples — convenience sample, opportunity sample, snowball sample, and so on — but why learn the names when they’re all bogus anyway?
|Good Samples||Bad Samples|
|Chosen through probability methods||Chosen by individual decisions about which persons or things to include|
|Represent the population as well as possible||Do not accurately represent the population|
|Uncertainty can be estimated, and can be reduced by increasing sample size||Uncertainty cannot be estimated, and bigger samples don’t help|
So goodbye to Internet polls and petitions, letter-writing campaigns, “the first 500 volunteers”, and every other form of self-selected sample. If people select themselves for a sample, then by definition they are not representative because they feel more strongly than the people who didn’t select themselves. You can make statements about the people who selected themselves, but that tells you nothing about the much larger number who didn’t select themselves. (More about this in Simon 2001 [see “Sources Used” at end of book], Web Polls.)
Goodbye also to any kind of poll where the pollster selects the individual people. If you set up a rule that depends on the workings of chance and then follow it, that’s okay. But if you decide on the spur of the moment who gets included, that’s bogus.
Why is it bad to just approach people as you see them? Because studies show that you are more likely to approach people that you perceive to be like you, even if you’re not aware of that. Ask yourself if you are truly equally likely to select someone of a different race or sex from yourself, someone who is dressed much richer or poorer than you, someone who seems much more or much less attractive, and so forth. Unless you’re Gandhi, the honest answer is “not equally likely”. It doesn’t make you a bad person, just a bad pollster like everyone else. If you tend to pick people who are more like you, your sample is not representative of the population.
The same principle applies to studies of non-humans. Here the investigator’s intrinsic biases may be less clear, but unless you choose your sample based on chance you can never be sure that those biases didn’t “skew it up”.
This will be an important topic throughout the course, because different variable types are presented differently in descriptive statistics, and again are analyzed differently in inferential statistics. So before you do anything, you need to think what type of data you’re dealing with.
You can think of the variable as kind of like a question, and the data points as the answers to that question.
If you record one piece of information from each member of the sample, you have univariate data; if you record two pieces of information from each member, you have bivariate data.
Example 10: You record the birth weights of babies born in a certain hospital during the year. The variable is “birth weight”.
Example 11: In April, you ask all the members of your sample whether they had the flu vaccine that year and how many days of work or school they lost because of colds or flu. (Can you see at least two problems with that second question? If not, you will after you read about Nonsampling Errors, later in this chapter.) This is bivariate data. One variable is “flu shot?” and the data points are all yes or no; the other variable is “days lost to colds and flu” and the data points are whole numbers.
Numeric data are subdivided into discrete and continuous data. Discrete data are whole numbers and typically answer the question “how many?” Continuous data can take on any value (or any value within a certain range) and typically answer the question “how much?”
Qualitative data are data that are not numbers. Qualitative data are also called non-numeric data, attribute data or categorical data.
Sometimes we talk about data types, and sometimes about variable types. They’re the same thing. For instance, “weight of a machine part” is a continuous variable, and 61.1 g, 61.4 g, 60.4 g, 61.0 g, and 60.7 g are continuous data.
|Quantitative (numeric)||Qualitative (categorical or non-numeric)|
|You get a number from each member of the sample.||You get a yes/no or a category from each member of the sample.|
|The data have units (inches, pounds, dollars, IQ points, whatever) and can be sorted from low to high.||The data may or may not have units and do not have a definite sort order.|
|It makes sense to average the data.||Your summary is counts or percentages in each category.|
|Examples (discrete): number of children in a family,
number of cigarettes smoked per day, age at last birthday
Examples (continuous): height, salary, exact age
|Examples: hair color, marital status, gender, country of birth, and opinion for or against a particular issue|
Continuous or discrete data? Sometimes when you have numeric data it’s hard to say whether you have discrete or continuous data. But since you’ll graph them differently, it’s important to be clear on the distinction. Here are two examples of doubtful cases: salary and age.
It’s true that your salary can be only a whole number of pennies. But there are a great many possible values, and the distance between the possible values is quite small, so you call salary a continuous variable. Besides, you don’t ask “how many pennies do you make?” but rather “how much do you make?”
What about age? Well, age at last birthday is clearly discrete since it can be only a whole number: “how many years old were you at your last birthday?” But age now, including years and months and days and fractions of days, would be continuous, again because you can subdivide it as finely as desired.
When you see a summary statement, you have to do a little mental detective work to figure out the data type. Always ask yourself, what was the original measurement taken or question asked?
Example 12: “The average salary at our corporation is $22,471.” The original measurement was the salary of each individual, so this is continuous data.
Example 13: “The average American family has 1.7 children.” Don’t let “1.7” fool you into identifying this as a continuous variable! What was the original question or measurement? “How many children are there in your family?” That’s discrete data.
Example 14: “Four out of five dentists surveyed recommend Trident sugarless gum for their patients who chew gum.” Yes, there are numbers in the summary statement, but the original question asked of each dentist was “Do you recommend Trident?” That is a yes/no question, so the data type is categorical.
Summary: In statistics, an error is not necessarily a mistake. This section explores the types of statistical errors and where they come from.
Definition: An error is a discrepancy between your findings and reality. Some errors arise from mistakes, but some are an inevitable part of the sampling process.
Even if you make no mistakes, inevitably samples will vary from each other and any given sample is almost sure to vary from the population. This variability is called sampling error. (It would probably be more helpful to call it sample variability, but we’re stuck with “sampling error”.)
Sampling error “refers to the difference between an estimate for a population based on data from a sample and the ‘true’ value for that population which would result if a census were taken.” (Australian Bureau of Statistics 2013) [see “Sources Used” at end of book]
Except for a census, no sample is a perfect representation of the population. So the sample mean (average), for example, will usually be a bit different from the population mean. Sampling errors are unavoidable, even if you do everything right when you take a random sample. They’re not mistakes, they’re just part of the nature of things.
Although sampling error cannot be eliminated, the size of the error can be estimated, and it can be reduced. For a given population, a larger sample size gives a smaller sampling error. You'll learn more about that when you study sampling distributions.
Definition: Nonsampling errors are discrepancies between your sample and reality that are caused by mistakes in planning, in collecting data, or in analyzing data.
Nonsampling errors make your sample unrepresentative of the population and your results questionable if not useless. Unlike sampling errors, nonsampling errors cannot be reduced by taking larger samples, and you can’t even estimate the size of most nonsampling errors. Instead, the mistakes must be corrected, and probably a new sample must be taken.
There are many types of nonsampling errors. Different authors give them different names, but it’s much more important for you to recognize the bad practice than to worry about what to name it. In taking your own samples, and in evaluating what other people are telling you about their samples, always ask yourself: what could go wrong here? has anything been done that can make this sample unrepresentative of the population? Here are some of the more common types of nonsampling errors. After you read through them, see how many others you can think of.
This is almost always bogus. People who select themselves are by definition different from people who don’t, which means they are not representative. It can be very hard to know whether that difference matters in the context of a particular study. Since you can never be sure, it is safest to avoid the problem and not let people select themselves.
But medical studies all use volunteers. (They have to, ethically.) Why doesn’t that make the sample bogus? They’re volunteers, but usually they’re not self-selected volunteers. For example, researchers may ask doctors and hospitals to suggest patients who meet a particular profile; they use probability techniques to select a sample from that pool.
But things are not always simple. For example, some companies or researchers may advertise and pay volunteers to undergo testing. In this case you have to ask very serious questions about whether the volunteers are representative of the general population. Statistical thinking isn’t a matter of black and white, but some pretty sophisticated judgment can be involved. Your take-away is: don’t accept anything at face value, but always ask: What important facts are being left out? What does that do to the credibility of the results?
Definition: Sampling bias results from taking a sample in a way that tends to over- or under-represent some subgroup of the population.
Example 15: If you’re doing a survey on student attitudes toward the cafeteria, and you conduct the survey in the cafeteria, you are systematically under-representing students who don’t use the cafeteria. It seems logical that attitudes are more negative among students who don’t use the cafeteria than among students generally, so by excluding them you will report overall attitude as more favorable than it really is.
“Bias” is a good example of the words in statistics that don’t have their ordinary English meaning. You’re not prejudiced against students who dislike the cafeteria. “Bias” in statistics just means that something tends to distort your results in a particular direction.
Example 16: The classic example of sampling bias is the Literary Digest fiasco in predicting that Landon would beat Roosevelt in the 1936 election. The magazine sent questionnaires to all its subscribers, it phoned randomly selected people in telephone books, and it left stacks of questionnaires at car dealerships with instructions to give one to every person who test drove a car. The sample size was in the millions.
This procedure systematically over-represented people who were well off and systematically under-represented poorer people. In 1936 the Great Depression still held sway, and most people did not have the disposable income to subscribe to a fancy magazine, let alone a home telephone; the very thought of buying a car would have struck them as ridiculous or insulting. In that era, the Republicans appealed more to the rich and the Democrats more to the working class. So the net effect of the Literary Digest’s procedure was that it made the country look a lot more Republican than it actually was. Since Landon was a Republican and FDR a Democrat, FDR’s actual support was much greater than shown by the poll, and Landon’s was much less.
Notice that a sample size of millions did not overcome sampling bias. A larger sample size is not an answer to nonsampling errors.
The Digest’s original article can be found in Landon in a Landslide: The Poll That Changed Polling (American Social History Project) [see “Sources Used” at end of book].
While we’re on the subject of presidential elections, different nonsampling errors also led to wrong predictions of a Dewey victory over Truman in 1948. For analyses of both the 1936 and the 1948 statistical mistakes, see Classic Polling Surprises (2004) [see “Sources Used” at end of book] and Introduction to Polling (n.d.) [see “Sources Used” at end of book].
Beyond sampling bias, there are many other bad practices in selecting your sample can bias the results. Wikipedia’s Selection Bias [see “Sources Used” at end of book] has a good rundown of quite a few.
If you’re taking a mail survey, a significant number of people (probably a majority) won’t respond. Are the responders representative of the non-responders, or has a bias been introduced by the non-response? That’s a tough question, and the answer may not always be clear.
For this reason, mail surveys are often coded so that the investigators can tell who did respond, and follow up with those who didn’t. That follow-up can be more mail, a phone call, or a visit.
Even with in-person polls, non-response is a problem: many people will simply refuse to participate in your survey. Depending on what you’re surveying, that could be unimportant or it could be a fatal flaw.
Definition: Response errors occur when respondents give answers that don’t match objective truth or their own actual opinions.
Poorly worded survey questions are a major source of response errors, and lead to biased results or completely meaningless results. There may not be a perfect survey question, but having several people review the questions against a list of possible problems will greatly reduce the level of response errors.
But response errors can never be completely eliminated. For instance, people tend to shade their answers to make themselves look good in their own eyes or in the interviewer’s eyes. Most people rate themselves as better-than-average drivers, for example, which obviously can’t be true. And self-reporting of objective facts is always suspect because memory is unreliable.
These include mistakes by interviewers in recording respondents’ answers, mistakes by investigators in measuring and recording data, and mistakes in entering the recorded data.
In the second half of the course you’ll learn a number of inferential statistics procedures. Each one is appropriate in some circumstances and inappropriate in others. If you use the wrong form of analysis in a given situation, or you apply it wrongly, your results will be about as good as the results from using a hammer to drive a screw.
Summary: There are two main methods of gathering data, the observational study and the experiment. Learn the differences, and what each one can tell you.
Many, many statistical investigations try to find out whether A causes B. To do this, you have two groups, one with A and one without A, or you have multiple groups with different levels of A. You then ask whether the difference in B among the groups is significant. The two main ways to investigate a possible connection are the observational study and the experiment.
The concepts aren’t hard, but there’s a boatload of vocabulary. Let’s get through the definitions first, and then have some concrete examples to show how the terms are used. Please read the definitions first, then read the first example and refer back to the definitions; repeat for the other examples.
Definition: In an observational study, the investigator simply records what happens (a prospective study) or what has already happened (a retrospective study). In an experiment, the investigator takes an more active role, randomly assigning members of the sample to different groups that get treated differently.
Which is better? Well, in an observational study, you always have the problem of wondering whether the groups are different in some way other than what you are studying. This means that an observational study can never establish cause. The best you can do after an observational study is to say that you found an association between two variables.
How do we establish cause, when for ethical or practical reasons we can’t do an experiment? The nine criteria are listed in Causation (Simon 2000b [see “Sources Used” at end of book]) and were first laid down by Sir Austin Bradford Hill in 1965.
In an observational study or an experiment, there are two or more variables. You want to show that changes in one or more of them, called the explanatory variables, go with changes in one or more response variables.
Explanatory variables are the suspected causes, and response variables are the suspected effects or results.
Example 18: Over the course of a year, you have parents record the number of minutes they spend every day reading to their child, and at the end of the year you record each child’s performance on standard tests. The explanatory variable is parental time spent reading to the child, and the response variable(s) are performance on the standardized test(s).
Definitions: In an experiment, the experimenter manipulates the suspected cause(s), called explanatory variable(s) or factor(s). A specific level of each factor is administered to each group. The level(s) of the explanatory variable(s) in a given group are known as its treatment.
Example 19: To test productivity of factory workers, you randomly assign them to three groups. One group gets an extra hour at lunch, one group gets half-hour breaks in morning and afternoon, and one group gets six 10-minute breaks spaced throughout the day. The explanatory variable or factor is structuring of break time, and the three levels or treatments are as described.
Definitions: In an experiment, each member of a sample is called a unit or an experimental unit. However, when the experiment is performed on people they are called subjects or participants.
In any study or experiment, results will vary for individuals within each group, and results will also vary between the groups as a whole. Some of that variation is due to chance: it is expected statistical variability or sampling error. If the differences between groups are bigger than the variation within groups — and enough bigger, according to some calculations you’ll learn later — then the investigator has a significant result. A significant result is a difference that is too big to be merely the result of normal statistical variability.
I’ll have a lot more to say about significance when you study Hypothesis Tests.
In Example 18, about reading to children, you find generally that the more time parents spend reading to first graders, the better the children tend to do on standard tests of reading level.
Is the reading time responsible for the improved test scores? You can easily think of other possible explanations. Parents who spend more time reading to their children probably spend more time with them in general. They tend to be better off financially — if you’re working two jobs to make ends meet, you probably have little time available for reading to your children. Economic status and time spent with children in general are examples of lurking variables in this study.
Definition: A hidden variable that isn’t measured and isn’t part of your design but affects the outcome is called a lurking variable.
Example 20: In a large elementary school, you schedule half the second grade to do art for an hour, two mornings a week, with the district’s art teacher. The other half does art for an hour, two afternoons a week, with the same teacher, but they are told at the beginning that all their projects will be displayed and prizes given for the best ones.
Can you learn anything from this about whether the chance to win prizes prompts children to do a better job on art projects? The problem is that there’s not just one difference in treatment here, the promised prizes. There’s also the fact that everyone’s project will be on display. And maybe mornings are better (or worse) for doing art than afternoons. Maybe the teacher is a morning person and fades in the afternoon, or is not a morning person and really shines in the afternoon. Even if there’s a difference in quality of the projects, you can’t make any kind of simple statement about the cause, because of these confounding variables.
A confounding variable is “associated in a non-causal way with a factor and affects the response.” (DeVeaux, Velleman, Bock 2009, 346 [see “Sources Used” at end of book])
“Confounding occurs in an experiment when you [can’t] distinguish the effects of different factors.” (Triola 2011, 32 [see “Sources Used” at end of book])
In the art example, you wanted to find out whether promising prizes makes children do better art work. But the promise of prizes wasn’t the only difference between the two groups. Time of day and public display are confounding variables built into the design of this experiment. You know what they are, but you can’t untangle their effect from the effect of what you actually wanted to study.
What’s the difference between lurking variables and confounding variables? Both confuse the issue of whether A causes B.
A lurking variable L is associated with or causes both A and B, so any relationship you see between A and B is just a side effect of the L/A and L/B relationships. For example, counties with more library books tend to have more murders per year. Does reading make people homicidal? Of course not! The lurking variable is population size. High-density urban counties have more books in the library and more murders than low-density rural counties.
A confounding variable C is typically associated with A but doesn’t cause it, so when you look at B you don’t know whether any effect comes from A, from C, or from both. For example, after a year with a lot of motorcycle deaths, a state passes a strict helmet law, and the next year there are significantly fewer deaths. Was the helmet law responsible? Maybe, but time is a confounding variable here. Were motorcyclists shocked at the high death toll, so that they started driving more carefully or switched to other modes of transit?
Don’t obsess over the difference between lurking and confounding variables. Some authors don’t even make a distinction. You should recognize variables that make results questionable; that’s a lot more important than what you call them.
That said, if you want to see two more takes on the difference, have a look at Confounding and Lurking Variables (Virmani 2012 [see “Sources Used” at end of book]) and Confounding Variables (Velleman 2005 [see “Sources Used” at end of book]).
Lurking and confounding variables are the boogeyman of any statistical work. Lurking variables are the reason that an observational study can show only association, not causation. In experiments, you have the potential to exclude lurking variables, or at least to minimize them, but it takes planning and extra work, and you need to be careful not to create a design with built-in confounding..
Whenever any experiment claims that A causes B, ask yourself what lurking variables there might be, and whether the design of the study has ruled them out. You can’t take this for granted, because even professional researchers sometimes cut corners, knowingly or unknowingly.
Does smoking cause lung cancer?
Initial studies in the mid-20th century had three or four groups: non-smokers, light smokers, moderate smokers, and heavy smokers. They looked at the number and severity of lung tumors in the groups to see whether there was a significant difference, and in fact they found one.
This was an observational study. Ethically it had to be: if you suspect smoking is harmful you can’t assign people to smoke.
Explanatory variable: smoking level (none, light, moderate, heavy). Levels or treatments don’t apply to an observational study.
Response variable: tumor production
Because this was an observational study, there was no control for lurking variables, and even with a significant result you can’t say from this study that smoking causes lung cancer. What lurking variables could there be? Well, maybe some genetic factor both makes some people more likely to smoke and makes them more susceptible to lung cancer. This is a problem with every observational study that finds an effect: you can’t rule out lurking variables, and therefore you can’t infer causation, no matter how strong an association you find.
Since you can’t do an experiment on humans that involves possibly harming them, how do you know that smoking causes lung cancer? A good explanation is in Causation (Simon 2000b [see “Sources Used” at end of book]).
Does aspirin make heart attacks less likely?
Here you can do an experiment, because aspirin is generally recognized as safe. Investigators randomly assigned people to two groups, gave aspirin to one group but not the other, and then monitored the proportion who had heart attacks. They found a significantly lower risk of heart attack in the aspirin group.
This was a designed experiment.
Explanatory variable: aspirin. There were two levels or treatments: yes and no.
Response variable: heart attack (yes/no)
From this experiment, you can say that aspirin reduces the risk of heart attack. How can you be sure there were no lurking variables? By randomly assigning people to the two groups, investigators made each group representative of the whole population. For example, overweight is a risk factor for heart attacks. The random assignment ensures that overweight people form about the same proportion in each group as in the population. And the same is true for any other potential lurking variable. (It helps to have larger samples, and in this study each sample was about 10,000 people.)
Does prayer help surgical patients?
Here again, no one thinks prayer is harmful, so ethically the experimenters were in the clear to assign cardiac-bypass patients randomly to three groups: people who knew they were prayed for, people who were prayed for and didn’t know it, and people who were not prayed for. Investigators found no significant difference in frequency of complications between the patients who were prayed for and those who were not prayed for.
This was a designed experiment.
Explanatory variables: receipt of prayer (two levels, yes and no) and knowledge of being prayed for (also two levels, yes and no). There were three treatments: (a) receipt=yes and knowledge=yes, (b) receipt=yes and knowledge=no, (c) receipt=no.
Response variable: occurrence of post-surgical complications (yes/no).
(You can read an abstract of the experiment and its results in Study of the Therapeutic Effects of Intercessory Prayer [Benson 2006 [see “Sources Used” at end of book]]. The full report of the experiment is in Benson 2005 [see “Sources Used” at end of book].)
Because lurking variables can’t be ruled out in an observational study, investigators always prefer an experiment if possible. If ethical or other considerations prevent doing an experiment, an observational study is the only choice. But then the best you can hope for is to show an association between the two variables. Only with an experiment do you have a hope of showing causation.
Okay, so you always have to do an experiment if you want to show that A causes B. Let’s look in more detail at how experiments are conducted, and learn best practices for an experiment.
Caution: Design of Experiments is a specialized field in statistics, and you could take a whole course on just that. This chapter can only give you enough to make you dangerous.☺ While you’re planning your first experiment in real life, it’s a good idea to get help from someone senior or a professional statistician.
R. A. Fisher “virtually invented the subject of experimental design” (Upton and Cook 2008, 144 [see “Sources Used” at end of book]), and pioneered many of the techniques that we use today. He was a great champion of planning: Upton & Cook quote him as saying “To call in the statistician after the experiment is done may be no more than asking him to perform a postmortem examination: he may be able to say what the experiment died of.”
Definitions: Experimenters randomly assign members to the various treatment groups. We say that they have randomized the groups, and this process is called randomization.
Why randomize? Why not just put the first half of the sample in group A and the second half in group B? Because randomization is how you control for lurking variables.
Think about the study with aspirin and heart attacks. You know that different individuals are more or less susceptible to heart attacks. Risk factors include smoking, obesity, lack of exercise, and family history. You want your aspirin group and your non-aspirin group to have the same mix of smokers and non-smokers as the general population, the same mix of obese and non-obese individuals, and so on. Actually it’s harder than that. There aren’t just “smokers” and “non-smokers”; people smoke various amounts. There aren’t just “obese” and “fit” people, but people have all levels of fitness.
It would be very laborious to do stratified samples and get the right proportions for a lot of variables. You’d have to have a huge number of strata. And even if you did do those matchups, taking enormous trouble and expense, what about the variables you didn’t think of? You can never be sure that the samples have the same composition as the population.
It really must be random assignments — you can’t just assign test subjects to groups alternately. Steve Simon (2000a) explains why, with examples, in Alternating Treatments [see “Sources Used” at end of book].
Randomization is the indispensable way out. Instead of trying to match everything up yourself — and inevitably failing — you let impersonal random chance do your work for you.
Are you guaranteed that the sample will perfectly represent the population? No, you’re not. Remember sampling error, earlier in this chapter. Samples vary from the population; that’s just the nature of things. But when you randomize, in the long run most of your samples will be representative enough, even though they’re not perfectly representative.
Notice I said that randomization works in the long run. But in the short run it may not. Suppose you are testing a weight-loss drug on a group of 100 volunteers, 50 men and 50 women. If you completely randomize them, you might end up with 20 men and 30 women in one group, and 30 men and 20 women in the other. (There’s about a 20% chance of this, or a more extreme split.)
Why is this bad? Because you don’t know whether men and women respond differently to the drug. If you see a difference between your 20/30 placebo group and your 30/20 treatment group, you don’t know how much of that is the drug and how much is the difference between men and women. Gender is a confounding variable.
What to do? Create blocks, in this case a block of the 50 men and a block of the 50 women. Then within each block you randomly assign individuals to receive medication or a placebo. Now you can find how the drug affects women and how it affects men. This is called a randomized block design. When you can identify a potentially confounding variable before you perform your experiment, first divide your subjects into blocks according to that variable, and then randomize within each block. Do this, and you have tamed that confounding variable.
R. A. Fisher coined the term “randomized block” in 1926.
In this example, gender would be called a blocking variable because you divide your subjects into blocks according to gender. Now there’s no problem separating the effects of the drug from the effects of gender in your experimental results.
When I talked about complete randomization, I said it would be laborious to take strata of a lot of variables, and that complete randomization was the answer. But here I’m suggesting exactly that for men and women in the weight-loss study. Right about now, you might be telling me, “Make up your mind!”
This is where some judgment is needed in making tradeoffs. Men and women typically have different percentages of body fat, and they are known to respond differently to some drugs. It makes sense that a weight-loss drug could have different response from men and women, and therefore you block on gender. But no other factor stands out as both important and measurable. If you tried to block on motivation, for instance, how would you measure it?
“Block what you can, randomize what you cannot” is a good rule, sometimes attributed to George Box. A variable is a candidate for blocking when it seems like it could make a difference, and you can identify and measure it. For other variables, we depend on randomization, either complete randomization or randomization within blocks.
There’s one circumstance where you can be sure that the subgroups are perfectly matched with respect to lurking variables: when you use a matched-pairs design. This is kind of like a randomized block design where each block contains two identical subjects.
Example 24: You want to know whether one form of foreign-language instruction is more effective than another. So you take fifty pairs of identical twins, and assign one twin from each pair to group A and the other twin to group B. Then you know that genetic factors are perfectly balanced between the two groups. And if you restrict yourself to twins raised together, you’ve also controlled for environmental factors.
A special type of matched-pairs design matches each experimental unit to itself.
Example 25: You want to know the effect of caffeine on heart rate. You don’t assemble a sample, give coffee to half of them, and measure the difference in heart rate between the groups. People’s heart rates vary quite a bit, so you would have large variation within each group, and that might swamp the effect you’re looking for.
Instead, you measure each individual’s resting heart rate, then give him or her a cup of coffee to drink, and after a specified time measure the heart rate again. By comparing each individual to himself or herself, you determine what effect caffeine has on each person’s heart rate, and people’s different resting heart rates aren’t an issue.
See also: Experimental Design in Statistics shows how the same experiment would work out with a completely randomized design, randomized blocks, and matched pairs.
Definition: In an experiment, usually one of the treatments will be no treatment at all. The group that gets no treatment is called the control group.
But “no treatment at all” doesn’t mean just leaving the control group alone. They should be treated the same way as the other groups, except that they get zero of whatever the other groups are getting. If the treatment groups are getting injections, the control group must get injections too. Otherwise you’ve introduced a lurking variable: effects of just getting a needle stick, and in humans the knowledge that they’re not actually getting medicine.
Definition: A placebo is a substance that has no medical activity but that the subjects of the experiment can’t tell from the real thing.
The placebo effect is well known. Sick people tend to get better if they feel like someone is looking after them. So if you gave your treatment group an injection but your control group no injection, you’d be putting them in a different psychological state. Instead, you inject your control group with salt water.
TheProfessorFunk has a fun three-minute YouTube video in the placebo effect (Keogh 2011 [see “Sources Used” at end of book]). Thanks to Benjamin Kirk for drawing this to my attention.
You might think placebos would be unnecessary when experimenting on animals. But if you’ve ever had a pet, you know that some animals are stressed by getting an injection. If the control group didn’t get an injection, you’d have those differing stress levels as a lurking variable. So you administer a placebo.
Example 26: Sometimes, for practical or ethical reasons, you have to get a little bit creative with a control group. Here’s an excellent example from Wheelan (2013, 238) [see “Sources Used” at end of book]:
Suppose a school district requires summer school for struggling students. The district would like to know whether the summer program has any long-term academic value. As usual, a simple comparison between students who attend summer school and those who do not would be worse than useless. The students who attend summer school are there because they are struggling. Even if the summer school program is highly effective, the participating students will probably still do worse in the long run than the students who were not required to take summer school. What we want to know is how the struggling students perform after taking summer school compared with how they would have done if they had not taken summer school. Yes, we could do some kind of controlled experiment in which struggling students are randomly selected to attend summer school or not, but that would involve denying the control group access to a program that we think would be helpful.
Instead, the treatment and control groups are created by comparing those students who just barely fell below the threshold for summer school with those who just barely escaped it. Think about it: the [group of all] students who fail a midterm are appreciably different from [the group of all] students who do not fail a midterm. But students who get a 59 percent (a failing grade) are not appreciably different from those students who get a 60 percent (a passing grade).
Some people do better if they think they’re getting medicine, even if they’re not. To avoid this placebo effect, the standard technique is the double blind.
Definitions: In a double-blind experiment, neither the test subjects nor those who administer the treatments know what treatment each subject is getting. In a single-blind experiment, the test subjects don’t know which treatment they’re getting, but the personnel who administer the treatments do know.
Okay, given that people’s thoughts influence whether they improve, a single blind makes sense. If you let someone know they’re not getting medicine in a trial, they’re less likely to improve. But why isn’t that enough? Why is a double blind necessary?
For one thing, there’s always the risk that a doctor or nurse might tell the subject, accidentally or on purpose. But beyond that, if you’re treating someone who has a terrible disease, you might treat them differently if they’re getting a placebo that if they’re getting real medicine, even if you don’t realize you’re doing it. Why take the risk of introducing another lurking variable? Better to use a double blind and just rule out the possibility.
You might wonder how it’s done in practice. In a drug trial, for instance, each test subject is assigned a code number, and the drug company then packages pills or vaccines with a subject’s code number on each. The doctors and nurses who administer the treatments just match the code number on the pill or vaccine to each subject’s code number. Of course all the pills or vaccines look alike, so the workers who have contact with the subjects don’t know who’s getting medicine and who’s getting a placebo. And what they don’t know, they can’t reveal.
You’ll be dealing with numbers through most of this course. Handle them right, and you won’t get burned! There are three issues here: how many digits to round to, how to round to that number of digits, and when to do your rounding.
There are a lot of rules for how many digits you should round to, but we’re not going to be that rigorous in this course. Instead, you’ll use common sense supplemented by a few rules of thumb. What’s common sense? Avoid false precision, and avoid overly rough numbers.
The rules are important, but we have only so much time, and you’ve probably learned them in your science courses. If you want to, look up “significant figures” or “significant digits” in the index of pretty much any science textbook, or look at Significant Figures/Digits [see “Sources Used” at end of book].
Example 27: When you fill your car’s gas tank, the pump shows the number of gallons to three decimal places. You can also describe that as the nearest thousandth of a gallon. How much gas is that? Convert it to teaspoons (Brown 2018 [see “Sources Used” at end of book]): (0.001 gal) × (4 qt/gal) × (4 cup/qt) × (16 Tbsp/cup) × (3 tsp/Tbsp) ≈ 0.8 tsp. You can bet there’s several times that much in the hose when the pump shuts off. Three decimal places at the gas pump is false precision a/k/a spurious accuracy. That third decimal place is just noise, statistical fluctuations without real significance.
On the other hand, suppose the pump showed only whole gallons. This is too rough. You can go along pumping gas for no extra charge (bad for the merchant), and then abruptly the cost jumps by several dollars (bad for you).
Here are some rules of thumb to supplement your common sense. These are not matters of right and wrong, but conventions to save thinking time:
Round in one step. Say you have a number 1.24789, and you want to round it to one decimal place. Draw a line — mentally or with your pencil — at the spot where you want to round: 1.2|4789. If the first digit to the right of that line is a 0, 1, 2, 3, or 4, throw away everything to the right of the line. It doesn’t matter what digits come after that first digit. Here, the first digit to the right of the line is a 4, so you throw away everything to the right of the line: 1.24789 rounded to one decimal place is 1.2.
Rounding in multiple steps, 1.24789 → 1.2479 → 1.248 → 1.25 → 1.3, is wrong. (Why? Because 1.24789 is 0.05211 units away from 1.3, but only 0.04789 units away from 1.2.) You must round in one step only.
As you know, if the first digit to the right of the line is a 5, 6, 7, 8, or 9, you raise the digit to the left of the line by one and throw away everything to the right of the line. To one decimal place, 1.27489 is 1.2|7489 → 1.3.
You may need to “carry one”. What is 1.97842 to one decimal place? 1.9|7842 needs you to increase that 9 by one. That means it becomes a zero and you have to increase the next digit over: 1.9+0.1 = 2.0. Therefore, 1.97842 rounded to one decimal place is 2.0.
Here’s the Big No-No: Never do further calculations with rounded numbers. What’s the right way? Round only after the last step in calculation.
Example 28: True story: In Europe, average body temperature for healthy people was determined to be 36.8°C, as repeated in A Critical Appraisal of 98.6°F (Mackowiak, Wasserman, Levine 1992 [see “Sources Used” at end of book]). Rounding to the nearest degree, the average human body temperature is 37°C. So far so good.
But in the US, thermometers for home use are marked in degrees Fahrenheit. Some nimrod converted 37°C using the good old formula 1.8C + 32 and got 98.6°F, and that’s what’s marked on millions of US thermometers as “normal” temperature. If you’ve got one of those, ask for your money back, because it’s wrong.
Why is it wrong? The person who did the conversion committed the Big No-No and did further calculations with a rounded number. For a correct calculation, use the unrounded number, 36.8. (Okay, 36.8 was probably rounded from 36.77 or 36.82 or something. But the point is that it’s the least rounded number available.) 1.8 × 36.8 + 32 = 98.24 → 98.2, and that is the average body temperature for healthy humans.
Randall Munroe shows what can happen if you carry rounding-in-the-middle-of-a-calculation to extremes:
used by permission; source: https://xkcd.com/2585/
When a calculation results in a number lower than about 0.0005, your calculator will usually present it in the dreaded scientific notation, like this example. Be alert for this! Your answer is not 1.99 (or however you want to round it). Your answer is 1.99×10-4.
How do you convert this to a decimal for reading by ordinary humans? (And yes, you should usually do that — definitely, if your work will be read by non-technical people.)
The exponent (the number after the E minus) tells you how many zeroes the decimal starts with, including the zero before the decimal point. 1.99×10-4 is 0.000 199 or 0.0002.
When a decimal starts with a bunch of zeroes, especially if the decimal is long, many people use spaces to separate groups of three digits. This makes the decimal easier to read.
Don’t just write down answers; show your work. This is in your own best interest:
“But,” I hear you object, “in the real world, all that matters is getting the right answer.” True enough, but there’s a difference between being in the real world and preparing for the real world. Part of your study is to develop thought and work habits that ensure you will get the right answer when there’s nobody around to check you. You expose your process now, so that problems can be corrected.
How do you show your work?
The general idea is to show enough that someone familiar with the course content can follow what you did.
When evaluating a formula, write down the formula, then on a line below show it with the numbers replacing the letters. Your calculator can handle very complicated formulas in one step, so your next line will be your last line, containing the final answer and any rounding you do. Example:
SEM = σ/√n
SEM = 160/√37
SEM = 26.30383797 → SEM = 26.3
You’ll be using a lot of the menus and commands on the TI-83 or TI-84. Here are some tips:
randIntto get five random integers from 1 to 100, write down
randInt(1,100,5). That’s the only way your instructor will know that you know how to use that function. If you think the command is
randInt(5,100), now is the time to correct that misunderstanding.
1-VarStats L1,L2, write that. For pity’s sake, don’t write all the keystrokes, [
ENTER]. I put them in this book because you’re just learning them. But someone familiar with the course knows how to get the command, and I hate to think of all the time and paper you could waste.
You’ll find that your calculator does the complicated stuff for you, but here and there I’ve scattered formulas in BTW paragraphs in case you want to peek behind the curtain.
Stats formulas usually need to do the same thing to every member of a data set and then add up the results. The Greek letter ∑, a capital sigma, indicates this. This summation notation makes formulas easier to write — easier to read, too, once you get used to it.
(The online book has live links to all of these.)
Chapter 2 WHYL →
Write out your solutions to these exercises, showing your work for all computations. Then check your solutions against the solutions page and get help with anything you don’t understand.
Caution! If you don’t see how to start a problem, don’t peek at the solution — you won’t learn anything that way. Ask your instructor or a tutor for a hint. Or just leave it and go on to a different problem for now. You may find when you return to that “impossible” problem that you see how to do it after all.
How would you construct a random sample of size 50? a systematic sample? a cluster sample? Which of these is the best balance between statistical purity and practicality?
Each doctor will put up notices in the waiting rooms, and will select the first 30 adult volunteers, assigning the first 15 to the experimental group and the second 15 to the control group. Patients will not be told which group they are in; you supply placebo pills that are identical in appearance to the active medication. Doctors will administer the placebo and medication to the selected groups and report results back to you.
Identify three serious errors in this technique. Are these examples of sampling or nonsampling error?
Now you answer these:
(b) The average dinner check at my restaurant last Friday was $38.23.
(c) 45% of patients taking Effluvium complained of bloating and stomach pain.
(d) The average size of a party at my restaurant last Friday was 2.9 people.