Stats without Tears
3. Numbers about Numbers
Updated 1 Feb 2015
(What’s New?)
Copyright © 2013–2017 by Stan Brown
Updated 1 Feb 2015
(What’s New?)
Copyright © 2013–2017 by Stan Brown
For numeric data, the goal of descriptive stats is to show the shape, center, spread, and outliers of a data set. In this chapter, you learn how to find and interpret numbers that do that.
Measures of shape called skewness and kurtosis do exist, but they’re not part of this course. Roughly, skewness tells how this data set differs from a symmetric distribution, and kurtosis tells how it differs from a normal distribution. If you’re interested, you can learn about them in Measures of Shape: Skewness and Kurtosis. The MATH200B Program part 1 can compute those measures of shape for you.
There are three common measures of the center of a data set: mean, median, and mode.
The mean is nothing more than the average that you’ve been computing since elementary school.
The symbol for the mean of a sample is x̅, pronounced “x bar”. The symbol for the mean of a population is the Greek letter μ, pronounced “mew” and spelled “mu” in English. (Don’t write μ as “u”; the letter has a tail at the left.)
You can think of the mean as the center of gravity of the distribution. If you made a wooden cutout of the histogram, you could balance it on a pencil or your finger placed exactly under the mean.
The formula for the mean is x̅ = ∑x/n or μ = ∑x/N, meaning that you add up all the numbers in the data set and then divide by sample size or population size.
The median is the middle number of a sample or population. It is the number that is above and below equal numbers of data points. (Examples are below.)
There’s no one agreed symbol for the median. Different books use M or Med or just “median”.
To find the median by hand, you must put the numbers in order. If the data set has an odd number of data points, counting duplicates, then the median is then the middle number. If the data set has an even number of data points, the median is half way between the two middle numbers. (In the next section, you’ll get the median from your TI calculator, with no need to sort the numbers.)
The mode is the number that occurs most frequently in a data set. If two or more numbers are tied for most frequent, some textbooks say that the data set has no mode, and others say that those numbers are the modes (plural). We’ll follow the second convention.
Most distributions have only one mode, and we call them unimodal. If a distribution has two modes, or even if it has two “frequency peaks” like the one at right, we call it bimodal. (This was students’ final grades in a math course: a lot of low or high grades, and few in the middle.)
There’s no symbol for the mode.
Example 1: You’re interviewing at a company. You ask about the average salary, and the interviewer tells you that it’s $100,000. That sounds pretty good to you. But when you start work, you find that everybody you work with is making $10,000. What went wrong here?
The interviewer told the truth, but left out a key fact: Everybody but the president makes below the average. Eight employees make $10,000 each, the vice president makes $50,000, and the president makes $870,000. Yes, the mean is (8×10,000 + 50,000 + 870,000)/10 = $100,000, but that’s not representative because the president’s salary is an outlier. It pulls the mean away from the rest of the data, and skews the salary distribution toward the right. This graph tells the sad tale:
There was your mistake. Salaries at most companies are strongly skewed right, so most employees make less than the average. When a data set is skewed, the mean is pulled toward the extreme values. (A data set can be skewed without outliers, but when there are outliers the data set is almost certain to be skewed.)
You should have asked for the median salary, not the average (mean) salary. There are 10 employees, and 50% of 10 is 5, so the median is less than or equal to five data points and greater than or equal to five data points. The fifth-highest and sixth-highest salaries are both $10,000, so the median is $10,000.
The median is more representative than the mean when a data set is skewed. The mean is pulled toward an extreme value, but the median is unaffected by extreme values in the data set. We say that the median is resistant.
Example 2: What is the median of the data set 8, 15, 4, 1, 2? Put the numbers in order: 1, 2, 4, 8, 15. There are five numbers, and 50% of 5 is 2.5. You need the number that is above 2 data points and below 2 data points; the median is 4.
Example 3: What is the median of the data set 7, 24, 15, 1, 7, 45? There are six data points, and in order they are 1, 7, 7, 15, 24, 45. 50% of 6 is 3; you need the number that is above 3 data points and below 3 data points. It’s clear that the median is between 7 and 15, but where exactly? When the sample size is an even number, the median is the average of the two middle numbers. Therefore the median for this data set is the average of 7 and 15, (7+15)/2 = 11.
When a distribution is symmetric, the mean and median are close together. If it’s unimodal, the mode is close to the mean and median as well.
But have you ever taken a course that was graded on a curve, and one or two “curve wreckers” ruined things for everyone else? What happened? Their high scores raised the class average (mean), so everybody else’s scores looked worse. The class scores were skewed right: low scores occurred frequently, and high scores were rare. (You can see shapes of skewed distributions in Chapter 2.)
When a distribution is skewed, the mean is pulled toward the extreme values. The median is resistant, unaffected by extreme values. And you can reverse that logic too: if the mean is greater than the median, it must be because the distribution is skewed right. From the median to the mean is the direction of skew.
Skewed left, mean < median (usually) |
Skewed right, mean > median (usually) |
For heaven’s sake, don’t memorize that! Instead, just draw a skewed distribution and ask yourself approximately where the mean and median fall on it.
Karl Pearson gives the rule median = (2×mean + mode)/3 for moderately skewed distributions. For more about this, see Empirical Relation between Mean, Median and Mode [see “Sources Used” at end of book].
Caution! All the statements in this section are a rule of thumb, true for most distributions. The logic holds for almost every unimodal continuous distribution, and for discrete distributions with a lot of different values. But it tends to break down on discrete distributions that have only a few different values. For more about this, see von Hippel 2005 [see “Sources Used” at end of book].
Summary:
The 1-VarStats
command gives you mean, median, and much
more for any data set. If you have just a
plain list of numbers, enter the name of that list on the
command line. If you have a frequency distribution, enter the
name of the data list and the name of the frequency list on the
command line.
Excel: Excel can do these computations. This isn’t an Excel course, but if you’re an Excel head you can figure out how to get this information. One way is with the Data Analysis tool, part of the Analysis Toolpak add-in that comes with Excel (though you may have to enable it). Another way is to click in a blank cell, click » » and select the appropriate worksheet function.
Example 4: Professor Marvel had a statistics class of fifteen students, and on one quiz their scores were
10.5 13.5 8 12 11.3 9 9.5 5 15 2.5 10.5 7 11.5 10 10.5
Your TI-83 or TI-84 can give you the mean, median, and other numbers that summarize this data set.
CLEAR
].STAT
] [ENTER
] to get into the edit screen for
statistics lists. You can use any list, but let’s use L1 this
time. (If you don’t see L1, and pressing the left arrow
doesn’t bring it into view, press
[STAT
] [5
] [ENTER
] [STAT
] [ENTER
].)L1
label at the top — not
the top number, the column heading — and press
[CLEAR
] [ENTER
] to clear the list.ENTER
] after each one.DEL
] to remove it; if you
left out a number, press [2nd
DEL
makes INS
] to open a space for
it.STAT
] [►
] [1
] to select
1-VarStats
.2nd
1
makes L1
]. For a simple list of numbers like this
one, there is no frequency list, so press [DEL
].1-VarStats
to the home
screen. On the same line, identify the list that contains your data:
[2nd
1
makes L1
].1-VarStats L1
—
press [ENTER
] to execute it. The results screen is shown below. A down arrow on the
screen says that there is more information if you
press [▼
], and an up arrow says that there is
more information if you press [▲
].
Look first at the bottom of the screen. Always check
n
first — if it doesn’t match your
sample or population size, the other numbers are big sacks of
bogosity. In this case a quick count of the original data set shows 15
numbers, which is the right quantity. (Of course, this check
can’t determine if you miskeyed any numbers. Only double and
triple checking can protect you from that kind of mistake.)
What are you seeing on this screen?
A word about rounding: The rules for significant digits and rounding are beyond the scope of this course, but beware of being ridiculously precise. (For example, most gasoline pumps are calibrated in 0.001 gallon units. But 0.001 gallon is two tablespoons, and there’s considerably more gas than that in the hose, so that precision is just silly.)
A good rule of thumb is to report sample statistics and population parameters to one more decimal place than the original data. Then why did I say μ = 9.7 instead of 9.72, since the original data have one decimal place? That’s a valid question, and my answer is that 9.72 would not be wrong but it feels overly precise when there are only fifteen data points, most are whole numbers, and the rest are a whole number plus ½.
Since this data set is a population, select σx = 3.057929583 and write down σ = 3.1.
The name standard deviation was created in 1893 by Karl Pearson. (We might wish that he had chosen something with fewer than six syllables.) He assigned the symbol σ to the standard deviation of a population in 1894.
Showing your work and your results, you write down:
1-VarStats L1
μ = 9.7
σ = 3.1
N = 15
min = 2.5
Q1 = 8
Med = 10.5
Q3 = 11.5
max = 15
Number of Adults in Vehicles Entering Park | ||
---|---|---|
Adults in Vehicle | Number of Vehicles | |
0 | 2 | |
1 | 5 | |
2 | 7 | |
3 | 15 | |
4 | 5 | |
5 | 2 | |
6 | 2 | |
7 | 1 | |
8 | 1 | |
Total | 40 |
Example 5: Your TI-83 or TI-84 can also compute statistics of a frequency distribution. Let’s try it with the data from Chapter 2 for number of adults in vehicles entering the park.
Enter the data values in one statistics list, such as L1.
Enter the frequencies in a second list, such as L2. Press
[STAT
] [►
] [1
] to select 1-VarStats
.
2nd
1
makes L1
] for List
and
[2nd
2
makes L2
] for FreqList
.2nd
1
makes L1
], then
[,
] (comma), then [2nd
2
makes L2
] and [ENTER
]. The data
list must come first and the frequency list second.
Caution — rookie mistake: Students often leave off the frequency list. Your calculator is pretty good, but it can’t read your mind. The only way it knows that you have a frequency distribution is if you give it both the frequency list and the data list.
Either way, write down the complete command on your
paper:
1-VarStats L1,L2
.
Here are the results:
Again, look at n
first. That protects you
from the rookie mistake of leaving off the frequency list.
If n
is wrong, redo your 1-VarStats
command and
this time do it right.
These forty vehicles are obviously not all the vehicles that enter the park, so they are a sample, not a population. You therefore write down the statistics as follows:
1-VarStats L1,L2
x̅ = 3.0
s = 1.7 (from Sx = 1.73186575)
n = 40
min = 0
Q1 = 2
Med = 3
Q3 = 4
max = 8
Sometimes you take an average where some data points are more important than others. We say that they are weighted more heavily, and the mean that you compute in this way is called a weighted average or weighted mean.
You’re intimately familiar with one example of a weighted average: your GPA or grade point average.
Example 6: The NHTSA’s Corporate Average Fuel Economy or CAFE Rule (NHTSA 2008) [see “Sources Used” at end of book] specifies a corporate average of 34.8 mpg (miles per gallon) for passenger cars. Let’s keep things simple and suppose that ZaZa Motors makes three models of passenger car: the Behemoth gets 22 mpg, the Ferret gets 35 mpg, and the Mosquito gets 50 mpg. Does ZaZa meet the standard?
To answer that, you can’t just average the three models: (22+36+50)/3 = 36 mpg. Suppose the company sells one Mosquito and the rest are Behemoths and a sprinkling of Ferrets? You have to take into account the number of cars of each model sold. In effect, you have a frequency distribution with mpg figures and repetition counts. Let’s suppose these are the sales figures:
Auto Sales by ZaZa Motors | ||
---|---|---|
Model | Miles per Gallon | Number Sold |
Behemoth | 22 | 100,000 |
Ferret | 35 | 250,000 |
Mosquito | 50 | 20,000 |
Total | 370,000 |
Put the miles per gallon in L1 and the frequencies in L2. (How do you know it’s not the other way around? You’re trying to find an average mpg, so the mpg numbers are your data.) You should find:
1-VarStats L1,L2
μ = 32.3 mpg
N = 370,000 passenger cars
Even though two of the three models meet the standard, the mix of sales is such that ZaZa Motors’ CAFE is 32.3 mpg, and it’s not in compliance.
The formula for the mean of a grouped distribution and the formula for a weighted average are the same formula: μ = ∑xf/N for a population or x̅ = ∑xf/n for a sample. Either way, take each data value times its frequency. Add up all those products, and divide by the population size or sample size. For the notation, see ∑ Means Add ’em Up in Chapter 1.
In a grouped frequency distribution, one number called the class midpoint stands for all the numbers in the class.
Definition: The class midpoint for a given class equals the lower boundary plus half the class width. This is half way between the lower class boundary of this class and the lower class boundary of the next class.
Lengths of iTunes Songs (seconds) | ||
---|---|---|
Class Boundaries | Class Midpoint | Frequency |
100–199 | 150 | 9 |
200–299 | 250 | 20 |
300–399 | 350 | 9 |
400–499 | 450 | 7 |
500–599 | 550 | 3 |
600–699 | 650 | 1 |
700–799 | 750 | 0 |
800–899 | 850 | 1 |
Example 7: Let’s revisit the lengths of iTunes songs from the ungrouped histogram in Chapter 2. What is the midpoint of the 300 to 399 class?
The class width equals the difference between lower boundaries: 400−300 = 100. Half the class width is 50, so the midpoint is 300+50 = 350. You could also compute the class midpoint as (300+400)/2 = 350.
However, it is wrong to take (300+399)/2 = 349.5 as class midpoint or 399−300 = 99 as class width. Don’t use the upper boundary in finding the class midpoint.
Of course you don’t have to compute every class midpoint the long way. Once you have the midpoint of the first class, (100+200)/2 = 150, just add the class width repeatedly to get the rest: 250, 350, … 850. The grouped frequency distribution, with the class midpoints, is shown at right.
What good is the class midpoint? It’s a stand-in for all the numbers in its class. Instead of being concerned with the nine different numbers in the 100 to 199 class, twenty different numbers in the 200 to 299 class, and so on, we pretend that the entire data set is nine 150s, twenty 250s, and so on. This means you get approximate statistics, but you get them with a lot less work.
Is this legitimate? How good is the approximation? Usually, quite good. In most data sets, a given class holds about equally many data points below the class midpoint and above the class midpoint, so the errors from making the approximation tend to balance each other out. And the bigger the data set, the more points you have in each class, so the approximation is usually better for a larger data set.
Procedure:
Enter the class midpoints in one statistics list, such as L1.
Enter the frequencies in another list, such as L2. Enter the
command 1-VarStats L1,L2
and write down the
complete command on your paper.
Again, avoid the rookie mistake: include the class-midpoint list and the frequency list in your command.
The results screens are below. As usual, before you look at anything else, check that n matches the size of the data set. 50 is correct, so that’s one less worry.
There’s a problem with
the second screen,
though. Your calculator knows you have a frequency distribution,
because you gave two lists to the 1-VarStats
command. But
it doesn’t have the original data, so it doesn’t know the
true minimum (lowest data point). When you read minX=150
,
you interpret that to mean that the lowest data point occurs in the
class whose midpoint is 150; in other words, the minimum is somewhere
between 100 and 199. Your knowledge of the rest of the
five-number summary has the same limitation. For instance, the median
isn’t 250; all you know is that it occurs somewhere between 200
and 299.
Because of these limitations,
you don’t do anything with the second results screen from a grouped distribution.
The mean and standard deviation don’t have this
problem: they’re approximate, but the approximation is good
enough. (n
is exact, not an approximation.)
These 50 iTunes songs are obviously not all the songs there are, not even all the songs in any particular person’s iTunes library. They are a sample, not a population. Therefore you write down your work and results like this:
1-VarStats L1,L2
x̅ = 316 (or you could write 316.0)
s = 145.1
n = 50
There are four common measures of the spread of a data set: range, interquartile range or IQR, variance, and standard deviation. (You may also see spread referred to as dispersion, scatter, variation, and similar words.)
Definition: The range of a data set is the distance between the largest and smallest members.
Example 8: If the largest number in a data set is 100 and the smallest is 20, the range is 100−20 = 80, regardless of what numbers lie between them and what shape the distribution might have.
Caution: The range is one number: 80, not “20 to 100”.
Obviously the range has a problem as a measure of spread: It uses only two of the numbers. Since only the two most extreme numbers in the data set get used to compute the range, the range is about as far from resistant as anything can be.
In favor of the range is that it’s easy to compute, and it can be a good rough descriptor for data sets that aren’t too weird. The interquartile range has something of the same idea, but it is resistant.
The interquartile range (IQR) is the distance between the largest and smallest members of the middle 50% of the data points, taking repetitions into account.
Alternative definition: The IQR is the third quartile minus the first quartile, or the 75th percentile minus the 25th percentile.
You’ll learn about percentiles and quartiles in the next section, Measures of Position, but for now let’s just take a quick non-technical example.
Example 9: Consider the data set 1, 2, 3, 3, 3, 4, 5, 8, 11, 11, 15, 23. There are twelve numbers, and the middle 50% (six numbers) are 3, 3, 4, 5, 8, 11. The interquartile range is 11−3 = 8.
The IQR is a better measure of spread than the range, because it’s resistant to the extreme values. but it still has the problem that it uses only two numbers in the data set. Isn’t there some measure of spread that uses all the numbers in the data set, as the mean does? The answer is yes: the variance and the standard deviation use all the numbers.
Your calculator gives you the standard deviation, as you saw above. The variance is important in a theoretical stats course, but not so much in this practical course. We’ll measure spread with the standard deviation almost exclusively. (To save wear and tear on my keyboard and your printer, I’ll often use the abbreviation SD.)
If you’d like to know how the variance and SD are computed, read the “BTW” section that follows. Otherwise, skip down to “What Good Is the Standard Deviation, Anyway?”
x | x−μ | (x−μ)² |
---|---|---|
10.5 | 0.78 | 0.6084 |
13.5 | 3.78 | 14.2884 |
8 | -1.72 | 2.9584 |
12 | 2.28 | 5.1984 |
11.3 | 1.58 | 2.4964 |
9 | -0.72 | 0.5184 |
9.5 | -0.22 | 0.0484 |
5 | -4.72 | 22.2784 |
15 | 5.28 | 27.8784 |
2.5 | -7.22 | 52.1284 |
10.5 | 0.78 | 0.6084 |
7 | -2.72 | 7.3984 |
11.5 | 1.78 | 3.1684 |
10 | 0.28 | 0.0784 |
10.5 | 0.78 | 0.6084 |
Total | 0.00 | 140.2640 |
If you want to devise a measure of spread, it seems reasonable to consider spread from the mean, so try subtracting the mean from each quiz score and then adding up all those deviations. You get zero, so obviously “sum of deviations” isn’t a useful measure of spread.
But with the next column you strike gold. Squaring all the deviations changes the negatives to positives, and also weights the larger deviations more heavily. This is progress! Now divide the total of squared deviations by the population size and you have the variance: σ² = 140.2640/15 = 9.3509. (σ is the Greek letter sigma.)
(When computing the variance of a sample, you divide by n−1 rather than n. The reasons are technical and are explained in Steve Simon’s articles Degrees of Freedom (1999a) [see “Sources Used” at end of book] and Degrees of Freedom, Part 2 (2004) [see “Sources Used” at end of book].
The variance is quite a good measure of spread because it uses all the numbers and combines their differences from the mean in one overall measure. But it’s got one problem. If the data are dollars, the squared deviations will be in square dollars, and therefore the variance will be in square dollars. What’s a square dollar? (No, I don’t know either.) You want a measure of spread that is in the same units as the original data, just like the mean and median are. The simplest solution is to take the square root of the variance, and when you do that you have the standard deviation (SD), σ = √(140.2640/15) = 3.05793, which rounds to 3.1. And because the standard deviation is in the same units as the original data, it can be used as a yardstick, as you’ll see below.
For lovers of formulas, here they are. The standard deviation of a population, σ, has population size N on the bottom of the fraction; the standard deviation of a sample, s, has sample size n minus 1 on the bottom of the fraction. If you’re not familiar with the ∑ notation (sigma or summation), ∑x² means square every data value and add the squares; ∑x²f means square every data value, multiply by the frequency, and add those products. For the notation, see ∑ Means Add ’em Up in Chapter 1.
Formulas for Standard Deviation | ||
---|---|---|
of a List of Numbers | of a Frequency Distribution | |
When the data set is the whole population | ||
When the data set is just a sample |
Why are there two formulas on each row under “list of numbers”? The first formula is the definition, and the second is a shortcut for faster computations. Of course they’re mathematically equivalent; you could prove that if you wanted to.
Sir Ronald Fisher coined the term variance in 1918. He used the symbol σ² for the variance of a population, since Pearson had already assigned σ to the standard deviation, and the variance is the square of the SD.
The standard deviation will be the key to inferential statistics, starting in Chapter 8, but even within the realm of descriptive statistics there are some applications. In addition to this section, you’ll see an application in z-Scores, below.
Working with the quiz scores on your TI-83 or TI-84, you found that the population mean was μ = 9.7 and the population SD was σ = 3.1. What does this mean?
Just as a concept, the standard deviation gives you an idea of the expected variation from one member of the sample or population to the next. The SD in this example is about a third of the mean, so you expect some variation but not a lot. But can you do better than this? Yes, you can!
You can predict what percentage of the data will be within a certain number of standard deviations above or below the mean. In a normal distribution, 68% of the data are between one SD below the mean and one SD above the mean (μ±σ), 95% are within two SD of the mean (μ±2σ), and 99.7% are within three SD of the mean (μ±3σ).
This is the Empirical Rule or 68–95–99.7 Rule. Caution! It’s good for normal distributions only.
You’ll notice that the 68%, 95%, and 99.7% of data occur within approximately one, two, and three SD of the mean. More accurate figures are shown in the pictures, but for now we’ll just use the simple rule of thumb. You’ll learn how to make precise computations in Chapter 7.
It’s not a traditional part of the Empirical Rule, but another useful rule of thumb is that, in a normal distribution, about 50% of the data are within 2/3 of a SD above and below the mean.
Example 10: Adult women’s heights are normally distributed with μ = 65.5″ and σ = 2.5″. (By the way, different sources give different values for human heights, so don’t be surprised to see different figures elsewhere in this book.) How tall are the middle 95% of women?
Solution: The middle 95% of the normal distribution lies between two SD below and two SD above the mean. 2σ = 2×2.5 = 5″, and 65.5±5 = 60.5″ to 70.5″, so 95% of women are 60.5″ to 70.5″ tall.
Actually there are two interpretations. You can say that 95% of women are 60.5″ to 70.5″ tall, or you can say that if you randomly select one woman the probability that she’s 60.5–70.5″ tall is 95%. Any probability statement can be turned into a proportion statement, and vice versa. You’ll learn about this in Interpreting Probability Statements in Chapter 5.
Example 11: What fraction of women are 65.5″ to 68″ tall?
Solution: 68−65.5 = 2.5, so 68″ is one standard deviation above the mean. You know that 68% of a normal distribution is within μ±σ. You also know that the normal distribution is symmetric, so 68%/2 = 34% of women are within one SD below the mean, and 34% are within one SD above the mean. Therefore 34% of women are 65.5″ to 68″ tall.
You can combine the three diagrams above and show data in regions bounded by each whole number of standard deviations, like this:
Where do these figures come from? For example, how do we know that about 13.5% of the population is between one and two standard deviations below the mean in a normal distribution? Well, 95% is between two SD below and two SD above the mean. Half of 95% is 47.5%, so 47.5% of the population is between the mean and two SD below the mean. Similarly, about 68% is between one SD below and one SD below, so 68/2 = 34% is between the mean and one SD below. But if 47.5% is between μ−2σ and μ — call it Region A — and 34% is between μ−σ and μ, then the part of Region A that is not in the 34% is the part between μ−2σ and μ−σ, and that must be 47.5−34 = 13.5%. If you had an afternoon to kill, you could work out the other seven percentages.
With this diagram, you can work Example 11 more easily, directly reading off the 34% figure for women between mean height and one SD above the mean. You can also work more complicated examples, like this one.
Example 12: If you randomly select a woman, how likely is it that she’s taller than 70.5″?
Solution: 70.5−65.5 = 5.0, so 70.5″ is two SD above the mean. From the diagram, you see that 2.35+0.15 = 2.5% of the population is more than two SD above the mean. Answer: a randomly selected woman has a 2.5% of being more than 70.5″ tall.
If you have a normal distribution, the Empirical Rule tells you how much of the population is in each region. What if you don’t have a normal distribution?
As you might expect, the portions of the population in the various regions depends on the shape of the distribution, but Chebyshev’s Inequality (or Chebyshev’s Rule) gives you a “worst case scenario” — no matter how skewed the distribution, at least 75% of the data are within 2 SD of the mean, and at least 89% are within 3 SD of the mean.
More generally, within k SD above and below the mean, you will find at least (1−1/k²)·100% of the data. (If you plug in k = 1, you’ll find that at least 0% of the data lie within one SD of the mean. Distributions where all the data are more than one SD away from the mean are unusual, but they do exist.)
Example 13: For the quiz scores, two standard deviations is 2×3.0579 = 6.1, so you expect at least 1−1/2² = 1−¼ = 75% of the quiz scores to be within the range 9.7±6.1 = 3.6 to 15.8. Remember that this is a worst case. In fact, 14 of the 15 numbers (93%) are within those limits.
The symbol is P followed by a number. For example, P35 or P_{35} denotes the 35th percentile, the member of the data set that is greater than or equal to 35% of the data.
Percentiles are most often used in measures of human development, like your child’s performance on standardized tests, or an infant’s length or weight.
Example 14: Your daughter takes a standardized reading test, and the report says that she is in the 85th percentile for her grade. Does this make you happy or sad? Solution: 85% of her grade read as well as she does, or less well; only 15% read better than she does. Presumably this makes you happy.
Example 15: Consider the data set 1, 4, 7, 8, 10, 13, 13, 22, 25, 28. (To find percentiles, you have to put the data set in order.)
(a) What is the percentile rank of the number 13? Solution: There are ten numbers in the data set, and seven of those are ≤13. Seven out of ten is 70%, so the percentile rank of 13 is 70, or “13 is at the 70th percentile”, or P70 = 13.
(b) Find P60 for this data set. Solution: What number is greater than or equal to 60% of the numbers in the data set? Counting up six numbers from the beginning, you find … 13 again. So 13 is both P60 and P70.
(Anomalies like this are usual when you have small data sets. It really doesn’t make sense to talk about percentiles unless you have a fairly large data set, typically a population like all third graders or all six-week-old infants.)
The different definitions can give very different answers for small data sets. Nobody worries too much about this, because in practice you seldom compute percentiles against small data sets. (What does “18th percentile” mean in a set of only 12 numbers?) All the definitions give pretty much the same answer for larger data sets.
David Lane’s Percentiles (2010) [see “Sources Used” at end of book] gives three definitions of percentile and shows what difference they make. His Definition 2 is the one I use in this book.
To find quartiles by hand, put the data set in order and find the median. If you have an odd number of data points, strike out the median. Q1 is the median of the lower half, and Q3 is the median of the upper half.
One fourth is 25% and three fourths is 75%, so Q1 = P25 and Q3 is P75. (I chose a definition of percentiles that makes this happen. Some authors use different definitions, which may give slightly different results.)
What, no Q2? There is a Q2, but two quarters is one half, or 50%, so the second quartile is better known as the median: 50% of the data are less than or equal to the 50th %ile, alias Q2, alias the median.
The quartiles and the median divide the data set into four equal parts. We sometimes use the word quartile in a way that reflects this: the “bottom quartile” means the part of the data set that is below Q1, and the “`upper quartile” or “top quartile” means the part of the data set that is above Q3.
Q1 and Q3 are part of the five-number summary (later in this chapter). From Measures of Spread, you already know that they’re used to find the interquartile range, and later in this chapter you’ll use the IQR to make a box-whisker plot.
Just like percentiles, quartiles are defined slightly differently by different authors. Dr. Math gives a nice, clear rundown of different ways of computing quartiles in Defining Quartiles in The Math Forum (2002) [see “Sources Used” at end of book]. I follow Moore and McCabe’s method, which is also used by your TI-83 or TI-84.
You’ll use z-scores more than any other measure of position. (Remember that every measure of position measures the position of one data point within the sample or population that it is part of.)
How do you find out how many SD a number
is above or below the mean of its data set? You subtract the
mean, and then divide the result by the SD.
z-score within a sample:
z-score within a population:
Either way, it’s
When you compute a z-score, the top and bottom of the fraction are both in the same units as the original data, and therefore the z-score itself has no units. z-scores are pure numbers.
What good are z-scores? You’ll use them in inferential statistics, starting in Chapter 9, but you can also use them in descriptive statistics.
For one thing, a z-score gives you economy in language. Instead of saying “at least 75% of the data in any distribution must lie between two standard deviations below the mean and two standard deviations above the mean”, you can say “at least 75% of the data lie between z = ±2.”
A z-score helps you determine whether a measurement is unusual. For instance, how good is an SAT verbal score of 300? Scores on the SAT verbal are ND with mean of 500 and SD of 100, so z = −2. The Empirical Rule tells you only 2½% of students score that low or lower.
And z-scores are also good for comparing apples and oranges, as the next example shows.
Example 16: You have two candidates for an entry-level position in your restaurant kitchen. Both have been to chef school, but different schools, and neither one has any experience. Chris presents you with a final exam score of 86, and Fran with a final exam score of 67. Which one do you hire?
At first glance, you’d go with the one who has the higher score. But wait! Maybe Fran with the 67 is actually better, and just went to a tougher school. So you ask about the average scores at the two schools. Chris’s school had a mean score of 76, and Fran’s school had a mean score of 59. Assuming that the students at the two schools had equal innate ability, Fran went to a tougher school than Chris.
Chris scored 10 points above the school average, while Fran scored only 8 points above the school average. Now do you hire Chris? Not yet! Maybe there was more variability in Chris’s class, so 10 points above the average is no big deal, but there was less variability in Fran’s, so 8 points above the mean is a big deal. So you dig further and find that the standard deviations of the two classes were 8 and 4. At this point, you make a table:
Chris | Fran | |
---|---|---|
Candidate’s score | 86 | 67 |
School mean | 76 | 59 |
School SD | 8 | 4 |
z-score | (86−76)/8 = 1.25 | (67−59)/4 = 2.00 |
The z-scores tell you that Fran stands higher in Fran’s class than Chris stands in Chris’s class. Assuming that the two classes as a whole were of equal ability, Fran is the stronger candidate.
Definition: The five-number summary of a data set is the minimum value, Q1, median, Q3, and maximum value (in order).
The five-number summary combines measures of center (the median) and spread (the interquartile range and the range). A plot of the five-number summary, called a box-whisker diagram (below), shows you shape of the data set.
On the TI-83 or TI-84, the five-number summary is the second
output screen from 1-VarStats
. Caution! Remember that the
second screen is meaningful only for a simple list of numbers or an
ungrouped distribution, not for a grouped distribution.
To produce a five-number summary, you need all the original data
points.
Example 17: Here is the second output screen from the quiz scores earlier in this chapter. The five-number summary is 2.5, 8, 10.5, 11.5, 15.
The median is 10.5, meaning that half the students scored 10.5 or below and half scored 10.5 or above.
The interquartile range is Q3−Q1 = 11.5−8 = 3.5. Half of the students scored between 8 and 11.5.
An outlier is a data value that is well separated from most of the data.
Conventionally, the values Q1−1.5×IQR and Q3+1.5×IQR (first quartile minus 1½ times interquartile range, and third quartile plus 1½ times interquartile range) are called fences, and any data points outside the fences are considered outliers.
Example 18: Here again are the quiz scores from earlier in this chapter:
10.5 13.5 8 12 11.3 9 9.5 5 15 2.5 10.5 7 11.5 10 10.5
Find the outliers, if any.
The five-number summary, above, gave you the quartiles: Q1 = 8 and Q3 = 11.5. The interquartile range is 11.5−8 = 3.5, and 1.5 times that is 5.25. The fences are 8−5.25 = 2.75 and 11.5+5.25 = 16.75. All the data points but one lie within the fences; only 2.5 is outside. Therefore 2.5 is the only outlier in this data set.
You can find outliers more easily by using your TI-83 or TI-84; see below.
Why do you care about outliers? First off, an outlier might be a mistake. You should always check all your data carefully, but check your outliers extra carefully.
But if it’s not a mistake, an outlier may be the most interesting part of your data set. Always ask yourself what an outlier may be trying to tell you. For example, does this quiz score represent a student who is trying but needs some extra help, or one who simply didn’t prepare for the quiz?
What do you do with outliers? One thing you definitely don’t do: Don’t just throw outliers away. That can really give a false picture of the situation.
But suppose you have to make some policy decision based on your analysis, or run a hypothesis test (Chapters 10 and 11) and announce whether some claim is true or false?
One way is to do your analysis twice, once with the outliers and once without, and present your results in a two-column table. Anyone who looks at it can judge how much difference the outliers make. If you’re lucky, the two columns are not very different, and whatever decision must be made can be made with confidence.
But maybe the two columns are so different that including or excluding the outliers leads to different decisions or actions. In that case, you may need to start over with a larger sample, change your data collection protocol, or call in a professional statistician.
For more on handling outliers, see Outliers (Simon 2000d) [see “Sources Used” at end of book].
The five-number summary packs a lot of information, but it’s usually easier to grasp a summary through a picture if possible. A graph of the five-number summary is called a boxplot or box-whisker diagram.
The box-whisker diagram was invented by John Tukey in 1970.
A box-whisker diagram has a horizontal axis, which is the number line of the data, and the number line need not start at zero. Either the axis or the chart as a whole needs a title, but there’s usually no need for a title on both. There is no vertical axis.
For the graph itself, first identify any outliers and mark them as squares or crosses. Then draw a box with vertical lines at Q1, the median, and Q3. Lastly, draw whiskers from Q3 to the greatest value in the data set that isn’t an outlier, and from Q1 to the smallest value in the data set that isn’t an outlier.
Example 19: Let’s look at a box-whisker plot of those same quiz scores, which were
10.5 13.5 8 12 11.3 9 9.5 5 15 2.5 10.5 7 11.5 10 10.5
The five-number summary is reproduced at right. You recall from the previous section that there is one outlier, 2.5, so the smallest number in the data set that isn’t an outlier is 5.
Here’s a plot that I made with StatTools from Palisade Corporation:
The box-whisker plot is almost as good as a histogram for showing you the shape of a distribution. If one whisker is longer than the other, and especially if there are outliers on the same side as the long whisker, the distribution is skewed in that direction. If the whiskers are about the same length and there are no outliers, but one side of the box is longer than the other, that usually indicates skew in that direction as well.
Example 20: In the boxplot of quiz scores, just above, you see an outlier on the left side, and the left side of the box is longer than the right. That indicates that the distribution is left skewed.
You can use your TI-83 or TI-84 to make a box-whisker plot. The calculator comes with that ability — see Box-Whisker Plots on TI-83/84 — but it’s easier to use MATH200A Program part 2. See Getting the Program for instructions on getting the program into your calculator.
(If you have a TI-89, see Box-Whisker Plots on TI-89.)
To make a box-whisker plot with the program, begin by entering the numbers into a statistics list, such as L1. (If you have an ungrouped frequency distribution, put the numbers in one list and the frequencies in a second list. You need the original data for a boxplot, so you can’t make a boxplot of a grouped frequency distribution.)
Now press [PRGM
]. If you can see
MATH200A
in the list, press its menu number; otherwise,
use the [▼
] or [▲
] key to get to
MATH200A
, and press [ENTER
].
With the program name on your home screen, press
[ENTER
] (again) to run the program, and yet again to dismiss
the title screen. You’ll then see a menu. Press
[2
] for box-whisker
plot.
The program asks whether you have one, two, or three samples. Select 1, since that’s what you have.
The program wants to know whether you have a plain list of numbers or a grouped frequency distribution. Since you have a plain list, choose 1.
The program needs to know which list holds the numbers to be plotted.
Finally, the program presents the box-whisker plot.
When you have a box-whisker plot on your screen, whether you used
MATH200A part 2 or the calculator’s native commands, if you see any
outliers press [TRACE
] and then [◄
] or
[►
] to find which data points are outliers.
(For the TI-89, see Box-Whisker Plots on TI-89. If you prefer to use Excel to find outliers, see Normality Check and Finding Outliers in Excel.)
After pressing the [TRACE
] key, you can get
the five-number summary by pressing [◄
] or
[►
] repeatedly. If there are outliers at the left,
use the lowest one for the minimum (first number in the five-number
summary); if
there are outliers at the right, use the highest one for
the maximum (last number in the five-number summary).
Overview: With numeric data, the goal of descriptive stats is to show shape, center, spread, and outliers.
(The online book has live links to all of these.)
Chapter 4 WHYL → ← Chapter 2 WHYL
Write out your solutions to these exercises, showing your work for all computations. Then check your solutions against the solutions page and get help with anything you don’t understand.
Caution! If you don’t see how to start a problem, don’t peek at the solution — you won’t learn anything that way. Ask your instructor or a tutor for a hint. Or just leave it and go on to a different problem for now. You may find when you return to that “impossible” problem that you see how to do it after all.
Ages | Frequency |
---|---|
20 – 29 | 34 |
30 – 39 | 58 |
40 – 49 | 76 |
50 – 59 | 187 |
60 – 69 | 254 |
70 – 79 | 241 |
80 – 89 | 147 |
The grouped frequency distribution at right is the ages
reported by a sample of Roman Catholic nuns, from
Johnson and Kuby (2004, 67) [see “Sources Used” at end of book].
(a) Approximate the mean and SD of the ages of
these nuns, to two decimal places, and find the sample size.
(b) Explain why a boxplot of this
distribution is a bad idea.
Course | Credits | Grade |
---|---|---|
Statistics | 3 | A |
Calculus | 4 | B+ |
Microsoft Word | 1 | C− |
Microbiology | 3 | B− |
English Comp | 3 | C |
Commuting Distances in km | ||||
---|---|---|---|---|
5 | 15 | 23 | 12 | 9 |
12 | 22 | 26 | 31 | 21 |
11 | 19 | 16 | 45 | 12 |
8 | 26 | 18 | 17 | 1 |
16 | 24 | 15 | 20 | 17 |
Maria took a traditional IQ test and scored 129. On that test, the mean is 100 and the SD is 15.
From the test scores, who is more intelligent? Explain.
Test Scores | Frequencies, f | |
---|---|---|
470.0–479.9 | 15 | |
480.0–489.9 | 22 | |
490.0–499.9 | 29 | |
500.0–509.9 | 50 | |
510.0–519.9 | 38 |
Updates and new info: https://BrownMath.com/swt/