BrownMath.com → Stats w/o Tears → 7. Normal Distributions

# Stats without Tears7. Normal Distributions

Updated 1 Aug 2019

View or
Print:
These pages change automatically for your screen or printer. Underlined text, printed URLs, and the table of contents become live links on screen; and you can use your browser’s commands to change the size of the text or search for key words. If you print, I suggest black-and-white, two-sided printing.
Summary: The normal distribution (ND) is important for two reasons. First, many natural and artificial processes are ND. You’ll look at some of those in this chapter. Second, any process can be treated as a ND through sampling. That will be the subject of Chapter 8, and it’s also the foundation of the inferential statistics you’ll do in Chapters 9 through 11.

## 7A.  Continuous Random Variables

You met random variables back in Chapter 6. Any random variable has a single numerical value, determined by chance, for each outcome of a procedure. Discrete random variables are limited to specified values, usually whole numbers. But a continuous random variable can take any value at all, within some interval or across all the real numbers.

Just as discrete probability models are used to model discrete variables, continuous probability models are used to model continuous variables. Of course, because a continuous random variable has infinitely many possible values, you can’t make a table of values and probabilities as you could do for a discrete distribution. Instead, either there’s an equation, or just a density curve (below).

A probability model is often called a distribution, so you can say that a variable “is normally distributed” (ND), that it “is a normal distribution” (also ND), or that it “follows a normal probability model”.

There are lots of specialized continuous distributions, but the normal distribution is most important by a wide margin. Many, many real-life processes follow the normal model, and the ND is also the key to most of our work in inferential statistics.

This section will give you some concepts that are common to all continuous distributions, and the rest of the chapter will talk about special properties of the normal distribution and applications. In Chapter 8, you’ll apply the normal distribution to get a handle on the variation from one sample to the next.

### 7A1.  Density Curves

In Chapter 2, you learned to graph continuous data by grouping the data in classes and making a histogram, like the one below left. This is wait times in a fast-food drive-through, with time in minutes — not whole minutes, which would make a discrete distribution, but minutes and fractional minutes.

Any sample you might take has a finite number of data points, so you set up classes, place the data points in the classes, and then draw a histogram. The height of each bar is proportional to the frequency or relative frequency of that class.

But when you come to consider all the possible values of a continuous variable, you have an infinite number of data points. If you tried to assign them to classes, it would take you forever —literally! Instead, you draw a smooth curve, called a density curve, to show the possible values and how likely they are to occur. An example is shown above right.

The density curve is a picture of a continuous probability model. It doesn’t just represent the data in a particular sample, but all possible data for that variable — along with the probabilities of their occurrence, as you’ll see next.

### 7A2.  Probability and Continuous Distributions

Up to now, the height of a bar in a histogram has been the number of data points in that class, or the relative frequency of that class. But how do you interpret the height of a density curve?

Answer: you don’t! The height of the curve above any particular point on the x axis just doesn’t lend itself to a simple interpretation. You might think it would be the probability of that value occurring. But with infinitely many possible values, “what’s the likelihood of a wait time of exactly 4 minutes?” just isn’t a meaningful question, because what about 3.99997 minutes or 4.002 minutes?

#### Area = Probability

What is meaningful is the probability within an interval, which equals the area under the curve within that interval. For example, in this illustration, the probability of a wait time of 6.4 to 9.5 minutes is 29.4%. In symbols,

P(6.4 ≤ x ≤ 9.5) = 29.4%

or

P(6.4 < x < 9.5) = 29.4%

That’s right — the probability is the same whether you include or exclude the endpoints of the interval.

Okay, I lied. The height of the curve is meaningful, but only if you’ve had some calculus. The curve is the graph of a probability density function or pdf. The integral of that curve from a to b is the area between x=a and x=b and is the probability that the random variable will have a value between a and b.

This explains why the probability is the same whether you include or exclude either endpoint of the interval. The difference is the area of a “rectangle” whose height is the height of the density curve and whose width is the distance from a to a — which is zero. Thus the area of the “rectangle” is zero, and the probability of the random variable taking any particular value, exactly, is zero.

Since area equals probability, and total probability must be 1, total area must be 1. Every pdf — the height of every density curve — is scaled so that the integral from −∞ to +∞ is 1.

You can also have the probability for an interval with one boundary, < or ≤ some value like the picture at right, or > or ≥ some value. For example, 3.33 minutes is about 3 minutes and 20 seconds, so the probability of waiting up to 3 minutes and 20 seconds is 20.6%: P(x ≤ 3.33) = 20.6%.

The total area under any density curve equals the probability that the random variable will take any one of its possible values, which of course is 1, or 100%. So you can use the complement to say that the probability of waiting 3 minutes and 20 seconds or more (or, more than 3 minutes and 20 seconds) is 100−20.6% = 79.4%.

#### Two Interpretations of Probability

You remember from Interpreting Probability Statements in Chapter 5 that every probability can be interpreted as a probability of one or a proportion of all. For example, P(x > 3.33) = 79.4% can equally well be interpreted in two ways:

• Probability of one: “Any randomly selected person has a 79.4% chance of waiting more than 3 minutes and 20 seconds.”
• Proportion of all: “79.4% of people will wait more than 3 minutes and 20 seconds.”

Which interpretation you use in a given situation depends on what seems simplest and most natural in the situation. Here, the “proportion of all” interpretation seems simpler. But you’re always free to switch to the other interpretation if it helps you in thinking about a situation.

Area = Probability of One = Proportion of All

## 7B.  The Normal Model

Why study the normal distribution?

First, it’s useful on its own. Lots and lots of real-life distributions match the normal model: body temperature or blood pressure of healthy people, scores on most standardized tests, commute times on a given route, lifetimes of batteries or light bulbs, heights of men or women, weights of apples of a particular variety, measurement errors (in many situations), and on and on.

Second, through sampling, even non-ND populations follow a normal model. You’ll use this model in inferential statistics to make statements about a whole population based on just one sample — look forward to learming this neat trick in Chapter 8.

Why is the ND so common? In real life, very few events have just one cause; most things are the result of many factors operating independently. It turns out that if you take a lot of independent random variables and add them up, their sum is ND. For example, your IQ score results from multiple genetic factors, countless occurrences in your education and your family life, even transient factors like how well you slept the night before the test. Most of these are independent of each other, so the result of adding them is a ND.

### 7B1.  Properties of the Normal Distribution

The normal distribution (ND) has the properties of other continuous distributions as listed earlier. In particular, area = probability, and the total area under the density curve is the total probability, which is 1. The ND also has these special properties:

• A ND is completely described by its mean and SD. The mean locates the center of the curve, but has no effect on the shape. For example, here are three normal curves with μ = 0, 2, and 5 and σ = 4.

The standard deviation determines the shape of the curve, but has no effect on the location. Smaller SD means the data stick closer to the mean, so the peak is higher and the tails are shorter and fatter. Larger SD means the data vary more, so they spread out from the mean: the peak is lower and the tails are longer and thinner. The second picture shows are three normal curves with μ = 2 and σ = 2, 4, and 6. (The vertical scale is different from the first picture.)

• The ND is symmetric — left and right sides are mirror images of each other. This implies that the mean, median and mode are all equal.
• In principle, the tails of the normal curve run out to ±∞. However, data points more than 3 standard deviations from the mean are rare. (This is part of the Empirical Rule from Chapter 3.)
• The books all say that inflection points are one SD above and below the mean. Inflection points, if you haven’t had calculus, are where the curve transitions between concave up and concave down. The books don’t tell you that those points are far from obvious visually. Just do the best you can when making sketches.

All of this is the theoretical normal distribution. In fact, nothing in real life is perfectly ND, because nothing in real life has an infinite number of data points. When we say something is ND, we mean it’s a close match, not a perfect match. “Normally distributed” (or ND) is short for “using a normal distribution to model this data set, the calculations will come out close enough to reality.”

This is a lot like what you did in Chapter 3, when you computed the statistics of a grouped distribution. The statistics were only approximate, because of the simplification you introduced by grouping, but the approximation was good enough.

Now let’s get to some applications! There are two main categories: “forward” problems, where you have the boundaries and you have to find the area or probability, and “backward” problems, where you have a probability or area and you have to find the boundaries.

Who invented the normal distribution? Abraham de Moivre (1667–1754, French) was probably first, in 1733, though several other mathematicians contributed.

Wikipedia has a decent short summary of the history. In Jenny Kenkle’s talk on the normal distribution, slide 18 shows de Moivre’s approximation to the binomial distribution for large n, and how to get from that to the ND. And if you want an exhaustive treatment of the history, see Saul Stahl’s The Evolution of the Normal Distribution, originally published in Mathematics magazine, April 2006.

The name of Carl Friedrich Gauss is permanently coupled to the normal distribution — literally. Although Sir Francis Galton coined the term normal distribution in 1889, Karl Pearson called it the Gaussian distribution in 1905, and that’s still a recognized synonym.

In case you’re interested, the pdf, the height of the density curve above a given x, is .

The cdf, the area to the left of a given x, is the integral of that, just the same as finding the area under any curve to the left of a given x: . This integral doesn’t have a “closed form”, a finite sequence of basic algebraic operations, so it must be found by successive approximations. That’s what your calculator does with normalcdf and Excel does with NORM.DIST.

### 7B2.  From Boundaries, Find Probability

Summary: Make a sketch, estimate the probability (area), then compute it.

TI-83/84/89: Use `normalcdf(`left bound, right bound, mean, SD`)`. I’ll walk you through the TI-83/84 keystrokes in the first example below. If you have a TI-89, press [`CATALOG`] [`F3`] [plain `6` makes `N`] [`ENTER`].

Excel: In Excel 2010 or later, use (deep breath here) `=NORM.DIST(`right bound, mean, SD, `TRUE) − NORM.DIST(`left bound, mean, SD, `TRUE)`. In Excel 2007 or earlier, it’s `NORMDIST` rather than `NORM.DIST`.

Example 1: Heights of human children of a given age and sex are ND. One study found that three-year-old girls’ heights have a mean of 38.72″ and SD of 3.17″. What percentage of three-year-old girls are 35″ to 40″ tall?

Solution: Take the time to make a sketch. It doesn’t have to be beautiful, but you should make it as accurate as you reasonably can. It’s an important safeguard against making boneheaded mistakes. Here’s what should be on your sketch:

1. Draw the axis line.
2. Label the axis, x or z as appropriate. x is the symbol for real-world data points, and z is the symbol for z-scores in the standard normal distribution, below.
3. Draw a vertical line in the middle of the distribution and write the numerical value of the mean below the axis where that central line meets it. (If necessary, offset it with a tick mark, as I did.)
4. Draw a horizontal line at about the right spot and show the numerical value of the standard deviation.
5. Draw a line and show the value for each boundary.

Important: When you marked the SD, you set the scale for the sketch. Now you have to honor that and place your boundaries in proportion. For instance, in this problem the mean is 38.72 and the left boundary is 35, which is 3.72 below the mean. Your left boundary therefore needs to be a bit more than one SD (3.17) left of the mean. The right bound is 40, which is 1.28 above the mean, so your line needs to be just over a third of a SD to the right of the mean.

(Students often put in more numbers and lines, like the values of 1, 2, and 3 SD above and below the mean. That’s not wrong, but it’s usually not helpful, and it definitely clutters up the sketch.)

6. Shade the area you’re trying to find.
7. Look at your sketch and estimate the area before you pull out your calculator. That way, if you make a mistake that leads to a ridiculous answer, you’ll recognize it as ridiculous and fix it.

From my sketch, I estimate an area of 50%–60%. If it’s 45% or 70% I won’t be terribly surprised, but if it’s 5% or 99% I’ll know something is wrong.

8. Compute the area (below).

If you wish, add that number to your sketch — not below the axis, please. Write it within the shaded area, if there’s room, or as a callout to the left or right of the diagram, the way I did here.

#### Computing the Area

On a TI-83 or TI-84, press [`2nd` `VARS` makes `DISTR`] [`2`] to select `normalcdf`. Enter the left boundary (35), right boundary (40), mean (38.72), and SD (3.17).

(If you have a TI-89 or you’re using Excel, see above.)

With the “wizard” interface: With the classic interface:

Press [`ENTER`] twice, and your screen will look like the one at right.

After entering the standard deviation, press [`)`] [`ENTER`] to get the answer.

You always need to show your work, so write down `normalcdf(35,40,38.72,3.17)` before you proceed to the answer. (There’s no need to write down the keystrokes you used.)

In this book, I round probabilities to four decimal places, or two decimal places if expressed as a percentage. The probability is

P(35 ≤ x ≤ 40) = 0.5365

That number matches my estimate of 50%–60%.

But the problem asked for a percentage. (Always, always, always look back at the problem and make sure you’re answering the question that was actually asked.) The answer: 53.65% of three-year-old girls are 35″ to 40″ tall.

Example 2: A three-year-old girl is randomly chosen. Would it be unusual (unexpected, surprising) if she’s over 45″ tall?

In Chapter 5 you learned to call a low-probability event unusual (a/k/a surprising or unexpected). The standard definition of unusual events is a probability below 0.05, so really this problem is just asking you to find the probability and compare it to 0.05.

Solution: The sketch is at right, and obviously the probability should be small. The left boundary is 45, but what’s the right boundary? The normal distribution never quite ends, so the right boundary is ∞ (infinity). TI-89s have a key for ∞, but TI-83s and TI-84s don’t and Excel doesn’t, so use 10^99 instead. (That’s 10 to the 99th power; the [`^`] key on your TI calculator is between [`CLEAR`] and [`÷`].)

P(x > 45) = `normalcdf(45,10^99,38.72,3.17)` = 0.0238

That’s rounded from 0.0237914986, and it’s in line with my estimate of “small”. Now answer the question: There’s only a 2.38% chance that a randomly selected three-year-old girl will be over 45″ tall, so that would be unusual.

Example 3: For the same population, find and interpret P(x < 33).

Solution: The sketch is at right, and again the expected probability is small. The right boundary is 33, but what’s the left boundary? You might want to use 0, since no one can be under 0″ tall, but you could make the same argument for 1″ or 5″, so that can’t be right.

To locate the left boundary, remember that you’re using a normal model to approximate the data, and the normal distribution runs right out to ±∞. Therefore, the left boundary is minus ∞ on a TI-89, or minus 10^99 on a TI-83/84. (Use the [`(-)`] key, not the [`−`] subtraction key.)

P(x < 33) = `normalcdf(-10^99,33,38.72,3.17)` = 0.0356

The proportion of three-year-old girls under 33″ tall is 0.0356 or 3.56%; or, 3.56% of three-year-old girls are under 33″ tall. The other interpretation is the chance that a randomly selected three-year-old girl is under 33″ tall is 0.0356 or 3.56%.

#### Percentiles

Example 4: What’s the percentile rank of a three-year-old girl who is 33″ tall?

Because this textbook helps you,
Because this textbook helps you,
BrownMath.com/donate.

Solution: Long ago, in a galaxy called Numbers about Numbers, you learned the definition of percentiles. The percentile rank of a data point is the percentage of the data set that is ≤ that data point. So you need P(x ≤ 33). But that’s exactly what you computed in the previous example: 3.56%. So the 33″-tall girl is between the third and fourth percentiles for her age group.

“That was P(x < 33), and for a percentile I need P(x ≤ 33)!” I hear you yell. But those two are equal. When we talked about density curves, near the beginning of this chapter, you learned that the area and probability are the same whether you include or exclude the boundary.

And this is why it doesn’t make much difference whether you define a percentile rank in terms of < or ≤, because the probability in a continuous distribution is the same either way.

### 7B3.  From Probability, Find Boundaries

Summary: Make a sketch, estimate the value(s), then compute the value(s).

TI-83/84/89: Use `invNorm(`area to left, mean, SD`)`. I’ll walk you through the TI-83/84 keystrokes in the first example below. If you have a TI-89, press [`CATALOG`] [`F3`] [plain `9` makes `I`] [`▼` 3 times] [`ENTER`].

Excel: In Excel 2010 or later, use `=NORM.INV(`area to left, mean, SD`)`. In Excel 2007 or earlier, it’s `NORMINV` rather than `NORM.INV`.

Example 5: Blood pressure is stated as two numbers, systolic over diastolic. The World Health Organization’s MONICA Project (Kuulasmaa 1998 [see “Sources Used” at end of book]) reported these parameters for the US:

Systolic: μ = 120, σ = 15

Diastolic: μ = 75, σ = 11

Blood pressure in the population is normally distributed. The lowest 5% is considered “hypotensive”, according to Kuzma and Bohnenblust (2005, 103) [see “Sources Used” at end of book]. What systolic blood pressure would be considered hypotensive?

Solution: Always make a sketch for these problems. Your sketch is similar to the ones you made for the first group of problems, except that you use a symbol like x1 or “?” for the unknown boundary, and you write in the known area.

Always estimate your answer to guard against at least some errors. In the sketch, x1 looks like it’s not quite two SD left of the mean, so I’ll estimate a pressure of 95 to 100. (Okay, I cheated by using my calculator to make my “sketch”. But even with a real pencil-and-paper sketch, you ought to be in the right ballpark.)

Now you’re ready to calculate. TI-89 or Excel users, please see the instructions above. On your TI-83 or TI-84, press [`2nd` `VARS` makes `DISTR`] [`3`] to select `invNorm`. Enter the area to the left of the point you’re interested in (.05), the mean (120), and the SD (15).

With the “wizard” interface: With the classic interface:

Press [`ENTER`] twice, and your screen will look like the one at right.

After the standard deviation, press [`)`] [`ENTER`] to get the answer.

Show your work! Write down `invNorm(.05,120,15)` before you proceed to the answer. (There’s no need to write down the keystrokes you used.)

Answer: Systolic blood pressure (first number) under 95 would be considered hypotensive.

Example 6: The same source considers the top 5% “hypertensive”. What is the minimum systolic blood pressure that is hypertensive?

Solution: My “sketch” is at right. It’s mostly straightforward — the x1 boundary is between the 5% tail and the rest of the distribution.

But what’s up with the 1−0.05? The problem asks you about the upper 5%, which is the area to the right of the unknown boundary. But `invNorm` on the calculator, and `NORM.INV` in Excel, need area to left of the desired boundary. The area to the left is the probability of “not hypertensive”, and area is probability, so the area to left is 1 minus the area to right, in this case 1−0.05.

Could you just write down 0.95? Sure, that would be correct. But if the area to right was 0.1627 you’d probably make the calculator compute 1 minus that for you, so why not be consistent?

x1 = invNorm(1−.05,120,15) = 144.6728044 → 145

(That’s actually a little liberal. Several sources that I’ve seen give 140 as the threshold.)

Example 7: Kuzma and Bohnenblust describe the middle 80% as “normal”. What is that range of systolic blood pressure?

This problem wants you to find two boundaries, lower and upper. You have to convert the 80% middle into two areas to left. Here’s how. If the middle is 80%, then the two tails combined must be 100−80% = 20%. But the curve is symmetric, so each tail must be 20/2 = 10%. Strictly speaking, I probably should have written that computation on the diagram, instead of just a laconic “0.1”, but it would take up a lot of space and the computation was easy enough. You’ll probably do the same — just be careful.

Once you have the areas squared away, the computation is simple enough:

x1 = invNorm(.1,120,15) = 100.7767265 → 101

x2 = invNorm(1−.1,120,15) = 139.2232735 → 139

Check: The boundaries of the middle 80% (or the middle any percent) should be equal distances from the mean. (100.776265+139.2232735)/2 = 120, so at least it’s consistent. Answer: Systolic b.p. of 101 to 139 is considered normal.

#### Percentiles Again

Example 8: What’s the 40th percentile for systolic blood pressure?

Sometimes the gods smile on us. The kth percentile is the value that is ≥ k% of the population, so k% is exactly the area to left that you need.

P40 = invNorm(.4,120,15) = 116.1997935 → 116

## 7C.  The Standard Normal Distribution

Definition: The standard normal distribution is a normal distribution with a mean of 0 and standard deviation of 1, sometimes written N(0,1).

The standard normal distribution is a picture of z-scores of any possible real-world ND — more about that later.

The standard normal distribution lets you make computations that apply to all normal models, not just a particular model. You’ll see some examples shortly, but first —

### 7C1.  “Normal” and “Standard Normal”

The main point about the standard normal distribution is that it’s a stand-in for every ND from real life. How does this work? Well, if you take any real data set and subtract the mean from every data point, the mean of the new data set is 0. And if you then divide that data set by the standard deviation (which doesn’t change when you subtract a constant from every data point), then the SD of the new-new data set is 1.

But all you did with those manipulations was replace the numbers with z-scores. Remember the formula: . The standard normal distribution is what you get when you convert any normal model to z-scores.

Long ago, when dinosaurs ruled the earth — okay, up through the early 1980s — a “computer” was a person who used a slide rule to make computations. (I swear I am not making this up.) There were no statistical calculators and no Excel. The only way for most people to make computations on a normal model was to look up probabilities in printed tables. But obviously a book couldn’t print tables for every normal model. So the printed tables were for the standard normal distribution. If you had boundaries and wanted the probability of the interval, you converted your real-world numbers to z-scores, looked up the probabilities in the table, and subtracted them. If you had a probability and needed a boundary, you looked up the z-score in the table and then converted it to a raw score using the mean and SD of your data set.

The need to do normal computations the hard way has gone the way of the dinosaurs, but I think this history is why many stats books still use tables to do their computations. Inertia is a powerful force in textbooks!

The pdf and cdf functions for the standard normal distribution are what you get when you set μ=0 and σ=1 in the general equations for the ND: and . Again, the integral must be found by successive approximations. That’s where the tables in books come from, and it’s what your calculator does with normalcdf and Excel does with NORM.DIST.

### 7C2.  Applying the Standard Normal Distribution

I said above that the standard normal distribution lets you make statements about all normal models. What sort of statements? Well, the Empirical Rule for one.

Example 9: The Empirical Rule says that 68% of the population in a normal model lies within one SD of the mean. How good is the rule? In other words, what’s the actual proportion?

Solution: As usual, you start with a sketch. This is the standard ND, so the axis is z, not x. There’s no need to mark the mean or SD, because the z label identifies this as a standard normal distribution and therefore μ = 0 and σ = 1. Just label the boundaries.

Compute the probability the same way you’ve already learned. (Both Excel and the TIs have special procedures available for the standard normal distribution, but it’s not worth taking brain cells to learn them, when the regular procedures for the ND work just fine with N(0,1).)

P(−1 ≤ z ≤ 1) = normalcdf(−1,1,0,1) = .6826894809 → 68.27%

The Empirical Rule says 68% of the data are within z = ±1. Actually it’s about 68¼%, close enough.

Example 10: How many standard deviations must you go above and below the mean to take in the middle 50% of the data in a normal model?

Solution: This is similar to finding the middle 80% of blood pressures earlier, except now you’re making a statement about all normal models, not just a particular one.

Shading the middle 50% leaves 100−50 = 50% in the two tails combined, so each tail is 50/2 = 25%.

z1 = invNorm(.25,0,1) = −.6744897495 → −0.67

By symmetry, z2 must be numerically equal to z1 but have the opposite sign: z2 = 0.67.

50% of the data in any normal model are within about 2/3 of a SD of the mean. Since the bounds of the middle 50% of the data are Q1 and Q3, the IQR of any normal distribution is twice that, about one and a third standard deviations. More precisely, the IQR is 2×0.674 ≈ 1.35 times the SD.

### 7C3.  The z Function (Critical z)

There’s one special notation you’ll use when you compute confidence intervals in Chapter 9.

Definition: zarea or z(area), also known as critical z, is the z-score that divides the standard normal distribution such that the right-hand tail has the indicated area.

This may seem a little weird, but really it’s just a recipe to specify a number. Compare with the square root of 48. That is the positive number such that, if you multiply it by itself, you get 48. Or consider π: the number that you get when you divide the circumference of a perfect circle by its diameter. Math is full of numbers that are specified as recipes. An example will make things clearer.

Example 11: Find z0.025.

Solution: The problem is diagrammed at right. Caution! 0.025 is an area, not a z-score, so you don’t write 0.025 on the number line (the z axis). z0.025 is a z-score (though you don’t know its value yet), so it goes on the number line.

Once you have your sketch, the computation is straightforward. Have area (probability), compute boundary. The area is 0.025, but it’s an area to right, and `invNorm` needs an area to left, so you subtract from 1 as usual:

z0.025 = invNorm(1−.025, 0, 1) = 1.959963986 → 1.96

Caution! You’re computing a boundary for the right-hand tail. If you get a negative number, that can’t possibly be right.

z0.025 = 1.96 makes sense, if you think about it. If you also shaded in the left-hand tail with an area of 0.025, the two tails together would total 5%, leaving 95% in the middle. The Empirical Rule says that 95% of data are within 2 SD above and below the mean, and 1.96 is approximately 2.

## 7D.  Checking for Normality

How do you know whether a normal model is appropriate? How do you know whether your data are normally distributed? A histogram can rule out skewed data, or data with more than one peak.

But what if your data are unimodal and not obviously skewed? Is that enough to justify a normal model? No, it’s not. You need to perform a test called a normal probability plot. You’ll need this procedure in Chapters 8 through 11, whenever you have a small sample of numeric data.

Summary: To check whether a normal model can represent your sample, make a normal probability plot. This plots the actual data points, against the z-scores you would expect for this number of points that are ND. If the plot is close to a straight line, a normal model is appropriate; if the plot is far from a straight line, a normal model is not appropriate.

That’s the bare outline, and you’ll get a little bit more with the examples. For those who want the full theory, it’s marked optional at the end of this section.

Technology:

Testing for normality can be automated partly or completely, depending on what technology you have:

### 7D1.  Checking Data Sets

Example 12: Consider these vehicle weights (in pounds):

2500, 3250, 4000, 3500, 2900, 4500, 3800, 3000, 5000, 2200

Do they fit a normal model?

Solution: Put the data in any statistics list, then press [`PRGM`], scroll down to `MATH200A`, and press [`ENTER`] twice. Select `Normality chk`.

The program makes the plot, and you can look at the points to determine whether they seem to be pretty much on a straight line. At least, that’s the theory. In practice, most data sets are a lot less clear cut than this one. It can be hard to tell whether the points fit a line, particularly if you have only a few of them. The plot takes up the whole screen, so deviations can look bigger than they really are.

Fortunately, there’s a test for whether points lie on a straight line. As you know from Chapter 4, the closer the correlation coefficient r is to 1, the closer the points are to a straight line.

The program computes r for you, and it also computes a critical value★ to help you determine if the points are close enough to a straight line. (For technical reasons, the critical value is different from the decision points of Chapter 4.) If r≥crit, it’s close enough to 1, the points are close enough to a straight line, and you can use a normal model. If r<crit, it’s too far from 1, the points are too far from a straight line, and you can’t use a normal model.

For this data set, r > crit, and therefore these vehicle weights fit the normal model.

★The “classic TI-83” (non-“Plus” model) doesn’t compute the critical value, so you have to do it yourself. See the formula in item 4 in the next section.

Example 13: Here’s a random sample of the lengths (in seconds) of tunes in my iTunes library:

``` 120  219  242  134  129     105  275   76  412  268
486  199  651  291  126     210  151   98  100   92
305  231  734  468  410     313  644  117  451  375```

Do they fit a normal model?

Solution: I entered them in a statistics list and then ran MATH200A Program part 4. The result was the plot at the right.

You can see that the plot is curved. This is reinforced by comparing r=0.9473 to crit=0.9639. r < crit. The points diverge too far from a straight line, and therefore I cannot use a normal model for the lengths of my iTunes songs.

### 7D2.  Optional:  How Normal Probability Plots Work

The basic idea isn’t too bad. You make an xy scatterplot where the x’s are the data points, sorted in ascending order, and the y’s are the expected z-scores for a normal distribution.

Why would you expect that to be a straight line? Recall the formula for a z-score: z = (x)/s. Breaking the one fraction into two, you have z = x/s/s. That’s just a linear equation, with slope 1/s and intercept /s. So an xz plot of any theoretical ND, plotting each data point’s z-score against the actual data value, would be a straight line.

Further, if your actual data points are ND, then their actual z-scores will match their expected-for-a-normal-distribution z-scores, and therefore a scatterplot of expected z-scores against actual data values will also be a straight line.

Now, in real life no data set is ever exactly a ND, so you won’t ever see a perfectly straight line. Instead, you say that the closer the points are to a straight line, the closer the data set is to normal. If the data points are too far from a straight line — if their correlation coefficient r is lower than some critical value — then you reject the idea that the data set is ND.

Okay, so you have to plot the data points against what their z-scores should be if this is a ND, and specifically for a sample of n points from a ND, where n is your sample size. This must be built up in a sequence of steps:

1. Divide the normal curve (mentally) into n regions of equal probability and take one probability from each region. For technical reasons, the probability number you use for region i is (i−.375)/(n+.25). This formula is in many textbooks, and also in Normal Probability Plots and Tests for Normality (Ryan and Joiner 1976 [see “Sources Used” at end of book]).
2. Compute the expected z-scores for those probabilities. Working with the calculator, that’s just `invNorm` of (i−.375)/(n+.25).
3. Plot those expected z-scores against the data values. This xy plot (or xz plot) has a correlation coefficient r, computed just like any other correlation coefficient.
4. Compare the r for your data set to the critical value for the size of your data set. Ryan and Joiner determined that the critical value for sample size n, at the 0.05 significance level,, is 1.0063 − .1288/√n − .6118/n + 1.3505/n². To make it a little easier on the calculator I rearranged it as 1.0063 − .6118/n + 1.3505/n² − .1288/√n.
In the same paper, they gave formulas for critical values at other significance levels:

1.0071 − 0.1371/√n − 0.3682/n + 0.7780/n² at α=0.10

0.9963 − 0.0211/√n − 1.4106/n + 3.1791/n² at α=0.01

The closer the points are to a straight line, the closer the data set is to fitting a normal model. In other words, a larger r indicates a ND, and a smaller r indicates a non-ND. You can draw one of two conclusions:

• If r is less than the critical value, reject the hypothesis of normality at the 0.05 significance level and say that the data set is not ND.

(If you haven’t studied hypothesis testing yet, another way to say it is that you’re pretty sure the data set doesn’t fit the normal model because there’s less than a 5% probability that it does.)

• If r is greater than the critical value, fail to reject the hypothesis that the data set comes from a ND.

This doesn’t mean you are certain it does, merely that you can’t rule it out. Technically you don’t know either way, but practically it doesn’t matter. Remember (or you will learn later) that inferential statistics procedures like t tests are robust, meaning that they still work even if the data are moderately non-normal. But if your data were extremely non-normal, r would be less than the critical value. When r is greater than the critical value, you don’t know whether the data set comes from normal data or moderately non-normal data, but either way your inferential statistics procedures are okay.

So the bottom line is, if r > CRIT, treat the data as normal, and if r < CRIT, don’t.

The normal probability plot is just one of many possible ways to determine whether a data set fits the normal model. Another method, the D’Agostino-Pearson test, uses numerical measures of the shape of a data set called skewness and kurtosis to test for normality. For details, see Assessing Normality in Measures of Shape: Skewness and Kurtosis.

## What Have You Learned?

Key ideas:

(The online book has live links to all of these.)

Because this textbook helps you,
Because this textbook helps you,
BrownMath.com/donate.
Study aids:

## Exercises for Chapter 7

Write out your solutions to these exercises, making a sketch and showing your work for all computations. Then check your solutions against the solutions page and get help with anything you don’t understand.

Caution! If you don’t see how to start a problem, don’t peek at the solution — you won’t learn anything that way. Ask your instructor or a tutor for a hint. Or just leave it and go on to a different problem for now. You may find when you return to that “impossible” problem that you see how to do it after all.

You’ll need this information for several of the problems:

• US men’s heights: ND with μ = 69.3″, σ = 2.92″
• US women’s heights: ND with μ = 64.1″, σ = 2.75″

Source: “Is Human Height Bimodal?” (Schilling 2002 [see “Sources Used” at end of book]).

1 Suppose that variable X is Chantal’s commute time between home and school, in minutes. Give two interpretations of the statement P(x < 17) = 0.0900.
2 A male co-worker is “six foot four and a half” — 76.5″ tall. How unusual is that? (Give two interpretations of your number.)
3 What proportion of women are 64″ to 67″ tall?
4 What heights for men would be considered unusual (less than 5% likely)? Hint: Your answer will be in the form “under ____ inches or over ____ inches”.
5 To enter the Pennsyltuckey Police Academy, you have to be at or above the 15th percentile in height. How tall is that, for a man?
6 (a) Find the 25th and 75th %iles for women’s heights.
(b) Find the interquartile range.
(c) Example 10 found that, in a normal distribution, the interquartile range equals 1.35 standard deviations. Does your computed IQR match that prediction?
7 Determine whether this sample of diastolic blood pressures fits the normal model:

78  66  98  90  74  70  70  76  72  86  62  84  66  70  68

8 Scores on the math SAT are ND with a mean of 500 and standard deviation of 100. What percentile is represented by a score of 735?
9 To join Mensa, you must be in the top 2% of the population on a recognized intelligence test. Mensa accepts the SAT as a qualifying test for membership. The mean on the combined three parts is 1500 and the SD is 300. What’s the minimum combined score to qualify you for Mensa?
10 Find z0.01.
11 For men’s heights, find P(x < 60″) and write two interpretations.
12 Test scores are supposed to be ND, but this is questionable on small tests. Here are scores from a recent quiz; do they fit the normal model?

0.3  8.8  11.5  12  12.3  12.5  13  13.5  14.8

13 A small shop decided to stock formal wear for men and women in the middle 90% of height. How tall must men and women be to shop there?

## What’s New?

• 1 Aug 2019: Added links to several treatments of the history of the ND.
• 19 Dec 2016: Added critical values for the Ryan-Joiner test at significance levels 0.10 and 0.01.
• 30 Mar 2016: For students with the “TI-83 classic”, added a pointer to the formula for the critical number for normality check.
• (intervening changes suppressed)
• 19 June 2013: New document.