BrownMath.com → Stats w/o Tears → 4. Linked Variables

# Stats without Tears4. Linked Variables

Updated 21 Aug 2023

View or
Print:
These pages change automatically for your screen or printer. Underlined text, printed URLs, and the table of contents become live links on screen; and you can use your browser’s commands to change the size of the text or search for key words. If you print, I suggest black-and-white, two-sided printing.

Intro: When you get two numbers from each member of the sample (bivariate numeric data), you make a plot to look for a relationship between them. If a straight line seems like a good fit for the plotted points, we say that they follow a linear model. In this chapter, you’ll learn when to use a linear model, and how to find the best one.

## 4A.  Mathematical Models

The chapter intro talks about points following a “linear model”. But what is a linear model, and what does it mean to follow one? Well, since a linear model is one kind of mathematical model, let’s talk a little bit about mathematical models.

You know what a model is in general, right? A copy of the original, usually smaller and with unimportant details left out. Think of model airplanes, or architect’s models of buildings.

A mathematical model is like that. Real Life is Complicated,™ and mathematical models help us manage those complications.

Definition: A mathematical model is a mathematical description of something in the real world. An object or process or data set follows a model if the calculations you do with the model match reality closely enough to be useful.

You’ve already met one model in Chapter 3: the grouped frequency distribution. Instead of dealing with all the data points, you do calculations using the midpoint of each class. That gives you approximate mean and SD, but the approximation is close enough to be useful.

The MathIsFun site has a nice example of modeling the space inside a cardboard box, going beyond the h×w×l formula; see Mathematical Models.

You’ll meet plenty more models in this book: probability models in Chapter 5, several discrete models in Chapter 6, and the normal model in Chapter 7.

But in this chapter we’re concerned with the linear model.

Definition: The linear model uses the linear equation y = ax+b to model the relationship between two numeric variables x and y. In any particular model, a and b are constants.

Because the graph of y = ax+b is a straight line, we can also call it a straight-line model, and we say that x and y have a straight-line relationship in the model.

The linear model is a good one if it describes the data well enough to let you make useful calculations.

## 4B.  Scatterplot, Correlation, and Regression on TI-83/84

Summary: When you have a set of (x,y) data points and want to find the best equation to describe them, you are performing a regression. You will learn how to find the strength of the association between your two variables (correlation coefficient), and how to find the line of best fit (least squares regression line).

Usually you have some idea that your x variable can help predict your y variable, so you call x the explanatory variable and y the response variable. (Other names are independent variable and dependent variable.)

See also:

### Step 0. Setup

 Set floating point mode, if you haven’t already. [`MODE`] [`▼`] [`ENTER`] Go to the home screen [`2nd` `MODE` makes `QUIT`] [`CLEAR`] Turn on diagnostics with the [`DiagnosticOn`] command. [`2nd` `0` makes `CATALOG`] [`x-1`] Don’t press the [`ALPHA`] key, because the `CATALOG` command has already put the calculator in alpha mode.  Scroll down to `DiagnosticOn` and press [`ENTER`] twice.

The calculator will remember these settings when you turn it off: next time you can start with Step 1.

### Step 1. Make the Scatterplot

Before you even run a regression, you should first plot the points and see whether they seem to lie along a straight line. If the distribution is obviously not a straight line, don’t do a linear regression. (Some other form of regression might still be appropriate, but that is outside the scope of this course.)

Let’s use this example from Sullivan (2011, 179) [see “Sources Used” at end of book]: the distance a golf ball travels versus the speed with which the club head hit it.

 Club-head speed, mph (x) Distance, yards (y) 100 102 103 101 105 100 99 105 257 264 274 266 277 263 258 275
 Turn off other plots. [`Y=`] Cursor to each highlighted = sign or Plot number and press [`ENTER`] to deactivate. Set the format screen. Press [`2nd` `ZOOM` makes `FORMAT`]. Just select everything in the left column. Enter the numbers in two statistics lists. [`STAT`] [`1`] selects the list-edit screen.  Cursor onto the label `L1` at top of first column, then [`CLEAR`] [`ENTER`] erases the list. Enter the x values.  Cursor onto the label `L2` at top of second column, then [`CLEAR`] [`ENTER`] erases the list. Enter the y values. Set up the scatterplot. [`2nd` `Y=` makes `STAT PLOT`] [`1`] [`ENTER`] turns Plot 1 on.   [`▼`] [`ENTER`] selects scatterplot.   [`▼`] [`2nd` `1` makes `L1`] ties list 1 to the x axis.   [`▼`] [`2nd` `2` makes `L2`] ties list 2 to the y axis.   (Leave the square as the selected mark for plotting.) Plot the points. I have the grid turned on in some of these pictures, but earlier I told you to turn it off. That’s simplest. If you want the grid, you can turn it on, but then you’ll have to adjust the grid spacing for almost every plot. To adjust grid spacing, press [`WINDOW`], set `Xscl` and `Yscl` to appropriate values for your data, and press [`GRAPH`] to see the result. [`ZOOM`] [`9`] automatically adjusts the window frame to fit the data. Check your data entry by tracing the points. [`TRACE`] shows you the first (x,y) pair, and then [`►`] shows you the others. They’re shown in the order you entered them, not necessarily from left to right.

A scatterplot on paper needs labels (numbers) and titles on both axes; the x and y axes typically won’t start at 0. Here’s the plot for this data set. (The horizontal lines aren’t needed when you plot on graph paper.)  When the same (x,y) pair occurs multiple times, plot the extra ones slightly offset. This is called jitter. In the example at the right, the point (6,6) occurs twice.

If the data points don’t seem to follow a straight line reasonably well, STOP! Your calculator will obey you if you tell it to perform a linear regression, but if the points don’t actually fit a straight line then it’s a case of “garbage in, garbage out.”

For instance, consider this example from DeVeaux, Velleman, Bock (2009, 179) [see “Sources Used” at end of book]. This is a table of recommended f/stops for various shutter speeds for a digital camera:

 Shutter speed (x) f/stop (y) 1/1000 1/500 1/250 1/125 1/60 1/30 1/15 1/8 2.8 4 5.6 8 11 16 22 32 If you try plotting these numbers on your calculator, enter the shutter speeds as fractions for accuracy: don’t convert them to decimals yourself. The calculator will show you only a few decimal places, but it maintains much greater precision internally.

You can see from the plot at right that these data don’t fit a straight line. There is a distinct bend near the left. When you have anything with a curve or bend, linear regression is wrong. You can try other forms of regression in your calculator’s menu, or you can transform the data as described in DeVeaux, Velleman, Bock (2009, ch 10) [see “Sources Used” at end of book] and other textbooks.

Tempted to just go ahead with the regression, without verifying the straight-lineness of the data? Heed this warning from Ellenberg (2014, 53) [see “Sources Used” at end of book]:

Linear regression is a marvelous tool, versatile, scalable, and … easy to execute. … And it works on any data set at all. That’s a weakness as well as a strength. You can do linear regression without thinking about whether the phenomenon you’re modeling is actually close to linear. But you shouldn’t. I said linear regression was like a screwdriver, and that’s true; but in another sense, it’s more like a table saw. If you use it without paying careful attention to what you’re doing, the results can be gruesome. [Emphasis in original.]

### Step 2. Perform the Regression

 Set up to calculate statistics. [`STAT`] [`►`] [`4`] pastes `LinReg(ax+b)` to the home screen. [`2nd` `1` makes `L1`] [`,`] [`2nd` `2` makes `L2`] defines L1 as x values and L2 as y values.   If you have the “wizard’ interface, leave `FreqList` blank, or press [`DEL`] if something is already filled in. Set up to store regression equation. [`,`] [`VARS`] [`►`] [`1`] [`1`] pastes `Y1` into the `LinReg` command. Show your work! Write down the whole command — `LinReg(ax+b)` `L1,L2,Y1` in this case, not just LinReg or LinReg(ax+b). Press [`ENTER`]. The calculator shows correlation and regression statistics and pastes the regression equation into `Y1`.

Your input screen should look like this, for the “wizard” and non-wizard interfaces:   Write down the slope a, the y intercept b, the coefficient of determination R², and the correlation coefficient r. (A decent rule of thumb is four decimal places for slope and intercept, and two for r and R².)

a = 3.1661, b = −55.7966

R² = 0.88, r = 0.94

Now let’s take a look in depth at each of those.

#### Correlation Coefficient, r “Several sets of (x,y) [pairs], with the correlation coefficient for each set. Note that correlation reflects the noisiness and direction of a linear relationship (top row), but not the slope of that relationship (middle), nor many aspects of nonlinear relationships (bottom).”
source: Correlation and Dependence [see “Sources Used” at end of book]

Look first at r, the coefficient of linear correlation. r can range from −1 to +1 and measures the strength of the association between x and y. A positive correlation or positive association means that y tends to increase as x increases, and a negative correlation or negative association means that y tends to decrease as x increases. The closer r is to 1 or −1, the stronger the association. We usually round r to two decimal places.

Karl Pearson developed the formula for the linear correlation coefficient in 1896. The symbol r is due to Sir Francis Galton in 1888.

For real-world data, 0.94 is a pretty strong correlation. But you might wonder whether there’s actually a general association between club-head speed and distance traveled, as opposed to just the correlation that you see in this sample. Decision Points for Correlation Coefficient, later in this chapter, shows you how to answer that question.

Though nobody ever computes r by hand any more, the formula explains the properties of r. Here are two equivalent forms. In the first form, you compute the z-score of each x within just the x’s and the z-score of each y within just the y’s. The second formula is easier if you already have the means and SD of the x’s and y’s. For the meaning of , see ∑ Means Add ’em Up in Chapter 1. z-scores are pure numbers without units, and therefore r also has no units. You can interchange the x’s and y’s in the formula without changing the result, and therefore r is the same regardless of which variable is x and which is y.

Why is r positive when data points trend up to the right and negative when they trend down to the right? The product (x−)(y−) explains this. When points trend up to the right, most are in the lower left and upper right quadrants of the plot. In the lower left, x and y are both below average, x− and y− are both negative, and the product is positive. In the upper right, x and y are both above average, x− and y− are both positive, and again the product is positive. The product is positive for most points, and therefore r is positive when the trend is up to the right.

On the other hand, if the data trend down to the right, most points are in the upper left (where x is below average and y is above average, x− is negative, y− is positive, and the product is negative) and the lower right (where x− is positive, y− is negative, and the product is also negative.) Since the product is negative for most points, r is negative when data trend down to the right.

Be careful in your interpretation! No matter how strong your r might be, say that changes in the y variable are associated with changes in the x variable, not “caused by” it. Correlation is not causation is your mantra.

It’s easy to think of associations where there is no cause. For example, if you make a scatterplot of US cities with x as number of books in the public library and y as number of murders, you’ll see a positive association: number of murders tends to be higher in cities with more library books. Does that mean that reading causes people to commit murder, or that murderers read more than other people? Of course not! There is a lurking variable here: population of the city.

When you have a positive or negative association, there are four possibilities: x might cause changes in y, y might cause changes in x, lurking variables might cause changes in both, or it could just be coincidence, a random sample that happens to show a strong association even though the population does not. used by permission; source: https://xkcd.com/552/ (accessed 2021-11-15)

If correlation is not causation, then how can we establish causation? For example, how do we know that smoking causes lung cancer in humans? Obviously we can’t perform an experiment, for ethical reasons. Sir Austin Bradford Hill laid down nine criteria for establishing causation in a 1965 paper, The Environment and Disease: Association or Causation? [see “Sources Used” at end of book] Short summaries of the “Bradford Hill criteria” are many places on the Web, including Steve Simon’s (2000b) Causation [see “Sources Used” at end of book].

#### Regression Line, ŷ = ax+b

Write the equation of the line using ŷ (“y-hat”), not y, to indicate that this is a prediction. b is the y intercept, and a is the slope. Round both of them to four decimal places, and write the equation of the line as

ŷ = 3.1661x − 55.7966

(Don’t write 3.1661x + −55.7966.)

These numbers can be interpreted pretty easily. Business majors will recognize them as intercept = fixed cost and slope = variable cost, but you can interpret them in non-business contexts just as well.

The slope, a or b1 or m, tells how much ŷ increases or decreases for a one-unit increase in x. In this case, your interpretation is “the ball travels about an extra 3.17 yards when the club speed is 1 mph greater.” The slope and the correlation coefficient always have the same sign. (A negative slope would mean that y decreases that many units for every one unit increase in x.)

The intercept, b or b0, says where the regression line crosses the y axis: it’s the value of ŷ when x is 0. Be careful! The y intercept may or may not be meaningful. In this case, a club-head speed of zero is not meaningful. In general, when the measured x values don’t include 0 or don’t at least come pretty close to it, you can’t assign a real-world interpretation to the intercept. In this case you’d say something like “the intercept of −55.7966 has no physical interpretation because you can’t hit a golf ball at 0 mph.

Here’s an example where the y intercept does have a physical meaning. Suppose you measure the gross weight of a UPS truck (y) with various numbers of packages (x) in it, and you get the regression equation ŷ = 2.17x+2463. The slope, 2.17, is the average weight per package, and the y intercept, 2463, is the weight of the empty truck.

The slope (a or m or b1) and y intercept (b or b0) of the regression line can be calculated from formulas, if you have a lot of time on your hands: For the meaning of , see ∑ Means Add ’em Up in Chapter 1.

Traditionally, calculus is used to come up with those equations, but all that’s really necessary is some algebra. See if you’d like to know more.

The second formula for the slope is kind of neat because it connects the slope, the correlation coefficient, and the SD of the two variables.

#### Coefficient of Determination, R²

Because this textbook helps you,
Because this textbook helps you,
BrownMath.com/donate.

The last number to look at (third on the screen) is R², the coefficient of determination. (The calculator displays `r²`, but the capital letter is standard notation.) R² measures the quality of the regression line as a means of predicting ŷ from x: the closer R² is to 1, the better the line. Another way to look at it is that R² measures how much of the total variation in y is predicted by the line.

In this case R² is about 0.88, so your interpretation is “about 88% of the variation in distance traveled is associated with variation in club-head speed.” Statisticians say that R² tells you how much of the variation in y is “explained” by variation in x, but if you use that word remember that it means a numerical association, not necessarily a cause-and-effect explanation. It’s best to stick with “associated” unless you have done an experiment to show that there is cause and effect.

There’s a subtle difference between r and R², so keep your interpretations straight. r talks about the strength of the association between the variables; R² talks about what part of the variation in the y variable is associated with variation in the x variable, and how well the line predicts y from x. Don’t use any form of the word “correlated” when interpreting R².

Only linear regression will have a correlation coefficient r, but any type of regression — fitting any line or curve to a set of data points — will have a coefficient of determination R² that tells you how well the regression equation predicts y from the independent variable(s). Steve Simon (1999b) gives an example for non-linear regression in R-squared [see “Sources Used” at end of book].

In straight-line regression, R² is the square of r, so if you want a formula just compute r and square the result.

### Step 3. Display the Regression Line

 Show line with original data points. [`GRAPH`]

What is this line, exactly? It’s the one unique line that fits the plotted points best. But what does “best” mean? The same four points on left and right. The vertical distance from each measured data point to the line, y−ŷ, is called the residual for that x value. The line on the right is better because the residuals are smaller.
source: Dabes and Janik (1999, 179) [see “Sources Used” at end of book]

For each plotted point, there is a residual equal to y−ŷ, the difference between the actual measured y for that x and the value predicted by the line. Residuals are positive if the data point is above the line, or negative if the data point is below the line.

You can think of the residuals as measures of how bad the line is at prediction, so you want them small. For any possible line, there’s a “total badness” equal to taking all the residuals, squaring them, and adding them up. The least squares regression line means the line that is best because it has less of this “total badness” than any other possible line. Obviously you’re not going to try different lines and make those calculations, because the formulas built into your calculator guarantee that there’s one best line and this is it.

Carl Friedrich Gauss developed the method of least squares in a paper published in 1809.

### Optional:  Display the Residuals

I would like you to know the material in this section, but it's not part of the MATH200 syllabus so I don’t require it. No homework or quiz problems will draw from this section. You will, however, need to calculate individual residuals; see Finding Residuals, below.

“No regression analysis is complete without a display of the residuals to check that the linear model is reasonable.”

DeVeaux, Velleman Bock (1999, 227) [see “Sources Used” at end of book]

The residuals are automatically calculated during the regression. All you have to do is plot them on the y axis against your existing x data. This is an important final check on your model of the straight-line relationship.

 Turn off other plots. Press [`Y=`]. Cursor to the highlighted = sign next to `Y1` and press [`ENTER`]. Cursor to `PLOT1` and press [`ENTER`]. Set up the plot of residuals against the x data. Set up Plot 2 for the residuals. Press [`2nd` `Y=` makes `STAT PLOT`] [`▼`] [`ENTER`] [`ENTER`] to turn on Plot 2. Press [`▼`] [`ENTER`] to select a scatterplot.   The x’s are still in `L1`, so press [`2nd` `1` makes `L1`] [`ENTER`]. In this plot, the y’s will be the residuals: press [`2nd` `STAT` makes `LIST`], cursor up to `RESID`, and press [`ENTER`] [`ENTER`]. Display the plot. [`ZOOM`] [`9`] displays the plot.

You want the plot of residuals versus x to be “the most boring scatterplot you’ve ever seen.” (DeVeaux, Velleman, Bock 2009, 203) [see “Sources Used” at end of book] “It shouldn’t have any interesting features, like a direction or shape. It should stretch horizontally, with about the same amount of scatter throughout. It should show no bends, and it should have no outliers. If you see any of these features, find out what the regression model missed.”

Don’t worry about the size of the residuals, because [`ZOOM`] [`9`] adjusts the vertical scale so that they take up the full screen.

If the residuals are more or less evenly distributed above and below the axis and show no particular trend, you were probably right to choose linear regression. But if there is a trend, you have probably forced a linear regression on non-linear data. If your data points looked like they fit a straight line but the residuals show a trend, it probably means that you took data along a small part of a curve.

Here there is no bend and there are no outliers. The scatter is pretty consistent from left to right, so you conclude that distance traveled versus club-head speed really does fit the straight-line model.

#### Residual Plot Showing Problems

Refer back to the scatterplot of f/stop against shutter speed. I said then that it was not a straight line, so you could not do a linear regression. If you missed the bend in the scatterplot and did a regression anyway, you’d get a correlation coefficient of r = 0.98, which would encourage you to rely on the bad regression. But plotting the residuals (at right) makes it crystal clear that linear regression is the wrong type for this data set.

This is a textbook case (which is why it was in a textbook): there’s a clear curve with a bend, variation on both sides of the x axis is not consistent, and there’s even a likely outlier.

#### Optional Advanced:  Residuals and R²

I said in Step 2 that the coefficient of determination measures the variation in measured y that’s associated with the variation in measured x. Now that you understand the residuals, I can make that statement more precise and perhaps a little easier to understand.

The set of measured y values has a spread, which can be measured by the standard deviation or the variance. It turns out to be useful to consider the variation in y’s as their variance. (You remember that the variance is the square of the standard deviation.)

The total variance of the measured y’s has two components: the so-called “explained” variation, which is the variation along the regression line, and the “unexplained” variation, which is the variation away from the regression line. The “explained” variation is simply the variance of the ŷ’s, computing ŷ for every x, and the “unexplained” variation is the variance of the residuals. Those two must add up to the total variance of the measured y’s, which means that as percentages of the variation in y they must add to 100%. So R² is the percent of “explained” variation in the regression, and 100%−R² is the percent of “unexplained” variation. and Now I can restate what you learned in Step 2. R² is 88% because 88% of the variance in y is associated with the regression line, and the other 12% must therefore be the variance in the residuals. This isn’t hard to verify: do a 1-VarStats on the list of measured y’s and square the standard deviation to get the total variance in y, s²y = 59.93. Then do 1-VarStats on the residuals list and square the standard deviation to get the “unexplained” variance, s²e = 7.12. The ratio of those is 7.12/59.93 = 0.12, which is 1−R². Expressing it as a percentage gives 100%−R² = 12%, so 12% of the variation in measured y’s is “unexplained” (due to lurking variables, measurement error, etc.).

## 4C.  How to Find ŷ from a Regression on TI-83/84

Summary: The regression line represents the model that best fits the data. One important reason for doing the regression in the first place is to answer the question, what average y value does the model predict for a given x? This page shows you two methods of answering that question.

See also:
• A separate version of these instructions for the TI-89
• A separate version of these instructions for Excel (later in this chapter)

### Method 1: Trace on the Regression Line Graph (preferred)

You can make predictions while examining the graph of the regression line on the TI-83/84 or TI-89.

Advantages to this method: aside from being pretty cool, it avoids rounding errors, and it’s very fast for multiple predictions.

 Activate tracing on the regression line. [`TRACE`] Look in the upper left corner to make sure that the regression equation is displayed. If you see `P:L1,L2`, press [`▲`] to display the regression equation.
 Enter the x value. Press the black-on-white numeric keys including [`(−)`] and decimal point if needed.   As soon as you press the first number, you’ll see a large `X=` appear at the bottom left of the screen. Enter any additional digits and press [`ENTER`].   The TI-83/84 displays the predicted average y value (ŷ) at the bottom right and puts a blinking cursor at that point on the regression line.

Caution: ŷ = 267.1 yards is the predicted or expected average distance for a club-head speed of 102 mph. But that does not mean any particular golf ball hit at that speed will travel that exact distance. You can think of ŷ as the average travel distance that you’d would expect for a whole lot of golf balls hit at that speed.

#### Extrapolation: Just Say No (Usually)

Caution: A regression equation is valid only within the range of actual measured x values, and a little way left and right of that range. If you try to go too far outside the valid range, the calculator will display `ERR:INVALID`.

It’s not just being cranky. The line describes the points you measured, so it’s usable between your minimum and maximum x values and maybe a little way outside those limits. But unless you have very solid reasons why the same straight-line model is good beyond that range, you can’t extrapolate.

Take a look at this graph of men’s and women’s winning times in the Olympic 100-meter dash from 1928 to 2012, which I made from data compiled by Mike Rosenbaum [see “Sources Used” at end of book]. (The women’s 100 m dash became an Olympic event in 1928.) From this you can reasonably guess that if women had run in the 1924 Olympics, the winner would have finished in around 12.2 or 12.3 seconds. And the 2016 winner will probably finish in around 11.5 seconds. But the further you go outside your measured data, the more riskier your predictions.

Will men’s and women’s times generally continue to decrease? Probably: training will get better, nutrition will improve, global communications will make it less likely that a stellar runner goes undiscovered. But will the decrease follow a straight line? Certainly not! Think about it for a minute. If times keep decreasing on a straight line, eventually they’ll cross the x axis and go negative. Runners will finish the race before they start it! So obviously the straight-line model breaks down — the only question is where. You don’t know, and you can’t know. All you know is that it’s not safe to extrapolate.

Bogus extrapolations give statistics a bad name and make people say “you can prove anything with statistics.” Here’s an example. I’ve just extended the two trend lines to “prove” that after the 2160 Olympics women will run the 100 meters faster than men. Pretty clearly, the linear model breaks down before then. It’s not safe to extrapolate to earlier times, either. The intercepts tell you that in the year zero, the fastest man in the world took 31.6 seconds to run 100 m, and the fastest woman took 44.7 seconds. Does that seem believable?

### Method 2: Use Calculated Regression Equation (if necessary)

But what if you don’t still have the regression line on your calculator, for instance if you’ve done a different regression? In that case, you can go back to your written-down regression equation and plug in the desired x value.

Advantage of this method: You already know how to substitute into equations.  Disadvantages: depending on the specific numbers involved, you may introduce rounding errors. Also, since you’re entering more numbers there’s an increased chance of entering a number wrong.

Example: To find the predicted average y value for x = 102, go back to the regression equation that you wrote down, and substitute 102 for x:

ŷ = 3.1661x − 55.7966

ŷ = 3.1661*102 − 55.7966

ŷ = 267.1456 → 267.1

In this example, the rounding error was very small, and it disappeared when you rounded ŷ to one decimal place. But there will be problems where the rounding error is large enough to affect the final answer, so always use the trace method if you can.

Again, please observe the Cautions above. With this method, the calculator won’t tell you when your x value is outside a reasonable range, so you need to be aware of that issue yourself.

### Finding Residuals

Each measured data point has an associated residual, defined as y−ŷ, the distance of the point above or below the line. To find a residual, the actual y comes from the original data, and the predicted average ŷ comes from one of the methods above.

Example: Find the residual for x = 102.
Solution: From the original data, y = 264. From either of the methods above, ŷ = 267.1. Therefore the residual is y−ŷ = 264−267.1 = −3.1 yards.

If a given x value occurs in more than one data point, you have multiple residuals for that x value.

## 4D.  Decision Points for Correlation Coefficient

Summary: After you compute the linear correlation coefficient r of your sample, you may wonder whether this reflects any linear correlation in the population. By comparing r to a critical number or decision point, you either conclude that there is linear correlation in the population, or reach no conclusion. You can never conclude that there’s no correlation in the population.

This page gives a simple mechanical test, but a proper statistical test exists. The optional advanced handout Inferences about Linear Correlation explains how decision points are computed and the theory behind the test. You need to learn about t tests before you can understand all of it, but right now you can use the Excel spreadsheet that you’ll find there. Or you can use MATH200B Program part 6 to do the computations.

### Procedure

The decision points are used to answer the question “From the linear correlation r of my sample, can I rule out chance as an explanation for the correlation I see? Can I infer that there is some correlation in the population?”

To answer that question, temporarily disregard the sign of r. This is the absolute value of r, written | r |. Then compare | r | to the decision point, and obtain one of the only three possible results:

If | r | ≤ d.p. If | r | > d.p.
… and r is negative … and r is positive
… then you cannot say whether there is any linear correlation in the population. … then there is some negative linear correlation in the population. … then there is some positive linear correlation in the population.

Here’s a table of decision points (also known as critical values of r) for various sample sizes.

Decision Points or Critical Numbers for r
(two-tailed test for ρ≠0 at significance level 0.05)
nd.p.  nd.p.  nd.p.  nd.p.  nd.p.
5.878   10.632   15.514   20.444   30.361
6.811 11.602 16.497 22.423 40.312
7.754 12.576 17.482 24.404 50.279
8.707 13.553 18.468 26.388 60.254
9.666 14.532 19.456 28.374 80.220
100.196

(If your sample size is not shown, either refer to the Excel workbook or use the next lower number that is shown in the table. Example: n = 35 is not shown, and therefore you will use the decision point for n = 30.)

### Examples

You survey 50 randomly selected college students about the number of hours they spend playing video games each week and their GPA, and you find r = −0.35. You look up n = 50 in the table and find 0.279 as the decision point. |r|>d.p. (0.35 > 0.279). You conclude that for college students in general, video game play time is negatively associated with GPA, or that GPA tends to decrease as video-game playing increases.

You randomly select 21 college students. For the amount they spend on textbooks and their GPA, you find r = +0.20. n=21 isn’t in the table of decision points, so you select 0.444, the decision point for n=20. |r|≤d.p. (0.20 ≤ 0.444). Therefore, you are unable to make any statement about an association between textbook spending and GPA for college students in general.

### Interpretation

Be very careful with your interpretation, and don’t say more than the statistics will allow.

The question was simply whether there is some correlation in the population, not how much. The population might have stronger or weaker correlation than your sample; all you know is that it has some. (Though you won’t learn how to do it in this course, it is possible to estimate the correlation coefficient of the population.)

If you conclude there is some correlation in the population, it’s probable, not certain. From a completely uncorrelated population, there’s still one chance in 20 of drawing a sample with | r | greater than the decision point. Because 1/20 is .05, we say that .05 is the significance level.

Even if you conclude that there is some correlation in the population, that’s the start of your investigation, not the end. If there’s a correlation in the population, you can’t just assume that one variable drives the other: correlation is not causation. Steve Simon’s (2000b) Causation [see “Sources Used” at end of book] gives some hints for investigating causation, using smoking and lung cancer as an example.)

Finally, note that there’s no way to reach the conclusion “there’s no correlation in the population." Either there (probably) is, or you can’t reach any conclusion. This will be a general pattern in inferential statistics: either you reach a conclusion of significance, or you don’t reach any conclusion at all. (As you’ll see in Chapter 10, you can conclude “something is going on”, you can fail to reach a conclusion, but you can never conclude “nothing is going on”. Lack of evidence for is not evidence against.)

## 4E.  Optional:  Scatterplot, Correlation, and Regression in Excel

Summary:

In “Scatterplot, Correlation, and Regression on TI-83/84”, earlier in this chapter, you learned the concepts of correlation and regression, and you used a TI-83 or TI-84 calculator to plot the points and do the computations. The calculator is handy, but calculator screens aren’t great for formal reports. This section tells you how to do the same operations in Microsoft Excel, without repeating the concepts.

I’m using Excel 2010, but Excel 2007 or 2013 should be almost identical.

### Plot the Points

Here again are the data:

 Club-head speed, mph (x) Distance, yards (y) 100 102 103 101 105 100 99 105 257 264 274 266 277 263 258 275
1. Enter the x-y pairs in rows or columns; row or column heads are optional.
2. With your mouse, highlight the data but not the headers. Click . In the section, click and choose the first scatterplot type.
3. Right-click the useless “Series1” legend and click .
4. This time I got lucky, but sometimes Excel puts too much white space at the left or bottom of the chart. If this happens to you, right-click the axis numbers and select . Change to Fixed and type in a sensible value.
5. In the Excel ribbon, click  »  »  » and type the axis title. Include units if any. In this case, you have club-head speed in miles per hour.
6. Click  »  » and type the axis title, including units if any. In this case, you have distance traveled in yards.
7. Click  » and type your chart title.
8. For a neater appearance, you can right-click the horizontal axis, select , and change to None. Repeat for the vertical axis. Your chart should look like this: ### Show the Regression Line

1. In the Excel ribbon, click . In the group, click  » .
2. In the dialog box that appears, click at the left. At the top right, select . At the bottom right, select and .
3. Click and drag the regression equation and R² value so that they’re not covering any data points or any part of the line. Then right-click them and select . Click at the left, then at the right click and change the color to white. (This keeps the gridlines from running through the text.) If you wish, click  » . Here’s the result: ### Show the Correlation Coefficient

Excel won’t put r on the chart, but you can compute it in a worksheet cell:

1. Click into an empty worksheet cell. Type `=CORREL(` including the = sign and opening parenthesis.
2. Highlight your y list with your mouse — numbers only, not the header — and type a comma.
3. Highlight your x list with your mouse — again, just the numbers. Type a closing parenthesis and press [`ENTER`].

(You can get the slope, y intercept, or R² into the worksheet by following the above procedure but substituting SLOPE, INTERCEPT, or RSQ for CORREL.)

### Predict the Average y

Like your calculator, Excel can find the ŷ value (predicted average y) for any x.

Caution: A regression equation is valid only within the range of actual measured x values, and a little way left and right of that range. If you go outside that range, Excel will happily serve up garbage numbers to you.

On average, how far do you expect a golf ball to travel when hit at 102 mph?

1. Type your x value, 102, in an empty cell.
2. Click into an empty worksheet cell. Type `=FORECAST(` including the = sign and opening parenthesis.
3. Click into the cell that holds your x value, and type a comma.
4. Highlight your y list with your mouse — numbers only, not the header — and type a comma.
5. Highlight your x list with your mouse — again, just the numbers. Type a closing parenthesis and hit [`ENTER`]. You’ll see the predicted average distance, 267.1 yards.

The prediction formula, like all Excel formulas, is “live”: if you type in a new x Excel will display the corresponding ŷ. If this doesn’t happen, in the Excel ribbon click  »  » .

## What Have You Learned?

Key ideas:

(The online book has live links to all of these.)

Because this textbook helps you,
Because this textbook helps you,
BrownMath.com/donate.
Study aids:

## Exercises for Chapter 4

Write out your solutions to these exercises, showing your work for all computations. Then check your solutions against the solutions page and get help with anything you don’t understand.

Caution! If you don’t see how to start a problem, don’t peek at the solution — you won’t learn anything that way. Ask your instructor or a tutor for a hint. Or just leave it and go on to a different problem for now. You may find when you return to that “impossible” problem that you see how to do it after all.

1 A researcher performed a regression on x = age and y = salary for all employees at MegaGrandeEnormoCorp (doing business as “Gramma’s Kitchen”). She found R² = 0.64. How would you explain this to a friend who doesn’t understand any math more complicated than percentages?
YearPower Boat
Reg. (1000s)
Manatees
Killed
197744713
197846021
197948124
198049816
198151324
198251220
198352615
198455934
198558533
198661433
198764539
198867543
198971150
199071947
199171653
199271638
199371635
199473549
2Manatees or “sea cows” are large, slow-moving mammals that live in coastal waters. They’re an endangered species. Sharyn O’Halloran (n.d., slide 4) [see “Sources Used” at end of book] quotes yearly figures from the US Fish & Wildlife Service for the number of power-boat registrations and number of manatees killed by power boats in Florida coastal waters.

(a) The two variables are power-boat registrations and manatee deaths. Which should be the explanatory variable, and which should be the response variable?

(b) On paper or on your calculator, make a scatterplot. Do the data seem to follow a straight line, more or less?

(c) Give the symbol and numerical value of the correlation coefficient.

(d) Write down the regression equation for manatee deaths as a function of power-boat registrations.

(e) State and interpret the slope.

(f) State and interpret the y intercept.

(g) Give the coefficient of determination with its symbol, and interpret it.

(h) How many deaths does the regression predict if 559,000 power boats are registered? Use the proper symbol.

(i) Find the residual for x = 559.

(j) How many manatee deaths would you expect for a million power-boat registrations?

3 Sascha randomly selected 10 TC3 students and asked how many hours of TV they watched on an average day and what was their GPA. The correlation was −0.57. What if anything can you say about TV watching and GPA for all TC3 students?
DialTemp, °F
06
2−1
3−3
5−10
6−16
4Your deep freezer has a dial to regulate temperature, but it’s just numbered 0 to 8 with no indication of temperature. So you try various dial settings, allowing 24 hours for temperature to stabilize after each change. The results are shown at right.

(a) Make a scatterplot. Does a straight-line model seem reasonable here?

(b) What linear equation best describes the relation between dial setting x and temperature y?

(c) State and interpret the slope.

(d) State and interpret the y intercept.

(e) Give the correlation coefficient with its symbol.

(f) Give the coefficient of determination with its symbol, and interpret it.

(g) Predict the temperature for a dial setting of 1.

5 A statistics professor asked students to write on their final exam the number of hours they had spent studying. After scoring the exams, she randomly selected 12 of them and plotted exam score against hours of study, with the result r = 0.85. What if anything can you say about the relation between study time and exam score for statistics students in general, assuming that this class is representative of all classes?
6 A public-school administrator with too much time on his hands studied shoe size and reading ability and found a correlation coefficient of 0.81. Are big feet a sign of intelligence?
7 A scatterplot is shown at right. Would the value of r be strongly positive, near zero, or strongly negative? Briefly explain your answer.
8 “In a large study of twins, the Minnesota Twin study found a correlation of +.71 between the IQ scores of identical twins. Another study found that family income is correlated +.30 with the IQ of children.” (Source: Pearson’s 2001 [see “Sources Used” at end of book] in the McGraw-Hill Statistical Primer.)

How much of the variation in children’s IQ is associated with variation in family income?

## What’s New?

• 21 Aug 2023: In step 1, added a suggestion to display residuals if the scatterplot isn’t clearly linear or nonlinear, and added Ellenberg’s warning.
• 15 Nov 2021: Updated the link to xkcd.
• (intervening changes suppressed)
• 10 Mar 2012: New document formed out of separate documents on regression, prediction, and decision points.