BrownMath.com → TI-83/84/89 → Correlation & Regression (TI-89)
Updated 19 Nov 2021

# Scatterplot, Correlation, and Regression on TI-89

Copyright © 2007–2024 by Stan Brown, BrownMath.com

Summary: When you have a set of (x,y) data points and want to find the best equation to describe them, you are performing a regression. You will learn how to find the strength of the association between your two variables (correlation coefficient), and how to find the line of best fit (least squares regression line).

Usually you have some idea that your x variable can help predict your y variable, so you call x the explanatory variable and y the response variable. (Other names are independent variable and dependent variable.)

See also:

## Step 0. Setup

 Set floating point mode, if you haven’t already. [`MODE`] [`▼`] [`▼`] [`►`] [`ALPHA` `÷` makes `E`] [`ENTER`]

The calculator will remember this setting when you turn it off: next time you can start with Step 1.

## Step 1. Make the Scatterplot

Before you even run a regression, you should first plot the points and see whether they seem to lie along a straight line. If the distribution is obviously not a straight line, don’t do a linear regression. (Some other form of regression might still be appropriate, but that is outside the scope of this course.)

Let’s use this example from Sullivan 2011 [full citation at https://BrownMath.com/swt/sources.htm#so_Sullivan2011], page 179: the distance a golf ball travels versus the speed with which the club head hit it.

 Club-head speed, mph (x) Distance, yards (y) 100 102 103 101 105 100 99 105 257 264 274 266 277 263 258 275
 Turn off other plots. [`◆`] [`APPS`] and select `Stats/List Editor`.  [`F2`] [`3`] [`F2`] [`4`] turns off all plots and functions. Enter the numbers in two statistics lists. You will use two named lists for the x’s and y’s. Any names are possible, but I’ll use `lx` and `ly` because they’re short. If those lists already exist, highlight the `lx` name and press [`CLEAR`] [`ENTER`] to erase previous entries. If `lx` isn’t there yet, move to an empty list heading and press [`L`] [`X`]. (L is above the 4 key. When you press 4 while naming a list, it will change to L automatically.)  Enter the x numbers, then clear list `ly` (or create it) and enter the y numbers.  Note: You can hide an unwanted list by cursoring to the list name and pressing [`◆` `←` makes `DEL`]. The list remains in memory until you use [`2nd` `−` makes `VARLINK`] to delete it. Set up the scatterplot. [`F2`] [`1`] [`F1`] opens a dialog box. You want these settings: Plot type: Scatter Mark: anything except dot (because a data dot looks just like a dot on the grid) X: [`alpha`] [`L`] [`X`] Y: [`alpha`] [`L`] [`Y`] Use Freq and categories: NO Press [`ENTER`] to complete the definition. Plot the points. [`F5`] automatically adjusts the window frame to fit the data. (optional)You can adjust the grid to look better. [`◆` `F2` makes `WINDOW`], set `Xscl`=1 and `Yscl`=5, then [`◆` `F3` makes `GRAPH`] to redisplay it.  Appropriate values of `Xscl` and `Yscl` may be different for other problems. Pick the values that make the graph look best to you. Check your data entry by tracing the points. [`F3`] shows you the first (x,y) pair, and then [`►`] shows you the others. They’re shown in the order you entered them, not necessarily from left to right.

A scatterplot on paper needs labels (numbers) and titles on both axes; the x and y axes typically won’t start at 0. Here’s the plot for this data set. (The horizontal lines aren’t needed when you plot on graph paper.)

When the same (x,y) pair occurs multiple times, plot the second one slightly offset. This is called jitter. An example will be shown in class.

If the data points don’t seem to follow a straight line reasonably well, STOP! Your calculator will obey you if you tell it to perform a linear regression, but if the points don’t actually fit a straight line then it’s a case of “garbage in, garbage out.”

For instance, consider this example from DeVeaux, Velleman and Bock 2009 [full citation at https://BrownMath.com/swt/sources.htm#so_DeVeaux2009], page 179. This is a table of recommended f/stops for various shutter speeds for a digital camera:

 Shutter speed (x) f/stop (y) 1/1000 1/500 1/250 1/125 1/60 1/30 1/15 1/8 2.8 4 5.6 8 11 16 22 32

If you try plotting these numbers yourself, enter the shutter speeds as fractions for accuracy: don’t convert them to decimals yourself. The calculator will show you only a few decimal places, but it maintains much greater precision internally.

You can see from the plot at right that these data don’t fit a straight line. There is a distinct bend near the left. When you have anything with a curve or bend, linear regression is wrong. You can try other forms of regression in your calculator’s menu, or you can transform the data as described in DeVeaux 2009 [full citation at https://BrownMath.com/swt/sources.htm#so_DeVeaux2009], Chapter 10, and other textbooks.

## Step 2. Perform the Regression

 Set up to calculate statistics. [`◆`] [`APPS`] and select ```Stats/List Editor```. [`F4`] [`3`] [`2`] brings up the LinReg(ax+b) dialog box. You want these settings: X list: [`alpha`] [`L`] [`X`] Y list: [`alpha`] [`L`] [`Y`] Store ReqE on to: [`►`] and select `y1(x)` Freq: 1 Category List: (leave blank) Include Categories: (leave blank)  Press [`ENTER`] to perform the regression and paste the regression equation into Y1.

Show your work! Write `LinReg(ax+b)` plus the two lists and the y-variable that you’re using. Just “LinReg” isn’t enough.

Write down a (slope), b (y intercept), R˛ (coefficient of determination), and r (correlation coefficient). (Four decimal places for slope and intercept, and two for r and R˛, is a decent rule of thumb.)

a = 3.1661, b = −55.7966

R˛ = 0.88, r = 0.94

### Correlation Coefficient, r

“Several sets of (x,y) [pairs], with the correlation coefficient for each set. Note that correlation reflects the noisiness and direction of a linear relationship (top row), but not the slope of that relationship (middle), nor many aspects of nonlinear relationships (bottom).”
source:

Look first at r, the coefficient of linear correlation. r can range from −1 to +1 and measures the strength of the association between x and y. A positive correlation or positive association means that y tends to increase as x increases, and a negative correlation or negative association means that y tends to decrease as x increases. The closer r is to 1 or −1, the stronger the association. We usually round r to two decimal places.

For real-world data, the 0.94 that we got is a pretty strong correlation. But you might wonder whether there’s actually an association between club-head speed and distance traveled, as opposed to just an apparent correlation in this sample. Decision Points for Correlation Coefficient shows you how to answer that question.

Be careful in your interpretation! No matter how strong your r might be, say that changes in the y variable are associated with changes in the x variable, not “caused by” it. Correlation is not causation is your mantra.

It’s easy to think of associations where there is no cause. For example, if you make a scatterplot of US cities with x as number of books in the public library and y as number of murders, you’ll see a positive association: number of murders tends to be higher in cities with more library books. Does that mean that reading causes people to commit murder, or that murderers read more than other people? Of course not! There is a lurking variable here: population of the city.

When you have a positive or negative association, there are four possibilities: x might cause changes in y, y might cause changes in x, lurking variables might cause changes in both, or it could just be coincidence, a random sample that happens to show a strong association even though the population does not.

used by permission; source:

Though nobody ever computes r by hand any more, the formula explains the properties of r. To compute r, find the z-scores of all the x’s and y’s, multiply zx times zy for each data point, add up all the products, and divide the total by n−1. The second formula is equivalent but a little easier: Find the means and standard deviations of the set of x’s and the set of y’s. For each data point, multiply x by y. Add up those products and divide by n−1 times the standard deviations.

z-scores are pure numbers without units, and therefore r also has no units. You can interchange the x’s and y’s in the formula without changing the result, and therefore r is the same regardless of which variable is x and which is y.

Why is r positive when data points trend up to the right and negative when they trend down to the right? The product (x)(y) explains this. When points trend up to the right, most are in the lower left and upper right quadrants of the plot. In the lower left, x and y are both below average, x and y are both negative, and the product is positive. In the upper right, x and y are both above average, x and y are both positive, and the product is positive. The product is positive for most points, and therefore r is positive when the trend is up to the right.

On the other hand, if the data trend down to the right, most points are in the upper left (where x is below average and y is above average, x is negative, y is positive, and the product is negative) and the lower right (where x is positive, y is negative, and the product is negative.) Since the product is negative for most points, r is negative when data trend down to the right.

### Regression Line, ŷ = ax+b

Because this article helps you,
please click to donate!
Because this article helps you,
please donate at
BrownMath.com/donate.

Write the equation of the line using ŷ, not y, to indicate that this is a prediction, not actual measured data. b is the y intercept, and a is the slope. We’ll round both of them to four decimal places, so write the equation of the line as

ŷ = 3.1661x − 55.7966

(Don’t write 3.1661x + −55.7966.)

These numbers can be interpreted pretty easily. Business majors will recognize them as intercept = fixed cost and slope = variable cost, but you can interpret them in non-business contexts just as well.

The slope, a, tells how much ŷ increases or decreases for a one-unit increase in x. In this case, your interpretation is “the ball travels about an extra 3.17 yards when the club speed is 1 mph greater.” The sign of a is always the same as the sign of r. (A negative slope would mean that ŷ decreases that many units for every one unit increase in x.)

The intercept, b, says where the regression line crosses the y axis: it’s the value of ŷ when x is 0. Be careful! The y intercept may or may not be meaningful. In this case, a club-head speed of zero is not meaningful. In general, when the measured x values don’t include 0 or don’t at least come pretty close to it, you can’t assign a real-world interpretation to the intercept. In this case you’d say something like “the intercept of −55.7966 has no physical interpretation because a club-head speed of zero is meaningless for striking a golf ball.”

Here’s an example where the y intercept does have a physical meaning. Suppose you measure the gross weight of a UPS truck (y) with various numbers of packages (x) in it, and you get the regression equation ŷ = 2.17x+2463. The slope, 2.17, is the average weight per package, and the y intercept, 2463, is the weight of the empty truck.

The slope (a or m or b1) and y intercept (b or b0) of the regression line can be calculated from formulas, if you have a lot of time on your hands:

For the meaning of , see ∑ Means Add ’em Up.

Traditionally, calculus is used to come up with those equations, but all that’s really necessary is some algebra. See if you’d like to know more.

The second formula for the slope is kind of neat because it connects the slope, the correlation coefficient, and the SD of the two variables.

### Coefficient of Determination, R˛

The last number we look at (third on the screen) is R˛, the coefficient of determination. (The calculator displays r˛, but the capital letter is standard notation.) R˛ measures the quality of the regression line as a means of predicting ŷ from x: the closer R˛ is to 1, the better the line. Another way to look at it is that R˛ measures how much of the total variation in y is predicted by the line.

In this case R˛ is about 0.88, so your interpretation is “about 88% of the variation in distance traveled is associated with variation in club-head speed.” Statisticians say that R˛ tells you how much of the variation in y is “explained” by variation in x, but if you use that word remember that it means a numerical association, not necessarily a cause-and-effect explanation. It’s best to stick with “associated” unless you have done an experiment to show that there is cause and effect.

There’s a subtle difference between r and R˛, so keep your interpretations straight. r talks about the strength of the association between the variables; R˛ talks about what part of the variation in the y variable is associated with variation in the x variable. Your interpretation of R˛ should not use any form of the word “correlated”.

Only linear regression will have a correlation coefficient r, but any type of regression — fitting any line or curve to a set of data points — will have a coefficient of determination R˛ that tells you how well the regression equation predicts y from the independent variable(s). Steve Simon gives an example for non-linear regression in R-squared.

## Step 3. Display the Regression Line

 Show line with original data points. [`◆` `F3` makes `GRAPH`]

What is this line, exactly? It’s the one unique line that fits the plotted points best. But what does “best” mean?

The same four points on left and right. The vertical distance from each measured data point to the line, yŷ, is called the residual for that x value. The line on the right is better because the residuals are smaller.
source: Dabes & Janik [full citation at https://BrownMath.com/swt/sources.htm#so_Dabes1999]

For each plotted point, there is a residual equal to yŷ, the difference between the actual measured y for that x and the value predicted by the line. Residuals are positive if the data point is above the line, or negative if the data point is below the line.

You can think of the residuals as measures of how bad the line is at prediction, so you want them small. For any possible line, there’s a “total badness” equal to taking all the residuals, squaring them, and adding them up. The least squares regression line means the line that is best because it has less of this “total badness” than any other possible line. Obviously you’re not going to try different lines and make those calculations, because the formulas built into your calculator guarantee that there’s one best line and this is it.

See also: Once you have the regression line, you can use the calculator to predict the y value for any x in the model.

## Appendix: Display the Residuals

Some profs want you to plot or compute residuals, and some don’t. Even if your prof doesn’t require this, it’s good to plot the residuals anyway, because that’s an important check on whether the linear model is actually a good choice for your data set. If you need to calculate individual residuals, see the last section of How to Find ŷ from a Regression on TI-89.

“No regression analysis is complete without a display of the residuals to check that the linear model is reasonable.”

DeVeaux 2009 [full citation at https://BrownMath.com/swt/sources.htm#so_DeVeaux2009], page 227

The residuals are automatically calculated during the regression, and stored in a `resid` list in your Stats/List Editor. All you have to do is plot them on the y axis against your existing x data. This is an important final check on your model of the straight-line relationship.

 Return to the editor; notice that a `resid` list has appeared and contains the residuals. [`◆`] [`APPS`] and select ```Stats/List Editor```. Turn off other plots. [`F2`] [`3`] [`F2`] [`4`] Set up the plot of residuals against the x data. [`F2`] [`1`] [`▼`] [`F1`] selects Plot 2 and opens a dialog box. You want these settings: Plot type: Scatter Mark: anything except dot (because a data dot looks just like a dot on the grid) X: [`alpha`] [`L`] [`X`] Y: To get `statvars\resid`, press [`2nd` `-` makes `VARLINK`] and scroll down to `STATVARS`. Press [`►`] to expand it if necessary. Scroll down to `resid` and press [`ENTER`]. Use Freq and categories: NO Press [`ENTER`] to complete the definition. Display the plot. [`F5`] displays the plot.

You want the plot of residuals versus x to be “the most boring scatterplot you’ve ever seen”, in De Veaux’s words (page 203). “It shouldn’t have any interesting features, like a direction or shape. It should stretch horizontally, with about the same amount of scatter throughout. It should show no bends, and it should have no outliers. If you see any of these features, find out what the regression model missed.”

Don’t worry about the size of the residuals, because [`ZOOM`] [`9`] adjusts the vertical scale so that they take up the full screen.

If the residuals are more or less evenly distributed above and below the axis and show no particular trend, you were probably right to choose linear regression. But if there is a trend, you have probably forced a linear regression on non-linear data. If your data points looked like they fit a straight line but the residuals show a trend, it probably means that you took data along a small part of a curve.

Here there is no bend and there are no outliers. The scatter is pretty consistent from left to right, so you conclude that distance traveled versus club-head speed really does fit the straight-line model.

### Residual Plot Showing Problems

Refer back to the scatterplot of f/stop against shutter speed. I said then that it was not a straight line, so you could not do a linear regression. If you missed the bend in the scatterplot and did a regression anyway, you’d get a correlation coefficient of r = 0.98, which would encourage you to rely on the bad regression. But plotting the residuals (at right) makes it crystal clear that linear regression is the wrong type for this data set.

This is a textbook case (which is why it was in a textbook): there’s a clear curve with a bend, variation on both sides of the x axis is not consistent, and there’s even a likely outlier.

### advanced: Residuals and R˛

I said in Step 2 that the coefficient of determination measures the variation in the measured y associated with the measured x. Now that we have the residuals, we can make that statement more precise and perhaps a little easier to understand.

The set of measured y values has a spread, which can be measured by the standard deviation or the variance. It turns out to be useful to consider the variation in y’s as their variance. (You remember that the variance is the square of the standard deviation.)

The total variance of the measured y’s has two components: the so-called “explained” variation, which is the variation along the regression line, and the “unexplained” variation, which is the variation away from the regression line. The “explained” variation is simply the variance of the ŷ’s, computing ŷ for every x, and the “unexplained” variation is the variance of the residuals. Those two must add up to the total variance of the measured y’s, which means that if we express them as percentages of the variation in y then the percentages must add to 100%. So R˛ is the percent of “explained” variation in the regression, and 100%−R˛ is the percent of “unexplained” variation.

and

Now I can restate what you learned in Step 2. R˛ is 88% because 88% of the variance in y is associated with the regression line, and the other 12% must therefore be the variance in the residuals. This isn’t hard to verify: do a 1-VarStats on the list of measured y’s and square the standard deviation to get the total variance in y, s˛y = 59.93. Then do 1-VarStats on the residuals list and square the standard deviation to get the “unexplained” variance, s˛e = 7.12. The ratio of those is 7.12/59.93 = 0.12, which is 1−R˛. Expressing it as a percentage gives 100%−R˛ = 12% so 12% of the variation in measured y’s is “unexplained” (due to lurking variables, measurement error, etc.).

## What’s New?

Because this article helps you,
please click to donate!
Because this article helps you,
please donate at
BrownMath.com/donate.

Updates and new info: https://BrownMath.com/ti83/