# Inferences about Linear Correlation

Copyright © 2002–2020 by Stan Brown

Copyright © 2002–2020 by Stan Brown

**Summary:**
From the sample correlation it is possible to
**estimate the population correlation**.
This page previews the techniques of
**inferential statistics** that can estimate the correlation
coefficient of the population.

- A downloadable Excel workbook does all these calculations.
- On TI-83 and TI-84 calculators, the downloadable program MATH200B part 6 computes confidence intervals and hypothesis tests.

After you have done a correlation analysis on your sample of points, there are three questions you might ask:

**Is there any correlation in the population**, or is the sample correlation just the luck of the draw?- What is the estimated
**correlation coefficient of the population**? - Is there any
**causal relationship**between the two variables?

The first two questions are matters of inferential statistics, and this page will explore them. (Some of the details may not make much sense till you’ve studied inference in the second half of the course.)

The third question is not a statistical
one. If you find a correlation, that suggests that a cause-and-effect
relationship may be worth looking for. But the mere fact that two
variables march more or less in step **does not let you conclude** that
one causes the other. For example, consider the variables “number of
murders per year” and “number of books in the public library”. If you
gather those values for many US cities, you will find a good
correlation. But one certainly does not cause the other; instead, they
are both explained by city population. Remember,
**“correlation is not causation”**.

Spiegel, Murray R., and Larry J. Stephens. 1999. Theory and Problems of Statistics. 3d ed. McGraw-Hill. See “Sampling Theory of Correlation” on page 317, and the solved problems on pages 336–337.

Some textbooks use *r*^{*} for the sample correlation
coefficient and *r* for the population correlation coefficient; others
use *r* for the sample and *ρ* (rho, pronounced “roe”) for
the population.

I’ll use *r* for the sample and *ρ* for the
population, consistent with the convention “Roman
letters for sample statistics, Greek letters for population
parameters”.

Why does this page exist?

Until summer 2007, the textbook for in my
Statistics class was
Dabes & Janik 1999 [full citation at https://BrownMath.com/swt/sources.htm#so_Dabes1999]. That book presented decision points
based on the sample size and correlation coefficient. If
| *r* | was greater than the decision point for the particular
sample size, then you could say that there was correlation in the
population. The decision points were critical values at
α = 0.05 for a two-tailed test of the hypothesis
“population correlation coefficient *ρ* is
nonzero.”

I began to wonder where these numbers came from, and
couldn’t find any answers on the Web, so I created this page for
students who might also wonder where the numbers came from. I
programmed the accompanying Excel workbook
at first to check my numbers against Dabes & Janik’s,
but then expanded it to compute hypothesis tests and confidence
intervals for *ρ*.

You have some correlation coefficient *r* in your
sample, and you wonder whether there’s any correlation in the
population, or if the population has no correlation and this sample *r*
is just normal sample variability.
This is a **hypothesis test**. It asks,
**is there some correlation in the population?** In other words,
**is there a linear relationship between the two variables?** In
symbols, you want to know:
**is ρ ≠ 0?**

As always, to perform a hypothesis test you begin with the null
hypothesis. **H _{o} is ρ = 0**, meaning
that there is no
linear correlation in the population, no linear relationship between X
and Y. If H

Remember, any hypothesis test asks, “is my sample far
enough from H_{o} that I can rule out random chance?” You get a
handle on “far enough” by computing a p-value: the chance of
getting a sample this far from H_{o} if H_{o} is actually true.
To do that, you have to know what the
distribution of sample statistics looks like. In this case
you’re concerned with the **distribution of sample r**.
If the true population
correlation

(1)

From the t statistic you can compute a p-value, which tells
you how likely it is that you would get at least the sample *r* that you
did, just by chance, if there is no correlation at all in the
population. If that p-value is very small (usually chosen with a
threshold of 0.05), you reject H_{o} and conclude that the sample
correlation is *not* due to chance and the population does have
some correlation.

(The first block of the accompanying Excel workbook will do these calculations for you. You can also use
MATH200B Program part 6,
or `LinRegTTest`

on your TI-83/84 or TI-89.)

Before you can make any inference (hypothesis test or confidence interval) about correlation or regression in the population, check these requirements:

- The data are a
**simple random sample**. - The
**plot of residuals versus**— no bending, no thickening or thinning trend from left to right, and no outliers.*x*is featureless - The
**residuals are normally distributed**. You can check this with a normal probability plot, available in most statistics packages and in MATH200A part 4. Since the test statistic is a t, and the t test is robust, moderate departures from normality are okay.

You measure *x*,*y* pairs from 20 individuals and find a correlation
coefficient of 0.49. Can you conclude that a correlation
exists in the population, at the 0.05 level of significance?

(To keep things simple, I’m just giving you the sample
size and *r* and skipping over the requirements check.)

**Solution**: Begin by writing down the hypotheses:

H_{o}: *ρ* = 0,
there is no linear correlation between the variables in the population.

H_{a}:
*ρ* ≠ 0, there is some linear correlation in the population.

This is a two-tailed test, and α = 0.05.

With *n* = 20 and *r* = 0.49,
use equation 1 to compute the test statistic
*t* = 2.38 with *df* = 18.
From a table or a calculator find two-tailed p = 0.0283.
This is < α, so you reject H_{o} and accept H_{1},
concluding that *ρ* ≠ 0 in the population.
Furthermore, since the sample correlation *r* is
> 0 you can conclude that the population correlation *ρ*
is also > 0.
(See p < α in Two-Tailed Test: What Does It Tell You?)

**Caution**: From this hypothesis test you
don’t know how large the population’s linear correlation
is, only that it’s greater than 0. You also don’t know
whether one variable drives the other, some third factor drives both,
or it’s just a remarkable coincidence.

You can precompute a decision point
(also known as critical value) for any sample size *n*
and any two-tailed significance level α.
If the absolute value of sample *r* is greater than that decision
point, there is some nonzero correlation in the population.

In fact, back in lesson 4 I presented Decision Points for Correlation Coefficient with no explanation of where they came from.

The decision points are found by working the previous
problem backward. First solve equation 1 for *r*:

(2)

Then from the sample size *n*,
use MATH200B Program part 3 to find the critical *t* for
*df* = *n*−2 and α/2. Plug *n* and critical *t* into
the equation to find the critical *r*.

(The second block of the accompanying Excel workbook will do these calculations for you.)

For a sample size of *n* = 30, what is the decision
point at the 0.05 level of significance?

**Solution**: If *n* = 30 then *df* = 28.
Divide α/2 = 0.025 for the one-tailed significance
level.
Then the critical *t* is *t*(28,0.025) = 2.048.
(You can find this using MATH200B Program part 3 in your TI-83/84, with the
Stats/List editor application on your TI-89,
with Excel, or with tables.)
Substituting *n* and *t* in
equation 2 yields critical *r* or a decision point of 0.361.

Conclusion: In a sample of 30 points,
at the 0.05 significance level,
if *r* > 0.361 then the
linear correlation coefficient of the population, *ρ*, is
positive. If
*r* < −0.361 then
*ρ* is negative. If *r* is between −0.361 and +0.361, you
can’t tell anything about *ρ*.

This business of “different from zero” is all very well
theoretically, but wouldn’t it be more useful to estimate the
linear correlation coefficient of the population (*ρ*) from the
linear correlation coefficient of the sample (*r*) in a confidence
interval?

Indeed it would, but there’s a catch. The t distribution from
equation 1 worked only when testing for a population correlation
coefficient of 0, which would give a symmetrical sampling distribution
of *r*. But when computing a confidence interval for *ρ*, you
can’t also assume that *ρ* = 0 and therefore you
can’t use that t statistic.

To resolve this paradox, use Fisher’s Z transformation, which is defined like this:

(3)

where “ln” is the natural or base-e logarithm. (Your calculator has
a key for it, and you use LN( ) in Excel.)
Fisher’s Z is a bit nasty to compute, but it is approximately normally
distributed no matter what the population *ρ* might be.
Its standard deviation is 1/√*n*−3.

To compute a confidence interval for *ρ*, transform *r* to
*Z* and compute the confidence interval of *Z* as you would
for any normal distribution with
σ = 1/√*n*−3.
To transform the *Z* interval to an interval about *ρ*, you need to
solve equation 3 for *r*, like this:

(4)

Plug the low *Z* into equation 4 to compute the lower limit on
*ρ*, then plug in the high *Z* to compute the higher limit on
*ρ*.

(The third block of the accompanying Excel workbook will do these calculations for you. You can also use MATH200B Program part 3 on your TI-83 or TI-84.)

A sample of 25 points shows a linear correlation coefficient of
0.84. What is the 95% confidence interval for the correlation
coefficient in the population? (Again,
to keep things simple I’m giving you the sample statistics
instead of the raw data, and we’ll assume that the requirements are met. But in real life,
**always check the requirements** before computing a confidence
interval.)

The solution is a wild ride; hang on! (The Excel spreadsheet shows the intermediate steps.)

(a) From 1−α = 0.95.
find α/2 = 0.025.
Use a table, use TI-83/84/89 invNorm, or use Excel NORMSINV( ) to find that
the 95% confidence interval is bounded by *z* = ±1.96.

(b) That critical *z* of ±1.96 bounds the confidence interval in the
*standard*
normal distribution with σ=1; for *this* one you must
multiply by the standard deviation of the Fisher Z, which is
1/√*n*−3.
For *n* = 25 points that is
σ = 1/√25−3 = 0.213.
Multiplying by the 1.96 from (a) gives
E = 0.418. E is the error of
the estimate, which is half the width of the confidence
interval for Fisher’s transformed Z.

(c) Now use equation 3 and *r* = 0.84
to compute *Z* = 1.221. This is the Fisher Z for this
particular sample.
Using the result from (b),
the confidence interval for the transformed *Z* is
1.221 ± 0.418, which is
0.803 to 1.639.

(d) Plug those Fisher-Z endpoints into equation 4.
*Z* = 0.803 yields *ρ* = 0.666, and
*Z* = 1.639 yields *ρ* = 0.927.

Conclusion: If 25 points have a linear correlation coefficient
of 0.84, then
**you’re 95% confident that the population’s linear correlation coefficient is between 0.666 and 0.927.**

Remark: The sample statistic 0.84 is not at the middle of the
confidence interval, because the sample *r* values have a skewed
distribution around the population correlation coefficient
*ρ*.

A downloadable Excel workbook (21 KB) will do all of these calculations for you:

- Hypothesis test to ask whether the population is correlated.
- Computation of critical values of
(decision points) for those hypotheses.*r* - Estimating
*ρ*, the correlation coefficient of the population.

**Caution:** Depending on your Excel settings,
you may get a warning about protected mode, or macro content, or both.
Look for this below the Excel ribbon and above the worksheet. If you
save the workbook to your computer, you should only have to click
and once for
each workbook.

**20 Oct 2020**: Converted the page from HTML 4.01 to HTML5, improved the formatting of radicals, and italicized the variable names.**3 Jan 2016**: Converted the Excel workbook to Excel 2007–2016 format. In the workbook, changed links to point to BrownMath.com instead of Oak Road Systems, and added a link to this page.**25 Dec 2009**: Rewrote the introductory section for the hypothesis test and added a requirements section; clarified the one-tailed interpretation of the two-tailed test; added a reference to the requirements in the CI example.- (intervening changes suppressed)
**15 June 2002**: New article.

Because this article helps you,

please click to donate!Because this article helps you,

please donate at

BrownMath.com/donate.

please click to donate!Because this article helps you,

please donate at

BrownMath.com/donate.

Updates and new info: https://BrownMath.com/stat/