# Least Squares — the Gory Details

How Do We Find the Line of Best Fit?

Copyright © 2002–2020 by Stan Brown

How Do We Find the Line of Best Fit?

Copyright © 2002–2020 by Stan Brown

You have a set of observed points (*x*,*y*). You’ve plotted
them and they seem to be pretty much linear. How do you find the line
that best fits those points?

“Don’t be silly,” you say. “Put them into a TI-83 or Excel and look at the answer.”

Okay, you got me. I’ll ask a deeper question: How does the calculator find the answer? What are the underlying equations?

Because this article helps you,

please click to donate!Because this article helps you,

please donate at

BrownMath.com/donate.

please click to donate!Because this article helps you,

please donate at

BrownMath.com/donate.

**See also:**
The least-squares method involves summations. If you’re
shaky on your ∑ (sigma) notation, see
“∑ Means Add
’em Up”.

To answer that question, first we have to agree on **what we mean by the “best fit”** of a line to a set
of points. Why do we say that the line on the left fits the points
better than the line on the right? And can we say that some other line
might fit them better still?

Intuitively, we think of a *close* fit as a
*good* fit. We look for a line with little space between the
line and the points it’s supposed to fit. We would say that the
best fitting line is the one that has **the least
space between itself and the data points**, which represent
actual measurements.

Okay, what do we mean by “least space”? There are three ways to
measure the space between a point and a line: vertically in the *y*
direction, horizontally in the *x* direction, and on a perpendicular to
the line. We choose to **measure the space
vertically**. Why? Because our whole purpose in making a
regression line is to use it to predict the *y* value for a given *x*, and
the vertical distances are how far off the predictions would be for
the points we actually measured.

That vertical deviation, or prediction error, is
called the **residual**, *y*−*ŷ*. Since the line
predicts a *y* value (symbol *ŷ*) for every *x* value, and there’s an
actual measured *y* value for every *x* value, there is a residual for
every *x* value in the data set.

For example, suppose the line is *y*=3*x*+2 and we have
a measured data point (2,9).
The actual *y* is 9 and the line predicts *ŷ*=3·2+2=8.
Subtracting, we can say that the residual for *x*=2, or the residual for
the point (2,9), is 9−8 = 1.
This is a positive number because the actual value is greater than the
predicted value, and the line passes below the data point (2,9).
In general, between any given point (*x*,*y*)
and the line *y*=*m**x*+*b* the residual (vertical gap) is *y*−(*m**x*+*b*).

But each residual could be negative or positive, depending on
whether the line passes above or below that point. We can’t simply add
up residuals, because then a line would be considered good if it fell
way below some points as long as it fell way above others. To prevent
that, we’ll **square each residual**, and add
up the squares. (This also has the desirable effect that a few small
deviations are more tolerable than one or two big ones.)

And at long last we can say exactly what we mean by the line of
best fit. If we compute the residual for every point, square each
one, and add up the squares, we say the line of best fit is the line
for which that sum is the least. Since it’s a sum of squares, the
method is called the **method of least
squares**.

It’s always a giant step in finding something to get clear on what
it is you’re looking for, and we’ve done that. The best-fit line, as
we have decided, is the line that **minimizes the
sum of squares of residuals**. For any
given line *y*=*m**x*+*b*, we can write that sum as

*E*(*m*,*b*) = ∑(*y* − *ŷ*)²

where *ŷ* is the predicted value for a given *x*,
namely *m**x*+*b*, and *y* is the actual value measured for that given *x*.

*E* is a function of *m* and *b* because the
sum of squared residuals is different for different lines *y*=*m**x*+*b*. *E* is
*not* a function of *x* and *y* because the data points are what
they are, and don’t change within any given problem.

Do we just try a bunch of lines, compute their *E* values, and pick
the line with the lowest *E* value?
No, it would be a lot of work without proving
anything — a lose-lose — because
we could never be sure that
there wasn’t some other line with still a lower *E*.

Instead, we use a **powerful and common
trick** in mathematics: We *assume* we know the line,
and use its properties to help us find its identity. Here’s how that
works.

What is the line of best fit? It’s *y*=*m**x*+*b*, because *any*
line (except a vertical one) is *y*=*m**x*+*b*. We happen not to know *m* and *b*
just yet, but we can use the properties of the line to find them.

What is the **chief property of the
line**? It is that *E* is less for this line than for any other
line that might pass through the same set of points. In other words,
** E(m,b) is minimized by varying m and b**. Let’s
look at how we can write an expression for

The squared residual for any one point follows from the definition I gave earlier:

residual² = (*y* − *ŷ*)² =
(*y* − (*m**x*+*b*))²

Since (A−B)² = (B−A)², let’s reverse the subtraction to get rid of a layer of parentheses:

residual² = (*y* − *ŷ*)² =
(*m**x* + *b* − *y*)²

Now multiply:

residual² =
(*y* − *ŷ*)² =
*m*²*x*² + 2*b**m**x* + *b*² − 2*m**x**y* − 2*b**y* + *y*²

The sum of squared residuals for a line *y*=*m**x*+*b* is found by
summing over all points:

*E*(*m*,*b*) = ∑residual² =
∑(*y* − *ŷ*)²

*E*(*m*,*b*) = ∑(*m*²*x*² + 2*b**m**x* + *b*² − 2*m**x**y*
− 2*b**y* + *y*²)

*E*(*m*,*b*) = *m*²∑*x*² + 2*b**m*∑*x* + *n**b*² − 2*m*∑*x**y* − 2*b*∑*y* + ∑*y*²

Once we find the *m* and *b* that minimize *E*(*m*,*b*), we’ll know
the exact equation of the line of best fit.

Incidentally, why is there no ∑
in the third term of the final expression for *E*(*m*,*b*)? Because *b*² in
the previous line is a property of the line that we’re looking
for and doesn’t vary from point to point. Adding up *b*² once
for each of the *n* points gives *n**b*².

As soon as you hear “minimize”, you think
“calculus”.
(Well, you do if you’ve *taken*
calculus!)
And indeed
calculus can find *m* and *b*. Surprisingly, we can also find *m* and *b*
using plain algebra.

It’s not entirely clear who invented the method of least squares. Most authors attach it to the name of Karl Friedrich Gauss (1777–1855), who first published on the subject in 1809.

But the Frenchman Adrien Marie Legendre (1752–1833) “published a clear explanation of the method, with a worked example, in 1805” according to Stephen Stigler in Statistics on the Table (Cambridge, Massachusetts; Harvard University Press, 1999; see Chapter 17). In setting up the new metric system of measurement, the meter was to be fixed at a ten-millionth of the distance from the North Pole through Paris to the Equator. Surveyors had measured portions of that arc, and Legendre invented the method of least squares to get the best measurement for the whole arc.

Using calculus, a function has its **minimum
where the derivative is 0**. Since we need to adjust both *m* and
*b*, we take the **partial derivative** of *E* with respect to *m*, and
separately with respect to *b*, and set both to 0:

*E _{m}* =
2

and

*E _{b}* =
2m∑

Each equation then gets divided by the common
factor 2, and the terms not involving *m* or *b* are moved to the other
side. With a little thought you can recognize the result as two
simultaneous equations in *m* and *b*, namely:

(∑*x*²)*m* + (∑*x*)*b* = ∑*x**y*
and
(∑*x*)*m* + *n**b* = ∑*y*

The summation expressions are all just numbers,
the results of summing *x* and *y* in various combinations.

These simultaneous equations can be solved like any others: by
substitution or by linear combination. Let’s try substitution. The
second equation looks easy to solve for *b*:

*b* = (∑*y* − *m*∑*x*) / *n*

Substitute that in the other equation and you eventually come up with

*m* = ( *n*∑*x**y* − (∑*x*)(∑*y*) ) / ( *n*∑*x*² − (∑*x*)² )

*b* = ( (∑*x*²)(∑*y*) − (∑*x*)(∑*x**y*) ) / ( *n*∑*x*² − (∑*x*)² )

And that is very probably what your calculator (or Excel) does: Add
up all the *x*’s, all the *x*², all the *x**y*, and so on, and compute
the coefficients. It’s tedious, but not hard. (Usually these equations
are presented in the shortcut form shown
later.)

How do we know that this *m* and *b* will
give us a minimum *E* and not a maximum or saddle point?
It’s obvious that no matter how badly a
line fits, no matter how large its *E*(*m*,*b*), it’s always possible
to find a worse line, one that is further away from all the points. It
seems reasonable that there should be a best line, one that
can’t be improved on, but can we prove that?

There is a second derivative test for two variables, but it’s
more complicated than the second derivative test for one variable. You
have a minimum *E* for particular values of *m* and *b* if three conditions
are all met:

(a) The first partial derivatives *E _{m}*
and

(b) The **determinant of the Hessian matrix** must be
positive. For independent variables *m* and *b*, that determinant is
defined in terms of second partial derivatives as

*D* =
*E _{mm}*

which evaluates to

*D* = (2∑*x*²)(2*n*) − (2∑*x*)²

*D* = 4*n*∑*x*² − 4(∑*x*)² =
4[*n*∑*x*² − (∑*x*)²]

The average of the *x*’s is * x̅ =
∑x/n*, so

*D* = 4(*n*∑*x*² − *n*²*x̅*²) =
4*n*(∑*x*² − *n**x̅*²)

Remember, we need to show that this is positive in order to be
sure that our *m* and *b* minimize the sum of squared residuals *E*(*m*,*b*).
To show that, consider the sum of the squares of
deviations between each *x* value and the average of all *x*’s:

∑(*x*−*x̅*)² =
∑(*x*² − 2*x**x̅* + *x̅*²)

∑(*x*−*x̅*)² =
∑*x*² − 2*x̅*∑*x* + *n**x̅*²

But ∑*x*/*n* = *x̅*, so
∑*x* = *n**x̅* and

∑(*x*−*x̅*)² =
∑*x*² − 2*n**x̅*² + *n**x̅*²

∑(*x*−*x̅*)² =
∑*x*² − *n**x̅*²

Look back at *D* = 4*n*(∑*x*²−*n**x̅*²).
4*n* is positive, since the number of points *n* is positive. The quantity in
parentheses must be positive because it equals
*∑( x−x̅)²*, which is a sum of squares.

(c) The second partial derivative with respect to either
variable must be positive. We have
* E_{bb} = 2n*, which is
positive, and therefore this condition is met. (It doesn’t matter which
variable you use; if condition (b) is met, then either both
second derivatives are positive or both are negative.)

Thus all three conditions are met, apart from pathological
cases like all points having the same *x* value, and the *m* and *b* you get
from solving the equations do minimize the total of the squared
residuals, *E*(*m*,*b*).

But you **don’t need calculus** to solve
every minimum or maximum problem. Look back again at the equation for
*E*, which is the quantity we want to minimize:

*E*(*m*,*b*) = *m*²∑*x*² + 2*b**m*∑*x* + *n**b*² − 2*m*∑*x**y* − 2*b*∑*y* + ∑*y*²

Now that may look intimidating, but remember that all
the sigmas are just constants, formed by adding up various
combinations of the (*x*,*y*) of the original points. In fact, collecting
like terms reveals that *E* is really **just a
parabola** with respect to *m* or *b*:

*E*(*m*) = (∑*x*²)*m*² + (2*b*∑*x* − 2∑*x**y*)*m* +
(*n**b*² − 2*b*∑*y* + ∑*y*²)

*E*(*b*) = *n**b*² + (2*m*∑*x* − 2∑*y*)*b* +
(*m*²∑*x*² − 2*m*∑*x**y* + ∑*y*²)

Both these parabolas are **open
upward**. (Why? because the coefficients of the *m*² and
*b*² terms are positive. The sum of *x*² must be positive unless
all *x*’s are 0; and of course *n*, the number of points, is positive.)
Since the parabolas are open upward, each one has a **minimum at its vertex**.

Where is the vertex for each of these parabolas? Well, recall
that a parabola *y*=*p**x*²+*q**x*+*r* has its vertex at -*q*/2*p*.
These are parabolas in *m* and *b*, not in *x*, but you can find the vertex
of each one the same way:

The vertex of *E*(*m*) is at *m* = ( −2*b*∑*x* + 2∑*x**y* ) /
2∑*x*² = ( ∑*x**y* − *b*∑*x* ) /
∑*x*²

The vertex of *E*(*b*) is at *b* = ( −2*m*∑*x* + 2∑*y* ) /
2*n* = ( ∑*y* − *m*∑*x* ) / *n*

Now there are two equations in *m* and *b*. Substitute one into the
other one, perhaps the second into the first, and the solution is

*m* = ( *n*∑*x**y* − (∑*x*)(∑*y*) ) / ( *n*∑*x*² − (∑*x*)² )

*b* = ( (∑*x*²)(∑*y*) − (∑*x*)(∑*x**y*) ) / ( *n*∑*x*² − (∑*x*)² )

These are exactly the equations obtained by the calculus method.

The formula for *m* is bad enough, and the formula for
*b* is a monstrosity. But if you compute *m* first, then it’s easier
to compute *b* using *m*:

*m* = ( *n*∑*x**y* − (∑*x*)(∑*y*) ) / ( *n*∑*x*² − (∑*x*)² )

*b* = ( ∑*y* − *m*∑*x* ) / *n*

Just to make things more concrete, here’s an example. Suppose that
*x* is dial settings in your freezer, and *y* is the resulting temperature
in °F. Here’s the full calculation:

n=5 | x | y | x² | xy |
---|---|---|---|---|

0 | 6 | 0 | 0 | |

2 | −1 | 4 | −2 | |

3 | −3 | 9 | −9 | |

5 | −10 | 25 | −50 | |

6 | −16 | 36 | −96 | |

∑ | 16 | −24 | 74 | −157 |

Now compute *m* and *b*:

5(−157) − (16)(−24) m = ------------------- 5(74) − 16² −401 m = ---- 114 m = −3.5175 −24 − (−3.5175)(16) b = -------------------- 5 32.28 b = ----- = 6.456 5

“These values agree precisely with the regression equation calculated by a TI-83 for the same data,” he said smugly.

Some authors give a different form of the solutions for *m* and *b*, such as:

*m* = ∑(*x*−*x̅*)(*y*−*y̅*) /
∑(*x*−*x̅*)²
and
*b* = *y̅* − *m**x̅*

where *x̅* and *y̅*
are the average of all *x*’s and average of all *y*’s.

These formulas are equivalent to the ones we derived earlier.
(Can you prove that? Remember that *n**x̅* is
∑ *x*, and
similarly for *y*.) While the *m* formula looks
simpler, it requires you to compute mean *x* and mean *y* first. If you do
that, here’s how the numbers work out:

n=5 | x | y |
x−x̅ | y−y̅ |
(x−x̅)² |
(x−x̅)(y−y̅) |
---|---|---|---|---|---|---|

0 | 6 | −3.2 | 10.8 | 10.24 | −34.56 | |

2 | −1 | −1.2 | 3.8 | 1.44 | −4.56 | |

3 | −3 | −0.2 | 1.8 | −0.04 | −0.36 | |

5 | −10 | 1.8 | −5.2 | 3.24 | −9.36 | |

6 | −16 | 2.8 | −11.2 | 7.84 | −31.36 | |

∑ | 16 | −24 | 0 | 0 | 22.8 | −80.2 |

mean | 3.2 | −4.8 |

Whew! Once you’ve got through that, *m* and *b* are only a little more work:

*m* = −80.2 / 22.8 = −3.5175

*b* = −4.8 − (−3.5175)(3.2) = 6.456

The simplicity of the alternative formulas is definitely deceptive.

**22 Oct 2020**: Converted page from HTML 4.01 to HTML5, italicized the variable names, and cleaned up table formatting. Made a few small textual changes for clarity.**13 May 2018**:- Replaced “deviations” with the standard term
*residual*starting here. - Instead of just
*E*, called the sum of squared residuals for a given line*E*(*m*,*b*). - Used subscript notation for partial derivatives instead of ∂ fractions, here and here.
- Improved the proof of condition (b) for the second partial derivative test.
- Changed equation pictures to text, here, here, here, here, here, here, here, here, here, here, here, and here.
- Replaced a bunch of en dashes U+2013 with minus signs U+2212, the proper character.

- Replaced “deviations” with the standard term
**6 May 2018**: Following a query from John Stallone, corrected a BTW that not only computed the second derivatives incorrectly but used an incomplete second derivative test.**14 Apr 2018**: Prompted by a reader’s query, explained why ∑*b*² =*n**b*².**9 Aug 2007**: New article.

Because this article helps you,

please click to donate!Because this article helps you,

please donate at

BrownMath.com/donate.

please click to donate!Because this article helps you,

please donate at

BrownMath.com/donate.

Updates and new info: https://BrownMath.com/stat/