# Normality Check and Finding Outliers in Excel

Copyright © 2015–2017 by Stan Brown

Copyright © 2015–2017 by Stan Brown

**Summary:**
In inferential statistics, you need to know whether small
data sets are normally distributed and free of outliers.
This page shows you how to do it in Excel.

**Caution!** Any time you change any numbers in the
Sample box, you must click the *Go!* button. This runs a necessary
macro that sorts the data and computes the quartiles.

The sheet is protected so that you can’t accidentally change a formula, but you can look at all the formulas by clicking into cells, and you can examine the Go! macro by clicking View » Macros.

**On a TI-83/84**, you have two choices: Normality Check on TI-83/84, or the MATH200A program.-
**On a TI-89**, you have to do the plot and the computations yourself. See the step-by-step procedure in Normality Check on TI-89.

Consider these vehicle weights (in pounds):

2950 | 4000 | 3300 | 3350 | 3500 | 3550 | 3500 | 2900 | 3250 | 3350 |

The question is whether they are nearly normally distributed, and free of outliers. As you see, the numbers are already entered in the Excel sheet.

To test for outliers, the usual tool is a boxplot or box-whisker diagram. Boxplots are quite difficult to do in Excel; see for example Box Plot and Whisker Plots in Excel 2007. (That page actually works for later Excels too.)

But the boxplot is based
on calculations that are easy in Excel: find the interquartile range
(IQR), then set **fences** equal to Q1−1.5×IQR
and Q3+1.5×IQR. Any values outside the fences are outliers.

In the workbook, this is done in the top half of columns I and J. You can see that the first and third quartiles are 3250 and 3500 pounds, so the IQR is 250. The fences are 2875 and 3875 pounds. There are no outliers on the low end, but there’s one on the high end. By looking at the sorted data in column F, you can identify 4000 pounds as the outlier.

The main tool for testing normality
is a **normal probability plot**.
Actually, no real-life data set is exactly normal, but you use
that plot to test whether a data set is
**close enough to normally distributed**.
The closer the data set is
to normal, the closer the plot will be to a straight line.

Just looking at a plot, you may not be sure whether
it’s **“close enough” to a straight line**,
especially with smaller data sets. Most of the time, you
need to make some fairly gnarly computations to answer that
question: see Appendix —
The Theory.

But the Excel workbook handles all that for you. All you have to do, if the normal probability plot is inconclusive, is look at r and the critical number CRIT. If r > CRIT, you’re safe to treat the sample as normally distributed; if r < CRIT, you can’t treat the sample as normally distributed or use it for inferential statistics.

As you see in the middle section of columns I and J, r = 0.9599 and CRIT = 0.9179. r > CRIT, so the data are near enough to a normal distribution.

Feel free to explore by changing, adding or removing numbers.
Just remember to click the *Go!* button after any changes.

When you click the *Go!* button, a macro
runs to sort the data and update the normal probability plot. If
you’re interested in the VBA code that does this, press
Alt+F11 to open the Visual Basic Editor.

There are some sample data sets in Sheet2; you can copy them into the Sample box with Edit » Paste Special » Values.

Many of these data sets are nearly normal but still have outliers. The accompanying workbook started out as a simple simulation using @RISK from Palisade Corporation to see whether such a thing was possible, but once I had the data sets I didn’t want to throw them away. ☺)

The basic idea isn’t too bad. You make an
** xy scatterplot where the x’s are the data points, sorted in ascending order, and the y’s are the expected z scores for a normal distribution**.
(I’m going to
abbreviate “normally distributed” or “normal
distribution” as ND to save wear and tear on my keyboard and your
eyes.)

Why would you expect that to
be a straight line? Recall the formula for a
z score: z = (x−x̅)/s. Breaking the one
fraction into two, you have z = x/s−x̅/s.
That’s just a linear equation, with slope 1/s and intercept
x̅/s.
So an *xz* plot of any theoretical ND, plotting each data
point’s z score against the actual data value, would be
a straight line.

Further, if your actual data points are ND, then their actual z scores will match their expected-for-a-normal-distribution z scores, and therefore a scatterplot of expected z scores against actual data values will also be a straight line.

Now, in real life no data set is ever exactly a ND, so
you won’t ever see a perfectly straight line.
Instead, you say that the closer the points are to a straight line, the
closer the data set is to normal. If the data points
are too far from a straight line — if their correlation
coefficient *r* is lower than some critical value — then
you reject the idea that the data set is ND.

Okay, so you have to plot the data points against what their
z-scores should be if this is a ND, and specifically for a
sample of *n* points from a ND, where *n* is your sample
size.
This must be built
up in a sequence of steps:

**Divide the normal curve (mentally) into**For technical reasons, the probability number you use for region*n*regions of equal probability and take one probability from each region.*i*is (*i*−.375)/(*n*+.25). This formula is in many textbooks, and also in Ryan and Joiner’s paper Normal Probability Plots and Tests for Normality [full citation at https://BrownMath.com/swt/sources.htm#so_Ryan1976].**Compute the expected z scores for those probabilities.**Working with the calculator, that’s just`invNorm`of (*i*−.375)/(*n*+.25).**Plot those expected z scores against the data values.**This*xy*plot (or*xz*plot) has a correlation coefficient*r*, computed just like any other correlation coefficient.-
**Compare the**Ryan and Joiner determined that the critical value for sample size*r*for your data set to the critical value for the size of your data set.*n*, at the 0.05 significance level,, is 1.0063−.1288/√*n*−.6118/*n*+1.3505/*n*². To make it a little easier on the calculator I rearranged it as 1.0063−.6118/*n*+1.3505/*n*²−.1288/√*n*.In the same paper, they gave formulas for critical values at other significance levels:1.0071−0.1371/√

*n*−0.3682/*n*+0.7780/n² at α=0.100.9963−0.0211/√

*n*−1.4106/*n*+3.1791/*n*² at α=0.01

The closer the points are to a straight line, the
closer the data set is to fitting a normal model.
In other words, a larger *r*
indicates a ND, and a smaller *r* indicates a
non-ND. You can draw one of two conclusions:

**If**, reject the hypothesis of normality at the 0.05 significance level and say that the data set is not ND.*r*is less than the critical value(If you haven’t studied hypothesis testing yet, another way to say it is that you’re pretty sure the data set doesn’t fit the normal model because there’s less than a 5% probability that it does.)

**If**, fail to reject the hypothesis that the data set comes from a ND.*r*is greater than the critical valueThis doesn’t mean you are certain it

*does*, merely that you can’t rule it out. Technically you don’t know either way, but practically it doesn’t matter. Remember (or you will learn later) that inferential statistics procedures like t tests are**robust**, meaning that they still work even if the data are moderately non-normal. But if your data were extremely non-normal,*r*would be less than the critical value. When*r*is greater than the critical value, you don’t know whether the data set comes from normal data or moderately non-normal data, but either way your inferential statistics procedures are okay.

So **the bottom line** is, if *r* > CRIT,
treat the data as normal, and if *r* < CRIT,
don’t.

The normal probability plot is just one of many possible ways to determine whether a data set fits the normal model. Another method, the D’Agostino-Pearson test, uses numerical measures of the shape of a data set called skewness and kurtosis to test for normality. For details, see Assessing Normality in Measures of Shape: Skewness and Kurtosis.

**19 Dec 2016**: Add critical values for the Ryan-Joiner test at significance levels 0.10 and 0.01.**3 Jan 2016**: In the Excel workbook, rearrange some cells for better flow, improve some headings, clean up pointers to Oak Road Systems or TC3, and add a “donate” button.- (intervening changes suppressed)
**25 Jan 2015**: New article and workbook.

Because this article helps you,

please click to donate!Because this article helps you,

please donate at

BrownMath.com/donate.

please click to donate!Because this article helps you,

please donate at

BrownMath.com/donate.

Updates and new info: https://BrownMath.com/stat/