BrownMath.com → Stats w/o Tears → 2. Graphing Your Data
Stats w/o Tears home page

Stats without Tears
2. Graphing Your Data

Updated 19 Jan 2017 (What’s New?)
Copyright © 2013–2017 by Stan Brown

View or
Print:
These pages change automatically for your screen or printer. Underlined text, printed URLs, and the table of contents become live links on screen; and you can use your browser’s commands to change the size of the text or search for key words. If you print, I suggest black-and-white, two-sided printing.

Summary: To make sense out of a mass of raw data, make a graph. Non-numeric data want a bar graph or pie chart; numeric data want a histogram or stemplot. Histograms and bar graphs can show frequency or relative frequency.

Contents:

Graph Paper
for Free:
Why buy an expensive pad of graph paper, especially if you only need a few sheets? You can print your own for free using Incompetech’s Plain Graph Paper PDF Generator and at Math Worksheets Land’s Graph Paper. Both are sources not just for the ordinary square grid, but for various specialty graph papers.

2A.  Graphing Non-Numeric Data

Any graph of non-numeric data needs to show two things: the categories and the size of each. Probably you’re already familiar with the two most common types, which are the bar graph and pie chart.

The sizes of categories can be shown as raw counts, called frequencies, or percentages, called relative frequencies. (Relative frequencies can also be shown as decimals, but I think most people respond better to “20%” than “.20”.)

How do you decide whether to show frequencies or relative frequencies? This is a stylistic choice, not a matter of right and wrong. Your choice depends on what’s important, what point you’re trying to make. If your main concern is just with the individuals in your sample, go with frequencies. But if you want to show the relationship of the parts to the whole, show relative frequencies.

2A1.  Bar Graph

How Often Parents Read to Children under Age 12 (n=434)
How OftenNumber of Parents
Every day217
A few times a week113
About once a week39
A few times a month26
Less often30
Never9

Example 1: In fall 2012, the Pew Research Center (2013a) [see “Sources Used” at end of book] surveyed American adults on their habits of reading to their children. The survey included 434 adults who had at least one child under age 12, and the results are shown in the table.

(Remember, you can’t call the data numeric just because you see numbers in a summary statement. You have to go back to the individual data points, which are categorical: “every day”, “a few times a week”, and so on. If the Pew Center had asked “how many days a week do you read to your child?” and got answers 0, 1, 2, 3, 4, 5, 6, and 7, that would be a set of numeric data.)

Your bar chart or bar graph must follow these rules:

Usually the category axis is horizontal, so the frequency axis and the bars are vertical. But you can also make a horizontal bar chart, where the category axis is vertical and the frequency axis and bars are horizontal.

You can make a bar graph by hand, or use software such as Microsoft Excel. If you make a bar graph by hand, use graph paper and draw the axes and bars with a straightedge — wobbly bars make you look like you had a liquid lunch.

Here’s my bar graph for parents reading to children:

frequency bar graph on graph paper

A couple of comments on best practices:

Optional:  Bar Graph in Excel

Getting some kind of bar graph out of Excel is easy. But then there’s a lot of fiddling around to reverse some of Excel’s rather strange format choices. Here are instructions for Excel 2010. If you have Excel 2007, 2013, or 2016, you’ll find that they’re pretty similar.

  1. Table data in two columns in Excel Get your categories into one column and your frequencies into the next column. The first row of each column should be the column headings from the table. Don’t enter a total row.
  2. With your mouse, highlight all rows and columns of your chart. (It doesn’t matter whether you include the column heads.) Click the Insert tab and then Charts » Column, and select the first 2-D column chart.
    First draft bar graph in Excel
  3. Right-click the useless legend at the right, “Series1” or “Number of Parents”, and select Delete.
  4. When you right-clicked the legend, three Chart Tools tabs appeared. On the Layout tab of the ribbon click Chart Title » Above Chart. Click into the chart title and type a better one. (Maybe Excel already gave your chart a title, but “Number of Parents” is the proper title of the frequency axis, not the whole chart.)
  5. Click Axis Titles » Primary Vertical Axis » Rotated Title. Click on the words “Axis Title” that appear in the chart, and type the new title “Number of Parents” for your frequency axis.
  6. If your category axis needs a title, click Axis Titles » Primary Horizontal Axis » Title Below Axis and enter the axis title.
  7. For some reason, the chart has tick marks between the categories. Right-click one of them, select Format Axis, and change Major tick mark type to None. That gives the chart you see here.
    Final bar graph in Excel
  8. You may have to tweak the formatting of the graph further; here are some suggestions. (If you try something and don’t like the result, press Ctrl-Z to undo the change.)

If you prefer a horizontal bar chart, it’s easy to make the change. Click into the chart area, then on the Design tab on the ribbon click Change Chart Type » Bar and select the first one.

Okay, well, nothing is that easy! Excel puts the categories in backwards order, so right-click the category axis and select Format Axis » Axis Options » Categories in reverse order. Still on the Axis Options dialog, click Horizontal axis crosses at maximum category.
Horizontal bar graph in Excel

Bar Graph with Relative Frequencies

The frequency bar graph tells us about the 434 individuals in the Pew Research Center’s sample. But why collect that sample except for what it can tell us about how often parents in general read to their children?

You know from Sampling Error in Chapter 1 that the proportions in the population are probably not the same as the sample, but probably not very far off either. So you compute those proportions and then redraw your graph to show percentages instead of raw counts.

How Often Parents Read to Children under Age 12 (n=434)
How OftenNumber of Parents Rel. Freq.
Every day21750%
A few times a week11326%
About once a week399%
A few times a month266%
Less often307%
Never92%

First, total all the frequencies to get the sample size n = 434. (In this case n is given already, but often it isn’t.) Then convert each frequency into a relative frequency. The formula, if you need one, is f/n. For example, 9 parents never read to their under-12 children. The relative frequency is f/n = 9/434 = 0.021 or 2%: 2% of parents never read to their children. Enter that and the other relative frequencies in the table, as shown at right.

You may see some bar graphs with relative frequencies as decimals. There’s nothing wrong with that for technical audiences, but general audiences usually respond better to percentages.

Your relative frequencies may not add up to exactly 100% (or 1.0000), because of rounding. Don’t change any of the numbers to force a total.

Once you have your relative frequencies, you can make your bar graph. Choose round numbers for the tick marks on your relative frequency axis, for example every 5% or every 10%. I won’t inflict another of my sketches on you, but you can see a finished relative-frequency bar graph below.

Optional:  Relative Frequencies in Excel

To my surprise, I found that Excel doesn’t include relative-frequency bar graphs in its repertoire. You have to enter some formulas to compute the relative frequencies, and then create the graph from them. (Of course you could compute the relative frequencies yourself and enter them in Excel as numbers, but whenever possible I like to be lazy and make the computer do the work.)

  1. Excel with partial formula =sum(C2:C7) in cell C8 Enter the categories in a column, leave a blank column, and enter the frequencies. If you already have the categories and frequencies in adjacent columns, right-click on the letter at the top of the frequency column and select Insert.
  2. Click into the cell below the last frequency, and type “=sum(” (without the quotes). Then with your mouse select the frequencies. Finally, type a closing parenthesis and hit the Enter key.
  3. Excel with TOTPARENTS in the name area In the address box just above the first column of the spreadsheet, type a unique name such as TOTPARENTS and press the Enter key. This makes it easier to refer to this total cell.
  4. Excel with formula =C2/TOTPARENTS in cell B2 Click into the empty relative-frequency cell for the first category. Type an = sign, then click on the first frequency cell. (In the illustration, the relative-frequency cells are in column B and the frequency cells are in column C.) Type /TOTPARENTS (including / mark for division) and press the Enter key.
  5. Excel with relative frequencies in cells B2 through B7 Grab the “handle” at the lower right of the cell you just typed into, and drag it down to fill the Relative Frequency column.
  6. Click the % sign in the ribbon to change the decimals to percentages. (The % sign is near the middle of the ribbon, on the Home tab.)

Excel relative frequency bar graph Now highlight the category and relative-frequency columns, click the Insert tab and the first 2-D column chart, and tweak the graph as you did before. Your result should be something like the one you see here.

On this chart, neither axis really needs a label. The percent signs reinforce the message in the chart title that the bars show relative frequencies. And the category names together with the chart title tell the reader exactly what is being represented.

It’s a judgment call where to place tick marks on the relative-frequency axis, and you really need to look at the data to make a decision. Four categories are under 10%, so it makes sense to show the 5% line and help the reader get a sense of the relative sizes. Of course, if you show 5% then you have to show every 5% increment up to the top of the graph.

Side-by-Side Bar Graph

You may want to compare two populations: men and women, for instance, or one year versus another year. To do this, a side-by-side bar graph is ideal. A side-by-side bar graph has two bars for each category, and a legend shows the meaning of the bars.

The two populations you’re comparing are almost never the same size. Therefore side-by-side graphs almost always show relative frequencies rather than frequencies.

Example 2: In Educational Attainment, the Census Bureau (2014) [see “Sources Used” at end of book] showed the educational attainment of the population in selected years 1940 to 2012. I chose the years 1992 and 2012 and prepared this graph to show the change over that 20-year period.

side-by-side bar graph for educational attainment

What do you see? Comparing 2012 to 1992, the proportion of the population with no college (the first four categories) declined, and the proportion with some college or a college degree increased. You should be able to see why this has to be a relative-frequency chart: in a frequency chart, the larger population in 2012 would make all the bars taller than the 1992 bars, and you’d be hard put to see any kind of trend.

Stacked Bar Graph

Example 3: Another way to compare two populations is the stacked bar graph. In the side-by-side bar graph, above, each group of bars was one category, and each bar within a group was a population. With the stacked bar graph, you have one bar for each population, and one piece of that bar for each category. (A stacked bar graph is kind of like an unrolled pie chart.)

Here’s a stacked bar graph for the same data set:

stacked bar graph, as described in text

Because this textbook helps you,
please click to donate!
Because this textbook helps you,
please donate at
BrownMath.com/donate.

What do you see? Look first at the legend that lists the categories, then at the two bars. The top two segments represent some college. In 1992, about 56% of adults had no education beyond high school. But in 2012, only about 42% had a high-school diploma or less, meaning that 58% had at least some college. The proportions of college and no college were reversed in those 20 years.

You can also see that, though the group with four years of high school shrank, it didn’t shrink as much as the group with college grew. In other words, it’s not just more high-school graduates going on to college, it’s a higher proportion of the population entering high school. All the categories without a high-school diploma shrank. In 1992, 20% of adults had less than a high-school diploma and 80% were high-school graduates; in 2012, only about 12% had less than a high-school diploma and 88% had graduated from high school.

What’s the best way to compare two populations? The answer depends on what you’re trying to show. The side-by-side graph seems to be better at showing how each category changed, and the stacked graph is usually better at showing the mix, especially if you want to group the categories mentally. In the side-by-side graph, you can easily see the decline in adults with a fourth-grade education or less, but the shift to a college-educated population is much harder to see. It’s just the opposite with the stacked graph.

As always, get clear in your own mind what you’re trying to show, and then select the type of graph that shows that most clearly.

Did you notice that this stacked bar graph shows relative frequencies? (Maybe you didn’t notice, because it seems like the natural way to go.) A stacked bar graph could show frequencies instead of relative frequencies, if you want to emphasize the different sizes of the populations, but then it becomes harder to compare the mix in the populations.

When you make a stacked bar graph in Excel, there’s no need to pre-compute the percentages. Just select the third type of 2-D column chart, 100% Stacked Column.

2A2.  Making a Table from Scratch

Example 4: In the first example, you were given a table of categories and counts. But more likely you’ll just have a mass of data points, like this:

Children’s Favorite Beach Toys
shoveldump truckshovelbucketshovelball
ballbucketsifterballshovelshovel
dump truckballshovelshovelbucketnet
siftershovelbucketdump truckbucketshovel
ballshovelballbucketnetball

Before you can make any kind of graph, you need a table to summarize the data. You’re probably tempted to count the number of shovels, the number of balls, and so on, but it’s way too easy to make mistakes that way. Why? Because you have to go over the data set multiple times, and you may count something twice or miss something.

The better procedure is to tally the categories in a table. It’s a win-win: the procedure is faster, and you’re less likely to make a mistake.

ToyTallies
shovel|||
ball|||
dump truck||
sifter|
bucket|

Simply go through the data, one item at a time. If you’re seeing a given category for the first time, add it to your list with a tally mark; if that category is already in your table, just add a tally mark. Here’s my table of tallies after going through the first two columns of data:

Please complete your tallies on your own before you look at mine.

After you’ve tallied all the data, count the tallies in each category and total the counts. Of course the total should equal your sample size n. Here’s my complete table:

ToyTalliesFrequency
shovel|||| |||| 10
ball|||| ||7
dump truck|||3
sifter||2
bucket|||| |6
net||2
Total30

Always check the total of your frequencies. If it matches the sample size, that’s no guarantee everything is correct; but if it doesn’t match, you know something is wrong.

Once you’ve got your table, you can make a graph by following the procedures above. If you’re publishing the table itself, give just the category names and sizes and the total, but leave out the tallies.

2A3.  Pie Chart

Where a bar graph tends to emphasize the sizes of categories in relation to each other, a pie chart tends to emphasize the categories as divisions of the whole. This distinction is not hard and fast; it’s just a matter of emphasis.

To make a pie chart, you need a compass, or something else that can draw a circle, and you need a protractor. The angle of each segment of the pie will be 360°×f/n, where f is the frequency of the category and n is the sample size — in other words, it’s 360° times the relative frequency, whether you’re showing frequencies or relative frequencies on the pie chart. But in practice, if you’re going to make a pie chart you’ll use Excel or some other software.

Optional:  Pie Chart in Excel

Excel can draw a pie chart for you, but you have to make a bunch of tweaks before it’s usable. There’s one bit of good news: with a pie chart, unlike a bar graph, Excel can compute relative frequencies automatically. I’ll show you how to do that for the data about parents reading to children, for which we made a bar graph earlier.

  1. first cut at an Excel pie chart Highlight the categories and frequencies, but not the total. Click the Insert tab and then Pie, and choose the first 2-D pie. You see the result at right.

    Many people stop there, but this is an absolutely horrible design. Readers have to keep looking back and forth to match up the colors, and often there are similar colors. Color-blind people are really screwed, and if you print the chart on a black-and-white printer it’s hopeless. Fortunately you can fix this!

  2. You’re going to put the category names with the pie segments, so right-click the legend (the list of categories at the right) and select Delete.
  3. Click on the “Number of Parents” title and type in a better one, such as “How Often Parents Read to Children”. (Don’t type the quotes in the title, of course.)
  4. In the ribbon, on the Layout tab, click Data Labels » More Data Label Options. Under Label Contains, select Category Name, select either Value or Percentage, and select Show Leader Lines. Under Label Position, select Best Fit. Click Close.
  5. You may want to resize the graph to make the labels less crowded, depending on the sizes of the segments. Drag a handle with your mouse, as you did before.

final pie graph in Excel

2B.  Graphing Numeric Data

Summary:

For numeric data, you want to show four things: the shape, center, and spread of the distribution plus any outliers. The histogram is the standard way to do this, and it can show frequencies or relative frequencies.

Usually you’ll group the data into classes, but when you have discrete data without too many different values you can make an ungrouped histogram.

For a discrete data set with a moderate number of values and a moderate range, a stemplot is an alternative. With a stemplot, it doesn’t matter how many different data values there are, but the number of data points matters.

2B1.  Histogram for Numeric Data

How can you draw a picture of numeric data? The answer is a histogram.

The term “histogram” was coined by Karl Pearson in lectures some time before 1895.

Example 5: Let’s use the lengths of some randomly selected iTunes songs:

Lengths of iTunes Songs (seconds)
113282179594213   319245323334526
395440477240296   428407230294152
242837246135412   223275409114604
170239138505316   369298168269398
433212367255218   283179374204227

How do you make sense of this? As you might expect, the first step is to make a table. But you don’t want to treat each number as its own category, because that would produce a really uninteresting graph. Instead you create categories, except for numeric data you call them classes. The rules for classes are very simple:

Notice that the rules don’t tell you how many classes there must be, or what width a class must have. That’s where your discretion comes in. You want to pick class boundaries that are “nice” numbers, and you don’t want too many classes or too few. In practice, five to nine classes is usually about the right number.

How does that apply to the iTunes songs? Take a look at the data. The lowest number seems to be 113, and the highest is 837. That gives a range in “nice” numbers of about 100–850. If you set class width to 100 you have eight classes, so that seems about right.

Now go ahead and make your tally marks to create the table. Instead of category names, you use class boundaries. You already know how to make tally marks, so I’ll just give you the results:

Lengths of iTunes Songs (seconds)
Class
Boundaries
TalliesFrequency
100–199|||| ||||9
200–299|||| |||| |||| ||||20
300–399|||| ||||9
400–499|||| ||7
500–599|||3
600–699|1
700–799 0
800–899|1

Even though the 700–799 class has no data points, it’s still a class and it will occupy the same width in the histogram as any other class. A bar with zero height shows in the histogram as a gap, and that’s good because it emphasizes that there’s something unusual about the point in the 800–899 class (which was 837 seconds).

If the class width is 100, how come the class bounds are 100–199 and not 100–200? In fact, some authors do write these class bounds as 100–200, 200–300, and so on, with the understanding that if a number is right on the boundary it goes in the upper class. All authors agree that the class width is the difference between the lower bounds of two consecutive classes, not the difference between lower and upper bounds of one class. So whether you write 100–199 for the first class or 100–200, the class width is 200 minus 100, which is 100.

histogram for lengths of iTunes songs Once you have the table, the histogram is straightforward. You can draw the histogram by hand or use Excel. I’ll show Excel later, but here’s my hand-made histogram for the iTunes data.

Notice that you label the data bars on their edges: 100, 200, …, 900, not 100–199, 200–299, …. Label the left edge of each bar, and also the right edge of the last bar. The right edge of the last bar is always one class width more than the left edge, so even if you’ve got 800–899 in your table the last bar’s edges are 800 and 900.

Like all histograms, this one is good at showing the shape of the data (skewed right; see below), the center (somewhere in the upper 200s to 300s), and the spread (from 100ish to 800ish seconds, or about two minutes to 13 minutes). In Chapter 3 you’ll learn how to measure center and spread numerically, but there’s always a place for a picture to help people grasp a data set as a whole.

This data set also shows an outlier, located somewhere in the 800–899 class. Not every data set will have an outlier, of course, and a rare sample might have more than one. When an outlier occurs, your first move is to go back to your original data sheets and make sure that it’s not simply a mistake in entering your data. If it’s a real data point, then you can ask what it means. In this case, the message is pretty simple: tunes generally run up to about 11 or 12 minutes (700 seconds), but the occasional one can be several minutes longer.

Histogram Versus Bar Graph

A histogram is similar to a bar graph, but with the following differences:

HistogramBar Graph
Data type Numeric (grouped) Discrete ungrouped★ Non-numeric
Order of categories Numeric order,
left to right
Numeric order,
left to right
Any order you choose
Do the bars touch? YesNo, they’re spaced
Where are they labeled? Below the edgesBelow the centers
★Some authors treat ungrouped discrete data as numeric and make a histogram. Others, including this book, treat ungrouped discrete data as categories and make a bar graph.

For both histogram and bar graph, the frequencies must start at 0. However, in a histogram the data axis typically doesn’t start at zero. You just leave some space between the frequency axis and the first bar, and the scale of the data axis is considered to start at the first bar.

Relative-Frequency Histogram

Though I don’t show it here, you could make a relative-frequency histogram, the same way you made a relative-frequency bar chart. The relative frequencies range from 0 for the 700–799 class to 20/50 = 40% for the 200–299 class.

Optional:  Histogram in Excel

Believe it or not, out of all the chart types in Excel, the standard histogram is not included. To make one, you have to combine a column chart and a scatterplot (Middleton) [see “Sources Used” at end of book], or download additional software. You can follow the detailed instructions in that document, or you can download the free Better Histogram add-in from TreePlan Software to do the job. (It works in Excel 2007 through 2016.) If you’re using Better Histogram:

  1. Enter all the original numbers in a column in Excel.
  2. Double-click the downloaded ZIP file, and within it double-click Better-Histogram-2007. You will have to enable macros.
  3. Click the Add-Ins tab in the ribbon, and then Better Histogram.
  4. histogram created by Better Histogram, following instructions in the text Better Histogram will create a new sheet in your workbook with a frequency table and histogram. Click on the chart title and enter a new title. Click on the horizontal axis title and either delete it or change it to more appropriate text. The result is shown at right.
  5. applying blue theme to the above histogram Optional: You might wish to jazz up the chart visually. If so, click on the Design tab of Excel’s ribbon and choose a design. Color is fine, but don’t choose different colors for different bars because that can make bars look larger or smaller than they actually are. Here’s what I got from clicking the blue theme.

2B2.  Ungrouped Discrete Data

To make sense of most data sets, you need to group the data into classes. But sometimes your data have only a few different values. In such cases, you probably want to skip the grouping and just have one histogram bar for each different response. The height of the bar tells you how often that response occurred, as usual.

Example 6: A state park collected data on the number of adults in each vehicle that entered the park in a given time interval:

3 1 1 3 3 3 0 7 3 1    3 6 4 5 3 2 3 4 2 3
0 2 2 4 8 3 3 1 3 3    3 4 1 5 2 2 6 3 4 2

Number of Adults
in Vehicles Entering Park
  Adults    Tallies  Frequency
0||2
1||||5
2|||| ||7
3|||| |||| ||||15
4||||5
5||2
6||2
7|1
8|1
Total40

There are only nine different values, so it seems a little silly to group them. Instead, just tally the occurrences, as shown at right.

Label ungrouped data under the centers of the bars, just like categorical data, not under the edges. Some authors still make the bars touch because the data are numeric, and others keep the bars separated because the data are ungrouped. I prefer the second approach, but I’ll accept the other. Here’s my histogram:

ungrouped histogram foradults in vehicles entering park

Caution: This particular data set has at least one occurrence of every value between min and max. But suppose it didn’t; suppose there were no vehicles with 7 adults? In that case, you would draw the histogram exactly the same, except that the bar above “7” would have zero height. The horizontal axis for numeric data must always have a consistent scale for its whole length, so you never close up any gaps.

Optional:  Ungrouped Discrete Histogram in Excel

You can graph ungrouped discrete data in Excel, if you wish. The key is to fool Excel into treating the data like categorical data:

  1. two-column layout in Excel as described in text Type the unique values in one column. But as you type each number, type an apostrophe (') first. Don’t put 0, 1, 2 and so on in the cells, but '0, '1, '2. The apostrophe won’t appear, but it tells Excel to treat the numbers like text. (You may notice that Excel left justifies those numbers.)
  2. Type the frequencies in a second column.
  3. Highlight the numbers in both columns, and on the Insert tab click Column. Select the first 2-D column.
    initial frequency histogram in Excel
  4. frequency histogram in Excel after cleanup Make all the same adjustments you made for the bar graph, above.

    By the way, you might notice that the tick marks on the vertical axis are every two cars on this graph, but they were every five cars on my hand-drawn histogram. One is not better than the other; it’s a stylistic choice.

  5. frequency histogram in Excel with no gaps between bars Optional: If you want to make the bars touch, right-click on the graph, select Format Data Series, and under Series Options change Gap Width to 0%. Then click Border Color and select Solid Line with a color of white.

2B3.  Shapes of Data Sets

You should know the names of the most common shapes of numeric data. Why? It’s easier to talk about data that way, and — as you’ll see in the next chapter — you treat different-shaped distributions a little differently.

The first question is whether the data set is symmetric or skewed. The histogram of a symmetric data set would look pretty much the same in a mirror; a skewed data set’s histogram would look quite different in a mirror.

If a distribution is skewed, you say whether it’s skewed left or skewed right. A distribution that is skewed left, like the first one below, has mostly high scores, and a distribution that is skewed right, like the second one below, has mostly low scores. The direction of skew is away from the bulk of the data, toward the long skinny tail, where there are few data points.

a distribution that is skwewed left a distribution that is skwewed right
Skewed left or
negatively skewed
Skewed right or
positively skewed

Example 7: Scores on a really easy test would be skewed left: most people get high scores, but a few get low or very low scores.

Lifespan in developed countries is skewed left: there are relatively few infant and child deaths, and most people live into their 60s, 70s, or 80s. (The first graph in Calculus Applied to Probability and Statistics [Waner and Costenoble 1996] [see “Sources Used” at end of book] illustrates this.)

People’s own evaluation of their driving skills and safety are left skewed: few people rate themselves below average and most rate themselves above average. Illusory Superiority [see “Sources Used” at end of book] cites a study by Svenson showing this “Lake Wobegon effect”.

Example 8: People’s departure times after a concert would be skewed right: most people leave shortly before or after the performers finish, but a few straggle out for some time afterward. Skewed-right distributions are more common than skewed-left distributions.

Salaries at almost any corporation are another good example of a distribution that is skewed right: most people make a modest wage, but a few top people make much more.

There are several types of symmetric distributions, but here are the two you’ll meet most often. A uniform distribution is one where all possible values are equally likely to occur. The normal distribution has a precise definition, which you’ll meet in Chapter 7, but for now it’s enough to say that it’s the famous bell curve, with the middle values occurring the most often and the extreme values occurring much less often.

You’ll notice that both of the examples below are “bumpy”. That’s usual. In real life you pretty much never meet an exact match for any distribution, because there are always lurking variables, measurement errors, and so on. And even if a population does perfectly follow a given distribution, like the probability distributions you’ll meet in Chapter 6, still a sample doesn’t perfectly reflect the population it came from: sampling error is always with us. When we say that a data set follows such-and-such a distribution, we mean it’s a close match, not a perfect match.

a uniform distribution a normal distribution
UniformNormal (“bell curve”)

Example 9: Winning lottery numbers are uniformly distributed. (In the short term some numbers occur more often than others, but over the long run they tend to even out.)

The results of rolling one die many times are uniformly distributed. (But the results of rolling two dice are not uniformly distributed: 7 is the most likely, 2 and 12 are tied for least likely, and the other numbers are intermediate.)

The normal distribution or bell curve occurs very often, and in fact many natural and industrial processes produce normal distributions. This happens so often that we often just say or write ND for “normal distribution” or “normally distributed”.

Example 10: Men’s and women’s heights follow separate normal distributions. People’s arrival times at an event are ND. IQ scores, and scores on most tests, are ND. The amount of soda in two-liter bottles is ND. Your commute times on a given route are ND.

2B4.  Stem Plot

Suppose you have a discrete data set with few repetitions. An ungrouped histogram would have most bars at the same low height; a grouped histogram might show a pattern but you’d lose the individual data points.

If your discrete data set isn’t too large (n < 100, give or take), and the range isn’t too great, you can eat your cake and have it too. The stemplot, also known as a stem-and-leaf diagram, is a mutant hybrid between a histogram and a simple list of data.

The idea is that you take all the digits of each data point except the last digit and call that the stem; the last digit is the leaf. For example, consider scores of 113 and 117. They are two leaves 3 and 7 on a common stem 11 (meaning 110).

To construct a stemplot, you look over your data set for the minimum and maximum, then write the stems in a column, from lowest to highest. Just like with a histogram, there are no gaps, so if you have data in the 50s and the 70s but not in the 60s you still need a stem of 6.

However, your stems probably won’t start at 0. Start them with the lowest data point that actually occurred, and end them with the highest data point that actually occurred.

The stemplot was invented by John Tukey in 1970.

Example 11: Here is a set of IQ scores from 50 randomly-selected tenth graders:

 99  77  83 111 141      89  98  84  93 124
110  73  96  60 102      87 123 120 100  95
100  90 104  85 129      81 119 112 103  76
108  91  94 114 108      92  96  94  88 101
117 106 103 105 113      97 106 109  80 116

To make your stemplot, eyeball the data for the minimum and maximum, which are 60 and 141. Write the stems, 6 to 14, in a column at the left of your paper, starting several lines below the top. Then draw a vertical line just to the right of them.

Now go through the data points, one by one, and add each leaf to the proper stem. During this process, you might find a value outside what you thought were the min and max. That’s no problem. Just add the stem and then the leaf. (Again, the stems can’t have gaps, so if your first stem is 6 and you come across a data point 47, you have to add stems 4 and 5, not just 4.)

Finally, add a title and a legend or key to your stemplot. Here is the result:

            IQ Scores
 6 | 0
 7 | 7 3 6
 8 | 3 9 4 7 5 1 8 0
 9 | 9 8 3 6 5 0 1 4 2 6 4 7
10 | 2 0 0 4 3 8 8 1 6 3 5 6 9
11 | 1 0 9 2 4 7 3 6
12 | 4 3 0 9
13 |
14 | 1
                    key: 11 | 7 = 117

If you lie down and look at this sideways, it looks like a histogram. But the bonus is that you can still see all the actual data points within the groupings of 60–69, 70–79, etc.

A stemplot is great at showing shape, center and spread of distributions plus outliers, but most data sets don’t lend themselves to a stemplot. If your data set is too large, your leaves will run off the edge of the page. If your data set is too sparse — if the range is large for the number of data points — most of your stems won’t have leaves and the plot won’t really show any patterns in the data. But when you have a moderate-sized data set and the data range is moderate, a stemplot is probably better than a histogram because the stemplot gives more information.

One last touch is sorting the leaves. I don’t think that’s important enough to take the extra effort in a homework problem or on a quiz, but if you’re going to be presenting your stemplot to other people then you probably want to sort the leaves. Here’s the same stemplot with sorted leaves:

            IQ Scores
 6 | 0
 7 | 3 6 7
 8 | 0 1 3 4 5 7 8 9
 9 | 0 1 2 3 4 4 5 6 6 7 8 9
10 | 0 0 1 2 3 3 4 5 6 6 8 8 9
11 | 0 1 2 3 4 6 7 9
12 | 0 3 4 9
13 |
14 | 1
                    key: 11 | 7 = 117

A glance at this stemplot shows you quite a lot. The data set is normally distributed, the center is around 100 points, the spread is 60–141, and there’s an outlier at 141.

2C.  Bad Graphs

You now know how to make good graphs, so be on the lookout for bad graphs. Sometimes they’re bad just because whoever drew them didn’t know any better, or didn’t think. But some people may deliberately try to deceive you with a graph.

Example 12: File this one under “what were they thinking?” The left-hand graph doesn’t have a title, so you don’t know what “Yes” and “No” mean. You have to look back and forth between the graph and the legend, and anyone with red-green color blindness probably won’t be able to see which segment is which. Oh yes — what percentages of the sample answered “Yes” and “No”? You can guess that it’s around a third versus two thirds, but that’s not very precise.

The right-hand graph cures those problems. It’s now crystal clear which segment is Yes and which is No, and what proportion of the sample gave each answer. This actually lets you show more information in less space, a win-win. (Of course you wouldn’t use a vague term like “Opinions” — that’s just there to remind you to give your graph a title.)

a bad pie chart and a good pie chart, as described

Example 13: There’s no telling whether this one is deliberate deception or just incompetent graphing. An oatmeal company, which shall remain nameless, wanted to show that eating oatmeal for four weeks reduces cholesterol. The first graph makes a strong case — until you look at the scale on the vertical axis. (Don’t even think about wasting your time on a graph with no vertical scale.)

The scale doesn’t start at zero, so it makes differences look much bigger than they are. Your frequency or relative frequency scale must always start at zero (and you must show the zero). The second graph is properly drawn, and now you can see that the drop in cholesterol is only a slight one.

bad graph: frequency scale doesn't start at zero      better graph: frequency scale starts at zero

impropoerly and properly scaled graphs; see text
source: Misleading Graph [see “Sources Used” at end of book]

Example 14: It’s all very well to create visual interest, but not if it makes the reader misinterpret the graph.

In the left-hand graph, you can tell from the scale that B is supposed to be three times as large as A, but since it’s three times as high and three times as wide it’s actually nine times as large, giving the reader a distorted impression of the amount of difference. Even if your “bars” are pictures, they still have to be the same width. The corrected version is shown at right. (It’s still not quite correct, though, because 0 is not shown on the vertical axis.)

2D.  Really Good Graphs

If you follow the rules in this chapter, you’ll make good, professional graphs. But there are plenty of other ways to make good graphs, depending on the data you’re trying to show.

There’s a classic picture book that can give you lots of good ideas. Edward Tufte’s The Visual Display of Quantitative Information has been around since 1983, and no one has yet done it any better. (Tufte has produced newer editions.)

Example 15: One famous graph in Tufte’s book is particularly stunning. Charles Minard wanted to present a lot of time-series data about Napoleon’s disastrous campaign in Russia in the winter of 1812–1813: where battles took place, numbers of casualties, temperature, and so forth. He elected to make a kind of stylized map showing just the rivers and the cities where events happened. (Niemen at the left is the Niemen River, Russia’s western border at the time. Moscow, “Moscou” in French, is as far east as Napoleon got.) Across that, Minard showed the army strength as a broad swath at the start that shrank to almost nothing by the end of the retreat westward. Below are dates of events, temperatures, and precipitation. It’s a huge amount of information on one piece of paper.

Charles Minard's famous figurative map of Napoleon’s Moscow campaign

This tiny rendition doesn’t do it justice, but if you click on it visit http://upload.wikimedia.org/wikipedia/commons/2/29/Minard.png you’ll see it at a better size. (Your browser may still reduce it to fit on your screen. Try clicking into the picture and you should see it at original size, though you’ll have to scroll around to see the details. It sounds like a lot of effort, but I promise you it’s worth it. Or just get the book, because it has plenty more!)

Example 16: Stephen Herrero three-way pie graph Here’s one I ran across in my reading. It’s not the graph of the century like Minard’s, but it’s a cut above the usual. In Bear Attacks: Their Causes and Avoidance (2002), Stephen Herrero had the problem of contrasting bears’ diet in spring, summer, and fall. (Of course in winter they’re not eating.)

He could have drawn three pie charts, or a stacked bar graph, but instead he came up with a great alternative. (You can click on the picture to enlarge it.) (A larger version is at http://BrownMath.com/swtpic/chap02_beardiet.jpg.) Each component of diet is clearly labeled right in the graph, not in some legend off to the side, and the contrasting backgrounds make it a little more interesting visually. A stacked bar graph would convey the same information, but I like this presentation because it suggests that “spring”, “summer”, and “fall” are not completely separate but rather transition one into the next.

The vertical axis is clearly labeled, too. There’s no doubt what the numbers are (as opposed to some units of weight, for instance, or something more esoteric like pounds of feed per hundreds of pounds of bear).

He probably could have left off the title off the category axis — after all, we know that the seasons are seasons, and the graph title also conveys that information. But that’s a minor point. My only real quibble with this graph is that the overall graph title at the bottom is too small.

What Have You Learned?

Overview: With numeric data, the goal of descriptive stats is to show shape, center, spread, and outliers.

Key ideas:

(The online book has live links to all of these.)

Because this textbook helps you,
please click to donate!
Because this textbook helps you,
please donate at
BrownMath.com/donate.
Study aids:

Chapter 3 WHYL → ← Chapter 1 WHYL

Exercises for Chapter 2

Write out your solutions to these exercises, showing your work for all computations. Then check your solutions against the solutions page and get help with anything you don’t understand.

Caution! If you don’t see how to start a problem, don’t peek at the solution — you won’t learn anything that way. Ask your instructor or a tutor for a hint. Or just leave it and go on to a different problem for now. You may find when you return to that “impossible” problem that you see how to do it after all.

1
Would You Move to the US?
Yes, with authorization154
Yes, without authorization204
No612
Don’t know30

The Pew Research Center (2013c) [see “Sources Used” at end of book] conducted a poll of 1000 adults in Mexico, asking whether they would move to the US if they had the means and opportunity to move. Draw a relative-frequency bar graph for their responses.

2

banana, apple, cherry What’s wrong with this graph? (You should be able to see at least two problems, maybe more.)

(source: Misleading Graph [see “Sources Used” at end of book] in Wikipedia)

3

Professor Marvel had a statistics class of fifteen students, and on one 15-point quiz their scores were

10.5   13.5   8   12   11.3   9   9.5   5   15   2.5   10.5   7   11.5   10   10.5

Construct a frequency table and bar graph for their letter grades on the quiz, where 90% is the minimum for an A, 80% for a B, 70% for a C, and 60% for a D.

4
Deaths by Horse Kick in
14 Prussian Army Corps, 1875–1894
Number of DeathsFrequency
0144
191
232
311
42
Total280

Bulmer (1979, 92) [see “Sources Used” at end of book] quotes an 1898 study of deaths by horse kick in the Prussian army. Von Bortkiewicz compiled the number of deaths in 14 Prussian Army corps over the 20-year period 1875–1894, as shown at right. (14 corps over 20 years gives 14×20 = 280 observations.) For example, there were 32 observations in which two officers died of horse kicks.
(a) What is the type of the variable?
(b) Construct an appropriate graph.

5
Commuting Distances in km
51523129
1222263121
1119164512
82618171
1624152017

In a GM factory in Brazil, 25 workers were asked their commuting distance in kilometers. Construct a stem-and-leaf plot.

—Adapted from Dabes and Janik (1999, 8) [see “Sources Used” at end of book]

6

Abigail asked a number of students their major. She found 35 in liberal arts, 10 in criminal justice, 25 in nursing, 45 in business, and 20 in other majors. What was the relative frequency of the nursing group, rounded to the nearest whole percent?

7 (a) Name three types of graph used for ungrouped discrete data. Which type do you use when?
(b) Name the type of graph used for grouped numeric data.
(c) Name two types of graph used for qualitative data.
8

histogram of data given in this question Bert asked his fellow students how many books they read for pleasure in a year. He found that most of them read 0, 1, or 2 books, but some read 3 or more and a very few read as many as 10. (He plotted the histogram shown at right.) Identify the shape of this distribution.

9 (a) In making a histogram, how do you decide whether to group data?
(b) What are the two rules for classes when you group data?
10
Test scores, xFrequencies, f
470.0–479.915
480.0–489.922
490.0–499.929
500.0–509.950
510.0–519.938

At right is a grouped frequency distribution.
(a) Create a frequency histogram. (For a real quiz, you’d use graph paper, but you can freehand this one.)
(b) Find the class width.
(c) What’s the shape of this distribution?

Solutions → 

What’s New

Updates and new info: http://BrownMath.com/swt/

Site Map | Home Page | Contact