BrownMath.com → Stats w/o Tears → 2. Graphing Your Data

# Stats without Tears2. Graphing Your Data

Updated 19 June 2017

View or
Print:
These pages change automatically for your screen or printer. Underlined text, printed URLs, and the table of contents become live links on screen; and you can use your browser’s commands to change the size of the text or search for key words. If you print, I suggest black-and-white, two-sided printing.

Summary: To make sense out of a mass of raw data, make a graph. Non-numeric data want a bar graph or pie chart; numeric data want a histogram or stemplot. Histograms and bar graphs can show frequency or relative frequency.

Graph Paper
Why buy an expensive pad of graph paper, especially if you only need a few sheets? You can print your own for free using Incompetech’s Plain Graph Paper PDF Generator and at Math Worksheets Land’s Graph Paper. Both are sources not just for the ordinary square grid, but for various specialty graph papers.

## 2A.  Graphing Non-Numeric Data

Any graph of non-numeric data needs to show two things: the categories and the size of each. Probably you’re already familiar with the two most common types, which are the bar graph and pie chart.

The sizes of categories can be shown as raw counts, called frequencies, or percentages, called relative frequencies. (Relative frequencies can also be shown as decimals, but I think most people respond better to “20%” than “.20”.)

How do you decide whether to show frequencies or relative frequencies? This is a stylistic choice, not a matter of right and wrong. Your choice depends on what’s important, what point you’re trying to make. If your main concern is just with the individuals in your sample, go with frequencies. But if you want to show the relationship of the parts to the whole, show relative frequencies.

### 2A1.  Bar Graph

How Often Parents Read to Children under Age 12 (n=434)
How OftenNumber of Parents
Every day217
A few times a week113
A few times a month26
Less often30
Never9

Example 1: In fall 2012, the Pew Research Center (2013a) [see “Sources Used” at end of book] surveyed American adults on their habits of reading to their children. The survey included 434 adults who had at least one child under age 12, and the results are shown in the table.

(Remember, you can’t call the data numeric just because you see numbers . You have to go back to the individual data points, which are categorical: “every day”, “a few times a week”, and so on. If the Pew Center had asked “how many days a week do you read to your child?” and got answers 0, 1, 2, 3, 4, 5, 6, and 7, that would be a set of numeric data.)

• The bars have equal width and equal spacing; they do not touch. Each bar is labeled with its category below the axis.
• Typically for non-numeric data, there’s no One True Order for the categories. Try to find an order that feels natural. If you prefer, you can order the categories from the tallest to the shortest bar; that is called a Pareto chart.
• The frequency or relative-frequency axis (usually the vertical axis) starts at 0, and you need to show the 0 label. That axis is a number line, so tick marks are equally spaced and represent consistent numbering. (The frequency 0 goes next to the horizontal axis. Don’t offset it downward.)
• The height or length of each bar is proportional to the number or percentage of individuals in that category. You can write frequencies or percentages at the top of every bar, but this is optional because you’re labeling your frequency axis.
• The frequency axis always needs a title. The category axis may or may not need a title, depending on whether the graph title and category names make the chart easy enough to understand.

Usually the category axis is horizontal, so the frequency axis and the bars are vertical. But you can also make a horizontal bar chart, where the category axis is vertical and the frequency axis and bars are horizontal.

You can make a bar graph by hand, or use software such as Microsoft Excel. If you make a bar graph by hand, use graph paper and draw the axes and bars with a straightedge — wobbly bars make you look like you had a liquid lunch.

Here’s my bar graph for parents reading to children:

A couple of comments on best practices:

• Notice that I made one square on the vertical axis equal 10 people, or five squares equal 50. That way when I have numbers like 113 or 39 I know how high to draw my bars. If you pick three or four squares per 50 people, you have a much harder job to draw the bars at the correct heights because you have to figure things like “if 50 is 3 squares, then 113 must be 113/(50/3) = about 6.8 squares.” Always pick “nice” numbers for your numeric scales.
• Notice also that I drew horizontal lines at the major milestones. These “gridlines” help the reader assess the heights of the bars more accurately.

#### Optional:  Bar Graph in Excel

Getting some kind of bar graph out of Excel is easy. But then there’s a lot of fiddling around to reverse some of Excel’s rather strange format choices. Here are instructions for Excel 2010. If you have Excel 2007, 2013, or 2016, you’ll find that they’re pretty similar.

1. Get your categories into one column and your frequencies into the next column. The first row of each column should be the column headings from the table. Don’t enter a total row.
2. With your mouse, highlight all rows and columns of your chart. (It doesn’t matter whether you include the column heads.) Click the tab and then  » , and select the first 2-D column chart.
3. Right-click the useless legend at the right, “Series1” or “Number of Parents”, and select Delete.
4. When you right-clicked the legend, three tabs appeared. On the tab of the ribbon click  » . Click into the chart title and type a better one. (Maybe Excel already gave your chart a title, but “Number of Parents” is the proper title of the frequency axis, not the whole chart.)
5. Click  »  » . Click on the words “Axis Title” that appear in the chart, and type the new title “Number of Parents” for your frequency axis.
6. If your category axis needs a title, click  »  » and enter the axis title.
7. For some reason, the chart has tick marks between the categories. Right-click one of them, select , and change to . That gives the chart you see here.
8. You may have to tweak the formatting of the graph further; here are some suggestions. (If you try something and don’t like the result, press Ctrl-Z to undo the change.)
• If the category names are long, try shifting them to vertical alignment: right-click on any of them and select  » . In , select .
• You may need to resize the whole graph to improve spacing or to make the bars’ heights show better contrast. Look carefully at the frame and you’ll see handles in each corner and the middle of each side. To resize the graph, drag any of the handles.
• To change fonts of the axis labels or titles, click the element, then click the tab in the ribbon. Change font or font size as desired.

If you prefer a horizontal bar chart, it’s easy to make the change. Click into the chart area, then on the tab on the ribbon click  » and select the first one.

Okay, well, nothing is that easy! Excel puts the categories in backwards order, so right-click the category axis and select  »  » . Still on the dialog, click .

#### Bar Graph with Relative Frequencies

The frequency bar graph tells us about the 434 individuals in the Pew Research Center’s sample. But why collect that sample except for what it can tell us about how often parents in general read to their children?

You know from Sampling Error in Chapter 1 that the proportions in the population are probably not the same as the sample, but probably not very far off either. So you compute those proportions and then redraw your graph to show percentages instead of raw counts.

How Often Parents Read to Children under Age 12 (n=434)
How OftenNumber of Parents Rel. Freq.
Every day21750%
A few times a week11326%
A few times a month266%
Less often307%
Never92%

First, total all the frequencies to get the sample size n = 434. (In this case n is given already, but often it isn’t.) Then convert each frequency into a relative frequency. The formula, if you need one, is f/n. For example, 9 parents never read to their under-12 children. The relative frequency is f/n = 9/434 = 0.021 or 2%: 2% of parents never read to their children. Enter that and the other relative frequencies in the table, as shown at right.

You may see some bar graphs with relative frequencies as decimals. There’s nothing wrong with that for technical audiences, but general audiences usually respond better to percentages.

Your relative frequencies may not add up to exactly 100% (or 1.0000), because of rounding. Don’t change any of the numbers to force a total.

Once you have your relative frequencies, you can make your bar graph. Choose round numbers for the tick marks on your relative frequency axis, for example every 5% or every 10%. I won’t inflict another of my sketches on you, but you can see a finished relative-frequency bar graph below.

#### Optional:  Relative Frequencies in Excel

To my surprise, I found that Excel doesn’t include relative-frequency bar graphs in its repertoire. You have to enter some formulas to compute the relative frequencies, and then create the graph from them. (Of course you could compute the relative frequencies yourself and enter them in Excel as numbers, but whenever possible I like to be lazy and make the computer do the work.)

1. Enter the categories in a column, leave a blank column, and enter the frequencies. If you already have the categories and frequencies in adjacent columns, right-click on the letter at the top of the frequency column and select Insert.
2. Click into the cell below the last frequency, and type “=sum(” (without the quotes). Then with your mouse select the frequencies. Finally, type a closing parenthesis and hit the Enter key.
3. In the address box just above the first column of the spreadsheet, type a unique name such as TOTPARENTS and press the Enter key. This makes it easier to refer to this total cell.
4. Click into the empty relative-frequency cell for the first category. Type an = sign, then click on the first frequency cell. (In the illustration, the relative-frequency cells are in column B and the frequency cells are in column C.) Type /TOTPARENTS (including / mark for division) and press the Enter key.
5. Grab the “handle” at the lower right of the cell you just typed into, and drag it down to fill the Relative Frequency column.
6. Click the % sign in the ribbon to change the decimals to percentages. (The % sign is near the middle of the ribbon, on the tab.)

Now highlight the category and relative-frequency columns, click the tab and the first 2-D column chart, and tweak the graph as you did before. Your result should be something like the one you see here.

On this chart, neither axis really needs a label. The percent signs reinforce the message in the chart title that the bars show relative frequencies. And the category names together with the chart title tell the reader exactly what is being represented.

It’s a judgment call where to place tick marks on the relative-frequency axis, and you really need to look at the data to make a decision. Four categories are under 10%, so it makes sense to show the 5% line and help the reader get a sense of the relative sizes. Of course, if you show 5% then you have to show every 5% increment up to the top of the graph.

#### Side-by-Side Bar Graph

You may want to compare two populations: men and women, for instance, or one year versus another year. To do this, a side-by-side bar graph is ideal. A side-by-side bar graph has two bars for each category, and a legend shows the meaning of the bars.

The two populations you’re comparing are almost never the same size. Therefore side-by-side graphs almost always show relative frequencies rather than frequencies.

Example 2: In Educational Attainment, the Census Bureau (2014) [see “Sources Used” at end of book] showed the educational attainment of the population in selected years 1940 to 2012. I chose the years 1992 and 2012 and prepared this graph to show the change over that 20-year period.

What do you see? Comparing 2012 to 1992, the proportion of the population with no college (the first four categories) declined, and the proportion with some college or a college degree increased. You should be able to see why this has to be a relative-frequency chart: in a frequency chart, the larger population in 2012 would make all the bars taller than the 1992 bars, and you’d be hard put to see any kind of trend.

#### Stacked Bar Graph

Example 3: Another way to compare two populations is the stacked bar graph. In the side-by-side bar graph, above, each group of bars was one category, and each bar within a group was a population. With the stacked bar graph, you have one bar for each population, and one piece of that bar for each category. (A stacked bar graph is kind of like an unrolled pie chart.)

Here’s a stacked bar graph for the same data set:

Because this textbook helps you,
Because this textbook helps you,
BrownMath.com/donate.

What do you see? Look first at the legend that lists the categories, then at the two bars. The top two segments represent some college. In 1992, about 56% of adults had no education beyond high school. But in 2012, only about 42% had a high-school diploma or less, meaning that 58% had at least some college. The proportions of college and no college were reversed in those 20 years.

You can also see that, though the group with four years of high school shrank, it didn’t shrink as much as the group with college grew. In other words, it’s not just more high-school graduates going on to college, it’s a higher proportion of the population entering high school. All the categories without a high-school diploma shrank. In 1992, 20% of adults had less than a high-school diploma and 80% were high-school graduates; in 2012, only about 12% had less than a high-school diploma and 88% had graduated from high school.

What’s the best way to compare two populations? The answer depends on what you’re trying to show. The side-by-side graph seems to be better at showing how each category changed, and the stacked graph is usually better at showing the mix, especially if you want to group the categories mentally. In the side-by-side graph, you can easily see the decline in adults with a fourth-grade education or less, but the shift to a college-educated population is much harder to see. It’s just the opposite with the stacked graph.

As always, get clear in your own mind what you’re trying to show, and then select the type of graph that shows that most clearly.

Did you notice that this stacked bar graph shows relative frequencies? (Maybe you didn’t notice, because it seems like the natural way to go.) A stacked bar graph could show frequencies instead of relative frequencies, if you want to emphasize the different sizes of the populations, but then it becomes harder to compare the mix in the populations.

When you make a stacked bar graph in Excel, there’s no need to pre-compute the percentages. Just select the third type of 2-D column chart, .

### 2A2.  Making a Table from Scratch

Example 4: In the first example, you were given a table of categories and counts. But more likely you’ll just have a mass of data points, like this:

Children’s Favorite Beach Toys
shoveldump truckshovelbucketshovelball
ballbucketsifterballshovelshovel
dump truckballshovelshovelbucketnet
siftershovelbucketdump truckbucketshovel
ballshovelballbucketnetball

Before you can make any kind of graph, you need a table to summarize the data. You’re probably tempted to count the number of shovels, the number of balls, and so on, but it’s way too easy to make mistakes that way. Why? Because you have to go over the data set multiple times, and you may count something twice or miss something.

The better procedure is to tally the categories in a table. It’s a win-win: the procedure is faster, and you’re less likely to make a mistake.

ToyTallies
shovel|||
ball|||
dump truck||
sifter|
bucket|

Simply go through the data, one item at a time. If you’re seeing a given category for the first time, add it to your list with a tally mark; if that category is already in your table, just add a tally mark. Here’s my table of tallies after going through the first two columns of data:

After you’ve tallied all the data, count the tallies in each category and total the counts. Of course the total should equal your sample size n. Here’s my complete table:

ToyTalliesFrequency
shovel|||| |||| 10
ball|||| ||7
dump truck|||3
sifter||2
bucket|||| |6
net||2
Total30

Always check the total of your frequencies. If it matches the sample size, that’s no guarantee everything is correct; but if it doesn’t match, you know something is wrong.

Once you’ve got your table, you can make a graph by following the procedures above. If you’re publishing the table itself, give just the category names and sizes and the total, but leave out the tallies.

### 2A3.  Pie Chart

Where a bar graph tends to emphasize the sizes of categories in relation to each other, a pie chart tends to emphasize the categories as divisions of the whole. This distinction is not hard and fast; it’s just a matter of emphasis.

To make a pie chart, you need a compass, or something else that can draw a circle, and you need a protractor. The angle of each segment of the pie will be 360°×f/n, where f is the frequency of the category and n is the sample size — in other words, it’s 360° times the relative frequency, whether you’re showing frequencies or relative frequencies on the pie chart. But in practice, if you’re going to make a pie chart you’ll use Excel or some other software.

#### Optional:  Pie Chart in Excel

Excel can draw a pie chart for you, but you have to make a bunch of tweaks before it’s usable. There’s one bit of good news: with a pie chart, unlike a bar graph, Excel can compute relative frequencies automatically. I’ll show you how to do that for the data about parents reading to children, for which we made a bar graph earlier.

1. Highlight the categories and frequencies, but not the total. Click the tab and then , and choose the first 2-D pie. You see the result at right.

Many people stop there, but this is an absolutely horrible design. Readers have to keep looking back and forth to match up the colors, and often there are similar colors. Color-blind people are really screwed, and if you print the chart on a black-and-white printer it’s hopeless. Fortunately you can fix this!

2. You’re going to put the category names with the pie segments, so right-click the legend (the list of categories at the right) and select .
3. Click on the “Number of Parents” title and type in a better one, such as “How Often Parents Read to Children”. (Don’t type the quotes in the title, of course.)
4. In the ribbon, on the tab, click  » . Under , select , select either or , and select . Under , select . Click .
5. You may want to resize the graph to make the labels less crowded, depending on the sizes of the segments. Drag a handle with your mouse, as you did before.

## 2B.  Graphing Numeric Data

Summary:

For numeric data, you want to show four things: the shape, center, and spread of the distribution plus any outliers. The histogram is the standard way to do this, and it can show frequencies or relative frequencies.

Usually you’ll group the data into classes, but when you have discrete data without too many different values you can make an ungrouped histogram.

For a discrete data set with a moderate number of values and a moderate range, a stemplot is an alternative. With a stemplot, it doesn’t matter how many different data values there are, but the number of data points matters.

### 2B1.  Histogram for Numeric Data

How can you draw a picture of numeric data? The answer is a histogram.

The term “histogram” was coined by Karl Pearson in lectures some time before 1895.

Example 5: Let’s use the lengths of some randomly selected iTunes songs:

Lengths of iTunes Songs (seconds)
113282179594213   319245323334526
395440477240296   428407230294152
242837246135412   223275409114604
170239138505316   369298168269398
433212367255218   283179374204227

How do you make sense of this? As you might expect, the first step is to make a table. But you don’t want to treat each number as its own category, because that would produce a really uninteresting graph. Instead you create categories, except for numeric data you call them classes. The rules for classes are very simple:

• The classes must cover all the data points.
• They must all be the same width.
• There must be no gaps between classes.

Notice that the rules don’t tell you how many classes there must be, or what width a class must have. That’s where your discretion comes in. You want to pick class boundaries that are “nice” numbers, and you don’t want too many classes or too few. In practice, five to nine classes is usually about the right number.

How does that apply to the iTunes songs? Take a look at the data. The lowest number seems to be 113, and the highest is 837. That gives a range in “nice” numbers of about 100–850. If you set class width to 100 you have eight classes, so that seems about right.

Now go ahead and make your tally marks to create the table. Instead of category names, you use class boundaries. You already know how to make tally marks, so I’ll just give you the results:

Lengths of iTunes Songs (seconds)
Class
Boundaries
TalliesFrequency
100–199|||| ||||9
200–299|||| |||| |||| ||||20
300–399|||| ||||9
400–499|||| ||7
500–599|||3
600–699|1
700–799 0
800–899|1

Even though the 700–799 class has no data points, it’s still a class and it will occupy the same width in the histogram as any other class. A bar with zero height shows in the histogram as a gap, and that’s good because it emphasizes that there’s something unusual about the point in the 800–899 class (which was 837 seconds).

If the class width is 100, how come the class bounds are 100–199 and not 100–200? In fact, some authors do write these class bounds as 100–200, 200–300, and so on, with the understanding that if a number is right on the boundary it goes in the upper class. All authors agree that the class width is the difference between the lower bounds of two consecutive classes, not the difference between lower and upper bounds of one class. So whether you write 100–199 for the first class or 100–200, the class width is 200 minus 100, which is 100.

Once you have the table, the histogram is straightforward. You can draw the histogram by hand or use Excel. I’ll show Excel later, but here’s my hand-made histogram for the iTunes data.

Notice that you label the data bars on their edges: 100, 200, …, 900, not 100–199, 200–299, …. Label the left edge of each bar, and also the right edge of the last bar. The right edge of the last bar is always one class width more than the left edge, so even if you’ve got 800–899 in your table the last bar’s edges are 800 and 900.

Like all histograms, this one is good at showing the shape of the data (skewed right; see below), the center (somewhere in the upper 200s to 300s), and the spread (from 100ish to 800ish seconds, or about two minutes to 13 minutes). In Chapter 3 you’ll learn how to measure center and spread numerically, but there’s always a place for a picture to help people grasp a data set as a whole.

This data set also shows an outlier, located somewhere in the 800–899 class. Not every data set will have an outlier, of course, and a rare sample might have more than one. When an outlier occurs, your first move is to go back to your original data sheets and make sure that it’s not simply a mistake in entering your data. If it’s a real data point, then you can ask what it means. In this case, the message is pretty simple: tunes generally run up to about 11 or 12 minutes (700 seconds), but the occasional one can be several minutes longer.

#### Histogram Versus Bar Graph

A histogram is similar to a bar graph, but with the following differences:

Histogram Bar Graph Data type Numeric (grouped) Discrete ungrouped★ Non-numeric Numeric order,left to right Numeric order,left to right Any order you choose Yes No, they’re spaced Below the edges Below the centers ★Some authors treat ungrouped discrete data as numeric and make a histogram. Others, including this book, treat ungrouped discrete data as categories and make a bar graph.

For both histogram and bar graph, the frequencies must start at 0. However, in a histogram the data axis typically doesn’t start at zero. You just leave some space between the frequency axis and the first bar, and the scale of the data axis is considered to start at the first bar.

#### Relative-Frequency Histogram

Though I don’t show it here, you could make a relative-frequency histogram, the same way you made a relative-frequency bar chart. The relative frequencies range from 0 for the 700–799 class to 20/50 = 40% for the 200–299 class.

#### Optional:  Histogram in Excel

Believe it or not, out of all the chart types in Excel, the standard histogram was not included until Excel 2016. If you’ve got Excel 2016, click  » and select the histogram.

In Excel 2013 and earlier, to make a histogram you have to combine a column chart and a scatterplot (Middleton) [see “Sources Used” at end of book], or download additional software. You can follow the detailed instructions in that document, or you can download the free Better Histogram add-in from TreePlan Software to do the job. (It works in Excel 2007 through 2016.) If you’re using Better Histogram:

1. Enter all the original numbers in a column in Excel.
2. Double-click the downloaded ZIP file, and within it double-click . You will have to enable macros.
3. Click the tab in the ribbon, and then .
• : Click the “_” button at right and highlight your numbers.
• : The lower bound of your first class, not the lowest number in the data.
• : The right-hand edge of the last class, which in this example is 900 (not 899).
4. Better Histogram will create a new sheet in your workbook with a frequency table and histogram. Click on the chart title and enter a new title. Click on the horizontal axis title and either delete it or change it to more appropriate text. The result is shown at right.
5. Optional: You might wish to jazz up the chart visually. If so, click on the tab of Excel’s ribbon and choose a design. Color is fine, but don’t choose different colors for different bars because that can make bars look larger or smaller than they actually are. Here’s what I got from clicking the blue theme.

### 2B2.  Ungrouped Discrete Data

To make sense of most data sets, you need to group the data into classes. But sometimes your data have only a few different values. In such cases, you probably want to skip the grouping and just have one histogram bar for each different response. The height of the bar tells you how often that response occurred, as usual.

Example 6: A state park collected data on the number of adults in each vehicle that entered the park in a given time interval:

3 1 1 3 3 3 0 7 3 1    3 6 4 5 3 2 3 4 2 3
0 2 2 4 8 3 3 1 3 3    3 4 1 5 2 2 6 3 4 2

in Vehicles Entering Park
0||2
1||||5
2|||| ||7
3|||| |||| ||||15
4||||5
5||2
6||2
7|1
8|1
Total40

There are only nine different values, so it seems a little silly to group them. Instead, just tally the occurrences, as shown at right.

Label ungrouped data under the centers of the bars, just like categorical data, not under the edges. Some authors still make the bars touch because the data are numeric, and others keep the bars separated because the data are ungrouped. I prefer the second approach, but I’ll accept the other. Here’s my histogram:

Caution: This particular data set has at least one occurrence of every value between min and max. But suppose it didn’t; suppose there were no vehicles with 7 adults? In that case, you would draw the histogram exactly the same, except that the bar above “7” would have zero height. The horizontal axis for numeric data must always have a consistent scale for its whole length, so you never close up any gaps.

#### Optional:  Ungrouped Discrete Histogram in Excel

You can graph ungrouped discrete data in Excel, if you wish. The key is to fool Excel into treating the data like categorical data:

1. Type the unique values in one column. But as you type each number, type an apostrophe (') first. Don’t put 0, 1, 2 and so on in the cells, but '0, '1, '2. The apostrophe won’t appear, but it tells Excel to treat the numbers like text. (You may notice that Excel left justifies those numbers.)
2. Type the frequencies in a second column.
3. Highlight the numbers in both columns, and on the tab click . Select the first 2-D column.
4. Make all the same adjustments you made for the bar graph, above.

By the way, you might notice that the tick marks on the vertical axis are every two cars on this graph, but they were every five cars on my hand-drawn histogram. One is not better than the other; it’s a stylistic choice.

5. Optional: If you want to make the bars touch, right-click on the graph, select , and under change to 0%. Then click and select with a color of white.

### 2B3.  Shapes of Data Sets

You should know the names of the most common shapes of numeric data. Why? It’s easier to talk about data that way, and — as you’ll see in the next chapter — you treat different-shaped distributions a little differently.

The first question is whether the data set is symmetric or skewed. The histogram of a symmetric data set would look pretty much the same in a mirror; a skewed data set’s histogram would look quite different in a mirror.

If a distribution is skewed, you say whether it’s skewed left or skewed right. A distribution that is skewed left, like the first one below, has mostly high scores, and a distribution that is skewed right, like the second one below, has mostly low scores. The direction of skew is away from the bulk of the data, toward the long skinny tail, where there are few data points.

Example 7: Scores on a really easy test would be skewed left: most people get high scores, but a few get low or very low scores.

Lifespan in developed countries is skewed left: there are relatively few infant and child deaths, and most people live into their 60s, 70s, or 80s. (The first graph in Calculus Applied to Probability and Statistics [Waner and Costenoble 1996] [see “Sources Used” at end of book] illustrates this.)

People’s own evaluation of their driving skills and safety are left skewed: few people rate themselves below average and most rate themselves above average. Illusory Superiority [see “Sources Used” at end of book] cites a study by Svenson showing this “Lake Wobegon effect”.

Example 8: People’s departure times after a concert would be skewed right: most people leave shortly before or after the performers finish, but a few straggle out for some time afterward. Skewed-right distributions are more common than skewed-left distributions.

Salaries at almost any corporation are another good example of a distribution that is skewed right: most people make a modest wage, but a few top people make much more.

There are several types of symmetric distributions, but here are the two you’ll meet most often. A uniform distribution is one where all possible values are equally likely to occur. The normal distribution has a precise definition, which you’ll meet in Chapter 7, but for now it’s enough to say that it’s the famous bell curve, with the middle values occurring the most often and the extreme values occurring much less often.

You’ll notice that both of the examples below are “bumpy”. That’s usual. In real life you pretty much never meet an exact match for any distribution, because there are always lurking variables, measurement errors, and so on. And even if a population does perfectly follow a given distribution, like the probability distributions you’ll meet in Chapter 6, still a sample doesn’t perfectly reflect the population it came from: sampling error is always with us. When we say that a data set follows such-and-such a distribution, we mean it’s a close match, not a perfect match.

Example 9: Winning lottery numbers are uniformly distributed. (In the short term some numbers occur more often than others, but over the long run they tend to even out.)

The results of rolling one die many times are uniformly distributed. (But the results of rolling two dice are not uniformly distributed: 7 is the most likely, 2 and 12 are tied for least likely, and the other numbers are intermediate.)

The normal distribution or bell curve occurs very often, and in fact many natural and industrial processes produce normal distributions. This happens so often that we often just say or write ND for “normal distribution” or “normally distributed”.

Example 10: Men’s and women’s heights follow separate normal distributions. People’s arrival times at an event are ND. IQ scores, and scores on most tests, are ND. The amount of soda in two-liter bottles is ND. Your commute times on a given route are ND.

### 2B4.  Stem Plot

Suppose you have a discrete data set with few repetitions. An ungrouped histogram would have most bars at the same low height; a grouped histogram might show a pattern but you’d lose the individual data points.

If your discrete data set isn’t too large (n < 100, give or take), and the range isn’t too great, you can eat your cake and have it too. The stemplot, also known as a stem-and-leaf diagram, is a mutant hybrid between a histogram and a simple list of data.

The idea is that you take all the digits of each data point except the last digit and call that the stem; the last digit is the leaf. For example, consider scores of 113 and 117. They are two leaves 3 and 7 on a common stem 11 (meaning 110).

To construct a stemplot, you look over your data set for the minimum and maximum, then write the stems in a column, from lowest to highest. Just like with a histogram, there are no gaps, so if you have data in the 50s and the 70s but not in the 60s you still need a stem of 6.

However, your stems probably won’t start at 0. Start them with the lowest data point that actually occurred, and end them with the highest data point that actually occurred.

The stemplot was invented by John Tukey in 1970.

Example 11: Here is a set of IQ scores from 50 randomly-selected tenth graders:

99  77  83 111 141      89  98  84  93 124
110  73  96  60 102      87 123 120 100  95
100  90 104  85 129      81 119 112 103  76
108  91  94 114 108      92  96  94  88 101
117 106 103 105 113      97 106 109  80 116

To make your stemplot, eyeball the data for the minimum and maximum, which are 60 and 141. Write the stems, 6 to 14, in a column at the left of your paper, starting several lines below the top. Then draw a vertical line just to the right of them.

Now go through the data points, one by one, and add each leaf to the proper stem. During this process, you might find a value outside what you thought were the min and max. That’s no problem. Just add the stem and then the leaf. (Again, the stems can’t have gaps, so if your first stem is 6 and you come across a data point 47, you have to add stems 4 and 5, not just 4.)

Finally, add a title and a legend or key to your stemplot. Here is the result:

```            IQ Scores
6 | 0
7 | 7 3 6
8 | 3 9 4 7 5 1 8 0
9 | 9 8 3 6 5 0 1 4 2 6 4 7
10 | 2 0 0 4 3 8 8 1 6 3 5 6 9
11 | 1 0 9 2 4 7 3 6
12 | 4 3 0 9
13 |
14 | 1
key: 11 | 7 = 117```

If you lie down and look at this sideways, it looks like a histogram. But the bonus is that you can still see all the actual data points within the groupings of 60–69, 70–79, etc.

A stemplot is great at showing shape, center and spread of distributions plus outliers, but most data sets don’t lend themselves to a stemplot. If your data set is too large, your leaves will run off the edge of the page. If your data set is too sparse — if the range is large for the number of data points — most of your stems won’t have leaves and the plot won’t really show any patterns in the data. But when you have a moderate-sized data set and the data range is moderate, a stemplot is probably better than a histogram because the stemplot gives more information.

One last touch is sorting the leaves. I don’t think that’s important enough to take the extra effort in a homework problem or on a quiz, but if you’re going to be presenting your stemplot to other people then you probably want to sort the leaves. Here’s the same stemplot with sorted leaves:

```            IQ Scores
6 | 0
7 | 3 6 7
8 | 0 1 3 4 5 7 8 9
9 | 0 1 2 3 4 4 5 6 6 7 8 9
10 | 0 0 1 2 3 3 4 5 6 6 8 8 9
11 | 0 1 2 3 4 6 7 9
12 | 0 3 4 9
13 |
14 | 1
key: 11 | 7 = 117```

A glance at this stemplot shows you quite a lot. The data set is normally distributed, the center is around 100 points, the spread is 60–141, and there’s an outlier at 141.

You now know how to make good graphs, so be on the lookout for bad graphs. Sometimes they’re bad just because whoever drew them didn’t know any better, or didn’t think. But some people may deliberately try to deceive you with a graph.

Example 12: File this one under “what were they thinking?” The left-hand graph doesn’t have a title, so you don’t know what “Yes” and “No” mean. You have to look back and forth between the graph and the legend, and anyone with red-green color blindness probably won’t be able to see which segment is which. Oh yes — what percentages of the sample answered “Yes” and “No”? You can guess that it’s around a third versus two thirds, but that’s not very precise.

The right-hand graph cures those problems. It’s now crystal clear which segment is Yes and which is No, and what proportion of the sample gave each answer. This actually lets you show more information in less space, a win-win. (Of course you wouldn’t use a vague term like “Opinions” — that’s just there to remind you to give your graph a title.)

Example 13: There’s no telling whether this one is deliberate deception or just incompetent graphing. An oatmeal company, which shall remain nameless, wanted to show that eating oatmeal for four weeks reduces cholesterol. The first graph makes a strong case — until you look at the scale on the vertical axis. (Don’t even think about wasting your time on a graph with no vertical scale.)

The scale doesn’t start at zero, so it makes differences look much bigger than they are. Your frequency or relative frequency scale must always start at zero (and you must show the zero). The second graph is properly drawn, and now you can see that the drop in cholesterol is only a slight one.

source: Misleading Graph [see “Sources Used” at end of book]

Example 14: It’s all very well to create visual interest, but not if it makes the reader misinterpret the graph.

In the left-hand graph, you can tell from the scale that B is supposed to be three times as large as A, but since it’s three times as high and three times as wide it’s actually nine times as large, giving the reader a distorted impression of the amount of difference. Even if your “bars” are pictures, they still have to be the same width. The corrected version is shown at right. (It’s still not quite correct, though, because 0 is not shown on the vertical axis.)

## 2D.  Really Good Graphs

If you follow the rules in this chapter, you’ll make good, professional graphs. But there are plenty of other ways to make good graphs, depending on the data you’re trying to show.

There’s a classic picture book that can give you lots of good ideas. Edward Tufte’s The Visual Display of Quantitative Information has been around since 1983, and no one has yet done it any better. (Tufte has produced newer editions.)

Example 15: One famous graph in Tufte’s book is particularly stunning. Charles Minard wanted to present a lot of time-series data about Napoleon’s disastrous campaign in Russia in the winter of 1812–1813: where battles took place, numbers of casualties, temperature, and so forth. He elected to make a kind of stylized map showing just the rivers and the cities where events happened. (Niemen at the left is the Niemen River, Russia’s western border at the time. Moscow, “Moscou” in French, is as far east as Napoleon got.) Across that, Minard showed the army strength as a broad swath at the start that shrank to almost nothing by the end of the retreat westward. Below are dates of events, temperatures, and precipitation. It’s a huge amount of information on one piece of paper.

This tiny rendition doesn’t do it justice, but if you click on it visit http://upload.wikimedia.org/wikipedia/commons/2/29/Minard.png you’ll see it at a better size. (Your browser may still reduce it to fit on your screen. Try clicking into the picture and you should see it at original size, though you’ll have to scroll around to see the details. It sounds like a lot of effort, but I promise you it’s worth it. Or just get the book, because it has plenty more!)

Example 16: Here’s one I ran across in my reading. It’s not the graph of the century like Minard’s, but it’s a cut above the usual. In Bear Attacks: Their Causes and Avoidance (2002), Stephen Herrero had the problem of contrasting bears’ diet in spring, summer, and fall. (Of course in winter they’re not eating.)

He could have drawn three pie charts, or a stacked bar graph, but instead he came up with a great alternative. (You can click on the picture to enlarge it.) (A larger version is at https://BrownMath.com/swtpic/chap02_beardiet.jpg.) Each component of diet is clearly labeled right in the graph, not in some legend off to the side, and the contrasting backgrounds make it a little more interesting visually. A stacked bar graph would convey the same information, but I like this presentation because it suggests that “spring”, “summer”, and “fall” are not completely separate but rather transition one into the next.

The vertical axis is clearly labeled, too. There’s no doubt what the numbers are (as opposed to some units of weight, for instance, or something more esoteric like pounds of feed per hundreds of pounds of bear).

He probably could have left off the title off the category axis — after all, we know that the seasons are seasons, and the graph title also conveys that information. But that’s a minor point. My only real quibble with this graph is that the overall graph title at the bottom is too small.

## What Have You Learned?

Overview: With numeric data, the goal of descriptive stats is to show shape, center, spread, and outliers.

Key ideas:

(The online book has live links to all of these.)

• When you have a mass of data and need frequencies, don’t pass through the data repeatedly, counting a different category each time. Instead, use the tally system.
• Relative frequency for any class or category is the number of data points in that class, divided by total sample size.
• For non-numeric data, make a bar graph or pie chart. Place categories in any order that seems reasonable to you. Side-by-side bar graphs and stacked bar graphs can be useful for comparing populations.
• Numeric data: Group continuous data in classes, tally them, and make a grouped histogram. Bars must touch, and you label them under the edges, not the middles. Do the same with discrete data that have a lot of different values.
• Present discrete data without too many different values in one bar for each different value. Label them under their middles. It’s a matter of taste whether the bars touch (ungrouped histogram) or not (bar graph).
• For bar graphs and histograms, show scale on the frequency or relative-frequency axis, and show scale or category name on the data axis. Usually, each axis has a title, with a separate chart title at the top. But you can omit an axis title when it would be redundant information.
• In every bar graph or histogram, the frequency or relative-frequency axis must start at 0 and have consistent scale for its whole length.

Be on the lookout for violations of this rule and other signs of bad graphs.

• Know the most common shapes of numeric distributions: uniform, bell curve, skewed left, and skewed right.
• The stemplot (stem-and-leaf diagram) is also an option for discrete data with moderate range and ≤ about 100 data points.
Because this textbook helps you,
Because this textbook helps you,
BrownMath.com/donate.

## Exercises for Chapter 2

Write out your solutions to these exercises, showing your work for all computations. Then check your solutions against the solutions page and get help with anything you don’t understand.

Caution! If you don’t see how to start a problem, don’t peek at the solution — you won’t learn anything that way. Ask your instructor or a tutor for a hint. Or just leave it and go on to a different problem for now. You may find when you return to that “impossible” problem that you see how to do it after all.

1
Would You Move to the US?
Yes, with authorization154
Yes, without authorization204
No612
Don’t know30

The Pew Research Center (2013c) [see “Sources Used” at end of book] conducted a poll of 1000 adults in Mexico, asking whether they would move to the US if they had the means and opportunity to move. Draw a relative-frequency bar graph for their responses.

2

What’s wrong with this graph? (You should be able to see at least two problems, maybe more.)

(source: Misleading Graph [see “Sources Used” at end of book] in Wikipedia)

3

Professor Marvel had a statistics class of fifteen students, and on one 15-point quiz their scores were

10.5   13.5   8   12   11.3   9   9.5   5   15   2.5   10.5   7   11.5   10   10.5

Construct a frequency table and bar graph for their letter grades on the quiz, where 90% is the minimum for an A, 80% for a B, 70% for a C, and 60% for a D.

4
Deaths by Horse Kick in
14 Prussian Army Corps, 1875–1894
Number of DeathsFrequency
0144
191
232
311
42
Total280

Bulmer (1979, 92) [see “Sources Used” at end of book] quotes an 1898 study of deaths by horse kick in the Prussian army. Von Bortkiewicz compiled the number of deaths in 14 Prussian Army corps over the 20-year period 1875–1894, as shown at right. (14 corps over 20 years gives 14×20 = 280 observations.) For example, there were 32 observations in which two officers died of horse kicks.
(a) What is the type of the variable?
(b) Construct an appropriate graph.

5
Commuting Distances in km
51523129
1222263121
1119164512
82618171
1624152017

In a GM factory in Brazil, 25 workers were asked their commuting distance in kilometers. Construct a stem-and-leaf plot.

—Adapted from Dabes and Janik (1999, 8) [see “Sources Used” at end of book]

6

Abigail asked a number of students their major. She found 35 in liberal arts, 10 in criminal justice, 25 in nursing, 45 in business, and 20 in other majors. What was the relative frequency of the nursing group, rounded to the nearest whole percent?

7 (a) Name three types of graph used for ungrouped discrete data. Which type do you use when?
(b) Name the type of graph used for grouped numeric data.
(c) Name two types of graph used for qualitative data.
8

Bert asked his fellow students how many books they read for pleasure in a year. He found that most of them read 0, 1, or 2 books, but some read 3 or more and a very few read as many as 10. (He plotted the histogram shown at right.) Identify the shape of this distribution.

9 (a) In making a histogram, how do you decide whether to group data?
(b) What are the two rules for classes when you group data?
10
Test scores, xFrequencies, f
470.0–479.915
480.0–489.922
490.0–499.929
500.0–509.950
510.0–519.938

At right is a grouped frequency distribution.
(a) Create a frequency histogram. (For a real quiz, you’d use graph paper, but you can freehand this one.)
(b) Find the class width.
(c) What’s the shape of this distribution?