Stats without Tears
2. Graphing Your Data
Updated 19 Jan 2017
(What’s New?)
Copyright © 2013–2017 by Stan Brown
Updated 19 Jan 2017
(What’s New?)
Copyright © 2013–2017 by Stan Brown
Summary: To make sense out of a mass of raw data, make a graph. Non-numeric data want a bar graph or pie chart; numeric data want a histogram or stemplot. Histograms and bar graphs can show frequency or relative frequency.
Graph Paper
for Free:
Why buy an expensive pad of graph paper, especially if you only need a
few sheets? You can print your own for free using Incompetech’s
Plain Graph Paper PDF Generator
and at Math Worksheets Land’s
Graph Paper.
Both are sources not just for the ordinary square grid, but for
various specialty graph papers.
Any graph of non-numeric data needs to show two things: the categories and the size of each. Probably you’re already familiar with the two most common types, which are the bar graph and pie chart.
The sizes of categories can be shown as raw counts, called frequencies, or percentages, called relative frequencies. (Relative frequencies can also be shown as decimals, but I think most people respond better to “20%” than “.20”.)
How do you decide whether to show frequencies or relative frequencies? This is a stylistic choice, not a matter of right and wrong. Your choice depends on what’s important, what point you’re trying to make. If your main concern is just with the individuals in your sample, go with frequencies. But if you want to show the relationship of the parts to the whole, show relative frequencies.
How Often Parents Read to Children under Age 12 (n=434) | |
---|---|
How Often | Number of Parents |
Every day | 217 |
A few times a week | 113 |
About once a week | 39 |
A few times a month | 26 |
Less often | 30 |
Never | 9 |
Example 1: In fall 2012, the Pew Research Center (2013a) [see “Sources Used” at end of book] surveyed American adults on their habits of reading to their children. The survey included 434 adults who had at least one child under age 12, and the results are shown in the table.
(Remember, you can’t call the data numeric just because you see numbers in a summary statement. You have to go back to the individual data points, which are categorical: “every day”, “a few times a week”, and so on. If the Pew Center had asked “how many days a week do you read to your child?” and got answers 0, 1, 2, 3, 4, 5, 6, and 7, that would be a set of numeric data.)
Your bar chart or bar graph must follow these rules:
Usually the category axis is horizontal, so the frequency axis and the bars are vertical. But you can also make a horizontal bar chart, where the category axis is vertical and the frequency axis and bars are horizontal.
You can make a bar graph by hand, or use software such as Microsoft Excel. If you make a bar graph by hand, use graph paper and draw the axes and bars with a straightedge — wobbly bars make you look like you had a liquid lunch.
Here’s my bar graph for parents reading to children:
A couple of comments on best practices:
Getting some kind of bar graph out of Excel is easy. But then there’s a lot of fiddling around to reverse some of Excel’s rather strange format choices. Here are instructions for Excel 2010. If you have Excel 2007, 2013, or 2016, you’ll find that they’re pretty similar.
If you prefer a horizontal bar chart, it’s easy to make the change. Click into the chart area, then on the
tab on the ribbon click » and select the first one.Okay, well, nothing is that easy! Excel puts the
categories in backwards order, so right-click the category axis and
select
The frequency bar graph tells us about the 434 individuals in the Pew Research Center’s sample. But why collect that sample except for what it can tell us about how often parents in general read to their children?
You know from Sampling Error in Chapter 1 that the proportions in the population are probably not the same as the sample, but probably not very far off either. So you compute those proportions and then redraw your graph to show percentages instead of raw counts.
How Often Parents Read to Children under Age 12 (n=434) | ||
---|---|---|
How Often | Number of Parents | Rel. Freq. |
Every day | 217 | 50% |
A few times a week | 113 | 26% |
About once a week | 39 | 9% |
A few times a month | 26 | 6% |
Less often | 30 | 7% |
Never | 9 | 2% |
First, total all the frequencies to get the sample size n = 434. (In this case n is given already, but often it isn’t.) Then convert each frequency into a relative frequency. The formula, if you need one, is f/n. For example, 9 parents never read to their under-12 children. The relative frequency is f/n = 9/434 = 0.021 or 2%: 2% of parents never read to their children. Enter that and the other relative frequencies in the table, as shown at right.
You may see some bar graphs with relative frequencies as decimals. There’s nothing wrong with that for technical audiences, but general audiences usually respond better to percentages.
Your relative frequencies may not add up to exactly 100% (or 1.0000), because of rounding. Don’t change any of the numbers to force a total.
Once you have your relative frequencies, you can make your bar graph. Choose round numbers for the tick marks on your relative frequency axis, for example every 5% or every 10%. I won’t inflict another of my sketches on you, but you can see a finished relative-frequency bar graph below.
To my surprise, I found that Excel doesn’t include relative-frequency bar graphs in its repertoire. You have to enter some formulas to compute the relative frequencies, and then create the graph from them. (Of course you could compute the relative frequencies yourself and enter them in Excel as numbers, but whenever possible I like to be lazy and make the computer do the work.)
Now highlight the category and relative-frequency columns, click the
tab and the first 2-D column chart, and tweak the graph as you did before. Your result should be something like the one you see here.On this chart, neither axis really needs a label. The percent signs reinforce the message in the chart title that the bars show relative frequencies. And the category names together with the chart title tell the reader exactly what is being represented.
It’s a judgment call where to place tick marks on the relative-frequency axis, and you really need to look at the data to make a decision. Four categories are under 10%, so it makes sense to show the 5% line and help the reader get a sense of the relative sizes. Of course, if you show 5% then you have to show every 5% increment up to the top of the graph.
You may want to compare two populations: men and women, for instance, or one year versus another year. To do this, a side-by-side bar graph is ideal. A side-by-side bar graph has two bars for each category, and a legend shows the meaning of the bars.
The two populations you’re comparing are almost never the same size. Therefore side-by-side graphs almost always show relative frequencies rather than frequencies.
Example 2: In Educational Attainment, the Census Bureau (2014) [see “Sources Used” at end of book] showed the educational attainment of the population in selected years 1940 to 2012. I chose the years 1992 and 2012 and prepared this graph to show the change over that 20-year period.
What do you see? Comparing 2012 to 1992, the proportion of the population with no college (the first four categories) declined, and the proportion with some college or a college degree increased. You should be able to see why this has to be a relative-frequency chart: in a frequency chart, the larger population in 2012 would make all the bars taller than the 1992 bars, and you’d be hard put to see any kind of trend.
Example 3: Another way to compare two populations is the stacked bar graph. In the side-by-side bar graph, above, each group of bars was one category, and each bar within a group was a population. With the stacked bar graph, you have one bar for each population, and one piece of that bar for each category. (A stacked bar graph is kind of like an unrolled pie chart.)
Here’s a stacked bar graph for the same data set:
What do you see? Look first at the legend that lists the categories, then at the two bars. The top two segments represent some college. In 1992, about 56% of adults had no education beyond high school. But in 2012, only about 42% had a high-school diploma or less, meaning that 58% had at least some college. The proportions of college and no college were reversed in those 20 years.
You can also see that, though the group with four years of high school shrank, it didn’t shrink as much as the group with college grew. In other words, it’s not just more high-school graduates going on to college, it’s a higher proportion of the population entering high school. All the categories without a high-school diploma shrank. In 1992, 20% of adults had less than a high-school diploma and 80% were high-school graduates; in 2012, only about 12% had less than a high-school diploma and 88% had graduated from high school.
What’s the best way to compare two populations? The answer depends on what you’re trying to show. The side-by-side graph seems to be better at showing how each category changed, and the stacked graph is usually better at showing the mix, especially if you want to group the categories mentally. In the side-by-side graph, you can easily see the decline in adults with a fourth-grade education or less, but the shift to a college-educated population is much harder to see. It’s just the opposite with the stacked graph.
As always, get clear in your own mind what you’re trying to show, and then select the type of graph that shows that most clearly.
Did you notice that this stacked bar graph shows relative frequencies? (Maybe you didn’t notice, because it seems like the natural way to go.) A stacked bar graph could show frequencies instead of relative frequencies, if you want to emphasize the different sizes of the populations, but then it becomes harder to compare the mix in the populations.
When you make a stacked bar graph in Excel, there’s no need to pre-compute the percentages. Just select the third type of 2-D column chart,
.Example 4: In the first example, you were given a table of categories and counts. But more likely you’ll just have a mass of data points, like this:
Children’s Favorite Beach Toys | |||||
---|---|---|---|---|---|
shovel | dump truck | shovel | bucket | shovel | ball |
ball | bucket | sifter | ball | shovel | shovel |
dump truck | ball | shovel | shovel | bucket | net |
sifter | shovel | bucket | dump truck | bucket | shovel |
ball | shovel | ball | bucket | net | ball |
Before you can make any kind of graph, you need a table to summarize the data. You’re probably tempted to count the number of shovels, the number of balls, and so on, but it’s way too easy to make mistakes that way. Why? Because you have to go over the data set multiple times, and you may count something twice or miss something.
The better procedure is to tally the categories in a table. It’s a win-win: the procedure is faster, and you’re less likely to make a mistake.
Toy | Tallies |
---|---|
shovel | ||| |
ball | ||| |
dump truck | || |
sifter | | |
bucket | | |
Simply go through the data, one item at a time. If you’re seeing a given category for the first time, add it to your list with a tally mark; if that category is already in your table, just add a tally mark. Here’s my table of tallies after going through the first two columns of data:
Please complete your tallies on your own before you look at mine.
After you’ve tallied all the data, count the tallies in each category and total the counts. Of course the total should equal your sample size n. Here’s my complete table:
Toy | Tallies | Frequency |
---|---|---|
shovel | |||| |||| | 10 |
ball | |||| || | 7 |
dump truck | ||| | 3 |
sifter | || | 2 |
bucket | |||| | | 6 |
net | || | 2 |
Total | 30 |
Always check the total of your frequencies. If it matches the sample size, that’s no guarantee everything is correct; but if it doesn’t match, you know something is wrong.
Once you’ve got your table, you can make a graph by following the procedures above. If you’re publishing the table itself, give just the category names and sizes and the total, but leave out the tallies.
Where a bar graph tends to emphasize the sizes of categories in relation to each other, a pie chart tends to emphasize the categories as divisions of the whole. This distinction is not hard and fast; it’s just a matter of emphasis.
To make a pie chart, you need a compass, or something else that can draw a circle, and you need a protractor. The angle of each segment of the pie will be 360°×f/n, where f is the frequency of the category and n is the sample size — in other words, it’s 360° times the relative frequency, whether you’re showing frequencies or relative frequencies on the pie chart. But in practice, if you’re going to make a pie chart you’ll use Excel or some other software.
Excel can draw a pie chart for you, but you have to make a bunch of tweaks before it’s usable. There’s one bit of good news: with a pie chart, unlike a bar graph, Excel can compute relative frequencies automatically. I’ll show you how to do that for the data about parents reading to children, for which we made a bar graph earlier.
Many people stop there, but this is an absolutely horrible design. Readers have to keep looking back and forth to match up the colors, and often there are similar colors. Color-blind people are really screwed, and if you print the chart on a black-and-white printer it’s hopeless. Fortunately you can fix this!
For numeric data, you want to show four things: the shape, center, and spread of the distribution plus any outliers. The histogram is the standard way to do this, and it can show frequencies or relative frequencies.
Usually you’ll group the data into classes, but when you have discrete data without too many different values you can make an ungrouped histogram.
For a discrete data set with a moderate number of values and a moderate range, a stemplot is an alternative. With a stemplot, it doesn’t matter how many different data values there are, but the number of data points matters.
How can you draw a picture of numeric data? The answer is a histogram.
The term “histogram” was coined by Karl Pearson in lectures some time before 1895.
Example 5: Let’s use the lengths of some randomly selected iTunes songs:
Lengths of iTunes Songs (seconds) | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
113 | 282 | 179 | 594 | 213 | 319 | 245 | 323 | 334 | 526 | |
395 | 440 | 477 | 240 | 296 | 428 | 407 | 230 | 294 | 152 | |
242 | 837 | 246 | 135 | 412 | 223 | 275 | 409 | 114 | 604 | |
170 | 239 | 138 | 505 | 316 | 369 | 298 | 168 | 269 | 398 | |
433 | 212 | 367 | 255 | 218 | 283 | 179 | 374 | 204 | 227 |
How do you make sense of this? As you might expect, the first step is to make a table. But you don’t want to treat each number as its own category, because that would produce a really uninteresting graph. Instead you create categories, except for numeric data you call them classes. The rules for classes are very simple:
Notice that the rules don’t tell you how many classes there must be, or what width a class must have. That’s where your discretion comes in. You want to pick class boundaries that are “nice” numbers, and you don’t want too many classes or too few. In practice, five to nine classes is usually about the right number.
How does that apply to the iTunes songs? Take a look at the data. The lowest number seems to be 113, and the highest is 837. That gives a range in “nice” numbers of about 100–850. If you set class width to 100 you have eight classes, so that seems about right.
Now go ahead and make your tally marks to create the table. Instead of category names, you use class boundaries. You already know how to make tally marks, so I’ll just give you the results:
Lengths of iTunes Songs (seconds) | ||
---|---|---|
Class Boundaries | Tallies | Frequency |
100–199 | |||| |||| | 9 |
200–299 | |||| |||| |||| |||| | 20 |
300–399 | |||| |||| | 9 |
400–499 | |||| || | 7 |
500–599 | ||| | 3 |
600–699 | | | 1 |
700–799 | 0 | |
800–899 | | | 1 |
Even though the 700–799 class has no data points, it’s still a class and it will occupy the same width in the histogram as any other class. A bar with zero height shows in the histogram as a gap, and that’s good because it emphasizes that there’s something unusual about the point in the 800–899 class (which was 837 seconds).
If the class width is 100, how come the class bounds are 100–199 and not 100–200? In fact, some authors do write these class bounds as 100–200, 200–300, and so on, with the understanding that if a number is right on the boundary it goes in the upper class. All authors agree that the class width is the difference between the lower bounds of two consecutive classes, not the difference between lower and upper bounds of one class. So whether you write 100–199 for the first class or 100–200, the class width is 200 minus 100, which is 100.
Once you have the table, the histogram is straightforward. You can draw the histogram by hand or use Excel. I’ll show Excel later, but here’s my hand-made histogram for the iTunes data.
Notice that you label the data bars on their edges: 100, 200, …, 900, not 100–199, 200–299, …. Label the left edge of each bar, and also the right edge of the last bar. The right edge of the last bar is always one class width more than the left edge, so even if you’ve got 800–899 in your table the last bar’s edges are 800 and 900.
Like all histograms, this one is good at showing the shape of the data (skewed right; see below), the center (somewhere in the upper 200s to 300s), and the spread (from 100ish to 800ish seconds, or about two minutes to 13 minutes). In Chapter 3 you’ll learn how to measure center and spread numerically, but there’s always a place for a picture to help people grasp a data set as a whole.
This data set also shows an outlier, located somewhere in the 800–899 class. Not every data set will have an outlier, of course, and a rare sample might have more than one. When an outlier occurs, your first move is to go back to your original data sheets and make sure that it’s not simply a mistake in entering your data. If it’s a real data point, then you can ask what it means. In this case, the message is pretty simple: tunes generally run up to about 11 or 12 minutes (700 seconds), but the occasional one can be several minutes longer.
A histogram is similar to a bar graph, but with the following differences:
Histogram | Bar Graph | ||
---|---|---|---|
Data type | Numeric (grouped) | Discrete ungrouped★ | Non-numeric |
Order of categories | Numeric order, left to right |
Numeric order, left to right |
Any order you choose |
Do the bars touch? | Yes | No, they’re spaced | |
Where are they labeled? | Below the edges | Below the centers | |
★Some authors treat ungrouped discrete data as numeric and make a histogram. Others, including this book, treat ungrouped discrete data as categories and make a bar graph. |
For both histogram and bar graph, the frequencies must start at 0. However, in a histogram the data axis typically doesn’t start at zero. You just leave some space between the frequency axis and the first bar, and the scale of the data axis is considered to start at the first bar.
Though I don’t show it here, you could make a relative-frequency histogram, the same way you made a relative-frequency bar chart. The relative frequencies range from 0 for the 700–799 class to 20/50 = 40% for the 200–299 class.
Believe it or not, out of all the chart types in Excel, the standard histogram is not included. To make one, you have to combine a column chart and a scatterplot (Middleton) [see “Sources Used” at end of book], or download additional software. You can follow the detailed instructions in that document, or you can download the free Better Histogram add-in from TreePlan Software to do the job. (It works in Excel 2007 through 2016.) If you’re using Better Histogram:
To make sense of most data sets, you need to group the data into classes. But sometimes your data have only a few different values. In such cases, you probably want to skip the grouping and just have one histogram bar for each different response. The height of the bar tells you how often that response occurred, as usual.
Example 6: A state park collected data on the number of adults in each vehicle that entered the park in a given time interval:
3 1 1 3 3 3 0 7 3 1 3 6 4 5 3 2 3 4 2 3
0 2 2 4 8 3 3 1 3 3 3 4 1 5 2 2 6 3 4 2
Number of Adults in Vehicles Entering Park | ||
---|---|---|
Adults | Tallies | Frequency |
0 | || | 2 |
1 | |||| | 5 |
2 | |||| || | 7 |
3 | |||| |||| |||| | 15 |
4 | |||| | 5 |
5 | || | 2 |
6 | || | 2 |
7 | | | 1 |
8 | | | 1 |
Total | 40 |
There are only nine different values, so it seems a little silly to group them. Instead, just tally the occurrences, as shown at right.
Label ungrouped data under the centers of the bars, just like categorical data, not under the edges. Some authors still make the bars touch because the data are numeric, and others keep the bars separated because the data are ungrouped. I prefer the second approach, but I’ll accept the other. Here’s my histogram:
Caution: This particular data set has at least one occurrence of every value between min and max. But suppose it didn’t; suppose there were no vehicles with 7 adults? In that case, you would draw the histogram exactly the same, except that the bar above “7” would have zero height. The horizontal axis for numeric data must always have a consistent scale for its whole length, so you never close up any gaps.
You can graph ungrouped discrete data in Excel, if you wish. The key is to fool Excel into treating the data like categorical data:
By the way, you might notice that the tick marks on the vertical axis are every two cars on this graph, but they were every five cars on my hand-drawn histogram. One is not better than the other; it’s a stylistic choice.
You should know the names of the most common shapes of numeric data. Why? It’s easier to talk about data that way, and — as you’ll see in the next chapter — you treat different-shaped distributions a little differently.
The first question is whether the data set is symmetric or skewed. The histogram of a symmetric data set would look pretty much the same in a mirror; a skewed data set’s histogram would look quite different in a mirror.
If a distribution is skewed, you say whether it’s skewed left or skewed right. A distribution that is skewed left, like the first one below, has mostly high scores, and a distribution that is skewed right, like the second one below, has mostly low scores. The direction of skew is away from the bulk of the data, toward the long skinny tail, where there are few data points.
Skewed left or negatively skewed |
Skewed right or positively skewed |
---|
Example 7: Scores on a really easy test would be skewed left: most people get high scores, but a few get low or very low scores.
Lifespan in developed countries is skewed left: there are relatively few infant and child deaths, and most people live into their 60s, 70s, or 80s. (The first graph in Calculus Applied to Probability and Statistics [Waner and Costenoble 1996] [see “Sources Used” at end of book] illustrates this.)
People’s own evaluation of their driving skills and safety are left skewed: few people rate themselves below average and most rate themselves above average. Illusory Superiority [see “Sources Used” at end of book] cites a study by Svenson showing this “Lake Wobegon effect”.
Example 8: People’s departure times after a concert would be skewed right: most people leave shortly before or after the performers finish, but a few straggle out for some time afterward. Skewed-right distributions are more common than skewed-left distributions.
Salaries at almost any corporation are another good example of a distribution that is skewed right: most people make a modest wage, but a few top people make much more.
There are several types of symmetric distributions, but here are the two you’ll meet most often. A uniform distribution is one where all possible values are equally likely to occur. The normal distribution has a precise definition, which you’ll meet in Chapter 7, but for now it’s enough to say that it’s the famous bell curve, with the middle values occurring the most often and the extreme values occurring much less often.
You’ll notice that both of the examples below are “bumpy”. That’s usual. In real life you pretty much never meet an exact match for any distribution, because there are always lurking variables, measurement errors, and so on. And even if a population does perfectly follow a given distribution, like the probability distributions you’ll meet in Chapter 6, still a sample doesn’t perfectly reflect the population it came from: sampling error is always with us. When we say that a data set follows such-and-such a distribution, we mean it’s a close match, not a perfect match.
Uniform | Normal (“bell curve”) |
---|
Example 9: Winning lottery numbers are uniformly distributed. (In the short term some numbers occur more often than others, but over the long run they tend to even out.)
The results of rolling one die many times are uniformly distributed. (But the results of rolling two dice are not uniformly distributed: 7 is the most likely, 2 and 12 are tied for least likely, and the other numbers are intermediate.)
The normal distribution or bell curve occurs very often, and in fact many natural and industrial processes produce normal distributions. This happens so often that we often just say or write ND for “normal distribution” or “normally distributed”.
Example 10: Men’s and women’s heights follow separate normal distributions. People’s arrival times at an event are ND. IQ scores, and scores on most tests, are ND. The amount of soda in two-liter bottles is ND. Your commute times on a given route are ND.
Suppose you have a discrete data set with few repetitions. An ungrouped histogram would have most bars at the same low height; a grouped histogram might show a pattern but you’d lose the individual data points.
If your discrete data set isn’t too large (n < 100, give or take), and the range isn’t too great, you can eat your cake and have it too. The stemplot, also known as a stem-and-leaf diagram, is a mutant hybrid between a histogram and a simple list of data.
The idea is that you take all the digits of each data point except the last digit and call that the stem; the last digit is the leaf. For example, consider scores of 113 and 117. They are two leaves 3 and 7 on a common stem 11 (meaning 110).
To construct a stemplot, you look over your data set for the minimum and maximum, then write the stems in a column, from lowest to highest. Just like with a histogram, there are no gaps, so if you have data in the 50s and the 70s but not in the 60s you still need a stem of 6.
However, your stems probably won’t start at 0. Start them with the lowest data point that actually occurred, and end them with the highest data point that actually occurred.
The stemplot was invented by John Tukey in 1970.
Example 11: Here is a set of IQ scores from 50 randomly-selected tenth graders:
99 77 83 111 141
89 98 84 93 124
110 73 96 60 102
87 123 120 100 95
100 90 104 85 129
81 119 112 103 76
108 91 94 114 108
92 96 94 88 101
117 106 103 105 113
97 106 109 80 116
To make your stemplot, eyeball the data for the minimum and maximum, which are 60 and 141. Write the stems, 6 to 14, in a column at the left of your paper, starting several lines below the top. Then draw a vertical line just to the right of them.
Now go through the data points, one by one, and add each leaf to the proper stem. During this process, you might find a value outside what you thought were the min and max. That’s no problem. Just add the stem and then the leaf. (Again, the stems can’t have gaps, so if your first stem is 6 and you come across a data point 47, you have to add stems 4 and 5, not just 4.)
Finally, add a title and a legend or key to your stemplot. Here is the result:
IQ Scores 6 | 0 7 | 7 3 6 8 | 3 9 4 7 5 1 8 0 9 | 9 8 3 6 5 0 1 4 2 6 4 7 10 | 2 0 0 4 3 8 8 1 6 3 5 6 9 11 | 1 0 9 2 4 7 3 6 12 | 4 3 0 9 13 | 14 | 1 key: 11 | 7 = 117
If you lie down and look at this sideways, it looks like a histogram. But the bonus is that you can still see all the actual data points within the groupings of 60–69, 70–79, etc.
A stemplot is great at showing shape, center and spread of distributions plus outliers, but most data sets don’t lend themselves to a stemplot. If your data set is too large, your leaves will run off the edge of the page. If your data set is too sparse — if the range is large for the number of data points — most of your stems won’t have leaves and the plot won’t really show any patterns in the data. But when you have a moderate-sized data set and the data range is moderate, a stemplot is probably better than a histogram because the stemplot gives more information.
One last touch is sorting the leaves. I don’t think that’s important enough to take the extra effort in a homework problem or on a quiz, but if you’re going to be presenting your stemplot to other people then you probably want to sort the leaves. Here’s the same stemplot with sorted leaves:
IQ Scores 6 | 0 7 | 3 6 7 8 | 0 1 3 4 5 7 8 9 9 | 0 1 2 3 4 4 5 6 6 7 8 9 10 | 0 0 1 2 3 3 4 5 6 6 8 8 9 11 | 0 1 2 3 4 6 7 9 12 | 0 3 4 9 13 | 14 | 1 key: 11 | 7 = 117
A glance at this stemplot shows you quite a lot. The data set is normally distributed, the center is around 100 points, the spread is 60–141, and there’s an outlier at 141.
You now know how to make good graphs, so be on the lookout for bad graphs. Sometimes they’re bad just because whoever drew them didn’t know any better, or didn’t think. But some people may deliberately try to deceive you with a graph.
Example 12: File this one under “what were they thinking?” The left-hand graph doesn’t have a title, so you don’t know what “Yes” and “No” mean. You have to look back and forth between the graph and the legend, and anyone with red-green color blindness probably won’t be able to see which segment is which. Oh yes — what percentages of the sample answered “Yes” and “No”? You can guess that it’s around a third versus two thirds, but that’s not very precise.
The right-hand graph cures those problems. It’s now crystal clear which segment is Yes and which is No, and what proportion of the sample gave each answer. This actually lets you show more information in less space, a win-win. (Of course you wouldn’t use a vague term like “Opinions” — that’s just there to remind you to give your graph a title.)
Example 13: There’s no telling whether this one is deliberate deception or just incompetent graphing. An oatmeal company, which shall remain nameless, wanted to show that eating oatmeal for four weeks reduces cholesterol. The first graph makes a strong case — until you look at the scale on the vertical axis. (Don’t even think about wasting your time on a graph with no vertical scale.)
The scale doesn’t start at zero, so it makes differences look much bigger than they are. Your frequency or relative frequency scale must always start at zero (and you must show the zero). The second graph is properly drawn, and now you can see that the drop in cholesterol is only a slight one.
source:
Misleading Graph [see “Sources Used” at end of book]
Example 14: It’s all very well to create visual interest, but not if it makes the reader misinterpret the graph.
In the left-hand graph, you can tell from the scale that B is supposed to be three times as large as A, but since it’s three times as high and three times as wide it’s actually nine times as large, giving the reader a distorted impression of the amount of difference. Even if your “bars” are pictures, they still have to be the same width. The corrected version is shown at right. (It’s still not quite correct, though, because 0 is not shown on the vertical axis.)
If you follow the rules in this chapter, you’ll make good, professional graphs. But there are plenty of other ways to make good graphs, depending on the data you’re trying to show.
There’s a classic picture book that can give you lots of good ideas. Edward Tufte’s The Visual Display of Quantitative Information has been around since 1983, and no one has yet done it any better. (Tufte has produced newer editions.)
Example 15: One famous graph in Tufte’s book is particularly stunning. Charles Minard wanted to present a lot of time-series data about Napoleon’s disastrous campaign in Russia in the winter of 1812–1813: where battles took place, numbers of casualties, temperature, and so forth. He elected to make a kind of stylized map showing just the rivers and the cities where events happened. (Niemen at the left is the Niemen River, Russia’s western border at the time. Moscow, “Moscou” in French, is as far east as Napoleon got.) Across that, Minard showed the army strength as a broad swath at the start that shrank to almost nothing by the end of the retreat westward. Below are dates of events, temperatures, and precipitation. It’s a huge amount of information on one piece of paper.
This tiny rendition doesn’t do it justice, but if you click on it visit http://upload.wikimedia.org/wikipedia/commons/2/29/Minard.png you’ll see it at a better size. (Your browser may still reduce it to fit on your screen. Try clicking into the picture and you should see it at original size, though you’ll have to scroll around to see the details. It sounds like a lot of effort, but I promise you it’s worth it. Or just get the book, because it has plenty more!)
Example 16: Here’s one I ran across in my reading. It’s not the graph of the century like Minard’s, but it’s a cut above the usual. In Bear Attacks: Their Causes and Avoidance (2002), Stephen Herrero had the problem of contrasting bears’ diet in spring, summer, and fall. (Of course in winter they’re not eating.)
He could have drawn three pie charts, or a stacked bar graph, but instead he came up with a great alternative. (You can click on the picture to enlarge it.) (A larger version is at http://BrownMath.com/swtpic/chap02_beardiet.jpg.) Each component of diet is clearly labeled right in the graph, not in some legend off to the side, and the contrasting backgrounds make it a little more interesting visually. A stacked bar graph would convey the same information, but I like this presentation because it suggests that “spring”, “summer”, and “fall” are not completely separate but rather transition one into the next.
The vertical axis is clearly labeled, too. There’s no doubt what the numbers are (as opposed to some units of weight, for instance, or something more esoteric like pounds of feed per hundreds of pounds of bear).
He probably could have left off the title off the category axis — after all, we know that the seasons are seasons, and the graph title also conveys that information. But that’s a minor point. My only real quibble with this graph is that the overall graph title at the bottom is too small.
Overview: With numeric data, the goal of descriptive stats is to show shape, center, spread, and outliers.
(The online book has live links to all of these.)
Be on the lookout for violations of this rule and other signs of bad graphs.
Chapter 3 WHYL → ← Chapter 1 WHYL
Write out your solutions to these exercises, showing your work for all computations. Then check your solutions against the solutions page and get help with anything you don’t understand.
Caution! If you don’t see how to start a problem, don’t peek at the solution — you won’t learn anything that way. Ask your instructor or a tutor for a hint. Or just leave it and go on to a different problem for now. You may find when you return to that “impossible” problem that you see how to do it after all.
Would You Move to the US? | |
---|---|
Yes, with authorization | 154 |
Yes, without authorization | 204 |
No | 612 |
Don’t know | 30 |
The Pew Research Center (2013c) [see “Sources Used” at end of book] conducted a poll of 1000 adults in Mexico, asking whether they would move to the US if they had the means and opportunity to move. Draw a relative-frequency bar graph for their responses.
What’s wrong with this graph? (You should be able to see at least two problems, maybe more.)
(source: Misleading Graph [see “Sources Used” at end of book] in Wikipedia)
Professor Marvel had a statistics class of fifteen students, and on one 15-point quiz their scores were
10.5 13.5 8 12 11.3 9 9.5 5 15 2.5 10.5 7 11.5 10 10.5
Construct a frequency table and bar graph for their letter grades on the quiz, where 90% is the minimum for an A, 80% for a B, 70% for a C, and 60% for a D.
Deaths by Horse Kick in 14 Prussian Army Corps, 1875–1894 | |
---|---|
Number of Deaths | Frequency |
0 | 144 |
1 | 91 |
2 | 32 |
3 | 11 |
4 | 2 |
Total | 280 |
Bulmer (1979, 92) [see “Sources Used” at end of book]
quotes an 1898 study of deaths by horse kick
in the Prussian army. Von Bortkiewicz compiled the number of deaths in
14 Prussian Army corps over the 20-year period 1875–1894, as
shown at right. (14 corps over 20 years gives 14×20 = 280
observations.) For example, there were 32 observations in which two
officers died of horse kicks.
(a) What is the type of the variable?
(b) Construct an appropriate graph.
Commuting Distances in km | ||||
---|---|---|---|---|
5 | 15 | 23 | 12 | 9 |
12 | 22 | 26 | 31 | 21 |
11 | 19 | 16 | 45 | 12 |
8 | 26 | 18 | 17 | 1 |
16 | 24 | 15 | 20 | 17 |
In a GM factory in Brazil, 25 workers were asked their commuting distance in kilometers. Construct a stem-and-leaf plot.
—Adapted from Dabes and Janik (1999, 8) [see “Sources Used” at end of book]
Abigail asked a number of students their major. She found 35 in liberal arts, 10 in criminal justice, 25 in nursing, 45 in business, and 20 in other majors. What was the relative frequency of the nursing group, rounded to the nearest whole percent?
Bert asked his fellow students how many books they read for pleasure in a year. He found that most of them read 0, 1, or 2 books, but some read 3 or more and a very few read as many as 10. (He plotted the histogram shown at right.) Identify the shape of this distribution.
Test scores, x | Frequencies, f |
---|---|
470.0–479.9 | 15 |
480.0–489.9 | 22 |
490.0–499.9 | 29 |
500.0–509.9 | 50 |
510.0–519.9 | 38 |
At right is a grouped frequency distribution.
(a) Create a frequency histogram. (For a real quiz, you’d
use graph paper, but you can freehand this one.)
(b) Find the class width.
(c) What’s the shape of this distribution?
Updates and new info: http://BrownMath.com/swt/