Mean vs. Median: A Beginner’s Guide
This post is adapted from material in my book Practical SQL: A Beginner’s Guide to Storytelling With Data from No Starch Press.
A common way to summarize a group of numbers — one most of us learned in grade school — is to find its mean, commonly called the average. But it’s not always the best measure. Often, the median is better.
Let’s say six kids go on a field trip, ages 10, 11, 10, 9, 13 and 12. It’s easy to add the ages and divide by six to get the group’s average age:
(10 + 11 + 10 + 9 + 13 + 12) / 6 = 10.8
Because all the ages are close, the average of 10.8 gives us a good picture of the group as a whole. But averages are less helpful when the values are skewed toward one end or if they include outliers.
For example, what if we add a much older chaperone to our field trip? With ages of 10, 11, 10, 9, 13, 12 and 46, the average age of the group rises considerably:
(10 + 11 + 10 + 9 + 13 + 12 + 46) / 7 = 15.9
Now the mean is not an accurate representation. The outlier skews the average, and no journalist should feel comfortable reporting it.
This is where calculating a median is handy. The median is the midpoint in an ordered list of values — the point at which half the values are higher and half lower. If the median household income in East Middletownburg is $50,000, then half the households earn more and half less.
Using our field trip, we order the ages from lowest to highest:
9, 10, 10, 11, 12, 13, 46
The middle value is 11, and that’s the median. Half the values are higher, and half lower. If there had been an even number of values, we’d average the two middle values to find the median. For larger sets of numbers, you can use the MEDIAN function in Microsoft Excel.
Given this group, the median of 11 is a much better representation of the typical age than the average of 15.9. That’s what makes median such a useful statistical measure. Scan financial news, and you’ll see medians reported frequently. Reports on housing prices often use medians because a few sales of McMansions in a zip code that’s otherwise modest can make averages useless. Same for sports player salaries where one or two superstars can skew results.
A good test: calculate the average and the median for a group of values. If they’re close, then the group is probably normally distributed (the familiar bell curve), and the average is useful. If they’re far apart, then the values are not normally distributed and the median is the better representation.