Mean vs. Median: A Beginner’s Guide

This post is adapted from material in my book Practical SQL: A Beginner’s Guide to Storytelling With Data from No Starch Press.


A common way to summarize a group of numbers — one most of us learned in grade school — is to find its mean, commonly called the average. But it’s not always the best measure.

Let’s say six kids go on a field trip, ages 10, 11, 10, 9, 13 and 12. It’s easy to add the ages and divide by six to get the group’s average age:
 

(10 + 11 + 10 + 9 + 13 + 12) / 6 = 10.8

Because all the ages are close, the average of 10.8 gives us a good picture of the group as a whole. But averages are less helpful when the values are skewed toward one end or if they include outliers.

For example, what if we add a much older chaperone to our field trip? With ages of 10, 11, 10, 9, 13, 12 and 46, the average age of the group rises considerably:
 

(10 + 11 + 10 + 9 + 13 + 12 + 46) / 7 = 15.9

Now the mean is not an accurate representation. The outlier skews the average, and no journalist should feel comfortable reporting it.

This is where calculating a median is handy. The median is the midpoint in an ordered list of values — the point at which half the values are higher and half lower. If the median household income in East Middletownburg is $50,000, then half the households earn more and half less.

Using our field trip, we order the ages from lowest to highest:
 

Continue…

Story hunting in birth, death data

Tracking the U.S. government’s annual count of births and deaths is one of my little obsessions. I keep annual totals in a spreadsheet and look forward to observing the trends with each new year of data.

This most basic of demographics can tell us much about a nation’s past—and its unfolding future.

For example, the CDC’s provisional 2018 U.S. birth data released in May 2019 showed that births in the U.S. dropped for a fourth year in a row, to the lowest level in 32 years. In a story for The Wall Street Journal, we encapsulated the trend and what demographers point to as likely causes: sharp declines in the teen birth rate, increased use of longer-acting contraceptives, and more women in the workforce delaying childbirth, among others.

Continue…

Packing The Stands at Nationals Park

Bryce Harper’s two home runs and Stephen Strasburg’s seven scoreless innings rightly earned the headlines in Monday’s opening-day win for the Washington Nationals. But the data journalist in me couldn’t help but want to apply a little percent change action to the proceedings.

So, I did, especially after I noticed in the boxscore that Nationals Park, on a Monday, was jammed to 108% capacity. A few minutes of research and Excel later, I had some findings:

  • The Nats’ opening day attendance of 45,274 was 11% higher than the team drew at last year’s home opener, when they beat the Reds 3-2 in 10 innings.
  • The attendance wasn’t a record for Nationals Park, but it was close — about 700 below the record set on the last, heartbreaking (if you’re a Nats fan) game of the 2012 NL division series against the St. Louis Cardinals.
  • The day was, however, a regular-season record for the park, which opened in 2008.

After I figured this out, I went a-Googling to see if anyone else had the same scoop. Didn’t find the percent change, but I did see a mention of the attendance record in a post on We Love DC and a mention on a MASN Sports blog that didn’t qualify it against the post-season record. Nothing from major sports media (please comment below if I missed some).

I’m not about to play a baseball writer — it’s one of the few jobs I haven’t had in journalism — but the basics of ballpark attendance gets too little attention, I think. And yet the money flowing through the turnstiles means a lot for a team and a city, as does the mental boost for the players who hear the cheers.

Can the Nats keep it up? I’ll be watching this chart at baseball-reference.com.

 

Excel: Extract text with FIND and MID

Data analysis begins with usable data, and that means arranging every item in a data set into its own field where we can count, sort and otherwise test it out.

But what if you get a spreadsheet where the pieces of data are all packed in one field? Say, something like this (which I cobbled together from Major League Baseball data in honor of the Nationals’ first playoff appearance):

NAME: Sean Burnett POS: RP AGE: 30 WT: 200 BORN: Dunedin, FL SALARY: 2350000
NAME: Tyler Clippard POS: RP AGE: 27 WT: 200 BORN: Lexington, KY SALARY: 1650000
NAME: Ross Detwiler POS: SP AGE: 26 WT: 174 BORN: St. Louis, MO SALARY: 485000
NAME: Christian Garcia POS: RP AGE: 27 WT: 215 BORN: Miami, FL SALARY: N/A
NAME: Gio Gonzalez POS: SP AGE: 27 WT: 205 BORN: Hialeah, FL SALARY: 3335000
NAME: Mike Gonzalez POS: RP AGE: 34 WT: 215 BORN: Robstown, TX SALARY: N/A
NAME: Ryan Mattheus POS: RP AGE: 28 WT: 215 BORN: Sacramento, CA SALARY: 481000
NAME: Craig Stammen POS: RP AGE: 28 WT: 200 BORN: Coldwater, OH SALARY: 485000
NAME: Drew Storen POS: RP AGE: 25 WT: 180 BORN: Indianapolis, IN SALARY: 498750
NAME: Jordan Zimmermann POS: SP AGE: 26 WT: 218 BORN: Auburndale, WI SALARY: 2300000

Let’s say you want to extract the city of each player’s birth into a separate column. The varying length of each player’s name means the birth place isn’t always in the same position in the string, so a typical text-to-columns operation won’t work. So, how to do it?

The answer lies in two very handy Excel functions: FIND and MID.

FIND locates characters you specify and returns its numeric place in the string.

MID returns X characters from a string beginning at a location X you specify.

For example, we can locate the position where each city name begins by using FIND to locate the string “BORN:” in each cell. The city name itself always starts six characters after the position of that string, so we add six to the result:

=FIND("BORN:",A2)+6

In the first row above, the functions returns 50. In the second row, 52. We’ll feed that value to the MID function as the starting point for our extraction.

MID takes three arguments: Text or cell, position to start, number of characters to return. So, we use the above FIND function as the second argument and, for now, extract 10 characters:

=MID(A2,FIND("BORN:",A2)+6,10)

That gets us part of the way there. We’re starting in the right spot, but 10 characters isn’t always the length of the city and state, so it leads to choppy results:

Dunedin, F
Lexington,
St. Louis,
Miami, FL 
Hialeah, F
Robstown, 
Sacramento
Coldwater,
Indianapol
Auburndale

What we need to do is tell MID the exact number of characters to return each time even though the length of the city varies. We can figure this out using FIND again.

The city name is always followed by the word “SALARY”. So, if we search for the position of that word and subtract the position of “BORN,” we’ll get the length of what’s between the two. The ultimate formula looks like spaghetti but works just fine:

Continue…

Which web browsers do journalists favor?

After I started playing with Internet Explorer 9 tonight — and knowing that most developers, including Microsoft, want to wean the world from IE6 as soon as possible — I grew curious about the browsers favored by my site’s visitors. A quick dig into Google Analytics gave me the data for the last few months, and the Google Charts API let me build a quick pie:

Site visits by browser, November 2010-March 2011

I can’t know for sure, but I suspect that most people who read my site are journalists or developers. Most traffic comes from links I post on Twitter or via search keywords that tend toward journalism, data, math and, lately, the Census.

Generally, you’re not an IE-centric crowd — just 12%. That’s lower than overall metrics, which tend to place Internet Explorer at anywhere from 40% or more of the overall market.

Oh, and the percent using IE6? Less than 0.4%.