Mean vs. Median: A Beginner’s Guide

This post is adapted from material in my book Practical SQL: A Beginner’s Guide to Storytelling With Data from No Starch Press.


A common way to summarize a group of numbers — one most of us learned in grade school — is to find its mean, commonly called the average. But it’s not always the best measure.

Let’s say six kids go on a field trip, ages 10, 11, 10, 9, 13 and 12. It’s easy to add the ages and divide by six to get the group’s average age:
 

(10 + 11 + 10 + 9 + 13 + 12) / 6 = 10.8

Because all the ages are close, the average of 10.8 gives us a good picture of the group as a whole. But averages are less helpful when the values are skewed toward one end or if they include outliers.

For example, what if we add a much older chaperone to our field trip? With ages of 10, 11, 10, 9, 13, 12 and 46, the average age of the group rises considerably:
 

(10 + 11 + 10 + 9 + 13 + 12 + 46) / 7 = 15.9

Now the mean is not an accurate representation. The outlier skews the average, and no journalist should feel comfortable reporting it.

This is where calculating a median is handy. The median is the midpoint in an ordered list of values — the point at which half the values are higher and half lower. If the median household income in East Middletownburg is $50,000, then half the households earn more and half less.

Using our field trip, we order the ages from lowest to highest:
 

Continue…

JSON from Excel using Python openpyxl

Powering an application from a collection of static files rather than a call to an API or database server is well worth considering. Removing a production database and data access layer from the mix eliminates complexity and trims development time. Static files aren’t right for every situation, but for small apps they may be all you need.

In my days building back ends for interactive news applications, most of our apps consumed JSON. Simple apps ran off one or two static JSON files. The more complex used hundreds. Updating content was as easy as generating new JSON files and uploading them to a file server.

I’ve written about how to use Python to generate JSON files from a SQL database. But it’s just as easy to use Python to make JSON out of Excel spreadsheets. The key ingredient is the Python library openpyxl, which offers a collection of functions for reading and writing Excel .xlsx workbooks. You can add it to your Python environment by running pip install openpyxl.

Let’s look at how it works, then use with some Python to generate JSON.

Basic openpyxl functions

Let’s say we have an Excel workbook containing a small table repeated over three worksheets. The table in each case looks like this:

Here are some snippets of code — just scratching the surface — to interact with it programmatically:

# Demonstrates basic openpyxl functions for working with Excel files
# (.xlsx format)

from openpyxl import load_workbook

# load the workbook
wb = load_workbook('excel-xlrd-sample.xlsx')

# Print the sheet names
print(wb.sheetnames)

# Get a sheet by name
sheet = wb['Cars20']

# Iterate through sheet rows, returning each as a tuple:
for row in sheet.values:
    print(row)

# If you just want the first column:
for cell in sheet['A']:
    print(cell.value)

# Index individual cells:
cell_c4 = sheet['C4'].value
print(cell_c4)

# Or you can use:
cell_c4 = sheet.cell(row=4, column=3).value
print(cell_c4)

# Let's say you want the same cell every sheet in a workbook:
for sheet in wb.sheetnames:
    open_sheet = wb[sheet]
    cell_x = open_sheet['D2'].value
    print(cell_x)

From Excel to JSON

Pretty cool stuff. Now, let’s convert our sample spreadsheet to JSON. I’ll borrow some of the techniques I discussed when outlining how to use Python to build JSON from a SQL database:

import json
from collections import OrderedDict
from itertools import islice
from openpyxl import load_workbook

# Open the workbook and select a worksheet
wb = load_workbook('excel-xlrd-sample.xlsx')
sheet = wb['Cars20']

# List to hold dictionaries
cars_list = []

# Iterate through each row in worksheet and fetch values into dict
for row in islice(sheet.values, 1, sheet.max_row):
    cars = OrderedDict()
    cars['car-id'] = row[0]
    cars['make'] = row[1]
    cars['model'] = row[2]
    cars['miles'] = row[3]
    cars_list.append(cars)

# Serialize the list of dicts to JSON
j = json.dumps(cars_list)

# Write to file
with open('data.json', 'w') as f:
    f.write(j)

Here’s the breakdown: We open the workbook, select a sheet, and set up a blank list to hold rows of data. Then we use Python’s islice tool to iterate through the rows in the sheet, skipping the header and stopping at the max_row property of the sheet.

For each row, we add each cell to a key/value pair in a dictionary, then add each dictionary to a list. Then we dump the list to JSON and write to a file.

Of course, a spreadsheet this simple doesn’t need a Python script to make its way to JSON. Just use Mr. Data Converter. But as soon as your JSON requirements gain complexity, the ability to use Python to nest nodes, build strings and transform data on the fly make this approach very appealing.


Like what you see here? Please check out my book, Practical SQL: A Beginner’s Guide to Storytelling With Data, available from No Starch Press and through all fine booksellers.

Get JSON from Excel using Python, xlrd

Author’s note: This post is out of date! The xlrd library it covers is no longer maintained. Please see the post JSON from Excel using Python openpyxl for a more current approach.


Powering interactive news applications off flat files rather than a call to a database server is an option worth considering. Cutting a production database and data access layer out of the mix eliminates a whole slice of complexity and trims development time. Flat files aren’t right for every situation, but for small apps they’re often all you need.

These days, most of the apps I help build at Gannett Digital consume JSON. Simpler apps — such as the table/modal displays we deployed in February for our Oscar Scorecard and Princeton Review Best Value Colleges — run off one or two JSON files. The more complex — e.g., last year’s Ghost Factories: Poison in the Ground — run off hundreds of them. Updating content is as easy as generating new JSON and uploading it to our file server.

I wrote last year how to use Python to generate JSON files from a SQL database. Lately, I’ve been using Python to make JSON out of Excel spreadsheets.

The key ingredient is the Python library xlrd. It’s part of a suite of Excel-related tools available from www.python-excel.org, including the xlwt library to create Excel files.

(Another choice is openpyxl, which has similar features and works with newer .xlsx formatted Excel files. I’ve used both with equal success. Take your pick.)

Basic xlrd operations

Let’s say we have an Excel workbook containing a small table repeated over three worksheets. The table in each case looks like this:

table

Here are some snippets of code — just scratching the surface — to interact with it programmatically:

# Demonstrates basic xlrd functions for working with Excel files
# (Excel 97-2003)

import xlrd

# Open the workbook
wb = xlrd.open_workbook('excel-xlrd-sample.xls')

# Print the sheet names
print wb.sheet_names()

# Get the first sheet either by index or by name
sh = wb.sheet_by_index(0)

# Iterate through rows, returning each as a list that you can index:
for rownum in range(sh.nrows):
    print sh.row_values(rownum)

# If you just want the first column:
first_column = sh.col_values(0)
print first_column

# Index individual cells:
cell_c4 = sh.cell(3, 2).value
# Or you can use:
#cell_c4 = sh.cell(rowx=3, colx=2).value
print cell_c4

# Let's say you want the same cell from x identical sheets in a workbook:
x = 2
while x >= 0:
    sh = wb.sheet_by_index(x)
    cell_x = sh.cell(2, 3).value
    print cell_x
    x = x - 1

From Excel to JSON

Pretty cool stuff. Now, let’s convert our sample spreadsheet to JSON. I’ll borrow some of the techniques I discussed when outlining how to use Python to build JSON from a SQL database:

import xlrd
from collections import OrderedDict
import simplejson as json

# Open the workbook and select the first worksheet
wb = xlrd.open_workbook('excel-xlrd-sample.xls')
sh = wb.sheet_by_index(0)

# List to hold dictionaries
cars_list = []

# Iterate through each row in worksheet and fetch values into dict
for rownum in range(1, sh.nrows):
    cars = OrderedDict()
    row_values = sh.row_values(rownum)
    cars['car-id'] = row_values[0]
    cars['make'] = row_values[1]
    cars['model'] = row_values[2]
    cars['miles'] = row_values[3]

    cars_list.append(cars)

# Serialize the list of dicts to JSON
j = json.dumps(cars_list)

# Write to file
with open('data.json', 'w') as f:
    f.write(j)

Here’s the breakdown: We open the workbook, select the sheet and iterate through the available rows (which xlrd conveniently counts using its nrows method).

Add each cell to a key/value pair in a dictionary, then add each dictionary to a list. Dump the list to JSON and write to a file.

Of course, a spreadsheet this simple doesn’t need a Python script to make its way to JSON. Just use Mr. Data Converter for something like this. But as soon as your JSON requirements gain complexity, the ability to use Python to nest nodes, build strings and transform data on the fly make this approach very appealing.

Learn data analysis with SQL!
If you like what you read here, check out my book Practical SQL: A Beginner’s Guide to Storytelling with Data from No Starch Press.

Excel: Extract text with FIND and MID

Data analysis begins with usable data, and that means arranging every item in a data set into its own field where we can count, sort and otherwise test it out.

But what if you get a spreadsheet where the pieces of data are all packed in one field? Say, something like this (which I cobbled together from Major League Baseball data in honor of the Nationals’ first playoff appearance):

NAME: Sean Burnett POS: RP AGE: 30 WT: 200 BORN: Dunedin, FL SALARY: 2350000
NAME: Tyler Clippard POS: RP AGE: 27 WT: 200 BORN: Lexington, KY SALARY: 1650000
NAME: Ross Detwiler POS: SP AGE: 26 WT: 174 BORN: St. Louis, MO SALARY: 485000
NAME: Christian Garcia POS: RP AGE: 27 WT: 215 BORN: Miami, FL SALARY: N/A
NAME: Gio Gonzalez POS: SP AGE: 27 WT: 205 BORN: Hialeah, FL SALARY: 3335000
NAME: Mike Gonzalez POS: RP AGE: 34 WT: 215 BORN: Robstown, TX SALARY: N/A
NAME: Ryan Mattheus POS: RP AGE: 28 WT: 215 BORN: Sacramento, CA SALARY: 481000
NAME: Craig Stammen POS: RP AGE: 28 WT: 200 BORN: Coldwater, OH SALARY: 485000
NAME: Drew Storen POS: RP AGE: 25 WT: 180 BORN: Indianapolis, IN SALARY: 498750
NAME: Jordan Zimmermann POS: SP AGE: 26 WT: 218 BORN: Auburndale, WI SALARY: 2300000

Let’s say you want to extract the city of each player’s birth into a separate column. The varying length of each player’s name means the birth place isn’t always in the same position in the string, so a typical text-to-columns operation won’t work. So, how to do it?

The answer lies in two very handy Excel functions: FIND and MID.

FIND locates characters you specify and returns its numeric place in the string.

MID returns X characters from a string beginning at a location X you specify.

For example, we can locate the position where each city name begins by using FIND to locate the string “BORN:” in each cell. The city name itself always starts six characters after the position of that string, so we add six to the result:

=FIND("BORN:",A2)+6

In the first row above, the functions returns 50. In the second row, 52. We’ll feed that value to the MID function as the starting point for our extraction.

MID takes three arguments: Text or cell, position to start, number of characters to return. So, we use the above FIND function as the second argument and, for now, extract 10 characters:

=MID(A2,FIND("BORN:",A2)+6,10)

That gets us part of the way there. We’re starting in the right spot, but 10 characters isn’t always the length of the city and state, so it leads to choppy results:

Dunedin, F
Lexington,
St. Louis,
Miami, FL 
Hialeah, F
Robstown, 
Sacramento
Coldwater,
Indianapol
Auburndale

What we need to do is tell MID the exact number of characters to return each time even though the length of the city varies. We can figure this out using FIND again.

The city name is always followed by the word “SALARY”. So, if we search for the position of that word and subtract the position of “BORN,” we’ll get the length of what’s between the two. The ultimate formula looks like spaghetti but works just fine:

Continue…

Sorting Data in Excel: Simple Analysis

Sorting a data set helps answer a basic question journalists like to ask: “Which ____ has the highest (or lowest) ______?”

Excel (and other spreadsheets such as the open source Calc) make sorting data easy. In fact, I often make sorting my first step when “interviewing” data because it quickly reveals high and low values and often highlights some that may seem questionable.

Let’s work through a simple sort in Excel. I’ll be using Excel 2007, but older versions have similar functions. Start by downloading the file “sorting.xls” and saving it to your computer. Open it and follow along:

1. We have a table of Census data from the 2006-2008 American Community Survey. It shows the median age of the population for each of 79 school districts in Virginia plus the state itself.

We want to know which district has the oldest and youngest populations. Let’s sort it!

2. Click once on one cell anywhere in the table. This will help Excel auto-discover your table in the next step.

Continue…