Mean vs. Median: A Beginner’s Guide

This post is adapted from material in my book Practical SQL: A Beginner’s Guide to Storytelling With Data from No Starch Press.


A common way to summarize a group of numbers — one most of us learned in grade school — is to find its mean, commonly called the average. But it’s not always the best measure.

Let’s say six kids go on a field trip, ages 10, 11, 10, 9, 13 and 12. It’s easy to add the ages and divide by six to get the group’s average age:
 

(10 + 11 + 10 + 9 + 13 + 12) / 6 = 10.8

Because all the ages are close, the average of 10.8 gives us a good picture of the group as a whole. But averages are less helpful when the values are skewed toward one end or if they include outliers.

For example, what if we add a much older chaperone to our field trip? With ages of 10, 11, 10, 9, 13, 12 and 46, the average age of the group rises considerably:
 

(10 + 11 + 10 + 9 + 13 + 12 + 46) / 7 = 15.9

Now the mean is not an accurate representation. The outlier skews the average, and no journalist should feel comfortable reporting it.

This is where calculating a median is handy. The median is the midpoint in an ordered list of values — the point at which half the values are higher and half lower. If the median household income in East Middletownburg is $50,000, then half the households earn more and half less.

Using our field trip, we order the ages from lowest to highest:
 

Continue…

JSON from Excel using Python openpyxl

Powering an application from a collection of static files rather than a call to an API or database server is well worth considering. Removing a production database and data access layer from the mix eliminates complexity and trims development time. Static files aren’t right for every situation, but for small apps they may be all you need.

In my days building back ends for interactive news applications, most of our apps consumed JSON. Simple apps ran off one or two static JSON files. The more complex used hundreds. Updating content was as easy as generating new JSON files and uploading them to a file server.

I’ve written about how to use Python to generate JSON files from a SQL database. But it’s just as easy to use Python to make JSON out of Excel spreadsheets. The key ingredient is the Python library openpyxl, which offers a collection of functions for reading and writing Excel .xlsx workbooks. You can add it to your Python environment by running pip install openpyxl.

Let’s look at how it works, then use with some Python to generate JSON.

Basic openpyxl functions

Let’s say we have an Excel workbook containing a small table repeated over three worksheets. The table in each case looks like this:

Here are some snippets of code — just scratching the surface — to interact with it programmatically:

# Demonstrates basic openpyxl functions for working with Excel files
# (.xlsx format)

from openpyxl import load_workbook

# load the workbook
wb = load_workbook('excel-xlrd-sample.xlsx')

# Print the sheet names
print(wb.sheetnames)

# Get a sheet by name
sheet = wb['Cars20']

# Iterate through sheet rows, returning each as a tuple:
for row in sheet.values:
    print(row)

# If you just want the first column:
for cell in sheet['A']:
    print(cell.value)

# Index individual cells:
cell_c4 = sheet['C4'].value
print(cell_c4)

# Or you can use:
cell_c4 = sheet.cell(row=4, column=3).value
print(cell_c4)

# Let's say you want the same cell every sheet in a workbook:
for sheet in wb.sheetnames:
    open_sheet = wb[sheet]
    cell_x = open_sheet['D2'].value
    print(cell_x)

From Excel to JSON

Pretty cool stuff. Now, let’s convert our sample spreadsheet to JSON. I’ll borrow some of the techniques I discussed when outlining how to use Python to build JSON from a SQL database:

import json
from collections import OrderedDict
from itertools import islice
from openpyxl import load_workbook

# Open the workbook and select a worksheet
wb = load_workbook('excel-xlrd-sample.xlsx')
sheet = wb['Cars20']

# List to hold dictionaries
cars_list = []

# Iterate through each row in worksheet and fetch values into dict
for row in islice(sheet.values, 1, sheet.max_row):
    cars = OrderedDict()
    cars['car-id'] = row[0]
    cars['make'] = row[1]
    cars['model'] = row[2]
    cars['miles'] = row[3]
    cars_list.append(cars)

# Serialize the list of dicts to JSON
j = json.dumps(cars_list)

# Write to file
with open('data.json', 'w') as f:
    f.write(j)

Here’s the breakdown: We open the workbook, select a sheet, and set up a blank list to hold rows of data. Then we use Python’s islice tool to iterate through the rows in the sheet, skipping the header and stopping at the max_row property of the sheet.

For each row, we add each cell to a key/value pair in a dictionary, then add each dictionary to a list. Then we dump the list to JSON and write to a file.

Of course, a spreadsheet this simple doesn’t need a Python script to make its way to JSON. Just use Mr. Data Converter. But as soon as your JSON requirements gain complexity, the ability to use Python to nest nodes, build strings and transform data on the fly make this approach very appealing.


Like what you see here? Please check out my book, Practical SQL: A Beginner’s Guide to Storytelling With Data, available from No Starch Press and through all fine booksellers.

Generate JSON From SQL Using Python

Author’s note: This post has been updated from its original 2012 version to use PostgreSQL and the Python psycopg2 library. Also, if you like what you read here, please check out my book Practical SQL: A Beginner’s Guide to Storytelling with Data from No Starch Press.


Let’s say you want to generate a few hundred — or even a thousand — flat JSON files from a SQL database. Maybe you want to power a data visualization but have neither the time nor the desire to spin up a server to dynamically generate the data. So, you want flat files, each one small for quick loading. And a lot of them.

A few lines of Python is all you need.

I’ve gone this route for a few data-driven visuals, creating JSON files out of large database tables. Python works well for this, with its JSON encoder/decoder offering a flexible set of tools for converting Python objects to JSON. Let’s see how this works.

The Data

I’m using PostgreSQL for this tutorial. Here’s a script that creates a table and fills it with two rows of data:

CREATE TABLE students_test (
    id integer,
    first_name text,
    last_name text,
    street text,
    city text,
    st text,
    zip text
);
    
INSERT INTO students_test VALUES
(1, 'Samantha', 'Baker', '9 Main St.', 'Hyde Park', 'NY', '12538'),
(2, 'Mark', 'Salomon', '3 Stavesacre Terrace', 'Dallas', 'TX', '75204');

The Script

Here’s an example Pythonscript that generates two JSON files from that query. One file contains JSON row arrays, and the other has JSON key-value objects. Below, we’ll walk through it step-by-step.

Continue…

Story hunting in birth, death data

Tracking the U.S. government’s annual count of births and deaths is one of my little obsessions. I keep annual totals in a spreadsheet and look forward to observing the trends with each new year of data.

This most basic of demographics can tell us much about a nation’s past—and its unfolding future.

For example, the CDC’s provisional 2018 U.S. birth data released in May 2019 showed that births in the U.S. dropped for a fourth year in a row, to the lowest level in 32 years. In a story for The Wall Street Journal, we encapsulated the trend and what demographers point to as likely causes: sharp declines in the teen birth rate, increased use of longer-acting contraceptives, and more women in the workforce delaying childbirth, among others.

Continue…

Setting up Python in Windows 10

Installing Python under Windows 10 is fairly easy as long as you set up your system environment correctly. Here’s your quick guide:

  1. Visit the official Python download page and grab the Windows installer for the latest version of Python 3.
  2. Navigate to the folder where you saved the installation file. Right-click the installer and select “Run as Administrator.” Click “Yes” when Windows asks if you want the program to make changes to your computer. 
  3. The next dialog asks whether you want to “Install Now” or “Customize Installation.” First, check the boxes that say “Install launcher for all users” and “Add Python 3.8 to PATH”. Then, choose “Customize Installation.”
  4. On the next screen, check all boxes under “Optional Features.” Click next.
  5. Under “Advanced Options,” set the location where you want to install Python. I like to use a location I can easily find, such as:
C:\Python38

That will remind me that I’ve installed Python 3.8. (Note that you may need to make a folder if you want to use a location that doesn’t exist yet.)

Also under “Advanced Options,” check the following boxes:

  • “Install for all users”
  • “Associate files with Python”
  • “Create shortcuts for installed applications”
  • “Add Python to environment variables”
  • “Precompile standard library”
  1. Click through the dialog to finish the installation and exit the installer.

Testing Your Install

Now, to launch the Python interpreter from the command line, you can open a command prompt (Start Menu > Windows System > Command Prompt) and type:

python

That will load the Python interpreter:

Python 3.8.5 (tags/v3.8.5:580fbb0, Jul 20 2020, 15:43:08) [MSC v.1926 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>>

Because of the settings you chose during the install, you can now run this interpreter — and, more important, a script — from any directory on your system.

Type exit() and hit Return to exit the interpreter and get back to a C: prompt.

Optional: Set up useful Python packages

Python 3 comes with the package installer pip already in place, which makes it super easy to add useful packages to your Python installation. The syntax is this (replace some_package with a package name you want to install):

pip install package_name_here

Let’s add a couple of must-have utilities for web scraping: Requests and BeautifulSoup. You can use pip to install them all with one command:

pip install beautifulsoup4 requests

csvkit, which I covered here, is a great tool for dealing with comma-delimited text files. Add it:

pip install csvkit

You’re now set to get started using and learning Python under Windows 10. If you’re looking for a guide, start with the Official Python tutorial.


Like What You Read Here?
My book Practical SQL: A Beginner’s Guide to Storytelling with Data from No Starch Press offers a step-by-step guide learning SQL and working with relational databases. Learn how to wrangle the everyday data you encounter to gain meaningful insights.