From Demo to Production at the WSJ

The Wall Street Journal this month launched Talk2020, an app that collects transcripts of speeches and other election-related material and makes it all easily searchable. The app’s journey began on my work Macbook, when I coded up a demo Django/PostgreSQL Talk2020 app during bus rides home from the office after writing about rally speeches. It’s been fantastic watching that demo take on new life in the hands of colleagues.

You can learn more about how Talk2020 went from demo to production via this Medium post, which members of the WSJ Digital Experiences and Strategy team wrote to recap the process:

Mean vs. Median: A Beginner’s Guide

This post is adapted from material in my book Practical SQL: A Beginner’s Guide to Storytelling With Data from No Starch Press.


A common way to summarize a group of numbers — one most of us learned in grade school — is to find its mean, commonly called the average. But it’s not always the best measure.

Let’s say six kids go on a field trip, ages 10, 11, 10, 9, 13 and 12. It’s easy to add the ages and divide by six to get the group’s average age:
 

(10 + 11 + 10 + 9 + 13 + 12) / 6 = 10.8

Because all the ages are close, the average of 10.8 gives us a good picture of the group as a whole. But averages are less helpful when the values are skewed toward one end or if they include outliers.

For example, what if we add a much older chaperone to our field trip? With ages of 10, 11, 10, 9, 13, 12 and 46, the average age of the group rises considerably:
 

(10 + 11 + 10 + 9 + 13 + 12 + 46) / 7 = 15.9

Now the mean is not an accurate representation. The outlier skews the average, and no journalist should feel comfortable reporting it.

This is where calculating a median is handy. The median is the midpoint in an ordered list of values — the point at which half the values are higher and half lower. If the median household income in East Middletownburg is $50,000, then half the households earn more and half less.

Using our field trip, we order the ages from lowest to highest:
 

Continue…

JSON from Excel using Python openpyxl

Powering an application from a collection of static files rather than a call to an API or database server is well worth considering. Removing a production database and data access layer from the mix eliminates complexity and trims development time. Static files aren’t right for every situation, but for small apps they may be all you need.

In my days building back ends for interactive news applications, most of our apps consumed JSON. Simple apps ran off one or two static JSON files. The more complex used hundreds. Updating content was as easy as generating new JSON files and uploading them to a file server.

I’ve written about how to use Python to generate JSON files from a SQL database. But it’s just as easy to use Python to make JSON out of Excel spreadsheets. The key ingredient is the Python library openpyxl, which offers a collection of functions for reading and writing Excel .xlsx workbooks. You can add it to your Python environment by running pip install openpyxl.

Let’s look at how it works, then use with some Python to generate JSON.

Basic openpyxl functions

Let’s say we have an Excel workbook containing a small table repeated over three worksheets. The table in each case looks like this:

Here are some snippets of code — just scratching the surface — to interact with it programmatically:

# Demonstrates basic openpyxl functions for working with Excel files
# (.xlsx format)

from openpyxl import load_workbook

# load the workbook
wb = load_workbook('excel-xlrd-sample.xlsx')

# Print the sheet names
print(wb.sheetnames)

# Get a sheet by name
sheet = wb['Cars20']

# Iterate through sheet rows, returning each as a tuple:
for row in sheet.values:
    print(row)

# If you just want the first column:
for cell in sheet['A']:
    print(cell.value)

# Index individual cells:
cell_c4 = sheet['C4'].value
print(cell_c4)

# Or you can use:
cell_c4 = sheet.cell(row=4, column=3).value
print(cell_c4)

# Let's say you want the same cell every sheet in a workbook:
for sheet in wb.sheetnames:
    open_sheet = wb[sheet]
    cell_x = open_sheet['D2'].value
    print(cell_x)

From Excel to JSON

Pretty cool stuff. Now, let’s convert our sample spreadsheet to JSON. I’ll borrow some of the techniques I discussed when outlining how to use Python to build JSON from a SQL database:

import json
from collections import OrderedDict
from itertools import islice
from openpyxl import load_workbook

# Open the workbook and select a worksheet
wb = load_workbook('excel-xlrd-sample.xlsx')
sheet = wb['Cars20']

# List to hold dictionaries
cars_list = []

# Iterate through each row in worksheet and fetch values into dict
for row in islice(sheet.values, 1, sheet.max_row):
    cars = OrderedDict()
    cars['car-id'] = row[0]
    cars['make'] = row[1]
    cars['model'] = row[2]
    cars['miles'] = row[3]
    cars_list.append(cars)

# Serialize the list of dicts to JSON
j = json.dumps(cars_list)

# Write to file
with open('data.json', 'w') as f:
    f.write(j)

Here’s the breakdown: We open the workbook, select a sheet, and set up a blank list to hold rows of data. Then we use Python’s islice tool to iterate through the rows in the sheet, skipping the header and stopping at the max_row property of the sheet.

For each row, we add each cell to a key/value pair in a dictionary, then add each dictionary to a list. Then we dump the list to JSON and write to a file.

Of course, a spreadsheet this simple doesn’t need a Python script to make its way to JSON. Just use Mr. Data Converter. But as soon as your JSON requirements gain complexity, the ability to use Python to nest nodes, build strings and transform data on the fly make this approach very appealing.


Like what you see here? Please check out my book, Practical SQL: A Beginner’s Guide to Storytelling With Data, available from No Starch Press and through all fine booksellers.

Generate JSON From SQL Using Python

Author’s note: This post has been updated from its original 2012 version to use PostgreSQL and the Python psycopg2 library. Also, if you like what you read here, please check out my book Practical SQL: A Beginner’s Guide to Storytelling with Data from No Starch Press.


Let’s say you want to generate a few hundred — or even a thousand — flat JSON files from a SQL database. Maybe you want to power a data visualization but have neither the time nor the desire to spin up a server to dynamically generate the data. So, you want flat files, each one small for quick loading. And a lot of them.

A few lines of Python is all you need.

I’ve gone this route for a few data-driven visuals, creating JSON files out of large database tables. Python works well for this, with its JSON encoder/decoder offering a flexible set of tools for converting Python objects to JSON. Let’s see how this works.

The Data

I’m using PostgreSQL for this tutorial. Here’s a script that creates a table and fills it with two rows of data:

CREATE TABLE students_test (
    id integer,
    first_name text,
    last_name text,
    street text,
    city text,
    st text,
    zip text
);
    
INSERT INTO students_test VALUES
(1, 'Samantha', 'Baker', '9 Main St.', 'Hyde Park', 'NY', '12538'),
(2, 'Mark', 'Salomon', '3 Stavesacre Terrace', 'Dallas', 'TX', '75204');

The Script

Here’s an example Pythonscript that generates two JSON files from that query. One file contains JSON row arrays, and the other has JSON key-value objects. Below, we’ll walk through it step-by-step.

Continue…

Story hunting in birth, death data

Tracking the U.S. government’s annual count of births and deaths is one of my little obsessions. I keep annual totals in a spreadsheet and look forward to observing the trends with each new year of data.

This most basic of demographics can tell us much about a nation’s past—and its unfolding future.

For example, the CDC’s provisional 2018 U.S. birth data released in May 2019 showed that births in the U.S. dropped for a fourth year in a row, to the lowest level in 32 years. In a story for The Wall Street Journal, we encapsulated the trend and what demographers point to as likely causes: sharp declines in the teen birth rate, increased use of longer-acting contraceptives, and more women in the workforce delaying childbirth, among others.

Continue…