Lessons From a Census Factory
After two months of processing Census data and writing about it here, I’m ready for a nice break. But before I go off to explore other topics, I thought I’d wrap this episode of Census 2010 with a look at how my teammates and I processed the data. My deepest thanks to my colleagues for doing such a great job. And many thanks to the journalists across the U.S. who offered encouragement as we shared our work with the journalism community.
* * * *
On a Thursday afternoon in the first week of February, three of us from our newsroom’s database team gathered at my computer and tried our best to subdue the butterflies swarming in our stomachs. What we were about to do, we hoped, would not only help us cover the year’s biggest demographic story but also help journalists across the country do the same.
That’s because weeks earlier, somewhere in the midst of poring through Census technical manuals and writing a few thousand lines of SAS code, we’d had a bright idea:
Let’s share this.
Let’s share the data we’re downloading, parsing and packaging up for ourselves. Let’s give it to Investigative Reporters and Editors to give to its members — our colleagues in the broader world of data journalism. Why? Because we knew all too well — having lived through one or more decennial Census reports — just how complex, painful and time consuming it could be to do a good job with the data. We could help. We’d consider this a goodwill gesture from our newsroom and a blessing for every journalist who’s been dreading having to deal with the Census on deadline. Our editors said yes.
So, there we were on a February afternoon: Me, Paul Overberg (who’s driven our Census coverage for years), and Barbara Hansen, our data team colleague who’s also a frequent collaborator on Census analysis. Four states were coming — Mississippi, Louisiana, Virginia and New Jersey. We had reporters, editors and graphic artists in our newsroom waiting. We had colleagues in all four states waiting.
I opened FileZilla and pulled the New Jersey file, unzipped it and fired up SAS.
“Ready?” I said.
Fast-forward to about two hours later — we had spreadsheets, we had maps, we had an interactive. And we had files up on IRE’s site and distributed to reporters in all four states. By the second day of releases, we were averaging about 20 minutes to process a state’s files and distribute them to each end point.
Two months later, feeling spent but happy, we’d ground through 50 states and D.C. and heard from enough people to believe we’d done some good.
In the process of preparing and executing our plan, we learned a whole lot about crafting a strategy for handling a complex data release. Here’s a little about how we did it and lessons I took away.
What we affectionately called our “Census Factory” was a series of applications and scripts. From end to end:
FileZilla: Free, open-source FTP client to download the zipped text files posted by the Census Bureau. (You can’t do anything till you download the files!)
SAS: Business analytics software to script the import, slicing and export of the data to CSV, Excel and Access formats. It’s pricy, but we are fortunate to have a license. The advantages of SAS are several: It’s very, very fast; you can embed SQL in your scripts; it has a function that easily merged the three text files Census provided; and it handily exports to multiple formats with little effort.
The SAS scripts produced the CSV and Excel files we shared with journalists. They also provided the output for the next steps needed for our coverage:
ArcView: Mapping software by ESRI that we used to generate shapefiles for print maps and the interactive Flash map our colleague Juan Thomassie built. Paul also used it to build a set of 2000 tract-level data that was comparable to the new 2010 tract boundaries. To load maps in ArcView on deadline, we used a SAS script to export a table to Microsoft Access, and Paul’s prebuilt maps pulled data from there.
SQL Server: We built a fully normalized data warehouse to store our processed data and to serve our Flash graphic and data tables via an API. Again, we used a SAS script to output, this time to Excel. For some reason, I’ve had better success importing that into SQL Server than plain text files so I went with it.
The tools we used we chose for a few reasons — among them availability, organizational integration, and familiarity. For example, we’d used SAS for the 2000 Census and saved our scripts — and we simply modified them for the 2010 data. Nevertheless, everything could have been done (and was done by others) with open source options such as Python for data parsing, QGIS for mapping, and various web frameworks (Django/Rails) for interactivity. I tend to avoid software holy wars. If you like it, use it.
What we learned
We moved a lot of data, and we moved it quickly. Here’s what helped us:
Diagramming the process: Early on, we sketched out a logical data flow. Who were our “customers,” and what would each need out of the hundreds of pieces of data available? I drew up a flow chart; it helped to visualize which parts of the process would have prior dependencies, and it helped us divide up tasks for preparing code and for each day’s processing.
Reading the documentation: The Census technical manuals were invaluable. Writing queries to join tables or select geographies was only possible by understanding the intricacies of the data, and that’s where we found the answers. Many of the struggles I saw journalists face during this round of the Census had their roots in skipping this step.
Writing documentation: Because we were sharing data outside our newsroom, we needed to document the output files’ fields. This helped us produce better files — it forced us to organize the output in a logical manner and weed out fields that were unneeded by most journalists.
Scripting tasks: Wherever possible, we wrote code to replace manual tasks, from dropping files in certain folders to updating server tables. In retrospect, I see several places in the data flow where I could have automated more and saved myself some keystrokes. Next time …
Getting feedback early and often: As soon as we had output files ready, we sent copies to some database editor friends and Census veterans. They gave valuable feedback on the content, and we wound up adding fields at their request. Once the releases began, we listened carefully to the journalists using our work. Occasionally, they identified gaffes in our documentation or, in one case, an issue with a state’s tract file. One colleague helped us identify a problem with our data import about an hour before the first state dropped. Having your work fact-checked by the brightest in the business is a huge benefit to sharing data.
Enlisting fresh eyes: After spending two months coding scripts and creating the factory, Paul and I asked our colleague Barbara Hansen to be a fresh set of eyes on our work. She ran our scripts from end to end, double-checked every calculation and file, and generally helped us avoid looking foolish. It was a huge help.
Even with all the prep, my palms were a tad moist that first day of processing in February. They became even more moist when I discovered mid-stream that one of my table joins needed an extra qualifier to work right. But all the prep even made that easy to fix. We all heaved a sigh of relief when the data streamed onto our web site.