Analyzing Shapefile Data with PostgreSQL
This post, on how to analyze an Esri shapefile with PostgreSQL, is adapted from material in my book Practical SQL: A Beginner’s Guide to Storytelling with Data from No Starch Press.
Spend some time digging into geographic information systems (GIS) and soon enough you’ll encounter a shapefile. It’s a GIS file type developed by mapping software firm Esri for use in its popular ArcGIS platform. A shapefile contains the geometric information to describe a shape—a river, road, lake, or town boundary, for example—plus metadata about the shape, such as its name.
Because the shapefile has become a de facto standard for publishing GIS data, other applications and software libraries use shapefiles too, such as the open source QGIS.
While researching GIS topics for a chapter in my book, Practical SQL, I learned that it’s easy to import a shapefile into a PostGIS-enabled PostgreSQL database. The information that describes each shape is stored in a column of data type
geometry, and so you can run spatial queries to calculate area, distances, intersections of objects, and more.
Here’s a quick exercise, adapted from the book.
(I’m assuming you’re familiar with databases and SQL. To follow along, you’ll need to have installed PostgreSQL and the PostGIS extension.)
Setting up a PostGIS Database in PostgreSQL
To analyze an Esri shapefile with PostgreSQL, let’s start by creating a database and enabling PostGIS on it. Make sure PostgreSQL is running on your system and that you’re connected to the default
postgres database either via the command-line app
or in a graphical interface such as pgAdmin.
Run the following command to create a database:
CREATE DATABASE shapefile_test;
Once that executes, disconnect from the
database and connect to the new
shapefile_test database. Once connected, run this command:
CREATE EXTENSION postgis;
Congrats — now you have a GIS-ready database!
Loading a Shapefile using shp2pgsql on the Command Line
For this exercise, we’ll use a shapefile from the U.S. Census Bureau that contains the boundaries of all counties in the nation. You can download it from my GitHub repository that contains the code and data for Chapter 14 of Practical SQL. Download the file tl_2010_us_county10.zip and decompress it into a directory on your computer.
Inside the zip file, you’ll find five files, each with the same prefix but a different extension:
tl_2010_us_county10.dbf tl_2010_us_county10.prj tl_2010_us_county10.shp tl_2010_us_county10.shp.xml tl_2010_us_county10.shx
I cover the purpose of each file in Chapter 14 of Practical SQL; you can also read about shapefile components here. For now, we’ll concentrate on loading the data into a table, which will contain an indexed column that holds the geometry data describing the shape of each county. The table also will have columns containing metadata about each county.
To load the table, we’ll use the command-line utility
shp2pgsql, which should be included in your installation of PostGIS, and
psql, the standard command-line utility included with PostgreSQL.
shp2pgsql takes arguments (or “flags”) that let you specify numerous options for handling the shapefile and the import. Here’s our complete command-line statement:
shp2pgsql -I -s 4269 -W Latin1 tl_2010_us_county10.shp us_counties_2010_shp | psql -d shapefile_test -U postgres
Even though it wraps into two lines in this blog post, this is one continuous command. Here’s a breakdown of what it all means:
- After the utility name, the
-Iflag tells the command to create an index on the geometry column.
- The flag
-s 4269sets the Spatial Reference Identifier (SRID) to a value of 4269. The SRID specifies the coordinate system used for plotting the data; the Census Bureau often uses 4269, which specifies the NAD 83 system.
-W Latin1specifies the encoding for the data in the incoming shapefile. We’re using Latin-1 here because some of the nation’s counties contain characters with special characters in their names.
- We follow these flags with the name of the shapefile to import,
tl_2010_us_county10.shp, and the name of the target table, which the utility will create:
- Finally, we use a Linux pipe symbol to send the output of
shp2pgsqlinto the PostgreSQL psql utility
psql. We specify the name of the database with the
-dflag and the user name with the
-Uflag. You’ll receive a prompt to supply a database password once you run the utility.
Enter the complete statement on your computer’s command line and run it. You should see a bunch of
INSERT 0 1 statements scroll by before the operation completes. Now, your database should have a table named
us_counties_2010_shp with a column named
geom that contains the data describing the shape of each county. Now, let’s explore it.
Querying the Shapefile
Here are two quick examples of how you can query the geometry data in the shapefile you just imported, in the table
1. Find the largest counties in square miles using
ST_Area()function returns the area of a Polygon or MultiPolygon object. If you’re working with a
geography data type,
ST_Area()returns the result in square meters. With a
geometry data type, the function returns the area in SRID units. Typically, the units are not useful for practical analysis, but you can cast the
geometry data to
geography to obtain square meters. That’s what I do in this query:
SELECT name10, statefp10 AS st, round( ( ST_Area(geom::geography) / 2589988.110336 )::numeric, 2 ) AS square_miles FROM us_counties_2010_shp ORDER BY square_miles DESC LIMIT 5;
The geom column is data type
geometry, so to find the area in square meters, we cast the
geom column as a
geography data type using the double-colon syntax. Then, to get square miles, we divide the area by 2589988.110336, the number of square meters in a square mile.
I’ve wrapped the results it in a
round() function and named the resulting column
square_miles. This makes the results easier to read. Finally, we list the results in descending order from the largest area to the smallest and use
LIMIT 5 to show only the first five results:
┌──────────────────┬────┬──────────────┐ │ name10 │ st │ square_miles │ ├──────────────────┼────┼──────────────┤ │ Yukon-Koyukuk │ 02 │ 147805.08 │ │ North Slope │ 02 │ 94796.21 │ │ Bethel │ 02 │ 45504.36 │ │ Northwest Arctic │ 02 │ 40748.95 │ │ Valdez-Cordova │ 02 │ 40340.08 │ └──────────────────┴────┴──────────────┘
The five counties with the largest areas are all in Alaska, denoted by the state FIPS code 02.
2. Finding a county by longitude and latitude with
Much of the current mobile phone experience revolves around geolocation services using various means, such as your phone’s GPS, to find your longitude and latitude. Once your coordinates are known, they can be used in a spatial query to find which geography contains that point—and thus where you are.
You can do the same using your census shapefile and the
ST_Within() function, which returns
true if one geometry is inside another. Here’s an example using the longitude and latitude of downtown Hollywood:
SELECT name10, statefp10 FROM us_counties_2010_shp WHERE ST_Within('SRID=4269;POINT(-118.3419063 34.0977076)'::geometry, geom);
ST_Within() function inside the
WHERE clause requires two geometry inputs and checks whether the first is inside the second. For the function to work properly, both geometry inputs must have the same SRID.
In this example, the first input is an extended well-known text representation of a Point that includes the SRID 4269 (same as the census data), which is then cast as a
geometry type. The
ST_Within() function doesn’t accept a separate SRID input, so to set it for the supplied well-known text, you must prefix it to the string.
The second input is the
geomcolumn from the table. Run the query; you should see the following result:
┌─────────────┬───────────┐ │ name10 │ statefp10 │ ├─────────────┼───────────┤ │ Los Angeles │ 06 │ └─────────────┴───────────┘
The query shows that the Point you supplied is within Los Angeles county in California (state FIPS 06). This information is very handy, because by joining additional data to this table you can tell a person about demographics or points of interest near them. Try supplying your own longitude and latitude pairs to see which U.S. county they fall in.
There are many, many more ways to analyze data in shapefiles in PostGIS. One typical procedure involves joining two shapefiles to find places where objects intersect. In my book Practical SQL, for example, we join shapefiles containing rivers and roads to find where the two objects cross.
If you’ve enjoyed this brief tutorial, Practical SQL includes more on working with spatial data, including additional shapefile examples as well as general background on building, indexing, and query spatial data. Thanks for reading!