Analyzing Shapefile Data with PostgreSQL

This is one in a series of posts adapted from material in the book Practical SQL.

Spend some time digging into geographic information systems (GIS) and soon enough you’ll encounter a shapefile. It’s a GIS file type developed by mapping software firm Esri for use in its popular ArcGIS platform. A shapefile contains the geometric information to describe a shape—a river, road, lake, or town boundary, for example—plus metadata about the shape, such as its name.

Because the shapefile has become a de facto standard for publishing GIS data, other applications and software libraries use shapefiles too, such as the open source QGIS.

While researching GIS topics for a chapter in my book, Practical SQL, I learned that it’s easy to import a shapefile into a PostGIS-enabled PostgreSQL database. The information that describes each shape is stored in a column of data type geometry, and so you can run spatial queries to calculate area, distances, intersections of objects, and more.

Here’s a quick exercise, adapted from the book.

(I’m assuming you’re familiar with databases and SQL. To follow along, you’ll need to have installed PostgreSQL and the PostGIS extension.)

Setting up a PostGIS Database in PostgreSQL

Let’s start by creating a database and enabling PostGIS on it. Make sure PostgreSQL is running on your system and that you’re connected to the default postgres database either via the command-line app psql or in a graphical interface such as pgAdmin.

Run the following command to create a database:

CREATE DATABASE shapefile_test;

Once that executes, disconnect from the postgres database and connect to the new shapefile_test database. Once connected, run this command:

CREATE EXTENSION postgis;

Congrats — now you have a GIS-ready database!

Loading a Shapefile using shp2pgsql on the Command Line

For this exercise, we’ll use a shapefile from the U.S. Census Bureau that contains the boundaries of all counties in the nation. You can download it from my GitHub repository that contains the code and data for Chapter 14 of Practical SQL. Download the file tl_2010_us_county10.zip and decompress it into a directory on your computer.

Inside the zip file, you’ll find five files, each with the same prefix but a different extension:

tl_2010_us_county10.dbf
tl_2010_us_county10.prj
tl_2010_us_county10.shp
tl_2010_us_county10.shp.xml
tl_2010_us_county10.shx

I cover the purpose of each file in Chapter 14 of Practical SQL; you can also read about shapefile components here. For now, we’ll concentrate on loading the data into a table, which will contain an indexed column that holds the geometry data describing the shape of each county. The table also will have columns containing metadata about each county.

To load the table, we’ll use the command-line utility shp2pgsql, which should be included in your installation of PostGIS, and psql, the standard command-line utility included with PostgreSQL. shp2pgsql takes arguments (or “flags”) that let you specify numerous options for handling the shapefile and the import. Here’s our complete command-line statement:

shp2pgsql -I -s 4269 -W Latin1 tl_2010_us_county10.shp us_counties_2010_shp | psql -d shapefile_test -U postgres

Even though it wraps into two lines in this blog post, this is one continuous command. Here’s a breakdown of what it all means:

  • After the utility name, the -I flag tells the command to create an index on the geometry column.
  • The flag -s 4269 sets the Spatial Reference Identifier (SRID) to a value of 4269. The SRID specifies the coordinate system used for plotting the data; the Census Bureau often uses 4269, which specifies the NAD 83 system.
  • Next, -W Latin1 specifies the encoding for the data in the incoming shapefile. We’re using Latin-1 here because some of the nation’s counties contain characters with special characters in their names.
  • We follow these flags with the name of the shapefile to import, tl_2010_us_county10.shp, and the name of the target table, which the utility will create: us_counties_2010_shp
  • Finally, we use a Linux pipe symbol to send the output of shp2pgsql into the PostgreSQL psql utility psql. We specify the name of the database with the -d flag and the user name with the -U flag. You’ll receive a prompt to supply a database password once you run the utility.

Enter the complete statement on your computer’s command line and run it. You should see a bunch of INSERT 0 1 statements scroll by before the operation completes. Now, your database should have a table named us_counties_2010_shp with a column named geom that contains the data describing the shape of each county.  Now, let’s explore it.

Querying the Shapefile

Here are two quick examples of how you can query the geometry data in the shapefile you just imported, in the table us_counties_2010_shp:

1. Find the largest counties in square miles using ST_Area():

The ST_Area()function returns the area of a Polygon or MultiPolygon object. If you’re working with a geography data type, ST_Area()returns the result in square meters. With a geometry data type, the function returns the area in SRID units. Typically, the units are not useful for practical analysis, but you can cast the geometry data to geography to obtain square meters. That’s what I do in this query:

SELECT name10,
       statefp10 AS st,
       round(
             ( ST_Area(geom::geography) / 2589988.110336 )::numeric, 2
            )  AS square_miles
FROM us_counties_2010_shp
ORDER BY square_miles DESC
LIMIT 5;

The geom column is data type geometry, so to find the area in square meters, we cast the geom column as a geography data type using the double-colon syntax. Then, to get square miles, we divide the area by 2589988.110336, the number of square meters in a square mile.

I’ve wrapped the results it in a round() function and named the resulting column square_miles. This makes the results easier to read. Finally, we list the results in descending order from the largest area to the smallest and use LIMIT 5 to show only the first five results:

┌──────────────────┬────┬──────────────┐
│      name10      │ st │ square_miles │
├──────────────────┼────┼──────────────┤
│ Yukon-Koyukuk    │ 02 │    147805.08 │
│ North Slope      │ 02 │     94796.21 │
│ Bethel           │ 02 │     45504.36 │
│ Northwest Arctic │ 02 │     40748.95 │
│ Valdez-Cordova   │ 02 │     40340.08 │
└──────────────────┴────┴──────────────┘

The five counties with the largest areas are all in Alaska, denoted by the state FIPS code 02.

2. Finding a county by longitude and latitude with ST_Within():

Much of the current mobile phone experience revolves around geolocation services using various means, such as your phone’s GPS, to find your longitude and latitude. Once your coordinates are known, they can be used in a spatial query to find which geography contains that point—and thus where you are.

You can do the same using your census shapefile and the ST_Within() function, which returns true if one geometry is inside another. Here’s an example using the longitude and latitude of downtown Hollywood:

SELECT name10,
       statefp10
FROM us_counties_2010_shp
WHERE ST_Within('SRID=4269;POINT(-118.3419063 34.0977076)'::geometry, geom);

The ST_Within() function inside the WHERE clause requires two geometry inputs and checks whether the first is inside the second. For the function to work properly, both geometry inputs must have the same SRID.

In this example, the first input is an extended well-known text representation of a Point that includes the SRID 4269 (same as the census data), which is then cast as a geometry type. The ST_Within() function doesn’t accept a separate SRID input, so to set it for the supplied well-known text, you must prefix it to the string.

The second input is the geomcolumn from the table. Run the query; you should see the following result:

┌─────────────┬───────────┐
│   name10    │ statefp10 │
├─────────────┼───────────┤
│ Los Angeles │ 06        │
└─────────────┴───────────┘

The query shows that the Point you supplied is within Los Angeles county in California (state FIPS 06). This information is very handy, because by joining additional data to this table you can tell a person about demographics or points of interest near them. Try supplying your own longitude and latitude pairs to see which U.S. county they fall in.

Wrap Up

There are many, many more ways to analyze data in shapefiles in PostGIS. One typical procedure involves joining two shapefiles to find places where objects intersect. In my book Practical SQL, for example, we join shapefiles containing rivers and roads to find where the two objects cross.

If you’ve enjoyed this brief tutorial, Practical SQL includes more on working with spatial data, including additional shapefile examples as well as general background on building, indexing, and query spatial data. Thanks for reading!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.