Using the US Census Geocoder


Introduction

What is Geocoding?

Hint

The act of determining a specific, canonical location based on some input data.

What we typically know about a specific location or geographical area is fuzzy. We might know part of the address, or refer to the address with abbreviations, or describe a general area, etc. It’s ambiguous, fuzzy, and unclear. That makes getting specific, canonical, and precise data about that geographic location challenging. Which is where the process of geocoding comes into play.

Geocoding is the process of getting a specific, precise, and canonical determination of a geographical location (a place or geographic feature) or of a geographical area (encompassing multiple places or geographic features).

A canonical determination of a geographical location or geographical area is defined by the meta-data that is returned for that location/area. Things like the canonical address, or various characteristics of the geographical area, etc. represent the “canonical” information about that location / area.

The process of geocoding returns exactly that kind of canonical / official / unambiguous meta-data about one or more geographical locations and areas based on a set of inputs. Some inputs may be expected to be imprecise or partial (e.g. addresses, typically used for forward geocoding) while others are expected to be precise but with incomplete information (e.g. longitude and latitude coordinates used in reverse geocoding).

Why the Census Geocoder?

Geocoding is used for many thing, but the Census Geocoder API in particular is meant to provide the US Census Bureau’s canonical meta-data about identified locations and areas. This meta-data is then typically used when executing more in-depth analysis on data published by the US Census Bureau and other departments of the US federal and state governments.

Because the US government uses a very complicated and overlapping hierarchy of geographic areas, it is essential when working with US government data to start from the precise identification of the geographic areas and locations of interest.

But using the Census Geocoder API to get this information is non-trivial in its complexity. That’s both because the API has limited documentation on the one hand, and because its syntax is non-pythonic and requires extensive familiarity with the internals of the (complicated) datasets that the US Census Bureau manages/publishes.

The Census Geocoder library is meant to simplify all of that, by providing an easy-to-use, batteries-included, pythonic wrapper around the Census Geocoder API.

Census Geocoder vs. Alternatives

While we’re partial to the US Census Geocoder as our primary means of interacting with the Census Geocoder API, there are obviously alternatives for you to consider. Some might be better for your use specific use cases, so here’s how we think about them:

The Census Geocoder API is a straightforward RESTful API. Which means that you can just execute your own HTTP requests against it, retrieve the JSON results, and work with the resulting data entirely yourself. This is what I did for years, until I got tired of repeating the same patterns over and over again, and decided to build the Census Geocoder instead.

For a super-simple use case, probably the most expedient way to do it. But of course, more robust use cases would require your own scaffolding with built-in retry-logic, object representation, error handling, etc. which becomes non-trivial.

Why not use a library with batteries included?

Tip

When to use it?

In practice, I find that rolling my own solution is great when it’s an extremely simple use case, or a one-time operation (e.g. in a Jupyter Notebook) with no business logic to speak of. It’s a “quick-and-dirty” solution, where I’m trading rapid implementation (yay!) for less flexibility/functionality (boo!).

Considering how easy the Census Geocoder is to use, however, I find that I never really roll my own scaffolding when working with the Census Geocoder API.


Census Geocoder Features

  • Easy to adopt. Just install and import the library, and you can be forward geocoding and reverse geocoding with just two lines of code.

  • Extensive documentation. One of the main limitations of the Geocoder API is that its documentation is scattered across the different datasets released by the Census Bureau, making it hard to navigate and understand. We’ve tried to fix that.

  • Location Search

    • Using Geographic Coordinates (reverse geocoding)

    • Using a One-line Address

    • Using a Parametrized Address

    • Using Batched Addresses

  • Geography Search

    • Using Geographic Coordinates (reverse geocoding)

    • Using a One-line Address

    • Using a Parametrized Address

    • Using Batched Addresses

  • Supports all available benchmarks, vintages, and layers.

  • Simplified syntax for indicating benchmarks, vintages, and layers.

  • No more hard to interpret field names. The library uses simplified (read: human understandable) names for location and geography properties.


Overview

How the Census Geocoder Works

The Census Geocoder works with the Census Geocoder API by providing a thin Python wrapper around the APIs functionality. Rather than having to construct your own HTTP requests against the API itself, you can instead work with Python objects and functions the way you normally would.

In other words, the process is very straightforward:

  1. Install the Census Geocoder library. (see here)

  2. Import the geocoder. (see here)

  3. Geocode something - either locations or geographies. (see here)

  4. Work with your geocoded locations or geographical areas. (see here)

And that’s it! Once you’ve done the steps above, you can easily geocode one-off requests or batch many requests into a single transaction.


1. Installing the Census Geocoder

To install the US Census Geocoder, just execute:

$ pip install census-geocoder

2. Import the Census Geocoder

Importing the Census Geocoder is very straightforward. You can either import its components precisely (see API Reference) or simply import the entire module:

# Import the entire module.
import census_geocoder as geocoder

result = geocoder.location.from_address('4600 Silver Hill Rd, Washington, DC 20233')
result = geocoder.geography.from_address('4600 Silver Hill Rd, Washington, DC 20233')

# Import precise components.
from census_geocoder import Location, Geography

result = Location.from_address('4600 Silver Hill Rd, Washington, DC 20233')
result = Geography.from_address('4600 Silver Hill Rd, Washington, DC 20233')

3. Geocoding

Geocoding a location means to retrieve canonical meta-data about that location. Think of it as getting the “official” details for a given place. Using the Census Geocoder, you can geocode locations given:

  • A single-line address (whole or partial)

  • A parametrized address where you know its components parts

  • A set of longitude and latitude coordinates

  • A batch file in CSV or TXT format

However, the Census Geocoder API provides two different sets of meta-data for any canonical location:

  • Location Data. Think of it as the canonical address for a given location/place.

  • Geographic Area Data. Think of it as canonical information about the (different) areas that contain the given location/place.

Using the Census Geocoder library you can retrieve both types of information.

Hint

When retrieving geographic area data, you also get location data.

Getting Location Data

Retrieving data about canonical locations is very straightforward. You have four different ways to get this information, depending on what information you have about the location you want to geocode:

import census_geocoder as geocoder

result = geocoder.location.from_address('4600 Silver Hill Rd, Washington, DC 20233')

Getting Geographic Area Data

Retrieving data about the geographic areas that contain a given location/place is just as straightforward as getting location data. In fact, the syntax is almost identical. Just swap out the word 'location' for 'geography' and you’re done!

Here’s how to do it:

import census_geocoder as geocoder

result = geocoder.geography.from_address('4600 Silver Hill Rd, Washington, DC 20233')

Benchmarks and Vintages

The data returned by the Census Geocoder API is different from typical geocoding services, in that it is time-sensitive. A geocoding service like the Google Maps API or Here.com only cares about the current location. But the US Census Bureau’s information is inherently linked to the statistical data collected by the US Census Bureau at particular moments in time.

Thus, when making requests against the Census Geocoder API you are always asking for geographic location data or geographic area data as of a particular date. You might think “geographies don’t change”, but in actuality they are constantly evolving. Congressional districts, school districts, town lines, county lines, street names, house numbers, etc. are all constantly evolving. And to ensure that the statistical data is tied to the locations properly, that alignment needs to be maintained through two key concepts:

The benchmark is the time period when geographic information was snapshotted for use / publication in the Census Geocoder API. This is typically done twice per year, and represents the “geographic definitions as of the time period indicated by the benchmark”.

The vintage is the census or survey data that the geographies are linked to. Thus, the geographic identifiers or statistical data associated with locations or geographic areas within a given benchmark are also linked to a particular vintage of census/survey data. Trying to use those identifiers or statistical data with a different vintage of data may produce inaccurate results.

The Census Geocoder API supports a variety of benchmarks and vintages, and they are unfortunately poorly documented and difficult to interpret. Therefore, the Census Geocoder has been designed to streamline and simplify their usage.

Vintages are only available for a given benchmark. The table below provides guidance on the vintages and benchmarks supported by the Census Geocoder:

BENCHMARKS

Current

Census2020

VINTAGES

Current

Census2020

Census2020

Census2010

ACS2019

ACS2018

ACS2017

Census2010

When using the Census Geocoder, you can supply the benchmark and vintage directly when executing your geocoding request:

import census_geocoder as geocoder

result = geocoder.location.from_address('4600 Silver Hill Rd, Washington, DC 20233',
                                        benchmark = 'Current',
                                        vintage = 'ACS2019')

result = geocoder.geography.from_address('4600 Silver Hill Rd, Washington, DC 20233',
                                         benchmark = 'Current',
                                         vintage = 'ACS2019')

Hint

Several important things to be aware of when it comes to benchmarks and vintages in the Census Geocoder library:

Unless over-ridden by the CENSUS_GEOCODER_BENCHMARK or CENSUS_GEOCODER_VINTAGE environment variables, the benchmark and vintage default to 'Current' and 'Current' respectively.

The benchmark and vintage are case-insensitive. This means that you can supply 'Current', 'CURRENT', or 'current' and it will all work the same.

If you want to set a different default benchmark or vintage, you can do so by setting CENSUS_GEOCODER_BENCHMARK and CENSUS_GEOCODER_VINTAGE environment variables to the defaults you want to use.

Layers

When working with the Census Geocoder API (particularly when getting geographic area data), you have the ability to control which types of geographic area get returned. These types of geographic area are called “layers”.

An example of two different “layers” might be “State” and “County”. These are two different types of geographic area, one of which (County) may be encompassed by the other (State). In general, geographic areas within the same layer cannot and do not overlap. However different layers can and do overlap, where one layer (State) may contain multiple other layers (Counties), or one layer (Metropolitan Statistical Areas) may partially overlap multiple entities within a different layer (States).

When using the Census Geocoder you can easily specify the layers of data that you want returned. Unless overridden by the CENSUS_GEOCODER_LAYERS environment variable, the layers returned will always default to 'all'.

Which layers are available is ultimately determined by the vintage of the data you are retrieving. The following represents the list of layers available in each vintage:

Note

You may notice that there are (logical) duplicate layers in the lists above, for example “2010 Census PUMAs” and “2010 Census Public Use Microdata Areas”. This is because there are multiple ways that users of Census data may refer to particular layers in their work. This duplication is purely for the convenience of Census Geocoder users, since the Census Geocoder API actually uses numerical identifiers for the layers returned.

When geocoding data, you can simply supply the layers you want using the layers keyword argument as below:

import census_geocoder as geocoder

result = geocoder.location.from_address('4600 Silver Hill Rd, Washington, DC 20233',
                                        benchmark = 'Current',
                                        vintage = 'ACS2019',
                                        layers = 'Census Tracts, States, CDPs, Divisions')

result = geocoder.geography.from_address('4600 Silver Hill Rd, Washington, DC 20233',
                                         benchmark = 'Current',
                                         vintage = 'ACS2019',
                                         layers = 'Census Tracts, States, CDPs, Divisions')

Hint

When using the Census Geocoder to return geographic area data, you can request multiple layers worth of data by passing them in a comma-delimited string. This will return separate data for each layer indicated. The comma-delimited string can include white-space for easy readability, which means that the following two values are considered identical:

  • layers = 'Census Tracts, States, CDPs, Divisions'

  • layers = 'Census Tracts,States,CDPs,Divisions'

To retrieve all available layers that have data for a given location, you can submit 'all'. Unless you have set the CENSUS_GEOCODER_LAYERS environment variable to a different value, 'all' is the default set of layers that will be returned.

Note that layer names in the Census Geocoder are case-insensitive.


4. Working with Results

Now that you’ve geocoded some data using the Census Geocoder, you probably want to work with your data. Well, that’s pretty easy since the Census Geocoder returns native Python objects containing your location or geographical area data.

Shared Methods

Most of what you will do with your results is read properties from them so as to consume or use the canonical location/geographic meta-data in your application. However, there are a number of methods that are shared between both location data and geographic area data that may prove helpful:

inspect(as_census_fields=False)
Parameters

as_census_fields (bool) – If True, returns the properties using the Census field name rather than the Census Geocoder (user-friendly) property name. Defaults to False.

Returns a list of the properties that are populated with values in the object.

Return type

list of str

to_dict()

Serializes the data for the location/geographic area into a dict that conforms directly to the output from the Census Geocoder API.

Return type

dict

to_json()

Serializes the data for the location/geographic area into a str containing a JSON object that conforms directly to the output from the Census Geocoder API.

Return type

str

Location Data

When working with location data, there are two principle sets of meta-data made available:

  • Input. This is the input that was submitted to the Census Geocoder API, and it includes:

    • The address that you submitted.

    • The benchmark requested.

    • The vintage requested.

  • Matched Addresses. This is a collection of addresses that the Census Geocoder API returned as the canonical addresses for your inputs.

Each matched address exposes its key meta-data, including:

  • The address components in a term:parametrized <parametrized address> form.

  • The address in a single-line form.

  • The Tigerline identifier information for the address.

  • The side of the street where the address can be found, per the Tigerline data.

Geographical Area Data

Geographical area data is always returned within the context of a MatchedAddress instance, which itself is always contained within a Location instance. That matched address will have a .geographies property, which will contain a GeographyCollection. That .geographies property is what contains the detailed geographical area meta-data for all geographical areas returned in response to your API request.

Each layer requested is contained in a property of the GeographyCollection. For example, the relevant regions would be contained in the .regions property, while the relevant census tracts would be contained in the .tracts property.

See also

For a full list of the properties/layers that are available within a GeographyCollection, please see the detailed API reference:

If a layer is not requested (or is irrelevant for a given benchmark / vintage), then its corresponding property in the GeographyCollection will be None.

Within each layer/property, you will find a collection of Geography instances (technically, layer-specific sub-class instances). Each of these instances represents a geographical area returned by the Census Geocoder API, and their properties will contain the meta-data returned by that API.

Because different types of geographical area return different meta-data, there is a useful .inspect() method that will tell you what meta-data properties are available / have data.

The most universal properties (and the ones that are going to prove most useful when working with other Census Bureau datasets) are:

  • .geoid which contains the GEOID (unique consolidated identifier for the geographical area)

  • .name which contains the human-readable name of the geographical area

  • .geography_type which contains a human-readable label for the instances’s geographical area/layer type

  • .functional_status which contains a human-readable indication of the geographical area’s functional status