Building the Government Data Toolkit

tools

Flickr image courtesy of Flickr user bitterbuick

We live in a time when people outside of government have better tools to build things with and extract insights from government data than governments themselves.

These tools are more plentiful, more powerful, more flexible, and less expensive than pretty much everything government employees currently have at their disposal. Governments may have exiting relationships with huge tech companies like Microsoft, IBM, Esri and others that have an array of different data tools — it doesn’t really matter.

In the race for better data tools, the general public isn‘t just beating out the public sector, its already won the race and is taking a Jenner-esque victory lap.

This isn’t a new trend.

Read More

GovTech is Not Broken

When we talk about the challenges that face governments in acquiring and implementing new technology, the conversation eventually winds around to the procurement process.

That’s when things usually get ugly. “It’s broken,” they say. “It just doesn’t work.”

What most people who care about this issue fail to recognize, however, is that while the procurement process for technology may not work well for governments or prospective vendors (particularly smaller, younger companies), it is not broken.

It works exactly as it was designed to work.

Read More

Command Line Data Science

When it comes to deriving useful results about the operation of government from open data sets, we have an enormous array of tools at our disposal that we can make use of. Often, we do not need sophisticated or expensive tools to produce useful results.

In this post, I want to use command line tools that are available on most laptops, and others that can be downloaded for free, to derive meaningful insights from a real government open data set. The following examples will leverage *nix-based tools like tail, grep, sort, uniq and sed as well as open source tools that can be invoked from the command line like csvkit and MySQL.

The data used in this post is from the NY State Open Data Portal for traffic tickets issued in New York State from 2008 – 2012.

Read More

Better Licensing For Open Data

It’s really interesting to see so many governments start to use GitHub as a platform for sharing both code and data. One of the things I find interesting, though, is how infrequently governments use standard licenses with their data and app releases on GitHub.

Why no licenses?

I’m as guilty as anyone of pushing government data and apps to GitHub without proper terms of use, or a standard license. Adding these to a repo can be a pain – more often than not, I used to find my self rooting around in older repos looking for a set of terms that I could include in a repo I wanted to create and copying it. This isn’t a terrible way ensure that terms of use for government data and apps stay consistent, but I think we can do better.

Before leaving the City of Philadelphia, I began experimenting with a new approach. I created a stand-alone repository for our most commonly used set of terms & conditions. Then, I added the license to a new project as a submodule. With this approach, we can ensure that every time a set of terms & conditions is included with a repo containing city data or apps that the language is up to date and consistent with what is being used in other repos.

Adding the terms of use to a new repo before making it public is easy:

~$ git submodule add git://github.com/CityOfPhiladelphia/terms-of-use.git license

This adds a new subdirectory in the parent repo named ‘license’ that contains a reference to the repo holding the license language. Any user cloning the repo to use the data or app, simply does (for purposes of demonstration, using this rep):

~$ git clone https://github.com/CityOfPhiladelphia/phl-polling-loctions
~$ git submodule init
~$ git submodule update

The user can run git submodule update any time to get the very latest license language, which can change from time to time.

Github is an amazing platform for governments to use in sharing open data and fostering collaboration through releasing applications as open source projects.

I think it also provides some powerful facilities for associating licenses and terms & conditions with these releases – something every open source project needs to be sustainable and successful.

Some Tips on API Stewardship

Following up on my last post, and a recent trip to St. Paul Minnesota for the NAGW Annual Conference to talk about open data APIs, I wanted to provide a few insights for proper API stewardship for any government looking to get started with open data, or those that already have an open data program underway.

Implementing an API for your open data is not a trivial undertaking, and even if this is a function that you outsource to a vendor or partner it’s useful to understand some of the issues and challenges involved.

This is something that the open data team in the City of Philadelphia researched extensively during my time there, and this issue continues to be among the most important for any government embarking on an open data program.

In no particular order, here are some of the things that I think are important for proper API stewardship.

Implement Rate Limiting

APIs are shared resources, and one consumer’s use of an API can potentially impact anther consumer. Implementing rate limiting ensures that one consumer doesn’t crowd out others by trying to obtain large amounts of data through your API (that’s what bulk downloads are for).

If you want to start playing around with rate limiting for your API, have a look at Nginx – an open source web proxy that makes it super easy to implement rate limits on your API. I use Nginx as a reverse proxy for pretty much every public facing API I work on. It’s got a ton of great features that make it ideal for front ending your APIs.

Depending on the user base for your API, you may also want to consider using pricing as a mechanism for managing access to your API.

Provide Bulk Data

If the kind of data you are serving through your API is also the kind that consumers are going to want to get in bulk, you should make it available as a static – but regularly updated – download (in addition to making it available through your API).

In my experience, APIs are a lousy way to get bulk data – consumers would much rather get it as a compressed file they can download and use without fuss, and making consumers get bulk data through your API simply burdens it with unneeded traffic and ties up resources that can affect other consumers’ experience using your API.

If your serving up open data through your API, here are some additional reasons that you should also make this data available in bulk.

Use a Proxy Cache

A proxy cache sits in between your API and those using it, and caches responses that are frequently requested. Depending on the nature of the data you are serving through your API, it might be desirable to cache responses for some period of time – even up to 24 hours.

For example, an API serving property data might only be updated when property values are adjusted – either through a reassessment or an appeal by a homeowner. An API serving tax data might only be updated on a weekly basis. The caching strategy you employ with your open data API should be a good fit for the frequency with which the data behind it is updated.

If the data is only updated on a weekly basis, there is little sense in serving every single request to your API through a fresh call down the stack to the application and database running it. It’s more beneficial for the API owner, and the API consumer, if these requests are served out of cache.

There are lots of good choices for standing up a proxy cache like Varnish or Squid. These tools are open source, easy to use and can make a huge difference in the performance of your API.

Always Send Caching Instructions to API Consumers

If your API supports CORS or JSONP then it will serve data directly to web browsers. An extension of the cacheing strategy discussed above should address cache headers that are returned to browser-based apps that will consume data from your API.

There are lots of good resources providing details of how to effectively employ cache headers like this and this. Use them.

Evaluate tradeoffs of using ETags

ETags are related to the cacheing discussion detailed above. In a nutshell, ETags enable your API consumers to make “conditional” requests for data.

When ETags are in use, API responses are returned to consumers with a unique representation of a resource (an ETag). When the resource changes – i.e., is updated – the ETag for that resource will change. A client can make subsequent requests for the same resource and include the original ETag in a special HTTP header. If the resource has changed since the last request, the API will return the updated resource (with an HTTP 200 response, and the new ETag). This ensures that the API consumer always gets the latest version of a resource.

If the resource hasn’t changed since the last request, the API will instead return a response indicating that the resource was not modified (an HTTP 304 response). When the API sends back this response to the consumer, the content of the resource is not included, meaning the transaction is less “expensive” because what is actually sent back as a response from the API is smaller in size. This does not, however, meant that your API doesn’t expend resources when ETgas are used.

Generating ETags and checking them against those sent with each API call will consume resources and can be rather expensive depending on how your API implements ETags. Even if what gets sent over the wire is more compact, the client response will be slowed down by the need to match ETags submitted with API calls, and this response will probably always be slower than sending a response from a proxy cache or simply dipping into local cache (in instances where a browser is making the API call).

Also, if you are rate limiting your API does responses that generate an HTTP 304 count against an individual API consumer’s limit? Some APIs work this way.

Some examples of how ETags work using CouchDB – which has a pretty easy to understand ETags implementation – can be found here.

Discussion

Did I miss something? Feel free to add a comment about what you think is important in API stewardship below.

Open Data: Beyond the Portal

One of the most visible statements a government embarking on a new open data program can make is the selection of an “open data portal.”

An open data portal provides a central location for listing or storing data released by a government for use by outside consumers, making such data more easily discoverable. A portal also has value as a more concrete manifestation of a government’s intentions for open government.

Governments that have data portals as the centerpiece of their open government agendas make a public statement about the importance of data to being transparent and collaborative.

But open data portals are much more than just data directories or repositories – when implemented and managed successfully, they are also the centerpiece for the community that generates value from publicly released government data.

The community around an open data portal is a direct contributor to the success of an open data program – and this community includes both people inside government (data producers) and outside (data consumers – developers, journalists, researchers, civic activists, etc.)

This fact helps underscore some important considerations government officials should keep in mind when evaluating different options for an open data portal, and also highlights work that must be done beyond the selection of an open data portal to ensure the success of government transparency efforts.

The Community Inside

The process that is used to identify, review, release, update and maintain information in an open data portal – regardless of what kind of portal it is – is what turns the wheels of open government.

The internal community around an open data portal is made up of data stewards and producers inside government.

This community uses an open data portal in a very specific way. A subset of this community may be involved in the maintenance or management of the underlying software platform that supports the open data portal, but most will contribute data (or information about data) to the portal in some way.

But before this specific touch point, where internal community members contribute data to a portal, there is a series of decisions and actions that must be taken to decide which data gets put into a portal, and what format that data will take.

All governments operate under an explicit set of rules about the kinds of data that can and should be released for public consumption. But beyond this binary evaluation of public vs. non-public, there is a set of (often complex) factors that need to be considered:

  • Which data sets have a higher “value” relative to others? What should be focused on first?
  • What is the current state of the data – is it accurate and up to date?
  • Does it require meta information, to assist users in understanding what it is and how it may be most effectively used?
  • Where is the data currently housed? Are there any technical barriers that might make it difficult to stage it for public release?
  • What specific steps are needed to take data from a backend system or data store and stage it for public release?
  • Who is responsible for each step? One person? Many?
  • What is the appropriate refresh cycle for such data? Does it change often enough to warrant frequent updates?
  • What is the appropriate format to release a data set in? Should more than one format be used?

(Another good source of information for data producers to take into consideration are the 8 Principles of Open Data.)

The process by which governments work through these issues (and others) is the foundation on which a successful open data program operates. The process that is used to identify, review, release, update and maintain information in an open data portal – regardless of what kind of portal it is – is what turns the wheels of open government.

The work to develop this process (or set of processes) must be done regardless of which open data portal a government elects to use.

This is not meant to suggest that the picking the right data portal doesn’t have value, just that much work remains to be done to build a successfully open data portal beyond simply picking which one to use.

The Community Outside

[S]ometimes, selecting the right data portal can make building an external community around open data easier.

Governments must also work to build the external community of users around an open data portal – this external community will use an open data portal very differently than their internal counterparts. These users will be direct consumers of the data provided by governments, and may also provide ideas for new data to release and feedback on the quality of existing open data.

To properly serve the external community of users of open data, governments must ensure that the portal they select (or build) has the features required to interact with this community.

Providing a forum for discussion, feedback mechanisms, the ability to rate the quality of data and suggest new kinds of data are all important functions. There are a number of both commercial and open source data portal options that do each of these things quite well.

Selecting an open source alternative for an open data portal might be perceived as a daunting task for some governments. There are several well developed and (increasingly) widely used open source options, including (but not limited to):

One of the primary considerations for a government considering an open source option for their data portal is the technology stack used to build it. Often, a mismatch between the technology used in one of these open source options and the government’s own technology infrastructure may raise concerns.

There are, however, some great examples of open source data portals that have been implemented with the assistance and direct involvement of members of the external data community, many of whom are software developers. The OpenDataPhilly.org data portal is a god example of this, as are it’s sister sites in San Diego and Chattanooga, TN.

Leveraging a local community of technologists and developers to help stand up, manage and improve a government’s data portal by using open source software may be an effective way of engaging and building the external community of data consumers.

In this way, an open source data portal may have an advantage over a commercial offering – the external community of users is directly invested in the data portal itself, and have a way to contribute to it themselves and make it better.

Building the internal and external communities around an open data portal is important work that must be done to ensure the success of a government’s open data and transparency program.

Selecting a specific open data portal to use doesn’t de-obligate governments from this important, foundational work.

And sometimes, selecting the right data portal can make building an external community around open data easier.