Following up on my last post, and a recent trip to St. Paul Minnesota for the NAGW Annual Conference to talk about open data APIs, I wanted to provide a few insights for proper API stewardship for any government looking to get started with open data, or those that already have an open data program underway.
Implementing an API for your open data is not a trivial undertaking, and even if this is a function that you outsource to a vendor or partner it’s useful to understand some of the issues and challenges involved.
This is something that the open data team in the City of Philadelphia researched extensively during my time there, and this issue continues to be among the most important for any government embarking on an open data program.
In no particular order, here are some of the things that I think are important for proper API stewardship.
Implement Rate Limiting
APIs are shared resources, and one consumer’s use of an API can potentially impact anther consumer. Implementing rate limiting ensures that one consumer doesn’t crowd out others by trying to obtain large amounts of data through your API (that’s what bulk downloads are for).
If you want to start playing around with rate limiting for your API, have a look at Nginx – an open source web proxy that makes it super easy to implement rate limits on your API. I use Nginx as a reverse proxy for pretty much every public facing API I work on. It’s got a ton of great features that make it ideal for front ending your APIs.
Depending on the user base for your API, you may also want to consider using pricing as a mechanism for managing access to your API.
Provide Bulk Data
If the kind of data you are serving through your API is also the kind that consumers are going to want to get in bulk, you should make it available as a static – but regularly updated – download (in addition to making it available through your API).
In my experience, APIs are a lousy way to get bulk data – consumers would much rather get it as a compressed file they can download and use without fuss, and making consumers get bulk data through your API simply burdens it with unneeded traffic and ties up resources that can affect other consumers’ experience using your API.
If your serving up open data through your API, here are some additional reasons that you should also make this data available in bulk.
Use a Proxy Cache
A proxy cache sits in between your API and those using it, and caches responses that are frequently requested. Depending on the nature of the data you are serving through your API, it might be desirable to cache responses for some period of time – even up to 24 hours.
For example, an API serving property data might only be updated when property values are adjusted – either through a reassessment or an appeal by a homeowner. An API serving tax data might only be updated on a weekly basis. The caching strategy you employ with your open data API should be a good fit for the frequency with which the data behind it is updated.
If the data is only updated on a weekly basis, there is little sense in serving every single request to your API through a fresh call down the stack to the application and database running it. It’s more beneficial for the API owner, and the API consumer, if these requests are served out of cache.
There are lots of good choices for standing up a proxy cache like Varnish or Squid. These tools are open source, easy to use and can make a huge difference in the performance of your API.
Always Send Caching Instructions to API Consumers
If your API supports CORS or JSONP then it will serve data directly to web browsers. An extension of the cacheing strategy discussed above should address cache headers that are returned to browser-based apps that will consume data from your API.
There are lots of good resources providing details of how to effectively employ cache headers like this and this. Use them.
Evaluate tradeoffs of using ETags
ETags are related to the cacheing discussion detailed above. In a nutshell, ETags enable your API consumers to make “conditional” requests for data.
When ETags are in use, API responses are returned to consumers with a unique representation of a resource (an ETag). When the resource changes – i.e., is updated – the ETag for that resource will change. A client can make subsequent requests for the same resource and include the original ETag in a special HTTP header. If the resource has changed since the last request, the API will return the updated resource (with an HTTP 200 response, and the new ETag). This ensures that the API consumer always gets the latest version of a resource.
If the resource hasn’t changed since the last request, the API will instead return a response indicating that the resource was not modified (an HTTP 304 response). When the API sends back this response to the consumer, the content of the resource is not included, meaning the transaction is less “expensive” because what is actually sent back as a response from the API is smaller in size. This does not, however, meant that your API doesn’t expend resources when ETgas are used.
Generating ETags and checking them against those sent with each API call will consume resources and can be rather expensive depending on how your API implements ETags. Even if what gets sent over the wire is more compact, the client response will be slowed down by the need to match ETags submitted with API calls, and this response will probably always be slower than sending a response from a proxy cache or simply dipping into local cache (in instances where a browser is making the API call).
Also, if you are rate limiting your API does responses that generate an HTTP 304 count against an individual API consumer’s limit? Some APIs work this way.
Some examples of how ETags work using CouchDB – which has a pretty easy to understand ETags implementation – can be found here.
Did I miss something? Feel free to add a comment about what you think is important in API stewardship below.