Some Tips on API Stewardship

Following up on my last post, and a recent trip to St. Paul Minnesota for the NAGW Annual Conference to talk about open data APIs, I wanted to provide a few insights for proper API stewardship for any government looking to get started with open data, or those that already have an open data program underway.

Implementing an API for your open data is not a trivial undertaking, and even if this is a function that you outsource to a vendor or partner it’s useful to understand some of the issues and challenges involved.

This is something that the open data team in the City of Philadelphia researched extensively during my time there, and this issue continues to be among the most important for any government embarking on an open data program.

In no particular order, here are some of the things that I think are important for proper API stewardship.

Implement Rate Limiting

APIs are shared resources, and one consumer’s use of an API can potentially impact anther consumer. Implementing rate limiting ensures that one consumer doesn’t crowd out others by trying to obtain large amounts of data through your API (that’s what bulk downloads are for).

If you want to start playing around with rate limiting for your API, have a look at Nginx – an open source web proxy that makes it super easy to implement rate limits on your API. I use Nginx as a reverse proxy for pretty much every public facing API I work on. It’s got a ton of great features that make it ideal for front ending your APIs.

Depending on the user base for your API, you may also want to consider using pricing as a mechanism for managing access to your API.

Provide Bulk Data

If the kind of data you are serving through your API is also the kind that consumers are going to want to get in bulk, you should make it available as a static – but regularly updated – download (in addition to making it available through your API).

In my experience, APIs are a lousy way to get bulk data – consumers would much rather get it as a compressed file they can download and use without fuss, and making consumers get bulk data through your API simply burdens it with unneeded traffic and ties up resources that can affect other consumers’ experience using your API.

If your serving up open data through your API, here are some additional reasons that you should also make this data available in bulk.

Use a Proxy Cache

A proxy cache sits in between your API and those using it, and caches responses that are frequently requested. Depending on the nature of the data you are serving through your API, it might be desirable to cache responses for some period of time – even up to 24 hours.

For example, an API serving property data might only be updated when property values are adjusted – either through a reassessment or an appeal by a homeowner. An API serving tax data might only be updated on a weekly basis. The caching strategy you employ with your open data API should be a good fit for the frequency with which the data behind it is updated.

If the data is only updated on a weekly basis, there is little sense in serving every single request to your API through a fresh call down the stack to the application and database running it. It’s more beneficial for the API owner, and the API consumer, if these requests are served out of cache.

There are lots of good choices for standing up a proxy cache like Varnish or Squid. These tools are open source, easy to use and can make a huge difference in the performance of your API.

Always Send Caching Instructions to API Consumers

If your API supports CORS or JSONP then it will serve data directly to web browsers. An extension of the cacheing strategy discussed above should address cache headers that are returned to browser-based apps that will consume data from your API.

There are lots of good resources providing details of how to effectively employ cache headers like this and this. Use them.

Evaluate tradeoffs of using ETags

ETags are related to the cacheing discussion detailed above. In a nutshell, ETags enable your API consumers to make “conditional” requests for data.

When ETags are in use, API responses are returned to consumers with a unique representation of a resource (an ETag). When the resource changes – i.e., is updated – the ETag for that resource will change. A client can make subsequent requests for the same resource and include the original ETag in a special HTTP header. If the resource has changed since the last request, the API will return the updated resource (with an HTTP 200 response, and the new ETag). This ensures that the API consumer always gets the latest version of a resource.

If the resource hasn’t changed since the last request, the API will instead return a response indicating that the resource was not modified (an HTTP 304 response). When the API sends back this response to the consumer, the content of the resource is not included, meaning the transaction is less “expensive” because what is actually sent back as a response from the API is smaller in size. This does not, however, meant that your API doesn’t expend resources when ETgas are used.

Generating ETags and checking them against those sent with each API call will consume resources and can be rather expensive depending on how your API implements ETags. Even if what gets sent over the wire is more compact, the client response will be slowed down by the need to match ETags submitted with API calls, and this response will probably always be slower than sending a response from a proxy cache or simply dipping into local cache (in instances where a browser is making the API call).

Also, if you are rate limiting your API does responses that generate an HTTP 304 count against an individual API consumer’s limit? Some APIs work this way.

Some examples of how ETags work using CouchDB – which has a pretty easy to understand ETags implementation – can be found here.

Discussion

Did I miss something? Feel free to add a comment about what you think is important in API stewardship below.

6 comments

  1. kinlaneKi nLane · September 13, 2014

    Nice Post Mark. I like the ETag reference. Something I should explore more, and help folks understand. You are focusing on some interesting and important aspects of the technical layer of stewardship, and I would add the use of integration and monitoring tools like Runscope, API Tools or other to help you see your API from consumer side.

    You can also extend these tools to your consumers, to help them understand how to make sense of integration. I think adding to your rate limiting, if you provide tools that give visibility into this for users it helps. I’ve seen additional API endpoints that provide stats on your usage, all the way to dashboards and visualizations that help.

    Overall, I think you provide some valuable stuff here for API stewards, and I like that title too. Makes it a much more meaningful process, and helps people take it seriously.

  2. mheadd · September 13, 2014

    Another good resource – White House API standards:

    https://github.com/WhiteHouse/api-standards

  3. mheadd · September 13, 2014

    Thanks, Kin. Appreciate the feedback – so much more to this issue and government API stewards need all the info they can get.

  4. sean metrivk · September 14, 2014

    Mark. This is just the kind of information I need to be learning as a government employee interested in open data. Please help us learn

  5. Stephane Guidoin (@Hoedic) · September 19, 2014

    Hi Mark,

    Thank you for this good post!

    If you have the opportunity, I would be curious to hear you on that topic when applied to “standardized” API (like Open311 for example): Do you think that the guidelines you are provided should be documented in a standard API documentation?

    Ideally, standardized APIs should be plug and play, so an existing clients should work on all the implementations. If not specified in the spec, the client won’t necessarily know which features to use. But specifing things like Etags and other features that have an impact on the client tend to move toward overspecification, making the standard API heavy to implement (mainly for smaller players)

    The advantage of using well-known features (mainly based on HTTP like Etags or making throttling trigger HTTP errors) is that the API spec could make these features as optionnal (or best practice) and not mandatory, something like “when implementating the standard, you can implement throttling, if the throlling is triggered, the API should return an HTTP error with code 429.

  6. mheadd · September 19, 2014

    Stephane,

    Thanks for your comments. I agree that making things like ETags and throttling optional would be a good place to go, especially for standardized APIs like Open311.

    I’ve often thought that ETags would be ideal for an API like Open311 where caching isn’t really an option (because an update may occur to a service request making the cached version obsolete), and where semi-frequent polling is most likely used.

    Thinking about it a bit more, for cases like Open311 it might be useful to implement WebSockets or even allow a client to register a Webhook URL for updates. When I worked in Philly, we implemented a WebSocket endpoint for one of our APIs. More on that here.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s