Well, not really – But I do dislike certain things about most open data portals. Even the ones that I work with every day or that I have been involved with in the past.
Don’t get me wrong – I’m a true believer in the power of open data. I love that every day there are more and more governments posting open data to specialized sites meant to make their data available to external (and, increasingly, internal) users. But there are things about the way that most open data portals are structured and used that bother me – I think we can do better. And I think a lot of people will agree with me.
This post began life as a private GitHub gist that I used to keep track of the little annoyances I ran into when using open data portals, or watching what others were doing.
After adding to this list over time I thought it warranted a full blog post. So – in no particular order – here are my gripes with government open data portals.
No One Eats Their Own Dog food
Governments can spend a good bit of of money on open data portals.
True, there are open source options, but they still require time and energy to setup and maintain. And the effort that goes into publishing data – regardless of what platform is used – can be significant. Every single commercially available open data portal product (and most of the open source options) provide API access to data. Some come with fancy developer portals with documentation on how to leverage these APIs to build open data apps.
And the sum total of governments that are using the data they put in their own open data portals to support official government apps? Zero. Zilch. Bupkis It still seems very rare but Chicago reports some impressive use of their open data portal – see comments below. Also, Los Angeles and New Orleans are making use of their portals as well.
In a way, governments are assigning a second class status to the data in open data portals, and they APIs that expose this data. Sure, it’s important and the APIs are awesome – but we’re not going to use them ourselves.
When I worked for the City of Philadelphia, we developed our own set of custom APIs to make data available for civic developers. We then built official city apps on top of these same APIs. We ate our own dog food – if the data being exposed by these APIs was incomplete or inaccurate, or if an issue affected the availability of these APIs then we felt the pain. It provided a really nice incentive to ensure that these APIs were working as they should be and the civic technology community in Philadelphia benefits from that.
If you want to be further outraged at the fact that governments are spending many thousands of dollars on open data portals that they themselves won’t use to build apps, go read Anthea Watson Strong’s post on this subject. It’s an excellent post well worth the read.
Get Out of the Way and Give Me My Data!
One of the key selling points of an open data portal is the built in tools that are provided for sorting, grouping, graphing and visualizing data.
Both commercial and open source open data products come with lots of built in bells and whistles, and these features tend to appeal to government officials. Reviewing a detailed features list can seem like a good way to ensure that governments are getting their moneys worth when spending many thousands of dollars on an open data portal. But often the needs of data users can diverge from the built-in features of many open data portals. Governments are likely paying a premium for features that most users of their data don’t care about and will never use.
Serious data scientists or app developers will want to extract the data from an open data portal and load it into another environment for their work. Some app developers might be attracted to open data APIs, but since these APIs are often relegated to second class status (see above) it is far more likely that they will opt to populate their own data store to support their app – particularly since the cost and effort required to do so is getting lower by the day.
This means that simply downloading the data made available through open data portals should be easy. It should be, but it often isn’t – go to a government open data portal and try downloading a data set with a nontrivial number of rows. Say a few million or more. Go ahead, give it a try. I’ll wait…
Things are so bad on this front that open source developers are building their own tools to improve static downloads from government open data portals.
Lack of Support for Realtime Data
As discussed above, every modern open data portal provides a way to programmatically access data that is housed in it – data is accessed via an API by making an HTTP request (with the required information – e.g., authentication – in the request) and getting a response back (typically in either JSON or XML format). This data access paradigm fits well with the way that most of the data in municipal open data portals is updated – usually not more frequently than daily.
If data updates happen frequently – or if a data consumer wants to check and see if data has changed since the last time it was accessed – a consumer application can poll the API for changes at set intervals. And though this approach works acceptably well for data that doesn’t change all that often, it is far from acceptable from data that does (or could) change more frequently. In fact, the closer updates to data get to realtime changes, the less optimal this approach is because it places a heavier burden on consumers (who must poll the API for data chances more frequently) and for the data portal itself (which must handle and respond to more frequent requests from API consumers).
Other – more efficient – approaches to accessing data can be used when data updates occur more frequently. These approaches – like server-sent events and Websockets (which are both part of the HTML5 specification), or registering a callback URL (or Webhook) – benefit both the data consumer and the data producer.
The number of open data portals that support these more efficient methods of consuming realtime data? Zero. Zilch. Bupkis.
No Revision History
One of the most important things that governments publishing open data can do is to ensure that it is timely and accurate. These values not only make imminent sense, they are enshrined in the guiding principles behind the open data movement. With this in mind, its a very good thing that we see governments updating their data sets frequently – particularly when these updates are driven by users who surface issues of accuracy or completeness that a government corrects with a subsequent data release. Feedback loops, ya’ll – they work.
However, for all of the money being spent on open data platforms, none of them make it easy to track the revision history of a data set over time. There are cases where this can be one of the most important aspects of a data set – how has it been enhanced and updated over time. Diff’ing things is such a common practice in the world of technology that the complete absence of this functionality in commercial open data portals is pretty frustrating.
More and more governments are using GitHub for their data, and Github makes it super easy to track the revision history of a file – that’s what GitHub does. Hopefully open data vendors will take a cue from GitHub and build this functionality into their platforms.
Rant: over
I realize I’m being picky, and I’m largely ignoring the fact that more and more governments are standing up open data portals. Let’s not kid ourselves – this is awesome.
But the amount of money spent by governments on open portals can be significant – these products should meet the needs of both governments and data users.
Got a gripe with open data portals? Agree or disagree with a point I’ve made here? I’d love if you added a comment below.
Thanks for reading.
Leave a Reply