I Hate Open Data Portals

Well, not really – But I do dislike certain things about most open data portals. Even the ones that I work with every day or that I have been involved with in the past.

Don’t get me wrong – I’m a true believer in the power of open data. I love that every day there are more and more governments posting open data to specialized sites meant to make their data available to external (and, increasingly, internal) users. But there are things about the way that most open data portals are structured and used that bother me – I think we can do better. And I think a lot of people will agree with me.

This post began life as a private GitHub gist that I used to keep track of the little annoyances I ran into when using open data portals, or watching what others were doing.

After adding to this list over time I thought it warranted a full blog post. So – in no particular order – here are my gripes with government open data portals.

No One Eats Their Own Dog food

Governments can spend a good bit of of money on open data portals.

True, there are open source options, but they still require time and energy to setup and maintain. And the effort that goes into publishing data – regardless of what platform is used – can be significant. Every single commercially available open data portal product (and most of the open source options) provide API access to data. Some come with fancy developer portals with documentation on how to leverage these APIs to build open data apps.

And the sum total of governments that are using the data they put in their own open data portals to support official government apps? Zero. Zilch. Bupkis It still seems very rare but Chicago reports some impressive use of their open data portal – see comments below. Also, Los Angeles and New Orleans are making use of their portals as well.

In a way, governments are assigning a second class status to the data in open data portals, and they APIs that expose this data. Sure, it’s important and the APIs are awesome – but we’re not going to use them ourselves.

When I worked for the City of Philadelphia, we developed our own set of custom APIs to make data available for civic developers. We then built official city apps on top of these same APIs. We ate our own dog food – if the data being exposed by these APIs was incomplete or inaccurate, or if an issue affected the availability of these APIs then we felt the pain. It provided a really nice incentive to ensure that these APIs were working as they should be and the civic technology community in Philadelphia benefits from that.

If you want to be further outraged at the fact that governments are spending many thousands of dollars on open data portals that they themselves won’t use to build apps, go read Anthea Watson Strong’s post on this subject. It’s an excellent post well worth the read.

Get Out of the Way and Give Me My Data!

One of the key selling points of an open data portal is the built in tools that are provided for sorting, grouping, graphing and visualizing data.

Both commercial and open source open data products come with lots of built in bells and whistles, and these features tend to appeal to government officials. Reviewing a detailed features list can seem like a good way to ensure that governments are getting their moneys worth when spending many thousands of dollars on an open data portal. But often the needs of data users can diverge from the built-in features of many open data portals. Governments are likely paying a premium for features that most users of their data don’t care about and will never use.

Serious data scientists or app developers will want to extract the data from an open data portal and load it into another environment for their work. Some app developers might be attracted to open data APIs, but since these APIs are often relegated to second class status (see above) it is far more likely that they will opt to populate their own data store to support their app – particularly since the cost and effort required to do so is getting lower by the day.

This means that simply downloading the data made available through open data portals should be easy. It should be, but it often isn’t – go to a government open data portal and try downloading a data set with a nontrivial number of rows. Say a few million or more. Go ahead, give it a try. I’ll wait…

Things are so bad on this front that open source developers are building their own tools to improve static downloads from government open data portals.

Lack of Support for Realtime Data

As discussed above, every modern open data portal provides a way to programmatically access data that is housed in it – data is accessed via an API by making an HTTP request (with the required information – e.g., authentication – in the request) and getting a response back (typically in either JSON or XML format). This data access paradigm fits well with the way that most of the data in municipal open data portals is updated – usually not more frequently than daily.

If data updates happen frequently – or if a data consumer wants to check and see if data has changed since the last time it was accessed – a consumer application can poll the API for changes at set intervals. And though this approach works acceptably well for data that doesn’t change all that often, it is far from acceptable from data that does (or could) change more frequently. In fact, the closer updates to data get to realtime changes, the less optimal this approach is because it places a heavier burden on consumers (who must poll the API for data chances more frequently) and for the data portal itself (which must handle and respond to more frequent requests from API consumers).

Other – more efficient – approaches to accessing data can be used when data updates occur more frequently. These approaches – like server-sent events and Websockets (which are both part of the HTML5 specification), or registering a callback URL (or Webhook) – benefit both the data consumer and the data producer.

The number of open data portals that support these more efficient methods of consuming realtime data? Zero. Zilch. Bupkis.

No Revision History

One of the most important things that governments publishing open data can do is to ensure that it is timely and accurate. These values not only make imminent sense, they are enshrined in the guiding principles behind the open data movement. With this in mind, its a very good thing that we see governments updating their data sets frequently – particularly when these updates are driven by users who surface issues of accuracy or completeness that a government corrects with a subsequent data release. Feedback loops, ya’ll – they work.

However, for all of the money being spent on open data platforms, none of them make it easy to track the revision history of a data set over time. There are cases where this can be one of the most important aspects of a data set – how has it been enhanced and updated over time. Diff’ing things is such a common practice in the world of technology that the complete absence of this functionality in commercial open data portals is pretty frustrating.

More and more governments are using GitHub for their data, and Github makes it super easy to track the revision history of a file – that’s what GitHub does. Hopefully open data vendors will take a cue from GitHub and build this functionality into their platforms.

Rant: over

I realize I’m being picky, and I’m largely ignoring the fact that more and more governments are standing up open data portals. Let’s not kid ourselves – this is awesome.

But the amount of money spent by governments on open portals can be significant – these products should meet the needs of both governments and data users.

Got a gripe with open data portals? Agree or disagree with a point I’ve made here? I’d love if you added a comment below.

Thanks for reading.

16 comments

  1. Eve Ahearn · April 1, 2015

    As long as it is also possible to download the data itself, I support on-site visualizers. They provide a way for the non-technical to use the data themselves, as opposed to only interacting with open data indirectly through apps. What I am against though, is then that it is often difficult to *export* the visualizations made this way. Why give people a tool to chart the data, if they can’t then export the chart?

  2. I think open data movement and dogfooding is under-reported by a hefty margin. In general, tracking usage is pretty hard and, coincidently, seems harder for dogfooding. In my case, it seems there is a better path for civic developers to let me know how they’ve used the portal, since the usage is more prominent and there is an eagerness to share. For dogfooding, there isn’t the same eagerness and incentives to share use of the portal–city users simply use it and then move on. All of this tracks back to limited documentation on dogfooding, but I hope these examples are illustrative for not only Chicago, but for other municipalities:

    + Ad Hoc Analysis: The most frequent type of dogfooding, I suspect, is the use of the portal to conduct ad-hoc analysis. Even for city users, sometimes the portal is the easiest path to obtain data. This is certainly the case for shapefiles, but also for items like 311 data, business permits, etc. On dozens of occasions, staff throughout the city have used data to help supplement analysis, make maps, etc. Some departments, who are able to see the most distilled data on the portal, have made graphs they’ve never seen before (e.g., breakdown of vehicles by fuel type).

    + Administrative process: To drive a cab, you need to be licensed. Chicago released the list of all public chauffeurs on the data portal (https://data.cityofchicago.org/Community-Economic-Development/Public-Chauffeurs/97wa-y6ff), which is now being used by cab companies to verify if their drivers are allowed to drive. When a cab driver shows-up to drive the cab, both the business affair department and cab companies use to portal to check eligibility. That is, the portal is part of a city business process.

    + Data source: Chicago developed and uses WindyGrid (http://datasmart.ash.harvard.edu/news/article/chicagos-windygrid-taking-situational-awareness-to-a-new-level-259), which combines multiple data sources. To power this application, we ingest three datasets from the portal: Parks (https://data.cityofchicago.org/Parks-Recreation/Parks-Locations/wwy2-k7b3?) and Schools (https://data.cityofchicago.org/Education/Schools/kqmn-byj8?). That is, the data portal sometimes is the best original source of data to pull into other systems.

    These are some quick examples so far, but tracking and documenting use cases is rare. There are a lot of instances, yet, where BI solutions could be pointing to the portal instead of backend databases, there is more training that could be done to make use of, for instance, OData portal to pipe into Excel (a common tool for any dogfooding).

  3. Chris Mathews · April 1, 2015

    My biggest gripe with all the portals is that the visualization tools bite the big one. If some could seamlessly integrate d3.js or something higher up, it would fix the dog food problem and the power user taking the data and running. I’m not sure the last two are strictly the fault of open data portals – version control is a joke where I work and some of our government systems are so old or living in silos that people have built a career out of protecting that getting daily or weekly feeds takes an act from elected officals.

  4. I think one problem with the dogfood narrative–or the lack of it–is the absence of capturing use-cases around the practice. In my experience, civic developers have better means and a greater desire on how they’ve used open data for some purpose. With dogfooding, city users may not have the opportunity to share their stories (e.g., at hack nights, blogs, etc.) nor incentives to share–they simply do the work and move on. But I do think there are some stories where dogfooding is present:

    * Ad-hoc analytics: I think the most frequent dog fooding is ad-hoc analysis. Sometimes the portal, primarily using Excel downloads, is the best and easiest place to find data, even for gov employees. I’ve seen the visualization tools used a bit and certainly interactive maps–especially maps. Portals provide the most distilled information, which has made it easier to focus on graphs and maps.

    * Administrative use: To drive a cab, taxi companies must verify a driver is eligible to drive. After partnering with the business enforcement agency, the list of eligible drivers are posted on the portal (https://data.cityofchicago.org/Community-Economic-Development/Public-Passenger-Vehicle-Licenses/tfm3-3j95). Now, the city and cab companies look at the portal to operationalize their business process. I don’t think this type of case gets associated with dogfooding, but do think it’s an important user case.

    * Operational use: Chicago built and uses WindyGrid (http://datasmart.ash.harvard.edu/news/article/chicagos-windygrid-taking-situational-awareness-to-a-new-level-259), which combines multiple data sources. Often, for performance, it’s easier to reference the source DB. However, we do reference the portal when the portal is the most authoritative source. In this case, we use a list of schools and a list of parks on the portal and feed it into WindyGrid.

    But more can be done in this area. There are a lot of instances where the BI capabilities could be stronger on portals or internal BI applications reference portals. Though, a lot of dogfooding stories don’t seem to be captured. Despite these stories, I suspect I am more aware of how the public have used the portal than how city employees have used it.

  5. mheadd · April 2, 2015

    Tom – thanks for those examples. Chicago is definitely ahead of the curve in its utilization of its open data portal.

    Though I think I’d challenge the assertion that the issue is simply underreporting – government employees might not be incentivized to surface these examples and make noise about them, but the companies that sell open data portal software sure do. Why aren’t Socrata, Junar, et al. publicizing these instances if they are occurring in meaningful numbers?

    But beyond just the examples you’ve shared – which are awesome, particularly the taxi company example – my comment about dogfooding is meant to refer to the API access to data provided by open data portals. This is one of the features that gets touted most often by companies selling portal software as a game changer (I’ve been on the receiving end of many of these), and yet almost no one is using these APIs for open data to power government apps – at least not that I can find.

    I get the examples of downloading spreadsheets or shapefiles to support the operations of different agencies, but why do governments need an open data portal for this? A standard FTP server, or a shared directory on a government’s network could serve this purpose – at a fraction of the cost.

    If you haven’t read Anthea Watson Strong’s post on dogfooding, I’d highly recommend it – I think it accurately captures where we are currently in the open data movement.

  6. The points are good. It seems the emphasis on portals are resoundingly on public engagement, moreso than internal usage. It seems the leading thinkers–those who are comfortable with portals–are wondering around dogfooding. Thus, my impression is the vendor stories are about community engagement–but this is debatable and

    The API discussion is important. Often, I’d point to technical capability to use APIs. Often, our users wouldn’t know how to leverage an API–save for OData which is easier, more baked-in protocol–and thus doesn’t use it. Our more advanced users who can use APIs are also the group who have direct access to systems. I’ve faced this choice at times, which equates to the choice:
    * A: Data source -> ETL -> Portal -> Application
    * B: Data source -> Application.
    I do think (B) is the more pragmatic choice. But there are caveats I noted above. Sometimes, the portal is the most authoritative source, especially when the system is managed offsite or by outside agencies.

    Discoverability and usability is a key difference between a portal and flat-file methods. Searching Google yields portal results, searching the portal itself. I have often found myself using the portal to access shapefiles, even though I have access to an ArcGIS server.

    I think there are many pragmatic choices that lead us down this road. But I should note the narrowness of my remarks to mainly municipal government. Anthea is discussing federal government, where my note in paragraph 2 is far less applicable. But, I often think there are two fundamental lines of discussion around open data: national programs versus local (city/state) initiatives.

  7. Eric Roche (@KansasCityEric) · April 2, 2015

    I can definitely say that city staff use the data portal for Kansas City (Data.kcmo.org). For example, the Office of Performance Management gets most of their data for analysis directly from the portal. We built it this way to make sure that the data is useful, accurate, and recent. We want staff to rely on it, because it ensures that it will be kept up to date.

    The KCSTAT program and internal performance management meetings rely heavily on this data for analysis. When we present our visualizations, we let people know that the data is available to them too. This data may reveals problems that we then address via process changes. Most of the data that he Mayor, City Council, and City Manager see come directly from the data portal. Socrata has published some information about these types of efforts at http://www.socrata.com/newsroom-article/socrata-launches-open-data-tv-odtv/.

    They City has also built out a series of dashboards to monitor progress on the City Council’s Goals and Priorities. You can visit these at kcstat.kcmo.org. Additionally, budget.kcmo.gov has only recently been launched, but is already being used by Departments and Budget Analysts to get a quick view of what their budget looks like. This process used to require advanced knowledge of our financial system. Therefore, the ability for any staff member to check out their budget with a few clicks saves a lot of staff time.

    I agree with Tom that tracking staff use of Open Data is really challenging. The platform Open Data provides is extremely useful – and is something staff seems to rely on once they learn how to use it efficiently. For example, the ability to upload a spreadsheet and map it without having to use GIS software is a fantastic resource for analysts across the organization. It also frees up our GIS experts for more complicated projects. The visualizations that you can generate may not be stellar, but they are effective at communicating information. The Open Data portal has made it much easier for staff to use and communicate data effectively and efficiently.

  8. mheadd · April 2, 2015

    Eric – thanks for your comments, really appreciate you sharing this information.

    I’d echo to you what I said to Tom in an earlier response – most of my concern about dogfooding relates to governments leveraging the APIs in their portals to build officially branded, externally facing apps. Essentially, doing from inside government what civic hackers are doing from outside

    KCStat and the Open Budget site are nice, but those are both products that are provided by Socrata that are tightly coupled to their open data portal software. They are (effectively) extensions of the Socrata platform, and don’t really represent the city independently consuming the APIs it is making available to outside developers – at least not to my way of thinking.

  9. Tim Nolan (@plotboy) · April 2, 2015

    Amen, Brother.

    I believe that Open Data vendors are inflating the need and value of their service. Many gov’t agencies already provide data for free. Why should we spend a lot of $$$ to dumping into a portal when it’s already available. Open Data vendors have created a false market for themselves.

  10. Joel Natividad · April 2, 2015

    Great post Mark! I share the same frustrations with open data and I couldn’t agree more.

    And you’re right, the dirty secret of data portals is that they’re primarily used as fancy FTP servers. API usage is pitifully low because the data is not readily usable.

    At Ontodia, we aim to address some of these issues by “productizing” out of the box functionality beyond an FTP server – making it easier to create customizable, embeddable, production-ready maps (not just map viewers), to publish clean data, to fostering discussions around datasets, etc.

    We even bundle a dog-fooding geocoder that geocodes uploaded datasets using previously uploaded spatial datasets (i.e. if you upload a neighborhood association GeoJSON, you can geocode against that).

    As a CKAN specialist, I’m a tad biased but remain hopeful. A lot of these very issues are being addressed by the community as we collectively scratch our itches.

    I think we’re still in the early days of Open Data. Remember AOL, Compuserve, Prodigy and all the proprietary networks before the Internet? What about all the proprietary CMSes before WordPress, Drupal, et. al.?

    So how can we move to the next stage? Some thoughts:

    * open standards – real-world standards that emulate the GTFS example that address a real-world need
    * reference implementation of those standards – create apps that consume the standard that can be quickly implemented in another jurisdiction
    * APIs everywhere – with CDOs, CAOs CDSs, CTOs now becoming more common in government at every level, service oriented architecture should become more common in time as we work towards Government as a Platform. This will take a while
    * Regional Portals – let jurisdictions pool resources, talent and users
    * Aggregators – a lot of this data was wrangled, aggregated, and repackaged in the private sector, even before data portals. Regional portals, undoubtedly, will perform some of this, but the private sector, specifically in the CivicTech sector, will still play a role

  11. Jerry Hall · April 4, 2015

    I think there’s a lot of people inside gov and outside (advocacy, media, community groups, citizens) that could use a wishlist board to post their needs. Not only for the datasets themselves but, for the output. The individual visualization, map, table or output that would help them do a better job – ultimately helping make life better. It seems if you’re seeing this apathy then the solution is to get the line-workers inside government, and active citizens, the ability to nudge leaders in a forward direction.

    Perhaps such wishlist items could be rated and commented on by individual subscribers? Such feedback wouldn’t push development because of the relative ease in an interested party to load up ‘interest’ eg. an aggressive anti-any-change group engaging it’s social media audience to stymie real insight into problematic conditions.

    It seems too what’s needed is the opportunity for the ‘wish’ creator to be able to do so within an anonymized system where they can ask for solutions and contribute responses to questions anonymously. The reason being that insiders can outline creative ways to expose or otherwise display data that would help others to see elements for the first time or from a different perspective. Without anonymity, many insiders may be too politically or career intimidated to make such suggestions publicly.

    Ideally such a tool would spawn more civic-stakeholder engagement and ideally some actual change by government internally. Since posts are public, entrepreneurs, existing civic-tech enterprises, civic hacks (from Code For America Brigades to independent hacks inside and outside government) could review and select projects to complete.

    Such a portal would be valuable if subscribers could view solutions others have utilized and request deployment in their community or some modification.

    Finally, more value would be created if all stakeholders had an ability to present deployments, challenges, opportunities, testimonials etc. that could help future communities engage without paying the heavy price of death by apathy or poor execution.

  12. Steven De Costa · April 6, 2015

    Heya Mark,

    I’ve thought the same on some of the points you have made, but have also been thinking about various solutions.

    No One Eats Their Own Dog food: With CKAN 2.3 the resource views should allow for extensions to be developed that allow visualizations to be embedded within other Govt sites. With such a platform I’m hoping we’ll a few things emerge:

    1. Govt directly embedding the viz from a portal’s dataset view into their own agency site to help communicate with various audiences.

    2. Govt directly embedding the data capture tools into their own agency sites to help collaborate with citizens. (general forms, requests for consultation, survey’s etc)

    3. Data First Web Design – I think this will be a thing before too long🙂

    4. Govt understanding that their data portal is useful infrastructure that can be used as the backbone for all kinds of solutions.

    Get Out of the Way and Give Me My Data!: I agree and believe the primary users of a data portal are machines, the secondary users are custodians and the tertiary users are anyone else (which includes the casual hacker or two browsing for data they’ll connect to machines😉 ).

    Lack of Support for Realtime Data: I’m hoping this will be pushed forward pretty quickly with more cities connecting up sensors to capture data. I think there will be some exciting applications for those working with realtime data and I think portal software such as CKAN is ‘ready’ for the challenge. I can see there is a good install base and an increasingly active developer community behind the project.

    Personally I’m looking for realtime data projects to push the boundaries of CKAN and hope to have some news on that soon.

    No Revision History: I’ve looked into this for CKAN with regard to custodian use cases (rolling back after an update to a resource, etc). My company was waiting until the 2.3 release (now out) before running at some of this stuff too hard. But, I think you’re right with regard to diffs and requests for optimization. Will be thinking on this more🙂

    Hoots!

  13. Steve Bennett (@stevage1) · April 23, 2015

    G’day,
    Nice rant!🙂 A couple of comments:

    You complain that open data portals have useless bells and whistles (sorting, grouping, graphing, visualising) yet lack features (realtime support, revision history etc). Basically you just want more. Ok.

    IMHO (having worked with quite a few government bodies on open data stuff now), the most important thing government can be doing is releasing data. There are other groups (civic hackers etc) who can build nice features, but no one else can release that data. So it’s ok if “open data portals suck” if resources are being spent instead on actual data release. If.

    Realtime data. Well, it’s hard, it’s new, and it’s niche. There’s really not much genuinely “realtime” data. Frequently by the time it makes it to the open data portal, it’s minutes or hours old. Is better access to this “realtime” data really a concern for *now*?

    And finally revision history? Sounds exactly like the kind of value added service that the open data community can add, and I don’t see much wrong with that.

  14. Mark,

    I’ve been waiting to reply until we had some more written up about the Housing Data Hub (http://housing.datasf.org), so jumping in a little late to the game. But basically, we’ve been working for the past couple of months on building this and very actively working toward using our open data APIs in it.

    There’s much more about the process of building here: http://datasf.org/blog/raising-digital-barn Mostly, this gives the overview of our approach, but I’ll have deeper reflections/technical points on how we used the portal APIs and what we learned in doing so later on (likely as part of the code documentation).

    The high point is we’re making API calls to power a set of visuals across a number of datasets, providing context to those interested in policies that affect housing affordability in San Francisco. This is part of our broader approach of doing strategic (or thematic) releases.

    Note, the Hub is not about accessing affordable housing, just understanding the policy context with supporting administrative datasets. But, while I’m on that distinction, let me air my broader concern around dogfooding and open data portals. I think we are still at very early stages of maturity in the open data world around processes, methods and technologies. This is why I wouldn’t build a product to access affordable housing using the open data portal as a backend. The current portal products are by their nature broad and apply to many different data and needs, but sometimes the requirements of the product may not be met by an open data portal. For example, handling private data, complex relational data, or maintaining SLAs at much higher levels than open data portals can currently provide. The APIs being inaccessible for a visual on eviction notices won’t stop business in the City, but it very well could if critical services are built off of them. (Tom gets at this earlier in his response regarding making pragmatic choices).

    I’m not saying you’re suggesting using open data portals for all government services, but I just wanted to take a moment to reiterate here that I think broadly we need to use the appropriate technology for the job, and consciously design open data access into those approaches so that surfacing the public data is straightforward. This will be a critical mark of a maturing open data program.

    All that said, I totally agree with the notion of dogfooding and what it can bring to government, but it is one piece of a broader strategy to institutionalize open data approaches at the City. The housing data hub is an example where we’re dogfooding, and you’ll see more in a couple of months. It is decidedly wonky, but for an outward and inward facing audience. Dogfooding has allowed us to understand the challenges and opportunities with API access in a hands on way. The things we are learning will then be shared back to the world through our website and github repos, and internally through our City training program, the Data Academy (http://datasf.org/academy).

    Thank you for the engaging post! My hope is that we can all grow and learn together and I’m encouraged by the broader community and trajectory of all the great work that cities, states and the Federal government continue to do.

  15. Andrew Clarke · June 5, 2015

    Hi Mark
    Forgive me if I’ve got this wrong but I’m pretty sure the UK Gov’s Department for International Development is eating its own dogfood through DevTracker: http://devtracker.dfid.gov.uk/ It’s populated from its IATI data feed. Here’s their data architect describing their approach: https://dfid.blog.gov.uk/2013/06/05/eating-our-own-dog-food/

    But I think your point still stands🙂

  16. Dave · July 29, 2015

    Found this post very interesting.

    However, as someone who works in local government as a developer, and exports data on a regular basis to a city data portal, I’ve found no reason to use the data portal internally. For two primary reasons:

    1. The data portal provides an external means for citizens to access data where the master copy is held within internal systems. If we’re developing apps we’ll be able to go direct to the source – something that is generally much harder to make available to local developers. It’s not that ‘we won’t eat it ourselves’, the data portal is perfectly good and useful, it just wouldn’t make architectural sense for any internal app to go to data that has been exported where it could go direct.

    2. A benefit of Open Data is that citizens are able to create apps with that data. The biggest challenge with open data has been finding datasets that are able to be released under a complete open data licence. Once those data sets are found and released I’d far rather leave them to the community to create apps with, focussing on creating useful solutions with the datasets that can’t be made open. That seems in general a better use of local government time and gives better ‘coverage’ of the data.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s