Governments make a wealth of data available on their public websites.
Forward thinking governments have take steps to ensure that more and more of this information is available in open formats. But the majority of the data available on government websites – even those with formal open data policies, like Philadelphia – is not open. It is merely “public.”
What is Public Data?
Public data is available for viewing, typically as an HTML page (or series of pages), a PDF document or in some similar format suitable for publishing on the web. It is meant to be consumed by eyeballs, not by computer programs or applications.
It is possible to access public data in a programatic fashion through the use of scraping tools that are available in most programming languages, through services like ScraperWiki or even using something as simple as Google Docs.
However, writing web scrapers – even though the technology options for doing so have advanced tremendously in recent years – is tedious and the notion is unappealing to most data users. In addition, there are a number of reasons why making important data available solely as public data is undesirable.
An Example: Scraping Tax Balance Data
I’ve never met the person that recorded the screencast above, but data on property tax balances in Philadelphia was important enough to them to invest the time to write a data scraper. (This data currently only exists as public data – viewable one record at a time – on the web site of the Philadelphia Department of Revenue.)
The screencast this individual recorded and posted to YouTube serves as a strong indicator that this data should be prioritized for release by the City of Philadelphia as open data. Until it is made available in this way, more people will be forced to obtain this important data by writing scrapers, or through alternate channels.
Property tax data is an ideal example to illustrate the problems of public data in the context of a city that has a formal open data policy. It meets every conceivable definition of the term “high value” data – the fact that external users go out of their way to write data scrapers should be enough to confirm this.
Moreover, there is a strong correlation between tax delinquency and vacant properties, and research is currently underway at Rutgers University’s School of Criminal Justice on the relationship between tax delinquency and the incidence of serious crimes in Philadelphia.
This is data that people want so that they may better understand how their city works, and may explore ways that they can make it work better.
High value data that is only made available as public data presents a number of problems for both data users, and for governments that produce such data.
Forcing people to employ web scrapers places an unneeded barrier in front of what should be open data. Furthermore, it limits the number of users that can access such data – not all data users are technically sophisticated enough to create and use web scrapers. Even for those who are, writing scrapers is almost never desirable – it’s tedious, can be very time consuming and most scrapers are brittle pieces of technology that get thrown away once the desired data is obtained.
For governments, the problems created by web scraping can be significant. Scrapers can place an unnecessary burden on IT infrastructure – disrupting normal operations and causing problems for more mainstream data users (those that may want to view data one record at a time). When inelegantly designed scrapers send non-trivial amounts of web traffic to a government site, it can create problems that IT staff must identify and troubleshoot. This wastes value staff time on problems that are easily avoided.
In addition, by failing to make data available in more easily consumable formats governments might be missing opportunities to use such data to improve their own operations. What if every interaction with city government by a taxpayer (a citizen or a business – residents & nonresidents alike) could include an API call to check for outstanding property taxes?
Open data can make such an operational improvement a reality – by making data more accessible and more usable for the public, governments also make data more usable by their own operating units.
But the strongest criticisms of public data derive from the principles that guide most open data policies – including the City of Philadelphia’s policy.
…agencies shall publish information on line (in addition to other planned or mandated publication methods), and in an open format. The open format will provide data in a form that can be retrieved, downloaded, indexed, searched and reused by commonly used web search applications and software.
Making valuable data “public” but not “open” is a departure from the principles outlined in most open data policies. Some more cynical observers have suggested that it is a way for government officials to lay claim to the ideals of open data while ensuring that information remains difficult to use for serious analysis. This may be why the vast majority of campaign finance and lobbying information published by governments is made available in PDF format only.
Any government that has a formal open data policy should target and prioritize data that is being scraped for release as open data. In the absence of such a prioritization, we may struggle to meet our objectives for openness and government transparency, we may miss real opportunities to improve government operations and we are likely creating avoidable problems for our IT staff and technology infrastructure.
Clean, well documented, public APIs accompanied by static data downloads should be the standard we strive for with public data releases of high value data.
[Note – In December, 2015 (almost 2 years after I wrote this post and began working inside the City of Philadelphia to release property tax balance data) the city released this data as open data.]