Creating open data standards for cities is really, really hard. It’s also really, really important.
Data standardization across cities is a critical milestones that must be realized to advance the open data movement, to fully realize all of the potential benefits of openly publishing government data. More and more people are starting to realize the importance of this milestone and more and more energy will be devoted to creating new standards for city data in the months and years ahead.
The best example of what is possible when governments publish open data that conforms to a specific standard is the General Transit Feed Specification (GTFS). Developed by Google in partnership with the Tri-County Metropolitan Transportation District of Oregon (TriMet), GTFS is a data specification that is used by dozens of transit and transportation authorities across the country, and it has all of the qualities that open data advocates hope to replicate in other data standards for cities.
Transit authorities that publish GTFS data see an immediate tangible benefit because their transit information is available in Google Transit. Making this information more widely available benefits both transit agencies and transit riders, but the immediacy with which transit agencies can see this benefit make GTFS particularly valuable. Data standardization is an easier sell to government officials when tangible benefits are quickly realized.
The GTFS standard is relatively easy to use – it’s a collection of zipped, comma-delimited text files. This is a pretty low bar for transit agencies being asked to produce GTFS data, and it’s an eminently usable format for consumers of GTFS data. In fact, the ease of use of GTFS has spawned a cottage industry of transit applications in cities across the country and continues to be used as the bedrock set of information for transit app developers.
And perhaps most importantly, GTFS has given open data advocates a benchmark to use to advance other data standardization efforts. In many ways, GTFS made standards like Open311 possible.
So if data standardization is the future, and we’ve got at least one really good example to demonstrate the benefits to stakeholders and advance the concept, then what’s next? What’s the next data standard that will be adopted by multiple governments?
For the past year or so, there has been widespread interest in developing a shared data standard for food safety inspection data. On it’s face, this seems like a good candidate data source to standardize across cities. Most cities (certainly all large cities) conduct regular inspections of establishments that serve food to the public. This information can be (but is not always) fairly succinct – usually a letter grade or numerical ranking – that can easily be delivered to an end user on a number of different platforms and channels. For many reasons, focusing on food safety inspections data as the next best data set to standardize across cities makes a lot of sense.
Just recently, the joint efforts of several different groups culminated in an announcement by the City of San Francisco and Yelp to deliver standardized food safety inspection data through the Yelp platform.
I was involved in the discussions about a data standard for food safety inspections, though the City I work for will not be adopting the newly developed standard (at least not yet). The process of developing the new food safety inspections data standard was illuminating. There are some important lessons we can take away from this work – lessons we can put to use as we work to identify additional municipal data sets for standardization.
For me, the biggest lesson learned from the work that went into standardizing food safety inspection data is understanding when applying a data standard might obscure important differences in how data is collected, or in what data means. By way of example, a data standard like GTFS does not obscure differences in the underlying data across different jurisdictions. A transit schedule broken down to its essence is about location and time – when will my bus be at a specific stop on a specific route. There is nothing inherently different about this information from jurisdiction to jurisdiction. Time and place mean the same thing everywhere.
But this is not always the case with food safety inspection data – particularly when this data is distilled into digestible (pun intended) scores or rankings. The methods for conducting food safety inspections from city to city can vary widely, and these differences can result in very different results depending on where it comes from.
Daniel E. Ho, a professor at Stanford University, conducted an in depth study of the restaurant inspection systems in New York City and San Diego and found that the way in which inspection regimes are implemented can result in data that is often very different when compared across cities.
“While San Diego, for example, has a single violation for vermin, New York records separate violations for evidence of rats or live rats; evidence of mice or live mice; live roaches; and flies — each scored at 5, 6, 7, 8 or 28 points, depending on the evidence. Thirty ‘fresh mice droppings in one area’ result in 6 points, but 31 droppings result in 7 points.”
There also appears to be some debate in the medical community about the effectiveness of simplified grading for food establishments – i.e., using a letter grade or a numerical score. As noted in Professor Ho’s report – “…a single indicator has not been developed that summarizes all the relevant factors into one measure of [food] safety.”
All that said, if we’re going to advance the work of creating data standards across cities we need to identify the right data sets to standardize. These candidate data sets should have the same qualities as GTFS – demonstrating immediate benefits to data producers and data users, ease of use – but not have some of the less desirable qualities of food safety inspection data – obscuring differences in data collection and data quality across jurisdictions.
Lately, I’ve been trying to advance the idea that data about the locations where flu shots are administered (or any other form of inoculation) could be standardized across cities. I’ve gotten some great input from data advocates and from other cities, like the cities of Chicago and Baltimore.
I’m hoping to continue pushing this idea in the months ahead, leading up to the next flu season. If this most recent flu season has shown us anything, it’s that data matters – I think there could be enormous benefit in having cities use a standard data format for this information before the onset of the next really bad flu season.
But whether it’s flu shot locations or some other data set, the future of open data lies in building standards that multiple cities and government can adhere to. This is the next great milestone in the open data movement.
Advancing the movement toward this goal will be the most important work of the open data community in the months and years ahead.
[Note - photo courtesy of the San Diego International Airport]