Every data wrangler has their own list of favorites – the go to tools that they use when they need to work with data.
If you need to clean, transform, or mashup data or if you are working with a data set that will form the basis for an application, here is a list of tools that can make life easier for you.
- OpenRefine – I don’t think there is a better tool for cleaning messy data than OpenRefine. One of my favorite features is the ability to add new columns to a data set based on data in an external web service.
- jq – I see a lot of JSON in my job, and its exceptionally easy to use JSON data with a tool like this one. For example, here is a simple jq recipe for extracting a list of licensed pawn shops in Philadelphia to a CSV file.
- csvkit – CSV is another format I see almost everyday, and using csvkit makes it simple. My favorite utility – though I don’t use it often – is csvsql. use this handy utility to generate SQL insert statements and easily create a relational database from a CSV file.
- Unix shell – jq and csvkit are both command line tools, and the Unix shell is the place where I spend a lot of time working with data. Without getting into a Windows vs. *nix war, there is simply no better collection of utilities for working with text files than those that can accessed via the shell. Tools like curl, grep, sed, awk, cut and a host of others are enormously useful on their own, or in combination with tools like jq and csvkit.
- CartoDB – pretty much the easiest way to create a web-based map from an open data set. There’s even an API for building apps on top of the data you have in your CartoDB account. Enough said.
Note, my background is in software development so the list of favorites above probably reflects my own professional biases. Someone who works primarily as a data scientist might have a completely different list of favorite tools.
What’s your favorite tool for working with data?