Command Line Data Science

When it comes to deriving useful results about the operation of government from open data sets, we have an enormous array of tools at our disposal that we can make use of. Often, we do not need sophisticated or expensive tools to produce useful results.

In this post, I want to use command line tools that are available on most laptops, and others that can be downloaded for free, to derive meaningful insights from a real government open data set. The following examples will leverage *nix-based tools like tail, grep, sort, uniq and sed as well as open source tools that can be invoked from the command line like csvkit and MySQL.

The data used in this post is from the NY State Open Data Portal for traffic tickets issued in New York State from 2008 – 2012.

I chose to download select portions of the data to my laptop in part so that I could play around with command line utilities, but also because I found the tools and options available on the data portal a bit clunky and difficult to use. This dataset is fairly large – around 18M rows – and manipulating it on the portal often seemed to take a long time (not really sure why). In addition, I felt like the data itself was so useful and compelling, I didn’t really need anything more than what I already had available to me in my own data toolkit.

I’ve packaged up all of the data used in the following examples and saved them to a public Dropbox folder – you can get the data files here, or simply download them directly from the NY Open Data Portal. (Note – the structure of the samples files in the Dropbox folder differs slightly from those that can be downloaded from the portal. For convenience, I separated records for specific years into separate files and removed the column titled Violation Year.)

With that in mind, if you decide to follow these examples you should read the metadata and make sure you understand the limitations of the data. Not all tickets – specifically for some violations in NYC – are included in this data set. Once we understand what the data includes, we can start to ask (and hopefully answer) some interesting questions.

Q1: Are Men Better Drivers Than Women?

There is a long held stereotype that men are better drivers than women.

The data on traffic tickets can help us to find out if there is any factual basis for such a claim, and once we have the data in hand we can use simple command line tools to start to work with it.

First, let’s count up the number of traffic tickets issued in NY State in 2012. To do this, we’ll use tail and wc.

~$ tail +2 violations2012.csv | wc -l
 3524692

By using the +2 with tail, we’re telling it to output the contents of the entire file starting on the second line in the file (to avoid counting the header row). We then pipe the output into wc and use the -l flag to indicate that we want to count the number of lines. Since each line in the file represents a ticket issued, we can see that there were more than 3.5 million tickets issued in NY State in 2012.

Now let’s repeat the process for 2011.

~$ tail +2 violations2011.csv | wc -l
 3643061

Interestingly, it appears that there were slightly fewer tickets issued in 2012 than in 2011 across the entire state (about 3.2% fewer in 2012).

Now let’s count up the number of tickets issued based on the gender of the person ticketed. To do this, we’ll add some additional tools. We’ll use csvcut to separate out the 7th column in each line outputted by tail (which holds the value for gender), and then we’ll use sort and uniq to total the number of unique values. When we use the -c flag with uniq it will tally up the counts for each unique value that we pipe to it.

~$ tail +2 violations2012 | csvcut -c 7 | sort | uniq -c

17090 C
962450 F
2542324 M
2828 U

In this case, we can see that in addition to a value of M for males and F for females we have C and U. According to the data dictionary from the NY Open Data Portal, a value of C indicates that the ticketed entity was an organization and a U indicates that the gender was unknown or not collected. Even if we assumed that all of the C and U rows were attributable to female drivers (which the most likely aren’t) we can see that the number of tickets issued to male drivers far outpaced females in 2012.

According to the NY State DMV, in 2012 female drivers accounted for about 49 percent of all licensed drivers in the state. Our results show that female drivers accounted for only about 27% of all tickets issued in 2012, while men accounted for a whopping 73% during the same time period.

One issue with our data that we may want to correct before finalizing this observation is that it includes non-NY State drivers. We know what the proportion of men and women drivers is in NY State, but not for the population that received tickets in NY State in 2012 (which includes drivers from a large number of different states and territories). To find out how many of those drivers ticketed in 2012 were licensed in NY State, we need to introduce a new tool – grep.

$ tail +2 violations2012.csv | csvcut -c 6 | grep -c "NEW YORK"

2902325

By following our previous example and using csvcut to extract the appropriate column (the one containing the state in which the driver is licensed), we can specify that we only want to count those rows where the value is equal to “NEW YORK”. We do this using the -c flag of grep, which works much like the one we used for uniq above.

Now we can combine both approaches and count up the number of tickets issued by the gender of the ticket recipient for drivers licensed in NY State.

~$ tail +2 violations2012.csv \
> | csvcut -c 6,7 \
> | grep "NEW YORK" \
> | csvcut -c 2 \
> | sort \
> | uniq -c

 620 C
829695 F
2070547 M
1463 U

Note that we use csvcut twice in this example. We first use it to extract the columns that contain the gender of the recipient and the state in which they are licensed (we do this by using the -c flag and indicating the two columns we are interested in). We then pipe the output of that operation to grep to select only those that have “NEW YORK” and then again pass it to csvcut to single out the column holding the value for gender.

Our results don’t change all that much, and we can conclude with some confidence – if getting ticketed for a traffic violation is a good proxy for driving skill – that women are much better drivers than men.

Now it’s time to delve deeper into this data.

Q2: Which Police Agencies Issue the Most Tickets?

To find out which police agencies issue the most tickets – and, maybe more interestingly, how the number of tickets issued changed from 2011 to 2012 – we’ll introduce a relational database to the mix

For the following examples I’ll use MySQL and a very simple import script to bring the data from the CSV files we have been working with into a database. But you can use any database tool that you want to do something similar. If you don’t have MySQL installed, you can download it here.

First, let’s create the MySQL database and import our CSV files.

~$ mysql -u username < path/to/violations.sql

Next, we log into the MySQL CLI and start using the new violations database that we just created.

~$ mysql -u username
mysql> use violations;

Let’s sum up the number of tickets issued by police agency in 2012 for the top 10 agencies.

mysql> SELECT Police_Agency AS 'Agency', SUM(1) AS 'Number' FROM _2012 
    -> GROUP BY Police_Agency 
    -> ORDER BY Number DESC 
    -> LIMIT 10;
+----------------------------+---------+
| Agency                     | Number  |
+----------------------------+---------+
| NYC PD                     | 1041037 |
| COUNTY OF SUFFOLK  PD      |  177551 |
| COUNTY OF NASSAU  PD       |  148919 |
| CITY OF ROCHESTER PD       |   33765 |
| FARMINGDALE STATE POLICE   |   32842 |
| COUNTY OF SUFFOLK  SHERIFF |   31972 |
| MIDDLETOWN STATE POLICE    |   31563 |
| COUNTY OF MONROE SHERIFF   |   30799 |
| ROCHESTER STATE POLICE     |   30354 |
| CITY OF SYRACUSE PD        |   30247 |
+----------------------------+---------+
10 rows in set (3.00 sec)

These results aren’t all that surprising – in terms of raw numbers, we would expect to see the NYC police, larger downstate counties and the big cities upstate in the top 10 list. We can also do the same thing for 2011.

mysql> SELECT Police_Agency AS 'Agency', SUM(1) AS 'Number' FROM _2011 
    -> GROUP BY Police_Agency 
    -> ORDER BY Number DESC 
    -> LIMIT 10;
+-------------------------------+---------+
| Agency                        | Number  |
+-------------------------------+---------+
| NYC PD                        | 1060588 |
| COUNTY OF SUFFOLK  PD         |  161276 |
| COUNTY OF NASSAU  PD          |  146046 |
| CITY OF ROCHESTER PD          |   34443 |
| MIDDLETOWN STATE POLICE       |   33702 |
| CITY OF SYRACUSE PD           |   33415 |
| TRIBOROUGH BRIDGE TUNNEL AUTH |   32214 |
| COUNTY OF SUFFOLK  SHERIFF    |   31788 |
| FARMINGDALE STATE POLICE      |   31187 |
| COUNTY OF MONROE SHERIFF      |   30386 |
+-------------------------------+---------+
10 rows in set (3.19 sec)

Interesting to note that the City of Syracuse PD has seen pretty dramatic drop off in tickets from 2011 to 2012 (a drop of almost 10 percent). But what I really want to do is to look at all police agencies and compare 2012 to the previous year to see how the number of tickets issued has changed for each.

The most convenient tool to make this kind of comparison and to sort results would probably be a spreadsheet (Excel or Google Docs). To import into either of those, we’ll need a CSV file with the total number of tickets issued by each police agency for both years.

To generate this CSV file, we can leverage MySQL by passing in a simple SQL query to the MySQL CLI and then use the sed tool to filter the results. Specifically, when the MySQL CLI outputs the results they will be tab delimited. We want to replace the tabs with commas and then direct the results to an output file. We use the following command to generate the results for 2012.

echo "use violations; select Police_Agency as 'Agency', SUM(1) as '2012 Number' from _2012 group by Agency;" \
| mysql -u username \
| sed 's/	/,/g' \
> agency_counts_1.csv

And we repeat the process for 2011 results.

echo "use violations; select Police_Agency as 'Agency', SUM(1) as '2011 Number' from _2011 group by Agency;" \
| mysql -u username \
| sed 's/	/,/g' \
> agency_counts_2.csv

Next we’ll use csvjoin to combine the files and match up the rows holding results for police agencies from 2012 and 2011 into a file called agency_counts.csv.

csvjoin -c "Agency,Agency" agency_counts_1.csv agency_counts_2.csv > agency_counts.csv

We can now open this file in Excel or Google Docs and refine it a bit.

When we joined the two CSV files above, it created a superfluous column called Agency – we can remove this column, calculate the difference in tickets issued between years, sort them and we get results that look like this.

Interpreting the Results

By sorting this spreadsheet on the # Difference column we can see some interesting things. Overall, tickets issued in NY State went down in 2012 over 2011, but for some police agencies the number of tickets issued went up dramatically. In Western New York, both of the Buffalo area police agencies saw dramatic increases in the number of tickets issued. On the eastern side of the state, the Albany PD saw a dramatic increase in tickets, and Downstate the suffolk County PD saw a healthy increase.

What caused these results? Was there something significant in the type of traffic violations issued by these agencies? Was there something important about the time of year (or the time of day) that they were issued? To find out, we’ll need to do more analysis, which is beyond the scope of this post.

What I hoped to show is that some of the most powerful data tools are widely available, easy to use command line utilities. When it comes to deriving useful results about the operation of government from data sets, we have an enormous array of tools at our disposal to use. Often, we do not need sophisticated or expensive tools to produce useful results.

We just need good data.

2 responses to “Command Line Data Science”

Jeroen Janssens (@jeroenhjanssens)

February 5, 2015 at 10:33 am

Great post and nice use of open data. Instead of manually importing the data into MySQL, you can also use `csvsql` [1], which comes with `csvkit`. As the name may imply, this tool allows you to apply SQL queries directly on CSV files.

[1] http://csvkit.readthedocs.org/en/latest/scripts/csvsql.html
Lauren Ancona (@laurenancona)

February 8, 2015 at 4:28 pm

This a great tutorial. I’d like to point out one inherent caveat in this type of data: it is representative of *enforcement* only — useful perhaps for determining trends or bias in ticket issuance, but not technically indicative of driver quality, etc.

Civic Innovations

Leave a comment