Thoughts on: Code and Data for the Social Sciences

Via my departmental maillist, this rather useful guide by Matthew Gentzkow and Jesse Shapiro at Chicago: Code and Data for the Social Sciences: A Practitioner’s Guide. They write:

Though we all write code for a living, few of the economists, political scientists, psychologists, sociologists, or other empirical researchers we know have any formal training in computer science.

Social science and computer science is a great combination—it happens to be my background. And since one of my current projects is looking at what new-fangled ‘data scientists’ do, I thought I would assemble here some additional ‘lessons for economists from software engineering’ to complement Gentzkow and Shapiro’s.

1. Tools #

Rule: Be tool agnostic. Corollary: Excel is not evil.

Economists tend to be too wedded to particular tools (like STATA), maybe because most undergrad and grad courses teach only one or two econometrics tools (in the graduate classes I taught, I used STATA and Oxmetrics and R - but I’m pretty sure that’s unusual).

Software engineers and computer scientists, on the other hand, tend to have more of a hacker mindset: use whatever tool will get the job done. If that’s Java or C++ then great; but if an assortment of unix command line utilities (sort, uniq, grep, awk, sed, wc…) will work, even better.

Of course, the downside is that if you use lots of different tools, it’s hard to be an expert in any of them. As a computer scientist I’ve used, at various times: C, C++, Java, Basic, COBOL, Python, Perl, Haskell, Prolog, Ruby and various assembly languages. As an economist I’ve used STATA, OxMetrics, SPSS, R and Matlab. Whenever I start a project, I have to remind myself of both the syntax and libraries of whatever I’m using. But like foreign languages, the more you know the easier this relearning is.

The corollary to this is that Excel is not evil. Despite the gasps you’ll hear when you mention linest. Despite Reinhart and Rogoff. Excel is a superb way to prototype analysis. People will occasionally mutter something vague about ‘numerical issues’ with Excel, and if you’re doing space science or particle physics that’s fair. But in econ? Where agreeing to one significant figure is great, and even finding an expected sign is a result? I think you’ll be ok.

2. Checks and tests #

Rule: test everything.

Good software engineers test everything. They incorporate sanity checks and assertions directly into their code. This is because they have little control over the inputs to their software, and it needs to work right, or fail sensibly, regardless. The situation is somewhat different in empirical economics, because your code will probably only run against one dataset, and you can control that. But the testing mentality should still transfer, where appropriate. Exactly what you can test will vary by project; simple examples include:

validate the number of data rows after importing, or any other operation that shouldn’t (but might) change this.
likewise, validate totals etc after reshaping
(in a spreadsheet) include total rows and columns—make sure they agree (cough Reinhart cough Rogoff)
check signs of important variables
check bounds: make sure proportions are in [0, 1], Likerts in 1…5, etc
use simple methods to check more complex ones (Your fancy nonparametric estimator looks totally different from OLS? Might want to investigate that.)
check the moments of data (mean, variance, skewness) make sense

And if these tests fail, make sure to output a message and exit, instead of failing silently. In fact it is probably better to avoid ‘fail silent’ operations altogether (this is particularly a problem with, say, missing values).

3. Automation #

Modified rule: Automate judiciously.

Gentzkow and Shapiro suggest you ‘automate everything that can be automated.’ But I’ve tried that, and it’s not always the right thing, so I would nuance this advice a little. There’s an upfront cost to automating, particularly when it comes to outputs like fancy tables and charts. How big is that cost? How many times will you need to do this? Does the eventual labour saved outweigh the cost of automating? To be sure, err on the side of automating—you’ll end up redoing things more than you expect. But some things are just easier to do manually. Case in point: my PhD thesis was around a 4 year process, with numerous edits and rewrites along the way. I invested substantial upfront time in automating things, so that I could press run on my simulations, and wake up the next morning to find PDF charts and LaTeX tables waiting to go. Automation made sense—right up until my examiners wanted small but complicated changes to my final table layout. At that point, with only one iteration left, I abandoned automation. Rather than spend half a day working out how to get my table exporting routines to produce this very particular layout, I just did it by hand, in about half an hour.

4. Keys #

Rule for bonus points: learn SQL and the relational model.

Gentzkow and Shapiro have a nice discussion about why you might want to put your data in more than one table. They offer some useful rules of thumb. But to really understand this stuff, it is worth learning about the relational model, the various normal forms and SQL (‘structured query language’). This is how data is stored in the real world. If you ever have to deal with administrative or customer data, knowing a bit of SQL will pay off handsomely.

It will also add another tool to your repertoire. In several projects recently I have used a data pipeline something like:

Python (web scraping) → SQLite database (cleaning, sorting, summarising) → R (statistical analysis on summarised data)

Relational databases are really incredibly powerful if you know how to use them.

(Postscript: you may have heard about Big Data and NoSQL and Hadoop, and assumed SQL and relational databases are dead. Nothing could be further from the truth: they still rule the world.)

Comments (1)

Pingback: Resources for Coding Style | Ricardo Dahis

2015-07-03 13:07:09 -0400

Add comment

Comments are moderated and will not appear immediately.