How the west was really won: by manipulative data vizualization.

One of most well-known historical theories about America is the “frontier thesis,” advanced by Frederick Jackson Turner in 1893. It states that the long existence of a “frontier,” a zone between the settled and unsettled regions of the United States, is responsible for distinctive aspects of the American character: independence, self-reliance, egalitarianism, a certain disdain for high culture and learning, etc. The United States was, at independence, a fairly narrow strip of land east of the Mississippi, but the frontier was pushed continually west until eventually it reached the Pacific and vanished altogether.

You can see this movement on a sequence of population density maps, based on decennial census results, published in the 1898 Statistical atlas of the United States. Here’s the 1790 to 1860, when they didn’t even both showing the west:

1790-1860.jpg

(more…)

Optimisation in data science: processing billions of GPS coordinates

This post is now many months old. I didn’t publish it at the time because it just got way too long, and to paraphrase Mark Twain, I didn’t have time to make it shorter. The other reservation I had, is that it’s pretty hard to explain why I didn’t (1) use a routing library like Valhalla, or (2) calculate road topology. I speculate about this a little at the end, but in retrospect I think it was mostly a kind of “greedy optimization”: I started with existing code, at each point made a small tweak, and ended up in a local, but not global, optimum.

Still, it may be interesting to some people anyway, so with some free time on my hands, I’ve made a few edits and hit publish.

I spent a part of this week helping some geospatial colleagues process GPS data: without being too specific, they have daily log data for around 1.5 million commercial vehicles in a large country, with each observation spaced a couple of seconds apart. Needless to say, this is a lot of data: around 1TB of data per day.

The project is exploratory, so at this point they were just interested in a “toy” application: traffic counts for Open Street Map ways (roads, roughly). As is pretty typical in the life of a data scientist, they had a prototype analysis running, but there was no way it would scale to the volume of data they had. Initially they came to me to see if I could help them run it on our internal cluster.

Usually when people have speed & scaling issues, my advice is to take their current code and run it on as powerful a machine as possible (faster CPU, more memory and – if their underlying code can take advantage – more cores). This point cannot be retweeted enough:

(more…)

What is big data?

People often ask me

What is big data?

The answer I usually give is

Any data too large to process using your normal tools & techniques.

That’s a very context dependent answer: last week in one case it meant one million records, too big for Stata on a laptop, and in another it meant a dataset growing at about 1TB per day. To put it another way, big data is anything where you have to think about the engineering side of data science: where you can’t just open up R and run lm(), because that would take a day and need a terabyte of memory.

(more…)