Optimisation in data science: processing billions of GPS coordinates

This post is now many months old. I didn’t publish it at the time because it just got way too long, and to paraphrase Mark Twain, I didn’t have time to make it shorter. The other reservation I had, is that it’s pretty hard to explain why I didn’t (1) use a routing library like Valhalla, or (2) calculate road topology. I speculate about this a little at the end, but in retrospect I think it was mostly a kind of “greedy optimization”: I started with existing code, at each point made a small tweak, and ended up in a local, but not global, optimum.

Still, it may be interesting to some people anyway, so with some free time on my hands, I’ve made a few edits and hit publish.

I spent a part of this week helping some geospatial colleagues process GPS data: without being too specific, they have daily log data for around 1.5 million commercial vehicles in a large country, with each observation spaced a couple of seconds apart. Needless to say, this is a lot of data: around 1TB of data per day.

The project is exploratory, so at this point they were just interested in a “toy” application: traffic counts for Open Street Map ways (roads, roughly). As is pretty typical in the life of a data scientist, they had a prototype analysis running, but there was no way it would scale to the volume of data they had. Initially they came to me to see if I could help them run it on our internal cluster.

Usually when people have speed & scaling issues, my advice is to take their current code and run it on as powerful a machine as possible (faster CPU, more memory and – if their underlying code can take advantage – more cores). This point cannot be retweeted enough: