Roll-your-own geocoding with OpenStreetMap Nominatim on Amazon EC2
Sometimes you need to geocode a few addresses, and while Google is obviously the gold standard, the Google Maps API conditions are quite strict - you are supposed only to geocode addresses you will be displaying in conjunction with a Google map. That’s no use for bulk / backend geocoding, the kind you might do for analysis purposes.
It’s not as accurate, but the OpenStreetMap project has a fairly serviceable global geocoder subproject called Nominatim. You can read the API docs here. Note that official OSM Nominatim site also has a fairly restrictive usage policy (summed up as ‘No heavy uses’ but effectively no parallel requests). The next step up is to use the Mapquest instance of Nominatim. It seems like you can in principle be a heavy user of this service, but in that case they have to approve your request. I didn’t try this so I don’t know what their terms or restrictions might be. In any case geocoding across the internet to a public server incurs a degree of latency, which may not be desirable if you have a really large number of addresses to code.
In my case, I have potentially millions of addresses. What’s more many of them are extremely low-quality, and Nominatim does not handle low quality addresses well at all in my experience (unlike - sigh - Google). To deal with this I use a pre-coding stage to attempt to guess the 5-10 most likely variants of the address data to geocode. But that means I’m geocoding 10 million+ addresses, which might stretch even Mapquest’s generous free service.
The great advantage of OSM, of course, is that’s an open project so you can, if you wish, or if you need to, replicate the entire thing on your own hardware. Which if course, means Amazon’s hardware.
I was slightly surprised to find there is no pre-existing AMI image with OSM/Nominatim pre-installed, so I had to install from scratch. The instructions are quite complete, but not specific to Amazon Linux and the EC2 environment, so I had to do quite a bit of adapting and trial-and-error to get things to work. All in all it took about 7 days runtime, which on the EC2 machine I used (r3.2xlarge - $0.70/hr) cost about $120. Whether that’s a small or large upfront cost depends on the project, but once you’ve done that you can geocode to your heart’s content for 70 cents per hour. In fact after installation I actually downgraded my instance to r3.xlarge ($0.35/hr) with no performance degradation, so there’s probably scope to do this even more cheaply.
Anyway, in case you want to try this yourself, I kept a reasonably complete (but probably not perfect, let me know any corrections) log of the install, which you can find in this gist: