Roll-your-own geocoding with OpenStreetMap Nominatim on Amazon EC2

Sometimes you need to geocode a few addresses, and while Google is obviously the gold standard, the Google Maps API conditions are quite strict – you are supposed only to geocode addresses you will be displaying in conjunction with a Google map. That’s no use for bulk / backend geocoding, the kind you might do for analysis purposes.

It’s not as accurate, but the OpenStreetMap project has a fairly serviceable global geocoder subproject called Nominatim. You can read the API docs here. Note that official OSM Nominatim site also has a fairly restrictive usage policy (summed up as ‘No heavy uses’ but effectively no parallel requests). The next step up is to use the Mapquest instance of Nominatim. It seems like you can in principle be a heavy user of this service, but in that case they have to approve your request. I didn’t try this so I don’t know what their terms or restrictions might be. In any case geocoding across the internet to a public server incurs a degree of latency, which may not be desirable if you have a really large number of addresses to code.

In my case, I have potentially millions of addresses. What’s more many of them are extremely low-quality, and Nominatim does not handle low quality addresses well at all in my experience (unlike – sigh – Google). To deal with this I use a pre-coding stage to attempt to guess the 5-10 most likely variants of the address data to geocode. But that means I’m geocoding 10 million+ addresses, which might stretch even Mapquest’s generous free service.

The great advantage of OSM, of course, is that’s an open project so you can, if you wish, or if you need to, replicate the entire thing on your own hardware. Which if course, means Amazon’s hardware.

I was slightly surprised to find there is no pre-existing AMI image with OSM/Nominatim pre-installed, so I had to install from scratch. The instructions are quite complete, but not specific to Amazon Linux and the EC2 environment, so I had to do quite a bit of adapting and trial-and-error to get things to work. All in all it took about 7 days runtime, which on the EC2 machine I used (r3.2xlarge – $0.70/hr) cost about $120. Whether that’s a small or large upfront cost depends on the project, but once you’ve done that you can geocode to your heart’s content for 70 cents per hours. In fact after installation I actually downgraded my instance to r3.xlarge ($0.35/hr) with no performance degradation, so there’s probably scope to do this even more cheaply.

Anyway, in case you want to try this yourself, I kept a reasonably complete (but probably not perfect, let me know any corrections) log of the install, which you can find in this gist:

15 comments

  1. Hi, This is amazing, but before I try to follow your directions I have a question. You did not happen to create an EC2 image from this did you? It would be fantastic to just boot this up and run it at cost, rather than having to reinstall.

    1. Unfortunately I don’t have one handy any more – but it’s a good idea. The other thing worth remembering is if you don’t need worldwide geocoding you can use a part of the OSM map and it will build much quicker.

  2. Andrew,
    Is there a reason you install postgresql 9.3? I have run into difficulty installing 9.3 on amazon-linux (it looks like there is confusion if RHEL6 or 7 repos should be used) and am contemplating installing 9.4 but don’t want to run into issues down the road.

      1. I will try and dig up there recommendations but wanted to leave a note here for anyone trying to implement these instructions:
        starting with 9.3 PostgreSQL has added a repo for amazon-linux; find the build that you want here: http://yum.postgresql.org/repopackages.php

        It’s also worth noting that changing the repo can change the future values (ex: postgresql-9.3 changed to postgresql93 in my case).

  3. Hi,

    fantastic tutorial, well done.

    For those that don’t want to go through all the effort though may I suggest the OpenCage Geocoder: https://geocoder.opencagedata.com which also uses OpenStreetMap data (along with other open data sources) via nominatim.

    We provide a simple, well-documented API for forward and reverse geocoding. There are libraries for all common programming languages, and a free testing level or affordable paid plans if you need more. We do a bit more than nominatim in that we provide well-formatted addresses and annotations (things like timezone, etc)

    But we love nominatim too, we regularly contribute new features and improvements.

  4. i have a problem import luxemburgo into my postgresql…

    [b]My SO is[/b]
    CentOS release 6.7 (Final)

    [b]my version of postgres[/b]
    PostgreSQL 9.3.11 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-16), 64-bit

    [b]version of nominatim[/b]
    nominatim 2.3.0

    [b]problem:[/b]

    When i try to import a little .pbf i have an error like execute command “createdb -E UTF-8 -p 5432 nominatim”, when i create the database it says that database “nominatim” aready exist…

    bash-4.1$ ./utils/setup.php --osm-file ../luxembourg-latest.osm.pbf --all --osm2pgsql-cache 1024
    Create DB
    createdb: database creation failed: ERROR:  database "nominatim" already exists
    ERROR: Error executing external command: createdb -E UTF-8 -p 5432 nominatim
    Error executing external command: createdb -E UTF-8 -p 5432 nominatim
    

    I’m desesperate, i don’t know what i will do e.e
    plx someone explainme…

  5. What’s the process to bulk geo-code using Nominatim? Can this only be achieved by firing a bunch of parallel requests?

  6. Have you created an AMI from your original installation and noticed it has performance issues unlike the original? I get errors when doing large generic searches like wal mart which I do not get on the original server.

  7. Hi, Great post, it is very useful!
    Having problem to find postgres 9.3 I moved to 9.6 and to postgis 2.3.
    Now the mapis loading properly, but an url like {domain}/nominatim/reverse.php?format=json&lat=29.9&lon=-90.1&debug=1 gives: ERROR: relation “placex” does not exist.
    This was an issue several years ago, when osm2pgsql was not updated.
    Updating it did not help. Do you have any idea what else could be the reason?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s