Roll-your-own geocoding with OpenStreetMap Nominatim on Amazon EC2

Sometimes you need to geocode a few addresses, and while Google is obviously the gold standard, the Google Maps API conditions are quite strict – you are supposed only to geocode addresses you will be displaying in conjunction with a Google map. That’s no use for bulk / backend geocoding, the kind you might do for analysis purposes.

It’s not as accurate, but the OpenStreetMap project has a fairly serviceable global geocoder subproject called Nominatim. You can read the API docs here. Note that official OSM Nominatim site also has a fairly restrictive usage policy (summed up as ‘No heavy uses’ but effectively no parallel requests). The next step up is to use the Mapquest instance of Nominatim. It seems like you can in principle be a heavy user of this service, but in that case they have to approve your request. I didn’t try this so I don’t know what their terms or restrictions might be. In any case geocoding across the internet to a public server incurs a degree of latency, which may not be desirable if you have a really large number of addresses to code.

In my case, I have potentially millions of addresses. What’s more many of them are extremely low-quality, and Nominatim does not handle low quality addresses well at all in my experience (unlike – sigh – Google). To deal with this I use a pre-coding stage to attempt to guess the 5-10 most likely variants of the address data to geocode. But that means I’m geocoding 10 million+ addresses, which might stretch even Mapquest’s generous free service.

The great advantage of OSM, of course, is that’s an open project so you can, if you wish, or if you need to, replicate the entire thing on your own hardware. Which if course, means Amazon’s hardware.

I was slightly surprised to find there is no pre-existing AMI image with OSM/Nominatim pre-installed, so I had to install from scratch. The instructions are quite complete, but not specific to Amazon Linux and the EC2 environment, so I had to do quite a bit of adapting and trial-and-error to get things to work. All in all it took about 7 days runtime, which on the EC2 machine I used (r3.2xlarge – $0.70/hr) cost about $120. Whether that’s a small or large upfront cost depends on the project, but once you’ve done that you can geocode to your heart’s content for 70 cents per hours. In fact after installation I actually downgraded my instance to r3.xlarge ($0.35/hr) with no performance degradation, so there’s probably scope to do this even more cheaply.

Anyway, in case you want to try this yourself, I kept a reasonably complete (but probably not perfect, let me know any corrections) log of the install, which you can find in this gist:

Installing Nominatim on Amazon Linux / EC2

  1. Introduction

The official instructions for installing Nominatim are complete, but brief in places, and several steps must be changed in the Amazon Linux environment (which is roughly CentOS / Redhat). The steps below are rough record of what I did to get it working, but I didn’t keep perfect track so you shouldn’t rely on them as a shell script. Just follow each step, make sure it worked, and hopefully you’ll need to adapt very little (version numbers, for one thing). (I also skip in and out of root, but you can be more careful if you like.)

  1. Setting up the EC2 instance

There’s plenty of information on setting up Amazon EC2 instances elsewhere. I chose an r3.2xlarge machine (61 GB memory, 8 vCPUs, $0.70/hour), based on several-year-old suggestions that you need at least 32 GB of memory for the install, and the assumption that OSM has grown since then. I attached 2 x 750GB EBS volumes (as /dev/sd[f,g]—eventually in RAID0 striping), again based on previous old size estimates plus an allowance for growth. The root volume can be relatively small, but you might want to allow, say, 10-20GB for source data files. Or you can store them on the large EBS volumes, as I ended up having to because I left the root at default 8GB.

The total install time was reasonable, but I’m sure this isn’t an optimal configuration. I considered other storage options like provisioned IOPS EBS and instance-attached storage, which may have sped up disk-bound tasks, but decided to stick with plain vanilla EBS in the end.

Login in to your running EC2 instance. You may want to invoke screen or equivalent so nothing quits if you get disconnected, as several commands will run for days.

  1. Setting up disk storage

These commands will construct a RAID0 striping volume over the two EBS volumes, create the filesystem and arrange for it to be mounted at boot time as /vol.

sudo su
mdadm --create --verbose /dev/md0 --level=stripe --raid-devices=2 /dev/sdf /dev/sdg
mkfs.ext4 /dev/md0
mkdir /vol
mount -t ext4 /dev/md127 /vol
cp /etc/fstab /etc/fstab.orig
echo "/dev/md0    /vol    ext4  defaults,nofail   0  2" >> /etc/fstab
mount -a

Reference: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/raid-config.html

  1. Install postgres + postgis

The standard repository packages won’t work, so you’ll need to get other packages and compile some things from source. It’s pretty straightforward though.

Edit /etc/yum.repos.d/amzn-main.repo and add the following line to the block [amzn-main]:

exclude=postgresql*

Then install postgres.

cd ~/
wget http://yum.postgresql.org/9.3/redhat/rhel-6-x86_64/pgdg-redhat93-9.3-1.noarch.rpm
rpm -ivh pgdg-redhat93-9.3-1.noarch.rpm 
yum install postgresql93 postgresql93-server postgresql93-devel postgresql93-contrib

# I just symlinked the existing data directory to my mounted volume
rm -r /var/lib/pgsql/9.3/data/
ln -s /vol /var/lib/pgsql/9.3/data

# Set filepermissions to postgres user
chown postgres:postgres /vol
chmod 700 /vol

# Initialize the db
service postgresql-9.3 initdb

# Start the service
service postgresql-9.3 start

# exit from su
exit 

And install postgis and dependencies.

sudo yum install gcc make gcc-c++ libtool libxml2-devel libpng libtiff

cd ~/

# Download GEOS and install
wget http://download.osgeo.org/geos/geos-3.4.2.tar.bz2
tar xjf geos-3.4.2.tar.bz2 
cd geos-3.4.2
./configure 
make
sudo make install 

# Download Proj.4 and install
cd ~/
wget http://download.osgeo.org/proj/proj-4.8.0.tar.gz
tar xzf proj-4.8.0.tar.gz
cd proj-4.8.0

### If you don't want python bindings
./configure

### Or if you do want python bindings (which you need to import US street number data)
sudo yum install python26-devel.x86_64
./configure --with-python

make
sudo make install

# Download and install GDAL
cd ~/
wget http://download.osgeo.org/gdal/1.10.1/gdal-1.10.1.tar.gz
tar -xvzf gdal-1.10.1.tar.gz
cd gdal-1.10.1
./configure 
make
make install

# Download and install JSON-C library
cd ~/
wget https://s3.amazonaws.com/json-c_releases/releases/json-c-0.11.tar.gz
tar -xvzf json-c-0.11.tar.gz
cd json-c-0.11
./configure
make
make install

# Download and install PostGIS 
cd ~/
wget http://download.osgeo.org/postgis/source/postgis-2.1.2.tar.gz
tar -xvzf postgis-2.1.2.tar.gz
cd postgis-2.1.2
./configure --with-pgconfig=/usr/pgsql-9.3/bin/pg_config --with-geosconfig=/usr/local/bin/geos-config --with-gdalconfig=/usr/local/bin/gdal-config
make
make install

# update your libraries
sudo su
echo /usr/local/lib >> /etc/ld.so.conf
ldconfig

Reference: http://overtronic.com/2013/12/how-to-install-postgresql-with-postgis-on-amazon-ec2-linux/

  1. Nominatim dependencies

yum --enablerepo=epel install git make automake gcc gcc-c++ libtool
yum --enablerepo=epel install php-pgsql php php-pear php-pear-DB libpqxx-devel 
yum --enablerepo=epel install bzip2-devel libxml2-devel protobuf-c-devel lua-devel
#  These were installed from source above: proj-devel geos-devel proj-epsg
  1. Postgres config for install

Edit /vol/postgresql.conf and make the following changes. Here I just took the examples from the official install guide and increased them a bit to reflect the larger memory size.

shared_buffers (4GB)
maintenance_work_mem (16GB/10GB)
work_mem (50MB)
effective_cache_size (24GB)
synchronous_commit = off
checkpoint_segments = 100
checkpoint_timeout = 10min
checkpoint_completion_target = 0.9

For the initial import, I also set:

fsync = off
full_page_writes = off

Also it seems less certain but I also had no problems with:

autovacuum = off

These last three changes will be reverted after installation per the official instructions.

  1. Nominatim main installation

As ec2-user:

cd ~/
wget http://www.nominatim.org/release/Nominatim-2.3.0.tar.bz2
tar xvf Nominatim-2.3.0.tar.bz2

cd Nominatim-2.3.0
./configure --with-postgresql=/usr/pgsql-9.3/bin/pg_config
make

Edit settings/local.php and copy in the following:

<?php
 // Paths
 @define('CONST_Postgresql_Version', '9.3');
 @define('CONST_Postgis_Version', '2.1');
 @define('CONST_Path_Postgresql_Contrib', '/usr/pgsql-9.3/share/contrib');

Download some optional data (I wanted everything)

wget --output-document=data/wikipedia_article.sql.bin http://www.nominatim.org/data/wikipedia_article.sql.bin
wget --output-document=data/wikipedia_redirect.sql.bin http://www.nominatim.org/data/wikipedia_redirect.sql.bin
wget --output-document=data/gb_postcode_data.sql.gz http://www.nominatim.org/data/gb_postcode_data.sql.gz

cd /
sudo -u postgres createuser -s ec2-user
createuser -SDR www-data

cd ~/
chmod +x ~
chmod +x ~/Nominatim-2.3.0
chmod +x ~/Nominatim-2.3.0/module

sudo su
mkdir /vol/planet
chown ec2-user /vol/planet
chmod a+rx /vol

Then you can do a test run with a small country:

### Test Luxembourg
wget --output-document=/vol/planet/luxembourg-latest.osm.pbf http://download.geofabrik.de/europe/luxembourg-latest.osm.pbf
cd ~/Nominatim-2.3.0
./utils/setup.php --osm-file /vol/planet/luxembourg-latest.osm.pbf --all --osm2pgsql-cache 18000 2>&1 | tee setup.log

# If all is good, then start over
dropdb nominatim

Before proceeding to the full install:

wget --output-document=/vol/planet/planet-latest.osm.pbf http://download.bbbike.org/osm/planet/planet-latest.osm.pbf
wget --output-document=/vol/planet/planet-latest.osm.pbf.md5 http://download.bbbike.org/osm/planet/planet-latest.osm.pbf.md5

# Check the md5 checksum to ensure we downloaded ok
md5sum --check /vol/planet/planet-latest.osm.pbf.md5

Warning this next command takes days

time ./utils/setup.php --osm-file /vol/planet/planet-latest.osm.pbf --all --osm2pgsql-cache 18000 2>&1 | tee setup.log

On my EC2 configuration, it took 5 days. Rank 28 and Rank 30 indexing took the longest.

real    6997m2.441s
user    409m31.204s
sys     88m19.924s

At the end of this 800GB was used in total across my RAID0 volume.

Then you can install the extras:

# Add special phrases
./utils/specialphrases.php --countries > specialphrases_countries.sql
psql -d nominatim -f specialphrases_countries.sql

./utils/specialphrases.php --wiki-import > specialphrases.sql
psql -d nominatim -f specialphrases.sql

And set up the website

# Set up website
sudo mkdir -m 755 /var/www/nominatim
sudo chown nginx /var/www/nominatim
./utils/setup.php --create-website /var/www/nominatim

Edit settings/local.php and add/edit:

@define('CONST_Website_BaseURL', '/nominatim/');

I used nginx as the HTTP server:

sudo yum install nginx
sudo yum install php-fpm

psql -d nominatim -c 'ALTER USER "www-data" RENAME TO "nginx"'

As root, edit /etc/php-fpm.d/www.conf to include:

; Comment out the tcp listener and add the unix socket
;listen = 127.0.0.1:9000
listen = /var/run/php5-fpm.sock
; Ensure that the daemon runs as the correct user
listen.owner = nginx
listen.group = nginx
listen.mode = 0666

As root, edit /etc/nginx/nginx.conf

# Edit to include, with in the http { ... server{ ... }} that is defined
index   index.html index.htm index.php;

#root         /usr/share/nginx/html;
root         /var/www;
    #location / {
    #}
location ~ [^/]\.php(/|$) {
       fastcgi_split_path_info ^(.+?\.php)(/.*)$;
       if (!-f $document_root$fastcgi_script_name) {
               return 404;
       }
       fastcgi_pass unix:/var/run/php5-fpm.sock;
       fastcgi_index index.php;
# Note this next line is super important or you'll get empty responses with no error message!
       include fastcgi.conf;
}

And then hopefully you can run

sudo /etc/init.d/php-fpm start
sudo /etc/init.d/nginx start

At this point, all going well, you should be able to connect to http://yourhost/nominatim and see the OSM Nominatim web page.

  1. TIGER files for US street numbers (optional)

This apparently helps Nominatim geocode street numbers more accurately. Unlike the other options above, this takes a substantial amount of time and space to run.

cd ~/Nominatim-2.3.0/data
mkdir -p TIGER2013/EDGES

# raw files are about 10GB, but will eventually expand to quite a bit more in SQL statements
wget -P TIGER2013/EDGES ftp://ftp2.census.gov/geo/tiger/TIGER2013/EDGES/*

# These next two steps took 24 hours together
./utils/imports.php --parse-tiger-2011 data/TIGER2013/EDGES/
./utils/setup.php --import-tiger-data

psql -d nominatim -c 'GRANT SELECT ON location_property_tiger TO "nginx"'

At this stage df looked like this, and my generous 1.5TB of EBS was looking like a good choice.

Filesystem      1K-blocks       Used Available Use% Mounted on
/dev/xvda1        8123812    4931312   3092252  62% /
devtmpfs         15701944         72  15701872   1% /dev
tmpfs            15710020          0  15710020   0% /dev/shm
/dev/md127     1548045540 1180848836 288537172  81% /vol
  1. Post install configuration

Revert some of the changes in /vol/postgresql.conf

fsync = on
full_page_writes = on
autovacuum = on

And then run

# Postgres gets upset otherwise
sudo chmod go-rx /vol

sudo chkconfig --add postgresql-9.3
sudo chkconfig --add php-fpm
sudo chkconfig --add nginx

sudo service php-fpm start
sudo service postgresql-9.3 start
sudo service nginx start

At this point you should be good to go. I haven’t set up automatic updating, so if you proceed with that you’ll have to follow the official Nominatim guide and adapt as necessary.

view raw
NominatimOnEC2.md
hosted with ❤ by GitHub

16 comments

  1. Hi, This is amazing, but before I try to follow your directions I have a question. You did not happen to create an EC2 image from this did you? It would be fantastic to just boot this up and run it at cost, rather than having to reinstall.

    1. Unfortunately I don’t have one handy any more – but it’s a good idea. The other thing worth remembering is if you don’t need worldwide geocoding you can use a part of the OSM map and it will build much quicker.

  2. Andrew,
    Is there a reason you install postgresql 9.3? I have run into difficulty installing 9.3 on amazon-linux (it looks like there is confusion if RHEL6 or 7 repos should be used) and am contemplating installing 9.4 but don’t want to run into issues down the road.

      1. I will try and dig up there recommendations but wanted to leave a note here for anyone trying to implement these instructions:
        starting with 9.3 PostgreSQL has added a repo for amazon-linux; find the build that you want here: http://yum.postgresql.org/repopackages.php

        It’s also worth noting that changing the repo can change the future values (ex: postgresql-9.3 changed to postgresql93 in my case).

  3. Hi,

    fantastic tutorial, well done.

    For those that don’t want to go through all the effort though may I suggest the OpenCage Geocoder: https://geocoder.opencagedata.com which also uses OpenStreetMap data (along with other open data sources) via nominatim.

    We provide a simple, well-documented API for forward and reverse geocoding. There are libraries for all common programming languages, and a free testing level or affordable paid plans if you need more. We do a bit more than nominatim in that we provide well-formatted addresses and annotations (things like timezone, etc)

    But we love nominatim too, we regularly contribute new features and improvements.

  4. i have a problem import luxemburgo into my postgresql…

    [b]My SO is[/b]
    CentOS release 6.7 (Final)

    [b]my version of postgres[/b]
    PostgreSQL 9.3.11 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-16), 64-bit

    [b]version of nominatim[/b]
    nominatim 2.3.0

    [b]problem:[/b]

    When i try to import a little .pbf i have an error like execute command “createdb -E UTF-8 -p 5432 nominatim”, when i create the database it says that database “nominatim” aready exist…

    bash-4.1$ ./utils/setup.php --osm-file ../luxembourg-latest.osm.pbf --all --osm2pgsql-cache 1024
    Create DB
    createdb: database creation failed: ERROR:  database "nominatim" already exists
    ERROR: Error executing external command: createdb -E UTF-8 -p 5432 nominatim
    Error executing external command: createdb -E UTF-8 -p 5432 nominatim
    

    I’m desesperate, i don’t know what i will do e.e
    plx someone explainme…

  5. What’s the process to bulk geo-code using Nominatim? Can this only be achieved by firing a bunch of parallel requests?

  6. Have you created an AMI from your original installation and noticed it has performance issues unlike the original? I get errors when doing large generic searches like wal mart which I do not get on the original server.

  7. Hi, Great post, it is very useful!
    Having problem to find postgres 9.3 I moved to 9.6 and to postgis 2.3.
    Now the mapis loading properly, but an url like {domain}/nominatim/reverse.php?format=json&lat=29.9&lon=-90.1&debug=1 gives: ERROR: relation “placex” does not exist.
    This was an issue several years ago, when osm2pgsql was not updated.
    Updating it did not help. Do you have any idea what else could be the reason?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s