The right country codes are ISO alpha-3

If you’re storing countries in a database, you should use codes of some kind. Country names might seem okay, but they:

There are many standardized code sets—R’s countrycode package lists around 30—but only two are commonly used: 2 and 3 letter ISO codes.

Of these, you should always use the 3 letter codes (technically ISO 3166-1 alpha-3).

Except of ISO 3166-1 alpha-3 from Wikipedia

Why #

There are only 26 English letters, so the space of 2 letter combinations has only 26 × 26 ≈ 600 codes. With ~250 allocated, it’s crowded, so there are a lot of potential collisions. It’s even more crowded than that, because these are not randomly assinged codes—they’re meant to sound like the country name. So in practice a code like XQ isn’t actually useful (in fact it’s part of a region reserved for user assignment).

Three letters, on the other hand, gives you 26 × 26 × 26 ≈ 17,000 possibilities, so this is a much more sparsely populated space, and therefore has more redundancy and is more robust to errors. That gives two concrete advantages:

3-letter codes are easier for humans to guess at #

You might think that country codes should always be translated into country names for presentation, but in practice that’s not always the case. To take one example, domain names show untranslated not-quite-2-letter-ISO country codes to end users (admittedly, sometimes divorced from the country semantics).

Two letters is not really enough to unambiguously establish a country name. Take these examples:

Even when codes are unambiguous, two letters are often insufficient to easily bring a country name to mind:

Of course even with 3-letter codes, it’s hard to remember that IND is India and IDN is Indonesia, or that Australia is AUS and Austria AUT, or that ZMB is Zambia and not Zimbabwe (ZWE).1

3-letter codes are harder for machines to misinterpret #

For the same reasons, two letter codes are more likely to collide with other non-country identifiers. There are some well-known examples:

Obviously an appropriate data format will avoid these issues, but sometimes it’s hard to control how data will go out into the world, and three letter codes are just that bit more robust.

  1. My rule of thumb for disambiguating is that the more populous country gets the more obvious prefix-style code—e.g. INDia—while the less populous country gets some other rule, “first letter of each syllable” or prefix-suffix—e.g. InDoNesia. This works for many examples, e.g. IRN vs IRQ. But probaby not all. 

Comments (1)

Hey Andrew!

Came upon this article when searching for the preferred way to set country codes.

Your reasoning is strong, and very valid. However, how would you handle a database that has both country codes, and currency codes?

Currency codes are 3-letter, and the first 2 letters come from the ISO alpha-2 country code. In this case, I think it would be better to store country codes in alpha-2, so you can easily associate with the corresponding currency.

What do you think?

Gonçalo Dias

Add comment

Comments are moderated and will not appear immediately.