The right country codes are ISO alpha-3
If you’re storing countries in a database, you should use codes of some kind. Country names might seem okay, but they:
- vary by language (Germany, Allemagne, Deutschland),
- have synonyms even within one language (United States, United States of America),
- can be ambiguous (Korea, China, Sudan, Congo) or confused with subnational regions (Ireland, Macedonia), and
- are not permanent, and change more than you might expect under circumstances violent (Cambodia, Myanmar) or peaceful (Czechia, Eswatini, North Macedonia).
There are many standardized code sets—R’s
countrycode package lists around 30—but only two are commonly used: 2 and 3 letter ISO codes.
Of these, you should always use the 3 letter codes (technically ISO 3166-1 alpha-3).
There are only 26 English letters, so the space of 2 letter combinations has only 26 × 26 ≈ 600 codes. With ~250 allocated, it’s crowded, so there are a lot of potential collisions. It’s even more crowded than that, because these are not randomly assinged codes—they’re meant to sound like the country name. So in practice a code like
XQ isn’t actually useful (in fact it’s part of a region reserved for user assignment).
Three letters, on the other hand, gives you 26 × 26 × 26 ≈ 17,000 possibilities, so this is a much more sparsely populated space, and therefore has more redundancy and is more robust to errors. That gives two concrete advantages:
3-letter codes are easier for humans to guess at #
You might think that country codes should always be translated into country names for presentation, but in practice that’s not always the case. To take one example, domain names show untranslated not-quite-2-letter-ISO country codes to end users (admittedly, sometimes divorced from the country semantics).
Two letters is not really enough to unambiguously establish a country name. Take these examples:
BDis Bangladesh, not to be confused with Burundi (
BI), which itself shouldn’t be confused with
CAis Canada not Cameroon (
CM), which narrowly avoids Cambodia (
KH) for historical reasons (vs
FIis Finland, not Fiji (
- Ukraine is
UA(what? why is A the second choice here?) not
UK, even though the official code of the United Kingdom is
UAEskirts the whole mess by using
Even when codes are unambiguous, two letters are often insufficient to easily bring a country name to mind:
AOis not so obviously Angola as
- Because of how it’s pronounced (in English, at least),
GEdoes not bring to mind Georgia, whereas
GEOat least has a fighting chance.
- Two letter codes are usually a substring of the three letter code, but in Ireland’s case that’s not true. And
IRLis far more obvious than
Of course even with 3-letter codes, it’s hard to remember that
IND is India and
IDN is Indonesia, or that Australia is
AUS and Austria
AUT, or that
ZMB is Zambia and not Zimbabwe (
3-letter codes are harder for machines to misinterpret #
For the same reasons, two letter codes are more likely to collide with other non-country identifiers. There are some well-known examples:
- In R, Namibia (
NA) might get interpreted as “not available.”
- In YAML, Norway (
NO) will get interpreted as “false.”
Obviously an appropriate data format will avoid these issues, but sometimes it’s hard to control how data will go out into the world, and three letter codes are just that bit more robust.
My rule of thumb for disambiguating is that the more populous country gets the more obvious prefix-style code—e.g.
INDia—while the less populous country gets some other rule, “first letter of each syllable” or prefix-suffix—e.g.
Nesia. This works for many examples, e.g.
IRQ. But probaby not all. ↩