The right country codes are ISO alpha-3
If you’re storing countries in a database, you should use codes of some kind. Country names might seem okay, but they:
- vary by language (Germany, Allemagne, Deutschland),
- have synonyms even within one language (United States, United States of America),
- can be ambiguous (Korea, China, Sudan, Congo) or confused with subnational regions (Ireland, Macedonia), and
- are not permanent, and change more than you might expect under circumstances violent (Cambodia, Myanmar) or peaceful (Czechia, Eswatini, North Macedonia).
There are many standardized code sets—R’s countrycode
package lists around 30—but only two are commonly used: 2 and 3 letter ISO codes.
Of these, you should always use the 3 letter codes (technically ISO 3166-1 alpha-3).
Why #
There are only 26 English letters, so the space of 2 letter combinations has only 26 × 26 ≈ 600 codes. With ~250 allocated, it’s crowded, so there are a lot of potential collisions. It’s even more crowded than that, because these are not randomly assinged codes—they’re meant to sound like the country name. So in practice a code like XQ
isn’t actually useful (in fact it’s part of a region reserved for user assignment).
Three letters, on the other hand, gives you 26 × 26 × 26 ≈ 17,000 possibilities, so this is a much more sparsely populated space, and therefore has more redundancy and is more robust to errors. That gives two concrete advantages:
3-letter codes are easier for humans to guess at #
You might think that country codes should always be translated into country names for presentation, but in practice that’s not always the case. To take one example, domain names show untranslated not-quite-2-letter-ISO country codes to end users (admittedly, sometimes divorced from the country semantics).
Two letters is not really enough to unambiguously establish a country name. Take these examples:
BD
is Bangladesh, not to be confused with Burundi (BI
), which itself shouldn’t be confused withBN
(Brunei) (vsBDG
,BDI
,BRN
)CA
is Canada not Cameroon (CM
), which narrowly avoids Cambodia (KH
) for historical reasons (vsCAN
,CAM
,KHM
)FI
is Finland, not Fiji (FJ
) (vsFIN
,FJI
)- Ukraine is
UA
(what? why is A the second choice here?) notUK
, even though the official code of the United Kingdom isGB
. TheUAE
skirts the whole mess by usingAE
.
Even when codes are unambiguous, two letters are often insufficient to easily bring a country name to mind:
AO
is not so obviously Angola asAGO
is.- Because of how it’s pronounced (in English, at least),
GE
does not bring to mind Georgia, whereasGEO
at least has a fighting chance. - Two letter codes are usually a substring of the three letter code, but in Ireland’s case that’s not true. And
IRL
is far more obvious thanIE
.
Of course even with 3-letter codes, it’s hard to remember that IND
is India and IDN
is Indonesia, or that Australia is AUS
and Austria AUT
, or that ZMB
is Zambia and not Zimbabwe (ZWE
).1
3-letter codes are harder for machines to misinterpret #
For the same reasons, two letter codes are more likely to collide with other non-country identifiers. There are some well-known examples:
- In R, Namibia (
NA
) might get interpreted as “not available.” - In YAML, Norway (
NO
) will get interpreted as “false.”
Obviously an appropriate data format will avoid these issues, but sometimes it’s hard to control how data will go out into the world, and three letter codes are just that bit more robust.
-
My rule of thumb for disambiguating is that the more populous country gets the more obvious prefix-style code—e.g.
IND
ia—while the less populous country gets some other rule, “first letter of each syllable” or prefix-suffix—e.g.I
nD
oN
esia. This works for many examples, e.g.IRN
vsIRQ
. But probaby not all. ↩
Add comment
Comments are moderated and will not appear immediately.
Comments (1)
Hey Andrew!
Came upon this article when searching for the preferred way to set country codes.
Your reasoning is strong, and very valid. However, how would you handle a database that has both country codes, and currency codes?
Currency codes are 3-letter, and the first 2 letters come from the ISO alpha-2 country code. In this case, I think it would be better to store country codes in alpha-2, so you can easily associate with the corresponding currency.
What do you think?