Fuzzy matching multiword strings & sentences in R

800px-plantains

Some “large green plantains”, courtesy of Wikipedia user Daegis.

A colleague asked me about fuzzy matching of string data, which is a problem that can come up when linking datasets. I figured I might as well reproduce my comments here since this is such a common problem, and many of the built-in algorithms are well suited to word matching but not to multiword strings.

  • For simple words some kind of edit distance, usually Levenshtein distance, works well. So “Andrew” and “Andre” match with a distance of 1, and you can set a threshold to accept a match. It can help to scale the Levenstein distance by the world length, or the max of the two word lengths, or something, since missing by one letter is a much more confident match for “Andrew”/”Andre” than for “Do”/”Dog”.
  • For names, the “Soundex” algorithm is practically antique but works well. It renders “Maurice” and “Morris” the same which is nice if you’re dealing with messy names data. Not sure how it works on non-names or non-English – it’s probably too loose. There are a bunch of descendants of this method too.
  • For sentences, I’ve generally made something up in the past, because it starts to get quite domain specific. Some kind of combination of Levenstein distance and a sort of bag-of-words works quite well. For example, for sentences A and B
    • Get rid of all punctuation and break each sentence into words
    • Consider two words a match if they have a levenstein distance <= 1 (stemming words might help to)
    • Calculate how many (approximately) matching words there are between the two strings
    • The score is that, divided by max(wordcount(A), wordcount(B)) – so high score is better
    • So for example “Green Plantain (Large)” and “Large Plantains (Green)” would be a perfect match (1), whereas “Green Plantain” and “Large Plantain” would get 0.5.
    • I have a version of this implemented in R from an old project, below. It adds a word frequency part, so that missing a match by a “the” or an “a” isn’t nearly so bad as missing by a rare word. This is just one really simple approach – it worked for one project I worked on, but I wouldn’t pretend it generalises to every such problem.

In any case, if you expect perfect 1:1 correspondence between the two lists then you can rank possible matches from the best to worst to get the best match possible.

2 comments

  1. Hi Andrew, Thank for the piece of code. I have customized the code as per my requirement. However this code seems to take a awful lot of time on large datasets.eg. A dataframe has close to 15k and B has 82k data. The code seems working on smaller datasets though!!

    1. That’s a fair point. I used it on datasets which, though large, had a relatively small number of unique values on each side, so scaleability wasn’t an issue. If you actually have a large number of unique values, then I would think any method that compares every item in A with every item in B will be slow (N^2 comparisons). Perhaps you can break them up into subsets (even overlapping subsets) somehow – sacrificing some matching accuracy for speed. If you find a good approach, feel free to post back here!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s