Fuzzy matching multiword strings and sentences in R
A colleague asked me about fuzzy matching of string data, which is a problem that can come up when linking datasets. I figured I might as well reproduce my comments here since this is such a common problem, and many of the built-in algorithms are well suited to word matching but not to multiword strings.
For simple words #
Some kind of edit distance, usually Levenshtein distance, works well. So “Andrew” and “Andre” match with a distance of 1, and you can set a threshold to accept a match. It can help to scale the Levenstein distance by the world length, or the max of the two word lengths, or something, since missing by one letter is a much more confident match for “Andrew”/”Andre” than for “Do”/”Dog”.
For names #
The “Soundex” algorithm is practically antique but works well. It renders “Maurice” and “Morris” the same which is nice if you’re dealing with messy names data. Not sure how it works on non-names or non-English – it’s probably too loose. There are a bunch of descendants of this method too.
For sentences #
I’ve generally made something up in the past, because it starts to get quite domain specific. Some kind of combination of Levenstein distance and a sort of bag-of-words works quite well. For example, for sentences A and B
-
Get rid of all punctuation and break each sentence into words
-
Consider two words a match if they have a levenstein distance <= 1 (stemming words might help to)
-
Calculate how many (approximately) matching words there are between the two strings
-
The score is that, divided by max(wordcount(A), wordcount(B)) – so high score is better
-
So for example “Green Plantain (Large)” and “Large Plantains (Green)” would be a perfect match (1), whereas “Green Plantain” and “Large Plantain” would get 0.5.
-
I have a version of this implemented in R from an old project, below. It adds a word frequency part, so that missing a match by a “the” or an “a” isn’t nearly so bad as missing by a rare word. This is just one really simple approach - it worked for one project I worked on, but I wouldn’t pretend it generalises to every such problem.
In any case, if you expect perfect 1:1 correspondence between the two lists then you can rank possible matches from the best to worst to get the best match possible.
https://gist.github.com/econandrew/a9930d812eb420b20358
Add comment
Comments are moderated and will not appear immediately.
Comments (2)
Hi Andrew, Thank for the piece of code. I have customized the code as per my requirement. However this code seems to take a awful lot of time on large datasets.eg. A dataframe has close to 15k and B has 82k data. The code seems working on smaller datasets though!!
That’s a fair point. I used it on datasets which, though large, had a relatively small number of unique values on each side, so scaleability wasn’t an issue. If you actually have a large number of unique values, then I would think any method that compares every item in A with every item in B will be slow (N^2 comparisons). Perhaps you can break them up into subsets (even overlapping subsets) somehow - sacrificing some matching accuracy for speed. If you find a good approach, feel free to post back here!
HI. Where can I find your code? I tried Github but could not find this R code. Thanks.
It’s a github gist - you should see it inline above, where you can click “View Raw”. But if that’s not working, this link should: https://gist.github.com/econandrew/a9930d812eb420b20358/raw/0e8bcd89d2610e0ffe8d1ab05c15eb369c9a4158/fuzzymatch.R
Check out fuzzyjoin::stringdist_join which includes Levenshtein, soundex, etc methods: https://github.com/dgrtwo/fuzzyjoin/blob/master/R/stringdist_join.R