Fuzzy matching multiword strings and sentences in R

800px-plantains

A colleague asked me about fuzzy matching of string data, which is a problem that can come up when linking datasets. I figured I might as well reproduce my comments here since this is such a common problem, and many of the built-in algorithms are well suited to word matching but not to multiword strings.

For simple words #

Some kind of edit distance, usually Levenshtein distance, works well. So “Andrew” and “Andre” match with a distance of 1, and you can set a threshold to accept a match. It can help to scale the Levenstein distance by the world length, or the max of the two word lengths, or something, since missing by one letter is a much more confident match for “Andrew”/”Andre” than for “Do”/”Dog”.

For names #

The “Soundex” algorithm is practically antique but works well. It renders “Maurice” and “Morris” the same which is nice if you’re dealing with messy names data. Not sure how it works on non-names or non-English – it’s probably too loose. There are a bunch of descendants of this method too.

For sentences #

I’ve generally made something up in the past, because it starts to get quite domain specific. Some kind of combination of Levenstein distance and a sort of bag-of-words works quite well. For example, for sentences A and B


In any case, if you expect perfect 1:1 correspondence between the two lists then you can rank possible matches from the best to worst to get the best match possible.

https://gist.github.com/econandrew/a9930d812eb420b20358

Comments (2)

Hi Andrew, Thank for the piece of code. I have customized the code as per my requirement. However this code seems to take a awful lot of time on large datasets.eg. A dataframe has close to 15k and B has 82k data. The code seems working on smaller datasets though!!

Jason Born

That’s a fair point. I used it on datasets which, though large, had a relatively small number of unique values on each side, so scaleability wasn’t an issue. If you actually have a large number of unique values, then I would think any method that compares every item in A with every item in B will be slow (N^2 comparisons). Perhaps you can break them up into subsets (even overlapping subsets) somehow - sacrificing some matching accuracy for speed. If you find a good approach, feel free to post back here!

Andrew Whitby

HI. Where can I find your code? I tried Github but could not find this R code. Thanks.

Michal

It’s a github gist - you should see it inline above, where you can click “View Raw”. But if that’s not working, this link should: https://gist.github.com/econandrew/a9930d812eb420b20358/raw/0e8bcd89d2610e0ffe8d1ab05c15eb369c9a4158/fuzzymatch.R

Andrew Whitby

Check out fuzzyjoin::stringdist_join which includes Levenshtein, soundex, etc methods: https://github.com/dgrtwo/fuzzyjoin/blob/master/R/stringdist_join.R

Arthur

Add comment

Comments are moderated and will not appear immediately.