A colleague asked me about fuzzy matching of string data, which is a problem that can come up when linking datasets. I figured I might as well reproduce my comments here since this is such a common problem, and many of the built-in algorithms are well suited to word matching but not to multiword strings.
- For simple words some kind of edit distance, usually Levenshtein distance, works well. So “Andrew” and “Andre” match with a distance of 1, and you can set a threshold to accept a match. It can help to scale the Levenstein distance by the world length, or the max of the two word lengths, or something, since missing by one letter is a much more confident match for “Andrew”/”Andre” than for “Do”/”Dog”.
- For names, the “Soundex” algorithm is practically antique but works well. It renders “Maurice” and “Morris” the same which is nice if you’re dealing with messy names data. Not sure how it works on non-names or non-English – it’s probably too loose. There are a bunch of descendants of this method too.
- For sentences, I’ve generally made something up in the past, because it starts to get quite domain specific. Some kind of combination of Levenstein distance and a sort of bag-of-words works quite well. For example, for sentences A and B
- Get rid of all punctuation and break each sentence into words
- Consider two words a match if they have a levenstein distance <= 1 (stemming words might help to)
- Calculate how many (approximately) matching words there are between the two strings
- The score is that, divided by max(wordcount(A), wordcount(B)) – so high score is better
- So for example “Green Plantain (Large)” and “Large Plantains (Green)” would be a perfect match (1), whereas “Green Plantain” and “Large Plantain” would get 0.5.
- I have a version of this implemented in R from an old project, below. It adds a word frequency part, so that missing a match by a “the” or an “a” isn’t nearly so bad as missing by a rare word. This is just one really simple approach – it worked for one project I worked on, but I wouldn’t pretend it generalises to every such problem.
In any case, if you expect perfect 1:1 correspondence between the two lists then you can rank possible matches from the best to worst to get the best match possible.