# Fast fuzzy searching of a string in a list

I haven't been working on the Sift string distance algorithm for a while, but then I was reminded of it because someone wanted it to use it to suggest corrections to user input. Something like Google's: "Did you mean...?" or like an autocomplete application. And it got me thinking of ways to use Sift for bulk searching. I am still thinking about it, but in the meanwhile, this can be achieved using the Sift4 algorithm, with up to **40% improvement** in speed to the naïve comparison with each item in the list.

Testing this solution, I've realized that the *maxDistance* parameter did not work correctly. I apologize. The code is now fixed on the algorithm's blog post, so go and get it.

So what is this solution for mass search? We can use two pieces of knowledge about the problem space:

- the minimum possible distance between two string of length
*l1*and*l2*will always*abs(l1-l2)*- it's very easy to understand the intuition behind it: one cannot generate a string of size 5 from a string of size 3 without at least adding two new letters, so the minimum distance would be 2

- as we advance through the list of strings, we have a best distance value that we keep updating
- this molds very well on the
*maxDistance*option of Sift4

- this molds very well on the

Thus armed, we can find the best matches for our string from a list using the following steps:

- set a
*bestDistance*variable to a very large value - set a
*matches*variable to an empty list - for each of the strings in the list:

- compare the
*minimum distance*between the search string and the string in the list (*abs(l1-l2)*) to*bestDistance*- if the
*minimum distance*is larger than*bestDistance*, ignore the string and move to the next

- if the
- use Sift4 to get the distance between the search string and the string in the list, using
*bestDistance*as the*maxDistance*parameter- if the algorithm reaches a
*temporary distance*that is larger than*bestDistance*, it will break early and report the*temporary distance*, which we will ignore

- if the algorithm reaches a
- if
*distance*<*bestDistance*, then clear the*matches*list and add the string to it, updating*bestDistance*to distance - if
*distance=bestDistance*, then add the string to the list of*matches*

- compare the

When using the common Sift4 version, which doesn't compute transpositions, the list of matches is retrieved 40% faster on average than simply searching through the list of strings and updating the distance. (about 15% faster with transpositions) Considering that Sift4 is already a lot faster than Levenshtein, this method will allow searching through hundreds of thousands of strings really fast. The gained time can be used to further refine the matches list using a slower, but more precise algorithm, like Levenshtein, only on a lot smaller set of possible matches.

Here is a sample written in JavaScript, where we search a random string in the list of English words:

```
search = getRandomString(); // this is the search string
let matches=[]; // the list of found matches
let bestDistance=1000000; // the smaller distance to our search found so far
const maxOffset=5; // a common value for searching similar strings
const l = search.length; // the length of the search string
for (let word of english) {
const minDist=Math.abs(l-word.length); // minimum possible distance
if (minDist>bestDistance) continue; // if too large, just exit
const dist=sift4(search,word,maxOffset,bestDistance);
if (dist<bestDistance) {
matches = [word]; // new array with a single item
bestDistance=dist;
if (bestDistance==0) break; // if an exact match, we can exit (optional)
} else if (dist==bestDistance) {
matches.push(word); // add the match to the list
}
}
```

There are further optimizations that can be added, beyond the scope of this post:

- words can be grouped by length and the minimum distance check can be done on entire buckets of strings of the same lengths
- words can be sorted, and when a string is rejected as a match, reject all string with the same prefix
- this requires an update of the Sift algorithm to return the offset at which it stopped (to which the maxOffset must be added)

I am still thinking of performance improvements. The transposition table gives more control over the precision of the search, but it's rather inefficient and resource consuming, not to mention adding code complexity, making the algorithm harder to read. If I can't find a way to simplify and improve the speed of using transpositions I might give up entirely on the concept. Also, some sort of data structure could be created - regardless of how much time and space is required, assuming that the list of strings to search is large and constant and the number of searches will be very big.

Let me know what you think in the comments!

## Comments

Be the first to post a comment