by Henk-Jan Lebbink | May 8, 2023
Accelerating Fuzzy Search using AVX-512
In this post, we will explore our SQL fuzzy string compare and contains functionality that allows multi GiB/s/core without the need for preprocessing or indexing.
Fuzzy string comparison is a valuable technique used for comparing strings that may be similar but not exactly the same. It is particularly useful when working with natural language data or when users make spelling errors. SQL provides several methods for implementing fuzzy string search, but we will focus on a method that works well with massive datasets, including those that are terabyte or petabyte in size. In such large-scale scenarios, we need methods that do not require preprocessing or indexing and only need to read data once. We at Sneller are interested in simplified planning and maintenance of schema-less terabyte storage systems by making preprocessing redundant.
Fuzzy search methods often use an edit distance metric. In our approach, we will use an estimation approach based on the Damerau-Levenshtein distance. One might wonder why we use estimation instead of calculating the true edit distance. As we will see, calculating true distances can be extremely difficult and may not be practical for fuzzy functions that inherently involve some degree of imprecision.
The term “edit distance” refers to a mathematical concept used to describe string comparison systems that are closed under specific string operations. There are several different distance measures that can be constructed using these operations, including:
- Substitution, which results in Hamming distance
- Transposition, which results in Jaro-Winkler distance
- Substitution, deletion, and insertion, which results in Levenshtein distance
- Substitution, deletion, insertion, and transposition, which results in Damerau-Levenshtein distance.
The Sneller query tool supports two SQL functions: the
EQUALS_FUZZY function checks if a given data string matches a provided string literal by allowing a certain number of string edits. The function estimates the minimum number of edits required to transform the data string into the literal string. If the estimated number of edits is below a specified threshold, the two strings are considered to have a fuzzy match. In other words, the function checks if the data string can be transformed into the literal string with no more than a certain number of edits. Similarly, the
CONTAINS_FUZZY function checks if a string literal is present in a data string by allowing a certain number of edits.
When searching through terabytes of data, the term “minimal” in “minimal edit distance” should be cause for concern. The number of possible edit sequences can be prohibitively large, and determining whether a particular sequence is minimal often requires examining every solution in the set. It is likely that computing the minimal edit distance is an NP-complete problem, and the most efficient algorithm available may require quadratic time. However, for functions as inherently imprecise as fuzzy search, finding exact edit distances may not be necessary. An approximation approach can meet our needs and enable very efficient implementations.
The strings ‘abcde’ and ‘acfee’ can have multiple edit sequences. To provide an example, one possible sequence involves substituting the second character of the second string ‘b’ with ‘c’ at position 2, replacing ‘c’ with ‘f’ at position 3, and substituting ’d’ with ’e’ at position 4. Another sequence could involve deleting ‘b’ at position 2, inserting ‘f’ at position 3, and substituting ’d’ with ’e’ at position 4. Although these are small strings, they still have a significant number of possible solutions, and we need to find the minimal sequence among them.
Edit Distance Approximation
Our approximation approach involves examining three consecutive Unicode characters at a time. More precisely, we first analyze the current character, and then we look ahead two characters to determine the minimum number of string operations needed to match three characters in the data string with three characters in the given string literal.
Zero-character Look-ahead Approximation
To explain how these approximations work, it is easiest to first consider a scenario where we do not look ahead any characters. We have two strings and two current positions in both strings. We have an accumulated edit distance of the strings before the current position, and we want to compute how much the distance increases if we consider the current position. Since we do not look beyond the current position, we cannot determine if the edit is a transposition, deletion, or insertion; we can only see substitutions. A transposition occurs when two characters that are next to each other in one string are swapped in the other string. With this zero-character look-ahead, we compute the Hamming distance between the two strings. Here is the pseudocode for this approach:
WHILE (DATA not empty) OR (NEEDLE not empty) DO D0 := DATA N0 := NEEDLE IF (D0==N0) // the first characters match THEN editDistance += 0; advanceData += 1; advanceNeedle += 1 ELSE editDistance += 1; advanceData += 1; advanceNeedle += 1 ENDIF ENDDO
One-character Look-ahead Approximation
If the approximation considers a one-character look-ahead, it can observe at the current position: single insertions, deletions, transpositions, and two substitutions. For example, if the data is
abc and the needle is
bac, the minimal edit distance would be 1. It would be classified as a transposition because the first character of the data (
D0) equals the second character of the needle (
N1), and vice versa.
WHILE (DATA not empty) OR (NEEDLE not empty) DO D0 := DATA N0 := NEEDLE IF (D0==N0) // the first characters match THEN editDistance += 0; advanceData += 1; advanceNeedle += 1 ELSE D1 := DATA N1 := NEEDLE // character is deleted in data IF (D1!=N0) && (D0==N1) THEN editDistance += 1; advanceData += 1; advanceNeedle += 2 ENDIF // character is inserted in data IF (D1==N0) && (D0!=N1) THEN editDistance += 1; advanceData += 2; advanceNeedle += 1 ENDIF // characters are transposed in data IF (D1==N0) && (D0==N1) THEN editDistance += 1; advanceData += 2; advanceNeedle += 2 ENDIF // all remaining situations: character is substituted in data IF (D1!=N0) && (D0!=N1) THEN editDistance += 1; advanceData += 1; advanceNeedle += 1 ENDIF ENDIF ENDDO
Note that we may observe two substitutions, but we only process the first one. This is because our approximation method yields better results if we handle the second substitution in the context of the next “current position”. For instance, the distance between the two strings may be more accurately classified with an insertion when it is considered alongside the subsequent characters.
Two-character Look-ahead Approximation
As we increase our look-ahead to two characters, our approximation method becomes more complex. At the current position, we can detect one or two deletions, one or two insertions, one transposition, or one transposition interleaved with one insertion. All other differences between the strings are considered as a single substitution.
Although there could be other combinations of operations, such as a transposition followed by a substitution, we have found that the following pseudocode produces the most accurate estimate of the actual edit distance when using a two-character look-ahead. To find this optimal estimation method, we generated all possible strings of length less than 8 using an alphabet of 4 characters, calculated the estimated edit distance, and subtracted the true edit distance. The total of all differences represents the cumulative error of the estimation method. We selected the method with the lowest cumulative error from the entire range of estimation methods. Only several percentages of the estimated edit distances were overestimated while the remainder was estimated correctly.
WHILE (DATA not empty) OR (NEEDLE not empty) DO D0 := DATA N0 := NEEDLE IF (D0==N0) // the first characters match THEN editDistance += 0; advanceData += 1; advanceNeedle += 1 ELSE D1 := DATA D2 := DATA N1 := NEEDLE N2 := NEEDLE // two characters are transposed in data IF (N0!=D0) && (N0==D1) && (N1==D0) THEN editDistance += 1; advanceData += 2; advanceNeedle += 2 ENDIF // one character is deleted in data IF (N0!=D0) && (N0!=D1) && (N1==D0) && (N2==D1) THEN editDistance += 1; advanceData += 1; advanceNeedle += 2 ENDIF // two characters are deleted in data: IF (N0!=D0) && (N1!=D1) && (N2!=D2) && (N2==D0) && (N0!=D1) && (N1!=D2) && (N1!=D0) && (N2!=D1) && (N0!=D2) THEN editDistance += 2; advanceData += 1; advanceNeedle += 3 ENDIF // one character is inserted in data IF (N0!=D0) && (N0==D1) && (N1==D2) && (N1!=D0) THEN editDistance += 1; advanceData += 2; advanceNeedle += 1 ENDIF // two characters are inserted in data IF (N0!=D0) && (N1!=D1) && (N2!=D2) && (N2!=D0) && (N0!=D1) && (N1!=D2) && (N1!=D0) && (N2!=D1) && (N0==D2) THEN editDistance += 2; advanceData += 3; advanceNeedle += 1 ENDIF // transposition and insertion ab -> bca: IF (N0!=D0) && (N1!=D1) && (N2!=D2) && (N2!=D0) && (N0!=D1) && (N1!=D2) && (N1==D0) && (N2!=D1) && (N0==D2) THEN editDistance += 2;; advanceData += 3; advanceNeedle += 2 ENDIF // substitution in all remaining situations: ELSE editDistance += 1; advanceData += 1; advanceNeedle += 1 ENDIF ENDDO
This approximation only overestimate the edit distance between two strings, which means that sometimes the distance is estimated to be three when it is actually only two. As a result, when searching for strings with a maximum edit distance of three, some results with a true edit distance of three may be missed because they were estimated to have a distance of four. While this may appear to be a problem, the benefit of this approximation is that it allows us to find high edit distances at speeds comparable to string comparisons. We will discuss this more in detail later.
Fuzzy Match vs. Fuzzy Contains
The difference between a fuzzy match and a fuzzy contains is that in the former two strings are compared for a match, while in the latter a substring is matched at all positions in another string. Broadly speaking, a fuzzy contains is an iterated fuzzy match, making a fuzzy contains significantly more computationally expensive compared to a fuzzy match. Additionally, we have two versions of the implementation: one for ASCII-only fuzzy matching and another for Unicode-aware fuzzy matching.
-- string literal 'cache' is treated as a byte sequence EQUALS_FUZZY('Cash', 'cache', 1) -> FALSE EQUALS_FUZZY('Cash', 'cache', 2) -> TRUE EQUALS_FUZZY('Cash', 'cache', 3) -> TRUE -- string literal 'Straße' is treated as a unicode sequence EQUALS_FUZZY_UNICODE('strasse', 'Straße', 1) -> FALSE EQUALS_FUZZY_UNICODE('strasse', 'Straße', 2) -> TRUE -- string literal 'Straße' is treated as a byte sequence -- (note that `ß` is a 2-byte sequence) EQUALS_FUZZY('strasse', 'Straße', 1) -> FALSE EQUALS_FUZZY('strasse', 'Straße', 2) -> TRUE -- string literal 'Fox Jumps' has a distance of 3 to 'foks jums' CONTAINS_FUZZY('The quick brown foks jums over the lazy dog', 'Fox Jumps', 3) -> TRUE
Too much talk and little code!
Let’s take a look at some actual code that demonstrates how to implement the pseudocode in a branchless manner using 16 parallel lanes and the instruction set of a SkylakeX. As mentioned earlier, we have two versions: one for matching only ASCII characters and another for matching Unicode characters. Although they are conceptually similar, the ASCII version is much simpler because we can gather three ASCII bytes for 16 lanes with a single
VPGATHERDD instruction. In contrast, the Unicode version requires three
VPGATHERDD instructions to load the first three Unicode values. In this blog post, we will focus on the ASCII implementation.
Step 1: Gather data and needle
Step 1 consists of loading four ASCIIs from the lanes that still need to be processed for both the data and the needle.
loop: VPCMPD $6, Z11, Z3, K2, K4 ; K4 := K2 & (data_len>0) ;K2=lane_todo; Z3=data_len; Z11=0 VPCMPD $6, Z11, Z13, K2, K5 ; K5 := K2 & (needle_len>0) ;K2=lane_todo; Z13=needle_len; Z11=0 KMOVW K4, K3 ; K3 := K4 ; VPGATHERDD (SI)(Z2*1),K3, Z8 ; gather data ;Z8=data; SI=data_ptr; Z2=data_off VPGATHERDD (R14)(Z5*1),K5, Z9 ; gather needle ;Z9=needle; R14=needle_ptr; Z5=needle_off
Step 2: Remove Tail
Since we use a two-character look-ahead, meaning that we only consider the first three characters, we’re setting the last byte of each lane to
0xFF. To discard the tail and set bytes that do not belong to the data to
0xFF, we calculate the minimum value between three and the remaining byte length of the data (
Z3), which serves as a key to retrieve the mask using a
VPERMD lookup. Then, we use an OR operation to apply the mask to the data, which is the fastest method to achieve this. It’s worth noting that we don’t have to handle the tail of the needle because our Go code pads the needle with sufficient
VPMINSD Z3, Z24, Z26 ; scratch1 := min(3, data_len) ;Z26=scratch1; Z24=3; Z3=data_len VPERMD Z18, Z26, Z26 ; retrieve the tail-mask ;Z26=scratch1; Z18=tail_mask_data VPORD Z8, Z26, K4, Z8 ; data |= scratch1 ;Z8=data; K4=scratch1; Z26=scratch1
Step 3: Put ASCII into Uppercase
We consider ASCII values that are equivalent under case-folding to have an edit distance of zero, meaning that lowercase and uppercase ASCII values are not distinguished from each other. To convert a lowercase ASCII character to uppercase, we check if its value is greater than or equal to ASCII
a and less than or equal to
z. If this condition is true, we need to set the 6th bit to 1, which can be done using the
VPTERNLOGD instruction. However, in the Unicode-aware version of the code, code-points that are equal under case-folding are not treated as having a zero edit distance, as implementing such functionality leads to unsatisfiable register pressure.
VPCMPB $5, Z16, Z8, K3 ; K3 := (data>=char_a) ;K3=tmp_mask; Z8=data; Z16=char_a VPCMPB $2, Z17, Z8, K3, K3 ; K3 &= (data<=char_z) ;K3=tmp_mask; Z8=data; Z17=char_z VPMOVM2B K3, Z26 ; mask with selected chars ;Z26=scratch1; K3=tmp_mask VPTERNLOGD $0b01001100,Z15, Z8, Z26 ; ;Z26=scratch1; Z8=data; Z15=c_0b00100000
Step 4: Shuffle and Compare
We perform two shuffles on the needle to create three versions of it: the original ‘012’, as well as ‘120’ and ‘201’. We refer to these bytes as
N2, respectively. We then compare these with the data ‘012’, which we refer to as
D2. The three mask registers (
K5) contain the results of the comparisons:
K3 shows whether
N2==D2. K4 shows whether
K5 shows whether
VPSHUFB Z25, Z9, Z27 ; ;Z27=scratch2; Z9=needle; Z25=ror_a3 VPSHUFB Z25, Z27, Z28 ; ;Z28=scratch3; Z27=scratch2; Z25=ror_a3 VPCMPB $0, Z26, Z9, K3 ; K3 := (needle==scratch1) ;K3=tmp_mask; Z9=needle; Z26=scratch1 VPCMPB $0, Z26, Z27, K4 ; K4 := (scratch2==scratch1) ;K4=scratch1; Z27=scratch2; Z26=scratch1 VPCMPB $0, Z26, Z28, K5 ; K5 := (scratch3==scratch1) ;K5=scratch2; Z28=scratch3; Z26=scratch1
Step 5: Fuzzy Match Logic
With these three mask registers, we can determine the edit distance increase and the number of bytes to advance for both the data and needle. We accomplish this by merging the bits of the three mask registers into a 9-bit key and using
VPGATHERDD to find the required values. For those interested in a challenge, we use a clever technique to shuffle bits ‘a’ to ‘g’ in the mask registers into the 9 least significant bits in register
VMOVDQU8.Z Z19, K3, Z26 ; 0000_0000 0000_000a 0000_000b 0000_000c; K3=tmp_mask; Z19=0x10101 VMOVDQU8.Z Z20, K4, Z27 ; 0000_0000 0000_00d0 0000_00e0 0000_00f0; K4=scratch1; Z20=0x20202 VMOVDQU8.Z Z21, K5, Z29 ; 0000_0000 0000_0g00 0000_0h00 0000_0i00; K5=scratch2; Z21=0x40404 VPTERNLOGD $0b11111110,Z29, Z27, Z26 ; 0000_0000 0000_0gda 0000_0heb 0000_0ifc; VPMADDUBSW Z26, Z22, Z26 ; 0000_0000 0000_0gda 0000_0000 00he_bifc; Z22=0x10801 VPMADDWD Z26, Z23, Z26 ; 0000_0000 0000_0000 0000_000g dahe_bifc; Z23=0x400001
Our Go code will prepare a pointer that contains the values needed for the fuzzy match logic. For example, consider the case where the data is “bcd” and the needle is “abc”. This is a situation where one character has been deleted from the data, as per the pseudocode.
// one character is deleted in data IF (N0!=D0) && (N0!=D1) && (N1==D0) && (N2==D1) THEN editDistance += 1; advanceData += 1; advanceNeedle += 2
(N0!=D0) && (N0!=D1) && (N1==D0) && (N2==D1) results in a specific 9-bit key, and at the corresponding index of the key,
editDistance += 1; advanceData += 1; advanceNeedle += 2 is stored.
KMOVW K2, K3 ; tmp_mask := lane_todo ;K3=tmp_mask; K2=lane_todo; VPGATHERDD (R15)(Z26*1),K3, Z27 ; retrieve how to update ;Z27=scratch2; K3=tmp_mask; R15=update_ptr; Z26=key;
Step 6: Unpack Results
It is quite straightforward to extract the deltas for the edit distance (
Z28), advance needle (
Z9) and advance data (
Z8) from the previously gathered data (
VPSRLD $4, Z27, Z28 ; ed_delta := scratch2>>4 ;Z28=ed_delta; Z27=scratch2 VPSRLD $2, Z27, Z9 ; adv_needle := scratch2>>2 ;Z9=adv_needle; Z27=scratch2 VPANDD Z24, Z28, Z28 ; ed_delta &= 3 ;Z28=ed_delta; Z24=3 VPANDD Z24, Z27, Z8 ; adv_data := scratch2 & 3 ;Z8=adv_data; Z27=scratch2; Z24=3 VPANDD Z24, Z9, Z9 ; adv_needle &= 3 ;Z9=adv_needle; Z24=3
Step 7: Update
The next step is quite simple: update the edit distance, advance data, and advance needle using the deltas that were extracted in the previous step.
VPADDD Z28, Z4, K2, Z4 ; edit_dist += ed_delta ;Z4=edit_dist; K2=lane_todo; Z28=ed_delta VPSUBD Z8, Z3, K2, Z3 ; data_len -= adv_data ;Z3=data_len; K2=lane_todo; Z8=adv_data VPADDD Z8, Z2, K2, Z2 ; data_off += adv_data ;Z2=data_off; K2=lane_todo; Z8=adv_data VPSUBD Z9, Z13, K2, Z13 ; needle_len -= adv_needle ;Z13=needle_len; K2=lane_todo; Z9=adv_needle VPADDD Z9, Z5, K2, Z5 ; needle_off += adv_needle ;Z5=needle_off; K2=lane_todo; Z9=adv_needle
Step 8: Are we done?
If the edit distance exceeds the provided threshold value (
Z14), the corresponding lanes are labeled as dead and are subsequently removed from the active lanes.
VPCMPD $2, Z14, Z4, K2, K2 ; K2 &= (edit_dist<=threshold) ;K2=lane_todo; Z4=edit_dist; Z14=threshold
In the final step, if there are no more characters left in both the needle and the data, and the lane is still in the to-do list, it means we have found a fuzzy match. We add this lane to the list of active lanes (
K1) and remove it from the lanes to-do (
K2). If there are still lanes left to process, we go back to step 1 to collect new data and continue with the matching procedure.
VPCMPD $5, Z13, Z11, K2, K3 ; K3 := K2 & (0>=needle_len) ;K3=tmp_mask; K2=lane_todo; Z13=needle_len VPCMPD $5, Z3, Z11, K3, K3 ; K3 &= (0>=data_len) ;K3=tmp_mask; Z11=0; Z3=data_len KORW K1, K3, K1 ; lane_active |= tmp_mask ;K1=lane_active; K3=tmp_mask KANDNW K2, K3, K2 ; lane_todo &= ~tmp_mask ;K2=lane_todo; K3=tmp_mask KTESTW K2, K2 ; any lanes still todo? ;K2=lane_todo JNZ loop ; yes, then loop; jump if not zero (ZF = 0)
The code can be found in our GitHub repository.
Next, we will present some basic benchmarking results. We did not compare our system with other fuzzy search systems such as PostgreSQL with trigram indexes because we do not create indexes. Creating a trigram index for 227GB of text just to check for spelling errors is no fun. Instead, we run speed tests and compare results to disk I/O and compression speeds. Specifically, we use the dataset provided by ClickHouse for their ClickBench benchmark, which is a compressed JSON file hits.json.gz which is 21.1GB in size and contains 227GB of uncompressed text.
We have compressed the data (227GB) with the Sneller open-source self-describing compressed table format (
.zion) with the following command:
sdb pack -o=clickbench.iguana.zion -c=zion+iguana_v0 hits.json.gz. This results in a file
clickbench.iguana.zion of 11.5GB. To obtain statistics, we use the
-v flag with the
sdb query command.
-- scenario 1 $ sdb query -v -fmt=json "SELECT COUNT(Title) FROM read_file('clickbench.iguana.zion') WHERE Title = ' biden '" 22.8GiB/s (scanned 61984473088 B (57.73 GB) in 2.6588645s)
The Sneller query engine only needs to decompress the Title column which is 57.73 GB of scanned data which is only a fragment of the total 227GB of uncompressed data. It could process the query at 22.7GiB of data per second, with 16 threads on an 8 core Intel 11900K with 64GB DDR4-3200. The total time to finish the query is 2.65 seconds.
-- scenario 2 $ sdb query -v -fmt=json "SELECT COUNT(Title) FROM read_file('clickbench.iguana.zion') WHERE EQUALS_FUZZY(Title, ' biden ', 1)" 22.8GiB/s (scanned 61984473088 B (57.73 GB) in 2.6530886s)
Note that the fuzzy equality has the same run time as the string equality comparison from scenario 1; the difference is not significant.
-- scenario 3 $ sdb query -v -fmt=json "SELECT COUNT(Title) FROM read_file('clickbench.iguana.zion') WHERE Title ILIKE '% biden %'" 22.7GiB/s (scanned 61984473088 B (57.73 GB) in 2.6628499s)
-- scenario 4 $ sdb query -fmt=json -v "SELECT COUNT(Title) FROM read_file('clickbench.iguana.zion') WHERE CONTAINS_FUZZY(Title, ' biden ', 1)" 11.4GiB/s (scanned 61984473088 B (57.73 GB) in 5.3130959s)
The case-insensitive ASCII-only version of
CONTAINS_FUZZY (scenario 4) treats ASCII characters that are equal under case-folding as not requiring an edit, which means we can use
CONTAINS_FUZZY(Title, ' biden ', 0) to compare its performance against
Title ILIKE '% biden %' (scenario 3), which produces the exact same results. We see that loading 11.5GB, decompressing 57.73GB of the table Title, and comparing doing a unicode string contains (scenario 3) takes the same time as the most simple byte wise memory compare (scenario 1), which means that the computations in scenario 3 are still memory bound.
As the table below demonstrates, the highly optimized Unicode-aware ILIKE function is 40% faster than CONTAIN_FUZZY, which is still a respectable speed considering the non-trivial fuzzy matching capability it offers.
Execution times for thresholds 0 to 5:
Reasonable threshold values do not affect the runtime of fuzzy equals, whereas for fuzzy contains, there is a linear relationship between threshold and runtime. Fuzzy equals makes comparisons at the start of the data string and up to the threshold, and thus the number of data strings affects the runtime, but not the length of the data string. Fuzzy contains starts at the first position and if there’s no match, restarts from the next position, considering all positions of the data string if the pattern is not found. Thus, the length of the data string significantly affects the runtime.
Moreover, the runtime of ASCII contains is approximately 1.8 times faster than Unicode contains. Increasing the threshold by 1 leads to a 70% increase in runtime of the contains functions.
In this post we have provided details about our implementation of a fuzzy match and fuzzy contains functionality that does not need any indexes. We see a linear relation between the length of the search string and the runtime of the contains functionality. We also see a linear relation between the edits threshold and the runtime. And although fuzzy contains functionality is not a memory bound computation we still observe good performance for common and acceptable thresholds.
Try Sneller for Free
You can try Sneller right now on your own data for free through our playground, or you can sign up for Sneller Cloud, which gives you access to automatic ingest from your S3 buckets and hundreds of CPU cores on which to run your queries.
If you’re a developer interested in the details of how Sneller works under the hood, you’re in luck: Sneller is open source software!