by Henk-Jan Lebbink | April 24, 2023
Accelerating Regular Expressions with AVX-512 at 1.5 GB/s/core
In this article, we will examine the inner workings of a high-performance regular expression engine for our SQL VM, which uses a novel approach with 16 parallel lanes to perform acceptance checks without relying on branching and backtracking, resulting in matching times close to memory speeds. We will also compare this engine to the ILIKE operator we discussed in a previous post. The engine is written in AVX-512 assembly and designed for the Intel Icelake processor.
Why-oh-why?
Why would you guys go through the effort of rewriting existing SQL functionality with handwritten vectorized assembly? This is a question that I got. A simplistic answer would be that we aim for faster tools because we can.
However, the benefits of efficient code extend beyond speed: we are interested in simplified planning and maintenance of schema-less large storage systems. For example, complex text searches on terabyte-scale data require huge search indices that need careful maintenance, and once present, existing indices may limit unforeseen new use-cases.
If having indexed data does not provide speed advantages over systems that are simply very fast, then complex storage architectures can be greatly simplified, by replacing them with systems that are simply put, very fast.
Automata and Regular Expressions
There is a lot to explore when it comes to regular expressions and automata, but for our current purpose, it’s important to note that regular expressions can be represented as Nondeterministic Finite Automata (NFA). From an NFA we can construct a Deterministic Finite Automaton (DFA), and these DFAs can be transformed to enable acceptance checks without the need for backtracking. This feature allows us to write a regex engine that traverses data just once. We have Go code available to do just this: generate such a DFA from any regular expression that is supported by the Go programming language.
Sneller VM
Some basic knowledge about our VM architecture will be helpful to understand the text. We will be utilizing various capabilities of this VM, such as the ability to handle 16 data streams concurrently. The S register serves as a virtual register and stores information related to the 16 strings being matched. It is mapped to two physical 512-bit vector registers, in the examples here, ZMM2
and ZMM3
. Specifically, ZMM2
contains 16 rows of 32-bit offsets that are relative to our data pointer RSI
, while ZMM3
holds the 16 byte lengths for each string.
We use “predicated execution”, which means that we pass a one-bit predicate to each operation in order to indicate whether its execution should be short-circuited into a no-op, this allows us to emulate diverging control between lanes without using branches. To handle this, the K virtual register is used and contains 16 predicate arguments for our regex instruction. In the subsequent example, the K register is mapped to the physical 16-bit mask register K1
. This same K register is also used to return the outcome of our regex expression: whether it matches or not.
Example Regex
We use Go code to construct, based on a regex, a specially crafted DFA, and we load this DFA into four vector registers, namely, ZMM21
, ZMM22
, ZMM23
, and ZMM25
. Let us first look at three regular expression patterns and the corresponding DFAs:
- a simple IP4 matching regex,
- a more complex IP4 matching regex, plus
- a case-insensitive search pattern for the name “Biden”.
-
^(?:[0-9]{1,3}\.){3}[0-9]{1,3}
-
^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)
-
(?i)biden
Note that the graph of DFA1
shows 14 states, while the special fail-state is not shown. A fail-state is the state that the DFA transitions to when a character is read that the regex cannot match. These 15 states can be encoded using four bits. Out of the 19 edges, only two are distinct: ‘0..9’ representing the character range from ASCII ‘0’ to ‘9’, and ‘0x2E’ for ASCII ‘dot’. A special fail-edge that transitions the state to the fail-state is also not shown. These three distinct edges can be encoded with just two bits. As we will see shortly, the total of six bits is used in a map to transition to the next (4-bit) state in what we call a 6-bit DFA implementation.
Similar numbers are observed for DFA2
: 21 states (including the fail-state) which is encoded using five bits, and seven distinct edges (including the fail-edge) which is encoded using three bits. For the total of eight bits we will need an implementation that can handle a translation from 8 bits to the next (5-bit) state.
In DFA3
we have seven states and seven distinct edges, thus also a total of six bits. We will use this case-insensitive search pattern while discussing the code examples. The astute reader would simplify a regex ‘(?i)biden’
to ILIKE ‘%biden%’
, but we will use the regex anyway and compare the speed differences at the end.
Overview of the implementation
If only we had a fast way to map the current state (3 bits) and the next character (3 bits) to the next state of our automaton. This is where the VPERMB
instruction comes in handy. Using VPERMB
, we can translate a 6-bit key into an 8-bit value or, equivalently, map sixteen 6-bit values to sixteen 8-bit values with a single instruction. Another useful observation is that we can use a single VPERMI2B
to translate 64 ASCII values in a ZMM register into what we call a “character group”.
One typographic observation is that Go employs an unconventional assembler syntax derived from the Plan 9 operating system. While it shares similarities with other assembly languages, it also has its own unique syntax. For instance, register names differ from those in more common assemblers; for example, ZMM10
is referred to as Z10
, and RSI
is denoted by SI
. Additionally, the information flows from left to right, so the destination is generally the last register (such as in at&t syntax).
Steps for an 6-bit DFA implementation
To summarize, the process involves the following steps:
- Gather four ASCII values for 16 lanes.
- Translate the 64 ASCII values into 64 character groups.
- Merge the current state with the character group into a 6-bit lookup key.
- Find the next state using the 6-bit key.
- Repeat steps 3 and 4 for the other three ASCIIs in the lane.
- Check whether we are in an accepting state and jump if necessary to step 1 to iterate and gather more data for lanes that are still processing.
It is important to note that while DFA1
has a 6-bit key that maps to the next state, DFA2
requires an 8-bit lookup key. The code in step 4 can be adjusted to handle 7 or 8-bit alternatives.
Step 0: Init
An input for the following code is the mask register K1
with its bits set for active lanes, and we use K1
as an output to identify the lanes for which the DFA is accepting. We use the register Z10
to store the value 1 for all sixteen lanes, while K2
stores the lanes that are yet to be processed and are neither accepting nor failing, that is, these lanes are ‘todo’. We clear the accepting lanes by resetting K1
, as we have not identified any lane as accepting at this point. Additionally, we set the current state of the automaton to its initial state, which is 1, by storing this value in register Z7
.
KMOVW K1, K2 ;lane_todo := lane_active ;K2=lane_todo; K1=lane_active
KXORW K1, K1, K1 ;lane_active := 0 ;K1=lane_active
VMOVDQU32 Z10, Z7 ;curr_state := 1 ;Z7=curr_state; Z10=1
Step 1: Gather data
Step 1 consists of loading four characters from the lanes that still need to be processed. To achieve this, we gather the next four ASCII values for each such lane. The offset of our data is stored in SI
, while Z2
holds the individual offsets for the sixteen lanes.
loop:
KMOVW K2, K3 ;copy eligible lanes ;K3=scratch; K2=lane_todo
VPGATHERDD (SI)(Z2*1),K3, Z8 ;gather data ;Z8=data; K3=scratch; SI=data_ptr; Z2=data_off
Step 2: Translate to character groups
Step 2 involves translating the four ASCII values in register Z8
for the active lanes into 64 character groups. To achieve this, we use the VPERMI2B
instruction, which allows us to map a 7-bit key to an 8-bit value. As a result, the 7-bit ASCIIs can be transformed into any 8-bit value using one instruction. After the permutation, Z8
will contain the 64 character groups.
VPERMI2B Z22, Z21, Z8 ;classify into character groups ;Z8=data; Z21=char_table1; Z22=char_table2
It can be confusing to figure out the contents of registers Z21
and Z22
required to implement the DFA. Thankfully, the Sneller query engine has options to display the necessary register contents. The following dump shows the contents of the maps: The map on the left has two columns, and it illustrates how 128 ASCII values are mapped to an 8-bit byte value. These 8-bit values are our character group values. For example, ASCII ‘B’ and ‘b’ are mapped to 00001'000, ASCII ‘I’ and ‘i’ are mapped to 00011'000.
regex: (?i)biden
char-group map (3 bit) | update map (3 bit) |
char group char group | key next-node | node type
Z21 Z22 | Z23 |
00h -> 00010'000; '@' -> 00010'000; | 00000'000 -> 00000'000; | 00000'000 (fail-state)
01h -> 00010'000; 'A' -> 00010'000; | 00000'001 -> 00000'000; | 00000'001 (start-state)
02h -> 00010'000; 'B' -> 00001'000; | 00000'010 -> 00000'000; | 00000'010
03h -> 00010'000; 'C' -> 00010'000; | 00000'011 -> 00000'011; | 00000'011 (accept-state)
04h -> 00010'000; 'D' -> 00110'000; | 00000'100 -> 00000'000; | 00000'100
05h -> 00010'000; 'E' -> 00100'000; | 00000'101 -> 00000'000; | 00000'101
06h -> 00010'000; 'F' -> 00010'000; | 00000'110 -> 00000'000; | 00000'110
07h -> 00010'000; 'G' -> 00010'000; | 00000'111 -> 00000'000; |
08h -> 00010'000; 'H' -> 00010'000; | 00001'000 -> 00000'000; |
09h -> 00010'000; 'I' -> 00011'000; | 00001'001 -> 00000'010; |
0Ah -> 00010'000; 'J' -> 00010'000; | 00001'010 -> 00000'010; |
0Bh -> 00010'000; 'K' -> 00010'000; | 00001'011 -> 00000'011; |
0Ch -> 00010'000; 'L' -> 00010'000; | 00001'100 -> 00000'010; |
0Dh -> 00010'000; 'M' -> 00010'000; | 00001'101 -> 00000'010; |
0Eh -> 00010'000; 'N' -> 00101'000; | 00001'110 -> 00000'010; |
0Fh -> 00010'000; 'O' -> 00010'000; | 00001'111 -> 00000'000; |
10h -> 00010'000; 'P' -> 00010'000; | 00010'000 -> 00000'000; |
11h -> 00010'000; 'Q' -> 00010'000; | 00010'001 -> 00000'001; |
12h -> 00010'000; 'R' -> 00010'000; | 00010'010 -> 00000'001; |
13h -> 00010'000; 'S' -> 00010'000; | 00010'011 -> 00000'011; |
14h -> 00010'000; 'T' -> 00010'000; | 00010'100 -> 00000'001; |
15h -> 00010'000; 'U' -> 00010'000; | 00010'101 -> 00000'001; |
16h -> 00010'000; 'V' -> 00010'000; | 00010'110 -> 00000'001; |
17h -> 00010'000; 'W' -> 00010'000; | 00010'111 -> 00000'000; |
18h -> 00010'000; 'X' -> 00010'000; | 00011'000 -> 00000'000; |
19h -> 00010'000; 'Y' -> 00010'000; | 00011'001 -> 00000'001; |
1Ah -> 00010'000; 'Z' -> 00010'000; | 00011'010 -> 00000'110; |
1Bh -> 00010'000; '[' -> 00010'000; | 00011'011 -> 00000'011; |
1Ch -> 00010'000; '\' -> 00010'000; | 00011'100 -> 00000'001; |
1Dh -> 00010'000; ']' -> 00010'000; | 00011'101 -> 00000'001; |
1Eh -> 00010'000; '^' -> 00010'000; | 00011'110 -> 00000'001; |
1Fh -> 00010'000; '_' -> 00010'000; | 00011'111 -> 00000'000; |
20h -> 00010'000; '`' -> 00010'000; | 00100'000 -> 00000'000; |
'!' -> 00010'000; 'a' -> 00010'000; | 00100'001 -> 00000'001; |
'"' -> 00010'000; 'b' -> 00001'000; | 00100'010 -> 00000'001; |
'#' -> 00010'000; 'c' -> 00010'000; | 00100'011 -> 00000'011; |
'$' -> 00010'000; 'd' -> 00110'000; | 00100'100 -> 00000'101; |
'%' -> 00010'000; 'e' -> 00100'000; | 00100'101 -> 00000'001; |
'&' -> 00010'000; 'f' -> 00010'000; | 00100'110 -> 00000'001; |
''' -> 00010'000; 'g' -> 00010'000; | 00100'111 -> 00000'000; |
'(' -> 00010'000; 'h' -> 00010'000; | 00101'000 -> 00000'000; |
')' -> 00010'000; 'i' -> 00011'000; | 00101'001 -> 00000'001; |
'*' -> 00010'000; 'j' -> 00010'000; | 00101'010 -> 00000'001; |
'+' -> 00010'000; 'k' -> 00010'000; | 00101'011 -> 00000'011; |
',' -> 00010'000; 'l' -> 00010'000; | 00101'100 -> 00000'001; |
'-' -> 00010'000; 'm' -> 00010'000; | 00101'101 -> 00000'011; |
'.' -> 00010'000; 'n' -> 00101'000; | 00101'110 -> 00000'001; |
'/' -> 00010'000; 'o' -> 00010'000; | 00101'111 -> 00000'000; |
'0' -> 00010'000; 'p' -> 00010'000; | 00110'000 -> 00000'000; |
'1' -> 00010'000; 'q' -> 00010'000; | 00110'001 -> 00000'001; |
'2' -> 00010'000; 'r' -> 00010'000; | 00110'010 -> 00000'001; |
'3' -> 00010'000; 's' -> 00010'000; | 00110'011 -> 00000'011; |
'4' -> 00010'000; 't' -> 00010'000; | 00110'100 -> 00000'001; |
'5' -> 00010'000; 'u' -> 00010'000; | 00110'101 -> 00000'001; |
'6' -> 00010'000; 'v' -> 00010'000; | 00110'110 -> 00000'100; |
'7' -> 00010'000; 'w' -> 00010'000; | 00110'111 -> 00000'000; |
'8' -> 00010'000; 'x' -> 00010'000; | 00111'000 -> 00000'000; |
'9' -> 00010'000; 'y' -> 00010'000; | 00111'001 -> 00000'000; |
':' -> 00010'000; 'z' -> 00010'000; | 00111'010 -> 00000'000; |
';' -> 00010'000; '{' -> 00010'000; | 00111'011 -> 00000'000; |
'<' -> 00010'000; '|' -> 00010'000; | 00111'100 -> 00000'000; |
'=' -> 00010'000; '}' -> 00010'000; | 00111'101 -> 00000'000; |
'>' -> 00010'000; '~' -> 00010'000; | 00111'110 -> 00000'000; |
'?' -> 00010'000; 7Fh -> 00010'000; | 00111'111 -> 00000'000; |
Note: To obtain register dumps, use the -g2
flag, and for Graphviz dumps, use the -g3
flag. For example, the command sneller -g3 "SELECT * FROM 'file' WHERE column ~ '(?i)biden'"
generates the dump shown above. To generate graphs and register dumps with identical identifiers, use both -g2
and -g3
flags simultaneously.
If the data contains non-ASCII values that do not match our regex, we exclude these values by replacing the value in Z8
with a zero.
VPMOVB2M Z8, K3 ;extract non-ASCII mask ;K3=scratch; Z8=data
VPERMI2B Z22, Z21, Z8 ;classify into character groups ;Z8=data; Z21=char_table1; Z22=char_table2
VMOVDQU8 Z11, K3, Z8 ;set non-ASCII to zero ;Z8=data; K3=scratch; Z11=0
Step 3: Merge current state with character group into a lookup key
We opted for character groups that can be combined with the current state using a straightforward OR operation to create a lookup key.
VPORD Z8, Z7, Z9 ;merge into lookup_key ;Z7=curr_state; Z8=data; Z9=lookup_key
Step 4: Lookup the next state
Given that the lookup key occupies only the lower 6 bits of a byte, we can utilize VPERMB
to retrieve the next state.
VPERMB Z23, Z9, Z9 ;lookup_key to next_state ;Z9=next_state; Z23=trans_table1
The register dump above shows the contents of Z23
, which contains the translation table. Each line in the table represents a mapping from a character group and current state to the next state. The character group is represented before the quote ‘, and the state identifier is represented after the quote. For instance, the line 00001'001 -> 00000'010; indicates that the character group 00001 (which represents the ASCII ‘B’ and ‘b’) and the current state 001 (which is the start state node s1) are mapped to the next state 010, corresponding to node s2 in the graph.
Next we need to copy the value in Z9
(the next-state) into Z7
(the current-state) for valid values. However, we only want to overwrite the current-state with the next-state when there is at least one character remaining in the string we are matching against.
VPCMPD $5, Z10, Z3, K4 ;K4 := (data_len>=1) ;K4=char1_valid; Z3=data_len; Z10=1
VMOVDQU32 Z9, K4, Z7 ;curr_state := next_state ;Z7=curr_state; K4=char1_valid; Z9=next_state
Perform steps 3 and 4 again for the remaining three ASCII values.
VPORD Z13, Z7, Z9 ;merge into lookup_key ;Z7=curr_state; Z8=data; Z9=lookup_key
VPERMB Z23, Z9, Z9 ;lookup_key to next_state ;Z9=next_state; Z23=trans_table1
VPCMPD $5, Z26, Z3, K5 ;K5 := (data_len>=2) ;K5=char2_valid; Z3=data_len; Z26=2
VMOVDQU32 Z9, K5, Z7 ;curr_state := next_state ;Z7=curr_state; K5=char2_valid; Z9=next_state
VPORD Z14, Z7, Z9 ;merge into lookup_key ;Z7=curr_state; Z8=data; Z9=lookup_key
VPERMB Z23, Z9, Z9 ;lookup_key to next_state ;Z9=next_state; Z23=trans_table1
VPCMPD $5, Z27, Z3, K6 ;K6 := (data_len>=3) ;K6=char3_valid; Z3=data_len; Z27=3
VMOVDQU32 Z9, K6, Z7 ;curr_state := next_state ;Z7=curr_state; K6=char3_valid; Z9=next_state
VPORD Z15, Z7, Z9 ;merge into lookup_key ;Z7=curr_state; Z8=data; Z9=lookup_key
VPERMB Z23, Z9, Z9 ;lookup_key to next_state ;Z9=next_state; Z23=trans_table1
VPCMPD $5, Z20, Z3, K7 ;K7 := (data_len>=4) ;K7=char4_valid; Z3=data_len; Z20=4
VMOVDQU32 Z9, K7, Z7 ;curr_state := next_state ;Z7=curr_state; K7=char4_valid; Z9=next_state
Step 5: Update loop variables
Move the start position of the string being matched forward by 4 characters, and decrease the remaining length of the string by 4.
VPADDD Z20, Z2, Z2 ;data_off += 4 ;Z2=data_off; Z20=4
VPSUBD Z20, Z3, Z3 ;data_len -= 4 ;Z3=data_len; Z20=4
Step 6: Accept
To determine whether we are in an accepting state, we use register Z25
. Each state can have a value of 0xFF for an accepting state or 0 for a non-accepting state. In the above register dump, state 011 is accepting, indicated by a byte of 0xFF in position 0b011 (node 3) of Z25
. We translate the current-state into a byte of 0xFF for only those states specified in Z25. The resulting byte vector in Z26
will contain 0xFF for lanes that are accepted and 0 for lanes that are not. We store the accepted lanes that are still ’todo’ in K3
. Later, we subtract these newly accepted lanes from the ’todo’ lanes and add them to the ‘active’ lanes in our returning result K1
. The ’todo’ lanes need to be restricted to those lanes that have not reached the fail-state 0 and those lanes for which the string has not reached its end yet.
VPERMB Z25, Z7, Z6 ;translate state to accept_id ;Z6=accept_state; Z7=curr_state; Z25=accept_table
VPCMPUD $0, Z6, Z19, K2, K3 ;K3 := K2 & (0xFF==accept_state) ;K2=lane_todo; Z19=0xFF; Z6=accept_state
VPCMPUD $4, Z7, Z11, K2, K2 ;K2 &= (0!=curr_state) ;K2=lane_todo; Z11=0; Z7=curr_state
VPCMPD $1, Z3, Z11, K2, K2 ;K2 &= (0<data_len) ;K2=lane_todo; Z11=0; Z3=data_len
KANDNW K2, K3, K2 ;lane_todo &= ~scratch ;K2=lane_todo
KORW K1, K3, K1 ;lane_active |= scratch ;K1=lane_active
Are we done scanning?
Finally, we check if there are any remaining lanes left to be processed. If there are, we return to step 1 to collect fresh data and proceed with the matching process.
KTESTW K2, K2 ;any lane still todo? ;K2=lane_todo
JNZ loop ;jump if not zero (ZF = 0)
If we are done scanning, K1
contains the matching lanes.
The full source code for the 6-bit DFA implementation can be found in our GitHub repository.
Simple Benchmarking
A more detailed comparison of regex handling will be provided in a later blog post. But a brief comparison to case-insensitive string compare is also interesting.
ClickBench Dataset
We use the ClickBench dataset used by Clickhouse in their benchmark suite. The original data (hits.json.gz) is compressed 21.1GB, and uncompressed 227GB. We convert the dataset to the Sneller open-source self-describing compressed table format (.zion
), with the following command:
$ sdb pack -o=clickbench.iguana.zion -c=zion+iguana_v0 hits.json.gz
This results in a file clickbench.iguana.zion
of 11.5GB.
Sneller Query Tool
The sdb
tool will read the compressed data, decompress columns that it needs, and run the query. The -fmt=json
flag is to present the results as JSON, and the -v
is to print some statistics.
$ sdb query -v -fmt=json "SELECT COUNT(Title) FROM read_file('clickbench.iguana.zion') WHERE Title ILIKE '%biden%'"
23.1GiB/s (scanned 61984473088 B (57.73 GB) in 2.6245143s)
These results are with 16 threads on an 8 core Intel 11900K with 64GB DDR4-3200. For sake of comparison, we have switched off the query tool’s ability to simplify regex (?i)biden
to equivalent LIKE '%biden%'
.
$ sdb query -v -fmt=json "SELECT COUNT(Title) FROM read_file('clickbench.iguana.zion') WHERE Title ~ '(?i)biden'"
22.2GiB/s (scanned 61984473088 B (57.73 GB) in 2.7274303s)
The findings indicate that the performance of the regex is comparable to an equivalent ILIKE query, although the ILIKE is 4 percent faster. This may seem surprising, but the reason is that the ILIKE query also loads four ASCII characters for 16 lanes and performs skipping and matching operations, much like the regex. As a result, the performance differences between the two approaches are obscured by the data loading process. However, if the data (which is several GB uncompressed) could fit into the L1 cache, the ILIKE query would be around four times faster than the regex approach because it can scan four bytes and skip if the ‘b’ or ‘B’ character is not present.
Wrap up
We have devised an approach to create a regex engine that can handle 16 strings at once while reading data only once. The engine can determine if the strings are accepted, rejected, or require additional characters, without using branching and with a fixed set of assembly instructions. The size of the graph, including the nodes and edges, is not significant, as long as each node and unique edge can be encoded with 6, 7, or 8 bits. Moreover, the execution time to evaluate the regex on a string is only dependent on the length of the string and is comparable to the execution time of a ILIKE operation. A quick test demonstrates that the regex engine is not significantly slower than an equivalent ILIKE query.
In a followup blog post we will share details of a comparison with ripgrep, and we share details of fuzzy string search.
Try Sneller for Free
You can try Sneller right now on your own data for free through our playground.
If you’re a developer interested in the details of how Sneller works under the hood, you’re in luck: Sneller is open source software!