by Henk-Jan Lebbink | May 10, 2023
Branchless Code with AVX-512
Single-instruction multiple data (SIMD) programming involves using a single instruction to operate on wide registers that can simultaneously contain vectors of values. To avoid using conditional branches in code, we programmers can use branchless programming. Instead of relying on if/else statements or loops to implement conditional logic, branchless programming leverages arithmetic and logical operations to achieve the same result.
Predicated Instruction Execution
In AVX-512, predication is a key feature that enables branchless programming. By allowing instructions to be conditionally executed based on a mask register (K register), predication makes it possible to implement conditional logic without branching. This can lead to improved performance by reducing branch mispredictions and pipeline stalls.
The K register is a 64-bit register that contains 1 bit for each element in a vector. If the corresponding bit in the K register is set to 1, the element is considered “active” and the instruction is executed on that element. If the corresponding bit is set to 0, the element is considered “inactive” and the instruction is skipped for that element. Colloquially we would say that a lane is either dead or alive. Generally, the Sneller code uses 16 lanes, this allows the code to load upto 16 data values from different locations with a single instruction, and do all sort of computations with these 16 lanes. Some lanes can become inactive while computing and comparing stuff, and once in a while the code optionally compresses threads with many dead lanes and fills up available lane slots.
An example can help clarify things. We will use the assembler used by the Go programming language, which has an unconventional syntax derived from the Plan 9 operating system. For example, ZMM8
is referred to as Z8
, and RSI
is denoted by SI
. In addition, the information flows from left to right, so the destination is generally the last register (similar to the at&t syntax).
16 Lanes
Below are some code snippets and an explanation of how they work. We will be creating code that converts lowercase ASCII values to uppercase, which is a function that we use frequently in our codebase. First, we load 16 values of 32 bits each into data register Z8
. Each 32-bit value contains 4 ASCII bytes for all 16 lanes. Next, we check whether each of the 64 ASCII bytes is a lowercase character. If a byte is lowercase, we convert it to uppercase and store the result in Z26
.
KMOVW K1, K3 ; copy eligible lanes ;K3=scratch; K1=lane_active
VPGATHERDD (SI)(Z2*1),K3, Z8 ; gather data ;Z8=data; K3=scratch; SI=data_ptr; Z2=data_off
Assuming that Z2
contains the 16x 32-bit offsets of the data that we want to process, and SI
is a 64-bit pointer to this data, we can use the VPGATHERDD
instruction to load the content at address SI
+ the value in Z2
for each active lane in K3
. The results of the operation are stored in Z8
, and K3
is overwritten with error codes. To avoid losing the contents of K1
, we made a copy of it with the KMOVW
instruction. If all 16 lanes in K1
are active, we will have 64 freshly loaded ASCII values in Z8
. However, if a lane is inactive, Z8
will contain stale content for that lane, but that should not matter for the following code.
64 Lanes
VPCMPB $5, Z16, Z8, K3 ; K3 := (data>=char_a) ;K3=tmp_mask; Z8=data; Z16=char_a
VPCMPB $2, Z17, Z8, K3, K3 ; K3 &= (data<=char_z) ;K3=tmp_mask; Z8=data; Z17=char_z
Let’s assume that Z16
contains 64 ASCII ‘a’ values, and Z17
contains 64 ASCII ‘z’ values. The first VPCMPB
instruction compares each byte of the data with the character ‘a’, and stores the results in the 64-bit mask register K3
. Note that loading the data uses 16 lanes, while the byte comparison interprets the data as having 64 lanes. The second VPCMPB
instruction computes the conjunction of the first comparison with whether the data is smaller or equal to character ‘z’. The resulting mask register K3
has a value of 1 for every lane with a lowercase ASCII character, and 0 for all other values.
Next, we want to set the 6th bit to 0 for all active lanes in K3
. One way to achieve this would be to use a hypothetical VPANDB
instruction, which would perform a predicated AND-operation with byte granularity. Specifically, every active lane would be updated with the result of AND-masking with 0b1101'1111
.
; HYPOTHETICAL VPANDB!
VPANDB Z15, Z8, K3, Z8 ; Z8 &= Z15 ;Z8=data; Z15=c_b11011111
512 Lanes
We will use a VPERNLOGD
instruction instead of VPANDB
since the latter doesn’t exist. To broadcast the 64 bits in K3
to the 64 bytes in Z26
, we will use a VPMOVM2B
instruction. This means that each set bit in K3
will become a 0xFF
byte in the corresponding lane, and a cleared bit will become a 0x00
byte. The next step may seem complex, but it simply involves applying boolean logic.
VPMOVM2B K3, Z26 ; mask with selected chars ;Z26=scratch_Z26; K3=tmp_mask
VPTERNLOGD $0b01001100,Z15, Z8, Z26 ; dark art ;Z26=scratch_Z26; Z8=data; Z15=c_0b00100000
During code review, I am often asked how I arrived at these magic numbers 0b01001100
. I use a truth table, which describes when to set a bit to 0. It’s worth noting that for bitwise operations, each bit of an 8-bit byte can be considered a lane by itself. In this particular case, we aim to solve the problem of converting ASCII characters to uppercase using 512x 1-bit lanes. The VPERNLOGD
instruction takes Z26
as input, and replaces those values with the computed results. The third column Z26
is the input column, the right most column Z26
is the output column.
Z15 | Z8 | Z26 | result: Z26 | |
---|---|---|---|---|
0 | 0 | 0 | 0 | |
0 | 0 | 1 | 0 | |
0 | 1 | 0 | 1 | |
0 | 1 | 1 | 1 | |
1 | 0 | 0 | 0 | |
1 | 0 | 1 | 0 | |
1 | 1 | 0 | 1 | |
1 | 1 | 1 | 0 | (change from lower to upper) |
Consider the first row of the truth table. If the bit in Z15
is 0, then the corresponding lane does not correspond to the 6th bit in the byte. In this case, we want to leave the bit in data Z8
unchanged, which means that the output column Z26
will simply equal the bit from data Z8
. The same is true for the 2nd, 3rd, and 4th rows of the truth table.
In the 5th and 6th rows, the bit in Z15
is set, which means we may want to clear the corresponding data bit. However, if the data bit is already cleared, the result will also be 0.
In the 7th row, we would normally clear the bit (since Z15
is 0) if the corresponding data bit is set (Z8
is 1). However, if the lane does not correspond to a lowercase ASCII character (since Z26
is 0), we will leave the data bit unchanged (output Z26
is 1).
Finally, in the last row, all three conditions hold: the lane corresponds to the 6th bit (Z15
is 1), the corresponding data bit is set (Z8
is 1), and the lane corresponds to a lowercase ASCII character (Z26
is 1). In this case, the result is a bit set to 0.
The output column 0b01001100
holds the 8-bit immediate value that is required for the specific binary function being implemented.
Wrap-up
The six lines of assembly code shown above enable the loading of 64 ASCII values from 16 lanes and their subsequent conversion to upper case. Although further processing is needed to make use of this transformed data, such as performing a case-insensitive string comparison, it is noteworthy that this was achieved without the need for loops or conditional branching. However, it should be noted that the most time-consuming aspect of this process is typically getting the data into the 16 lanes, which in the Sneller code involves a VPGATHERDD
operation with a latency of around 30 cycles. Once the data is properly arranged within the vector registers, predicated computation can quickly make up for the initial investment of gathering the data. We at Sneller achieve multi GB/s/core throughput with complex string functions such as regular expressions and Unicode aware case-insensitive searching.
Try Sneller for Free
You can try Sneller right now on your own data for free through our playground.
If you’re a developer interested in the details of how Sneller works under the hood, you’re in luck: Sneller is open source software!