Sneller Regex vs Ripgrep

by Henk-Jan Lebbink | May 1, 2023

Benchmarking Ripgrep and Sneller Regex

In this post we will briefly compare the raw speed of ripgrep with Sneller’s SQL regular expression engine.

Ripgrep is a popular and powerful command-line search tool that recursively searches directories for lines of text that match a given regular expression pattern. It is designed to be fast and efficient, even when searching through large files or directories with many files. On the other hand, the Sneller query engine is different, although it can be used for comparable tasks. The query engine has a command-line SQL query tool designed to processing very large compressed JSON files, and in this comparison, we will be using the ~ POSIX SQL regex operator.

When dealing with large amounts of data, having a single file of over 100GB can be inconvenient. With ripgrep you would organize your files in a directory tree that you can query easily. This is not different for Sneller: you would organize a petabyte of storage with, for example, millions of 1GB S3 objects; it will be treated as single storage unit. However, for the sake of convenience and reproducibility on a laptop, we will be performing our testing on a single 12GB text file.

What is Sneller’s regex engine and why is it fast?

The Sneller regular expression engine leverages the parallel processing capabilities of AVX-512 by utilizing 16 parallel lanes to perform acceptance checks without the need for branching. This enables 16 separate match processes to be executed concurrently on a single core. In addition, we eliminated the need for backtracking during acceptance checks, resulting in data only needing to be traversed once. This is achieved by transforming the underlying Deterministic Finite Automaton (DFA) such that backtracking is not necessary.

Lastly, Sneller’s regex engine is optimized for the Intel Icelake/Zen4 processor and is implemented entirely in AVX-512 assembly language. However, there is also a highly efficient implementation available for older Intel SkylakeX (or compatible AMD) hardware, which is only around 20% slower than the Icelake implementation. We have previously published a blog post that delves into the technical aspects of our engine’s regex implementation.

Existing comparison of RipGrep and other grep tools

An existing small comparison exists on the ripgrep GitHub repository in which ripgrep is compared to other grep tools such as GNU grep. We will use the same regex and the same dataset to reproduce the ripgrep results. The test data, en.txt.gz is about 12.8GB of uncompressed text, and gzipped about 3.6GB. The speed results as provided on the GitHub repository suggest that the data was unzipped, so we reproduce the following results (with 16 threads on an 8 core Intel 11900K with 64GB DDR4-3200, ripgrep version 13.0.0, and sneller build with date: 2023-05-17T18:05:14Z):

$ rg --stats -q 'Sherlock [A-Z]\w+' c:\en.txt
7909 matches
7882 matched lines
1 files contained matches
1 files searched
0 bytes printed
13113340782 bytes searched
2.871474 seconds spent searching
3.899706 seconds

Out of the total spend time of 3.89 seconds, the search process accounts for 2.87 seconds. I always assumed that the remaining time (here one second) was spent on file I/O. These results were obtained using an SSD (Seagate FireCuda 530, 2TB).

To evaluate the impact of slower storage, the search was repeated using a Seagate Exos X16 16TB, 512e, SATA 6Gb/s spinning disk. The following equal results are obtained:

$ rg --stats -q 'Sherlock [A-Z]\w+' d:\en.txt
[...]
2.889316 seconds spent searching
3.908904 seconds

Interestingly, the additional one-second time spent beyond searching does not seem to be related to file I/O.

Another observation is that ripgrep does not use multi-threading for single files. This might have gone unnoticed because ripgrep’s typical use-case involves crawling directories with multiple threads to find matches. This makes it challenging to make fair comparisons, particularly since Sneller is designed to run on high-core count nodes and in clusters.

Another interesting observation is that ripgrep’s performance suffers significantly when handling compressed data. For example:

$ rg --stats -z -q 'Sherlock [A-Z]\w+' c:\en.txt.gz
[...]
60.929102 seconds spent searching
60.943493 seconds

This could also have gone unnoticed if users keep all their files uncompressed, or use transparent disk compression available with ZFS or Windows NTFS drive compression. However, I personally prefer not to compress everything just for convenience. It should be a conscious decision when the costs of compression and decompression outweigh the benefits of saving space. But if someone’s use-case is to frequently search a source tree with a regex (?i)(TODO|BUG), this limitation may not be a deal-breaker.

The example regex with the Sneller regex engine

Let’s take a closer look at the regex we are using. To generate a graphviz graph for the regular expression Sherlock [A-Z]\w+, we use the Sneller query tool by adding the -g3 flag to the command sneller -g3 "SELECT * FROM 'file' WHERE column ~ 'Sherlock [A-Z]\\w+'". Note that we need to escape the backslash in PowerShell, thus \\.

Although the resulting .dot and .svg files may be somewhat clunky, we can still observe from the graph that the number of nodes and edges are small enough to use the branchless IceLake implementation. In this particular case, we only need 8 bits to encode the number of nodes and the number of distinct edges, enabling the tool to use (what we call) the 8-bit DFA implementation. For more details on how this works, see our post on regex implementations.

Comparing Sneller to Ripgrep

The Sneller query engine supports native compression support. Although it may seem unfair to enable compression since ripgrep does not require decompression overhead, we always prefer to use compression. This is because at petabyte scale (in the cloud) storage space and network capacity are the limiting resources. In clouds, compression reduces end-to-end times. Thus, we will compare Sneller loading the compressed file, decompressing it, and running the regex, whereas ripgrep will load the normal size (uncompressed) file, and run the regex.

Preparing the Data

Sneller uses JSON as its default format, so we convert the data into a JSON file, wrapping each line of the original text into a row with a single field named “Line”. This file en.json is compressed to a file named en.iguana.zion with a size of 2.5GB. The .zion extension is the Sneller open-source self-describing compressed table format. It’s worth noting that en.json contains more bytes (the ones used for the brackets, for the string "Line", etc.) and yet en.iguana.zion is still 30% smaller than the original compressed .gz file. However, other compressed formats can also be used.

For completeness, with the command $ type en.txt | jq --raw-input '{"Line": .} >> en.json', the file en.json is created with content:

{
  "Line":"Presented by IM Pictures,"
}
{
  "Line":"Produced by Shin Cine"
}
[...]

And we compress en.json with the pack command to file en.iguana.zion:

$ sdb pack -o=en.iguana.zion -c=zion+iguana_v0 en.json

Speed Comparison Multi-threaded

As shown above, we have the following results for ripgrep, run with an SSD.

$ rg --stats -q 'Sherlock [A-Z]\w+' c:\en.txt
[...]
2.871474 seconds spent searching
3.899706 seconds

For the sdb query tool, to estimate the disk access time without compression, we run a COUNT(*) on the entire dataset. The COUNT(*) is optimized to count rows without the need to decompress the data.

$ sdb query -fmt=json -v "SELECT COUNT(*) FROM read_file('c:\en.iguana.zion')"
{"count": 441450449}
25.9GiB/s (scanned 14078181376 B (13.11 GB) in 531.5734ms)

To estimate the disk-access time and the time required to decompress the Line column, we execute a COUNT(Line) query.

$ sdb query -fmt=json -v "SELECT COUNT(Line) FROM read_file('c:\en.iguana.zion')"
{"count": 441450449}
12GiB/s (scanned 14078181376 B (13.11 GB) in 1.1459875s)

Finally, we execute the entire query, which includes running the regular expression.

$ sdb query -fmt=json -v "SELECT COUNT(Line) FROM read_file('c:\en.iguana.zion') WHERE Line ~ 'Sherlock [A-Z]\\w+'"
{"count": 7882}
10.1GiB/s (scanned 14078181376 B (13.11 GB) in 1.3575434s)

Due to an optimization of COUNT(*), data does not need to be decompressed to retrieve the number of lines, and we use this feature to estimate that disk I/O took about 0.53 seconds. However, when counting the number of rows with a non-null value in the Line column, the content of that column does need to be decompressed, which took an additional 0.61 seconds.

The entire process took 1.35 seconds, leaving us with an approximate processing time of 0.74 seconds for the regex. Sneller is 2.8x faster than ripgrep.

Although Sneller had to decompress 13GB of JSON data, it used 16 threads on 8 cores not just for decompression, but also for regex handling, while ripgrep only used a single thread for the regex. Despite these differences, the execution speed advantage of Sneller remains significant.

Speed Comparison Multi-threaded and Single-threaded

Next we are going to compare three regexes, and a fair question is how Sneller fares with only one single thread. Regex2 matches when a simple IP4 address is present in the prefix of the data, while Regex3 matches when an IP4 address is contained in the data.

Regex1: Sherlock [A-Z]\\w+
Regex2: ^(?:[0-9]{1,3}\\.){3}[0-9]{1,3}
Regex3: (?:[0-9]{1,3}\\.){3}[0-9]{1,3}

	Regex1	Regex2	Regex3
ripgrep	3.899s	23.983s	21.931s
Sneller 16-threads	1.357s	1.330s	2.051s
Sneller 1-thread	10.024s	8.317s	21.044s

First obvious observation is that multi-threading offers clear performance benefits over single-threaded ripgrep, but the situation for single-threaded performance is more complex.

For Regex1, Sneller is estimated to take 1.45 seconds for disk I/O, 5.32 seconds for decompression, and 3.25 seconds for regex handling with a single thread, resulting in a total time of 10.02 seconds. Compared to ripgrep, which takes around 60 seconds to handle compressed input, Sneller’s performance is impressive. However, when using a single thread, ripgrep appears to be slightly faster. The reason for this is that ripgrep uses the Boyer-Moore string search algorithm (See footnote ¹), which is a pattern matching algorithm that is often used for searching for substrings within larger strings. It is particularly efficient when the pattern being searched for is relatively long and the alphabet of characters being searched over is relatively small. Sneller does not use this substring search algorithm and as a result is slower than ripgrep with substrings. However, when long substrings are not present, Sneller outperforms ripgrep.

For Regex2, assuming that Sneller takes 6.4 seconds for I/O and decompression with a single thread, only 1.9 seconds are left for regex handling. With 16 threads, Sneller only needs 0.2 seconds for regex handling, and these numbers scale approximately with the number of threads. In contrast, ripgrep needs almost 24 seconds for Regex2.

For Regex3, the total time needed is almost equal for single threads, but considering that Sneller needs 5.0 seconds for decompression, Sneller still outperforms ripgrep significantly.

Wrap-up

We observed a significant (10x faster) and moderate (2x faster) performance improvement with Sneller compared to ripgrep when using 16 threads for the provided regexes. However, the performance comparison becomes more complex with single threads. Sneller outperforms ripgrep for regexes without substrings, while with single threads regexes with long substrings perform better with ripgrep. We do not have any knowledge regarding why the dataset could be specifically problematic for ripgrep, nor especially favorable for Sneller. We believe that the dataset and the regular expression are suitable for the use-cases of both systems.

Why do the two tools have different speeds? There could be several reasons related to their intended use-cases. Ripgrep is designed to search large directory trees for files, whereas Sneller is meant to search multi terabyte-sized datasets of deeply nested semi-structured JSON. Ripgrep parallelizes well when crawling directories but not when searching small files (less than 10MB). Ripgrep uses only one thread when searching a single file, while Sneller distributes the workload over processors and uses all available threads. Additionally, ripgrep uses AVX2 and does not take advantage of AVX-512 instruction sets, but this can be forgiven given the specialized skills required for handcrafting for SkylakeX and Icelake/Zen4 processors.

Comparing the raw execution speed between Sneller and ripgrep is difficult because ripgrep does not support decompression at competitive performance. This raises the question of whether ripgrep is well-suited for searching large amounts of text, such as tens of terabytes distributed over multiple objects directly out of object storage. These systems are typically organized with a specific goal, such as storage, and if the goal is search optimization, storing text files uncompressed may be too expensive.

In summary, Sneller demonstrates faster performance with text files exceeding several gigabytes, thanks to its ability to leverage multi-threading and optimized hardware utilization, despite the performance penalty needed for decompressing data on-the-fly.

The ripgrep author let us know that ripgrep does not use Boyer-Moore (anymore), but uses heuristic approaches.) ↩︎

Try Sneller for Free

You can try Sneller right now on your own data for free through our playground.

If you’re a developer interested in the details of how Sneller works under the hood, you’re in luck: Sneller is open source software!

Intro

SQL

Onboarding

Cloud

The Hard Way