Semantic Search with SQL

by Phil Hofer | June 21, 2023

AI-powered applications are having a bit of a moment, and one area that is of interest to us as a query service is “vector databases.” Developers can use vector databases to search for semantically similar entities within a data-set, where semantic similarity is determined in advance by an AI model and encoded in vectors of numbers.

Over the course of a few weeks we were able to integrate “exact” vector search functionality into Sneller SQL. Our SQL engine is already designed around a high-performance pure-assembly virtual machine, so it wasn’t a stretch for us architecturally to use this virtual machine to compute distance measures on “vectors” (in our case, lists of numbers). The API we use for exposing this functionality is modeled on the Postgres pgvector extension: we implement the inner_product, cosine_distance, l1_distance, and l2_distance functions.

An Example: Semantic Similarity Search with Glove-6B

The GloVe project out of Stanford has some similarity data we can use for an easy demo. The vectors in this dataset were produced by having an AI model try to produce vectors such that the log-likelihood of co-occurence of two words in a text is equal to the dot product of their vectors. Thus, we can use the inner_product function to compute a “score” for the likelihood of co-occurrence of a pair of words.

We’ll use glove.6B.200d.txt for this demo since it’ll download reasonably quickly even on a slow internet connection.

$ wget
$ unzip glove.6B.200d.txt

The first line of glove.6B.200d.txt starts with

the -0.071549 0.093459 0.023738 -0.090339 0.056123 0.32547 -0.39796 -0.092139 0.061181 -0.1895 0.13061 0.14349 0.011479 ...

The word that begins the line (the) is the word associated with the 200-element number vector that follows it. The easiest way to create a Sneller table out of this data is to convert it to JSON using jq and then run sdb pack to build a table.

$ go install
$ jq -R 'split(" ")| {word: .[0], embedding: .[1:] | [.[] | tonumber]}' < glove.6B.200d.txt > glove.6B.200d.json
$ sdb pack -o glove.6B.zion -c 'zion+iguana_v0' glove.6B.200d.json

Since the GloVe models were constructed such that similar words have a larger dot-product, we can use order by inner_product(...) desc to produce a list of similar candidates. For example, here’s a query that looks up the embedding for simd and then produces a list of the next 10 most-similar words in the dataset. (We’re using OFFSET 1 because the most-similar word to any particular word is itself, so the first result is uninteresting.)

SELECT word,
       inner_product(embedding, (SELECT embedding
                                 FROM read_file('glove.6B.200d.zion') t2
                                 WHERE t2.word = 'simd' LIMIT 1)) AS similarity
FROM read_file('glove.6B.200d.zion')

In vector database terms, we’re performing an “exact search” here. Sneller doesn’t have support for approximate-nearest-neighbor vector clustering, so there’s no point in performing an inexact search.

If you’ve got an AVX-512-capable CPU, you can run a query like the one above directly from your terminal:

$ sdb query -fmt=json "select word from read_file('glove.6B.200d.zion') order by inner_product(embedding, (select embedding from read_file('glove.6B.200d.zion') t2 where
 t2.word = 'simd' limit 1)) desc limit 10 offset 1"
{"word": "floating-point"}
{"word": "x86"}
{"word": "64-bit"}
{"word": "32-bit"}
{"word": "integer"}
{"word": "sql"}
{"word": "architectures"}
{"word": "cpus"}
{"word": "superscalar"}
{"word": "mmx"}

GloVe’s model determined that technical terms like x86 and mmx are likely to co-occur with simd in text fragments in its training data.

Practical Applications

Specialized “vector databases” are often constrained in terms of the sorts of filtering that can be applied in advance of performing (expensive) vector search operations; in contrast, Sneller gives you the full expressive power of SQL. For example, it’s easy to combine semantic search with other features like geospatial search and fuzzy text search for complex real-world applications. An application that implements facial recognition using vector search could also use geospatial search to further bound the set of candidate matches.

Try Sneller for Free

You can try Sneller right now on your own data for free through our playground.

If you’re a developer interested in the details of how Sneller works under the hood, you’re in luck: Sneller is open source software!