CYBERTEC PostgreSQL Logo

Semantic Search in PostgreSQL: An Overview

03.2025 / Category: / Tags:

PostgreSQL offers advanced capabilities through extensions like pgvector, which enable semantic search at a level and quality never achieved before. Unlike traditional text search, which mostly relies on trivial string comparison, semantic search in PostgreSQL goes beyond keywords by understanding context and
meaning, enhancing relevance.

With extensions such as pgvector and pgai it is possible to combine the power of a relational database with cutting- edge features of modern AI models.

Benefits of Semantic Search in PostgreSQL

The benefits of semantic search are plentiful:

  • Improved search accuracy and user experience
  • Handling ambiguity
  • Generate content recommendations
  • Create personalized content

pgvector: Create a table for text data

After our brief introduction, it is time to take a look at the technical aspects
and see how this can be done for real. For the purpose of this demo we have used the excellent data set available here. It contains 3.6 million Amazon reviews, which are in an easy to use CSV format.

The first thing we have to do is to load the extension into our database and create a table that we can use to store the data we want to query:

There are two important things to note here. The first observation is that the extension has been loaded into the database. Note that the name we have to use here is "vector", and not "pgvector" as one might expect. The second noteworthy thing is the fact that a new data type called "vector" is used here. However, the extension has more data types:

  • vector: up to 2,000 dimensions
  • halfvec: up to 4,000 dimensions
  • sparsevec: up to 1,000 non-zero elements

However, for our use case we need a fixed vector with 384 entries, which is what our model is going to return.

Importing and embedding data

The key thing to understand is the concept of "embedding". What does it mean? It is actually fairly simple: We take a string and transform it into a vector (=list of numbers). It is essential that one understands this when using the extension. Using only pgvector inside PostgreSQL does not turn a text into vectors, so we have to come up with some code to do this for us.

Python is a simple way to do exactly that. The following code snippet shows how this can be done:

What is this code doing, exactly? Let us go through it step by step. First, we have to import a couple of libraries such as psycopg, which is the Python library in use, to talk to PostgreSQL.
Then we have to run:

Now, this line is important because we have to add some capabilities to psycopg. Otherwise it is not possible to handle vectors nicely. Once this is done, we need a model to transform the inputs. To do this, HuggingFace is the place to go. It contains models that can be used with the SentenceTransformer class we have just imported.

Next, we can register the additional data types with the driver, open a cursor, and we are ready to go. We can simply open the CSV file and loop over it. Each line contains one review in a convenient to use format. The real magic happens here:

In this line, we feed the title and the body to our HuggingFace model and fetch a vector of numbers, which is then simply inserted into the table. Every couple of thousands of lines, we issue a COMMIT. Basically, this is not needed but it is nice to see some of what is happening during the import process to run some basic tests while embedding still happens. Note that most of the time here is not consumed by the actual database, but by the model in use, so doing some work in parallel is
definitely helpful.

The following snippets show how we can run the code:

Mind that the execution time is fairly high - it therefore can make sense to run those things in parallel to make use of all CPUs on your machine. In this case, everything was running on a single core.

pgvector: What do vectors and similarity mean?

Running the script takes a fair amount of time. The question many people ask is: What does the result actually look like? Well, here is an example:

Note that the input data contains typos. We will get back to this later. We cannot rely on all input data always being 100% correct.

The important part is: Each text is represented as a set of numbers. The goal in AI is to find vectors that are close to each other. But what does "close" mean in this context?

First of all, there are various ways to calculate distance. Here are some options:

Semantic Search blog graphic 1

"dot product" and "cosine distance" might be something you have seen already. The image above shows what it actually means.

Let us focus our attention on the idea of "similarity". Here is a simple introduction to the concept:

Semantic Search blog graphic 2

Obviously, this all happens in hyperspace, but it is quite useful to have a simple idea in mind.

Running similarity search in PostgreSQL

After importing the data and preparing everything, it is time for the big moment: We can write a program that will be turning a real-world search string into a query. The following piece of code returns the best results, given a certain search string:

Actually, approximately the first 40 lines of code are just like before. We define the model, connect to the database, fetch a command line argument, and issue an insanely cool error message in case no argument is passed to our Python program. But where does the real magic happen? Here it is:

We use the SAME model as before to turn a search string into a vector. Finally, we can run something that is generally known as "KNN-search". The idea is to sort by the distance to your desired element and return the top results. This is exactly what the loop towards the end does.

One question people ask is: What is this strange "<=>" operator in the ORDER BY clause? We mentioned the idea of "distance" before. The following operators are available and provided by pgvector:

  • inner product: <#>
  • cosine distance: <=>
  • L1 distance: <+>

In this case, we went for cosine distance.

Trying it all out

After this lengthy and hopefully useful introduction, it is time to run the code and see what it can do for us. In my showcase, I want to find all reviews that match the string "mooney airplane navigation literature". So, the idea is to find all reviews that might be related to aircrafts made by the Mooney Aircraft company and have something to do with books on navigation. The result looks as
follows:

Note that not a single one of those entries is actually related to Mooney aircraft in particular. This makes sense, because navigation is not tied to a brand at all. What we can also see is that the word "literature" was not found - however, the system has understood that "literature" seems to be related to "books", and
so in this context we get a ton of book reviews (which is what exists in the data). In short: The result is relevant and far superior to what standard FTS (= Full Text Search) would have given us.

Speeding things up with vector indexes

The problem at the moment is that PostgreSQL has to scan an entire table containing 3.6 million rows to find those results. Obviously, this is not too efficient, so it makes sense to create an index:

My machine is big enough to build the index quickly, using 64 GB of maintenance_work_mem, which is in charge of handling index creation. Once the index is deployed, we should see a significantly improved database performance. Keep in mind that the Python script will still need some time to handle things (loading libraries, models and so on), but the database side will be extremely fast.

Leave a Reply

Your email address will not be published. Required fields are marked *

CYBERTEC Logo white
Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2025
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle
    linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram