Semantic Search in PostgreSQL: An Overview

PostgreSQL offers advanced capabilities through extensions like pgvector, which enable semantic search at a level and quality never achieved before. Unlike traditional text search, which mostly relies on trivial string comparison, semantic search in PostgreSQL goes beyond keywords by understanding context and
meaning, enhancing relevance.

Table of Contents

With extensions such as pgvector and pgai it is possible to combine the power of a relational database with cutting- edge features of modern AI models.

Benefits of Semantic Search in PostgreSQL

The benefits of semantic search are plentiful:

Improved search accuracy and user experience
Handling ambiguity
Generate content recommendations
Create personalized content

pgvector: Create a table for text data

After our brief introduction, it is time to take a look at the technical aspects
and see how this can be done for real. For the purpose of this demo we have used the excellent data set available here. It contains 3.6 million Amazon reviews, which are in an easy to use CSV format.

The first thing we have to do is to load the extension into our database and create a table that we can use to store the data we want to query:

cybertec=# CREATE EXTENSION IF NOT EXISTS vector;
CREATE EXTENSION

cybertec=# CREATE TABLE t_document 
(
	id 		serial 		PRIMARY KEY, 
	polarity 	float, 
	title 		text, 
	body 		text, 
	embedding 	vector(384) 
);
CREATE TABLE

cybertec=# CREATE EXTENSION IF NOT EXISTS vector;

CREATE EXTENSION

cybertec=# CREATE TABLE t_document

(

id serial PRIMARY KEY,

polarity float,

title text,

body text,

embedding vector(384)

);

CREATE TABLE

There are two important things to note here. The first observation is that the extension has been loaded into the database. Note that the name we have to use here is "vector", and not "pgvector" as one might expect. The second noteworthy thing is the fact that a new data type called "vector" is used here. However, the extension has more data types:

vector: up to 2,000 dimensions
halfvec: up to 4,000 dimensions
sparsevec: up to 1,000 non-zero elements

However, for our use case we need a fixed vector with 384 entries, which is what our model is going to return.

Importing and embedding data

The key thing to understand is the concept of "embedding". What does it mean? It is actually fairly simple: We take a string and transform it into a vector (=list of numbers). It is essential that one understands this when using the extension. Using only pgvector inside PostgreSQL does not turn a text into vectors, so we have to come up with some code to do this for us.

Python is a simple way to do exactly that. The following code snippet shows how this can be done:

#!/usr/bin/python3

import os
import csv
from pgvector.psycopg import register_vector
import psycopg
from sentence_transformers import SentenceTransformer

# -----------------------------------------
# EDIT HERE 

file = '/path/amazon_reviews/train.csv'
model_name = "multi-qa-MiniLM-L6-cos-v1"

# Set up database connection parameters
db_params = {
    'host': 'localhost',
    'database': 'cybertec',
    'user': 'your_user',
    'password': 'your_password',
    'port': '5432'
}

# -----------------------------------------
    
def main():
    # create model for embedding
    model = SentenceTransformer(model_name)

    # Create the connection
    conn = psycopg.connect(
        host=db_params['host'],
        dbname=db_params['database'],
        user=db_params['user'],
        password=db_params['password'],
        port=db_params['port']
    )

    register_vector(conn)
    cur = conn.cursor()
    
    lines = 0
    with open(file, 'r', encoding='utf-8') as csvfile:
        csv_reader = csv.reader(csvfile, delimiter=',')
        for row in csv_reader:
            lines += 1
            emb = model.encode(row[1] + " " + row[2])
            cur.execute("INSERT INTO t_document (polarity, title, body, embedding) VALUES (%s, %s, %s, %s) ", (row[0], row[1], row[2], emb))
            if lines % 1000 == 0:
                 print("rows imported: %s" % (lines))
                 conn.commit()

    conn.commit()
        
if __name__ == "__main__":
    main()

#!/usr/bin/python3

import os

import csv

from pgvector.psycopg import register_vector

import psycopg

from sentence_transformers import SentenceTransformer

# -----------------------------------------

# EDIT HERE

file = '/path/amazon_reviews/train.csv'

model_name = "multi-qa-MiniLM-L6-cos-v1"

# Set up database connection parameters

db_params = {

'host': 'localhost',

'database': 'cybertec',

'user': 'your_user',

'password': 'your_password',

'port': '5432'

}

# -----------------------------------------

def main():

# create model for embedding

model = SentenceTransformer(model_name)

# Create the connection

conn = psycopg.connect(

host=db_params['host'],

dbname=db_params['database'],

user=db_params['user'],

password=db_params['password'],

port=db_params['port']

)

register_vector(conn)

cur = conn.cursor()

lines = 0

with open(file, 'r', encoding='utf-8') as csvfile:

csv_reader = csv.reader(csvfile, delimiter=',')

for row in csv_reader:

lines += 1

emb = model.encode(row[1] + " " + row[2])

cur.execute("INSERT INTO t_document (polarity, title, body, embedding) VALUES (%s, %s, %s, %s) ", (row[0], row[1], row[2], emb))

if lines % 1000 == 0:

print("rows imported: %s" % (lines))

conn.commit()

if __name__ == "__main__":

main()

What is this code doing, exactly? Let us go through it step by step. First, we have to import a couple of libraries such as psycopg, which is the Python library in use, to talk to PostgreSQL.
Then we have to run:

from pgvector.psycopg import register_vector

1	from pgvector.psycopg import register_vector

Now, this line is important because we have to add some capabilities to psycopg. Otherwise it is not possible to handle vectors nicely. Once this is done, we need a model to transform the inputs. To do this, HuggingFace is the place to go. It contains models that can be used with the SentenceTransformer class we have just imported.

Next, we can register the additional data types with the driver, open a cursor, and we are ready to go. We can simply open the CSV file and loop over it. Each line contains one review in a convenient to use format. The real magic happens here:

emb = model.encode(row[1] + " " + row[2])

1	emb = model.encode(row[1] + " " + row[2])

In this line, we feed the title and the body to our HuggingFace model and fetch a vector of numbers, which is then simply inserted into the table. Every couple of thousands of lines, we issue a COMMIT. Basically, this is not needed but it is nice to see some of what is happening during the import process to run some basic tests while embedding still happens. Note that most of the time here is not consumed by the actual database, but by the model in use, so doing some work in parallel is
definitely helpful.

The following snippets show how we can run the code:

$ time ./handle_comments.py 
rows imported: 1000
rows imported: 2000
...
rows imported: 3599000
rows imported: 3600000

real	325m35.287s
user	311m50.151s
sys	2m11.664s

$ time ./handle_comments.py

rows imported: 1000

rows imported: 2000

...

rows imported: 3599000

rows imported: 3600000

real 325m35.287s

user 311m50.151s

sys 2m11.664s

Mind that the execution time is fairly high - it therefore can make sense to run those things in parallel to make use of all CPUs on your machine. In this case, everything was running on a single core.

pgvector: What do vectors and similarity mean?

Running the script takes a fair amount of time. The question many people ask is: What does the result actually look like? Well, here is an example:

cybertec=# \x
Expanded display is on.
cybertec=# SELECT * FROM t_document LIMIT 1;
-[ RECORD 1 ]-----------------------------------------------------------------
id        | 1
polarity  | 2
title     | Stuning even for the non-gamer
body      | This sound track was beautiful! It paints the senery in your mind 
            so well I would recomend it even to people who hate vid. game 
            music! I have played the game Chrono Cross but out of all of the 
            games I have ever played it has the best music! It backs away 
            from crude keyboarding and takes a fresher step with grate guitars 
            and soulful orchestras. It would impress anyone who cares to listen!
embedding | [-0.051104408,-0.06477496,0.053947136,-0.090758994,-0.11248113,
             0.021052703,0.040592767,0.033487927,0.008018044,0.030972324,
             -0.09552986,-0.041627012,-0.032203916,-0.08294443,-0.0076782834,
             0.039493423,0.06395606,-0.014506877,0.023361415,-0.022605361,
...
             0.0133066205,-0.003375273,-0.003283798,-0.08727879,0.02891936]

cybertec=# \x

Expanded display is on.

cybertec=# SELECT * FROM t_document LIMIT 1;

-[ RECORD 1 ]-----------------------------------------------------------------

id | 1

polarity | 2

title | Stuning even for the non-gamer

body | This sound track was beautiful! It paints the senery in your mind

so well I would recomend it even to people who hate vid. game

music! I have played the game Chrono Cross but out of all of the

games I have ever played it has the best music! It backs away

from crude keyboarding and takes a fresher step with grate guitars

and soulful orchestras. It would impress anyone who cares to listen!

embedding | [-0.051104408,-0.06477496,0.053947136,-0.090758994,-0.11248113,

0.021052703,0.040592767,0.033487927,0.008018044,0.030972324,

-0.09552986,-0.041627012,-0.032203916,-0.08294443,-0.0076782834,

0.039493423,0.06395606,-0.014506877,0.023361415,-0.022605361,

...

0.0133066205,-0.003375273,-0.003283798,-0.08727879,0.02891936]

Note that the input data contains typos. We will get back to this later. We cannot rely on all input data always being 100% correct.

The important part is: Each text is represented as a set of numbers. The goal in AI is to find vectors that are close to each other. But what does "close" mean in this context?

First of all, there are various ways to calculate distance. Here are some options:

"dot product" and "cosine distance" might be something you have seen already. The image above shows what it actually means.

Let us focus our attention on the idea of "similarity". Here is a simple introduction to the concept:

Obviously, this all happens in hyperspace, but it is quite useful to have a simple idea in mind.

Running similarity search in PostgreSQL

After importing the data and preparing everything, it is time for the big moment: We can write a program that will be turning a real-world search string into a query. The following piece of code returns the best results, given a certain search string:

#!/usr/bin/python3

import os, sys
import psycopg
from pgvector.psycopg import register_vector
from sentence_transformers import SentenceTransformer

# -----------------------------------------
# EDIT HERE 

model_name = "multi-qa-MiniLM-L6-cos-v1"

# Set up database connection parameters
db_params = {
    'host': 'localhost',
    'database': 'cybertec',
    'user': 'your_user',
    'password': 'your_password',
    'port': '5432'
}
# -----------------------------------------
    
def main():
    if len(sys.argv) < 2:
        print("Usage: please provide a query string")
        sys.exit(1)
    
    query_string = sys.argv[1]
    print("Querying database for: '%s'" % (query_string))

    # create model for embedding
    model = SentenceTransformer(model_name)

    # Create the connection
    conn = psycopg.connect(
        host=db_params['host'],
        dbname=db_params['database'],
        user=db_params['user'],
        password=db_params['password'],
        port=db_params['port']
    )

    register_vector(conn)
    cur = conn.cursor()
   
    # transform query string into vector 
    emb = model.encode(query_string)
    result = conn.execute('SELECT embedding <=> %s AS distance, title, body FROM t_document ORDER BY embedding <=> %s LIMIT 5', (emb, emb)).fetchall() 
    
    for row in result:
        print("Result item: ")
        print(f"    Distance: " + str(row[0]))
        print(f"    Title:  : " + str(row[1]))
        print(f"    Body    : " + str(row[2]))
        print ("\n")

        
if __name__ == "__main__":
    main()

#!/usr/bin/python3

import os, sys

import psycopg

from pgvector.psycopg import register_vector

from sentence_transformers import SentenceTransformer

# -----------------------------------------

# EDIT HERE

model_name = "multi-qa-MiniLM-L6-cos-v1"

# Set up database connection parameters

db_params = {

'host': 'localhost',

'database': 'cybertec',

'user': 'your_user',

'password': 'your_password',

'port': '5432'

}

# -----------------------------------------

def main():

if len(sys.argv) < 2:

print("Usage: please provide a query string")

sys.exit(1)

query_string = sys.argv[1]

print("Querying database for: '%s'" % (query_string))

# create model for embedding

model = SentenceTransformer(model_name)

# Create the connection

conn = psycopg.connect(

host=db_params['host'],

dbname=db_params['database'],

user=db_params['user'],

password=db_params['password'],

port=db_params['port']

)

register_vector(conn)

cur = conn.cursor()

# transform query string into vector

emb = model.encode(query_string)

result = conn.execute('SELECT embedding <=> %s AS distance, title, body FROM t_document ORDER BY embedding <=> %s LIMIT 5', (emb, emb)).fetchall()

for row in result:

print("Result item: ")

print(f" Distance: " + str(row[0]))

print(f" Title: : " + str(row[1]))

print(f" Body : " + str(row[2]))

print ("\n")

if __name__ == "__main__":

main()

Actually, approximately the first 40 lines of code are just like before. We define the model, connect to the database, fetch a command line argument, and issue an insanely cool error message in case no argument is passed to our Python program. But where does the real magic happen? Here it is:

emb = model.encode(query_string)

1	emb = model.encode(query_string)

We use the SAME model as before to turn a search string into a vector. Finally, we can run something that is generally known as "KNN-search". The idea is to sort by the distance to your desired element and return the top results. This is exactly what the loop towards the end does.

One question people ask is: What is this strange "<=>" operator in the ORDER BY clause? We mentioned the idea of "distance" before. The following operators are available and provided by pgvector:

inner product: <#>
cosine distance: <=>
L1 distance: <+>

In this case, we went for cosine distance.

Trying it all out

After this lengthy and hopefully useful introduction, it is time to run the code and see what it can do for us. In my showcase, I want to find all reviews that match the string "mooney airplane navigation literature". So, the idea is to find all reviews that might be related to aircrafts made by the Mooney Aircraft company and have something to do with books on navigation. The result looks as
follows:

$ ./query_comments.py "mooney airplane navigation literature"
Querying database for: 'mooney airplane navigation literature'
Result item: 
    Distance: 0.3748793229388776
    Title:  : Great Book!
    Body    : I gave this book as a gift to a person who has an interest in 
	      historic aircraft. The recipient of the gift was very, very 
              appeciative and thought it was an excellent reference.

Result item: 
    Distance: 0.3833693082144648
    Title:  : Decent airplane book but nothing more
    Body    : I would not recommend anyone buying this book. If you are 
              curious, check it out from the library. The character development 
              is shallow and the story is generally not well written. The 
              readers who have enjoyed this must be very young, and therefore 
              not exposed to the better writers of the past. I put a lot of 
              blame on the editor of this book, as it is his job to assist the 
              writer in preparing the manuscript. I will not read this author 
              again.

Result item: 
    Distance: 0.38802472913338903
    Title:  : Super, Great Information
    Body    : If you are an airplane lover, you should order this book. 
              Great pictures and facts.

Result item: 
    Distance: 0.3974898514887534
    Title:  : Wonderfully done book
    Body    : This book is wonderful, I love seeing aircraft put to sleep in 
              the boneyards! Philip D. Chinnery has an awesome book here! I 
              would recommend it to all aviation enthusiasts to buy, it's 
              well worth it!

Result item: 
    Distance: 0.39884170286903886
    Title:  : Good book
    Body    : Very good way to adventure on this Plane but the content is a 
              little bit on the less exciting part but a good book over all

$ ./query_comments.py "mooney airplane navigation literature"

Querying database for: 'mooney airplane navigation literature'

Result item:

Distance: 0.3748793229388776

Title: : Great Book!

Body : I gave this book as a gift to a person who has an interest in

historic aircraft. The recipient of the gift was very, very

appeciative and thought it was an excellent reference.

Result item:

Distance: 0.3833693082144648

Title: : Decent airplane book but nothing more

Body : I would not recommend anyone buying this book. If you are

curious, check it out from the library. The character development

is shallow and the story is generally not well written. The

readers who have enjoyed this must be very young, and therefore

not exposed to the better writers of the past. I put a lot of

blame on the editor of this book, as it is his job to assist the

writer in preparing the manuscript. I will not read this author

again.

Result item:

Distance: 0.38802472913338903

Title: : Super, Great Information

Body : If you are an airplane lover, you should order this book.

Great pictures and facts.

Result item:

Distance: 0.3974898514887534

Title: : Wonderfully done book

Body : This book is wonderful, I love seeing aircraft put to sleep in

the boneyards! Philip D. Chinnery has an awesome book here! I

would recommend it to all aviation enthusiasts to buy, it's

well worth it!

Result item:

Distance: 0.39884170286903886

Title: : Good book

Body : Very good way to adventure on this Plane but the content is a

little bit on the less exciting part but a good book over all

Note that not a single one of those entries is actually related to Mooney aircraft in particular. This makes sense, because navigation is not tied to a brand at all. What we can also see is that the word "literature" was not found - however, the system has understood that "literature" seems to be related to "books", and
so in this context we get a ton of book reviews (which is what exists in the data). In short: The result is relevant and far superior to what standard FTS (= Full Text Search) would have given us.

Speeding things up with vector indexes

The problem at the moment is that PostgreSQL has to scan an entire table containing 3.6 million rows to find those results. Obviously, this is not too efficient, so it makes sense to create an index:

cybertec=# SET maintenance_work_mem TO '64 GB';
SET
cybertec=# CREATE INDEX ON t_document 
	USING ivfflat (embedding vector_cosine_ops);
CREATE INDEX

cybertec=# SET maintenance_work_mem TO '64 GB';

SET

cybertec=# CREATE INDEX ON t_document

USING ivfflat (embedding vector_cosine_ops);

CREATE INDEX

My machine is big enough to build the index quickly, using 64 GB of maintenance_work_mem, which is in charge of handling index creation. Once the index is deployed, we should see a significantly improved database performance. Keep in mind that the Python script will still need some time to handle things (loading libraries, models and so on), but the database side will be extremely fast.

Semantic Search in PostgreSQL: An Overview

Benefits of Semantic Search in PostgreSQL

pgvector: Create a table for text data

Importing and embedding data

pgvector: What do vectors and similarity mean?

Running similarity search in PostgreSQL

Trying it all out

Speeding things up with vector indexes

Leave a Reply Cancel reply

Hans-Jürgen Schönig

Blog Tags

NEWSLETTER

Articles by our PostgreSQL Experts