PostgreSQL offers advanced capabilities through extensions like pgvector, which enable semantic search at a level and quality never achieved before. Unlike traditional text search, which mostly relies on trivial string comparison, semantic search in PostgreSQL goes beyond keywords by understanding context and
meaning, enhancing relevance.
Table of Contents
With extensions such as pgvector and pgai it is possible to combine the power of a relational database with cutting- edge features of modern AI models.
The benefits of semantic search are plentiful:
After our brief introduction, it is time to take a look at the technical aspects
and see how this can be done for real. For the purpose of this demo we have used the excellent data set available here. It contains 3.6 million Amazon reviews, which are in an easy to use CSV format.
The first thing we have to do is to load the extension into our database and create a table that we can use to store the data we want to query:
1 2 3 4 5 6 7 8 9 10 11 12 |
cybertec=# CREATE EXTENSION IF NOT EXISTS vector; CREATE EXTENSION cybertec=# CREATE TABLE t_document ( id serial PRIMARY KEY, polarity float, title text, body text, embedding vector(384) ); CREATE TABLE |
There are two important things to note here. The first observation is that the extension has been loaded into the database. Note that the name we have to use here is "vector", and not "pgvector" as one might expect. The second noteworthy thing is the fact that a new data type called "vector" is used here. However, the extension has more data types:
However, for our use case we need a fixed vector with 384 entries, which is what our model is going to return.
The key thing to understand is the concept of "embedding". What does it mean? It is actually fairly simple: We take a string and transform it into a vector (=list of numbers). It is essential that one understands this when using the extension. Using only pgvector inside PostgreSQL does not turn a text into vectors, so we have to come up with some code to do this for us.
Python is a simple way to do exactly that. The following code snippet shows how this can be done:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 |
#!/usr/bin/python3 import os import csv from pgvector.psycopg import register_vector import psycopg from sentence_transformers import SentenceTransformer # ----------------------------------------- # EDIT HERE file = '/path/amazon_reviews/train.csv' model_name = "multi-qa-MiniLM-L6-cos-v1" # Set up database connection parameters db_params = { 'host': 'localhost', 'database': 'cybertec', 'user': 'your_user', 'password': 'your_password', 'port': '5432' } # ----------------------------------------- def main(): # create model for embedding model = SentenceTransformer(model_name) # Create the connection conn = psycopg.connect( host=db_params['host'], dbname=db_params['database'], user=db_params['user'], password=db_params['password'], port=db_params['port'] ) register_vector(conn) cur = conn.cursor() lines = 0 with open(file, 'r', encoding='utf-8') as csvfile: csv_reader = csv.reader(csvfile, delimiter=',') for row in csv_reader: lines += 1 emb = model.encode(row[1] + " " + row[2]) cur.execute("INSERT INTO t_document (polarity, title, body, embedding) VALUES (%s, %s, %s, %s) ", (row[0], row[1], row[2], emb)) if lines % 1000 == 0: print("rows imported: %s" % (lines)) conn.commit() conn.commit() if __name__ == "__main__": main() |
What is this code doing, exactly? Let us go through it step by step. First, we have to import a couple of libraries such as psycopg, which is the Python library in use, to talk to PostgreSQL.
Then we have to run:
1 |
from pgvector.psycopg import register_vector |
Now, this line is important because we have to add some capabilities to psycopg. Otherwise it is not possible to handle vectors nicely. Once this is done, we need a model to transform the inputs. To do this, HuggingFace is the place to go. It contains models that can be used with the SentenceTransformer class we have just imported.
Next, we can register the additional data types with the driver, open a cursor, and we are ready to go. We can simply open the CSV file and loop over it. Each line contains one review in a convenient to use format. The real magic happens here:
1 |
emb = model.encode(row[1] + " " + row[2]) |
In this line, we feed the title and the body to our HuggingFace model and fetch a vector of numbers, which is then simply inserted into the table. Every couple of thousands of lines, we issue a COMMIT. Basically, this is not needed but it is nice to see some of what is happening during the import process to run some basic tests while embedding still happens. Note that most of the time here is not consumed by the actual database, but by the model in use, so doing some work in parallel is
definitely helpful.
The following snippets show how we can run the code:
1 2 3 4 5 6 7 8 9 10 |
$ time ./handle_comments.py rows imported: 1000 rows imported: 2000 ... rows imported: 3599000 rows imported: 3600000 real 325m35.287s user 311m50.151s sys 2m11.664s |
Mind that the execution time is fairly high - it therefore can make sense to run those things in parallel to make use of all CPUs on your machine. In this case, everything was running on a single core.
Running the script takes a fair amount of time. The question many people ask is: What does the result actually look like? Well, here is an example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
cybertec=# \x Expanded display is on. cybertec=# SELECT * FROM t_document LIMIT 1; -[ RECORD 1 ]----------------------------------------------------------------- id | 1 polarity | 2 title | Stuning even for the non-gamer body | This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! embedding | [-0.051104408,-0.06477496,0.053947136,-0.090758994,-0.11248113, 0.021052703,0.040592767,0.033487927,0.008018044,0.030972324, -0.09552986,-0.041627012,-0.032203916,-0.08294443,-0.0076782834, 0.039493423,0.06395606,-0.014506877,0.023361415,-0.022605361, ... 0.0133066205,-0.003375273,-0.003283798,-0.08727879,0.02891936] |
Note that the input data contains typos. We will get back to this later. We cannot rely on all input data always being 100% correct.
The important part is: Each text is represented as a set of numbers. The goal in AI is to find vectors that are close to each other. But what does "close" mean in this context?
First of all, there are various ways to calculate distance. Here are some options:
"dot product" and "cosine distance" might be something you have seen already. The image above shows what it actually means.
Let us focus our attention on the idea of "similarity". Here is a simple introduction to the concept:
Obviously, this all happens in hyperspace, but it is quite useful to have a simple idea in mind.
After importing the data and preparing everything, it is time for the big moment: We can write a program that will be turning a real-world search string into a query. The following piece of code returns the best results, given a certain search string:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 |
#!/usr/bin/python3 import os, sys import psycopg from pgvector.psycopg import register_vector from sentence_transformers import SentenceTransformer # ----------------------------------------- # EDIT HERE model_name = "multi-qa-MiniLM-L6-cos-v1" # Set up database connection parameters db_params = { 'host': 'localhost', 'database': 'cybertec', 'user': 'your_user', 'password': 'your_password', 'port': '5432' } # ----------------------------------------- def main(): if len(sys.argv) < 2: print("Usage: please provide a query string") sys.exit(1) query_string = sys.argv[1] print("Querying database for: '%s'" % (query_string)) # create model for embedding model = SentenceTransformer(model_name) # Create the connection conn = psycopg.connect( host=db_params['host'], dbname=db_params['database'], user=db_params['user'], password=db_params['password'], port=db_params['port'] ) register_vector(conn) cur = conn.cursor() # transform query string into vector emb = model.encode(query_string) result = conn.execute('SELECT embedding <=> %s AS distance, title, body FROM t_document ORDER BY embedding <=> %s LIMIT 5', (emb, emb)).fetchall() for row in result: print("Result item: ") print(f" Distance: " + str(row[0])) print(f" Title: : " + str(row[1])) print(f" Body : " + str(row[2])) print ("\n") if __name__ == "__main__": main() |
Actually, approximately the first 40 lines of code are just like before. We define the model, connect to the database, fetch a command line argument, and issue an insanely cool error message in case no argument is passed to our Python program. But where does the real magic happen? Here it is:
1 |
emb = model.encode(query_string) |
We use the SAME model as before to turn a search string into a vector. Finally, we can run something that is generally known as "KNN-search". The idea is to sort by the distance to your desired element and return the top results. This is exactly what the loop towards the end does.
One question people ask is: What is this strange "<=>" operator in the ORDER BY clause? We mentioned the idea of "distance" before. The following operators are available and provided by pgvector:
In this case, we went for cosine distance.
After this lengthy and hopefully useful introduction, it is time to run the code and see what it can do for us. In my showcase, I want to find all reviews that match the string "mooney airplane navigation literature". So, the idea is to find all reviews that might be related to aircrafts made by the Mooney Aircraft company and have something to do with books on navigation. The result looks as
follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
$ ./query_comments.py "mooney airplane navigation literature" Querying database for: 'mooney airplane navigation literature' Result item: Distance: 0.3748793229388776 Title: : Great Book! Body : I gave this book as a gift to a person who has an interest in historic aircraft. The recipient of the gift was very, very appeciative and thought it was an excellent reference. Result item: Distance: 0.3833693082144648 Title: : Decent airplane book but nothing more Body : I would not recommend anyone buying this book. If you are curious, check it out from the library. The character development is shallow and the story is generally not well written. The readers who have enjoyed this must be very young, and therefore not exposed to the better writers of the past. I put a lot of blame on the editor of this book, as it is his job to assist the writer in preparing the manuscript. I will not read this author again. Result item: Distance: 0.38802472913338903 Title: : Super, Great Information Body : If you are an airplane lover, you should order this book. Great pictures and facts. Result item: Distance: 0.3974898514887534 Title: : Wonderfully done book Body : This book is wonderful, I love seeing aircraft put to sleep in the boneyards! Philip D. Chinnery has an awesome book here! I would recommend it to all aviation enthusiasts to buy, it's well worth it! Result item: Distance: 0.39884170286903886 Title: : Good book Body : Very good way to adventure on this Plane but the content is a little bit on the less exciting part but a good book over all |
Note that not a single one of those entries is actually related to Mooney aircraft in particular. This makes sense, because navigation is not tied to a brand at all. What we can also see is that the word "literature" was not found - however, the system has understood that "literature" seems to be related to "books", and
so in this context we get a ton of book reviews (which is what exists in the data). In short: The result is relevant and far superior to what standard FTS (= Full Text Search) would have given us.
The problem at the moment is that PostgreSQL has to scan an entire table containing 3.6 million rows to find those results. Obviously, this is not too efficient, so it makes sense to create an index:
1 2 3 4 5 |
cybertec=# SET maintenance_work_mem TO '64 GB'; SET cybertec=# CREATE INDEX ON t_document USING ivfflat (embedding vector_cosine_ops); CREATE INDEX |
My machine is big enough to build the index quickly, using 64 GB of maintenance_work_mem, which is in charge of handling index creation. Once the index is deployed, we should see a significantly improved database performance. Keep in mind that the Python script will still need some time to handle things (loading libraries, models and so on), but the database side will be extremely fast.
Leave a Reply