pgvector is a widely adopted extension for PostgreSQL that is often used to handle semantic search. One can find various topics and posts dealing with AI and so on. Recently, we have posted information about semantic search in PostgreSQL (see post).
Table of Contents
However, pgvector is much more than that - a vector can be anything, and the technology can be applied to other fields, such as timeseries analysis, as well. This article will explain how this works and what can be done to leverage the technology to get a handle on timeseries data.
For the purpose of this example, we will use some historic timeseries showing the development of the German stock market (DAX):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
cybertec=# SELECT * FROM stock_data ORDER BY d DESC; d | symbol | open | high | low | close | volume ------------+--------+--------------------+--------------------+--------------------+--------------------+----------- 2025-03-20 | ^GDAXI | 23009.5 | 23315.490234375 | 22842.94921875 | 23295.720703125 | 0 2025-03-19 | ^GDAXI | 23288.060546875 | 23372.080078125 | 23136.5390625 | 23272.150390625 | 79641400 2025-03-18 | ^GDAXI | 23380.69921875 | 23476.009765625 | 23240.560546875 | 23266.650390625 | 80663300 2025-03-17 | ^GDAXI | 23154.5703125 | 23154.5703125 | 22933.5703125 | 22998.529296875 | 67152000 2025-03-14 | ^GDAXI | 22986.8203125 | 23049.48046875 | 22465.119140625 | 22501.33984375 | 93287400 2025-03-13 | ^GDAXI | 22567.140625 | 22752.830078125 | 22417.51953125 | 22578.099609375 | 78955600 2025-03-12 | ^GDAXI | 22676.41015625 | 22813.83984375 | 22461.76953125 | 22525.740234375 | 80929100 2025-03-11 | ^GDAXI | 22328.76953125 | 22835.099609375 | 22258.30078125 | 22680.390625 | 97374800 2025-03-10 | ^GDAXI | 22620.94921875 | 23164.240234375 | 22519.2109375 | 23163.779296875 | 108707000 ... |
The data goes all the way back to 1987 and ends in March 2025. To get a better handle on the data, we can count how many rows we have per decade. Yes, this can be done with a GROUP BY statement. Something that is not widely known, is that in PostgreSQL we can group by expression and not just by columns. Here is how this works:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
cybertec=# SELECT date_trunc('decade', d) AS year, count(*) FROM stock_data GROUP BY 1 ORDER BY 1; year | count ------------------------+------- 1980-01-01 00:00:00+01 | 502 1990-01-01 00:00:00+01 | 2506 2000-01-01 00:00:00+01 | 2542 2010-01-01 00:00:00+01 | 2531 2020-01-01 00:00:00+01 | 1331 (5 rows) |
The date_trunc function allows us to cut off years, months, days and so on - this leaves us with the decade we want to count.
What do we want to achieve? Actually, a timeseries can be seen as a vector as well. So, why not use vectors to find anomalies in timeseries? One way to do just that is to take a look at the changes in the data.
In PostgreSQL, we can use a windowing function to calculate the changes between two subsequent rows:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
cybertec=# SELECT d, close, close / lag(close) OVER (ORDER BY d) AS diff FROM stock_data; d | close | diff ------------+--------------------+------------------------ 1987-12-30 | 1005.1900024414062 | 1988-01-04 | 956.489990234375 | 0.95155143595862615489 1988-01-05 | 996.0999755859375 | 1.0414118137732488 1988-01-06 | 1006.010009765625 | 1.0099488348786055 1988-01-07 | 1014.469970703125 | 1.0084094202397359 1988-01-08 | 1026.68994140625 | 1.0120456702081141 1988-01-11 | 987.52001953125 | 0.96184834359889682645 ... |
We divide the current row by the previous row and display the result in an additional table.
Analyzing single rows and the changes to the previous row is already fairly interesting but by far not good enough. Often, important events are not happening in a single day but over a period of time. So why not take the changes happening over a couple of days (in our case 6) and put them into a single
vector for later analysis?
The array_agg function provides a way to turn values into an array. In this case, we want the current row as well as the 5 previous rows inside the same array. pgvector provides us with the ability to cast an array to a vector directly. However, there is a catch:
1 2 3 4 5 6 7 8 9 10 11 |
cybertec=# SELECT *, (array_agg(diff) OVER (ORDER BY d ROWS BETWEEN 5 PRECEDING AND CURRENT ROW))::vector AS vec FROM ( SELECT d, close, close / lag(close) OVER (ORDER BY d) AS diff FROM stock_data ) AS x ORDER BY d OFFSET 6; ERROR: array must not contain nulls |
It is important to make sure that all values inside the vector are valid - NULL entries are not allowed. Otherwise, the type cast will error out as shown in the previous listing.
Avoiding this error can be easily achieved by doing the type case later in the query once we can guarantee that there is no single NULL value inside our array:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
cybertec=# CREATE VIEW v_analysis AS SELECT d, round(close::numeric, 2) AS close, round(diff::numeric, 8) AS diff, vec::vector FROM ( SELECT *, array_agg(diff) OVER (ORDER BY d ROWS BETWEEN 5 PRECEDING AND CURRENT ROW) AS vec FROM ( SELECT d, close, close / lag(close) OVER (ORDER BY d) AS diff FROM stock_data ) AS x ORDER BY d OFFSET 6 ) AS y; CREATE VIEW |
We will need those vectors for all later operations, so it is quite convenient to create a view that helps us simplify the SQL we want to run on top of this data.
The data looks as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
cybertec=# SELECT * FROM v_analysis ORDER BY d LIMIT 10; d | close | diff | vec ------------+---------+------------+----------------------------------------------------------------- 1988-01-11 | 987.52 | 0.96184834 | [0.95155144,1.0414118,1.0099488,1.0084094,1.0120456,0.9618483] 1988-01-12 | 986.89 | 0.99936203 | [1.0414118,1.0099488,1.0084094,1.0120456,0.9618483,0.99936205] 1988-01-13 | 965.77 | 0.97859944 | [1.0099488,1.0084094,1.0120456,0.9618483,0.99936205,0.9785994] 1988-01-14 | 974.46 | 1.00899800 | [1.0084094,1.0120456,0.9618483,0.99936205,0.9785994,1.008998] 1988-01-15 | 952.46 | 0.97742339 | [1.0120456,0.9618483,0.99936205,0.9785994,1.008998,0.97742337] 1988-01-18 | 1003.13 | 1.05319907 | [0.9618483,0.99936205,0.9785994,1.008998,0.97742337,1.053199] 1988-01-19 | 980.18 | 0.97712160 | [0.99936205,0.9785994,1.008998,0.97742337,1.053199,0.9771216] 1988-01-20 | 960.44 | 0.97986085 | [0.9785994,1.008998,0.97742337,1.053199,0.9771216,0.97986084] 1988-01-21 | 949.24 | 0.98833867 | [1.008998,0.97742337,1.053199,0.9771216,0.97986084,0.98833865] 1988-01-22 | 966.48 | 1.01816189 | [0.97742337,1.053199,0.9771216,0.97986084,0.98833865,1.0181619] (10 rows) |
What you can see here is that the data in pgvector is basically a list of floating point values.
One interesting aspect is that there is actually such a thing as an "average vector". There is indeed an incarnation of the "avg" function that works for vectors:
1 2 3 4 5 |
cybertec=# SELECT avg(vec) FROM v_analysis; avg --------------------------------------------------------------- [1.0004238,1.0004286,1.0004265,1.0004267,1.0004258,1.0004246] (1 row) |
Now this is interesting. What we see here is basically the average daily change in the stock market. It is not surprising to see that all those values are slightly positive but close to zero. Over the past decades, the markets have gone up in average and this is exactly what this vector tells us.
Often one uses AI to create complicated models to find anomalies in timeseries. However, this might not even be necessary. Consider the following example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
cybertec=# SELECT * FROM v_analysis ORDER BY vec <=> (SELECT avg(vec) FROM v_analysis) DESC; d | close | diff | vec ------------+----------+------------+--------------------------------------------------------------------- 2008-10-16 | 4675.90 | 0.90139738 | [0.96520793,0.90892595,1.0103363,1.1062297,1.010175,0.9013974] 2008-10-17 | 4757.98 | 1.01755386 | [0.90892595,1.0103363,1.1062297,1.010175,0.9013974,1.0175538] 2008-10-14 | 5135.14 | 1.10622964 | [0.97222835,0.9608706,0.96520793,0.90892595,1.0103363,1.1062297] 2020-03-25 | 9987.37 | 1.08064334 | [0.942259,0.98636895,1.0688034,0.9400737,1.0826737,1.0806433] 2008-10-15 | 5187.39 | 1.01017499 | [0.9608706,0.96520793,0.90892595,1.0103363,1.1062297,1.010175] 1989-10-20 | 1520.01 | 0.99850224 | [0.99229485,0.86856604,1.0658485,1.0164639,1.013853,0.99850225] 1989-10-19 | 1522.29 | 1.01385294 | [0.9984663,0.99229485,0.86856604,1.0658485,1.0164639,1.013853] 1989-10-23 | 1518.24 | 0.99883552 | [0.86856604,1.0658485,1.0164639,1.013853,0.99850225,0.9988355] 1989-10-18 | 1501.49 | 1.01646388 | [0.97981346,0.9984663,0.99229485,0.86856604,1.0658485,1.0164639] 1989-10-17 | 1477.17 | 1.06584844 | [1.0045961,0.97981346,0.9984663,0.99229485,0.86856604,1.0658485] 2008-10-21 | 4882.80 | 0.99816836 | [1.1062297,1.010175,0.9013974,1.0175538,1.028117,0.99816835] |
What this statement does is take the average vector and see which other vectors are as different as possible. In other words: Where can we find data that is as far away from the average as possible?
The results of these simple queries are nothing short of stunning. Let us take a look at those dates and ask ourselves: What happened in October 2008 and October 1989? Here are the results:
"The Friday the 13th mini-crash, or Black Friday, was a stock market crash that occurred on Friday, October 13, 1989. The crash was apparently caused by a reaction to a news story of the breakdown of a $6.75 billion leveraged buyout deal for UAL Corporation, the parent company of United Airlines."
"October 6–10, 2008: From October 6–10, 2008, the Dow Jones Industrial Average (DJIA) closed lower in all five sessions. Volume levels were record-breaking. The DJIA fell 1,874.19 points, or 18.2%, in its worst weekly decline ever on both a points and percentage basis. The S&P 500 fell more than 20%."
Wow, we have just identified two of the most important events in recent financial history using nothing more than a window function, an array, and a bit of vector magic.
The most important aspect here is that pgvector is certainly the extension go to if you are looking for artificial intelligence, semantic search, and a lot more. However, there is much more than just fancy stuff - even basic vector operations can already be highly beneficial.
Leave a Reply