When dealing with AI, one of the key things is: Where can I find data to experiment with? While there are many places out there on the internet to satisfy the need for test data, Wikipedia is certainly one of the most interesting candidates to get you started. So: How can we import Wikipedia into PostgreSQL and make use of it? Here is how it works.
Table of Contents
One of the by far most useful extensions out there is pgai, which is a massive toolbox for Artificial Intelligence in PostgreSQL. Luckily, it also contains an easy way to load Wikipedia into the database.
Once you have installed pgai on your systems or launched the ready-made containers as shown on Github, you can enable the extension inside your database:
1 2 3 4 |
cybertec=# CREATE EXTENSION IF NOT EXISTS ai CASCADE; NOTICE: installing required extension "vector" NOTICE: installing required extension "plpython3u" CREATE EXTENSION |
All we have to do here is call the ai.load_dataset function, which has a ton of useful parameters we can utilize:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
cybertec=# \x Expanded display is on. cybertec=# \df+ ai.load_dataset List of functions -[ RECORD 1 ]-------+------------------------------------------------- Schema | ai Name | load_dataset Result data type | bigint Argument data types | name text, config_name text DEFAULT NULL::text, split text DEFAULT NULL::text, schema_name name DEFAULT 'public'::name, table_name name DEFAULT NULL::name, if_table_exists text DEFAULT 'error'::text, field_types jsonb DEFAULT NULL::jsonb, integer DEFAULT 5000, max_batches integer DEFAULT NULL::integer, kwargs jsonb DEFAULT '{}'::jsonb Type | func Volatility | volatile Parallel | unsafe Owner | hs Security | invoker Access privileges | hs=X*/hs + | pg_database_owner=X*/hs Language | plpython3u Internal name | Description | |
The first parameter tells the function which data set to aim for. In our case, we are going straight for Wikipedia and its incarnation. Those first two parameters are really important. Optionally, we can define which target table we want to load the data into or how to behave when the table we want to use already exists. Also: Often, we don't need lots and lots of rows so we can basically define how much data we want to load. Obviously, it takes a fair amount of time to fetch all this data and you might not always need a big data set to get going with Wikipedia and PostgreSQL.
So, let us try it out and see what happens. The following snippet contains a sample call that already imports some data:
1 2 3 4 5 |
cybertec=# SELECT ai.load_dataset( 'wikimedia/wikipedia', '20231101.en', table_name=>'wiki', batch_size=>1000, max_batches=>100000, if_table_exists=>'append' ); |
If this command executes successfully, we can already enjoy our freshly created table:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
cybertec=# SELECT count(*) FROM wiki; count --------- 6407814 (1 row) cybertec=# \d wiki Table "public.wiki" Column | Type | Collation | Nullable | Default --------+------+-----------+----------+--------- id | text | | not null | url | text | | | title | text | | | text | text | | | Indexes: "wiki_pkey" PRIMARY KEY, btree (id) |
As you can see, we have around 6.4 million rows that contain a URL, a title, and the text of the site.
Now, all this data is interesting for various purposes. However, we are far from done. If you want to engage in machine learning, the next step is to use something known as "embedding", which is also nicely taken care of by pgai.
As shown in our previous posting about semantic search already, it is important to turn those texts into vectors, which can be understood by the famous and highly-adored pgvector extension. pgai provides us with a nice way to do this. Embedding millions of documents is not something that can be done in a couple of seconds, so what pgai does is to run this in chunks.
In order to do this we have to run a command:
1 2 3 4 5 6 7 8 9 10 11 12 |
cybertec=# SELECT ai.create_vectorizer( 'wiki'::regclass, destination => 'demo_wiki_emb', embedding => ai.embedding_ollama('all-minilm', 384), chunking => ai.chunking_recursive_character_text_splitter('text') ); create_vectorizer ------------------- 6 (1 row) Time: 24135.802 ms (00:24.136) |
What we have done is to invoke the ai.create_vectorizer function. It is asked to embed the content of the "wiki" table. The vector data is supposed to be accessible through a view named "demo_wiki_emb". The next step is to tell the system which model to use for the embedding. In our case, I have used a fairly small model called "all-minilm". However, there are various such models that can be invoked:
More information can be found on the Ollama website. Just decide what works best for you and run it locally on your Ollama server.
The next thing we can do now is to track the progress of our embedding process. Here is how it works:
1 2 3 4 5 6 7 8 9 10 11 12 |
cybertec=# select * from ai.vectorizer_status; id | source_table | target_table | view | pending_items | disabled ----+--------------+------------------------------+------------------------+---------------------+---------- 6 | public.wiki | public.wiki_embeddings_store | public.wiki_embeddings | 9223372036854775807 | f (1 row) cybertec=# select ai.vectorizer_queue_pending(6, exact_count=>true); vectorizer_queue_pending -------------------------- 6407814 (1 row) |
There are two ways to actually inspect what is going on inside PostgreSQL. Personally, I prefer the exact version, but both options are fine.
However, there is a problem here: Nothing happens. No progress is made. The reason is that we have to launch a Python program that actually does most of the heavy lifting. The easiest way to do this works as follows:
1 2 |
$ /usr/local/bin/pgai vectorizer worker -d "dbname=cybertec host=localhost" 2025-03-21 20:16:04 [info ] running vectorizer vectorizer_id=6 |
This is the most simplistic way to get the process going and we can already see some progress. For small amounts of data this should be fine, but running this on huge data sets can take, well, long - really long.
One way to approach this is to do things in parallel:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
$ /usr/local/bin/pgai vectorizer worker --help Usage: pgai vectorizer worker [OPTIONS] Options: --version Show the version and exit. -d, --db-url TEXT The database URL to connect to [default: postgres://postgres@localhost:5432/postgres] -i, --vectorizer-id INTEGER Only fetch work from the given vectorizer ids. If not provided, all vectorizers will be fetched. --log-level [DEBUG|INFO|WARN|ERROR|FATAL|CRITICAL] --poll-interval TIME DURATION The interval, in duration string or integer (seconds), to wait before checking for new work after processing all available work in the queue. [default: 5m] --once Exit after processing all available work (implies --exit-on-error). -c, --concurrency INTEGER RANGE [x>=1] --exit-on-error BOOLEAN Exit immediately when an error occurs. --help Show this message and exit. |
The solution to the problem is found in the "-c" option, which adds some parallelism to the scenery. If you want to run this type of operation faster, this is definitely helpful.
After some time our queue should be empty:
1 2 3 4 5 |
cybertec=# select ai.vectorizer_queue_pending(6, exact_count=>true); vectorizer_queue_pending -------------------------- 0 (1 row) |
Once the data has been imported, we can check the result:
1 2 3 4 5 6 7 8 |
cybertec=# \d+ List of relations Schema | Name | Type | Owner | Persistence | Access method | Size | Description --------+---------------------+-------+-------+-------------+---------------+---------+------------- public | demo_wiki_emb | view | hs | permanent | | 0 bytes | public | demo_wiki_emb_store | table | hs | permanent | heap | 92 GB | public | wiki | table | hs | permanent | heap | 13 GB | (3 rows) |
What we see here is that pgai has produced a table and a view that allows us to conveniently read the data (= raw data + vectors). We can also see that the amount of data generated by our model is substantial.
The data returned by the process can be found in this table (as defined by the function call before):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
cybertec=# \d demo_wiki_emb_store Table "public.demo_wiki_emb_store" Column | Type | Collation | Nullable | Default ----------------+-------------+-----------+----------+------------------- embedding_uuid | uuid | | not null | gen_random_uuid() id | text | | not null | chunk_seq | integer | | not null | chunk | text | | not null | embedding | vector(384) | | not null | Indexes: "demo_wiki_emb_store_pkey" PRIMARY KEY, btree (embedding_uuid) "demo_wiki_emb_store_id_chunk_seq_key" UNIQUE CONSTRAINT, btree (id, chunk_seq) Foreign-key constraints: "demo_wiki_emb_store_id_fkey" FOREIGN KEY (id) REFERENCES wiki(id) ON DELETE CASCADE |
What we can see here is that this table references the wiki table.
To gain access to the data easier, pgai has created a simple-to-use view:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
cybertec=# \d+ demo_wiki_emb View "public.demo_wiki_emb" Column | Type | Collation | Nullable | Default | Storage | Description ----------------+-------------+-----------+----------+---------+----------+------------- embedding_uuid | uuid | | | | plain | chunk_seq | integer | | | | plain | chunk | text | | | | extended | embedding | vector(384) | | | | external | id | text | | | | extended | url | text | | | | extended | title | text | | | | extended | text | text | | | | extended | View definition: SELECT t.embedding_uuid, t.chunk_seq, t.chunk, t.embedding, t.id, s.url, s.title, s.text FROM demo_wiki_emb_store t LEFT JOIN wiki s ON t.id = s.id; |
What it does is join the raw data with those embeddings and expose everything as a single view, which makes interaction really straight forward. But let me focus your attention on some other detail here:
1 2 3 4 5 6 7 8 9 10 |
cybertec=# explain SELECT count(*) FROM demo_wiki_emb; QUERY PLAN --------------------------------------------------------------------------------------------------------------- Finalize Aggregate (cost=1105977.00..1105977.01 rows=1 width=8) -> Gather (cost=1105976.78..1105976.99 rows=2 width=8) Workers Planned: 2 -> Partial Aggregate (cost=1104976.78..1104976.79 rows=1 width=8) -> Parallel Index Only Scan using demo_wiki_emb_store_id_chunk_seq_key on demo_wiki_emb_store t (cost=0.56..1062313.83 rows=17065183 width=0) (5 rows) |
What can we see here? In the previous listing, the view shows a join. But why can we not see a join in this execution plan? The answer is: "Join pruning". The optimizer has figured out that it can safely remove the "wiki" table from this operation without risk. If you want to learn more about this, one of our older blog posts explains the concept.
In this post you have learned how to inject Wikipedia into PostgreSQL quickly and how to turn those postings into vectors. The next postings will build on this and show which indexing options we have, how we can index quickly, and how we can actually make use of the data.
Leave a Reply