pgai: Importing Wikipedia into PostgreSQL

When dealing with AI, one of the key things is: Where can I find data to experiment with? While there are many places out there on the internet to satisfy the need for test data, Wikipedia is certainly one of the most interesting candidates to get you started. So: How can we import Wikipedia into PostgreSQL and make use of it? Here is how it works.

Table of Contents

Using pgai to load Wikipedia

One of the by far most useful extensions out there is pgai, which is a massive toolbox for Artificial Intelligence in PostgreSQL. Luckily, it also contains an easy way to load Wikipedia into the database.

Once you have installed pgai on your systems or launched the ready-made containers as shown on Github, you can enable the extension inside your database:

cybertec=# CREATE EXTENSION IF NOT EXISTS ai CASCADE;
NOTICE:  installing required extension "vector"
NOTICE:  installing required extension "plpython3u"
CREATE EXTENSION

cybertec=# CREATE EXTENSION IF NOT EXISTS ai CASCADE;

NOTICE: installing required extension "vector"

NOTICE: installing required extension "plpython3u"

CREATE EXTENSION

All we have to do here is call the ai.load_dataset function, which has a ton of useful parameters we can utilize:

cybertec=# \x
Expanded display is on.
cybertec=# \df+ ai.load_dataset
List of functions
-[ RECORD 1 ]-------+-------------------------------------------------
Schema              | ai
Name                | load_dataset
Result data type    | bigint
Argument data types | name text, config_name text DEFAULT NULL::text, 
		      split text DEFAULT NULL::text, 
		      schema_name name DEFAULT 'public'::name, 
                      table_name name DEFAULT NULL::name, 
                      if_table_exists text DEFAULT 'error'::text, 
                      field_types jsonb DEFAULT NULL::jsonb, 
 		      integer DEFAULT 5000, 
                      max_batches integer DEFAULT NULL::integer, 
		      kwargs jsonb DEFAULT '{}'::jsonb
Type                | func
Volatility          | volatile
Parallel            | unsafe
Owner               | hs
Security            | invoker
Access privileges   | hs=X*/hs +
                    | pg_database_owner=X*/hs
Language            | plpython3u
Internal name       | 
Description         |

cybertec=# \x

Expanded display is on.

cybertec=# \df+ ai.load_dataset

List of functions

-[ RECORD 1 ]-------+-------------------------------------------------

Schema | ai

Name | load_dataset

Result data type | bigint

Argument data types | name text, config_name text DEFAULT NULL::text,

split text DEFAULT NULL::text,

schema_name name DEFAULT 'public'::name,

table_name name DEFAULT NULL::name,

if_table_exists text DEFAULT 'error'::text,

field_types jsonb DEFAULT NULL::jsonb,

integer DEFAULT 5000,

max_batches integer DEFAULT NULL::integer,

kwargs jsonb DEFAULT '{}'::jsonb

Type | func

Volatility | volatile

Parallel | unsafe

Owner | hs

Security | invoker

Access privileges | hs=X*/hs +

| pg_database_owner=X*/hs

Language | plpython3u

Internal name |

Description |

The first parameter tells the function which data set to aim for. In our case, we are going straight for Wikipedia and its incarnation. Those first two parameters are really important. Optionally, we can define which target table we want to load the data into or how to behave when the table we want to use already exists. Also: Often, we don't need lots and lots of rows so we can basically define how much data we want to load. Obviously, it takes a fair amount of time to fetch all this data and you might not always need a big data set to get going with Wikipedia and PostgreSQL.

So, let us try it out and see what happens. The following snippet contains a sample call that already imports some data:

cybertec=#  SELECT ai.load_dataset(
		'wikimedia/wikipedia', '20231101.en', 
		table_name=>'wiki', batch_size=>1000, 
		max_batches=>100000, if_table_exists=>'append'
);

cybertec=# SELECT ai.load_dataset(

'wikimedia/wikipedia', '20231101.en',

table_name=>'wiki', batch_size=>1000,

max_batches=>100000, if_table_exists=>'append'

);

If this command executes successfully, we can already enjoy our freshly created table:

cybertec=# SELECT count(*) FROM wiki;
  count  
---------
 6407814
(1 row)

cybertec=# \d wiki
              Table "public.wiki"
 Column | Type | Collation | Nullable | Default 
--------+------+-----------+----------+---------
 id     | text |           | not null | 
 url    | text |           |          | 
 title  | text |           |          | 
 text   | text |           |          | 
Indexes:
    "wiki_pkey" PRIMARY KEY, btree (id)

cybertec=# SELECT count(*) FROM wiki;

count

---------

6407814

(1 row)

cybertec=# \d wiki

Table "public.wiki"

Column | Type | Collation | Nullable | Default

--------+------+-----------+----------+---------

id | text | | not null |

url | text | | |

title | text | | |

text | text | | |

Indexes:

"wiki_pkey" PRIMARY KEY, btree (id)

As you can see, we have around 6.4 million rows that contain a URL, a title, and the text of the site.

Now, all this data is interesting for various purposes. However, we are far from done. If you want to engage in machine learning, the next step is to use something known as "embedding", which is also nicely taken care of by pgai.

Embedding texts for semantic search

As shown in our previous posting about semantic search already, it is important to turn those texts into vectors, which can be understood by the famous and highly-adored pgvector extension. pgai provides us with a nice way to do this. Embedding millions of documents is not something that can be done in a couple of seconds, so what pgai does is to run this in chunks.

In order to do this we have to run a command:

cybertec=# SELECT ai.create_vectorizer(
     'wiki'::regclass,
     destination => 'demo_wiki_emb',
     embedding => ai.embedding_ollama('all-minilm', 384),
     chunking => ai.chunking_recursive_character_text_splitter('text')
);
 create_vectorizer 
-------------------
                 6
(1 row)

Time: 24135.802 ms (00:24.136)

cybertec=# SELECT ai.create_vectorizer(

'wiki'::regclass,

destination => 'demo_wiki_emb',

embedding => ai.embedding_ollama('all-minilm', 384),

chunking => ai.chunking_recursive_character_text_splitter('text')

);

create_vectorizer

-------------------

(1 row)

Time: 24135.802 ms (00:24.136)

What we have done is to invoke the ai.create_vectorizer function. It is asked to embed the content of the "wiki" table. The vector data is supposed to be accessible through a view named "demo_wiki_emb". The next step is to tell the system which model to use for the embedding. In our case, I have used a fairly small model called "all-minilm". However, there are various such models that can be invoked:

More information can be found on the Ollama website. Just decide what works best for you and run it locally on your Ollama server.

The next thing we can do now is to track the progress of our embedding process. Here is how it works:

cybertec=# select * from ai.vectorizer_status;
 id | source_table |         target_table         |          view          |    pending_items    | disabled 
----+--------------+------------------------------+------------------------+---------------------+----------
  6 | public.wiki  | public.wiki_embeddings_store | public.wiki_embeddings | 9223372036854775807 | f
(1 row)


cybertec=# select ai.vectorizer_queue_pending(6, exact_count=>true);
 vectorizer_queue_pending 
--------------------------
                  6407814
(1 row)

cybertec=# select * from ai.vectorizer_status;

----+--------------+------------------------------+------------------------+---------------------+----------

6 | public.wiki | public.wiki_embeddings_store | public.wiki_embeddings | 9223372036854775807 | f

(1 row)

cybertec=# select ai.vectorizer_queue_pending(6, exact_count=>true);

vectorizer_queue_pending

--------------------------

6407814

(1 row)

There are two ways to actually inspect what is going on inside PostgreSQL. Personally, I prefer the exact version, but both options are fine.

However, there is a problem here: Nothing happens. No progress is made. The reason is that we have to launch a Python program that actually does most of the heavy lifting. The easiest way to do this works as follows:

$ /usr/local/bin/pgai vectorizer worker -d "dbname=cybertec host=localhost"
2025-03-21 20:16:04 [info     ] running vectorizer             vectorizer_id=6

1 2	$ /usr/local/bin/pgai vectorizer worker -d "dbname=cybertec host=localhost" 2025-03-21 20:16:04 [info ] running vectorizer vectorizer_id=6

This is the most simplistic way to get the process going and we can already see some progress. For small amounts of data this should be fine, but running this on huge data sets can take, well, long - really long.

One way to approach this is to do things in parallel:

$ /usr/local/bin/pgai vectorizer worker --help
Usage: pgai vectorizer worker [OPTIONS]

Options:
  --version                       Show the version and exit.
  -d, --db-url TEXT               The database URL to connect to  [default:
                                  postgres://postgres@localhost:5432/postgres]
  -i, --vectorizer-id INTEGER     Only fetch work from the given vectorizer
                                  ids. If not provided, all vectorizers will
                                  be fetched.
  --log-level [DEBUG|INFO|WARN|ERROR|FATAL|CRITICAL]
  --poll-interval TIME DURATION   The interval, in duration string or integer
                                  (seconds), to wait before checking for new
                                  work after processing all available work in
                                  the queue.  [default: 5m]
  --once                          Exit after processing all available work
                                  (implies --exit-on-error).
  -c, --concurrency INTEGER RANGE
                                  [x>=1]
  --exit-on-error BOOLEAN         Exit immediately when an error occurs.
  --help                          Show this message and exit.

$ /usr/local/bin/pgai vectorizer worker --help

Usage: pgai vectorizer worker [OPTIONS]

Options:

--version Show the version and exit.

-d, --db-url TEXT The database URL to connect to [default:

postgres://postgres@localhost:5432/postgres]

-i, --vectorizer-id INTEGER Only fetch work from the given vectorizer

ids. If not provided, all vectorizers will

be fetched.

--poll-interval TIME DURATION The interval, in duration string or integer

(seconds), to wait before checking for new

work after processing all available work in

the queue. [default: 5m]

--once Exit after processing all available work

(implies --exit-on-error).

-c, --concurrency INTEGER RANGE

[x>=1]

--exit-on-error BOOLEAN Exit immediately when an error occurs.

--help Show this message and exit.

The solution to the problem is found in the "-c" option, which adds some parallelism to the scenery. If you want to run this type of operation faster, this is definitely helpful.

After some time our queue should be empty:

cybertec=# select ai.vectorizer_queue_pending(6, exact_count=>true);
 vectorizer_queue_pending 
--------------------------
                        0
(1 row)

cybertec=# select ai.vectorizer_queue_pending(6, exact_count=>true);

vectorizer_queue_pending

--------------------------

(1 row)

Inspecting the result

Once the data has been imported, we can check the result:

cybertec=# \d+
                                         List of relations
 Schema |        Name         | Type  | Owner | Persistence | Access method |  Size   | Description 
--------+---------------------+-------+-------+-------------+---------------+---------+-------------
 public | demo_wiki_emb       | view  | hs    | permanent   |               | 0 bytes | 
 public | demo_wiki_emb_store | table | hs    | permanent   | heap          | 92 GB   | 
 public | wiki                | table | hs    | permanent   | heap          | 13 GB   | 
(3 rows)

cybertec=# \d+

List of relations

--------+---------------------+-------+-------+-------------+---------------+---------+-------------

(3 rows)

What we see here is that pgai has produced a table and a view that allows us to conveniently read the data (= raw data + vectors). We can also see that the amount of data generated by our model is substantial.

The data returned by the process can be found in this table (as defined by the function call before):

cybertec=# \d demo_wiki_emb_store
                   Table "public.demo_wiki_emb_store"
     Column     |    Type     | Collation | Nullable |      Default      
----------------+-------------+-----------+----------+-------------------
 embedding_uuid | uuid        |           | not null | gen_random_uuid()
 id             | text        |           | not null | 
 chunk_seq      | integer     |           | not null | 
 chunk          | text        |           | not null | 
 embedding      | vector(384) |           | not null | 
Indexes:
    "demo_wiki_emb_store_pkey" PRIMARY KEY, btree (embedding_uuid)
    "demo_wiki_emb_store_id_chunk_seq_key" UNIQUE CONSTRAINT, btree (id, chunk_seq)
Foreign-key constraints:
    "demo_wiki_emb_store_id_fkey" FOREIGN KEY (id) REFERENCES wiki(id) ON DELETE CASCADE

cybertec=# \d demo_wiki_emb_store

Table "public.demo_wiki_emb_store"

Column | Type | Collation | Nullable | Default

----------------+-------------+-----------+----------+-------------------

embedding_uuid | uuid | | not null | gen_random_uuid()

id | text | | not null |

chunk_seq | integer | | not null |

chunk | text | | not null |

embedding | vector(384) | | not null |

Indexes:

"demo_wiki_emb_store_pkey" PRIMARY KEY, btree (embedding_uuid)

"demo_wiki_emb_store_id_chunk_seq_key" UNIQUE CONSTRAINT, btree (id, chunk_seq)

Foreign-key constraints:

"demo_wiki_emb_store_id_fkey" FOREIGN KEY (id) REFERENCES wiki(id) ON DELETE CASCADE

What we can see here is that this table references the wiki table.
To gain access to the data easier, pgai has created a simple-to-use view:

cybertec=# \d+ demo_wiki_emb
                              View "public.demo_wiki_emb"
     Column     |    Type     | Collation | Nullable | Default | Storage  | Description 
----------------+-------------+-----------+----------+---------+----------+-------------
 embedding_uuid | uuid        |           |          |         | plain    | 
 chunk_seq      | integer     |           |          |         | plain    | 
 chunk          | text        |           |          |         | extended | 
 embedding      | vector(384) |           |          |         | external | 
 id             | text        |           |          |         | extended | 
 url            | text        |           |          |         | extended | 
 title          | text        |           |          |         | extended | 
 text           | text        |           |          |         | extended | 
View definition:
 SELECT t.embedding_uuid,
    t.chunk_seq,
    t.chunk,
    t.embedding,
    t.id,
    s.url,
    s.title,
    s.text
   FROM demo_wiki_emb_store t
     LEFT JOIN wiki s ON t.id = s.id;

cybertec=# \d+ demo_wiki_emb

View "public.demo_wiki_emb"

----------------+-------------+-----------+----------+---------+----------+-------------

View definition:

SELECT t.embedding_uuid,

t.chunk_seq,

t.chunk,

t.embedding,

t.id,

s.url,

s.title,

s.text

FROM demo_wiki_emb_store t

LEFT JOIN wiki s ON t.id = s.id;

What it does is join the raw data with those embeddings and expose everything as a single view, which makes interaction really straight forward. But let me focus your attention on some other detail here:

cybertec=# explain SELECT count(*) FROM demo_wiki_emb;
                                                QUERY PLAN
---------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=1105977.00..1105977.01 rows=1 width=8)
   ->  Gather  (cost=1105976.78..1105976.99 rows=2 width=8)
         Workers Planned: 2
         ->  Partial Aggregate  (cost=1104976.78..1104976.79 rows=1 width=8)
               ->  Parallel Index Only Scan using demo_wiki_emb_store_id_chunk_seq_key on demo_wiki_emb_store t  
			(cost=0.56..1062313.83 rows=17065183 width=0)
(5 rows)

cybertec=# explain SELECT count(*) FROM demo_wiki_emb;

QUERY PLAN

---------------------------------------------------------------------------------------------------------------

Finalize Aggregate (cost=1105977.00..1105977.01 rows=1 width=8)

-> Gather (cost=1105976.78..1105976.99 rows=2 width=8)

Workers Planned: 2

-> Partial Aggregate (cost=1104976.78..1104976.79 rows=1 width=8)

-> Parallel Index Only Scan using demo_wiki_emb_store_id_chunk_seq_key on demo_wiki_emb_store t

(cost=0.56..1062313.83 rows=17065183 width=0)

(5 rows)

What can we see here? In the previous listing, the view shows a join. But why can we not see a join in this execution plan? The answer is: "Join pruning". The optimizer has figured out that it can safely remove the "wiki" table from this operation without risk. If you want to learn more about this, one of our older blog posts explains the concept.

pgai: Importing Wikipedia into PostgreSQL

Using pgai to load Wikipedia

Embedding texts for semantic search

Inspecting the result

Further reading

Leave a Reply Cancel reply

Hans-Jürgen Schönig

Blog Tags

NEWSLETTER

Articles by our PostgreSQL Experts