CYBERTEC PostgreSQL Logo

pgai: Importing Wikipedia into PostgreSQL

04.2025 / Category: / Tags:

When dealing with AI, one of the key things is: Where can I find data to experiment with? While there are many places out there on the internet to satisfy the need for test data, Wikipedia is certainly one of the most interesting candidates to get you started. So: How can we import Wikipedia into PostgreSQL and make use of it? Here is how it works.

Using pgai to load Wikipedia

One of the by far most useful extensions out there is pgai, which is a massive toolbox for Artificial Intelligence in PostgreSQL. Luckily, it also contains an easy way to load Wikipedia into the database. 

Once you have installed pgai on your systems or launched the ready-made containers as shown on Github, you can enable the extension inside your database:

All we have to do here is call the ai.load_dataset function, which has a ton of useful parameters we can utilize:

The first parameter tells the function which data set to aim for. In our case, we are going straight for Wikipedia and its incarnation. Those first two parameters are really important. Optionally, we can define which target table we want to load the data into or how to behave when the table we want to use already exists. Also: Often, we don't need lots and lots of rows so we can basically define how much data we want to load. Obviously, it takes a fair amount of time to fetch all this data and you might not always need a big data set to get going with Wikipedia and PostgreSQL. 

So, let us try it out and see what happens. The following snippet contains a sample call that already imports some data:

If this command executes successfully, we can already enjoy our freshly created table:

As you can see, we have around 6.4 million rows that contain a URL, a title, and the text of the site.

Now, all this data is interesting for various purposes. However, we are far from done. If you want to engage in machine learning, the next step is to use something known as "embedding", which is also nicely taken care of by pgai.

As shown in our previous posting about semantic search already, it is important to turn those texts into vectors, which can be understood by the famous and highly-adored pgvector extension. pgai provides us with a nice way to do this. Embedding millions of documents is not something that can be done in a couple of seconds, so what pgai does is to run this in chunks.

In order to do this we have to run a command:

What we have done is to invoke the ai.create_vectorizer function. It is asked to embed the content of the "wiki" table. The vector data is supposed to be accessible through a view named "demo_wiki_emb". The next step is to tell the system which model to use for the embedding. In our case, I have used a fairly small model called "all-minilm". However, there are various such models that can be invoked:

More information can be found on the Ollama website. Just decide what works best for you and run it locally on your Ollama server.

The next thing we can do now is to track the progress of our embedding process. Here is how it works:

There are two ways to actually inspect what is going on inside PostgreSQL. Personally, I prefer the exact version, but both options are fine. 

However, there is a problem here: Nothing happens. No progress is made. The reason is that we have to launch a Python program that actually does most of the heavy lifting. The easiest way to do this works as follows:

This is the most simplistic way to get the process going and we can already see some progress. For small amounts of data this should be fine, but running this on huge data sets can take, well, long - really long.

One way to approach this is to do things in parallel: 

The solution to the problem is found in the "-c" option, which adds some parallelism to the scenery. If you want to run this type of operation faster, this is definitely helpful.

After some time our queue should be empty:

Inspecting the result

Once the data has been imported, we can check the result:

What we see here is that pgai has produced a table and a view that allows us to conveniently read the data (= raw data + vectors). We can also see that the amount of data generated by our model is substantial.

The data returned by the process can be found in this table (as defined by the function call before):

What we can see here is that this table references the wiki table.
To gain access to the data easier, pgai has created a simple-to-use view:

What it does is join the raw data with those embeddings and expose everything as a single view, which makes interaction really straight forward. But let me focus your attention on some other detail here:

What can we see here? In the previous listing, the view shows a join. But why can we not see a join in this execution plan? The answer is: "Join pruning". The optimizer has figured out that it can safely remove the "wiki" table from this operation without risk. If you want to learn more about this, one of our older blog posts explains the concept.

Further reading

In this post you have learned how to inject Wikipedia into PostgreSQL quickly and how to turn those postings into vectors. The next postings will build on this and show which indexing options we have, how we can index quickly, and how we can actually make use of the data.

Leave a Reply

Your email address will not be published. Required fields are marked *

CYBERTEC Logo white
Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2025
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle
    linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram