CYBERTEC PostgreSQL Logo

Indexing Wikipedia

10.2015 / Category: / Tags: |

Wikipedia is one of the most relevant collections of text data on the Internet today. It consists of countless entries covering pretty much every topic conceivable. Therefore Wikipedia is an interesting collection of text to play with. To prepare for my talk at the PostgreSQL conference in Vienna, we decided to import Wikipedia into PostgreSQL. The goal is to see which settings are most suitable for indexing Wikipedia, using the creation of a GIN index.

The structure of the table is pretty simple:

Overall, the size of the table adds up to around 27 GB. We store the title, the raw content as well as some filtered content, which does not contain mark-up anymore.

The size of the filtered column is as follows:

Wikipedia statistics

Given the very special nature of this corpus, it seems interesting to inspect the distribution of values inside those texts. PostgreSQL has some simple means to gather this information:

A full CPU core will consume about 4-7 MB / sec from disk when this query is executed. This is far lower than what the test hardware can provide. The operation is surprisingly expensive.

Here is what vmstat says:

Indexing speed

Let's do some speed tests now. The first observation is that indexing on a function is incredibly slow:

After materializing the tsvector column, things are a lot faster. Calculating the stemmed representation of the input is a major speed issue:

Increasing memory certainly helps in case of GIN.

NOTE: This is not necessarily the case if you are using btree indexes: /adjusting-maintenance_work_mem/

Here is some performance data:

maintenance_work_mem Index creation

1 MB 4478 seconds
16 MB 2307 seconds
64 MB 1410 seconds
256 MB 1173 seconds
1024 MB 1171 seconds
4096 MB 1160 seconds
12 GB 1160 seconds

The build process speeds up dramatically with more memory. However, it is interesting to see that the curve flattens out pretty soon. At 256 MB we already see pretty much the best we can expect. More memory does not improve the situation anymore.

It is interesting to admit that we are not I/O bound but CPU bound. A better I/O system won't speed up things anymore.

The size of the index in PostgreSQL is roughly 3 GB:

In case you need any assistance, please feel free to contact us.
 


In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Facebook or LinkedIn.

Leave a Reply

Your email address will not be published. Required fields are marked *

CYBERTEC Logo white
Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2024
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle
    linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram