CYBERTEC PostgreSQL Logo

Trying out Postgres Bloom indexes

12.2016 / Category: / Tags: |

By Kaarel Moppel - First - if the word “Bloom” doesn’t yet resonate with you in Postgres context, I’d recommend to read my previous introductory post on the topic here. This one is a follow-up. I aim to set up a simple demo use case to exemplify “fields of application” for Bloom, and to gauge pros/cons/performance in comparison to B-tree.

So to get started, we first need some kind of “close to real life” test case to visualize and remember the problem better in our heads. As explained in the previous post - for Bloom to make sense, we need a query/table with many columns which Postgres queries in random combinations... but with only integer/text data types and equality operators, the options are a bit limited. After scrolling around a bit, I couldn’t spot any suitable test sets (for example from here), so I decided to look at the ceiling and pictured a simple table from an online car dealership domain, where Bloom would be a good fit, in addition to being easily digestible for our brains.

The online car dealership

You'll find the schema (for a single table) below, with the amount of distinct values in the comment. Choosing a domain where values are not too distinct sets a good example, since B-tree indexes are not too effective there. We hope to benefit from Bloom. Also note it is a well-normalized table - we’re only dealing with numeric type codes and natural values so we can generate test data easily with random brand_codes from 0..199 etc. As size for test data - I decided to generate 5 million test rows (cars listed for sale, which could be a realistic number for bigger countries), giving us a 365 MB table which fits nicely into shared buffers, and removes disk access factor from the tests.

Building the indexes

Again – idea for our “application” was that users issue queries using equality filters on many columns, meaning all columns need to be indexed for fast response times. So we index separately all columns with B-tree and then one Bloom index for all columns.

And the resulting index sizes:

All B-tree indexes yield 107 MB, totalling to 963 MB, and the Bloom index yields 77 MB. Pretty impressive already - remember, they should basically provide comparable outcomes for our queries!

Running a test query

Now for the interesting part – will queries profit from the Bloom index? Note that here I’m imposing some behaviour on our “users” for my contrived use case -  I will not use all 10 columns and leave out brand/model from the query as together they’re too selective, probably giving no chance to the “lossy” Bloom index. This could be translated so that our “users” are being pragmatic, knowing what  features they would want for their cars and set on utility first, not a flashy brand name 🙂 Note that selected code values are randomly chosen, but since we have a random distribution with limited distinct values, playing with different numbers didn’t change the outcome.

Let's see what happens:

* Run 1.

To my surprise, the B-tree indexes were still used with a 130 ms average runtime for 5 runs.

Hmm. Here we can see that Postgres is smart enough to pick the most selective indexes and does a bitmap merge on only three indexes. Ok, it seems that to test out Bloom we still need to disable/delete the normal indexes.

* Run 2.

Hurray, finally some progress after we disabled the indexes! Seems that Bloom in our use case actually is a better choice – it is 2x faster and ~10x more space efficient! The only sad thing is that Postgres cannot correctly estimate that yet, so we need to point it out ourselves.

But why is that?

Because of the cost estimation mechanics of the Postgres planner. We can see that total cost (penalty points calculated for every operation) is bigger for the Bloom case (29K vs 139K)... albeit Bloom is faster in real time. And main costs come from Bitmap Index Scan on the Bloom index and the fact that currently the whole index needs to be scanned and all the 5 million signatures compared! One could of course fine tune parameters like cpu_tuple_cost etc. so that the planner would give Bloom the advantage, but that’s another topic and could heavily affect other well-working queries. For comparison by the way, a full table scan for our query (set enable_bitmapscan to off) takes about 500ms.

Tuning Bloom signature length

So after the first “feedback” from the Bloom index, I decided to see what the practical outcome of changing the signature length is. I tried out a relatively small value of 32 and a bit bigger one of 128, which yielded sizes of 48 MB and 106 MB (5 Mio rows). Note that the index size really doesn't change linearly with the signature size, since physical location info (6 bytes) takes a considerable toll for every Bloom index entry.

After executing the same EXPLAIN ANALYZE from above (not including output here for brevity), Postgres funnily enough decided to use the less precise (32 bits) index here, although we have a better (more precise) one available. To test my theory of improvement, I dropped the 32-bit one and indeed - runtime was decreased 2x, 137ms vs 65ms! Less heap pages to visit and tuples to recheck. Again Postgres could be a bit cleverer here, probably it just uses the smallest available index covering the queried columns as with B-tree.

Summary

Our example use case was not too exhaustive, but hopefully close enough to real life so that you get an idea when to consider Bloom, what to take into account, and what the pros and cons are.

A list of things to remember about Bloom indexes:

  • Perfect for “checkbox” type of applications – lots of columns with little varying data (and lots of data rows). One could draw parallels to Oracle bitmap index use cases.
  • Index is lossy, meaning it may give false positives and instead of real column values, in the index there are signatures linked to tables via CTID-s as usual.
  • Be ready for some trial and error with Bloom - finding out optimal columns to include and signature length. But when applied appropriately one can easily gain 10x benefit (aggregating space and speed).
  • Currently only equality operations and int4 and text (a.k.a. varchar as they're the same) data types are supported.
  • For 5m rows with 10 columns the default signature length of 80 bits was sufficient (limited nr. of  distinct values) and performed better than individual B-tree indexes -- at the same time, it was a lot smaller!
  • There’s probably some room for improvement in planner cost estimates for Bloom, so currently there’s no point to have both B-tree and Bloom indexes, since B-tree is preferred.
  • What about unique values? ERROR:  access method "bloom" does not support unique indexes…
  • What about NULL-s? NULL-s are not indexed so all queries like “WHERE col IS NULL” can’t benefit from Bloom indexes.
  • Bloom is WAL logged and crash-safe (unlike Hash indexes), so you can rely on it.
  • Changing individual bit lengths is probably not worth the effort. It only makes sense for special cases where one of the columns has a lot of distinct values and others very few. Then in theory increasing bits for the distinct column should give you less rows thrown away during rechecking.
  • When specifying index signature bit length (... WITH (length=X)), bits are actually rounded up to full words (16 bits). Thus “length=33” bits ends up as “length=48” bits.

 

2 responses to “Trying out Postgres Bloom indexes”

    • Hi, hard to say for sure as "data warehouse" is a very wide term...but definitely worth the investigation in my opinion - it could save you a lot of disk space on indexes:)

Leave a Reply

Your email address will not be published. Required fields are marked *

CYBERTEC Logo white
Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2024
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle
    linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram