Indexing is the key to good performance. However, people often ask: Is there an alternative to btree indexing? Can we make indexes in PostgreSQL smaller? Can we create indexes in PostgreSQL faster? And how can we index in a data warehouse?
Table of Contents
This blog will answer all those questions and show which options you have to index tables in a data warehouse. We're going to focus on the epic battle between btree (AKA B-tree) and BRIN indexes.
Unlike many other relational databases, PostgreSQL supports more than just one index type. The reason for that is that not all indexes are suitable for the same type of operation. An index that works fine to index names does not necessarily work for GIS data, and vice versa. That's why people have the choice of which index to use in PostgreSQL.
Here's how it works:
1 2 3 4 5 6 7 8 9 10 11 |
test=# SELECT * FROM pg_am; oid | amname | amhandler | amtype ------+--------+----------------------+-------- 2 | heap | heap_tableam_handler | t 403 | btree | bthandler | i 405 | hash | hashhandler | i 783 | gist | gisthandler | i 2742 | gin | ginhandler | i 4000 | spgist | spghandler | i 3580 | brin | brinhandler | i (7 rows) |
The pg_am
system table reveals which types of indexes are supported by your system. Note that an index can actually be loaded as an extension, which means that the list you see is not necessarily constant - you might see more entries.
Before we dive into the details of btree and BRIN indexes, it makes sense to load some sample data. The easiest way to do that is to make use of generate_series
. This is the Swiss army knife type of function which is basically used by most PostgreSQL consultants to generate huge tables for testing purposes:
1 2 3 4 5 6 |
test=# CREATE TABLE t_demo (id_sorted int8, id_random int8); CREATE TABLE test=# INSERT INTO t_demo SELECT id, hashtext(id::text) FROM generate_series(1, 5000000000) AS id; INSERT 0 5000000000 |
In this case, we created 5 billion rows which lead to a 206 GB table. Here we have two columns: The first column is ascending. It contains numbers ranging from 1 to 5 billion.
1 2 3 4 5 6 |
test=# d+ List of relations Schema | Name | Type | Owner | Persistence | Access method | Size | Description --------+--------+-------+-------+-------------+---------------+--------+------------- public | t_demo | table | hs | permanent | heap | 206 GB | (1 row) |
The second column contains a hash value. Take a look:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
test=# SELECT * FROM t_demo LIMIT 10 OFFSET 50; id_sorted | id_random -----------+------------- 51 | 342243648 52 | -1711900017 53 | 213436769 54 | 2112769913 55 | -1318351987 56 | -19937162 57 | -299365904 58 | -1228416573 59 | 93548776 60 | 1491665190 (10 rows) |
The point here is that the hash value is pretty much random and should be evenly distributed. It gives us the chance to see how various index types such as btree and BRIN behave in these two quite common use cases.
Let's create a simple btree. In a first step we use the PostgreSQL default configuration (the only change I made was to set max_wal_size
to 100 GB - all other settings are default)
1 2 3 |
test=# CREATE INDEX idx_id ON t_demo (id_sorted); CREATE INDEX Time: 2109651,552 ms (35:09,652) |
However, we can do better. If we drop the index again and pump maintenance_work_mem
to 1 GB, and if we increase the number of worker processes PostgreSQL is allowed to use, we can speed things up:
1 2 3 4 5 6 7 8 9 10 11 12 |
test=# SHOW maintenance_work_mem ; maintenance_work_mem ---------------------- 64MB (1 row) Time: 0,170 ms test=# SET maintenance_work_mem TO '1 GB'; SET Time: 0,181 ms test=# SET max_parallel_maintenance_workers TO 4; SET |
1 2 3 4 5 6 7 8 9 |
[hs@hanspc user_test]$ vmstat 2 procs -----------memory---------- ---swap-- -----io---- -system--- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 5 0 450048 235408 1800 25689472 0 2 2964 6838 21 41 9 2 88 1 0 5 0 450048 261876 1800 25650596 0 0 1225912 8 19425 8512 39 7 54 0 0 7 0 450048 257548 1800 25660116 0 0 941458 277776 19766 7638 32 7 53 7 0 1 5 450048 232476 1800 25690880 0 0 857176 411110 17791 7603 34 7 50 9 0 5 0 450048 260136 1800 25663332 0 0 756074 485380 16005 7811 28 6 53 13 0 3 4 450048 242340 1800 25679688 0 0 722454 666192 16785 8842 26 10 49 15 0 |
When looking at vmstat
we see that PostgreSQL starts to read hundreds of MB per second, and starts to spill hundreds of MB/second to disk at a time. Since we cannot sort everything in memory, this is exactly what we expected to happen.
During the first phase we see exactly what happens as revealed by the progress monitoring infrastructure:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
test=# x Expanded display is on. test=# SELECT * FROM pg_stat_progress_create_index; -[ RECORD 1 ]------+------------------------------- pid | 23147 datid | 24576 datname | test relid | 24583 index_relid | 0 command | CREATE INDEX phase | building index: scanning table lockers_total | 0 lockers_done | 0 current_locker_pid | 0 blocks_total | 27027028 blocks_done | 8577632 tuples_total | 0 tuples_done | 0 partitions_total | 0 partitions_done | 0 |
In the process table we can see that all desired cores are working at full speed to make this happen:
1 2 3 4 5 6 |
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 38830 hs 20 0 594336 219780 12152 R 99,0 0,7 0:06.65 postgres 38831 hs 20 0 594336 219748 12120 R 99,0 0,7 0:06.65 postgres 38833 hs 20 0 594336 219920 12288 R 98,7 0,7 0:06.64 postgres 38832 hs 20 0 594336 219916 12284 R 98,3 0,7 0:06.63 postgres 23147 hs 20 0 514964 276408 149056 R 97,7 0,8 106:48.69 postgres |
Remember, the table is too big to make this happen in memory, so we had to spill to disk:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
-[ RECORD 1 ]-------+------------------------------------ pid | 23147 datid | 24576 datname | test relid | 24583 index_relid | 0 command | CREATE INDEX phase | building index: sorting live tuples lockers_total | 0 lockers_done | 0 current_locker_pid | 0 blocks_total | 27027028 blocks_done | 27027028 tuples_total | 0 tuples_done | 0 partitions_total | 0 partitions_done | 0 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
-[ RECORD 1 ]------+--------------------------------------- pid | 23147 datid | 24576 datname | test relid | 24583 index_relid | 0 command | CREATE INDEX phase | building index: loading tuples in tree lockers_total | 0 lockers_done | 0 current_locker_pid | 0 blocks_total | 0 blocks_done | 0 tuples_total | 5000000000 tuples_done | 2005424786 partitions_total | 0 partitions_done | 0 |
1 2 3 |
test=# CREATE INDEX idx_id ON t_demo (id_sorted); CREATE INDEX Time: 1212601,300 ms (20:12,601) |
If you want to know more about indexes I can also recommend one of my other blog posts dealing with faster index creation in PostgreSQL.
What is noteworthy here is the size of the index we've just created:
1 2 3 4 5 6 |
test=# di+ List of relations Schema | Name | Type | Owner | Table | Persistence | Access method | Size | Description --------+--------+-------+-------+--------+-------------+---------------+--------+------------- public | idx_id | index | hs | t_demo | permanent | btree | 105 GB | (1 row) |
105 GB's is pretty massive. However, it basically confirms the following rule of thumb: Usually an index needs around 25 bytes per index entry - assuming we do not have any duplicate entries.
1 2 3 |
test=# CREATE INDEX idx_random ON t_demo (id_random); CREATE INDEX Time: 1731088,437 ms (28:51,088) |
In this case, we have indexed the randomized column. As our input stream is not sorted anymore it takes a bit longer to create this index.
1 2 3 4 5 6 7 8 9 |
test=# SELECT count(*) FROM t_demo WHERE id_sorted BETWEEN 10000000 AND 20000000; count ---------- 10000001 (1 row) Time: 358,804 ms |
We managed to extract 10 million rows in just 358 milliseconds. Let's look deep into the execution plans and see what happens:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
test=# explain (analyze, buffers) SELECT count(*) FROM t_demo WHERE id_sorted BETWEEN 10000000 AND 20000000; QUERY PLAN --------------------------------------------------------------------- Finalize Aggregate (cost=296509.46..296509.47 rows=1 width=8) (actual time=496.450..500.609 rows=1 loops=1) Buffers: shared hit=11 read=27325 -> Gather (cost=296509.24..296509.45 rows=2 width=8) (actual time=496.379..500.603 rows=3 loops=1) Workers Planned: 2 Workers Launched: 2 Buffers: shared hit=11 read=27325 -> Partial Aggregate (cost=295509.24..295509.25 rows=1 width=8) (actual time=492.621..492.622 rows=1 loops=3) Buffers: shared hit=11 read=27325 -> Parallel Index Only Scan using idx_id on t_demo (cost=0.58..284966.32 rows=4217169 width=0) (actual time=0.074..340.666 rows=3333334 loops=3) Index Cond: ((id_sorted >= 10000000) AND (id_sorted <= 20000000)) Heap Fetches: 0 Buffers: shared hit=11 read=27325 Planning: Buffers: shared hit=7 read=3 Planning Time: 0.233 ms Execution Time: 500.662 ms (16 rows) |
Actually this is good news because it means that it's enough to scan the index. There's no need to fetch data from the underlying table. All we need to ensure is the existence of the row (which is in this case done through PostgreSQL hint bits).
Note that I reduced the range a bit to ensure that we also get 10 million rows. The size of the result set is therefore roughly identical:
1 2 3 4 5 6 7 8 9 10 11 |
test=# explain SELECT count(*) FROM t_demo WHERE id_random BETWEEN 10000000 AND 19000000; QUERY PLAN ------------------------------------------------------------------- Aggregate (cost=22871994.27..22871994.28 rows=1 width=8) -> Index Only Scan using idx_random on t_demo (cost=0.58..22846435.29 rows=10223592 width=0) Index Cond: ((id_random >= 10000000) AND (id_random <= 19000000)) (3 rows) |
The parallel query using more than one core is gone. But, it's even worse … see what happens during the query:
1 2 3 4 5 6 7 8 |
[hs@hanspc user_test]$ vmstat 2 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 0 1 1377280 228168 1032 27286216 0 0 120584 314 17933 30276 2 5 86 7 0 1 0 1377280 231608 1032 27286224 0 0 121102 1069 17839 30854 2 3 88 7 0 1 1 1377280 251400 1032 27270360 0 0 121168 869 17669 30177 1 4 88 7 0 0 1 1377280 263096 1032 27254396 0 0 121674 514 17780 30695 2 2 88 7 0 0 1 1377280 236788 1032 27275712 0 0 121630 260 18403 31970 2 2 88 7 0 |
Wow, we see 120 MB / sec sustained I/O which translates to horrible runtime.
Note that the amount of data is the same - all that has changes is the distribution of data on disk:
1 2 3 4 5 6 7 8 |
test=# SELECT count(*) FROM t_demo WHERE id_random BETWEEN 10000000 AND 19000000; count ---------- 10481155 (1 row) Time: 322747,613 ms (05:22,748) |
We can figure out why:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
test=# explain (analyze, buffers) SELECT count(*) FROM t_demo WHERE id_random BETWEEN 10000000 AND 19000000; QUERY PLAN ------------------------------------------------------------------- Aggregate (cost=22871994.27..22871994.28 rows=1 width=8) (actual time=292381.841..292381.843 rows=1 loops=1) Buffers: shared hit=6611938 read=3884860 -> Index Only Scan using idx_random on t_demo (cost=0.58..22846435.29 rows=10223592 width=0) (actual time=0.515..291336.908 rows=10481155 loops=1) Index Cond: ((id_random >= 10000000) AND (id_random <= 19000000)) Heap Fetches: 3866504 Buffers: shared hit=6611938 read=3884860 Planning Time: 1.080 ms Execution Time: 292381.934 ms (8 rows) Time: 292384,903 ms (04:52,385) |
Data is spread all over the place. Therefore we need millions of 8k pages to handle the query. In those blocks we only need a subset of data. That causes substantial runtime.
BRIN indexes are often seen as a savior for warehousing applications. However, you should not be so sure about that.
1 2 3 |
test=# CREATE INDEX idx_brin ON t_demo USING brin (id_sorted); CREATE INDEX Time: 710549,637 ms (11:50,550) |
The index is made really quickly. What is also extremely convincing is the size of the BRIN index:
1 2 3 4 5 |
test=# SELECT pg_size_pretty(pg_relation_size('idx_brin')); pg_size_pretty ---------------- 7064 kB (1 row) |
What we can also see during index creation is that the process looks slightly different:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
test=# SELECT * FROM pg_stat_progress_create_index; -[ RECORD 1 ]------+--------------- pid | 23147 datid | 24576 datname | test relid | 24583 index_relid | 0 command | CREATE INDEX phase | building index lockers_total | 0 lockers_done | 0 current_locker_pid | 0 blocks_total | 27027028 blocks_done | 16898015 tuples_total | 0 tuples_done | 0 partitions_total | 0 partitions_done | 0 |
BRIN is not as extensive a sorting exercise as normal btree index creation. This is important because it translates to less temporary I/O and faster index creation.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
test=# explain (analyze, buffers) SELECT count(*) FROM t_demo WHERE id_sorted BETWEEN 10000000 AND 20000000; QUERY PLAN ------------------------------------------------------------------------------------ Finalize Aggregate (cost=49930967.53..49930967.54 rows=1 width=8) (actual time=664.605..671.519 rows=1 loops=1) Buffers: shared hit=311 read=55025 -> Gather (cost=49930967.32..49930967.53 rows=2 width=8) (actual time=664.387..671.502 rows=3 loops=1) Workers Planned: 2 Workers Launched: 2 Buffers: shared hit=311 read=55025 -> Partial Aggregate (cost=49929967.32..49929967.33 rows=1 width=8) (actual time=653.953..653.955 rows=1 loops=3) Buffers: shared hit=311 read=55025 -> Parallel Bitmap Heap Scan on t_demo (cost=5390.21..49920339.39 rows=3851169 width=0) (actual time=56.944..489.563 rows=3333334 loops=3) Recheck Cond: ((id_sorted >= 10000000) AND (id_sorted <= 20000000)) Rows Removed by Index Recheck: 5546 Heap Blocks: lossy=19046 Buffers: shared hit=311 read=55025 -> Bitmap Index Scan on idx_brin (cost=0.00..3079.51 rows=9258865 width=0) (actual time=65.293..65.294 rows=541440 loops=1) Index Cond: ((id_sorted >= 10000000) AND (id_sorted <= 20000000)) Buffers: shared hit=311 read=881 Planning: Buffers: shared hit=1 Planning Time: 0.210 ms Execution Time: 671.623 ms (20 rows) |
The btree managed to run the query with just 27,000 blocks - now we need around 55,000 - which translates to more runtime.
While a BRIN index is usually a good idea for such use cases, it's not always the magic bullet. It does need significantly less space and it can be built faster. But is does not magically fix all cases under all circumstances.
Please keep in mind that I AM NOT SAYING that BRIN is bad - it's usually a speedup compared to btree indexes. What I am saying is that you should compare various setups and decide what is best. Also: Keep in mind that correlation does matter.
To find out about reduced B-tree index bloat, check out Laurenz' post.
If you want to learn more about PostgreSQL and if you want to read something about GIS data right now I can highly recommend Florian Nadler’s post dealing with historical data and MobilityDB.
You need to load content from reCAPTCHA to submit the form. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
Thanks for this article.
I am not clear why using idx_random requires more disk reads than idx_id, considering keys are inserted already sorted. Also how planner knows that it can do a parallel index scan or not ?