A “composite index”, also known as “concatenated index” or a combined index, is an index on multiple columns in a table. Many people wonder what is more beneficial: Using separate or using composite indexes? Whenever we do training, consulting or support this question is high up on the agenda and many people keep asking this question. Therefore, I decided to shed some light on this question.
Table of Contents
To discuss the topic on a more practical level, I created a table consisting of three columns. Then I loaded 1 million rows and added a composite index covering all three columns:
1 2 3 4 5 6 7 8 9 10 |
test=# CREATE TABLE t_data (a int, b int, c int); CREATE TABLE test=# INSERT INTO t_data SELECT random()*100000, random()*100000, random()*100000 FROM generate_series(1, 1000000); INSERT 0 1000000 test=# CREATE INDEX idx_data ON t_data(a, b, c); CREATE INDEX |
The layout of the table is therefore as follows:
1 2 3 4 5 6 7 8 9 |
test=# d t_data Table 'public.t_data' Column | Type | Collation | Nullable | Default --------+---------+-----------+----------+--------- a | integer | | | b | integer | | | c | integer | | | Indexes: 'idx_data' btree (a, b, c) |
Now let's run ANALYZE
to ensure that optimizer statistics are there. Usually, autovacuum will kick in and create statistics for your table -- but to make absolutely sure, running ANALYZE
does not hurt in this case.
1 2 |
test=# ANALYZE t_data; ANALYZE |
The first important thing to observe is that PostgreSQL will try to arrange the filters in your query for you. The following query will filter on all indexed columns:
1 2 3 4 5 6 7 8 9 10 11 |
test=# explain SELECT * FROM t_data WHERE c = 10 AND b = 20 AND a = 10; QUERY PLAN --------------------------------------------------- Index Only Scan using idx_data on t_data (cost=0.42..4.45 rows=1 width=12) Index Cond: ((a = 10) AND (b = 20) AND (c = 10)) (2 rows) |
As you can see, we filtered for c, b, & a, but the optimizer changed the order of those conditions and turned it into a, b, c to make sure that the index we created suits the query. There are some important things to learn here:
Also, keep in mind that PostgreSQL knows that equality is transitive and can infer conditions from that:
1 2 3 4 5 6 7 8 9 10 11 |
test=# explain SELECT * FROM t_data WHERE c = 10 AND b = a AND a = c; QUERY PLAN --------------------------------------------------- Index Only Scan using idx_data on t_data (cost=0.42..4.45 rows=1 width=12) Index Cond: ((a = 10) AND (b = 10) AND (c = 10)) (2 rows) |
What you can see here is that PostgreSQL figured out automatically that a, b and c are actually the same.
However, if you have a composite index, it isn't necessary to filter on all columns. It is also ok to use just the first or both the first and second field to filter on. Here is an example:
1 2 3 4 5 6 7 8 9 |
test=# explain SELECT * FROM t_data WHERE a = 10; QUERY PLAN ------------------------------------------ Index Only Scan using idx_data on t_data (cost=0.42..4.62 rows=11 width=12) Index Cond: (a = 10) (2 rows) |
As you can see, PostgreSQL can still use the same index. An index is a sorted list which happens to be ordered by three fields. In multi-column indexes, this ordering is a so-called &ldauo;lexicographical ordering”: the rows are intially sorted by the first index column. Rows with the same first column are sorted by the second column, and so on. It is perfectly fine to just make use of the first columns. Talking about sorted lists:
1 2 3 4 5 6 7 8 9 10 |
test=# explain SELECT * FROM t_data WHERE a = 10 ORDER BY b, c; QUERY PLAN ------------------------------------------- Index Only Scan using idx_data on t_data (cost=0.42..4.62 rows=11 width=12) Index Cond: (a = 10) (2 rows) |
An index can also provide you with sorted data. In this case, we filter on "a" and order by the remaining two columns, "b" and "c".
When DOESN'T a composite index help to speed up a query? Here is an example:
1 2 3 4 5 6 7 8 9 10 11 |
test=# explain SELECT * FROM t_data WHERE b = 10; QUERY PLAN --------------------------------------------------- Gather (cost=1000.00..11615.43 rows=11 width=12) Workers Planned: 2 -> Parallel Seq Scan on t_data (cost=0.00..10614.33 rows=5 width=12) Filter: (b = 10) (4 rows) |
In this case, we filter on the second column. Remember: A b-tree index is nothing more than a sorted list. In our case, it's sorted by a, b, c. So naturally, we cannot usefully filter on this field. However, in some rare cases it may be that you will still see an index scan. Some people see that as proof “that it does indeed work”. But you will see an index scan for the wrong reason: PostgreSQL is able to read an index completely – not to filter but to reduce the amount of I/O it takes to read a table. If a table has many columns, it might be faster to read the index than to digest the table.
1 2 |
test=# SET seq_page_cost TO 10000; SET |
By telling PostgreSQL that sequential I/O is more expensive than it really is, you will see that the optimizer decides on an index scan:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
test=# explain analyze SELECT * FROM t_data WHERE b = 10; QUERY PLAN -------------------------------------------------- Index Only Scan using idx_data on t_data (cost=0.42..22892.53 rows=11 width=12) (actual time=7.626..63.087 rows=8 loops=1) Index Cond: (b = 10) Heap Fetches: 0 Planning Time: 0.121 ms Execution Time: 63.122 ms (5 rows) |
However, the execution time is 63 milliseconds, which is A LOT more than if we had done this for the first column in the index.
Note that “Index Cond: (b = 10)
” means something different here than in the previous examples: while before we had index scan conditions, here we have an index filter condition. It's not ideal: the two look the same in the EXPLAIN
output.
PostgreSQL is able to use more than one index at the same time. This is especially important if you are using OR as shown in the next example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
test=# SET seq_page_cost TO default; SET test=# explain SELECT * FROM t_data WHERE a = 4 OR a = 23232; QUERY PLAN ---------------------------------------------------- Bitmap Heap Scan on t_data (cost=9.03..93.15 rows=22 width=12) Recheck Cond: ((a = 4) OR (a = 23232)) -> BitmapOr (cost=9.03..9.03 rows=22 width=0) -> Bitmap Index Scan on idx_data (cost=0.00..4.51 rows=11 width=0) Index Cond: (a = 4) -> Bitmap Index Scan on idx_data (cost=0.00..4.51 rows=11 width=0) Index Cond: (a = 23232) (7 rows) |
What PostgreSQL will do is to decide on a bitmap scan. The idea is to first consult all the indexes to compile a list of rows / blocks, which then have to be fetched from the table (= heap). This is usually a lot better than a sequential scan. In my example, the same index is even used twice. However, in real life you might be more likely to see bitmap scans involving a variety of indexes.
In many cases, using too many indexes can even be counterproductive. The planner will make sure that enough indexes are chosen, but it won’t take too many. Let's take a look at the following example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
test=# DROP INDEX idx_data; DROP INDEX test=# CREATE INDEX idx_a ON t_data (a); CREATE INDEX test=# CREATE INDEX idx_b ON t_data (b); CREATE INDEX test=# CREATE INDEX idx_c ON t_data (c); CREATE INDEX test=# d t_data Table 'public.t_data' Column | Type | Collation | Nullable | Default --------+---------+-----------+----------+--------- a | integer | | | b | integer | | | c | integer | | | Indexes: 'idx_a' btree (a) 'idx_b' btree (b) 'idx_c' btree (c) |
The following query filters on all columns again. This time, I didn't use a single combined index. Instead, I decided to use separate indexes, which is usually more inefficient. Here is what happens:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
test=# explain SELECT * FROM t_data WHERE a = 10 AND b = 20 AND c = 30; QUERY PLAN ---------------------------------------------------- Bitmap Heap Scan on t_data (cost=9.27..13.28 rows=1 width=12) Recheck Cond: ((c = 30) AND (b = 20)) Filter: (a = 10) -> BitmapAnd (cost=9.27..9.27 rows=1 width=0) -> Bitmap Index Scan on idx_c (cost=0.00..4.51 rows=11 width=0) Index Cond: (c = 30) -> Bitmap Index Scan on idx_b (cost=0.00..4.51 rows=11 width=0) Index Cond: (b = 20) (8 rows) |
Let's take a closer look at the plan. The query filters on 3 columns BUT PostgreSQL only decided on two indexes. Why is that the case? We have imported 1 million rows. Each column contains 100.000 distinct values, which means that every value occurs 10 times. What is the point in fetching a couple of rows from every index even if two indexes already narrow down the result sufficiently enough? This is exactly what happens.
Indexes are not only about filtering. It will also help you to mind the lowest and the highest value in a column as shown by the following SQL statement:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
test=# explain SELECT min(a), max(b) FROM t_data; QUERY PLAN ----------------------------------------------------------------------- Result (cost=0.91..0.92 rows=1 width=8) InitPlan 1 (returns $0) -> Limit (cost=0.42..0.45 rows=1 width=4) -> Index Only Scan using idx_a on t_data (cost=0.42..28496.42 rows=1000000 width=4) Index Cond: (a IS NOT NULL) InitPlan 2 (returns $1) -> Limit (cost=0.42..0.45 rows=1 width=4) -> Index Only Scan Backward using idx_b on t_data t_data_1 (cost=0.42..28496.42 rows=1000000 width=4) Index Cond: (b IS NOT NULL) (9 rows) |
As you can see, PostgreSQL is pretty sophisticated. If you are looking for good performance. it certainly makes sense to see how PostgreSQL handles indexes and review your code in order to speed thing up. If you want to learn more about performance, consider checking out Laurenz Albe’s post about speeding up count(*). Also, if you are not sure, why your database is slow, check out my post on PostgreSQL database performance, which explains, how to find slow queries.
In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Facebook or LinkedIn.
+43 (0) 2622 93022-0
office@cybertec.at
You are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
Hello,
Thank you for this interesting article!
I saw a small typo into paragraph: Using a subset of indexes in a single SQL statement
test=# CREATE INDEX idx_a ON t_data (b);
CREATE INDEX
test=# CREATE INDEX idx_a ON t_data (c);
CREATE INDEX
instead of
test=# CREATE INDEX idx_b ON t_data (b);
CREATE INDEX
test=# CREATE INDEX idx_c ON t_data (c);
CREATE INDEX
Thank you so much, your posts are so clear.
Hey there!
I've got a question about how a “composite index” works when trying to query with a subset of the columns in order. For example
CREATE INDEX idx_data ON t_data(a, b, c);
Then a query like...
SELECT *
FROM t_data
WHERE a = 10
AND c = 10;
Would that still benefit from the index created? (excuse me if this was already covered and I misunderstood.)
The query could use the index for the condition
a = 10
, but not forc = 10
, because there is no condition onb
.Imagine a phone book sorted by (last name, first name): you cannot use it efficiently to search for all people with first name "Matt".
Thanks a lot!