A lot has been written about effective_cache_size in postgresql.conf and about PostgreSQL performance in general. However, few people know what this famous parameter really does. Let me share some more insights.
Table of Contents
The idea behind SQL is actually quite simple: The end user sends a query and the optimizer is supposed to find the best strategy to execute this query. The output of the optimizer is what people call an “execution plan”. The question now is: What makes one execution plan better than some other plan? What is it that makes a strategy greater than some other strategy? In PostgreSQL everything boils down to the concept of “costs”. The planner will assign costs to every operation. At the end of the day the cheapest plan is selected and executed.
The magic is therefore in the way the optimizer handles costs. That is exactly what effective_cache_size is all about.
To achieve good performance it is important to figure out whether to use an index or not. A question often asked is: Why not always use an index? Traversing and index might not be cheap at and using an index does not mean that there is no need to touch the table as well. The optimizer therefore has to decide whether to go for an index or not.
The way costs are estimated depend on various factors: Amount of I/O needed, number of operators called, number of tuples processed, selectivity, and a lot more. However, what is the cost of I/O? Obviously it makes a difference if data is already in cache or if data has to be read from disk. That brings us to the idea behind effective_cache_size which tells the optimizer how much cache to expect in the system. The important part is that “cache” is not only the amount of memory knows about (this part is pretty clear). The system also has to consider the size of the filesystem cache, CPU caches, and so on. effective_cache_size is the sum of all those caching components. What you will learn in this post is how the optimizer uses this kind of information.
Before we lose ourselves in theoretical explanation it makes sense to dig into a practical example. For the purpose of this blog I have generated two tables:
1 2 3 4 5 6 7 8 9 10 11 12 |
test=# CREATE TABLE t_random AS SELECT id, random() AS r FROM generate_series(1, 1000000) AS id ORDER BY random(); SELECT 1000000 test=# CREATE TABLE t_ordered AS SELECT id, random() AS r FROM generate_series(1, 1000000) AS id; SELECT 1000000 test=# CREATE INDEX idx_random ON t_random (id); CREATE INDEX test=# CREATE INDEX idx_ordered ON t_ordered (id); CREATE INDEX test=# VACUUM ANALYZE ; VACUUM |
Mind that both tables contain the same set of data. One table is ordered - the other one is not. Let us set the effective_cache_size to a really small value. The optimizer will assume that there is really not much memory in the system:
1 2 3 4 5 6 7 8 9 10 |
test=# SET effective_cache_size TO '1 MB'; SET test=# SET enable_bitmapscan TO off; SET test=# explain SELECT * FROM t_random WHERE id < 1000; QUERY PLAN --------------------------------------------------------------------------------- Index Scan using idx_random on t_random (cost=0.42..3611.96 rows=909 width=12) Index Cond: (id < 1000) (2 rows) |
Normally PostgreSQL would go for a bitmap index scan, but we want to see what happens in case of an index scan. Therefore we turn bitmap scans off (= making them insanely expensive to the optimizer).
Let us compare the plan with the one we have just seen before:
1 2 3 4 5 6 7 8 |
test=# SET effective_cache_size TO '1000 GB'; SET test=# explain SELECT * FROM t_random WHERE id < 1000; QUERY PLAN -------------------------------------------------------------------------------- Index Scan using idx_random on t_random (cost=0.42..3383.99 rows=909 width=12) Index Cond: (id < 1000) (2 rows) |
As you can see, the price of the index scan has gone down. Why is that relevant? We have to see costs as “relative”. The absolute number is not important - it is important how expensive a plan is compared to some other plan. If the price of a sequential scan stays the same and the price of an index scan goes down relative to a seq scan PostgreSQL will favor indexing more often than it otherwise would. This is exactly what effective_cache_size at its core is all about: Making index scans more likely if there is a lot of RAM around.
When people talk about postgresql.conf and effective_cache_size they are often not aware of the fact that the parameter does not always work miracles. The following scenario shows when there is no impact:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
test=# SET effective_cache_size TO '1 MB'; SET test=# explain SELECT * FROM t_ordered WHERE id < 1000; QUERY PLAN ---------------------------------------------------------------------------------- Index Scan using idx_ordered on t_ordered (cost=0.42..39.17 rows=1014 width=12) Index Cond: (id < 1000) (2 rows) test=# SET effective_cache_size TO '1000 GB'; SET test=# explain SELECT * FROM t_ordered WHERE id < 1000; QUERY PLAN ---------------------------------------------------------------------------------- Index Scan using idx_ordered on t_ordered (cost=0.42..39.17 rows=1014 width=12) Index Cond: (id < 1000) (2 rows) |
The table statistics used by the optimizer contain information about physical “correlation”. If correlation is 1 (= all data is sorted perfectly on disk) effective_cache_size will NOT change anything.
The same is true if the is only one column as shown in the next example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
test=# ALTER TABLE t_random DROP COLUMN r; ALTER TABLE test=# SET effective_cache_size TO '1 MB'; SET test=# explain SELECT * FROM t_ordered WHERE id < 1000; QUERY PLAN ---------------------------------------------------------------------------------- Index Scan using idx_ordered on t_ordered (cost=0.42..39.17 rows=1014 width=12) Index Cond: (id < 1000) (2 rows) test=# SET effective_cache_size TO '1000 GB'; SET test=# explain SELECT * FROM t_ordered WHERE id < 1000; QUERY PLAN ---------------------------------------------------------------------------------- Index Scan using idx_ordered on t_ordered (cost=0.42..39.17 rows=1014 width=12) Index Cond: (id < 1000) (2 rows) |
That comes as a surprise to most users and therefore I considered it worth mentioning.
I found it useful to use a simple formula to get a rough estimate for a good setting:
effective_cache_size = RAM * 0.7
Some people have also used 0.8 successfully. Of course this is only true when we are talking about a dedicated database server. Feel free to experiment.
In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Facebook or LinkedIn.
+43 (0) 2622 93022-0
office@cybertec.at
You are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
if effective_cache_size = RAM * 0.7 is TRUE for a dedicated server. One has to divide it by the amount of parralel postgres instances on a multi-instanze postgres server, correct?