PostgreSQL 12 is just around the corner and therefore we already want to present some of the new features we like. One important new feature gives users and devops the chance to control the behavior of the PostgreSQL optimizer. Prepared plans are always a major concern (especially people moving from Oracle seem to be most concerned) and therefore it makes sense to discuss the way plans are handled in PostgreSQL 12.
Table of Contents
To start I will create a simple table consisting of just two fields:
1 2 |
db12=# CREATE TABLE t_sample (id serial, name text); CREATE TABLE |
Then some data is loaded:
1 2 3 4 5 6 7 |
db12=# INSERT INTO t_sample (name) SELECT 'hans' FROM generate_series(1, 1000000); INSERT 0 1000000 db12=# INSERT INTO t_sample (name) SELECT 'paul' FROM generate_series(1, 2); INSERT 0 2 |
Note that 1 million names are identical (“hans”) and just two people are called “paul”. The distribution of data is therefore quite special, which has a major impact as you will see later in this post.
To show how plans can change depending on the setting, an index on “name” is defined as shown in the next listing:
1 2 |
db12=# CREATE INDEX idx_name ON t_sample (name); CREATE INDEX |
Let us run a simple query and see what happens:
1 2 3 4 5 6 7 8 9 10 11 |
db12=# explain SELECT count(*) FROM t_sample WHERE name = 'hans'; QUERY PLAN ------------------------------------------------------------------ Finalize Aggregate (cost=12656.23..12656.24 rows=1 width=8) -> Gather (cost=12656.01..12656.22 rows=2 width=8) Workers Planned: 2 -> Partial Aggregate (cost=11656.01..11656.02 rows=1 width=8) -> Parallel Seq Scan on t_sample (cost=0.00..10614.34 rows=416668 width=0) Filter: (name = 'hans'::text) (6 rows) |
In this case PostgreSQL decided to ignore the index and go for a sequential scan. It has even seen that the table is already quite large and opted for a parallel query. Still, what we see is a sequential scan. All data in the table has to be processed. Why is that? Remember: Most people in the table have the same name. It is faster to read the entire table and kick out those other ones instead of having to read almost the entire index. The planner figures (correctly) that running a sequential scan will be faster.
What you can take away from this example is that an index is not used because it exists - PostgreSQL uses indexes when they happen to make sense. If we search for a less frequent value, PostgreSQL will decide on using the index and offer us the optimal plan shown in the next listing:
1 2 3 4 5 6 7 8 |
db12=# explain SELECT count(*) FROM t_sample WHERE name = 'paul'; QUERY PLAN ------------------------------------------------ Aggregate (cost=4.45..4.46 rows=1 width=8) -> Index Only Scan using idx_name on t_sample (cost=0.42..4.44 rows=1 width=0) Index Cond: (name = 'paul'::text) (3 rows) |
If you are looking for good performance, keeping an eye on optimizer statistics is definitely a good idea. The main question now is: Which data does the optimizer keep? pg_stats contains information about each column:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
db12=# x Expanded display is on. db12=# SELECT * FROM pg_stats WHERE tablename = 't_sample' AND attname = 'name'; -[ RECORD 1 ] --------------------------+------------ schemaname | public tablename | t_sample attname | name inherited | f null_frac | 0 avg_width | 5 n_distinct | 1 most_common_vals | {hans} most_common_freqs | {1} histogram_bounds | correlation | 1 most_common_elems | most_common_elem_freqs | elem_count_histogram | |
PostgreSQL keeps track over the percentage of NULL entries in the table (null_frac). The average width of a column, the estimated number of distinct values (are all different, are all values the same). Then PostgreSQL keeps a list of the most frequent entries as well as their likelihood. The histogram_bounds column will contain the statistical distribution of data. In our example you will only find entries in this field if you are looking for the “id” column. There are only two names so keeping a histogram is basically pointless. The correlation column will tell us about the physical order of rows on disk. This field can be pretty important because it helps the optimizer to estimate the amount of I/O.
If you send a query to PostgreSQL it is usually planned when the query is sent. However, if you explicitly want to prepare a query, you can make use of the PREPARE / EXECUTE commands. Here is how it works:
1 2 3 4 5 6 7 8 9 10 |
db12=# PREPARE myplan(text) AS SELECT count(*) FROM t_sample WHERE name = $1; PREPARE db12=# EXECUTE myplan('paul'); count ------- 2 (1 row) |
As you can see the following query will give us an indexscan:
1 2 3 4 5 6 7 8 |
db12=# explain EXECUTE myplan('paul'); QUERY PLAN ------------------------------------------------- Aggregate (cost=4.45..4.46 rows=1 width=8) -> Index Only Scan using idx_name on t_sample (cost=0.42..4.44 rows=1 width=0) Index Cond: (name = 'paul'::text) (3 rows) |
If we fall back to the more common value we will again get a parallel sequential scan:
1 2 3 4 5 6 7 8 9 10 11 |
db12=# explain EXECUTE myplan('hans'); QUERY PLAN ------------------------------------------------------------------- Finalize Aggregate (cost=12656.23..12656.24 rows=1 width=8) -> Gather (cost=12656.01..12656.22 rows=2 width=8) Workers Planned: 2 -> Partial Aggregate (cost=11656.01..11656.02 rows=1 width=8) -> Parallel Seq Scan on t_sample (cost=0.00..10614.34 rows=416668 width=0) Filter: (name = 'hans'::text) (6 rows) |
In PostgreSQL life is not so straight forward. Even if we prepare explicitly we will still get a “fresh plan” before a generic plan is created. What is a generic plan? A generic plan is made assuming some constant parameters. The idea is to keep the plan and execute it multiple times in the hope that overall performance goes up due to lower planning overhead. Up to PostgreSQL 11 this process has been a bit “obscure” to most people.
Here is how the “obscure” thing works in detail. There are two ways PostgreSQL can choose for executing a prepared statement:
By default, PostgreSQL chooses a "middle road": it will generate a "custom plan" during the first 5 executions of the prepared statement that takes the parameter values into account. From the sixth execution on, it will check if the generic plan would have performed as well (by comparing the estimated execution costs of the custom and the generic plan). If it thinks that the generic plan would have done just as well, the prepared statement will always use the generic plan from that point on.
PostgreSQL 12 introduces a new variable, which allows users to control the behavior more explicitly. Let us try the same thing again and enforce a generic plan:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
db12=# SET plan_cache_mode = 'force_generic_plan'; SET db12=# PREPARE newplan(text) AS SELECT count(*) FROM t_sample WHERE name = $1; PREPARE db12=# explain EXECUTE newplan('hans'); QUERY PLAN ----------------------------------------------------------- Finalize Aggregate (cost=12656.23..12656.24 rows=1 width=8) -> Gather (cost=12656.01..12656.22 rows=2 width=8) Workers Planned: 2 -> Partial Aggregate (cost=11656.01..11656.02 rows=1 width=8) -> Parallel Seq Scan on t_sample (cost=0.00..10614.34 rows=416668 width=0) Filter: (name = $1) (6 rows) db12=# explain EXECUTE newplan('paul'); QUERY PLAN ------------------------------------------------------------------- Finalize Aggregate (cost=12656.23..12656.24 rows=1 width=8) -> Gather (cost=12656.01..12656.22 rows=2 width=8) Workers Planned: 2 -> Partial Aggregate (cost=11656.01..11656.02 rows=1 width=8) -> Parallel Seq Scan on t_sample (cost=0.00..10614.34 rows=416668 width=0) Filter: (name = $1) (6 rows) |
What you see here is that the plan is constant and PostgreSQL does not attempt replanning. Planning time will be cut BUT it does not necessarily mean that you always win. You might save on some CPU cycles to optimize the query but this of course means that the plan you are using is not necessarily optimal for your parameters.
If you want to play around with plan_cache_mode you can try the following values:
1 2 3 4 5 6 7 8 |
db12=# SET plan_cache_mode = 'force_custom_plan'; SET db12=# SET plan_cache_mode = 'force_generic_plan'; SET db12=# SET plan_cache_mode = 'auto'; SET |
"auto", which is the default value, resembles the traditional behavior of letting
PostgreSQL choose whether to use a generic plan or not.
You might ask what good it could be to use a prepared statement with "force_custom_plan".
The main reason is that using prepared statements is the best way to prevent SQL injection
attacks, so it may be worth using them even if you don't save on planning time.
If you want to learn more about PostgreSQL 12, consider checking out our blog post about optimizer support functions.
+43 (0) 2622 93022-0
office@cybertec.at
You are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
HI, After 5 attempts, I see bind query is taking more time, what is the alternate solution apart from using force_custom_plan? Thanks