Histograms are a common way for a relational database to store statistical information about data. What is this kind of statistical information good for? Well, statistics are the rocket fuel behind performance. If statistics are bad, the optimizer will come up with bad decisions and poor execution plans will be the result. The PostgreSQL optimizer is a damn sophisticated piece of software and it provides end users with good plans in general – still, without statistics even the best SQL optimizer will yield bad performance.
Table of Contents
The good news is that when things are a little tight, the end user can have some influence on the way PostgreSQL stores statistics. The ALTER TABLE ... SET STATISTICS command will allow users to define the size of the internal histograms storing statistical information.
Before we dig into those details we create a simple table and add some data to it:
1 2 |
test=# CREATE TABLE t_test AS SELECT * FROM generate_series(1, 1000000) AS x; SELECT 1000000 |
This will create a table containing one million numbers. To make sure that the optimizer will know about the content of the table, we can run ANALYZE (a command in charge of creating statistics):
1 2 |
test=# ANALYZE ; ANALYZE |
Whenever PostgreSQL has to plan a query, it will rely on the statistics we have made just before. Let us take a look at a very simplistic plan:
1 2 3 4 5 6 7 |
test=# explain SELECT * FROM t_test WHERE x < 10000; QUERY PLAN ------------------------------------------------------------- Seq Scan on t_test (cost=0.00..16925.00 rows=9109 width=4) Filter: (x < 10000) (2 rows) |
We try to find all values lower than 10.000. The planner estimates that we will find roughly 9.100 rows. Overall this is pretty good and totally sufficient to plan the query efficiently. Remember, estimating statistics is not about having a final answer right away – it is about having a reasonably well estimate.
Let us see, which statistics PostgreSQL has made. To do so we can take a look at a system view called pg_stats:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
test=# \d pg_stats View "pg_catalog.pg_stats" Column | Type | Modifiers ------------------------+----------+----------- schemaname | name | tablename | name | attname | name | inherited | boolean | null_frac | real | avg_width | integer | n_distinct | real | most_common_vals | anyarray | most_common_freqs | real[] | histogram_bounds | anyarray | correlation | real | most_common_elems | anyarray | most_common_elem_freqs | real[] | elem_count_histogram | real[] | |
If you are interested in the content of pg_stats we recommend to take a look at the PostgreSQL documentation. It outlines nicely how things work in this area: http://www.postgresql.org/docs/9.3/static/view-pg-stats.html
Sometimes it can be quite beneficial to change the size of a column's histogram. Reducing the size if the histogram will make estimates a lot less precise – but, it will reduce the time needed to plan a query. If you are dealing with a table with 1 billion rows and all you do is hitting the primary key to fetch exactly one row at a time, a narrow histogram is definitely not a big issue.
However, if you are dealing with more complex operations, keeping the default to 100 or to raise this value can be quite beneficial.
This is how you can change the size of the histogram of a specific column:
1 2 |
test=# ALTER TABLE t_test ALTER COLUMN x SET STATISTICS 10; ALTER TABLE |
Once this has been done, we can recreate the stats:
1 2 |
test=# ANALYZE t_test; ANALYZE |
Querying the stats table reveals what PostgreSQL knows about the column:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
test=# \x Expanded display is on. test=# SELECT * FROM pg_stats WHERE tablename = 't_test' AND attname = 'x'; -[ RECORD 1 ]----------+---------------------------------------------------------------------------- schemaname | public tablename | t_test attname | x inherited | f null_frac | 0 avg_width | 4 n_distinct | -1 most_common_vals | most_common_freqs | histogram_bounds | {492,100971,194099,290435,398419,485780,587410,692588,793607,898079,999686} correlation | 1 most_common_elems | most_common_elem_freqs | elem_count_histogram | |
Note that the histogram consists of just 11 values now (which gives us 10 bars).
We can toy around with this setting now and see what happens to the plan made by PostgreSQL once the statistics target has been risen to 1.000:
1 2 3 4 5 6 7 8 9 10 11 12 |
test=# ALTER TABLE t_test ALTER COLUMN x SET STATISTICS 1000; ALTER TABLE test=# ANALYZE ; ANALYZE test=# explain SELECT * FROM t_test WHERE x < 10000; QUERY PLAN ------------------------------------------------------------- Seq Scan on t_test (cost=0.00..16925.00 rows=9921 width=4) Filter: (x < 10000) (2 rows) |
In our example the estimate is a little bit more precise than initially. However, this comes at a cost. Planning time has risen – the plan is still the same (because there is no other choice for the planner anyway).
Btw, keep in mind that those values should not be changed on a daily basis and in many cases it is not even necessary to change this stuff at all – it is nice to have the opportunity to do so, however, when beneficial.
In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Facebook or LinkedIn.
You need to load content from reCAPTCHA to submit the form. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
Leave a Reply