PostgreSQL 10 will provide end users with countless new features. One of those features is related to “Foreign Data Wrappers” and is generally known as “aggregate pushdown”. To be honest: This stuff is one of my favorite new features of PostgreSQL 10 and therefore it might be worth, sharing this piece information with a broader audience. So if you are interested in remote aggregation, distributed queries, distributed aggregates and aggregate pushdown in PostgreSQL, keep reading.
Table of Contents
To show what the optimizer is already capable of, we need two databases:
1 2 |
iMac:~ hs$ createdb db01 iMac:~ hs$ createdb db02 |
Then we can deploy some simple test data in db02:
1 2 3 4 5 6 7 8 9 10 11 |
BEGIN; CREATE TABLE t_test (id serial, name text); INSERT INTO t_test (name) SELECT 'dummy' FROM generate_series(1, 1000000); ANALYZE; COMMIT; |
The script generates 1 million rows and just a single name (= “dummy”)
For many years now, PostgreSQL has provided means to access remote data sources using “Foreign Data Wrappers” (FDWs)
1 2 3 4 5 6 7 8 9 10 11 12 |
CREATE EXTENSION postgres_fdw; CREATE SERVER pgserver FOREIGN DATA WRAPPER postgres_fdw OPTIONS (dbname 'db02', host 'localhost'); CREATE USER MAPPING FOR CURRENT_USER SERVER pgserver OPTIONS (user 'hs'); IMPORT FOREIGN SCHEMA public FROM SERVER pgserver INTO public; |
The script shown here loads the postgres_fdw extension, which allows us to connect to a remote PostgreSQL database. Then a virtual server pointing to db01 is created in db01. Finally, there are a user mapping and the foreign schema imported. All tables in the remote database, which can be found in the “public” schema, will be linked and visible in db01.
Once the test data is in place, we can give PostgreSQL a try and see, how it behaves in case of aggregates. Here is an example:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
db01=# explain (analyze, verbose) SELECT name, count(*) FROM t_test GROUP BY 1; QUERY PLAN ------------------------------------------------------------------------------ Foreign Scan (cost=107.31..137.93 rows=200 width=40) (actual time=192.244..192.245 rows=1 loops=1) Output: name, (count(*)) Relations: Aggregate on (public.t_test) Remote SQL: SELECT name, count(*) FROM public.t_test GROUP BY name Planning time: 0.063 ms Execution time: 192.581 ms (6 rows) |
The most important observation here is that PostgreSQL is able to push over the complete aggregate. As you can see, the remote SQL is basically the same as the local query. The main advantage is that by pushing over the aggregates PostgreSQL can drastically reduce the load on your local machine and reduce the amount of data, which has to be sent over the network.
However, at this point it is necessary to issue a word of caution: Yes, aggregates can be pushed down to a remote server. The thing is: Joins happen before the aggregate. In other words: PostgreSQL has to transfer all the data from the remote host in this case:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
PgSQL db01=# explain (analyze, verbose) SELECT a.name, count(*) FROM t_test AS a, t_test AS b WHERE a.id = b.id GROUP BY 1 ORDER BY 2, 1; QUERY PLAN ------------------------------------------------------------------------------------------------- Sort (cost=1003.90..1004.40 rows=200 width=40) (actual time=4012.290..4012.290 rows=1 loops=1) Output: a.name, (count(*)) Sort Key: (count(*)), a.name Sort Method: quicksort Memory: 25kB -> HashAggregate (cost=994.26..996.26 rows=200 width=40) (actual time=4012.283..4012.283 rows=1 loops=1) Output: a.name, count(*) Group Key: a.name -> Merge Join (cost=588.18..894.45 rows=19963 width=32) (actual time=3382.674..3848.202 rows=1000000 loops=1) Output: a.name Merge Cond: (a.id = b.id) -> Sort (cost=222.03..225.44 rows=1365 width=36) (actual time=1691.089..1788.210 rows=1000000 loops=1) Output: a.name, a.id Sort Key: a.id Sort Method: external sort Disk: 21528kB -> Foreign Scan on public.t_test a (cost=100.00..150.95 rows=1365 width=36) (actual time=0.311..1232.045 rows=1000000 loops=1) Output: a.name, a.id Remote SQL: SELECT id, name FROM public.t_test -> Sort (cost=366.15..373.46 rows=2925 width=4) (actual time=1691.579..1785.666 rows=1000000 loops=1) Output: b.id Sort Key: b.id Sort Method: external sort Disk: 17616kB -> Foreign Scan on public.t_test b (cost=100.00..197.75 rows=2925 width=4) (actual time=0.400..1203.757 rows=1000000 loops=1) Output: b.id Remote SQL: SELECT id FROM public.t_test Planning time: 0.105 ms Execution time: 4071.736 ms |
For PostgreSQL 11.0 we are working on a patch, which will hopefully make it into core. It allows PostgreSQL to perform many aggregations before the join has to happen, which makes joining cheaper because less data ends up in the join. There are many more improvements possible. They may be added to the planner in the near future.
However, as of PostgreSQL 10 a large step forward has been made already to allow PostgreSQL to dynamically distribute queries in a cluster.
In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Facebook or LinkedIn.
You need to load content from reCAPTCHA to submit the form. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
Leave a Reply