In PostgreSQL, a parallel aggregate refers to a way of processing aggregate functions (such as SUM
, AVG
, MAX
, MIN
, etc.) on large amounts of data in a parallel and distributed manner, thereby making the query execution faster.
Table of Contents
When executing an aggregate query, the database system automatically breaks up the result set into smaller pieces and distributes the work among the available resources, and then combines the results to produce the final output. This approach can significantly improve the performance of aggregate queries on large datasets, but it requires sufficient resources and may not always be faster than serial execution.
Not all aggregate functions support so-called "Partial Mode", which indicates the aggregate is eligible to participate in various optimizations, such as parallel aggregation. And that was true for array_agg()
and string_agg()
.
The first one collects all the input values, including nulls, into an array, while the second concatenates the non-null input values into a string.
David Rowley implemented a new functionality for parallel aggregates on string_agg()
and array_agg()
functions. The patch was reviewed by Andres Freund, Tomas Vondra, Stephen Frost and Tom Lane. Committed by David Rowley. The commit message is:
1 2 3 4 5 6 7 8 9 10 11 |
This adds combine, serial and deserial functions for the array_agg() and string_agg() aggregate functions, thus allowing these aggregates to partake in partial aggregations. This allows both parallel aggregation to take place when these aggregates are present and also allows additional partition-wise aggregation plan shapes to include plans that require additional aggregation once the partially aggregated results from the partitions have been combined. Author: David Rowley Reviewed-by: Andres Freund, Tomas Vondra, Stephen Frost, Tom Lane Discussion: https://postgr.es/m/CAKJS1f9sx_6GTcvd6TMuZnNtCh0VhBzhX6FZqw17TgVFH-ga_A@mail.gmail.com |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
test=# select version(); version ---------------------------------------------------------------------------------------------------------- PostgreSQL 16devel on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, 64-bit (1 row) test=# create table pagg_test (x int, y int); CREATE TABLE test=# insert into pagg_test test-# select (case x % 4 when 1 then null else x end), x % 10 test-# from generate_series(1,5000) x; INSERT 0 5000 test=# set parallel_setup_cost TO 0; SET test=# set parallel_tuple_cost TO 0; SET test=# set parallel_leader_participation TO 0; SET test=# set min_parallel_table_scan_size = 0; SET test=# explain select test-# y, test-# string_agg(x::text, ',') AS t, test-# string_agg(x::text::bytea, ',') AS b, test-# array_agg(x) AS a, test-# array_agg(ARRAY[x]) AS aa test-# from pagg_test test-# group by y; QUERY PLAN ---------------------------------------------------------------------------------------------- Finalize GroupAggregate (cost=117.14..118.02 rows=10 width=132) Group Key: y -> Gather Merge (cost=117.14..117.37 rows=20 width=132) Workers Planned: 2 -> Sort (cost=117.12..117.14 rows=10 width=132) Sort Key: y -> Partial HashAggregate (cost=116.75..116.95 rows=10 width=132) Group Key: y -> Parallel Seq Scan on pagg_test (cost=0.00..48.00 rows=2500 width=8) (9 rows) |
And to be absolutely sure, let's try the same test on a PostreSQL 13 cluster. The EXPLAIN
output will be the same for PostgreSQL 14 and 15.
1 2 3 4 5 6 7 8 9 |
... QUERY PLAN ---------------------------------------------------------------------------------- HashAggregate (cost=185.50..185.70 rows=10 width=132) Group Key: y -> Gather (cost=0.00..48.00 rows=5000 width=8) Workers Planned: 2 -> Parallel Seq Scan on pagg_test (cost=0.00..48.00 rows=2500 width=8) (5 rows) |
We cannot compare timing but we can compare costs. As you can see, the final cost in the upcoming PostgreSQL 16 with parallel aggregate implemented is 118.02! The final execution cost for previous versions was 185.70.
Let me remind you what the cost means in the PostgreSQL's EXPLAIN
output. Cost is an estimated measure of the query execution time in arbitrary units, representing the processing power required to execute a particular step in the query plan. It generally represents a combination of CPU, I/O, and memory usage and helps the query planner to choose the fastest execution plan.
See "Handling Bonus Programs in SQL" blog post by Hans-Jürgen Schönig to check how to efficiently code bonus programs in SQL. There is a real life example of using array_agg
aggregate function. Or check out my "ER diagrams with SQL and Mermaid" post to see an example on how to use string_agg()
function.
Check out out the “what’s new” post series to know more about upcoming PostgreSQL versions.
In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.
You need to load content from reCAPTCHA to submit the form. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
Leave a Reply