In my previous post I have shared some GPU performance data, which were mostly related to aggregation. Given the information I have found on the pgstrom wiki page, I decided to give joins a try to see how much speed we can gain by offloading some of the work PostgreSQL has to do to the graphics card. Let's see how PostgreSQL does on a GPU with joins.
Table of Contents
First I created some test data:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
test=# CREATE TABLE t_test AS SELECT id, id % 10 AS ten, id % 20 AS twenty FROM generate_series(1, 25000000) AS id ORDER BY id; SELECT 25000000 test=# CREATE TABLE t_join AS SELECT * FROM t_test ORDER BY random() LIMIT 1000000; SELECT 1000000 test=# SELECT count(*) FROM t_join; count --------- 1000000 (1 row) test=# SELECT count(*) FROM t_test; count ---------- 25000000 (1 row) |
25 million rows should be joined with 1 million rows, which are a 100% subset of the original data.
A first test shows a very nice improvement. First two queries on my CPU:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
test=# explain analyze SELECT count(*) FROM t_test AS a, t_join AS b WHERE a.id = b.id GROUP BY a.ten; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------- HashAggregate (cost=724919.18..724919.28 rows=10 width=4) (actual time=11283.769..11283.772 rows=10 loops=1) Group Key: a.ten -> Hash Join (cost=31813.00..719919.18 rows=1000000 width=4) (actual time=340.218..11061.192 rows=1000000 loops=1) Hash Cond: (a.id = b.id) -> Seq Scan on t_test a (cost=0.00..385135.40 rows=24999940 width=8) (actual time=0.046..3590.336 rows=25000000 loops=1) -> Hash (cost=15406.00..15406.00 rows=1000000 width=4) (actual time=338.516..338.516 rows=1000000 loops=1) Buckets: 131072 Batches: 16 Memory Usage: 3226kB -> Seq Scan on t_join b (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.031..142.085 rows=1000000 loops=1) Planning time: 0.587 ms Execution time: 11284.411 ms (10 rows) test=# explain analyze SELECT count(*) FROM t_test AS a, t_join AS b WHERE a.id = b.id GROUP BY a.twenty; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------- HashAggregate (cost=724919.18..724919.38 rows=20 width=4) (actual time=10991.615..10991.619 rows=20 loops=1) Group Key: a.twenty -> Hash Join (cost=31813.00..719919.18 rows=1000000 width=4) (actual time=317.089..10766.779 rows=1000000 loops=1) Hash Cond: (a.id = b.id) -> Seq Scan on t_test a (cost=0.00..385135.40 rows=24999940 width=8) (actual time=0.050..3636.188 rows=25000000 loops=1) -> Hash (cost=15406.00..15406.00 rows=1000000 width=4) (actual time=316.321..316.321 rows=1000000 loops=1) Buckets: 131072 Batches: 16 Memory Usage: 3226kB -> Seq Scan on t_join b (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.026..141.551 rows=1000000 loops=1) Planning time: 0.240 ms Execution time: 10992.203 ms (10 rows) |
The CPU in use here is an “AMD FX(tm)-8350 Eight-Core Processor” running at 4 Ghz.
Let us try the same thing on the GPU:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
test=# explain analyze SELECT count(*) FROM t_test AS a, t_join AS b WHERE a.id = b.id GROUP BY a.ten; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------- HashAggregate (cost=175445.25..175445.35 rows=10 width=4) (actual time=2556.159..2556.161 rows=10 loops=1) Group Key: a.ten -> Custom Scan (GpuPreAgg) (cost=27659.66..174601.72 rows=22 width=8) (actual time=2538.558..2556.123 rows=30 loops=1) Bulkload: Off Reduction: Local + Global -> Custom Scan (GpuJoin) (cost=23659.66..170445.25 rows=1000000 width=4) (actual time=1392.571..2437.772 rows=1000000 loops=1) Bulkload: On (density: 100.00%) Depth 1: GpuHashJoin, HashKeys: (id), JoinQual: (id = id) Nrows (in:25000000 out:1000000, 4.00% planned 4.00%), KDS-Hash (size: 57.22MB planned 82.25MB, nbatches: 1 planned 1) Inner Buffer: (82.25MB), DMA nums: 13, size: 1069.32MB -> Custom Scan (BulkScan) on t_test a (cost=0.00..385137.60 rows=25000160 width=8) (actual time=16.137..1333.313 rows=25000000 loops=1) -> Seq Scan on t_join b (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.018..109.725 rows=1000000 loops=1) Planning time: 1.720 ms Execution time: 3264.747 ms (14 rows) test=# explain analyze SELECT count(*) FROM t_test AS a, t_join AS b WHERE a.id = b.id GROUP BY a.twenty; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------- HashAggregate (cost=175445.25..175445.45 rows=20 width=4) (actual time=2532.113..2532.121 rows=20 loops=1) Group Key: a.twenty -> Custom Scan (GpuPreAgg) (cost=27659.66..174601.94 rows=44 width=8) (actual time=2508.512..2531.991 rows=120 loops=1) Bulkload: Off Reduction: Local + Global -> Custom Scan (GpuJoin) (cost=23659.66..170445.25 rows=1000000 width=4) (actual time=1373.296..2410.850 rows=1000000 loops=1) Bulkload: On (density: 100.00%) Depth 1: GpuHashJoin, HashKeys: (id), JoinQual: (id = id) Nrows (in:25000000 out:1000000, 4.00% planned 4.00%), KDS-Hash (size: 57.22MB planned 82.25MB, nbatches: 1 planned 1) Inner Buffer: (82.25MB), DMA nums: 11, size: 904.81MB -> Custom Scan (BulkScan) on t_test a (cost=0.00..385137.60 rows=25000160 width=8) (actual time=12.961..1308.733 rows=25000000 loops=1) -> Seq Scan on t_join b (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.018..109.893 rows=1000000 loops=1) Planning time: 0.539 ms Execution time: 3225.901 ms (14 rows) |
What we see here is a nice improvement. The query is several times faster.
The interesting part is to see, what happens if indexes are added to both sides of the join:
1 2 3 4 5 |
test=# CREATE INDEX idx_id ON t_test (id); CREATE INDEX test=# CREATE INDEX idx_id2 ON t_join (id); CREATE INDEX |
The standard query, shown before, does not show any difference because too much data is needed inside the join. So, the test is repeated with a reasonable filter:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
test=# explain analyze SELECT count(*) FROM t_test AS a, t_join AS b WHERE a.id = b.id AND a.id < 100000 GROUP BY a.ten; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------- HashAggregate (cost=32541.70..32541.71 rows=1 width=4) (actual time=45.845..45.846 rows=10 loops=1) Group Key: a.ten -> Merge Join (cost=1.49..32519.75 rows=4390 width=4) (actual time=0.076..44.423 rows=4137 loops=1) Merge Cond: (a.id = b.id) -> Index Scan using idx_id on t_test a (cost=0.44..3722.05 rows=109749 width=8) (actual time=0.025..27.556 rows=99999 loops=1) Index Cond: (id < 100000) -> Index Only Scan using idx_id2 on t_join b (cost=0.42..25980.42 rows=1000000 width=4) (actual time=0.022..0.931 rows=4138 loops=1) Heap Fetches: 0 Planning time: 0.449 ms Execution time: 45.946 ms (10 rows) |
The CPU version is pretty fast. PostgreSQL scans the indexes, makes use of the sorted input to perform a nice merge join then. The aggregation is pretty fast as well.
In the GPU enabled case the plan does not look as efficient in the default setting:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
test=# explain analyze SELECT count(*) FROM t_test AS a, t_join AS b WHERE a.id = b.id AND a.id < 100000 GROUP BY a.ten; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------ HashAggregate (cost=13404.21..13404.22 rows=1 width=4) (actual time=118.154..118.156 rows=10 loops=1) Group Key: a.ten -> Custom Scan (GpuJoin) (cost=9649.48..13383.34 rows=4174 width=4) (actual time=70.993..117.099 rows=4137 loops=1) Bulkload: On (density: 100.00%) Depth 1: GpuHashJoin, HashKeys: (id), JoinQual: (id = id) Nrows (in:1000000 out:4137, 0.41% planned 0.42%), KDS-Hash (size: 5859.39KB planned 8.58MB, nbatches: 1 planned 1) Inner Buffer: (8.58MB), DMA nums: 1, size: 8.58MB -> Custom Scan (BulkScan) on t_join b (cost=0.00..15406.00 rows=1000000 width=4) (actual time=7.164..24.709 rows=1000000 loops=1) -> Index Scan using idx_id on t_test a (cost=0.44..3542.55 rows=104349 width=8) (actual time=0.018..21.684 rows=99999 loops=1) Index Cond: (id < 100000) Planning time: 0.858 ms Execution time: 553.543 ms (12 rows) |
Even if the query is slower in this case, it should not be a major issue. If the cost models are adjusted properly and if the planner is told how to decide on the right plan, the system should be able to figure out that the CPU version is the faster one. So the loss in speed is really not an issue here. As soon as all infrastructure is in place, balancing the cost model and adjusting the planner here and there does not seem like a showstopper (at least not given the achievements already made).
Let us just add more tables to the join to see what happens. To make sure that CPU and GPU get a fair chance, I have removed my two indexes again:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 |
test=# explain analyze SELECT count(*) FROM t_test AS a, t_join AS b, t_test AS c, t_join AS d, t_test AS e, t_join AS f WHERE a.id = b.id AND b.id = c.id AND c.id = d.id AND d.id = e.id AND e.id = f.id; QUERY PLAN ------------------------------------------------------------------------------------------------------------ Aggregate (cost=1745125.48..1745125.49 rows=1 width=0) (actual time=34234.209..34234.209 rows=1 loops=1) -> Hash Join (cost=1266218.99..1745121.48 rows=1600 width=0) (actual time=23054.940..34133.602 rows=1000000 loops=1) Hash Cond: (e.id = a.id) -> Seq Scan on t_test e (cost=0.00..385136.36 rows=25000036 width=4) (actual time=0.056..2728.388 rows=25000000 loops=1) -> Hash (cost=1266198.99..1266198.99 rows=1600 width=20) (actual time=23054.441..23054.441 rows=1000000 loops=1) Buckets: 131072 (originally 2048) Batches: 32 (originally 1) Memory Usage: 4205kB -> Hash Join (cost=787296.49..1266198.99 rows=1600 width=20) (actual time=12247.386..22803.559 rows=1000000 loops=1) Hash Cond: (c.id = a.id) -> Seq Scan on t_test c (cost=0.00..385136.36 rows=25000036 width=4) (actual time=0.004..2727.218 rows=25000000 loops=1) -> Hash (cost=787276.49..787276.49 rows=1600 width=16) (actual time=12246.958..12246.958 rows=1000000 loops=1) Buckets: 131072 (originally 2048) Batches: 16 (originally 1) Memory Usage: 3960kB -> Hash Join (cost=768104.49..787276.49 rows=1600 width=16) (actual time=11330.994..12044.482 rows=1000000 loops=1) Hash Cond: (f.id = a.id) -> Seq Scan on t_join f (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.030..111.371 rows=1000000 loops=1) -> Hash (cost=767604.49..767604.49 rows=40000 width=12) (actual time=11330.687..11330.687 rows=1000000 loops=1) Buckets: 131072 (originally 65536) Batches: 16 (originally 1) Memory Usage: 3716kB -> Hash Join (cost=63626.00..767604.49 rows=40000 width=12) (actual time=620.425..11124.981 rows=1000000 loops=1) Hash Cond: (a.id = d.id) -> Hash Join (cost=31813.00..719920.49 rows=1000000 width=8) (actual time=306.078..10215.246 rows=1000000 loops=1) Hash Cond: (a.id = b.id) -> Seq Scan on t_test a (cost=0.00..385136.36 rows=25000036 width=4) (actual time=0.002..3430.792 rows=25000000 loops=1) -> Hash (cost=15406.00..15406.00 rows=1000000 width=4) (actual time=305.466..305.466 rows=1000000 loops=1) Buckets: 131072 Batches: 16 Memory Usage: 3226kB -> Seq Scan on t_join b (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.002..137.925 rows=1000000 loops=1) -> Hash (cost=15406.00..15406.00 rows=1000000 width=4) (actual time=314.051..314.051 rows=1000000 loops=1) Buckets: 131072 Batches: 16 Memory Usage: 3226kB -> Seq Scan on t_join d (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.006..137.796 rows=1000000 loops=1) Planning time: 4.833 ms Execution time: 34238.964 ms (29 rows) |
As expected the CPU has a hard time joining all the stuff together. It is a pretty nasty join. At the end 34 seconds are needed.
The GPU can really shine under those circumstances. Expensive joins really seem to be what this is all about after all:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
test=# explain analyze SELECT count(*) FROM t_test AS a, t_join AS b, t_test AS c, t_join AS d, t_test AS e, t_join AS f WHERE a.id = b.id AND b.id = c.id AND c.id = d.id AND d.id = e.id AND e.id = f.id; NOTICE: GpuJoin(0x204508860) DataStoreNoSpace retry=1 [0.00%] src_nitems: 312465 max_space: 390581=>781162 nrooms: 101578=>390581 Nthreads: (312465, 499=>312465) inners: (0, 1000000) results: [312465, 312465] NOTICE: GpuJoin(0x204506060) DataStoreNoSpace retry=1 [0.00%] src_nitems: 312465 max_space: 390581=>781162 nrooms: 101578=>390581 Nthreads: (312465, 499=>312465) inners: (0, 1000000) results: [312465, 312465] NOTICE: GpuJoin(0x204505860) DataStoreNoSpace retry=1 [0.00%] src_nitems: 312465 max_space: 390581=>781162 nrooms: 101578=>390581 Nthreads: (312465, 499=>312465) inners: (0, 1000000) results: [312465, 312465] QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=471554.15..471554.16 rows=1 width=0) (actual time=7270.748..7270.748 rows=1 loops=1) -> Custom Scan (GpuJoin) (cost=334846.07..471550.15 rows=1600 width=0) (actual time=6212.393..7184.973 rows=1000000 loops=1) Bulkload: On (density: 100.00%) Depth 1: GpuHashJoin, HashKeys: (id), JoinQual: (id = id) Nrows (in:25000000 out:1000000, 4.00% planned 0.01%), KDS-Hash (size: 64.85MB planned 134.86KB, nbatches: 1 planned 1) Inner Buffer: (124.00MB), DMA nums: 9, size: 1116.01MB -> Custom Scan (BulkScan) on t_test e (cost=0.00..385136.36 rows=25000036 width=4) (actual time=13.106..1103.447 rows=25000000 loops=1) -> Custom Scan (GpuJoin) (cost=192385.58..329089.66 rows=1600 width=20) (actual time=4240.682..5277.841 rows=1000000 loops=1) Bulkload: On (density: 100.00%) Depth 1: GpuHashJoin, HashKeys: (id), JoinQual: (id = id) Nrows (in:25000000 out:1000000, 4.00% planned 0.01%), KDS-Hash (size: 57.22MB planned 119.23KB, nbatches: 1 planned 1) Inner Buffer: (62.00MB), DMA nums: 15, size: 930.01MB -> Custom Scan (BulkScan) on t_test c (cost=0.00..385136.36 rows=25000036 width=4) (actual time=13.160..1093.463 rows=25000000 loops=1) -> Custom Scan (GpuJoin) (cost=182921.05..186629.17 rows=1600 width=16) (actual time=3320.939..344.434 rows=1000000 loops=1) Bulkload: On (density: 100.00%) Depth 1: GpuHashJoin, HashKeys: (id), JoinQual: (id = id) Nrows (in:1000000 out:1000000, 100.00% planned 0.16%), KDS-Hash (size: 57.22MB planned 2978.59KB, nbatches: 1 planned 1) Inner Buffer: (62.00MB), DMA nums: 1, size: 62.00MB -> Custom Scan (BulkScan) on t_join f (cost=0.00..15406.00 rows=1000000 width=4) (actual time=13.220..45.445 rows=1000000 loops=1) -> Custom Scan (GpuJoin) (cost=41543.18..176974.98 rows=40000 width=12) (actual time=1736.740..2818.735 rows=1000000 loops=1) Bulkload: On (density: 100.00%) Depth 1: GpuHashJoin, HashKeys: (id), JoinQual: (id = id) Nrows (in:25000000 out:1000000, 4.00% planned 4.00%), KDS-Hash (size: 57.22MB planned 82.25MB, nbatches: 1 planned 1) Depth 2: GpuHashJoin, HashKeys: (id), JoinQual: (id = id) Nrows (in:1000000 out:1000000, 100.00% planned 4.00%), KDS-Hash (size: 57.22MB planned 82.25MB, nbatches: 1 planned 1) Inner Buffer: (82.25MB)x(82.25MB), DMA nums: 10, size: 1645.10MB -> Custom Scan (BulkScan) on t_test a (cost=0.00..385136.36 rows=25000036 width=4) (actual time=16.505..1599.395 rows=25000000 loops=1) -> Seq Scan on t_join b (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.022..111.602 rows=1000000 loops=1) -> Seq Scan on t_join d (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.005..95.175 rows=1000000 loops=1) Planning time: 11.106 ms Execution time: 8051.245 ms (31 rows) |
Given the number of rows involved, this is really a nice number to see. 8 seconds just rocks and shows the real potential of a GPU.
Of course: You got to keep in mind that I am running arbitrary SQL statements, which are not too realistic in many cases. I am yet to test pgstrom on a real workload involving more every-day queries you would typically see in PostgreSQL.
You need to load content from reCAPTCHA to submit the form. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
You also should keep in the mind that you're comparing running SQL in single thread of CPU and running same SQL in multiple threads of GPU. I can suppose that pgstrom utilize the full GPU. You could run your test queries in parallel using pgbench. Then you could figure out real average performance.
i actually planned some larger scale pgbench tests but I did not get around do it yet. i am also yet to figure out, where CPU time is burned. we are definitely planning some more tests as soon as there is enough time.
This article (and your previous one) were really useful. I recently finished up some performance bench-marking for a production system with various DB implementations. One was PGStrom and your examples let me quickly find what wasn't available for GPUPreAggr calls. It'd be great if decide to post another blog on PGStrom where you pointed out where the library is still lacking/what's needed to get it to production capability. i.e. Can't aggregate involving Numeric/UUIDS, or GPU worker limits, using the debug logger to identify decision trees, etc. Also more content on bottlenecks and more on types of queries that will be sub-optimal would be awesome. I found, for example, that being unable to stage data on the Graphics card memory for later execution made all queries have an underlying latency regardless of size that made my use-case less ideal.