Finally I got around to take a more extensive look at pgstrom (a module to make use of GPUs). The entire GPU thing fascinates me, and I was excited to see the first real performance data.
Table of Contents
Here is some simple test data:
1 2 3 4 5 |
test=# CREATE TABLE t_test AS SELECT x, 'a'::char(100) AS y, 'b'::char(100) AS z FROM generate_series(1, 5000000) AS x ORDER BY random(); SELECT 5000000 |
5 million rows should be enough to get a first impression of what is going on.
To make sure that a real difference can actually be observed, I have decided to use no indexes. In real life, this is not too realistic because performance would suffer in a horrible way. pgstrom has not been made to speed up index lookups anyway so this should not be an issue.
The first thing I tried was to filter and group some data on the CPU:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
test=# explain analyze SELECT count(*) FROM t_test WHERE sqrt(x) > 0 GROUP BY y; QUERY PLAN -------------------------------------------------------------------------------------- HashAggregate (cost=242892.24..242892.25 rows=1 width=101) (actual time=3965.362..3965.362 rows=1 loops=1) Group Key: y -> Seq Scan on t_test (cost=0.00..234558.91 rows=1666665 width=101) (actual time=0.033..1417.593 rows=5000000 loops=1) Filter: (sqrt((x)::double precision) > '0'::double precision) Planning time: 0.459 ms Execution time: 3965.495 ms |
My box (4 GHz AMD) can do that in just under 4 seconds. Note that I am using the standard PostgreSQL storage manager here (no column store or so).
Let us try the same thing on the GPU:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
test=# explain analyze SELECT count(*) FROM t_test WHERE sqrt(x) > 0 GROUP BY y; QUERY PLAN ------------------------------------------------------------------------------------------ HashAggregate (cost=176230.88..176230.89 rows=1 width=101) (actual time=2004.355..2004.356 rows=1 loops=1) Group Key: y -> Custom Scan (GpuPreAgg) (cost=11929.24..171148.30 rows=75 width=108) (actual time=1151.310..2003.868 rows=76 loops=1) Bulkload: On (density: 100.00%) Reduction: Local + Global Device Filter: (sqrt((x)::double precision) > '0'::double precision) -> Custom Scan (BulkScan) on t_test (cost=8929.24..167897.55 rows=5000001 width=101) (actual time=12.956..1152.273 rows=5000000 loops=1) Planning time: 0.550 ms Execution time: 2299.633 ms (9 rows) |
We see a nice improvement here. The speedup is incredible - especially when taking into consideration that getting the data already takes more than a second. It seems moving stuff out to the GPU definitely pays off in this case.
The interesting thing to notice is that the real improvement can be seen because of the GROUP BY clause. A normal filter does not show a benefit:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
test=# explain analyze SELECT count(*) FROM t_test WHERE sqrt(x) > 0; QUERY PLAN -------------------------------------------------------------------------------------------- Aggregate (cost=238725.58..238725.59 rows=1 width=0) (actual time=1762.829..1762.829 rows=1 loops=1) -> Seq Scan on t_test (cost=0.00..234558.91 rows=1666665 width=0) (actual time=0.055..1311.220 rows=5000000 loops=1) Filter: (sqrt((x)::double precision) > '0'::double precision) Planning time: 0.074 ms Execution time: 1762.875 ms (5 rows) test=# explain analyze SELECT count(*) FROM t_test WHERE sqrt(x) > 0; QUERY PLAN ------------------------------------------------------------------------------------------------- Aggregate (cost=172064.21..172064.22 rows=1 width=0) (actual time=1411.036..1411.036 rows=1 loops=1) -> Custom Scan (GpuPreAgg) (cost=11929.24..171148.30 rows=75 width=4) (actual time=650.590..1410.837 rows=76 loops=1) Bulkload: On (density: 100.00%) Reduction: NoGroup Device Filter: (sqrt((x)::double precision) > '0'::double precision) -> Custom Scan (BulkScan) on t_test (cost=8929.24..167897.55 rows=5000001 width=0) (actual time=16.034..1160.592 rows=5000000 loops=1) Planning time: 1.634 ms Execution time: 1745.605 ms (8 rows) |
It certainly makes sense that there is no improvement in this case because moving data around is simply too expensive to make a difference. Remember: GPUs only make sense if things can be done in parallel and if data is coming fast enough. sqrt is not complicated enough to justify the effort of moving data around and PostgreSQL cannot provide data fast enough.
It is important to mention that many queries won't benefit from the GPU at all. In fact, I would expect than the majority of queries in a usual system will not behave differently.
Here is an example of a query, which is actually slower with pgstrom:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
test=# explain analyze SELECT *, sum(x) OVER () FROM t_test WHERE sqrt(x) > 0; QUERY PLAN ------------------------------------------------------------------------------------------------ WindowAgg (cost=8929.24..188730.88 rows=1666667 width=206) (actual time=4449.716..5848.233 rows=5000000 loops=1) -> Custom Scan (GpuScan) on t_test (cost=8929.24..167897.55 rows=1666667 width=206) (actual time=616.879..1899.651 rows=5000000 loops=1) Device Filter: (sqrt((x)::double precision) > '0'::double precision) Planning time: 0.142 ms Execution time: 6260.461 ms (5 rows) test=# explain analyze SELECT *, sum(x) OVER () FROM t_test WHERE sqrt(x) > 0; QUERY PLAN -------------------------------------------------------------------------------------------------- WindowAgg (cost=0.00..255392.23 rows=1666665 width=206) (actual time=3610.914..4895.518 rows=5000000 loops=1) -> Seq Scan on t_test (cost=0.00..234558.91 rows=1666665 width=206) (actual time=0.038..1389.645 rows=5000000 loops=1) Filter: (sqrt((x)::double precision) > '0'::double precision) Planning time: 0.324 ms Execution time: 5187.048 ms (5 rows) |
In this case the GPU seems like a loss - at least there is no benefit to be expected at this stage.
According to the main developer of pgstrom sorting is not yet as good as he wants it to be, so I skipped the sort part for now. As sorts are key to many queries, there is still pgstrom functionality I am really looking forward to.
I assume that sorts can greatly benefit from a GPU because there is a lot of intrinsic parallelism in a sort algorithm. Therefore sorting on the GPU could be highly beneficial. The speedup we can expect is hard to predict but I firmly believe that it can be quite substantial.
What stunned me is that I have not encountered a single segmentation fault during my tests. I definitely did not expect that. My assumption was that there would be more loose ends but actually things worked as expected most of the time - given the stage of the project I am pretty excited. pgstrom certainly feels like the future ...
Find all the latest CYBERTEC blog posts by Hans-Jürgen Schönig, Laurenz Albe, Pavlo Golub and others in our Performance blog spot.
+43 (0) 2622 93022-0
office@cybertec.at
You are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
> The interesting thing to notice is that the real improvment can be seen because of the GROUP BY clause.
Having done a bit of programming in OpenCL, I'd say the improvement comes from the GROUP BY implemented as parallel reduction, a pattern where GPGPUs really shine.
Unfortunately, the choice of CUDA limits the range of supported devices. As you mentioned, transferring data between host and device memory can kill performance easily, so integrated GPGPUs like Intel's IRIS/Pro often show comparatively good performance because they use the same slow main memory as the CPU but in return do not need those transfers. In addition, this allows to spread the work over both, CPU and GPGPU.
Since recent GPGPUs allow multiple concurrent kernels, combining GPGPU parallel processing with columnar data stores seems also very promising. Ah, so many options, so little time...
this seems to be the issue here. i am planning to give it some more tests with more grouping and so on. it seems grouping is where pgstrom really excels. in addition to that i am really looking forward to seeing, what sorts can do once they are done the way they are planned. we got interesting times ahead.
> Unfortunately, the choice of CUDA limits the range of supported devices.
The previous version of PG-Strom used OpenCL, however, I backed to CUDA because of driver's quality and debug support.
CUDA has widespread user base, thus gives us stable run-time environment. On the other hands, I faced some strange behavior on *ntel's driver when PG-Strom used OpenCL implementation. It was hard time for me...
what you have achieved is definitely beyond incredible. i was stunned when i tested things. not a single segfault ... despite the size of the code.
If you give me the back trace of the crash, it may help to fix.
Also, I merged cumulative bugfixes around GpuJoin code. If you can, please retry with the latest master branch.
Oh, this was no criticism. GPGPU acceleration for PostgreSQL is way cool, no matter what API. 🙂
I just tried pg_strom:
LOG: CUDA Runtime version: 6.5.0
LOG: NVIDIA driver version: 340.76
LOG: GPU0 Quadro K1100M (384 CUDA cores, 705MHz), L2 256KB, RAM 2047MB (128bits, 1400MHz), capability 3.0
LOG: NVRTC - CUDA Runtime Compilation vertion 7.5
But when I try to run a query with pg_strom.enabled = on:
ERROR: failed on cuModuleLoadData (CUDA_ERROR_NO_BINARY_FOR_GPU - no kernel image is available for execution on the device)
Any hints what causes this?
Seems to be a driver conflict. But on Ubuntu 14.04 the latest official driver is 346 and then pg_strom does not compile:
src/cuda_control.c:2522:4: error: ‘CU_DEVICE_ATTRIBUTE_GLOBAL_L1_CACHE_SUPPORTED’ undeclared (first use in this function)
{CU_DEVICE_ATTRIBUTE_GLOBAL_L1_CACHE_SUPPORTED,
^
src/cuda_control.c:2522:4: note: each undeclared identifier is reported only once for each function it appears in
src/cuda_control.c:2524:4: error: ‘CU_DEVICE_ATTRIBUTE_LOCAL_L1_CACHE_SUPPORTED’ undeclared (first use in this function)
{CU_DEVICE_ATTRIBUTE_LOCAL_L1_CACHE_SUPPORTED,
^
and so on...
Ok, works now
I guess you use unsupported CUDA version (6.5), please ensure CUDA 7.0 or later is installed.
LOG: CUDA Runtime version: 6.5.0
LOG: NVIDIA driver version: 340.76
Also, PG-Strom should have version check here.
Thanks for your feedback. If you can, please file your troubles in the project github.
https://github.com/pg-strom/devel/issues
Could you revisit the OpenCL implementation in the current landscape (2016)?
Maybe things have improved.
Worst case, ignore *ntel and focus on AMD/Nvidia.
No, what I have to do "first" is provision of a working, valuable and stable software for users.
Other comprehensive might change, however, I already built many stuffs on CUDA.
Switch of the platform makes the v1.0 delayed. Sorry.
HIP : C Heterogeneous-Compute Interface for Portability
http://gpuopen.com/compute-product/hip-convert-cuda-to-portable-c-code/
OpenCL for AMD
CUDA for NVIDIA
WIN / WIN
From the link: "To further reduce the learning curve when moving from Cuda to HIP, we developed the hipify tool to automate your application’s core conversion."
Maybe for future consideration?
For future reference: https://github.com/pg-strom/devel/issues/245
GPUOpen is being pushed a lot:
http://gpuopen.com/
HIP : C Heterogeneous-Compute Interface for Portability:
http://gpuopen.com/compute-product/hip-convert-cuda-to-portable-c-code/
OpenCL for AMD
CUDA fro NVIDIA
WIN / WIN
Hi! Very interesting article. Is it possible to use PG-Storm with PostgreSQL in Windows 10 Pro? Alternatively, would it be straightforward to rebuild for Windows, or does it depend on Linux-specific libraries?
we did not dare to run that on Windows.
Why not, is it not possible? I've downloaded PG-Storm and trying to figure out if/how to compile/build it using Visual Studio or any other appropriate toolchain.
Any help/advise would be kindly accepted! 🙂
Hi Hans-Jürgen, Can you please provide an update to this article with testing of PG-strom 2.0? I'm interested to know if the performance has improved.
3. PG-Strom v2.0 features highlight PG-Strom v2.0 Release Technical Brief (17-Apr-2018)3 ▌Storage Enhancement SSD-to-GPU Direct SQL Execution In-memory Columnar Cache GPU memory store (gstore_fdw) ▌Advanced SQL Infrastructure PostgreSQL v9.6/v10 support – CPU GPU Hybrid Parallel SCAN JOIN GROUP BY combined GPU kernel Utilization of demand paging of GPU device memory ▌Miscellaneous PL/CUDA related enhancement New data type support Documentation and Packaging