Table of Contents
Sometimes a PostgreSQL query takes forever. Usually, it is easy to interrupt (cancel) such a statement, but sometimes it can turn into a problem, and a query cannot be canceled. This article discusses what might be the cause. I'll also show you a trick how to solve the problem (not for the faint of heart!)
The PostgreSQL protocol has provisions for interrupting a running statement. This is done by opening a new connection and sending a CancelRequest
message with a secret key. That secret key was sent by the server during the start of the original connection. Without that key, everybody could cancel your query, which would be an inacceptable security problem.
The C library libpq
provides the functions PQgetCancel()
and PQcancel()
to cancel queries, and other database APIs should have similar provisions. In an interactive psql
session, you can simply hit Ctrl+C to send a cancel request, and GUI clients usually have a button for that purpose.
But it is also possible to cancel somebody else's query by calling the database function pg_cancel_backend()
. Another option is pg_terminate_backend()
, which goes one step further and terminates someone else's database session. To be able to use these functions, you must either be a superuser or a member of the default role pg_signal_backend
(the next section will explain why), or you must have connected to the database as the same database user as the session you want to torpedo (you are allowed to cancel your own statements).
Inter-process communication in PostgreSQL relies a lot on signals.
When the postmaster process receives a CancelRequest
, it sends the signal SIGINT
to the backend process of the corresponding database session. This is also what the function pg_cancel_backend()
does. pg_terminate_backend()
sends the signal SIGTERM
.
Now each PostgreSQL process has a signal handler that processes these signals when they are received. This signal handler does not immediately interrupt the backend process, but it sets global variables for the process. SIGINT
will set QueryCancelPending
and SIGTERM
will set ProcDiePending
. These variables act as flags, and it is the responsibility of the backend process to react to them as soon as is convenient. This ensures that no process is interrupted at an inconvenient time, when it would for example leave shared memory in an inconsistent state.
Calls to the CHECK_FOR_INTERRUPTS()
macro, which invoke the ProcessInterrupts()
function, are sprinkled all over the PostgreSQL code at safe places. This function will then throw the error that cancels the current statement or it will terminate the backend process, depending on which flag was set.
There are several possible causes:
CHECK_FOR_INTERRUPTS()
. That would be a PostgreSQL bug, and the fix would be to add another call to the macro.kill -9
unless you are desperateIt is perfectly fine to use plain kill
on a PostgreSQL backend process. That will send the SIGTERM
signal, so it is the same as calling pg_terminate_backend()
for that backend. If that has no effect, it is tempting to use kill -9
, which will send SIGKILL
. That signal cannot be caught and immediately terminates the process. The problem is that the postmaster detects if one of its child processes did not shut down cleanly. It will then kill all other PostgreSQL processes and undergo crash recovery, which causes an outage of the whole database that can take seconds to minutes.
Note that while using kill -9
on a backend causes a short down time, kill -9
on the postmaster process itself has even worse effects and should be avoided at any cost. It opens a time window during which a new postmaster could be started while some of the old postmaster's children are still alive, which is likely to lead to corruption of the data on disk. Never, ever, kill the postmaster process with kill -9
!
Sometimes even kill -9
won't be able to kill a PostgreSQL backend. That means that the backend is stuck in an uninterruptible system call, for exampling performing I/O on network attached storage that is no longer available. If that condition persists, the only way to get rid of the process is a reboot of the operating system.
Sometimes you can avoid crash recovery and an outage by proceeding as follows. This example uses the GNU debugger on Linux; you'll have to adapt it to other environments as necessary.
We write this simple C function (source file loop.c
):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
#include 'postgres.h' #include 'fmgr.h' #include <unistd.h> PG_MODULE_MAGIC; PG_FUNCTION_INFO_V1(loop); Datum loop(PG_FUNCTION_ARGS) { /* an endless loop */ while(1) sleep(2); } |
Build the shared library like this (change the include path as appropriate):
1 2 |
gcc -I /usr/pgsql-14/include/server -fPIC -shared -o loop.so loop.c |
and copy the file into the PostgreSQL shared library directory (which you can get with “pg_config --libdir
”).
Define the function in SQL as superuser:
1 2 |
CREATE FUNCTION loop() RETURNS void LANGUAGE c AS 'loop'; |
Then, as an arbitrary user, call the function:
1 |
SELECT loop(); |
Execution will hang. You can try to cancel the query, but it will keep running.
Open another database connection with the same database user. Then find out the process ID of the background process, which identifies the database session:
1 2 3 |
SELECT pid, query FROM pg_stat_activity WHERE query LIKE '%loop%'; |
Once you know the process ID, send the process a SIGTERM
:
1 |
SELECT pg_terminate_backend(12345); |
Of course, the argument should be the process ID obtained in the previous step. The function returns TRUE
, since the signal was sent, but the query continues to execute.
Make sure that the GNU debugger gdb
is installed. To get a readable stack trace, you should have the debugging symbols for the PostgreSQL server installed, although that is not necessary for the trick I will show you. Log into the database server machine as the PostgreSQL user, which is typically called postgres
. Call gdb
like this (using the proper path to the postgres
executable and the proper process ID):
1 |
gdb /usr/pgsql-14/bin/postgres 12345 |
As soon as you get the prompt “(gdb)
”, you generate a stack trace with the command “bt
”. That will look similar to the following:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
#0 __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=req@entry=0x7ffdaf61cde0, rem=rem@entry=0x7ffdaf61cde0) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:71 #1 0x00007f113d864897 in __GI___nanosleep (req=req@entry=0x7ffdaf61cde0, rem=rem@entry=0x7ffdaf61cde0) at ../sysdeps/unix/sysv/linux/nanosleep.c:25 #2 0x00007f113d8647ce in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55 #3 0x00007f113e623139 in loop () from /usr/pgsql-14/lib/loop.so #4 0x00000000006d71fb in ExecInterpExpr (state=0x13837b8, econtext=0x13834e0, isnull=<optimized out>) at executor/execExprInterp.c:1260 #5 0x000000000070e391 in ExecEvalExprSwitchContext (isNull=0x7ffdaf61ced7, econtext=0x13834e0, state=0x13837b8) at executor/../../../src/include/executor/executor.h:339 #6 ExecProject (projInfo=0x13837b0) at executor/../../../src/include/executor/executor.h:373 #7 ExecResult (pstate=<optimized out>) at executor/nodeResult.c:136 #8 0x00000000006da8b2 in ExecProcNode (node=0x13833d0) at executor/../../../src/include/executor/executor.h:257 #9 ExecutePlan (execute_once=<optimized out>, dest=0x137f4c0, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x13833d0, estate=0x13831a8) at executor/execMain.c:1551 [...] |
The stack trace is a valuable help for determining where the problem is. Include it if you report a bug to PostgreSQL!
If you don't want to perform the next step, you can chicken out and enter “detach
” to detach the debugger from the process and allow it to continue.
The above stack trace shows that execution currently is not inside PostgreSQL code, but in a custom function (in loop () from /usr/pgsql-14/lib/loop.so
). That means that it is pretty safe to let the process exit. If execution is somewhere inside the PostgreSQL server, there is a small risk that PostgreSQL may be in the middle of modifying shared state, holding a spinlock or something similar. If you know the PostgreSQL source, a look at the call stack will help you to assess that risk. Now, if you dare, call ProcessInterrupts()
, which will cause the process to exit, since ProcDiePending
is set:
1 2 3 4 5 6 |
(gdb) print ProcessInterrupts() [Inferior 1 (process 12345) exited with code 01] The program being debugged exited while in a function called from GDB. Evaluation of the expression containing the function (ProcessInterrupts) will be abandoned. (gdb) quit |
To improve the situation, the function code should be modified like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
#include 'postgres.h' #include 'fmgr.h' #include 'miscadmin.h' #include <unistd.h> PG_MODULE_MAGIC; PG_FUNCTION_INFO_V1(loop); Datum loop(PG_FUNCTION_ARGS) { /* an endless loop */ while(1) { CHECK_FOR_INTERRUPTS(); sleep(2); } } |
The above changes will cause the function to check for interrupts every two seconds, so that execution can be canceled safely.
Canceling queries works by sending the backend a SIGINT
signal. If neither that nor sending SIGTERM
can interrupt a backend, you can attach to the hanging backend with gdb
and call ProcessInterrupts()
directly to make it exit.
If you found this article interesting, perhaps you will also like my article about how to use TCP keepalive to keep abandoned queries from running forever, and to keep idle sessions from closing.
In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.
+43 (0) 2622 93022-0
office@cybertec.at
You are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
Nice Article. I have one question here:
How a simple infinite loop c script (run from OS) can be terminated using SIGTERM signal from OS side. Why Postgres ignores the SIGTERM signal?
That is not really related to the article. A C program will terminate if you send it
SIGTERM
unless you programmed it to catch and handle that signal.