Table of Contents
(Updated 2023-04-07) Even if you understand what a deadlock is, debugging deadlocks can be tricky. This article shows some techniques on how to figure out the cause of a deadlock.
We will test our techniques for debugging deadlocks with the following example:
1 2 3 4 5 6 7 8 9 10 11 12 |
CREATE TABLE parent ( pid bigint PRIMARY KEY, pdata text NOT NULL ); CREATE TABLE child ( cid bigint PRIMARY KEY, pid bigint REFERENCES parent NOT NULL, cdata text NOT NULL ); INSERT INTO parent VALUES (1, 'no children yet'); |
To provoke the deadlock, we run the following transactions in parallel.
Each transaction adds a row to the child
table and then tries to modify the corresponding row in parent
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
-- session 1 BEGIN; INSERT INTO child VALUES (100, 1, 'new child'); -- session 2 BEGIN; INSERT INTO child VALUES (101, 1, 'another child'); SELECT pdata FROM parent WHERE pid = 1 FOR UPDATE; -- session 2 blocks -- session 1 SELECT pdata FROM parent WHERE pid = 1 FOR UPDATE; -- session 1 blocks, but after one second, it receives an error ERROR: deadlock detected DETAIL: Process 19674 waits for ShareLock on transaction 1109; blocked by process 19718. Process 19718 waits for ShareLock on transaction 1108; blocked by process 19674. HINT: See server log for query details. CONTEXT: while locking tuple (0,1) in relation 'parent' -- session 2 is unblocked and receives a result pdata ═════════════ no children (1 row) |
The PostgreSQL log reveals more details:
1 2 3 4 5 6 7 8 |
ERROR: deadlock detected DETAIL: Process 19674 waits for ShareLock on transaction 1109; blocked by process 19718. Process 19718 waits for ShareLock on transaction 1108; blocked by process 19674. Process 19674: SELECT pdata FROM parent WHERE pid = 1 FOR UPDATE; Process 19718: SELECT pdata FROM parent WHERE pid = 1 FOR UPDATE; HINT: See server log for query details. CONTEXT: while locking tuple (0,1) in relation 'parent' STATEMENT: SELECT pdata FROM parent WHERE pid = 1 FOR UPDATE; |
The error message sent to the client does not contain any meaningful detail. That is because the SQL statement from the other sessions participating in the deadlock may contain data that you have no right to see. The PostgreSQL log contains more helpful data, but also not enough to understand the deadlock. To understand the cause of the deadlock, you would have to know all statements from the involved transactions, because locks are always held until the end of a transaction. The two statements from the error message alone could never cause a deadlock. So some earlier statement in these transactions must already have taken locks that are part of the problem. However, PostgreSQL does not retain any memory of these previous statements, so we have to find some other way to get the whole picture.
Usually you need help from the application's developers. Perhaps you are lucky, the SQL statements from the error message are enough to identify the database transactions involved, and it is easy to reconstruct the SQL statements that were issued from each transaction. By examining the locks taken by these statements, you should be able to reconstruct the cause of the deadlock.
If we know the statements from our transactions in the example above, we can deduce the following:
INSERT
statements take a FOR KEY SHARE
lock on the referenced rows in parent
. That is necessary so that no concurrent transaction can delete the row referenced by our not yet committed INSERT
. As you can see in the compatibility matrix, these locks don't conflict with each other.FOR UPDATE
row locks from the SELECT ... FOR UPDATE
statements conflict with the FOR KEY SHARE
locks from the other transaction, which leads to the deadlock.Once we know that, the solution is simple: change “SELECT ... FOR UPDATE
” to “SELECT ... FOR NO KEY UPDATE
”, and there will be no more lock conflicts and consequently no deadlock.
Note that a FOR UPDATE
row lock is only required if you want to DELETE
a row, or if you intend to update a column with a unique or primary key constraint on it (a key). The lock name FOR UPDATE
is counter-intuitive, since it suggests that this is the correct lock for an UPDATE
, and it is a common trap for PostgreSQL beginners.. See my article about row locks for more detail.
Great, with our complete knowledge we could fix the problem. But how do we get the statements from the transactions?
There are several approaches:
If the application logs information about each error and each database transaction started, it is easy to determine the transactions that caused the problem. However, that requires that you either have control over the application code, or that the application has sufficiently well-designed tracing facilities. You are not always in that lucky position.
This is also a method that requires that you can modify the application. The trick here is to add a comment to the query, like
1 2 |
SELECT /* function 'add_data', source file 'app.c', line 1234 */ pdata FROM parent WHERE pid = 1 FOR UPDATE; |
The information can be anything that allows you to tell which transaction the statement belongs to.
In order to enable us to trace statements, we have to make sure that the process ID and transaction number are logged with each statement. With CSV logging, that is always the case, but with stderr
logging, you have to change log_line_prefix
:
1 |
log_line_prefix = '%m [pid=%p] %q[txid=%x] %u@%d ' |
Also, you have to log all statements:
1 |
log_statement = 'all' |
Now the amount and performance impact of logging all statements may be forbidding for busy applications. PostgreSQL v12 introduced log_transaction_sample_rate
, which allows you to log only a certain percentage of all transactions. However, in many cases that won't be good enough, because you'll typically only see the statements from one of the transactions involved.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
2022-06-20 17:17:50.888 CEST [pid=19718] [txid=0] laurenz@test LOG: statement: BEGIN; 2022-06-20 17:17:52.988 CEST [pid=19718] [txid=0] laurenz@test LOG: statement: INSERT INTO child VALUES (100, 1, 'new child'); 2022-06-20 17:17:58.215 CEST [pid=19674] [txid=0] laurenz@test LOG: statement: BEGIN; 2022-06-20 17:18:03.626 CEST [pid=19674] [txid=0] laurenz@test LOG: statement: INSERT INTO child VALUES (101, 1, 'another child'); 2022-06-20 17:18:18.979 CEST [pid=19674] [txid=1108] laurenz@test LOG: statement: SELECT pdata FROM parent WHERE pid = 1 FOR UPDATE; 2022-06-20 17:18:31.550 CEST [pid=19718] [txid=1109] laurenz@test LOG: statement: SELECT pdata FROM parent WHERE pid = 1 FOR UPDATE; 2022-06-20 17:18:32.551 CEST [pid=19718] [txid=1109] laurenz@test ERROR: deadlock detected 2022-06-20 17:18:32.551 CEST [pid=19718] [txid=1109] laurenz@test DETAIL: Process 19718 waits for ShareLock on transaction 1108; blocked by process 19674. Process 19674 waits for ShareLock on transaction 1109; blocked by process 19718. Process 19718: SELECT pdata FROM parent WHERE pid = 1 FOR UPDATE; Process 19674: SELECT pdata FROM parent WHERE pid = 1 FOR UPDATE; 2022-06-20 17:18:32.551 CEST [pid=19718] [txid=1109] laurenz@test HINT: See server log for query details. 2022-06-20 17:18:32.551 CEST [pid=19718] [txid=1109] laurenz@test CONTEXT: while locking tuple (0,1) in relation 'parent' 2022-06-20 17:18:32.551 CEST [pid=19718] [txid=1109] laurenz@test STATEMENT: SELECT pdata FROM parent WHERE pid = 1 FOR UPDATE; |
Note that the transaction ID is logged as 0 until after the first data modifying statement in the transaction. This is because PostgreSQL assigns transaction IDs lazily, only when they are needed.
deadlock_timeout
The PostgreSQL deadlock detector doesn't kick in immediately, but waits until after deadlock_timeout
has passed. This is because the deadlock detector is a computationally expensive component, so we only want to activate it if there is very likely a problem. By default, the parameter is set to one second. Unfortunately, one second is much too short to debug the problem. Now if you can afford to let deadlocked sessions “hang” until you can debug them, you can increase the parameter. The maximum allowed value is almost 600 days, which is longer than you need. With proper monitoring, you could for example set deadlock_timeout
to 15 minutes to catch deadlocks in the act.
Once you have an active deadlock, you can find client_addr
and client_port
from pg_stat_activity
for the hanging sessions:
1 2 3 4 5 6 7 8 9 10 11 |
SELECT client_addr, client_port, query FROM pg_stat_activity WHERE state = 'active' AND wait_event_type = 'Lock' ORDER BY state_change; client_addr │ client_port │ query ═══════════════╪═════════════╪════════════════════════════════════════════════════ 192.168.2.120 │ 49578 │ SELECT pdata FROM parent WHERE pid = 1 FOR UPDATE; 192.168.2.122 │ 47432 │ SELECT pdata FROM parent WHERE pid = 1 FOR UPDATE; (2 rows) |
We sort by state_change
, because other sessions might later get stuck behind the sessions that originally caused the deadlock. For that reason, you want to focus on the oldest entries.
Next, log into the client machine and find out which process has a TCP connection on port 49578:
1 2 |
# lsof -P | grep 49578 myapp 3974 laurenz [...] TCP localhost:49578->dbserver:5432 (ESTABLISHED) |
Now you know that the client process is 3974, running the program myapp
. You can now attach to the program with a debugger and see the current execution stack. That should give you a clue as to what is being executed, so you can figure out the SQL statements run in that transaction.
Needless to say, this is yet another method that requires control over the application code.
We have seen several methods for debugging deadlocks. Most of them require that you have control over the application code. Not all techniques are useful in every case: for example, you may not be able to afford logging all statements or leaving deadlocks hanging for a longer time.
For more information on deadlocks, see Hans' blog on Understanding Deadlocks.
Please leave your comments below. In order to receive regular updates on important changes in PostgreSQL, subscribe to the CYBERTEC newsletter, or follow us on Twitter, Facebook, or LinkedIn
You need to load content from reCAPTCHA to submit the form. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
Laurenz,
Could you explain this part about:
You've probably talked about it in the blog, I'm searching it as we speak.
I only briefly glossed over it, but my article about row locks describes it in some more detail.