Table of Contents
This is a short war story from a customer problem. It serves as a warning that there are special considerations when running software in a Docker container.
The customer is running PostgreSQL in Docker containers. They are not using the “official” image, but their own.
Sometimes, under conditions of high load, PostgreSQL crashes with
1 2 |
LOG: server process (PID 84799) was terminated by signal 13: Broken pipe LOG: terminating any other active server processes |
This causes PostgreSQL to undergo crash recovery, during which the service is not available.
SIGPIPE
(signal 13 on Linux) is a rather harmless signal: the kernel sends that signal to a process that tries to write to a pipe if the process at the other end of the pipe no longer exists. Crash recovery seems like a somewhat excessive reaction to that. If you look at the log entry, the message level is LOG
and not PANIC
(an error condition that PostgreSQL cannot recover from).
The reason for this excessive reaction is that PostgreSQL does not expect a child process to die from signal 13. A careful scrutiny of the PostgreSQL code shows that all PostgreSQL processes ignore SIGPIPE
. So if one of these processes dies from that signal, something must be seriously out of order.
In PostgreSQL, the postmaster process (the parent of all server processes) listens for incoming connections and starts new server processes. It takes good care of its children: it respawns background processes that terminated in a controlled fashion, and it watches out for children that died from “unnatural causes”. Any such event is alarming, because all PostgreSQL processes use shared buffers, the shared memory segment that contains the authoritative copy of the table data. If a server process runs amok, it can scribble over these shared data and corrupt the database. Also, something could interrupt a server process in the middle of a “critical section” and leave the database in an inconsistent state. To prevent that from happening, the postmaster treats any irregular process termination as a sign of danger. Since shared buffers might be affected, the safe course is to interrupt processing and to restore consistency by performing crash recovery from the latest checkpoint.
(If this behavior strikes you as oversensitive, and you are less worried about data integrity, you might prefer more cavalier database systems like Oracle, where a server crash – euphemistically called ORA-00600 – does not trigger such a reaction.)
To understand and fix the problem, it was important to know which server process died from signal 13. All we knew is the process ID from the error message. We searched the log files for messages by this process, which is easy if you log the process ID with each entry. However, that process never left any trace in the log, even when we cranked up log_min_messages
to debug3
.
An added difficulty was that the error condition could not be reproduced on demand. All that we could do is to increase the load on the system by starting a backup, in the hope that the problem would manifest.
The next idea was to take regular “ps
” snapshots in the hope to catch the offending process red-handed. The process remained elusive. Finally, the customer increased the frequency of those snapshots to one per second, and in the end we got a mug shot of our adversary.
The process turned out not to be a server process at all. Rather, it was a psql
process that gets started inside the container to run a monitoring query on the database. Now psql
is a client program that does not ignore SIGPIPE
, so that mystery is solved. But how can psql
be a PostgreSQL server process?
ps
snapshot that helped solve the Docker problemThe snapshot in question like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
PID STIME USER COMMAND 1 Mar13 postgres /usr/lib/postgresql/14/bin/postgres -D /postgresqldata/14 332 Mar13 postgres postgres: logger 84507 11:09 postgres postgres: checkpointer 84508 11:09 postgres postgres: background writer 84509 11:09 postgres postgres: walwriter 84510 11:09 postgres postgres: autovacuum launcher 84511 11:09 postgres postgres: archiver 84512 11:09 postgres postgres: stats collector 84513 11:09 postgres postgres: logical replication launcher 84532 11:09 postgres postgres: logrep_user mydb 10.0.0.42(36270) idle 84557 11:09 postgres postgres: postgres postgres 10.0.0.232(37434) idle 84756 11:10 postgres postgres: appuser mydb 10.0.0.12(38600) idle 84773 11:10 postgres postgres: appuser mydb 10.0.0.12(38610) idle 84799 11:10 root /usr/lib/postgresql/14/bin/psql -U postgres -t -c SELECT 1=1 |
The last line is the offending process, which is about to receive signal 13. This is very clearly not a server process; among other things, it is owned by the root
user instead of postgres
. Unfortunately, the snapshot does not include the parent process ID. However, since the postmaster (in the first line) recognized the rogue process as its child, it must be the parent.
The key observation is that the process ID of the postmaster is 1. In Unix, process 1 is a special process: it is the first user land process that the kernel starts. This process then starts other processes to bring the system up. It is the ancestor of all other processes, and every other process has a parent process. There is another special property of process 1: if the parent process of a process dies, the kernel automatically assigns process 1 as parent to the orphaned process. Process 1 has to “adopt” all orphans.
Normally, process 1 is a special init
executable specifically designed for this purpose. But in a Docker container, process 1 is the process that you executed to start the container. As you can see, that was the postmaster. The postmaster handles one of the tasks of the init
process admirably: it waits for its children and collects the exit status when one of them dies. This keeps zombie processes from lingering for any length of time. However, the postmaster is less suited to handle another init
task: remain stoic if one of its children dies horribly. That is what caused our problem.
Once we understand the problem, the solution is simple: don't start the container with PostgreSQL. Rather, start a different process, which in turn starts the postmaster. Either write your own or use an existing solution like dumb-init. The official PostgreSQL docker image does it right.
This problem also could not have occurred if the psql
process hadn't been started inside the container. It is good practice to consider a container running a service as a closed unit: you shouldn't start jobs or interactive sessions inside the container. I can understand the appeal of using the container's psql
executable to avoid having to install the PostgreSQL client anywhere else, but it is a shortcut that you shouldn't take.
It turned out that the cause of our problem was that the postmaster served as process 1 in the Docker container. The psql
process that ran a monitoring query died from a SIGPIPE
under high load. The postmaster, which had inadvertently inherited that process, noticed this unusual process termination and underwent crash recovery to stay on the safe side.
While running a program in a Docker container is not very different from running it outside in most respects, there are some differences that you have to be aware of if you want your systems to run stably.
You can read more about PostgreSQL and Docker in this article, "Running Postgres in Docker Why and How".
You need to load content from reCAPTCHA to submit the form. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
By reading this sentence, one might think that Oracle is more leaning to data inconsistency. That is not the case.
Oracle has mechanisms to ensure that memory changes left by crashed sessions are cleaned up successfully before other sessions can touch the same data.
May I suggest to remove the sentence or change it to something more technically correct?
Feel free to contact me if you have questions about it (I work for Oracle).
I admit that I have somewhat hard feelings against Oracle when it comes to data integrity. Cases in point are their willingness to accept corrupt strings if client encoding = server encoding or "invalid views", but this is of course a matter of taste. So I may have been slightly unfair.
However, I doubt your assertion: if a server process encounters a bad pointer (software bug, faulty memory, memory overrun caused by the user, ...), it cannot be excluded that it writes over a section of shared memory (SGA) it has no business to modify. You may claim otherwise, but as long as I cannot see Oracle source code, we cannot really settle that question.
It looks that you are OK in spreading uncertainty based on what "cannot be excluded". If that's your intention, I can't do anything against it. At least I tried 🙂
Fair enough. In a half-hearted attempt at being fair, let me state clearly that I do not think that it is very likely that a crashing Oracle server process will corrupt your database.