Table of Contents
(Updated 2023-06-22) If you've heard about TCP keepalive but aren't sure what that is, read on. If you've ever been surprised by error messages like:
server closed the connection unexpectedly
SSL SYSCALL error: EOF detected
unexpected EOF on client connection
could not receive data from client: Connection reset by peer
then this article is for you.
There are several possible causes for broken connections:
The first two messages in the above list can be the consequence of a PostgreSQL server problem. If the server crashes for whatever reason, you'll see a message like that. To investigate whether there is a server problem, you should first look into the PostgreSQL log and see if you can find a matching crash report.
We won't deal with that case in the following, since it isn't a network problem.
If the client exits without properly closing the database connection, the server will get an end-of-file or an error while communicating on the network socket. With the new session statistics introduced in v14, you can track the number of such “abandoned” database connections in pg_stat_database.sessions_abandoned
.
For example, if an application server fails and is restarted, it typically won't close the connections to the database server. This isn't alarming, and the database server will quickly detect it when the server tries to send data to the client. But if the database session is idle, the server process is waiting for the client to send the next statement (you can see the wait_event
“ClientRead
” in pg_stat_activity
). Then the server won't immediately notice that the client is no longer there! Such lingering backend processes occupy a process slot and can cause you to exceed max_connections
.
PostgreSQL v14 has introduced a new parameter idle_session_timeout
which closes idle connections after a while. But that will terminate “healthy” idle connections as well, so it isn't a very good solution. TCP keepalive provides a much better solution to this problem.
Sometimes both ends of the database connection experience the same problem: each sees that the other end “hung up on them”. In that case, the problem lies somewhere between the database client and the server.
Network connections can get disconnected if there is a real connectivity problem. There's nothing you can do to change that on the software level. But very often, disconnections are caused by the way firewalls or routers are configured. The network component may have to “memorize” the state of each open connection, and the resources for that are limited. So it can seem expedient to “forget” and drop connections that have been idle for a longer time.
Since a lot of today's TCP traffic is via HTTP, and HTTP is stateless, that's not normally a problem. If your HTTP connection is broken, you simply establish a new connection for your next request, which isn't very expensive. But databases are different:
This is where TCP keepalive comes in handy as a way to keep idle connections open.
Keepalive is a functionality of the TCP protocol. When you set the SO_KEEPALIVE
option on a TCP network socket, a timer will start running as soon as the socket becomes idle. When the keepalive idle time has expired without further activity on the socket, the kernel will send a “keepalive packet” to the communication partner. If the partner answers, the connection is considered good, and the timer starts running again.
If there is no answer, the kernel waits for the keepalive interval before sending another keepalive packet. This process is repeated until the number of keepalive packets sent reaches the keepalive count. After that, the connection is considered dead, and attempts to use the network socket will result in an error.
Note that it is the operating system kernel, not the application (database server or client) that sends keepalive messages. The application is not aware of this process.
TCP keepalive serves two purposes:
The default values for the keepalive parameters vary from operating system to operating system. On Linux and Windows, the default values are:
Thanks to Vahid Saber for the MacOS settings!
To keep firewalls and routers from closing an idle connection, we need a much lower setting for the keepalive idle time. Then keepalive packets get sent before the connection is closed. This will trick the offending network component into believing that the connection isn't idle, even if neither database client nor server send any data.
For this use case, keepalive count and keepalive interval are irrelevant. All we need is for the first keepalive packet to be sent early enough.
For this use case, reducing the keepalive idle time is often not enough. If the server sends nine keepalive packets with an interval of 75 seconds, it will take more than 10 minutes before a dead connection is detected. So we'll also reduce the keepalive count, or the keepalive interval, or both - as in this case.
There is still one missing piece to the puzzle: even if the operating system detects that a network connection is broken, the database server won't notice, unless it tries to use the network socket. If it's waiting for a request from the client, that will happen immediately. But if the server is busy executing a long-running SQL statement, it won't notice the dead connection until the query is done and it tries to send the result back to the client! To prevent this from happening, PostgreSQL v14 has introduced the new parameter client_connection_check_interval
, which is currently only supported on Linux. Setting this parameter causes the server to “poll” the socket regularly, even if it has nothing to send yet. That way, it can detect a closed connection and interrupt the execution of the SQL statement.
The PostgreSQL server always sets SO_KEEPALIVE
on TCP sockets to detect broken connections, but the default idle timeout of two hours is very long.
You can set the configuration parameters tcp_keepalives_idle
, tcp_keepalives_interval
and tcp_keepalives_count
(the last one is not supported on Windows) to change the settings for all server sockets.
This is the most convenient way to configure TCP keepalive for all database connections, regardless of the client used.
The PostgreSQL client shared library libpq
has the connection parameters keepalives_idle
, keepalives_interval
and keepalives_count
(again, the latter is not supported on Windows) to configure keepalive on the client side.
These parameters can be used in PostgreSQL connection strings with all client interfaces that link with libpq
, for example, Psycopg or PHP.
The PostgreSQL JDBC driver, which does not use libpq
, only has a connection parameter tcpKeepAlive
to enable TCP keepalive (it is disabled by default), but no parameter to configure the keepalive idle time and other keepalive settings.
Instead of configuring keepalive settings specifically for PostgreSQL connections, you can change the operating system default values for all TCP connections - which can be useful, if you are using a PostgreSQL client application that doesn't allow you to set keepalive connection parameters.
On Linux, this is done by editing the /etc/sysctl.conf
file:
1 2 3 4 |
# detect dead connections after 70 seconds net.ipv4.tcp_keepalive_time = 60 net.ipv4.tcp_keepalive_intvl = 5 net.ipv4.tcp_keepalive_probes = 3 |
To activate the settings without rebooting the machine, run
1 |
sysctl -p |
On older MacOS versions, you can also edit /etc/sysctl.conf
, but the parameters are different:
1 2 3 4 |
# detect dead connections after 70 seconds net.inet.tcp.keepidle = 60000 net.inet.tcp.keepintvl = 5000 net.inet.tcp.keepcnt = 3 |
On newer MacOS versions (tested on 13), create the file /Library/LaunchDaemons/sysctl.plist
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
<?xml version='1.0' encoding='UTF-8'?> <!DOCTYPE plist PUBLIC '-//Apple//DTD PLIST 1.0//EN' 'http://www.apple.com/DTDs/PropertyList-1.0.dtd'> <plist version='1.0'> <dict> <key>Label</key> <string>sysctl</string> <key>Program</key> <string>/usr/sbin/sysctl</string> <key>ProgramArguments</key> <array> <string>/usr/sbin/sysctl</string> <string>net.inet.tcp.keepidle=60000</string> <string>net.inet.tcp.keepintvl=5000</string> <string>inet.inet.tcp.keepcnt=3</string> </array> <key>RunAtLoad</key> <true/> </dict> </plist> |
You will have to reboot to activate the changes.
On Windows, you change the TCP keepalive settings by adding these registry keys:
1 2 |
HKEY_LOCAL_MACHINESystemCurrentControlSetServicesTcpipParametersKeepAliveTime HKEY_LOCAL_MACHINESystemCurrentControlSetServicesTcpipParametersKeepAliveInterval |
As noted above, there is no setting for the number of keepalive probes, which is hard-coded to 10. The registry keys must be of type DWORD
, and the values are in milliseconds rather than in seconds.
After changing these keys, restart Windows to activate them.
Configuring TCP keepalive can improve your PostgreSQL experience, either by keeping idle database connections open, or through the timely detection of broken connections. You can do configure keepalive on the PostgreSQL client, the server, or on the operating system.
In addition to configuring keepalive, set the new parameter client_connection_check_interval
to cancel long-running queries when the client has abandoned the session.
To learn more about terminating database connections, see our blog post here.
If you would like to learn about connections settings, see my post about max_connections.
You need to load content from reCAPTCHA to submit the form. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
Hello, if a haproxy exists between the application server and the postgres server, does the tcp keepalive settings still work
if you can control haproxy then you can consider haproxy as a client and set it accirdingly
MacOS settings are:
bcd0741786e8 ~ % sysctl -a |grep net.inet.tcp.keep
net.inet.tcp.keepidle: 7200000
net.inet.tcp.keepintvl: 75000
net.inet.tcp.keepinit: 75000
net.inet.tcp.keepcnt: 8
version: Ventura 13.2.1 as of Feb 2023
Thanks a lot! I have updated the article with that information. Could you check if everything is correct?
Looks good
Some firewall default policy is to block keepalive packets. Beware that there isn't any of these between the postgres server and clients. If so, you should allow keepalive packet for postgres port.
Interesting information. Yet another way to configure a network to break database connections.
client_connection_check_interval shoud be grater or lower than tcp_keepalives_idle (tcp_keepalives_interval*tcp_keepalives_count) ?
What is recommended value?
I don't think it matters that much. The point is only that a query whose client has died doesn't keep running forever.
from some reasons (e.g. generating additional network traffic) AWS sets on RDS and Aurora Postgresql tcp_keepalives_idle = 300 seconds, as default
client_connection_check_interval is not set yet, probably because it's the new parameter (postgresql >=14).
This i s why I'm wondering how often postgresql server shloud check client connection (5sec or 300sec, 900sec)
Also could you explain what is the difference between keepalive parameter and client_connection_check_interval in case to detect dead connection?
Why I should set client_connection_check_interval when I have keepalive on?
If client abandon query that is still running on the server, and then the postgresql server trigger keepalive then the server should know that connection is dead and finish query and connection?
I am not going to explain the keepalive parameters and
client_connection_check_interval
again. That is described in the article.If keepalive closes the socket on a dead connection after 400 seconds, and you set
client_connection_check_interval = '100s'
, you can be certain that any running query is canceled at most 500 seconds after the client went away.A running query does not notice immediately that the client is gone, because it does not try to read from or write to the socket. So even if the socket is already known dead, PostgreSQL doesn't realize that immediately.
thanks for the replies and great article 🙂
Hi LAURENZ,
We have AWS rds postgres and lambda. I observe errors "could not receive data from client: Connection reset by peer"
occasionally each 5-20 min. I tried to adjust tcp_keepalives_idle, tcp_keepalives_interval, tcp_keepalives_count and client_connection_check_interval parameters on BD side but no luck. The errors aren't gone. The default DB values are
tcp_keepalives_idle = 300
tcp_keepalives_interval = 30
tcp_keepalives_count = 2
client_connection_check_interval = 0
On lambda(Psycopg) side
keepalive_kwargs = {
'keepalives': 1,
'keepalives_idle': 30,
'keepalives_interval': 5,
'keepalives_count': 5,
}
Looks like I'm doing something wrong, please advise.
I don't know Amazon's network topology or the network components between you and Amazon, so I cannot tell you what is going on.
may i know what's difference between below two errors?
"
could not receive data from client: Connection reset by peer"
"could not receive data from client: Connection timed out".
are both same issue due to keepalive related?
No, only the first error is related to keepalive: The peer (the other end of the TCP connection) has terminated the connection. The second error means that nobody responded to a TCP connection request within certain time. This is typical if you are trying to connect to a machine that is shut down, does not exist or is blocked by a firewall.