CYBERTEC PostgreSQL Logo

TCP keepalive for a better PostgreSQL experience

10.2021 / Category: / Tags: | |
Before there was TCP keepalive
© Laurenz Albe 2021

 

(Updated 2023-06-22) If you've heard about TCP keepalive but aren't sure what that is, read on. If you've ever been surprised by error messages like:

  • server closed the connection unexpectedly
  • SSL SYSCALL error: EOF detected
  • unexpected EOF on client connection
  • could not receive data from client: Connection reset by peer

then this article is for you.

Reasons for broken connections

There are several possible causes for broken connections:

Database server crashes

The first two messages in the above list can be the consequence of a PostgreSQL server problem. If the server crashes for whatever reason, you'll see a message like that. To investigate whether there is a server problem, you should first look into the PostgreSQL log and see if you can find a matching crash report.

We won't deal with that case in the following, since it isn't a network problem.

Connections abandoned by the client

If the client exits without properly closing the database connection, the server will get an end-of-file or an error while communicating on the network socket. With the new session statistics introduced in v14, you can track the number of such “abandoned” database connections in pg_stat_database.sessions_abandoned.

For example, if an application server fails and is restarted, it typically won't close the connections to the database server. This isn't alarming, and the database server will quickly detect it when the server tries to send data to the client. But if the database session is idle, the server process is waiting for the client to send the next statement (you can see the wait_eventClientRead” in pg_stat_activity). Then the server won't immediately notice that the client is no longer there! Such lingering backend processes occupy a process slot and can cause you to exceed max_connections.

PostgreSQL v14 has introduced a new parameter idle_session_timeout which closes idle connections after a while. But that will terminate “healthy” idle connections as well, so it isn't a very good solution. TCP keepalive provides a much better solution to this problem.

Connections closed by a network component

Sometimes both ends of the database connection experience the same problem: each sees that the other end “hung up on them”. In that case, the problem lies somewhere between the database client and the server.

Network connections can get disconnected if there is a real connectivity problem. There's nothing you can do to change that on the software level. But very often, disconnections are caused by the way firewalls or routers are configured. The network component may have to “memorize” the state of each open connection, and the resources for that are limited. So it can seem expedient to “forget” and drop connections that have been idle for a longer time.

Since a lot of today's TCP traffic is via HTTP, and HTTP is stateless, that's not normally a problem. If your HTTP connection is broken, you simply establish a new connection for your next request, which isn't very expensive. But databases are different:

  • it is expensive to establish a database connection
  • database connections are not stateless; for example, with a closed connection you lose open transactions, temporary tables and prepared statements
  • it is normal for database sessions to be idle for a longer time, for example if you are using a connection pool, or when the client is waiting for the result from a long-running analytical query

This is where TCP keepalive comes in handy as a way to keep idle connections open.

What is TCP keepalive?

Keepalive is a functionality of the TCP protocol. When you set the SO_KEEPALIVE option on a TCP network socket, a timer will start running as soon as the socket becomes idle. When the keepalive idle time has expired without further activity on the socket, the kernel will send a “keepalive packet” to the communication partner. If the partner answers, the connection is considered good, and the timer starts running again.

If there is no answer, the kernel waits for the keepalive interval before sending another keepalive packet. This process is repeated until the number of keepalive packets sent reaches the keepalive count. After that, the connection is considered dead, and attempts to use the network socket will result in an error.

Note that it is the operating system kernel, not the application (database server or client) that sends keepalive messages. The application is not aware of this process.

TCP keepalive serves two purposes:

  • keep network connections from being idle
  • detect if the other communication end has left without closing the network connection
    (The name “keepalive” does not describe that well – “detectdead” would be more to the point).

TCP keepalive default settings

The default values for the keepalive parameters vary from operating system to operating system. On Linux and Windows, the default values are:

  • keepalive idle time: 2 hours on Linux, MacOS and Windows
  • keepalive interval: 75 seconds on Linux and MacOS, 1 second on Windows
  • keepalive count: 8 on MacOS, 9 on Linux and 10 on Windows (this value cannot be changed on Windows)

Thanks to Vahid Saber for the MacOS settings!

Using TCP keepalive to keep an idle database session alive

To keep firewalls and routers from closing an idle connection, we need a much lower setting for the keepalive idle time. Then keepalive packets get sent before the connection is closed. This will trick the offending network component into believing that the connection isn't idle, even if neither database client nor server send any data.

For this use case, keepalive count and keepalive interval are irrelevant. All we need is for the first keepalive packet to be sent early enough.

Using TCP keepalive to detect dead connections

For this use case, reducing the keepalive idle time is often not enough. If the server sends nine keepalive packets with an interval of 75 seconds, it will take more than 10 minutes before a dead connection is detected. So we'll also reduce the keepalive count, or the keepalive interval, or both - as in this case.

There is still one missing piece to the puzzle: even if the operating system detects that a network connection is broken, the database server won't notice, unless it tries to use the network socket. If it's waiting for a request from the client, that will happen immediately. But if the server is busy executing a long-running SQL statement, it won't notice the dead connection until the query is done and it tries to send the result back to the client! To prevent this from happening, PostgreSQL v14 has introduced the new parameter client_connection_check_interval, which is currently only supported on Linux. Setting this parameter causes the server to “poll” the socket regularly, even if it has nothing to send yet. That way, it can detect a closed connection and interrupt the execution of the SQL statement.

Setting TCP keepalive parameters on the PostgreSQL server

The PostgreSQL server always sets SO_KEEPALIVE on TCP sockets to detect broken connections, but the default idle timeout of two hours is very long.

You can set the configuration parameters tcp_keepalives_idle, tcp_keepalives_interval and tcp_keepalives_count (the last one is not supported on Windows) to change the settings for all server sockets.

This is the most convenient way to configure TCP keepalive for all database connections, regardless of the client used.

Setting TCP keepalive parameters on the PostgreSQL client

The PostgreSQL client shared library libpq has the connection parameters keepalives_idle, keepalives_interval and keepalives_count (again, the latter is not supported on Windows) to configure keepalive on the client side.

These parameters can be used in PostgreSQL connection strings with all client interfaces that link with libpq, for example, Psycopg or PHP.

The PostgreSQL JDBC driver, which does not use libpq, only has a connection parameter tcpKeepAlive to enable TCP keepalive (it is disabled by default), but no parameter to configure the keepalive idle time and other keepalive settings.

Setting TCP keepalive parameters on the operating system

Instead of configuring keepalive settings specifically for PostgreSQL connections, you can change the operating system default values for all TCP connections - which can be useful, if you are using a PostgreSQL client application that doesn't allow you to set keepalive connection parameters.

On Linux, this is done by editing the /etc/sysctl.conf file:

To activate the settings without rebooting the machine, run

On older MacOS versions, you can also edit /etc/sysctl.conf, but the parameters are different:

On newer MacOS versions (tested on 13), create the file /Library/LaunchDaemons/sysctl.plist

You will have to reboot to activate the changes.

On Windows, you change the TCP keepalive settings by adding these registry keys:

As noted above, there is no setting for the number of keepalive probes, which is hard-coded to 10. The registry keys must be of type DWORD, and the values are in milliseconds rather than in seconds.

After changing these keys, restart Windows to activate them.

Conclusion

Configuring TCP keepalive can improve your PostgreSQL experience, either by keeping idle database connections open, or through the timely detection of broken connections. You can do configure keepalive on the PostgreSQL client, the server, or on the operating system.

In addition to configuring keepalive, set the new parameter client_connection_check_interval to cancel long-running queries when the client has abandoned the session.

 

To learn more about terminating database connections, see our blog post here.

If you would like to learn about connections settings, see my post about max_connections.

16 responses to “TCP keepalive for a better PostgreSQL experience”

  1. Hello, if a haproxy exists between the application server and the postgres server, does the tcp keepalive settings still work

  2. MacOS settings are:
    bcd0741786e8 ~ % sysctl -a |grep net.inet.tcp.keep
    net.inet.tcp.keepidle: 7200000
    net.inet.tcp.keepintvl: 75000
    net.inet.tcp.keepinit: 75000
    net.inet.tcp.keepcnt: 8

    version: Ventura 13.2.1 as of Feb 2023

  3. Some firewall default policy is to block keepalive packets. Beware that there isn't any of these between the postgres server and clients. If so, you should allow keepalive packet for postgres port.

  4. client_connection_check_interval shoud be grater or lower than tcp_keepalives_idle (tcp_keepalives_interval*tcp_keepalives_count) ?
    What is recommended value?

    • I don't think it matters that much. The point is only that a query whose client has died doesn't keep running forever.

      • from some reasons (e.g. generating additional network traffic) AWS sets on RDS and Aurora Postgresql tcp_keepalives_idle = 300 seconds, as default
        client_connection_check_interval is not set yet, probably because it's the new parameter (postgresql >=14).

        This i s why I'm wondering how often postgresql server shloud check client connection (5sec or 300sec, 900sec)

        Also could you explain what is the difference between keepalive parameter and client_connection_check_interval in case to detect dead connection?
        Why I should set client_connection_check_interval when I have keepalive on?

        If client abandon query that is still running on the server, and then the postgresql server trigger keepalive then the server should know that connection is dead and finish query and connection?

        • I am not going to explain the keepalive parameters and client_connection_check_interval again. That is described in the article.

          If keepalive closes the socket on a dead connection after 400 seconds, and you set client_connection_check_interval = '100s', you can be certain that any running query is canceled at most 500 seconds after the client went away.

          A running query does not notice immediately that the client is gone, because it does not try to read from or write to the socket. So even if the socket is already known dead, PostgreSQL doesn't realize that immediately.

  5. Hi LAURENZ,
    We have AWS rds postgres and lambda. I observe errors "could not receive data from client: Connection reset by peer"
    occasionally each 5-20 min. I tried to adjust tcp_keepalives_idle, tcp_keepalives_interval, tcp_keepalives_count and client_connection_check_interval parameters on BD side but no luck. The errors aren't gone. The default DB values are
    tcp_keepalives_idle = 300
    tcp_keepalives_interval = 30
    tcp_keepalives_count = 2
    client_connection_check_interval = 0
    On lambda(Psycopg) side
    keepalive_kwargs = {
      'keepalives': 1,
      'keepalives_idle': 30,
      'keepalives_interval': 5,
      'keepalives_count': 5,

    Looks like I'm doing something wrong, please advise.

    • I don't know Amazon's network topology or the network components between you and Amazon, so I cannot tell you what is going on.

  6. may i know what's difference between below two errors?
    "could not receive data from client: Connection reset by peer"
    "could not receive data from client: Connection timed out".

    are both same issue due to keepalive related?

    • No, only the first error is related to keepalive: The peer (the other end of the TCP connection) has terminated the connection. The second error means that nobody responded to a TCP connection request within certain time. This is typical if you are trying to connect to a machine that is shut down, does not exist or is blocked by a firewall.

Leave a Reply

Your email address will not be published. Required fields are marked *

CYBERTEC Logo white
Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2024
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle
    linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram