CYBERTEC PostgreSQL Logo

Reproducible builds: a PostgreSQL query optimization example

07.2023 / Category: , / Tags: | |

This post shows how to optimize a slow query that came out of the Reproducible Builds project. The Reproducible Builds initiative aims to make software compilation entirely deterministic, with no
variation in the output when the build is run again. This makes software supply chain attacks much harder, but has also advantages for software quality assurance. While initially started by Debian, reproducible builds are now part of many other Linux distributions and software projects like Alpine, Apache Maven, Arch, Buildroot, coreboot, Fedora, F-Droid, FreeBSD, GNU Guix, NetBSD, NixOS, openSUSE, OpenWrt, Qubes, Tails, Tor Browser, and many others.

Reproducible Builds logo

Background - Reproducible Builds and queries

To check if some given software package is reproducible, rebuilders run that build with some environment variations, and compare the build results. For example, the time on one of the build machines runs ahead, to check if some "built on this date" stamp is embedded in the build result. Naturally, this machinery includes a PostgreSQL database which stores the "is package X reproducible?" information. Rebuilds are triggered periodically, and again, this involves SQL queries.

The problem query

Some time ago, I was asked by a fellow Debian Developer to look at a query which performed particularly badly: determining which packages to test next took around 45 minutes.

The query had already been found:

I loaded up the provided database dump and looked at the execution plan:

The 9-digit estimated plan cost and subsequently the 45min query run time come from the O(N²) behavior of the two SubPlans in there.

Query analysis

The problem is so common that it got a place on PostgreSQL's famous Don't Do This list:

Don't use NOT IN

.

"...performance can look good in small-scale tests but then slow down by 5 or more orders of magnitude once a size threshold is crossed; you do not want this to happen."

How to fix the query

As Laurenz explained in his subqueries in PostgreSQL blog post, the optimizer cannot rewrite NOT IN (subquery) to something else, since that would change the query semantics. But since we know we do not care about NULLs here, we can rewrite the query manually to use NOT EXISTS (subquery):

100ms looks much better. Compared to the original 45min runtime, that's a 27,000x speedup.

The new runtime: 100 ms... and it has an effect

In practice, the job running this query now runs much faster:

Reproducible build job runtimes


In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.

Leave a Reply

Your email address will not be published. Required fields are marked *

CYBERTEC Logo white
Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2024
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle
    linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram