The power of open source in PostgreSQL

Superhero Slonik - the power of open source in PostgreSQL
© Laurenz Albe 2024

Table of Contents

During a training session on tuning autovacuum, a participant asked me,
“Why is autovacuum_vacuum_scale_factor set to 0.2?”
Rather than replying with a guess or a wisecrack (“Because that's the way somebody implemented it!” or “If you don't like it, submit a patch!”), I decided to use the opportunity to demonstrate the true power of open source. That was such a nice experience that I decided to share it with a wider audience.

The dreadful “why” question

Everybody who has children has learned to dread “why” questions. Very often, they are hard or impossible to answer, and after a while you learn that seeing an adult struggle for words is one of the reasons for children to ask them. Over the years, I have come to the conclusion that when children ask “why”, they usually mean something else; namely:

Tell me more about this! It is interesting, and I don't want to go to bed yet!
This is really annoying, and it should be different!

As a mathematician and computer engineer, I tend to take questions too literally. Knowing the above interpretation has helped me come up with more appropriate answers. Adults are not that much different from children. So when people start asking “why” instead of “how”, I try to figure out what they really mean. In the case I related in the introduction, the “why” was rather of the interested kind, which is why I decided to pursue the question.

The true power of open source

I hadn't spent a lot of thought on the power of open source before I heard Simon Phipps speak on the European PGDay 2010. In this section, he is the giant on whose shoulders I am standing!

Is cheapness the main reason for using open source software?

Many people think that the main advantage of open source software is that you don't have to pay a license fee. This can be a substantial advantage. However, the total cost of ownership is not zero when you run open source software:

you need to integrate the software into your landscape; at least set up monitoring and backup
you have to train people or buy external knowledge to configure, operate and maintain the software
you have to get support in case you run into problems you cannot solve yourself

The true benefits of open source software

The bigger benefits of open source software are not that you get it for free. Here are a few of the more salient advantages:

You don't have to open a service request to figure out what an error message means. You can search for it in the source code, and even if you are not a programmer, you'll usually understand the cause of the error better.
You can have as many test and development installations as you want without having to worry about compliance with the license you bought. What good is a Kubernetes cluster that allows you to create a database instance at the push of a button, if you need to perform a license compliance check first?
For support, you are not at the mercy of a single vendor. Others can have access to the source code and provide third-level support to you. This healthy competition is good for the customer.
If the vendor abandons the software, you are not left high and dry (think of Oracle Forms). If the software is sufficiently popular, someone else will probably pick it up and maintain it. At the very least, you can pay developers to keep the software alive.
The software quality is often better. If more eyes are on the software, they are more likely to find bugs and security vulnerabilities. This also leads to raised awareness on the side of the developers: I'm much more likely to build a quick and dirty hack into closed source software. But I'd think twice before submitting such a patch to PostgreSQL, where it would be driven out of town and stoned in public.
For knowledge about the internal workings of the software, you are not at the mercy of consultants who talked to someone at a conference who happens to know somebody who has inside information. Closed source software is often surrounded by a surprising amount of myths and rumors, and not all of them are true.

I decided to demonstrate the last point when I set out to investigate the motivation behind the default value of autovacuum_vacuum_scale_factor with my pupils.

Getting the source code

Since I work with the PostgreSQL source code a lot, I already had it on my machine. But it is not hard to get the source code. Appendix I of the PostgreSQL documentation points you to the PostgreSQL Wiki. There we learn how to clone the PostgreSQL source code repository. All you need is to install Git. Then you can run

git clone https://git.postgresql.org/git/postgresql.git

1	git clone https://git.postgresql.org/git/postgresql.git

This creates a directory postgresql that contains the complete history of the PostgreSQL source code since 1996. Change into the directory, and you are ready to go.

Searching the commit messages

Once we are in the directory that contains the PostgreSQL source repository, we can look at the commit messages. That is often enough to get an answer!

git log

git log

Since this is a very long list, Git invokes the pager. If you have a good pager, you can search the commit messages, so I searched for “autovacuum_vacuum_scale_factor”. I only found a single commit from 2020 that fixed something in the documentation. So this seems to be a dead end.

However, a commit message often is all you need to answer your question. Naturally, the usefulness of this approach depends on the quality of the commit messages. Nowadays, PostgreSQL's commit messages are usually excellent: they describe the problem solved and the approach taken, and they point you to the relevant discussion (more about that later). If you go further back in time, you may find less useful commit messages like “Row count patch from Bruce”. But those “fast and loose” days are long gone. Anyway, perhaps you may want to stop and think the next time you are about to write a commit message like “Fix a couple of bugs”. After all, a few years down the line, you may be grateful for a useful description of the mess you hacked up back then.

Spotting the right place in the open source code

Since the commit log didn't turn up anything useful, we have to dig deeper. The first thing we need to do is to find the place in the source code that defines the default value. If you are not familiar with the PostgreSQL source, that can be complicated. But looking for the name of a parameter is not so hard:

git grep autovacuum_vacuum_scale_factor

1	git grep autovacuum_vacuum_scale_factor

This gives us a fairly short list:

doc/src/sgml/config.sgml:     
doc/src/sgml/config.sgml:      autovacuum_vacuum_scale_factor (floating point)
doc/src/sgml/config.sgml:       autovacuum_vacuum_scale_factor
doc/src/sgml/ref/create_table.sgml:   
doc/src/sgml/ref/create_table.sgml:    autovacuum_vacuum_scale_factor, toast.autovacuum_vacuum_scale_factor
doc/src/sgml/ref/create_table.sgml:     autovacuum_vacuum_scale_factor 
src/backend/access/common/reloptions.c:         "autovacuum_vacuum_scale_factor",
src/backend/access/common/reloptions.c:     {"autovacuum_vacuum_scale_factor", RELOPT_TYPE_REAL,
src/backend/postmaster/autovacuum.c: * autovacuum_vacuum_scale_factor GUC variable.  Ditto for analyze.
src/backend/utils/misc/guc_tables.c:        {"autovacuum_vacuum_scale_factor", PGC_SIGHUP, AUTOVACUUM,
src/backend/utils/misc/postgresql.conf.sample:#autovacuum_vacuum_scale_factor = 0.2 # fraction of table size before vacuum
src/bin/psql/tab-complete.c:    "autovacuum_vacuum_scale_factor",
src/bin/psql/tab-complete.c:    "toast.autovacuum_vacuum_scale_factor",

doc/src/sgml/config.sgml:

doc/src/sgml/config.sgml: autovacuum_vacuum_scale_factor (floating point)

doc/src/sgml/config.sgml: autovacuum_vacuum_scale_factor

doc/src/sgml/ref/create_table.sgml:

doc/src/sgml/ref/create_table.sgml: autovacuum_vacuum_scale_factor, toast.autovacuum_vacuum_scale_factor

doc/src/sgml/ref/create_table.sgml: autovacuum_vacuum_scale_factor

src/backend/access/common/reloptions.c: "autovacuum_vacuum_scale_factor",

src/backend/access/common/reloptions.c: {"autovacuum_vacuum_scale_factor", RELOPT_TYPE_REAL,

src/backend/postmaster/autovacuum.c: * autovacuum_vacuum_scale_factor GUC variable. Ditto for analyze.

src/backend/utils/misc/guc_tables.c: {"autovacuum_vacuum_scale_factor", PGC_SIGHUP, AUTOVACUUM,

src/backend/utils/misc/postgresql.conf.sample:#autovacuum_vacuum_scale_factor = 0.2 # fraction of table size before vacuum

src/bin/psql/tab-complete.c: "autovacuum_vacuum_scale_factor",

src/bin/psql/tab-complete.c: "toast.autovacuum_vacuum_scale_factor",

The “sgml” files are the documentation, which is part of the source code in PostgreSQL. It won't take long to look at the remaining occurrences. If you know what “GUC” stands for, you need not search. If not, you can look up that piece of PostgreSQL jargon in the acronym list in appendix L of the PostgreSQL documentation.

The relevant piece of code in guc_tables.c is the following:

struct config_real ConfigureNamesReal[] =
{
    [...]
    {
        {"autovacuum_vacuum_scale_factor", PGC_SIGHUP, AUTOVACUUM,
            gettext_noop("Number of tuple updates or deletes prior to vacuum as a fraction of reltuples."),
            NULL
        },
        &autovacuum_vac_scale,
        0.2, 0.0, 100.0,
        NULL, NULL, NULL
    },
    [...]

struct config_real ConfigureNamesReal[] =

{

[...]

{

{"autovacuum_vacuum_scale_factor", PGC_SIGHUP, AUTOVACUUM,

gettext_noop("Number of tuple updates or deletes prior to vacuum as a fraction of reltuples."),

NULL

&autovacuum_vac_scale,

0.2, 0.0, 100.0,

NULL, NULL, NULL

[...]

Now we know where in the source code we can find the default value of 0.2 for autovacuum_vacuum_scale_factor. But how did it get there?

Traveling back in time in the open source repository

Back to 2022

We can use “git blame” to see where individual source lines originate:

git blame src/backend/utils/misc/guc_tables.c

1	git blame src/backend/utils/misc/guc_tables.c

That shows us:

0a20ff54f5e6 (Tom Lane           2022-09-13 11:05:07 -0400 3938)        {"autovacuum_vacuum_scale_factor", PGC_SIGHUP, AUTOVACUUM,
0a20ff54f5e6 (Tom Lane           2022-09-13 11:05:07 -0400 3939)            gettext_noop("Number of tuple updates or deletes prior to vacuum as a fraction of reltuples."),
0a20ff54f5e6 (Tom Lane           2022-09-13 11:05:07 -0400 3940)            NULL
0a20ff54f5e6 (Tom Lane           2022-09-13 11:05:07 -0400 3941)        },
0a20ff54f5e6 (Tom Lane           2022-09-13 11:05:07 -0400 3942)        &autovacuum_vac_scale,
0a20ff54f5e6 (Tom Lane           2022-09-13 11:05:07 -0400 3943)        0.2, 0.0, 100.0,
0a20ff54f5e6 (Tom Lane           2022-09-13 11:05:07 -0400 3944)        NULL, NULL, NULL

0a20ff54f5e6 (Tom Lane 2022-09-13 11:05:07 -0400 3938) {"autovacuum_vacuum_scale_factor", PGC_SIGHUP, AUTOVACUUM,

0a20ff54f5e6 (Tom Lane 2022-09-13 11:05:07 -0400 3939) gettext_noop("Number of tuple updates or deletes prior to vacuum as a fraction of reltuples."),

0a20ff54f5e6 (Tom Lane 2022-09-13 11:05:07 -0400 3940) NULL

0a20ff54f5e6 (Tom Lane 2022-09-13 11:05:07 -0400 3941) },

0a20ff54f5e6 (Tom Lane 2022-09-13 11:05:07 -0400 3942) &autovacuum_vac_scale,

0a20ff54f5e6 (Tom Lane 2022-09-13 11:05:07 -0400 3943) 0.2, 0.0, 100.0,

0a20ff54f5e6 (Tom Lane 2022-09-13 11:05:07 -0400 3944) NULL, NULL, NULL

What was commit 0a20ff54f5e6? We could look it up on the Git web interface, but on the command line, we simply use

git show 0a20ff54f5e6

1	git show 0a20ff54f5e6

We see the following:

commit 0a20ff54f5e66158930d5328f89f087d4e9ab400
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   Tue Sep 13 11:05:07 2022 -0400

    Split up guc.c for better build speed and ease of maintenance.
    
    guc.c has grown to be one of our largest .c files, making it
    a bottleneck for compilation.  It's also acquired a bunch of
    knowledge that'd be better kept elsewhere, because of our not
    very good habit of putting variable-specific check hooks here.
    Hence, split it up along these lines:
    
    * guc.c itself retains just the core GUC housekeeping mechanisms.
    * New file guc_funcs.c contains the SET/SHOW interfaces and some
      SQL-accessible functions for GUC manipulation.
    * New file guc_tables.c contains the data arrays that define the
      built-in GUC variables, along with some already-exported constant
      tables.
[...]

commit 0a20ff54f5e66158930d5328f89f087d4e9ab400

Author: Tom Lane <tgl@sss.pgh.pa.us>

Date: Tue Sep 13 11:05:07 2022 -0400

Split up guc.c for better build speed and ease of maintenance.

guc.c has grown to be one of our largest .c files, making it

a bottleneck for compilation. It's also acquired a bunch of

knowledge that'd be better kept elsewhere, because of our not

very good habit of putting variable-specific check hooks here.

Hence, split it up along these lines:

* guc.c itself retains just the core GUC housekeeping mechanisms.

* New file guc_funcs.c contains the SET/SHOW interfaces and some

SQL-accessible functions for GUC manipulation.

* New file guc_tables.c contains the data arrays that define the

built-in GUC variables, along with some already-exported constant

tables.

[...]

So that was just a commit that refactored the code; prior to that commit, the code was in guc.c. To proceed, we travel back in time to the commit before that:

git checkout 0a20ff54f5e6~1

1	git checkout 0a20ff54f5e6~1

Back to 2011

Let's use git blame again:

git blame src/backend/utils/misc/guc.c

1	git blame src/backend/utils/misc/guc.c

This time we see

29094193f526 (Tom Lane           2005-07-14 05:13:45 +0000  3885)       {"autovacuum_vacuum_scale_factor", PGC_SIGHUP, AUTOVACUUM,
29094193f526 (Tom Lane           2005-07-14 05:13:45 +0000  3886)           gettext_noop("Number of tuple updates or deletes prior to vacuum as a fraction of reltuples."),
29094193f526 (Tom Lane           2005-07-14 05:13:45 +0000  3887)           NULL
29094193f526 (Tom Lane           2005-07-14 05:13:45 +0000  3888)       },
29094193f526 (Tom Lane           2005-07-14 05:13:45 +0000  3889)       &autovacuum_vac_scale,
2594cf0e8c04 (Tom Lane           2011-04-07 00:11:01 -0400  3890)       0.2, 0.0, 100.0,
2594cf0e8c04 (Tom Lane           2011-04-07 00:11:01 -0400  3891)       NULL, NULL, NULL

29094193f526 (Tom Lane 2005-07-14 05:13:45 +0000 3885) {"autovacuum_vacuum_scale_factor", PGC_SIGHUP, AUTOVACUUM,

29094193f526 (Tom Lane 2005-07-14 05:13:45 +0000 3886) gettext_noop("Number of tuple updates or deletes prior to vacuum as a fraction of reltuples."),

29094193f526 (Tom Lane 2005-07-14 05:13:45 +0000 3887) NULL

29094193f526 (Tom Lane 2005-07-14 05:13:45 +0000 3888) },

29094193f526 (Tom Lane 2005-07-14 05:13:45 +0000 3889) &autovacuum_vac_scale,

2594cf0e8c04 (Tom Lane 2011-04-07 00:11:01 -0400 3890) 0.2, 0.0, 100.0,

2594cf0e8c04 (Tom Lane 2011-04-07 00:11:01 -0400 3891) NULL, NULL, NULL

Rinse and repeat: let's look at commit 2594cf0e8c04

git show 2594cf0e8c04

1	git show 2594cf0e8c04

This is another commit that changed the API, not the values:

commit 2594cf0e8c04406ffff19b1651c5a406d376657c
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   Thu Apr 7 00:11:01 2011 -0400

    Revise the API for GUC variable assign hooks.

[...]
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 2151fde3618..5e4904aeb7f 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
[...]
@@ -2297,7 +2479,8 @@ static struct config_real ConfigureNamesReal[] =
            NULL
        },
        &autovacuum_vac_scale,
-       0.2, 0.0, 100.0, NULL, NULL
+       0.2, 0.0, 100.0,
+       NULL, NULL, NULL
    },
    {
        {"autovacuum_analyze_scale_factor", PGC_SIGHUP, AUTOVACUUM,

commit 2594cf0e8c04406ffff19b1651c5a406d376657c

Author: Tom Lane <tgl@sss.pgh.pa.us>

Date: Thu Apr 7 00:11:01 2011 -0400

Revise the API for GUC variable assign hooks.

[...]

diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c

index 2151fde3618..5e4904aeb7f 100644

--- a/src/backend/utils/misc/guc.c

+++ b/src/backend/utils/misc/guc.c

[...]

@@ -2297,7 +2479,8 @@ static struct config_real ConfigureNamesReal[] =

NULL

&autovacuum_vac_scale,

- 0.2, 0.0, 100.0, NULL, NULL

+ 0.2, 0.0, 100.0,

+ NULL, NULL, NULL

{

{"autovacuum_analyze_scale_factor", PGC_SIGHUP, AUTOVACUUM,

Again, we travel back in time to the commit prior to that:

git checkout 2594cf0e8c04~1

1	git checkout 2594cf0e8c04~1

Finding the open source code modification in 2006

Using git blame src/backend/utils/misc/guc.c again, we wee

29094193f526 (Tom Lane           2005-07-14 05:13:45 +0000 2295)        {"autovacuum_vacuum_scale_factor", PGC_SIGHUP, AUTOVACUUM,
29094193f526 (Tom Lane           2005-07-14 05:13:45 +0000 2296)            gettext_noop("Number of tuple updates or deletes prior to vacuum as a fraction of reltuples."),
29094193f526 (Tom Lane           2005-07-14 05:13:45 +0000 2297)            NULL
29094193f526 (Tom Lane           2005-07-14 05:13:45 +0000 2298)        },
29094193f526 (Tom Lane           2005-07-14 05:13:45 +0000 2299)        &autovacuum_vac_scale,
e0938c3f5b03 (Bruce Momjian      2006-09-02 23:12:16 +0000 2300)        0.2, 0.0, 100.0, NULL, NULL

29094193f526 (Tom Lane 2005-07-14 05:13:45 +0000 2295) {"autovacuum_vacuum_scale_factor", PGC_SIGHUP, AUTOVACUUM,

29094193f526 (Tom Lane 2005-07-14 05:13:45 +0000 2296) gettext_noop("Number of tuple updates or deletes prior to vacuum as a fraction of reltuples."),

29094193f526 (Tom Lane 2005-07-14 05:13:45 +0000 2297) NULL

29094193f526 (Tom Lane 2005-07-14 05:13:45 +0000 2298) },

29094193f526 (Tom Lane 2005-07-14 05:13:45 +0000 2299) &autovacuum_vac_scale,

e0938c3f5b03 (Bruce Momjian 2006-09-02 23:12:16 +0000 2300) 0.2, 0.0, 100.0, NULL, NULL

This time, we have hit gold; commit e0938c3f5b03 is the one we were looking for:

commit e0938c3f5b03e48bca32fe903f057e5777d43df8
Author: Bruce Momjian <bruce@momjian.us>
Date:   Sat Sep 2 23:12:16 2006 +0000

    Make autovacuum behavior more agressive, per discussion on hackers list
    --- was part of autovacuum default 'on' patch that was reverted, but we
    want this part.
    
    Peter Eisentraut

[...]
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index fb12bd7adfc..f985c9da423 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
[...]
@@ -1738,7 +1738,7 @@ static struct config_real ConfigureNamesReal[] =
            NULL
        },
        &autovacuum_vac_scale,
-       0.4, 0.0, 100.0, NULL, NULL
+       0.2, 0.0, 100.0, NULL, NULL
    },
    {
        {"autovacuum_analyze_scale_factor", PGC_SIGHUP, AUTOVACUUM,

commit e0938c3f5b03e48bca32fe903f057e5777d43df8

Author: Bruce Momjian <bruce@momjian.us>

Date: Sat Sep 2 23:12:16 2006 +0000

Make autovacuum behavior more agressive, per discussion on hackers list

--- was part of autovacuum default 'on' patch that was reverted, but we

want this part.

Peter Eisentraut

[...]

diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c

index fb12bd7adfc..f985c9da423 100644

--- a/src/backend/utils/misc/guc.c

+++ b/src/backend/utils/misc/guc.c

[...]

@@ -1738,7 +1738,7 @@ static struct config_real ConfigureNamesReal[] =

NULL

&autovacuum_vac_scale,

- 0.4, 0.0, 100.0, NULL, NULL

+ 0.2, 0.0, 100.0, NULL, NULL

{

{"autovacuum_analyze_scale_factor", PGC_SIGHUP, AUTOVACUUM,

So in 2006, autovacuum_vacuum_scale_factor got reduced from 0.4 to 0.2.

Finding the mailing list discussion

In order to understand the motivation behind the change, it is a good idea to read the mailing list discussion. The PostgreSQL project archives the mailing lists, so we should be able to find the discussion. Note that finding the discussion has become easier today, because it has become a habit to reference the relevant mailing list discussion in the commit message.

I used the advanced archive search for the best results. I narrowed the search down to the “pgsql-hackers” mailing list, searched for “autovacuum aggressive”, set “Date: anytime” and “Sort by: Reverse date”. This produced plenty of results, but on the second page I already found this thread from August 2006, which must be the correct one.

To sum up the arguments: Rod Taylor argued that the scale factor should become more aggressive and suggested a value of 0.1. Matthew T. O'Connor and Jim C. Nasby supported that, and Jim suggested a value of 0.2. ITAGAKI Takahiro suggested a value below 0.1 to reduce index page splits. Peter Eisentraut proposed a value of 0.08, which seemed too big a jump to Matthew T. O'Connor. Peter suggested 0.2 as a compromise. Josh Berkus considered that too aggressive and wanted to stay on 0.4. Finally Bruce Momjian settled the dispute by simply committing Peter's proposal.

To give you some historical context: Before version 8.1 (released in 2005), autovacuum was a “contrib module” (there were no extensions back then). Version 8.3 from 2008 enabled autovacuum by default. So these were the times when autovacuum was still somewhat experimental, and many people still used cron jobs to schedule regular VACUUM runs. These were the bad old days indeed.

Conclusion

In PostgreSQL, both the source code and the discussions that lead the developers to a certain decision are there for everyone to see. I demonstrated this by uncovering the reasoning behind the current default value for autovacuum_vacuum_scale_factor. With open source software like PostgreSQL, you are not reduced to being a user. If you feel that 18 years of experience and the changing requirements are enough to suggest a change, join the hackers mailing list and make your point. Welcome to the community!

The power of open source in PostgreSQL

The dreadful “why” question

The true power of open source

Is cheapness the main reason for using open source software?

The true benefits of open source software

Getting the source code

Searching the commit messages

Spotting the right place in the open source code

Traveling back in time in the open source repository

Back to 2022

Back to 2011

Finding the open source code modification in 2006

Finding the mailing list discussion

Conclusion

Leave a Reply Cancel reply

Laurenz Albe

Blog Tags

NEWSLETTER

Articles by our PostgreSQL Experts