CYBERTEC PostgreSQL Logo

Insert-only data modeling with PostgreSQL?

08.2016 / Category: / Tags: |

By Kaarel Moppel - In recent years, there has been quite a lot of buzz about “insert-only” approaches and some database-like products (Datomic, Apache Samza) have emerged, being inspired by the idea of having an immutable data store. In light of cheap storage and powerful hardware, I see the idea definitely having potential for certain use cases. So why not to apply the concept using a PostgreSQL database? There are no inherent limitations there. One should just find a suitable use case and design the data model accordingly.

My own interest was more towards performance of the approach in comparison to the standard “update” way, as in theory according to my understanding “insert-only” approach should provide some performance benefits as Postgres implementation of UPDATE is actually a DELETE+INSERT together with the possible update of all index entries in the worst-case scenario.

Pros and Cons

When doing only INSERT’s instead I would reckon that altogether fewer IO operations would be needed, bloat would be minimized (meaning no AUTOVACUUM hiccups), mostly sequential writing would happen (index pages could still split), a lot less full page images need to be written into XLOG. Plus on the logical level one would get the full history of changes making audit tables unnecessary.

Of course there would also be downsides – more disk space and additional indexing would be needed, queries could easily get unwieldy for joins etc, limited foreign key usage, for really big amounts of data one could need custom batch jobs for cleaning up old versions of rows still at some point, to name a few considerations.

Finding a suitable use case

So after some thinking about a good but simple test case for my theoretical hunch about possible performance benefits, I decided to model my “insert-only” schema to match the short version (using the –skip-some-updates flag) of the default “pgbench” TPC-B scenario. What the short version does, is that it leaves out balance updates on the smaller “pgbench_branches” and “pgbench_tellers” tables (where locking could occur, thus not maxing out IO), meaning that my customized “insert-only” script would become very simple, only doing inserts into the “pgbench_accounts”, leaving out also the insert on “pgbench_history” as we would already have the full history in the accounts table itself anyways.

Now to the course of testing itself. Files for converting the schema to insert-only mode and the custom “pgbench” test script can be found here.

Test results

After looking at the TPS numbers for both runs we see that the performance benefit expressed in TPS was +43% here! And behind the scenes I actually also measured the amount of WAL generated, the difference there was even bigger with +135%.

To sum it up - although a very simplistic example here - “insert-only” data modelling provided a nice TPS boost for our test scenario, while providing exactly the same (and even more, full history) information. So in short it could be a worthwhile investigation when one has to deal with sudden peaks for example, as it should give you more predictable IO when applied correctly. A nice exercise in thought in any case.

You also may be interested in my post about insert-only data modelling to smooth peaks on slow disks. 

4 responses to “Insert-only data modeling with PostgreSQL?”

Leave a Reply

Your email address will not be published. Required fields are marked *

CYBERTEC Logo white
Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2024
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle
    linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram