By Kaarel Moppel
Table of Contents
Recently, I was asked for advice on how to reasonably implement a common task of table change tracking - meaning a so-called “audit trail” for all row changes over time was required. The keyword “compliance” might also ring a bell in this context, here for those who work in finance or for government institutions. But as always with technology, there are a couple of different approaches with different benefits / drawbacks to choose from; let’s lay it out for the Internet Citizens! There’s a summary of pros and cons down below if you’re in a hurry.
TL;DR: sadly, there are no real standard built-in tools for these types of tasks, but in real life it mostly boils down to creating some “shadow” tables and writing some triggers.
This server parameter named log_statement has 4 possible values - none, ddl, mod, all. By default, it’s configured conservatively (in this case, with ‘none’), as all PostgreSQL parameters generally are, in order to avoid accidentally wasting resources. That way, it doesn’t log normal SELECT or DML statements to the server log - only warnings and errors. But when set to ‘mod’ (or ‘all’ which is a superset of ‘mod’), all top level data-changing SQL statements (INSERT, UPDATE, DELETE, TRUNCATE) will be stored in the server log!
Make sure to note the “top level” part though - this means that if you have some business logic exercising triggers on your tables and a chain of changes is triggered, you’ll not see any traces of these subsequent changes in the server log! Also, log_statement doesn’t work well if you rely on stored procedures to change data - since they’re usually called via normal SELECT-s, which don’t get flagged as data-changing operations with the ‘mod’ log_statement level. Worse, if you need to do some dynamic SQL within your stored procedures - even the ‘all’ level won’t catch them!
In short - the use cases for the whole approach are somewhat limited to basic CRUD patterns, and log_statement is not necessarily suitable for enterprise requirements.
PS - also note that superusers can change this log_statement setting on the fly; thus bypassing the auditing, and do stuff secretly-- without any traces left! To remind you - “by design” with PostgreSQL, it’s never really possible to guard the system against a malicious superuser. Some methods just need some more work from the user, in short better be careful to whom you hand out superuser rights. Preferably, do not even allow remote superuser access, but that’s another topic - see here for more info, if you’re interested in security.
In short, pgaudit is a 3rd-party PostgreSQL extension that tries to improve on the quite limited default PostgreSQL text-based auditing / logging capabilities. It has been around for ages, so it’s stable enough to use, and there are even packages provided by PGDG repos for Debian / RH-based distros.
Its main drawback is the same as with the previous method, though - everything goes to the same server log with normal status / error messages. There’s no clear separation of “concerns”-- thus searching will be a bit of work, and for fast “trail” access, you probably need to parse the files and store them in some other system, properly indexed. It’s also the same story for the generated volume of logs. At default settings, (when just enabling all) it’s way more write heavy than the log_statement-based approach. In short, be sure to tune the plentiful parameters to your needs. To warn users about that, the project README also nicely says: … be sure to assess the performance impact while testing and allocate plenty of space on the log volume.
Custom audit tables and triggers must be the most classic / common approach to auditing, and all those working with RDBMS systems for a longer period have surely seen or implemented something like these features. The main idea - create a “shadow” history tracking / audit table for all relevant tables, where all changes to the main table will be logged, via triggers. However, since triggers are considered black magic by a lot of developers these days, I’ve also seen implementations via application code...but this can’t be recommended, as only in a non-existent perfect world are there no ad-hoc manual data changes.
The setup process here would look something like what you see below for every target table X, where we want to track who changed what rows / columns and when:
On a high level, this method is very similar to the previous one; the only change being that instead of having custom audit tables / trigger code for all “business” tables, we create a generic trigger function that can be used on all target tables, and that also logs all changes into a single table! By doing that, we’ve minimized the amount of table / code duplication - which could be of real value for big and complex systems - remember, DRY!
And how, you may wonder, would be the best way to implement it? Well, the best way to achieve such generic behaviour is to utilize the superb JSON functions of PostgreSQL, preferably the JSONB data type (available since v9.4), due to some space saving and faster search capabilities. BTW, if you happen to be running some earlier version, you should really think of upgrading, as versions 9.3 and lesser are not officially supported any more...and soon (February 13, 2020) PostgreSQL 9.4 will stop receiving fixes.
Since this approach is relatively unknown to the wider public, a piece of sample code probably wouldn’t hurt; check below for a sample. Note, however, that fiddling with JSONB along with the fact that this is basically a NoSQL type of storage, is not exactly as effective as normal tables / columns. You’ll have to pay a small performance and storage premium for this “generic” convenience.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
CREATE TABLE generic_log ( mtime timestamptz not null default now(), action char not null check (action in ('I', 'U', 'D')), username text not null, table_name text not null, row_data jsonb not null ); CREATE INDEX ON generic_log USING brin (mtime); CREATE INDEX ON generic_log ((row_data->>'my_pk’)) WHERE row_data->>'my_pk' IS NOT NULL; // note the cast to text as JSONB can’t be indexed with B-tree CREATE EXTENSION IF NOT EXISTS btree_gin; CREATE INDEX ON generic_log USING gin (table_name); // GiN is better for lots of repeating values CREATE OR REPLACE FUNCTION public.generic_log_trigger() RETURNS trigger LANGUAGE plpgsql AS $function$ BEGIN IF TG_OP = 'DELETE' THEN INSERT INTO generic_log VALUES (now(), 'D', session_user, TG_TABLE_NAME, to_json(OLD)); ELSE INSERT INTO generic_log VALUES (now(), TG_OP::char , session_user, TG_TABLE_NAME, to_json(NEW)); END IF; RETURN NULL; END; $function$; CREATE TRIGGER log_generic AFTER INSERT OR UPDATE OR DELETE ON some_table FOR EACH ROW EXECUTE FUNCTION generic_log_trigger(); |
PS - note again, that with both of these trigger-based methods, superusers (or table owners) can temporarily disable the triggers and thus bypass our audit trail.
Logical replication, also known as “pub-sub” replication, is a relatively new thing in PostgreSQL (introduced in v10), and originally, not really an auditing feature but rather a near-zero-downtime upgrade mechanism (read this blog post with more details on that).
It can also be “abused” for auditing or CDC (Change Data Capture) purposes...and actually quite well! The biggest benefit - storage of any extra auditing data can be “outsourced” to an external system, and so-called “write amplification” can be avoided - meaning generally better performance, since extra writing to the disk happens somewhere else.
You need to choose between 2 implementation options though - PostgreSQL native or the custom application way
PostgreSQL native logical replication means that you build up a master server similarly structured to the original server, re-adjust the schema a bit - dropping PK/UQ-s, create some triggers that tweak or throw away uninteresting data or store it in “shadow” tables (just like with the normal trigger-based approaches) and then configure data streaming with CREATE PUBLICATION / CREATE SUBSCRIPTION commands.
As usual, some constraints still apply - you might need to alter the schema slightly to get going. Large objects (up to 4TB blobs) are not supported, and with default settings, you’d only be getting the primary key and changed column data, i.e. not the latest “row image”. Also, it’s generally more hassle to set up and run - an extra node and monitoring is needed, since the publisher and subscriber will be sort of “physically coupled”, and there will be operational risks for the publisher (source server) - if the subscriber goes on a ”vacation” for too long the publisher might run out of disk space as all data changes will be reserved and stored as transaction logs (WAL) until they’re fetched (or the slot deleted). The latter actually applies for the “custom application” approach. So you should definitely spend a minute in careful consideration before jumping into some implementations.
On a positive note from the security side - “PostgreSQL native” can actually be configured in such a way that it’s not even possible for superusers on the source system to disable / bypass the auditing process and change something secretly! (i.e. temporarily leaving out some tables from the replication stream so that the subscriber doesn’t notice!) However, this only works with the standard (for upgrades at least) FOR ALL TABLES setup.
The “application way” means using some programming language (C, Python, Java, etc) where the PostgreSQL driver supports logical decoding. You’ll always be streaming the changes as they happen, and then inspect or stash away the data in your favourite format, or propagate into some other database system altogether. See here for a sample PostgreSQL implementation that also can easily be tested on the command line. To simplify it a bit - you can live-stream JSON change-sets out of PostgreSQL and do whatever you like with the data.
Approach | Pros | Cons |
log_statement=’mod’ | Simplest way for basic requirements - just flip the built-in switch, even during runtime. | Text-based: volume and search can be problematic.
Captures only top level statements issued by users. Does not capture bulk update details. No table level configuration. |
Pgaudit extension | Options to configure processing according to operation type and object / role.
Logs also dynamically generated statements. | Text-based: volume and search can be problematic.
Does not capture bulk update details. A 3rd party extension. |
Explicit audit tables and triggers for all (relevant) tables | Less resources burnt than with text-based approaches.
Fast search. Easily customizable per table. | Write amplification.
Lots of code to manage and some structure duplication introduced. |
A single audit table and trigger for all (relevant) tables | Less resources burnt than with text-based approaches.
Still a fast search. Customizable per table. | Write amplification.
Audit search queries might need some JSON skills. |
Logical Replication | Least amount of resources burnt on the source system.
Highly customizable on object level. Can be well secured to guard data integrity. Fast search. | Complex setup.
Needs extra hardware / custom application. Typically requires some schema changes and extra care when the schema evolves. |
Hope you got some ideas for your next auditing project with PostgreSQL!
+43 (0) 2622 93022-0
office@cybertec.at
You are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
"One generic audit trigger and table for all audited tables"
To speed up this solution, it's possible to use REFERENCING and FOR EACH STATEMENT
CREATE TRIGGER paired_items_update
AFTER UPDATE ON paired_items
REFERENCING NEW TABLE AS newtab OLD TABLE AS oldtab
FOR EACH STATEMENT
EXECUTE FUNCTION check_matching_pairs();
https://debezium.io/ seems promising for this sort of purpose.
Nice article with good summary of different options.
Do you have some example for "Logical replication – with custom applications" option?
I don't see any link here
.
@ar@disqus_G8YEnTLjiA:disqus ah true, good catch. link is here - https://www.postgresql.org/docs/12/app-pgrecvlogical.html
There is also solution based on decoding of WAL files. It looks like a custom logical replication aplication but not exactly - because do not use stream but simply files.
Decoding of WAL files do not provide dangerous situation when subscriber fails (situation you described in this article) ,of'course when the WALs are sent outside of master computer - but using replication slots is probably less resource cosuming if you audit only smal part of tables set.
Very nice article - indeed.
I want only add, in building of such custom logical replication application can be very helpfull the wal2json extension.
@tomaszcwajda:disqus good point about wal2json 3rd party extension - it helps to avoid parsing and is also whitelisted for use even on some managed offerings like RDS
I've been using an audit91 plus audit trigger w/ JSONB support.
Also modified it with PG11 with monthly partitioning and it's stable on production.
https://github.com/cmabastar/audit-trigger
I'd like to highlight Debezium (https://debezium.io). Works flawlessly using logical decoding. And on PG>=10 it doesn't need an external logical decoding plugin,it uses pgoutput.
Also covers POSTGIS types.
Great article. For version 9.6 and up, I had to change
EXECUTE FUNCTION
toEXECUTE PROCEDURE
CREATE TRIGGER log_generic AFTER INSERT OR UPDATE OR DELETE ON your_table FOR EACH ROW EXECUTE FUNCTION public.generic_log_trigger()
To
CREATE TRIGGER log_generic AFTER INSERT OR UPDATE OR DELETE ON your_table FOR EACH ROW EXECUTE PROCEDURE public.generic_log_trigger()