CYBERTEC PostgreSQL Logo

A quick pg_stat_statements troubleshooting hack

08.2020 / Category: / Tags: |

By Kaarel Moppel

How can I use pg_stat_statements for troubleshooting? First, an introductory story (with some slight “drama” added for fun) from those good old days of on-site consulting 🙂 So...I’m at a client where the database is not behaving nicely among other things...what a crappy DB product indeed I hear, it gets cranky every morning although there should be a constant workload and is saturating the CPU and making some important queries time out!

Hmm very interesting I think, Postgres usually doesn’t tend to care too much about mornings or time in general...Okay, well let’s sit down and look at the metrics I say...oh, surprise-surprise - we discover that there’s no monitoring framework in place! Sounds very bad and rare, right? Well, quite common actually as many people rely only on the server log files...which I wouldn’t recommend doing personally. But not too tragic in this case actually - as luckily their DBA at least had read from somewhere that it’s good to install the pg_stat_statements extension when taking an instance into use - which is of course absolutely true and a highly recommended thing to do!

Giving hope to DBA's

So I say - well, let’s just quickly install some monitoring tool and have a coffee and talk about other issues while the metrics, that will help us to get to the bottom of this, are being gathered. And then I hear that distinct audible “sigh” and they look at me with despair in their eyes - you know, we just can’t install some software like that...firstly there’s no Internet...and 2nd, it needs to go through managerial and security acceptance processes X,Y and Z and it would take a week at best.

Yoda and direct console access

OK, doesn’t make life easier for sure...the poor DBA is also already feeling bad that their organisation is not exactly the candidate for the “Agile company of the year” award...But, then I ask...you know, we do have direct console access to the database still, right? Yes, we do! And then I can finally use my wise Yoda quote: Well great, this is pretty much the same as a monitoring tool - it’s all about the data anyways! The DBA starts to look at me in a very suspicious and puzzled way...what the hell am I talking about? How can a plain text-based query interface be nearly as useful as some modern graphical monitoring tool for troubleshooting?

Ad-hoc setup of pg_stat_statements monitoring

So let’s shed some light on what I mean. A disclaimer - it’s not the most “humane” method of course, but it will definitely gather all the necessary information and thus do the trick. For most users though, who have the possibility, I’d recommend installing some real monitoring tool - like pgwatch2 or pghero, which are both dead simple to use and have a graphical interface for pinpointing problems.

Step 1 - prerequisites

Making sure the pg_stat_statements extension is loaded via the shared_preload_libraries parameter and also the extension created in some database - doesn’t have to be even the problematic one, as the extension works on the cluster level. For new installs, a restart of the PostgreSQL server is needed though, which might be problematic in some cases - so again: it’s a really highly recommended step when performing the initial server setup.

Step 2 - “snapshot” table setup

Log into the database where pg_stat_statements is available and create a table to hold the periodic snapshots of the tracking information the extension is providing. And this is the crux of the “trick” actually - we want to regularly store snapshots of detailed query execution details so that we can later calculate the so called “deltas” for each query and see who was burning the most resources within a certain timespan! This is actually exactly what most “continuous monitoring” tools are doing in a nicer coat - just storing snapshots of the internal PostgreSQL performance counters for various objects or queries and then calculating deltas or anomalies based on that data and representing it in some nice and digestible, usually graphical, format.

A DDL statement to create a table for such “snapshot” storing purposes would look something like that:

Step 3 - metrics gathering

Set up periodic gathering of pg_stat_statements data into our “snapshot” table. In the simplest form on the “psql” console it can be done with the built-in “watch” subcommand. Namely, this command will re-execute the previous user query with some regular interval - by default set at 2 seconds...which is a bit too frequent mostly though for general pg_stat_statements monitoring (but might again be too infrequent for pg_locks or pg_stat_activity monitoring). So in short it would look something like this:

Remember - “psql” + “watch” is just one option and a simple Bash loop or a Cron with the same SQL would be as good, and maybe even better - if you plan to keep it running for a day or so as there might be network interruptions or restarts that terminate your “psql“ session.

Analyzing the data

Now when the data collection has been running for a while and our target time range is covered, we obviously need to analyze the data somehow. And there are quite some ways to do that - many resource usage columns to choose from depend on the  detected problem type and also many SQL query techniques could be used. But to give you a  general idea, in the chapter below there are some simpler examples of some standard questions and the resulting queries for such ad-hoc troubleshooting sessions...but as already said, there are many-many different paths here and the queries provided are by far not the only ones.

The burning question

Which top 3 queries were burning the most CPU time between times t1 and t2 and approximately how much of the total Postgres runtime did that represent? Let's see:

What queries had the biggest (absolute) execution time change compared to the hour before?

Another warning on the above queries

The results can be considered useful only in case there was no Postmaster crash or user invoked statistics reset in between. If that was the case, your best bet is almost always to go all in on the awesome Window Functions!

Improving resource usage

In some cases, as in the above described fully utilized CPU case, one doesn’t really want to add any unnecessary load on the database server with monitoring activities (especially if you use shorter intervals), as it might actually be that last straw that breaks the camel's back...so how do you  minimize such risks a bit?

First performance hint

We could (and mostly should) use the UNLOGGED table modifier for our snapshot gathering table, to bypass the transaction log completely for our frequently inserted data - thus reducing disk flushing and network load (remember - WAL is sent also to the replicas). This modifier of course has the downside that the table is not effectively crash safe and we would lose our data in case of a PostgreSQL crash or a power outage! Nevertheless, this is especially recommended when the whole monitoring operation is planned to be relatively short-lived, as it mostly should, or if the strange “peaks” are appearing very frequently and we could just re-start our capturing any time and still get usable “leads”. I actually use “UNLOGGED” almost always for such ad-hoc troubleshooting - better to be safe than sorry and not to strain the system under monitoring.

Second performance hint

We could leave out the longest info (byte-wise) from the pg_stat_statements view snapshots - the actual queries! This is OK to do as the internal query IDs are static and good enough to look up the real query texts later with a JOIN, when we have already identified some suspicious query ID-s.

Third performance hint

We don’t actually need to store the pg_stat_statement statistics on each interval for all queries...as probably only a subset of them are “active” in the monitored time range. So basically we could add some WHERE filter condition that looks up the very first entry for a specific query and dismisses it if the execution counter has not progressed for the current timestamp - this usually minimizes the data amount at least 2 times based on my experience. A hint though - for fast lookups on that first entry, you might want to also throw some index on the “snapshot” table, so that we again burn a bit less CPU time.

Another hint

By the way, exactly the same approach can be used also for all the other internal statics views that are cumulative in nature, as most of them are - first such that comes to mind for example is the pg_stat_user_tables view, that enables to determine at which times some tables are getting a lot of sequential scans. There though I’d also add some minimum table size limit into the mix as well, as scans on tiny tables, unless in billions per day, are not usually a thing to worry about.

Hopefully,  that wasn’t too “hacky”,  and you have a new trick up your sleeve now.

Read more about putting pg_stat_statements to good use in Hans' blog Find and Fix a Missing PostgreSQL Index.

Leave a Reply

Your email address will not be published. Required fields are marked *

CYBERTEC Logo white
CYBERTEC PostgreSQL International GmbH
Römerstraße 19
2752 Wöllersdorf
Austria

+43 (0) 2622 93022-0
office@cybertec.at

Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2024
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle
    linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram