There are various compelling reasons to use universally unique identifiers (UUID) as primary keys. Two examples are:
Table of Contents
However, like everything good in life, UUID's come with their own downsides. The most important of these downsides is a loss of temporal locality.
The most common way to generate a UUID is to pick a random 128 bit number. (As an aside, for those worried about collisions: you should take up the lottery, since winning the jackpot twice in a row is a much more likely outcome than your system ever generating two identical random 128 bit numbers.) There is a standard for doing this called UUID v4. That standard requires fixed values for 6 bits, leaving only 122 bits of randomness-- but that is still plenty for all practical purposes. There even is a convenient PostgreSQL function called gen_random_uuid()
for generating these values.
The trouble with picking random values is that they lose any correlation between the order of inserts and the order of id’s. B-tree indexes on UUID values are built in lexicographic order, which means that when using this index to look up two rows that were inserted at approximately the same time, they will be in completely different places in the index. By contrast, id’s generated from a sequence are going to be close by each other in the index. Since workloads commonly are interested in recently inserted rows, this means that a random UUID-based index will need to keep essentially the whole index in cache, whereas a sequential id index can make do with only a few recent pages.
But this relationship goes the other way, too. With random ID’s, values that are near each other in the index are going to be inserted at totally different times, and be in totally different places in the table. This is a story of how that can become important.
To start with, we'll create a table of records with a traditional integer id, a random UUID. Let's also add a timestamp-based UUID method (using the soon-to-be-standardized UUID v7 method) to see if the reason for this is the UUID type itself, or the data ordering. We will also add 100 bytes of filler for every row, both to be realistic and in order for the planner to pick the plans that I want. This table will have 10 million rows in it.
1 2 3 4 5 6 |
locality=# create table records (id int8 not null, uuid_v4 uuid not null, uuid_v7 uuid not null, filler text); CREATE TABLE Time: 98.515 ms locality=# insert into records select id, gen_random_uuid(), uuid_generate_v7(), repeat(' ', 100) from generate_series(1, 10000000) id; INSERT 0 10000000 Time: 28050.965 ms (00:28.051) |
We will add a unique index on each of the id columns, like a normal table would have, and run VACUUM ANALYZE
on it, just so we don’t have to wait for AUTOVACUUM
to get around to this table.
1 2 3 4 5 6 7 8 9 10 11 12 |
locality=# create index on records (id); CREATE INDEX Time: 1689.437 ms (00:01.689) locality=# create index on records (uuid_v4); CREATE INDEX Time: 2378.215 ms (00:02.378) locality=# create index on records (uuid_v7); CREATE INDEX Time: 1956.945 ms (00:01.957) locality=# vacuum analyze records; VACUUM Time: 281.858 ms |
Checking the sizes of the table and indexes, we ended up with and making sure they are cached:
1 2 3 4 5 6 7 8 9 10 |
locality=# select relname, pg_size_pretty(pg_relation_size(oid)), pg_prewarm(oid) from pg_class where relname like 'records%'; relname | pg_size_pretty | pg_prewarm ---------------------+----------------+------------ records | 1662 MB | 212766 records_id_idx | 214 MB | 27422 records_uuid_v4_idx | 301 MB | 38506 records_uuid_v7_idx | 301 MB | 38506 (4 rows) Time: 283.849 ms |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
locality=# SELECT COUNT(id) FROM records; count ---------- 10000000 (1 row) Time: 526.486 ms locality=# SELECT COUNT(uuid_v4) FROM records; count ---------- 10000000 (1 row) Time: 1213.813 ms (00:01.214) |
The UUID is about twice as slow, but the values in it are also twice as big, so that makes sense, right?
1 2 3 4 5 6 7 |
locality=# SELECT COUNT(uuid_v7) FROM records; count ---------- 10000000 (1 row) Time: 541.322 ms |
Huh? So it’s the random ordering that is causing us issues, not the UUID type itself. We will have to dig into the explain plans to see what is going on.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
locality=# EXPLAIN (BUFFERS, ANALYZE, TIMING OFF) SELECT COUNT(id) FROM records; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------ Aggregate (cost=202422.47..202422.48 rows=1 width=8) (actual rows=1 loops=1) Buffers: shared hit=27332 -> Index Only Scan using records_id_idx on records (cost=0.43..177422.46 rows=10000002 width=8) (actual rows=10000000 loops=1) Heap Fetches: 0 Buffers: shared hit=27332 Planning Time: 0.056 ms Execution Time: 777.764 ms (7 rows) locality=# EXPLAIN (BUFFERS, ANALYZE, TIMING OFF) SELECT COUNT(uuid_v4) FROM records; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------ Aggregate (cost=213506.47..213506.48 rows=1 width=8) (actual rows=1 loops=1) Buffers: shared hit=8562960 -> Index Only Scan using records_uuid_v4_idx on records (cost=0.43..188506.46 rows=10000002 width=16) (actual rows=10000000 loops=1) Heap Fetches: 0 Buffers: shared hit=8562960 Planning Time: 0.058 ms Execution Time: 1430.684 ms (7 rows) |
Holy behemoth buffer count batman, what is going on with the number of buffers referenced by the random UUID query? To understand this, we have to talk a little bit about how index-only scans work.
PostgreSQL indexes do not include transaction metadata used to determine who gets to see what. So normally, when we want to check if an index row is visible to us, we have to go, fetch the table row, and see if the transaction that added that row is committed and visible to us-- and also make sure the row hasn’t been deleted or updated. As you can imagine, that takes quite a bit of work. For most bigger tables, the answer will almost always be yes.
There is a way to get around this. Vacuuming keeps track of which table pages do not need any further cleanup work as everything on it is visible to everyone, a concept known as page all visible. This is kept track of in a visibility map, that is just a bitmap with 2 bits per each 8KB page. This makes it 32’768 times smaller than the table, or in other words, tiny and really easy to cache. Index-only scan makes use of it, and if we only need values from the index and the visibility map tells us that the row is visible, we can skip looking up the row, saving a bunch of time.
But both of our scans need to do this visibility map lookup, so what is going on up there? The answer is that because looking up pages in shared buffers is not free - it involves grabbing a lock, running a hash table lookup on a relatively large hash table, and doing a memory write to mark the pages as being looked at - index-only scan will hang onto a reference to the visibility map page it had to look at last time, and if the next index tuple happens to need the same page, we can skip all that work, and just read the bit we need off of it. Some of you might see where this is going…
Each visibility map page covers 8192*8/2 = 32’768 table pages, or 256MiB. In the sequential id case, nearby id-values are practically always on the same visibility map page and the optimization works perfectly. With random UUIDs, however, there is only about a 1 in 7 chance that they are on the same page. This results in having to do about 6/7*10M ~= 8.5M extra buffer lookups for visibility checks, each of which is counted as one buffer hit in the explain plan. A cache hit is cheap, but not free, and doing it 8.5M times over adds up.
Just to verify this, we can see that the sequential UUID case does not have that issue, and only has buffer hits increase in proportion to its size.
1 2 3 4 5 6 7 8 9 10 11 |
locality=# EXPLAIN (BUFFERS, ANALYZE, TIMING OFF) SELECT COUNT(uuid_v7) FROM records; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------ Aggregate (cost=213506.47..213506.48 rows=1 width=8) (actual rows=1 loops=1) Buffers: shared hit=39126 -> Index Only Scan using records_uuid_v7_idx on records (cost=0.43..188506.46 rows=10000002 width=16) (actual rows=10000000 loops=1) Heap Fetches: 0 Buffers: shared hit=39126 Planning Time: 0.058 ms Execution Time: 790.872 ms (7 rows) |
The moral of the story is that data locality matters, and it can pop up in the most surprising of places. Using random is typically the worst thing you can do for locality, so if you want to use UUID’s, try to use a sequential variant. UUID v7 is a good option. Hopefully it’s coming to PostgreSQL 17. In the meantime, there are already a couple of implementations out there, even a plain SQL one. Your cache hit rates will thank you.
See Laurenz' blog about auto-generated primary keys for more info on this topic.
See Hans' blog about Hint Bits if you would like to read more about PostgreSQL performance under the hood.
In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.
You need to load content from reCAPTCHA to submit the form. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
I would argue that the "traditional" UUID is variant 0x8, version 1, RFC 4122 section 4.2.1. By the time Postgres introduced a native UUID type, that was widely deprecated in favor of version 4, same RFC, section 4.4; but there was a "timestamp" version already. The first field in sorting is "time_low", which wraps around about every 429½ seconds (about 7 minutes).
While v1 UUID is indeed timestamp based, because of that wrap around behavior it does not preserve temporal locality and as such for performance considerations is as good as the random variant.