What's in an xmax?

04.2018 / Category: How To / Tags: concurrency | lock | transactions

xmax is a PostgreSQL system column that is used to implement Multiversion Concurrency Control (MVCC). The documentation is somewhat terse:

Table of Contents

The identity (transaction ID) of the deleting transaction, or zero for an undeleted row version. It is possible for this column to be nonzero in a visible row version. That usually indicates that the deleting transaction hasn't committed yet, or that an attempted deletion was rolled back.

While this is true, the presence of “weasel words” like “usually” indicates that there is more to the picture. This is what I want to explore in this article.

The two meanings of `xmax`

I'll follow the PostgreSQL convention to use the word “tuple” for “row version” (remember that PostgreSQL implements MVCC by holding several versions of a row in the table).

xmax is actually used for two purposes in PostgreSQL:

It stores the transaction ID (“xid”) of the transaction that deleted the tuple, like the documentation says. Remember that UPDATE also deletes a tuple in PostgreSQL!
It stores row locks on the tuple.

This is possible, because a tuple cannot be locked and deleted at the same time: normal locks are only held for the duration of the transaction, and a tuple is deleted only after the deleting transaction has committed.

Storing row locks on the tuple itself has one vast advantage: it avoids overflows of the “lock table”. The lock table is a fixed-size area that is allocated in shared memory during server startup and could easily be too small to hold all the row locks from a bigger transaction. To cope with this, you'd need techniques like “lock escalation” that are difficult to implement, impact concurrency and lead to all kinds of nasty problems.

There is also a downside to storing row locks in the tuple: each row lock modifies the table, and the modified blocks have to be written back to persistent storage. This means that row locks lead to increased I/O load.

But a number of questions remain:

How can you tell which of the two meanings xmax has in a tuple?
How can I tell if xmax is valid or not?
How exactly are row locks stored?

We will dive deeper in the rest of this article to answer these questions.

An example

In the following, I'll use a simple schema for demonstration. I am using PostgreSQL v10, but this hasn't changed in the last couple of releases.

CREATE TABLE parent(
   p_id integer PRIMARY KEY,
   p_val text
);

CREATE TABLE child(
   c_id integer PRIMARY KEY,
   p_id integer REFERENCES parent(p_id),
   c_val text
);

INSERT INTO parent (p_id, p_val)
   VALUES (42, 'parent');

1

2

3

4

5

6

7

8

9

10

11

12

13

CREATE TABLE parent(

p_id integer PRIMARY KEY,

p_val text

);

CREATE TABLE child(

c_id integer PRIMARY KEY,

p_id integer REFERENCES parent(p_id),

c_val text

);

INSERT INTO parent (p_id, p_val)

VALUES (42, 'parent');

Now let's look at the relevant system columns:

session1=# SELECT ctid, xmin, xmax, p_id, p_val FROM parent;

 ctid  | xmin  | xmax | p_id | p_val  
-------+-------+------+------+--------
 (0,1) | 53163 |    0 |   42 | parent
(1 row)

1

2

3

4

5

6

session1=# SELECT ctid, xmin, xmax, p_id, p_val FROM parent;

ctid | xmin | xmax | p_id | p_val

-------+-------+------+------+--------

(0,1) | 53163 | 0 | 42 | parent

(1 row)

This is the simple view we expect to see: ctid is the physical location of the tuple (Block 0, item 1), xmin contains the ID of the inserting transaction, and xmax is zero because the row is alive.

Now let's start a transaction in session 1 and delete the row:

session1=# BEGIN;
session1=# DELETE FROM parent WHERE p_id = 42;

1 2	session1=# BEGIN; session1=# DELETE FROM parent WHERE p_id = 42;

Then session 2 can see that xmax has changed:

session2=# SELECT ctid, xmin, xmax, p_id, p_val FROM parent;

 ctid  | xmin  | xmax  | p_id | p_val  
-------+-------+-------+------+--------
 (0,1) | 53163 | 53165 |   42 | parent
(1 row)

1

2

3

4

5

6

session2=# SELECT ctid, xmin, xmax, p_id, p_val FROM parent;

ctid | xmin | xmax | p_id | p_val

-------+-------+-------+------+--------

(0,1) | 53163 | 53165 | 42 | parent

(1 row)

But wait, we change our mind in session 1 and undo the change:

session1=# ROLLBACK;

1	session1=# ROLLBACK;

To find out what xmax means in this case, let's call in the cavalry.

`pageinspect` comes to the rescue

PostgreSQL comes with a “contrib” module called pageinspect that can be used to examine the actual contents of table blocks. It is installed with

CREATE EXTENSION pageinspect;

1	CREATE EXTENSION pageinspect;

We'll use two of its functions:

get_raw_page: reads one 8kB block from the table's data file
heap_page_item_attrs: for each tuple in a data block, this returns the tuple metadata and data

Needless to say, these functions are superuser only.

heap_page_item_attrs returns an integer field named t_infomask that contains several flags, some of which tell us the meaning of xmax. To get the full story, you'll have to read the code in src/include/access/htup_details.h.

Let's have a look at table block 0, which contains our tuple:

session2=# SELECT lp, 
       t_ctid AS ctid,
       t_xmin AS xmin,
       t_xmax AS xmax,
       (t_infomask & 128)::boolean AS xmax_is_lock,
       (t_infomask & 1024)::boolean AS xmax_committed,
       (t_infomask & 2048)::boolean AS xmax_rolled_back,
       (t_infomask & 4096)::boolean AS xmax_multixact,
       t_attrs[1] AS p_id,
       t_attrs[2] AS p_val
FROM heap_page_item_attrs(
        get_raw_page('parent', 0), 
        'parent'
     );  

-[ RECORD 1 ]----+-----------------
lp               | 1
ctid             | (0,1)
xmin             | 53163
xmax             | 53165
xmax_is_lock     | f
xmax_committed   | f
xmax_rolled_back | f
xmax_multixact   | f
p_id             | x2a000000
p_val            | x0f706172656e74

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

session2=# SELECT lp,

t_ctid AS ctid,

t_xmin AS xmin,

t_xmax AS xmax,

(t_infomask & 128)::boolean AS xmax_is_lock,

(t_infomask & 1024)::boolean AS xmax_committed,

(t_infomask & 2048)::boolean AS xmax_rolled_back,

(t_infomask & 4096)::boolean AS xmax_multixact,

t_attrs[1] AS p_id,

t_attrs[2] AS p_val

FROM heap_page_item_attrs(

get_raw_page('parent', 0),

'parent'

);

-[ RECORD 1 ]----+-----------------

lp | 1

ctid | (0,1)

xmin | 53163

xmax | 53165

xmax_is_lock | f

xmax_committed | f

xmax_rolled_back | f

xmax_multixact | f

p_id | x2a000000

p_val | x0f706172656e74

The attributes p_id and p_val are displayed in binary form.

The information in the tuple doesn't tell us whether the transaction that set xmax has been committed or rolled back, so we (and PostgreSQL when it inspects the tuple) still don't know what to make of xmax. That is because PostgreSQL does not update the tuple when a transaction ends.

To resolve that uncertainty, we'd have to look at the commit log that stores the state of each transaction. The commit log is persisted in the pg_xact subdirectory of the PostgreSQL data directory (pg_clog in older versions).

A `SELECT` that modifies data

We cannot examine the commit log from SQL, but when any database transaction reads the tuple and looks up the commit log, it will persist the result in the tuple so that the next reader does not have to do it again (this is called “setting the hint bits”).

So all we have to do is to read the tuple:

session2=# SELECT ctid, xmin, xmax, p_id, p_val FROM parent;

 ctid  | xmin  | xmax  | p_id | p_val  
-------+-------+-------+------+--------
 (0,1) | 53163 | 53165 |   42 | parent
(1 row)

1

2

3

4

5

6

session2=# SELECT ctid, xmin, xmax, p_id, p_val FROM parent;

ctid | xmin | xmax | p_id | p_val

-------+-------+-------+------+--------

(0,1) | 53163 | 53165 | 42 | parent

(1 row)

This changes the information stored in the tuple. Let's have another look with pageinspect:

-[ RECORD 1 ]----+-----------------
lp               | 1
ctid             | (0,1)
xmin             | 53163
xmax             | 53165
xmax_is_lock     | f
xmax_committed   | f
xmax_rolled_back | t
xmax_multixact   | f
p_id             | x2a000000
p_val            | x0f706172656e74

1

2

3

4

5

6

7

8

9

10

11

-[ RECORD 1 ]----+-----------------

lp | 1

ctid | (0,1)

xmin | 53163

xmax | 53165

xmax_is_lock | f

xmax_committed | f

xmax_rolled_back | t

xmax_multixact | f

p_id | x2a000000

p_val | x0f706172656e74

The SELECT statement has set the flags on the tuple, and now we can see that xmax is from a transaction that was rolled back and should be ignored.

As an aside, that means that the first reader of a tuple modifies the tuple, causing surprising write I/O. This is annoying, but it is the price we pay for instant COMMIT and ROLLBACK. It is also the reason why it is a good idea to either use COPY … (FREEZE) to bulk load data or to VACUUM the data after loading.

Now we know how to determine if xmax is from a valid transaction or not, but what about row locks?

Row locks and `xmax`

Rows are locked by data modifying statements, but there is a simple way to lock a row without inserting or deleting tuples:

session1=# BEGIN;
session1=# SELECT * FROM parent WHERE p_id = 42 FOR UPDATE;

 p_id | p_val  
------+--------
   42 | parent
(1 row)

1

2

3

4

5

6

7

session1=# BEGIN;

session1=# SELECT * FROM parent WHERE p_id = 42 FOR UPDATE;

p_id | p_val

------+--------

42 | parent

(1 row)

Now what does pageinspect tell us?

-[ RECORD 1 ]----+-----------------
lp               | 1
ctid             | (0,1)
xmin             | 53163
xmax             | 53166
xmax_is_lock     | t
xmax_committed   | f
xmax_rolled_back | f
xmax_multixact   | f
p_id             | x2a000000
p_val            | x0f706172656e74

1

2

3

4

5

6

7

8

9

10

11

-[ RECORD 1 ]----+-----------------

lp | 1

ctid | (0,1)

xmin | 53163

xmax | 53166

xmax_is_lock | t

xmax_committed | f

xmax_rolled_back | f

xmax_multixact | f

p_id | x2a000000

p_val | x0f706172656e74

We see that the row is locked. In this case, it is a FOR UPDATE lock, but the query does not distinguish between the lock modes for simplicity's sake. You'll notice that xmax again is neither committed nor rolled back, but we don't care because we know it is a row lock.

xmax is set to 53166, which is the transaction ID of the locking transaction. Let's close that transaction to continue:

session1=# COMMIT;

1	session1=# COMMIT;

PostgreSQL does not have to set hint bits here — if xmax contains a row lock, the row is active, no matter what the state of the locking transaction is.

If you think you have seen it all, you are in for a surprise.

Multiple locks on a single row

In the previous example we have seen that PostgreSQL stores the transaction ID of the locking transaction in xmax. This works fine as long as only a single transaction holds a lock on that tuple. With exclusive locks like the one that SELECT … FOR UPDATE takes, this is always the case.

But PostgreSQL also knows other row locks, for example the FOR KEY SHARE lock that is taken on the destination of a foreign key constraint to prevent concurrent modification of the keys in that row. Let's insert some rows in the child table:

session1=# BEGIN;
session1=# INSERT INTO child (c_id, p_id, c_val)
   VALUES (1, 42, 'first');

session2=# BEGIN;
session2=# INSERT INTO child (c_id, p_id, c_val)
   VALUES (2, 42, 'second');

1

2

3

4

5

6

7

session1=# BEGIN;

session1=# INSERT INTO child (c_id, p_id, c_val)

VALUES (1, 42, 'first');

session2=# BEGIN;

session2=# INSERT INTO child (c_id, p_id, c_val)

VALUES (2, 42, 'second');

Now let's look at our parent row again:

-[ RECORD 1 ]----+-----------------
lp               | 1
ctid             | (0,1)
xmin             | 53163
xmax             | 3
xmax_is_lock     | t
xmax_committed   | f
xmax_rolled_back | f
xmax_multixact   | t
p_id             | x2a000000
p_val            | x0f706172656e74

1

2

3

4

5

6

7

8

9

10

11

-[ RECORD 1 ]----+-----------------

lp | 1

ctid | (0,1)

xmin | 53163

xmax | 3

xmax_is_lock | t

xmax_committed | f

xmax_rolled_back | f

xmax_multixact | t

p_id | x2a000000

p_val | x0f706172656e74

That “3” in xmax cannot be a transaction ID (they keep counting up), and the xmax_multixact flag is set.

This is the ID of a “multiple transaction object”, called “mulitxact” in PostgreSQL jargon for lack of a better word. Such objects are created whenever more than one transaction locks a row, and their IDs are also counted up (you can tell that this database needs few of them). Multixacts are persisted in the pg_multixact subdirectory of the data directory.

You can get information about the members of a multixact with the undocumented pg_get_multixact_members function:

session2=# SELECT * FROM pg_get_multixact_members('3');
  xid  | mode  
-------+-------
 53167 | keysh
 53168 | keysh
(2 rows)

1

2

3

4

5

6

session2=# SELECT * FROM pg_get_multixact_members('3');

xid | mode

-------+-------

53167 | keysh

53168 | keysh

(2 rows)

Now you really know what is in an xmax!

In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.

9 responses to “What's in an xmax?”

David Aldridge says:

April 23, 2018 at 8:40 am

Very interesting, and similar in many ways to Oracle's "delayed block cleanout" mechanism. Thanks for the write-up.

Reply
- laurenz says:
  
  April 23, 2018 at 10:47 am
  
  Setting the hint bits is indeed similar in many respects, except
  a) it is about rows, not blocks
  b) it is about visibility information (which Oracle does not need), not about locks
  c) it happens to all rows, not only to some of the blocks in a large transaction
  d) you cannot run into an ORA-01555 error :^) - commit log entries won't be removed when they are still needed
  
  But thanks for the comment, it introduced me to this Oracle concept that I didn't know before.
  
  Reply
Clint Dunn says:

February 7, 2023 at 10:49 pm

Thanks for the great explanation! I know this is super old, but I'm still curious about this:

PostgreSQL does not have to set hint bits here — if xmax contains a row lock, the row is active, no matter what the state of the locking transaction is.

Doesn't postgres need to do some cleanup after the transaction is committed? E.g. set the hint for xmax_is_lock to false? It seems like the lock would be in effect from the perspective of other transactions until that happens.

Reply
- laurenz says:
  
  February 8, 2023 at 6:52 am
  
  The difference is the state of the transaction. The reader will look at the commit log (or a hint bit set by a previous reader) and determine that the transaction in xmax is committed. As a closed transaction, it cannot hold a lock, so xmax can be ignored.
  
  Reply
  - Clint Dunn says:
    
    February 8, 2023 at 5:07 pm
    
    Ahh that makes sense thanks.
    
    Reply
Fabrice Chapuis says:

May 9, 2023 at 10:24 am

Thank you for this excellent article, do you know the mechanism used for the drop table in a transaction.

Reply
- laurenz says:
  
  May 9, 2023 at 12:23 pm
  
  I don't see the connection to the article, but it is not difficult: The metadata are just transactional, and you wait with deleting the underlying files until COMMIT.
  
  Reply
Fabrice Chapuis says:

August 17, 2023 at 8:54 am

Thanks for these excellent explanations about xmax
I used pg_inspect and I noticed that it returned two same values for ctid. I do not understand how 2 rows could be in the same place in the block?
5 │ (0,6) │ 743391 │ 743392 │ f │ f │ f │ f │ x02000000 │ │
6 │ (0,6) │ 743392 │ 0 f │ f │ t │ f │ x01000000 │ │

Reply
- laurenz says:
  
  August 28, 2023 at 7:50 am
  
  The ctid attribute of an updated row is modified to point to the new version.
  
  Reply

Leave a Reply Cancel reply

Stay tuned with our

NEWSLETTER

CYBERTEC PostgreSQL International GmbH
Römerstraße 19
2752 Wöllersdorf
Austria

+43 (0) 2622 93022-0

office@cybertec-postgresql.com

ISO_27001_Badge

Customer Support

Support Platform

Services

Support CYBERTEC Partner PostgreSQL Books

Company

STAY TUNED WITH OUR NEWSLETTER

Get the newest PostgreSQL Info & Tools

Data Protection Policy Terms and conditions Terms of Service Imprint

©

2025

CYBERTEC PostgreSQL International GmbH