Keyset pagination with descending order

Keyset pagination is the most performant way to retrieve a large result set page by page. However, the neat trick with composite type comparison doesn't always work. This article explains why and how you can work around that shortcoming.

Table of Contents

An example table for paginated queries

We create a table with a million rows:

CREATE TABLE sortme (
   id bigint PRIMARY KEY,
   val1 integer NOT NULL,
   val2 timestamp with time zone NOT NULL,
   val3 text NOT NULL
);

-- deterministic, but sort of random values
INSERT INTO sortme
SELECT i,
       abs(hashint8(i)) % 200 + 1,
       TIMESTAMPTZ '2024-01-01 00:00:00' +
          (abs(hashint8(-i)) % 200) * INTERVAL '1 hour',
       substr(md5(i::text), 1, 2)
FROM generate_series(1, 1000000) AS i;

-- get statistics and set hint bits
VACUUM (ANALYZE) sortme;

CREATE TABLE sortme (

id bigint PRIMARY KEY,

val1 integer NOT NULL,

val2 timestamp with time zone NOT NULL,

val3 text NOT NULL

);

-- deterministic, but sort of random values

INSERT INTO sortme

SELECT i,

abs(hashint8(i)) % 200 + 1,

TIMESTAMPTZ '2024-01-01 00:00:00' +

(abs(hashint8(-i)) % 200) * INTERVAL '1 hour',

substr(md5(i::text), 1, 2)

FROM generate_series(1, 1000000) AS i;

-- get statistics and set hint bits

VACUUM (ANALYZE) sortme;

Normal keyset pagination

If we want to ORDER BY val1, val2, the first query would look like

SELECT val1, val2, val3, id
FROM sortme
ORDER BY val1, val2, id
LIMIT 50;

SELECT val1, val2, val3, id

FROM sortme

ORDER BY val1, val2, id

LIMIT 50;

The primary key id was added to the result list and the ORDER BY clause to provide uniqueness and thereby guarantee a stable order. Assuming that the last row returned from that query was (1, '2024-01-01 01:00:00+01', 'e3', 920198), the query for the next page would be

SELECT val1, val2, val3, id
FROM sortme
WHERE (val1, val2, id) > (1, '2024-01-01 01:00:00+01', 920198)
ORDER BY val1, val2, id
LIMIT 50;

SELECT val1, val2, val3, id

FROM sortme

WHERE (val1, val2, id) > (1, '2024-01-01 01:00:00+01', 920198)

ORDER BY val1, val2, id

LIMIT 50;

With a multi-column index on (val1, val2, id), both queries would be lightning fast, as would be the queries for all following pages.

The problem with keyset pagination in descending and mixed order

Matters become more complicated as soon as we need descending sort order. A fairly simple case is if we need to sort the result by a single colum (for example, val3). Again, we need the unique column id as a tie-breaker. The solution is to use descending sort order for both columns:

-- first page
SELECT val1, val2, val3, id
FROM sortme
ORDER BY val3 DESC, id DESC
LIMIT 50;

-- next page
SELECT val1, val2, val3, id
FROM sortme
WHERE (val3, id) < ('ff', 986821)
ORDER BY val3 DESC, id DESC
LIMIT 50;

-- first page

SELECT val1, val2, val3, id

FROM sortme

ORDER BY val3 DESC, id DESC

LIMIT 50;

-- next page

SELECT val1, val2, val3, id

FROM sortme

WHERE (val3, id) < ('ff', 986821)

ORDER BY val3 DESC, id DESC

LIMIT 50;

This would scan an index on (val3, id) in reverse order and be just as efficient as before.

However, if we want to ORDER BY val3, val1 DESC, the above trick will no longer work. Finding the first 50 results is no problem, but we hit an impasse when we want to fetch the second page. We'd have to compare the triple (val3, val1, id) with the last row from the first page, but neither comparison operator will do: for the first element, we'd need to compare with >, but for the second element we need <.

-- how to get the next page?
SELECT val1, val2, val3, id
FROM sortme
WHERE (val3, val1, id) ??? ('00', 198, 219956)
ORDER BY val3, val1 DESC, id
LIMIT 50;

-- how to get the next page?

SELECT val1, val2, val3, id

FROM sortme

WHERE (val3, val1, id) ??? ('00', 198, 219956)

ORDER BY val3, val1 DESC, id

LIMIT 50;

It seems that we cannot use keyset pagination in a case like this. But, as the following sections will show, there are ways to get keyset pagination to work with mixed sorting order.

Keyset pagination using the negative value

The simplest trick to get the above example working is to use -val1 to turn the descending sort order into an ascending search order:

-- first page
SELECT val1, val2, val3, id
FROM sortme
ORDER BY val3, -val1, id
LIMIT 50;

-- next page
SELECT val1, val2, val3, id
FROM sortme
WHERE (val3, -val1, id) > ('00', -198, 219956)
ORDER BY val3, -val1, id
LIMIT 50;

-- first page

SELECT val1, val2, val3, id

FROM sortme

ORDER BY val3, -val1, id

LIMIT 50;

-- next page

SELECT val1, val2, val3, id

FROM sortme

WHERE (val3, -val1, id) > ('00', -198, 219956)

ORDER BY val3, -val1, id

LIMIT 50;

The index you need to support this query is

CREATE INDEX ON sortme (val3, (-val1), id);

1	CREATE INDEX ON sortme (val3, (-val1), id);

The above trick is simple, but will only work for data types for which you can turn descending order into ascending order. Numbers are easy. To make the trick work with val2, which is a timestamp with time zone, we have to work harder.

“Negative value” for timestamps

There is no “unary minus” operator for timestamp with time zone, but we can extract the number of seconds since the Unix epoch:

-- first page
SELECT val1, val2, val3, - EXTRACT (epoch FROM val2), id 
FROM sortme
ORDER BY val3, - EXTRACT (epoch FROM val2), id
LIMIT 50;

-- next page
SELECT val1, val2, val3, - EXTRACT (epoch FROM val2), id 
FROM sortme
WHERE (val3, - EXTRACT (epoch FROM val2), id) >
      ('00', -1704772800.000000, 204901)
ORDER BY val3, - EXTRACT (epoch FROM val2), id
LIMIT 50;

-- first page

SELECT val1, val2, val3, - EXTRACT (epoch FROM val2), id

FROM sortme

ORDER BY val3, - EXTRACT (epoch FROM val2), id

LIMIT 50;

-- next page

SELECT val1, val2, val3, - EXTRACT (epoch FROM val2), id

FROM sortme

WHERE (val3, - EXTRACT (epoch FROM val2), id) >

('00', -1704772800.000000, 204901)

ORDER BY val3, - EXTRACT (epoch FROM val2), id

LIMIT 50;

This looks ugly, but is fairly straightforward. However, we'll get a surprise if we try to create the necessary index:

CREATE INDEX ON sortme (val3, (- EXTRACT (epoch FROM val2)), id);
ERROR:  functions in index expression must be marked IMMUTABLE

1 2	CREATE INDEX ON sortme (val3, (- EXTRACT (epoch FROM val2)), id); ERROR: functions in index expression must be marked IMMUTABLE

The reason why EXTRACT for timestamp with time zone isn't immutable is that for some arguments, the result depends on the current setting of timezone. However, extracting the Unix epoch is immutable, so we can safely work around the problem by defining an IMMUTABLE function for that purpose:

CREATE FUNCTION get_epoch(timestamp with time zone) RETURNS numeric
   IMMUTABLE
   RETURN EXTRACT (epoch FROM $1);

CREATE INDEX ON sortme (val3, (- get_epoch(val2)), id);

CREATE FUNCTION get_epoch(timestamp with time zone) RETURNS numeric

IMMUTABLE

RETURN EXTRACT (epoch FROM $1);

CREATE INDEX ON sortme (val3, (- get_epoch(val2)), id);

To make use of the index, we have to use the function in our queries as well.

Getting the negative value to turn the comparison operator from < to > is a good trick, but it won't work for all data types. For example, you cannot get a “negative value” for a string, so our trick won't work for the column val3.

Keyset pagination by splitting up the query

A second approach to solving the problem of descending sort order is to avoid composite value comparison at all. For example, if you want to ORDER BY val2, val3 DESC, you could write the query for the next page as

SELECT val1, val2, val3, id 
FROM sortme
WHERE val2 > '2024-01-01 00:00:00+01'
   OR val2 = '2024-01-01 00:00:00+01'
      AND val3 < 'fc'
   OR val2 = '2024-01-01 00:00:00+01'
      AND val3 = 'fc'
      AND id > 173511
ORDER BY val2, val3 DESC, id
LIMIT 50;

SELECT val1, val2, val3, id

FROM sortme

WHERE val2 > '2024-01-01 00:00:00+01'

OR val2 = '2024-01-01 00:00:00+01'

AND val3 < 'fc'

OR val2 = '2024-01-01 00:00:00+01'

AND val3 = 'fc'

AND id > 173511

ORDER BY val2, val3 DESC, id

LIMIT 50;

That query is technically correct, but the WHERE condition contains OR, which is notoriously bad for performance. The query won't use an index defined on the ORDER BY clause effectively. But we can split the query into three simpler subqueries:

  (SELECT val1, val2, val3, id
   FROM sortme
   WHERE val2 > '2024-01-01 00:00:00+01'
   ORDER BY val2, val3 DESC, id
   LIMIT 50)
UNION ALL
  (SELECT val1, val2, val3, id
   FROM sortme
   WHERE val2 = '2024-01-01 00:00:00+01'
     AND val3 < 'fc'
   ORDER BY val2, val3 DESC, id
   LIMIT 50)
UNION ALL
  (SELECT val1, val2, val3, id
   FROM sortme
   WHERE val2 = '2024-01-01 00:00:00+01'
     AND val3 = 'fc' 
     AND id > 173511
   ORDER BY val2, val3 DESC, id
   LIMIT 50)
ORDER BY val2, val3 DESC, id
LIMIT 50;

(SELECT val1, val2, val3, id

FROM sortme

WHERE val2 > '2024-01-01 00:00:00+01'

ORDER BY val2, val3 DESC, id

LIMIT 50)

UNION ALL

(SELECT val1, val2, val3, id

FROM sortme

WHERE val2 = '2024-01-01 00:00:00+01'

AND val3 < 'fc'

ORDER BY val2, val3 DESC, id

LIMIT 50)

UNION ALL

(SELECT val1, val2, val3, id

FROM sortme

WHERE val2 = '2024-01-01 00:00:00+01'

AND val3 = 'fc'

AND id > 173511

ORDER BY val2, val3 DESC, id

LIMIT 50)

ORDER BY val2, val3 DESC, id

LIMIT 50;

This query looks ugly, but each of the three branches fetches 50 rows using an index on (val2, val3 DESC, id), so the query will be quite fast. Here is the execution plan:

 Limit
   ->  Merge Append
         Sort Key: sortme.val2, sortme.val3 DESC, sortme.id
         ->  Limit
               ->  Index Scan using sortme_val2_val3_id_idx on sortme
                     Index Cond: (val2 > '2024-01-01 00:00:00+01'::timestamp with 
time zone)
         ->  Sort
               Sort Key: sortme_1.val2, sortme_1.val3 DESC, sortme_1.id
               ->  Limit
                     ->  Index Scan using sortme_val2_val3_id_idx on sortme sortme_1
                           Index Cond: ((val2 = '2024-01-01 00:00:00+01'::timestamp 
with time zone) AND (val3 < 'fc'::text))
         ->  Sort
               Sort Key: sortme_2.val2, sortme_2.val3 DESC, sortme_2.id
               ->  Limit
                     ->  Index Scan using sortme_val2_val3_id_idx on sortme sortme_2
                           Index Cond: ((val2 = '2024-01-01 00:00:00+01'::timestamp 
with time zone) AND (val3 = 'fc'::text) AND (id > 173511))

Limit

-> Merge Append

Sort Key: sortme.val2, sortme.val3 DESC, sortme.id

-> Limit

-> Index Scan using sortme_val2_val3_id_idx on sortme

Index Cond: (val2 > '2024-01-01 00:00:00+01'::timestamp with

time zone)

-> Sort

Sort Key: sortme_1.val2, sortme_1.val3 DESC, sortme_1.id

-> Limit

-> Index Scan using sortme_val2_val3_id_idx on sortme sortme_1

Index Cond: ((val2 = '2024-01-01 00:00:00+01'::timestamp

with time zone) AND (val3 < 'fc'::text))

-> Sort

Sort Key: sortme_2.val2, sortme_2.val3 DESC, sortme_2.id

-> Limit

-> Index Scan using sortme_val2_val3_id_idx on sortme sortme_2

Index Cond: ((val2 = '2024-01-01 00:00:00+01'::timestamp

with time zone) AND (val3 = 'fc'::text) AND (id > 173511))

Keyset pagination by defining a custom data type

Finally, you can solve the problem of the descending sorting order by defining a data type that is just like a normal data type, but the comparison operators are switched around. For that, you need to define a new base type. You don't need to write any C code, since we are just reusing the functions from the standard data types. That requires fairly deep knowledge of PostgreSQL and superuser access, so it is not something you can do in a hosted database. I doubt that many users will want to go that way, but it is a nice exercise of PostgreSQL's extensibility features.

For our experiment, we want to use the same ORDER BY clause as in the previous section, so we will define a data type invtext that sorts inversely to text.

Defining the data type

Since the type input and output functions reference the data type and vice versa, we have to define a shell type first.

-- the real type definition will come later
CREATE TYPE invtext;

CREATE FUNCTION invtextin(cstring) RETURNS invtext
   LANGUAGE internal IMMUTABLE STRICT PARALLEL SAFE
AS 'textin';

CREATE FUNCTION invtextout(invtext) RETURNS cstring
   LANGUAGE internal IMMUTABLE STRICT PARALLEL SAFE
AS 'textout';

CREATE FUNCTION invtextrecv(internal) RETURNS invtext
   LANGUAGE internal STABLE STRICT PARALLEL SAFE
AS 'textrecv';

CREATE FUNCTION invtextsend(invtext) RETURNS bytea
   LANGUAGE internal STABLE STRICT PARALLEL SAFE
AS 'textsend';

-- now we can create the type for real
CREATE TYPE invtext (
   INPUT          = invtextin,
   OUTPUT         = invtextout,
   RECEIVE        = invtextrecv,
   SEND           = invtextsend,
   INTERNALLENGTH = VARIABLE,
   STORAGE        = extended,
   CATEGORY       = 'S',
   PREFERRED      = false,
   COLLATABLE     = true
);

CREATE CAST (text AS invtext) WITHOUT FUNCTION;
CREATE CAST (invtext AS text) WITHOUT FUNCTION;
CREATE CAST (character varying AS invtext) WITHOUT FUNCTION;

-- the real type definition will come later

CREATE TYPE invtext;

CREATE FUNCTION invtextin(cstring) RETURNS invtext

LANGUAGE internal IMMUTABLE STRICT PARALLEL SAFE

AS 'textin';

CREATE FUNCTION invtextout(invtext) RETURNS cstring

LANGUAGE internal IMMUTABLE STRICT PARALLEL SAFE

AS 'textout';

CREATE FUNCTION invtextrecv(internal) RETURNS invtext

LANGUAGE internal STABLE STRICT PARALLEL SAFE

AS 'textrecv';

CREATE FUNCTION invtextsend(invtext) RETURNS bytea

LANGUAGE internal STABLE STRICT PARALLEL SAFE

AS 'textsend';

-- now we can create the type for real

CREATE TYPE invtext (

INPUT = invtextin,

OUTPUT = invtextout,

RECEIVE = invtextrecv,

SEND = invtextsend,

INTERNALLENGTH = VARIABLE,

STORAGE = extended,

CATEGORY = 'S',

PREFERRED = false,

COLLATABLE = true

);

CREATE CAST (text AS invtext) WITHOUT FUNCTION;

CREATE CAST (invtext AS text) WITHOUT FUNCTION;

CREATE CAST (character varying AS invtext) WITHOUT FUNCTION;

So far, our type is just like text. The type casts in the end allow us to convert other string data types to invchar. I won't bother to define a cast from character, since that is a data type you shouldn't use anyway. The cast from invtext to text is only needed for the b-tree comparison function further down.

Defining comparison operators for the data type

This is where invtext differs from text. We are reusing the PostgreSQL core functions from the data type text, but we switch around the order of comparison:

CREATE FUNCTION invtexteq(invtext, invtext) RETURNS boolean
   LANGUAGE internal IMMUTABLE STRICT PARALLEL SAFE LEAKPROOF
AS 'texteq';

CREATE FUNCTION invtextne(invtext, invtext) RETURNS boolean
   LANGUAGE internal IMMUTABLE STRICT PARALLEL SAFE LEAKPROOF
AS 'textne';

CREATE FUNCTION invtext_lt(invtext, invtext) RETURNS boolean
   LANGUAGE internal IMMUTABLE STRICT PARALLEL SAFE LEAKPROOF
AS 'text_gt';

CREATE FUNCTION invtext_le(invtext, invtext) RETURNS boolean
   LANGUAGE internal IMMUTABLE STRICT PARALLEL SAFE LEAKPROOF
AS 'text_ge';

CREATE FUNCTION invtext_gt(invtext, invtext) RETURNS boolean
   LANGUAGE internal IMMUTABLE STRICT PARALLEL SAFE LEAKPROOF
AS 'text_lt';

CREATE FUNCTION invtext_ge(invtext, invtext) RETURNS boolean
   LANGUAGE internal IMMUTABLE STRICT PARALLEL SAFE LEAKPROOF
AS 'text_le';

CREATE OPERATOR = (
   LEFTARG    = invtext,
   RIGHTARG   = invtext,
   COMMUTATOR = =,
   NEGATOR    = <>,
   PROCEDURE  = invtexteq,
   RESTRICT   = eqsel,
   JOIN       = eqjoinsel,
   HASHES,
   MERGES
);

CREATE OPERATOR <> (
   LEFTARG    = invtext,
   RIGHTARG   = invtext,
   NEGATOR    = =,
   COMMUTATOR = <>,
   PROCEDURE  = invtextne,
   RESTRICT   = neqsel,
   JOIN       = neqjoinsel
);

CREATE OPERATOR < (
    LEFTARG    = invtext,
    RIGHTARG   = invtext,
    NEGATOR    = >=,
    COMMUTATOR = >,
    PROCEDURE  = invtext_lt,
    RESTRICT   = scalargtsel,
    JOIN       = scalargtjoinsel
);

CREATE OPERATOR <= (
    LEFTARG    = invtext,
    RIGHTARG   = invtext,
    NEGATOR    = >,
    COMMUTATOR = >=,
    PROCEDURE  = invtext_le,
    RESTRICT   = scalargtsel,
    JOIN       = scalargtjoinsel
);

CREATE OPERATOR >= (
    LEFTARG    = invtext,
    RIGHTARG   = invtext,
    NEGATOR    = <,
    COMMUTATOR = <=,
    PROCEDURE  = invtext_ge,
    RESTRICT   = scalarltsel,
    JOIN       = scalarltjoinsel
);

CREATE OPERATOR > (
    LEFTARG    = invtext,
    RIGHTARG   = invtext,
    NEGATOR    = <=,
    COMMUTATOR = <,
    PROCEDURE  = invtext_gt,
    RESTRICT   = scalarltsel,
    JOIN       = scalarltjoinsel
);

CREATE FUNCTION invtexteq(invtext, invtext) RETURNS boolean