Keyset pagination is the most performant way to retrieve a large result set page by page. However, the neat trick with composite type comparison doesn't always work. This article explains why and how you can work around that shortcoming.
Table of Contents
We create a table with a million rows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
CREATE TABLE sortme ( id bigint PRIMARY KEY, val1 integer NOT NULL, val2 timestamp with time zone NOT NULL, val3 text NOT NULL ); -- deterministic, but sort of random values INSERT INTO sortme SELECT i, abs(hashint8(i)) % 200 + 1, TIMESTAMPTZ '2024-01-01 00:00:00' + (abs(hashint8(-i)) % 200) * INTERVAL '1 hour', substr(md5(i::text), 1, 2) FROM generate_series(1, 1000000) AS i; -- get statistics and set hint bits VACUUM (ANALYZE) sortme; |
If we want to ORDER BY val1, val2
, the first query would look like
1 2 3 4 |
SELECT val1, val2, val3, id FROM sortme ORDER BY val1, val2, id LIMIT 50; |
The primary key id
was added to the result list and the ORDER BY
clause to provide uniqueness and thereby guarantee a stable order. Assuming that the last row returned from that query was (1, '2024-01-01 01:00:00+01', 'e3', 920198)
, the query for the next page would be
1 2 3 4 5 |
SELECT val1, val2, val3, id FROM sortme WHERE (val1, val2, id) > (1, '2024-01-01 01:00:00+01', 920198) ORDER BY val1, val2, id LIMIT 50; |
With a multi-column index on (val1, val2, id)
, both queries would be lightning fast, as would be the queries for all following pages.
Matters become more complicated as soon as we need descending sort order. A fairly simple case is if we need to sort the result by a single colum (for example, val3
). Again, we need the unique column id
as a tie-breaker. The solution is to use descending sort order for both columns:
1 2 3 4 5 6 7 8 9 10 11 12 |
-- first page SELECT val1, val2, val3, id FROM sortme ORDER BY val3 DESC, id DESC LIMIT 50; -- next page SELECT val1, val2, val3, id FROM sortme WHERE (val3, id) < ('ff', 986821) ORDER BY val3 DESC, id DESC LIMIT 50; |
This would scan an index on (val3, id)
in reverse order and be just as efficient as before.
However, if we want to ORDER BY val3, val1 DESC
, the above trick will no longer work. Finding the first 50 results is no problem, but we hit an impasse when we want to fetch the second page. We'd have to compare the triple (val3, val1, id)
with the last row from the first page, but neither comparison operator will do: for the first element, we'd need to compare with >
, but for the second element we need <
.
1 2 3 4 5 6 |
-- how to get the next page? SELECT val1, val2, val3, id FROM sortme WHERE (val3, val1, id) ??? ('00', 198, 219956) ORDER BY val3, val1 DESC, id LIMIT 50; |
It seems that we cannot use keyset pagination in a case like this. But, as the following sections will show, there are ways to get keyset pagination to work with mixed sorting order.
The simplest trick to get the above example working is to use -val1
to turn the descending sort order into an ascending search order:
1 2 3 4 5 6 7 8 9 10 11 12 |
-- first page SELECT val1, val2, val3, id FROM sortme ORDER BY val3, -val1, id LIMIT 50; -- next page SELECT val1, val2, val3, id FROM sortme WHERE (val3, -val1, id) > ('00', -198, 219956) ORDER BY val3, -val1, id LIMIT 50; |
The index you need to support this query is
1 |
CREATE INDEX ON sortme (val3, (-val1), id); |
The above trick is simple, but will only work for data types for which you can turn descending order into ascending order. Numbers are easy. To make the trick work with val2
, which is a timestamp with time zone
, we have to work harder.
There is no “unary minus” operator for timestamp with time zone
, but we can extract the number of seconds since the Unix epoch:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
-- first page SELECT val1, val2, val3, - EXTRACT (epoch FROM val2), id FROM sortme ORDER BY val3, - EXTRACT (epoch FROM val2), id LIMIT 50; -- next page SELECT val1, val2, val3, - EXTRACT (epoch FROM val2), id FROM sortme WHERE (val3, - EXTRACT (epoch FROM val2), id) > ('00', -1704772800.000000, 204901) ORDER BY val3, - EXTRACT (epoch FROM val2), id LIMIT 50; |
This looks ugly, but is fairly straightforward. However, we'll get a surprise if we try to create the necessary index:
1 2 |
CREATE INDEX ON sortme (val3, (- EXTRACT (epoch FROM val2)), id); ERROR: functions in index expression must be marked IMMUTABLE |
The reason why EXTRACT
for timestamp with time zone
isn't immutable is that for some arguments, the result depends on the current setting of timezone
. However, extracting the Unix epoch is immutable, so we can safely work around the problem by defining an IMMUTABLE
function for that purpose:
1 2 3 4 5 |
CREATE FUNCTION get_epoch(timestamp with time zone) RETURNS numeric IMMUTABLE RETURN EXTRACT (epoch FROM $1); CREATE INDEX ON sortme (val3, (- get_epoch(val2)), id); |
To make use of the index, we have to use the function in our queries as well.
Getting the negative value to turn the comparison operator from <
to >
is a good trick, but it won't work for all data types. For example, you cannot get a “negative value” for a string, so our trick won't work for the column val3
.
A second approach to solving the problem of descending sort order is to avoid composite value comparison at all. For example, if you want to ORDER BY val2, val3 DESC
, you could write the query for the next page as
1 2 3 4 5 6 7 8 9 10 |
SELECT val1, val2, val3, id FROM sortme WHERE val2 > '2024-01-01 00:00:00+01' OR val2 = '2024-01-01 00:00:00+01' AND val3 < 'fc' OR val2 = '2024-01-01 00:00:00+01' AND val3 = 'fc' AND id > 173511 ORDER BY val2, val3 DESC, id LIMIT 50; |
That query is technically correct, but the WHERE
condition contains OR
, which is notoriously bad for performance. The query won't use an index defined on the ORDER BY
clause effectively. But we can split the query into three simpler subqueries:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
(SELECT val1, val2, val3, id FROM sortme WHERE val2 > '2024-01-01 00:00:00+01' ORDER BY val2, val3 DESC, id LIMIT 50) UNION ALL (SELECT val1, val2, val3, id FROM sortme WHERE val2 = '2024-01-01 00:00:00+01' AND val3 < 'fc' ORDER BY val2, val3 DESC, id LIMIT 50) UNION ALL (SELECT val1, val2, val3, id FROM sortme WHERE val2 = '2024-01-01 00:00:00+01' AND val3 = 'fc' AND id > 173511 ORDER BY val2, val3 DESC, id LIMIT 50) ORDER BY val2, val3 DESC, id LIMIT 50; |
This query looks ugly, but each of the three branches fetches 50 rows using an index on (val2, val3 DESC, id)
, so the query will be quite fast. Here is the execution plan:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
Limit -> Merge Append Sort Key: sortme.val2, sortme.val3 DESC, sortme.id -> Limit -> Index Scan using sortme_val2_val3_id_idx on sortme Index Cond: (val2 > '2024-01-01 00:00:00+01'::timestamp with time zone) -> Sort Sort Key: sortme_1.val2, sortme_1.val3 DESC, sortme_1.id -> Limit -> Index Scan using sortme_val2_val3_id_idx on sortme sortme_1 Index Cond: ((val2 = '2024-01-01 00:00:00+01'::timestamp with time zone) AND (val3 < 'fc'::text)) -> Sort Sort Key: sortme_2.val2, sortme_2.val3 DESC, sortme_2.id -> Limit -> Index Scan using sortme_val2_val3_id_idx on sortme sortme_2 Index Cond: ((val2 = '2024-01-01 00:00:00+01'::timestamp with time zone) AND (val3 = 'fc'::text) AND (id > 173511)) |
Finally, you can solve the problem of the descending sorting order by defining a data type that is just like a normal data type, but the comparison operators are switched around. For that, you need to define a new base type. You don't need to write any C code, since we are just reusing the functions from the standard data types. That requires fairly deep knowledge of PostgreSQL and superuser access, so it is not something you can do in a hosted database. I doubt that many users will want to go that way, but it is a nice exercise of PostgreSQL's extensibility features.
For our experiment, we want to use the same ORDER BY
clause as in the previous section, so we will define a data type invtext
that sorts inversely to text
.
Since the type input and output functions reference the data type and vice versa, we have to define a shell type first.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
-- the real type definition will come later CREATE TYPE invtext; CREATE FUNCTION invtextin(cstring) RETURNS invtext LANGUAGE internal IMMUTABLE STRICT PARALLEL SAFE AS 'textin'; CREATE FUNCTION invtextout(invtext) RETURNS cstring LANGUAGE internal IMMUTABLE STRICT PARALLEL SAFE AS 'textout'; CREATE FUNCTION invtextrecv(internal) RETURNS invtext LANGUAGE internal STABLE STRICT PARALLEL SAFE AS 'textrecv'; CREATE FUNCTION invtextsend(invtext) RETURNS bytea LANGUAGE internal STABLE STRICT PARALLEL SAFE AS 'textsend'; -- now we can create the type for real CREATE TYPE invtext ( INPUT = invtextin, OUTPUT = invtextout, RECEIVE = invtextrecv, SEND = invtextsend, INTERNALLENGTH = VARIABLE, STORAGE = extended, CATEGORY = 'S', PREFERRED = false, COLLATABLE = true ); CREATE CAST (text AS invtext) WITHOUT FUNCTION; CREATE CAST (invtext AS text) WITHOUT FUNCTION; CREATE CAST (character varying AS invtext) WITHOUT FUNCTION; |
So far, our type is just like text
. The type casts in the end allow us to convert other string data types to invchar
. I won't bother to define a cast from character
, since that is a data type you shouldn't use anyway. The cast from invtext
to text
is only needed for the b-tree comparison function further down.
This is where invtext
differs from text
. We are reusing the PostgreSQL core functions from the data type text
, but we switch around the order of comparison:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 |
CREATE FUNCTION invtexteq(invtext, invtext) RETURNS boolean LANGUAGE internal IMMUTABLE STRICT PARALLEL SAFE LEAKPROOF AS 'texteq'; CREATE FUNCTION invtextne(invtext, invtext) RETURNS boolean LANGUAGE internal IMMUTABLE STRICT PARALLEL SAFE LEAKPROOF AS 'textne'; CREATE FUNCTION invtext_lt(invtext, invtext) RETURNS boolean LANGUAGE internal IMMUTABLE STRICT PARALLEL SAFE LEAKPROOF AS 'text_gt'; CREATE FUNCTION invtext_le(invtext, invtext) RETURNS boolean LANGUAGE internal IMMUTABLE STRICT PARALLEL SAFE LEAKPROOF AS 'text_ge'; CREATE FUNCTION invtext_gt(invtext, invtext) RETURNS boolean LANGUAGE internal IMMUTABLE STRICT PARALLEL SAFE LEAKPROOF AS 'text_lt'; CREATE FUNCTION invtext_ge(invtext, invtext) RETURNS boolean LANGUAGE internal IMMUTABLE STRICT PARALLEL SAFE LEAKPROOF AS 'text_le'; CREATE OPERATOR = ( LEFTARG = invtext, RIGHTARG = invtext, COMMUTATOR = =, NEGATOR = <>, PROCEDURE = invtexteq, RESTRICT = eqsel, JOIN = eqjoinsel, HASHES, MERGES ); CREATE OPERATOR <> ( LEFTARG = invtext, RIGHTARG = invtext, NEGATOR = =, COMMUTATOR = <>, PROCEDURE = invtextne, RESTRICT = neqsel, JOIN = neqjoinsel ); CREATE OPERATOR < ( LEFTARG = invtext, RIGHTARG = invtext, NEGATOR = >=, COMMUTATOR = >, PROCEDURE = invtext_lt, RESTRICT = scalargtsel, JOIN = scalargtjoinsel ); CREATE OPERATOR <= ( LEFTARG = invtext, RIGHTARG = invtext, NEGATOR = >, COMMUTATOR = >=, PROCEDURE = invtext_le, RESTRICT = scalargtsel, JOIN = scalargtjoinsel ); CREATE OPERATOR >= ( LEFTARG = invtext, RIGHTARG = invtext, NEGATOR = <, COMMUTATOR = <=, PROCEDURE = invtext_ge, RESTRICT = scalarltsel, JOIN = scalarltjoinsel ); CREATE OPERATOR > ( LEFTARG = invtext, RIGHTARG = invtext, NEGATOR = <=, COMMUTATOR = <, PROCEDURE = invtext_gt, RESTRICT = scalarltsel, JOIN = scalarltjoinsel ); |
Finally, we need a comparison function and an operator class so that we can use b-tree indexes on the new data type. Again, we reverse the direction of comparison.
1 2 3 4 5 6 7 8 9 10 11 12 |
CREATE FUNCTION btinvtextcmp(invtext, invtext) RETURNS integer LANGUAGE SQL STRICT IMMUTABLE PARALLEL SAFE LEAKPROOF COST 10 RETURN pg_catalog.bttextcmp($2::text, $1::text); CREATE OPERATOR CLASS invtext_ops DEFAULT FOR TYPE invtext USING btree AS OPERATOR 1 <(invtext, invtext), OPERATOR 2 <=(invtext, invtext), OPERATOR 3 =(invtext, invtext), OPERATOR 4 >=(invtext, invtext), OPERATOR 5 >(invtext, invtext), FUNCTION 1 btinvtextcmp(invtext, invtext); |
Using our new data type, keyset pagination with the order val2, val3 DESC
becomes simple:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
-- first page SELECT val1, val2, val3, id FROM sortme ORDER BY val2, val3::invtext, id LIMIT 50; -- second page SELECT val1, val2, val3, id FROM sortme WHERE (val2, val3::invtext, id) > ('2024-01-01 00:00:00+01', 'fc', 173511) ORDER BY val2, val3::invtext, id LIMIT 50; |
The index to support these queries is
1 |
CREATE INDEX ON sortme (val2, (val3::invtext), id); |
We have seen three tricks to get keyset pagination to work efficiently if some columns are sorted in ascending order and others in descending order:
The first trick is the simpleast, but works only with some data types. The second trick is a bit more complicated, but works everywhere. The third trick is cute, but perhaps too much black magic for practical use.
+43 (0) 2622 93022-0
office@cybertec.at
You are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
Leave a Reply