© Laurenz Albe 2023
Table of Contents
Unless you use the binary collation, creating a b-tree index to support a LIKE
condition in PostgreSQL is not straightforward. This keeps surprising Oracle users, who claim that a regular b-tree index will of course always support LIKE
. I decided to explore the differences between Oracle and PostgreSQL when it comes to indexing LIKE
conditions. Is Oracle really smarter here?
LIKE
conditionsIt is clear that a pattern must not start with a wildcard if we want to support it with a b-tree index. A b-tree index is ordered, and a pattern like %smith
could be anywhere in the ordered list. Consider searching a phone book for all names ending in “smith”: you'd have to search all entries to make sure you find all “Blacksmith”, “Hammersmith” and who knows what other “smiths” might be in there. The phone book is a good example for an index: the database could use the b-tree index, but it would have to scan the whole index, and that is not very efficient.
If a pattern does not start with a wildcard, the case should be different. After all, the phone book is efficient if you are searching for all names that start with “black”: you may have to scan several pages, but “Black”, “Blackthorn” and “Blacksmith” are right next to each other in an alphabetically sorted list. So a b-tree index should be helpful in this case.
So what is the difficulty with indexing LIKE
conditions in PostgreSQL?
LIKE
in PostgreSQLWe create a table with three columns. The columns will contain the same strings, but I define them with different collations:
1 2 3 4 5 6 |
CREATE TABLE likeme ( id integer PRIMARY KEY, s_binary text COLLATE 'C' NOT NULL, s_libc text COLLATE 'cs_CZ.utf8' NOT NULL, s_icu text COLLATE 'cs-CZ-x-icu' NOT NULL ); |
The three collations I used are
C
, also known as POSIX
. It compares strings byte by byte, so it is very fast and never changes.Remember that (with the exception of the C
collation) PostgreSQL does not implement collations itself. Rather, it uses the collations from the system's C library of the ICU library. See this article for problems that can arise from this design choice.
We inserts a few rows of data:
1 2 3 4 5 6 7 8 9 10 |
INSERT INTO likeme (id, s_binary, s_libc, s_icu) VALUES (1, 'abc', 'abc', 'abc'), (2, 'abč', 'abč', 'abč'), (3, 'abb', 'abb', 'abb'), (4, 'abd', 'abd', 'abd'), (5, 'abcm', 'abcm', 'abcm'), (6, 'abch', 'abch', 'abch'), (7, 'abh', 'abh', 'abh'), (8, 'abz', 'abz', 'abz'), (9, 'Nesmysl', 'Nesmysl', 'Nesmysl'); |
Then, we'll fill up the table with enough unimportant data so that the optimizer will prefer an index scan if possible:
1 2 3 |
INSERT INTO likeme (id, s_binary, s_libc, s_icu) SELECT i, 'z'||i, 'z'||i, 'z'||i FROM generate_series(10, 10000) AS i; |
Finally, let's run VACUUM
and ANALYZE
to set hint bits and gather statistics:
1 |
VACUUM (ANALYZE) likeme; |
C
(binary) collationIndexing LIKE
is trivial: just create a regular b-tree index:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
CREATE INDEX binary_ind ON likeme (s_binary); EXPLAIN (COSTS OFF) SELECT s_binary FROM likeme WHERE s_binary LIKE 'abc%'; QUERY PLAN ════════════════════════════════════════════════════════════════════════ Index Scan using binary_ind on likeme Index Cond: ((s_binary >= 'abc'::text) AND (s_binary < 'abd'::text)) Filter: (s_binary ~~ 'abc%'::text) (3 rows) |
PostgreSQL scans the index for everything between 'abc'
and 'abd'
and eliminates false positive results with a filter. (There are no false positive results in our case, but there could be characters after the wildcard.)
While this works great and just as we want it to, the result of ORDER BY
is rather poor:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
SELECT s_binary FROM likeme ORDER BY s_binary FETCH FIRST 9 ROWS ONLY; s_binary ══════════ Nesmysl abb abc abch abcm abd abh abz abč (9 rows) |
Blech! Not only does the upper-case “N” sort before the lower-case “a”, but “č” sorts after all ASCII characters. So if you need natural language sorting, the C
collation is not acceptable. However, if you don't need to sort strings, it is the best collation to use.
Let's look at s_icu
(s_libc
will behave just the same). First, PostgreSQL now sorts the strings correctly:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
SELECT s_icu FROM likeme ORDER BY s_icu FETCH FIRST 9 ROWS ONLY; s_icu ═════════ abb abc abcm abč abd abh abch abz Nesmysl (9 rows) |
Please turn your attention to the position of 'abch'
: it appears after 'abh'
! This is the correct way to sort the digraph “ch” in Czech (and that is why I chose Czech for my example). You see that natural language collations can be quite complicated. For details, read Peter Eisentraut's excellent blog series how collation works, how collation of punctuation and whitespace works and overview of ICU collation settings.
This peculiarity has no influence on how LIKE
works (which the SQL standard defines in ISO/IEC 9075-2, 8.5 (
):
1 2 3 4 5 6 7 8 |
SELECT s_icu FROM likeme WHERE s_icu LIKE 'abc%'; s_icu ═══════ abc abch abcm (3 rows) |
Let's create an index on the column and see if it speeds up the LIKE
condition:
1 2 3 4 5 6 7 8 9 10 11 12 |
CREATE INDEX icu_ind ON likeme (s_icu); EXPLAIN (COSTS OFF) SELECT s_icu FROM likeme WHERE s_icu LIKE 'abc%'; QUERY PLAN ═══════════════════════════════════ Seq Scan on likeme Filter: (s_icu ~~ 'abc%'::text) (2 rows) |
PostgreSQL doesn't (and cannot) use the index. Can you guess the reason? If we scanned the index in the same manner as before, we would not find 'abch'
, because it is sorted elsewhere! With some natural language collations, you cannot find a string in the index if you only know a prefix. Not all collations are as complicated as that, but since PostgreSQL has no deeper insight into the collations (which are defined by external libraries), it has to stay on the safe side and avoid using a natural language index.
LIKE
in PostgreSQLIf you want to both have your cake and eat it, that is, if you want the benefits of a natural language collation and still like to index LIKE
conditions, you have to use a PostgreSQL feature. In PostgreSQL you can specify an operator class in an index definition. That way, you can specify the set of comparison operators that the index supports. Now there is a special operator class text_pattern_ops
, whose operators compare strings character for character. That is exactly what we need for indexing LIKE
:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
CREATE INDEX icu_like_idx ON likeme (s_icu text_pattern_ops); EXPLAIN (COSTS OFF) SELECT s_icu FROM likeme WHERE s_icu LIKE 'abc%'; QUERY PLAN ══════════════════════════════════════════════════════════════════════ Index Only Scan using icu_like_idx on likeme Index Cond: ((s_icu ~>=~ 'abc'::text) AND (s_icu ~<~ 'abd'::text)) Filter: (s_icu ~~ 'abc%'::text) (3 rows) |
You can see the “character-wise comparison operators” in the execution plan. But this index can do more than speed up LIKE
conditions: you can also use it to speed up equality comparisons. In fact, such an index is typically the only index you need on a string column, unless you have to index ORDER BY
, which clearly needs the plain natural language index.
We have seen how to index LIKE
in PostgreSQL, now what about Oracle?
LIKE
in OracleWe create a table similar to the test table in PostgreSQL. We have to define the string columns as NOT NULL
, otherwise Oracle won't be able to speed up ORDER BY
with an index (Oracle does not store index entries where all columns are NULL values):
1 2 3 4 5 6 |
CREATE TABLE likeme ( id NUMBER(5) CONSTRAINT likeme_pkey PRIMARY KEY, s_binary VARCHAR2(100 CHAR) NOT NULL, s_czech VARCHAR2(100 CHAR) COLLATE CZECH NOT NULL, s_xczech VARCHAR2(100 CHAR) COLLATE XCZECH NOT NULL ); |
Oracle uses the binary collation by default. This already goes a long way to explain why people claim that a “simple index” is sufficient for indexing LIKE
! But we want to dig deeper, that's why we define two additional columns with the two natural language collations Oracle supports:
CZECH
is a “poor man's natural language collation” that only supports part of the functionalityXCZECH
is the collation that behavs as a natural language collation shouldIn the words of the Oracle documentation:
Some monolingual collations have an extended version that handles special linguistic cases. The name of the extended version is prefixed with the letter
X
. These special cases typically mean that one character is sorted like a sequence of two characters or a sequence of two characters is sorted as one character. For example,ch
andll
are treated as a single character inXSPANISH
. Extended monolingual collations may also define special language-specific uppercase and lowercase rules that override standard rules of a character set.
To insert data into Oracle, we can simply use pg_dump
to export the PostgreSQL data and use that as an SQL script for Oracle. That guarantees that we have the same data in both systems:
1 |
pg_dump --inserts --data-only --table=likeme --file=import.sql dbname |
After importing the data into Oracle, we calculate optimizer statistics:
1 |
ANALYZE TABLE likeme COMPUTE STATISTICS; |
ORDER BY
With the CZECH
collation, the result is not satisfactory:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
SELECT s_czech FROM likeme ORDER BY s_czech FETCH FIRST 9 ROWS ONLY; S_CZECH ------- abb abc abch abcm abč abd abh abz Nesmysl 9 rows selected. |
Like the Oracle documentation stated, Oracle does not sort 'abch'
correctly. We need the XCZECH
collation for that:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
SELECT s_xczech FROM likeme ORDER BY s_xczech FETCH FIRST 9 ROWS ONLY; S_XCZECH -------- abb abc abcm abč abd abh abch abz Nesmysl 9 rows selected. |
If we create an index on s_xczech
, that index will speed up the query:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
CREATE INDEX xczech_ind ON likeme (s_xczech); Index created. SET AUTOTRACE TRACEONLY SELECT s_xczech FROM likeme ORDER BY s_xczech FETCH FIRST 9 ROWS ONLY; Execution Plan -------------- --------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | --------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 9 | 3753 | | 1 | SORT ORDER BY | | 9 | 3753 | |* 2 | VIEW | | 9 | 3753 | |* 3 | WINDOW NOSORT STOPKEY | | 10000 | 17M| | 4 | TABLE ACCESS BY INDEX ROWID| LIKEME | 10000 | 17M| | 5 | INDEX FULL SCAN | XCZECH_IND | 10000 | | --------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - filter('from$_subquery$_002'.'rowlimit_$$_rownumber'<=9) 3 - filter(ROW_NUMBER() OVER ( ORDER BY 'S_XCZECH')<=9) |
I have omitted some parts for brevity. Oracle seems to implement FETCH FIRST 9 ROWS ONLY
using a window function, but I guess it is smart enough not to read the whole index. Oracle calls an index that uses a collation different from the binary one a linguistic index.
LIKE
in OracleThe suspense is mounting: this is what I originally wanted to investigate. Let's see if Oracle uses the linguistic index:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
SELECT s_xczech FROM likeme WHERE s_xczech LIKE 'abc%'; Execution Plan -------------- -------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | -------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 1809 | |* 1 | TABLE ACCESS BY INDEX ROWID BATCHED| LIKEME | 1 | 1809 | |* 2 | INDEX RANGE SCAN | XCZECH_IND | 45 | | -------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter(INSTR('S_XCZECH','abc',1,1)=1) 2 - access(NLSSORT('S_XCZECH','nls_sort=''XCZECH''')>=HEXTORAW('14191E0001010100') AND NLSSORT('S_XCZECH','nls_sort=''XCZECH''')<HEXTORAW('14191F0001010100')) |
Oracle actually uses the index! But how is that possible? How can it find 'abch'
that way? The mystery is solved as soon as we look at the actual query result:
1 2 3 4 5 6 7 8 9 10 |
SET AUTOTRACE OFF SELECT s_xczech FROM likeme WHERE s_xczech LIKE 'abc%'; S_XCZECH -------- abc abcm |
Ouch! Oracle gave us the wrong result! 'abch'
is missing. With the CZECH
collation we get the correct result:
1 2 3 4 5 6 7 8 9 |
SELECT s_czech FROM likeme WHERE s_czech LIKE 'abc%'; S_CZECH ------- abc abch abcm |
So Oracle gives us a choice: either get a correct result for ORDER BY
and a wrong result for LIKE
using the XCZECH
collation, or get a wrong result for ORDER BY
and a correct result for LIKE
using the CZECH
collation. To get the correct result, you'd have to add an explicit COLLATE
clause:
1 2 3 |
SELECT s_xczech FROM likeme WHERE (s_xczech COLLATE CZECH) LIKE 'abc%'; |
But I found no way to create an index in Oracle that can support that query.
We have seen how to index LIKE
in PostgreSQL. We have also verified that LIKE
really never needs a special index in Oracle. However, that is not because Oracle is smarter than PostgreSQL, but because it does not care all that much about correct results.
To see more posts about the topic of collations, check out the following:
In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.
+43 (0) 2622 93022-0
office@cybertec.at
You are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
PostgreSQLs support of trigram indexes, which even support searches where the search string starts with a wildcards, might also be worth mentioning here.
As far as I'm aware, Oracle has nothing similar.