CYBERTEC PostgreSQL Logo

Indexing "LIKE" in PostgreSQL and Oracle

08.2023 / Category: , / Tags: | |

Connecting through a time rift into the AI future: in PostgreSQL v27, LIKE can search for books with similar content (indexing LIKE won't become any easier)
© Laurenz Albe 2023

Unless you use the binary collation, creating a b-tree index to support a LIKE condition in PostgreSQL is not straightforward. This keeps surprising Oracle users, who claim that a regular b-tree index will of course always support LIKE. I decided to explore the differences between Oracle and PostgreSQL when it comes to indexing LIKE conditions. Is Oracle really smarter here?

Indexing LIKE conditions

It is clear that a pattern must not start with a wildcard if we want to support it with a b-tree index. A b-tree index is ordered, and a pattern like %smith could be anywhere in the ordered list. Consider searching a phone book for all names ending in “smith”: you'd have to search all entries to make sure you find all “Blacksmith”, “Hammersmith” and who knows what other “smiths” might be in there. The phone book is a good example for an index: the database could use the b-tree index, but it would have to scan the whole index, and that is not very efficient.

If a pattern does not start with a wildcard, the case should be different. After all, the phone book is efficient if you are searching for all names that start with “black”: you may have to scan several pages, but “Black”, “Blackthorn” and “Blacksmith” are right next to each other in an alphabetically sorted list. So a b-tree index should be helpful in this case.

So what is the difficulty with indexing LIKE conditions in PostgreSQL?

An example to show the difficulties with indexing LIKE in PostgreSQL

We create a table with three columns. The columns will contain the same strings, but I define them with different collations:

The three collations I used are

  • The “binary” collation C, also known as POSIX. It compares strings byte by byte, so it is very fast and never changes.
  • The Czech collation as defined by the operating system's C library.
  • The Czech collation as defined by the ICU library.

Remember that (with the exception of the C collation) PostgreSQL does not implement collations itself. Rather, it uses the collations from the system's C library of the ICU library. See this article for problems that can arise from this design choice.

We inserts a few rows of data:

Then, we'll fill up the table with enough unimportant data so that the optimizer will prefer an index scan if possible:

Finally, let's run VACUUM and ANALYZE to set hint bits and gather statistics:

The good and bad sides of the C (binary) collation

Indexing LIKE is trivial: just create a regular b-tree index:

PostgreSQL scans the index for everything between 'abc' and 'abd' and eliminates false positive results with a filter. (There are no false positive results in our case, but there could be characters after the wildcard.)

While this works great and just as we want it to, the result of ORDER BY is rather poor:

Blech! Not only does the upper-case “N” sort before the lower-case “a”, but “č” sorts after all ASCII characters. So if you need natural language sorting, the C collation is not acceptable. However, if you don't need to sort strings, it is the best collation to use.

The good and bad sides of natural language collations

Let's look at s_icu (s_libc will behave just the same). First, PostgreSQL now sorts the strings correctly:

Please turn your attention to the position of 'abch': it appears after 'abh'! This is the correct way to sort the digraph “ch” in Czech (and that is why I chose Czech for my example). You see that natural language collations can be quite complicated. For details, read Peter Eisentraut's excellent blog series how collation works, how collation of punctuation and whitespace works and overview of ICU collation settings.

This peculiarity has no influence on how LIKE works (which the SQL standard defines in ISO/IEC 9075-2, 8.5 ():

Let's create an index on the column and see if it speeds up the LIKE condition:

PostgreSQL doesn't (and cannot) use the index. Can you guess the reason? If we scanned the index in the same manner as before, we would not find 'abch', because it is sorted elsewhere! With some natural language collations, you cannot find a string in the index if you only know a prefix. Not all collations are as complicated as that, but since PostgreSQL has no deeper insight into the collations (which are defined by external libraries), it has to stay on the safe side and avoid using a natural language index.

How to go about indexing LIKE in PostgreSQL

If you want to both have your cake and eat it, that is, if you want the benefits of a natural language collation and still like to index LIKE conditions, you have to use a PostgreSQL feature. In PostgreSQL you can specify an operator class in an index definition. That way, you can specify the set of comparison operators that the index supports. Now there is a special operator class text_pattern_ops, whose operators compare strings character for character. That is exactly what we need for indexing LIKE:

You can see the “character-wise comparison operators” in the execution plan. But this index can do more than speed up LIKE conditions: you can also use it to speed up equality comparisons. In fact, such an index is typically the only index you need on a string column, unless you have to index ORDER BY, which clearly needs the plain natural language index.

We have seen how to index LIKE in PostgreSQL, now what about Oracle?

An example for indexing LIKE in Oracle

We create a table similar to the test table in PostgreSQL. We have to define the string columns as NOT NULL, otherwise Oracle won't be able to speed up ORDER BY with an index (Oracle does not store index entries where all columns are NULL values):

Oracle uses the binary collation by default. This already goes a long way to explain why people claim that a “simple index” is sufficient for indexing LIKE! But we want to dig deeper, that's why we define two additional columns with the two natural language collations Oracle supports:

  • CZECH is a “poor man's natural language collation” that only supports part of the functionality
  • XCZECH is the collation that behavs as a natural language collation should

In the words of the Oracle documentation:

Some monolingual collations have an extended version that handles special linguistic cases. The name of the extended version is prefixed with the letter X. These special cases typically mean that one character is sorted like a sequence of two characters or a sequence of two characters is sorted as one character. For example, ch and ll are treated as a single character in XSPANISH. Extended monolingual collations may also define special language-specific uppercase and lowercase rules that override standard rules of a character set.

To insert data into Oracle, we can simply use pg_dump to export the PostgreSQL data and use that as an SQL script for Oracle. That guarantees that we have the same data in both systems:

After importing the data into Oracle, we calculate optimizer statistics:

Natural language collations in Oracle and ORDER BY

With the CZECH collation, the result is not satisfactory:

Like the Oracle documentation stated, Oracle does not sort 'abch' correctly. We need the XCZECH collation for that:

If we create an index on s_xczech, that index will speed up the query:

I have omitted some parts for brevity. Oracle seems to implement FETCH FIRST 9 ROWS ONLY using a window function, but I guess it is smart enough not to read the whole index. Oracle calls an index that uses a collation different from the binary one a linguistic index.

Indexing LIKE in Oracle

The suspense is mounting: this is what I originally wanted to investigate. Let's see if Oracle uses the linguistic index:

Oracle actually uses the index! But how is that possible? How can it find 'abch' that way? The mystery is solved as soon as we look at the actual query result:

Ouch! Oracle gave us the wrong result! 'abch' is missing. With the CZECH collation we get the correct result:

So Oracle gives us a choice: either get a correct result for ORDER BY and a wrong result for LIKE using the XCZECH collation, or get a wrong result for ORDER BY and a correct result for LIKE using the CZECH collation. To get the correct result, you'd have to add an explicit COLLATE clause:

But I found no way to create an index in Oracle that can support that query.

Conclusion

We have seen how to index LIKE in PostgreSQL. We have also verified that LIKE really never needs a special index in Oracle. However, that is not because Oracle is smarter than PostgreSQL, but because it does not care all that much about correct results.

To see more posts about the topic of collations, check out the following:

 


In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.

One response to “Indexing "LIKE" in PostgreSQL and Oracle”

  1. PostgreSQLs support of trigram indexes, which even support searches where the search string starts with a wildcards, might also be worth mentioning here.

    As far as I'm aware, Oracle has nothing similar.

Leave a Reply

Your email address will not be published. Required fields are marked *

CYBERTEC Logo white
CYBERTEC PostgreSQL International GmbH
Römerstraße 19
2752 Wöllersdorf
Austria

+43 (0) 2622 93022-0
office@cybertec.at

Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2024
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle
    linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram