CYBERTEC PostgreSQL Logo

UNION ALL, data types and performance

12.2022 / Category: / Tags: | | |
Cartoon about UNION ALL in PostgreSQL and type matching
© Laurenz Albe 2022

 

A while ago, I wrote about the performance impact of query parameter data types. Recently I encountered a similar problem with UNION ALL that baffled me for a while, so I'd like to share it with you.

UNION ALL to implement polymorphism

Different entities can have something in common. For example, a customer relationship management system has to deal with customers and employees, both of which are people with a name, birthday and other common attributes. Such polymorphism is easy to model in object oriented programming, but notoriously hard to implement in relational databases. One way to approach the problem is to have a separate table for each entity and to model the “superclass” with UNION ALL:

Let's fill the tables with random data so that we can test the performance:

A surprising performance difference between the UNION ALL views

We want to lookup animals by joining with a small lookup table:

Now let's check the join performance with both views:

Oops! What's wrong with those flying animals?

Investigating the difference

Let's look at the respective execution plans:

The query on mammal works just the way we'd like it: it performs a nested loop join, using the primary key indexes on bat and cat to quickly retrieve the few rows we need. Contrariwise, the query on flying_animal scans all 2 million rows from bird and bat to perform a hash join.

Can you spot the difference?

Investigating the difference between the UNION ALL queries

To get the fast execution plan, PostgreSQL has to propagate the join condition into the UNION ALL (which turns into “Append” in the execution plan). Looking closer, we notice the strange “Subquery Scan on "*SELECT* 2"” in the slow query. Indeed, the branches of UNION ALL are subqueries. But usually the PostgreSQL optimizer “pulls up” subqueries into the main join tree wherever possible. Let's read the source in pull_up_subqueries_recurse() in src/backend/optimizer/prep/prepjointree.c:

The function comment of is_simple_union_all() tells us more:

That gives us the decisive clue. Looking at the table definitions, we notice that wingspan has data type real in bird, but numeric in bat. The reason why the optimizer cannot flatten the UNION ALL properly is this difference in the data types. This preserves the subquery, which prevents the optimizer from finding the ideal execution plan.

The surprising thing that makes it difficult to reach that conclusion is that the column that caused the problem didn't play an important role in the query. We could even omit it from the SELECT list, and we'd still get the bad execution plan!

UNION ALL and data types

Both branches of a UNION must have the same number of columns, and the columns must have compatible data types. The documentation describes the exact meaning of “compatible”. Essentially, if the data types differ, PostgreSQL chooses the type for which there is an implicit conversion from the other type (the PostgreSQL jargon for this is a coercion). If there is an implicit conversion in either direction, PostgreSQL prefers the data type from the first branch of the UNION.

In our case, there is an implicit cast from numeric to real, but not the other way around, so wingspan is of data type real in flying_animals. While this allowed us to create the view without problems, it caused the performance problem we observed.

Explicit type casts as a solution for the performance problem

There is a simple way to make sure that the data types on both sides of the UNION ALL are identical: apply an explicit type cast! So we can redefine the view as

And, sure enough, we end up with a fast execution plan:

Conclusion

While it is convenient that PostgreSQL allows us to use slightly different data types on both sides of a UNION ALL, we saw that it can lead to bad execution plans and bad performance. To be on the safe side, apply explicit type casts to render the data types identical.

Perhaps it would be possible to make the optimizer smarter in cases like the one we investigated. However, it is easy to work around this limitation.

If you've enjoyed reading this blog, check out my blog about rewriting OR to UNION in PostgreSQL queries.

One response to “UNION ALL, data types and performance”

  1. This solved a big headache for me. I had a UNION ALL of three queries and it took 13 minutes to get the result. After making all the types the same, the result came back in .... 348ms. Thanks for great blog.

Leave a Reply

Your email address will not be published. Required fields are marked *

CYBERTEC Logo white
Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2024
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle
    linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram