Understanding recursive queries in PostgreSQL

08.2020 | Category: | Tags: development , sql help , table

08.2020

Category:

recursive queries misunderstood — © Laurenz Albe 2020

Table of Contents

Many people consider recursive queries a difficult topic. Still, they enable you to do things that would otherwise be impossible in SQL.

This articles gives a simple introduction with examples and shows the differences to Oracle's implementation of recursive queries.

Common table expressions (`WITH` clauses)

A common table expression (CTE) can be seen as a view that is only valid for a single query:

WITH ctename AS (
   SELECT ...
)
SELECT ...
FROM ctename ...

WITH ctename AS (

SELECT ...

)

SELECT ...

FROM ctename ...

This could also be written as a subquery in FROM, but there are some advantages to using CTEs:

The query becomes more readable.
You can reference the CTE several times in the query, and it will be calculated only once.
You can use data modifying statements in the CTE (typically with a RETURNING clause).

Note that before v12, PostgreSQL always materialized CTEs. That means, the CTE was calculated independently from the containing query. From v12 on, CTEs can be “inlined” into the query, which provides further optimization potential.

CTEs (as well as the recursive CTEs mentioned below) are defined in the SQL standard, although PostgreSQL does not implement the SEARCH and CYCLE clause.

Syntax of recursive queries

Recursive queries are written using recursive CTEs, that are CTEs containing the RECURSIVE keyword:

WITH RECURSIVE ctename AS (
   SELECT /* non-recursive branch, cannot reference 'ctename' */
   UNION [ALL]
   SELECT /* recursive branch referencing 'ctename' */
)
SELECT ...
FROM ctename ...

WITH RECURSIVE ctename AS (

SELECT /* non-recursive branch, cannot reference 'ctename' */

UNION [ALL]

SELECT /* recursive branch referencing 'ctename' */

)

SELECT ...

FROM ctename ...

How recursive queries are processed

PostgreSQL internally uses a working table to process recursive CTEs. This processing is not really recursive, but rather iterative:

First, the working table is initialized by executing the non-recursive branch of the CTE. The result of the CTE is also initialized with this result set. If the recursive CTE uses UNION rather than UNION ALL, duplicate rows are removed.

Then, PostgreSQL repeats the following until the working table is empty:

Evaluate the recursive branch of the CTE, replacing the reference to the CTE with the working table.
Add all resulting rows to the CTE result. If UNION is used to combine the branches, discard duplicate rows.
Replace the working table with all new rows from the previous step (excluding any removed duplicates).

Note that the self-referential branch of the CTE is not executed using the complete CTE result so far, but only the rows that are new since the previous iteration (the working table).

You have to be aware of the danger of an endless loop here:

If the iteration never ends, the query will just keep running until the result table becomes big enough to cause an error. There are two ways to deal with that:

Often you can avoid infinite recursion by using UNION, which removes duplicate result rows (but of course requires extra processing effort).
Another way is to place a LIMIT clause on the query that uses the CTE, because PostgreSQL stops processing if the recursive CTE has calculated as many rows as are fetched by the parent query. Note that this technique is not portable to other standard compliant databases.

A simple example

Let's assume a self-referential table like this:

TABLE emp;

 empno | ename  |    job    | mgr  |  hiredate  |   sal   |  comm   | deptno 
-------+--------+-----------+------+------------+---------+---------+--------
  7839 | KING   | PRESIDENT |      | 1981-11-17 | 5000.00 |         |     10
  7698 | BLAKE  | MANAGER   | 7839 | 1981-05-01 | 2850.00 |         |     30
  7782 | CLARK  | MANAGER   | 7839 | 1981-06-09 | 2450.00 |         |     10
  7566 | JONES  | MANAGER   | 7839 | 1981-04-02 | 2975.00 |         |     20
  7902 | FORD   | ANALYST   | 7566 | 1981-12-03 | 3000.00 |         |     20
  7369 | SMITH  | CLERK     | 7902 | 1980-12-17 |  800.00 |         |     20
  7499 | ALLEN  | SALESMAN  | 7698 | 1981-02-20 | 1600.00 |  300.00 |     30
  7521 | WARD   | SALESMAN  | 7698 | 1981-02-22 | 1250.00 |  500.00 |     30
  7654 | MARTIN | SALESMAN  | 7698 | 1981-09-28 | 1250.00 | 1400.00 |     30
  7844 | TURNER | SALESMAN  | 7698 | 1981-09-08 | 1500.00 |    0.00 |     30
  7900 | JAMES  | CLERK     | 7698 | 1981-12-03 |  950.00 |         |     30
  7934 | MILLER | CLERK     | 7782 | 1982-01-23 | 1300.00 |         |     10
(12 rows)

TABLE emp;

-------+--------+-----------+------+------------+---------+---------+--------

7839 | KING | PRESIDENT | | 1981-11-17 | 5000.00 | | 10

7698 | BLAKE | MANAGER | 7839 | 1981-05-01 | 2850.00 | | 30

7782 | CLARK | MANAGER | 7839 | 1981-06-09 | 2450.00 | | 10

7566 | JONES | MANAGER | 7839 | 1981-04-02 | 2975.00 | | 20

7902 | FORD | ANALYST | 7566 | 1981-12-03 | 3000.00 | | 20

7369 | SMITH | CLERK | 7902 | 1980-12-17 | 800.00 | | 20

7499 | ALLEN | SALESMAN | 7698 | 1981-02-20 | 1600.00 | 300.00 | 30

7521 | WARD | SALESMAN | 7698 | 1981-02-22 | 1250.00 | 500.00 | 30

7654 | MARTIN | SALESMAN | 7698 | 1981-09-28 | 1250.00 | 1400.00 | 30

7844 | TURNER | SALESMAN | 7698 | 1981-09-08 | 1500.00 | 0.00 | 30

7900 | JAMES | CLERK | 7698 | 1981-12-03 | 950.00 | | 30

7934 | MILLER | CLERK | 7782 | 1982-01-23 | 1300.00 | | 10

(12 rows)

We want to find all the subordinates of person 7566, including the person itself.

The non-recursive branch of the query will be:

SELECT empno, ename
FROM emp
WHERE empno = 7566;

SELECT empno, ename

FROM emp

WHERE empno = 7566;

The recursive branch will find all subordinates of all entries in the working table:

SELECT emp.empno, emp.ename
FROM emp
   JOIN ctename ON emp.mgr = ctename.empno;

SELECT emp.empno, emp.ename

FROM emp

JOIN ctename ON emp.mgr = ctename.empno;

We can assume that the dependencies contain no cycles (nobody is his or her own manager, directly or indirectly). So we can combine the queries with UNION ALL, because no duplicates can occur.

So our complete query will be:

WITH RECURSIVE ctename AS (
      SELECT empno, ename
      FROM emp
      WHERE empno = 7566
   UNION ALL
      SELECT emp.empno, emp.ename
      FROM emp
         JOIN ctename ON emp.mgr = ctename.empno
)
SELECT * FROM ctename;

 empno | ename 
-------+-------
  7566 | JONES
  7902 | FORD
  7369 | SMITH
(3 rows)

WITH RECURSIVE ctename AS (

SELECT empno, ename

FROM emp

WHERE empno = 7566

UNION ALL

SELECT emp.empno, emp.ename

FROM emp

JOIN ctename ON emp.mgr = ctename.empno

)

SELECT * FROM ctename;

empno | ename

-------+-------

7566 | JONES

7902 | FORD

7369 | SMITH

(3 rows)

Adding generated columns

Sometimes you want to add more information, like the hierarchical level. You can do that by adding the starting level as a constant in the non-recursive branch. In the recursive branch you simply add 1 to the level:

WITH RECURSIVE ctename AS (
      SELECT empno, ename,
             0 AS level
      FROM emp
      WHERE empno = 7566
   UNION ALL
      SELECT emp.empno, emp.ename,
             ctename.level + 1
      FROM emp
         JOIN ctename ON emp.mgr = ctename.empno
)
SELECT * FROM ctename;

 empno | ename | level 
-------+-------+-------
  7566 | JONES |     0
  7902 | FORD  |     1
  7369 | SMITH |     2
(3 rows)

WITH RECURSIVE ctename AS (

SELECT empno, ename,

0 AS level

FROM emp

WHERE empno = 7566

UNION ALL

SELECT emp.empno, emp.ename,

ctename.level + 1

FROM emp

JOIN ctename ON emp.mgr = ctename.empno

)

SELECT * FROM ctename;

empno | ename | level

-------+-------+-------

7566 | JONES | 0

7902 | FORD | 1

7369 | SMITH | 2

(3 rows)

If you use UNION to avoid duplicated rows in the case of circular references, you cannot use this technique. This is because adding level will make rows that were identical before different. But in that case a hierarchical level wouldn't make much sense anyway because an entry could appear on infinitely many levels.

Another frequent requirement is to collect all ancestors in a “path”:

WITH RECURSIVE ctename AS (
      SELECT empno, ename,
             ename AS path
      FROM emp
      WHERE empno = 7566
   UNION ALL
      SELECT emp.empno, emp.ename,
             ctename.path || ' -> ' || emp.ename
      FROM emp
         JOIN ctename ON emp.mgr = ctename.empno
)
SELECT * FROM ctename;

 empno | ename |          path          
-------+-------+------------------------
  7566 | JONES | JONES
  7902 | FORD  | JONES -> FORD
  7369 | SMITH | JONES -> FORD -> SMITH

WITH RECURSIVE ctename AS (

SELECT empno, ename,

ename AS path

FROM emp

WHERE empno = 7566

UNION ALL

SELECT emp.empno, emp.ename,

ctename.path || ' -> ' || emp.ename

FROM emp

JOIN ctename ON emp.mgr = ctename.empno

)

SELECT * FROM ctename;

empno | ename | path

-------+-------+------------------------

7566 | JONES | JONES

7902 | FORD | JONES -> FORD

7369 | SMITH | JONES -> FORD -> SMITH

Comparison to Oracle

Oracle database has a different syntax for recursive queries that does not conform to the SQL standard. The original example would look like this:

SELECT empno, ename
FROM emp
START WITH empno = 7566
CONNECT BY PRIOR empno = mgr;

     EMPNO ENAME
---------- ----------
      7566 JONES
      7902 FORD
      7369 SMITH

SELECT empno, ename

FROM emp

START WITH empno = 7566

CONNECT BY PRIOR empno = mgr;

EMPNO ENAME

---------- ----------

7566 JONES

7902 FORD

7369 SMITH

This syntax is more concise, but less powerful than recursive CTEs. For more complicated queries involving joins, it can become difficult and confusing.

It is always easy to translate an Oracle “hierarchical query” to a recursive CTE:

The non-recursive branch is the Oracle query without the CONNECT BY clause but including the START WITH clause.
The recursive branch is the Oracle query without the START WITH clause but including the CONNECT BY clause. You add a join with the name of the recursive CTE and replace all PRIOR columns with columns from that joined CTE.
If the Oracle query uses CONNECT BY NOCYCLE, use UNION, otherwise UNION ALL.

Apart from that, Oracle also supports standard compliant recursive CTEs. These also support the SEARCH and CYCLE clauses that PostgreSQL doesn't implement.

The true power of recursive queries

Without recursive CTEs, many things that can be written in procedural languages cannot be written in SQL. That is usually no problem, because SQL is made to query databases. If you need procedural code, you can always write a database function in one of the many procedural languages available in PostgreSQL.

But recursive CTEs make SQL turing complete, that is, it can perform the same calculations as any other programming language. Obviously, such a “program” would often be quite complicated and inefficient, and this is a theoretical consideration and not of much practical value. Still, the previous examples showed that recursive CTEs can do useful work that you couldn't otherwise perform in SQL.

As an example for the power of recursive queries, here is a recursive CTE that computes the first elements of the Fibonacci sequence:

WITH RECURSIVE fib AS (
      SELECT 1 AS n,
             1::bigint AS 'fibₙ',
             1::bigint AS 'fibₙ₊₁'
   UNION ALL
      SELECT n+1,
             'fibₙ₊₁',
             'fibₙ' + 'fibₙ₊₁'
      FROM fib
)
SELECT n, 'fibₙ' FROM fib
LIMIT 20;

 n  | fibₙ 
----+------
  1 |    1
  2 |    1
  3 |    2
  4 |    3
  5 |    5
  6 |    8
  7 |   13
  8 |   21
  9 |   34
 10 |   55
 11 |   89
 12 |  144
 13 |  233
 14 |  377
 15 |  610
 16 |  987
 17 | 1597
 18 | 2584
 19 | 4181
 20 | 6765
(20 rows)

WITH RECURSIVE fib AS (

SELECT 1 AS n,

1::bigint AS 'fibₙ',

1::bigint AS 'fibₙ₊₁'

UNION ALL

SELECT n+1,

'fibₙ₊₁',

'fibₙ' + 'fibₙ₊₁'

FROM fib

)

SELECT n, 'fibₙ' FROM fib

LIMIT 20;

n | fibₙ

----+------

1 | 1

2 | 1

3 | 2

4 | 3

5 | 5

6 | 8

7 | 13

8 | 21

9 | 34

10 | 55

11 | 89

12 | 144

13 | 233

14 | 377

15 | 610

16 | 987

17 | 1597

18 | 2584

19 | 4181

20 | 6765

(20 rows)

This example also demonstrates how an endless loop can be avoided with a LIMIT clause on the parent query.

Conclusion

You shouldn't be afraid of recursive queries, even if the term “recursive” scares you. They are not that hard to understand, and they provide the simplest way to query databases containing hierarchies and graphs. In many cases they are also much more efficient than equivalent procedural code because they avoid repeated queries to the same tables.

Together with window functions, recursive queries are among the most powerful tools that SQL has to offer.

2 responses to “Understanding recursive queries in PostgreSQL”

Sean Anderson says:

August 17, 2020 at 6:52 pm

Probably more useful than deriving a "level" column would be a varchar breadcrumb of the IDs for each row that can be used as a sort key. Concatenating the IDs (unique attribute) with white space or underscores allows you to order a hierarchy in your results and see the tree expand out in your tabular display. Another useful convention is to derive yet another column that shows the origin or root parent for each row. From all that you can determine the root ID of any row that you are analyzing and issue a simple query (or create a table valued function) that returns the hierarchy for a specific parent (customer, employee, inventory, etc). This is all advantageous because in the real world you often have many hierarchies existing in one table/tuple and not the trivial manager/employee example that expands into a single clean tree.

Reply
Curtis Paris says:

August 17, 2020 at 8:28 pm

Another great feature with Postgres is Array column types.

In the case of recursive queries you can use it to prevent recursion re-entry. If somebody made a mistake and two people report to each in the tree, you can ensure that it never loops forever.

To do this, add another calculated column that is an Array. So in the first query you could do:
SELECT empno, ename, ename AS path, ARRAY[empno] as visited FROM emp WHERE empno = 7566

Then in the recursive 2nd clause:
UNION ALL SELECT emp.empno, emp.ename, ctename.path || ' -> ' || emp.ename, ctename.visited || emp.empno FROM emp JOIN ctename ON emp.mgr = ctename.empno WHERE NOT ctename.visited && ARRAY[emp.empno]

This will add an array that keeps appending each visited location. If you have previously visited the location, it will prevent recursion.

Reply

Understanding recursive queries in PostgreSQL

Common table expressions (`WITH` clauses)

Syntax of recursive queries

How recursive queries are processed

You have to be aware of the danger of an endless loop here:

A simple example

So our complete query will be:

Adding generated columns

Another frequent requirement is to collect all ancestors in a “path”:

Comparison to Oracle

The true power of recursive queries

Conclusion

2 responses to “Understanding recursive queries in PostgreSQL”

Leave a Reply Cancel reply

Laurenz Albe

Blog Tags

NEWSLETTER

Articles by our PostgreSQL Experts

Understanding recursive queries in PostgreSQL

Common table expressions (WITH clauses)

Syntax of recursive queries

How recursive queries are processed

You have to be aware of the danger of an endless loop here:

A simple example

So our complete query will be:

Adding generated columns

Another frequent requirement is to collect all ancestors in a “path”:

Comparison to Oracle

The true power of recursive queries

Conclusion

2 responses to “Understanding recursive queries in PostgreSQL”

Leave a Reply Cancel reply

Laurenz Albe

Blog Tags

NEWSLETTER

Articles by our PostgreSQL Experts

Common table expressions (`WITH` clauses)