Searching for the best compromise: “Skyline” queries

Ever since Hannes Eder published the idea of the SKYLINE OF operator on the PostgreSQL mailing list years ago, I was somewhat hooked on the idea of being able to make more intelligent queries in PostgreSQL. So, what is the idea of a “Skyline query”? Here is the basic concept: Imagine you want to go on holiday, and you are looking for a nice hotel on the beach. The trouble is: The hotels with a nice view of the beach are way too expensive – the hotels further back are cheap but far away from the sea. The question is: What is the best compromise?

Table of Contents

That's exactly what this post is all about.

Here is an example:

test=# CREATE TABLE t_hotel 
( 
     id               serial, 
     name             text, 
     price            numeric, 
     distance_beach   numeric 
); 
CREATE TABLE

test=# CREATE TABLE t_hotel

(

id serial,

name text,

price numeric,

distance_beach numeric

);

CREATE TABLE

The table stores the name of a hotel, the price and the distance to the beach. Let us add a couple of rows manually:

test=# INSERT INTO t_hotel (name, price, distance_beach) 
       VALUES ('ABC Motel', 120, 2.4); 
INSERT 0 1 
test=# INSERT INTO t_hotel (name, price, distance_beach) 
       VALUES ('Crapstone Hotel', 90, 2.2); 
INSERT 0 1 
test=# INSERT INTO t_hotel (name, price, distance_beach) 
       VALUES ('Luxury Watch Spa Hotel', 495, 0.2); 
INSERT 0 1 
test=# INSERT INTO t_hotel (name, price, distance_beach) 
       VALUES ('Nowhere Middle Hotel', 45, 9.6); 
INSERT 0 1

test=# INSERT INTO t_hotel (name, price, distance_beach)

VALUES ('ABC Motel', 120, 2.4);

INSERT 0 1

test=# INSERT INTO t_hotel (name, price, distance_beach)

VALUES ('Crapstone Hotel', 90, 2.2);

INSERT 0 1

test=# INSERT INTO t_hotel (name, price, distance_beach)

VALUES ('Luxury Watch Spa Hotel', 495, 0.2);

INSERT 0 1

test=# INSERT INTO t_hotel (name, price, distance_beach)

VALUES ('Nowhere Middle Hotel', 45, 9.6);

INSERT 0 1

If we select our hotels sorted by price, we will see that we will most likely end up far away from the beach in a cheap, low-quality hotel. Clearly, this is not desirable:

test=# SELECT * FROM t_hotel ORDER BY price; 
 id |       name             | price | distance_beach 
----+------------------------+-------+---------------- 
 4  |   Nowhere Middle Hotel |   45  | 9.6 
 2  |        Crapstone Hotel |   90  | 2.2 
 1  |              ABC Motel |  120  | 2.4 
 3  | Luxury Watch Spa Hotel |  495  | 0.2 
(4 rows)

test=# SELECT * FROM t_hotel ORDER BY price;

id | name | price | distance_beach

----+------------------------+-------+----------------

4 | Nowhere Middle Hotel | 45 | 9.6

2 | Crapstone Hotel | 90 | 2.2

1 | ABC Motel | 120 | 2.4

3 | Luxury Watch Spa Hotel | 495 | 0.2

(4 rows)

However, if we sort by distance, we will end up close to the beach, but we won't be able to afford it. The trouble is that none of those queries will actually offer us a good compromise:

test=#  SELECT * FROM t_hotel ORDER BY distance_beach; 
 id |          name          | price | distance_beach 
----+------------------------+-------+---------------- 
  3 | Luxury Watch Spa Hotel |   495 | 0.2 
  2 |        Crapstone Hotel |    90 | 2.2 
  1 |              ABC Motel |   120 | 2.4 
  4 |   Nowhere Middle Hotel |    45 | 9.6 
(4 rows)

test=# SELECT * FROM t_hotel ORDER BY distance_beach;

id | name | price | distance_beach

----+------------------------+-------+----------------

3 | Luxury Watch Spa Hotel | 495 | 0.2

2 | Crapstone Hotel | 90 | 2.2

1 | ABC Motel | 120 | 2.4

4 | Nowhere Middle Hotel | 45 | 9.6

(4 rows)

More advanced ordering in PostgreSQL

Fortunately PostgreSQL allows us to use more sophisticated sort criteria. Sorting by a single column is boring. What we want is to somehow treat different columns differently. In this case, customers might feel that distance is not really linear. Whether you are 20 or 50 meters away from the beach does not really matter anymore. However, being 50 meters or 1 km away really matters already. To make it easy, I decided to go for the square root of the distance, while still taking the price as it is. The result looks way more promising than before:

test=# SELECT price * sqrt(distance_beach), * 
       FROM   t_hotel 
       ORDER BY 1; 
   ?column?   | id |           name         | price | distance_beach 
--------------+----+------------------------+-------+---------------- 
   133.491572 |  2 |        Crapstone Hotel |    90 | 2.2 
   139.427400 |  4 |   Nowhere Middle Hotel |    45 | 9.6 
   185.903200 |  1 |              ABC Motel |   120 | 2.4 
 221.37072977 |  3 | Luxury Watch Spa Hotel |   495 | 0.2 
(4 rows)

test=# SELECT price * sqrt(distance_beach), *

FROM t_hotel

ORDER BY 1;

?column? | id | name | price | distance_beach

--------------+----+------------------------+-------+----------------

133.491572 | 2 | Crapstone Hotel | 90 | 2.2

139.427400 | 4 | Nowhere Middle Hotel | 45 | 9.6

185.903200 | 1 | ABC Motel | 120 | 2.4

221.37072977 | 3 | Luxury Watch Spa Hotel | 495 | 0.2

(4 rows)

It seems that the Crapstone hotel is the best bargain here. It is not the cheapest hotel, but it is pretty close and still reasonably priced, so maybe it is most fitting to book that one.

The trouble starts when we look at the execution plan of this tiny PostgreSQL query:

test=# explain SELECT price * sqrt(distance_beach), * 
        FROM    t_hotel 
        ORDER BY 1; 
                            QUERY PLAN
------------------------------------------------------------------ 
Sort (cost=48.74..50.32 rows=630 width=132) 
     Sort Key: ((price * sqrt(distance_beach))) 
     -> Seq Scan on t_hotel (cost=0.00..19.45 rows=630 width=132) 
(3 rows)

test=# explain SELECT price * sqrt(distance_beach), *

FROM t_hotel

ORDER BY 1;

QUERY PLAN

------------------------------------------------------------------

Sort (cost=48.74..50.32 rows=630 width=132)

Sort Key: ((price * sqrt(distance_beach)))

-> Seq Scan on t_hotel (cost=0.00..19.45 rows=630 width=132)

(3 rows)

PostgreSQL will read all the data and sort by our custom criterial. While this is nice for a small data set, it will kill us if the amount of data keeps growing

Scaling up: Increasing the size of our data set

Let us see what happens if we load 5 million rows into our table:

test=# TRUNCATE t_hotel ; 
TRUNCATE TABLE 
test=# INSERT INTO t_hotel (price, distance_beach) 
       SELECT  40 + random() * 200, random() * 15 
       FROM    generate_series(1, 5000000); 
INSERT 0 5000000

test=# TRUNCATE t_hotel ;

TRUNCATE TABLE

test=# INSERT INTO t_hotel (price, distance_beach)

SELECT 40 + random() * 200, random() * 15

FROM generate_series(1, 5000000);

INSERT 0 5000000

Loading all this data is clearly not a problem, but check out what happens now:

test=# \timing 
Timing is on. 
test=# SELECT price * sqrt(distance_beach), price, distance_beach FROM t_hotel 
        ?column?              |        price     |         distance_beach 
------------------------------+------------------+------------------------- 
 0.15700293278521447089153382 | 199.127877652645 | 0.000000621657818555832 
  0.3200968902259212440086465 | 77.0173728093505 | 0.0000172737054526806   
  0.3452837023672139082940331 | 59.0800635144114 | 0.0000341562554240227 
... 
(10 rows) 

Time: 18916.807 ms (00:18.917)

test=# \timing

Timing is on.

test=# SELECT price * sqrt(distance_beach), price, distance_beach FROM t_hotel

?column? | price | distance_beach

------------------------------+------------------+-------------------------

0.15700293278521447089153382 | 199.127877652645 | 0.000000621657818555832

0.3200968902259212440086465 | 77.0173728093505 | 0.0000172737054526806

0.3452837023672139082940331 | 59.0800635144114 | 0.0000341562554240227

...

(10 rows)

Time: 18916.807 ms (00:18.917)

It took almost 19 seconds (my laptop) to run the query. Most users would not tolerate this kind of behavior for too often, so we somehow need to improve the runtime.

The SKYLINE OF operator does not exist in PostgreSQL (nor in any other database engine I am aware of). However: PostgreSQL offers functional indexes, which are ideal in this case:

test=# CREATE FUNCTION rate_hotel(numeric, numeric) 
  RETURNS numeric AS 
$$ 
  SELECT $1 * sqrt($2) 
$$ 
LANGUAGE 'sql' IMMUTABLE; 
CREATE FUNCTION

test=# CREATE FUNCTION rate_hotel(numeric, numeric)

RETURNS numeric AS

SELECT $1 * sqrt($2)

LANGUAGE 'sql' IMMUTABLE;

CREATE FUNCTION

The important thing here is to use an IMMUTABLE function. We must assure that the function used to rank the data is perfectly deterministic and its result does not change over time given the same input parameters.
Creating the index is easy:

test=# CREATE INDEX idx_fix_hotel 
         ON t_hotel (rate_hotel(price, distance_beach)); 
CREATE INDEX 
Time: 22706.882 ms (00:22.707) 
test=# ANALYZE ; 
ANALYZE 
Time: 354.660 ms

test=# CREATE INDEX idx_fix_hotel

ON t_hotel (rate_hotel(price, distance_beach));

CREATE INDEX

Time: 22706.882 ms (00:22.707)

test=# ANALYZE ;

ANALYZE

Time: 354.660 ms

Speeding up the skyline query using an index

Our new index is really boosting things and reduces the runtime of this query to around 1 millisecond, which is around 20.000 faster than before. The result is the same:

test=# SELECT rate_hotel(price, distance_beach), 
              price, distance_beach 
       FROM   t_hotel 
       ORDER BY 1 
       LIMIT 10; 
           rate_hotel         |       price      |      distance_beach 
------------------------------+------------------+------------------------- 
 0.15700293278521447089153382 | 199.127877652645 | 0.000000621657818555832 
  0.3200968902259212440086465 | 77.0173728093505 | 0.0000172737054526806 
  0.3452837023672139082940331 | 59.0800635144114 | 0.0000341562554240227 
... 
(10 rows) 

Time: 1.024 ms

test=# SELECT rate_hotel(price, distance_beach),

price, distance_beach

FROM t_hotel

ORDER BY 1

LIMIT 10;

rate_hotel | price | distance_beach

------------------------------+------------------+-------------------------

0.15700293278521447089153382 | 199.127877652645 | 0.000000621657818555832

0.3200968902259212440086465 | 77.0173728093505 | 0.0000172737054526806

0.3452837023672139082940331 | 59.0800635144114 | 0.0000341562554240227

...

(10 rows)

Time: 1.024 ms

The execution plan shows that PostgreSQL will directly go to the index and fetch the data needed. Indexes in PostgreSQL return sorted data so there is no need for sorting and no need to touch more than a handful of rows:

test=# explain SELECT rate_hotel(price, distance_beach), 
                      price, distance_beach 
               FROM   t_hotel 
               ORDER BY 1 
               LIMIT 10; 
                                 QUERY PLAN 
-------------------------------------------------------------------------- 
Limit (cost=0.43..1.12 rows=10 width=55) 
 -> Index Scan using idx_fix_hotel on t_hotel 
    (cost=0.43..346214.73 rows=4999993 width=55) 
(2 rows)

test=# explain SELECT rate_hotel(price, distance_beach),

price, distance_beach

FROM t_hotel

ORDER BY 1

LIMIT 10;

QUERY PLAN

--------------------------------------------------------------------------

Limit (cost=0.43..1.12 rows=10 width=55)

-> Index Scan using idx_fix_hotel on t_hotel

(cost=0.43..346214.73 rows=4999993 width=55)

(2 rows)

Of course the approach is somewhat different than described in the paper about Skyline queries. However, the method I have shown you is easy to implement, efficient and serves most real-world purposes. You can make your rating function reasonably sophisticated without and still maintain good performance.

Read on about PostgreSQL in our latest performance blogs.

Searching for the best compromise: “Skyline” queries

More advanced ordering in PostgreSQL

Scaling up: Increasing the size of our data set

Speeding up the skyline query using an index

Leave a Reply Cancel reply

Hans-Jürgen Schönig

Blog Tags

NEWSLETTER

Articles by our PostgreSQL Experts