Removing duplicates in PostgreSQL

12.2012 / Category: How To / Tags: sql help

Today somebody asked me how to remove duplicates which accidentally made it into a table. The problem is: A normal DELETE won't do, because you would delete both values - not just the one which is in there twice.

The magic word is "ctid"

To solve the problem, you have to use a "secret" column called "ctid". The "ctid" identifies a row inside a table. Here is an example:

test=# CREATE TABLE t_test (idint4);
CREATE TABLE

test=# INSERT INTO t_test VALUES (1) (2) (3);
idint4 -> t_test (id int4);

test=# CREATE TABLE t_test (idint4);

CREATE TABLE

test=# INSERT INTO t_test VALUES (1) (2) (3);

idint4 -> t_test (id int4);

As you can see two values show up twice. To find out how we can remove the duplicate value we can query the "ctid":

test=# SELECT ctid, * FROM t_test;
ctid  | id
______+___
(0,1) | 1
(0,2) | 2
(0,3) | 2
(0,4) | 3
(4 rows)

test=# SELECT ctid, * FROM t_test;

ctid | id

______+___

(0,1) | 1

(0,2) | 2

(0,3) | 2

(0,4) | 3

(4 rows)

We can make use of the fact that the ctid is not the same for our values. The subselect will check for the lowest ctid for a given value and delete it:

test=# DELETE FROM t_test
       WHERE ctid IN (SELECT min(ctid)
                      FROM   t_test
                      GROUP BY id
                      HAVING count(*) > 1
                     )
       RETURNING * ;
id
___
 2
(1 row)

test=# DELETE FROM t_test

WHERE ctid IN (SELECT min(ctid)

FROM t_test

GROUP BY id

HAVING count(*) > 1

)

RETURNING * ;

___

(1 row)

This query works nicely if we can rely on the fact that we only got values which don't show up more often than twice. If we want to do things in a generic way, we can use a simple windowing function to make things work:

test=# SELECT ctid 
       FROM (SELECT ctid, id,
                    count(*) OVER (PARTITION BY id ORDER BY ctid)
             FROM   t_test
            ) AS x
       WHERE count > 1;
 ctid
------
 (0,3)
 (0,4)
(2 rows)

test=# DELETE FROM t_test
       WHERE ctid IN (SELECT ctid)
                      FROM (SELECT ctid, id,
                                   count(*) OVER (PARTITION BY id ORDER BY ctid)
                            FROM t_test
                           ) AS x
                      WHERE count > 1
                     )
       RETURNING * ;
 id
----
  2
  2
(2 rows)

test=# SELECT ctid

FROM (SELECT ctid, id,

count(*) OVER (PARTITION BY id ORDER BY ctid)

FROM t_test

) AS x

WHERE count > 1;

ctid

------

(0,3)

(0,4)

(2 rows)

test=# DELETE FROM t_test

WHERE ctid IN (SELECT ctid)

FROM (SELECT ctid, id,

count(*) OVER (PARTITION BY id ORDER BY ctid)

FROM t_test

) AS x

WHERE count > 1

)

RETURNING * ;

----

(2 rows)

Now we can check for the result:

test=# SELECT ctid, * FROM t_test;
ctid  | id
______+___
(0,1) | 1
(0,2) | 2
(0,5) | 3
(3 rows)

test=# SELECT ctid, * FROM t_test;

ctid | id

______+___

(0,1) | 1

(0,2) | 2

(0,5) | 3

(3 rows)

In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Facebook or LinkedIn.

2 responses to “Removing duplicates in PostgreSQL”

Bjoern Schilberg says:

April 4, 2014 at 7:58 pm

Just a quick note here. Window Functions appeared first in PostgreSQL 8.4.

Reply
Jorge Fernandez says:

November 18, 2014 at 4:39 pm

Finally a cristal-clear explanation.... thank you so much!

Reply

Removing duplicates in PostgreSQL

The magic word is "ctid"

2 responses to “Removing duplicates in PostgreSQL”

Leave a Reply Cancel reply

Hans-Jürgen Schönig

Blog Tags

NEWSLETTER

Articles by our PostgreSQL Experts