remove duplicate rows while keeping one copy

Pet

unread,

May 26, 2009, 9:13:57 AM5/26/09

to

Hi,

how can duplicate rows be removed, so one copy is still in db?
Example rows (first row should remain and second should be deleted):

1 iso-33 bla
1 iso-33 bla
2 iso-34 bla

Table structure:
CREATE TABLE "public"."table" (
"id" int4 DEFAULT NULL,
"iso" varchar DEFAULT NULL,
"name" varchar(200) DEFAULT NULL
)
WITHOUT OIDS ;

Regards, Pet

Dave Hughes

unread,

May 26, 2009, 10:53:41 AM5/26/09

to

Assuming the table isn't massive, it's probably easiest to just create a
new table based on the results of a SELECT DISTINCT, then replace the
old table with the new:

CREATE TABLE foo AS (SELECT DISTINCT * FROM bar);
DROP TABLE bar;
ALTER TABLE foo RENAME TO bar;

Cheers,

Dave.

P.S. Quoted identifiers?! Bleh... ;-)

Pet

unread,

May 26, 2009, 11:08:24 AM5/26/09

to

On May 26, 4:53 pm, Dave Hughes <d...@waveform.plus.com> wrote:
> Pet wrote:
> > Hi,
>
> > how can duplicate rows be removed, so one copy is still in db?
> > Example rows (first row should remain and second should be deleted):
>
> > 1 iso-33 bla
> > 1 iso-33 bla
> > 2 iso-34 bla
>
> > Table structure:
> > CREATE TABLE "public"."table" (
> > "id" int4 DEFAULT NULL,
> > "iso" varchar DEFAULT NULL,
> > "name" varchar(200) DEFAULT NULL
> > )
> > WITHOUT OIDS ;
>
> > Regards, Pet
>
> Assuming the table isn't massive, it's probably easiest to just create a

It is really big...

> new table based on the results of a SELECT DISTINCT, then replace the
> old table with the new:
>
> CREATE TABLE foo AS (SELECT DISTINCT * FROM bar);
> DROP TABLE bar;
> ALTER TABLE foo RENAME TO bar;
>
> Cheers,
>
> Dave.
>
> P.S. Quoted identifiers?! Bleh... ;-)

it was automatically generated from some client. hate it too :)
Thanks, anyway

Bob Badour

unread,

May 26, 2009, 11:09:59 AM5/26/09

to

Pet wrote:

A lot of people would just ask: "Why didn't you use OID?"

You might be able to do something with the ctid system column, but in
the version of postgres I use, ctid does not have any inequality
comparisons defined, which makes that difficult.

The only sure-fire way is to copy the data into another table using
"select distinct" and then get drop the original table.

The best option is not to let the duplicates get in in the first place.

Thomas Kellerer

unread,

May 26, 2009, 11:21:31 AM5/26/09

to

Dave Hughes, 26.05.2009 16:53:

> P.S. Quoted identifiers?! Bleh... ;-)

As they are all lowercase the quotes aren't actually needed.

Laurenz Albe

unread,

May 26, 2009, 12:51:01 PM5/26/09

to

Pet wrote:
>>> how can duplicate rows be removed, so one copy is still in db?
>>>

>>> Table structure:
>>> CREATE TABLE "public"."table" (
>>> "id" int4 DEFAULT NULL,
>>> "iso" varchar DEFAULT NULL,
>>> "name" varchar(200) DEFAULT NULL
>>> )
>>> WITHOUT OIDS ;
>>

>> Assuming the table isn't massive, it's probably easiest to just create a

>> new table based on the results of a SELECT DISTINCT, then replace the
>> old table with the new:
>>
>> CREATE TABLE foo AS (SELECT DISTINCT * FROM bar);
>> DROP TABLE bar;
>> ALTER TABLE foo RENAME TO bar;
>

> It is really big...

It also works for big tables.

From PostgreSQL 8.2 on, you can also do something like:

DELETE FROM "table" AS t
WHERE EXISTS (SELECT 1 FROM "table" AS t1
WHERE t.id = t1.id AND t.iso = t1.iso AND t.name = t1.name
AND t.ctid > t1.ctid);

Then a VACUUM.

Yours,
Laurenz Albe

Marco Mariani

unread,

May 27, 2009, 5:20:12 AM5/27/09

to

Bob Badour wrote:

> A lot of people would just ask: "Why didn't you use OID?"

Oh, well, I think it's better to ask "why didn't you use a textbook on
database design"
You know, the ones that babble a lot about primary and foreign keys :-)

> The best option is not to let the duplicates get in in the first place.

Amen! If the question of the OP is in the context of a "big mess I am
into", and he knows what he's doing, it's all well and let's hope for
the best.

If he, on the other hand, has issues like this on a daily basis or this
is a query that must be ran more than once, the real WTF here is "how
could this happen".

Often, "SELECT DISTINCT" means we've gone from having a problem, to
having two problems.

With OIDs that would be three problems.

Laurenz Albe

unread,

May 27, 2009, 5:21:28 AM5/27/09

to

I wrote:
> From PostgreSQL 8.2 on, you can also do something like:
>
> DELETE FROM "table" AS t
> WHERE EXISTS (SELECT 1 FROM "table" AS t1
> WHERE t.id = t1.id AND t.iso = t1.iso AND t.name = t1.name
> AND t.ctid > t1.ctid);

Disclaimer: This may be very inefficient, particularly without indexes...

Yours,
Laurenz Albe

Bob Badour

unread,

May 27, 2009, 10:32:09 AM5/27/09

to

Marco Mariani wrote:

I'm sorry, Marco. Were you trying to say anything in particular? Because
I could not make any sense out of what you wrote.

Marco Mariani

unread,

May 27, 2009, 11:06:03 AM5/27/09

to

Bob Badour wrote:

> I'm sorry, Marco. Were you trying to say anything in particular?

I was trying to say, while being admittedly impolite, that the OP most
probably has a broken design, and should fix it as soon as possible.
Couple that with the fact that he is not experienced enough to make
proper use of a SELECT DISTINCT, and is handling a "really big" quantity
of data, I don't think quick fixes are what he really needs.

What is "really big" anyway? I cannot imagine a million-rows table which
has "iso" and "name" as the only columns.

A quick fix like "use OIDs or ctids" would probably flush any hope to
have a sane schema down the toilet. You're not supposed to know what a
ctid is at this level of usage.

> Because I could not make any sense out of what you wrote.

Oh, I would have understood if you said "trivial" or "trolling", but
"nonsense" is quite strong :-/

Have I been clearer now?
[yes/no/abort]

Bob Badour

unread,

May 27, 2009, 12:19:26 PM5/27/09

to

Marco Mariani wrote:

> Bob Badour wrote:
>
>> I'm sorry, Marco. Were you trying to say anything in particular?
>
> I was trying to say, while being admittedly impolite, that the OP most
> probably has a broken design, and should fix it as soon as possible.

I agree. In fact, I would say the OP has just had a hard lesson on the
importance of the information principle and of logical identity. Because
it was a hard lesson, I doubt anyone could say anything to improve the
effectiveness of the lesson.

> Couple that with the fact that he is not experienced enough to make
> proper use of a SELECT DISTINCT, and is handling a "really big" quantity
> of data, I don't think quick fixes are what he really needs.

I am not sure how you conclude he lacks enough experience to use select
distinct.

> What is "really big" anyway? I cannot imagine a million-rows table which
> has "iso" and "name" as the only columns.

I cannot imagine real data with "bla" repeated in every row. I
appreciate that the OP constructed a very simple test case to illustrate
the problem he was trying to overcome. I assume the real data is much
larger.

> A quick fix like "use OIDs or ctids" would probably flush any hope to
> have a sane schema down the toilet. You're not supposed to know what a
> ctid is at this level of usage.

Assuming, as you did, that the OP is not as experienced as some of us,
it's not his fault the product he is using let him get into such a mess.
For all we know, the OP was trying to declare a unique key and was
stopped by unexpected duplicates.

Then again, for all we know, the OP has a lot of experience with other
products and found himself in the unfortunate position of cleaning up
someone else's mess.

It's true that using ctid to remove duplicates is using one flaw to
overcome another. But those are flaws in the product not in the OP or in
the method. If one flaw in the product allows one to expediently get out
of a mess caused by another flaw, I suggest expedience rules--especially
when one considers that having two copies of the data may require
additional hardware.

The fact is one has to know what a ctid is at this level of usage
because the product exposes it in the logical interface. One cannot use
ctid as a name for some other column, for example.

> > Because I could not make any sense out of what you wrote.
>
> Oh, I would have understood if you said "trivial" or "trolling", but
> "nonsense" is quite strong :-/

I didn't say it was nonsense--only that I could not make sense of it.

> Have I been clearer now?
> [yes/no/abort]

Yes. Much clearer.

Pet

unread,

May 28, 2009, 4:40:09 PM5/28/09

to

I think, SELECT DISTINCT in temptable and then replacing original
table is the best solution.
And table wasn't created by me like other 90% of tables which I should
maintain now...

Marco Mariani

unread,

May 29, 2009, 7:01:33 AM5/29/09

to

Pet wrote:

> I think, SELECT DISTINCT in temptable and then replacing original
> table is the best solution.
> And table wasn't created by me like other 90% of tables which I should
> maintain now...

If other tables are unmainainable like this, you might use the advice
provided by Martin Fowler in "Evolutionary Database Design", and
sanitize the environment piece by piece...

http://databaserefactoring.com/

There is another book called "Refactoring SQL applications" -- which I
heartly recommend, but has to do with performance and scalability -- not
on improving the schema design.

Ben Finney

unread,

May 30, 2009, 9:52:34 AM5/30/09

to

Marco Mariani <ma...@sferacarta.com> writes:

> If other tables are unmainainable like this, you might use the advice
> provided by Martin Fowler in "Evolutionary Database Design", and
> sanitize the environment piece by piece...
>
> http://databaserefactoring.com/

I enthusiastically second the recommendation for this book, but note
that its authors are Scott W. Ambler and Pramod Sadalage.

--
\ “Pinky, are you pondering what I'm pondering?” “I think so, |
`\ Brain, but if the plural of mouse is mice, wouldn't the plural |
_o__) of spouse be spice?” —_Pinky and The Brain_ |
Ben Finney