"Seek and destroy" duplicate entries

Alex

unread,

Oct 16, 2006, 7:36:14 AM10/16/06

to

Hello NG,

considering the following table-structure (import-table):

Material_Nr || Alternative_Hierarchy
-------------||--------------------
A || 10xalpha
A || 10xalpha
A || 10xalpha
A || 10xalpha
B || 20xgamma
B || 20xgamma
B || 20xgamma
B || 20xgamma
...

The data I've to import isn't normalized. Therefore, the original
datasets are inflated to nearly 140.000.
While I'm able to delete about 130.000 datasets by killing those, who
have NULL as "Alternative_Hierarchy", I've still got >8.000 datasets
left.

But a lot of these are still duplicates!
Having only the mentioned columns, I failed to apply the method of
S-A-D using the MIN()/MAX() functions.

Can anyone provide me with a solution of killing the duplicates (except
one) ?
Any help will be appreciated!

Alex Sauer

Robert Klemme

unread,

Oct 16, 2006, 8:39:34 AM10/16/06

to

1. You can use the min(rowid) for every unique combination to delete all
superfluous records.

However, I find these approaches more attractive:

2. Use a staging table and move the data from there to the target table
that has proper constraints with a SELECT DISTINCT or GROUP BY. (Or is
your import table the staging table already?)

3. Filter during import to avoid duplicates during insertion (for
example by having SQL*Loader reject records that violate constraints).

Kind regards

robert

Charles Hooper

unread,

Oct 16, 2006, 8:42:52 AM10/16/06

to

The following should retrieve one row per MATERIAL_NR,
ALTERNATIVE_HIERARCHY combination:
SELECT
MATERIAL_NR,
ALTERNATIVE_HIERARCHY,
MIN(ROWID) ROW_ID
FROM
MY_TABLE
GROUP BY
MATERIAL_NR,
ALTERNATIVE_HIERARCHY;

Using the above as a starting point:
SELECT
MATERIAL_NR,
ALTERNATIVE_HIERARCHY,
ROWID ROW_ID
FROM
MY_TABLE
MINUS
SELECT
MATERIAL_NR,
ALTERNATIVE_HIERARCHY,
MIN(ROWID) ROW_ID
FROM
MY_TABLE
GROUP BY
MATERIAL_NR,
ALTERNATIVE_HIERARCHY;

The above SQL statement retrieves the MATERIAL_NR,
ALTERNATIVE_HIERARCHY of all rows, and then eliminates all of those
that were contained in the original query. Now to make use of the
results:
DELETE FROM
MY_TABLE
WHERE
(MATERIAL_NR,
ALTERNATIVE_HIERARCHY,
ROWID)
IN
(SELECT
MATERIAL_NR,
ALTERNATIVE_HIERARCHY,
ROWID ROW_ID
FROM
MY_TABLE
MINUS
SELECT
MATERIAL_NR,
ALTERNATIVE_HIERARCHY,
MIN(ROWID) ROW_ID
FROM
MY_TABLE
GROUP BY
MATERIAL_NR,
ALTERNATIVE_HIERARCHY);

Obviously, test the results before executing the DELETE.

Charles Hooper
PC Support Specialist
K&M Machine-Fabricating, Inc.

Alex

unread,

Oct 16, 2006, 8:48:08 AM10/16/06

to

Robert Klemme schrieb:

> 1. You can use the min(rowid) for every unique combination to delete all
> superfluous records.
>
> However, I find these approaches more attractive:
>
> 2. Use a staging table and move the data from there to the target table
> that has proper constraints with a SELECT DISTINCT or GROUP BY. (Or is
> your import table the staging table already?)
>
> 3. Filter during import to avoid duplicates during insertion (for
> example by having SQL*Loader reject records that violate constraints).
>
> Kind regards
>
> robert

Hello Robert,

thanks for your tips!
@2: The table I'm talking of ist already a staging table.
@3: For various reasons there are no constraints on the staging tables
defined at all.
So I stick to the rowid-method.

Thank you!
Alex Sauer

Alex

unread,

Oct 16, 2006, 8:57:35 AM10/16/06

to

Charles Hooper schrieb:

Thank you! I'll try that solution.
Can you tell me which one is speedier? "SELECT... MINUS SELECT ..." or
"SELECT ... WHERE ... NOT IN (SELECT...)"?

THX
Alex Sauer

Charles Hooper

unread,

Oct 16, 2006, 9:13:05 AM10/16/06

to

SELECT ... WHERE ... NOT IN (SELECT...) _should_ be much slower. The
NOT IN (SELECT) clause will be executed/evaluated once per row of the
input table.

Here is another way to find the rows to be deleted:
SELECT
MATERIAL_NR,
ALTERNATIVE_HIERARCHY,
ROW_ID
FROM

(SELECT
MATERIAL_NR,
ALTERNATIVE_HIERARCHY,
ROWID ROW_ID
FROM

MY_TABLE) M1,
(SELECT

MATERIAL_NR,
ALTERNATIVE_HIERARCHY,
MIN(ROWID) ROW_ID
FROM
MY_TABLE
GROUP BY
MATERIAL_NR,

ALTERNATIVE_HIERARCHY) M2
WHERE
M1.MATERIAL=M2.MATERIAL(+)
AND M1.ALTERNATIVE_HIERARCHY=M2.ALTERNATIVE_HIERARCHY(+)
AND M1.ROW_ID=M2.ROW_ID(+)
AND M2.ROW_ID IS NULL;

The above creates a left outer join between the two SQL statements, and
the "M2.ROW_ID IS NULL" specification basically means, find all rows
that are in the first result set that are not in the second. This
effectively does the same thing as the MINUS. Now is this faster than
the MINUS method? a 10046 trace should be able to tell you.

Alex

unread,

Oct 16, 2006, 9:20:12 AM10/16/06

to

Charles Hooper schrieb:

Thanks again!

Alex Sauer

Brian Peasland

unread,

Oct 16, 2006, 10:27:47 AM10/16/06

to

Depending on how many rows, you may find that the MIN(ROWID) option
previously discussed is slow. Since you have no FK constraints on this
table, you might want to try this approach:

CREATE TABLE import_table_temp
AS SELECT DISTINCT material_nr, alternative_hierarchy
FROM import_table;

TRUNCATE import_table;

INSERT INTO import_table SELECT material_nr, alternative_hierarchy
FROM import_table_temp;

DROP TABLE import_table_temp;

This is just an alternative solution. And there is no guarantee that
this is faster.

HTH,
Brian

--
===================================================================

Brian Peasland
d...@nospam.peasland.net
http://www.peasland.net

Remove the "nospam." from the email address to email me.

"I can give it to you cheap, quick, and good.
Now pick two out of the three" - Unknown

Robert Klemme

unread,

Oct 16, 2006, 10:40:19 AM10/16/06

to

On 16.10.2006 14:48, Alex wrote:
> Robert Klemme schrieb:
>> 1. You can use the min(rowid) for every unique combination to delete all
>> superfluous records.
>>
>> However, I find these approaches more attractive:
>>
>> 2. Use a staging table and move the data from there to the target table
>> that has proper constraints with a SELECT DISTINCT or GROUP BY. (Or is
>> your import table the staging table already?)
>>
>> 3. Filter during import to avoid duplicates during insertion (for
>> example by having SQL*Loader reject records that violate constraints).

> @2: The table I'm talking of ist already a staging table.

> @3: For various reasons there are no constraints on the staging tables
> defined at all.

Constraints on the /target/ tables!!

> So I stick to the rowid-method.

Well, if you copy anyway then distinct or group by seem to be the
easiest and most portable solutions.

robert

William Robertson

unread,

Oct 16, 2006, 11:41:25 AM10/16/06

to

One other standard dedupe would be:

DELETE import_table
WHERE ROWID IN
( SELECT LEAD(ROWID) OVER (PARTITION BY material_nr,
alternative_hierarchy ORDER BY NULL)
FROM import_table );

Whether it would be more efficient to save the non-duplicates and
truncate the table etc will depend on your data.