Removing LOADed duplicates

James Kwan

unread,

Feb 26, 1996, 3:00:00 AM2/26/96

to

I must admit that I fail to see the problem here. Even without using
DISCARD statement, 'Load' utility will reject all duplicate entries if an
unique index is defined. Of course problem will occur if the job failed
before it finishs and agin it depends on whether he used LOG YES or not.
I would much appreciate if you or the original author to clarify my
understanding.

Regards,

------------------------------------------------------------------------------
James KWAN | | | | | |
DB2 Specialist | | | | | |
jam...@mincom.com | | | | | |
Phone: +61 6 218 0212 | | | | | |
Fax: +61 6 218 0211 |M |I |N |C |O |M
------------------------------------------------------------------------------

On Sun, 25 Feb 1996, Hannan, Michael wrote:

> The DISCARD dataset is for removing duplicates during a Load. See 'DB2
> Command and Utility Reference Manual' under Load (Utility).
>
> From: Michael Hannan
>
> In reply to:
> ----------
> From: owner-db2-l
> To: Multiple recipients of list DB2-L
> Subject: Removing LOADed duplicates
> Date: Saturday, 24 February 1996 12:10PM
>
> I want to know if there are any good ways of solving the
> problem of loading duplicates into a table.
>
> Here's the situation:
> - The table already contains rows
> - I LOAD RESUME some additional rows (or accidentally the same
> data again!)
> - Some (or all) of the new rows are duplicates of the original
> data
> - The duplicates are now in the database. DB2 doesn't catch
> the problem until it's building the indexes.
> - I can't use the REPAIR utility or SQL to remove the
> duplicates, because there are too many of them.
>
> The only way I know of resolving the problem is:
> - Drop the unique indexes and primary key
> - Unload the data
> - Remove duplicates (using some form of sort)
> - Reload the data
> - Add the primary key and unique indexes
> - Replace all dropped foreign keys
> - CHECK all Check Pending tables
> - Rebind all dependent plans
>
> This is an enormous amount of work! (Thank goodness for third-party
> tools!) Have you got a better solution?
>
> (also posted in comp.databases.ibm-db2)
> ------------------------
>

mle...@tne.com

unread,

Feb 26, 1996, 3:00:00 AM2/26/96

to

How about this:
1. Drop unique index
2. Load RESUME the data.
3. If you have Platinum/comparable product, unload using SELECT DISTINCT
entire row. If you have QMF, save data from select distinct query.
4. Reload the data. If you have QMF, replace the data.
5. Re-create unique index.
6. Re-create foreign keys.
I am not sure why you would run check, as opposed to just starting dependent
tablespaces, unless there were inserts done in the meantime.
You could rely on auto-rebind for the plans, unless afraid of contention at
execution time.

----------
From: Bill Goss <bill...@ZETA.ORG.A
To: Multiple recipients of list D; TNE/BOSTON4/SJANOWITZ;
TNE/BOSTON1/MLEVINE
Subject: Removing LOADed duplicates
Date: Monday, February 26, 1996 08:03AM

<<File Attachment: SMTP_ENV.TXT>>

Linda L. Hagedorn

unread,

Feb 26, 1996, 3:00:00 AM2/26/96

to

From Linda L. Hagedorn, DB2 Systems Support
BellSouth, 675 W. Peachtree St., Atl, GA 30375
404-529-6686, Pager 404-722-9757 (enter call back number)

>Bill Goss <bill...@ZETA.ORG.AU> wrote:

> The only way I know of resolving the problem is:
> - Drop the unique indexes and primary key
> - Unload the data
> - Remove duplicates (using some form of sort)
> - Reload the data
> - Add the primary key and unique indexes
> - Replace all dropped foreign keys
> - CHECK all Check Pending tables
> - Rebind all dependent plans

Well, this is kind of cheating, but ...

You can create two other tables - a key.table and an extract.table.
Insert into the key.table the keys grouped by index columns having
count(*) > 1 ; This will provide a list of the duplicate keys.
Insert into extract.table (select cols from orig.table a, key.table
b where a.xx = b.xx ); This will give you all the duplicates in
a separate extract table.
Delete from orig.table where key = ' ' ; I use SPUFI to
generate the delete statements:
SELECT 'DELETE FROM ORIG.TABLE WHERE KEY1 = ',col1,
'AND KEY2 = ',col2,' FROM ORIG.TABLE '
FROM KEY.TABLE ;
Use the SPUFI output (FB 80) as input, stripping out the lines and
messages, and adding single quotes around the key values.
Use C X'40' X'5E' 62 to add the semi-colon to every line.
Run the delete statements. You can now recover the unique index.
The duplicates can be managed individually - show to the user,
delete by hand, reinserted letting DB2 kick out the duplicates,
etc.

Send me a note if something I've said is unclear. Hope this helps.
Best regards, Linda Hagedorn.

James Kwan

unread,

Feb 27, 1996, 3:00:00 AM2/27/96

to

See comment below.

Regards,

------------------------------------------------------------------------------
James KWAN | | | | | |
DB2 Specialist | | | | | |
jam...@mincom.com | | | | | |
Phone: +61 6 218 0212 | | | | | |
Fax: +61 6 218 0211 |M |I |N |C |O |M
------------------------------------------------------------------------------

On Mon, 26 Feb 1996 mle...@TNE.COM wrote:

> How about this:
> 1. Drop unique index
> 2. Load RESUME the data.
> 3. If you have Platinum/comparable product, unload using SELECT DISTINCT
> entire row. If you have QMF, save data from select distinct query.
> 4. Reload the data. If you have QMF, replace the data.
> 5. Re-create unique index.

If you use SELECT DISTINCT entire row, you can't guarantee all keys are
unique unless 'entire row' is the key. Am I correct?

> 6. Re-create foreign keys.
> I am not sure why you would run check, as opposed to just starting dependent
> tablespaces, unless there were inserts done in the meantime.
> You could rely on auto-rebind for the plans, unless afraid of contention at
> execution time.
>
> ----------
> From: Bill Goss <bill...@ZETA.ORG.A
> To: Multiple recipients of list D; TNE/BOSTON4/SJANOWITZ;
> TNE/BOSTON1/MLEVINE
> Subject: Removing LOADed duplicates
> Date: Monday, February 26, 1996 08:03AM
>
> <<File Attachment: SMTP_ENV.TXT>>
> I want to know if there are any good ways of solving the
> problem of loading duplicates into a table.
>
> Here's the situation:
> - The table already contains rows
> - I LOAD RESUME some additional rows (or accidentally the same
> data again!)
> - Some (or all) of the new rows are duplicates of the original
> data
> - The duplicates are now in the database. DB2 doesn't catch
> the problem until it's building the indexes.
> - I can't use the REPAIR utility or SQL to remove the
> duplicates, because there are too many of them.
>

> The only way I know of resolving the problem is:
> - Drop the unique indexes and primary key
> - Unload the data
> - Remove duplicates (using some form of sort)
> - Reload the data
> - Add the primary key and unique indexes
> - Replace all dropped foreign keys
> - CHECK all Check Pending tables
> - Rebind all dependent plans
>

Sean A Dunn

unread,

Feb 27, 1996, 3:00:00 AM2/27/96

to

> I must admit that I fail to see the problem here. Even without using
> DISCARD statement, 'Load' utility will reject all duplicate entries if an
> unique index is defined. Of course problem will occur if the job failed
> before it finishs and agin it depends on whether he used LOG YES or not.
> I would much appreciate if you or the original author to clarify my
> understanding.

One little problem, as I understand it, is that LOAD will not only remove
the NEW rows you just added that are duplicates, it will remove the ORIGINAL
rows that 'contended' with your new rows. Ouch - no longer a simple thing
to manage with just LOAD.

With a 3rd-party tool to manage this (e.g. Princeton Softech MOVE FOR DB2),
it will do insert/update processing for you. Unless your data volumes are
just too great.

--
*** Sean Dunn, Wolverhampton, England ***
*** E-mail: se...@lilydale.demon.co.uk ***

Ruwu

unread,

Feb 28, 1996, 3:00:00 AM2/28/96

to

Im Artikel <344061...@lilydale.demon.co.uk>, Sean A Dunn
<Se...@lilydale.demon.co.uk> schreibt:

>Thema: Re: Removing LOADed duplicates
>Von: Sean A Dunn <Se...@lilydale.demon.co.uk>
>Datum: Tue, 27 Feb 1996 19:02:20 GMT

>
>One little problem, as I understand it, is that LOAD will not only remove

>the NEW rows you just added that are duplicates, it will remove the
ORIGINAL
>rows that 'contended' with your new rows. Ouch - no longer a simple thing
>to manage with just LOAD.
>
>With a 3rd-party tool to manage this (e.g. Princeton Softech MOVE FOR
DB2),
>it will do insert/update processing for you. Unless your data volumes are
>just too great.
>
>--
>*** Sean Dunn, Wolverhampton, England ***
>*** E-mail: se...@lilydale.demon.co.uk ***

No Sean,

only records of the loaded data will be discarded; the original data won't
be affected by a load resume. Thus, if you load the same data again into a
table with a unique key, all of them will be removed during discard
processing and the tablespace will be ok - no problem at all.

But maybe the LOAD abends because of line-overflow or out of space on
error or discard dataset before having discarded the rubbish. Then indeed
the easeast way is either a timestamp-recovery or (start acc(force),
unload, sort unique, load replace and start repair nocheckpend on children
(Check not needed because you know the data is ok).

Because the COND CODE of LOAD is 4 in case of discared records (the load
as almost always 4 because of often uses LOG NO) and because of that
nobody looks at the LOAD-output and takes notice of discarded records it's
a good idea to use the DIAGNOSE statement with the LOAD statement to
produce an abend when a discarding message occurs. The abends will take
place after the last phase of the LOAD is funished; that means no problems
or pending flags.

So long,

Ruediger Wuepper / D-22299 Hamburg / Germany
Fax: Germany (0)40-512668