I am trying to delete rows from a massive data table that are "almost"
duplicates. I can do a nice query
to find them when I join the table with itself, but I can't figure out
how to form a delete statement out
of this.
Query:
SELECT a.*
FROM Trafficdata a, Trafficdata b
WHERE a.id = b.id
AND a.date = b.date
AND a.flag = 'Q'
AND b.flag = 'Q'
AND a.dstop = b.dstop
AND (a.batch != b.batch OR (a.batch = b.batch AND a.pos_marker <
b.pos_marker) );
One idea I tried after searching the web was
DELETE FROM ( <above query> ); but that gives
ORA-01752 "Cannot delete from view without exactly one key preserved
table"
Any help would be greatly appreciated!
Jeff C.
Do you have a primary key on this Trafficdata table?
Assuming your results are the rows your really want to delete then a
delete of the form:
DELETE FROM Trafficdata WHERE pk IN (SELECT a.pk from ...) ;
I suggest you do this AFTER doing a backup.
Ed
Assuming you want to keep one copy of each row the following will work
delete from table_name
where (key_list, rowid) in
( select keys, rowid from table_name
minus
select keys, min(rowid) from table_name
group by keys
)
/
HTH -- Mark D Powell --
Thanks for your reply. No, there isn't don't have a primary key on
this data table. I also can't make a backup of the TrafficData table,
due to its huge size, but I did make a smaller table to experiment on.
Jeff
Sic?
> AND a.flag = 'Q'
> AND b.flag = 'Q'
> AND a.dstop = b.dstop
> AND (a.batch != b.batch OR (a.batch = b.batch AND
> a.pos_marker < b.pos_marker) );
Note that if there are rows with the same (id, date,
flag = 'Q', dstop) tuple, but different batch numbers,
then you'll get every row for that tuple.
Is that really what you want? Or do you only want to
remove all but the the top pos_markers for every (id,
date, flag = 'Q', dstop, batch) tuple? Or something
else...?
Note also that this will potentially give you multiple
copies of some rows from a.
> One idea I tried after searching the web was
> DELETE FROM ( <above query> ); but that gives
> ORA-01752 "Cannot delete from view without exactly
> one key preserved table"
Hint: A better select query might be...
select a.*
from trafficdata a
where a.flag = 'Q'
and exists
(select 1
from trafficdata b
where b.id = a.id
and b.date = a.date
and b.flag = a.flag
and b.dstop = a.dstop
and ( b.batch != a.batch
or b.batch = a.batch
and b.pos_marker > a.pos_marker));
--
Peter
I took a quick look, but if there are corresponding rows on (id, date,
flag, dstop) where a.batch <> b.batch it looks like BOTH rows will be
deleted. I don't know if that is what you really wanted...
Shakespeare
The code I post will work when no PK or UK exists. The key_list is
just the list of columns that qualify the data to be considered
duplicate. If could be one column or all the columns in the row.
Thanks, Mark. I was wondering about exactly that. I'll have another
go at it today.
--Jeff
Looks like you are skipping the condition that the record with the
largest pos_marker should be preserved, not the lowest rowid.
Furthermore, if a.batch != b.batch, both records would be deleted in the
original query (if it worked).
And is there no mutating table problem here? Won't this lead to
'snapshot too old' problems with large tables? (I'm not sure about that)
Shakespeare
I had similar problem. Large table with many rows to delete.
My solution was to write a short program which would connect to db, open full
table scan, reading all important columns and rowids. While reading i was making
in memory list of good rows to keep (actualy it was hash map) and for each row i
read i would check if i had row with similar important columns on my list, if no
i would add it to list, if yes i would check if it should replace row on list or
if it should be deleted. On second list i kept rowids rows to be deleted and
every time when its size reached 10k i was sending rowids from it to separate
deleter thread which would just loop through that rowids and issue delete from x
where rowid = y for each one.
So total cost of that solution was one full table scan plus numer of deletes by
rowid.
Since i was commiting every 10k deletes, snapshot too old was not a problem
because i was able to restart that job any time, and after each restart there
would be less rows in table.
Anyway in my case "cleaning" job duration droped from 120h (it was writen in
PL/SQL loop on the beginging) to 5h for table with ~150m rows (25m after
cleanup) and the bottleneck is now at redo log writers, how fast they are able
to log changes.
Sebastian K.
Is there an echo in here?! :-)
The query I posted replicates the semantics of the OP's
original select, except that it doesn't repeat matching
rows from the target table.
The hint about my script maybe being better was that it's
a one line change to make it delete instead of select rows.
That sort of select is useful in situations like formal
data change requests where you have to confirm the affected
rows with pre and post selects. Performing a review of the
change script prior to running in production is much easier
if the select and delete (or update) are all but identical.
Of course, it's important to get the select right in the
first place. ;-)
--
Peter
As you saw, ROWID works for this case.
But about backup, you are saying you (a developer) cannot back up the
table, right? I meant for the DBA to do a backup before changing a
table that is so massive. (And if the DBA cannot do the backup you
have bigger problems than duplicates!)
Good luck.
Ed
No, not in here, but in my other post. I noticed the deletion of too
many rows in your query, and later found out that the original query did
the same, and luckily did not work!
>
> The query I posted replicates the semantics of the OP's
> original select, except that it doesn't repeat matching
> rows from the target table.
>
> The hint about my script maybe being better was that it's
> a one line change to make it delete instead of select rows.
>
> That sort of select is useful in situations like formal
> data change requests where you have to confirm the affected
> rows with pre and post selects. Performing a review of the
> change script prior to running in production is much easier
> if the select and delete (or update) are all but identical.
>
> Of course, it's important to get the select right in the
> first place. ;-)
Right, especially when there is no space for backups!
>
> --
> Peter
Shakespeare
The delete is a generic template. It would be very easy to modify it
to keep the max whatever instead of the min rowid.
Mutating tables happen only when a trigger is involved and you attempt
to modify the base table from the trigger. There is no mention of a
trigger; however there is mention of a view error. The delete should
probably be aimed at one or more of the base tables in the view.
Remove the extra rows from the base table then the view will display
the right results since the data has been fixed. Either that or the
duplcate data should exist and it is the view definition which needs
fixing to filter out what is being called duplicate. I have seen both
issues over my career.
Right, missed that part.
But how about the snapshot too old issue? I think to remember that a
'delete from a where exists (select .. from a)' with large tables can
cause that error.. or was that only with use of cursors?
>There is no mention of a
> trigger; however there is mention of a view error. The delete should
> probably be aimed at one or more of the base tables in the view.
Yes. Because there is not exactly one key preserved table, it can not
determined from which table the row should be deleted (eventhough they
are actually the same one)
> Remove the extra rows from the base table then the view will display
> the right results since the data has been fixed. Either that or the
> duplcate data should exist and it is the view definition which needs
> fixing to filter out what is being called duplicate. I have seen both
> issues over my career.
>
> HTH -- Mark D Powell --
>
>
>
Shakespeare
Thanks for your reply, Peter (and thanks to everyone else who
responded!). Sorry for the delay in getting back to this thread, but
as you may have experienced yourselves, an issue that is really
important to management one day can be replaced by another hot potato
issue the next day.
You are correct that the original query was flawed, as I saw by
creating a small table and using rowids to pinpoint exactly which row
I was returning. I made the query work by getting all duplicate rows
with rowid created than the min(rowid) for that set of data. It works
great on a small table, but for 30+ million rows, it churns away for
many hours and isn't practical to do in the production enviroment.
--Jeff