Handling unique key violations with bulk inserts of large number of rows

856 views
Skip to first unread message

alchemy1

unread,
Jan 22, 2015, 9:15:52 PM1/22/15
to sqlal...@googlegroups.com
If I'm doing a bulk insert of a very large number of rows is it possible to add only the ones that don't violate unique constraints and log the rest? Since I'm inserting so many rows I don't do a flush/commit after each one, instead I let my application batch several hundred at a time then do a .commit(). But if there's even 1 row that violates unique constraint none of the hundreds of non-duplicated rows will get updated either.

I thought about adding the rows one at a time from my application (doing a .commit() after each row) so that I can add only the ones that don't violate the constraint. But then the loading process goes extremely slowly because of all the database round trips and disk writes I presume. Is there a way to bulk load while keeping performance high?

I am currently using ORM but would use Core if it's doable that way instead.

Michael Bayer

unread,
Jan 22, 2015, 10:47:31 PM1/22/15
to sqlal...@googlegroups.com


alchemy1 <veeruk...@hotmail.com> wrote:

> If I'm doing a bulk insert of a very large number of rows is it possible to add only the ones that don't violate unique constraints and log the rest? Since I'm inserting so many rows I don't do a flush/commit after each one, instead I let my application batch several hundred at a time then do a .commit(). But if there's even 1 row that violates unique constraint none of the hundreds of non-duplicated rows will get updated either.

you’d need to load the rows ahead of time and remove them from your bulk list.

if the number of rows is too large to do that at once, then you have to handle it in chunks. sort the items in your bulk list, extract chunks of say 1000 at a time; for each chunk of 1000, load the rows from the DB within the same range covered by those thousand, then remove from the chunk those rows already present. bulk insert the 1000, then onto the next chunk.

ORM / core doesn’t matter.


>
> I thought about adding the rows one at a time from my application (doing a .commit() after each row) so that I can add only the ones that don't violate the constraint. But then the loading process goes extremely slowly because of all the database round trips and disk writes I presume. Is there a way to bulk load while keeping performance high?
>
> I am currently using ORM but would use Core if it's doable that way instead.
>
> --
> You received this message because you are subscribed to the Google Groups "sqlalchemy" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+...@googlegroups.com.
> To post to this group, send email to sqlal...@googlegroups.com.
> Visit this group at http://groups.google.com/group/sqlalchemy.
> For more options, visit https://groups.google.com/d/optout.

SF Markus Elfring

unread,
Jan 23, 2015, 2:48:55 AM1/23/15
to veeruk...@hotmail.com, sqlal...@googlegroups.com
> If I'm doing a bulk insert of a very large number of rows
> is it possible to add only the ones that don't violate
> unique constraints and log the rest?

Yes. - But such functionality depends on some design details.

Would you like to invest any software development efforts
to get the involved application programming interfaces into
the shape that will provide you with enough error information
so that you can make the corresponding exception handling really
safe, consistent and efficient?


> But if there's even 1 row that violates unique constraint
> none of the hundreds of non-duplicated rows will get updated either.

How much do you need such core database functionality from
transaction management?

Does each record belong to a change which affects the consistency/integrity
for your data structures?

Can data updates be really reordered for failure handling?

Regards,
Markus

Jonathan Vanasco

unread,
Jan 23, 2015, 10:59:24 AM1/23/15
to sqlal...@googlegroups.com
+1 to mike's suggestion of batches to weed out potential dupes.  that will save a lot. 

you'd also have to do some pre-processing within your 1000, to make sure one member of that set won't violate a constraint created by another. 

if your DB supports savepoints, you can also add in a savepoint/nested transaction and then immediately flush. you'll be able to avoid most errors like that.

i use postgres, and I found the best option FOR ME was to use Python+SqlAlchemy to query and track the duplicates -- and then to generate a text file in the postgres COPY FROM format for bulk import.  I don't know if other dbs have a similar function.  
Reply all
Reply to author
Forward
0 new messages