Best design for commits?

Aviv Giladi

unread,

Apr 20, 2011, 10:53:19 AM4/20/11

to sqlalchemy

Hey guys,

I have a Pylons back-end running on SQLAlchemy. I have a script that
reads a tree of XML files from an HTTP server (it downloads an xml X,
and then downloads that X's children, and then iterates the children,
and so forth in recursion). Each xml file represents an SQLAlchemy
model.

The problem is that I have thousands of these xml's (sometimes 5000,
sometimes 26000). I was able to optimize the download process with
HTTP pooling, but I cannot seem to think of the best approach as to
committing the models to the DB. Every time an xml file is downloaded,
I create an orm object for it and add it to my session.

Problem 1: some xml's will exists multiple times in the tree so I am
checking that there is no duplicate insertion. Is the check in my code
optimal or should I keep an indexed collection on the side and use it
to check for duplicates?

Problem 2: my autocommit is set to False because I don't want to
commit on every add (not because its bad design, but because of
performance). But I also don't want to iterate the entire tree of
thousands of categories without committing at all. Therefor, I created
a constant number upon which my code commits the data. Is this a good
approach? What would be a good number for that? It might be important
to mention that I do not know in advance how many xml's I am looking
at.

Here is what my pseudo-code looks like now (ignore syntax errors):

count = 0
COMMIT_EVERY = 50

def recursion(parent):
global count, COMMIT_EVERY
pool = get_http_connection_pool(...)
sub_xmls = get_sub_xmls(pool, parent)

if sub_xmls == None:
return

for sub_xml in sub_xmls:
orm_obj = MyObj(sub_xml)

duplicate = Session.query(MyObj).filter(MyObj.id ==
orm_obj.id).first()
if not duplicate:
Session.add(orm_obj)
count = count + 1
if count % COMMIT_EVERY == 0:
Session.commit()
recursion(orm_obj.id)

recursion(0)

Richard Harding

unread,

Apr 20, 2011, 12:05:50 PM4/20/11

to sqlal...@googlegroups.com

What I tend to do in cases like this is to break things into commit chunks. For instance, I've got an import script that goes through and processes 10 at a time and performs a commit every 10. This is tweakable via a config setting, but so far 10 works for my needs.

As for the duplicates, If they're exact and you don't need to processes changes between one record and another of the same pk, just try: except and catch the pk error on the second insert. With my import script, if I get an exception, I then run through all 10 committing one at a time and finding the 'bad egg' so that I can log out that this one failed because of the exception. Sure, for that batch of 10 I'm doing a bunch of single commits, but more often than not I'm running in my batch mode.

I'd just the db/exceptions tell me a record exists vs trying to query the server for each one to check first. Good ole case of 'better to ask for forgiveness than permission'.

Rick

Aviv Giladi

unread,

Apr 20, 2011, 12:10:10 PM4/20/11

to sqlalchemy

Dear Rick,

Thank you for your reply. I understand, but is there not a better way
than doing a lot of single commits in case of a commit exception?
In other words, is there a way to tell SQLAlchemy to throw an
exception on the Session.add if there's a duplicate as opposed to on
the Session.commit?
Or else, is it not better to keep an indexed collection on the side to
check that an id was inserted before?

Thanks!

Richard Harding

unread,

Apr 20, 2011, 12:17:21 PM4/20/11

to sqlal...@googlegroups.com

Not that I'm aware of. When you do a Session.add() it's not touching the database yet. It's part of the performance tradeoff. There's not a good way for it to *know* there's a record in the db with that pk id until it does chat with the db on it. Sure, you can keep a list of ids on the side if you want, but it just seems that you're going to have a try: except block there anyway in case of other issues, db connection fails, bad values, etc, that you might as well just catch the exception for a row already existing as well.

Rick

Aviv Giladi

unread,

Apr 20, 2011, 12:20:22 PM4/20/11

to sqlalchemy

I agree, but the thing is that committing every 10 entries is a little
low for me, I was thinking of around 50, at which case having 50
individual commits is quite costly..
In case I choose the implement your method, how would you go about it?
How do you keep objects of the last 50 or whatever records from the
last commits that have not been committed yet (taking into account my
recursion)?

Thanks again!

Richard Harding

unread,

Apr 20, 2011, 12:43:18 PM4/20/11

to sqlal...@googlegroups.com

I'm not sure, but I'd check the exception and see if you can get the info about which of your 50 were the dupe. I don't recall if it's in the traceback or exception error. If you can identify it then you could store it aside and remove it from the session and retry the other 49 again.

Otherwise, it's the case of finding the mole. Maybe run some sort of binary split of the 50 so that you split the list in half, try to commit each half, one works, one fails. Split the fail side again, etc. In this way you should really only get down to what, 7 commits per 50? This is all assuming one dupe/bad record in the group of 50.

King Simon-NFHD78

unread,

Apr 20, 2011, 1:02:57 PM4/20/11

to sqlal...@googlegroups.com

I'm not sure I can comment on the overall approach, but there are a
couple of things that might help you.

1. If you use Query.get rather than Query.filter, you won't actually
query the database when the object already exists in the session. You'll
probably need to clear the session every now and then (I don't think
flush() or commit() clear it, but I could be wrong)

2. You may want to distinguish Session.flush() from Session.commit() -
you could flush every N new objects, and only commit once at the very
end.

3. If you know you are the only person writing to the database, consider
setting expire_on_commit=False on your session. Otherwise I think
accessing orm_obj.id after Session.commit() will trigger another
(possibly unnecessary) query to the database.

Hope that helps,

Simon

Aviv Giladi

unread,

Apr 20, 2011, 1:36:56 PM4/20/11

to sqlalchemy

Thanks again Rick. The issue is that I have a LOT of duplicates
(around 20-30%) - that's just how that tree is structured.
Therefore, I think I am going to go with catching DB exceptions
regardless, but also use an indexed collection to prevent duplicates.

Cheers!

Michael Bayer

unread,

Apr 20, 2011, 3:26:36 PM4/20/11

to sqlal...@googlegroups.com

my practices with this kind of situation are:

1. theres just one commit() at the end. I'd like the whole operation in one transaction
2. There are flush() calls every 100-1000 or so. 10 is very low.
3. I frequently will disable autoflush, if there are many flushes occurring due to queries for related data as the bulk proceeds.
4. I dont use try/except to find duplicates - this invalidates the transaction (SQLAlchemy does this but many DBs force it anyway). I use a SELECT to get things ahead of time, preferably loading the entire database worth of keys into a set, or loading the keys that I know we're dealing with, so that individual per-key SELECTs are not needed. Or if the set of data I'm working with is the whole thing at once, I store the keys in a set as I get them, then I know which one's I've got as I go along.
5. if i really need to do try/except, use savepoints, i.e. begin_nested().

Aviv Giladi

unread,

Apr 20, 2011, 8:33:16 PM4/20/11

to sqlalchemy

Thank you for your responses everyone.
I have one more question - the really time heavy task here is
retrieving the URLs over HTTP (it takes almost a second per URL).
I am using urllib3 that has connection pooling, but other than that,
is there any other way to speed this up? Perhaps multi-threading?

Michael Bayer

unread,

Apr 20, 2011, 9:27:58 PM4/20/11

to sqlal...@googlegroups.com

if the URL fetch is an IO-bound operation (i.e. the time is spent waiting for IO), it might work if you did a standard consumer/producer model using Queue.Queue. One thread retrieves data from each URL and places the datasets into the Queue. the other thread pulls off items and loads them into the DB.

Or the same idea, using the multiprocessing module instead of threading if the GIL is still getting in the way. Or using Celery. Maybe a deferred approach like that of Twisted. There's lots of ways to offload slow IO operations while work continues.

> --
> You received this message because you are subscribed to the Google Groups "sqlalchemy" group.
> To post to this group, send email to sqlal...@googlegroups.com.
> To unsubscribe from this group, send email to sqlalchemy+...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en.
>

Andrey Petrov

unread,

Apr 25, 2011, 10:05:02 PM4/25/11

to sqlal...@googlegroups.com

Hi Aviv,

Since your bottleneck is fetching the urls, I suggest you look at using workerpool with urllib3. It helps you do exactly what Michael describes. (Disclaimer: I wrote both, workerpool and urllib3. They were built to complement each other.)

There are even examples of how to use workerpool to download things in a multithreaded fashion:

http://code.google.com/p/workerpool/wiki/MassDownloader

Just substitute urllib with a urllib3 connection pool, and off you go. Experiment with different numbers of workers, depending on your server (5~15 is usually a good number for high-throughput servers).

For other examples of how to do multithreaded IO stuff, have a look at s3funnel (another tool I wrote using workerpool). s3funnel uses a slightly more interesting setup with EquippedWorkers.

- Andrey

Reply all

Reply to author

Forward