Database performance issues with large number of output files

Maximilian Albert

unread,

May 30, 2013, 12:52:18 PM5/30/13

to sumatr...@googlegroups.com

Hi,

some of my simulations produce a large number of output files (>100).
Unfortunately, with such a large number of files Sumatra takes an
awfully long time to write the data record to the database. I seem to
remember that for a toy simulation which I ran recently the actual
simulation took ca. 1:30 minutes whereas the total runtime of the
sumatra run (including the database access) was ca. 8 minutes.

Is there an obvious way to tune the performance? I have no experience
with SQLite (or other databases, for that matter) and I haven't done
any profiling yet, but my impression is that the slowdown might be due
to the fact that each output file is recorded separately, instead of
writing all of them in a single transaction. Is this right or am I
barking up the wrong tree? I saw that there exists a bulk-create
method [1] in recent versions of django. Could this be useful somehow?
Any suggestions would be much appreciated.

Thanks,
Max

[1] https://docs.djangoproject.com/en/dev/ref/models/querysets/#bulk-create

Andrew Davison

unread,

May 30, 2013, 1:44:08 PM5/30/13

to sumatr...@googlegroups.com

I think it's necessary to do some profiling before trying to optimize. The simplest way I've found to profile "smt" calls is to apply the following diff to the "bin/smt" script:

diff -r 5a79eed0ab46 bin/smt

--- a/bin/smt Thu May 30 08:59:46 2013 -0400

+++ b/bin/smt Thu May 30 13:39:17 2013 -0400

@@ -27,8 +27,14 @@

print(usage)

sys.exit(1)

+import cProfile as profile

+

try:

- main(sys.argv[2:])

+ profile.run("main(sys.argv[2:])", "profile_stats")

except (VersionControlError, RecordStoreAccessError) as err:

print("Error: %s" % err.message)

sys.exit(1)

+

+import pstats

+p = pstats.Stats("profile_stats")

+p.sort_stats('cumulative').print_stats(20)

Can you let us know where most of the 8 minutes is spent?

Thanks,

Andrew

Maximilian Albert

unread,

May 31, 2013, 4:46:58 AM5/31/13

to sumatr...@googlegroups.com

Hi Andrew

> I think it's necessary to do some profiling before trying to optimize.

Yep, totally agree. :)

> The simplest way I've found to profile "smt" calls is to apply the following
> diff to the "bin/smt" script:
>

> [...]

Thanks for that recipe! After sending my email I actually did almost
precisely that. As expected, the costly function turned out to be the
DjangoRecordStore.save() in recordstore/django_store/__init__.py (line
172). I also did a line-profiling of that function, and it turns out
that 80% of the time is spent in line 193, inside the following for
loop:

for key in record.output_data:
db_record.output_data.add(self._get_db_obj('DataKey', key))

Just to make sure that it doesn't add much to the cost I also
temporarily separated out the call to self._get_db_obj(), but it turns
out that nearly all the time is indeed spent in the call to
'db_record.output_data.add()'.

I hope this helps. Any ideas how to optimise it? I think the fact that
writing records to the database takes so long (if there are many
output files) is actually the main reason why I also ran into some of
the concurrency issues with locked databases that were discussed on
the mailing list a while ago (see also this blog post: [1]). Btw, has
Daniel had a chance to submit a patch for postgres support yet? Or are
there any other ways of circumventing the concurrency issues that
could be integrated into Sumatra quickly?

Cheers,
Max

[1] http://wd15.github.io/2013/04/08/configuring-sumatra-for-postgres/

Andrew Davison

unread,

May 31, 2013, 7:25:53 AM5/31/13

to sumatr...@googlegroups.com

On Friday, 31 May 2013 04:46:58 UTC-4, Maximilian Albert wrote:

As expected, the costly function turned out to be the
DjangoRecordStore.save() in recordstore/django_store/__init__.py (line
172). I also did a line-profiling of that function, and it turns out
that 80% of the time is spent in line 193, inside the following for
loop:

for key in record.output_data:
db_record.output_data.add(self._get_db_obj('DataKey', key))

Just to make sure that it doesn't add much to the cost I also
temporarily separated out the call to self._get_db_obj(), but it turns
out that nearly all the time is indeed spent in the call to
'db_record.output_data.add()'.

Looking at the Django documentation, the first thing to try is to replace the for loop by a single call, something like

db_keys = (self._get_db_obj('DataKey', key) for key in record.output_data)

db_record.output_data.add(*db_keys)

Please can you profile that? If it doesn't help, we'll have to delve deeper into Django.

I hope this helps. Any ideas how to optimise it? I think the fact that
writing records to the database takes so long (if there are many
output files) is actually the main reason why I also ran into some of
the concurrency issues with locked databases that were discussed on
the mailing list a while ago (see also this blog post: [1]). Btw, has
Daniel had a chance to submit a patch for postgres support yet?

Yes, in fact I just accepted his pull request yesterday. To use postgres, you will have to create a new Sumatra project and pass "--store=postgres://username:password@hostname/databasename" to "smt init". You should then be able to use "smt sync" to copy your existing records from the old store to the new one (take a backup first, though!).

Cheers,

Andrew

Maximilian Albert

unread,

May 31, 2013, 11:07:49 AM5/31/13

to sumatr...@googlegroups.com

Hi Andrew,

> Looking at the Django documentation, the first thing to try is to replace
> the for loop by a single call, something like
>
> db_keys = (self._get_db_obj('DataKey', key) for key in record.output_data)
> db_record.output_data.add(*db_keys)

That's spot-on! Indeed, it makes the problem disappear completely. For
your entertainment I'm attaching a plot of Sumatra's runtimes for a
large number of output files before and after the change. With your
suggestion the database inserts only add a couple seconds to the
runtime, whereas before it became unberably slow even for a moderate
number of files.

I've just submitted a pull request with your change. Unfortunately,
SQLite seems to have problems with inserts that involve more than
~1000 entries, so I had to split larger inserts into chunks. However,
this should be hardly noticeable in terms of performance (at least it
wasn't in my test cases).

>> Btw, has Daniel had a chance to submit a patch for postgres support yet?
>
> Yes, in fact I just accepted his pull request yesterday. To use postgres,
> you will have to create a new Sumatra project and pass
> "--store=postgres://username:password@hostname/databasename" to "smt init".
> You should then be able to use "smt sync" to copy your existing records from
> the old store to the new one (take a backup first, though!).