Performance of data import

0 views

Skip to first unread message

Iain D Keddie

unread,

Jul 7, 2005, 6:18:46 AM7/7/05

to cpat...@googlegroups.com

Hi there,

I'm currently working on a ways to improve the import performance of the
cpath admin tool. I have made some sizable performance fixes to the
indexing, mainly by threading the lucene task and running it on a multi
processor machine. I'll feed back my changes once they're properly tested.

The importing process doesn't have such an obvious quick win. I'm
planning on adding some indexes to some of the cpath tables and and
running a profiler against it, to look for bottle necks. Do you have
any advice on areas that can be streamlined/threaded? The XML marshaling
is a little costly, so minimising that may be handy.

We're also having some problems with memory when we import a large
amount of large PSI_MI files. We can do a few large files and thousands
of small files, but it dies somewhere in between.

I'm afraid we're still on the 0.3.2 branch at the moment, we're planning
on releasing with that one and introducing a staged upgrade when the
time is right.

Many thanks

Iain K

Ethan Cerami

unread,

Jul 8, 2005, 10:45:34 AM7/8/05

to cpat...@googlegroups.com

Hi Iain:

>
> Hi there,
>
> I'm currently working on a ways to improve the import performance of the
> cpath admin tool. I have made some sizable performance fixes to the
> indexing, mainly by threading the lucene task and running it on a multi
> processor machine. I'll feed back my changes once they're properly tested.
>

Cool. Your changes would be very welcome.

> The importing process doesn't have such an obvious quick win. I'm
> planning on adding some indexes to some of the cpath tables and and
> running a profiler against it, to look for bottle necks. Do you have
> any advice on areas that can be streamlined/threaded? The XML marshaling
> is a little costly, so minimising that may be handy.
>

This is a hard problem. I have not actually done much performance
optimization on the import pipeline. Definitely, a big part of the
problem is Castor, as Castor reads the entire XML document into memory,
and this can cause problems when dealing with very large PSI-MI files.
If we moved to a SAX interface, we could probably save a huge amount of
memory, and be able to process much larger PSI-MI files. However, this
would also require a whole lot more hardcoding of PSI-MI specific
elements and attributes. We have the same problem with importing BioPAX
files, but the problem is actually worse, and I have plans to refactor
some of this existing code. Another problem is data validation and
look-ups. For example, for each xref, we make a database call to see if
we already have this specific protein or the specified database. This
results in thousands of mini queries to the database, and it might make
sense to cache some data in memory rather than going back to MySQL for
each mini-query.

Having said that, my only real advice is just to do as you suggest, and
do some profiling to find out where the bottlenecks are. There might be
some surprising areas that we can optimize fairly easily.

Ethan

> We're also having some problems with memory when we import a large
> amount of large PSI_MI files. We can do a few large files and thousands
> of small files, but it dies somewhere in between.
>
> I'm afraid we're still on the 0.3.2 branch at the moment, we're planning
> on releasing with that one and introducing a staged upgrade when the
> time is right.
>
> Many thanks
>
> Iain K
>

--
Ethan Cerami
Computational Biology Center
Memorial Sloan-Kettering Cancer Center
http://cbio.mskcc.org
Email: cer...@cbio.mskcc.org
Direct phone: (646) 735-8082