Performance/handling of datasets with a large number of features

21 views
Skip to first unread message

Pontus Freyhult

unread,
Aug 26, 2021, 4:34:09 AM8/26/21
to apollo
  Hi,

We're trying to publish dataset in webapollo and have been hitting various performance issues - enough that they bridge into actually breaking functionality. We're using the docker build (https://hub.docker.com/r/gmod/apollo).

The issues we see does not seem to be related to the genome size (~700Mb) or the number of contigs (12). We do see that it sticks out with the number of mRNA going to 142546 (the number of rows in the `features` table is roughly 10 times that, this also seems to be the rule for most of our datasets).

Loading this information does take forever (roughly 7-10 days for our latest runs, if I recall correctly), while we're not happy about that, we can live with it.

The "performance issues" we see is partly in resource usage for webapollo when trying to load annotations (we kept running out of memory or have the system thrash and die, finally running on a large node, memory usage seems to plateau at just below 90 Gbytes when the client tries to load the annotations).

But even with that larger node where the server doesn't have issues, the client never manages to actually load the annotations, I haven't traced exactly where it goes wrong, but after some 40-60 minutes, something seems to break somehow. 

(We'd have a slight bit of a problem dedicating a node that size to this publication, but it seemed worthwhile figuring out how much it wanted and testing if that worked, which it didn't.)

Do anyone have experience with this and any suggestion? We've also been discussing various ways of splitting it (e.g. have several organisms), but haven't been really happy about how to do that or the usability of it.

We'd be happy for any suggestions.

cheers,
  /Pontus

Nathan Dunn

unread,
Aug 26, 2021, 10:17:13 AM8/26/21
to Pontus Freyhult, apollo

Suggestions:

1 - Build natively (not on docker) so you can tune both tomcat and Postgres if you need to.

2 - Can you load annotations in smaller blocks via a script?   

3 - Instead of loading all annotations into the manual curation track at top, you can load ONLY the tracks that differ from your official / predicted gene set and merge the two on export when you are done annotating.   This is what a lot of groups also do.

I’m not sure what the 90GB is about.  That definitely doesn’t seem correct.

Nathan


--
To unsubscribe from this group and stop receiving emails from it, send an email to apollo+un...@lbl.gov.

Pontus Freyhult

unread,
Sep 3, 2021, 9:46:26 AM9/3/21
to apollo, ndu...@gmail.com, apollo, Pontus Freyhult
  Hi,

and thanks for the response (and sorry for the delayed follow-up).

1 - We have a bit of a special situation where we feel using a dockerized setup simplifies things. With that said, we're perfectly comfortably applying any tuning (if any, so far we don't see any way that would get us out of this situation), and as a last resort we can setup our own custom container builds.

2 - Sorry if this was unclear, getting data into the Apollo instance is done like this by a script (this takes 8--10 for this dataset, which is long but we can live with it). 

3 - I think this may be kind of what we want, although we would need to split in many more tracks I assume and would appreciate suggestions in how to do the split. I'm also not sure we understand how to get it to do the join on export, do you have good pointers. 

(90 Gbytes is the memory usage where the server flats out when a client is trying to load the track. We think this is the maximum required for this dataset but are not certain as we can't get a client to successfully load the track- it seems to get lost after half an hour or so. From other datasets, we have the impression that slightly less memory is required by the server the first time a client loads the track, but that's nothing we feel certain about, the server using 90 Gbytes for this dataset is certain, though).
 
regards,
  /Pontus

Nathan Dunn

unread,
Sep 3, 2021, 11:14:33 AM9/3/21
to Pontus Freyhult, apollo


What script are you using to do this? 


It will use the same API, but will allow you to specify batch_size (default is 1), which may both be faster and lead to a lower memory footprint. 

The default perl script loads all of what you have at once I believe, which is probably what you don’t want.

The stack will readily use the available memory, so if 90GBs is provided, it may use all of that, leaving stuff in cache.   You should never need all of that, however.

Nathan

Pontus Freyhult

unread,
Sep 3, 2021, 11:32:18 AM9/3/21
to Nathan Dunn, apollo
Hi,

we have a perl script we use for submitting, its origins are somewhat
unclear but I think it's purely custom. It does split uploads in a way
that is good enough for us (and during uploads, we tend to be mostly
io-bound for postgres). But again, we can live with the time for that.

And this dataset seems to need just south of 90 Gbyte heap. I have
obviously not tried exactly every lower value, but IIRC e.g. a 70
Gbytes will lead to either OutOfMemoryError due to lack of heap space
or GC overhead limit exceeded. We've also gone through a number of
iterations moving to larger machines (being more cumbersome to
allocate for testing), so I'm quite confident this is the actual
memory requirement for this specific dataset (but I also believe we
may be hitting on a weird case).

/Pontus

Nathan Dunn

unread,
Sep 3, 2021, 11:35:25 AM9/3/21
to Pontus Freyhult, apollo

Is it Apollo’s JVM heap stack? 

This was the older official perl script (so things change). 


If you are rolling your own, I would strongly suggest submitting in batches as there is no requirement that they come in all at once.  Minimally, I would do one chromosome / scaffold at a time.  

Also, what version of Apollo are you using? 

Nathan

Pontus Freyhult

unread,
Sep 3, 2021, 12:07:14 PM9/3/21
to Nathan Dunn, apollo
Hi,

and yes, the Apollo JVM heap.

It looks like our perl script is based on that, yes. We might try the
python variant with a future project, but for now, I think we're not
too unhappy with our current setup.

It's not obvious what version is being run, but the
META-INF/MANIFEST.MF from the actual war file says 2.6.6-SNAPSHOT.
Reply all
Reply to author
Forward
0 new messages