Data export very slow

Marc Hoeppner

unread,

Oct 17, 2023, 9:04:24 AM10/17/23

to apollo

Hi,

so I am in the process of wrapping up a number of projects that we are hosting through a local WebApollo instance. The instance is currently running on a virtualized host, with 16GB Ram using Docker and the gmod/2.6.6 container. We have 26 genomes on there, with sizes of around 1GB and each with roughly 20.000 user curated models.

I would like to export each of those curation tracks, either through the web interface or, better yet, a script from the command line (which I found as part of the code base, get_gff3.groovy).

In either case, requesting a download is very, very, very slow. I limited it to a singular contig that happened to have only one gene model on it - and that took easily over a minute.

So my question is - which knobs do I need to turn to speed this up? Is the SQL database (Postgres 9) simply too large now? Do I need to allocate (much) more RAM to the VM?

Happy to provide additional information if needed.

Cheers,

Marc

Helena Rasche

unread,

Oct 18, 2023, 4:36:15 AM10/18/23

to Marc Hoeppner, apollo

I guess a number of us use a separate python library and its associated command line tool (arrow) but I doubt that will give you a significant speed up.

One of the main speed ups I take advantage of is proxying the findAllOrganisms connection through a faster implementation. (sorry Nathan I'm rubbish at groovy or we would've contributed it upstream.)

Generally it isn't the database (postgres is very fast), but the queries that could be optimised further. (But the gene model export is complicated)

Ciao,
Helena

--
To unsubscribe from this group and stop receiving emails from it, send an email to apollo+un...@lbl.gov.

Marc Hoeppner

unread,

Oct 19, 2023, 9:02:53 AM10/19/23

to apollo, Helena Rasche, Marc Hoeppner

Thanks for the feedback.

I have started moving the installation to a new machine, including dumping the SQL database to re-import into the other installation. Turns out the sql dump has around 27 million lines, which I assume equates to many, many entries in the features table.

Without trying to navigate the REST documentation, is there any information around how the API runs the queries against the features table? Is there any point in indexing this table, and if so on which columns (i.e. what specific SQL calls are performed "under the hood")? Failing that, are there any utility scripts around to selectively remove individual organisms and their data (in a timely manner, if this takes as long as dumping the data, it wouldn't do much good ;))

Happy with nerdy SQL statements - I am guessing the REST stuff is, as mentioned, poorly optimized.

Cheers,

Marc

Marc Hoeppner

unread,

Oct 19, 2023, 9:19:20 AM10/19/23

to apollo, Marc Hoeppner, Helena Rasche

Addendum - I have just finished importing the sql dump into the new database. When I start the apollo container, same options as on the "prouduction" machine, I get all my organism and tracks now - but my user database and all the curation tracks are empty. I must be missing something?

/M

Nathan Dunn

unread,

Oct 19, 2023, 5:05:00 PM10/19/23

to Marc Hoeppner, apollo, Helena Rasche

Marc, the method is here:

Apollo/grails-app/controllers/org/bbop/apollo/OrganismController.groovy at 5985ab8d98cc3b98625101f649c9f300efb5b555 · GMOD/Apollo

github.com

The meat of the algorithm is here:

Apollo/grails-app/controllers/org/bbop/apollo/OrganismController.groovy at 5985ab8d98cc3b98625101f649c9f300efb5b555 · GMOD/Apollo

github.com

, which uses a GORM (i.e., Hibernate) criteria query:

The Grails Framework 6.0.0

docs.grails.org

The problem with the original algorithm is that if you have multiple levels of features, you have to do sort of an multi-level SQL with all of the sub-features, as well as the metadata associated with it.

In retrospect, for the sizes stored, storing it as a JSON blob would have made more sense (and is now queryable directly via PostgeSQL). I’m sure Rob and Team are well on their way here.

In the interim Helena has come up with some amazing stop-gop measures.

Cheers,

Nathan

Helena Rasche

unread,

Oct 20, 2023, 3:52:53 AM10/20/23

to Nathan Dunn, Marc Hoeppner, apollo

I guess that's the organism list, but not sure if that's what's slow for Marc when doing export. Marc your network logs might indicate which API could use some help?

just wanted to second Nathan's comments, for exporting features it's definitely a lot of data spread across a large number of tables (organism, feature, featureLoc, annotations, ontologies) that needs to get pulled from the DB which makes it a bit hairy.

(unlike Nathan I'd keep this structure but I'm also deeply allergic to json columns and denormalization 😅)

Ciao,

H

Apollo.png

Reply all

Reply to author

Forward