Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

Data export very slow

13 views
Skip to first unread message

Marc Hoeppner

unread,
Oct 17, 2023, 9:04:24 AM10/17/23
to apollo
Hi,

so I am in the process of wrapping up a number of projects that we are hosting through a local WebApollo instance. The instance is currently running on a virtualized host, with 16GB Ram using Docker and the gmod/2.6.6 container. We have 26 genomes on there, with sizes of around 1GB and each with roughly 20.000 user curated models.

I would like to export each of those curation tracks, either through the web interface or, better yet, a script from the command line (which I found as part of the code base, get_gff3.groovy). 

In either case, requesting a download is very, very, very slow. I limited it to a singular contig that happened to have only one gene model on it - and that took easily over a minute. 

So my question is - which knobs do I need to turn to speed this up? Is the SQL database (Postgres 9) simply too large now? Do I need to allocate (much) more RAM to the VM? 

Happy to provide additional information if needed. 

Cheers,
Marc

Helena Rasche

unread,
Oct 18, 2023, 4:36:15 AM10/18/23
to Marc Hoeppner, apollo

I guess a number of us use a separate python library and its associated command line tool (arrow) but I doubt that will give you a significant speed up.

One of the main speed ups I take advantage of is proxying the findAllOrganisms connection through a faster implementation. (sorry Nathan I'm rubbish at groovy or we would've contributed it upstream.)

Generally it isn't the database (postgres is very fast), but the queries that could be optimised further. (But the gene model export is complicated)

Ciao,
Helena

--
To unsubscribe from this group and stop receiving emails from it, send an email to apollo+un...@lbl.gov.

Marc Hoeppner

unread,
Oct 19, 2023, 9:02:53 AM10/19/23
to apollo, Helena Rasche, Marc Hoeppner
Thanks for the feedback. 

I have started moving the installation to a new machine, including dumping the SQL database to re-import into the other installation. Turns out the sql dump has around 27 million lines, which I assume equates to many, many entries in the features table. 

Without trying to navigate the REST documentation, is there any information around how the API runs the queries against the features table? Is there any point in  indexing this table, and if so on which columns (i.e. what specific SQL calls are performed "under the hood")? Failing that, are there any utility scripts around to selectively remove individual organisms and their data (in a timely manner, if this takes as long as dumping the data, it wouldn't do much good ;))

Happy with nerdy SQL statements - I am guessing the REST stuff is, as mentioned, poorly optimized. 

Cheers,
Marc

Marc Hoeppner

unread,
Oct 19, 2023, 9:19:20 AM10/19/23
to apollo, Marc Hoeppner, Helena Rasche
Addendum - I have just finished importing the sql dump into the new database. When I start the apollo container, same options as on the "prouduction" machine, I get all my organism and tracks now - but my user database and all the curation tracks are empty. I must be missing something?

/M

Nathan Dunn

unread,
Oct 19, 2023, 5:05:00 PM10/19/23
to Marc Hoeppner, apollo, Helena Rasche

Helena Rasche

unread,
Oct 20, 2023, 3:52:53 AM10/20/23
to Nathan Dunn, Marc Hoeppner, apollo
I guess that's the organism list, but not sure if that's what's slow for Marc when doing export. Marc your network logs might indicate which API could use some help? 

just wanted to second Nathan's comments, for exporting features it's definitely a lot of data spread across a large number of tables (organism, feature, featureLoc, annotations, ontologies) that needs to get pulled from the DB which makes it a bit hairy.

(unlike Nathan I'd keep this structure but I'm also deeply allergic to json columns and denormalization 😅)

Ciao,
H
Apollo.png
Apollo.png
Reply all
Reply to author
Forward
0 new messages