How to speed up ERDDAP startup

97 views
Skip to first unread message

marco...@ettsolutions.com

unread,
May 8, 2018, 3:15:45 AM5/8/18
to ERDDAP
Hi all,
at the moment I've something like 53000 datasets in EMODnet Physics ERDDAP (http://erddap.emodnet-physics.eu/erddap/index.html).
If I nedd to restart tomcat service or reboot the server (rarely but it can happen), ERDDAP startup took comething like 5 hours to reload all datasets.
There is a way to speed up ERDDAP startup, maybe using something like storing on disk (on file or database) the last LoadDatasets result in order to avoid re-reading all the NetCDF files?

Thanks

Ciao

Marco

Bob Simons

unread,
May 8, 2018, 11:27:51 AM5/8/18
to ERDDAP
53,000! Excellent!

All of the EDD...From...Files datasets maintain a database-like list of information about each file. This info is stored in files in [bigParentDirectory]/dataset/[last2LettersOfDatasetID]/[datasetID]/ . To gather this information, a given data file is only read once (when it first appears), unless its file size or last modified date changes. So when ERDDAP is restarted, it should not be reading the data files (unless you changed the location of [bigParentDirectory], forcing ERDDAP to regenerate all that information).  If you see in the logs that ERDDAP is reading all the data files, please email the log to me directly.

When ERDDAP starts up, it can do a quickRestart or a regular restart. That is controlled by 
<quickRestart>true</quickRestart>
in your setup.xml file. If that tag is absent or set to false,  that would certainly explain what you are seeing.
With a quickRestart, ERDDAP attempts to load the datasets as quickly as possible, based on the stored information (as described above, and similarly for other types of datasets). It doesn't look for new or changed data files. So the datasets should load very quickly. If we assume ERDDAP can load 5/second (conservative), that's 300/min, or 18,000/hour, or about 3 hours for 53,000 datasets. So most of the problem may just be that huge number of datasets. 

Things which cause a dataset to be loaded more slowly:
* A subscriptions tied to a dataset causes an email to be sent (bad case ~10 seconds) or a URL to be contacted (bad case ~5 minutes).
* Some non-local-file and non-Opendap dataset EDD... types are slow to reload.
* Some remote 
* If ERDDAP's access to the "local" files or to the [bigParentDirectory] is slow, that will slow down ERDDAP.

Things you can do to speed up a restart:
* Set <quickRestart> to true.
* Get an SSD and move [bigParentDirectory] to that.  SSD's are now relatively cheap ($250/TB). An SSD will speed up a large number of tasks in ERDDAP by ~5X.
* Consider aggregating datasets with the same type of information (e.g., from a given type of sensor), into 1 datasets. So you might end up with 53 datasets with ~1000 sensors in each. But I understand you might prefer your current setup.

Ways to diagnose this problem:
* Look in the log.txt file. Each dataset will indicate the time that the constructor took. Look for unusually large times. Try to figure out why those datasets loaded slowly.

I hope that helps. If not, please contact me directly and we'll try to figure out your specific case.

Marco Alba

unread,
May 9, 2018, 4:07:22 AM5/9/18
to ERDDAP
Hi Bob (and all in the group),
thanks for the info!  <quickRestart> is already set to true, there are no subscription and all files are on local disk (non SSD). I'll try to ask my boss for a SSD disk.

For the aggregation of datasets...at the moment this is not easy. 
I've 2 main types of datasets:
* latest (7925 datasets at the moment), one dataset for each fixed station with last 60 days of data. 
* monthly (37605  datasets at the moment),  one dataset for each fixed station with data older then 60 days. 
(I've also files dataset to download the .nc from last 60 days, but these datasets are 'easy'.)

The latest an monthly datasets are build using heavily modified NCML files for each .nc files because the .nc files are not homogeneous/normalized, not only for different platform but also for the same platform! (if you remember you helped me on this matter when I've started the ERDDAP configuration).
The script that generate this NCML files is a collection of special cases, and almost every day I need to check if the data producer have inserted some new 'gift' in their .nc files.

I've tried to aggregate data for each type of platforms (mooring, drifting buoy, etc...) but is an hell without modify the .nc files, and  modify the .nc files is not an option at the moment.
We are working hard on to the producer of data to make them create homogeneous, normalized files...it is not possible that one day the fillvalue for a variable is "-999" and the next day is "9999" (and this is the "easiest" case).

Thank you very much for the support (as always!)

Ciao


_________________________________________

Marco Alba

ETT spa
People and Technology

www.ettsolutions.com
marco...@ettsolutions.com

Via Sestri 37  16154 Genova - Italy - Tel: +39 010 6519116  Fax: +39 010 6518540
Via Giulio Venticinque 38  00136 Roma - Italy - Tel. +39 06 37352352  Fax +39 06 37358533
ETT Solutions Ltd - 2-4 Hoxton Square – London N1 6NU – UK – Tel. +44 (0)20 8133 8135

Facebook  |  Twitter  |  YouTube  |  Flickr  |  Instagram  |  LinkedIn

Le informazioni contenute nella presente comunicazione e i relativi allegati possono essere riservate e sono, comunque, destinate esclusivamente alle persone o alla Società sopraindicati. La diffusione, distribuzione e/o copiatura del documento trasmesso da parte di qualsiasi soggetto diverso dal destinatario è proibita, sia ai sensi dell’art. 616 c.p. , che ai sensi del D.Lgs. n. 196/2003. Se avete ricevuto questo messaggio per errore, vi preghiamo di distruggerlo e di informarci immediatamente per telefono allo +39. 010.6519116 o inviando un messaggio all’indirizzo e-mail in...@ettsolutions.com


--
You received this message because you are subscribed to a topic in the Google Groups "ERDDAP" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/erddap/X579DJ805_c/unsubscribe.
To unsubscribe from this group and all its topics, send an email to erddap+unsubscribe@googlegroups.com.
To post to this group, send email to erd...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/erddap/19feaaef-f43e-490a-bf00-d13048d4d4f0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Bob Simons

unread,
May 9, 2018, 11:22:07 AM5/9/18
to ERDDAP
Thanks for the information.

1) In setup.xml, I encourage you to set logMaxSizeMB to a number larger than the default (which I think is 20), e.g., 100, so that more information is saved before being lost.
<logMaxSizeMB>100</logMaxSizeMB> 

2) Look in log.txt (and log.txt.previous if that has been created) right after ERDDAP has finished loading all the datasets after you restart ERDDAP (ideally, save a copy of these so you can analyze them later).
Look for "constructor finished." There will be one of those for each dataset. They will tell you the time it took to load each dataset.
2a) Can you estimate the average time?
2b) Look for datasets that took a long time to load. What is different about them?

3) In the long run, perhaps the best solution is to aggregate the 53000 datasets into ~20 datasets.
Different missing values in different files prevents aggregation. I see that you are using EDDTableFromMultidimNcFiles. I will try to make a new dataset type, EDDTableFromMultidimNcFilesUnpacked, which, like EDDGridFromNcFilesUnpacked
unpacks any packed values (by applying add_offset and scale_factor), makes all missing values consistent (e.g., NaN for floats and doubles) and makes all time values consistent (seconds since 1970-01-01) at a low level (when the files are read). Then, you can aggregate files that have different missing values. If the files have different variable names for the same concept, then you can use .ncml to change the variable names in individual files. 
I think that using aggregation to go from 53000 datasets to maybe 20 datasets will have several advantages:
* ERDDAP will restart and reload all datasets much much faster (a few seconds instead of 5 hours).
* Once the aggregated datasets are defined, you won't need to spend any more time creating new datasets in ERDDAP, although you may spend more time writing .ncml or making scripts to generate the .ncml.
* It will be much easier for users to work with multiple stations, because all of the stations in one dataset will have consistent variable names, consistent missing values, and consistent units. 
* Users will be able to easily query multiple stations at once (perhaps using a lat lon bounding box or a regular expression to select the desired stations) because all similar stations will be in one dataset.

Well, that's my suggestion. Take it with a grain of salt because I know very little about the details of these datasets and all the problems you face. 

I hope that helps.

Rob Fuller

unread,
May 10, 2018, 9:01:19 AM5/10/18
to Bob Simons, ERDDAP
Hi,

This is an interesting thread. I have two main thoughts:

1. I remember from many years ago that using (linux) hardlinks can be very useful when wanting to maintain access to data files which may be rolled over (Backing up solr index was the use case http://lucene.472066.n3.nabble.com/Proper-way-to-backup-solr-tp4168498p4168748.html ). Might be worth consideration for ERDDAP file datasets (I think they may currently go offline during refresh?).

2. Whatever optimisations you come up with, the real problem outlined by Marco is one of scalability. A best solution would be to distribute those thousands of datasets across a number of ERDDAP servers at the backend, each responsible for some of the datasets. The frontend ERDDAPs merely delegate/proxy requests to the appropriate backend server. Some consideration might also be given to HA... which might involve each dataset being on two ERDDAP backends, and ultimately everything being in two data centers. (Bob is there an easy way to add "all datasets from other server"?)

Kind regards,
Rob.

--
You received this message because you are subscribed to the Google Groups "ERDDAP" group.
To unsubscribe from this group and stop receiving emails from it, send an email to erddap+unsubscribe@googlegroups.com.

To post to this group, send email to erd...@googlegroups.com.

Bob Simons

unread,
May 10, 2018, 11:45:55 AM5/10/18
to ERDDAP
Interesting. Thanks for your comments.

1) I'm not sure hardlinks would help here. ERDDAP datasets don't go offline when they are refreshed. Instead, ERDDAP works in the background to make the new version of the dataset, then swaps it to be the accessible version when it is ready. If the old version is still processing a request when it is swapped out, it continues to process that request.  In any case, that isn't the issue here -- here we're dealing with datasets first being created after restarting ERDDAP.

2) Although having back end ERDDAPs solves a lot of problems, it doesn't solve this one. There is still the problem: when you restart the front end ERDDAP (e.g., after installing a new version), it still has to load those 53000 datasets. Given the quickRestart system, having the 53000 reside on back end servers won't allow the datasets on the front end ERDDAP to be loaded any faster. 

But I think you are right about other High Availability options, notably the option of having 2 front end ERDDAP servers (hot+warm). You could make the backup (warm) one live when it is time to update the active (hot) one.  This does solve (sidestep) the problem. But I would still like to solve the underlying problem. And I still think that aggregating similar datasets is a good solution because it solves the slow restart problem but also makes for a better user experience and simplifies administration of these datasets, e.g., the problem of missing_values changing each day.

2a) Yes, there is an easy way to add links to all of the datasets from a remote ERDDAP to your ERDDAP. Use the EDDGridFromErddap and EDDTableFromErddap options in GenerateDatasetsXml. See

Rob Fuller

unread,
May 10, 2018, 12:09:30 PM5/10/18
to Bob Simons, ERDDAP
Hi Bob,

Thanks.

2) I mean scalability could be solved by a frontend ERDDAP which (mostly) proxies requests to the backend servers, rather than answering for itself. Obviously a bit more complicated for the info and allDatasets special cases as these must be aggregated based on the responses from the backend servers. I appreciate this doesn't exist yet.

2a) I suppose it would need to be something dynamic like AllGridsAndTablesFromOtherErddap such that datasets added/removed from the backend would appear/disappear from what shows at the frontend ;-)

Marco Alba

unread,
May 11, 2018, 4:05:00 AM5/11/18
to ERDDAP
Hi All,
thanks for the interesting discussion!

My final goal is to reduce the number of datasets to 15, as the number of different type of platforms, especially for this point (my boos really want it!):

* Users will be able to easily query multiple stations at once (perhaps using a lat lon bounding box or a regular expression to select the desired stations) because all similar stations will be in one dataset.

At the moment I have not been able to do it, but if Bob can succesfullly create the EDDTableFromMultidimNcFilesUnpacked dataset type I think I can make it! (uhm..no pressure here ;) )

Thanks

Ciao!

_________________________________________

Marco Alba

ETT spa
People and Technology

www.ettsolutions.com
marco...@ettsolutions.com

Via Sestri 37  16154 Genova - Italy - Tel: +39 010 6519116  Fax: +39 010 6518540
Via Giulio Venticinque 38  00136 Roma - Italy - Tel. +39 06 37352352  Fax +39 06 37358533
ETT Solutions Ltd - 2-4 Hoxton Square – London N1 6NU – UK – Tel. +44 (0)20 8133 8135

Facebook  |  Twitter  |  YouTube  |  Flickr  |  Instagram  |  LinkedIn

Le informazioni contenute nella presente comunicazione e i relativi allegati possono essere riservate e sono, comunque, destinate esclusivamente alle persone o alla Società sopraindicati. La diffusione, distribuzione e/o copiatura del documento trasmesso da parte di qualsiasi soggetto diverso dal destinatario è proibita, sia ai sensi dell’art. 616 c.p. , che ai sensi del D.Lgs. n. 196/2003. Se avete ricevuto questo messaggio per errore, vi preghiamo di distruggerlo e di informarci immediatamente per telefono allo +39. 010.6519116 o inviando un messaggio all’indirizzo e-mail in...@ettsolutions.com


Reply all
Reply to author
Forward
0 new messages