Problem with max(time) query

191 views
Skip to first unread message

PAUL WHELAN

unread,
May 24, 2023, 9:21:55 AM5/24/23
to ERDDAP
Hi folks,

I am new to this Google Group, but our team has been using ERDDAP for quite some time for a near real-time monitoring application that displays status and science data from deployed assets.

Our ERDDAP datasets have current data trickling in at regular intervals, so frequent updating is critical for us. However, the datasets are quite large, extending back 10 or more years in some cases. This brings me to our actual problem. Doing a full reload of datasets using "reloadEveryNMinutes" in datasets.xml can take a long time and bogs down our system if we do it too frequently. We therefore don't. We make use of the "updateEveryNMillis" tag to catch new data. 

The actual problem we see is that our application queries for the most recent 24 hours of available data use max(time) as a query constraint. For reasons we do not understand, this seems to only catch the timestamp from the most recent full reload. I can look at the ERDDAP UI and see that it thinks that is what the max(time) value is. However, if I strip that time constraint, I can see data with newer timestamps that came in as a result of updates. Is there a setting that is causing updates to not update the appropriate metadata?

As an aside, I did try to query the most recent time value separately using the "orderByMax(time)" server-side function. It returns the right value, but because of our large datasets, it is quite slow, so it isn't practical to query that first from our UI and subsequently use the value in another query.

That is the overview of the problem. If there are more details that I could provide,  or related questions I could answer, please don't hesitate to ask. Thanks for any insights you can provide on this.

Regards,
Paul

bobsimons2.00

unread,
May 24, 2023, 12:01:22 PM5/24/23
to ERDDAP
ERDDAP does different things when reloading vs. updating different types of datasets. And ERDDAP calculates/maintains the known max value differently.  What type of dataset is the dataset you are talking about?  

And if the dataset is a ...From...Files dataset, how is the data arranged in the files?  Notably, is it one big file that you frequently completely rewrite, or a series of files (e.g., one per year) where just the current year's file is rewritten/updated, or something else?

And in the queries you make, are you asking for data =max(time) or >=max(time)?  That shouldn't make a difference, but I think it does for some dataset types because of what ERDDAP knows and when it knows it.

If I know that information, I can give a more specific answer.

PAUL WHELAN

unread,
May 24, 2023, 12:28:20 PM5/24/23
to ERDDAP
Thanks for the quick response. The datasets are of type EDDTableFromMultidimNcFiles. It is a series of netcdf files, generally one per day over long time periods. It is only the current day's file that would be getting updated. Older ones remain static.
The queries from our UI take the form in the following example, where we are looking for the most recently available 24 hrs of data for a given deployment. Note that a single dataset will usually have multiple deployments (1/yr) and some can be quite old; thus the "most recently available" part of the query.
https://prospect.whoi.net/erddap/tabledap/GI01SUMO-BUOY-METBK-01-1.csv?time,barometric_pressure&time>=max(time)-1days&time<=max(time)&deploy_id="D0009"
Thanks again,
Paul

bobsimons2.00

unread,
May 24, 2023, 1:01:10 PM5/24/23
to ERDDAP
A little more information please: 
In the dataset, is there just one buoy at one nominal location (although with multiple deployments, so the actual buoy might change over time) or multiple buoys at multiple nominal locations (i.e., many buoys simultaneously)?   
I ask because the dataset will only have one max(time), not one per buoy.
If there are multiple buoys, how many are there?

Also, why do you include a constraint like &time<=max(time) ?  Isn't that, hopefully, not necessary?  If max(time) isn't perfectly up-to-date, that will cause ERDDAP to reject data that has time values after the known max(time).

Also, how up-to-date are the time values in the dataset?  If they are relatively recent (e.g., a hour's delay), why don't you just do a query with a specific time constraint based on the current time or a relative time, e.g., time>=now-24hours ?

PAUL WHELAN

unread,
May 24, 2023, 1:28:05 PM5/24/23
to ERDDAP
Sorry for the limited information. I am trying to keep the details germain. Our reality is much more complex. 

Our datasets refer to the data for a single instrument. There are many instruments on a nominal buoy and, yes, we have a number of buoys in any given location. There are also multiple locations.

So, for the simple case of a single instrument = 1 dataset:  one of the reasons for including time <=max(time) along with matching deployment id is so that our user interface code can run a single set of logic and queries for new and old deployments alike. So, plotting data from a deployment that ended 9 years ago can work the same way as plotting the current deployment. In theory, querying data for the current deployment might not need that constraint. However, even ignoring old deployments, there are cases for the current deployment where telemetering of data gets delayed a day or two; so querying the "most recently available" 24 hrs is preferable to the last 24 hrs of data.

We do actually support plotting of data relative to the current time. However, most of our plots are shared with older deployments, so that point of reference fails to find data for those. We would have to generate a considerable number of individual plots if we cannot share the logic and queries.

Thanks again for your time.

Regards,
Paul

bobsimons2.00

unread,
May 24, 2023, 2:09:31 PM5/24/23
to ERDDAP
The complexity of your system doesn't matter much here except regarding how things are set up in ERDDAP. Without knowing the details of how things are set up for this dataset in ERDDAP, I can't give a specific answer. "What is a dataset" in your ERDDAP installation is crucial.  If I misunderstand anything, I'll probably give you the wrong answer and the process of diagnosing the problem will take even longer.

So just to be clear, is one dataset in ERDDAP the data from one instrument (e.g., surface water temperature) on one buoy, or from multiple buoys?
  
Just to be sure I understand: If it is one buoy, then isn't the &deploy_id="D0009" constraint unnecessary if you are asking for data from the current/latest deployment?  (Earlier deployments will have much earlier max(time) values.)  

And again, what is the point of the &time<=max(time) constraint?

Roy Mendelssohn - NOAA Federal

unread,
May 24, 2023, 2:19:10 PM5/24/23
to ERDDAP, bobsimons2.00
Hi Paul:

I think what might be helpful is to provide the header information for one of your netcdf files, as well as the appropriate xml snippet for that dataset.

-Roy
> --
> You received this message because you are subscribed to the Google Groups "ERDDAP" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to erddap+un...@googlegroups.com.
> To view this discussion on the web, visit https://groups.google.com/d/msgid/erddap/1306ee2d-aec1-4dbe-86a5-5187d7507ab6n%40googlegroups.com.

**********************
"The contents of this message do not reflect any position of the U.S. Government or NOAA."
**********************
Roy Mendelssohn
Supervisory Operations Research Analyst
NOAA/NMFS
Environmental Research Division
Southwest Fisheries Science Center
***Note new street address***
110 McAllister Way
Santa Cruz, CA 95060
Phone: (831)-420-3666
Fax: (831) 420-3980
e-mail: Roy.Men...@noaa.gov www: https://www.pfeg.noaa.gov/

"Old age and treachery will overcome youth and skill."
"From those who have been given much, much will be expected"
"the arc of the moral universe is long, but it bends toward justice" -MLK Jr.

bobsimons2.00

unread,
May 24, 2023, 2:28:04 PM5/24/23
to ERDDAP
The things Roy asked for might be useful, but they don't answer my questions because they don't tell me about the variability of the files in the dataset.

I feel bad for having asked you so many questions, but I need a clear understanding of the dataset and I don't have that yet.
Without a clear understanding, there is a slightly different answer for each possible scenario. That is not a good way to diagnose this problem.

PAUL WHELAN

unread,
May 24, 2023, 2:41:49 PM5/24/23
to ERDDAP
A dataset for us corresponds to the data for a single instrument on a single buoy. 

Yet another complexity for us in our system is that deploy_id can be important for the current deployment. While a current deployment is in the field, the next deployment is frequently under development on-shore. There are periods of time where that is being tested and reporting data into our system for a future deployment while a current deployment is reporting data.

The point of &time<=max(time) is to retrieve the newest 24 hours of data for a deployment, whether that deployment be from 10 years ago, from 3 days ago due to telemetry delays or from right now using a single query.

Below is a dump of the header for one of our netcdf files:

netcdf \20230523.flort1 {
dimensions:
station = 1 ;
t = 14331 ;
variables:
int crs ;
double t(t) ;
t:units = "seconds since 1990-01-01 00:00:00Z" ;
t:standard_name = "time" ;
t:axis = "T" ;
int station(station) ;
station:cf_role = "timeseries_id" ;
station:long_name = "station identifier" ;
double y(station) ;
y:axis = "Y" ;
double x(station) ;
x:axis = "X" ;
double z(station) ;
z:_FillValue = -9999.9 ;
z:axis = "Z" ;
z:long_name = "Sensor Depth" ;
z:standard_name = "depth" ;
z:units = "m" ;
z:comment = "Sensor depth below sea surface" ;
z:positive = "down" ;
string dcl_date_time_string(station, t) ;
dcl_date_time_string:long_name = "DCL Date and Time Stamp" ;
dcl_date_time_string:units = "1" ;
dcl_date_time_string:coordinates = "t z x y" ;
int measurement_wavelength_beta(station, t) ;
measurement_wavelength_beta:_FillValue = -9999 ;
measurement_wavelength_beta:long_name = "Wavelength" ;
measurement_wavelength_beta:standard_name = "radiation_wavelength" ;
measurement_wavelength_beta:units = "nm" ;
measurement_wavelength_beta:coordinates = "t z x y" ;
int raw_signal_beta(station, t) ;
raw_signal_beta:_FillValue = -9999 ;
raw_signal_beta:units = "counts" ;
raw_signal_beta:coordinates = "t z x y" ;
int measurement_wavelength_chl(station, t) ;
measurement_wavelength_chl:_FillValue = -9999 ;
measurement_wavelength_chl:long_name = "Wavelength" ;
measurement_wavelength_chl:standard_name = "radiation_wavelength" ;
measurement_wavelength_chl:units = "nm" ;
measurement_wavelength_chl:coordinates = "t z x y" ;
int raw_signal_chl(station, t) ;
raw_signal_chl:_FillValue = -9999 ;
raw_signal_chl:units = "counts" ;
raw_signal_chl:coordinates = "t z x y" ;
int measurement_wavelength_cdom(station, t) ;
measurement_wavelength_cdom:_FillValue = -9999 ;
measurement_wavelength_cdom:long_name = "Wavelength" ;
measurement_wavelength_cdom:standard_name = "radiation_wavelength" ;
measurement_wavelength_cdom:units = "nm" ;
measurement_wavelength_cdom:coordinates = "t z x y" ;
int raw_signal_cdom(station, t) ;
raw_signal_cdom:_FillValue = -9999 ;
raw_signal_cdom:units = "counts" ;
raw_signal_cdom:coordinates = "t z x y" ;
int raw_internal_temp(station, t) ;
raw_internal_temp:_FillValue = -9999 ;
raw_internal_temp:units = "counts" ;
raw_internal_temp:coordinates = "t z x y" ;
double estimated_chlorophyll(station, t) ;
estimated_chlorophyll:_FillValue = -9999.9 ;
estimated_chlorophyll:long_name = "Estimated Chlorophyll" ;
estimated_chlorophyll:standard_name = "mass_concentration_of_chlorophyll_in_sea_water" ;
estimated_chlorophyll:units = "mg L-1" ;
estimated_chlorophyll:coordinates = "t z x y" ;
double fluorometric_cdom(station, t) ;
fluorometric_cdom:_FillValue = -9999.9 ;
fluorometric_cdom:long_name = "Fluorometric CDOM" ;
fluorometric_cdom:units = "ppm" ;
fluorometric_cdom:coordinates = "t z x y" ;
double beta_700(station, t) ;
beta_700:_FillValue = -9999.9 ;
beta_700:long_name = "Volume Scattering Function at 700 nm" ;
beta_700:standard_name = "volume_scattering_function_of_radiative_flux_in_sea_water" ;
beta_700:units = "m-1 sr-1" ;
beta_700:coordinates = "t z x y" ;
double temperature(station, t) ;
temperature:_FillValue = -9999.9 ;
temperature:long_name = "Sea Water Temperature" ;
temperature:standard_name = "sea_water_temperature" ;
temperature:units = "degrees_Celsius" ;
temperature:comment = "Interpolated into record from co-located CTD" ;
temperature:coordinates = "t z x y" ;
double salinity(station, t) ;
salinity:_FillValue = -9999.9 ;
salinity:long_name = "Practical Salinity" ;
salinity:standard_name = "sea_water_practical_salinity" ;
salinity:units = "1" ;
salinity:comment = "Interpolated into record from co-located CTD" ;
salinity:coordinates = "t z x y" ;
double bback(station, t) ;
bback:_FillValue = -9999.9 ;
bback:long_name = "Total Optical Backscatter at 700 nm" ;
bback:units = "m-1" ;
bback:coordinates = "t z x y" ;
string deploy_id(station, t) ;
deploy_id:long_name = "Deployment ID" ;
deploy_id:units = "1" ;
deploy_id:coordinates = "t z x y" ;

// global attributes:
:Conventions = "CF-1.6" ;
:date_created = "2023-05-24T18:35:00Z" ;
:featureType = "timeseries" ;
:cdm_data_type = "Timeseries" ;
:project = "Ocean Observatories Initiative" ;
:institution = "Coastal and Global Scale Nodes (CGSN)" ;
:acknowledgement = "National Science Foundation" ;
:references = "http://oceanobservatories.org" ;
:creator_name = "Christopher Wingard" ;
:creator_email = "cwin...@coas.oregonstate.edu" ;
:creator_url = "http://oceanobservatories.org" ;
:comment = "Mooring ID: GI01SUMO-0009" ;
}
Below is an example dataset definition from our datasets.xml file:
<dataset active="true" datasetID="GI01SUMO-BUOY-FLORT-01-1" type="EDDTableFromNcCFFiles">
<reloadEveryNMinutes>720</reloadEveryNMinutes>
<updateEveryNMillis>10000</updateEveryNMillis>
<fileDir>/mnt/ooi_volume_07/data/processed/gi01sumo</fileDir>
<fileNameRegex>.*\.flort1\.nc</fileNameRegex>
<recursive>true</recursive>
<pathRegex>.*D[0-9]+/(|buoy/(|flort-1/))</pathRegex>
<metadataFrom>last</metadataFrom>
<preExtractRegex>120</preExtractRegex>
<postExtractRegex />
<extractRegex />
<columnNameForExtract />
<sortFilesBySourceNames>t</sortFilesBySourceNames>
<fileTableInMemory>false</fileTableInMemory>
<defaultGraphQuery>time,raw_signal_chl&amp;time&gt;=max(time)-7day&amp;.draw=lines</defaultGraphQuery>
<accessibleViaFiles>false</accessibleViaFiles>
<addAttributes>
<att name="title">GI01SUMO BUOY FLORT1</att>
<att name="_NCProperties">null</att>
<att name="Conventions">CF-1.6, COARDS, ACDD-1.3</att>
<att name="creator_type">person</att>
<att name="infoUrl">http://oceanobservatories.org</att>
<att name="institution">Coastal and Global Scale Nodes (CGSN)</att>
<att name="keywords">backscatter, bback, beta, beta_700, cdom, cgsn, chemistry, chl, chlorophyll, coastal, color,
colored, concentration, cp01cnsm, cp01cnsm-0007, crs, data, date, dcl, dcl_date_time_string, density, deploy_id,
deployment, depth, dissolved, estimated, estimated_chlorophyll, fluorometric, fluorometric_cdom, flux, function,
global, heat, heat flux, identifier, internal, latitude, local, longitude, mass,
mass_concentration_of_chlorophyll_in_sea_water, matter, measurement_wavelength_beta,
measurement_wavelength_cdom, measurement_wavelength_chl, mooring, nodes, ocean, ocean color, oceans,
Oceans &gt; Ocean Chemistry &gt; Chlorophyll,
Oceans &gt; Ocean Optics &gt; Radiance,
Oceans &gt; Ocean Temperature &gt; Water Temperature,
Oceans &gt; Salinity/Density &gt; Salinity,
optical, optical properties, optics, organic, practical, properties, radiance, radiation, radiation_wavelength,
radiative, raw, raw_internal_temp, raw_signal_beta, raw_signal_cdom, raw_signal_chl, salinity, scales,
scattering, sea, sea_water_practical_salinity, sea_water_temperature, seawater, sensor, signal, source, stamp,
station, temperature, time, total, volume, volume_scattering_function_of_radiative_flux_in_sea_water, water,
wavelength</att>
<att name="keywords_vocabulary">GCMD Science Keywords</att>
<att name="license">[standard]</att>
<att name="sourceUrl">(local files)</att>
<att name="standard_name_vocabulary">CF Standard Name Table v29</att>
<att name="summary">Mooring ID: CP01CNSM-0007. CGSN data from a local source.</att>
<att name="cdm_data_type">timeSeries</att>
<att name="subsetVariables">deploy_id</att>
<att name="cdm_timeseries_variables">latitude, longitude, station</att>
<att name="processing_level">processed</att>
</addAttributes>
<dataVariable>
<sourceName>crs</sourceName>
<destinationName>crs</destinationName>
<dataType>int</dataType>
<addAttributes>
<att name="ioos_category">Unknown</att>
<att name="long_name">CRS</att>
</addAttributes>
</dataVariable>
<dataVariable>
<sourceName>station</sourceName>
<destinationName>station</destinationName>
<dataType>int</dataType>
<addAttributes>
<att name="ioos_category">Identifier</att>
</addAttributes>
</dataVariable>
<dataVariable>
<sourceName>t</sourceName>
<destinationName>time</destinationName>
<dataType>double</dataType>
<addAttributes>
<att name="ioos_category">Time</att>
<att name="long_name">Time</att>
</addAttributes>
</dataVariable>
<dataVariable>
<sourceName>latitude</sourceName>
<destinationName>latitude</destinationName>
<dataType>double</dataType>
<addAttributes>
<att name="colorBarMaximum" type="double">90.0</att>
<att name="colorBarMinimum" type="double">-90.0</att>
<att name="ioos_category">Location</att>
<att name="long_name">Latitude</att>
<att name="standard_name">latitude</att>
<att name="units">degrees_north</att>
</addAttributes>
</dataVariable>
<dataVariable>
<sourceName>longitude</sourceName>
<destinationName>longitude</destinationName>
<dataType>double</dataType>
<addAttributes>
<att name="colorBarMaximum" type="double">180.0</att>
<att name="colorBarMinimum" type="double">-180.0</att>
<att name="ioos_category">Location</att>
<att name="long_name">Longitude</att>
<att name="standard_name">longitude</att>
<att name="units">degrees_east</att>
</addAttributes>
</dataVariable>
<dataVariable>
<sourceName>z</sourceName>
<destinationName>z</destinationName>
<dataType>double</dataType>
<addAttributes>
<att name="colorBarMaximum" type="double">8000.0</att>
<att name="colorBarMinimum" type="double">-8000.0</att>
<att name="colorBarPalette">TopographyDepth</att>
<att name="ioos_category">Location</att>
</addAttributes>
</dataVariable>
<dataVariable>
<sourceName>dcl_date_time_string</sourceName>
<destinationName>dcl_date_time_string</destinationName>
<dataType>String</dataType>
<addAttributes>
<att name="coordinates">null</att>
<att name="ioos_category">Time</att>
</addAttributes>
</dataVariable>
<dataVariable>
<sourceName>measurement_wavelength_beta</sourceName>
<destinationName>measurement_wavelength_beta</destinationName>
<dataType>int</dataType>
<addAttributes>
<att name="coordinates">null</att>
<att name="ioos_category">Optical Properties</att>
</addAttributes>
</dataVariable>
<dataVariable>
<sourceName>measurement_wavelength_cdom</sourceName>
<destinationName>measurement_wavelength_cdom</destinationName>
<dataType>int</dataType>
<addAttributes>
<att name="coordinates">null</att>
<att name="ioos_category">Optical Properties</att>
</addAttributes>
</dataVariable>
<dataVariable>
<sourceName>measurement_wavelength_chl</sourceName>
<destinationName>measurement_wavelength_chl</destinationName>
<dataType>int</dataType>
<addAttributes>
<att name="coordinates">null</att>
<att name="ioos_category">Optical Properties</att>
</addAttributes>
</dataVariable>
<dataVariable>
<sourceName>raw_internal_temp</sourceName>
<destinationName>raw_internal_temp</destinationName>
<dataType>int</dataType>
<addAttributes>
<att name="coordinates">null</att>
<att name="ioos_category">Unknown</att>
<att name="long_name">Raw Internal Temp</att>
</addAttributes>
</dataVariable>
<dataVariable>
<sourceName>raw_signal_beta</sourceName>
<destinationName>raw_signal_beta</destinationName>
<dataType>int</dataType>
<addAttributes>
<att name="coordinates">null</att>
<att name="ioos_category">Unknown</att>
<att name="long_name">Raw Signal Beta</att>
</addAttributes>
</dataVariable>
<dataVariable>
<sourceName>raw_signal_cdom</sourceName>
<destinationName>raw_signal_cdom</destinationName>
<dataType>int</dataType>
<addAttributes>
<att name="coordinates">null</att>
<att name="ioos_category">Unknown</att>
<att name="long_name">Raw Signal Cdom</att>
</addAttributes>
</dataVariable>
<dataVariable>
<sourceName>raw_signal_chl</sourceName>
<destinationName>raw_signal_chl</destinationName>
<dataType>int</dataType>
<addAttributes>
<att name="coordinates">null</att>
<att name="ioos_category">Unknown</att>
<att name="long_name">Raw Signal Chl</att>
</addAttributes>
</dataVariable>
<dataVariable>
<sourceName>estimated_chlorophyll</sourceName>
<destinationName>estimated_chlorophyll</destinationName>
<dataType>double</dataType>
<addAttributes>
<att name="colorBarMaximum" type="double">30.0</att>
<att name="colorBarMinimum" type="double">0.03</att>
<att name="colorBarScale">Log</att>
<att name="coordinates">null</att>
<att name="ioos_category">Ocean Color</att>
</addAttributes>
</dataVariable>
<dataVariable>
<sourceName>fluorometric_cdom</sourceName>
<destinationName>fluorometric_cdom</destinationName>
<dataType>double</dataType>
<addAttributes>
<att name="coordinates">null</att>
<att name="ioos_category">Optical Properties</att>
</addAttributes>
</dataVariable>
<dataVariable>
<sourceName>beta_700</sourceName>
<destinationName>beta_700</destinationName>
<dataType>double</dataType>
<addAttributes>
<att name="colorBarMaximum" type="double">500.0</att>
<att name="colorBarMinimum" type="double">-500.0</att>
<att name="coordinates">null</att>
<att name="ioos_category">Heat Flux</att>
</addAttributes>
</dataVariable>
<dataVariable>
<sourceName>temperature</sourceName>
<destinationName>temperature</destinationName>
<dataType>double</dataType>
<addAttributes>
<att name="colorBarMaximum" type="double">32.0</att>
<att name="colorBarMinimum" type="double">0.0</att>
<att name="coordinates">null</att>
<att name="ioos_category">Temperature</att>
</addAttributes>
</dataVariable>
<dataVariable>
<sourceName>salinity</sourceName>
<destinationName>salinity</destinationName>
<dataType>double</dataType>
<addAttributes>
<att name="colorBarMaximum" type="double">37.0</att>
<att name="colorBarMinimum" type="double">32.0</att>
<att name="coordinates">null</att>
<att name="ioos_category">Salinity</att>
<att name="standard_name">sea_water_practical_salinity</att>
<att name="units">PSU</att>
</addAttributes>
</dataVariable>
<dataVariable>
<sourceName>bback</sourceName>
<destinationName>bback</destinationName>
<dataType>double</dataType>
<addAttributes>
<att name="coordinates">null</att>
<att name="ioos_category">Optical Properties</att>
</addAttributes>
</dataVariable>
<dataVariable>
<sourceName>deploy_id</sourceName>
<destinationName>deploy_id</destinationName>
<dataType>String</dataType>
<addAttributes>
<att name="coordinates">null</att>
<att name="ioos_category">Identifier</att>
</addAttributes>
</dataVariable>
</dataset>

bobsimons2.00

unread,
May 24, 2023, 3:33:42 PM5/24/23
to ERDDAP
1) For now, I would change <updateEveryNMillis> for the dataset to be 1. Unless (and maybe even if) your dataset is updated every few seconds and people are submitting requests to this dataset very frequently (e.g., every second), this will be a better setting.  10000 is the default only to avoid problems with extreme situations.  But I think this is an optimization, not a solution.

2) Your sample query (listed above) has 2 time constraints: 
&time>=max(time)-1days  
&time<=max(time)  
The first one makes sense. The second one doesn't. And if ERDDAP's understanding of max(time) isn't correct (i.e., it is out-of-date), then data from after the incorrect max(time) time value will be removed from the response. I think that is part of what triggered your initial email. The solution is to remove that second constraint. Do you then see all of the recent data?  Does that solve the main/immediate problem?

3) But that leaves the question: why is max(time) not being kept up-to-date by a dataset update? I don't know.  That could easily be a bug/oversight, although surprising that no one caught it until now. Chris John, can you please check the code that handles the "update" call to ensure that it correctly maintains the min and max for each variable? (You don't have to recreate the variable. I think there is a method to update the variable's min and max.)  

4) My understanding of your dataset makes your comment about a "dataset reload taking a long time" stand out. That makes no sense. Something is wrong.  When ERDDAP reloads the dataset, it should quickly see that only 1 file has a different timestamp and size, so it only needs to open and read the data from that one file. That should occur quickly unless the file is huge. Is it huge? How many time points are typically in the currently changing file?  I would set a flag for this dataset, then wait till it is reloaded, then visit status.html so the pending log messages are flushed to log.txt, then look in the log.txt file for diagnostics about what happened during the reload. Did ERDDAP try to reread all of the files or just the file that changed? How long did the dataset reload take? If it took long, is there an indication of why? The point here is to resolve this mystery, but also, if reloads are fast, then you can switch to <reloadEveryNMinutes> to 15 (or better,  use a flag when a file is changed), then max(time) will be correct and you can bypass the max(time) bug associated with updates (if that is what it is).  But in the long run, it is better if the update bug is fixed and you stick with infrequent reloads and frequent updates. 

I hope that helps. Please let us know about #2 and #4. Chris John, please let us know about #3.

bobsimons2.00

unread,
May 24, 2023, 3:49:09 PM5/24/23
to ERDDAP
I'll add:
Bugs like this (if this is the possible bug I described) can occur because different features in ERDDAP were added at different times and it is easy to miss an interaction of 2 different parts of ERDDAP. In this case, it is likely that the max() function was added after the update function. Update was always intended to be a quick and perhaps imperfect version of reload (quick so that it could be done quickly before normal processing of a user's request and not annoy the user). Apparently it needs to be slightly less quick so that each variable's min and max are up-to-date.

And for Chris John, I should have said "check the update() code in EDDTableFromFiles (the relevant superclass)", but also check all the other classes that actually do something in the update() method. (I forget the exact method name, but it is something like that.)

PAUL WHELAN

unread,
May 24, 2023, 4:00:31 PM5/24/23
to ERDDAP
Thank you all. Also, I'll try to get you the additional responses as quickly as I can, though I am a little tied up at the moment.

Regards,
Paul

Roy Mendelssohn - NOAA Federal

unread,
May 25, 2023, 12:26:03 AM5/25/23
to PAUL WHELAN, ERDDAP
Hi Paul:

I forgot to ask, are you running ERDDAP through Docker?

Thanks,

-Roy
> --
> You received this message because you are subscribed to the Google Groups "ERDDAP" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to erddap+un...@googlegroups.com.
> To view this discussion on the web, visit https://groups.google.com/d/msgid/erddap/405cb796-048a-4cc4-939b-2346e7d894ebn%40googlegroups.com.

PAUL WHELAN

unread,
May 25, 2023, 8:47:01 AM5/25/23
to ERDDAP
Hello again folks,

Sorry for the delayed response. I had another, unrelated fire burning.

regarding Bob's most recent post:
1) will change <updateEveryNMillis> to 1.

2) I dug into this further based on your input. It seems that our UI guys do use a different query for older (static) deployments. So, the query using max(time) is only being used for the current deployments. That means that we could eliminate the "<=max(time)" constraint. That does solve the problem of not picking up newer data. One thing I don't see that it gives us a way to do is to query a set time period leading up to the last available data (unless "max(time)" if fixed to be accurate). For example, if I want the "most recently available 24 hrs" of data, I do not see a way to do that. Thoughts?

4) So, individual files in our datasets are not what I would think of as huge. On average, a daily file is about 1Mb. The range seems to be 0.5Mb to 5.0Mb, depending on the instrument. The points per file varies. Most instruments report 1 second resolution (86400/day) data or slower, some as few as once per hour. There are a few that report gridded data (spectra), so those have more. None are producing enormous individual files. 

There are a few oddball cases where the instruments log and telemeter single samples. This results in a proliferation of files for a single dataset. We have seen slow-downs as a result of that and previously addressed that with an internal cron task that consolidates them into an individual file per day. That did speed up reloads of those datasets.

I will have one of our team look into what is happening on the full reloads, as you have suggested. Our reload history looks like the following screen shot, with most being quick, but every so often, one takes up to 30 minutes. It makes me think more than a single file is reloading on those.

regarding Roy's question:
We are not using ERDDAP on Docker. We were, but were concerned that Docker might be a source of some of our performance issues, so we took it out.

Thanks again,
Paul

ERDDAP_status.png

Roy Mendelssohn - NOAA Federal

unread,
May 25, 2023, 10:13:20 AM5/25/23
to PAUL WHELAN, ERDDAP
Thanks. Yes, that is why I asked. There have been several posts the last couple of months having to do with updating or flags (see for example the recent post from Callum Rollo), which in each case boiled down to how Docker was configured. We don't use Docker and know the bare minimum about it, but if there are update problems and Docker is involved (or for that matter NFS shares) that needs to be checked.

-Roy

> On May 25, 2023, at 5:47 AM, PAUL WHELAN <paul....@whoi.edu> wrote:
>
> regarding Roy's question:
> We are not using ERDDAP on Docker. We were, but were concerned that Docker might be a source of some of our performance issues, so we took it out.

bobsimons2.00

unread,
May 31, 2023, 6:58:27 AM5/31/23
to ERDDAP
Replying to Paul's comments about my comments:

2) You are correct. Removing the &time<max(time) constraint is not the full solution to the problem of returning precisely the last 24 hours worth of data, but it does solve the problem of the very latest data not being returned to the user.  Sometimes you will get slightly more than 24 hours worth of data. The full solution will require max(time) being kept perfectly up-to-date. That is something for Chris to pursue.

4) It is hard to imagine how a single 1MB data file could take a long time for ERDDAP to read and thus be the sole cause of a dataset loading slowly. If you ever see a dataset with this file organization loading slowly, please report the details (the information from log.txt regarding the dataset loading slowly) to Chris John so he can pursue it. 

But if I understand the other type of dataset correctly (there might be 10's of 1000's of data files, even if each file is tiny), it is easy to explain why those datasets reload slowly. Operating systems behave badly (very slowly) where there are >10000 files in a given directory. That isn't a magical number. It's just that by the time you get anywhere near that number, it takes seconds for the operating system to open the file. If reloading the dataset involves opening and reading 10,000 files in a directory with 100,000 files, it will take a horrendously long time. There is nothing ERDDAP can do to speed this up because it is an operating system problem. But there are things you can do: 
1) Store the files in a series of subdirectories (one per hour?) so that no directory every has more that a few 1000 files. In extreme cases, you might set up a hierarchy of directories, e.g., year/month/day/hour. See https://coastwatch.pfeg.noaa.gov/erddap/download/setupDatasetsXml.html#EDDTableFromFiles_MillionsOfFiles
2) Store the data in aggregated files.  (This may be hard to set up, so option #1 is often much easier.)

I've portrayed two extremes: datasets with a few huge files (which doesn't seem to be the case) and datasets with many tiny files (more likely). But it is up to you to look in log.txt to find out which datasets are loading slowly and see what is going on when the dataset is loaded (by reading log.txt) that makes the dataset load slowly. Then it is either something you can fix (e.g., arrange the files in subdirectories) or something Chris can look into fixing. 

Note that if you can resolve the slow reload problem, you can go back to using a reloadEveryNMInutes setting of 10, so that max(time) is better maintained, so the original problem will be minimized until Chris figures out why max(time) is not being maintained by a dataset update.

---
I hope that helps. 

PAUL WHELAN

unread,
May 31, 2023, 7:41:00 AM5/31/23
to ERDDAP
Thanks for all of the feedback and ideas on cleaning up the remaining performance issues. Your insights are much appreciated. We'll be pursuing cleaning up all of that stuff and look forward to any updates on the max(time) issue.

Best Regards,
Paul

Reply all
Reply to author
Forward
0 new messages