GOFS 3.1 Reanalysis Download Best Practices

150 views
Skip to first unread message

Gustafson Drew

unread,
Mar 16, 2021, 2:16:13 PM3/16/21
to fo...@hycom.org

Hello,

 

I am currently attempting to download GOFS 3.1 reanalysis data (expt_53.X) for a single grid point, spanning the full water column profile for the entire temporal duration. At present, my script iterates through each time step individually and uses MATLAB’s ncread function to download water_u and water_v data for all valid depths at the grid point of interest. However, this method is fairly time-consuming, often taking several days to collect the full time series. Do you have any recommendations for best practices to ensure an efficient download, whether that would be changing the amount of data downloaded in each call, using a different method (NCCS vs. OPeNDAP, etc.), or using a different software altogether? Any assistance you can provide would be appreciated.

 

A sample call from my current process is below. As I mentioned above, the script iterates through time and downloads data for all valid depths at a specific grid point.

 

u = ncread(‘http://tds.hycom.org/thredds/dodsC/GLBv0.08/expt_53.X/data/2015’, ‘water_u’, [1481 1631 1 1], [1 1 9 1]);

 

Thanks and regards,


Drew Gustafson

 

Michael McDonald

unread,
Mar 19, 2021, 7:19:53 PM3/19/21
to Gustafson Drew, fo...@hycom.org
Drew,
The method you listed looks good... if it works and doesn't fail. 

The issue with MATLAB is that when presented with a *very large* NETCDF object, which is essentially what you are doing via ncread with an OPENDAP URL (the method you are using), this translates to opening a "13 TB" NetCDF file, each time.


and click the "Get Binary" button to see how much data this OPENDAP dataset contains. 

  message = "Request too big=1.3812150708448E7 Mbytes, max=1000000.0";

This has a lot of metadata and variables with all the time indexes, and each *open* of this can take time to read/send over the network and just "takes time", and if not cached properly this essentially becomes the bottleneck/penalty in MATLAB. 


To get around this you can open a (much) smaller OPENDAP NetCDF object for each request via the *individual files* that make up this virtual data aggregation, which should speed up your queries. So give this method a try and report back on your findings. 

Start at the top level THREDDS catalog here, http://tds.hycom.org/thredds/catalog.html

* select the "All Data" link under "* Unaggregated * at the top. Then follow the directory tree to your desired dataset,


From here you can build file listing from the catalog listing of server files, there are two main formats available, 

e.g., change the .html extension to .xml for a format to use in a MATLAB for loop (or process this list externally), 
change to 


Process this listing and extract all the urlPath="..." entries. These can be appended to the root/base DAP URL (https://tds.hycom.org/thredds/dodsC/ OR http://tds.hycom.org/thredds/dodsC/, both secure and non-secure HTTP queries work). 

e.g., from one file urlPath from the XML listing
 urlPath="datasets/GLBv0.08/expt_53.X/data/2015/hycom_GLBv0.08_539_2015010112_t000.nc"

joined with the root DAP URL becomes,
OR

With this unaggregated method, you do not need to specify the "date/time" index as there is only one "time value" present per request, only the variable name (water_u) and the lat/lon subset.

NOTE: Please do not *guess* OPENDAP URLs. Base your request tree on the XML output that that server reports as available/present. Repetitive attempts to non-existent files will get your IP auto blocked with fail2ban. Note there are known missing values from the GOFS 3.1 reanalysis, and the THREDDS XML listing knows this due to the lack of a file for that time being present, where a simple "for loop" from year begin to year end will not be aware of this  :-)




Reply all
Reply to author
Forward
0 new messages