I'm trying to use xarray to access data on OPeNDAP servers (TDS for the most part)
Sometimes the data have been aggregated properly on the servers, sometimes not.
We're using the netCDF4 library to actually access the server.
We're not trying to do any computation directly on the data -- but rather slicing and dicing and the data, and then outputting it as a netcdf file with .to_netcdf.
It works for the most part, but the performance is awful.
I think the problem is the xarray (and dask) -- are not set up to support OPeNDAP per se -- the code is treating it as generic arrays, or *maybe* as netcdf files -- but has no idea how to efficiently access OPeNDAP -- It seems to be making a LOT of small requests, and it can't request more than part of one variable at a time, which effects performance as well.
So:
1) Is it possible to configure xarray / dask to do this efficiently?
2) Is there a better "driver" to use -- pyDAP maybe?
3) How can I instrument this? I haven't figure out any way to know what's going on -- what requests is it making? (or what sub-arrays is is downloading at a give time, etc)
4) Is there any way to run this with dask distributed? I get errors when I try -- I"m guessing that it can't multiprocess 'cause a netCDF4 DAtaset is not picklable. Any solutions?
I'm going to put together some simple examples to poke at this with, but any hints would be appreciated.
-CHB