Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

xarray and OPeNDAP

187 views
Skip to first unread message

Chris Barker

unread,
Jan 5, 2023, 9:15:09 PM1/5/23
to xar...@googlegroups.com
I'm trying to use xarray to access data on OPeNDAP servers (TDS for the most part)

Sometimes the data have been aggregated properly on the servers, sometimes not.

We're using the netCDF4 library to actually access the server.

We're not trying to do any computation directly on the data -- but rather slicing and dicing and the data, and then outputting it as a netcdf file with .to_netcdf.

It works for the most part, but the performance is awful.

I think the problem is the xarray (and dask) -- are not set up to support OPeNDAP per se -- the code is treating it as generic arrays, or *maybe* as netcdf files -- but has no idea how to efficiently access OPeNDAP -- It seems to be making a LOT of small requests, and it can't request more than part of one variable at a time, which effects performance as well.

So:

1) Is it possible to configure xarray / dask to do this efficiently?

2) Is there a better "driver" to use -- pyDAP maybe?

3) How can I instrument this? I haven't figure out any way to know what's going on -- what requests is it making? (or what sub-arrays is is downloading at a give time, etc)

4) Is there any way to run this with dask distributed? I get errors when I try -- I"m guessing that it can't multiprocess 'cause a netCDF4 DAtaset is not picklable. Any solutions?

I'm going to put together some simple examples to poke at this with, but any hints would be appreciated.

-CHB



--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris....@noaa.gov

Ryan Abernathey

unread,
Jan 5, 2023, 9:55:59 PM1/5/23
to xar...@googlegroups.com
Hi Chris! Xarray supports opendap 100% via the netcdf library. Xarray is just wrapping the netcdf4-python library (which is wrapping the netcdf4 C library), which is ultimately what is making those http calls. (That's also why it's hard to debug / diagnose.) Is the Xarray performance worse than just using raw netCDF4?

You can try chunking with Dask to get some parallelism in your requests. There should be zero issue with serialization / pickling on any Dask scheduler. Xarray handles this case. For an example of using Xarray and Dask with opendap, check out https://gallery.pangeo.io/repos/pangeo-gallery/cmip6/search_and_load_with_esgf_opendap.html

In this paper we did some benchmarking of opendap scalability and throughput - https://ieeexplore.ieee.org/document/9354557
We have never been able to achieve very good throughput with opendap. Zarr scales much, much better. The bottleneck is usually on the server side.

Hope this is helpful.

Best,
Ryan

p.s. The discussion forum on github tends to get a lot more traffic than this mailing list recently - https://github.com/pydata/xarray/discussions

--
You received this message because you are subscribed to the Google Groups "xarray" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xarray+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/xarray/CALGmxEL%2B3dDw20Be4xVnngKq4TmngeBHzUpswQPd28s%2Bo-gjBg%40mail.gmail.com.

Benjamin Root

unread,
Jan 5, 2023, 10:40:06 PM1/5/23
to xar...@googlegroups.com
I would also point out that TDS could potentially be the bottleneck. TDS doesn't blindly send blobs of netcdf to the client, it reads and interprets the netcdf data that it is serving. Just today, I achieved an order-of-magnitude speedup (seconds as opposed to minutes) for a point in a timeseries selection just by supplying TDS with uncompressed netcdf4 files instead of compressed files.

I'll also note that I've been working with the unidata folks on some performance issues I've encountered in TDS and the netcdf-java library, so make sure to stay on top of the latest version of TDS!

Cheers!
Ben Root


Chris Barker

unread,
Jan 6, 2023, 12:46:07 PM1/6/23
to xar...@googlegroups.com
Thanks Ben and Ryan,

To answer the questions:

1) As Ryan said, we know OPeNDAP is supported, but when it's deep
in the stack, it's hard to debug, and, I suspect, tricky because the netCDF C lib presents the same API for files and an OPeNDAP endpoint, even though there are substantial differences that influence  how you may want to access them.

2) We are getting substantially better performance using netCDF4 directly. So we have a problem that *should* be solvable.

3) Yes, TDS is the real bottleneck here -- using raw netCDF, I"m getting throughput at leat an order of magnitude less than what my home internet supports :-( -- but we dont' control the servers we need to hit, so nothing I can do about that.
[note: we *may* be able to have some influence on future serving of at least some of this data, and I'm thinking zarr :-)]

4) I think Ben's example is a good one -- and a serious challenge -- you need to make requests that are not only friendly to OPeNDAP, but friendly to how the data are stored on the server side, which we probably don't know :-(

I'm going to try to boil this down to some smaller digestible examples, and then reach out to the gitHub list for more advice :-)

-CHB



Ryan Abernathey

unread,
Jan 6, 2023, 12:48:28 PM1/6/23
to xar...@googlegroups.com

2) We are getting substantially better performance using netCDF4 directly. So we have a problem that *should* be solvable.

Ok, so THIS should definitely make its way to an Xarray github issue. Is your server public? I'd be happy to help you develop a minimal reproducible example.
 
Reply all
Reply to author
Forward
0 new messages