performance issues with open_mfdataset

530 views
Skip to first unread message

Ryan Abernathey

unread,
May 16, 2017, 12:33:46 PM5/16/17
to xar...@googlegroups.com
Hi XArray Folks,

I am trying to hunt down the source of a big performance bottleneck in xarray

The scenario is the following: I have a collection of large netCDF files on a disk. Each file can be though of as representing geospatial data: the first dimension is time, and the higher dimensions are spatial dimensions (e.g. x and y). I want to open the data as an xarray dataset, concatenating along the time dimension  (using open_mfsdataset), select a single point in space, and load the data for this point into memory, creating a timeseries.

Instead of working lazily, xarray is loading the whole dataset into memory just to extract the one point.

I have created a clearly documented, self-contained, reproducible example of this issue here:

And I have an open github issue here:

My impression is that the lazy indexing of the underlying netCDF arrays is breaking down somewhere within the dask graph. I don't know if this bug has always been with us, or whether it was introduced in some recent version. The answer might lie in the netcdf4 backend. Whatever the case, I see it as a pretty serious problem which is becoming a major obstacle to my science. I often work with 10TB+ datasets. Right now, it seems more efficient to bypass xarray completely to do this sort of timeseries extraction.

I am happy to work with anyone who is interested in trying to sort out this bug. I imagine others on this list must have encountered it before already.

Cheers,
Ryan Abernathey

Matthew Rocklin

unread,
May 16, 2017, 4:27:59 PM5/16/17
to xarray
How unreasonable is it to assume that all files in a Pile-of-NetCDF-files have exactly the same structure?

--
You received this message because you are subscribed to the Google Groups "xarray" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xarray+unsubscribe@googlegroups.com.
To post to this group, send email to xar...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/xarray/CAJAZx5ATCYnGU-ESPe1Vp_eeguFtd2oVSMePY77gPhHAhFvfeQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Ryan Abernathey

unread,
May 16, 2017, 4:30:33 PM5/16/17
to xar...@googlegroups.com
What do you mean by "structure"? Do you mean the same dimensions and coordinates?

Matthew Rocklin

unread,
May 16, 2017, 4:34:28 PM5/16/17
to xarray
I'm asking if it's feasible to build the dask graph and collect all of the metadata for the XArray DataSet without opening all of the files. 

Matthew Rocklin

unread,
May 16, 2017, 4:34:42 PM5/16/17
to xarray
(sorry for the lack of deep knowledge of xarray and the domain here)

Ryan Abernathey

unread,
May 16, 2017, 4:41:32 PM5/16/17
to xar...@googlegroups.com
I think in some cases it is feasible. For example, if the user specifies that certain dimensions are "prealigned," then xarray can skip an expensive indexing step when it creates the dataset. I have a separate issue open for the slow performance of open_mfdataset due to this indexing and alignment step:

However, unless I am misunderstanding something, I don't think that is relevant to the performance problem described in my notebook. In this case, XArray has already opened all the files, aligned all the indexes, and concatenated them together into dask arrays. This performance problem comes when I try to select data from those arrays.

Matt, your expertise would be very helpful in deciphering the underlying dask graphs (see the bottom of the notebook). Can you tell why we are not able to lazily extract a single value from each dask array without loading the whole file into memory?


Matthew Rocklin

unread,
May 16, 2017, 4:48:44 PM5/16/17
to xarray
Without optimizations, dask.array is probably told to pick out the entire slice from disk, then it is told to slice out a particular element.  This would explain why you're loading everything and then throwing away.  Touching a tiny bit of a slice can be just as bad as touching the entire slice.

However this was a problem a long time ago and was largely resolved through optimizations, which try to fuse slicing operations.  Maybe these have failed in some way?  You would want to look at the optimized graph of the resulting dask.array to check that things are still as you would expect.  

result._optimize(result.dask, result._keys())

If your data was compressed and unchunked then loading a single element would also probably cost the same as loading the entire slice.

If you're doing many of these operations then you might consider rechunking your arrays with the rechunk method.  This can be an expensive operation to do once, but can make future operations like this cheaper.  This also assumes that you have sufficient RAM.

Stephan Hoyer

unread,
May 16, 2017, 7:30:31 PM5/16/17
to xarray
Hi Ryan,

I'm sorry to hear you're running into trouble here. I have two guesses about what may be going on here. The first is that xarray's decoding of netcdf files using CF conventions may not be happening in a lazy way. I think there is an open GitHub issue about that (and I'm happy to elaborate on what a fix would look like).

The second guess, which I think is more likely, is that the dask array optimization to fuse slicing operations has indeed broken, in at least some important cases. Although we had it mostly working before, the design of a post-hoc optimization step is inherently somewhat fragile (but this is a hard problem). Hopefully we can debug this by looking at your dask graphs.

Best,
Stephan
To unsubscribe from this group and stop receiving emails from it, send an email to xarray+un...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "xarray" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xarray+un...@googlegroups.com.

To post to this group, send email to xar...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "xarray" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xarray+un...@googlegroups.com.

To post to this group, send email to xar...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "xarray" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xarray+un...@googlegroups.com.

To post to this group, send email to xar...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "xarray" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xarray+un...@googlegroups.com.

To post to this group, send email to xar...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "xarray" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xarray+un...@googlegroups.com.

To post to this group, send email to xar...@googlegroups.com.

Ryan Abernathey

unread,
May 17, 2017, 10:20:44 AM5/17/17
to xar...@googlegroups.com
Thanks for the feedback.

I just re-ran my example with decode_cf=False, and nothing was qualitatively different. This suggests that the second possibility, "the dask array optimization to fuse slicing operations" is more likely.

I also checking out earlier versions of xarray (0.7.2) and dask (0.7.6) to see if anything was different. It was not. So, contrary to what I implied in my original email, I guess this problem is not new but has been with us for a while.

Matt, I updated the notebook with the output of result._optimize(result.dask, result._keys()). I'm not sure exactly what to make of this output. I have attached a partial dot graph for both the original and optimized graphs.

-Ryan

Original
Inline image 1


Optimized

Inline image 2


To unsubscribe from this group and stop receiving emails from it, send an email to xarray+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "xarray" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xarray+unsubscribe@googlegroups.com.

To post to this group, send email to xar...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "xarray" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xarray+unsubscribe@googlegroups.com.

To post to this group, send email to xar...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "xarray" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xarray+unsubscribe@googlegroups.com.

To post to this group, send email to xar...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "xarray" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xarray+unsubscribe@googlegroups.com.

To post to this group, send email to xar...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "xarray" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xarray+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "xarray" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xarray+unsubscribe@googlegroups.com.

To post to this group, send email to xar...@googlegroups.com.

Matthew Rocklin

unread,
May 17, 2017, 10:39:33 AM5/17/17
to xarray
Yeah, so it looks like there is a getarray followed by a getitem.  We'll have to dive into why these two weren't fused into a single gatarray/getitem call.

I can look at your notebook, but probably won't get to this very soon (conferences this week and next).  I suspect that if you were to put a simple test case on the dask issue tracker that one of the other dask devs (probably jcrist) would handle it quickly.  We should probably have a setup that creates and slices from a simple netcdf file in the dask.array test suite to make sure that regressions like this don't occur in the future.

Reply all
Reply to author
Forward
0 new messages