Computation to arrays larger than memory storage and serialization to split netCDF files

Phillip Wolfram

unread,

Jan 18, 2016, 4:07:39 PM1/18/16

to xarray, Matthew Rocklin, Ryan Abernathey

Hi All,

Following up with a conversation I had with Matthew Rocklin and Ryan Abernathey this morning. I've emailed this list in lieu of an issue but am happy to submit an issue if that is a better.

What is the preferred xarray / dask approach for computational results (xarray Arrays) such that the output is larger than fits in memory? Dask should be able to handle this and a memory error will occur on a *.load or *.values, but not necessarily serialization to output. In particular, is there a way to serialize these large arrays that won't fit into memory into multiple output netCDF files, split across the unlimited netCDF dimension, say Time?

Currently I am opening multiple netCDF files for output and am essentially handling output in a SIMD way via the python netCDF4 bindings for files spanning small time segments. This does not seem to be the optimal solution. However, I'm not sure having a single large O(petabyte) netCDF output file is the right approach either because it is hard to archive such a large file. However, perhaps this is a non-issue but it seems like the benefit of serialization to smaller files is that the problem could be readily parallelized with each processor or node effectively responsible for writing to its own file.

I'm hoping you all might have some advice / context on the best way to approach this problem.

Thanks in advance,
Phil
--------------------------------------------------
Phillip J. Wolfram, Ph.D.
Postdoctoral Research Associate
Climate, Ocean and Sea Ice Modeling
T-3 Fluid Dynamics and Structural Mechanics
Los Alamos National Laboratory

Matthew Rocklin

unread,

Jan 18, 2016, 4:19:22 PM1/18/16

to Phillip Wolfram, xarray, Ryan Abernathey

From dask array's perspective.

Dask.array writes data to any object that supports numpy style setitem syntax like the following:

dataset[my_slice] = my_numpy_array

Objects like h5py.Dataset and netcdf objects support this syntax.

So dask.array would work today without modification if we had such an object that represented many netcdf files at once and supported numpy-style setitem syntax, placing the numpy array properly across the right files. This work could happen easily without deep knowledge of either project.

Alternatively, we could make the dask.array.store function optionally lazy so that users (or xarray) could call store many times before triggering execution.

Stephan Hoyer

unread,

Jan 18, 2016, 8:42:59 PM1/18/16

to xar...@googlegroups.com, Phillip Wolfram, Ryan Abernathey

Hi Phil,

I agree that creating gigantic netCDF files is not a great solution.

I actually wrote an xarray function for exactly this purpose a few months ago:

http://xarray.pydata.org/en/stable/generated/xray.save_mfdataset.html

It hasn't been extensively tested (nor advertised), but it did work for us. It jumps through some gymnastics to store multiple dask array simultaneously using a single da.store call.

Cheers,

Stephan

--
You received this message because you are subscribed to the Google Groups "xarray" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xarray+un...@googlegroups.com.
To post to this group, send email to xar...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/xarray/CAJ8oX-E7Xx6NT4F6J8B4__Q-kBazoob9_qe_oFLi5hany9-%3DKQ%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

Phillip Wolfram

unread,

Jan 18, 2016, 9:57:45 PM1/18/16

to Stephan Hoyer, xar...@googlegroups.com, Ryan Abernathey

Hi Stephan,

Awesome! This looks like what I need. I will try it out. A follow up question: Does this imply that one could build up a system of dask arrays collected in an Xarray dataset such that they could be larger than ram and then compute and write them out to netcdf files in on-node parallelism, all without encountering an out of memory error?

I'm following up with Matthew to see how to get off a node via distributed. Can you envision problems in this marriage of the libraries? The ultimate goal, of course, is a fully parallel (on and off node) post-processing framework for analysis of really large climate datasets. I'm hoping, as I know Ryan is too, that we can have a really scalable solution with this dynamic duo.

Thanks,
Phil

--

Phil

--------------------

Phillip J. Wolfram

Stephan Hoyer

unread,

Jan 20, 2016, 11:47:18 AM1/20/16

to Phillip Wolfram, xar...@googlegroups.com, Ryan Abernathey

Hi Phil,

Yes, this is the xarray/dask dream! We have done this successfully on a single node.

There are still a few kinks to work out with the system (e.g., distributed IO, improvements to the dask scheduler), but it is definitely usable by advanced users right now. With your help, I think we could make this into a robust tool.

Cheers,
Stephan

Phillip Wolfram

unread,

Mar 18, 2016, 8:57:02 PM3/18/16

to Stephan Hoyer, xar...@googlegroups.com, Matthew Rocklin, Ryan Abernathey

Thanks Stephan!

Matt and I have been making progress with using distributed in production contexts to analyze data from 1320 netCDF4 files.

Should I put an issue out to see if we can get xarray to work with distributed or is it still too premature to start this process in your mind? I'm not sure exactly what is involved and depending upon the breakdown of the challenge perhaps you can solicit contribution from me and others. For example, I know Ryan Abernathy is interested in this too.

Long term I think this will be a huge benefit to the climate community and there is benefit to doing this sooner than later provided we can get a clear idea of what is involved and how to get the expertise (even funding if required) to pull this off.

Thanks,

Phil

Stephan Hoyer

unread,

Mar 20, 2016, 11:53:01 PM3/20/16

to Phillip Wolfram, xar...@googlegroups.com, Matthew Rocklin, Ryan Abernathey

Hi Phillip,

It's awesome to hear about your dask experiments!

I don't think it's premature to start thinking about using xarray with dask distributed, so an issue to keep track of things on the xarray side seems in order. It will definitely require some refactoring of the xarray backend system to make this work cleanly, but that's OK -- the xarray backend system is indicated as experimental/internal API precisely because we hadn't figured out all the use cases yet.

To be honest, I've never been entirely happy with the design we took there (we use inheritance rather than composition for backend classes), but we did get it to work for our use cases. Some refactoring with an eye towards compatibility with dask distributed seems like a very worthwhile endeavor. We do have the benefit of a pretty large test suite covering existing use cases.

I am happy to give guidance on this and share expertise (especially reasons for aspects of the existing design), but I can't commit to writing much code. I'd love to give an estimate for the difficulty here, but that's a little hard to say before we know before we're started exploring the design.

Best,

Stephan

Phillip Wolfram

unread,

Mar 21, 2016, 7:20:38 PM3/21/16

to Stephan Hoyer, xar...@googlegroups.com, Matthew Rocklin, Ryan Abernathey

Thanks Stephan! Please see https://github.com/pydata/xarray/issues/798 for the issue.

Reply all

Reply to author

Forward