writing a netcdf huge file

Ryan Abernathey

unread,

Jun 23, 2016, 9:48:17 AM6/23/16

to xar...@googlegroups.com

Hi xarrayers,

I want to use xarray to open several large netcdf files, concatenate them into one even bigger dataset, and write to disk using .to_netcdf. The final file size will be ~350 GB, larger than my RAM.

Will the entire concatenated dataset have to be read into memory (impossible), or can I make it write in chunks?

Thanks,

Ryan

Matthew Rocklin

unread,

Jun 23, 2016, 10:20:24 AM6/23/16

to xarray

Dask.array will happily store on-disk array data in chunks

--
You received this message because you are subscribed to the Google Groups "xarray" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xarray+un...@googlegroups.com.
To post to this group, send email to xar...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/xarray/CAJAZx5BpHFEkcr23qbTnU0n7_C3oZLv0FJ%2BLzi6s8cAkEGAteg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Wolfram Jr., Phillip

unread,

Jun 23, 2016, 10:29:35 AM6/23/16

to xar...@googlegroups.com

Hi Ryan,

I was able to write large files, although we did have some issues with a potential bug related to serialization in dask / HDF5: https://github.com/pydata/xarray/issues/793 I don’t know if you’ll find a problem but if you do I would be curious to see if the error was similar. Hopefully the issue was fully resolved.

Best regards,

Phil

To view this discussion on the web visit https://groups.google.com/d/msgid/xarray/CAJ8oX-ELcpW1PdswNuyywyhYnrkc%2BV%3DLEXB5L%2Br7v8%2BrC_puMw%40mail.gmail.com.

Ryan Abernathey

unread,

Jun 23, 2016, 10:39:17 AM6/23/16

to xar...@googlegroups.com

On Thu, Jun 23, 2016 at 10:20 AM, Matthew Rocklin <mroc...@gmail.com> wrote:

Dask.array will happily store on-disk array data in chunks

Yes, I know this. What I don't know is whether xarray's .to_netcdf method will write each chunk one at a time or will instead read all the chunks into memory before writing.

To view this discussion on the web visit https://groups.google.com/d/msgid/xarray/CAJ8oX-ELcpW1PdswNuyywyhYnrkc%2BV%3DLEXB5L%2Br7v8%2BrC_puMw%40mail.gmail.com.

Stephan Hoyer

unread,

Jun 23, 2016, 1:11:05 PM6/23/16

to xar...@googlegroups.com

Xarray should indeed write data out to netcdf using chunks. This only works with the netcdf4 and h5netcdf back ends though -- scipy does not support incremental writes.

To view this discussion on the web visit https://groups.google.com/d/msgid/xarray/CAJAZx5Bg3ATAo6DDSujNduZUchbrhPT5aeSPUZNcbOL4HtkTBQ%40mail.gmail.com.

Ryan Abernathey

unread,

Dec 12, 2016, 10:54:55 PM12/12/16

to xar...@googlegroups.com

I am resurrecting this thread because I have not solved my issue.

I have a big dataset

region.nbytes / 1e9

>>> 429.981772112 # GB

It is chunked, and each chunk can easily fit in memory:

region.chunks

>>> Frozen(SortedKeysDict({'i': (2160,), 'k': (90,), 'time': (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), 'j': (2160,)}))

When I call

region.to_netcdf('test.nc', engine='netcdf4')

all the memory on the node (128 GB) gradually gets eaten up until it crashes.

I was hoping that instead it would intelligently stream the data and write it to disk in a way that did not overflow the RAM.

This dataset is backed by my own custom datastore (https://github.com/xgcm/xmitgcm), which uses numpy memmaps to read the binary data. I wonder if this is part of the problem?

Any advice would be appreciated.

-Ryan

On Thu, Jun 23, 2016 at 1:10 PM, Stephan Hoyer <sho...@gmail.com> wrote:

Xarray should indeed write data out to netcdf using chunks. This only works with the netcdf4 and h5netcdf back ends though -- scipy does not support incremental writes.

On Thu, Jun 23, 2016 at 4:39 PM Ryan Abernathey <ryan.ab...@gmail.com> wrote:

On Thu, Jun 23, 2016 at 10:20 AM, Matthew Rocklin <mroc...@gmail.com> wrote:
Dask.array will happily store on-disk array data in chunks

Yes, I know this. What I don't know is whether xarray's .to_netcdf method will write each chunk one at a time or will instead read all the chunks into memory before writing.

On Thu, Jun 23, 2016 at 6:48 AM, Ryan Abernathey <ryan.ab...@gmail.com> wrote:

Hi xarrayers,

I want to use xarray to open several large netcdf files, concatenate them into one even bigger dataset, and write to disk using .to_netcdf. The final file size will be ~350 GB, larger than my RAM.

Will the entire concatenated dataset have to be read into memory (impossible), or can I make it write in chunks?

Thanks,
Ryan

--
You received this message because you are subscribed to the Google Groups "xarray" group.

To unsubscribe from this group and stop receiving emails from it, send an email to xarray+unsubscribe@googlegroups.com.

To post to this group, send email to xar...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/xarray/CAJAZx5BpHFEkcr23qbTnU0n7_C3oZLv0FJ%2BLzi6s8cAkEGAteg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "xarray" group.

To unsubscribe from this group and stop receiving emails from it, send an email to xarray+unsubscribe@googlegroups.com.

To post to this group, send email to xar...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/xarray/CAJ8oX-ELcpW1PdswNuyywyhYnrkc%2BV%3DLEXB5L%2Br7v8%2BrC_puMw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "xarray" group.

To unsubscribe from this group and stop receiving emails from it, send an email to xarray+unsubscribe@googlegroups.com.

To post to this group, send email to xar...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/xarray/CAJAZx5Bg3ATAo6DDSujNduZUchbrhPT5aeSPUZNcbOL4HtkTBQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--

You received this message because you are subscribed to the Google Groups "xarray" group.

To unsubscribe from this group and stop receiving emails from it, send an email to xarray+unsubscribe@googlegroups.com.

To post to this group, send email to xar...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/xarray/CAEQ_TvfEnyQOqOYws2qQBOAwuBzVD9EpDGTsHwfESoWmJp6Bsw%40mail.gmail.com.

Stephan Hoyer

unread,

Dec 13, 2016, 11:58:10 AM12/13/16

to xarray

What does your computation look like? For debugging purposes, it would probably be best to start with something simple, just copying data from your datastore into a single netCDF file.

A few other ideas:

- Can you do computation that doesn't involve doing a write in a streaming fashion, e.g., calculating .mean()?

- It might help to set chunks in the resulting netCDF file, using "chunksizes" in encoding.

- It would be interesting to see if writing multiple netCDF files works helps, using save_mfdataset.

To view this discussion on the web visit https://groups.google.com/d/msgid/xarray/CAJAZx5Ct%3D0MCn9bFShWtcp16wx%2BjoJDExfkLmwKCXbVHsrCP2A%40mail.gmail.com.

Ryan Abernathey

unread,

Dec 15, 2016, 12:39:34 PM12/15/16

to xar...@googlegroups.com

Thanks for the feedback.

Here is the pipeline:

https://gist.github.com/rabernat/2d72ff3fc5fd2e0a54ae4fbeac127671

Yes, I am basically just copying data from my datastore to a netcdf file.

A few other ideas:
- Can you do computation that doesn't involve doing a write in a streaming fashion, e.g., calculating .mean()?

I tried this. The results were inconclusive. With a small dataset (n=12), it seemed to work. The CPU utilization gets high (~%2000 in a 28-core machine) and the RAM utilization gets high (90%) but never exceeds the available resident memory. So dask seems to be behaving well and scheduling the computation with the limits of the hardware. But then I tried with a bigger dataset (n=128). The RAM was overloaded and the node crashed.

Part of the problem may be that, due to some annoying details of the way I read these files, much more memory is briefly allocated than what the dataset actually uses. So perhaps I need to tell dask to use a lower limit on its memory when scheduling.

- It might help to set chunks in the resulting netCDF file, using "chunksizes" in encoding.

I tried this and it made no difference. (I set the chunksize to equal one timestep (1, 90, 2160, 2160); I wasn't sure exactly what was expected for this keyword.)

But it looks like the problem might be deeper, given the above test.

- It would be interesting to see if writing multiple netCDF files works helps, using save_mfdataset.

No real luck with this either.

To view this discussion on the web visit https://groups.google.com/d/msgid/xarray/CAEQ_TvcK6oHPvu45NFGaRPBw2WgKNdxZq0-D-gQoOcYrk0Z%3DoQ%40mail.gmail.com.

Reply all

Reply to author

Forward