writing a netcdf huge file

1,836 views
Skip to first unread message

Ryan Abernathey

unread,
Jun 23, 2016, 9:48:17 AM6/23/16
to xar...@googlegroups.com
Hi xarrayers,

I want to use xarray to open several large netcdf files, concatenate them into one even bigger dataset, and write to disk using .to_netcdf. The final file size will be ~350 GB, larger than my RAM.

Will the entire concatenated dataset have to be read into memory (impossible), or can I make it write in chunks?

Thanks,
Ryan

Matthew Rocklin

unread,
Jun 23, 2016, 10:20:24 AM6/23/16
to xarray
Dask.array will happily store on-disk array data in chunks

--
You received this message because you are subscribed to the Google Groups "xarray" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xarray+un...@googlegroups.com.
To post to this group, send email to xar...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/xarray/CAJAZx5BpHFEkcr23qbTnU0n7_C3oZLv0FJ%2BLzi6s8cAkEGAteg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Wolfram Jr., Phillip

unread,
Jun 23, 2016, 10:29:35 AM6/23/16
to xar...@googlegroups.com
Hi Ryan,

I was able to write large files, although we did have some issues with a potential bug related to serialization in dask / HDF5: https://github.com/pydata/xarray/issues/793  I don’t know if you’ll find a problem but if you do I would be curious to see if the error was similar.  Hopefully the issue was fully resolved.

Best regards,
Phil

Ryan Abernathey

unread,
Jun 23, 2016, 10:39:17 AM6/23/16
to xar...@googlegroups.com
On Thu, Jun 23, 2016 at 10:20 AM, Matthew Rocklin <mroc...@gmail.com> wrote:
Dask.array will happily store on-disk array data in chunks

Yes, I know this. What I don't know is whether xarray's .to_netcdf method will write each chunk one at a time or will instead read all the chunks into memory before writing.
 

Stephan Hoyer

unread,
Jun 23, 2016, 1:11:05 PM6/23/16
to xar...@googlegroups.com
Xarray should indeed write data out to netcdf using chunks. This only works with the netcdf4 and h5netcdf back ends though -- scipy does not support incremental writes.

Ryan Abernathey

unread,
Dec 12, 2016, 10:54:55 PM12/12/16
to xar...@googlegroups.com
I am resurrecting this thread because I have not solved my issue.

I have a big dataset

region.nbytes / 1e9
>>> 429.981772112 # GB

It is chunked, and each chunk can easily fit in memory:

region.chunks
>>> Frozen(SortedKeysDict({'i': (2160,), 'k': (90,), 'time': (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), 'j': (2160,)}))

When I call

region.to_netcdf('test.nc', engine='netcdf4')

all the memory on the node (128 GB) gradually gets eaten up until it crashes.

I was hoping that instead it would intelligently stream the data and write it to disk in a way that did not overflow the RAM.

This dataset is backed by my own custom datastore (https://github.com/xgcm/xmitgcm), which uses numpy memmaps to read the binary data. I wonder if this is part of the problem?

Any advice would be appreciated.

-Ryan


On Thu, Jun 23, 2016 at 1:10 PM, Stephan Hoyer <sho...@gmail.com> wrote:
Xarray should indeed write data out to netcdf using chunks. This only works with the netcdf4 and h5netcdf back ends though -- scipy does not support incremental writes.
On Thu, Jun 23, 2016 at 4:39 PM Ryan Abernathey <ryan.ab...@gmail.com> wrote:
On Thu, Jun 23, 2016 at 10:20 AM, Matthew Rocklin <mroc...@gmail.com> wrote:
Dask.array will happily store on-disk array data in chunks

Yes, I know this. What I don't know is whether xarray's .to_netcdf method will write each chunk one at a time or will instead read all the chunks into memory before writing.
On Thu, Jun 23, 2016 at 6:48 AM, Ryan Abernathey <ryan.ab...@gmail.com> wrote:
Hi xarrayers,

I want to use xarray to open several large netcdf files, concatenate them into one even bigger dataset, and write to disk using .to_netcdf. The final file size will be ~350 GB, larger than my RAM.

Will the entire concatenated dataset have to be read into memory (impossible), or can I make it write in chunks?

Thanks,
Ryan

--
You received this message because you are subscribed to the Google Groups "xarray" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xarray+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "xarray" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xarray+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "xarray" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xarray+unsubscribe@googlegroups.com.

To post to this group, send email to xar...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "xarray" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xarray+unsubscribe@googlegroups.com.

To post to this group, send email to xar...@googlegroups.com.

Stephan Hoyer

unread,
Dec 13, 2016, 11:58:10 AM12/13/16
to xarray
What does your computation look like? For debugging purposes, it would probably be best to start with something simple, just copying data from your datastore into a single netCDF file.

A few other ideas:
- Can you do computation that doesn't involve doing a write in a streaming fashion, e.g., calculating .mean()?
- It might help to set chunks in the resulting netCDF file, using "chunksizes" in encoding.
- It would be interesting to see if writing multiple netCDF files works helps, using save_mfdataset.



Ryan Abernathey

unread,
Dec 15, 2016, 12:39:34 PM12/15/16
to xar...@googlegroups.com
Thanks for the feedback.

Here is the pipeline:

Yes, I am basically just copying data from my datastore to a netcdf file.

A few other ideas:
- Can you do computation that doesn't involve doing a write in a streaming fashion, e.g., calculating .mean()?

I tried this. The results were inconclusive. With a small dataset (n=12), it seemed to work. The CPU utilization gets high (~%2000 in a 28-core machine) and the RAM utilization gets high (90%) but never exceeds the available resident memory. So dask seems to be behaving well and scheduling the computation with the limits of the hardware. But then I tried with a bigger dataset (n=128). The RAM was overloaded and the node crashed.

Part of the problem may be that, due to some annoying details of the way I read these files, much more memory is briefly allocated than what the dataset actually uses. So perhaps I need to tell dask to use a lower limit on its memory when scheduling.
 
- It might help to set chunks in the resulting netCDF file, using "chunksizes" in encoding.

I tried this and it made no difference. (I set the chunksize to equal one timestep (1, 90, 2160, 2160); I wasn't sure exactly what was expected for this keyword.)
But it looks like the problem might be deeper, given the above test.

- It would be interesting to see if writing multiple netCDF files works helps, using save_mfdataset.

No real luck with this either.
 
Reply all
Reply to author
Forward
0 new messages