Efficient use of ds.to

SANJIV

unread,

May 10, 2020, 5:17:15 PM5/10/20

to xarray

Hi,

Objective: I am trying to writing out netcdf files from xarray datasets using dataset.to_netcdf. My original dataset ds has just one 4d array, temp, with dimensions (306,193,50,1336) and size 15 GB. The end goal is to create a second dataset ds_rolling that is a rolling mean of ds, where the rolling mean is computed along the fourth (unlimited) dimension of size 1336. I then want to write it out as a netcdf file. So I would expect ds_rolling.cdf to be roughly 15 GB in size as well. I compute the rolling mean using:

ds_rolling=ds
ds_rolling['temp']=ds_rolling['temp'].rolling(ocean_time=112,center=True).mean()

where temp is the 4d array and ocean_time is the dimension of size 1336.

[I have plotted the rolling mean to confirm the command above is doing the correct thing.]

Resources: 18 cores on one node, 108 GB memory (6 GB per core)

Questions: (i) Just using the ds_rolling.to_netcdf command never finishes execution and I have to end up aborting the job. Don't I have more than enough memory to write out a 15gb file?

(ii) I decided to test writing out smaller chunks. In the code below, dim1 is the first dimension of size 306 and dim2 is the second dimension of size 193.

Case 1: Write out a (175,175,50,43) array

# Carve out a smaller chunk
ds_rslice=ds_rolling.isel(ocean_time=range(57,100),dim1=range(0,175),dim2=range(0,175) )

# Load into memory
ds_rslice=ds_rslice.persist() # This step is not possible with the full 4d array, again I don't understand why given I have ample memory.

# It was taking much longer without this step
encoding = {'temp':{'_FillValue':False}}

ds_rslice.to_netcdf('test.nc',encoding=encoding)

#The final step executes in 222.066 s and creates a file of size 252 MB.

Case 2: Writing out a (306,193,50,1) array (much smaller than Case 1)

# Note this chunk is using all the values along dim1 and dim2

ds_rslice=ds_rolling.isel(ocean_time=range(57,58),dim1=range(0,306),dim2=range(0,193) )

Using the rest of the command sequence as for Case 1, the final writing step executes in 433.296 s and creates a file of size 12 MB.

Why is Case 2 so much slower than Case 1 even though the output file size in the former is about 20 times smaller?

Thanks for any insights!

Sanjiv

Texas A & M

Deepak Cherian

unread,

May 10, 2020, 5:42:41 PM5/10/20

to xar...@googlegroups.com

rolling is memory intensive and creates a lot of tasks so it is pretty resource intensive AFAIK [len(ds_rolling.__dask_graph__()) will tell you the number of tasks]. Rolling with dask datasets was buggy so the current implementation is a little inefficient AFAIR (but correct).

1. Does ds_rolling.compute() work? IT should work depending on your memory situation though it may take some time to actually start computing anything.

2. Is this a dask dataset? The differences between Case 1 and Case2 will depend on chunking. Also note that slicing a dask dataset scales with the size of the entire dask graph so it's unintuitively as slow as doing the whole thing in some aspects (e.g. https://github.com/dask/dask/issues/5918).

Ah, the chunking along ocean_time in the original dataset will be an issue. You'll want this to be O(100) to reduce number of tasks that get created I think. Some experimentation and checking with len(ds_rolling.__dask_graph__()) should help.

Deepak

--
You received this message because you are subscribed to the Google Groups "xarray" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xarray+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/xarray/5c7dabf2-4604-40ce-87dc-e4eb76dc8069%40googlegroups.com.

SANJIV

unread,

May 10, 2020, 8:03:20 PM5/10/20

to xarray

Hi Deepak,

Thanks for the response.

I am using dask datasets, should have mentioned this in my earlier post. Also, just to be clear the rolling mean calculation is not giving me any problems as it executes in a few seconds even on the original 4d array. Are you saying that the memory intensive nature of the rolling mean function negatively impacts the step later on when I am using ds.to_netcdf?

Both cases 1 and 2 are using the same chunk sizes as in the original 4d array. The original array is chunked along dim1 and dim2 but not along ocean_time. Based on what you say about the desirability of having chunks O(100) along ocean_time, I will try different chunk sizes along this dimension as well.

I tried ds_rolling.compute() but that did not work and I had to kill the job. On the smaller slices (cases 1 and 2), I haven't tried it yet but persist() seems to work.

Sanjiv

To unsubscribe from this group and stop receiving emails from it, send an email to xar...@googlegroups.com.

Reply all

Reply to author

Forward

Efficient use of ds.to_netcdf

SANJIV

Deepak Cherian

SANJIV