Hi,
Objective: I am trying to writing out netcdf files from xarray datasets using dataset.to_netcdf. My original dataset ds has just one 4d array, temp, with dimensions (306,193,50,1336) and size 15 GB. The end goal is to create a second dataset ds_rolling that is a rolling mean of ds, where the rolling mean is computed along the fourth (unlimited) dimension of size 1336. I then want to write it out as a netcdf file. So I would expect ds_rolling.cdf to be roughly 15 GB in size as well. I compute the rolling mean using:
ds_rolling=ds
ds_rolling['temp']=ds_rolling['temp'].rolling(ocean_time=112,center=True).mean()
where temp is the 4d array and ocean_time is the dimension of size 1336.
[I have plotted the rolling mean to confirm the command above is doing the correct thing.]
Resources: 18 cores on one node, 108 GB memory (6 GB per core)
Questions: (i) Just using the ds_rolling.to_netcdf
command never finishes execution and I have to end up aborting the job. Don't I have more than enough memory to write out a 15gb file?
(ii) I decided to test writing out smaller chunks. In the code below, dim1 is the first dimension of size 306 and dim2 is the second dimension of size 193.
Case 1: Write out a (175,175,50,43) array
# Carve out a smaller chunk
ds_rslice=ds_rolling.isel(ocean_time=range(57,100),dim1=range(0,175),dim2=range(0,175) )
# Load into memory
ds_rslice=ds_rslice.persist() # This step is not possible with the full 4d array, again I don't understand why given I have ample memory.
# It was taking much longer without this step
encoding = {'temp':{'_FillValue':False}}
ds_rslice.to_netcdf('test.nc',encoding=encoding)
#The final step executes in 222.066 s and creates a file of size 252 MB.
Case 2: Writing out a (306,193,50,1) array (much smaller than Case 1)# Note this chunk is using all the values along dim1 and dim2
ds_rslice=ds_rolling.isel(ocean_time=range(57,58),dim1=range(0,306),dim2=range(0,193) )
Using the rest of the command sequence as for Case 1, the final writing step executes in 433.296 s and creates a file of size 12 MB.
Why is Case 2 so much slower than Case 1 even though the output file size in the former is about 20 times smaller?
Thanks for any insights!
Sanjiv
Texas A & M
rolling is memory intensive and creates a lot of tasks so it is
pretty resource intensive AFAIK [len(ds_rolling.__dask_graph__())
will tell you the number of tasks]. Rolling with dask datasets was
buggy so the current implementation is a little inefficient AFAIR
(but correct).
1. Does ds_rolling.compute() work? IT should work depending on
your memory situation though it may take some time to actually
start computing anything.
2. Is this a dask dataset? The differences between Case 1 and
Case2 will depend on chunking. Also note that slicing a dask
dataset scales with the size of the entire dask graph so it's
unintuitively as slow as doing the whole thing in some aspects
(e.g. https://github.com/dask/dask/issues/5918).
Ah, the chunking along ocean_time in the original dataset will be
an issue. You'll want this to be O(100) to reduce number of tasks
that get created I think. Some experimentation and checking with
len(ds_rolling.__dask_graph__()) should help.
Deepak
--
You received this message because you are subscribed to the Google Groups "xarray" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xarray+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/xarray/5c7dabf2-4604-40ce-87dc-e4eb76dc8069%40googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to xar...@googlegroups.com.