Hi Stephan,
thank you for your answer! Sorry for not replying sooner, I wanted to try to investigate this a bit further to be able to come back with more meaningful information about the problem I'm facing.
I tried your suggestions. The chunksize differs between variables in the dataset, it's either one quarter or one third of lat/lon resolution. So (1,1800,900), or (1, 2400, 1200) for the ones I checked. Is there a good heuristic to choose dask chunk size for the whole dataset if the chunksize differs between variables?
I also tried changing the split_every setting. Didn't do much.
However, then I realized that I only have something like 9GB of free disk space on that machine. While checking available disk size using 'df' it didn't seem like it's being eaten away, but I increased it to having 50GB free disk space anyway. This then solved the problem of the VM becoming very sluggish while running the script, and it managed to get far enough to end in an error. To this end. Is there a rule of thumb regarding how much free disk space one should have when working with xarray depending on dataset size?
So then, the error I got was this:
Traceback (most recent call last):
File "daily_sbox.py", line 47, in <module>
sst_mean_time.to_netcdf('/home/ccitbx/Desktop/
sst_mean.nc')
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/xarray-0.8.2-py3.5.egg/xarray/core/dataset.py", line 782, in to_netcdf
engine=engine, encoding=encoding)
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/xarray-0.8.2-py3.5.egg/xarray/backends/api.py", line 354, in to_netcdf
dataset.dump_to_store(store, sync=sync, encoding=encoding)
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/xarray-0.8.2-py3.5.egg/xarray/core/dataset.py", line 730, in dump_to_store
store.sync()
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/xarray-0.8.2-py3.5.egg/xarray/backends/netCDF4_.py", line 289, in sync
super(NetCDF4DataStore, self).sync()
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/xarray-0.8.2-py3.5.egg/xarray/backends/common.py", line 192, in sync
self.writer.sync()
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/xarray-0.8.2-py3.5.egg/xarray/backends/common.py", line 171, in sync
da.store(self.sources, self.targets)
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/dask/array/core.py", line 712, in store
Array._get(dsk, keys, **kwargs)
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/dask/base.py", line 43, in _get
return get(dsk2, keys, **kwargs)
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/dask/threaded.py", line 57, in get
**kwargs)
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/dask/async.py", line 481, in get_async
raise(remote_exception(res, tb))
dask.async.MemoryError:
Traceback
---------
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/dask/async.py", line 264, in execute_task
result = _execute_task(task, data)
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/dask/async.py", line 246, in _execute_task
return func(*args2)
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/dask/array/reductions.py", line 212, in mean_chunk
total = sum(x, dtype=dtype, **kwargs)
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/numpy/lib/nanfunctions.py", line 513, in nansum
return np.sum(a, axis=axis, dtype=dtype, out=out, keepdims=keepdims)
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/numpy/core/fromnumeric.py", line 1835, in sum
out=out, keepdims=keepdims)
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/numpy/core/_methods.py", line 32, in _sum
return umr_sum(a, axis, dtype, out, keepdims)
So, obviously a memory problem. I ran it again, this time monitoring with 'top'. Peak memory consumption of the process was 48% (out of 4GB). I got again a Memory error, but a slightly different stack trace:
Traceback (most recent call last):
File "daily_sbox.py", line 47, in <module>
sst_mean_time.to_netcdf('/home/ccitbx/Desktop/
sst_mean.nc')
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/xarray-0.8.2-py3.5.egg/xarray/core/dataset.py", line 782, in to_netcdf
engine=engine, encoding=encoding)
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/xarray-0.8.2-py3.5.egg/xarray/backends/api.py", line 354, in to_netcdf
dataset.dump_to_store(store, sync=sync, encoding=encoding)
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/xarray-0.8.2-py3.5.egg/xarray/core/dataset.py", line 730, in dump_to_store
store.sync()
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/xarray-0.8.2-py3.5.egg/xarray/backends/netCDF4_.py", line 289, in sync
super(NetCDF4DataStore, self).sync()
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/xarray-0.8.2-py3.5.egg/xarray/backends/common.py", line 192, in sync
self.writer.sync()
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/xarray-0.8.2-py3.5.egg/xarray/backends/common.py", line 171, in sync
da.store(self.sources, self.targets)
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/dask/array/core.py", line 712, in store
Array._get(dsk, keys, **kwargs)
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/dask/base.py", line 43, in _get
return get(dsk2, keys, **kwargs)
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/dask/threaded.py", line 57, in get
**kwargs)
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/dask/async.py", line 481, in get_async
raise(remote_exception(res, tb))
dask.async.MemoryError:
Traceback
---------
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/dask/async.py", line 264, in execute_task
result = _execute_task(task, data)
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/dask/async.py", line 246, in _execute_task
return func(*args2)
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/dask/array/reductions.py", line 214, in mean_chunk
dtype=[('total', total.dtype), ('n', n.dtype)])
So I then tried custom chunking again. The code ran considerably slower. This consumed up to 80% of the memory and ended in a memory error, again, with a different stack trace:
Traceback (most recent call last):
File "daily_sbox.py", line 47, in <module>
sst_mean_time.to_netcdf('/home/ccitbx/Desktop/
sst_mean.nc')
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/xarray-0.8.2-py3.5.egg/xarray/core/dataset.py", line 782, in to_netcdf
engine=engine, encoding=encoding)
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/xarray-0.8.2-py3.5.egg/xarray/backends/api.py", line 354, in to_netcdf
dataset.dump_to_store(store, sync=sync, encoding=encoding)
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/xarray-0.8.2-py3.5.egg/xarray/core/dataset.py", line 730, in dump_to_store
store.sync()
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/xarray-0.8.2-py3.5.egg/xarray/backends/netCDF4_.py", line 289, in sync
super(NetCDF4DataStore, self).sync()
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/xarray-0.8.2-py3.5.egg/xarray/backends/common.py", line 192, in sync
self.writer.sync()
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/xarray-0.8.2-py3.5.egg/xarray/backends/common.py", line 171, in sync
da.store(self.sources, self.targets)
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/dask/array/core.py", line 712, in store
Array._get(dsk, keys, **kwargs)
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/dask/base.py", line 42, in _get
dsk2 = cls._optimize(dsk, keys, **kwargs)
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/dask/array/optimization.py", line 24, in optimize
dsk2 = cull(dsk, keys)
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/dask/optimize.py", line 39, in cull
seen.update(nxt)
MemoryError
And the last thing I tried was again with default chunking, setting split_every=1 again. Consumed up to 48% of memory and ended in a stack trace:
Traceback (most recent call last):
File "daily_sbox.py", line 47, in <module>
sst_mean_time.to_netcdf('/home/ccitbx/Desktop/
sst_mean.nc')
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/xarray-0.8.2-py3.5.egg/xarray/core/dataset.py", line 782, in to_netcdf
engine=engine, encoding=encoding)
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/xarray-0.8.2-py3.5.egg/xarray/backends/api.py", line 354, in to_netcdf
dataset.dump_to_store(store, sync=sync, encoding=encoding)
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/xarray-0.8.2-py3.5.egg/xarray/core/dataset.py", line 730, in dump_to_store
store.sync()
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/xarray-0.8.2-py3.5.egg/xarray/backends/netCDF4_.py", line 289, in sync
super(NetCDF4DataStore, self).sync()
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/xarray-0.8.2-py3.5.egg/xarray/backends/common.py", line 192, in sync
self.writer.sync()
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/xarray-0.8.2-py3.5.egg/xarray/backends/common.py", line 171, in sync
da.store(self.sources, self.targets)
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/dask/array/core.py", line 712, in store
Array._get(dsk, keys, **kwargs)
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/dask/base.py", line 43, in _get
return get(dsk2, keys, **kwargs)
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/dask/threaded.py", line 57, in get
**kwargs)
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/dask/async.py", line 481, in get_async
raise(remote_exception(res, tb))
dask.async.MemoryError:
Traceback
---------
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/dask/async.py", line 264, in execute_task
result = _execute_task(task, data)
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/dask/async.py", line 246, in _execute_task
return func(*args2)
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/toolz/functoolz.py", line 381, in __call__
ret = f(ret)
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/dask/array/reductions.py", line 221, in mean_combine
n = sum(pair['n'], **kwargs)
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/numpy/lib/nanfunctions.py", line 513, in nansum
return np.sum(a, axis=axis, dtype=dtype, out=out, keepdims=keepdims)
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/numpy/core/fromnumeric.py", line 1835, in sum
out=out, keepdims=keepdims)
File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/numpy/core/_methods.py", line 32, in _sum
return umr_sum(a, axis, dtype, out, keepdims)
Sorry for all the stack traces. So now I don't really have any ideas left! I would appreciate any comments you might have. This is also not really just about this particular case. We're using xarray as a backend for a tool that we're working on. It has worked well up until now, but eventually we will have to deal with larger datasets, maybe ten years of monthly data, a few years of daily data, so I would really like to understand what's happening and to be able to solve it, as this will not be the only time we'll have to deal with something like this!
If anyone has the interest to look into this more into depth and has necessary time to allocate, I put the data, as well as the script I was using here:
Thanks for the insights you've given already, xarray has a really nice community!
Best,
Janis