Very slow retrieval of values from Dataset

2,192 views
Skip to first unread message

mcki...@ucar.edu

unread,
Aug 17, 2016, 7:04:45 PM8/17/16
to xarray
I've recently started using xarray and am having one insurmountable problem.

After manipulating my data as much as possible within the xarray framework, I get to the point where I need to get the values out as a numpy array, e.g. vals = ds.varname.values. However, this command seems to just hang up, even in the case where there are not many values in my dataset. 

For example, I have a dataset X with a variable in it, va. The command X.va yields:

<xarray.DataArray 'va' (time: 1110)>
dask.array<elemwis..., shape=(1110,), dtype=float64, chunksize=(1,)>
Coordinates:
    latitude   float32 39.75
    longitude  float32 0.0
  * time       (time) datetime64[ns] 1979-06-01 1979-06-02 1979-06-03 ...

but when I try X.va.values .... well, nothing happens. 

I thought the issue might be the chunking within dask, so I tried rechunking such that all time was in one chunk, but I have the same problem.

<xarray.DataArray 'va' (time: 1110)>
dask.array<rechunk..., shape=(1110,), dtype=float64, chunksize=(1110,)>
Coordinates:
    latitude   float32 39.75
    longitude  float32 0.0
  * time       (time) datetime64[ns] 1979-06-01 1979-06-02 1979-06-03 ...

Has anyone else had this issue, or know how to fix it?

Thanks,
Karen

Stephan Hoyer

unread,
Aug 17, 2016, 7:13:08 PM8/17/16
to xarray
There's definitely something going wrong with dask here. Debugging that requires understanding what your computation looks like. What did you do to make this array?

If your data isn't that big to begin with, you may have more success calling .load() on it immediately after loading it from disk.

--
You received this message because you are subscribed to the Google Groups "xarray" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xarray+unsubscribe@googlegroups.com.
To post to this group, send email to xar...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/xarray/061a2642-8a2c-4384-bc29-4bbe0958ebbf%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Karen McKinnon

unread,
Aug 18, 2016, 12:17:46 PM8/18/16
to xar...@googlegroups.com
The data is pretty big to begin with, which is why I was trying to prevent loading it until I'd done some reductions. I did just try doing load on the whole thing, but (perhaps unsurprisingly since there are 4*365*37*241*480 values) that also is very slow . Here's what I'm doing to get to the array I'm trying to load:

# ERA-I v winds, 6 hr, 1979-2015, 241 lat, 480 lon
ds1 = xr.open_mfdataset(var1Files)

# Average to daily
ds1 = ds1.resample('D', dim = 'time', how = 'mean') # va

newNames = {'v': 'va'}
ds1.rename(newNames, inplace = True)

# pull out a single month across all years, and a single lat/lon point
timeIdx = ds1.time.to_index().month == monthUse
pointVal = ds1.sel(time = timeIdx, latitude = locNA[0], longitude = locNA[1], method = 'nearest') # va

# remove mean across time
X = pointVal - pointVal.mean('time') # remove temporal mean for covariance calculation

# get data (this is slooooow / never finishes)
vals = X.va.values

--
You received this message because you are subscribed to a topic in the Google Groups "xarray" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/xarray/11lDGSeza78/unsubscribe.
To unsubscribe from this group and all its topics, send an email to xarray+unsubscribe@googlegroups.com.

To post to this group, send email to xar...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Karen McKinnon
ASP post-doctoral fellow
National Center for Atmospheric Research
1850 Table Mesa Dr., Boulder, CO, 80305

Stephan Hoyer

unread,
Aug 18, 2016, 12:44:09 PM8/18/16
to xarray, Blaze Dev - Public
+blaze-dev

Hi Karen,

You've managed to run into most of the major limitations of dask.array that impact xarray! Fortunately, with care all of these issues can be worked around.

Specific things to try (we should add similar guidelines to xarray's docs):
1. Do your spatial and temporal indexing with .sel() earlier in the pipeline, specifically before you resample. Resample triggers some computation on all the blocks, which in theory should commute with indexing, but we haven't implemented this optimization in dask yet:
2. Save the temporal mean to disk as a netCDF file (and then load it again with open_dataset) before subtracting it. Again, in theory, dask should be able to do the computation in a streaming fashion, but in practice this is a fail case for the dask scheduler, because it tries to keep every chunk of an array that it computes in memory:
3. Specify smaller chunks across space when using open_mfdataset, e.g., chunks={'latitude': 10, 'longitude': 10}. This makes spatial subsetting easier, because there's no risk you will load chunks of data referring to different chunks (probably not necessary if you do my suggestion 1).

Best,
Stephan


Karen McKinnon

unread,
Aug 18, 2016, 1:03:32 PM8/18/16
to xar...@googlegroups.com, Blaze Dev - Public
Hi Stephan,

Thanks for the fast reply! All those work arounds seem to have fixed the problem. 

Cheers,
Karen



For more options, visit https://groups.google.com/d/optout.

Edward

unread,
Nov 10, 2016, 8:46:07 AM11/10/16
to xarray, blaz...@continuum.io
Suggestion #2 here is less obvious, wizard advice and HIGHLY recommended. Can save a lot of time.
To unsubscribe from this group and stop receiving emails from it, send an email to xarray+un...@googlegroups.com.

To post to this group, send email to xar...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/xarray/061a2642-8a2c-4384-bc29-4bbe0958ebbf%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "xarray" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/xarray/11lDGSeza78/unsubscribe.
To unsubscribe from this group and all its topics, send an email to xarray+un...@googlegroups.com.



--
Karen McKinnon
ASP post-doctoral fellow
National Center for Atmospheric Research
1850 Table Mesa Dr., Boulder, CO, 80305

--
You received this message because you are subscribed to the Google Groups "xarray" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xarray+un...@googlegroups.com.

To post to this group, send email to xar...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages