Reading files in parallel with open

SANJIV

unread,

Jul 27, 2021, 10:30:10 PM7/27/21

to xarray

Hi all,

I am trying to read in a bunch of monthly files from the POP ocean model. There are 1212 files in all. The command I am using is:

files = xarray.open_mfdataset(fpath, parallel=True,chunks='auto')

I am finding xarray is reading in the files in a weird way. I made a short movie at this link showing the dask dashboard:

https://www.dropbox.com/s/w3fbmtbd4sfn6od/xarray_parallel_read.mkv?dl=0

The tasks first appear to get processed in parallel and then for some reason (0:22 in the video), the process appears to start all over again but in serial. This then takes a very long time to execute although eventually it does. I don't get why is this happening. Has anyone else experienced this issue?

I start the dask client as follows:

from distributed import Client
client=Client(n_workers=10,memory_limit='26GB')

I tried removing the 'chunks' option but still have the same problem.

Thanks,

Sanjiv

Maximilian Roos

unread,

Jul 28, 2021, 9:56:50 PM7/28/21

to xarray

Hi Sanjiv,

Thanks for the video. That does look surprising.

I'm not the expert here, but it might be worth trying with "join='exact'" to confirm whether the files are aligned. I've had some performance issues with unaligned indexes before.

FWIW we get a bit more traffic on https://github.com/pydata/xarray/discussions, so you may get more responses there. One issue with these dask-xarray problems on large datasets is they can be difficult to reproduce. It may also be worth seeing whether the same behavior occurs with a smaller number of files, or seeing whether attempting to write a version of the data and then read that version makes a difference.

And if others have other ideas for debugging in these sorts of cases, please add on...

Thanks,

Max

Deepak Cherian

unread,

Jul 29, 2021, 12:06:33 PM7/29/21

to xar...@googlegroups.com

Yes, it's checking various coordinate variables for equality, that's why you see the loads.

See https://xarray.pydata.org/en/stable/user-guide/io.html#reading-multi-file-datasets for solutions

Deepak

--
You received this message because you are subscribed to the Google Groups "xarray" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xarray+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/xarray/acca8542-758b-4b45-a73a-f098c98bbed0n%40googlegroups.com.

SANJIV

unread,

Jul 29, 2021, 2:05:10 PM7/29/21

to xarray

Thanks for your responses, they was helpful! It appears adding compat='override' is crucial in my case. Without that option, the runtime for the open_mfdataset step increases from around 10 sec to a few minutes.

Reading files in parallel with open_mfdataset

SANJIV

Maximilian Roos

Deepak Cherian

SANJIV