xarray + dask + cloud storage

1,212 views
Skip to first unread message

Ryan Abernathey

unread,
Jun 8, 2017, 10:38:58 PM6/8/17
to xar...@googlegroups.com
Hi Xarrayers,

We are considering moving some analysis tasks to AWS / Google Cloud. I’m curious if anyone can share their experience using xarray in a cloud computing context.

The scenario would be:
- python + xarray running on a cloud instance (e.g. EC2)
- lots of netcdf data files living in cloud storage (e.g. S3)
- maybe a dask.distributed cluster running in the background (but that is an additional complication and perhaps a distraction from my main question)

Dask itself has pretty substantial support for accessing various remote data services:
You can just pass a url like "s3://bucket/path” to any of the data ingestion functions.

My understanding is that the underlying netcdf4 library prevents reading from generic file-like objects, as discussed here:
Although that might be obsolete given recent developments in netcdf4-python:
But even with this fix, I think the target files have to be fully loaded in memory, which might negate many potential performance enhancements.

A different storage backend such as zarr would probably be much better in this context
but unfortunately most of us have data already in netCDF format.

I expect someone on this list has thought about these problems and how they might be overcome. Please let me know your thoughts.

Cheers,
Ryan Abernathey

Stephan Hoyer

unread,
Jun 9, 2017, 1:08:50 AM6/9/17
to xarray
When I tried this a few years ago on AWS, S3 access on a single node was annoying slow (something like 50-100MB/s) compared to disk access and I didn't pursue it much further. This was without dask-distributed -- using multiple machines would definitely scale better.

If you use netCDF3 files, you can use engine='scipy' to read and write file-like objects on S3 or another distributed file-system.

For netCDF4 files, you can read/write with copies to a temporary directory on local disk. I've done this sort of thing using dask.delayed and dask.array.from_delayed to construct dask arrays manually.

--
You received this message because you are subscribed to the Google Groups "xarray" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xarray+unsubscribe@googlegroups.com.
To post to this group, send email to xar...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/xarray/CAJAZx5CiWiqmn1r-KwVHjKyXi%2BMb%2BXkD3h6BNJ1-%3Dm%3DCJN3Bpg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Matthew Rocklin

unread,
Jun 9, 2017, 6:07:19 AM6/9/17
to xarray
For what it's worth I've had good experiences recently on Google's infrastructure (for which Dask also has decent support).  I've enjoyed both the cluster setup using google container engine (with dask-kubernetes) and nicer network performance.  I've moved most of my experimentation and demonstration system from AWS to Google and am pretty happy with the shift.

Ryan Abernathey

unread,
Jun 9, 2017, 12:01:17 PM6/9/17
to xar...@googlegroups.com
Matt, are your container configurations shareable?

Matthew Rocklin

unread,
Jun 9, 2017, 1:51:46 PM6/9/17
to xarray, Martin Durant

Matthew Rocklin

unread,
Jun 9, 2017, 2:16:26 PM6/9/17
to Martin Durant, xarray
Martin sent the following message but got bounced from the list.  

<begin Martin's message>

Tangentially, I do think it would be worthwhile to bring zarr completely inline with what xarray expects, i.e., a netCDF-like interface, not too far from the “groups” interface already in existence. It’s something I’ve spoken about before (as some people on this thread knows) and made some rough POC code for. Of course for data already as netCDF, there’s probably no appetite.

On dask-kubernetes, please do use and give feedback, there is certainly scope for improvements. The library gcsfs ( http://gcsfs.readthedocs.io/en/latest/ ), included in the image, provides a very similar experience to s3fs (in dask with URLs like ‘gcs://mybucket/mypath') and should be better-performing on google hardware.
 --
Dr Martin Durant
Software Engineer
Continuum Analytics | http://continuum.io

On Fri, Jun 9, 2017 at 2:09 PM, Martin Durant <mdu...@continuum.io> wrote:
Tangentially, I do think it would be worthwhile to bring zarr completely inline with what xarray expects, i.e., a netCDF-like interface, not too far from the “groups” interface already in existence. It’s something I’ve spoken about before (as some people on this thread knows) and made some rough POC code for. Of course for data already as netCDF, there’s probably no appetite.

On dask-kubernetes, please do use and give feedback, there is certainly scope for improvements. The library gcsfs ( http://gcsfs.readthedocs.io/en/latest/ ), included in the image, provides a very similar experience to s3fs (in dask with URLs like ‘gcs://mybucket/mypath') and should be better-performing on google hardware.
 --
Dr Martin Durant
Software Engineer
Continuum Analytics | http://continuum.io




John Readey

unread,
Jul 21, 2017, 12:04:58 PM7/21/17
to xarray, mdu...@continuum.io
Hi,

   Re: moving data access to the cloud.  I gave a talk last week at SciPy about HDF Data Access on AWS.  You can checkout the slides here: 

 http://s3.amazonaws.com/hdfgroup/docs/hdf_data_services_scipy2017.pptx, and the talk itself is on youtube: https://www.youtube.com/watch?v=EmnCz1Hg-VM.  


   The HDF Server discussed in the talk gets around the S3 slow throughput problem mentioned by Stephen by utilizing a cluster of nodes that can read/write to S3 in parallel.


    For xarray, I think it's not all there yet unfortunately.  We have NetCDF support planned, but not implemented yet.  Hopefully I'll get this in over the next 3 months or so.


    We have a development server running on AWS now with ~10TB of data loaded.  If anyone would like to play around with it, please let me know.


John Readey

Sr. Architect

The HDF Group

> To unsubscribe from this group and stop receiving emails from it, send an email to xarray+un...@googlegroups.com.

> To post to this group, send email to xar...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/xarray/CAJAZx5CiWiqmn1r-KwVHjKyXi%2BMb%2BXkD3h6BNJ1-%3Dm%3DCJN3Bpg%40mail.gmail.com.
> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> You received this message because you are subscribed to the Google Groups "xarray" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to xarray+un...@googlegroups.com.

> To post to this group, send email to xar...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/xarray/CAEQ_TvcC9pHLBFpTixuYQtA-KFsHpBu8WMTFqBtQ-napveURng%40mail.gmail.com.
>
> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> You received this message because you are subscribed to the Google Groups "xarray" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to xarray+un...@googlegroups.com.

> To post to this group, send email to xar...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/xarray/CAJ8oX-EqZ4N617fzCFymbDTscO%2BJC_GkqHQm_wA4eOZEKZf%3D1A%40mail.gmail.com.
>
> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> You received this message because you are subscribed to the Google Groups "xarray" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to xarray+un...@googlegroups.com.

> To post to this group, send email to xar...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/xarray/CAJAZx5CWOuvmDH9xdbA_K-S%2B6T5i8MJw2b%3DcWzVvh-2YEx7Ybw%40mail.gmail.com.
>
> For more options, visit https://groups.google.com/d/optout.
>


Stephan Hoyer

unread,
Jul 21, 2017, 12:59:30 PM7/21/17
to xarray, Martin Durant
Hi John,

Thanks for sharing! It's exciting to see more interest in distributed array stores.

If you're already written a h5py compatible interface, one idea would be to wrap it in a netCDF-compatible interface for xarray via my h5netcdf library (https://github.com/shoyer/h5netcdf). That would require a bit of refactoring but would be pretty easy I think.

Cheers,
Stephan

To unsubscribe from this group and stop receiving emails from it, send an email to xarray+unsubscribe@googlegroups.com.

To post to this group, send email to xar...@googlegroups.com.

John Readey

unread,
Jul 21, 2017, 1:30:30 PM7/21/17
to xarray, mdu...@continuum.io
Thanks Stephan.  
I'll take a look at h5netcdf.  Rich Signell has already created an issue requesting h5netcdf support: https://github.com/HDFGroup/h5pyd/issues/30.  

John

Ryan Abernathey

unread,
Jul 21, 2017, 2:19:47 PM7/21/17
to xar...@googlegroups.com
This sounds like a very promising path forward to a truly scalable data store for xarray!

Are there any plans to eventually support Google Cloud storage?

-Ryan

To unsubscribe from this group and stop receiving emails from it, send an email to xarray+unsubscribe@googlegroups.com.

To post to this group, send email to xar...@googlegroups.com.

John Readey

unread,
Jul 21, 2017, 2:28:48 PM7/21/17
to xarray
Google Cloud Storage supports the S3 API, so my naive thought would be that it would just work. :)

John

Ryan Abernathey

unread,
Jul 21, 2017, 3:27:22 PM7/21/17
to xar...@googlegroups.com
John,

I just watched your full talk, and I am now extremely excited about h5pyd. It could be a game changer!

I think the h5pyd --> h5netcdf --> xarray stack could be the killer app for big data climate science.

I am still trying to figure out where dask fits in at this point. It would be cool to propagate the h5pyd chunks through so that each would correspond to a dask chunk within an xarray dataset. With a dask.distributed cluster running in the same datacenter as the h5pyd store, I am fantasizing about blazingly fast throughput on PB scale datasets.

-Ryan


To unsubscribe from this group and stop receiving emails from it, send an email to xarray+unsubscribe@googlegroups.com.

To post to this group, send email to xar...@googlegroups.com.

John Readey

unread,
Jul 21, 2017, 5:25:03 PM7/21/17
to xarray
Hey Ryan,
   
    I had some issues with using one dask chunk per h5pyd chunk for larger datasets.  Dask would flood the server with thousands of requests.  Optimal (I think) would be to have each Dask worker iterate through it's set of chunks serially.   Is there a way to specify that?

    With dask.distributed and enough nodes on the server side, it should be possible to rip through those PB scale datasets.    NREL is expanding their wind dataset on AWS to 50TB, so that will be a start.  

   John

Matthew Rocklin

unread,
Jul 21, 2017, 7:54:33 PM7/21/17
to xarray
By default Dask uses a thread pool with as many threads as you have logical cores.  It will only have that many concurrent requests at once.  All Dask is doing is calling dset[1000:2000, 5000:6000] or something similar.  If you'd like to limit access you can create a dask.array with da.from_array(dset, lock=True) which will use a lock to limit concurrent access to your object.

To unsubscribe from this group and stop receiving emails from it, send an email to xarray+unsubscribe@googlegroups.com.

To post to this group, send email to xar...@googlegroups.com.

stanisla...@petacube.com

unread,
Jul 25, 2017, 5:01:55 PM7/25/17
to xarray
I had worked on this problem for a while. I have tried ingesting netcdf into Scidb based on distributed local storage and then expressing calculations using Scidb array query language. it works well. I am trying now alternative strategy combining tiledb with xarray but both of these technologies are not nearly as mature (requiring software development) as scidb. 

Stan  
Message has been deleted

stanisla...@petacube.com

unread,
Jul 25, 2017, 5:11:29 PM7/25/17
to xarray
honestly, 
I don't think it matters AWS, Google or Azure. 
you can get dedicated machines and high-performance network pipes in all these environments if you need higher hardware performance. 

I think what matters is basic efficiency of storage software you are using and netcdf is not efficient if you are trying to slice and dice data while crossing boundaries of multiple files. xarray helps alleviate the issue somewhat. 
S3 is pretty cheap - if you concerned about FS performance  I would use parallel file system the AWS has instead of S3. 

Stanislav Se

unread,
Jul 27, 2017, 11:39:55 PM7/27/17
to xarray
Ryan, this paper which is being presented in VLDB 2017 next week is probably helpful (dask/xarray is being compared to bunch of other things) https://arxiv.org/pdf/1612.02485.pdf

Daniel Rothenberg

unread,
Jul 28, 2017, 11:06:27 AM7/28/17
to xarray
Stanislav - also see Jake VanderPlas' talk at SciPy2017 on the work summarized in that pre-print.

Stanislav Seltser

unread,
Jul 30, 2017, 11:19:30 AM7/30/17
to xar...@googlegroups.com
Daniel,
yes, the video is pretty good. 
thank you for sharing it. 

here is the github for the code


Stanislav
--
You received this message because you are subscribed to a topic in the Google Groups "xarray" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/xarray/iKOWciXt7yU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to xarray+un...@googlegroups.com.

To post to this group, send email to xar...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages