Towards an xarray-based, astropy-analogous package for atmos/ocean/climate/meteorology data analysis

559 views
Skip to first unread message

Spencer Hill

unread,
Apr 10, 2016, 10:26:17 PM4/10/16
to xarray
(For reference, this is the astropy package: http://astropy.readthedocs.org)

Continuing the discussion began in this thread

Quoting Julien Le Sommer in that thread: "so I guess the first question is probably to know whether we are focusing on physical oceanography only or also embrasse
the question of atmospheric data. From my standpoint there are practical specificities and ecosystem considerations that justifies focusing on oceanography first"

By "practical specificities and ecosystem considerations", are you just saying that for you personally it would be better to start with ocean models (since that what you work on), or are you saying that there are more general reasons for considering the atmosphere and ocean domains separately (and oceans first)?  If the latter, what do you have in mind?

I ask because I actually don't see any fundamentally differences between the two domains that would force separate wholesale approaches to the two.  Global coupled models are usually postprocessed such that the output data for both ocean and atmosphere (and land and sea ice) are very similar in format, metadata specifications, etc.

Thanks!

Ryan Abernathey

unread,
Apr 11, 2016, 9:49:41 AM4/11/16
to xar...@googlegroups.com
Apologies for provoking this conversation then dropping out... ;)

I agree with Spencer that the scope should be ocean, atmosphere, and climate. I would not just focus on physical oceanography, because I think engaging the larger climate modeling community would bring lots of users, enthusiasm, and (possibly) funding. The upcoming release of CMIP6 will be a huge data challenge to many different groups of researchers.

How can we do this without taking on an impossibly ambitious and open-ended project? The example of astropy is useful here. I think the key is to make the collaborative project focused on a certain set of "core" services which are common to nearly all model-based research. These are the "hard" things to code which are currently heavily duplicated / fragmented in the existing ecosystem. I'm thinking about:
- scalable, out-of-core finite volume operations on common model grids (e.g. C-grid)
- scalable, out-of-core versions of common data analysis routines (e.g. multi-dimensional spectral analysis, EOFs, etc.)
- " " " interpolation / regridding
- units and constants? 

On top of this, there could be many "affiliated packages" which build domain-specific projects on top of the core tools:

There are already many projects out there which purport to accomplish some or all of these goals (e.g. UV-CDAT, Iris, etc.) What would make our approach stand out is its first-class support for scalability via xarray and its focus on "difficult" types analysis, rather than simply load / take mean / plot. (Also the fact that it arose organically via actual scientists rather than being mandated through some big agency initiative.)

-Ryan





--
You received this message because you are subscribed to the Google Groups "xarray" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xarray+un...@googlegroups.com.
To post to this group, send email to xar...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/xarray/63222c0a-17de-4431-a74b-90a0cecb2c6d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Joe Hamman

unread,
Apr 11, 2016, 5:26:01 PM4/11/16
to xarray
From my perspective, the climate community would benefit most from a set of general analysis tools. I would recommend against focusing too much on domain specific applications (e.g. ocean or atmosphere). Some features I'd like to see that would likely be applicable across most of the climate data community.

- vector <--> raster transformations
- interpolation / regridding / spatial projections
- improved calendar support (hopefully we can push this further upstream to pandas or numpy)
- units support
- others (finite element, spectral analysis, eofs, etc.) that are basically wrappers around existing tools that use xarray objects

These are things that haven't really fit within xarray up to this point but would be useful for many of us in the climate community.

I like the idea of domain specific packages that build on xarray and a more specific set of climate tools. I could see developing specific land, ocean, and atmosphere climate packages.

Joe

julien....@gmail.com

unread,
Apr 13, 2016, 11:58:08 AM4/13/16
to xarray

Hi all,

(it took me a couple of days to see that you have initiated the new thread of discussion.
sorry for being late here.)

I agree with the general perspective of a core set of tools that are generic enough and
not too specific to a particular field of "gridded geosciences". This will provide more
momentum, facilitate community uptake and possibly help leverage fundings at some point.

I also agree with the idea of more specific packages that build on the top of the core tools.
I think there are indeed needs that are specific to each field.

## general remarks

- **dependencies** : as a general rule, I would prefer if the packages could have as little dependencies as possible. optional dependencies could be allowed for specific function but should not be a blocker for installation.

- **performance** :  I think that the performance on small (non-parrallel) datasets should be as close as possible to standard numpy implementations to facilitate community uptake. 
 
- **2D data** : we should not only focus on 3D model data, but also provide tools for 2D data (as eg. gridded satellite data). This is probably obvious to you but this should be kept in mind when developing the actual tools.


## regarding the core tool :

- having some tools for projecting 3D data onto 2D iso-surfaces defined from 3D fields would be very useful.

- we should try to provide the low level finite difference operators so that the users can reproduce the actual discretization used in the model they are analyzing. This aspect is key for closing budgets.

## regarding the physical oceanography layer

- providing tools that implements common equations of states (eos80, teos10) and all the derived quantities ready for use with xarrays would be great.

my 2 cents.

--
Julien




Spencer Hill

unread,
Apr 18, 2016, 5:24:21 PM4/18/16
to xarray
Hi all,

Julien, sorry for not posting about the new thread in your original one!  And sorry for the delay in this message overall -- was on travel this past week.

Ryan, I fully agree with your comment about leveraging the unique capabilities of xarray in order to not just replicate what Iris and UV-CDAT already do.  And it seems that the dask/out-of-core functionality is your candidate.  This seems reasonable to me too, although admittedly I have only cursory experience with dask (the cluster at GFDL has, depending on the node, up to 512 GB RAM :)).

Would this mean that we would be attaching to dask from the outset?  I ask because xarray has deliberately kept it as an optional dependency, and it seems to be not fully mature (again, my knowledge is limited).  I'm not necessarily opposed to this, just wanted to clarify.

Looking at the three "scalable, out-of-core..." services Ryan lists, they seem largely independent of one another.  So is what you're after an interface/data structure that facilitates performing each of them on, say, netCDF data (or xarray.Datasets more generally)?

Regarding units, which Ryan, Joe, and Julien all brought up: both astropy and Iris have units support.  I'm not familiar with the internals for either case, but surely they can be leveraged.  Would anybody experienced with either care to chime in?  And is this orthogonal to the above dask-related issues?

My two cents for now.  Thanks!

Best,
Spencer

Best,
Spencer

Daniel Rothenberg

unread,
Apr 29, 2016, 12:27:36 PM4/29/16
to xarray
Hello everyone,

I'd enthusiastically support this effort. Somewhere in one of my notes files I've got a list of "killer-features" that an "aospy" (stealing your package name, Spencer!) would have to differentiate it from existing systems and attract new users. 

I think there's critical mass to get some sort of live-chat going on about this. Perhaps a Google Hangout with interested people, or we can set up a #slack chat to discuss lots of ideas live? If we can condense a vision of what an "aospy" package would look like out of such a meeting, it would be a fantastic starting point for a grant proposal or some other way of securing support in pursuing this project. 

On a plus side, I'm anticipating having a few months of "open" time in between finishing up my dissertation at the summer and taking on a post-doc, and I'd be thrilled to do coordination and contribute a significant amount of time towards this project in the Fall.

- Daniel Rothenberg

Ravi Shekhar

unread,
Apr 29, 2016, 1:50:30 PM4/29/16
to xarray
Hi everyone, 

I also would enthusiastically support this effort to create an aospy project. I feel that I've already ended up duplicating a lot of this functionality over the past few years in order to keep my analysis toolkit in mostly Python. I'd like to be part of any organizational meetings about this. 

I think we should start with a small, highly useful bit of code to allow the project to gain some momentum before trying to encompass everything under the sun, as it's really easy to come up with a wish list of features that lead to a very large and unmaintainable codebase. I think `scikit-learn` is a good model to emulate here, and while that library is large, they are very selective about what new features they add. To that effect, I think starting with the following basic features might be good. Once the package is usable, more functionality can be added. 

-- Out of core from the outset (transparently work on numpy or dask arrays)
-- Regridding / Interpolating (I actually really like the approach Iris takes here)
-- Finite element differentiation with equal and unequal spacing (provide default implementation, allow custom ones via a stencil)
-- Current Xarray functionality 

I have a two small pieces of code that I can contribute if there aren't better implementations already floating out there:
-- Finite difference derivative in one dimension of an N-dimensional array with unequal spacing. Pure python and numpy, should translate to dask well.
-- Optimized cython code to linearly interpolate or regrid 2D or 3D data onto iso surface(s). I've used it for isobars and isosurfaces of potential temperature, but it should be quite general.

Cheers,
Ravi

Ryan Abernathey

unread,
May 4, 2016, 9:48:00 AM5/4/16
to xar...@googlegroups.com
Dear Colleagues,

I am excited by all the enthusiasm for this idea.

I have some funding from the Sloan foundation to organize a small workshop / hackathon / sprint in the fall. I feel like sitting down together in person for a few days would really accelerate this idea and help transform our disparate individual packages into a coherent collaborative project. Let me know if you are interested in coming to New York during September or October and I can start organizing this. (I'm talking especially to Spencer, Julien, Joe, Daniel, and Ravi, plus anyone else in the list who would like to contribute.)

In the meantime, I think we should move this discussion off the xarray list and onto some other collaboration platform. Does anyone have any suggestions for that? Should we just start a new repo and start making PRs? Write some documentation and tests to start defining an API? Or perhaps use hackpads (or equivalent) to continue to brainstorm ideas? Input from the more experienced developers on how to get organized would be appreciated! ;)

Cheers,
Ryan

--
You received this message because you are subscribed to the Google Groups "xarray" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xarray+un...@googlegroups.com.
To post to this group, send email to xar...@googlegroups.com.

Daniel Rothenberg

unread,
May 4, 2016, 10:39:32 AM5/4/16
to xarray
I'd definitely be interested in an in-person coding sprint along that timeframe. I have a dissertation to defend around the same time, but I'm sure I can make things work.

Before we actually begin any coding or designing, some brainstorming would be a very good idea. At a minimum, we need to define who our potential users are, and what sort of work they are performing. It would also be worth considering what sort of machines they're using for their work - if we wish to support workflows on the larger supercomputers used in the field, for instance, then we have to make far more conservative assumptions about dependencies than if we're targeting users who can "conda update" their local setup each morning. 

Towards that end, a hackpad is a great idea. I took the liberty of setting one up, and recorded some of the ideas/discussion here.

- Daniel

Matthew Rocklin

unread,
May 4, 2016, 10:41:51 AM5/4/16
to xarray
You all might consider a couple of online meetings using something like Skype or Google Hangouts as a precursor.

Wolfram Jr., Phillip

unread,
May 4, 2016, 5:56:07 PM5/4/16
to xar...@googlegroups.com
Agreed.  I’m interested too although I’m not 100% sure what my contribution could be at the moment.  

Phil
--------------------------------------------------
Phillip J. Wolfram, Ph.D.
Climate, Ocean and Sea Ice Modeling
T-3 Fluid Dynamics and Structural Mechanics
Los Alamos National Laboratory
Phone:  (505) 667-3518
Email:   pwol...@lanl.gov

Spencer Hill

unread,
May 4, 2016, 11:14:00 PM5/4/16
to xarray
Hi all,

Dan and Ravi, great to have you both contributing here!

Count me in for all of it -- this is really exciting.  I'm happy to migrate to the hackpad Dan set up.

No doubt Spencer Clark, a colleague of mine and the other developer on my aospy package (which, BTW, can change names if that's a better fit for the project currently under discussion!), will want to be a part of this and will be a great asset.

Despite the migration of the discussion elsewhere, I echo Ryan's plea for advice from more experienced developers on getting this off to an optimal start.  That would be really useful!

Best,
Spencer Hill
To post to this group, send email to xa...@googlegroups.com.

Spencer Clark

unread,
May 5, 2016, 7:58:38 AM5/5/16
to xar...@googlegroups.com
Yes, I would be excited to be a part of this as well!  Pending my qualifying exams at the end of this month, I won't have any other major deadlines in the near future.  Like Phil, I'm not 100% sure what my contribution would be at the moment, but post-exams I should be able to give this some deeper thought.

Spencer Clark

julien....@gmail.com

unread,
May 9, 2016, 3:36:14 AM5/9/16
to xarray


Hi all,

I'd be very keen on participating in an in-person code sprint over a fews days in Sept/Oct as well. Keep me in informed.

also, because I want to prepare some tools for two PhD students who start in September, I have iterated a bit on the library I initially had in mind.

This is all very preliminary and I am happy to switch to our common library when we get started.

see : https://github.com/lesommer/oocgcm

I think this can probably provide some ideas as to how we will organize our library.

cheers
--
J.

Ryan Abernathey

unread,
May 10, 2016, 4:43:01 PM5/10/16
to xar...@googlegroups.com
Julien,

This is very cool! I really like your proposed layout for the package.

I am partial to xgcm because I have already written a good amount of code and set up several painful ingredients (e.g. travis-ci integration, codecov, and readthedocs) 

I would have no problem merging with oocgcm in the future once similar infrastructure exists.

In the meantime, I will try to contribute some ideas to the hackpad

-Ryan

julien....@gmail.com

unread,
May 12, 2016, 5:01:14 PM5/12/16
to xarray


Le mardi 10 mai 2016 22:43:01 UTC+2, Ryan Abernathey a écrit :
Julien,

This is very cool! I really like your proposed layout for the package.

I am partial to xgcm because I have already written a good amount of code and set up several painful ingredients (e.g. travis-ci integration, codecov, and readthedocs) 

 
I would have no problem merging with oocgcm in the future once similar infrastructure exists.



 - [x] travis-ci integration
 - [x] codecov
 - [x] readthedocs

https://github.com/lesommer/oocgcm
http://oocgcm.rtfd.io/

;-)
 

Ryan Abernathey

unread,
May 13, 2016, 10:43:39 AM5/13/16
to xar...@googlegroups.com
Julien,

I am impressed by your rapid progress.

But now, even more, we are really developing two (or more) identical packages, rather than a single collaborative effort.

Let's talk in Pasadena about how to genuinely collaborate and not waste our effort on duplicate code.

-Ryan

julien....@gmail.com

unread,
May 13, 2016, 11:11:13 AM5/13/16
to xarray


I agree, we should articulate our effort as much as possible. 
let us talk about that in Pasadena. I will arrive there on June 11
and leave on June 16.

cheers
--
J.

Paul

unread,
Oct 9, 2017, 9:15:01 PM10/9/17
to xarray
Hi everyone,

Just wondering if there was a conclusion to this thread of discussion? Is there a new combined effort taking place that we should all be jumping on?

Cheers,
Paul

Paul

unread,
Oct 9, 2017, 9:32:58 PM10/9/17
to xarray
I think I can answer my own question, everyone should check out

https://pangeo-data.github.io/

Spencer Hill

unread,
Oct 9, 2017, 10:23:56 PM10/9/17
to xarray
Yes, definitely!
Reply all
Reply to author
Forward
0 new messages