Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

Use xarray as a lower-level lib ???

85 views
Skip to first unread message

Chris Barker

unread,
Oct 23, 2023, 8:03:30 PM10/23/23
to xar...@googlegroups.com
I'm running in to a LOT of frustration with trying to use xarray for data sets that it, shall we say, hasn't been designed for.

In particular, results from the FVCOM triangular mesh ocean model. But it's not just that.

IIRC, xarray (maybe when it was called xray) was essentially an implementation of the netcdf Data model. Since then, it's grown to do a lot of nifty stuff for you, with fancy indexing features, etc. But as a result, if you try to load a file it doesn't understand, it raises an Error and you are dead in the water.

See: https://github.com/pydata/xarray/issues/2233 as a good example -- that issue has been closed, and presumably resolved, but I can't open an FVCOM file right now, with another error (more on that later).

If you think of xarray as a data analysis tool, this makes sense -- try to load the file, if xarray doesn't understand it, then fix your file, and reload it -- all good.

However, if you want to use xarray as a lower level library, then you might want to use TO FIX those issues it doesn't understand, and then if you can't even load the file, you're dead.

I guess what I'm looking for is a "load-all-the-arrays-even-if-you-can't-figure-out-all-the-coordinates" mode.

Does such a mode exist?

-CHB

The problem at hand:

This file:

Is results from an operational FVCOM model.

When I try to load it with xarray, I get:

File ~/miniforge3/envs/gnome/lib/python3.11/site-packages/xarray/core/dataset.py:696, in Dataset.__init__(self, data_vars, coords, attrs)
    693 if isinstance(coords, Dataset):
    694     coords = coords._variables
--> 696 variables, coord_names, dims, indexes, _ = merge_data_and_coords(
    697     data_vars, coords
    698 )
    700 self._attrs = dict(attrs) if attrs is not None else None
    701 self._close = None

File ~/miniforge3/envs/gnome/lib/python3.11/site-packages/xarray/core/dataset.py:421, in merge_data_and_coords(data_vars, coords)
    419     coords = coords.copy()
    420 else:
--> 421     coords = create_coords_with_default_indexes(coords, data_vars)
    423 # exclude coords from alignment (all variables in a Coordinates object should
    424 # already be aligned together) and use coordinates' indexes to align data_vars
    425 return merge_core(
    426     [data_vars, coords],
    427     compat="broadcast_equals",
   (...)
    432     skip_align_args=[1],
    433 )


File ~/miniforge3/envs/gnome/lib/python3.11/site-packages/xarray/core/coordinates.py:957, in create_coords_with_default_indexes(coords, data_vars)
    955 all_variables = dict(coords)
    956 if data_vars is not None:
--> 957     all_variables.update(data_vars)
    959 indexes: dict[Hashable, Index] = {}
    960 variables: dict[Hashable, Variable] = {}

ValueError: dictionary update sequence element #0 has length 1; 2 is required


Clearly, it's not getting what it's expecting for coordinates, but I"d really like ot to give me a warning and move on, rather than erroring out :-)

-CHB













--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris....@noaa.gov

george trojan

unread,
Oct 23, 2023, 8:44:41 PM10/23/23
to xar...@googlegroups.com
I had to disable decoding times to open this file:

import xarray as xr

file = '/home/george/Downloads/nos.lsofs.fields.f002.20231003.t18z.nc'
ds = xr.open_dataset(file, use_cftime=False, decode_coords=False, decode_times=False)
print(ds)

<xarray.Dataset>
Dimensions:             (nele: 174015, node: 90964, siglay: 20, siglev: 21,
                         three: 3, time: 1, maxnode: 11, maxelem: 9, four: 4)
Coordinates:
    siglay              (siglay, node) float32 ...
    siglev              (siglev, node) float32 ...
  * time                (time) float32 2.102e+03
Dimensions without coordinates: nele, node, three, maxnode, maxelem, four
Data variables: (12/53)
    nprocs              int32 ...
    partition           (nele) int32 ...
    x                   (node) float32 ...
    y                   (node) float32 ...
    lon                 (node) float32 ...
    lat                 (node) float32 ...
    ...                  ...
    vwind_speed         (time, nele) float32 ...
    wet_nodes           (time, node) int32 ...
    wet_cells           (time, nele) int32 ...
    wet_nodes_prev_int  (time, node) int32 ...
    wet_cells_prev_int  (time, nele) int32 ...
    wet_cells_prev_ext  (time, nele) int32 ...
Attributes: (12/14)
    title:                       LSOFS
    institution:                 School for Marine Science and Technology
    source:                      FVCOM_4.3
    history:                     model started at: 03/10/2023   20:17
    references:                  http://fvcom.smast.umassd.edu, http://codfis...
    Conventions:                 CF-1.0
    ...                          ...
    Tidal_Forcing:               TIDAL ELEVATION FORCING IS OFF!
    River_Forcing:               THERE ARE 20 RIVERS IN THIS MODEL.\nRIVER IN...
    GroundWater_Forcing:         GROUND WATER FORCING IS OFF!
    Surface_Heat_Forcing:        FVCOM variable surface heat forcing file:\nF...
    Surface_Wind_Forcing:        FVCOM variable surface Wind forcing:\nFILE N...
    Surface_PrecipEvap_Forcing:  FVCOM periodic surface precip forcing:\nFILE...

--
You received this message because you are subscribed to the Google Groups "xarray" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xarray+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/xarray/CALGmxE%2By-Uu2EzJvJn2P8JUOAOFeLJMUoXY_CMx%2BPbw%2BHS6BWg%40mail.gmail.com.

Benoît Bovy

unread,
Oct 24, 2023, 4:07:59 AM10/24/23
to xar...@googlegroups.com
The error looks like either a bug or something that should have been
raised or warned with a clearer message before that point.

AFAIK there's no single option to disable all decoding when opening a
dataset, but I guess that disabling every "decode_*" argument in
open_dataset should be close to "raw" data loading from the file? Maybe
we could expose an option to use with "xarray.set_options()" for
convenience? Or expose an alternative function (e.g.,
https://github.com/pydata/xarray/discussions/8080#discussioncomment-7367079).

Chris, I'm not sure your point is best illustrated by the file example
you give. Xarray hasn't (never?) been able to open such file before
because it implemented a slightly constrained version of the netcdf data
model (to allow dimension coordinates have an index). As a result of the
fancy features recently added in Xarray, now in theory it should be able
to open it :-).

I understand your frustration, though. Do you have other examples that
Xarray cannot load? If yes, could you add them in this issue, please?
https://github.com/pydata/xarray/issues/2368

Benoît

Deepak Cherian

unread,
Oct 24, 2023, 4:52:30 PM10/24/23
to xar...@googlegroups.com
> AFAIK there's no single option to disable all decoding when opening a
dataset,

decode_cf=False turns everything but automatic index creation off.

> Xarray hasn't (never?) been able to open such file before
because it implemented a slightly constrained version of the netcdf data
model (to allow dimension coordinates have an index).

+1, this wouldn't have helped much earlier :)

Here's the time decoding error on Xarray v2023.10.1
Failed to decode variable 'Itime2': unable to decode time units 'msec since 00:00:00' with 'the default calendar'. 
Try opening your dataset with decode_times=False or installing cftime if it is not installed.
It does suggest the solution ;)

What version are you running, Chris?

Deepak


Chris Barker

unread,
Oct 24, 2023, 6:01:43 PM10/24/23
to xar...@googlegroups.com
Thanks all,

On Mon, Oct 23, 2023 at 5:44 PM george trojan <george...@gmail.com> wrote:
I had to disable decoding times to open this file:

import xarray as xr

file = '/home/george/Downloads/nos.lsofs.fields.f002.20231003.t18z.nc'
ds = xr.open_dataset(file, use_cftime=False, decode_coords=False, decode_times=False)
print(ds)

Thanks - that's not much of a surprise -- they are doing very odd (wrong) things with time in this one :-(

Now I need to look and see how to "fix" the time so that I can get a time-aware Dataset. There actually IS a valid time variable in there, but there are also invalid ones, so it's messy.

> The error looks like either a bug or something that should have been
raised or warned with a clearer message before that point

That was my fault, I wasn't paying attention, and passed the filename directly to the xr.Dataset constructor -- when I used open_dataset(), I got the time encoding error.

However, it looks like passing a filename into Dataset() is not (no longer) supported, so maybe that should raise an error (or at least a warning), rather than half-working. I don't see an issue on that -- maybe I'll go add one.

> decode_cf=False turns everything but automatic index creation off.

I'll give that a try.

I think that does make sense -- I'll take a look at that discussion.


> Chris, I'm not sure your point is best illustrated by the file example
you give.

That was just the "problem at hand" -- probably not a good example, though maybe -- see below.

> Xarray hasn't (never?) been able to open such file before
because it implemented a slightly constrained version of the netcdf data
model (to allow dimension coordinates have an index).

Well, I'm pretty sure early versions of xray could , but that was a long time ago. 

> As a result of the
fancy features recently added in Xarray, now in theory it should be able
to open it :-).

I think it may be the opposite: 

xarray provides a lot of nifty high level features. But it also provided low level features like an abstraction around different file formats, and use of dask under the hood.

What I'd like is for the high-level features to be built upon the lower-level stuff, while still exposing the lower level stuff, which may be possible, but I'm struggling with finding that.

This may be what the NamedArray Proposal is heading toward, which is great.

But for now, the trick is that I'm having trouble figuring out if xarray can work for my needed use cases -- in short, as the underpinning of a higher level library [*] that needs to be able to work with "messy" data sources, and ideally without having to re-implement the stuff xarray does do well.

The problem is the target use cases -- we are putting coe behind a web service, so a file simply not opening is not a good options, and we what to us xarray itself to "fix" the files that may be non-compliant.

What that means is that:
1) xarray should be able to load almost anything (probably any netcdf file, for instance) -- and any time it can't figure out what to do, it provides a computer-readable error or warning, or ?? so that code can figure out what to do from there.

one idea, for instance, when there's something wonky with a time variable:

ValueError: Failed to decode variable 'Itime2': unable to decode time units 'msec since 00:00:00' with 'the default calendar'. Try opening your dataset with decode_times=False or installing cftime if it is not installed

Rather than a raw ValueError, it could be a:

TimeDecodingError or some such.

I'll mess around with disabling all the decode_ options, and see how that goes.
 
[*] I'm working on two things:

One is building gridded on top of xarray:
https://github.com/NOAA-ORR-ERD/gridded and the other is a (currently internal) package that we use to gather model results from various servers and formats, and re=pacakge them as compiling files. In both cases, we need to be able to read non-compliant files, and then fix the resulting Datasets in place.

Chris Barker

unread,
Oct 24, 2023, 6:22:46 PM10/24/23
to xar...@googlegroups.com
On Tue, Oct 24, 2023 at 1:08 AM Benoît Bovy <ben...@gmail.com> wrote:
I understand your frustration, though. Do you have other examples that
Xarray cannot load? If yes, could you add them in this issue, please?
https://github.com/pydata/xarray/issues/2368

Not handy -- but I'll keep it in mind to add it there when I discover new ones.

-CHB

 

Chris Barker

unread,
Oct 24, 2023, 6:24:33 PM10/24/23
to xar...@googlegroups.com
On Tue, Oct 24, 2023 at 3:22 PM Chris Barker <chris....@noaa.gov> wrote:
On Tue, Oct 24, 2023 at 1:08 AM Benoît Bovy <ben...@gmail.com> wrote:
I understand your frustration, though. Do you have other examples that
Xarray cannot load? If yes, could you add them in this issue, please?
https://github.com/pydata/xarray/issues/2368

Not handy -- but I'll keep it in mind to add it there when I discover new ones.

just as a note -- from the original issue from Ryan:

"Anything that can be written to netCDF should be readable by xarray."

which is what I thought, and I think is still a good goal.

-CHB

Deepak Cherian

unread,
Oct 24, 2023, 7:25:59 PM10/24/23
to xar...@googlegroups.com
> I'll mess around with disabling all the decode_ options, and see how that goes.

Just set `decode_cf=False` and you have what you want :) . It can't be the default but you can just use that as a foundation for your libraries.

Deepak


Reply all
Reply to author
Forward
0 new messages