Some small questions...

55 views
Skip to first unread message

Chris Barker

unread,
Feb 21, 2014, 7:27:01 PM2/21/14
to UGRID Interoperability
Hi folks,

I'm poking at the pyugrid code a bit -- trying to fill in a few missing pieces, and add more to the netcdf writing code. If you are a Pytonista, but not yet familliar with that project, it's here:


But these are questions about the standards, apart from any python issues.

 * The standard specifies a dummy variable, used to store information about the grid (cf_role = "mesh_topology").

  - In the examples, we call that "Mesh2" -- should there be a convention to that name? Or should it (is it) arbitrary -- and could be something like:
the_mesh_for_my_particular_model ? In which case, we should make sure that reading code looks for a variable with the given cf_role.

 - it also has a cf_role and long_name, but no standard_name -- should it?


2 ) OK -- I lied. Related to the Python code (though not specific to it). In netcdf4, which builds on hdf5, you can specify a "chunk size" for variables. However, it turns out that the netcdf4 lib currently has really bad defaults for 1-d variables (or 2-d variables with one of the dimensions really small, like 2 or three -- common case for the mesh)

So anyone have suggestions as to good chunk sizes? Option one is to use the full array size as the chunk size -- but that defeats the purpose of chunking. In some experiments for a different use case, I found having a minimum chunk size of about 1k helped a lot, and going to a larger one was helpful, but not hugely.

I'm thinking maybe using something like 1MB, or the length of the array, whichever is less.

thoughts?

-Chris











--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris....@noaa.gov

Rich Signell

unread,
Feb 22, 2014, 4:07:40 PM2/22/14
to Chris Barker, UGRID Interoperability
Chris,

> * The standard specifies a dummy variable, used to store information about
> the grid (cf_role = "mesh_topology").
>
> - In the examples, we call that "Mesh2" -- should there be a convention to
> that name? Or should it (is it) arbitrary -- and could be something like:
> the_mesh_for_my_particular_model ? In which case, we should make sure that
> reading code looks for a variable with the given cf_role.

It's arbitrary, just as all netcdf variable names are in CF.
>
> - it also has a cf_role and long_name, but no standard_name -- should it?

We changed from standard_name to cf_role because Jonathan (CF guru) told us to.
I can't remember why cf_role was deemed more appropriate, but I do
remember feeling persuaded by his argument at the time. ;-)


> 2 )Related to the Python code, in netcdf4, which builds on hdf5, you can specify a "chunk size" for variables.
> However, it turns out that the netcdf4 lib currently has really bad defaults
> for 1-d variables (or 2-d variables with one of the dimensions really small,
> like 2 or three -- common case for the mesh)
>
> So anyone have suggestions as to good chunk sizes? Option one is to use the
> full array size as the chunk size -- but that defeats the purpose of
> chunking. In some experiments for a different use case, I found having a
> minimum chunk size of about 1k helped a lot, and going to a larger one was
> helpful, but not hugely.
>

Russ Rew posted a very nice blog piece about selecting chunk size here:
http://www.unidata.ucar.edu/blogs/developer/entry/chunking_data_choosing_shapes
If you can't figure it out after reading that, just ask Russ. He's a great guy.

And please report back on what you found out.
--
Rich Signell
81 Queen St
Falmouth, MA 02540

Bert Jagers

unread,
Feb 24, 2014, 5:07:09 AM2/24/14
to Chris Barker, UGRID Interoperability
Hi Chris,

Rich is indeed correct.

1) You shouldn't depend on the mesh_topology variable being called "Mesh2" or anything like that. I used Mesh1, Mesh2 and Mesh3 to separate 1D, 2D, and 3D meshes in the examples such that at the end, I would be able to combine all three in one file containing a mesh composed of 1D, 2D, and 3D parts (similar to structured grid mosaics). The text and example for mosaic concept is still described on our wiki:
http://publicwiki.deltares.nl/display/NETCDF/Mosaics+of+meshes
and not part of the current UGRID-0.9 convention. Other groups have in the meantime written UGRID files that contain two grids: one grid for the state variables of the hydrodynamic model, and one higher resolution grid for subgrid information on bed topography that is used by the hydrodynamic solver when computing storage volumes and cross-sectional areas.

2) The reason for changing from standard_name to cf_role was that standard_name is used solely for physical quantities and none of the names that we introduced fall in that category. I like the fact that one use the UGRID convention without immediately needing to go into an extensive discussion about the drawbacks and benefits of the CF standard_names in general, although the use of cf_role has introduced a kind of a weird link to CF while UGRID remains a separate convention for the time being.

Best regards,

Bert
--
You received this message because you are subscribed to the Google Groups "UGRID Interoperability" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ugrid-interoperab...@googlegroups.com.
To post to this group, send email to ugrid-inter...@googlegroups.com.
Visit this group at http://groups.google.com/group/ugrid-interoperability.
For more options, visit https://groups.google.com/groups/opt_out.

DISCLAIMER: This message is intended exclusively for the addressee(s) and may contain confidential and privileged information. If you are not the intended recipient please notify the sender immediately and destroy this message. Unauthorized use, disclosure or copying of this message is strictly prohibited. The foundation 'Stichting Deltares', which has its seat at Delft, The Netherlands, Commercial Registration Number 41146461, is not liable in any way whatsoever for consequences and/or damages resulting from the improper, incomplete and untimely dispatch, receipt and/or content of this e-mail.

Chris Barker

unread,
Feb 24, 2014, 12:19:29 PM2/24/14
to Bert Jagers, UGRID Interoperability
On Mon, Feb 24, 2014 at 2:07 AM, Bert Jagers <Bert....@deltares.nl> wrote:
 
1) You shouldn't depend on the mesh_topology variable being called "Mesh2" or anything like that. I used Mesh1, Mesh2 and Mesh3 to separate 1D, 2D, and 3D meshes in the examples such that at the end, I would be able to combine all three in one file containing a mesh composed of 1D, 2D, and 3D parts (similar to structured grid mosaics).

Thanks Bert -- and thanks for the example of multiple meshes -- that should clearly be taken into account in all this code -- I think it's semi-there at the moment.
 
2) The reason for changing from standard_name to cf_role was that standard_name is used solely for physical quantities and none of the names that we introduced fall in that category.

makes sense, yes.
 
I like the fact that one use the UGRID convention without immediately needing to go into an extensive discussion about the drawbacks and benefits of the CF standard_names in general, although the use of cf_role has introduced a kind of a weird link to CF while UGRID remains a separate convention for the time being.\

yeah that is odd, but what the heck!

Are there accepted CF conventions that use a dummy variabel and cf_role in this way? If so it would be nice to point to that in the docs.

-Chris

Bert Jagers

unread,
Feb 24, 2014, 12:34:14 PM2/24/14
to Chris Barker, UGRID Interoperability
>Are there accepted CF conventions that use a dummy variabel and cf_role in this way? If so it would be nice to point to that in the docs.

There is not one case in CF that is exactly like this. The Horizontal Coordinate Reference Systems, Grid Mappings, and Projections uses dummy variables
http://cf-pcmdi.llnl.gov/documents/cf-conventions/1.6/cf-conventions.html#grid-mappings-and-projections
but they don’t use a cf_role. The cf_role keyword was introduced together with discrete geometry concepts.
http://cf-pcmdi.llnl.gov/documents/cf-conventions/1.6/cf-conventions.html#coordinates-metadata

The combination was proposed here
http://www.met.rdg.ac.uk/~jonathan/CF_metadata/ugrid_gridspec.html
by Jonathan Gregory. That was a discussion in the context of aligning mosaics in UGRID and GridSpec.

Best regards,

Bert
------

From: Chris Barker [mailto:chris....@noaa.gov]
Sent: 24 February 2014 18:19
To: Bert Jagers
Cc: UGRID Interoperability
Subject: Re: Some small questions...

Chris Barker

unread,
Feb 24, 2014, 1:02:50 PM2/24/14
to Rich Signell, UGRID Interoperability
On Sat, Feb 22, 2014 at 1:07 PM, Rich Signell <ri...@signell.us> wrote:
 
> 2 )Related to the Python code, in netcdf4, which builds on hdf5, you can specify a "chunk size" for variables.
> However, it turns out that the netcdf4 lib currently has really bad defaults
> for 1-d variables (or 2-d variables with one of the dimensions really small,
> like 2 or three -- common case for the mesh)
>
> So anyone have suggestions as to good chunk sizes? Option one is to use the
> full array size as the chunk size -- but that defeats the purpose of
> chunking. In some experiments for a different use case, I found having a
> minimum chunk size of about 1k helped a lot, and going to a larger one was
> helpful, but not hugely.
>

Russ Rew posted a very nice blog piece about selecting chunk size here:
http://www.unidata.ucar.edu/blogs/developer/entry/chunking_data_choosing_shapes
If you can't figure it out after reading that, just ask Russ.  He's a great guy.

That is a great post -- helped my a lot to understanding what chunking is about -- but it still leaves questions in this case -- so yes, I'll send a note to Russ. But in the meantime:

That post, and most of what I've seen about chunking, is about 2-d and higher dimension arrays. But in this case, we have a lot of 1-d _or quasi 1-d) arrays, a use case I haven't seen discussed. In fact,  The default netcdf chunking for 1-d arrays is seriously broken: if you create an unlimited dimension 1-d array, it will use a chunksize of 1 -- which is really, really bad. We discovered it because someone using our code was writing some big (~2-3GB) files and it was crashing out on memory errors. I'm pretty sure the issue is that HDF needs to build a tree structure to manage the chunks, so that they can be located quickly. If you have one-element sized chunks you are going to have a massive tree to manage all that for a large array, hence the crash (and some really slow writing speeds...)

The reason that default is there is that a common use case, and the one designed for, is 3 and 4 d arrays, where one D (usually time) in unlimited -- in that case, having a single "slab" as a chunk makes sense -- optimized for access to the data for a single time step.

So I'm still not sure what to do for 1-d arrays. It seems one needs to know access needs AND things like disk cache size to get optimal chunking, which we don't know in this case, so reasonable default would be good.

The other issue is that for UGRIDs, you have a bunch of data associated with the nodes, or faces, or ? -- and this is, of course, unstructured, so if you want a subset of the data, then you'll ask essentially for arbitrary, and probably not contiguous, indexes -- so how to have efficient chunking for that? Maybe contiguous data is the way to go for this anyway.

I suppose we need to do some profiling / performance testing and see what happens.

-Chris

Rich Signell

unread,
Feb 24, 2014, 3:21:36 PM2/24/14
to Chris Barker, UGRID Interoperability
Chris,
The default chunking of netCDF is 1 for
unlimited dimensions, and chunk size matching the full dimension length
for fixed dimensions, unless those fixed dimensions are very large.

1. For 1D only vars, try making it big, like 1024 or something (or bigger)

You can play around easily with chunking using "nccopy":
https://www.unidata.ucar.edu/software/netcdf/docs/netcdf/nccopy.htm

Note the "-u" option to convert unlimited into fixed side on output.

2. For UGRID you might not think chunking on the node or element
dimension makes any sense, but if you have users trying to extract
time series from a particular node or element, it will pay off to
chunk your domain. If you break your domain into 10 pieces or so, it
won't slow down the extraction of the whole domain much, but will
really speed up time series extraction.

3. I'm kind of stating stuff like I know what I'm talking about. It
would be good if you run some tests and report back.

-Rich
> --
> You received this message because you are subscribed to the Google Groups
> "UGRID Interoperability" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to ugrid-interoperab...@googlegroups.com.
> To post to this group, send email to
> ugrid-inter...@googlegroups.com.
> Visit this group at http://groups.google.com/group/ugrid-interoperability.
> For more options, visit https://groups.google.com/groups/opt_out.



Bert Jagers

unread,
Feb 24, 2014, 4:10:31 PM2/24/14
to Rich Signell, UGRID Interoperability
Hi Rich,

> 2. For UGRID you might not think chunking on the node or element dimension makes any sense, but if you have users trying to extract time series from a particular node or element, it will pay off to chunk your domain. If you break your domain into 10 pieces or so, it won't slow down the extraction of the whole domain much, but will really speed up time series extraction.

Yes, you are absolutely correct. This issue was in a slightly different context also raised during a Met Ocean DWG session during last year's OGC TC meeting in Frascati. If I remember correctly, Chris Little (UK Met) gave some presentation on splitting data sets in space and grouping them together in time to balance time- and space-slicing overhead costs as bit. Since the OGC standard only concerns classic CF-netCDF3 files, I believe the focus was there on actually splitting a files, but in this context, it would imply increasing the chunk for the time dimension, and reducing it for the node/space dimension.

Cheers,

Bert

-----Original Message-----
From: ugrid-inter...@googlegroups.com [mailto:ugrid-inter...@googlegroups.com] On Behalf Of Rich Signell
Sent: 24 February 2014 21:22
To: Chris Barker
Cc: UGRID Interoperability
Subject: Re: Some small questions...

Chris Barker

unread,
Feb 24, 2014, 7:40:05 PM2/24/14
to Rich Signell, UGRID Interoperability
On Mon, Feb 24, 2014 at 12:21 PM, Rich Signell <ri...@signell.us> wrote:
Chris,
The default chunking of netCDF is 1 for
unlimited dimensions, and chunk size matching the full dimension length
for fixed dimensions, unless those fixed dimensions are very large.

right -- which is a disaster for 1-D unlimited dimension arrays.... 
 
1. For 1D only vars, try making it big, like 1024 or something (or bigger)

yeah, I was thinking about that. In experiments with a different use case, I found up to 1024 helped speed things up, but larger than that made much less difference -- though that was write speed, not read speed...And, of coures, access patterns really matter.

2. For UGRID you might not think chunking on the node or element
dimension makes any sense, but if you have users trying to extract
time series from a particular node or element, it will pay off to
chunk your domain.  If you break your domain into 10 pieces or so, it
won't slow down the extraction of the whole domain much, but will
really speed up time series extraction.

Good point -- I was thinking to stick with chunk 1 for time here, have each chunk be the full grid. That would be fine for getting all the data at one time, but probably lousy for a time series...

But this is where my understanding gets really limited -- let's say there is a (time X node) array with some data in it.

The user wants the data for all time and nodes 134, 1032, 12, 478. I don't think there is a way to ask for a discontiguous set of data, so you make a separate request for each node. That's going to be pretty painful  as each time step requires a separate disk access to get one value. but does chunking on the nodes axis help? those may well be in different chunks, so that's not helpful. And even if some are in the same chunk, then will that allow some dat ato be cached and therefor more available when teh next node is requested. And that will depend on what order the the user writes their loop..


3. I'm kind of stating stuff like I know what I'm talking about.  It
would be good if you run some tests and report back.

yeah -- me too -- the only think I know for sure about profiling is as much as I think I know about the problem, I'm usually wrong.

Bert wrote:
 Yes, you are absolutely correct. This issue was in a slightly different context also raised during a Met Ocean DWG session during last year's OGC TC meeting in Frascati. If I remember correctly, Chris Little (UK Met) gave some presentation on splitting data sets in space and grouping them together in time to balance time- and space-slicing overhead costs as bit. Since the OGC standard only concerns classic CF-netCDF3 files, I believe the focus was there on actually splitting a files, but in this context, it would imply increasing the chunk for the time dimension, and reducing it for the node/space dimension.

OK -- maybe a good rule of thumb is that if you dont know what access patter you need to support, you should use fairly symmetric chunks....

I'll play around a bit with this and report back.

-Chris



Stefan Vater

unread,
Mar 11, 2014, 10:06:50 AM3/11/14
to ugrid-inter...@googlegroups.com
Hi Rich,

how is this like for dimensions? In the ugrid convention on github they are
always given like "n<MESHNAME>_node" or "n<MESHNAME>_face" for the number of
nodes and faces (cells) in a given mesh <MESHNAME> and and "Two", "Three" for
connectivity relations. However, when I look at the data on your server
(http://comt.sura.org/thredds/comt_1_archive_summary.html), there we have
"node", "nele", "nvertex" for the number of nodes, faces (cells) and vertices
per face. Furthermore, the mesh variable has dimension 1 with size "mesh",
what is also not even described in the document about mosaics of meshes
(http://publicwiki.deltares.nl/display/NETCDF/Mosaics+of+meshes). There, a
mesh variable is always an integer of dimension 0.

It would be good to know how flexible the naming of these quantities can be,
since this greatly influences the parsing of the NetCDF structure in the
beginning of file reading.

Best regards,
Stefan

Rich Signell

unread,
Mar 11, 2014, 10:22:25 AM3/11/14
to Stefan Vater, UGRID Interoperability
Stefan,
In UGRID, as in CF the names of variables and dimensions are not important.

I see that some of the datasets on
http://comt.sura.org/thredds/comt_1_archive_summary.html
have the "mesh" variable with dimension 1 and other have it with
dimension 0. It does not matter since no data is read from this
variable. Only the attributes are read.
> --
> You received this message because you are subscribed to the Google Groups "UGRID Interoperability" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to ugrid-interoperab...@googlegroups.com.
> To post to this group, send email to ugrid-inter...@googlegroups.com.
> Visit this group at http://groups.google.com/group/ugrid-interoperability.
> For more options, visit https://groups.google.com/d/optout.

Stefan Vater

unread,
Mar 11, 2014, 10:35:21 AM3/11/14
to Rich Signell, UGRID Interoperability
Hi Rich,

thanks a lot for clarifying this. One more question: How does this hold for
the time dimension/variable? Can this be arbitrary, as well? And must a file
always have a time variable?

Regards,
Stefan

Chris Barker - NOAA Federal

unread,
Mar 11, 2014, 10:43:18 AM3/11/14
to Stefan Vater, Rich Signell, UGRID Interoperability
On Mar 11, 2014, at 7:35 AM, Stefan Vater <st.v...@web.de> wrote:

> thanks a lot for clarifying this. One more question: How does this hold for
> the time dimension/variable? Can this be arbitrary, as well?

Can what be arbitrary? We are trying to be standard CF wherever
possible, so time is handled exactly as it is for CF.

So (as I understand it) time can be named anything, though most people
do call it 'time'. The standard_name is what matters.

It needs to be the size of the number of time steps.

And it is not required if the data set doesn't need it. Some data sets
are time-independent.

I would probably use a time dimension of size 1 if I had a data set
that was for one particular datetime.

If you are doing this in Python for paraview, perhaps we should share
code with Py_ugrid?

Chris

Rich Signell

unread,
Mar 11, 2014, 10:45:46 AM3/11/14
to Stefan Vater, UGRID Interoperability
Stefan,

Right. Nothing special for time. It doesn't matter what the name or
dimension name is.

Just as in CF, if you've specified the "coordinates" attribute for
data variables, it will look through the list and find the one with
valid time units, and assign that as a time coordinate variable.

There is an old rule in CF that if coordinates attributes are not
specified, then assume that variables that are named the same as their
dimension are coordinate variables. So if you had
time(time)
or
my_goofy_time(my_goofy_time)
it would assume that those were coordinate variables. For that
reason, you will see many datasets with time(time).

I would recommend specifying the time coordinate variable in the
coordinates attribute.

Chris Barker

unread,
Mar 11, 2014, 11:47:18 AM3/11/14
to Rich Signell, Stefan Vater, UGRID Interoperability
We should probably put up an example with time. For that matter a handfull of examples with various features...

-Chris

Stefan Vater

unread,
Mar 14, 2014, 7:07:31 AM3/14/14
to Chris Barker, Rich Signell, UGRID Interoperability
Dear Chris, dear Rich,

thanks a lot for your comments. Unfortunately I am not yet fully familiar with
the CF convention. That's why I asked.

Some examples would be definitely good. Also in form of NetCDF files, so that
one could try them in their application/visualization program.

The reader we are writing is written in C++, so sharing code with pyugrid
might not be so easy. However, it is a good possibility for comparison and I
am also interested in having a python interface for reading my files. Is there
a script which includes some visualization in the pyugrid repository?

I also downloaded some data for testing from the THREDDS server Rich
mentioned. Unfortunately, my reader has some problems with it. I am not sure,
if this has to do with the data size or some other thing. It might be also due
to my download through nccopy, where I only fetched some subset. Well, this is
still under investigation...

Thanks so far,
Stefan

Rich Signell

unread,
Mar 14, 2014, 7:25:39 AM3/14/14
to Stefan Vater, Chris Barker, UGRID Interoperability
Stefan,

I guess you probably know this, but you don't have to download UGRID
netcdf files -- just build your netcdf C library with opendap enabled,
and link your application to that. Then you can open a OPenDAP URL
(e.g. http://comt.sura.org/thredds/dodsC/data/comt_1_archive/inundation_tropical/UND_ADCIRC/Hurricane_Ike_3D_final_run_with_waves)
just as if it was a local NetCDF file name.

-Rich

Stefan Vater

unread,
Mar 14, 2014, 9:06:03 AM3/14/14
to ugrid-inter...@googlegroups.com, Rich Signell, Chris Barker
Yes, i know this, but I haven't tried this yet with paraview. Furthermore, for
testing I find at much better to have the dataset locally right now. Also some
smaller subset would be nice...

Stefan

Rich Signell

unread,
Mar 14, 2014, 1:19:55 PM3/14/14
to Stefan Vater, UGRID Interoperability, Chris Barker
If you want to a huge OPeNDAP dataset as NetCDF, you can use "nccopy",
as it buffers the data (does multiple small DAP requests to build the
local netCDF file). I've downloaded 300GB files this way. Just
make sure to set the default buffer size larger than the default 5m.
Here I've set the buffer to 500m, the default maximum size for binary
DAP transfers on THREDDS Data Servers.

nccopy -k 4 -d 1 -m 500m
'http://comt.sura.org/thredds/dodsC/data/comt_1_archive/inundation_tropical/UND_ADCIRC/Hurricane_Ike_2D_final_run_without_waves'
adcirc_nc4.nc

This dataset is 33GB, however, so it will still take a long time to download.

If you want to clip out just a few times steps, however, you could use ncks.
Here's an example taking just the first two steps:

ncks -d time,0,1
'http://comt.sura.org/thredds/dodsC/data/comt_1_archive/inundation_tropical/UND_ADCIRC/Hurricane_Ike_2D_final_run_without_waves'
adcirc_tiny.nc
> --
> You received this message because you are subscribed to the Google Groups "UGRID Interoperability" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to ugrid-interoperab...@googlegroups.com.
> To post to this group, send email to ugrid-inter...@googlegroups.com.
> Visit this group at http://groups.google.com/group/ugrid-interoperability.
> For more options, visit https://groups.google.com/d/optout.



Reply all
Reply to author
Forward
0 new messages