Please: let's all read nexus docs

Carlos Pascual Izarra

unread,

Jan 27, 2010, 4:17:40 AM1/27/10

to ma...@googlegroups.com

One lesson from the horribly long thread "quick comments about multidim..."
started by Darren is that we spend too much effort into discussing things
only to reach virtually the same conclusions that the NeXus people reached
(e.g., that data from individual detectors/positioners are better stored in
their own arrays, or that attribute names better don't contain punctuation
and whitespace,...)

During the workshop I advocated (and I though that we generally agreed) to
take as much as possible from the Nexus conventions and only extend them when
something is clearly insufficient for our needs.
Even if that is not the general consensus, I think the discussions at MAHID
will benefit a lot if everybody had at least had a look at the basic NeXus
docs. This way we would at least all speak the same dialect.

And yes, I know the NeXus docs are far from ideal, but in order to grasp the
basic concepts, they are good enough.

--
+----------------------------------------------------+
Carlos Pascual Izarra
Scientific Software Contact
Computing Division
Cells / Alba Synchrotron [http:/www.cells.es]
Carretera BP 1413 de Cerdanyola-Sant Cugat, Km. 3.3
E-08290 Cerdanyola del Valles (Barcelona), Spain
E-mail: carlos....@cells.es
Phone: +34 93 592 4428
+----------------------------------------------------+

"V. Armando Solé"

unread,

Jan 27, 2010, 5:28:44 AM1/27/10

to ma...@googlegroups.com

Carlos Pascual Izarra wrote:
> One lesson from the horribly long thread "quick comments about multidim..."
> started by Darren is that we spend too much effort into discussing things
> only to reach virtually the same conclusions that the NeXus people reached
> (e.g., that data from individual detectors/positioners are better stored in
> their own arrays, or that attribute names better don't contain punctuation
> and whitespace,...)
>
> During the workshop I advocated (and I though that we generally agreed) to
> take as much as possible from the Nexus conventions and only extend them when
> something is clearly insufficient for our needs.
>

I also thought it was generally agreed and I reflected it in the report.

In the course of the above mentioned thread I suggested to split it into
pure acquisition related storage problems and pure handling of of
multidimensional data. The thread owner/starter considered both were
related enough not to split them. I think that made the thread harder to
be followed but I admit I cannot say he was wrong.

> Even if that is not the general consensus, I think the discussions at MAHID
> will benefit a lot if everybody had at least had a look at the basic NeXus
> docs. This way we would at least all speak the same dialect.
>
> And yes, I know the NeXus docs are far from ideal, but in order to grasp the
> basic concepts, they are good enough.
>

From what I have read up to now, and in terms of acquisition needs, all
what I need is "NeXus + 1 Group" or "1 Group + NeXus" with that group
being a close equivalent of what currently is a specfile or what Darren
presented at the workshop. So positions are quite close from my side and
I do not need to reinvent anything but to add something.

In the last months I have had the opportunity to work with NeXus files
from PSI, SOLEIL and Diamond and HDF5 from Elettra. There are serious
discrepancies and, in my opinion, serious mistakes on how data are
organized for further analysis in many (I would say most) of them. The
positive conclusion I reached was that NeXus was not as rigid as I
supposed :-)

To learn about the existence of definitions was quite good for many of
us and I intend to participate in Mark Koennecke's training on them
whenever it takes place.

That said, it is very easy to fall into the trap of the instrumental
versus the analysis minded approach. We had the example in the mentioned
thread. Matt Newville, Darren Dale have thought: we have ONE
multielement fluorescence DETECTOR. So, it is logic to have (npoints,
ndetectors, channels). It is so logical, that it was Elettra's initial
choice too. Here is clearly exemplified the problem of the instrumental
approach. Even if the elements have the same number of channels and the
same calibration, they will not have the same experimental geometry and
they will be very likely of different resolution leading to independent
analysis.

At the ESRF I have worked for a long time in beamline control software.
Experience tells that when the last guy(s) of the control chain
(hereafter the software guy) is also involved in the specifications of a
new instrument, the integration of the instrument in the beamline is
very smooth and efficient. When the software guy is left out because "he
has nothing to say in the specifications", the probability that the
integration of the instrument is cumbersome, inefficient or impossible
is extremely high.

The same can be applied here. If that multielement fluorescence detector
is split into single detectors, analysis can be made "out of the box".
Elettra people just showed me (the "last software guy in the chain")
what they had thought with an example. I gave them my opinion: it can be
supported but it is not there and can be unproductive. So, now they can
think about. The nice point of this history is they did it BEFORE
starting to generate data.

Most of us are not experts on HDF5 and, if understood Francesc's talk at
the workshop, it should be possible to "map parts of datasets". I think
that is called references:

http://davis.lbl.gov/Manuals/HDF5-1.6.5/References.html

All the above is just because I want to say that many issues can show up
and will show up, not only because of lack of knowledge of NeXus, but also:

- because of NeXus instrumental approach itself not being ideal for
analysis. Most of this issue can be solved with NeXus definitions and I
guess we have a lot of work in front of us in this subject because it
can be used to unify some analysis procedures on files written in
totally different ways
- because of lack of knowledge of HDF5
- because of not having involved people doing analysis (scientists and
programmers) since the beginning

I also hope/wish/dream next threads will be easier to follow.

Best regards,

Armando

Darren Dale

unread,

Jan 27, 2010, 7:51:19 AM1/27/10

to ma...@googlegroups.com

On Wed, Jan 27, 2010 at 5:28 AM, "V. Armando Solé" <so...@esrf.fr> wrote:
> Carlos Pascual Izarra wrote:
>>
>> One lesson from the horribly long thread "quick comments about
>> multidim..." started by Darren is that we spend too much effort into
>> discussing things only to reach virtually the same conclusions that the
>> NeXus people reached (e.g., that data from individual detectors/positioners
>> are better stored in their own arrays, or that attribute names better don't
>> contain punctuation and whitespace,...)
>>
>> During the workshop I advocated (and I though that we generally agreed) to
>> take as much as possible from the Nexus conventions and only extend them
>> when something is clearly insufficient for our needs.
>
> I also thought it was generally agreed and I reflected it in the report.
>
> In the course of the above mentioned thread I suggested to split it into
> pure acquisition related storage problems and pure handling of of
> multidimensional data. The thread owner/starter considered both were related
> enough not to split them. I think that made the thread harder to be followed
> but I admit I cannot say he was wrong.

Matt and I were not talking about pure acquisition-related storage
problems. I'm sorry the thread ran so long, and that the topic did
wander a bit. That tends to happen from time to time on development
mailing lists.

[...]

> All the above is just because I want to say that many issues can show up and
> will show up, not only because of lack of knowledge of NeXus, but also:
>
> - because of NeXus instrumental approach itself not being ideal for
> analysis. Most of this issue can be solved with NeXus definitions and I
> guess we have a lot of work in front of us in this subject because it can be
> used to unify some analysis procedures on files written in totally different
> ways
> - because of lack of knowledge of HDF5

or of limitations on hdf5 features, as nexus supports those features
that are common between xml, hdf4 and hdf5.

Darren

Pete R. Jemian

unread,

Jan 27, 2010, 8:15:55 AM1/27/10

to ma...@googlegroups.com

On 1/27/2010 3:17 AM, Carlos Pascual Izarra wrote:
> ... discussing things
> only to reach virtually the same conclusions that the NeXus people reached ...

This was my concern going in to the workshop, that we would repeat
the discussions of the NIAC and yet result in similar yet not quite
identical conclusions. This is sensible only if the differences
are for technical reasons. The discussion/debate is good but I
question why do differently than NeXus if the reasoning is not
technical but semantics?

Possibly, better education about NeXus is important, as suggested.

Since the group is keen to have a look, here is the URL
of the current draft (open for review) of the NeXus manual,
updates will appear at the same URL until it is released:

http://www.nexusformat.org/images/f/fc/NeXusManual.pdf

The NeXus manual is moving from MediaWiki to DocBook
and the release copy is not ready. Most of the MediaWiki
content is moved over but the document needs editing (as noted
below) and reorganization. I'm working on that now.
Suggestions are welcome (including the Index).

Pete

On 1/27/2010 3:17 AM, Carlos Pascual Izarra wrote:
> One lesson from the horribly long thread "quick comments about multidim..."
> started by Darren is that we spend too much effort into discussing things
> only to reach virtually the same conclusions that the NeXus people reached
> (e.g., that data from individual detectors/positioners are better stored in
> their own arrays, or that attribute names better don't contain punctuation
> and whitespace,...)
>
> During the workshop I advocated (and I though that we generally agreed) to
> take as much as possible from the Nexus conventions and only extend them when
> something is clearly insufficient for our needs.
> Even if that is not the general consensus, I think the discussions at MAHID
> will benefit a lot if everybody had at least had a look at the basic NeXus
> docs. This way we would at least all speak the same dialect.
>
> And yes, I know the NeXus docs are far from ideal, but in order to grasp the
> basic concepts, they are good enough.
>
>
>
>

--
----------------------------------------------------------
Pete R. Jemian, Ph.D. <jem...@anl.gov>
Beam line Controls and Data Acquisition, Group Leader
Advanced Photon Source, Argonne National Laboratory
Argonne, IL 60439 630 - 252 - 3189
-----------------------------------------------------------
Education is the one thing for which people
are willing to pay yet not receive.
-----------------------------------------------------------

Darren Dale

unread,

Jan 27, 2010, 10:17:51 AM1/27/10

to ma...@googlegroups.com

On Wed, Jan 27, 2010 at 8:15 AM, Pete R. Jemian <prje...@gmail.com> wrote:
>
> On 1/27/2010 3:17 AM, Carlos Pascual Izarra wrote:
>> ... discussing things
>> only to reach virtually the same conclusions that the NeXus people
>> reached ...
>
> This was my concern going in to the workshop, that we would repeat
> the discussions of the NIAC and yet result in similar yet not quite
> identical conclusions. This is sensible only if the differences
> are for technical reasons. The discussion/debate is good but I
> question why do differently than NeXus if the reasoning is not
> technical but semantics?

My own opinion is that semantic preferences are not sufficient to
justify using an alternative. If the result of such discussion were to
yield a solution that is essentially NeXus with different semantics,
the logical conclusion would be "this is technically no different than
NeXus, I don't think the community would accept it as a viable
alternative, so let's use NeXus". Result: people wary of NeXus become
reassured that it is the right approach. If technical issues are
found, perhaps they can be addressed in a NeXus application
definition, or other extensions to NeXus, if not, we should be aware
of that.

Please, let's be tolerant of discussion of perceived limitations and
alternatives to NeXus or other proposals. We don't all have to
participate in every discussion, and I recognize it can be frustrating
when it seems like some of us are debating settled issues, but
comments like "I thought we settled on NeXus" or "I thought we settled
this at the workshop" would be more helpful if coupled with a pointer
to documentation that is complete in the sense that it explains both
the issue and the solution (including semantics and examples of use).
Without that, such comments can be interpreted as "you may not
understand it, but its settled, and you can trust us that it will all
work out".

Finally, I will try to keep Chris Jacobsen's comment in mind: "the
perfect can be the enemy of the good". I can too easily become focused
on finding an ideal solution (as I worry that developing a relatively
constrained format can lead to painting oneself into a corner). I feel
responsible for engendering a feeling that it is going to be difficult
to get anything accomplished, which is sort of ironic because I think
the amount of discussion this mailing list has had in the last few
days is proof that we are accomplishing plenty (most importantly: an
engaged community!)

> Possibly, better education about NeXus is important, as suggested.
>
> Since the group is keen to have a look, here is the URL
> of the current draft (open for review) of the NeXus manual,
> updates will appear at the same URL until it is released:
>
> http://www.nexusformat.org/images/f/fc/NeXusManual.pdf

Is there any difference between this document and the documentation on the web?

Darren

Wellenreuther, Gerd

unread,

Jan 27, 2010, 10:25:27 AM1/27/10

to ma...@googlegroups.com

Dear colleagues,

since I noticed that a couple of subscribers to the mailinglist
deregistered I just wanted to give a short notice about how to cope with
the sometimes large amount of emails.

* It is possible to tell google to put all emails of one day into one
email - you only get one email per day, which is long, and you can have
a look if anything particular interesting is happening, and comment on
it. (If somebody wants to try this feature but does not know how to
switch it on, or can not because he has no google-account ==> tell me!)

* A little bit more elaborate: All mails from MAHID should contain a
[MAHID] in the subject - most mailing programs should be able to filter
your mails, and e.g. move all MAHID-mails in a special folder.

And I should stress that I am happy if people are discussing here - in
whatever length.

Cheers, Gerd

Pete R. Jemian

unread,

Jan 27, 2010, 11:43:18 AM1/27/10

to ma...@googlegroups.com

On 1/27/2010 9:17 AM, Darren Dale wrote:
>
> Is there any difference between this document and the documentation on the web?

The PDF file includes information about NXDL which is not available in the NeXus wiki.

> http://www.nexusformat.org/images/f/fc/NeXusManual.pdf

Carlos Pascual Izarra

unread,

Jan 27, 2010, 1:48:26 PM1/27/10

to ma...@googlegroups.com

On Wednesday 27 January 2010 16:17:51 Darren Dale wrote:
> My own opinion is that semantic preferences are not sufficient to
> justify using an alternative.

That is exactly my opinion too. (i.e., I might prefer CamelCase over
underscores for attribute names, but I think it is worth using underscores if
that saves me a long discussion).

To state my original point again: even if we chose not to follow NeXus, it'd
be interesting to read its docs so that we could illustrate our points by
referring to the way nexus does it.

Darren Dale

unread,

Jan 27, 2010, 2:03:57 PM1/27/10

to ma...@googlegroups.com

On Wed, Jan 27, 2010 at 10:17 AM, Darren Dale <dsda...@gmail.com> wrote:
> Please, let's be tolerant of discussion of perceived limitations and
> alternatives to NeXus or other proposals.

That was a general request, inspired by years spent on various
open-source development mailing lists. It was not directed at anyone,
and it was especially not directed at Pete (whose comments have been
both encouraging and insightful.)

Darren

Matt Newville

unread,

Jan 27, 2010, 3:58:49 PM1/27/10

to ma...@googlegroups.com

Thanks Darren,

That was much more polite than I would have been.

For myself, I've ONLY discussed how to store data for interchange on
this list, NEVER how to store raw experimental data.

As for what is "settled" or "was decided at the workshop", I speak
only for myself. I came away from the workshop with the understanding
that those of us trying to exchange hyperspectral x-ray fluorescence
mapping data would use HDF5, but not much more beyond that. For one
thing, several people who do "hyperspectral x-ray fluorescence
mapping" (at beamlines and/or with analysis programs) were not present
at the workshop, and are not active in this discussion. I'm not sure
how aware these folks are of this effort, and I would not assert that
any details are settled.

To me, Nexus is interesting and provides both positive and negative
lessons. It seems to me that "semantic preferences" is the least of
the issues with Nexus, but I don't think anyone here suggested that
Nexus be ignored. So thanks, Pete, for the link to the manual:
documentation is one of the concerns about Nexus, so it is good to see
that this is being worked on!

When Carlos says

During the workshop I advocated (and I though that we generally agreed)
to take as much as possible from the Nexus conventions and only extend
them when something is clearly insufficient for our needs.

I don't understand exactly what "take ... from Nexus" means in detail.
Could you be specific?

Cheers,

--Matt Newville <newville at cars.uchicago.edu> 630-252-0431

Vicente Sole

unread,

Jan 27, 2010, 4:33:38 PM1/27/10

to ma...@googlegroups.com

Hi Matt,

Quoting Matt Newville <newv...@cars.uchicago.edu>:

>
> When Carlos says
> During the workshop I advocated (and I though that we generally agreed)
> to take as much as possible from the Nexus conventions and only extend
> them when something is clearly insufficient for our needs.
>
> I don't understand exactly what "take ... from Nexus" means in detail.
> Could you be specific?
>

I am not Carlos but I can be specific at least on a couple of points I
mentioned at the workshop and they did not seem to be openly contested.

- When we were talking about units, we said we would try to use the
same attributes because if we had to start from scratch it would be
more time consuming.

- When we were talking about exchanging pretreated/processed data, I
suggested an NXdata field was convenient because it was already
describing at least how the data were expected to be looked at. With
the previous, long thread discussion, that shouls even more complete
now once the data_type attribute is added. The datasets I submitted to
the MAHID site are ready to be analyzed by multivariate techniques.
They show that one can suit both approaches, a name based approach not
relying at all on attributes or on NeXus and a strict NeXus one in the
same file. So, I do not see why we should not try to use something
that can certainly accomodate that. If in addition, one can add
definitions about how to do further processing, I have to say I take it.

About the rest of NeXus, I certainly do not need it for exchanging
data. If I decide one day to put the full description of an instrument
in a file, and it happens to be described in NeXus, I will use it, but
that's all I foresee from my side concerning NeXus. Within the scope
of this list is even more limited: units, NXdata and definitions. I
expect the definitions to come from this mailing list and for
definition I understand just a dictionary of physical quantities
needed to perform a particular type of analysis.

Criticism can be very positive. I really loved your presentation at
the workshop. It was extremely rich.

Best regards,

Armando

Carlos Pascual Izarra

unread,

Jan 28, 2010, 4:23:29 AM1/28/10

to ma...@googlegroups.com

On Wednesday 27 January 2010 21:58:49 Matt Newville wrote:
> Thanks Darren,
> That was much more polite than I would have been.

I want to apologize if I sounded rude about the long thread above. Nothing
farther from my intention than criticizing things being debated. I was just
suggesting that such debate wouldimprove if we all had in mind a few common
concepts.

> When Carlos says
> During the workshop I advocated (and I though that we generally agreed)
> to take as much as possible from the Nexus conventions and only extend
> them when something is clearly insufficient for our needs.
>
> I don't understand exactly what "take ... from Nexus" means in detail.
> Could you be specific?

Apart from what Armando mentions in his reply to this, I was referring to
things like:
-the way of dealing with units
-the conventions for describing geometries (which in occasions may need being
extended, as it was noted by the SAXS people)
-The naming conventions for attributes and groups.
-The way of storing data. For example, that a scan of NP points involving 3
counters+ 1 MCA with 1024 channels + 2 area detectors with 800x600pixels
would be stored as 6 datasets: 3 of dim NP*1 , 1 of dim NPx1024 and 2 of dim
NPx800x600 (This is the specification at the GenericScan Application
Definition)
-That each experiment hangs from it own NXentry group
-...

Note that I am not saying that we already agreed to these precise points, I am
just being more specific about to what I was referring to.

Cheers,

Carlos

Darren Dale

unread,

Jan 28, 2010, 7:42:40 AM1/28/10

to ma...@googlegroups.com

On Thu, Jan 28, 2010 at 4:23 AM, Carlos Pascual Izarra
<carlos....@cells.es> wrote:
> On Wednesday 27 January 2010 21:58:49 Matt Newville wrote:
>> Thanks Darren,
>> That was much more polite than I would have been.
>
> I want to apologize if I sounded rude about the long thread above. Nothing
> farther from my intention than criticizing things being debated. I was just
> suggesting that such debate wouldimprove if we all had in mind a few common
> concepts.

Understood, and thanks.

>> When Carlos says
>> During the workshop I advocated (and I though that we generally agreed)
>> to take as much as possible from the Nexus conventions and only extend
>> them when something is clearly insufficient for our needs.
>>
>> I don't understand exactly what "take ... from Nexus" means in detail.
>> Could you be specific?
>
> Apart from what Armando mentions in his reply to this, I was referring to
> things like:
> -the way of dealing with units

I would like to discuss units at some point (in a separate thread)

> -the conventions for describing geometries (which in occasions may need being
> extended, as it was noted by the SAXS people)
> -The naming conventions for attributes and groups.
> -The way of storing data. For example, that a scan of NP points involving 3
> counters+ 1 MCA with 1024 channels + 2 area detectors with 800x600pixels
> would be stored as 6 datasets: 3 of dim NP*1 , 1 of dim NPx1024 and 2 of dim
> NPx800x600 (This is the specification at the GenericScan Application
> Definition)
> -That each experiment hangs from it own NXentry group

If I am doing simultaneous powder diffraction and scanning xrf
mapping, do you consider those to be separate experiments? If so, we
discussed at the workshop the possibility of specifying multiple
application definitions in a single entry. This is a topic that will
need additional consideration.

Darren

Pete Jemian

unread,

Jan 28, 2010, 11:39:22 AM1/28/10

to ma...@googlegroups.com

from the NeXus PDF manual (and not so clearly stated in the wiki):

"NeXus units are written as a string (NX_CHAR)
and describe the engineering units. The string
should be appropriate for the value.
Values for the NeXus units must be specified in
a format compatible with Unidata UDunits
(http://www.unidata.ucar.edu/software/udunits).
The UDunits specification also includes instructions
for derived units."

This comment (written by me) in the manual's source code also appears:

Pete Jemian

unread,

Jan 28, 2010, 12:06:58 PM1/28/10

to ma...@googlegroups.com

There is a discussion about naming conventions that is summarized by
Carlos about adopting various standards:

Carlos Pascual Izarra wrote:
> ...

> -The naming conventions for attributes and groups.

This message is long. My summary: I suggest to use XML attribute name
rules since they are flexible and well-considered by a standards body
that thought about this a lot. It does not resolve issues of
camelCase or whether names_should_have_underscored or hyphenated-names.

As for group names, similarly but there is another NeXus rule: NX___,
where ___ is any character sequence, is reserved for use by the NIAC.
The NIAC must approve new NX___ names (mostly to prevent contention and
duplication).

The XML standards organization has a standard for naming attributes that
is broadly used and supported in various computer languages, including
languages used by scientists. Acknowledging that they have considered
how these names might be represented in the variety of computer
languages (where they might be used as the names of variables), it seems
reasonable to consider adopting their plans.

Many WWW pages provide information about XML. I've found the set from
w3schools.org to be the most useful. Here's the page describing XML
attributes:
http://www.w3schools.com/xml/xml_attributes.asp

Another document, from the XML standards body, provides more technically
accurate information:
http://www.w3.org/TR/REC-xml/#NT-Name
The excerpt below says that attribute names are pretty flexible but
should start with an ASCII character (see
http://www.w3.org/TR/REC-xml/#NT-NameStartChar if you are really
concerned) to avoid the attribute being parsed incorrectly. Almost all
characters are permitted in names except delimiters.

One other thing to remember is not to start any of these names with any
permutation of "XML"
in upper or lower case.

Here's the quote:
--------------------
"The first character of a Name MUST be a NameStartChar, and any other
characters MUST be NameChars; this mechanism is used to prevent names
from beginning with European (ASCII) digits or with basic combining
characters. Almost all characters are permitted in names, except those
which either are or reasonably could be used as delimiters. The
intention is to be inclusive rather than exclusive, so that writing
systems not yet encoded in Unicode can be used in XML names. See J
Suggestions for XML Names for suggestions on the creation of names.

"Document authors are encouraged to use names which are meaningful words
or combinations of words in natural languages, and to avoid symbolic or
white space characters in names. Note that COLON, HYPHEN-MINUS, FULL
STOP (period), LOW LINE (underscore), and MIDDLE DOT are explicitly
permitted.

"The ASCII symbols and punctuation marks, along with a fairly large
group of Unicode symbol characters, are excluded from names because they
are more useful as delimiters in contexts where XML names are used
outside XML documents; providing this group gives those contexts hard
guarantees about what cannot be part of an XML name. The character
#x037E, GREEK QUESTION MARK, is excluded because when normalized it
becomes a semicolon, which could change the meaning of entity references."

Carlos Pascual Izarra

unread,

Jan 28, 2010, 12:57:03 PM1/28/10

to ma...@googlegroups.com

On Thursday 28 January 2010 18:06:58 Pete Jemian wrote:
> There is a discussion about naming conventions that is summarized by
> Carlos about adopting various standards:
>
> Carlos Pascual Izarra wrote:
> > ...
> > -The naming conventions for attributes and groups.
>
> This message is long. My summary: I suggest to use XML attribute name
> rules since they are flexible and well-considered by a standards body
> that thought about this a lot. It does not resolve issues of
> camelCase or whether names_should_have_underscored or hyphenated-names.

I am trying to find the doc where the NeXus people define their convention. I
am sure I once read about it, but I cannot find it again... the only thing I
got is slide 27 of:
http://www.cacr.caltech.edu/projects/ARCS/ARCS/Review/Osborn.ppt

Pete Jemian

unread,

Jan 28, 2010, 1:37:54 PM1/28/10

to ma...@googlegroups.com

The NeXus manual currently provides very little constraint here.

From: http://www.nexusformat.org/images/f/fc/NeXusManual.pdf
Appendix A.1 Overview of NeXus classes
page 38/130 (PDF document page 54 of 146),

Name
Short name of the data field. Name must satisfy both HDF and XML naming rules.

Attributes
Attributes are additional metadata used to define this variable. Attributes are identified with a leading "at" symbol
(@) and belong with the preceding field or group. In the example above, the program_name element has two attributes:
version (required) and configuration (optional) while the thumbnail element has one attribute: mime_type
(optional).

Matt Newville

unread,

Jan 29, 2010, 12:16:10 PM1/29/10

to ma...@googlegroups.com

Hi,

I think that the Nexus approach toward names of (correct me if I have
this wrong)
A Name for Group, Dataset or attributes
must be a valid HDF5 and XML name.

is a bit too weak. To verify a name is allowed, does one check both?
I don't actually see a simple grammar production for HDF5 names (I
believe it may simply be "char*"). Spaces and non-printable ASCII
characters are definitely allowed, and I suspect that unicode support
in names may vary with HDF5 versions and libraries.

I think non-printable characters and whitespace should be avoided. If
I read it correctly, one of the examples in the Nexus doc has a
dataset named " data " (Example 3.1, page 16):
<NXdata name=" data " >
<time_of_flight axis= 1 primary= 1 > 1500.0 1502.0 1504.0 ...
</time_of_flight>
<polar_angle axis= 2 primary= 1 > 15.0 15.6 16.2 ... </polar_angle>
<data > 5 7 14 ... </data>
</NXdata>

That could be unintentional, but (if I understand correctly) the
corresponding HDF5 file would have a Group named " data ", which is
allowed (both HDF5 and XML). That seems problematic to me (what if
there are Groups named 'data', ' data ', and ' data'?). I recommend a
much simplified variation of the XML grammar production that doesn't
allow whitespace, non-printable characters, or most punctuation in
names. Specifically, I suggest

Names for Groups, Datasets, and attrbutes must match:
NameStartChar ::= _ | a..z | A..Z
NameChar ::= NameStartChar | . | 0..9
Name ::= NameStartChar (NameChar)*

Or, as a regular expression: [_a-zA-Z][_a-zA-Z.0-9]*

We could consider other punctuation characters, such as '@$&~|:-', but
I think we could easily live without these too.

Any comments?

Cheers,

--Matt Newville

Pete R. Jemian

unread,

Jan 29, 2010, 1:08:19 PM1/29/10

to ma...@googlegroups.com

Matt:

You are right. That example is disturbing and is
one of the things that needs to be cleaned up.

Matt's proposition looks sound and looks like a
good suggestion for NeXus, as well.

IMHO, don't use hyphens or the other special characters.
Why? If using the name as a variable name, any of those
characters violate naming rules in some popular languages
and need to be mapped to safe characters.

Ray Osborn used to be more involved with NeXus.
From slides he gave in a talk about NeXus in 2002,
there are these comments about a "naming convention"
that do not appear (yet) in the NeXus manual.

1 Lower case letters are used throughout, except
for common symbols and abbreviations such as FWHM.
2 Names are constructed from full words separated by
the underscore character e.g. time_of_flight.
3 For sequentially indexed group names, the sequential
number is simply appended to the name, e.g. filter1,
filter2. This convention should be used only for
data group names.
4 The hierarchical structure of NeXus files should be
used to simplify data names. e.g. ï¿½temperatureï¿½ ,
not ï¿½sample_temperatureï¿½.

On inspection just now (checked facts before writing
something incorrect), all the "field" declarations
in the NeXus classes (NXDL files) consistently
use all lower case with under_score delimiters as
needed for the "name".

All "group" declarations use the same convention except for:
NXsample: <group name="external_ADC" type="NXlog">
Maybe the NIAC should take a look at this one for uniformity.

All "attribute" declarations mostly adhere to this
but there a few exceptions: "URL", "NeXus_version",
"HDF_version", "HDF5_version", and "XML_version".

The example data files and examples of data files in the manual
are much more inconsistent in how things are named, whether
camelCase, under_scores, or hyphenated-names. Since some of
these examples have been around for many years, we might assume
they are acceptable and not invalid. That is a very weak
specification of what is allowed. But a naming convention
is different from a declaration of what is allowed. Seems
that what is allowed is rather broad.

Pete

--

Pete R. Jemian

unread,

Jan 29, 2010, 3:33:11 PM1/29/10

to ma...@googlegroups.com

Followup from NeXus Tech Committee:
The NeXus aim was to stick to character sequences that were
also valid as program variable names; this allows programming language
classes/structures to be built that mirror a defined file structure. The
scheme below allows "." which is usually invalid in program
variable names, instead being reserved as an operator. The expression
"[_a-zA-Z][_a-zA-Z0-9]*" fits with the NeXus aim and is probably the
best to use, but I've just noticed a small mistake in
http://svn.nexusformat.org/definitions/trunk/NeXus.xsd as it uses
"[_a-zA-Z0-9]+" for the "validName" restriction thus allowing variable
names to start with a digit rather than only contain them; we should
update it to "[_a-zA-Z][_a-zA-Z0-9]*"

Pete R. Jemian

unread,

Jan 29, 2010, 3:41:47 PM1/29/10

to ma...@googlegroups.com

simply:

NameStartChar ::= _ | a..z | A..Z

NameChar ::= NameStartChar | 0..9
Name ::= NameStartChar (NameChar)*

Or, as a regular expression:

[_a-zA-Z][_a-zA-Z0-9]*

On 1/29/2010 2:33 PM, Pete R. Jemian wrote:
>
> Followup from NeXus Tech Committee:
> The NeXus aim was to stick to character sequences that were
> also valid as program variable names; this allows programming language
> classes/structures to be built that mirror a defined file structure. The
> scheme below allows "." which is usually invalid in program
> variable names, instead being reserved as an operator.

Darren Dale

unread,

Jan 29, 2010, 8:54:07 PM1/29/10

to ma...@googlegroups.com

I think I remember when we discussed units at the workshop that units
had to conform to the UDUnits specification, but also that certain
units had to be used, for example lengths had to be listed in meters.
Did I misunderstand? Also, there is a formative UDUnits2 package which
supports unicode symbols for some units, has there been any discussion
of UDUnits2 in NeXus?

In case anyone is interested, I have written a python package for
dealing with physical quantities (physical constants and arrays with
units), based on numpy, called quantities, which can be found at
http://packages.python.org/quantities/ . It should be compatible with
specifying units using strings according to udunits. The package is
functional and useful, and tested, but test coverage is not so
complete that I would advertise it for rocket science.

Darren

Carlos Pascual Izarra

unread,

Feb 1, 2010, 3:26:54 AM2/1/10

to ma...@googlegroups.com

The proposition ( [_a-zA-Z][_a-zA-Z0-9]* ) seems the best for me for *valid*
names. And I would also at least *encourage* the stricter convention by R
Osborn:

1 Lower case letters are used throughout, except
for common symbols and abbreviations such as FWHM.
2 Names are constructed from full words separated by
the underscore character e.g. time_of_flight.
3 For sequentially indexed group names, the sequential
number is simply appended to the name, e.g. filter1,
filter2. This convention should be used only for
data group names.
4 The hierarchical structure of NeXus files should be

used to simplify data names. e.g. “temperature” ,
not “sample_temperature”.

--

Darren Dale

unread,

Feb 1, 2010, 7:44:41 AM2/1/10

to ma...@googlegroups.com

On Mon, Feb 1, 2010 at 3:26 AM, Carlos Pascual Izarra
<carlos....@cells.es> wrote:
> The proposition ( [_a-zA-Z][_a-zA-Z0-9]* ) seems the best for me for *valid*
> names. And I would also at least *encourage* the stricter convention by R
> Osborn:
>
> 1 Lower case letters are used throughout, except
> for common symbols and abbreviations such as FWHM.
> 2 Names are constructed from full words separated by
> the underscore character e.g. time_of_flight.
> 3 For sequentially indexed group names, the sequential
> number is simply appended to the name, e.g. filter1,
> filter2. This convention should be used only for
> data group names.
> 4 The hierarchical structure of NeXus files should be
> used to simplify data names. e.g. “temperature” ,
> not “sample_temperature”.

+1

"V. Armando Solé"

unread,

Feb 1, 2010, 9:40:00 AM2/1/10

to ma...@googlegroups.com

>> used to simplify data names. e.g. ï¿½temperatureï¿½ ,
>> not ï¿½sample_temperatureï¿½.
>>
>

> +1
>
>
Well, I think point 3 can be improved (at least is my point of view).

Let me try to explain what I have in mind.

If, instead of point 3, we would instead agree on not using spaces for
anything else than sequentially indexed group names, a simple "space
splitting" plus "number conversion" allows to easily order the groups
without nasty surprises like having group11 listed prior to group2 and
similar things. Also MCA1 and MCA2 do not carry the same meaning as "MCA
1" and "MCA 2".

Ambiguity shown point 1 and point 2 below does not exist:

1 - MCA1 and MCA2 -> two sequentially indexed MCA groups or
2 - MCA1 and MCA2 -> two non sequentially indexed MCA1 and MCA2 groups?
3 - "MCA11" and "MCA12" -> two sequentially indexed MCA1 groups or MCA
groups?

With my hint:

- MCA1 and MCA2 are two non-sequentially indexed groups, MCA1 and MCA2.
- "MCA 1" and "MCA 2" are two sequentially indexed MCA groups
- "MCA1 1" adn "MCA1 2" are two sequentially indexed MCA1 groups

I used MCA on purpose, should I have used mca or MCA? :-) I think rule
1 is not needed because we'll mostly use an attribute based approach,
but as far as those points are taken as a hints, I do not mind.

All what I want to say is that, if we are going to avoid spaces, I guess
we miss an opportunity for easily ordering sequentially indexed groups
by using them (see above)

Armando

Pete R. Jemian

unread,

Feb 1, 2010, 10:26:51 AM2/1/10

to ma...@googlegroups.com

Which are we talking about?
* a naming convention for attributes
* a convention for the "name" attribute

In the first case, spaces are not advised.

In the second case, the appearance of spaces has a benefit as
noted by Armando but also introduces a problem to be solved
when identifying a link using the element tag (such as
"NXthing") and the name (such as "mca 1").

An example link syntax might look like this:
/NXentry/NXdetector/data['mca 1']

Pete

--

Darren Dale

unread,

Feb 1, 2010, 10:29:09 AM2/1/10

to ma...@googlegroups.com

On Mon, Feb 1, 2010 at 9:40 AM, "V. Armando Solé" <so...@esrf.fr> wrote:
> Darren Dale wrote:
>>
>> On Mon, Feb 1, 2010 at 3:26 AM, Carlos Pascual Izarra
>> <carlos....@cells.es> wrote:
>>
>>>
>>> The proposition ( [_a-zA-Z][_a-zA-Z0-9]* ) seems the best for me for
>>> *valid*
>>> names. And I would also at least *encourage* the stricter convention by R
>>> Osborn:
>>>
>>> 1 Lower case letters are used throughout, except
>>> for common symbols and abbreviations such as FWHM.
>>> 2 Names are constructed from full words separated by
>>> the underscore character e.g. time_of_flight.
>>> 3 For sequentially indexed group names, the sequential
>>> number is simply appended to the name, e.g. filter1,
>>> filter2. This convention should be used only for
>>> data group names.
>>> 4 The hierarchical structure of NeXus files should be

>>> used to simplify data names. e.g. “temperature” ,
>>> not “sample_temperature”.

>>>
>>
>> +1
>>
>>
>
> Well, I think point 3 can be improved (at least is my point of view).
>
> Let me try to explain what I have in mind.
>
> If, instead of point 3, we would instead agree on not using spaces for
> anything else than sequentially indexed group names, a simple "space
> splitting" plus "number conversion" allows to easily order the groups
> without nasty surprises like having group11 listed prior to group2 and
> similar things. Also MCA1 and MCA2 do not carry the same meaning as "MCA 1"
> and "MCA 2".

There is a really compelling reason not to use spaces in names,
quoting Pete/NeXus Tech Committee:

"The NeXus aim was to stick to character sequences that were
also valid as program variable names; this allows programming language
classes/structures to be built that mirror a defined file structure."

This is why "." and "-" should not be used either. Could you live with
"MCA_1" and "MCA_2"?

Personally, I would be uncomfortable producing files where the
difference between "MCA1" and "MCA_1" is significant.

> Ambiguity shown point 1 and point 2 below does not exist:
>
> 1 - MCA1 and MCA2 -> two sequentially indexed MCA groups or
> 2 - MCA1 and MCA2 -> two non sequentially indexed MCA1 and MCA2 groups?
> 3 - "MCA11" and "MCA12" -> two sequentially indexed MCA1 groups or MCA
> groups?
>
> With my hint:
>
> - MCA1 and MCA2 are two non-sequentially indexed groups, MCA1 and MCA2.
> - "MCA 1" and "MCA 2" are two sequentially indexed MCA groups
> - "MCA1 1" adn "MCA1 2" are two sequentially indexed MCA1 groups

What is wrong with MCA1_1 and MCA1_2?

> I used MCA on purpose, should I have used mca or MCA? :-) I think rule 1 is
> not needed because we'll mostly use an attribute based approach, but as far
> as those points are taken as a hints, I do not mind.
>
> All what I want to say is that, if we are going to avoid spaces, I guess we
> miss an opportunity for easily ordering sequentially indexed groups by using
> them (see above)

I think the issue could be addressed without using spaces, and there
is a really compelling reason to try to do so.

Cheers,
Darren

"V. Armando Solé"

unread,

Feb 1, 2010, 10:30:42 AM2/1/10

to ma...@googlegroups.com

Pete R. Jemian wrote:
> Which are we talking about?
> * a naming convention for attributes
> * a convention for the "name" attribute

I was at least talking about the names given to groups (not to
attributes) as point 3 was indicating:

3 For sequentially indexed group names, the sequential
number is simply appended to the name, e.g. filter1,
filter2. This convention should be used only for
data group names.

nothing else.

Armando

"V. Armando Solé"

unread,

Feb 1, 2010, 10:36:05 AM2/1/10

to ma...@googlegroups.com

Darren Dale wrote:

> On Mon, Feb 1, 2010 at 9:40 AM, "V. Armando Solï¿½" <so...@esrf.fr> wrote:
>
>> Darren Dale wrote:
>>
>>> On Mon, Feb 1, 2010 at 3:26 AM, Carlos Pascual Izarra
>>> <carlos....@cells.es> wrote:
>>>
>>>
>>>> The proposition ( [_a-zA-Z][_a-zA-Z0-9]* ) seems the best for me for
>>>> *valid*
>>>> names. And I would also at least *encourage* the stricter convention by R
>>>> Osborn:
>>>>
>>>> 1 Lower case letters are used throughout, except
>>>> for common symbols and abbreviations such as FWHM.
>>>> 2 Names are constructed from full words separated by
>>>> the underscore character e.g. time_of_flight.
>>>> 3 For sequentially indexed group names, the sequential
>>>> number is simply appended to the name, e.g. filter1,
>>>> filter2. This convention should be used only for
>>>> data group names.
>>>> 4 The hierarchical structure of NeXus files should be

>>>> used to simplify data names. e.g. ï¿½temperatureï¿½ ,

>>>> not ï¿½sample_temperatureï¿½.

Not much if we agree that we are appending not just a number as in
Carlos' example but an underscore and a number as you are suggesting.
That is certainly not what point 3 says. I said point 3 can certainly be
improved. That's all. My examples are to be taken as examples, mot as
the last word on the subject.

Armando

Matt Newville

unread,

Feb 1, 2010, 10:54:30 AM2/1/10

to ma...@googlegroups.com

Hi,

Carlos wrote:
> 1 Lower case letters are used throughout, except
> for common symbols and abbreviations such as FWHM.

+1
Perhaps we should be explicit on what "common symbols and
abbreviations" would be allowed to be upper case. No strong
preference. I *do* like Pete Jemian's suggestion that the names be
constructed so that they may be converted to/from program
symbol/variable names. Since some languages are case-insensitive in
variable names, it might be best to simply restrict names to be
strictly lower case.

> 2 Names are constructed from full words separated by
> the underscore character e.g. time_of_flight.

+1

> 3 For sequentially indexed group names, the sequential
> number is simply appended to the name, e.g. filter1,
> filter2. This convention should be used only for
> data group names.

Comment below.

> 4 The hierarchical structure of NeXus files should be

> used to simplify data names. e.g. “temperature” ,
> not “sample_temperature”.

OK. As a comment, Point 2 and 4 seem a little contradictory to me.
If 'temperature' is better than 'sample_temperature',. where would
other temperatures go? What if there are multiple readings of the
sample temperature?

Also, if 'temperature' of the Sample Group is better than
'sample_temperature' why is 'time_of_flight' better than 'time' or
'flighttime'? Could there be a "flight" group with a "time"
dataset?

Armando wrote:
> Well, I think point 3 can be improved (at least is my point of view).
>
> Let me try to explain what I have in mind.
>
> If, instead of point 3, we would instead agree on not using spaces for
> anything else than sequentially indexed group names, a simple "space
> splitting" plus "number conversion" allows to easily order the groups
> without nasty surprises like having group11 listed prior to group2 and
> similar things. Also MCA1 and MCA2 do not carry the same meaning as "MCA 1"
> and "MCA 2".
>
> Ambiguity shown point 1 and point 2 below does not exist:
>
> 1 - MCA1 and MCA2 -> two sequentially indexed MCA groups or
> 2 - MCA1 and MCA2 -> two non sequentially indexed MCA1 and MCA2 groups?
> 3 - "MCA11" and "MCA12" -> two sequentially indexed MCA1 groups or MCA
> groups?
>
> With my hint:
>
> - MCA1 and MCA2 are two non-sequentially indexed groups, MCA1 and MCA2.
> - "MCA 1" and "MCA 2" are two sequentially indexed MCA groups
> - "MCA1 1" adn "MCA1 2" are two sequentially indexed MCA1 groups
>
> I used MCA on purpose, should I have used mca or MCA? :-) I think rule 1 is
> not needed because we'll mostly use an attribute based approach, but as far
> as those points are taken as a hints, I do not mind.
>
> All what I want to say is that, if we are going to avoid spaces, I guess we
> miss an opportunity for easily ordering sequentially indexed groups by using
> them (see above)

I'm +1 if 'space' is replaced with 'underscore' . For MCA v mca, I
slightly prefer all lower case.

I do not understand "I think rule 1is not needed because we'll mostly
use an attribute based approach, ....". We are discussing rules for
naming of groups, datasets, and attributes, no? Or perhaps: what is
an attribute based approach?

--Matt

"V. Armando Solé"

unread,

Feb 1, 2010, 11:08:00 AM2/1/10

to ma...@googlegroups.com

Hi,

Matt Newville wrote:
> I'm +1 if 'space' is replaced with 'underscore' . For MCA v mca, I
> slightly prefer all lower case.
>

No problem from my side. Appending just one number could be confusing.
Appending an underscore plus a number should solve the issues I was
pointing at, it should allow easy ordering and does not break other rules.

> I do not understand "I think rule 1is not needed because we'll mostly
> use an attribute based approach, ....". We are discussing rules for
> naming of groups, datasets, and attributes, no?

I was just discussing point three that was, at least as Carlos presented
it, to be applied to groups and not to attributes.

> Or perhaps: what is
> an attribute based approach?
>

By that I understand that the identification of the type of the contents
of one group will be very often given by an attribute (for instance
NXclass in the NeXus way of thinking, class or NXclass in the case of
phynx) or by a definition, therefore the actual name being secondary.

Armando

Darren Dale

unread,

Feb 1, 2010, 11:20:43 AM2/1/10

to ma...@googlegroups.com

The assumption here is that the "temperature" item is located in a
group representing the sample.

> Also, if 'temperature' of the Sample Group is better than
> 'sample_temperature' why is 'time_of_flight' better than 'time' or
> 'flighttime'?

Probably because the term "time of flight" is unambiguous and also
very commonly (universally?) used to refer to this technique, the
others you suggested are not.

> Could there be a "flight" group with a "time"
> dataset?

I don't understand the motivation behind that abstraction.

Point 1 is largely an issue of style. If groups or datasets can have
an arbitrary name (as opposed to "if you have an MCA you have to name
it 'MCA' or the program will not recognize it"), than point 1 is a
recommendation of style, not syntax.

Concerning an attribute-based approach, if you want to find all the
groups that are MCA detectors, you can inspect the groups for an
attribute that declares the group represents an MCA. (This risks
spinning off into a tangent discussion, if you want to pursue it, lets
start another thread.)

Cheers,
Darren

Matt Newville

unread,

Feb 1, 2010, 5:39:59 PM2/1/10

to ma...@googlegroups.com

Hi Darren, All,

I don't disagree with any of your comments. Trying to go one step at
a time, I had been limiting comments to naming conventions for groups,
datasets, and attributes, focusing on just the syntax of what is an
"allowed name": that appears to be hard enough! FWIW, I'm OK with ":"
(I don't mind "@" or "$", either), neutral on mixed-case, but am not
in favor of spaces.

When we talk about a "Sample" group having a "temperature" (dataset or
attribute) while "time_of_flight" is understood to be it's own object,
that's definitely discussing what we want the names to mean. (and
since 'time_of_flight' parses a lot like 'temperature_of_sample', are
we sure it is clear?) I think we don't want to say that "foo_bar"
should mean thing "foo" of group "bar" in general. So saying "Don't
use 'sample_temperature', use Sample/temperature" already assumes many
things: that the temperature of the sample is held in
"sample_temperature", that there will be a group Sample that holds
sample information about the sample, and that Sample/temperature is
the best place to hold that data (is it an attribute or a dataset? if
it varies during the experiment, does it go in a measurement group
instead?). I'm not against this, but there's a lot going on there.
Have we actually decided on a Sample group? What are the groups that
should be included?

If I understand "attribute based" correctly, you and Armando mean that
datasets (perhaps groups?) are allowed to have arbitrary names (within
naming rules) without pre-defined semantics. Instead, pre-defined,
meaningful names of attributes for a datasets provide identifying
information to determine what type or class of data is held.

As far as I understand it, this is slightly different from "take what
we can from Nexus". Nexus has groups name NXEntry and NXSample, it
does not have groups named "Next measurement" with an attribute such
as "NxClass = 'entry'", or "Ge_14_AS1' with attribute "NxClass =
'sample'". Or I am wrong on this? Are you proposing that all groups
and dataset have their class determined by a 'Class' attribute?

I think having names for Groups and Datasets that signify the
class/type of object held is a simpler approach. I don't think
attribute lookup is expensive, but does every group and dataset need a
'class' or 'type' attribute? Why not instead have the class/type in
the name (which is already required and cannot be forgotten) and have
an optional label attribute that would hold the "arbitrary name". As
a bonus, the label would not have to adhere to the naming conventions
for groups, datasets, and attributes.

Sorry if this is too long or too dull for most of the people on the
list. I'm just trying to think these things through....

Cheers,

--Matt

Pete Jemian

unread,

Feb 1, 2010, 6:26:52 PM2/1/10

to ma...@googlegroups.com

NeXus deviates from that convention as well. Look at NXsample, for example:
(http://trac.nexusformat.org/definitions/browser/trunk/base_classes/NXsample.nxdl.xml)

or

or

The point is that Ray's convention seems to be generally followed
but sometimes it is a choice to deviate from the convention.

On 2/1/2010 4:39 PM, Matt Newville wrote:
> Hi Darren, All,
>
> I don't disagree with any of your comments. Trying to go one step at
> a time, I had been limiting comments to naming conventions for groups,
> datasets, and attributes, focusing on just the syntax of what is an
> "allowed name": that appears to be hard enough! FWIW, I'm OK with ":"
> (I don't mind "@" or "$", either), neutral on mixed-case, but am not
> in favor of spaces.

":" only used as a delimiter in links to identify and differentiate a
specific instance of a class, not used in the names of things

>
> When we talk about a "Sample" group having a "temperature" (dataset or
> attribute) while "time_of_flight" is understood to be it's own object,
> that's definitely discussing what we want the names to mean. (and
> since 'time_of_flight' parses a lot like 'temperature_of_sample', are
> we sure it is clear?) I think we don't want to say that "foo_bar"

> should mean thing "foo" of group "bar" in general....

"V. Armando Solé"

unread,

Feb 2, 2010, 3:14:51 AM2/2/10

to ma...@googlegroups.com

Hi Matt,

Matt Newville wrote:
> Hi Darren, All,
>
> I don't disagree with any of your comments. Trying to go one step at
> a time, I had been limiting comments to naming conventions for groups,
> datasets, and attributes, focusing on just the syntax of what is an
> "allowed name": that appears to be hard enough! FWIW, I'm OK with ":"
> (I don't mind "@" or "$", either), neutral on mixed-case, but am not
> in favor of spaces.
>

I was in favor of spaces, but they seem to give more problems than
advantages. So, bye, bye spaces.

> If I understand "attribute based" correctly, you and Armando mean that
> datasets (perhaps groups?) are allowed to have arbitrary names (within
> naming rules) without pre-defined semantics. Instead, pre-defined,
> meaningful names of attributes for a datasets provide identifying
> information to determine what type or class of data is held.
>

I think you have understood it correctly. That is the situation now.

> As far as I understand it, this is slightly different from "take what
> we can from Nexus". Nexus has groups name NXEntry and NXSample, it
> does not have groups named "Next measurement" with an attribute such
> as "NxClass = 'entry'", or "Ge_14_AS1' with attribute "NxClass =
> 'sample'". Or I am wrong on this?

I would say you are slightly wrong but I am not a NeXus expert.

NeXus has the NXentry that is supposed to be associated to a
measurement. Next measurement would be other entry in the same file.
Since they foresee fields named (not attribute, but name because the
name carry all the meaning) start_time and end_time, you have everything
to have your next measurement.

When I parse HDF5 files, I look for start_time, end_time or a hidden
sequence to order the HDF5 file. HDF5 always reports you alphabetical
order, therefore it can be messy if some sequential ordering is not there.

> Are you proposing that all groups
> and dataset have their class determined by a 'Class' attribute?
>

Not really, that would take away a lot of flexibility.

> I think having names for Groups and Datasets that signify the
> class/type of object held is a simpler approach. I don't think
> attribute lookup is expensive, but does every group and dataset need a
> 'class' or 'type' attribute?

Certainly not. I would say it is advisable that at least detector
generated datasets carry out information about what is in the dataset
(spectra, images, encoded images, ...) Darren and I will submit a
proposal to the list with the conclusion of the previous discussion.

> Why not instead have the class/type in
> the name (which is already required and cannot be forgotten) and have
> an optional label attribute that would hold the "arbitrary name". As
> a bonus, the label would not have to adhere to the naming conventions
> for groups, datasets, and attributes.
>

My main objection would be that generic HDF5 utilities might have
troubles there.

The extreme situation you can already find it in files from a certain
European synchrotron. You perform a scan of one motor and all what you
see in the file is something like "actuator_1", then you perform a
second scan with two motors and you see "actuator_1" and "actuator_2".
The problem is that the underlying motors can be totally different. For
a program aware of their convention is easy to browse for the
information, for a generic program or a human is painful.

> Sorry if this is too long or too dull for most of the people on the
> list. I'm just trying to think these things through....
>

Matt, I think all of us are learning from each other (at least I am
doing so). Personally I am not trying to endorse NeXus and I do not like
some of their conventions. I think some of their classes and conventions
are perfect for exchanging data but some other conventions make no sense
to me.

NXentry allows to put some order into the file. One measurement (or as I
understand it, a command originating the data), corresponds to a top
level NXentry in the file. You can have several of them in the file and
you can name them as you prefer (although they usually name them on the
lines of "Entry_XX" or something similar). Personally I prefer to give a
more meaningful name because I give priority to start_time or end_time
(if present) to sort them.

NXdata allows to transparently mix "meaningful name" based approaches
with "attribute" based ones. So, it should suit all needs.

NeXus already has some conventions for units. We need them for
exchanging data. If they are good or not good conventions, I do not know
it. To me they have the merit of being already there.

I repeat just to remove ambiguities, during the workshop and in the
report I explicitly talked about NXdata providing a good framework for
exchanging data. It suits name based approaches and attribute based ones
and already deals with units. Perhaps made a mistake when I wrote "While
there is not endorsement of NeXus, wherever appropriate we will be using
the NeXus definitions / conventions." I think we can do a good work on
the definitions because it is almost "virgin territory" and we should be
able to find good common solutions. The "wherever appropriate" in the
same sentence as "conventions" is probing it will be really hard work
and I think that sentence has one word too much.

About the rest of NeXus classes, I said almost nothing at the workshop
and wrote nothing in the report. Personally (=just one opinion that is
as valid as anyone's else) I do not see why I need an NXmonitor when
experience at beamlines shows that scientists normalize against one
detector or another at will without the need to have it defined as a
special detector. The sample temperature can be treated as a
"positioner" that can be scanned and does need to be treated differently
than any other motor or "positioner" and so on. By other hand, NeXus
definitions can point to what should be taken as monitor or as sample
temperature for a particular analysis.

This post is much longer than I expected. Sorry.

Armando

Carlos Pascual Izarra

unread,

Feb 2, 2010, 3:31:43 AM2/2/10

to ma...@googlegroups.com

On Monday 01 February 2010 17:08:00 V. Armando Solé wrote:
> Matt Newville wrote:
> > I'm +1 if 'space' is replaced with 'underscore' . For MCA v mca, I
> > slightly prefer all lower case.
> >
>
> No problem from my side. Appending just one number could be confusing.
> Appending an underscore plus a number should solve the issues I was
> pointing at, it should allow easy ordering and does not break other rules.

I also prefer using underscore over space to separate the sequence index (even
though it would be against suggestion #3 from Ray, which is not separating at
all)

Darren Dale

unread,

Feb 3, 2010, 2:17:32 PM2/3/10

to ma...@googlegroups.com

Hi Matt,

Sorry for the delayed response.

On Mon, Feb 1, 2010 at 5:39 PM, Matt Newville
<newv...@cars.uchicago.edu> wrote:
> When we talk about a "Sample" group having a "temperature" (dataset or
> attribute) while "time_of_flight" is understood to be it's own object,
> that's definitely discussing what we want the names to mean. (and
> since 'time_of_flight' parses a lot like 'temperature_of_sample', are
> we sure it is clear?)

I think I see your point: the proposal seems to suggest that in some
cases, the names are meaningful, and that in other cases, it is the
attributes that are meaningful. By contrast, NeXus application
definitions ensure that certain elements must be present, and the
definitions specify both the type and the name of each element.

> I think we don't want to say that "foo_bar"
> should mean thing "foo" of group "bar" in general. So saying "Don't
> use 'sample_temperature', use Sample/temperature" already assumes many
> things: that the temperature of the sample is held in
> "sample_temperature", that there will be a group Sample that holds
> sample information about the sample, and that Sample/temperature is
> the best place to hold that data (is it an attribute or a dataset? if
> it varies during the experiment, does it go in a measurement group
> instead?).

I guess the location of the sample temperature (or any other item)
would depend on the application definition, which I do not think
should require a measurement group.

(I have more comments on this and the utility of the measurement group
that I will write about in another thread, I hope people will provide
feedback there because it relates to how we might develop application
definitions and how we can share data from experiments incorporating
multiple techniques.)

> I'm not against this, but there's a lot going on there.
> Have we actually decided on a Sample group? What are the groups that
> should be included?
>
> If I understand "attribute based" correctly, you and Armando mean that
> datasets (perhaps groups?) are allowed to have arbitrary names (within
> naming rules) without pre-defined semantics. Instead, pre-defined,
> meaningful names of attributes for a datasets provide identifying
> information to determine what type or class of data is held.
>
> As far as I understand it, this is slightly different from "take what
> we can from Nexus". Nexus has groups name NXEntry and NXSample

NXentry and NXsample are group types, I don't think they are ever used
as the names of actual groups in the file (judging from the
application definitions)

> it
> does not have groups named "Next measurement" with an attribute such
> as "NxClass = 'entry'", or "Ge_14_AS1' with attribute "NxClass =
> 'sample'". Or I am wrong on this?

It does have groups called "entry" with an NX_class attribute =
"NXentry", and somewhere in that entry there may be a required group
named "sample" that has NX_class = "NXsample".

> Are you proposing that all groups
> and dataset have their class determined by a 'Class' attribute?

That seems to be the way NeXus does it (using NX_class).

> I think having names for Groups and Datasets that signify the
> class/type of object held is a simpler approach. I don't think
> attribute lookup is expensive, but does every group and dataset need a
> 'class' or 'type' attribute? Why not instead have the class/type in
> the name (which is already required and cannot be forgotten) and have
> an optional label attribute that would hold the "arbitrary name". As
> a bonus, the label would not have to adhere to the naming conventions
> for groups, datasets, and attributes.

I guess it is a question of preference. If the community agreed that
the name in the hdf5 hierarchy should reflect the item's type, and the
label should appear as an attribute, then I would adjust the phynx
interface to allow me to navigate the tree according to the labels
(and hope that people would remember to provide a meaningful label
attribute) so I could continue to call things by their given names
like I do when interacting with the equipment at acquisition time.
(Just like I would prefer it if you called me by my given name instead
of US_citizen_123-45-6789 or cornell_employee_123456).

Darren

Pete R. Jemian

unread,

Feb 3, 2010, 2:55:10 PM2/3/10

to ma...@googlegroups.com

Let me clarify this statement if I can:

On 2/3/2010 1:17 PM, Darren Dale wrote:
> It does have groups called "entry" with an NX_class attribute =
> "NXentry", and somewhere in that entry there may be a required group
> named "sample" that has NX_class = "NXsample".

In NeXus, there are various groups of the form NX_class where "_class"
might be entry, sample, scan, log, sas, monopd, ...
Of course, there is no NeXus class called "NX_class"; we just use that
here for generic reference.

The default name attribute for each of these groups is the name of the class
with the leading "NX" stripped away. BUT, that is just the default and
better judgment may compel the use of a different name attribute.
For example, if there are two NXentry elements in a NeXus file,
they might be named "entry1" and "entry2" or they might be named
using yet another motive.
HDF rules say that now two items at the same level can have the same
name attribute so they have to be different names.

Here's a few examples:
# for one NXentry, this defaults to name="entry"
<group type="NXentry">

# for two NXentry items, default names change to use an index
<group type="NXentry"> default name="entry1"
<group type="NXentry"> default name="entry2"

# differentiate between two detectors
<group type="NXdetector" name="mca1">
<group type="NXdetector" name="mca2">

No matter what, though, the name attribute must always match this
regular expression: [A-Za-z_][A-Za-z0-9_]* and is limited to nor more than 63
characters by an HDF5 rule. So the name of a NeXus group is:
NX[A-Za-z_]\w*

Within any NeXus group, the children may be either groups or fields.
The "field" is a named item that holds information. Data or metadata.
Rules for naming the "field" are exactly the same as the name attribute
in groups: [A-Za-z_][A-Za-z0-9_]*

Darren Dale

unread,

Feb 3, 2010, 3:34:35 PM2/3/10

to ma...@googlegroups.com

On Wed, Feb 3, 2010 at 2:55 PM, Pete R. Jemian <prje...@gmail.com> wrote:
> Let me clarify this statement if I can:
>
> On 2/3/2010 1:17 PM, Darren Dale wrote:
>>
>> It does have groups called "entry" with an NX_class attribute =
>> "NXentry", and somewhere in that entry there may be a required group
>> named "sample" that has NX_class = "NXsample".
>
> In NeXus, there are various groups of the form NX_class where "_class"
> might be entry, sample, scan, log, sas, monopd, ...
> Of course, there is no NeXus class called "NX_class"; we just use that
> here for generic reference.

Sorry, this misunderstanding maybe comes from your emphasis on xml,
and my experience interacting with nexus hdf5 files using h5py. If I
open the ID34_not_complete.h5 file that was distributed at the
workshop, and I inspect the hdf5 attributes of the "/entry" group, I
find one attribute called 'NX_class', and its value is "NXentry".

> The default name attribute for each of these groups is the name of the class
> with the leading "NX" stripped away. BUT, that is just the default and
> better judgment may compel the use of a different name attribute.
> For example, if there are two NXentry elements in a NeXus file,
> they might be named "entry1" and "entry2" or they might be named
> using yet another motive.

When I look at the nexus application definition (like nxtomo), I see
something like:

does that mean that at this point in the hierarchy, the file must
contain a group of type "NXentry" that is named "entry"? If it were an
NXentry but named something else, and I tried to validate the file
using nexus formative validation tool, what would be the expected
result?

> HDF rules say that now

I think you meant "no" instead of "now", right?

> two items at the same level can have the same
> name attribute so they have to be different names.
>
> Here's a few examples:
> # for one NXentry, this defaults to name="entry"
> <group type="NXentry">
>
> # for two NXentry items, default names change to use an index
> <group type="NXentry"> default name="entry1"
> <group type="NXentry"> default name="entry2"
>
> # differentiate between two detectors
> <group type="NXdetector" name="mca1">
> <group type="NXdetector" name="mca2">
>
> No matter what, though, the name attribute must always match this
> regular expression: [A-Za-z_][A-Za-z0-9_]* and is limited to nor more than
> 63
> characters by an HDF5 rule. So the name of a NeXus group is:
> NX[A-Za-z_]\w*

Seeking clarification here, I think you mean that the names for the
various *kinds* of NeXus groups, not the names of the groups
themselves. The name of the group found at "/entry" is "entry", and
its type is "NXentry". Also, there is no hdf5 attribute called "name"
associated with the nodes in NeXus hdf5 files. The name attribute is
specific to the xml format.

Darren

Pete Jemian

unread,

Feb 3, 2010, 6:51:24 PM2/3/10

to ma...@googlegroups.com

On 2/3/2010 2:34 PM, Darren Dale wrote:
> ... If I

> open the ID34_not_complete.h5 file that was distributed at the
> workshop, and I inspect the hdf5 attributes of the "/entry" group, I
> find one attribute called 'NX_class', and its value is "NXentry".

That file is an example of a file we write here at APS
that has some NeXus structure and some non-NeXus structure.
It was not written with the NeXus API and is not certain to be valid.

That said, I must learn more about how things are actually named/stored in NeXus HDF5 files.

> When I look at the nexus application definition (like nxtomo), I see
> something like:
>
> <group type="NXentry" name="entry">
>
> does that mean that at this point in the hierarchy, the file must
> contain a group of type "NXentry" that is named "entry"?

yes

> If it were an
> NXentry but named something else, and I tried to validate the file
> using nexus formative validation tool, what would be the expected
> result?

It is perfectly valid NeXus but
it would fail against the NXtomo application definition.

>
>> HDF rules say that now
>
> I think you meant "no" instead of "now", right?

yes, typing too fast today

>
>> two items at the same level can have the same
>> name attribute so they have to be different names.
>>
>> Here's a few examples:
>> # for one NXentry, this defaults to name="entry"
>> <group type="NXentry">
>>
>> # for two NXentry items, default names change to use an index
>> <group type="NXentry"> default name="entry1"
>> <group type="NXentry"> default name="entry2"
>>
>> # differentiate between two detectors
>> <group type="NXdetector" name="mca1">
>> <group type="NXdetector" name="mca2">
>>
>> No matter what, though, the name attribute must always match this
>> regular expression: [A-Za-z_][A-Za-z0-9_]* and is limited to nor more than
>> 63
>> characters by an HDF5 rule. So the name of a NeXus group is:
>> NX[A-Za-z_]\w*
>
> Seeking clarification here, I think you mean that the names for the
> various *kinds* of NeXus groups, not the names of the groups
> themselves. The name of the group found at "/entry" is "entry", and
> its type is "NXentry". Also, there is no hdf5 attribute called "name"
> associated with the nodes in NeXus hdf5 files. The name attribute is
> specific to the xml format.

yes.
*kinds* == type (the term used in NXDL), same thing

And to be even more precise, the name attribute is specific to the NXDL format.
NeXus XML data files would take this name attribute and use it as the element name.

Reply all

Reply to author

Forward