quick comment about multidimensional hdf5 arrays

Darren Dale

unread,

Jan 23, 2010, 4:44:01 PM1/23/10

to ma...@googlegroups.com

At the HDF5 workshop we talked a bit about the problem of identifying
the organization of data into multidimensional arrays. There is a
sort-of standard way to do this, where an 1-d scan of n_points
containing an n_channels spectrum at each point would have shape
(n_points, n_channels) [C indexing convention]. There was some
confusion about the nexus documentation, but apparently regularly
gridded 2-d data is stored as (nx,ny,n_channels) or
(ny,nx,n_channels). Herbert Bernstein suggested that is addition to
such a standard, it would be a good idea to store information about
the identity of each dimension.

Judging from the nexus documentation at
http://www.nexusformat.org/NXdetector, looking at the "data" dataset,
it seems like there is already some support for the feature that
Armando was requesting: that the dimensions containing spectra or
images be identified with some sort of attribute. There is an optional
"axes" attribute that identifies the dimensions of the dataset. I'm
not sure if the nexus documentation is up to date, does anyone know?
Armando, does this meet your needs?

This also put me in mind of a project developed by Fernando Perez (of
ipython fame), called datarray (http://github.com/fperez/datarray). He
has recently been working on neuroimaging, and some of their data
consists of hyperdimensional arrays. He was unhappy with the potential
for indexing errors: do I want a[:,:,:,0,:] or a[:,:,0,:,:]? datarray
lets you name your axes so you can do a.axis.x[0]. It might be worth
incorporating features like these into phynx.

Darren

Vicente Sole

unread,

Jan 24, 2010, 4:26:35 AM1/24/10

to ma...@googlegroups.com

Hi Darren,

Quoting Darren Dale <dsda...@gmail.com>:

>
> Judging from the nexus documentation at
> http://www.nexusformat.org/NXdetector, looking at the "data" dataset,
> it seems like there is already some support for the feature that
> Armando was requesting: that the dimensions containing spectra or
> images be identified with some sort of attribute. There is an optional
> "axes" attribute that identifies the dimensions of the dataset. I'm
> not sure if the nexus documentation is up to date, does anyone know?
> Armando, does this meet your needs?
>

I was already aware of that, but that does not solve the problem. It
solves it for CHESS and for the ESRF but not for everybody. The axes
attribute only let you know how to arrange the abscisas for a plot,
but does not let you know the data you are dealing with unless you are
dealing with a "regular grid" and dimensions are matching. It does not
specify either if the data for each grid point are at the first or the
last dimensions.

Example:
2 Axes dimensions 1000, 1000
Data dimensions = 1000, 1000, 1000

One would be very tempted to plot that as a regular mesh of 1D data,
but it could also be a potentially irregular sampling of 1000 images
of 1000x1000. Just saying the data are images solves the ambiguity.
You can see that which dimensions are required is the other issue.

In the CHESS/ESRF case, where you will always try to save your data as
(npoints, c_dim0, c_dim1, c_dim2, ..., c_dimN) you already know the
type of data you are dealing with and its ordering. A program can make
sure that the product of the axes dimensions specified in the axes
attribute matches the number of points (regular grid) or that each of
the axes dimensions is either 1 (constant position) or the number of
points (irregularly sampled data). I strongly recommend that way of
arranging things, but as I wrote in the workshop report, different
labs are saving data differently depending of measuring a regular grid
or not. I consider that a mistake, but everybody can be happy just
adding one attribute to the dataset and assuming C ordering unless
other attibute says the contrary. That was the conclusion of the
workshop and it was reflected in my report.

Armando

Darren Dale

unread,

Jan 24, 2010, 7:59:26 AM1/24/10

to ma...@googlegroups.com

On Sun, Jan 24, 2010 at 4:26 AM, Vicente Sole <so...@esrf.fr> wrote:
> Hi Darren,
>
> Quoting Darren Dale <dsda...@gmail.com>:
>
>>
>> Judging from the nexus documentation at
>> http://www.nexusformat.org/NXdetector, looking at the "data" dataset,
>> it seems like there is already some support for the feature that
>> Armando was requesting: that the dimensions containing spectra or
>> images be identified with some sort of attribute. There is an optional
>> "axes" attribute that identifies the dimensions of the dataset. I'm
>> not sure if the nexus documentation is up to date, does anyone know?
>> Armando, does this meet your needs?
>>
>
> I was already aware of that, but that does not solve the problem. It solves
> it for CHESS and for the ESRF but not for everybody. The axes attribute only
> let you know how to arrange the abscisas for a plot, but does not let you
> know the data you are dealing with unless you are dealing with a "regular
> grid" and dimensions are matching. It does not specify either if the data
> for each grid point are at the first or the last dimensions.
>
> Example:
> 2 Axes dimensions 1000, 1000
> Data dimensions = 1000, 1000, 1000
>
> One would be very tempted to plot that as a regular mesh of 1D data, but it
> could also be a potentially irregular sampling of 1000 images of 1000x1000.
> Just saying the data are images solves the ambiguity. You can see that which
> dimensions are required is the other issue.

I agree that it would be useful to have metadata so datasets can
identify themselves as images or whatever.

The point I was making is that if each dimension (axis) is labeled,
there can be no ambiguity. If its a 2D regular grid of 1d data, the
names of the axes could somehow communicate that: for example (nslow,
nfast, nchannels). If it was a irregular sampling of 2D images: (np,
x_offset, y_offset), where {x,y}_offset is the offset of the pixels in
the detector.

> In the CHESS/ESRF case, where you will always try to save your data as
> (npoints, c_dim0, c_dim1, c_dim2, ..., c_dimN) you already know the type of
> data you are dealing with and its ordering. A program can make sure that the
> product of the axes dimensions specified in the axes attribute matches the
> number of points (regular grid) or that each of the axes dimensions is
> either 1 (constant position) or the number of points (irregularly sampled
> data).

I think there is a misunderstanding here. The axes attribute as
described at the nexus detector page includes information about the
dimensionality of the data element (x_offset and y_offset correspond
to area detector pixel indices i and j). The problem with that page is
that for scans of dimensionality greater than 1, the number of scan
points dimension (length np) is broken up and divided into the
dimensions of the scan itself (which can only be done for regularly
gridded scans).

> I strongly recommend that way of arranging things, but as I wrote in
> the workshop report, different labs are saving data differently depending of
> measuring a regular grid or not. I consider that a mistake, but everybody
> can be happy just adding one attribute to the dataset and assuming C
> ordering unless other attibute says the contrary. That was the conclusion of
> the workshop and it was reflected in my report.

I thought your report suggested identifying in which dimensions of the
dataset one would find the atomic data element (a single area detector
image or spectrum). Is that incorrect? The point I was making is that
it would be beneficial to identify every dimension (as Herbert
suggested) and that nexus appears to have taken a first step in this
direction, at least for the detector data array.

Darren

Vicente Sole

unread,

Jan 24, 2010, 9:09:18 AM1/24/10

to ma...@googlegroups.com

Quoting Darren Dale <dsda...@gmail.com>:

> On Sun, Jan 24, 2010 at 4:26 AM, Vicente Sole <so...@esrf.fr> wrote:
>
> I think there is a misunderstanding here. The axes attribute as
> described at the nexus detector page includes information about the
> dimensionality of the data element (x_offset and y_offset correspond
> to area detector pixel indices i and j). The problem with that page is
> that for scans of dimensionality greater than 1, the number of scan
> points dimension (length np) is broken up and divided into the
> dimensions of the scan itself (which can only be done for regularly
> gridded scans).

The misunderstanding here is that the very same attribute is used in
NXdata to specify how to arrange the plots:

http://www.nexusformat.org/NXdata

>
> I thought your report suggested identifying in which dimensions of the
> dataset one would find the atomic data element (a single area detector
> image or spectrum). Is that incorrect?

That's almost correct. We decided to go for two attributes.

One attribute says what we are dealing for, assuming C order to
identify its location unless a second attribute defines it. Herbert
was suggesting to take a look at img_CIF for seeing a possibility.

> The point I was making is that
> it would be beneficial to identify every dimension (as Herbert
> suggested) and that nexus appears to have taken a first step in this
> direction, at least for the detector data array.

Agreed, but it was also said we would take a look at how img_CIF was
dealing with the issue. In the mean time, we would assume C order
because was the one most commonly used in the different labs besides
being indirectly suggested in the NeXus web site. It is easier to
define first an attribute to say what we are dealing with prior to
move to more complex issues.

Armando

Darren Dale

unread,

Jan 24, 2010, 9:53:29 AM1/24/10

to ma...@googlegroups.com

On Sun, Jan 24, 2010 at 9:09 AM, Vicente Sole <so...@esrf.fr> wrote:
> Quoting Darren Dale <dsda...@gmail.com>:
>
>> On Sun, Jan 24, 2010 at 4:26 AM, Vicente Sole <so...@esrf.fr> wrote:
>>
>> I think there is a misunderstanding here. The axes attribute as
>> described at the nexus detector page includes information about the
>> dimensionality of the data element (x_offset and y_offset correspond
>> to area detector pixel indices i and j). The problem with that page is
>> that for scans of dimensionality greater than 1, the number of scan
>> points dimension (length np) is broken up and divided into the
>> dimensions of the scan itself (which can only be done for regularly
>> gridded scans).
>
> The misunderstanding here is that the very same attribute is used in NXdata
> to specify how to arrange the plots:
>
> http://www.nexusformat.org/NXdata

My reading of that webpage is that the axes attribute is used in
exactly the same way as it is in the detector "data" dataset, but it
is not clear. Unfortunately the documentation on the nexus website is
misleading in some cases (like how to store 2D or higher scans), so
perhaps someone more familiar with how nexus is (intended to be) used
in practice can clarify.

>> I thought your report suggested identifying in which dimensions of the
>> dataset one would find the atomic data element (a single area detector
>> image or spectrum). Is that incorrect?
>
> That's almost correct. We decided to go for two attributes.
>
> One attribute says what we are dealing for, assuming C order to identify its
> location unless a second attribute defines it.

I'm sorry, I didn't completely follow that last statement. Could you
be more concrete, what exactly are the names and values of the
attributes, how do they relate to the dataset, and how are they
intended to be used?

I think we made some progress at the workshop, but we only had a
couple of hours to discuss a lot of implementation issues. I would
hope that the decisions of the workshop can still be discussed and
vetted. (Nexus itself has a year-long incubation period for this
purpose.)

Darren

Vicente Sole

unread,

Jan 24, 2010, 10:21:06 AM1/24/10

to ma...@googlegroups.com, nexus-de...@nexusformat.org

Quoting Darren Dale <dsda...@gmail.com>:

> On Sun, Jan 24, 2010 at 9:09 AM, Vicente Sole <so...@esrf.fr> wrote:
>
> I'm sorry, I didn't completely follow that last statement. Could you
> be more concrete, what exactly are the names and values of the
> attributes, how do they relate to the dataset, and how are they
> intended to be used?

Well, since the NIAC accepted it, I would expect they would give a name.

If I have to propose it -perhaps one day I'll join the NIAC-, I would
go for a string attribute like:

NXintrinsic_dimension = "0D", "1D", "2D", "3D", ...

(or just NXintrinsic)

the goal of having a string attribute would be to be ready to some
more exotic detector acquisition modes wither present ("ListMode1D",
"ListMode2D", ...) or future.

>
> I think we made some progress at the workshop, but we only had a
> couple of hours to discuss a lot of implementation issues. I would
> hope that the decisions of the workshop can still be discussed and
> vetted. (Nexus itself has a year-long incubation period for this
> purpose.)
>

Well, I would expect first to try something prior to decide it is not useful.

Is anybody against the (agreed) attribute being named
NXintrinsic_dimension? Is it too long for the NeXus API?

Armando

Darren Dale

unread,

Jan 24, 2010, 12:00:55 PM1/24/10

to ma...@googlegroups.com

On Sun, Jan 24, 2010 at 10:21 AM, Vicente Sole <so...@esrf.fr> wrote:
> Quoting Darren Dale <dsda...@gmail.com>:
>
>> On Sun, Jan 24, 2010 at 9:09 AM, Vicente Sole <so...@esrf.fr> wrote:
>>
>> I'm sorry, I didn't completely follow that last statement. Could you
>> be more concrete, what exactly are the names and values of the
>> attributes, how do they relate to the dataset, and how are they
>> intended to be used?
>
> Well, since the NIAC accepted it, I would expect they would give a name.
>
> If I have to propose it -perhaps one day I'll join the NIAC-, I would go for
> a string attribute like:
>
> NXintrinsic_dimension = "0D", "1D", "2D", "3D", ...

This is describing the dimensionality of the atomic data element (1D
for spectrum, 2D for area detectors), right? And so the attribute is
attached to the datasets themselves?

> (or just NXintrinsic)
>
> the goal of having a string attribute would be to be ready to some more
> exotic detector acquisition modes wither present ("ListMode1D",
> "ListMode2D", ...) or future.

I don't recall this future extension being mentioned at the workshop.
It is not clear to me that it is a good idea to try to communicate so
much information in a single attribute. The information about the
acquisition mode can be communicated in another attribute, I don't see
what is gained by combining the two.

>> I think we made some progress at the workshop, but we only had a
>> couple of hours to discuss a lot of implementation issues. I would
>> hope that the decisions of the workshop can still be discussed and
>> vetted. (Nexus itself has a year-long incubation period for this
>> purpose.)
>>
>
> Well, I would expect first to try something prior to decide it is not
> useful.

That is what I mean by vetting.

> Is anybody against the (agreed) attribute being named NXintrinsic_dimension?
> Is it too long for the NeXus API?

I don't think it should be prefixed with "NX". None of the other
dataset attributes are prefixed with NX (only the data types from the
nexus API, and the NX_class attribute use this prefix, as far as I
know). I think "intrinsic_dimensionality" or "intrinsic_dimensions" is
descriptive, and I think it should be an integer value.

Darren

Pete Jemian

unread,

Jan 24, 2010, 12:17:12 PM1/24/10

to ma...@googlegroups.com

No need for the NX prefix.
General rule for attributes is brief, but clear.

I'm seeking clarification on the "axes" attribute.

Pete

--

----------------------------------------------------------
Pete R. Jemian, Ph.D. <jem...@anl.gov>
Beam line Controls and Data Acquisition, Group Leader
Advanced Photon Source, Argonne National Laboratory
Argonne, IL 60439 630 - 252 - 3189
-----------------------------------------------------------
Education is the one thing for which people
are willing to pay yet not receive.
-----------------------------------------------------------

Vicente Sole

unread,

Jan 24, 2010, 12:16:56 PM1/24/10

to ma...@googlegroups.com

Quoting Darren Dale <dsda...@gmail.com>:

> On Sun, Jan 24, 2010 at 10:21 AM, Vicente Sole <so...@esrf.fr> wrote:
>> Well, since the NIAC accepted it, I would expect they would give a name.
>>
>> If I have to propose it -perhaps one day I'll join the NIAC-, I would go for
>> a string attribute like:
>>
>> NXintrinsic_dimension = "0D", "1D", "2D", "3D", ...
>
> This is describing the dimensionality of the atomic data element (1D
> for spectrum, 2D for area detectors), right? And so the attribute is
> attached to the datasets themselves?

Yes.

>
>> (or just NXintrinsic)
>>
>> the goal of having a string attribute would be to be ready to some more
>> exotic detector acquisition modes wither present ("ListMode1D",
>> "ListMode2D", ...) or future.
>
> I don't recall this future extension being mentioned at the workshop.
> It is not clear to me that it is a good idea to try to communicate so
> much information in a single attribute. The information about the
> acquisition mode can be communicated in another attribute, I don't see
> what is gained by combining the two.
>

No, it was not mentioned. That is just given as an example that a
string can leave the door open in case of need. A number would have to
be "translated" to something else.

What is gained is to avoid future discussions about naming a second
attribute and having to systematically check for the presence and
value of two attributes.

> I don't think it should be prefixed with "NX". None of the other
> dataset attributes are prefixed with NX (only the data types from the
> nexus API, and the NX_class attribute use this prefix, as far as I
> know). I think "intrinsic_dimensionality" or "intrinsic_dimensions" is
> descriptive, and I think it should be an integer value.

OK, then skip the NX for the attribute.

Personally I consider a pity to restrict ourselves to an integer, but
perhaps the others are of your opinion. Personally I would like to
have the opinion of those already working in list mode.

Armando

Darren Dale

unread,

Jan 24, 2010, 12:27:20 PM1/24/10

to ma...@googlegroups.com

Just to summarize my point here: if it is combined into one attribute,
we would be communicating both information about the intrinsic
dimensionality of a particular dataset, and information about the mode
in which it was collected. These are unrelated, and I would prefer an
additional attribute like "acquisition_type" (off the top of my head)
that might need to be referenced by applications at a point where they
are not interested in information about the dimensionality of a
particular dataset.

Darren

Matt Newville

unread,

Jan 24, 2010, 11:53:30 PM1/24/10

to ma...@googlegroups.com

Hi Darren, Armando,

I'd like to reply to Darren's suggestion (discussed at the workshop)
on how to record data of multi-dimension scans. I apologize in
advance for this being so long. Also, I have to admit that I see
little reason to follow existing Nexus conventions. Perhaps that
betrays my ambivalence about using NeXus, but I don't see why we
should be constrained by what NeXus currently does.

I think Darren's suggestion of storing "positions in a
multi-dimensional scan" with a single array is the best approach.
Each of those array entries (each Point in the scan) can be
multi-dimensional, reflecting multiple things changing during the
scan. For a simple 2-D scan, the array of "Positions" could be as
simple as this:

Point X Y
------------------
0 1 1
1 2 1
2 1 2
3 2 2

while the array of "Images" could be NPTSx(NxM) array for an NxM
detector (C order -- Happily HDF5 hides this).

A single Array of Positions (NPTS x N_Positioners) is easily
extensible: Replacing X by 3 motors with non-trivially constraints
between them would be supported, as would non-uniform and
non-rectangular scans. In addition, having multiple
independent-variable Positioners like Energy, Sample Temperature, or
pH would be allowed. But I do think there other subtle issues
involved with this approach.

Since I called this Array "Positioner", let me describe the Epics Scan
Record (since many facilities use Epics). It's slightly different to
Spec, and a fairly low-level approach to scanning, but well-designed
(Of course, I have complaints about it!) and I think it makes a good
abstract model for collecting data. I'll stick with "Step Scan Mode"
for now, though slew scans aren't very different.

A 1-dimensional Scan is defined by 3 things:
1) a set of independent variables to change (positions)
2) a set of actions to take at each set of position (triggers)
3) a set of dependent variables to record at each position
after the action has been completed (detectors).

In Epics jargon, the independent variables are called "Positioners".
They can be any Epics Variable (need not be motors) and there can be
several of them so that a coordinated "motion" can be made (several
being "up to 4": more complicated motions can also be bundled into a
single variable with some upfront work). The full set of target
positions for all "Positioners" is specified before the scan begins.

The "actions" ("triggers" in Epics Speak) are typically used to begin
a detector acquisition, but can be any Epics Variable that can be set
to 1 to process and then report back when it is done processing.
There can be several triggers.

The dependent variables (detectors) are generally scalar values (up to
70 of them -- a frustrating limitation at times) for ion chamber,
ROIs, temperatures, etc. There are at least 2 ways to get
multi-dimensional data for each scan point: I won't dwell on those
here. I would like to emphasize that there is no explicit "Monitor"
here: one or more of these detectors is probably a useful Monitor, but
this is not explicitly separate. I think Epics is right (and NeXus
wrong) on this.

The Epics Scan Record has no explicit support for multi-dimensional
scans. Instead, one creates a Scan for the "fast positioners" (row)
and another for the "slow positioners" (column). The slow scan simply
sets the "slow Positioners" and has the Fast Scan as (one of) its
Actions. At each Column value a Row Scan is triggered and the Column
scan waits for it to finish. The Column Scan can also record Detectors
of its own. To make it even more mind-bending, nothing prevents the
parameters for the Row Scan to be changed for each Column Value, so
that arbitrary shapes or dynamically acquired scans ("base the range
and/or dwell time for Row N on the results of Row N-1") can be done --
these are rare in practice.

Though most of us do make all "Row Scans" the same when collecting XRF
maps, so that the map is a grid, the "right way" to store the data is
clearly the single multi-Positioner array. Among other things, this
will improve support for Slew Scans, where Motor Positions are not
frozen or (strictly) deterministic. Specifically, there must be a
NPTS x N_Positioners array of Positioner Values, and a N_Positioner
array of Names, and probably an array of Control System Addresses.

In addition, there needs to be an NPTS x N_Detectors Array of scalar
values for each of the detectors values (intensities, ROIS, etc)
recorded at each Position. There also needs to be arrays of Names and
Addresses for these. I think it's wise to enforce that detectors
can't be changed during a multi-dimensional scan (though Epics allows
each row to be different).

Of course, there would need to be an NPTS x (NAREA_DETECTOR_SHAPE)
"image" array for the area detector(s) or multi-element XRF
detector(s). When discussed at the workshop, I think the consensus
was to use the "natural shape" of the detector. I would like to
discuss this more: if multiple multi-element detectors (2 Quad
Vortexes for example or 1 Quad Vortex and 1 XRD CCD) are used, it's
not clear how to best do this: Probably a separate "image" dataset for
each detector used??? That's not so obvious for 2 Quad Vortexes (one
could consider it an 8 x 2048 image). Perhaps the detector data
should also be unraveled to 1-d with a similar Legend Table as for
Positioners? I'm not convinced of this, but perhaps it should be
considered.

As mentioned above, each Scan can also have a set of detectors and
"environmental data" (typically used for Ring Current, Room
Temperature, and the like) recorded prior to each Inner Scan. I think
this would best be handled by having a set of arrays that were
dimensioned with NROWS (# of Rows in the Scan == # of times the
Environmental Data is recorded) and N_Environmental (# of
Environmental Variables). Since Environmental data is multi-typed, I
suggest these be recorded as strings.

In all, I think the following datasets would be needed (optional data
sets noted with leading *), with proposed names:

Positions NPTS x N_Positioners
Position_Names N_Positioners (descriptive labels)
Position_Addrs N_Positioners (addresses)

Detectors NPTS x N_Detectors (N_Detectors = scalar values)
Detector_Names N_Detectors (descriptive labels)
Detector_Addrs N_Detectors (addresses)

ENV_Rows NROWS (Scan Point Value prior to recording Env data)
ENV_Data NROWS x N_Environmental (strings)
ENV_Names N_Environmental (name of Env vars)
ENV_Addrs N_Environmental (addresses of Env vars)

*XRF_Spectra NPTS x N_Elems x N_Channels (full spectra)
*XRF_Corrected NPTS x N_Elems x N_Channels (corrected spectra)

*XRD_Images NPTS x N_X x N_Y (XRD image)

I'm not committed to these names or this layout, but I do think such
an approach should be considered. For example, the above could be
made more hierarchical (a Positioners group, etc), but I wouldn't
necessarily recommend that. For attributes, I think all the dimensions
need to be recorded as attributes. With N_Positioners it is not
obvious that a scan is easily determined to be 1 or 2 or 10
dimensional. I think this does need to be explicitly set with
attributes and/or data sets:
dimension (single integer)
Positioner_Axes = array (length=dimension) of Positioners to be treated as
principle scan Axes. Perhaps "Array
Order" should be
an attribute of this dataset.

(If I understand, that was part of Armando's point).

Again, I apologize for this being so long, Again, I think unraveling
the Positioner values is a good idea.

--Matt Newville <newville at cars.uchicago.edu>

Stefan Vogt

unread,

Jan 25, 2010, 12:27:55 AM1/25/10

to ma...@googlegroups.com

Dear Matt, Darren, Armando, all,

> I think Darren's suggestion of storing "positions in a
> multi-dimensional scan" with a single array is the best approach.

i tend to agree with Darren and Matt. This approach seems to me to be
the most suited one in order to deal with files of varying dimensions.

One incurs a small disadvantage in that, with the suggested format, one
can not just open up the file, look for something 2D and assume it will
be the image one wants to display. Instead, there will need to be
additional information allowing one to infer the 'natural'
dimensionality of the dataset. But, IMHO, one gains significant
flexibility to handle multidimensional files, as well as scans on
'irregular grids'.

Cheers,
Stefan

--
Dr. Stefan Vogt
Group Leader Microscopy Adj. Assoc. Professor
Advanced Photon Source Feinberg School of Medicine
Argonne National Lab. Northwestern University

phone: (630) 252-3071; beamline: -3711; fax: -0140
cell: (815) 302-1956
http://www.stefan.vogt.net/

"V. Armando Solé"

unread,

Jan 25, 2010, 4:12:16 AM1/25/10

to ma...@googlegroups.com

Hi all,

Stefan Vogt wrote:
> Dear Matt, Darren, Armando, all,
>
>> I think Darren's suggestion of storing "positions in a
>> multi-dimensional scan" with a single array is the best approach.
>
> i tend to agree with Darren and Matt. This approach seems to me to be
> the most suited one in order to deal with files of varying dimensions.
>
> One incurs a small disadvantage in that, with the suggested format,
> one can not just open up the file, look for something 2D and assume it
> will be the image one wants to display. Instead, there will need to be
> additional information allowing one to infer the 'natural'
> dimensionality of the dataset. But, IMHO, one gains significant
> flexibility to handle multidimensional files, as well as scans on
> 'irregular grids'.

Well, the only thing I am not sure to have understood in the sentence:

"""
I think Darren's suggestion of storing "positions in a
multi-dimensional scan" with a single array is the best approach
"""

is the single array. I would expect one array per detector for instance
because of different number of channels and so on. For the rest, the
proposition is just reflecting what Darren and myself had figured out as
being most convenient. So, to me is clearly "one of the best approaches"
if not the best for scans: simple and straight-forward upgrade path of
what we already have. As it was said, the goal of the workshop was not
to tell people how to write their data at acquisition time. My own view
for acquisition is each measurement into an "HDF5 folder" (it cost me
nothing to call it NXentry so I call it like that), a data group with
all the measured information considered relevant (the "Measurement
group" Darren and I agreed on and that is not NXwhatever). So, that
approach is 100% NeXus compatible while allowing to write the
information in the way myself, Darren, and it seems you consider more
convenient.

The question then is, you have just used one NeXus class. What do you
gain by allowing NeXus compatibility?

a) A default plot. Just a click at the NXentry level, and the user can
have the very same plot and images he was looking at when collecting the
data. All that can be provided by NXdata.

b) Application definitions. Analysis programs aware of a technique can
get the information with little or no user intervention.

c) Instrument definitions. If you need to describe an instrument, why
not to use something already decided? You avoid long discussions about
naming, units, ...

If you do not want any of them, do not use them, but, why not to let the
door open? There are things to gain and very little effort to add.

That is how I see it. Of course I support Darren's proposal, you do not
need to convince me. Why should I do the opposite if that is just
implementing what I want and Darren and I were at the origin of that
proposal? Just a couple of mails ago I said I considered a mistake to
store multidimensional data information in different ways as function of
being on regular grids or not If that can be solved with an application
definition, it is fine, if not, they should reconsider how they are
writing their data, but that is their problem.

Armando

"V. Armando Solé"

unread,

Jan 25, 2010, 4:31:50 AM1/25/10

to ma...@googlegroups.com

Hi Matt,

Just a small comment. In the original proposition made by Darren and I,
scanned motors are not in the same array. They were under the same
heading into a group together with measured counters. That simplifies
analysis because you do not necessarily want to visualize/analyze
certain data versus the position of the two motors in you example but,
for example, against one motor and the value of other counter.

So, unless Darren has changed his mind, our approach is each scanned
motor has its array (npoints, positions), each MCA detector has its
array (npoints, spectrum), each 2D detector too (npoints, rows,
columns), ...

"""
*XRF_Spectra NPTS x N_Elems x N_Channels (full spectra)
*XRF_Corrected NPTS x N_Elems x N_Channels (corrected spectra)
"""

Concerning multielement detector arrays in traditional (= non-list)
acquisition mode, I would recommend the same approach. (npoints,
spectrum). The different detectors do not need to collect the same
number of channels, and already available programs (PyMca) do not need
to handle special cases. In my opinion, one would be doing an analogous
mistake to the one made when doing a distinction between regular grids
and irregularly sampled data.

Armando

Darren Dale

unread,

Jan 25, 2010, 8:03:29 AM1/25/10

to ma...@googlegroups.com

That's an interesting point. Let's continue to consider details and
alternatives. My first instinct is to agree with one of Matt's
suggestions: XRF_Spectra NPTS x N_Elems x N_Channels (full spectra).
It could be burdensome on the user to concatenate all the spectra
together. But I had not considered that the length of the spectra from
individual elements might vary. I wonder if vlen arrays or tables
might provide enough additional flexibility for an elegant solution.

Darren

"V. Armando Solé"

unread,

Jan 25, 2010, 8:16:42 AM1/25/10

to ma...@googlegroups.com

Darren Dale wrote:

Each detector can have different number of channels, different
calibration, different geometry, even different material.

To make it short, most likely you will anyways have to process them
separately and in one case you already have an application being able to
deal with them while on the other one not.

Armando

"V. Armando Solé"

unread,

Jan 25, 2010, 8:25:01 AM1/25/10

to ma...@googlegroups.com

Darren Dale wrote:

Just to make clear, I am not concatenating all the the detectors into
one. I am saying to save one array (npoints, spectrum ) for each of them
as one would do if they would be independent detectors and not part of a
multielement one.

Each detector can have different number of channels, different
calibration, different geometry, even different material.

To make it short, most likely you will anyways have to process them
separately and in one case you already have an application being able to
deal with them while on the other one not.

Armando

--
You received this message because you are subscribed to the Google
Groups "Methods for the analysis of hyperspectral image data" group.
To post to this group, send email to ma...@googlegroups.com.
To unsubscribe from this group, send email to
mahid+un...@googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/mahid?hl=en.

Darren Dale

unread,

Jan 25, 2010, 9:20:56 AM1/25/10

to ma...@googlegroups.com

Hi Matt,

On Sun, Jan 24, 2010 at 11:53 PM, Matt Newville
<newv...@cars.uchicago.edu> wrote:
> Hi Darren, Armando,
>
> I'd like to reply to Darren's suggestion (discussed at the workshop)
> on how to record data of multi-dimension scans. I apologize in
> advance for this being so long. Also, I have to admit that I see
> little reason to follow existing Nexus conventions. Perhaps that
> betrays my ambivalence about using NeXus, but I don't see why we
> should be constrained by what NeXus currently does.

I am also beginning to harbor some reservations about the general
NeXus approach. NeXus may overconstrain the problem for experiments
that tend to evolve quickly. The organization I presented at the
workshop, and which you discuss in this email, seems to me compatible
with many kinds of analyses. When I do xrf mapping, I save pymca's
configuration file (containing lots of xrf analysis information
specific to that particular measurement) in the group containing the
energy-dispersive detector datasets. We also do scanning powder
diffraction phase mapping of combinatorial thin films *in conjunction
with* scanning xrf mapping, which is an example of needing to consider
data in different contexts.

This approach could probably be improved by moving these
analysis-specific settings into their own group under entry, but as
far as being able to exchange this data and analyze it with different
programs, it would be up to the software developers to either agree on
how those parameters should be named and organized (or to import the
parameters from another program), or up to the human using the program
to provide context and fill in the missing details.

Still, strict NeXus seems to have momentum, and if it becomes a useful
exchange format supported by the majority of analysis programs, then
even if we use this alternative approach (which in my mind is
basically our current approach since it is very similar to spec and
epics way of doing things, only cast into hdf5), we may still end up
having to export to NeXus as an exchange format.

[...]

> I think Darren's suggestion of storing "positions in a
> multi-dimensional scan" with a single array is the best approach.

As Armando mentioned, for two or three years I have been using hdf5
with a layout similar to what you describe, but with individual
position and signal arrays organized into groups rather than a single
dataset. But otherwise what you describe is not too different.

[...]

> Of course, there would need to be an NPTS x (NAREA_DETECTOR_SHAPE)
> "image" array for the area detector(s) or multi-element XRF
> detector(s). When discussed at the workshop, I think the consensus
> was to use the "natural shape" of the detector. I would like to
> discuss this more: if multiple multi-element detectors (2 Quad
> Vortexes for example or 1 Quad Vortex and 1 XRD CCD) are used, it's
> not clear how to best do this: Probably a separate "image" dataset for
> each detector used??? That's not so obvious for 2 Quad Vortexes (one
> could consider it an 8 x 2048 image). Perhaps the detector data
> should also be unraveled to 1-d with a similar Legend Table as for
> Positioners? I'm not convinced of this, but perhaps it should be
> considered.

I think an array of shape (npts,nelements,nchannels) makes sense, but
Armando pointed out that nchannels may not be the same for every
element. Is that common?

I prefer splitting each positioner and detector into a separate array,
and placing them in an appropriate group. There are a number of
reasons: each dataset can contain its own attributes, and it may be
much easier to work with the data interactively in IDL or Matlab or
Python or whatever.

> For attributes, I think all the dimensions
> need to be recorded as attributes. With N_Positioners it is not
> obvious that a scan is easily determined to be 1 or 2 or 10
> dimensional. I think this does need to be explicitly set with
> attributes and/or data sets:
> dimension (single integer)
> Positioner_Axes = array (length=dimension) of Positioners to be treated as
> principle scan Axes. Perhaps "Array
> Order" should be
> an attribute of this dataset.

The way I have been handling this is to store an attribute in the
entry called "acquisition_shape", which any group or dataset contained
in the entry's hierarchy can access to determine the natural shape of
the scan itself. Datasets, like an ion chamber or a dead time or
whatever, can use this information to reshape the data in memory (not
in the file). In phynx, datasets have a map property which returns the
data in a numpy array, but reshaped to respect acquisition_shape. I
have not considered slew scans and non-uniform grids, mapping these
data into a regularly gridded array should not be a problem, but it is
not trivial like the supported case of scans on a regular grid.

> (If I understand, that was part of Armando's point).
>
> Again, I apologize for this being so long, Again, I think unraveling
> the Positioner values is a good idea.

I, for one, appreciate you taking the time to write in depth and make
your views known (and I would say that even if you disagreed with my
point of view.)

Darren

Darren Dale

unread,

Jan 25, 2010, 9:23:07 AM1/25/10

to ma...@googlegroups.com

On Mon, Jan 25, 2010 at 8:16 AM, "V. Armando Solé" <so...@esrf.fr> wrote:
> Darren Dale wrote:
>>

In that case, I would argue that grouping them all into one detector
is not appropriate.

> To make it short, most likely you will anyways have to process them
> separately and in one case you already have an application being able to
> deal with them while on the other one not.

What are those two cases? I didn't follow.

Darren

Darren Dale

unread,

Jan 25, 2010, 9:24:34 AM1/25/10

to ma...@googlegroups.com

On Mon, Jan 25, 2010 at 8:25 AM, "V. Armando Solé" <so...@esrf.fr> wrote:
> Darren Dale wrote:
>>

Ah, ok. I see merit in that approach.

Darren

Darren Dale

unread,

Jan 25, 2010, 9:50:01 AM1/25/10

to ma...@googlegroups.com

Hi Stefan,

On Mon, Jan 25, 2010 at 12:27 AM, Stefan Vogt <vo...@aps.anl.gov> wrote:
> Dear Matt, Darren, Armando, all,
>
>> I think Darren's suggestion of storing "positions in a
>> multi-dimensional scan" with a single array is the best approach.
>
> i tend to agree with Darren and Matt. This approach seems to me to be the
> most suited one in order to deal with files of varying dimensions.
>
> One incurs a small disadvantage in that, with the suggested format, one can
> not just open up the file, look for something 2D and assume it will be the
> image one wants to display. Instead, there will need to be additional
> information allowing one to infer the 'natural' dimensionality of the
> dataset.

This is true. As I mentioned in response to Matt, the way I have
handled this is to save the acquisition_shape as a property of the
entry, which can be used to provide a view of the array in the
appropriate shape.

I should mention a minor drawback and explain how it can be overcome.
My acquisition software lets you select a range of one or more pixels
from an element map to inspect and fit the average spectrum (useful
for calibrating using standards). This would probably be pretty easy
if an xrf mapping scan saved the spectra in shape (nx,ny,nchannels),
but since I save the data in shape (npts,nchannels), I end up finding
the x,y coordinates of the selected pixels, find the indices in the
x,y datasets that contain those coordinates, and use those indices to
extract the desired spectra. It's only a few lines of code in python
(numpy's indexing tricks are similar to matlab's, which simplifies
things a great deal). I think this solution is compatible with
irregular scans, but the axes of the scan need to be known and present
in the file.

> But, IMHO, one gains significant flexibility to handle
> multidimensional files, as well as scans on 'irregular grids'.

I agree. These scans are acquired linearly over time, and the format
reflects that reality. It has served spec well for many years.

Thanks,
Darren

Carlos Pascual Izarra

unread,

Jan 25, 2010, 9:53:42 AM1/25/10

to ma...@googlegroups.com

On Monday 25 January 2010 15:20:56 Darren Dale wrote:
> I think an array of shape (npts,nelements,nchannels) makes sense, but
> Armando pointed out that nchannels may not be the same for every
> element. Is that common?

I cannot answer whether it is common or not, but at least is something we
support in our scans.

> I prefer splitting each positioner and detector into a separate array,
> and placing them in an appropriate group. There are a number of
> reasons: each dataset can contain its own attributes, and it may be
> much easier to work with the data interactively in IDL or Matlab or
> Python or whatever.

Furthermore, having each positioner and detector in its own array is precisely
the NeXus GenericScan approach.

--
+----------------------------------------------------+
Carlos Pascual Izarra
Scientific Software Contact
Computing Division
Cells / Alba Synchrotron [http:/www.cells.es]
Carretera BP 1413 de Cerdanyola-Sant Cugat, Km. 3.3
E-08290 Cerdanyola del Valles (Barcelona), Spain
E-mail: carlos....@cells.es
Phone: +34 93 592 4428
+----------------------------------------------------+

Darren Dale

unread,

Jan 25, 2010, 9:56:53 AM1/25/10

to ma...@googlegroups.com

On Mon, Jan 25, 2010 at 9:50 AM, Darren Dale <dsda...@gmail.com> wrote:
> Hi Stefan,
>
> On Mon, Jan 25, 2010 at 12:27 AM, Stefan Vogt <vo...@aps.anl.gov> wrote:
>> Dear Matt, Darren, Armando, all,
>>
>>> I think Darren's suggestion of storing "positions in a
>>> multi-dimensional scan" with a single array is the best approach.
>>
>> i tend to agree with Darren and Matt. This approach seems to me to be the
>> most suited one in order to deal with files of varying dimensions.
>>
>> One incurs a small disadvantage in that, with the suggested format, one can
>> not just open up the file, look for something 2D and assume it will be the
>> image one wants to display. Instead, there will need to be additional
>> information allowing one to infer the 'natural' dimensionality of the
>> dataset.
>
> This is true. As I mentioned in response to Matt, the way I have
> handled this is to save the acquisition_shape as a property of the
> entry, which can be used to provide a view of the array in the
> appropriate shape.

Sorry, I misread your comment. You were talking about the natural
dimensionality of the elements in the dataset, not the scan itself.
With the list mode I favor, in principle you can identify an area
detector because the dataset has a dimensionality of 3: (npts,
xpixels, ypixels). Some might find it desirable to store a single
image as (xpixels,ypixels) instead of (1, xpixels, ypixels), so
Armando's intrinsic_dimensionality attribute could help.

Darren

"V. Armando Solé"

unread,

Jan 25, 2010, 10:45:56 AM1/25/10

to ma...@googlegroups.com

Really, that attribute is so cheap and can solve so many issues and
different ways of saving data that I do not really understand why there
are still doubts about its usefulness. I thought it was clear at the
workshop and I cannot see the interest on discussing it again. If you
consider that it was just fruit of a sudden idea is because you have not
tried to support a large variety of data that can be found on an HDF5.

I entitle myself to remind that the workshop goal and the goal of this
mailing list was NOT to tell people how to store their data at
collection time. I have been trying to accommodate the needs of
everybody irrespectively of how they originally stored their data and
that can be facilitated with very simple approaches. The simplest of
them is to tell the programs having to deal with the data what the data
are (spectra? images? counters?). If that cannot be done, I doubt we'll
be able to go much further.

Armando

"V. Armando Solé"

unread,

Jan 25, 2010, 10:52:14 AM1/25/10

to ma...@googlegroups.com

Carlos Pascual Izarra wrote:
> On Monday 25 January 2010 15:20:56 Darren Dale wrote:
>
>> I think an array of shape (npts,nelements,nchannels) makes sense, but
>> Armando pointed out that nchannels may not be the same for every
>> element. Is that common?
>>
>
> I cannot answer whether it is common or not, but at least is something we
> support in our scans.
>

Same here and because of that we save them separately.

Armando

Darren Dale

unread,

Jan 25, 2010, 11:42:23 AM1/25/10

to ma...@googlegroups.com

Hi Armando,

On Mon, Jan 25, 2010 at 10:45 AM, "V. Armando Solé" <so...@esrf.fr> wrote:
> Darren Dale wrote:
>>
>> On Mon, Jan 25, 2010 at 9:50 AM, Darren Dale <dsda...@gmail.com> wrote:
>> Sorry, I misread your comment. You were talking about the natural
>> dimensionality of the elements in the dataset, not the scan itself.
>> With the list mode I favor, in principle you can identify an area
>> detector because the dataset has a dimensionality of 3: (npts,
>> xpixels, ypixels). Some might find it desirable to store a single
>> image as (xpixels,ypixels) instead of (1, xpixels, ypixels), so
>> Armando's intrinsic_dimensionality attribute could help.
>>
>
> Really, that attribute is so cheap and can solve so many issues and
> different ways of saving data that I do not really understand why there are
> still doubts about its usefulness. I thought it was clear at the workshop
> and I cannot see the interest on discussing it again. If you consider that
> it was just fruit of a sudden idea is because you have not tried to support
> a large variety of data that can be found on an HDF5.

It seems most of the comments have been supportive of this attribute.
I respect your position, and hope you will respect if other members of
this community want to continue the discussion for whatever reason.

> I entitle myself to remind that the workshop goal and the goal of this
> mailing list was NOT to tell people how to store their data at collection
> time.

If members of this community want to consider how storing data at
collection time relates to how data will be used during analysis, I
think we should be allowed to do so. This seems to me an appropriate
forum for that discussion.

> I have been trying to accommodate the needs of everybody
> irrespectively of how they originally stored their data and that can be
> facilitated with very simple approaches. The simplest of them is to tell the
> programs having to deal with the data what the data are (spectra? images?
> counters?).

I wasn't suggesting otherwise. I was just making an observation based
on having worked with this particular layout for the past couple
years, and I concluded by saying that the proposed attribute could be
useful. In my own files, I have been identifying the types of datasets
using a class attribute with values like "Spectrum" or "Signal"
(could/should be changed to "Counter"). This has worked fine for me,
but I see merit in providing intrinsic_dimensionality as well, which
may be more generally accessible to other analysis programs.

Darren

Andy Gotz

unread,

Jan 25, 2010, 11:54:08 AM1/25/10

to ma...@googlegroups.com

I whole-heartedly support this "low-cost" attribute. I had hoped at
least this would be retained as a result of the HDF5 workshop.

I vote to have an attribute indicating the type of dimensionality of the
acquired or processed data. I think this needs to be stored with the
data and not only as part of the detector definition (i.e. NXDetector)
which might not be meaningful (the detector I mean) by the time the data
have been processed e.g. a series of spectra could be converted into an
array of images.

Andy

George Kourousias

unread,

Jan 25, 2010, 11:58:10 AM1/25/10

to ma...@googlegroups.com

I also support this.

2010/1/25 Andy Gotz <andy...@esrf.fr>:

Pete R. Jemian

unread,

Jan 25, 2010, 12:23:32 PM1/25/10

to ma...@googlegroups.com

Sensible suggestion. There has been a standing TRAC ticket
in NeXus for updating the NXdetector definition to better
support the type of usage at synchrotrons. This sounds
like a component of that update.
(http://trac.nexusformat.org/definitions/ticket/6)

Armando suggested

> NXintrinsic_dimension = "0D", "1D", "2D", "3D", ...

Working with that idea, a proposed addition to NXdetector that incorporates this attribute might look like this:
----------------------% clip here %----------------------
<field name="intrinsic_dimensions">
<doc>
Describes the natural dimensionality of the elements in the dataset.
</doc>
<enumeration>
<item value="0D"><doc>such as from a point detector</doc></item>
<item value="1D"><doc>such as from a line or strip detector</doc></item>
<item value="2D"><doc>such as from an area detector</doc></item>
<item value="3D"><doc>such as a 1D series of 2D images</doc></item>
<item value="4D"><doc>such as a 2D series of 2D images</doc></item>
</enumeration>
</field>
----------------------% clip here %----------------------

This says the "intrinsic_dimensions" attribute can take
only a few specific values as shown in the "item" elements above.

Does this represent the sense of the discussion?
What else to add here?

Pete

On 1/25/2010 10:58 AM, George Kourousias wrote:
> I also support this.
>
> 2010/1/25 Andy Gotz<andy...@esrf.fr>:

--

Darren Dale

unread,

Jan 25, 2010, 12:39:19 PM1/25/10

to ma...@googlegroups.com

Not quite. The value is describing the intrinsic dimensionality of a
single data element. The intrinsic dimensions of an area detector is
2D, regardless of whether it is a 1D or 2D series of such images
(unless I am completely off base. The turn of the discussion suddenly
became more contentious than I had anticipated, and I am not sure why,
but I was probably the instigator. In which case, my apologies).

I suggested that the values should be integers, Armando favors strings.

> What else to add here?

Andy suggested that this attribute should be attached to the dataset,
not NXdetector. I agree that the attribute would be useful on datasets
that are not in an NXdetector group.

Darren

Matt Newville

unread,

Jan 25, 2010, 1:18:05 PM1/25/10

to ma...@googlegroups.com

Hi Armando, Darren, All,

OK, now I'm sure that I've missed some parts of this wide-ranging conversation!

If I understand correctly (and trying to stay as specific as
possible), what you are proposing is to use HDF5 Groups for
Positioners and Detectors (names negotiable) with each of these groups
having several datasets. Many of these datasets would be one
dimensional of length NPTS, though I suppose multi-element detectors
would be NPTS x Detector_Shape.

The principle advantage of this approach is that attributes can be
assigned to individual datasets. The disadvantage it that there are
many more datasets, and so a naming convention for these datasets must
be described. Names like "Positioner/1", and "Detector/28" seem
vagues, but I believe that Darren's HDF5 files had arrays with labels
like 'Ca'. I think relying on the name to convey scan-specific meaning
is probably bound to fail. Perhaps "I0" is "iodine from detector 0"?
A mutli-element detector may need names like "mca1 Ca Kbeta", but that
leads to a host of potential problems (mca first or last, can unicode
be used?, etc). I guess it's best to use "Positioner/1" with
attributes of "Label", "Address", and so on.

Anyway, I'm fine with this layout. With more reliance on attributes,
it does allow Positioners and Detectors to appear more as Objects (or
at least Structures). Perhaps we'll want to define some Positioner
and Detector types (StepperMotor_Positioner, DCMono_Positioner,
Scalar_Detector, ROI_Detector, MultiElementFluorescence_Detector,
Area_Detector, etc) that have defined attributes. I'd hesitate to
make each of these datasets into groups, but that might be required
for completeness.

Speaking of completeness, I do find it interesting that it is
preferred for Positioners to be as a set of 1D datasets of Single
Position Values that must be unraveled, while Detectors can either be
1D or multi-dimensional. I don't disagree with this, but it is
asymmetric which makes me think we might be doing something wrong.

Finally, specifying the "intrinsic" or "intended" dimensionality of a
scan is necessary is definitely needed to understand the data. I would
suggest this to be called "dimension" and it be an integer (not a
string) attribute of the Scan (or Entry or Measurement or whatever the
Main Group is called).

Cheers,

--Matt Newville

Vicente Sole

unread,

Jan 25, 2010, 1:50:04 PM1/25/10

to ma...@googlegroups.com

Hi Pete,

The attribute should be "attachable" to datasets. Not just to
detectors. Since the NXdetector will generate some sort of data, one
would expect that the attribute is added to those data.

There seems to be more acceptance for intrinsic_dimensionality than
for intrinsic_dimension. I am fine with both.

Personally I prefer to have a string as "0D", "1D", "2D", "3D", ....
just in case we get more exotic data than those just represented by an
integer.

Darren Dale prefers an integer and a new attribute if it is something
more exotic.

Armando

Darren Dale

unread,

Jan 25, 2010, 1:53:23 PM1/25/10

to ma...@googlegroups.com

On Mon, Jan 25, 2010 at 1:18 PM, Matt Newville
<newv...@cars.uchicago.edu> wrote:
> Hi Armando, Darren, All,
>
> OK, now I'm sure that I've missed some parts of this wide-ranging conversation!
>
> If I understand correctly (and trying to stay as specific as
> possible), what you are proposing is to use HDF5 Groups for
> Positioners and Detectors (names negotiable) with each of these groups
> having several datasets. Many of these datasets would be one
> dimensional of length NPTS, though I suppose multi-element detectors
> would be NPTS x Detector_Shape.
>
> The principle advantage of this approach is that attributes can be
> assigned to individual datasets. The disadvantage it that there are
> many more datasets, and so a naming convention for these datasets must
> be described. Names like "Positioner/1", and "Detector/28" seem
> vagues, but I believe that Darren's HDF5 files had arrays with labels
> like 'Ca'. I think relying on the name to convey scan-specific meaning
> is probably bound to fail.

I don't rely on the name to convey meaning, I use attributes for that.
The name, like 'Ca' (which in this case is a SCA ROI corresponding to
calcium fluorescence) comes from the identity of that particular
counter in the spec config file. At the workshop, Matt advocated for a
2D dataset of shape (npts, nSCAs) (did I remember correctly?) and I
suggested that identifying it as either a "Counter" or a single
channel analyzer in the attributes would allow a user or application
to identify these channels.

> Perhaps "I0" is "iodine from detector 0"?

In my talk at the workshop, I showed how I use a group to collect such
data. In the entry, next to the measurement group, I have a detector
group called "vortex", with an attribute that identifies it as a
multichannel analyzer. That group contains the spectra
(npts,nchannels) and the various dead time datasets (npts,), and could
also contain associated SCA ROIs.

> A mutli-element detector may need names like "mca1 Ca Kbeta", but that
> leads to a host of potential problems (mca first or last, can unicode
> be used?, etc). I guess it's best to use "Positioner/1" with
> attributes of "Label", "Address", and so on.

I would strongly encourage using a more meaningful identifier,
whatever the user or beamline scientist decides to call channel 1, in
this case "Positioner/monoff" or whatever. If I want to work
interactively with the data, it is easier for me to open the hdf5 file
and request f['/entry1/measurement/positioners/monoff'] than it is to
remember which channel corresponds to the label or mnemonic that is
used to identify that positioner. I agree the address should be saved
as an attribute.

> Anyway, I'm fine with this layout. With more reliance on attributes,
> it does allow Positioners and Detectors to appear more as Objects (or
> at least Structures). Perhaps we'll want to define some Positioner
> and Detector types (StepperMotor_Positioner, DCMono_Positioner,
> Scalar_Detector, ROI_Detector, MultiElementFluorescence_Detector,
> Area_Detector, etc) that have defined attributes. I'd hesitate to
> make each of these datasets into groups, but that might be required
> for completeness.

I think I am in complete agreement, and some of this is already
implemented in phynx. I also have reservations about making each of
these datasets groups. I only envisioned groups for detectors that
produce multiple datasets, like an MCA (spectra,deadtime,SCAs). I have
not used groups for things like ion chambers, for which datasets and
associated attributes seems sufficient.

> Speaking of completeness, I do find it interesting that it is
> preferred for Positioners to be as a set of 1D datasets of Single
> Position Values that must be unraveled,

I don't understand what you mean, could you clarify? A positioner can
only have one value per position (or two, a target position and a
measured position, which I haven't considered before)

> while Detectors can either be
> 1D or multi-dimensional. I don't disagree with this, but it is
> asymmetric which makes me think we might be doing something wrong.

It seems symmetric to me. I would prefer that iterating over a dataset
would yield data with the right intrinsic dimensionality:

# monoff is a 1D dataset
for item in f['/entry1/measurement/positioners/monoff']:
# item is a scalar position of monoff at this point in the scan

# spectra is a 2D dataset
for item in f['/entry1/measurement/vortex/spectra']:
# item is a 1D spectrum collected this point in the scan

# images is a 2D dataset
for item in f['/entry1/measurement/ccd/images']:
# item is a 2D image collected this point in the scan

> Finally, specifying the "intrinsic" or "intended" dimensionality of a
> scan is necessary is definitely needed to understand the data. I would
> suggest this to be called "dimension" and it be an integer (not a
> string) attribute of the Scan (or Entry or Measurement or whatever the
> Main Group is called).

I think there are two different dimensionalities to consider.
"intrinsic_dimensionality" indicates what is the natural
dimensionality of the "items" in my above examples. Some other
attribute (I suggested "acquisition_shape") would serve the purpose
you just described, and I agree that the Scan or Entry would be the
appropriate place.

Darren

Vicente Sole

unread,

Jan 25, 2010, 1:57:33 PM1/25/10

to ma...@googlegroups.com

Hi Matt,

Quoting Matt Newville <newv...@cars.uchicago.edu>:

> Hi Armando, Darren, All,
>
> OK, now I'm sure that I've missed some parts of this wide-ranging
> conversation!
>
> If I understand correctly (and trying to stay as specific as
> possible), what you are proposing is to use HDF5 Groups for
> Positioners and Detectors (names negotiable) with each of these groups
> having several datasets. Many of these datasets would be one
> dimensional of length NPTS, though I suppose multi-element detectors
> would be NPTS x Detector_Shape.
>

That is what i understand.

> The principle advantage of this approach is that attributes can be
> assigned to individual datasets. The disadvantage it that there are
> many more datasets, and so a naming convention for these datasets must
> be described. Names like "Positioner/1", and "Detector/28" seem
> vagues, but I believe that Darren's HDF5 files had arrays with labels
> like 'Ca'. I think relying on the name to convey scan-specific meaning
> is probably bound to fail. Perhaps "I0" is "iodine from detector 0"?
> A mutli-element detector may need names like "mca1 Ca Kbeta", but that
> leads to a host of potential problems (mca first or last, can unicode
> be used?, etc). I guess it's best to use "Positioner/1" with
> attributes of "Label", "Address", and so on.

Darren is storing analysis results in a similar layout as direct
measurement ones.

At the ESRF, one of our initial ideas is to use the unique device
server name of each detector to name or to complement the name of a
group related to that detector.

>
> Anyway, I'm fine with this layout. With more reliance on attributes,
> it does allow Positioners and Detectors to appear more as Objects (or
> at least Structures). Perhaps we'll want to define some Positioner
> and Detector types (StepperMotor_Positioner, DCMono_Positioner,
> Scalar_Detector, ROI_Detector, MultiElementFluorescence_Detector,
> Area_Detector, etc) that have defined attributes. I'd hesitate to
> make each of these datasets into groups, but that might be required
> for completeness.
>
> Speaking of completeness, I do find it interesting that it is
> preferred for Positioners to be as a set of 1D datasets of Single
> Position Values that must be unraveled, while Detectors can either be
> 1D or multi-dimensional. I don't disagree with this, but it is
> asymmetric which makes me think we might be doing something wrong.
>
> Finally, specifying the "intrinsic" or "intended" dimensionality of a
> scan is necessary is definitely needed to understand the data. I would
> suggest this to be called "dimension" and it be an integer (not a
> string) attribute of the Scan (or Entry or Measurement or whatever the
> Main Group is called).
>

The attribute we are discussing is for datasets, not for scans themselves.

I support the measurement approach and I consider it the most
versatile one, but I would appreciate if we could discuss it in a
separate dedicated thread.

Armando

Darren Dale

unread,

Jan 25, 2010, 1:57:42 PM1/25/10

to ma...@googlegroups.com

On Mon, Jan 25, 2010 at 1:53 PM, Darren Dale <dsda...@gmail.com> wrote:

> # images is a 2D dataset

That should have said "images is a 3D dataset", sorry for the mistake.

Darren Dale

unread,

Jan 25, 2010, 2:22:16 PM1/25/10

to ma...@googlegroups.com

On Mon, Jan 25, 2010 at 1:57 PM, Vicente Sole <so...@esrf.fr> wrote:
> I support the measurement approach and I consider it the most versatile one,
> but I would appreciate if we could discuss it in a separate dedicated
> thread.

I would like to point out that I started this thread, and the
discussion still seems relevant to "multidimensional hdf5 arrays"
(although the comments are decreasingly "quick").

Darren

Vicente Sole

unread,

Jan 25, 2010, 2:31:56 PM1/25/10

to ma...@googlegroups.com

Quoting Darren Dale <dsda...@gmail.com>:

Ok. It's your right.

Armando

Matt Newville

unread,

Jan 25, 2010, 4:40:46 PM1/25/10

to ma...@googlegroups.com

Hi.

> At the ESRF, one of our initial ideas is to use the unique device server
> name of each detector to name or to complement the name of a group related
> to that detector.

I think this "unique device server name" would be very similar to what
I called "Address" and used for Epics Process Variable. I would think
that these would not make very meaningful names for Groups or
Datasets. I think having short, simple, predictable names for
datasets that don't vary widely is probably wise. I admit that
"Positioner/1" is not very informative, but I'd like to avoid
"Positioner/13IDC:m17.VAL" (an Epics Address) or "Detector/Ca Kalpha
mca1" as it would make it difficult to automatically read in and
manage datafiles from different places (precisely the problem we are
trying to avoid).

>> Finally, specifying the "intrinsic" or "intended" dimensionality of a
>> scan is necessary is definitely needed to understand the data. I would
>> suggest this to be called "dimension" and it be an integer (not a
>> string) attribute of the Scan (or Entry or Measurement or whatever the
>> Main Group is called).
>>
> The attribute we are discussing is for datasets, not for scans themselves.

But wouldn't this attribute be the same for all datasets, and so be
pointlessly repeated if it was an attribute for all datasets?

The way I understand the issue, the Scan/Map/Measurement has an
intrinsic dimension which is hidden by unraveling the set of
Positioners and Detectors to be 1-D arrays (or well, mult-dimensional
for multi-dimensional detectors). The data must be restored to the
original multi-dimensional shape in order to understand the data, but
the transformation needed would be identical for all datasets in the
scan, no?

Or perhaps I'm completely misunderstanding your point? Since you're
also suggesting using strings, it may mean that you may not intend
this value to be used for automatically unwrapping the data.

--Matt Newville

Darren Dale

unread,

Jan 25, 2010, 5:08:31 PM1/25/10

to ma...@googlegroups.com

On Mon, Jan 25, 2010 at 4:40 PM, Matt Newville
<newv...@cars.uchicago.edu> wrote:
> Hi.
>
>> At the ESRF, one of our initial ideas is to use the unique device server
>> name of each detector to name or to complement the name of a group related
>> to that detector.
>
> I think this "unique device server name" would be very similar to what
> I called "Address" and used for Epics Process Variable. I would think
> that these would not make very meaningful names for Groups or
> Datasets. I think having short, simple, predictable names for
> datasets that don't vary widely is probably wise. I admit that
> "Positioner/1" is not very informative, but I'd like to avoid
> "Positioner/13IDC:m17.VAL" (an Epics Address) or "Detector/Ca Kalpha
> mca1" as it would make it difficult to automatically read in and
> manage datafiles from different places (precisely the problem we are
> trying to avoid).

Do your epics PVs have a more accessible name or mnemonic? For the
epics devices we use at CHESS (vortex detector/MCA), I use "vortex1".
Others might prefer "MCA" or "MCA1".

>>> Finally, specifying the "intrinsic" or "intended" dimensionality of a
>>> scan is necessary is definitely needed to understand the data. I would
>>> suggest this to be called "dimension" and it be an integer (not a
>>> string) attribute of the Scan (or Entry or Measurement or whatever the
>>> Main Group is called).
>>>
>> The attribute we are discussing is for datasets, not for scans themselves.
>
> But wouldn't this attribute be the same for all datasets, and so be
> pointlessly repeated if it was an attribute for all datasets?
>
> The way I understand the issue, the Scan/Map/Measurement has an
> intrinsic dimension which is hidden by unraveling the set of
> Positioners and Detectors to be 1-D arrays (or well, mult-dimensional
> for multi-dimensional detectors).

Either you or I are confusing the intended purpose of
"intrinsic_dimension". (I think it might be you). More below.

> The data must be restored to the
> original multi-dimensional shape in order to understand the data, but
> the transformation needed would be identical for all datasets in the
> scan, no?

Correct.

> Or perhaps I'm completely misunderstanding your point? Since you're
> also suggesting using strings, it may mean that you may not intend
> this value to be used for automatically unwrapping the data.

"intrinsic_dimensions" would not be the same for all datasets. That
attribute does not communicate anything about the scan dimensions, it
communicates information about the atomic unit of a dataset. An area
detector would have intrinsic_dimensions value of 2 or "2D", a
single-element MCA would be 1 or "1D". Does that make sense?
"intrinsic_dimensions" is most necessary for the current nexus way of
storing scans over an area or volume.

"acquisition_shape" could convey information about the shape of the
scan itself. This is the data you are looking for to take a 1D dataset
of shape (npts,) and yield a 2D array of shape (ypts, xpts). I don't
think this attribute should be attached to all datasets, since it
could be attached to the entry and would apply to all datasets.

Darren

Vicente Sole

unread,

Jan 25, 2010, 5:20:13 PM1/25/10

to ma...@googlegroups.com

Quoting Matt Newville <newv...@cars.uchicago.edu>:

> Hi.
>
>> At the ESRF, one of our initial ideas is to use the unique device server
>> name of each detector to name or to complement the name of a group related
>> to that detector.
>
> I think this "unique device server name" would be very similar to what
> I called "Address" and used for Epics Process Variable. I would think
> that these would not make very meaningful names for Groups or
> Datasets. I think having short, simple, predictable names for
> datasets that don't vary widely is probably wise. I admit that
> "Positioner/1" is not very informative, but I'd like to avoid
> "Positioner/13IDC:m17.VAL" (an Epics Address) or "Detector/Ca Kalpha
> mca1" as it would make it difficult to automatically read in and
> manage datafiles from different places (precisely the problem we are
> trying to avoid).
>

Since the approach is based on attributes rather than names, there is
freedom to choose names. "frelon" or "wideangle" or "waxs" do not seem
so cumbersome.

>>> Finally, specifying the "intrinsic" or "intended" dimensionality of a
>>> scan is necessary is definitely needed to understand the data. I would
>>> suggest this to be called "dimension" and it be an integer (not a
>>> string) attribute of the Scan (or Entry or Measurement or whatever the
>>> Main Group is called).
>>>
>> The attribute we are discussing is for datasets, not for scans themselves.
>
> But wouldn't this attribute be the same for all datasets, and so be
> pointlessly repeated if it was an attribute for all datasets?
>

It would certainly not be the same for all datasets. You seem to be
talking about the "dimensionality" of a scan. Something that as I see
the approach is not needed because it is reflected by the data
themselves. The datasets may come, and they do come simultaneously at
the ESRF for the same scan, from intensity monitors ("0D"
intrinsic_dimensionality), MCA ("1D" intrinsic_dimensionality) and
CCDs ("2D" intrinsic dimensionality).

I think the type of dimensionality you are talking about is almost the
same Darren has defined as acquisition_shape in his acquisition system.

> The way I understand the issue, the Scan/Map/Measurement has an
> intrinsic dimension which is hidden by unraveling the set of
> Positioners and Detectors to be 1-D arrays (or well, mult-dimensional
> for multi-dimensional detectors). The data must be restored to the
> original multi-dimensional shape in order to understand the data, but
> the transformation needed would be identical for all datasets in the
> scan, no?
>
> Or perhaps I'm completely misunderstanding your point? Since you're
> also suggesting using strings, it may mean that you may not intend
> this value to be used for automatically unwrapping the data.
>

I have a single image written in an HDF5 as:

(rows, columns)
(1, rows, columns)
(1, 1, rows, columns)

and none of them was part of a mesh. A simple indication about the
data being "2D" is already enough provides one assumes C ordering. If
not, one still needs an additional way to indicate where to find the
image. Image_CIF specifies the order of faster/slower variation of the
indices, but image_CIF already knows it is dealing with images. In the
general case, you need the image_CIF approach plus to know what you
are dealing with. Both needs are mentioned in the report.

Really, I considered this was already clear at the workshop and it
seems I was not the only one. Otherways I would not have reflected it
in the report. The only thing left to decide was name and type but not
need. If for such an already consensuated item, I have to devote so
much effort, I will give up, wait for somebody else come with a
proposal, implement it and that's it. That was my position less than 2
years ago.

Concerning why I am suggesting strings is because I already support
listmode datasets from ion beam analysis labs in other formats. One
HDF5 analogue is a continuous set of numbers. In the worst of the
cases we will never use anything else than "0D", "1D", "2D", and so
on, but why not to leave the door open to have something more explicit
than negative numbers that need to be translated to something else?

Concerning list mode, I guess a way to go in that an other situations
can be indicated by the image corresponding to tutorial 1 at:

http://vitables.berlios.de/screenshots/index.html

Armando

Darren Dale

unread,

Jan 25, 2010, 5:59:06 PM1/25/10

to ma...@googlegroups.com

On Mon, Jan 25, 2010 at 5:20 PM, Vicente Sole <so...@esrf.fr> wrote:
> Really, I considered this was already clear at the workshop and it seems I
> was not the only one. Otherways I would not have reflected it in the report.
> The only thing left to decide was name and type but not need. If for such an
> already consensuated item, I have to devote so much effort, I will give up,
> wait for somebody else come with a proposal, implement it and that's it.
> That was my position less than 2 years ago.

I think this is unfair. We spent a few minutes discussing this issue
at the workshop, and it was obviously not as clear as you seem to
believe. The implementation was not specified or discussed until it
was requested in this thread, and there has been obvious confusion
about the intended use (see Pete's interpretation, or Matt's.) If the
goal is a commonly accepted exchange format, consensus is needed.

> Concerning why I am suggesting strings is because I already support listmode
> datasets from ion beam analysis labs in other formats. One HDF5 analogue is
> a continuous set of numbers. In the worst of the cases we will never use
> anything else than "0D", "1D", "2D", and so on, but why not to leave the
> door open to have something more explicit than negative numbers that need to
> be translated to something else?

You are advocating combining two different, unrelated aspects into a
single attribute, which is unadvisable. If you want to communicate
more information about how those intrinsic units are organized
(ListMode, or whatever), it is better to encapsulate that into a
separate attribute. What if analysis program X only cares if the
intrinsic dimensions are 1 or 2? Does it have to parse the string and
see if it starts or ends with, or contains "1D" or "2D"?

Darren

Matt Newville

unread,

Jan 25, 2010, 8:27:05 PM1/25/10

to ma...@googlegroups.com

Hi Darren,

I definitely feel I'm having trouble keeping up with this conversation....

>> The principle advantage of this approach is that attributes can be

>> assigned to individual datasets. Â The disadvantage it that there are

>> many more datasets, and so a naming convention for these datasets must

>> be described. Â Names like "Positioner/1", and "Detector/28" seem

>> vagues, but I believe that Darren's HDF5 files had arrays with labels
>> like 'Ca'. I think relying on the name to convey scan-specific meaning
>> is probably bound to fail.
>

> I don't rely on the name to convey meaning, I use attributes for that.
> The name, like 'Ca' (which in this case is a SCA ROI corresponding to
> calcium fluorescence) comes from the identity of that particular
> counter in the spec config file. At the workshop, Matt advocated for a
> 2D dataset of shape (npts, nSCAs) (did I remember correctly?) and I
> suggested that identifying it as either a "Counter" or a single
> channel analyzer in the attributes would allow a user or application
> to identify these channels.
>

>> Perhaps "I0" is "iodine from detector 0"?
>

> In my talk at the workshop, I showed how I use a group to collect such
> data. In the entry, next to the measurement group, I have a detector
> group called "vortex", with an attribute that identifies it as a
> multichannel analyzer. That group contains the spectra
> (npts,nchannels) and the various dead time datasets (npts,), and could
> also contain associated SCA ROIs.

So the idea would be that one goes through all datasets in the
detector group, not looking at the dataset name, and finds one with a
specific attribute (probably with a fixed name, say 'type') that
defines whether it is an MCA or Scalar or AreaDetector or whatever?

That would be as opposed to naming the datasets MCA1, Scaler29, etc
and have a label as an attribute? I'm not sure this makes a big
difference, but it's the sort of thing to work out. Short,
predictable names might make parsing easier, while flexible names and
rigid attributes might make direct human viewing easier for those who
understand the selected names (of course, for those who don't
understand the selected names it might be harder). Will 'Vortex' mean
the same thing to everyone five years from now? MCA1 is a boring
name, but probably more universally understood, and understandable by
a machine (code for "list all MCA spectra" is obvious). As for Epics
Variable names: they are generally unique, if not exactly the most
mnemonic of names. They often mean something to the beamline
scientists, and not much to anyone else. 13IDC:m17.VAL is "Position
of Motor 17 at station IDC at sector 13" (of course!).
13SSD1:dpx1.PKTIM is "Peaking Time of DXP module 1 of Silicon Drift
Detector 1 at sector 13". I don't recommend presenting these to users
and asking them to find their data.

For non-predefiined names for datasets, I'd worry about things like
whitespaces, unicode, etc. I'm sure these are solvable, but it would
mean coming up with naming conventions. If instead the dataset names
could be relied upon to have meaning, than the "label" attribute could
be an string without formatting restrictions, to be understood only by
a person. Again, I can be persuaded either way.

>> A mutli-element detector may need names like "mca1 Ca Kbeta", but that
>> leads to a host of potential problems (mca first or last, can unicode

>> be used?, etc). Â I guess it's best to use "Positioner/1" with

>> attributes of "Label", "Address", and so on.
>

> I would strongly encourage using a more meaningful identifier,
> whatever the user or beamline scientist decides to call channel 1, in
> this case "Positioner/monoff" or whatever. If I want to work
> interactively with the data, it is easier for me to open the hdf5 file
> and request f['/entry1/measurement/positioners/monoff'] than it is to
> remember which channel corresponds to the label or mnemonic that is
> used to identify that positioner. I agree the address should be saved
> as an attribute.

Right. I think the questions are meaningful to whom and in what way.
Using MCA1 might be fairly clear, even if the type of MCA is not.
Names like "Monitor" and "Fluorescence" might seem meaningful but be
deceptively ambiguous.

Already in this thread, I've been confused many times, for instance.
At the HDF5 Workshop, it took me quite a while to understand the
meaning of the word Application in Nexus Application Definitions
(Application as in Executable Program or Application as Scientific
Technique). Perhaps I'm unusually dense about such things.

>> Anyway, I'm fine with this layout. Â With more reliance on attributes,

>> it does allow Positioners and Detectors to appear more as Objects (or

>> at least Structures). Â Perhaps we'll want to define some Positioner

>> and Detector types (StepperMotor_Positioner, DCMono_Positioner,
>> Scalar_Detector, ROI_Detector, MultiElementFluorescence_Detector,

>> Area_Detector, etc) that have defined attributes. Â I'd hesitate to

>> make each of these datasets into groups, but that might be required
>> for completeness.
>

> I think I am in complete agreement, and some of this is already
> implemented in phynx. I also have reservations about making each of
> these datasets groups. I only envisioned groups for detectors that
> produce multiple datasets, like an MCA (spectra,deadtime,SCAs). I have
> not used groups for things like ion chambers, for which datasets and
> associated attributes seems sufficient.

If I typically collect 3 ion chamber intensities and 40 ROIs in
addition to 4 x 2048 MCAs at each point in the scan, how many detector
datasets should I have? I can see how it could be 1, 2, 44 or 47:
1: 1 array of (npts, 43+4*2048)
2: 1 array of (npts,43) 1 array of (npts, 4, 2048)
44: 43 arrays of (npts,) 1 array of (npts, 4, 2048)
47: 43 arrays of (npts,) 4 array of (npts, 2048)

I'm sure there are other variations. But it seems to me that this is
precisely the question to be answered. I believe you're suggesting the
44 dataset solution, or perhaps it's the 47 datasets solution. Either
is OK with me.

>> Speaking of completeness, I do find it interesting that it is
>> preferred for Positioners to be as a set of 1D datasets of Single
>> Position Values that must be unraveled,
>

> I don't understand what you mean, could you clarify? A positioner can
> only have one value per position (or two, a target position and a
> measured position, which I haven't considered before)

One could consider a 2D Map to be a grid of points, sampling over a 2D
set of values for a thing called "Sample Position" -- a multi-axis
positioner. That's not actually different than an Area Detector (a 2D
array of point detectors) or a multi-element-multi-channel-array
(either a 1D array of MCA spectra or a 2D array of "data").

We're unraveling the map pixels to reduce the dimensionality of the
motion to a 1-d array of multi-axis motions. But we're not also
unraveling the detectors to a 1-d array of multi-dimensional data.
Again, I'm not advocating this (the "1 dataset option" above). But,
as with the Positioners, NOT using this approach does mean choices and
assumptions have to be made about the shape of detector data that do
not have to be made about the positioners. Again, I'm not advocating
this approach, just noting the difference as I try to think through
these things.

>> Finally, specifying the "intrinsic" or "intended" dimensionality of a
>> scan is necessary is definitely needed to understand the data. I would
>> suggest this to be called "dimension" and it be an integer (not a
>> string) attribute of the Scan (or Entry or Measurement or whatever the
>> Main Group is called).
>

> I think there are two different dimensionalities to consider.
> "intrinsic_dimensionality" indicates what is the natural
> dimensionality of the "items" in my above examples. Some other
> attribute (I suggested "acquisition_shape") would serve the purpose
> you just described, and I agree that the Scan or Entry would be the
> appropriate place.

OK, I got it now. Thanks, and sorry for adding noise. If detectors
are stored with their "intrinsic dimension" at each point in the scan
(seems to be the consensus), wouldn't the shape of the data array be
self-describing? Anyway, this is fine with me. I think I still may
not understand why it would be anything but an integer.

Again, sorry for not keeping up. Cheers,

--Matt

Matt Newville

unread,

Jan 25, 2010, 9:15:05 PM1/25/10

to ma...@googlegroups.com

Hi Armando,

> I have a single image written in an HDF5 as:
>
> (rows, columns)
> (1, rows, columns)
> (1, 1, rows, columns)
>
> and none of them was part of a mesh. A simple indication about the data
> being "2D" is already enough provides one assumes C ordering. If not, one
> still needs an additional way to indicate where to find the image. Image_CIF
> specifies the order of faster/slower variation of the indices, but image_CIF
> already knows it is dealing with images. In the general case, you need the
> image_CIF approach plus to know what you are dealing with. Both needs are
> mentioned in the report.

Perhaps I don't understand image_CIF well enough to get your point.
It seems you are saying that array ordering on disk is important. But
you keep calling this C order, which I don't understand. What you
*wrote* is C order, how it is stored on disk is an implementation
detail of HDF5. Someone using a "speedometer language" (Fortran,
IDL) to read that image from an HDF5 file written in a C-language in
C-order would also see the data laid out "the fast way" for that
language (that is as (columns,rows,1,1)). Isn't that one of the key
points of using HDF5?

> Really, I considered this was already clear at the workshop and it seems I
> was not the only one. Otherways I would not have reflected it in the report.
> The only thing left to decide was name and type but not need. If for such an
> already consensuated item, I have to devote so much effort, I will give up,
> wait for somebody else come with a proposal, implement it and that's it.
> That was my position less than 2 years ago.

I left the meeting with the view that very little had been decided for
how to store MAHID data beyond "use HDF5", and that we would work on
the details over the next few months. If consensus is to be reached,
it will definitely take effort. I am trying to be as clear and
open-minded as possible, and trying to bring as little past baggage as
possible.

> Concerning why I am suggesting strings is because I already support listmode
> datasets from ion beam analysis labs in other formats. One HDF5 analogue is
> a continuous set of numbers. In the worst of the cases we will never use
> anything else than "0D", "1D", "2D", and so on, but why not to leave the
> door open to have something more explicit than negative numbers that need to
> be translated to something else?

Sorry, I think I don't understand this. Deliberately keeping a
number a string means that it is meant for human reading, not to be
used in any calculation. As an attribute of a Detector dataset, I
just can't imagine how "intrinsic_dimensionality" = "2D" is more
meaningful to anyone than "dimension=2", whereas the latter can be
used to create storage for the array.

Again, it's possible that I am just not understanding your point.

Cheers,

--Matt

Matt Newville

unread,

Jan 25, 2010, 10:06:17 PM1/25/10

to ma...@googlegroups.com

> detail of HDF5. Someone using a "speedometer language" (Fortran,
> IDL) to read that image from an HDF5 file written in a C-language in
> C-order would also see the data laid out "the fast way" for that
> language (that is as (columns,rows,1,1)).

Of course I meant "non-speedometer language" (Fortran, IDL)....
Sorry,

--Matt

Darren Dale

unread,

Jan 25, 2010, 10:38:29 PM1/25/10

to ma...@googlegroups.com

On Mon, Jan 25, 2010 at 8:27 PM, Matt Newville
<newv...@cars.uchicago.edu> wrote:
>> I would strongly encourage using a more meaningful identifier,
>> whatever the user or beamline scientist decides to call channel 1, in
>> this case "Positioner/monoff" or whatever. If I want to work
>> interactively with the data, it is easier for me to open the hdf5 file
>> and request f['/entry1/measurement/positioners/monoff'] than it is to
>> remember which channel corresponds to the label or mnemonic that is
>> used to identify that positioner. I agree the address should be saved
>> as an attribute.
>
> Right. I think the questions are meaningful to whom and in what way.
> Using MCA1 might be fairly clear, even if the type of MCA is not.
> Names like "Monitor" and "Fluorescence" might seem meaningful but be
> deceptively ambiguous.

I agree. That is a risk in allowing the user or beamline scientist to
choose whatever name they want, but that situation may be preferable
to trying to dictate what things must be called in order to conform to
a standard, especially when attributes can help provide context. Just
my opinion.

> Already in this thread, I've been confused many times, for instance.
> At the HDF5 Workshop, it took me quite a while to understand the
> meaning of the word Application in Nexus Application Definitions
> (Application as in Executable Program or Application as Scientific
> Technique). Perhaps I'm unusually dense about such things.

I would have suggested calling it NXanalysis instead of NXapplication,
I think that would have been clearer.

>>> Anyway, I'm fine with this layout. Â With more reliance on attributes,
>>> it does allow Positioners and Detectors to appear more as Objects (or
>>> at least Structures). Â Perhaps we'll want to define some Positioner
>>> and Detector types (StepperMotor_Positioner, DCMono_Positioner,
>>> Scalar_Detector, ROI_Detector, MultiElementFluorescence_Detector,
>>> Area_Detector, etc) that have defined attributes. Â I'd hesitate to
>>> make each of these datasets into groups, but that might be required
>>> for completeness.
>>
>> I think I am in complete agreement, and some of this is already
>> implemented in phynx. I also have reservations about making each of
>> these datasets groups. I only envisioned groups for detectors that
>> produce multiple datasets, like an MCA (spectra,deadtime,SCAs). I have
>> not used groups for things like ion chambers, for which datasets and
>> associated attributes seems sufficient.
>
> If I typically collect 3 ion chamber intensities and 40 ROIs in
> addition to 4 x 2048 MCAs at each point in the scan, how many detector
> datasets should I have? I can see how it could be 1, 2, 44 or 47:
> 1: 1 array of (npts, 43+4*2048)

Yuck.

> 2: 1 array of (npts,43) 1 array of (npts, 4, 2048)
> 44: 43 arrays of (npts,) 1 array of (npts, 4, 2048)
> 47: 43 arrays of (npts,) 4 array of (npts, 2048)

I think 44 or 47 makes sense. Armando suggested that if you have
independent calibrations for each of your 4 MCAs, it makes sense to
separate them. I agree.

> I'm sure there are other variations. But it seems to me that this is
> precisely the question to be answered. I believe you're suggesting the
> 44 dataset solution, or perhaps it's the 47 datasets solution. Either
> is OK with me.

Yes.

>>> Speaking of completeness, I do find it interesting that it is
>>> preferred for Positioners to be as a set of 1D datasets of Single
>>> Position Values that must be unraveled,
>>
>> I don't understand what you mean, could you clarify? A positioner can
>> only have one value per position (or two, a target position and a
>> measured position, which I haven't considered before)
>
> One could consider a 2D Map to be a grid of points, sampling over a 2D
> set of values for a thing called "Sample Position" -- a multi-axis
> positioner. That's not actually different than an Area Detector (a 2D
> array of point detectors) or a multi-element-multi-channel-array
> (either a 1D array of MCA spectra or a 2D array of "data").

But it is collected in a different way, the grid is sampled as a
function of time.

> We're unraveling the map pixels to reduce the dimensionality of the
> motion to a 1-d array of multi-axis motions. But we're not also
> unraveling the detectors to a 1-d array of multi-dimensional data.
> Again, I'm not advocating this (the "1 dataset option" above). But,
> as with the Positioners, NOT using this approach does mean choices and
> assumptions have to be made about the shape of detector data that do
> not have to be made about the positioners. Again, I'm not advocating
> this approach, just noting the difference as I try to think through
> these things.

I understand. I think this gets at the heart of the
intrinsic_dimensionality attribute that Armando has suggested. This
thread started when I was looking through the nexus documentation and
noticed an axes attribute on some datasets that I thought might
already address the issue in a different way. This is probably not the
case, since not all datasets carry axes attributes, so
intrinsic_dimensionality is a simple way of communicating (rather than
assuming) information about the shape of the detector dataset.

>>> Finally, specifying the "intrinsic" or "intended" dimensionality of a
>>> scan is necessary is definitely needed to understand the data. I would
>>> suggest this to be called "dimension" and it be an integer (not a
>>> string) attribute of the Scan (or Entry or Measurement or whatever the
>>> Main Group is called).
>>
>> I think there are two different dimensionalities to consider.
>> "intrinsic_dimensionality" indicates what is the natural
>> dimensionality of the "items" in my above examples. Some other
>> attribute (I suggested "acquisition_shape") would serve the purpose
>> you just described, and I agree that the Scan or Entry would be the
>> appropriate place.
>
> OK, I got it now. Thanks, and sorry for adding noise. If detectors
> are stored with their "intrinsic dimension" at each point in the scan
> (seems to be the consensus), wouldn't the shape of the data array be
> self-describing?

Only for the case where any scan is flattened into (npts,...) rather
than (nx,ny,...). If datasets are structured according to the latter
scheme, which appears to be the normal way for nexus, then you need
more information to know whether a 3-dimensional dataset, taken out of
context, is a 2D scan of spectra, a 3D scan of a counter, or a 1D scan
of images.

>Anyway, this is fine with me. I think I still may
> not understand why it would be anything but an integer.

I don't either, but maybe Armando would be willing to clarify.

> Again, sorry for not keeping up. Cheers,

No need for apologies. Cheers,

Darren

Andy Gotz

unread,

Jan 25, 2010, 5:19:34 PM1/25/10

to ma...@googlegroups.com

Hi Darren,

>
> "intrinsic_dimensions" would not be the same for all datasets. That
> attribute does not communicate anything about the scan dimensions, it
> communicates information about the atomic unit of a dataset. An area
> detector would have intrinsic_dimensions value of 2 or "2D", a
> single-element MCA would be 1 or "1D". Does that make sense?
> "intrinsic_dimensions" is most necessary for the current nexus way of
> storing scans over an area or volume.
>
>

This is exactly what I (and I think Armando) mean. Personally I find the
term "intrinsic_dimension" confusing. Intrinsic to whom ? It appears to
me we are talking about a data type more e.g. a 2D_Image of Floats or
Ints or whatever, or a 1D_Spectrum of Floats etc.

When I look at the EOS (Earth Sciences) community's use of HDF they have
defined 3 datatypes specific to their needs : Point, Swath and Grid
which are defined as :

The Point interface is designed to support data that has associated
geolocation information, but is not organized in any well defined
spatial or temporal way. The Swath interface is tailored to support
time-ordered data such as satellite swaths (which consist of a
time-ordered series of scanlines), or profilers (which consist of a
time-ordered series of profiles). The Grid interface is designed to
support data that has been stored in a rectilinear array based on a well
defined and explicitly supported projection.

IMHO we only need a few datatypes like Point, Spectrum, Image and Cube
to cover the 4 basic dimensions. Later on we might need more dimensions
to cover the hyperspectral case but I would be happy with a solution for
the basic types first.

> "acquisition_shape" could convey information about the shape of the
> scan itself. This is the data you are looking for to take a 1D dataset
> of shape (npts,) and yield a 2D array of shape (ypts, xpts). I don't
> think this attribute should be attached to all datasets, since it
> could be attached to the entry and would apply to all datasets.
>
>

Fine with me. We could even imagine different types of shapes like
RegularGrid, SparseGrid etc. Just an idea.

I think we agree.

Andy

"V. Armando Solé"

unread,

Jan 26, 2010, 2:11:07 AM1/26/10

to ma...@googlegroups.com

Matt Newville wrote:
> Hi Armando,
>
>
>> I have a single image written in an HDF5 as:
>>
>> (rows, columns)
>> (1, rows, columns)
>> (1, 1, rows, columns)
>>
>> and none of them was part of a mesh. A simple indication about the data
>> being "2D" is already enough provides one assumes C ordering. If not, one
>> still needs an additional way to indicate where to find the image. Image_CIF
>> specifies the order of faster/slower variation of the indices, but image_CIF
>> already knows it is dealing with images. In the general case, you need the
>> image_CIF approach plus to know what you are dealing with. Both needs are
>> mentioned in the report.
>>
>
> Perhaps I don't understand image_CIF well enough to get your point.
> It seems you are saying that array ordering on disk is important.

No it is not important in itself. It is important when someone else than
the person who wrote the data is going to read it. You have 8 different
ways of storing or of reading a 2D image. SAXS people is well aware of
that and those issues are handled by image_CIF.

> But
> you keep calling this C order, which I don't understand. What you
> *wrote* is C order, how it is stored on disk is an implementation
> detail of HDF5. Someone using a "speedometer language" (Fortran,
> IDL) to read that image from an HDF5 file written in a C-language in
> C-order would also see the data laid out "the fast way" for that
> language (that is as (columns,rows,1,1)). Isn't that one of the key
> points of using HDF5?
>

Yes. And what you say is true, MATLAB reports as (columns, rows, 1, 1).
I keep calling C order just to underline the fact data are contiguous in
memory when directly mapping the files to memory. If someone tells me
(columns, rows, 1, 1) I would think images are not arranged one after
the other.

>> Concerning why I am suggesting strings is because I already support listmode
>> datasets from ion beam analysis labs in other formats. One HDF5 analogue is
>> a continuous set of numbers. In the worst of the cases we will never use
>> anything else than "0D", "1D", "2D", and so on, but why not to leave the
>> door open to have something more explicit than negative numbers that need to
>> be translated to something else?
>>
>
> Sorry, I think I don't understand this. Deliberately keeping a
> number a string means that it is meant for human reading, not to be
> used in any calculation.

Yes.

> As an attribute of a Detector dataset, I
> just can't imagine how "intrinsic_dimensionality" = "2D" is more
> meaningful to anyone than "dimension=2", whereas the latter can be
> used to create storage for the array.
>

The goal of that attribute is not to create storage for the array. In
fact, for large datasets I do not read them into memory, I just map
them. The "storage needs" are already given by HDF5 and nothing is
needed. If the "value" of the attribute would be "ScalarData",
"Spectrum", "ImageData", and so on you would understand its meaning, why
Darren may not need it and why a generic program needs it.

> Again, it's possible that I am just not understanding your point.
>

I think last paragraph will help you.

Armando

Carlos Pascual Izarra

unread,

Jan 26, 2010, 4:01:51 AM1/26/10

to ma...@googlegroups.com

On Monday 25 January 2010 23:59:06 Darren Dale wrote:
> What if analysis program X only cares if the
> intrinsic dimensions are 1 or 2? Does it have to parse the string and
> see if it starts or ends with, or contains "1D" or "2D"?

I think the criticism above does not apply to Armando's proposal: the string
is not thought to be used as a flexible place where you can add a lot of info
and which the analysis program will need to parse. Instead, it is just a
label from a closed and agreed-upon enumeration.
The merit of it is that you leave the door open to other "atomic data types",
compared to the more rigid use of integers.

That said, I would be happy with either Armando's (value=string) or Darren's
(value=int and possibly other attributes in the future) approaches.

--
+----------------------------------------------------+
Carlos Pascual Izarra
Scientific Software Contact
Computing Division
Cells / Alba Synchrotron [http:/www.cells.es]
Carretera BP 1413 de Cerdanyola-Sant Cugat, Km. 3.3
E-08290 Cerdanyola del Valles (Barcelona), Spain
E-mail: carlos....@cells.es
Phone: +34 93 592 4428
+----------------------------------------------------+

Darren Dale

unread,

Jan 26, 2010, 8:11:56 AM1/26/10

to ma...@googlegroups.com

On Tue, Jan 26, 2010 at 4:01 AM, Carlos Pascual Izarra
<carlos....@cells.es> wrote:
> On Monday 25 January 2010 23:59:06 Darren Dale wrote:
>> What if analysis program X only cares if the
>> intrinsic dimensions are 1 or 2? Does it have to parse the string and
>> see if it starts or ends with, or contains "1D" or "2D"?
>
> I think the criticism above does not apply to Armando's proposal: the string
> is not thought to be used as a flexible place where you can add a lot of info
> and which the analysis program will need to parse. Instead, it is just a
> label from a closed and agreed-upon enumeration.

He gave examples like "ListMode2D". That sounds to me like a dataset
of images organized in list mode. Maybe I misunderstood what he was
getting at.

I would be fine with an attribute named something like "data_type" and
values like "point", "spectrum", "image".

> The merit of it is that you leave the door open to other "atomic data types",
> compared to the more rigid use of integers.

Ok, but I would like to know what other atomic types people have in
mind whose dimensionality can't be expressed by an integer.

> That said, I would be happy with either Armando's (value=string) or Darren's
> (value=int and possibly other attributes in the future) approaches.

I want to be sure our proposals are been fully considered and
documented. Semantics, implementation, intended use, and potential for
being used in unintended ways that frustrate the the goal of a common
format.

Darren

"V. Armando Solé"

unread,

Jan 26, 2010, 8:50:12 AM1/26/10

to ma...@googlegroups.com

Darren Dale wrote:
> On Tue, Jan 26, 2010 at 4:01 AM, Carlos Pascual Izarra
> <carlos....@cells.es> wrote:
>
>> On Monday 25 January 2010 23:59:06 Darren Dale wrote:
>>
>>> What if analysis program X only cares if the
>>> intrinsic dimensions are 1 or 2? Does it have to parse the string and
>>> see if it starts or ends with, or contains "1D" or "2D"?
>>>
>> I think the criticism above does not apply to Armando's proposal: the string
>> is not thought to be used as a flexible place where you can add a lot of info
>> and which the analysis program will need to parse. Instead, it is just a
>> label from a closed and agreed-upon enumeration.
>>
>
> He gave examples like "ListMode2D". That sounds to me like a dataset
> of images organized in list mode. Maybe I misunderstood what he was
> getting at.
>
>

That should only sound as whatever that it is agreed to represent. If it
is agreed that it represents a movie then it is a movie.

It is just an example and just to show that an integer is too restrictive.

> I would be fine with an attribute named something like "data_type" and
> values like "point", "spectrum", "image".
>
>

"image_data" or "data_image" are better. For "image" I understand
something else.

Add "vertex" to the list and I can already make a long way.

>> The merit of it is that you leave the door open to other "atomic data types",
>> compared to the more rigid use of integers.
>>
>
> Ok, but I would like to know what other atomic types people have in
> mind whose dimensionality can't be expressed by an integer.
>

Not so difficult: any list mode, series of pictures, streamed data, ...

Armando

Darren Dale

unread,

Jan 26, 2010, 9:11:10 AM1/26/10

to ma...@googlegroups.com

On Tue, Jan 26, 2010 at 8:50 AM, "V. Armando Solé" <so...@esrf.fr> wrote:
> Darren Dale wrote:
>>
>> On Tue, Jan 26, 2010 at 4:01 AM, Carlos Pascual Izarra
>> <carlos....@cells.es> wrote:
>>
>>>
>>> On Monday 25 January 2010 23:59:06 Darren Dale wrote:
>>>
>>>>
>>>> What if analysis program X only cares if the
>>>> intrinsic dimensions are 1 or 2? Does it have to parse the string and
>>>> see if it starts or ends with, or contains "1D" or "2D"?
>>>>
>>>
>>> I think the criticism above does not apply to Armando's proposal: the
>>> string
>>> is not thought to be used as a flexible place where you can add a lot of
>>> info
>>> and which the analysis program will need to parse. Instead, it is just a
>>> label from a closed and agreed-upon enumeration.
>>>
>>
>> He gave examples like "ListMode2D". That sounds to me like a dataset
>> of images organized in list mode. Maybe I misunderstood what he was
>> getting at.
>>
>>
>
> That should only sound as whatever that it is agreed to represent. If it is
> agreed that it represents a movie then it is a movie.
>
> It is just an example and just to show that an integer is too restrictive.

I misunderstood what you meant by "ListMode". Did you mean "a list of
2-dimensional arrays collected at each point in a measurement"? I
thought you meant "a 2-dimensional array collected at each point in
the scan, and the scan points are organized into a list".

>> I would be fine with an attribute named something like "data_type" and
>> values like "point", "spectrum", "image".
>>
>>
>
> "image_data" or "data_image" are better. For "image" I understand something
> else.

You mean that "image" might be something like a jpg, tiff or png, right?

> Add "vertex" to the list and I can already make a long way.

Ok, these examples are more illustrative, and I see where you are going.

>>> The merit of it is that you leave the door open to other "atomic data
>>> types",
>>> compared to the more rigid use of integers.
>>>
>>
>> Ok, but I would like to know what other atomic types people have in
>> mind whose dimensionality can't be expressed by an integer.
>>
>
> Not so difficult: any list mode, series of pictures, streamed data, ...

Ok, I think I am with you now. Let's be careful about selecting the
values and make sure accepted values are well documented. Matt, what
do you think?

Darren

"V. Armando Solé"

unread,

Jan 26, 2010, 9:25:12 AM1/26/10

to ma...@googlegroups.com

Darren Dale wrote:

> On Tue, Jan 26, 2010 at 8:50 AM, "V. Armando Sol�" <so...@esrf.fr> wrote:
>
>> Darren Dale wrote:
>>
>>>
>>> He gave examples like "ListMode2D". That sounds to me like a dataset
>>> of images organized in list mode. Maybe I misunderstood what he was
>>> getting at.
>>>
>>>
>>>
>> That should only sound as whatever that it is agreed to represent. If it is
>> agreed that it represents a movie then it is a movie.
>>
>> It is just an example and just to show that an integer is too restrictive.
>>
>
> I misunderstood what you meant by "ListMode". Did you mean "a list of
> 2-dimensional arrays collected at each point in a measurement"? I
> thought you meant "a 2-dimensional array collected at each point in
> the scan, and the scan points are organized into a list".
>

No. I did not mean that. Some of the list modes I support are just a
continuous list of integers structured as i1,i2,i3,i4 in which the first
two represent the raster position and the last two the channel hit at a
fluorescence detector and an RBS detector. That corresponds to a 2D scan
but has two 1D detectors associated so, it is a mixture of several
things that need to be interpreted in a particular way. That particular
way can be identified by a particular label that would need to be
decided the day the problem arrives.

>>> I would be fine with an attribute named something like "data_type" and
>>> values like "point", "spectrum", "image".
>>>
>>>
>>>
>> "image_data" or "data_image" are better. For "image" I understand something
>> else.
>>
>
> You mean that "image" might be something like a jpg, tiff or png, right?
>

Yes. For some of them HDF5 already provides data types, but not for all
of them.
Again is a question of agreement. If we call one thing "image" and the
other one "picture" probably there is no confusion (at least in our
community).

>> Add "vertex" to the list and I can already make a long way.
>>
>
> Ok, these examples are more illustrative, and I see where you are going.
>
>

>> Not so difficult: any list mode, series of pictures, streamed data, ...
>>
>
> Ok, I think I am with you now. Let's be careful about selecting the
> values and make sure accepted values are well documented. Matt, what
> do you think?
>

Do we see the end of the tunnel?

Armando

Pete R. Jemian

unread,

Jan 26, 2010, 9:58:37 AM1/26/10

to ma...@googlegroups.com

On 1/25/2010 9:38 PM, Darren Dale wrote:
>> in response to Matt Newville's observation:

>> Already in this thread, I've been confused many times, for instance.
>> At the HDF5 Workshop, it took me quite a while to understand the
>> meaning of the word Application in Nexus Application Definitions
>> (Application as in Executable Program or Application as Scientific
>> Technique). Perhaps I'm unusually dense about such things.
>
> I would have suggested calling it NXanalysis instead of NXapplication,
> I think that would have been clearer.

I did not understand this last paragraph's context.
But I did understand there is a need to describe NeXus a bit better.

--------------
short answers:
--------------

NeXus uses the latter definition: "Application" is meant to describe
NXDL specifications for scientific techniques and instrument definitions.

Neither of the terms "NXanalysis" or "NXapplication" exist in NeXus.
Was this their first use in this conversation or did I miss something?

-------------------
longer explanations:
-------------------

Class definitions in NeXus prior to 2008 had been in the form of base
classes and instrument definitions. All of these were in the same
category. As the development of NeXus had been led mostly by scientists
from neutron sources, this represented their typical situations.

Both those new to NeXus (self included) and also those familiar saw
the previous emphasis on instrument definitions as a deficiency that
limited flexibility and possibly usage. The point was made that NeXus
should attempt to better describe reduced data and also data for analysis
since synchrotron instruments are rarely adhering to a fixed definition.

The design of NeXus is moving towards an object-oriented approach
where the base classes will be the objects and the "application definitions"
will use the objects to specify the required components as fits some
application. Here, "application" is very loosely defined to include:
* specification of a scientific instrument
+ example: TOF-USANS at SNS
* specification of what is expected for a scientific technique
+ example: small-angle scattering data for common analysis programs
* specification of generic data acquisition stream
+ example: TOFRAW - raw time-of-flight data from a pulsed neutron source
* specification of input or output of a specific software program
The term "the sky is the limit" seems to apply.
The point of the "NeXus Application Definition" is that all of these
start with "NX" and all have been approved by the NIAC.

Those NXDL specifications not yet approved by the NIAC fall into
the category of "NeXus contributed definitions" for which NeXus has a
place in the repository. At present, this place is empty. Think of
this category as place to put an NXDL (a candidate for a base class or
application definition) for the NIAC to consider approving.

Hope this helps,
Pete

Darren Dale

unread,

Jan 26, 2010, 10:03:14 AM1/26/10

to ma...@googlegroups.com

On Tue, Jan 26, 2010 at 9:25 AM, "V. Armando Solé" <so...@esrf.fr> wrote:
> Darren Dale wrote:
>>

>> On Tue, Jan 26, 2010 at 8:50 AM, "V. Armando Solé" <so...@esrf.fr> wrote:
>>
>>>
>>> Darren Dale wrote:
>>>
>>>>
>>>> He gave examples like "ListMode2D". That sounds to me like a dataset
>>>> of images organized in list mode. Maybe I misunderstood what he was
>>>> getting at.
>>>>
>>>>
>>>>
>>>
>>> That should only sound as whatever that it is agreed to represent. If it
>>> is
>>> agreed that it represents a movie then it is a movie.
>>>
>>> It is just an example and just to show that an integer is too
>>> restrictive.
>>>
>>
>> I misunderstood what you meant by "ListMode". Did you mean "a list of
>> 2-dimensional arrays collected at each point in a measurement"? I
>> thought you meant "a 2-dimensional array collected at each point in
>> the scan, and the scan points are organized into a list".
>>
>
> No. I did not mean that. Some of the list modes I support are just a
> continuous list of integers structured as i1,i2,i3,i4 in which the first two
> represent the raster position and the last two the channel hit at a
> fluorescence detector and an RBS detector. That corresponds to a 2D scan but
> has two 1D detectors associated so, it is a mixture of several things that
> need to be interpreted in a particular way. That particular way can be
> identified by a particular label that would need to be decided the day the
> problem arrives.

This is very similar to some ideas Matt brought up yesterday. If a
common exchange format is to support cases like the above, I suggest
designating them with a value like "structured_data" rather than
"ListMode2D", and use an additional attribute or attributes to
communicate the actual organization of the data in the array. Does
that sound reasonable?

>>>> I would be fine with an attribute named something like "data_type" and
>>>> values like "point", "spectrum", "image".
>>>>
>>>>
>>>>
>>>
>>> "image_data" or "data_image" are better. For "image" I understand
>>> something
>>> else.
>>>
>>
>> You mean that "image" might be something like a jpg, tiff or png, right?
>>
>
> Yes. For some of them HDF5 already provides data types, but not for all of
> them.
> Again is a question of agreement. If we call one thing "image" and the other
> one "picture" probably there is no confusion (at least in our community).

Or image_array vs compressed_image?

>>> Add "vertex" to the list and I can already make a long way.
>>>
>>
>> Ok, these examples are more illustrative, and I see where you are going.
>>
>>
>>>
>>> Not so difficult: any list mode, series of pictures, streamed data, ...
>>>
>>
>> Ok, I think I am with you now. Let's be careful about selecting the
>> values and make sure accepted values are well documented. Matt, what
>> do you think?
>>
>
> Do we see the end of the tunnel?

I think I see it. I suggest we write up some documentation that
explains the issue, the proposed solution, its implementation,
intended use, and accepted values. We can post it to the list and
request (final?) comments, and entertain any additional atomic types
that are currently required or envisioned. How does that sound?

Darren

Darren Dale

unread,

Jan 26, 2010, 10:20:54 AM1/26/10

to ma...@googlegroups.com

On Tue, Jan 26, 2010 at 9:58 AM, Pete R. Jemian <prje...@gmail.com> wrote:
>
>
> On 1/25/2010 9:38 PM, Darren Dale wrote:
>>> in response to Matt Newville's observation:
>>>
>>> Already in this thread, I've been confused many times, for instance.
>>> At the HDF5 Workshop, it took me quite a while to understand the
>>> meaning of the word Application in Nexus Application Definitions
>>> (Application as in Executable Program or Application as Scientific
>>> Technique). Perhaps I'm unusually dense about such things.
>>
>> I would have suggested calling it NXanalysis instead of NXapplication,
>> I think that would have been clearer.
>
> I did not understand this last paragraph's context.
> But I did understand there is a need to describe NeXus a bit better.
>
> --------------
> short answers:
> --------------
>
> NeXus uses the latter definition: "Application" is meant to describe
> NXDL specifications for scientific techniques and instrument definitions.
>
> Neither of the terms "NXanalysis" or "NXapplication" exist in NeXus.
> Was this their first use in this conversation or did I miss something?

This was my mistake. NXanalysis comes from a suggestion in my workshop
talk: the idea was that an NXentry could contain one or more analysis
groups that would contain all the information needed for different
kinds of analysis scanning x-ray fluorescence microscopy data, for
example.

I don't know where I got NXapplication, I thought I saw it somewhere
in a data file. I was referring to the nexus definitions in the
applications directory in my svn checkout of the definitions trunk.
There was some talk at the workshop of using multiple application
definitions for a single entry, and I think I assumed that this meant
there would be separate groups to contain each "application". I
apologize for adding to the confusion. Looking at the application
definitions now, I see that this is not the case.

It does, thank you.

Darren

"V. Armando Solé"

unread,

Jan 26, 2010, 10:33:47 AM1/26/10

to ma...@googlegroups.com

Darren Dale wrote:

> On Tue, Jan 26, 2010 at 9:25 AM, "V. Armando Sol�" <so...@esrf.fr> wrote:
>
>>
>> No. I did not mean that. Some of the list modes I support are just a
>> continuous list of integers structured as i1,i2,i3,i4 in which the first two
>> represent the raster position and the last two the channel hit at a
>> fluorescence detector and an RBS detector. That corresponds to a 2D scan but
>> has two 1D detectors associated so, it is a mixture of several things that
>> need to be interpreted in a particular way. That particular way can be
>> identified by a particular label that would need to be decided the day the
>> problem arrives.
>>
>
> This is very similar to some ideas Matt brought up yesterday. If a
> common exchange format is to support cases like the above, I suggest
> designating them with a value like "structured_data" rather than
> "ListMode2D", and use an additional attribute or attributes to
> communicate the actual organization of the data in the array. Does
> that sound reasonable?
>

My feeling is that list modes will very likely be specific to labs and
that either each list mode will need a dedicated identifier or they will
just be tagged as list something else will describe its structure.

As I said, I just wanted to let the door open. We do not need to decide
on that. I guess Mark Rivers will be the first one recording data in
that format once he gets his detector.

> Or image_array vs compressed_image?
>
Well, images are not necessarily compressed. Back to the measurement
group we had taken (something in the line of) image_data to illustrate
that we had numbers and not an "traditional" image. My only problem with
image_array is that one can interpret that as an array of pictures :-)
What about image_data and image_picture? I do not like underscores but
that is very explicit.

> I think I see it. I suggest we write up some documentation that
> explains the issue, the proposed solution, its implementation,
> intended use, and accepted values. We can post it to the list and
> request (final?) comments, and entertain any additional atomic types
> that are currently required or envisioned. How does that sound?
>
>

It seems OK. The issue was already shown at the workshop but it seems it
needed clarification. For some of the types where we may be unsure about
the naming at the time of presenting it to the list, we could just ask
in a poll. After all I am interested on the functionality and not on the
semantics.

Armando

Pete R. Jemian

unread,

Jan 26, 2010, 10:34:08 AM1/26/10

to ma...@googlegroups.com

Why do things look so clear _after_ they are sent to a listserve?

...

On 1/26/2010 8:58 AM, Pete R. Jemian wrote:
> since synchrotron instruments are rarely adhering to a fixed definition.

this should have read:
since synchrotron instruments are rarely adhering to a fixed instrument definition.

Darren Dale

unread,

Jan 26, 2010, 11:09:44 AM1/26/10

to ma...@googlegroups.com

On Tue, Jan 26, 2010 at 10:33 AM, "V. Armando Solé" <so...@esrf.fr> wrote:
> Darren Dale wrote:
>>

Ok.

>> Or image_array vs compressed_image?
>>
>
> Well, images are not necessarily compressed. Back to the measurement group
> we had taken (something in the line of) image_data to illustrate that we had
> numbers and not an "traditional" image. My only problem with image_array is
> that one can interpret that as an array of pictures :-) What about
> image_data and image_picture?

How about "image" for regularly-gridded images and "encoded_image" or
"formatted_image" for the others?

> I do not like underscores but that is very explicit.

All the existing NeXus attributes use underscores.

Darren

"V. Armando Solé"

unread,

Jan 26, 2010, 11:29:09 AM1/26/10

to ma...@googlegroups.com

Darren Dale wrote:

> On Tue, Jan 26, 2010 at 10:33 AM, "V. Armando Sol�" <so...@esrf.fr> wrote:
>
>
>
> How about "image" for regularly-gridded images and "encoded_image" or
> "formatted_image" for the others?
>

Fine. If I have to choose between the two I prefer "encoded_image".

Armando

Matt Newville

unread,

Jan 26, 2010, 3:11:41 PM1/26/10

to ma...@googlegroups.com

Hi Darren, Armando,

I can't keep up with each topic in this thread. I am afraid that this
thread has gone quickly and deeply in many directions at once, which I
think is actually limiting participation in the conversation. I would
certainly not want my silence to be taken as tacit agreement, and am
certain there are interested parties not active in this discussion.
Concrete proposals for formats need to be put forth and thought about.

>> Right. I think the questions are meaningful to whom and in what way.
>> Using MCA1 might be fairly clear, even if the type of MCA is not.
>> Names like "Monitor" and "Fluorescence" might seem meaningful but be
>> deceptively ambiguous.
>
> I agree. That is a risk in allowing the user or beamline scientist to
> choose whatever name they want, but that situation may be preferable
> to trying to dictate what things must be called in order to conform to
> a standard, especially when attributes can help provide context. Just
> my opinion.

If I understand correctly, the some names and layout will be mandated,
perhaps as
Entry/Positioners/
and
Entry/Detectors/
(All names here being negotiable, decorating with "NX_" as desired).

If that is already the case, I don't see that allowing
Entry/Detectors/Canberra Ge Elem #3/
with attributes including (type='MCA', npts=2048, dimension=1)

is all that much better than
Entry/Detectors/mca003/
with attributes including
(label='Canberra MED Ge Element #3, APS DetectorPool GE2',
npts=2048, dimension=1)

Having to deal with whitespace, punctuation, (and unicode?) in dataset
names seems like a bad idea to me. The detector label is meant to be
read by a human, but the dataset name will need to be traversed by the
reading program. The First approach would require looking for the
required 'type' attribute (which must be one of some pre-defined list
of valid types) to understand what the data was, and guarantees a
user-defined label. The Second approach means the dataset name itself
tells the type (and the dateset name cannot be absent!), and has a
user-defined label as an attribute to help understand the values in
the dataset.

>> If I typically collect 3 ion chamber intensities and 40 ROIs in
>> addition to 4 x 2048 MCAs at each point in the scan, how many detector
>> datasets should I have? I can see how it could be 1, 2, 44 or 47:
>> 1: 1 array of (npts, 43+4*2048)
>
> Yuck.
>
>> 2: 1 array of (npts,43) 1 array of (npts, 4, 2048)
>> 44: 43 arrays of (npts,) 1 array of (npts, 4, 2048)
>> 47: 43 arrays of (npts,) 4 array of (npts, 2048)
>
> I think 44 or 47 makes sense. Armando suggested that if you have
> independent calibrations for each of your 4 MCAs, it makes sense to
> separate them. I agree.

That's a good point, and I can certainly see how 47 datasets looks
like the most reasonable solution here.

Well, except that this assumes that calibration information is best
held in attributes, which may or may not be the case. It may be best
to store an (npts, 4, 2008) array for the data and an (4, 2008) array
for the MCA Energies at each Channel.

I think this just expresses the inherent tension of when something is
an attribute and when it is data. My bias is to use attributes
sparingly, and use them for labels, flags, and descriptive integers
(dimensions, etc) that describe or modify the data contained in the
dataset. If an attribute is required for all Detectors or Positions,
I tend to think something may be wrong.

But perhaps, especially given the "intrinsic_dimensionality"
confusion, we should postpone the Detectors discussion and limit the
topic to how to organize the Positioners Group.

>> OK, I got it now. Thanks, and sorry for adding noise. If detectors
>> are stored with their "intrinsic dimension" at each point in the scan
>> (seems to be the consensus), wouldn't the shape of the data array be
>> self-describing?
>
> Only for the case where any scan is flattened into (npts,...) rather
> than (nx,ny,...). If datasets are structured according to the latter
> scheme, which appears to be the normal way for nexus, then you need
> more information to know whether a 3-dimensional dataset, taken out of
> context, is a 2D scan of spectra, a 3D scan of a counter, or a 1D scan
> of images.

I'm definitely in favor of flattening the positioners from (nx,
ny,...) to (npts,...). But, as above, what is less clear to me is how
to store the Positioner data beyond that. One could have

Entry/Positioners/Positions (npts, NPositioners)
Entry/Positioners/PositionLabels (NPositioners)
Entry/Positioners/PositionAddrs (NPositioners)

at each Point i in the scan, the values of all relevent positioners
are recorded. All their labels and addresses are also stored. As a
bonus, the need for the Positioners Group seems weak, which could
flatten the structure.

I have the sense that some would prefer
Entry/Positioners/Position1 (npts,) attributes: label, address
Entry/Positioners/Position2 (npts,) attributes: label, address
...
Entry/Positioners/PositionN (npts,) attributes: label, address

At this point, I have a slight preference for the first variation, as
it has predictable names for datasets, 2-D array of numerical data,
and replaces attributes which are required by every dataset with
"attribute arrays".

But we should probably think about how to move from long email thread
to actual written documents. Google Wave doesn't seem quite ready for
us (or vice versa). Perhaps a wiki or group-wide Google Doc?

Cheers,

--Matt

Darren Dale

unread,

Jan 26, 2010, 3:56:00 PM1/26/10

to ma...@googlegroups.com

On Tue, Jan 26, 2010 at 3:11 PM, Matt Newville
<newv...@cars.uchicago.edu> wrote:
> Hi Darren, Armando,
>
> I can't keep up with each topic in this thread. I am afraid that this
> thread has gone quickly and deeply in many directions at once, which I
> think is actually limiting participation in the conversation.

There is plenty to talk about, so soon after the workshop. I don't
think all discussions will be like this one, but in the future I will
try to find a balance between keeping a thread on topic without
attempting to moderate discussion.

> I would
> certainly not want my silence to be taken as tacit agreement, and am
> certain there are interested parties not active in this discussion.
> Concrete proposals for formats need to be put forth and thought about.

That is why I suggested we write a summary document describing the
proposal: the problem at hand, the proposed solution, its
implementation, and how it is intended to be used. Something akin to a
Python Enhancement Proposal. Probably things will be more productive
if such documents are written in advance of the discussion.