Re: HDF 5 files for spectromicroscopy and coherent diffraction

4 views
Skip to first unread message

Wellenreuther, Gerd

unread,
Oct 2, 2009, 4:29:40 AM10/2/09
to ma...@googlegroups.com
Dear Chris,

Chris Jacobsen wrote:
> Hi - several of us have been talking about sharing data files and
> analysis programs among different labs. This applies both to
> spectromicroscopy / spectrum imaging / hyperspectral data, and to
> coherent diffraction data.
>
> I'm writing this to stir the pot in two different ways.
>
> ------------------------
>
> The first way concerns thoughts on using HDF5 as a file storage
> protocol. It involves a slight performance hit:
> http://xray1.physics.sunysb.edu/~jacobsen/colormic/hdf5_tests.pdf
> However, it offers self-documentation and platform independence. We
> have defined a set of groups that we use in HDF5 files:
> http://xray1.physics.sunysb.edu/~micros/diffmic_recon/node20.html
> We have also done the same for spectromicroscopy files:
> http://xray1.physics.sunysb.edu/~jacobsen/colormic/
> However, Anton Barty has suggested that it's good to also make it
> possible to use a minimal definition, so that one could simply follow
> this IDL code example:
> fid = H5F_CREATE(filename)
> datatype_id = H5T_IDL_CREATE(data)
> dataspace_id = H5S_CREATE_SIMPLE(size(data,/DIMENSIONS))
> dataset_id = H5D_CREATE(fid,'data',datatype_id,dataspace_id)
> H5D_WRITE,dataset_id,data
> I think this could be very good, and one could combine it with more
> elaborate structures by simply adding a flag "stored_in_data" to a
> more tightly specified group so that one can have one's cake and eat
> it too: an absolutely simple definition that anyone can write a file
> into, and more complete information for when it is desired/required.
> I'd be interested in your thoughts and comments!
>
> ------------------
Since you asked for our thoughts and comments ...

At the ESRF we are currently studying what direction to follow
concerning data formats. We like NeXus to describe the instruments and
the idea to define a default plot, but we find it does profit from the
versatility of HDF5. To keep it short, we are rather convinced about the
usefulness of HDF5 as a portable file system and I have started to add
support for HDF5/NeXus on PyMca (current windows snapshot
http://ftp.esrf.fr/pub/bliss/PyMca4.3.1-20091001-snapshotSetup.exe) that
is one of the ESRF battle horses for this type of data analysis . If I
have properly understood the proposed definition, I would prefer
something slightly different: basically not to have the dataset at the
root level but inside a a group. That should give the freedom to put
several datasets, with optional additional information, in the same file
without mixing them. In any case, I fully support having something
simple based on HDF5 and I have no major objections to your proposal.
> The second way to stir the pot concerns thoughts about having an
> intensive 2-4 day workshop on spectromicroscopy / spectrum imaging /
> hyperspectral data analysis. The idea would be to really talk about
> details of the mathematics and the programming, perhaps among a group
> of 30 or so people from the synchrotron, electron beam, and maybe even
> satellite hyperspectral data communities. There are possibilities for
> hosting such a workshop at Petra III, or Soleil, or Argonne... Again,
> I'd appreciate any thoughts people have.
Well, honestly I do not consider myself an expert on the mathematics and
the programming associated to this type of data analysis. All what I can
say is that I would really like to attend to such a workshop and to
share ideas and difficulties.

Sincerely,

V. Armando Sole - ESRF Data Analysis Unit

Wellenreuther, Gerd

unread,
Oct 2, 2009, 4:28:51 AM10/2/09
to ma...@googlegroups.com

------------------------

------------------

The second way to stir the pot concerns thoughts about having an


intensive 2-4 day workshop on spectromicroscopy / spectrum imaging /
hyperspectral data analysis. The idea would be to really talk about
details of the mathematics and the programming, perhaps among a group
of 30 or so people from the synchrotron, electron beam, and maybe even
satellite hyperspectral data communities. There are possibilities for
hosting such a workshop at Petra III, or Soleil, or Argonne... Again,
I'd appreciate any thoughts people have.

--------

Sincerely, Chris Jacobsen

---
Prof. Chris Jacobsen, Dept. Physics & Astronomy, Stony Brook University
Chris.J...@stonybrook.edu, http://xray1.physics.sunysb.edu/~jacobsen/

Wellenreuther, Gerd

unread,
Oct 2, 2009, 4:37:48 AM10/2/09
to Chris Jacobsen, ma...@googlegroups.com, Laszlo...@ugent.be, koen.j...@ua.ac.be, Cloetens Peter, Boutet Sebastien, Dumas Paul, Hornberger Benjamin, Ryan Chris, Maia Filipe, Vogt Stefan, Kotula Paul, Watts Benjamin, Andrews Joy, Kaulich Burkhard, Steinbrener Jan, Rau Christoph, Sole Armanda, Williams Garth, Barty Anton, Thibault Pierre
Dear colleagues,

I would like to follow-up on the mail Chris just sent you. First of all,
I think Chris is making an excellent point when he suggests to join
development of methods for hyperspectral data and the
introduction/expansion of the associated data format - it is a little
bit of an object-oriented approach to science. To get this going, I
think, we need several things:

A mailing list / web page
=========================

I have talked only to a few of you before. And when I heard a talk of
Paul Kotula at the ICXOM in Karlsruhe a few weeks ago I felt it is a
pity that there are people working in adjacent fields using basically
the same methods, but not knowing each other.

So I came up with the idea of setting up some web page where we all
could share our talks, references, algorithms and data. This would make
it much easier to communicate thoughts and achievements, or to get
advice. Currently, a first sketch of how this could look like can be
found at http://groups.google.com/group/mahid - most of the information
should be available to everyone without requiring to join the group.

But since this group could (and, in my opinion, should) also serve as a
mailing list for all people interested in the various issues which might
come up I would like to invite you to this group. You will receive an
email with the invitation soon, please give it a try. As you can see, I
have also sent this email to the mailing list, so anybody who is joining
later could search and browse through whatever discussion we have had in
the meantime.

More datasets
=============

I have obtained some webspace at DESY where we could share our datasets.
The idea is to link those datasets using the google-group. Currently, I
am already hosting one dataset from Laszlo Vincze, and I am waiting to
get some other datasets, e.g. from Armando Sole, Koen Janssens (that
multi-detector Rembrandt dataset), Chris Ryan using the Maia-detector
etc. First, I would like to have those on the web (together with the
corresponding publication) in their original datasets, and in a second
step produce HDF5 datasets (for that particular part, I could use some
help!). Please feel free to contact me if you have a dataset which you
could provide.

A workshop
==========

As I already discussed with Chris and others on the ICXOM, it would be
great to find some organization(s) hosting a workshop on hyperspectral
imaging + data analysis. I already talked to Hermann Franz, and there
might be an opportunity to combine to a workshop here at PETRA III /
DESY with money coming from an IT-project. But this has still to be
decided, but I will continue on that, too.

That's my five cents.

Cheers, Gerd

--
Dr. Gerd Wellenreuther
beamline scientist P06 "Hard X-Ray Micro/Nano-Probe"
Petra III project
HASYLAB at DESY
Notkestr. 85
22603 Hamburg

Tel.: + 49 40 8998 5701

Gerd Wellenreuther

unread,
Oct 3, 2009, 3:40:47 AM10/3/09
to ma...@googlegroups.com, "V. Armando Solé", Cloetens Peter, Boutet Sebastien, Dumas Paul, Hornberger Benjamin, Ryan Chris, Maia Filipe, Vogt Stefan, Kotula Paul, Watts Benjamin, Kaulich Burkhard, Rau Christoph, Williams Garth, Barty Anton, Thibault Pierre, fer...@esrf.fr, go...@esrf.fr, gerald.f...@desy.de, tnu...@mail.desy.de
Chris Jacobsen schrieb:

>> If I have properly understood the proposed definition, I would prefer
>> something slightly different: basically not to have the dataset at
>> the root level but inside a a group. That should give the freedom to
>> put several datasets, with optional additional information, in the
>> same file without mixing them. In any case, I fully support having
>> something simple based on HDF5 and I have no major objections to your
>> proposal.
> So to accomodate this yet still keep things simple, I would say we
> just make a group called "/data" which holds the most basic data.
> This means we add two lines to an IDL program.
I am currently approaching the same issue, but from a different side:
How should our beamline write complex data, e.g. 2-dimensional raster
scans? We will be using a container-format for sure. Right now we are
thinking about using the same approach as Soleil, which means using
NeXus to write HDF5 files (see
http://www.synchrotron-soleil.fr/images/File/instrumentation/Informatique/DataStorage.pdf).
According to www.nexusformat.org also Diamond (UK), ESRF (France) and
ALBA (Spain) will be using NeXus, and last but really not least: the APS
*is* already using it for tomography.

So my simple question would be: Why bother with designing a new way how
to write HDF5-files from scratch? Why not use the conventions imposed by
NeXus as a starting point, and see how this can be extended? Because
pure HDF5 does not tell you anything about which data to save where, you
are in principle completely free. In order to ensure some basic
cross-compatibility between facilities on the one hand, and software on
the other hand it would be good to be rather strict about the format, IMHO.

In the NeXus-Format, a very simple dataset as Chris was suggesting would
contain exactly on NXentry (first level of the data structure, should
represent one basic measurement/scan AFAIK), which could contain exactly
one NXdata object (second level). More data could be either added in the
same NXentry or better be put into the next NXentry. For more detail
please see http://www.nexusformat.org/Design .

Further advantages:
* A lot of nomenclature about where to put which additional
data/metadata is already defined - we just have to see how our data fits
in there and if we need further fields, and define those for ourselves
(e.g. I am not sure, what kind of metadata you want to/should write if
you do coherent diffraction imaging, IR or XEOL, just for example).
* A lot of APIs/Tools already exist, e.g. a tool for IDL
(http://lns00.psi.ch/NeXus/NeXus_IDL.html), and for Python on is either
finished or at least under developement

Okay, so much about HDF5. I removed those people from the list of
recipients which already joined the mailinglist at
http://groups.google.com/group/mahid (Chris, Joy Andrews, Jan
Steinbrener, as well as Andre and myself).

Cheers, Gerd

Vicente Sole

unread,
Oct 3, 2009, 3:53:57 PM10/3/09
to Gerd Wellenreuther, ma...@googlegroups.com, Cloetens Peter, Boutet Sebastien, Dumas Paul, Hornberger Benjamin, Ryan Chris, Maia Filipe, Vogt Stefan, Kotula Paul, Watts Benjamin, Kaulich Burkhard, Rau Christoph, Williams Garth, Barty Anton, Thibault Pierre, fer...@esrf.fr, go...@esrf.fr, gerald.f...@desy.de, tnu...@mail.desy.de
Quoting Gerd Wellenreuther <Gerd.Well...@desy.de>:

> Chris Jacobsen schrieb:
>>> If I have properly understood the proposed definition, I would
>>> prefer something slightly different: basically not to have the
>>> dataset at the root level but inside a a group. That should give
>>> the freedom to put several datasets, with optional additional
>>> information, in the same file without mixing them. In any case, I
>>> fully support having something simple based on HDF5 and I have no
>>> major objections to your proposal.
>> So to accomodate this yet still keep things simple, I would say we
>> just make a group called "/data" which holds the most basic data.
>> This means we add two lines to an IDL program.

Thanks, Chris.

> I am currently approaching the same issue, but from a different side:
> How should our beamline write complex data, e.g. 2-dimensional raster
> scans?

We are discussing how to share the data, and for that just to write
them into an HDF5 file is enough. If you want to write some other
information, I would say you are free to do so.

> So my simple question would be: Why bother with designing a new way
> how to write HDF5-files from scratch? Why not use the conventions
> imposed by NeXus as a starting point, and see how this can be
> extended? Because pure HDF5 does not tell you
> anything about which data to save where, you are in principle completely
> free. In order to ensure some basic cross-compatibility between facilities
> on the one hand, and software on the other hand it would be good to be
> rather strict about the format, IMHO.


Concerning NeXus NXdata group, my (personal!) opinion is that it is
fine for what it was thought: to define a default plot. If, in
addition to moving two motors, you are simultaneously taking data with
more than one detector and those detectors do not have the same
dimensions (a 2D detector, a 1D detector and a point detector is quite
common in this imaging field), you will see that most likely you are
going to need more than one NXdata group... It's partly because of
that type of problems that I have proposed to some members of the NIAC
alternative ways of storing the data into HDF5 without using NeXus
defined fields. The little python script I sent you in a separate mail
illustrates that.

> According to www.nexusformat.org also Diamond (UK), ESRF (France)
> and ALBA >(Spain) will be using NeXus, and last but really not
> least: the APS *is* >already using it for tomography.

ESRF is currently studying it but we favour an hybrid solution based
on HDF5. If that type of use is accepted by the NIAC, then you can
call it NeXus. If not, you just call it HDF5. We can use NeXus to
describe instruments and to define default plots, but we do not want
to miss all the flexibility of HDF5. In particular we do not see why
using NeXus has to be incompatible to, for instance, writing imageCIF
like data into HDF5:

http://portal.acm.org/citation.cfm?id=1562764.1562781&coll=portal&dl=ACM

I would say you are mixing two problems. How to exchange our imaging
data and what you should use everyday at your beamline. My experience
is that once the data are into an HDF5 file, I will be able to read to
them.

Best regards,

Armando

ambergino

unread,
Oct 3, 2009, 6:08:03 PM10/3/09
to Methods for the analysis of hyperspectral image data
While I appreciate the work people have done on defining NeXus, it's
not so clear to me that it's widely adopted even at the facilities
where it's been developed. Few people at the APS at Argonne use it, I
would guess. I'm unaware of people at NSLS or ALS using it, though I
could be wrong. It's hard to ask people who are doing electron
microscopy to use it. Also, while you pointed to a web page that
describes a set of IDL routines for NeXus, the ftp site where one is
to download those files is unreachable by anonymous ftp. There's no
Python module that I'm aware of, or Matlab routines; whereas HDF5
support has long been available in Matlab, IDL, and Python. Finally,
many years after its introduction it still has no definition for
protein crystallography and x-ray absorption spectroscopy, which are
two of the most common synchrotron techniques in use (see
http://www.nexusformat.org/Instruments). That's why I think it serves
only a limited community thus far, and thus does not represent a
heavily adopted standard.

I do think that Anton Barty has made a very good point by suggesting
that one be able to write basic data in the most simple way possible.
One can define additional groups that store both more widely-
recognized information, and even define groups that are specific to
one beamline (and which are ignored by most analysis programs). Those
additional groups can be done NeXus style, leading to the mix of basic
HDF5 plus NeXus hybrid approach that Armando speaks of.

I will write a little example read-write program for such a hybrid
approach in the coming days.

CJ


Gerd Wellenreuther

unread,
Oct 4, 2009, 7:01:50 AM10/4/09
to Vicente Sole, ma...@googlegroups.com, Boutet Sebastien, Hornberger Benjamin, Ryan Chris, Maia Filipe, Vogt Stefan, Kotula Paul, Watts Benjamin, Kaulich Burkhard, Rau Christoph, Williams Garth, Barty Anton, Thibault Pierre, fer...@esrf.fr, go...@esrf.fr, gerald.f...@desy.de, tnu...@mail.desy.de
Vicente Sole schrieb:

> Quoting Gerd Wellenreuther <Gerd.Well...@desy.de>:
>> So my simple question would be: Why bother with designing a new way
>> how to write HDF5-files from scratch? Why not use the conventions
>> imposed by NeXus as a starting point, and see how this can be
>> extended? Because pure HDF5 does not tell you
>> anything about which data to save where, you are in principle completely
>> free. In order to ensure some basic cross-compatibility between
>> facilities
>> on the one hand, and software on the other hand it would be good to be
>> rather strict about the format, IMHO.
> Concerning NeXus NXdata group, my (personal!) opinion is that it is
> fine for what it was thought: to define a default plot. If, in
> addition to moving two motors, you are simultaneously taking data with
> more than one detector and those detectors do not have the same
> dimensions (a 2D detector, a 1D detector and a point detector is quite
> common in this imaging field), you will see that most likely you are
> going to need more than one NXdata group...
Sure, but that is possible - at least this is what they (=NeXus) claim
and propose. Or maybe I missed your point?

Just to make that clear: I am completly open to use all possibilities of
HDF5 wherever we need it! We should not be restrained by our dataformat,
never. But on the other side I am not so happy imaging lots of different
HDF5-files, all written and organized using completely different design
pattern. I would really appreciate some kind of standard pattern which
defines where to put which (meta)data, and how this is designated/named.
Otherwise, any program being able to read HDF5 looking for some special
part of the data would have to a) either apply some heuristics, which
would vary from program to program, or b) the user would have to browse
the tree or otherwise indicate where the required data is located
(Armando, please correct me, but this is how PyMca is doing it right
now, right?).

So, to summarize: Right now (having not written a single HDF5-file
myself) I would try to adhere to the NeXus-specifications as long as
they do not restrain me, and otherwise try to come up with something
that blends into that design pattern. If NeXus turns out to be unusable,
one would have to find or develop some other kind of standard which is
more suitable.

Anyway, we agree that this standard, be it NeXus or something else,
should allow everyone to write very simple datasets in HDF5-files
without unnessary difficulties. And I would add: if you want to
incorporate any kind of metadata, there should exist a definition where
to put it and how to find it. Is that unrealistic?

Cheers, Gerd

Vicente Sole

unread,
Oct 4, 2009, 9:38:46 AM10/4/09
to Gerd Wellenreuther, ma...@googlegroups.com, so...@esrf.fr, Boutet Sebastien, Hornberger Benjamin, Ryan Chris, Maia Filipe, Vogt Stefan, Kotula Paul, Watts Benjamin, Kaulich Burkhard, Rau Christoph, Williams Garth, Barty Anton, Thibault Pierre, fer...@esrf.fr, go...@esrf.fr, gerald.f...@desy.de, tnu...@mail.desy.de
Hi Gerd,

Quoting Gerd Wellenreuther <Gerd.Well...@desy.de>:

> Vicente Sole schrieb:


>> Concerning NeXus NXdata group, my (personal!) opinion is that it is
>> fine for what it was thought: to define a default plot. If, in
>> addition to moving two motors, you are simultaneously taking data
>> with more than one detector and those detectors do not have the
>> same dimensions (a 2D detector, a 1D detector and a point detector
>> is quite common in this imaging field), you will see that most
>> likely you are going to need more than one NXdata group...
> Sure, but that is possible - at least this is what they (=NeXus) claim
> and propose. Or maybe I missed your point?

My point is that for further analysis (particularly for the
hyperspectral type) you will anyways have to browse the file for the
appropriate information. Please do not be mistaken, I intend to use
the NXdata group, but only for what it was intended: a default plot.
Simply, if we are just going to share data, a simple approach as Chris
suggested is enough. The NXdata group does not provide the required
metadata for the analysis that you are looking for, only for the
correct plot. Therefore, it does not bring much more than the simple
model proposed by Chris.

If you are not yet convinced, please take a close look to the NeXus
web page, you will read that one of its goals was to separate the
measured data from the metadata needed to generate them.

Quoting http://www.nexusformat.org/Design#NeXus_Classes

"""
One of the aims of the NeXus design was to make it possible to
separate the measured data in a NeXus file from all the metadata that
describe how that measurement was performed. In principle, it should
be possible for a plotting utility to identify the plottable data
automatically (or to provide a list of choices if there is more than
one set of data). In order to distinguish the actual measurements from
this metadata, it is stored separately in groups with the class NXdata.
"""

So, if you want us to follow their criteria, NXdata will not be enough
and you will need NXinstrument, NXdetector, you will miss detectors as
simple as an MCA, etc. Again, the proposal of Chris fully meets its
goals: to share our data in the simplest of the ways.

>
> Just to make that clear: I am completly open to use all possibilities
> of HDF5 wherever we need it! We should not be restrained by our
> dataformat, never.

Great to hear that from somebody else. I am still waiting to hear it
from the NIAC.

> But on the other side I am not so happy imaging lots
> of different HDF5-files, all written and organized using completely
> different design pattern. I would really appreciate some kind of
> standard pattern which defines where to put which (meta)data, and how
> this is designated/named.

My hope is that common use will lead to commond needs and therefore to
consensus. Although, I am not so very sure about the time scale for
that to happen.

> Otherwise, any program being able to read
> HDF5 looking for some special part of the data would have to

> a) either apply some heuristics, which would vary from program to program, or

as far as the heuristics works ... :-)

> b) the user would have to browse the tree or otherwise indicate where the
> required data is located (Armando, please correct me, but this is how
> PyMca is doing it right now, right?).

Yes, Gerd, PyMca is asking the the user to say where the relevant data
are located, but the user can save his preferences in order to
instruct the program about where to find the data. PyMca will support
properly defined NXdata groups too. Nevertheless, and again this is a
personal view, at the end everything can be reduced to a translation
dictionnary: a type of analysis requires a set of metadata, the
program prompts the user where to find them, the user asks the program
to remember the choice and problem is solved. Sure, you can have as
many configurations as instrumentation facilities, but the access to
the data is granted.

>
> Anyway, we agree that this standard, be it NeXus or something else,
> should allow everyone to write very simple datasets in HDF5-files
> without unnessary difficulties.

Chris'was sending the minimal requirements for an IDL program. For
python it's almost as simple as creating a dictionnary and you can do
it from the interpreter. Please, think about HDF5 as a file system:
you create a directory (= a group) where you create your data file (=
dataset). You can write a description either as a separate file (=
other dataset in the same HDF5 group) or as file properties (=
metadata as dataset attributes).

> And I would add: if you want to
> incorporate any kind of metadata, there should exist a definition where
> to put it and how to find it. Is that unrealistic?

I think Chris' proposal was leaving the door open to write metadata
provided one was saying where to find them.

Best regards,

Armando


Anton Barty

unread,
Oct 5, 2009, 3:57:44 AM10/5/09
to Vicente Sole, Gerd Wellenreuther, ma...@googlegroups.com, Cloetens Peter, Boutet Sebastien, Dumas Paul, Hornberger Benjamin, Ryan Chris, Maia Filipe, Vogt Stefan, Kotula Paul, Watts Benjamin, Kaulich Burkhard, Rau Christoph, Williams Garth, Thibault Pierre, fer...@esrf.fr, go...@esrf.fr, gerald.f...@desy.de, tnu...@mail.desy.de
Hi All,

Just a brief comment

Using Nexus conventions may be convenient for those already using
Nexus, but its existing format may not necessarily suit everyone
else's needs.

Hence the proposal that at the most basic level we just put the data
in the 'data' field of an HDF5 file. That way it is brain-dead to
write a reader/writer - lowering the barrier to entry and making the
file at the lowest level a 'bucket' for data. Keeping it as simple as
possible will make it much easier to share data.
Of course some groups may desire to put extra information in the file
- for example the configuration of an instrument. But at a base level
the simple reader will still work.

To be sure: once the data is in HDF5 format most people with some
coding experience will be able to extract the data. That much is
clear. But that is not the point. The purpose of having a simple
interchange format is so that we do not have to recode for every data
new HDF5 format that comes along.

At the base level I foresee this being a convenient container format
for sharing data amongst groups with minimum effort. (Another
discussion is whether we put all the data in '/data/data instead of '/
data'. If we decide to put it in a group rather than at the top
level it's a few more lines of code but that is another separate
discussion that should be settled soon).

Anton

>
> I would say you are mixing two problems. How to exchange our imaging
> data and what you should use everyday at your beamline. My
> experience is that once the data are into an HDF5 file, I will be
> able to read to them.
>
> Best regards,
>
> Armando
>

----
Anton Barty
Centre for Free Electron Laser Science (CFEL)
Notkestrasse 85, 22607 Hamburg, Germany
phone: +49 (0)40 8998 5783
secretary: +49 (0)40 8998 5798
anton.barty @ desy.de

Wellenreuther, Gerd

unread,
Oct 6, 2009, 3:05:57 AM10/6/09
to "V. Armando Solé", ma...@googlegroups.com, Laszlo...@ugent.be
Hi Armando,

if you can host the dataset itself just send me the link and I will link
it in the "Datasets"-page of the group. Since it is public FTP I expect
no special username/password is required?

Alternatively, you can send me the dataset, I will host it at DESY, and
link it.

Of course, this applies to everyone: I am happy to link any kind of
dataset, or host your dataset.

My plans for the dataset from Laszlo Vincze is to provide as well the
raw data and some kind of HDF5-file. Or maybe two HDF5 files, one
containing the raw data, the other one the elemental maps.

Cheers, Gerd

V. Armando Solé wrote:
> Dear Garth,
>
> Garth Jonathan Williams wrote:
>>
>> I prefer the /data/data location, but wonder if the dataset name
>> shouldn't
>> be a bit more specific regarding the nature of the data contained in the
>> file. For example, we often record normalization information from the
>> ring
>> current or ionization chambers that I would store in /data. As a
>> suggestion,
>> were I specifying this in isolation, I would choose /data/cdi for
>> coherent
>> scattering data.
>>
>> I see the benefit of this scheme as two-fold:
>> (1) datasets are easily recognized by name (I can have my /data/I0 for
>> normalization or /data/spect for complementary spectroscopy data.)
>> and
>> (2) metadata can be stored with flags linking them to the relevant
>> datasets (This cuts down on duplication when parameters are important
>> for multiple datasets.).
>>
>> We've previously used NeXus/HDF4 and I agree with Chris/Anton that a
>> simpler
>> format is desirable for distributing data.
>>
>> Ross Harder and I have separately discussed the idea of a workshop
>> along the
>> lines of Chris' earlier proposal, so I believe such an effort would be
>> well-attended.
>>
> I have a dataset ready (in the /data/data convention). It is a dataset
> belonging to the set of samples leading to the article published in
> Analytical Chemistry 79 (2007) 6988-6999. Basically it is just a 3D
> array in which the first two dimensions are the map size and the 3rd
> direction are the spectra. It is an X-ray fluorescence dataset taken at
> 4.707 keV, but I thought the main goal here was to apply pure
> statistical techniques to see how far we could go and which
> methodologies are more promising. So, just to know the basic data are 1D
> spectra is perhaps most relevant in this case, not even to have the I0
> correction.
>
> In your case, if you store your data as /data/cdi, /data/I0, and so on,
> it should be possible to nevertheless link to the main dataset
> (/data/cdi) with a link named /data/data. Specific coherent diffraction
> codes could find their way and generic ones would at least know where
> the main data are. Is it acceptable?
>
> Please, Gerd, let me know how I can make the dataset available. I can
> put it at the public ESRF ftp area but perhaps you would like to keep it
> somewhere else.
>
> Best regards,
>
> Armando

Wellenreuther, Gerd

unread,
Oct 6, 2009, 6:55:47 AM10/6/09
to ma...@googlegroups.com, Nicola Coppola, so...@esrf.fr, fer...@esrf.fr, go...@esrf.fr, Andrew Aquila, Joachim Schulz, Nicola Coppola, Tomas Ekeberg, Thomas White
Dear all,

first of all: I would like to invite all of those who joined the
discussion later to join the mailing list - I will try to send an
invitation to all of you, but you can also directly join at
http://groups.google.com/group/mahid .

And if you know somebody, who should be incorporated into this
discussion, please forward them to the group, so we can start to lead
the discussion using only the mailing list. Anyone dropping in later
than can first read the previous discussions.

Second:

* I understood that there is a wish to share data using HDF5.

* I have not understood yet why "raw"-data is not better shared using a
binary format. Why use HDF5 if the only thing I want to do is dump an
array into it, maybe in the same datastructure without taking into
respect what that data is?

* I noticed that I have a different view on how a "simple" read-/write-
routine is looking like. For me, simple means something like "browse the
tree of the HDF5-file in question, until you find the first occurrence
of any entry matching some criteria (e.g. size of array, name, position
in the tree)".

But I am not a HDF5-expert ...

Cheers, Gerd

P.S.: Nicola, Anton and all the others at DESY: Maybe we can just meet
an discuss?

Nicola Coppola wrote:
> Dear Anton,
> the only problem, that I see, is that you think that you are able to
> "recover" what is from the raw data present in "/data". Unfortunately I
> REALLY doubt that you will be able to do so, given that everything (at
> RAW level) will be stored in the "/data" section.
> Example:
>
> We are not going to use HDF5 in this very beam-session at SCLS, but if
> it were so, we would have pnCCD+VMI+REMI in the same place (and in the
> future it will be more and more true that we are using a lot of
> different detectors) and again you would be needing to "parse" what is
> in the storage element. How are you able to distinguish which detector
> is which (every thing is mixed in "/data")? Is your idea that you will
> be always able to "disconnect" any other measuring device, once our
> group take data (so that they do not appear in the files)?
>
> I know your point of view, but I do not see why you have a problem
> in having things stored in area(e) that are called "/raw/pnCCD"
> "/raw/xyz" and so on. And, in case, just "propagate" a
> "read_simple_hdf5" function together with the data.
>
> I think that, in this way, also in a future not so distant, you will be
> able to reread data and "remember/recover" what they were.
>
> The most important thing is that the "datastream" has a self-explanatory
> description, to be able to interpret what is inside. And "data" does not
> carry a lot of information as a string.
>
> regards
> Nicola
>
> On Tue, 6 Oct 2009, Anton Barty wrote:
>
>> Hi Garth
>>
>> At first it is tempting to be more descriptive: it is a point many
>> have made and I understand the desire.
>>
>> However the idea is that it will be much easier to exchange data if we
>> do not have to worry about hunting for different tags for different
>> types of data (eg: /data/cdi) depending on what the data is or where
>> it came from. The aim is to simplify the sharing of not only
>> diffraction intensities, but also processed data, iterates,
>> reconstructions, 3D volume data in real or reciprocal space, etc...
>> in this respect HDF is self-describing but not self-parsing.
>>
>> 98% of the time my bet is that most people already know what is in the
>> file - by its location, or it's filename. Let's say you send me a
>> file - the idea is that I can open it using the same program (say
>> read_simple_hdf5) regardless of what the data represents. No parsing
>> of whether it's cdi or iterate or volume required. Just open the file
>> and read it and start working with the data.
>>
>> For those that want to be more descriptive (and I see the benefits):
>> Chris has suggested a tag specifying what the data is (eg:/data/what =
>> 'cdi') and Filipe has suggested using a symbolic link. No problem if
>> you'd like to go down that route.
>>
>> Furthermore: for data coming from an instrument I can totally see the
>> virtue of adding groups like /ALS, or /FLASH, or /LCLS for beam
>> current, bunch ID, energy, etc. ; and an /instrument group for the
>> position of all motors. Those will necessarily be instrument specific,
>> although we could agree on come commonality (eg: for energy, pixel
>> size, etc.). But once again if all I want to do is look at your raw
>> data at least I will know where to look (/data/data) without having to
>> worry about the rest of the tags.
>>
>> My 2c worth.
>>
>>
>> Cheers
>>
>> Anton
>>> garth
>>>
>>> -----Original Message-----
>>> From: Anton Barty [mailto:anton...@desy.de]
>>> Sent: Tue 10/6/2009 4:55 AM
>>> To: Chris Jacobsen
>>> Cc: "V. Armando Solé"; Gerd Wellenreuther; Cloetens Peter; Boutet
>>> Sebastien; Dumas Paul; Hornberger Benjamin; Ryan Chris; Maia Filipe;
>>> Vogt Stefan; Kotula Paul; Watts Benjamin; Andrews Joy; Kaulich
>>> Burkhard; Steinbrener Jan; Rau Christoph; Williams Garth; Thibault
>>> Pierre; fer...@esrf.fr; go...@esrf.fr
>>> Subject: Re: HDF 5 files for spectromicroscopy and coherent diffraction
>>>
>>> Hi All,
>>>
>>> To speed things up - and to provide an example - here is some sample
>>> IDL code for reading and writing a very bare-bones HDF file of the
>>> type we have been talking about.
>>>
>>> As discussed it is designed to be as simple as possible - to promote
>>> the idea of sharing data processing tools and make it easy to be
>>> compatible with each other.
>>>
>>>
>>> Cheers
>>>
>>> Anton
>>>
>>>
>>>
>>
>> ----
>> Anton Barty
>> Centre for Free Electron Laser Science (CFEL)
>> Notkestrasse 85, 22607 Hamburg, Germany
>> phone: +49 (0)40 8998 5783
>> secretary: +49 (0)40 8998 5798
>> anton.barty @ desy.de
>>
>
> |\ _,,,--,,_
> \|||/ /,`.-'`' ._ \-;;,_
> (o o) /,4- ) )_ .;.( `'-'
> ---------oOO--(_)--OOo--------------'---''(_/._)-'(_\_)-------------
> | |
> | Nicola Coppola /\ DESY - CFEL & F1/ZEUS |
> | cop...@mail.desy.de /\\//\ Notkestrasse 85 |
> | Tel: +49 40 8998 5781/2909 \//\\/ 22607 Hamburg |
> | Tel: +49 40 8998 1958(fax) \/ Germany |
> | Fax: +49 40 8998 5793(clean room) |
> | Home: +49 40 5190 5631 In Italy: (+39 041 984108) |
> | DESY Deutsches Elektronen Synchrotron http://www.desy.de |
> --------------------------------------------------------------------

Ross Harder

unread,
Oct 6, 2009, 10:40:13 PM10/6/09
to Methods for the analysis of hyperspectral image data
Greetings,

I've spent some time thinking about data files. One thing that I have
trouble getting past is the fact the HDF is not archival! A large
chunk of the astronomical community is still using an archaic file
format called FITS. It seems that they like it because there is no
reliance on a software API. The file is self describing following a
published standard. In twenty years if the physical media is still
readable the bits can be interpreted.

I know HDF has considered this and briefly summarizes it's strategy
here: http://www.hdfgroup.org/about/history.html

Does anyone else care about this?

Ross



On Oct 6, 5:55 am, "Wellenreuther, Gerd" <gerd.wellenreut...@desy.de>
wrote:
> > |  copp...@mail.desy.de       /\\//\    Notkestrasse 85            |

"V. Armando Solé"

unread,
Oct 7, 2009, 2:42:06 AM10/7/09
to ma...@googlegroups.com
Dear Gerd,

Wellenreuther, Gerd wrote:
>
> Second:
>
> * I understood that there is a wish to share data using HDF5.
>
> * I have not understood yet why "raw"-data is not better shared using
> a binary format. Why use HDF5 if the only thing I want to do is dump
> an array into it, maybe in the same datastructure without taking into
> respect what that data is?

Because "raw" data is undefined.

The advantages I see with HDF5:

- you can drop in and mix data types and dimensions unlike specialized
formats only thought for 2D
- you do not have to care about if you are reading floats, doubles,
integers, ... It is self-descriptive.
- you can chunk your data, allowing a very fast read out.
- you do not have to care about little-endian -high-endian problems
- you can have straight read out with common tools (IDL, MATLAB, Python,
...) very few lines of code
- HDF5 is on its way to become an ISO standard


>
> * I noticed that I have a different view on how a "simple"
> read-/write- routine is looking like. For me, simple means something
> like "browse the tree of the HDF5-file in question, until you find the
> first occurrence of any entry matching some criteria (e.g. size of
> array, name, position in the tree)".

import h5py
f = h5py.File(filename,'r')
data=f['/data/data']

Does it need to be simpler? It's just like looking into a hard disk
because all what you need is a path. Either you know the path and you
take it, either you browse the disk and find your path.


>
> But I am not a HDF5-expert ...

Neither I am, but having supported quite a few formats on PyMca, I can
say I appreciate it.

Armando


Gerd Wellenreuther

unread,
Oct 7, 2009, 2:46:38 AM10/7/09
to ma...@googlegroups.com, Barty Anton, Nicola Coppola
Dear all,

just to summarize my insights after a discussion I just had with Anton
and Nicola, and the discussion over the last days: Most probably it
would be best to differentiate between two different applications for HDF5:

Simple data exchange (path-centered use of HDF5)
================================================
* items are found using a path
* only contains one dataset or it is clear which/how the data has to be used
* users (Chris, Anton, Gerd) have to agree about / communicate paths
* purpose: use HDF5 as a machine/platform-independent container
* Anton proposed '/data' as a unified storage place for sharing, Armando
suggested to go one hierarchical layer deeper into '/data/data'. Garth
and Armando thought it would make sense to indicate the kind of data by
actually putting the data to be shared in e.g. '/data/cdi', and link
that to '/data/data', which would enable other data e.g. monitors to be
put in '/data/I0'.


Rich (meta-)data storage
========================
* items are rather found using a heuristic, e.g. looking for a special
attribute/tag than looking for an absolute path (e.g. looking for a
data-group with a name or tag called 'coherent diffraction imaging data')
* in case several matching datasets are found, any program should ask
the user
* users (SOLEIL, DESY, ESRF, APS, NeXus, groupXYZ) should agree about
guidelines / philosophy concerning names / tags / hierarchical
structures in HDF5
* purpose: create container-files containing all data + metadata
obtained during data acquisition, + additional data from preprocessing
and processing
* in order to achieve some kind of compatabiliy between synchrotron
labs, I suggested to use NeXus as is being used at SOLEIL. I think it
would be great to define some philosophies/guidelines about how what
should be stored where and why, and how it is being designated. Such
general guidelines + common use as Armando suggested it will lead to the
evolution of the NeXus-standard into something adequat for our purposes.

How to go on:
=============

* Armando is already creating HDF5-files, and I will start to link them
in the http://groups.google.com/group/mahid/web/datasets .
* I will also try to convert aspects of the datasets hosted by DESY
(e.g. the raw fluorescence spectra as well as the elemental contents) in
two different HDF5 files.
* Anybody else having spectroscopic datasets in HDF5 is welcome to join.
* Further advances benefitting those people using HDF5 for simple
sharing could come from an improvement of the APIs, e.g. allowing a
simple fetch of an item not only based on the path, but e.g. on a
certain name or tag.
* For the development of the rich-HDF5-format I really think that we
need a workshop :).

Cheers, Gerd

Gerd Wellenreuther

unread,
Oct 7, 2009, 2:57:27 AM10/7/09
to ma...@googlegroups.com
V. Armando Solé schrieb:
> import h5py
> f = h5py.File(filename,'r')
> data=f['/data/data']
>
> Does it need to be simpler? It's just like looking into a hard disk
> because all what you need is a path. Either you know the path and you
> take it, either you browse the disk and find your path.
Sure, it is simpler. But I have a innate tendency to dislike any kind of
fixed path or constants in my programs. I agree, that this is the
fastest and quickest way to implement reading of an HDF5-file in a
single-purpose script. And this is who you will and should read in an
HDF5-file purely built for sharing data.

But I would rather not consider to identify something in a possibly rich
HDF5-file just by its mere path, but trying to look at its shape, its
tags and so one. Especially, if I want to make sure that I understand
what the program is doing after leaving it a couple of months
unattended, and after somebody willingly or mistakenly changes the
path-structure of his HDF5-files.

Cheers, Gerd


Andy Gotz

unread,
Oct 7, 2009, 3:03:36 AM10/7/09
to ma...@googlegroups.com, Barty Anton, Nicola Coppola
Dear Gerd,

thank you for your excellent summary.

I have this naive approach that it would be ideal to be able to analyse
data by only exchanging files and not have the user have to guide the
program(s) at every step to find the data. I realise this might be too
naive for files containing multiple datasets. But for a single dataset
this should be possible. For this we need to agree on some conventions.
I think this is what this discussion has been all about. It looks
promising. Nexus has some useful conventions and metadata tags which we
should follow but Nexus alone is not enough to avoid the user clicking
to find the data. I think this can be changed by using a combination of
Nexus and some additional conventions where to find data in an HDF file
(as has been discussed on this mailing list).

I am very much in favour of a workshop and have discussed this at the
ESRF. We are ready to host such a workshop at the ESRF if people are
ready to come to Grenoble. We have EU funds for networking around the
topic of Data Analysis (project VEDAC). So if the majority agrees just
let us know and we will start organising such a workshop. Any preference
for dates - 2009 or early 2010 ?

Best regards

Andy

DUMAS Paul

unread,
Oct 7, 2009, 3:13:18 AM10/7/09
to ma...@googlegroups.com
The idea of a workshop at ESRF is great, and I had already suggested this to Armando.
Why not a DOODLE.CH calendar?
Best regards
Paul


===============================
Paul Dumas
SOLEIL Synchrotron
BP 48- L'Orme des Merisiers
91192 Gif sur Yvette Cédex ( France)
Tel: +33(0)1 -69 35 9621
Fax:+33(0)1 -69 35 9456
============================

-----Message d'origine-----
De : ma...@googlegroups.com [mailto:ma...@googlegroups.com] De la part de Andy Gotz
Envoyé : Wednesday, October 07, 2009 9:04 AM
À : ma...@googlegroups.com
Cc : Barty Anton; Nicola Coppola
Objet : [MAHID] Re: HDF 5 files for spectromicroscopy and coherent diffraction

"V. Armando Solé"

unread,
Oct 7, 2009, 3:17:35 AM10/7/09
to ma...@googlegroups.com
DUMAS Paul wrote:
> The idea of a workshop at ESRF is great, and I had already suggested this to Armando.
> Why not a DOODLE.CH calendar?
> Best regards
> Paul
>
Wow, no time to react! The things that happen during the time to take a
cup of coffee... ;-)

Armando

Gerd Wellenreuther

unread,
Oct 7, 2009, 3:36:56 AM10/7/09
to ma...@googlegroups.com, thorste...@desy.de, tnu...@mail.desy.de, gerald.f...@desy.de, stepha...@desy.de
Hi Andy.

Great news! From HASYLAB at least Thorsten Kracht, Maria-Teresa
Nunez-Pardo-de-Vera and myself would most probably attend. Hopefully
Andre Rothkirch will also join, but maybe this is also a question of how
many people from IT can leave HASYLAB unattended for a few of days :).
And since the SAXS people have also been thinking about this topic and
want to use HDF5/NeXus they would/should also want to join. I can not
speak for all the people from CFEL.

It would be great if this could happen in 2009, although it will make
things a little bit more difficult, especially for you.

Cheers, Gerd

Andy Gotz schrieb:
> Dear Gerd,

DUMAS Paul

unread,
Oct 7, 2009, 3:40:41 AM10/7/09
to ma...@googlegroups.com
At SOLEIL? We have one person in charge of the data formatting with NEXUS ( Stéphane Poirier). Do you think he can be included in the discussion group?
Very best regards to all
Paul


===============================
Paul Dumas
SOLEIL Synchrotron
BP 48- L'Orme des Merisiers
91192 Gif sur Yvette Cédex ( France)
Tel: +33(0)1 -69 35 9621
Fax:+33(0)1 -69 35 9456
============================

-----Message d'origine-----
De : ma...@googlegroups.com [mailto:ma...@googlegroups.com] De la part de Gerd Wellenreuther
Envoyé : Wednesday, October 07, 2009 9:37 AM
À : ma...@googlegroups.com
Cc : thorste...@desy.de; tnu...@mail.desy.de; gerald.f...@desy.de; stepha...@desy.de
Objet : [MAHID] Re: HDF 5 files for spectromicroscopy and coherent diffraction


Andy Gotz

unread,
Oct 7, 2009, 3:52:36 AM10/7/09
to ma...@googlegroups.com, thorste...@desy.de, tnu...@mail.desy.de, gerald.f...@desy.de, stepha...@desy.de
HI Gerd,

we will do our best to organise the workshop this year. December might
be the only option to allow people (including us) to prepare. I wonder
however if this is not a problem for colleagues from USA. I have heard
DOE employees need a 3 month lead time for travel. Some feedback on this
topic from USA colleagues would be useful. We have started the ball
rolling over here ...

Andy

Aquila, Andrew

unread,
Oct 7, 2009, 4:13:08 AM10/7/09
to ma...@googlegroups.com
Hi Andy,

December is bad for the people at CFEL as we have beamtime at LCLS. So I
doubt anyone from our group would be able to attend in December. January
would be better.

Andy

Gerd Wellenreuther

unread,
Oct 7, 2009, 4:22:46 AM10/7/09
to ma...@googlegroups.com
Aquila, Andrew schrieb:

> Hi Andy,
>
> December is bad for the people at CFEL as we have beamtime at LCLS. So I
> doubt anyone from our group would be able to attend in December. January
> would be better.
Could you give us an overview who would be attending the workshop? (I am
still only at the beginning of getting to know people at CFEL. Which is
a shame.)

Cheers, Gerd

ambergino

unread,
Oct 7, 2009, 5:21:23 AM10/7/09
to Methods for the analysis of hyperspectral image data
Hi to all -
Thanks, Gerd for really getting things rolling on this, and for others
for embracing it with so much enthusiasm that my mailbox overflows!

Over the years I have read a great many different file formats
produced by different programs. With binary data one has the problem
of needing to know how the data was written to disk (big or little
endian, datatype, is there a header and what does it contain, and so
on). Because HDF5 records things like a text string tag and info on
the data type (16 bit unsigned integer, or 32 bit IEEE float, or
whatever) along with the data, with HDF5 you can be given a data file
and learn all about its structure without knowing about the reading
program. To see an example, look at the results of h5dump on the
NeXus test file "nxtest.c" on this web page:
http://xray1.physics.sunysb.edu/~jacobsen/colormic/
It lets you see how an example NeXus file is put together.

Regarding Ross Harder's comment, I would say that since the source
code of HDF5 is available, since it is widely adopted, and since it is
self-describing, it is actually quite good for an archival file
format.

There are compelling reasons to specify the data as simply as possible
to maximize the ease of reading/writing and also the portability among
experiment types. But there are also compelling reasons to make the
data be as specific as possible so that one does not have to rely on
memory or logbooks to know what went tnto a data set, and it's also
very good to avoid making long dialogs in a program that reads a file.

To my thinking, we can accomplish both goals by making a layered
format. One can start by writing the key data (presumably a 2D or 3D
image array) into the first layer with an agreed-upon name, such as "/
data/data" or maybe "/image." One can then have optional tags in that
basic group which give pixel types (distance or energy) and pixel
sizes (microns or ev) for each dimension, along with date of
creation. One can follow that with specifics such as a group "/image/
spectrum" which provides basic tags used in spectromicroscopy, or "/
image/cdi" for coherent diffraction data, and so on. Finally, one can
follow with beamline-specific information like "/image/als901". A
data reading program could then start by reading the image, and then
see if there's a "spectrum" subgroup and get more details, and then
see if it was written on a beamline it knows about and get even more
details. In general, a program can look for all parameters it wants
and insert default values or zeroes for parameters not stored in a
file.

One key thing I would strong urge is that all tags include the units
in their name. It's much better to have a tag that says
"pixel_size_microns" rather than "pixel_size", or "energy_ev" rather
than "energy", or "angle_degrees" rather than "angle".

Beyond that, I think that there are many different ways that we can
accomplish the definition of a file format. I would like to maximize
compatibility with NeXus, and would like to see how the NeXus file
reading routines might deal with non-NeXus groups coexisting in the
same HDF5 file. But I do worry that NeXus moves slowly, given that it
has been around for years yet still does not cover crystallography,
spectroscopy, or image data. There is a NeXus "code camp" at Argonne
on Oct. 16-18, with 8 participants (http://www.nexusformat.org/
NIAC2009#Participants). I don't know any of them, but Andrew Götz of
ESRF is on the NeXus international advisory committee and perhaps he
could shed some light.

How to proceed? Well, I think that we face two separate but related
issues:
1. Getting an agreed-upon data exchange file format, perhaps with a
layered approach in HDF5. This is urgent.
2. Having a workshop on data analysis algorithms.
It seems like the only way to make progress on (1) is to empower a
small group to get together to write a draft and provide example file
read/write code. For this task, I would suggest a small group, with
maybe 1 person from each of a few key institutions; this group could
see if they can make progress via email and/or a conference call, or
if it is best to meet in person. Regarding the workshop, it seems
like there are several volunteers including ESRF, and possibilities
for dates including December and January. I would rather start by
empowering a small committee to form a draft set of topics, and then
solicit speaker suggestions for the topics, and then think of dates
that might work for the key speakers.

CJ

ambergino

unread,
Oct 7, 2009, 5:27:25 AM10/7/09
to Methods for the analysis of hyperspectral image data
By the way, I will endeavor to sign my messages "Chris Jacobsen"
instead of CJ from now on. "Ambergino" is an amalgam of the names of
our dogs from a few years ago, and I used it to be anonymous in a
Google group discussing the Toyota Prius...

Mark Rivers

unread,
Oct 7, 2009, 7:08:55 AM10/7/09
to ma...@googlegroups.com
I am all in favor of this effort. My new EPICS areaDetector software which collects 2-D image data from a variety of detectors has file writers for both netCDF and NeXus. The NeXus writer (written by John Hammonds) is still in development, and is currently limited to a single image per file, while the netCDF writer can stream an unlimited number of images to a single file. If you are interested you can learn more about this at:

http://cars.uchicago.edu/software/epics/areaDetector.html

I would be very interested in attending the workshop. December is bad for me, since there is APS beam and another large meeting, but January would be good, since that is a down month at APS.

Mark


________________________________
winmail.dat

Antonio Lanzirotti

unread,
Oct 7, 2009, 7:50:44 AM10/7/09
to ma...@googlegroups.com
Hey Chris

>
> One key thing I would strong urge is that all tags include the units
> in their name. It's much better to have a tag that says
> "pixel_size_microns" rather than "pixel_size", or "energy_ev" rather
> than "energy", or "angle_degrees" rather than "angle".
>
I just wanted to comment on this one point you make, which is very
important. I 100% agree that it's critical that units be specified.
However, with HDF5 adding attributes is very easy. So I'd suggest rather
than having a field descriptor that says pixel_size_microns, there could
be the data store itself (i.e. pixel_size) and then an attribute that
specifies the units. This would also then easily allow for differences
between beamlines or changes in configuration (for example changing an
electrometer setting from mA to nA).

On one other point regarding some advantages of HDF5. True it may not be
the speediest way to read data. But it's extensibility and
self-description makes it a very good choice for wider acceptance. You
also can gzip compress individual data fields in the file individually.
Individual groups can actually add attribute or data fields that are
specifically important for their operation and as long as they don't
modify the "agreed-upon" fields, we can still extract what's needed for
data analysis because we look specifically for attribute and data fields
by name. For a beamline, this means I can also decide to add attributes
later in time, for example as hardware upgrades are incorporated,
without having to modify software written to extract those specific
named fields.

Anyhow, I think a workshop at the ESRF would be a great idea.

Tony Lanzirotti

--
Dr. Antonio Lanzirotti, Ph.D.
Senior Research Associate
The University of Chicago - CARS
National Synchrotron Light Source
Brookhaven National Laboratory
Upton, NY 11973
(631) 344-7174
mailto: lanzi...@uchicago.edu
or
mailto: lanzi...@bnl.gov

"V. Armando Solé"

unread,
Oct 7, 2009, 8:25:22 AM10/7/09
to ma...@googlegroups.com
Hi Gerd,

Gerd Wellenreuther wrote:
> How to go on:
> =============
>
> * Armando is already creating HDF5-files, and I will start to link them
> in the http://groups.google.com/group/mahid/web/datasets
>

You can write a link to the data set:

http://ftp.esrf.fr/pub/scisoft/HDF5FILES/MGN1_4707eV.h5

I have tried to follow NeXus conventions as well as the agreed
/data/data way.

That should also serve to illustrate that by using links, one can meet
several standards.

Best regards,

Armando
PS. It is a fluorescence dataset of one of the samples used at in
Analytical Chemistry 79 (2007) 6988-6994. I am not 100% sure it is part
of the region shown in Fig. 2, but it is the same sample.

"V. Armando Solé"

unread,
Oct 7, 2009, 8:27:47 AM10/7/09
to ma...@googlegroups.com
Gerd Wellenreuther wrote:
> * Armando is already creating HDF5-files, and I will start to link them
> in the http://groups.google.com/group/mahid/web/datasets .
>

BTW, I have tried to download the daphnia dataset but I am prompted for
a username and a password.

Armando


ambergino

unread,
Oct 7, 2009, 8:33:01 AM10/7/09
to Methods for the analysis of hyperspectral image data

> One key thing I would strong urge is that all tags include the units
> in their name.  It's much better to have a tag that says
> "pixel_size_microns" rather than "pixel_size", or "energy_ev" rather
> than "energy", or "angle_degrees" rather than "angle".

Tony Lanzirotti made the point that in HDF5 one can set an attribute
for the units. I still like including it in the variable name,
because then a programmer is more likely to use that variable name
with units included in the source code. That can make for much
greater readability of code by others. Sure, one might have scale
changes on an input device, but one can convert upon reading into the
"standard" units.

Chris Jacobsen

Wellenreuther, Gerd

unread,
Oct 7, 2009, 8:35:05 AM10/7/09
to ma...@googlegroups.com

ambergino wrote:
> There are compelling reasons to specify the data as simply as possible
> to maximize the ease of reading/writing and also the portability among
> experiment types. But there are also compelling reasons to make the
> data be as specific as possible so that one does not have to rely on
> memory or logbooks to know what went tnto a data set, and it's also
> very good to avoid making long dialogs in a program that reads a file.
>
> To my thinking, we can accomplish both goals by making a layered
> format.

I guess I have to disagree, but please feel free to try to convince me. :)

If the purpose is just to share data, then we should keep everything
simple. It will not hurt if additional (meta-)data is in the file. But
typically, as I discussed with Anton and Nicola yesterday, the purpose
would be to just read that single dataset with the minimum amount of
three lines of code (open, read, close). If this is what you want to do,
you should stick to this routine and the paths, and your fine. If you
need to write e.g. monitor data, just make a second file HDF5-file.

On the other hand, I think we should not use this "simple" HDF5-files as
a starting point for something (much) more elaborate. Because whatever
evolves out of this process will most probably neither have the property
of being easy to read/write, nor will it have the proper design for
advanced data analysis.

For example, one possible dead-end I see is connected to the usage of
fixed path-names. You always put data xyz into path abc. And now some
guy buys a second detector. Consequence: You have to find a new
convention for naming the path, tell everyone, and then they have to
change their programs in order to be able to read that data. That is bad
in my opinion. Very bad.

The only way to solve that misery is to distinguish between
"quick-sharing, path-oriented" simple HDF5-files, and more elaborate,
rich HDF5-files. In my opinion, the individual items in the latter
should *not* be identified by their location in the HDF5-structure
(although this should be kept as defined as necessary), but the real
identification is implemented as some kind of attribute or tag: The data
itself should identify itself as a fluorescence map, or a XANES scan, or
a CDI-image. Then and only then an elaborate program could browse the
HDF5-tree, and for example extract all fluorescence maps and do
something with them. Of course, this need for browsing the tree would be
a somewhat higher barrier towards users.

But again: These two purposes are really different! Quick and really
easy sharing collides with exact, path-independent identification (at
least as long as their is no HDF5-routines available for all of us doing
the browsing and retrieval based on tags).

Cheers, Gerd

Wellenreuther, Gerd

unread,
Oct 7, 2009, 8:37:05 AM10/7/09
to ma...@googlegroups.com

Yupp, but username and password are displayed on the datasets page. I
guess I have to make this bigger ...

Mark Rivers

unread,
Oct 7, 2009, 8:46:20 AM10/7/09
to Methods for the analysis of hyperspectral image data
I have to disagree about including the units in the name. I think they should be separate, and there should be a convention for unit names. netCDF has had this for a long time, and it has utilities that take advantage of it, i.e. one can convert data from one set of units to another using completely general utilities.

Here is the link to the units package for netCDF.

http://www.unidata.ucar.edu/software/udunits/

NeXus also does this, with structures like:

This is the description for the general spatial location
of a component - it is used by the NXgeometry.xml class
-->
<NXtranslation name="{name of translation}">
<NXgeometry name="{geometry}">
{Link to other object if we are relative , else absent}?
</NXgeometry>
<distances type="NX_FLOAT[numobj,3]" units="">
{(x,y,z)}{This field and the angle field describe the position of a component. For absolute position, the origin is the scattering center (where a perfectly aligned sample would be) with the z-axis pointing downstream and the y-axis pointing gravitationally up. For a relative position the NXtranslation is taken into account before the NXorientation. The axes are right-handed and orthonormal.}?
</distances>
</NXtranslation>


Mark



________________________________

From: ma...@googlegroups.com on behalf of ambergino
Sent: Wed 10/7/2009 7:33 AM
To: Methods for the analysis of hyperspectral image data
Subject: [MAHID] Re: HDF 5 files for spectromicroscopy and coherent diffraction





ambergino

unread,
Oct 7, 2009, 8:54:00 AM10/7/09
to Methods for the analysis of hyperspectral image data


> If the purpose is just to share data, then we should keep everything
> simple. It will not hurt if additional (meta-)data is in the file. But
> typically, as I discussed with Anton and Nicola yesterday, the purpose
> would be to just read that single dataset with the minimum amount of
> three lines of code (open, read, close). If this is what you want to do,
> you should stick to this routine and the paths, and your fine. If you
> need to write e.g. monitor data, just make a second file HDF5-file.
>
> On the other hand, I think we should not use this "simple" HDF5-files as
> a starting point for something (much) more elaborate. Because whatever
> evolves out of this process will most probably neither have the property
> of being easy to read/write, nor will it have the proper design for
> advanced data analysis.

I don't like having multiple files because then you run the risk of
missing some of them when you copy data from an experiment.

But I don't think that having a simple group precludes richness.
Let's say you have both transmission and fluorescence data in one
measurement. You could have one group of "/image" which for example
contains the transmission image, a group "/image/transmission" which
in fact has a flag set "stored in /image", and another group of "/
image/microprobe" which contains the (X,Y,Energy) fluorescence data.
This provides for the simple readout of one type of data, and the full
readout of all of the data with proper tagging.

"V. Armando Solé"

unread,
Oct 7, 2009, 8:59:18 AM10/7/09
to ma...@googlegroups.com
Mark Rivers wrote:
> I have to disagree about including the units in the name. I think they should be separate, and there should be a convention for unit names.
I fully agree with Mark. Even in common life the unit is an attribute,
as important as the value, but an attribute. A "wavelength" is a
"wavelength" independently of being in angstroms or meters and one will
look for a "wavelength" in a file and not for a
"wavelength_in_angstroms" or a "wavelength_in_meters".

Please, those that have not done it yet, play ASAP with hdf5 files with
your favorite language (C, Fortran, MATLAB, IDL, python, ...) it will
be very instructive.

Best regards,

Armando

"V. Armando Solé"

unread,
Oct 7, 2009, 9:02:38 AM10/7/09
to ma...@googlegroups.com
ambergino wrote:
>
>
>> If the purpose is just to share data, then we should keep everything
>> simple. It will not hurt if additional (meta-)data is in the file. But
>> typically, as I discussed with Anton and Nicola yesterday, the purpose
>> would be to just read that single dataset with the minimum amount of
>> three lines of code (open, read, close). If this is what you want to do,
>> you should stick to this routine and the paths, and your fine. If you
>> need to write e.g. monitor data, just make a second file HDF5-file.
>>
>> On the other hand, I think we should not use this "simple" HDF5-files as
>> a starting point for something (much) more elaborate. Because whatever
>> evolves out of this process will most probably neither have the property
>> of being easy to read/write, nor will it have the proper design for
>> advanced data analysis.
>>
>
> I don't like having multiple files because then you run the risk of
> missing some of them when you copy data from an experiment.
>

I also agree here. If HDF5 itself is a "portable filesytem", why should
be simpler to have two separate files than two groups inside the same
HDF5 file avoiding missing data?

Armando

Wellenreuther, Gerd

unread,
Oct 7, 2009, 9:16:01 AM10/7/09
to ma...@googlegroups.com, Barty Anton

V. Armando Solé wrote:


> ambergino wrote:
>> I don't like having multiple files because then you run the risk of
>> missing some of them when you copy data from an experiment.
>>
> I also agree here. If HDF5 itself is a "portable filesytem", why should
> be simpler to have two separate files than two groups inside the same
> HDF5 file avoiding missing data?

Sure, as long as you communicated where which data lies, everything is
fine. But then you obviously need to define more than just '/data/data',
and you have to communicate that. There is absolutely nothing wrong with
setting up a rather strict data-structure like the proposed
'/data/data'. But everyone using it should be well aware that while it
is very easy to read it in, it has certain disadvantages - either you
need to communicate changes, or you need several files.

My point is that for the rich-type as summarized by me as much
information as necessary to understand the file should be *in* the file,
and not agreed upon before and communicated (e.g. via email).

And consequently we should agree rather about certain conventions
regarding naming / tagging / structuring objects in an HDF5-file, than
to fix absolute paths. Because there will be always something we haven't
thought about. Then it is good to have a convention written down
somewhere, which can be used to put that new kind of data to a sensible
position, and give it a sensible tag. Instead of creating '/data/data1',
'/data/data2' or something like it.

ambergino

unread,
Oct 7, 2009, 9:17:19 AM10/7/09
to Methods for the analysis of hyperspectral image data
OK, I'm happy to have the HDF files use the attribute. But I still
tell my graduate students that it is much better to write programs
with variable names like "energy_ev" instead of "energy" because it
removes ambiguities when someone else reads the source code.

I agree that there should be a convention for the unit name, and I
also feel it is important to not specify redundant units. That is,
with photons one should specify either wavelength or energy but not
make it possible to store both (on the principle that a person with
two wristwatches never knows what time it is).

"V. Armando Solé"

unread,
Oct 7, 2009, 10:36:12 AM10/7/09
to ma...@googlegroups.com
For that is for what we should have the workshop :-)

For applying statistical methods on data, you do not need more than I
sent in my file than to know the last dimension corresponds to the
measured data.

For specific ways of analysis, one would need a dedicated group, with
defined dataset names, linking to wherever in the file the actual data
is stored.

If one has followed the full NeXus convention, and wants to perform
azimuthal averaging of an image obtained by powder diffraction, one
would need a group where image, sample_detector_distance,
pixel_size_dim0, pixel_size_dim1, direct_beam_position, detector_tilt,
detector_rotation, wavelength, and perhaps something else is written. If
that "image_powder_diffraction_group" is available, everybody has agreed
on the names and so on, the analysis is possible irrespectively of the
convention used to store the data (NeXus in this example). If we could
get a consensus about the minimal set of information to perform a
particular analysis, FROM THE RAW DATA, of image powder diffraction
analysis and/or image SAXS (quite similar problems) and/or XRF mapping
and/or XANES mapping, and so on we would have taken a huge step forward.

The /data/data approach was intended to share simple data. I think for
pure statistical analysis problems can take us quite far in the mean time.

Armando

Antonio Lanzirotti

unread,
Oct 7, 2009, 10:53:59 AM10/7/09
to ma...@googlegroups.com

>
> If one has followed the full NeXus convention, and wants to perform
> azimuthal averaging of an image obtained by powder diffraction, one
> would need a group where image, sample_detector_distance,
> pixel_size_dim0, pixel_size_dim1, direct_beam_position, detector_tilt,
> detector_rotation, wavelength, and perhaps something else is written. If
> that "image_powder_diffraction_group" is available, everybody has agreed
> on the names and so on, the analysis is possible irrespectively of the
> convention used to store the data (NeXus in this example). If we could
> get a consensus about the minimal set of information to perform a
> particular analysis, FROM THE RAW DATA, of image powder diffraction
> analysis and/or image SAXS (quite similar problems) and/or XRF mapping
> and/or XANES mapping, and so on we would have taken a huge step forward.
>
> The /data/data approach was intended to share simple data. I think for
> pure statistical analysis problems can take us quite far in the mean time.
>
> Armando
>
I agree with Armando here. Getting a consensus about the minimal set of
information that needs to be included, it's format and how it should be
tagged would be a major step forward. I still believe that with HDF5 if
we define this upper level set of structures, there will be latitude for
individuals to add what's needed for their own purposes without
detriment. Any software we write for data processing will simply make
calls for these specific pre-defined structures. Any additional beamline
specific structures will be ignored. Again, we have been using HDF-5
here on our microprobes for quite some time and as we add parameters
the older datasets are still valid and can be read without issue.

Tony

Darren Dale

unread,
Oct 7, 2009, 1:37:44 PM10/7/09
to Methods for the analysis of hyperspectral image data
Hello All,

On Oct 7, 5:21 am, ambergino <chris.j.jacob...@gmail.com> wrote:
> There are compelling reasons to specify the data as simply as possible
> to maximize the ease of reading/writing and also the portability among
> experiment types.  But there are also compelling reasons to make the
> data be as specific as possible so that one does not have to rely on
> memory or logbooks to know what went tnto a data set, and it's also
> very good to avoid making long dialogs in a program that reads a file.
>
> To my thinking, we can accomplish both goals by making a layered
> format.  One can start by writing the key data (presumably a 2D or 3D
> image array) into the first layer with an agreed-upon name, such as "/
> data/data" or maybe "/image."  One can then have optional tags in that
> basic group which give pixel types (distance or energy) and pixel
> sizes (microns or ev) for each dimension, along with date of
> creation.  One can follow that with specifics such as a group "/image/
> spectrum" which provides basic tags used in spectromicroscopy, or "/
> image/cdi" for coherent diffraction data, and so on.  Finally, one can
> follow with beamline-specific information like "/image/als901".  A
> data reading program could then start by reading the image, and then
> see if there's a "spectrum" subgroup and get more details, and then
> see if it was written on a beamline it knows about and get even more
> details.  In general, a program can look for all parameters it wants
> and insert default values or zeroes for parameters not stored in a
> file.

I have been putting together a high-level python interface to hdf5
files, based on h5py, called phynx. The name was inspired by "python
interface to hybrid nexus files". At CHESS, we are not ready to adopt
the nexus standard, but I wanted to use hdf5. So as a first step I
basically collaborated with Armando on an hdf5 organization that lets
us cast standard spec data files into hdf5. But we wanted to allow a
nexus format to be built around such data at a later time, to provide
an upgrade path if we decided to embrace nexus. These data files look
like:

/entry_1
|- measurement
|- scalar_data
|- I0
|- bicron
|- sample_x
|- positioners
|- sample_x (reference position, before scan starts)
|- mca_1
|- counts
|- deadtime

entry_1 can be identified as an NXentry, and all of the spec file data
is stored under the measurement group. This allows an upgrade path to
nexus, full nexus instrument definition etc can be created at a later
time, under entry_1 and alongside measurement.

In my phynx library, the idea is to handle the functionality of your
"layered format" without a layered format in the data file. The
layering is based on OO subclassing, creating a higher-level class
interface to the hdf5 groups and datasets. In python, when I define a
classes that inherit from phynx.Group or phynx.dataset, lets say:

class Detector(Group):
pass

class MultiChannelAnalyzer(Detector):
# extend the functionality of detector
# may use additional hdf5 data that was
# overlooked by Detector

these classes are automatically added to a class registry in phynx.
Instances of these classes provide an intuitive interface (via h5py)
to the hdf5 node. That node automatically gets an hdf5 attribute
called "class" that would be equal to "Detector" or
"MultiChannelAnalyzer", which is used to identify the appropriate
constructor in the registry and thus return the right kind of object
in the future. Thus, when I do:

from phynx import File
f=File('foo.h5')
mca=f['/entry_1/measurement/mca_1']

mca is now an instance of MultiChannelAnalyzer. This layer of
abstraction allows you to provide a higher-level interface to the same
group or dataset. You still have the basic low-level access to the
groups, datasets, and attributes as well. The object-oriented
interface allows you to define additional methods and proxies that,
for example, provide an interface to the dead-time corrected data, or
to an array representing the energy for each bin in the MCA counts
dataset.

If a beamline needs to extend the built-in classes in phynx in order
to provide a more complete interface, thats easy to do. Just define
your class, create an instance, save the data. When you reopen the
group or dataset, you get your own interface back. I think this solves
the problem of enforcing strict adherence to some format, just
distribute you ALS extensions to phynx and I will have full access to
all the features of your high-level interface. Ok, perhaps this simply
pushes the conformance issue into the software interface, but I think
that can be managed by intelligently defining the base class
interfaces. I'm not arguing against a common format, just observing
that realistically, we beamline scientists occasionally need to store
additional information that has not yet been added to the format
specification.

> One key thing I would strong urge is that all tags include the units
> in their name.  It's much better to have a tag that says
> "pixel_size_microns" rather than "pixel_size", or "energy_ev" rather
> than "energy", or "angle_degrees" rather than "angle".

I would suggest that the units be saved as an attribute of the hdf5
node. I have written a library, based on numpy, that handles units and
physical constants, conversions etc, for this purpose, see
http://packages.python.org/quantities/user/tutorial.html if you are
interested. Quantities is still under development, so phynx just
returns plain-old numpy arrays for now, but I will add an option in
the future that users can activate to return quantities.

> Beyond that, I think that there are many different ways that we can
> accomplish the definition of a file format.  I would like to maximize
> compatibility with NeXus, and would like to see how the NeXus file
> reading routines might deal with non-NeXus groups coexisting in the
> same HDF5 file.  

I completely agree. Last I checked, the routines either ignored groups
it didn't recognize or it attempted to use a default.

> But I do worry that NeXus moves slowly, given that it
> has been around for years yet still does not cover crystallography,
> spectroscopy, or image data.  There is a NeXus "code camp" at Argonne
> on Oct. 16-18, with 8 participants (http://www.nexusformat.org/
> NIAC2009#Participants).  I don't know any of them, but Andrew Götz of
> ESRF is on the NeXus international advisory committee and perhaps he
> could shed some light.

My biggest concern about nexus is ease of use. I wanted phynx to be
useful to developers, but it had to be intuitive and easy to use
interactively in ipython or in scripting. I even wrote a custom
ipython completer for h5py, which is now distributed with h5py, that
lets you tab-complete to navigate the file:

f['entry<tab>
# yields a list of possible matches: entry_1, entry_2, etc, or
autocompletes if there is only one match.

> How to proceed? Well, I think that we face two separate but related
> issues:
> 1. Getting an agreed-upon data exchange file format, perhaps with a
> layered approach in HDF5.  This is urgent.

I encourage a single format, and think hdf5 is the right choice. But I
suggest that the chances for success will be improved if we can
provide intuitive, extensible interfaces to data in this format. The
core of phynx is a really simple wrapper around h5py, it should not be
difficult to provide similar interfaces for other languages, that make
the best use of each language's idioms.

> 2. Having a workshop on data analysis algorithms.
> It seems like the only way to make progress on (1) is to empower a
> small group to get together to write a draft and provide example file
> read/write code.  For this task, I would suggest a small group, with
> maybe 1 person from each of a few key institutions; this group could
> see if they can make progress via email and/or a conference call, or
> if it is best to meet in person.  Regarding the workshop, it seems
> like there are several volunteers including ESRF, and possibilities
> for dates including December and January.  I would rather start by
> empowering a small committee to form a draft set of topics, and then
> solicit speaker suggestions for the topics, and then think of dates
> that might work for the key speakers.

I am definitely interested. I've been giving this a lot of thought
over the last two years.

Darren

Darren Dale

unread,
Oct 7, 2009, 5:01:52 PM10/7/09
to Methods for the analysis of hyperspectral image data
On Oct 7, 2:57 am, Gerd Wellenreuther <Gerd.Wellenreut...@desy.de>
wrote:
> V. Armando Solé schrieb:> import h5py
> > f = h5py.File(filename,'r')
> > data=f['/data/data']
>
> > Does it need to be simpler? It's just like looking into a hard disk
> > because all what you need is a path. Either you know the path and you
> > take it, either you browse the disk and find your path.
>
> Sure, it is simpler. But I have a innate tendency to dislike any kind of
> fixed path or constants in my programs. I agree, that this is the
> fastest and quickest way to implement reading of an HDF5-file in a
> single-purpose script. And this is who you will and should read in an
> HDF5-file purely built for sharing data.
>
> But I would rather not consider to identify something in a possibly rich
> HDF5-file just by its mere path, but trying to look at its shape, its
> tags and so one. Especially, if I want to make sure that I understand
> what the program is doing after leaving it a couple of months
> unattended, and after somebody willingly or mistakenly changes the
> path-structure of his HDF5-files.

I think it is important to be able to do both. In phynx (pronounced
phoenix, by the way), you can do:

f['entry_1/my_badly_named_measurement/funny_scalar_data_name/bicron']
or
f['entry_1'].measurement.scalar_data.signals['bicron']
or:
f['entry_1'].measurement.scalar_data.get_sorted_signals_list()[0]

The latter two use measurement and scalar_data properties in python to
inspect the subgroups and return the one that was tagged as a
"Measurement", "ScalarData", and "Signal".

and I should probably add the ability to do
f.entries[0].measurement.scalar_data.get_sorted_signals_list()[0]

I guess my bigger point is that we can make things much easier on
ourselves by considering both the model (file format) and interface
together. The abstractions that can be made in the interface may
significantly reduce the complexity needed in the file. While
developing the format and interface, we can write unit tests which
define how the format should look, how the interface should behave,
and how both the format and interface are intended to be used.

Darren Dale

unread,
Oct 7, 2009, 5:30:14 PM10/7/09
to Methods for the analysis of hyperspectral image data
On Oct 7, 8:35 am, "Wellenreuther, Gerd" <gerd.wellenreut...@desy.de>
wrote:
> ambergino wrote:
> > There are compelling reasons to specify the data as simply as possible
> > to maximize the ease of reading/writing and also the portability among
> > experiment types.  But there are also compelling reasons to make the
> > data be as specific as possible so that one does not have to rely on
> > memory or logbooks to know what went tnto a data set, and it's also
> > very good to avoid making long dialogs in a program that reads a file.
>
> > To my thinking, we can accomplish both goals by making a layered
> > format.
>
> I guess I have to disagree, but please feel free to try to convince me. :)
>
> If the purpose is just to share data, then we should keep everything
> simple. It will not hurt if additional (meta-)data is in the file. But
> typically, as I discussed with Anton and Nicola yesterday, the purpose
> would be to just read that single dataset with the minimum amount of
> three lines of code (open, read, close). If this is what you want to do,
> you should stick to this routine and the paths, and your fine. If you
> need to write e.g. monitor data, just make a second file HDF5-file.

I prefer using the hdf5 hierarchy rather than distributing data over
multiple files.

> On the other hand, I think we should not use this "simple" HDF5-files as
> a starting point for something (much) more elaborate. Because whatever
> evolves out of this process will most probably neither have the property
> of being easy to read/write, nor will it have the proper design for
> advanced data analysis.
>
> For example, one possible dead-end I see is connected to the usage of
> fixed path-names. You always put data xyz into path abc. And now some
> guy buys a second detector. Consequence: You have to find a new
> convention for naming the path, tell everyone, and then they have to
> change their programs in order to be able to read that data. That is bad
> in my opinion. Very bad.

It shouldn't be necessary to change programs to read the data, as long
as there is some agreement on where the detector groups are located
relative to root. Armando and I have been putting these groups in /
entry_1/measurement/, so the application developer can do:

def get_detectors(entry):
detectors = []
for subgroup in entry.measurement:
if isinstance(subgroup, Detector):
detectors.append(subgroup)
return detectors
# or better still:
#return [sg for sg in entry if isinstance(sg, Detector)]

> The only way to solve that misery is to distinguish between
> "quick-sharing, path-oriented" simple HDF5-files, and more elaborate,
> rich HDF5-files.

I agree, phynx relies on tags in hdf5 attributes and other contextual
information to provide the higher-level interface. phynx provides an
easy enough interface for creating groups and datasets with these tags
(f.create_group('new_group', type='MultiChannelAnalyzer')). But anyone
can opt to create simple groups and datasets that do not have enough
context for advanced analysis.

> In my opinion, the individual items in the latter
> should *not* be identified by their location in the HDF5-structure
> (although this should be kept as defined as necessary), but the real
> identification is implemented as some kind of attribute or tag: The data
> itself should identify itself as a fluorescence map, or a XANES scan, or
> a CDI-image. Then and only then an elaborate program could browse the
> HDF5-tree, and for example extract all fluorescence maps and do
> something with them. Of course, this need for browsing the tree would be
> a somewhat higher barrier towards users.

I think there should be some general agreement on hierarchy, because
it provides helps isolate certain features in the data and lets you
define context. For example, lets say I have two detectors each
reporting counts and deadtime. It is easier for both the simple
interactive user and the application developer to do (building on my
last example, we have a list of detectors):

detectors[0].dead_time

> But again: These two purposes are really different! Quick and really
> easy sharing collides with exact, path-independent identification (at
> least as long as their is no HDF5-routines available for all of us doing
> the browsing and retrieval based on tags).

Well, if you are willing to use python, the routines exist in phynx.

Stefan Vogt

unread,
Oct 7, 2009, 11:06:52 PM10/7/09
to ma...@googlegroups.com
Andy,

> be the only option to allow people (including us) to prepare. I wonder
> however if this is not a problem for colleagues from USA. I have heard
> DOE employees need a 3 month lead time for travel. Some feedback on this
> topic from USA colleagues would be useful. We have started the ball
> rolling over here ...

if more than 3 DOE people comeing from the same DOE lab, then yes, it
requires ~3 months lead time. If it is 2-3 it should likely be ok (the
limit is a total expenditure of $10000 for a conference).

Cheers,
Stefan
--
Dr. Stefan Vogt
Group Leader Microscopy Adj. Assoc. Professor
Advanced Photon Source Feinberg School of Medicine
Argonne National Lab. Northwestern University

phone: (630) 252-3071; beamline: -3711; fax: -0140
cell: (815) 302-1956
http://www.stefan.vogt.net/

Gerd Wellenreuther

unread,
Oct 8, 2009, 1:26:47 AM10/8/09
to ma...@googlegroups.com, Barty Anton, Nicola Coppola
Darren Dale schrieb:

> On Oct 7, 8:35 am, "Wellenreuther, Gerd" <gerd.wellenreut...@desy.de>
> wrote:
>
>> If you
>> need to write e.g. monitor data, just make a second file HDF5-file.
>>
>
> I prefer using the hdf5 hierarchy rather than distributing data over
> multiple files.
>
Definitely me, too! But some people said they require to be able to read
HDF5-files in the easiest possible way. If you want to be able to read
every data using a fixed path, you have to do some compromises.

It is great that phynx is already able to do the browsing/searching by
tagname, I just wonder if you know if something like that also exists
for IDL, Matlab, etc.? (I am fine with Python, but others aren't.)

Cheers, Gerd

Darren Dale

unread,
Oct 8, 2009, 7:32:56 AM10/8/09
to ma...@googlegroups.com, Barty Anton, Nicola Coppola

Not that I know of. I don't think it would be hard for someone
proficient in those languages to do, though. If anyone is interested,
I can post the phynx code somewhere in this group's collection of
documents. The project itself is hosted as part of a larger data
acquisition and analysis project (xpaxs) hosted at Launchpad. I've
been intending to split phynx off into its own separately branchable
project, as soon as the Bazaar version control system gets support for
nested trees.

Darren

"V. Armando Solé"

unread,
Oct 8, 2009, 8:49:50 AM10/8/09
to ma...@googlegroups.com
I have created an event as doodle in order to let all of you enter your
availability.

The idea is to organize a workshop at the ESRF.

The link to set your preferences is:

http://www.doodle.com/u3pusirdm5idzzdm


SOLEIL users meeting is on January (20-21) and ESRF users meeting is in
February (2nd week).
Just keep it in mind when choosing your dates. I guess we'll need at
least three days to get something productive but we'll see.

See you,

Armando

"V. Armando Solé"

unread,
Oct 8, 2009, 10:42:16 AM10/8/09
to ma...@googlegroups.com
I will try to send a reminder before the workshop, but please, start to
think about what do you need to perform your analysis.

At the ESRF I have identified some relatively common problems that could
benefit from such an agreement and surely there are more.
I am not necessarily an expert on the associated fields and perhaps
there are well defined systems to describe those measurements and we
just need to embed them into HDF5.

Some of those common problems that are closely related:

Powder diffraction data collection with 2D detectors
SAXS with 2D detectors
Conversion from image data to Q space
Conversion from image data to HKL space

Clearly there will not be a Nobel price for solving those issues, but
the amount of time that can be lost on them just to be able to start the
actual analysis ...

The list is open. Please feel free to open new discussion subjects in
this mailing list associated to particular techniques too.

Armando


Nicola Coppola

unread,
Oct 10, 2009, 11:50:34 AM10/10/09
to Andy Gotz, ma...@googlegroups.com, Barty Anton
Dear all,
I think that few ideas/things are missing in the whole discussion:

there seems to be nobody who thinks that, few days after the "file has
been shared", people could forget what they have on their hard drive;

there is nobody who thinks that a utility can be written or used, that
looks in the files, and dumps what is inside, like a image browser. In
this way each person that receives the file could be able to "browse" the
file and automatically discover that actually "/data/data" is
"/data/corrected_data" or whatever. I have joined more than once a
different collaboration, I have seen new students starting or "old"
postdocs/professors, who were used to one version of the reconstruction
software, being totally sidetracked by new versions of the software.
All in all, most of the problems were not that things were moved or
changed, but that the way that data were saved was not "automatically"
clear. People could not see by opening a data file or a ntuple or a
roottree what had changed or what was not "canonical" and they needed
somebody who could explain how to treat the values and what kind of
meaning had the objects in the files.
That is why I think that a bit more of infos saved in the files is better,
even if there is the need of a bit more of "browsing". Experienced people
will be able to browse the file, and not experienced will by browsing know
and learn what they have been given...
The only problem that a new student or postdoc will have is if there is
no other way for him to look into the files than to run a "full" fledged
program. It is vital that there is always a simple utility (under all, and
I stress ALL, architectures) to look into data-files (and I say "always"
in the sense that this utility MUST be maintained).

regards

nicola

On Wed, 7 Oct 2009, Andy Gotz wrote:

> Dear Gerd,
>
> thank you for your excellent summary.
>
> I have this naive approach that it would be ideal to be able to analyse data
> by only exchanging files and not have the user have to guide the program(s)
> at every step to find the data. I realise this might be too naive for files
> containing multiple datasets. But for a single dataset this should be
> possible. For this we need to agree on some conventions. I think this is what
> this discussion has been all about. It looks promising. Nexus has some useful
> conventions and metadata tags which we should follow but Nexus alone is not
> enough to avoid the user clicking to find the data. I think this can be
> changed by using a combination of Nexus and some additional conventions where
> to find data in an HDF file (as has been discussed on this mailing list).
>
> I am very much in favour of a workshop and have discussed this at the ESRF.
> We are ready to host such a workshop at the ESRF if people are ready to come
> to Grenoble. We have EU funds for networking around the topic of Data
> Analysis (project VEDAC). So if the majority agrees just let us know and we
> will start organising such a workshop. Any preference for dates - 2009 or
> early 2010 ?
>
> Best regards
>
> Andy
>
> Gerd Wellenreuther wrote:
>> Dear all,
>>
>> just to summarize my insights after a discussion I just had with Anton
>> and Nicola, and the discussion over the last days: Most probably it
>> would be best to differentiate between two different applications for
>> HDF5:
>>
>> Simple data exchange (path-centered use of HDF5)
>> ================================================
>> * items are found using a path
>> * only contains one dataset or it is clear which/how the data has to be
>> used
>> * users (Chris, Anton, Gerd) have to agree about / communicate paths
>> * purpose: use HDF5 as a machine/platform-independent container
>> * Anton proposed '/data' as a unified storage place for sharing, Armando
>> suggested to go one hierarchical layer deeper into '/data/data'. Garth
>> and Armando thought it would make sense to indicate the kind of data by
>> actually putting the data to be shared in e.g. '/data/cdi', and link
>> that to '/data/data', which would enable other data e.g. monitors to be
>> put in '/data/I0'.
>>
>>
>> Rich (meta-)data storage
>> ========================
>> * items are rather found using a heuristic, e.g. looking for a special
>> attribute/tag than looking for an absolute path (e.g. looking for a
>> data-group with a name or tag called 'coherent diffraction imaging data')
>> * in case several matching datasets are found, any program should ask
>> the user
>> * users (SOLEIL, DESY, ESRF, APS, NeXus, groupXYZ) should agree about
>> guidelines / philosophy concerning names / tags / hierarchical
>> structures in HDF5
>> * purpose: create container-files containing all data + metadata
>> obtained during data acquisition, + additional data from preprocessing
>> and processing
>> * in order to achieve some kind of compatabiliy between synchrotron
>> labs, I suggested to use NeXus as is being used at SOLEIL. I think it
>> would be great to define some philosophies/guidelines about how what
>> should be stored where and why, and how it is being designated. Such
>> general guidelines + common use as Armando suggested it will lead to the
>> evolution of the NeXus-standard into something adequat for our purposes.
>>
>> How to go on:
>> =============
>>
>> * Armando is already creating HDF5-files, and I will start to link them
>> in the http://groups.google.com/group/mahid/web/datasets .
>> * I will also try to convert aspects of the datasets hosted by DESY
>> (e.g. the raw fluorescence spectra as well as the elemental contents) in
>> two different HDF5 files.
>> * Anybody else having spectroscopic datasets in HDF5 is welcome to join.
>> * Further advances benefitting those people using HDF5 for simple
>> sharing could come from an improvement of the APIs, e.g. allowing a
>> simple fetch of an item not only based on the path, but e.g. on a
>> certain name or tag.
>> * For the development of the rich-HDF5-format I really think that we
>> need a workshop :).
>>
>> Cheers, Gerd
>>
>> >>
>
>
>

|\ _,,,--,,_
\|||/ /,`.-'`' ._ \-;;,_
(o o) /,4- ) )_ .;.( `'-'
---------oOO--(_)--OOo--------------'---''(_/._)-'(_\_)-------------
| |
| Nicola Coppola /\ DESY - CFEL & F1/ZEUS |
| cop...@mail.desy.de /\\//\ Notkestrasse 85 |
| Tel: +49 40 8998 5781/2909 \//\\/ 22607 Hamburg |
| Tel: +49 40 8998 1958(fax) \/ Germany |
| Fax: +49 40 8998 5793(clean room) |
| Home: +49 40 5190 5631 In Italy: (+39 041 984108) |
| DESY Deutsches Elektronen Synchrotron http://www.desy.de |
--------------------------------------------------------------------

Antonio Lanzirotti

unread,
Oct 10, 2009, 12:14:10 PM10/10/09
to ma...@googlegroups.com
I believe if we stay with HDF5 that there are numerous freeware hdf browsers (one is incorporated as part of IDL which is what we use primarily) that allows you to very quickly look at what SD's and attributes are in the file and even plot many of them in a rudimentary fashion. I think Chris Jacobsen expressed this earlier and I agree, I think we need to approach this as a two part process. One is to agree on a data format that is generally agreeable across the board for the variety of beamlines that are available. The second is to develop the tools for imaging, processing (and browsing) the data. I think task 1 is more easily achievable in short order. Task 2 will take a bit more effort and I suspect in the end we'll end up with a variety of tools for processing. I think that's OK.

Tony

Vicente Sole

unread,
Oct 10, 2009, 1:01:22 PM10/10/09
to ma...@googlegroups.com, Nicola Coppola, Barty Anton
Quoting Nicola Coppola <cop...@mail.desy.de>:

>
> Dear all,
> I think that few ideas/things are missing in the whole discussion:
>
> there seems to be nobody who thinks that, few days after the "file has
> been shared", people could forget what they have on their hard drive;
>

Perhaps those who think the same thing are busy.

> there is nobody who thinks that a utility can be written or used, that
> looks in the files, and dumps what is inside, like a image browser.

There are those who work to make sure that utilisty exists and goes
well beyond that.

PyMca allows you to browse the files:

http://ftp.esrf.eu/pub/bliss/PyMca4.3.1-20091008-snapshotSetup.exe

and, if you take the associated ROI imaging tool, you can, besides ROI
imaging :-), do PCA analysis. The latest check in I did yesterday to
the sourceforge svn repository even allows you to perform ICA in your
datasets. NNMA will follow when Gerd and I find the time to work
together on it or when I have more time.

My idea when making the datasets avaliable was to compare pure
statiscial methodologies of analysis. Please, just give some time! :-)

So, the files are there and the applications are there. PyMca is just
an example, but I am sure there are others around.

If you want, you can play with the datasets yourself. The MGN1 dataset
is quite good for PCA while PCA seems to bring very little information
information in the Daphnia dataset and ICA (I did it last night and
may be I did it wrong) only seems to find 3-4 reliable components. ou
see there is already matter for discussion. Nevertheless, to discuss
results and so on, I would like to open other mailing list thread and,
if possible, compare the different findings.


> In
> this way each person that receives the file could be able to "browse" the
> file and automatically discover that actually "/data/data" is
> "/data/corrected_data" or whatever.

I said, the datasets where stored in /data/data and they are.

> I have joined more than once a
> different collaboration, I have seen new students starting or "old"
> postdocs/professors, who were used to one version of the reconstruction
> software, being totally sidetracked by new versions of the software.
> All in all, most of the problems were not that things were moved or
> changed, but that the way that data were saved was not "automatically"
> clear. People could not see by opening a data file or a ntuple or a
> roottree what had changed or what was not "canonical" and they needed
> somebody who could explain how to treat the values and what kind of
> meaning had the objects in the files.

I think the submitted datasets can allow us fruitful discussions
working with pure statistical methods. All you need to know is the
first two dimmensions correspond to the dimensions of your map while
the last dimension correspond to the measured 1D data. You do not even
need to know they are fluorescence data.

> That is why I think that a bit more of infos saved in the files is better,
> even if there is the need of a bit more of "browsing". Experienced people
> will be able to browse the file, and not experienced will by browsing know
> and learn what they have been given...

Nobody is saying /data/data is going to solve everything. But in the
mean time we already have something to work with that does not need
more information. Some people are interested on how to describe data
in file, some are interested on just writing them, some just on
reading them and some are interested on the science that can be made.
I belong to all of them and I do not want that one of the issues
prevent the others from going further.

> The only problem that a new student or postdoc will have is if there is
> no other way for him to look into the files than to run a "full" fledged
> program.

Not at all. You can inspect the files from the command line of popular
tools like MATLAB, IDL, Python, ... Two or three lines of code of
those tools and you have your data, you can plot them, print them,
modify them, export them, ...

In the submitted files is even simpler because you already know that
the dataset is at /data/data.

> It is vital that there is always a simple utility (under all, and
> I stress ALL, architectures) to look into data-files (and I say "always"
> in the sense that this utility MUST be maintained).
>

MATLAB is maintained and available in all architectures.
IDL is maintained and available in all architectures.
Python is maintained and available in all architectures.

PyMca, is maintained and available in all architectures. In addition,
provides hyperspectral analysis capabilities and you can freely
interact with the main author (myself).

I do not think the situation is so bad.

Armando

Pete Jemian

unread,
Oct 10, 2009, 2:14:11 PM10/10/09
to ma...@googlegroups.com
Actually, data standards people think about this as the first, best
"killer app" since that type of visualization is what may help convince
people that their particular data format is worth further examination.
NeXus, for example, has nxplot. One early design point for NeXus was to
store the data in such a way that a plotting tool could automatically
discover what and how to plot the data in a NeXus file.

One may also consider other particular needs important, too, such as
building an index for the library of accumulated data files and their
last known location as well as some high-level metadata to help in
searching and cross-referencing. This type of tool is essential as one
begins to think of grid resources for storage and computation.

Regards,
Pete

Nicola Coppola wrote:
> Dear all,
> I think that few ideas/things are missing in the whole discussion:
>
> there seems to be nobody who thinks that, few days after the "file has
> been shared", people could forget what they have on their hard drive;
>
> there is nobody who thinks that a utility can be written or used, that
> looks in the files, and dumps what is inside, like a image browser.

--
----------------------------------------------------------
Pete R. Jemian, Ph.D. <jem...@anl.gov>
Beam line Controls and Data Acquisition, Group Leader
Advanced Photon Source, Argonne National Laboratory
Argonne, IL 60439 630 - 252 - 3189
-----------------------------------------------------------
Education is the one thing for which people
are willing to pay yet not receive.
-----------------------------------------------------------

Vicente Sole

unread,
Oct 10, 2009, 2:29:45 PM10/10/09
to ma...@googlegroups.com
Hi Pete,

Quoting Pete Jemian <prje...@gmail.com>:

>
> Actually, data standards people think about this as the first, best
> "killer app" since that type of visualization is what may help convince
> people that their particular data format is worth further examination.
> NeXus, for example, has nxplot. One early design point for NeXus was to
> store the data in such a way that a plotting tool could automatically
> discover what and how to plot the data in a NeXus file.
>

NXdata is one of the classes I really like in the NeXus design.

In the currently available datasets, I have created the /data/data
dataset as a link to an array in an annex NXdata group. I am
interested to know if NXplot is tolerant enough to recognize the
presence of the NXentry and the NXdata groups despite the presence of
the link. If the NXdata group is not compliant, please let me know it
and I will correct it.

Armando

ambergino

unread,
Oct 11, 2009, 8:47:40 PM10/11/09
to Methods for the analysis of hyperspectral image data


On Oct 10, 11:50 am, Nicola Coppola <copp...@mail.desy.de> wrote:
> Dear all,
> I think that few ideas/things are missing in the whole discussion:
>
> there seems to be nobody who thinks that, few days after the "file has
> been shared", people could forget what they have on their hard drive;
>


Hi Nicola (and others) - to the contrary, I think it is quite
important and useful to store as much information as possible, and the
NetCDF file format that we've been using for 15 years for STXM at
Brookhaven, and 6-7 years for CDI at the ALS incorporate all relevant
parameters automatically. We plan on shifting both to HDF5, and if
you look here (http://xray1.physics.sunysb.edu/~jacobsen/colormic/
spectromicro_draft2.pdf) you'll see that we already have defined a
HDF5 format for analysis but I we're early enough in the use of this
that we are happy to shift to whatever consensus format emerges.

All I want to do is to also preserve the option for people to write
files in as simple a way as possible, to maximize adoption of the
standard.

> there is nobody who thinks that a utility can be written or used, that
> looks in the files, and dumps what is inside, like a image browser.

In fact I think that those who are already using HDF5 know quite well
that it's very easy to see what's in any HDF5 file and how it is
structured even if you are given no other info. The simplest way is
to use the "h5dump" command that gets built when you install HDF5;
there are also more elaborate browsers.

I've been meaning to dig up a nice example dataset and write it to
HDF5; hopefully this week.

Chris Jacobsen

Anton Barty

unread,
Oct 12, 2009, 7:57:55 AM10/12/09
to Nicola Coppola, Andy Gotz, ma...@googlegroups.com
Hi All

The other idea that needs to be remembered is that as someone working
with the data after the experiment, I want the same analysis code to
work regardless of where the data comes from. I do not want to spend
my days re-writing code to cope with different formats from different
instruments. Nor recompiling code to work with instrument A, then
instrument B, then instrument A again. That is needless busy-work.
Browsing helps but fundamentally I don't want to have to change the
analysis code depending on what I find using some point-and-click
interface. I just want to write code that will look in a standard,
same part of the file for the data regardless of where it is from and
go from there.

So we need both a simple format for sharing data between analysis
codes with the data always in the same place, and a rich format to
save the data from instruments.
The two serve different purposes and can be linked by data extraction
tools (aka file converters) if needs be.
----
Anton Barty
Centre for Free Electron Laser Science (CFEL)
Notkestrasse 85, 22607 Hamburg, Germany
phone: +49 (0)40 8998 5783
secretary: +49 (0)40 8998 5798
anton.barty @ desy.de

Wellenreuther, Gerd

unread,
Oct 12, 2009, 8:25:42 AM10/12/09
to ma...@googlegroups.com, Nicola Coppola
Hi Anton.

To be honest, I think this more shifting your problem than solving it.
If you have a complex file in which whatever-data-you-want is not in
/data/data, do you really want to use a converter to first extract that
data to be compliant to the /data/data-convention? Then you always have
to do the sorting yourself... I would rather give my script the
(sometimes changing) location of the data, and would call the program
for A or B with only this changed parameter.

And if somebody would write a more elaborate API for IDL to open HD5s,
e.g. enabling you to fetch the first data-object in the file matching
certain criteria, would you use it? E.g. instead of

data=myHDF5.read('/data/data')

do

data=myHDF5.find(tag=='CDI-image')

Would you use something like that?

Cheers, Gerd

ambergino

unread,
Oct 12, 2009, 8:54:56 AM10/12/09
to Methods for the analysis of hyperspectral image data
What Gerd describes with "data=myHDF5.read('/data/data')" or
"data=myHDF5.find(tag=='CDI-image')" is at the level of programming
file reading routines. At the basic HDF5 level, one can define a
minimal set of required tags and attributes, and then specify a large
number of optionally-included tags and attributes. As long as one
agrees on the names of the optional tags (as well as the groups), one
can have the read-write program gather all information it knows about
and not have to rewrite it based on additional beamline-specific
information that many will want to archive in the HDF5 file. In HDF5
you can ask to read only the information you care about, and ignore
the rest, without having to rewrite your program.

Darren Dale

unread,
Oct 12, 2009, 9:24:07 AM10/12/09
to ma...@googlegroups.com
Hi Chris,

On Sun, Oct 11, 2009 at 8:47 PM, ambergino <chris.j....@gmail.com> wrote:
>
>
>
> On Oct 10, 11:50 am, Nicola Coppola <copp...@mail.desy.de> wrote:
>> Dear all,
>> I think that few ideas/things are missing in the whole discussion:
>>
>> there seems to be nobody who thinks that, few days after the "file has
>> been shared", people could forget what they have on their hard drive;
>>
>
>
> Hi Nicola (and others) - to the contrary, I think it is quite
> important and useful to store as much information as possible, and the
> NetCDF file format that we've been using for 15 years for STXM at
> Brookhaven, and 6-7 years for CDI at the ALS incorporate all relevant
> parameters automatically.  We plan on shifting both to HDF5, and if
> you look here (http://xray1.physics.sunysb.edu/~jacobsen/colormic/
> spectromicro_draft2.pdf) you'll see that we already have defined a
> HDF5 format for analysis but I we're early enough in the use of this
> that we are happy to shift to whatever consensus format emerges.
>
> All I want to do is to also preserve the option for people to write
> files in as simple a way as possible, to maximize adoption of the
> standard.

Armando and I have been using an hierarchy that attempts to do the
same thing. We have been using an hierarchy that is very similar to
spec datafiles, where each scan is a group below the hdf5 root:

/entry1 # an N by M 2-D scan, for example
/measurement
/scalar_data # arrays for motor positions, counters, like
spec's table of data
/positioners # starting positions, like spec's scan header
/mca_1 # a group containing mca data, for example
/counts # an array of shape (N times M) by Number_of_Bins
/deadtime
/ccd_2 # a group containing image data
/counts # an array of shape (N times M) by Ypixels by Xpixels

Instead of having an ev_array stored in the hdf5 file, I have been
storing an array of bin numbers and calibration parameters to convert
the bins to energy using a second order polynomial. When I ask for
mca_1.energy in python it automatically calculates the energy array. I
think the bin numbers and the calibration are the important values to
be stored.

We did not organize things into hdf5 groups according to technique, as
you suggest in your pdf, we just wanted an as-simple-as-possible way
to organize raw data, regardless of technique. But with hdf5 links,
your spectromicro group could be created to make the data available
with all the context of that particular technique. This way we can
have the best of both worlds.

In your mandatory /spectromicro items, you have an image_array of
shape [nx,ny,nz,n_ev]. I have been saving such data as [nx times ny
times nz, n_ev]. There are several reasons for this: it simplifies
data acquisition, since I only need to know the index of the point in
the scan instead all of my x and y and z indices, it supports
non-uniform images (data sampled at irregular intervals), and it also
supports scans of arbitrary dimensionality. Imagine I am working with
data interactively in some shell, and it was a 1D scan. If I want to
plot the first mca spectra, I want to do this:

plot(mca.counts[0])

not this:

plot(mca.counts[0,0,0])

But I understand there are times when you want regularly-spaced data
formatted into [nx,ny,nz,n_ev], so I store the shape of the scan as in
hdf5 and can provide a proxy that will let you index the array as if
it were formatted that way:

mca.counts.image[0,0,0]


>> there is nobody who thinks that a utility can be written or used, that
>> looks in the files, and dumps what is inside, like a image browser.
>
> In fact I think that those who are already using HDF5 know quite well
> that it's very easy to see what's in any HDF5 file and how it is
> structured even if you are given no other info.  The simplest way is
> to use the "h5dump" command that gets built when you install HDF5;
> there are also more elaborate browsers.

I wrote a PyQt4 based tree-view widget to allow files to be explored
in a GUI. Armando folded that into pymca and added all kinds of new
features and capabilities.

> I've been meaning to dig up a nice example dataset and write it to
> HDF5; hopefully this week.

I'll post a file of my own as well.

Darren

Darren Dale

unread,
Oct 12, 2009, 9:32:50 AM10/12/09
to ma...@googlegroups.com, Nicola Coppola
On Mon, Oct 12, 2009 at 8:25 AM, Wellenreuther, Gerd
<gerd.well...@desy.de> wrote:
>
> Hi Anton.
>
> To be honest, I think this more shifting your problem than solving it.
> If you have a complex file in which whatever-data-you-want is not in
> /data/data, do you really want to use a converter to first extract that
> data to be compliant to the /data/data-convention?

Is /data/data seriously being considered for some kind of convention?
I'm not familiar with it, where is it discussed? I think the naming
convention /data/data is unfortunate.

Darren

Pete R. Jemian

unread,
Oct 12, 2009, 9:39:22 AM10/12/09
to ma...@googlegroups.com
Another point that seems forgotten is "acceptance of
a common data standard by those that create the data."

A major problem with data standards is the reluctance of
the community to accept a standard that is different from
what is presently done. The reluctance is stronger from
those who are instrument scientists who often have been
using some (local or perhaps broader standard) format for
years and see no added value in using something new.
Sometimes, the reluctance is due to management pressure
("you do not have time or resources to spend on that effort")
but most often it is more pragmatic, they have
reduction/analysis codes that already write and read their
format and have been producing science. In this case, a
new standard does not provide tangible benefit to their
instrument.

They miss the point, though. Big time. While many
researchers do not couple their data from one instrument
with data from another for a given analytical procedure,
there are plenty of researchers who need to do exactly
this. One example is contrast variation studies capitalizing
on the differences in the scattering of neutrons and X-rays.
These scientists could obtain substantial benefit from a common
data standard that would carry both kinds of information.
The vision is that such a common standard would provide benefit
for the "multiple instruments" scientists without penalizing
the "single instrument" investigators.

One requirement set by instrument scientists that is necessary
for any new data standard is "to be better than what is presently
in use." There must be a benefit or there is no need to change.
The benefit I can see, which still is not convincing, is that
support of scientists who collect data on multiple instruments
will benefit each and all those instruments in new ways.
Gain notoriety, broader user base, more competition for
higher-quality science, etc.

Regards,
Pete

"V. Armando Solé"

unread,
Oct 12, 2009, 9:49:31 AM10/12/09
to ma...@googlegroups.com, Nicola Coppola

It was discussed at the beginning of this thread just to have a fast and
easy way to get access to the main data. The already available datasets
(from me) follow that convention but they also follow the NeXus
convention just to illustrate that with links, one can do everything.

Armando

"V. Armando Solé"

unread,
Oct 12, 2009, 10:22:39 AM10/12/09
to ma...@googlegroups.com
Obviously I agree with your suggestion :-)

I have already suggested to define HDF5 analysis groups depending on the
type of analysis. It is not necessary to write the raw data directly
based on the type of analysis (if you are sure about what your are doing
then go ahead). In the Darren and myself convention, those analysis
groups would be just links to information stored in the measurement
group. In the NeXus convention, those analysis groups would be links to
the relevant information spread among NXdata, NXsample, NXmonitor and
NXinstruments groups. In my opinion, those analysis minded groups can
save the whole NeXus approach from failure because it is certainly not
analysis minded (just see how many places you have to inspect to find
the relevant information and that without accounting for different
facilities recipes). The most important thing to retain, is that we
have to agree on the analysis minded groups, we do not need to agree on
how to write the raw data.

In this thread I already mentioned how easy would be to define a group
allowing the analysis of raw powder diffracion data and raw SAXS data
irrespectively of the underlying structure of the file. Darren and I are
convinced of our approach, NeXus people are convinced of their approach,
others will be convinced of theirs, but the analysis problem is not
solved by any of them, just the storage. A bonus of Darren and Armando's
approach is that similarly structured groups can be used to store
analysis results but to me, the main problem is not the storage. For the
storage you can choose an existing solution or invent one yourself.

> In your mandatory /spectromicro items, you have an image_array of
> shape [nx,ny,nz,n_ev]. I have been saving such data as [nx times ny
> times nz, n_ev].

Chris, if I can offer you a hint, try to avoid coupling "vertex"
information (nx, ny, nz) with "value" information (n_ev) because that
can make 3D visualization and handling cumbersome and memory hungry.
The "natural" analysis of 4D data is to have the vertexes at one side
and the associated values by other side. That also allows to reduce
dataset size when the vertexes can be replaced by a regular mesh because
instead of having nx*ny*nz values for x, for y and for z, you have nx x
values, ny y values and nz z values.

Armando

Darren Dale

unread,
Oct 12, 2009, 10:52:30 AM10/12/09
to ma...@googlegroups.com

I saw the discussion of /data/data at the beginning of the list, and
still don't understand what it is or why it is preferable to other
existing conventions (or extensions thereof). Do people think it would
be useful to try to build on the NeXus conventions to create Technique
or Analysis definitions similar to their instrument definitions
(http://www.nexusformat.org/Instruments)? What I am imagining would
look similar to Chris's /spectromicro specification, and could be used
as a basis for sharing data for analysis, and later incorporated into
whatever archive.

Darren

ambergino

unread,
Oct 12, 2009, 12:29:55 PM10/12/09
to Methods for the analysis of hyperspectral image data
Hi -
Thanks for the various suggestions: array of [nx by ny by nz, n_ev]
instead of [nx,ny,nz,n_ev], for one. I compared 3D with 2D as in
[nx,ny,n_ev] with [nx,n_ev] and found only a modest speed difference,
but I didn't try the 4D instead of 2D test.

Regarding adoption of a data standard, I can think of one very good
mechanism that can drive it: the availability of multiple analysis
programs that are able to read and write the standard. That is, if
I'm a user of beamline A and I find out that program B is really good
at analyzing my kind of data, I now have a very good reason for
getting my data into program B. I think that PyMCA can provide one
sort of "driver" to promote the use of a data standard, and I will
certainly want to make my programs able to read and write files in our
standard format.

Regarding a generic group of /data/data, perhaps we can try something
slightly more descriptive like NXdata/image where the image can have
three spatial and one energy/wavelength dimension.

CJ

"V. Armando Solé"

unread,
Oct 12, 2009, 12:42:25 PM10/12/09
to ma...@googlegroups.com
ambergino wrote:
>
> Regarding a generic group of /data/data, perhaps we can try something
> slightly more descriptive like NXdata/image where the image can have
> three spatial and one energy/wavelength dimension.
>

Once you have accepted an NXdata group, you do not need to specify if it
is an image or not.

You just have to take a look at the attributes of the different datasets.

If you take a look a the datasets of the files I submitted, you will see
that three of them have the attribute axis and only one has the
attribute signal. That already tells you how to plot the data. If you
set the attribute axes of the dataset marked as signal, you should be
able to know if it is a 1D, 2D or 3D dataset.

Armando

Reply all
Reply to author
Forward
0 new messages