Chris Jacobsen wrote:
> Hi - several of us have been talking about sharing data files and
> analysis programs among different labs. This applies both to
> spectromicroscopy / spectrum imaging / hyperspectral data, and to
> coherent diffraction data.
>
> I'm writing this to stir the pot in two different ways.
>
> ------------------------
>
> The first way concerns thoughts on using HDF5 as a file storage
> protocol. It involves a slight performance hit:
> http://xray1.physics.sunysb.edu/~jacobsen/colormic/hdf5_tests.pdf
> However, it offers self-documentation and platform independence. We
> have defined a set of groups that we use in HDF5 files:
> http://xray1.physics.sunysb.edu/~micros/diffmic_recon/node20.html
> We have also done the same for spectromicroscopy files:
> http://xray1.physics.sunysb.edu/~jacobsen/colormic/
> However, Anton Barty has suggested that it's good to also make it
> possible to use a minimal definition, so that one could simply follow
> this IDL code example:
> fid = H5F_CREATE(filename)
> datatype_id = H5T_IDL_CREATE(data)
> dataspace_id = H5S_CREATE_SIMPLE(size(data,/DIMENSIONS))
> dataset_id = H5D_CREATE(fid,'data',datatype_id,dataspace_id)
> H5D_WRITE,dataset_id,data
> I think this could be very good, and one could combine it with more
> elaborate structures by simply adding a flag "stored_in_data" to a
> more tightly specified group so that one can have one's cake and eat
> it too: an absolutely simple definition that anyone can write a file
> into, and more complete information for when it is desired/required.
> I'd be interested in your thoughts and comments!
>
> ------------------
Since you asked for our thoughts and comments ...
At the ESRF we are currently studying what direction to follow
concerning data formats. We like NeXus to describe the instruments and
the idea to define a default plot, but we find it does profit from the
versatility of HDF5. To keep it short, we are rather convinced about the
usefulness of HDF5 as a portable file system and I have started to add
support for HDF5/NeXus on PyMca (current windows snapshot
http://ftp.esrf.fr/pub/bliss/PyMca4.3.1-20091001-snapshotSetup.exe) that
is one of the ESRF battle horses for this type of data analysis . If I
have properly understood the proposed definition, I would prefer
something slightly different: basically not to have the dataset at the
root level but inside a a group. That should give the freedom to put
several datasets, with optional additional information, in the same file
without mixing them. In any case, I fully support having something
simple based on HDF5 and I have no major objections to your proposal.
> The second way to stir the pot concerns thoughts about having an
> intensive 2-4 day workshop on spectromicroscopy / spectrum imaging /
> hyperspectral data analysis. The idea would be to really talk about
> details of the mathematics and the programming, perhaps among a group
> of 30 or so people from the synchrotron, electron beam, and maybe even
> satellite hyperspectral data communities. There are possibilities for
> hosting such a workshop at Petra III, or Soleil, or Argonne... Again,
> I'd appreciate any thoughts people have.
Well, honestly I do not consider myself an expert on the mathematics and
the programming associated to this type of data analysis. All what I can
say is that I would really like to attend to such a workshop and to
share ideas and difficulties.
Sincerely,
V. Armando Sole - ESRF Data Analysis Unit
------------------------
------------------
The second way to stir the pot concerns thoughts about having an
intensive 2-4 day workshop on spectromicroscopy / spectrum imaging /
hyperspectral data analysis. The idea would be to really talk about
details of the mathematics and the programming, perhaps among a group
of 30 or so people from the synchrotron, electron beam, and maybe even
satellite hyperspectral data communities. There are possibilities for
hosting such a workshop at Petra III, or Soleil, or Argonne... Again,
I'd appreciate any thoughts people have.
--------
Sincerely, Chris Jacobsen
---
Prof. Chris Jacobsen, Dept. Physics & Astronomy, Stony Brook University
Chris.J...@stonybrook.edu, http://xray1.physics.sunysb.edu/~jacobsen/
I would like to follow-up on the mail Chris just sent you. First of all,
I think Chris is making an excellent point when he suggests to join
development of methods for hyperspectral data and the
introduction/expansion of the associated data format - it is a little
bit of an object-oriented approach to science. To get this going, I
think, we need several things:
A mailing list / web page
=========================
I have talked only to a few of you before. And when I heard a talk of
Paul Kotula at the ICXOM in Karlsruhe a few weeks ago I felt it is a
pity that there are people working in adjacent fields using basically
the same methods, but not knowing each other.
So I came up with the idea of setting up some web page where we all
could share our talks, references, algorithms and data. This would make
it much easier to communicate thoughts and achievements, or to get
advice. Currently, a first sketch of how this could look like can be
found at http://groups.google.com/group/mahid - most of the information
should be available to everyone without requiring to join the group.
But since this group could (and, in my opinion, should) also serve as a
mailing list for all people interested in the various issues which might
come up I would like to invite you to this group. You will receive an
email with the invitation soon, please give it a try. As you can see, I
have also sent this email to the mailing list, so anybody who is joining
later could search and browse through whatever discussion we have had in
the meantime.
More datasets
=============
I have obtained some webspace at DESY where we could share our datasets.
The idea is to link those datasets using the google-group. Currently, I
am already hosting one dataset from Laszlo Vincze, and I am waiting to
get some other datasets, e.g. from Armando Sole, Koen Janssens (that
multi-detector Rembrandt dataset), Chris Ryan using the Maia-detector
etc. First, I would like to have those on the web (together with the
corresponding publication) in their original datasets, and in a second
step produce HDF5 datasets (for that particular part, I could use some
help!). Please feel free to contact me if you have a dataset which you
could provide.
A workshop
==========
As I already discussed with Chris and others on the ICXOM, it would be
great to find some organization(s) hosting a workshop on hyperspectral
imaging + data analysis. I already talked to Hermann Franz, and there
might be an opportunity to combine to a workshop here at PETRA III /
DESY with money coming from an IT-project. But this has still to be
decided, but I will continue on that, too.
That's my five cents.
Cheers, Gerd
--
Dr. Gerd Wellenreuther
beamline scientist P06 "Hard X-Ray Micro/Nano-Probe"
Petra III project
HASYLAB at DESY
Notkestr. 85
22603 Hamburg
Tel.: + 49 40 8998 5701
So my simple question would be: Why bother with designing a new way how
to write HDF5-files from scratch? Why not use the conventions imposed by
NeXus as a starting point, and see how this can be extended? Because
pure HDF5 does not tell you anything about which data to save where, you
are in principle completely free. In order to ensure some basic
cross-compatibility between facilities on the one hand, and software on
the other hand it would be good to be rather strict about the format, IMHO.
In the NeXus-Format, a very simple dataset as Chris was suggesting would
contain exactly on NXentry (first level of the data structure, should
represent one basic measurement/scan AFAIK), which could contain exactly
one NXdata object (second level). More data could be either added in the
same NXentry or better be put into the next NXentry. For more detail
please see http://www.nexusformat.org/Design .
Further advantages:
* A lot of nomenclature about where to put which additional
data/metadata is already defined - we just have to see how our data fits
in there and if we need further fields, and define those for ourselves
(e.g. I am not sure, what kind of metadata you want to/should write if
you do coherent diffraction imaging, IR or XEOL, just for example).
* A lot of APIs/Tools already exist, e.g. a tool for IDL
(http://lns00.psi.ch/NeXus/NeXus_IDL.html), and for Python on is either
finished or at least under developement
Okay, so much about HDF5. I removed those people from the list of
recipients which already joined the mailinglist at
http://groups.google.com/group/mahid (Chris, Joy Andrews, Jan
Steinbrener, as well as Andre and myself).
Cheers, Gerd
> Chris Jacobsen schrieb:
>>> If I have properly understood the proposed definition, I would
>>> prefer something slightly different: basically not to have the
>>> dataset at the root level but inside a a group. That should give
>>> the freedom to put several datasets, with optional additional
>>> information, in the same file without mixing them. In any case, I
>>> fully support having something simple based on HDF5 and I have no
>>> major objections to your proposal.
>> So to accomodate this yet still keep things simple, I would say we
>> just make a group called "/data" which holds the most basic data.
>> This means we add two lines to an IDL program.
Thanks, Chris.
> I am currently approaching the same issue, but from a different side:
> How should our beamline write complex data, e.g. 2-dimensional raster
> scans?
We are discussing how to share the data, and for that just to write
them into an HDF5 file is enough. If you want to write some other
information, I would say you are free to do so.
> So my simple question would be: Why bother with designing a new way
> how to write HDF5-files from scratch? Why not use the conventions
> imposed by NeXus as a starting point, and see how this can be
> extended? Because pure HDF5 does not tell you
> anything about which data to save where, you are in principle completely
> free. In order to ensure some basic cross-compatibility between facilities
> on the one hand, and software on the other hand it would be good to be
> rather strict about the format, IMHO.
Concerning NeXus NXdata group, my (personal!) opinion is that it is
fine for what it was thought: to define a default plot. If, in
addition to moving two motors, you are simultaneously taking data with
more than one detector and those detectors do not have the same
dimensions (a 2D detector, a 1D detector and a point detector is quite
common in this imaging field), you will see that most likely you are
going to need more than one NXdata group... It's partly because of
that type of problems that I have proposed to some members of the NIAC
alternative ways of storing the data into HDF5 without using NeXus
defined fields. The little python script I sent you in a separate mail
illustrates that.
> According to www.nexusformat.org also Diamond (UK), ESRF (France)
> and ALBA >(Spain) will be using NeXus, and last but really not
> least: the APS *is* >already using it for tomography.
ESRF is currently studying it but we favour an hybrid solution based
on HDF5. If that type of use is accepted by the NIAC, then you can
call it NeXus. If not, you just call it HDF5. We can use NeXus to
describe instruments and to define default plots, but we do not want
to miss all the flexibility of HDF5. In particular we do not see why
using NeXus has to be incompatible to, for instance, writing imageCIF
like data into HDF5:
http://portal.acm.org/citation.cfm?id=1562764.1562781&coll=portal&dl=ACM
I would say you are mixing two problems. How to exchange our imaging
data and what you should use everyday at your beamline. My experience
is that once the data are into an HDF5 file, I will be able to read to
them.
Best regards,
Armando
Just to make that clear: I am completly open to use all possibilities of
HDF5 wherever we need it! We should not be restrained by our dataformat,
never. But on the other side I am not so happy imaging lots of different
HDF5-files, all written and organized using completely different design
pattern. I would really appreciate some kind of standard pattern which
defines where to put which (meta)data, and how this is designated/named.
Otherwise, any program being able to read HDF5 looking for some special
part of the data would have to a) either apply some heuristics, which
would vary from program to program, or b) the user would have to browse
the tree or otherwise indicate where the required data is located
(Armando, please correct me, but this is how PyMca is doing it right
now, right?).
So, to summarize: Right now (having not written a single HDF5-file
myself) I would try to adhere to the NeXus-specifications as long as
they do not restrain me, and otherwise try to come up with something
that blends into that design pattern. If NeXus turns out to be unusable,
one would have to find or develop some other kind of standard which is
more suitable.
Anyway, we agree that this standard, be it NeXus or something else,
should allow everyone to write very simple datasets in HDF5-files
without unnessary difficulties. And I would add: if you want to
incorporate any kind of metadata, there should exist a definition where
to put it and how to find it. Is that unrealistic?
Cheers, Gerd
Quoting Gerd Wellenreuther <Gerd.Well...@desy.de>:
> Vicente Sole schrieb:
>> Concerning NeXus NXdata group, my (personal!) opinion is that it is
>> fine for what it was thought: to define a default plot. If, in
>> addition to moving two motors, you are simultaneously taking data
>> with more than one detector and those detectors do not have the
>> same dimensions (a 2D detector, a 1D detector and a point detector
>> is quite common in this imaging field), you will see that most
>> likely you are going to need more than one NXdata group...
> Sure, but that is possible - at least this is what they (=NeXus) claim
> and propose. Or maybe I missed your point?
My point is that for further analysis (particularly for the
hyperspectral type) you will anyways have to browse the file for the
appropriate information. Please do not be mistaken, I intend to use
the NXdata group, but only for what it was intended: a default plot.
Simply, if we are just going to share data, a simple approach as Chris
suggested is enough. The NXdata group does not provide the required
metadata for the analysis that you are looking for, only for the
correct plot. Therefore, it does not bring much more than the simple
model proposed by Chris.
If you are not yet convinced, please take a close look to the NeXus
web page, you will read that one of its goals was to separate the
measured data from the metadata needed to generate them.
Quoting http://www.nexusformat.org/Design#NeXus_Classes
"""
One of the aims of the NeXus design was to make it possible to
separate the measured data in a NeXus file from all the metadata that
describe how that measurement was performed. In principle, it should
be possible for a plotting utility to identify the plottable data
automatically (or to provide a list of choices if there is more than
one set of data). In order to distinguish the actual measurements from
this metadata, it is stored separately in groups with the class NXdata.
"""
So, if you want us to follow their criteria, NXdata will not be enough
and you will need NXinstrument, NXdetector, you will miss detectors as
simple as an MCA, etc. Again, the proposal of Chris fully meets its
goals: to share our data in the simplest of the ways.
>
> Just to make that clear: I am completly open to use all possibilities
> of HDF5 wherever we need it! We should not be restrained by our
> dataformat, never.
Great to hear that from somebody else. I am still waiting to hear it
from the NIAC.
> But on the other side I am not so happy imaging lots
> of different HDF5-files, all written and organized using completely
> different design pattern. I would really appreciate some kind of
> standard pattern which defines where to put which (meta)data, and how
> this is designated/named.
My hope is that common use will lead to commond needs and therefore to
consensus. Although, I am not so very sure about the time scale for
that to happen.
> Otherwise, any program being able to read
> HDF5 looking for some special part of the data would have to
> a) either apply some heuristics, which would vary from program to program, or
as far as the heuristics works ... :-)
> b) the user would have to browse the tree or otherwise indicate where the
> required data is located (Armando, please correct me, but this is how
> PyMca is doing it right now, right?).
Yes, Gerd, PyMca is asking the the user to say where the relevant data
are located, but the user can save his preferences in order to
instruct the program about where to find the data. PyMca will support
properly defined NXdata groups too. Nevertheless, and again this is a
personal view, at the end everything can be reduced to a translation
dictionnary: a type of analysis requires a set of metadata, the
program prompts the user where to find them, the user asks the program
to remember the choice and problem is solved. Sure, you can have as
many configurations as instrumentation facilities, but the access to
the data is granted.
>
> Anyway, we agree that this standard, be it NeXus or something else,
> should allow everyone to write very simple datasets in HDF5-files
> without unnessary difficulties.
Chris'was sending the minimal requirements for an IDL program. For
python it's almost as simple as creating a dictionnary and you can do
it from the interpreter. Please, think about HDF5 as a file system:
you create a directory (= a group) where you create your data file (=
dataset). You can write a description either as a separate file (=
other dataset in the same HDF5 group) or as file properties (=
metadata as dataset attributes).
> And I would add: if you want to
> incorporate any kind of metadata, there should exist a definition where
> to put it and how to find it. Is that unrealistic?
I think Chris' proposal was leaving the door open to write metadata
provided one was saying where to find them.
Best regards,
Armando
Just a brief comment
Using Nexus conventions may be convenient for those already using
Nexus, but its existing format may not necessarily suit everyone
else's needs.
Hence the proposal that at the most basic level we just put the data
in the 'data' field of an HDF5 file. That way it is brain-dead to
write a reader/writer - lowering the barrier to entry and making the
file at the lowest level a 'bucket' for data. Keeping it as simple as
possible will make it much easier to share data.
Of course some groups may desire to put extra information in the file
- for example the configuration of an instrument. But at a base level
the simple reader will still work.
To be sure: once the data is in HDF5 format most people with some
coding experience will be able to extract the data. That much is
clear. But that is not the point. The purpose of having a simple
interchange format is so that we do not have to recode for every data
new HDF5 format that comes along.
At the base level I foresee this being a convenient container format
for sharing data amongst groups with minimum effort. (Another
discussion is whether we put all the data in '/data/data instead of '/
data'. If we decide to put it in a group rather than at the top
level it's a few more lines of code but that is another separate
discussion that should be settled soon).
Anton
>
> I would say you are mixing two problems. How to exchange our imaging
> data and what you should use everyday at your beamline. My
> experience is that once the data are into an HDF5 file, I will be
> able to read to them.
>
> Best regards,
>
> Armando
>
----
Anton Barty
Centre for Free Electron Laser Science (CFEL)
Notkestrasse 85, 22607 Hamburg, Germany
phone: +49 (0)40 8998 5783
secretary: +49 (0)40 8998 5798
anton.barty @ desy.de
Wellenreuther, Gerd wrote:
>
> Second:
>
> * I understood that there is a wish to share data using HDF5.
>
> * I have not understood yet why "raw"-data is not better shared using
> a binary format. Why use HDF5 if the only thing I want to do is dump
> an array into it, maybe in the same datastructure without taking into
> respect what that data is?
Because "raw" data is undefined.
The advantages I see with HDF5:
- you can drop in and mix data types and dimensions unlike specialized
formats only thought for 2D
- you do not have to care about if you are reading floats, doubles,
integers, ... It is self-descriptive.
- you can chunk your data, allowing a very fast read out.
- you do not have to care about little-endian -high-endian problems
- you can have straight read out with common tools (IDL, MATLAB, Python,
...) very few lines of code
- HDF5 is on its way to become an ISO standard
>
> * I noticed that I have a different view on how a "simple"
> read-/write- routine is looking like. For me, simple means something
> like "browse the tree of the HDF5-file in question, until you find the
> first occurrence of any entry matching some criteria (e.g. size of
> array, name, position in the tree)".
import h5py
f = h5py.File(filename,'r')
data=f['/data/data']
Does it need to be simpler? It's just like looking into a hard disk
because all what you need is a path. Either you know the path and you
take it, either you browse the disk and find your path.
>
> But I am not a HDF5-expert ...
Neither I am, but having supported quite a few formats on PyMca, I can
say I appreciate it.
Armando
just to summarize my insights after a discussion I just had with Anton
and Nicola, and the discussion over the last days: Most probably it
would be best to differentiate between two different applications for HDF5:
Simple data exchange (path-centered use of HDF5)
================================================
* items are found using a path
* only contains one dataset or it is clear which/how the data has to be used
* users (Chris, Anton, Gerd) have to agree about / communicate paths
* purpose: use HDF5 as a machine/platform-independent container
* Anton proposed '/data' as a unified storage place for sharing, Armando
suggested to go one hierarchical layer deeper into '/data/data'. Garth
and Armando thought it would make sense to indicate the kind of data by
actually putting the data to be shared in e.g. '/data/cdi', and link
that to '/data/data', which would enable other data e.g. monitors to be
put in '/data/I0'.
Rich (meta-)data storage
========================
* items are rather found using a heuristic, e.g. looking for a special
attribute/tag than looking for an absolute path (e.g. looking for a
data-group with a name or tag called 'coherent diffraction imaging data')
* in case several matching datasets are found, any program should ask
the user
* users (SOLEIL, DESY, ESRF, APS, NeXus, groupXYZ) should agree about
guidelines / philosophy concerning names / tags / hierarchical
structures in HDF5
* purpose: create container-files containing all data + metadata
obtained during data acquisition, + additional data from preprocessing
and processing
* in order to achieve some kind of compatabiliy between synchrotron
labs, I suggested to use NeXus as is being used at SOLEIL. I think it
would be great to define some philosophies/guidelines about how what
should be stored where and why, and how it is being designated. Such
general guidelines + common use as Armando suggested it will lead to the
evolution of the NeXus-standard into something adequat for our purposes.
How to go on:
=============
* Armando is already creating HDF5-files, and I will start to link them
in the http://groups.google.com/group/mahid/web/datasets .
* I will also try to convert aspects of the datasets hosted by DESY
(e.g. the raw fluorescence spectra as well as the elemental contents) in
two different HDF5 files.
* Anybody else having spectroscopic datasets in HDF5 is welcome to join.
* Further advances benefitting those people using HDF5 for simple
sharing could come from an improvement of the APIs, e.g. allowing a
simple fetch of an item not only based on the path, but e.g. on a
certain name or tag.
* For the development of the rich-HDF5-format I really think that we
need a workshop :).
Cheers, Gerd
Armando
Great news! From HASYLAB at least Thorsten Kracht, Maria-Teresa
Nunez-Pardo-de-Vera and myself would most probably attend. Hopefully
Andre Rothkirch will also join, but maybe this is also a question of how
many people from IT can leave HASYLAB unattended for a few of days :).
And since the SAXS people have also been thinking about this topic and
want to use HDF5/NeXus they would/should also want to join. I can not
speak for all the people from CFEL.
It would be great if this could happen in 2009, although it will make
things a little bit more difficult, especially for you.
Cheers, Gerd
Andy Gotz schrieb:
> Dear Gerd,
Cheers, Gerd
On one other point regarding some advantages of HDF5. True it may not be
the speediest way to read data. But it's extensibility and
self-description makes it a very good choice for wider acceptance. You
also can gzip compress individual data fields in the file individually.
Individual groups can actually add attribute or data fields that are
specifically important for their operation and as long as they don't
modify the "agreed-upon" fields, we can still extract what's needed for
data analysis because we look specifically for attribute and data fields
by name. For a beamline, this means I can also decide to add attributes
later in time, for example as hardware upgrades are incorporated,
without having to modify software written to extract those specific
named fields.
Anyhow, I think a workshop at the ESRF would be a great idea.
Tony Lanzirotti
--
Dr. Antonio Lanzirotti, Ph.D.
Senior Research Associate
The University of Chicago - CARS
National Synchrotron Light Source
Brookhaven National Laboratory
Upton, NY 11973
(631) 344-7174
mailto: lanzi...@uchicago.edu
or
mailto: lanzi...@bnl.gov
Gerd Wellenreuther wrote:
> How to go on:
> =============
>
> * Armando is already creating HDF5-files, and I will start to link them
> in the http://groups.google.com/group/mahid/web/datasets
>
You can write a link to the data set:
http://ftp.esrf.fr/pub/scisoft/HDF5FILES/MGN1_4707eV.h5
I have tried to follow NeXus conventions as well as the agreed
/data/data way.
That should also serve to illustrate that by using links, one can meet
several standards.
Best regards,
Armando
PS. It is a fluorescence dataset of one of the samples used at in
Analytical Chemistry 79 (2007) 6988-6994. I am not 100% sure it is part
of the region shown in Fig. 2, but it is the same sample.
BTW, I have tried to download the daphnia dataset but I am prompted for
a username and a password.
Armando
ambergino wrote:
> There are compelling reasons to specify the data as simply as possible
> to maximize the ease of reading/writing and also the portability among
> experiment types. But there are also compelling reasons to make the
> data be as specific as possible so that one does not have to rely on
> memory or logbooks to know what went tnto a data set, and it's also
> very good to avoid making long dialogs in a program that reads a file.
>
> To my thinking, we can accomplish both goals by making a layered
> format.
I guess I have to disagree, but please feel free to try to convince me. :)
If the purpose is just to share data, then we should keep everything
simple. It will not hurt if additional (meta-)data is in the file. But
typically, as I discussed with Anton and Nicola yesterday, the purpose
would be to just read that single dataset with the minimum amount of
three lines of code (open, read, close). If this is what you want to do,
you should stick to this routine and the paths, and your fine. If you
need to write e.g. monitor data, just make a second file HDF5-file.
On the other hand, I think we should not use this "simple" HDF5-files as
a starting point for something (much) more elaborate. Because whatever
evolves out of this process will most probably neither have the property
of being easy to read/write, nor will it have the proper design for
advanced data analysis.
For example, one possible dead-end I see is connected to the usage of
fixed path-names. You always put data xyz into path abc. And now some
guy buys a second detector. Consequence: You have to find a new
convention for naming the path, tell everyone, and then they have to
change their programs in order to be able to read that data. That is bad
in my opinion. Very bad.
The only way to solve that misery is to distinguish between
"quick-sharing, path-oriented" simple HDF5-files, and more elaborate,
rich HDF5-files. In my opinion, the individual items in the latter
should *not* be identified by their location in the HDF5-structure
(although this should be kept as defined as necessary), but the real
identification is implemented as some kind of attribute or tag: The data
itself should identify itself as a fluorescence map, or a XANES scan, or
a CDI-image. Then and only then an elaborate program could browse the
HDF5-tree, and for example extract all fluorescence maps and do
something with them. Of course, this need for browsing the tree would be
a somewhat higher barrier towards users.
But again: These two purposes are really different! Quick and really
easy sharing collides with exact, path-independent identification (at
least as long as their is no HDF5-routines available for all of us doing
the browsing and retrieval based on tags).
Cheers, Gerd
Yupp, but username and password are displayed on the datasets page. I
guess I have to make this bigger ...
Please, those that have not done it yet, play ASAP with hdf5 files with
your favorite language (C, Fortran, MATLAB, IDL, python, ...) it will
be very instructive.
Best regards,
Armando
I also agree here. If HDF5 itself is a "portable filesytem", why should
be simpler to have two separate files than two groups inside the same
HDF5 file avoiding missing data?
Armando
V. Armando Solé wrote:
> ambergino wrote:
>> I don't like having multiple files because then you run the risk of
>> missing some of them when you copy data from an experiment.
>>
> I also agree here. If HDF5 itself is a "portable filesytem", why should
> be simpler to have two separate files than two groups inside the same
> HDF5 file avoiding missing data?
Sure, as long as you communicated where which data lies, everything is
fine. But then you obviously need to define more than just '/data/data',
and you have to communicate that. There is absolutely nothing wrong with
setting up a rather strict data-structure like the proposed
'/data/data'. But everyone using it should be well aware that while it
is very easy to read it in, it has certain disadvantages - either you
need to communicate changes, or you need several files.
My point is that for the rich-type as summarized by me as much
information as necessary to understand the file should be *in* the file,
and not agreed upon before and communicated (e.g. via email).
And consequently we should agree rather about certain conventions
regarding naming / tagging / structuring objects in an HDF5-file, than
to fix absolute paths. Because there will be always something we haven't
thought about. Then it is good to have a convention written down
somewhere, which can be used to put that new kind of data to a sensible
position, and give it a sensible tag. Instead of creating '/data/data1',
'/data/data2' or something like it.
For applying statistical methods on data, you do not need more than I
sent in my file than to know the last dimension corresponds to the
measured data.
For specific ways of analysis, one would need a dedicated group, with
defined dataset names, linking to wherever in the file the actual data
is stored.
If one has followed the full NeXus convention, and wants to perform
azimuthal averaging of an image obtained by powder diffraction, one
would need a group where image, sample_detector_distance,
pixel_size_dim0, pixel_size_dim1, direct_beam_position, detector_tilt,
detector_rotation, wavelength, and perhaps something else is written. If
that "image_powder_diffraction_group" is available, everybody has agreed
on the names and so on, the analysis is possible irrespectively of the
convention used to store the data (NeXus in this example). If we could
get a consensus about the minimal set of information to perform a
particular analysis, FROM THE RAW DATA, of image powder diffraction
analysis and/or image SAXS (quite similar problems) and/or XRF mapping
and/or XANES mapping, and so on we would have taken a huge step forward.
The /data/data approach was intended to share simple data. I think for
pure statistical analysis problems can take us quite far in the mean time.
Armando
> be the only option to allow people (including us) to prepare. I wonder
> however if this is not a problem for colleagues from USA. I have heard
> DOE employees need a 3 month lead time for travel. Some feedback on this
> topic from USA colleagues would be useful. We have started the ball
> rolling over here ...
if more than 3 DOE people comeing from the same DOE lab, then yes, it
requires ~3 months lead time. If it is 2-3 it should likely be ok (the
limit is a total expenditure of $10000 for a conference).
Cheers,
Stefan
--
Dr. Stefan Vogt
Group Leader Microscopy Adj. Assoc. Professor
Advanced Photon Source Feinberg School of Medicine
Argonne National Lab. Northwestern University
phone: (630) 252-3071; beamline: -3711; fax: -0140
cell: (815) 302-1956
http://www.stefan.vogt.net/
It is great that phynx is already able to do the browsing/searching by
tagname, I just wonder if you know if something like that also exists
for IDL, Matlab, etc.? (I am fine with Python, but others aren't.)
Cheers, Gerd
Not that I know of. I don't think it would be hard for someone
proficient in those languages to do, though. If anyone is interested,
I can post the phynx code somewhere in this group's collection of
documents. The project itself is hosted as part of a larger data
acquisition and analysis project (xpaxs) hosted at Launchpad. I've
been intending to split phynx off into its own separately branchable
project, as soon as the Bazaar version control system gets support for
nested trees.
Darren
The idea is to organize a workshop at the ESRF.
The link to set your preferences is:
http://www.doodle.com/u3pusirdm5idzzdm
SOLEIL users meeting is on January (20-21) and ESRF users meeting is in
February (2nd week).
Just keep it in mind when choosing your dates. I guess we'll need at
least three days to get something productive but we'll see.
See you,
Armando
At the ESRF I have identified some relatively common problems that could
benefit from such an agreement and surely there are more.
I am not necessarily an expert on the associated fields and perhaps
there are well defined systems to describe those measurements and we
just need to embed them into HDF5.
Some of those common problems that are closely related:
Powder diffraction data collection with 2D detectors
SAXS with 2D detectors
Conversion from image data to Q space
Conversion from image data to HKL space
Clearly there will not be a Nobel price for solving those issues, but
the amount of time that can be lost on them just to be able to start the
actual analysis ...
The list is open. Please feel free to open new discussion subjects in
this mailing list associated to particular techniques too.
Armando
>
> Dear all,
> I think that few ideas/things are missing in the whole discussion:
>
> there seems to be nobody who thinks that, few days after the "file has
> been shared", people could forget what they have on their hard drive;
>
Perhaps those who think the same thing are busy.
> there is nobody who thinks that a utility can be written or used, that
> looks in the files, and dumps what is inside, like a image browser.
There are those who work to make sure that utilisty exists and goes
well beyond that.
PyMca allows you to browse the files:
http://ftp.esrf.eu/pub/bliss/PyMca4.3.1-20091008-snapshotSetup.exe
and, if you take the associated ROI imaging tool, you can, besides ROI
imaging :-), do PCA analysis. The latest check in I did yesterday to
the sourceforge svn repository even allows you to perform ICA in your
datasets. NNMA will follow when Gerd and I find the time to work
together on it or when I have more time.
My idea when making the datasets avaliable was to compare pure
statiscial methodologies of analysis. Please, just give some time! :-)
So, the files are there and the applications are there. PyMca is just
an example, but I am sure there are others around.
If you want, you can play with the datasets yourself. The MGN1 dataset
is quite good for PCA while PCA seems to bring very little information
information in the Daphnia dataset and ICA (I did it last night and
may be I did it wrong) only seems to find 3-4 reliable components. ou
see there is already matter for discussion. Nevertheless, to discuss
results and so on, I would like to open other mailing list thread and,
if possible, compare the different findings.
> In
> this way each person that receives the file could be able to "browse" the
> file and automatically discover that actually "/data/data" is
> "/data/corrected_data" or whatever.
I said, the datasets where stored in /data/data and they are.
> I have joined more than once a
> different collaboration, I have seen new students starting or "old"
> postdocs/professors, who were used to one version of the reconstruction
> software, being totally sidetracked by new versions of the software.
> All in all, most of the problems were not that things were moved or
> changed, but that the way that data were saved was not "automatically"
> clear. People could not see by opening a data file or a ntuple or a
> roottree what had changed or what was not "canonical" and they needed
> somebody who could explain how to treat the values and what kind of
> meaning had the objects in the files.
I think the submitted datasets can allow us fruitful discussions
working with pure statistical methods. All you need to know is the
first two dimmensions correspond to the dimensions of your map while
the last dimension correspond to the measured 1D data. You do not even
need to know they are fluorescence data.
> That is why I think that a bit more of infos saved in the files is better,
> even if there is the need of a bit more of "browsing". Experienced people
> will be able to browse the file, and not experienced will by browsing know
> and learn what they have been given...
Nobody is saying /data/data is going to solve everything. But in the
mean time we already have something to work with that does not need
more information. Some people are interested on how to describe data
in file, some are interested on just writing them, some just on
reading them and some are interested on the science that can be made.
I belong to all of them and I do not want that one of the issues
prevent the others from going further.
> The only problem that a new student or postdoc will have is if there is
> no other way for him to look into the files than to run a "full" fledged
> program.
Not at all. You can inspect the files from the command line of popular
tools like MATLAB, IDL, Python, ... Two or three lines of code of
those tools and you have your data, you can plot them, print them,
modify them, export them, ...
In the submitted files is even simpler because you already know that
the dataset is at /data/data.
> It is vital that there is always a simple utility (under all, and
> I stress ALL, architectures) to look into data-files (and I say "always"
> in the sense that this utility MUST be maintained).
>
MATLAB is maintained and available in all architectures.
IDL is maintained and available in all architectures.
Python is maintained and available in all architectures.
PyMca, is maintained and available in all architectures. In addition,
provides hyperspectral analysis capabilities and you can freely
interact with the main author (myself).
I do not think the situation is so bad.
Armando
One may also consider other particular needs important, too, such as
building an index for the library of accumulated data files and their
last known location as well as some high-level metadata to help in
searching and cross-referencing. This type of tool is essential as one
begins to think of grid resources for storage and computation.
Regards,
Pete
Nicola Coppola wrote:
> Dear all,
> I think that few ideas/things are missing in the whole discussion:
>
> there seems to be nobody who thinks that, few days after the "file has
> been shared", people could forget what they have on their hard drive;
>
> there is nobody who thinks that a utility can be written or used, that
> looks in the files, and dumps what is inside, like a image browser.
--
----------------------------------------------------------
Pete R. Jemian, Ph.D. <jem...@anl.gov>
Beam line Controls and Data Acquisition, Group Leader
Advanced Photon Source, Argonne National Laboratory
Argonne, IL 60439 630 - 252 - 3189
-----------------------------------------------------------
Education is the one thing for which people
are willing to pay yet not receive.
-----------------------------------------------------------
Quoting Pete Jemian <prje...@gmail.com>:
>
> Actually, data standards people think about this as the first, best
> "killer app" since that type of visualization is what may help convince
> people that their particular data format is worth further examination.
> NeXus, for example, has nxplot. One early design point for NeXus was to
> store the data in such a way that a plotting tool could automatically
> discover what and how to plot the data in a NeXus file.
>
NXdata is one of the classes I really like in the NeXus design.
In the currently available datasets, I have created the /data/data
dataset as a link to an array in an annex NXdata group. I am
interested to know if NXplot is tolerant enough to recognize the
presence of the NXentry and the NXdata groups despite the presence of
the link. If the NXdata group is not compliant, please let me know it
and I will correct it.
Armando
On Sun, Oct 11, 2009 at 8:47 PM, ambergino <chris.j....@gmail.com> wrote:
>
>
>
> On Oct 10, 11:50 am, Nicola Coppola <copp...@mail.desy.de> wrote:
>> Dear all,
>> I think that few ideas/things are missing in the whole discussion:
>>
>> there seems to be nobody who thinks that, few days after the "file has
>> been shared", people could forget what they have on their hard drive;
>>
>
>
> Hi Nicola (and others) - to the contrary, I think it is quite
> important and useful to store as much information as possible, and the
> NetCDF file format that we've been using for 15 years for STXM at
> Brookhaven, and 6-7 years for CDI at the ALS incorporate all relevant
> parameters automatically. We plan on shifting both to HDF5, and if
> you look here (http://xray1.physics.sunysb.edu/~jacobsen/colormic/
> spectromicro_draft2.pdf) you'll see that we already have defined a
> HDF5 format for analysis but I we're early enough in the use of this
> that we are happy to shift to whatever consensus format emerges.
>
> All I want to do is to also preserve the option for people to write
> files in as simple a way as possible, to maximize adoption of the
> standard.
Armando and I have been using an hierarchy that attempts to do the
same thing. We have been using an hierarchy that is very similar to
spec datafiles, where each scan is a group below the hdf5 root:
/entry1 # an N by M 2-D scan, for example
/measurement
/scalar_data # arrays for motor positions, counters, like
spec's table of data
/positioners # starting positions, like spec's scan header
/mca_1 # a group containing mca data, for example
/counts # an array of shape (N times M) by Number_of_Bins
/deadtime
/ccd_2 # a group containing image data
/counts # an array of shape (N times M) by Ypixels by Xpixels
Instead of having an ev_array stored in the hdf5 file, I have been
storing an array of bin numbers and calibration parameters to convert
the bins to energy using a second order polynomial. When I ask for
mca_1.energy in python it automatically calculates the energy array. I
think the bin numbers and the calibration are the important values to
be stored.
We did not organize things into hdf5 groups according to technique, as
you suggest in your pdf, we just wanted an as-simple-as-possible way
to organize raw data, regardless of technique. But with hdf5 links,
your spectromicro group could be created to make the data available
with all the context of that particular technique. This way we can
have the best of both worlds.
In your mandatory /spectromicro items, you have an image_array of
shape [nx,ny,nz,n_ev]. I have been saving such data as [nx times ny
times nz, n_ev]. There are several reasons for this: it simplifies
data acquisition, since I only need to know the index of the point in
the scan instead all of my x and y and z indices, it supports
non-uniform images (data sampled at irregular intervals), and it also
supports scans of arbitrary dimensionality. Imagine I am working with
data interactively in some shell, and it was a 1D scan. If I want to
plot the first mca spectra, I want to do this:
plot(mca.counts[0])
not this:
plot(mca.counts[0,0,0])
But I understand there are times when you want regularly-spaced data
formatted into [nx,ny,nz,n_ev], so I store the shape of the scan as in
hdf5 and can provide a proxy that will let you index the array as if
it were formatted that way:
mca.counts.image[0,0,0]
>> there is nobody who thinks that a utility can be written or used, that
>> looks in the files, and dumps what is inside, like a image browser.
>
> In fact I think that those who are already using HDF5 know quite well
> that it's very easy to see what's in any HDF5 file and how it is
> structured even if you are given no other info. The simplest way is
> to use the "h5dump" command that gets built when you install HDF5;
> there are also more elaborate browsers.
I wrote a PyQt4 based tree-view widget to allow files to be explored
in a GUI. Armando folded that into pymca and added all kinds of new
features and capabilities.
> I've been meaning to dig up a nice example dataset and write it to
> HDF5; hopefully this week.
I'll post a file of my own as well.
Darren
Is /data/data seriously being considered for some kind of convention?
I'm not familiar with it, where is it discussed? I think the naming
convention /data/data is unfortunate.
Darren
A major problem with data standards is the reluctance of
the community to accept a standard that is different from
what is presently done. The reluctance is stronger from
those who are instrument scientists who often have been
using some (local or perhaps broader standard) format for
years and see no added value in using something new.
Sometimes, the reluctance is due to management pressure
("you do not have time or resources to spend on that effort")
but most often it is more pragmatic, they have
reduction/analysis codes that already write and read their
format and have been producing science. In this case, a
new standard does not provide tangible benefit to their
instrument.
They miss the point, though. Big time. While many
researchers do not couple their data from one instrument
with data from another for a given analytical procedure,
there are plenty of researchers who need to do exactly
this. One example is contrast variation studies capitalizing
on the differences in the scattering of neutrons and X-rays.
These scientists could obtain substantial benefit from a common
data standard that would carry both kinds of information.
The vision is that such a common standard would provide benefit
for the "multiple instruments" scientists without penalizing
the "single instrument" investigators.
One requirement set by instrument scientists that is necessary
for any new data standard is "to be better than what is presently
in use." There must be a benefit or there is no need to change.
The benefit I can see, which still is not convincing, is that
support of scientists who collect data on multiple instruments
will benefit each and all those instruments in new ways.
Gain notoriety, broader user base, more competition for
higher-quality science, etc.
Regards,
Pete
It was discussed at the beginning of this thread just to have a fast and
easy way to get access to the main data. The already available datasets
(from me) follow that convention but they also follow the NeXus
convention just to illustrate that with links, one can do everything.
Armando
I have already suggested to define HDF5 analysis groups depending on the
type of analysis. It is not necessary to write the raw data directly
based on the type of analysis (if you are sure about what your are doing
then go ahead). In the Darren and myself convention, those analysis
groups would be just links to information stored in the measurement
group. In the NeXus convention, those analysis groups would be links to
the relevant information spread among NXdata, NXsample, NXmonitor and
NXinstruments groups. In my opinion, those analysis minded groups can
save the whole NeXus approach from failure because it is certainly not
analysis minded (just see how many places you have to inspect to find
the relevant information and that without accounting for different
facilities recipes). The most important thing to retain, is that we
have to agree on the analysis minded groups, we do not need to agree on
how to write the raw data.
In this thread I already mentioned how easy would be to define a group
allowing the analysis of raw powder diffracion data and raw SAXS data
irrespectively of the underlying structure of the file. Darren and I are
convinced of our approach, NeXus people are convinced of their approach,
others will be convinced of theirs, but the analysis problem is not
solved by any of them, just the storage. A bonus of Darren and Armando's
approach is that similarly structured groups can be used to store
analysis results but to me, the main problem is not the storage. For the
storage you can choose an existing solution or invent one yourself.
> In your mandatory /spectromicro items, you have an image_array of
> shape [nx,ny,nz,n_ev]. I have been saving such data as [nx times ny
> times nz, n_ev].
Chris, if I can offer you a hint, try to avoid coupling "vertex"
information (nx, ny, nz) with "value" information (n_ev) because that
can make 3D visualization and handling cumbersome and memory hungry.
The "natural" analysis of 4D data is to have the vertexes at one side
and the associated values by other side. That also allows to reduce
dataset size when the vertexes can be replaced by a regular mesh because
instead of having nx*ny*nz values for x, for y and for z, you have nx x
values, ny y values and nz z values.
Armando
I saw the discussion of /data/data at the beginning of the list, and
still don't understand what it is or why it is preferable to other
existing conventions (or extensions thereof). Do people think it would
be useful to try to build on the NeXus conventions to create Technique
or Analysis definitions similar to their instrument definitions
(http://www.nexusformat.org/Instruments)? What I am imagining would
look similar to Chris's /spectromicro specification, and could be used
as a basis for sharing data for analysis, and later incorporated into
whatever archive.
Darren
Once you have accepted an NXdata group, you do not need to specify if it
is an image or not.
You just have to take a look at the attributes of the different datasets.
If you take a look a the datasets of the files I submitted, you will see
that three of them have the attribute axis and only one has the
attribute signal. That already tells you how to plot the data. If you
set the attribute axes of the dataset marked as signal, you should be
able to know if it is a 1D, 2D or 3D dataset.
Armando