hdf5 for x-ray microprobe data

Matt Newville

unread,

Nov 24, 2009, 4:22:55 PM11/24/09

to ma...@googlegroups.com

Hi Folks,

Though I haven't been using HDF5 files for storing data from our x-ray
microprobe, I am keen to start doing this. I had an opportunity to
try it out for an "exchange format", and thought it might be helpful
to write down some notes about my experiences.

We had a user come to our at microprobe in October who wanted to use
Stefan Vogt's MAPS program to analyze their data. Though both Stefan
and I collect data with Epics, the data file formats we use are not
the same. I thought that trying out HDF5 as an interchange format for
this one case would help me better understand:
1. How well can Python and IDL work together with HDF5 files?
2. What data is needed to fully explain a set of microprobe data?
3. What is the right data layout for HDF5 files?

Although I looked at the example data sets on the MAHID discussion
pages, I decided to not necessarily follow these examples, in part to
see how big the differences would be between the "naturally evolved"
datasets. I'm perfectly willing to move to a common standard format,
once that is figured out.

I was surprised at how difficult it was to deal with HDF5 files in
IDL. I managed to crash IDL and corrupt data files several times as I
was working on code, and IDL's tools for exploring HDF5 files are
crude and clunky.

I was also surprised at how non-obvious it was to encode "dead time
correction". While there are many ways to encode the deadtime
correction, I opted for recording the correction scale factor, and
putting both raw and corrected data sets in the HDF5 file.

I've put more complete notes and description of the file format I
used, 2 example data files, and some initial code (Python and IDL) at
http://cars9.uchicago.edu/pybeamline/DataFormats/H5UsageNotes

I'd welcome any comments.

--Matt Newville

Matthew Dougherty

unread,

Nov 24, 2009, 5:33:01 PM11/24/09

to ma...@googlegroups.com

Hi Matt

main comment:

1) don't use /data/ as your root group;

2) use /GSECARS/ instead.

3) Attach the HDF attribute [DOMAIN_FORMAT=http://cars9.uchicago.edu/pybeamline/DataFormats/H5UsageNotes] to /GSECARS/

Matthew Dougherty

--

You received this message because you are subscribed to the Google Groups "Methods for the analysis of hyperspectral image data" group.
To post to this group, send email to ma...@googlegroups.com.
To unsubscribe from this group, send email to mahid+un...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/mahid?hl=en.

Matthew Dougherty

National Center for Macromolecular Imaging

Baylor College of Medicine

==================================

Darren Dale

unread,

Nov 24, 2009, 8:50:55 PM11/24/09

to ma...@googlegroups.com

On Tue, Nov 24, 2009 at 4:22 PM, Matt Newville
<newv...@cars.uchicago.edu> wrote:
> Hi Folks,
>
> Though I haven't been using HDF5 files for storing data from our x-ray
> microprobe, I am keen to start doing this. I had an opportunity to
> try it out for an "exchange format", and thought it might be helpful
> to write down some notes about my experiences.

[...]

> I've put more complete notes and description of the file format I
> used, 2 example data files, and some initial code (Python and IDL) at
> http://cars9.uchicago.edu/pybeamline/DataFormats/H5UsageNotes
>
> I'd welcome any comments.

Thank you for publishing your notes.

You mentioned on the website that IDL sometimes cannot read a well
formed file, and that it might have been due to the file being open
multiple times. Do you mean that you had the file open in multiple
programs at the same time? If so, I think that is unfortunately a
situation that needs to be avoided. I don't think the hdf5 library
uses locks to prevent multiple processes from modifying/accessing the
file at the same time, so this can result in corrupted data. However,
I may be mistaken, and would love to have someone correct me. If I
understood Andrew Collette (author of h5py) correctly, this situation
is improved if hdf5 is compiled with support for mpi. Preliminary
tests seem to show that hdf5 can be built as a shared library
(required for python bindings) with mpi support, but I don't think the
hdf5 group tests this configuration.

Concerning dead time, I think I have an elegant solution in phynx. I
only save raw data, corrected data is calculated by proxies (which can
be indexed just like a dataset). I will be focusing on phynx
documentation (and unit tests) in order to prepare for the upcoming
workshop. pytables and h5py (and phynx, which extends h5py to
specialize for our kinds of data) each provide such an easy-to-use
interface, having an accessible user interface enables all kinds of
really elegant solutions for the kinds of problems we will be
addressing. I'm sorry to hear that IDL's support for hdf5 is so badly
wanting. I would not recommend people to use h5py with hdf5-1.6.
hdf5-1.8 supports many essential features, like being able to resize
data and copy groups (and there subgroups) from one file to another.
Plus, the hdf5 website itself states:

"Release 1.8.4 is the latest release of HDF5 1.8. New projects are
strongly encouraged to use this release as it contains many new
features, file optimizations, and performance enhancements."
[...]
"Release 1.6.10 is the latest release of HDF5 1.6. This release is
provided for users with existing major projects that cannot be ported
to the latest release. Users with tools that have not been ported to
the newer release may also need to use this release. PLEASE NOTE that
this is the last release of HDF5 1.6."

Darren

Matt Newville

unread,

Nov 24, 2009, 9:19:59 PM11/24/09

to ma...@googlegroups.com

Hi Matthew,

Thanks for the comments. Those are all good suggestions. I was not
sure that a top-level "data" group was needed at all, though it does
allow multiple data sets in a single file. I have to admit that I
didn't adopt a top-level group until seeing the other files on the
MAHID web pages. I have no allegiance to the group name "data" (or
any of the other group names, or the layout at all).

Since IDL seems to have a very difficult time programmatically
exploring HDF5 files, I can see that storing data with a pre-defined
layout, with known group names and tags is very important. I don't
have a strong preference whether there is a published schema (ala
NeXus) or simply an agreed upon format+API.

Cheers,

--Matt Newville

Werner Benger

unread,

Nov 25, 2009, 1:33:14 AM11/25/09

to ma...@googlegroups.com

Hi Matt,

the only hdf5-agreed upon "official" way to specify images is the HDF5
image API:

http://www.hdfgroup.org/HDF5/doc/ADGuide/ImageSpec.html

It has its limitations, but at least it's part of the HDF5 releases already.
In theory, all applications should support detecting a dataset that conforms
to these specifications as an image. In practice, not all might do so; I don't
know about IDL here. Technically, it would not matter at this point
if you place all your images in the root group, or in a sub group hierarchy.
Though bundling them in sub groups is a good idea to keep images together
that share common properties, e.g. same creation date or related to the
same dataset. In my own work, I arrived at a 6-level scheme to organize
datasets (and images) in subgroups, which I can tell you more about if
you're interested, but it's not a widely agreed standard.

Werner

--
___________________________________________________________________________
Dr. Werner Benger Visualization Research
Laboratory for Creative Arts and Technology (LCAT)
Center for Computation & Technology at Louisiana State University (CCT/LSU)
211 Johnston Hall, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809 Fax.: +1 225 578-5362

"V. Armando Solé"

unread,

Nov 25, 2009, 2:51:47 AM11/25/09

to ma...@googlegroups.com

Matt Newville wrote:
> Hi Matthew,
>
> Thanks for the comments. Those are all good suggestions. I was not
> sure that a top-level "data" group was needed at all, though it does
> allow multiple data sets in a single file. I have to admit that I
> didn't adopt a top-level group until seeing the other files on the
> MAHID web pages. I have no allegiance to the group name "data" (or
> any of the other group names, or the layout at all).
>
> Since IDL seems to have a very difficult time programmatically
> exploring HDF5 files, I can see that storing data with a pre-defined
> layout, with known group names and tags is very important. I don't
> have a strong preference whether there is a published schema (ala
> NeXus) or simply an agreed upon format+API.
>

You can use a predefined layout with known (to you and to your software)
names and tags if IDL does not allow you to explore the file in a
convenient way. Matthew has given us nice hints to avoid collision with
other formats also based on HDF5. All what you have to do is to somehow
identify your data. If you are not going to share your data outside a
small community, I guess then it is irrelevant.

Personally I prefer to have an attribute based layout than a name based
layout but nothing prevents you from having both. For simplicity, at
this mailing list we had decided to put the data in /data/data till we
get a better layout. If you take a look at the Pigment dataset I
provided, and you look at the attributes, you will see that /data is an
NXentry group, /data/data is a link and the relevant information is
inside an NXdata group. So, the file is following "NeXus rules" despite
having an agreed name based layout. When I said in my PANDATA talk, that
NeXus could save us a lot of discussions, I meant that some things have
been already thought and solved and we could use them. The main problem
I see with NeXus is that it is an Instrument based approach to store the
data that is far from adequate from the analysis point of view. Darren
and I have come with a proposal that may help to simplify the situation.
In the extreme case (just a Measurement group) only borrows NXentry from
NeXus.

Concerning IDL and HDF5, I thought the support level was good.
Nevertheless, I think IDL also gets some "good promotion" in Scientific
fields from the fact people like Stefan Vogts (MAPS), Chris Ryan
(GeoPIXE) and Manolo Sanchez (XOP) use it. I am unable to judge it, but
I guess IF the HDF5 support in IDL is very limited, IDL itself could do
some effort to improve it if the request comes from "good customers"...

Best regards,

Armando

Matt Newville

unread,

Nov 25, 2009, 11:12:19 AM11/25/09

to ma...@googlegroups.com

Hi Darren, All,

Thanks for the replies.

Darren wrote:
> Concerning dead time, I think I have an elegant solution in phynx

I agree that the phynx solution is elegant. It does put some burden
on API implementers in other languages, and also means that the
corrected data is not actually stored.

I think there are enough subtleties with dead time corrections that
placing the burden of doing the correction at the point of origin is
preferred. For example, if there are two dead-times for a detector
(detectors using XIA's DXP electronics have this feature), the full
correction may included deadtimes (in nanoseconds) that have been
determined separately, and then the correction is done iteratively...
at least that's one way to do it. I t's hard to imagine that this sort
of correction (or ALL the variations on how to store deadtime) would
be done by every library that can read an HDF5 file.

I think that for an interchange format, it's best to not rely on
anything besides trivial calculations at the user end.

==Issues with IDL

Darren wrote:
> You mentioned on the website that IDL sometimes cannot read a well
> formed file, and that it might have been due to the file being open
> multiple times. Do you mean that you had the file open in multiple
> programs at the same time? If so, I think that is unfortunately a
> situation that needs to be avoided. I don't think the hdf5 library
> uses locks to prevent multiple processes from modifying/accessing the
> file at the same time, so this can result in corrupted data. However,
> I may be mistaken, and would love to have someone correct me. If I
> understood Andrew Collette (author of h5py) correctly, this situation
> is improved if hdf5 is compiled with support for mpi. Preliminary
> tests seem to show that hdf5 can be built as a shared library
> (required for python bindings) with mpi support, but I don't think the
> hdf5 group tests this configuration.

I need to (or better yet, somebody else needs to) look into this more
carefully. I had a hard time getting reliable failures. I was
definitely doing all of these things, which may lead to troubles:
- overwriting test files
- using the HDF5 Viewer, and not always closing the Viewer.
- reading files on both Linux and Windows.
- using files sitting on networked drives.

I never had trouble with Python (h5py only, I didn't test pytables),
but several problems with IDL. I was using IDL 7.0, on both Windows
and Linux but (mostly linux simply because restarting IDL is faster).

My suspicion is that IDL and the HDF5 Viewer are not always good at
actually closing file handles. I believe I never had trouble when IDL
opened a "brand new file" that had never been touched by another
application. So, perhaps I was beating on HDF5 files more than they
expect -- that worries me a little.

Armando wrote:
> Concerning IDL and HDF5, I thought the support level was good.
> Nevertheless, I think IDL also gets some "good promotion" in Scientific
> fields from the fact people like Stefan Vogts (MAPS), Chris Ryan
> (GeoPIXE) and Manolo Sanchez (XOP) use it. I am unable to judge it, but
> I guess IF the HDF5 support in IDL is very limited, IDL itself could do
> some effort to improve it if the request comes from "good customers"...

I won't claim to be a good IDL customer, so that's not my fight.
FWIW, the release notes for IDL 7.1 (May, 2009) says it supports HDF5
1.6.7. According to the HDF5 web pages, 1.6.7 was released in Jan,
2008, 1.6.8 in Nov, 2008, and 1.6.9 in May, 2009. 1.8.0 was released
in Feb, 2008. So IDL 6 to 12 months behind HDF5 releases, and more
reluctant to move up minor versions, both of which seem reasonable.
It does mean that assuming an application can read HDF5 1.8 files may
not be a good idea for a long time (many folks are still using the
IDL6).

Personally, I'm more concerned that files written with HDF5 1.8 can
*crash* applications linked with the HDF5 1.6 library. That seems
like it has to at mostly an HDF5 problem to me.

==Data Layout

Armando also wrote:
> Personally I prefer to have an attribute based layout than a name based
> layout but nothing prevents you from having both.

I agree with this. I had "Version", and "Beamline" attributes to the
top-level data group. In principle, attributes such as these ought to
be able to explain what the data layout is well enough for a library
to read data from several different sources.

> For simplicity, at this mailing list we had decided to put the data in
> /data/data till we get a better layout. If you take a look at the Pigment
> dataset I provided, and you look at the attributes, you will see that
> /data is an NXentry group, /data/data is a link and the relevant
> information is inside an NXdata group. So, the file is following "NeXus
> rules" despite having an agreed name based layout. When I said in my
> PANDATA talk, that NeXus could save us a lot of discussions, I meant that
> some things have been already thought and solved and we could use
> them. The main problem I see with NeXus is that it is an Instrument based
> approach to store the data that is far from adequate from the analysis
> point of view. Darren and I have come with a proposal that may help to
> simplify the situation. In the extreme case (just a Measurement group)
> only borrows NXentry from NeXus.

OK, but why 'data/data'?? What is gained by following NeXus conventions?

Thanks,

--Matt Newville

Darren Dale

unread,

Nov 25, 2009, 11:28:02 AM11/25/09

to ma...@googlegroups.com

On Wed, Nov 25, 2009 at 11:12 AM, Matt Newville
<newv...@cars.uchicago.edu> wrote:
> Hi Darren, All,
>
> Thanks for the replies.
>
> Darren wrote:
>> Concerning dead time, I think I have an elegant solution in phynx
>
> I agree that the phynx solution is elegant. It does put some burden
> on API implementers in other languages, and also means that the
> corrected data is not actually stored.
>
> I think there are enough subtleties with dead time corrections that
> placing the burden of doing the correction at the point of origin is
> preferred. For example, if there are two dead-times for a detector
> (detectors using XIA's DXP electronics have this feature), the full
> correction may included deadtimes (in nanoseconds) that have been
> determined separately, and then the correction is done iteratively...
> at least that's one way to do it. I t's hard to imagine that this sort
> of correction (or ALL the variations on how to store deadtime) would
> be done by every library that can read an HDF5 file.
>
> I think that for an interchange format, it's best to not rely on
> anything besides trivial calculations at the user end.

I see your point.

> ==Issues with IDL
>
> Darren wrote:
>> You mentioned on the website that IDL sometimes cannot read a well
>> formed file, and that it might have been due to the file being open
>> multiple times. Do you mean that you had the file open in multiple
>> programs at the same time? If so, I think that is unfortunately a
>> situation that needs to be avoided. I don't think the hdf5 library
>> uses locks to prevent multiple processes from modifying/accessing the
>> file at the same time, so this can result in corrupted data. However,
>> I may be mistaken, and would love to have someone correct me. If I
>> understood Andrew Collette (author of h5py) correctly, this situation
>> is improved if hdf5 is compiled with support for mpi. Preliminary
>> tests seem to show that hdf5 can be built as a shared library
>> (required for python bindings) with mpi support, but I don't think the
>> hdf5 group tests this configuration.
>
> I need to (or better yet, somebody else needs to) look into this more
> carefully. I had a hard time getting reliable failures. I was
> definitely doing all of these things, which may lead to troubles:
> - overwriting test files
> - using the HDF5 Viewer, and not always closing the Viewer.
> - reading files on both Linux and Windows.
> - using files sitting on networked drives.

At one time, I was using python to work with files stored on a samba
share, and the program would crash. I don't remember the details, but
the program did not crash when I copied the data to the local disk and
opened that copy.

> I never had trouble with Python (h5py only, I didn't test pytables),
> but several problems with IDL. I was using IDL 7.0, on both Windows
> and Linux but (mostly linux simply because restarting IDL is faster).
>
> My suspicion is that IDL and the HDF5 Viewer are not always good at
> actually closing file handles. I believe I never had trouble when IDL
> opened a "brand new file" that had never been touched by another
> application. So, perhaps I was beating on HDF5 files more than they
> expect -- that worries me a little.

I wonder if IDL automatically closes files when you exit the program.
h5py and pytables will do this for you.

Darren

"V. Armando Solé"

unread,

Nov 25, 2009, 11:28:57 AM11/25/09

to ma...@googlegroups.com

Hi Matt,

Matt Newville wrote:
> OK, but why 'data/data'?? What is gained by following NeXus conventions?
>
> Thanks

To make it short:

Chris suggested to share data as /data

I just suggested not to have the data themselves the root directory

Chris suggested have the data at /data/data and to me that is as good
and as bad as any other name based convention.

Further discussions with other people in the mailing list just served to
show a workshop was needed. You already know the rest.

Concerning what is gained by following Nexus conventions:

- It serves to illustrate concepts.
- A NeXus aware analysis program (if that exists) can take the data and
visualize them.

Armando

Wellenreuther, Gerd

unread,

Nov 26, 2009, 3:47:18 AM11/26/09

to ma...@googlegroups.com

Dear Matt,

just by pure coincidence, people from CREASO (they are selling /
supporting IDL in D/CH/A) where asking me about the usage of IDL at
PETRA III. So, I was finally able to answer their call, and indicated
the problems you were facing. According to them some of your problems
could be related to the fact that you are using IDL 7.0 - IDL is
supposed to properly support HDF starting from 7.1. Anyway, I guess
someone has to test whether the serious issues of HDF 1.8 vs 1.6 in IDL
are settled in 7.1.

Anyway, I have tried to make them aware of both the mailing list and the
workshop. Maybe they can/want to send someone listening to our needs and
problems. So Armando + Andy: Somebody could approach you. But most
probably they first contact me again.

Cheers, Gerd

> --
>
> You received this message because you are subscribed to the Google Groups "Methods for the analysis of hyperspectral image data" group.
> To post to this group, send email to ma...@googlegroups.com.
> To unsubscribe from this group, send email to mahid+un...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/mahid?hl=en.
>
>

--

Dr. Gerd Wellenreuther
beamline scientist P06 "Hard X-Ray Micro/Nano-Probe"
Petra III project
HASYLAB at DESY
Notkestr. 85
22603 Hamburg

Tel.: + 49 40 8998 5701

Wellenreuther, Gerd

unread,

Nov 26, 2009, 5:12:21 AM11/26/09

to ma...@googlegroups.com

Hi Matt,

maybe you have already found this discussion:

Wellenreuther, Gerd

unread,

Nov 26, 2009, 5:13:20 AM11/26/09

to ma...@googlegroups.com

Hi Matt,

maybe you have already found this discussion:

http://mail.hdfgroup.org/pipermail/hdf-forum_hdfgroup.org/2009-August/001732.html

Cheers, Gerd

P.S.: Sometimes, email clients are quicker than their users. Sorry.

Wellenreuther, Gerd

unread,

Nov 26, 2009, 5:58:48 AM11/26/09

to ma...@googlegroups.com

Hi Matt,

V. Armando Sol� wrote:
> Hi Matt,
>
> Matt Newville wrote:
>> OK, but why 'data/data'?? What is gained by following NeXus conventions?
>>
>> Thanks
>
> To make it short:
>
> Chris suggested to share data as /data
>
> I just suggested not to have the data themselves the root directory
>
> Chris suggested have the data at /data/data and to me that is as good
> and as bad as any other name based convention.
>
> Further discussions with other people in the mailing list just served to
> show a workshop was needed. You already know the rest.

At one point of the discussion Armando is mentioning I tried to
summarize it - with strong emphasis on the try. Here you can find my old
post, which is explaining the two different ways of how people would
like to use HDF5: http://groups.google.com/group/mahid/msg/798987da1443bd65?

Cheers, Gerd

Matt Newville

unread,

Nov 27, 2009, 10:19:03 AM11/27/09

to ma...@googlegroups.com

Hi Gerd,

Thanks -- I have read the thread on the HDF5 group mailing list about
IDL. There are similar messages on the h5py mailing list too. The
IDL 7 v. hdf5 v1.8 seems to be a common problem.

I don't really understand what is happening for these failures.
Playing with this some more, I have been able to create a very simple
HDF5 file with python2.6, hdf5 1.8.4, h5py 1.2.1 on one linux box and
read it with IDL 7.0 on another linux box and IDL 6.3 on yet another
linux box. This file does not use v1.8 features and is at

http://cars9.uchicago.edu/pybeamline/DataFormats/H5UsageNotes?action=AttachFile&do=view&target=test_v18_IDL.h5

So my earlier report (and reports on the HDF group mailing list and on
the h5py mailing list) that IDL crashes with files written with HDF5
v1.8 was not completely correct -- I have a file that can be read with
IDL7.0 and IDL6.3. Still, I (and others) have definitely have seen
crashes, and ended up with corrupted files. I haven't thoroughly
tested these issues. I also haven't tested any other packages that use
HDF5 (Matlab, Mathematica, IGOR Pro, Labview, Perl/PDL, R, octave
...), many of which seem to use the v1.6 series.

At this point, I'm reluctant to place all the blame on IDL. My
conclusion from this experience is that HDF5 v1.8 and HDF5 v1.6 files
*can* have problems. The HDF5 pages on 'compatibility' acknowledge as
much, though it's not clear to me what they expect to happen if an
v1.6 library reads a file with features that are new to v1.8.

For a group such as this that is attempting to define an exchange file
format, it is probably best to be conservative. I think that means
using the HDF5 v1.6 API until using the HDF5 v1.8 API is demonstrated
to not have problems. I don't know if there is a way to test which
library version was used when writing a file, but that might be a
useful attribute to include.

Cheers,

--Matt Newville <newville at cars.uchicago.edu>

"V. Armando Solé"

unread,

Nov 27, 2009, 10:29:50 AM11/27/09

to ma...@googlegroups.com

Matt Newville wrote:
> I don't know if there is a way to test which
> library version was used when writing a file, but that might be a
> useful attribute to include.
>

It seems to be included by default among the attributes of the root
group "/"

At least I get:

HDF5_version = 1.8.3 (string)

in the files I have generated and in the files generated at PSI the same
attribute is there (with value = 1.6.whatever).

Have you tried to read my dataset on IDL?

Armando

Matt Newville

unread,

Nov 27, 2009, 11:13:52 AM11/27/09

to ma...@googlegroups.com

Hi Armando,

On Fri, Nov 27, 2009 at 9:29 AM, "V. Armando Solé" <so...@esrf.fr> wrote:
> Matt Newville wrote:
>> I don't know if there is a way to test which
>> library version was used when writing a file, but that might be a
>> useful attribute to include.
>>
> It seems to be included by default among the attributes of the root
> group "/"
>
> At least I get:
>
> HDF5_version = 1.8.3 (string)
>
> in the files I have generated and in the files generated at PSI the same
> attribute is there (with value = 1.6.whatever).

A simple file created with python and h5py does not include this
information by default. It must be set by the generating software....
That's probably necessary. It appears that the HDF5 libraries does
not do rigorous self-check for backwards and forwards compatibility.

> Have you tried to read my dataset on IDL?

Yes. All the files on the MAHID list of data sets can be opened with
IDL 7.0 on linux. I haven't tried to actually read all the datasets.

-Matt

"V. Armando Solé"

unread,

Nov 27, 2009, 11:22:42 AM11/27/09

to ma...@googlegroups.com

Hi Matt,

Matt Newville wrote:

> A simple file created with python and h5py does not include this
> information by default. It must be set by the generating software....
> That's probably necessary. It appears that the HDF5 libraries does
> not do rigorous self-check for backwards and forwards compatibility.
>

You are right. I used Darren's module to create the file and it is
adding the information I guess to be compliant with NeXus.

HDF5 files generated at SOLEIL and at PSI also include that attribute,
so, I think we can already adopt it :-)

>> Have you tried to read my dataset on IDL?
>>
>
> Yes. All the files on the MAHID list of data sets can be opened with
> IDL 7.0 on linux. I haven't tried to actually read all the datasets.
>

Great!

Armando

Reply all

Reply to author

Forward