HDF best practices

7 views
Skip to first unread message

Matthew Dougherty

unread,
Oct 13, 2009, 5:14:25 AM10/13/09
to Methods for the analysis of hyperspectral image data, bio...@fsl.cs.sunysb.edu
For reasons of namespace collisions and long-term stewardship I
recommend the following:

1) When developing a new community format in HDF, establish a unique
dedicated HDF group under the root so as to contain all community's
datasets, attributes & subgroups. For example //SAXS/data/data/ as
opposed to //data/data/; //EELS/ is more unique than //image/. There
are a number of unrelated bio-imaging groups planning to use HDF from
acquisition through archiving; integrating the data downstream into
complex models will be problematic if there are namespace conflicts in
attribute, dataset, & group names, essentially prohibiting the co-
existence of different scientific communities within the same HDF
file. There is a strong parallel to internet domains, it would be
pretty dull if everybody had the same domain name.

2) Attach the attribute "DataDomainDefinition" to your dedicated root
group, such that the value is a URL to the community's format
definition (version ID embedded in URL). This will also reinforce
ownership of the communities' group names by explicitly tagging them
in a common manner. In the future there may be a registry, allowing
for communities to assert their data domain, similar to ICANN but
clearly not as many domains; this will also provide a means for
different communities planning new designs to look at existing designs
for inspiration or adoption.

3) In the event of "archiving", the format specification document
should be included as a dataset under the community's data domain
group, using an internal URL for "DataDomainDefinition".

As for Nexus, don't worry about it. In a past conversation I was told
that such a re-instantiation could be done by a formal change in their
standard without causing a lot of problems. I am under the impression
this has effectively been done under their XML design.



In the long run, I am working with the HDF Group and various
scientific user communities to establish a formal de-jure image
definition/standard using RDF, such that communities could voluntarily
use it within their data domains. This should not interfere with your
current or future designs, because there is no fundamental change in
HDF design strategy, and will not force a change in your core data
design. The intent here is to establish common nomenclature for
generic scientific multimodal-multidimensional images stored as HDF
datasets. This will allow researchers outside your communities to
identify where the images are within your data domains, similar to
Dublin Core used by libraries and archivists to provide basic
navigation. For example, this should make it a lot easier for
visualization developers in other communities to interact with your
images, and also lay the groundwork for a common downstream approach
to archiving HDF files that will meet archivist's best practices. If
you would like to know more about this or participate in the
discussions/development, let me know.

On a related note, HDF5 is about to complete ISO standardization in
support of the engineering community. This should help in developing
common long-term strategy & design through voluntary consensus.

Matthew Dougherty
National Center for Macromolecular Imaging

"V. Armando Solé"

unread,
Oct 13, 2009, 6:14:35 AM10/13/09
to ma...@googlegroups.com
Hi Mathew,

Thank you for the contribution. There are many important things to be
taken from it.

Matthew Dougherty wrote:
> For reasons of namespace collisions and long-term stewardship I
> recommend the following:
>
> 1) When developing a new community format in HDF, establish a unique
> dedicated HDF group under the root so as to contain all community's
> datasets, attributes & subgroups. For example //SAXS/data/data/ as
> opposed to //data/data/; //EELS/ is more unique than //image/. There
> are a number of unrelated bio-imaging groups planning to use HDF from
> acquisition through archiving; integrating the data downstream into
> complex models will be problematic if there are namespace conflicts in
> attribute, dataset, & group names, essentially prohibiting the co-
> existence of different scientific communities within the same HDF
> file. There is a strong parallel to internet domains, it would be
> pretty dull if everybody had the same domain name.
>

My idea goes more in the direction of extending NeXus and relying more
on Group types/classes and attributes than in absolute path names.

If I look for a group of type/class NXentry (that may or may be not
named entry), and a group of type SAXS (that may or may not be called
SAXS), the chances to step on others definitions start to be small. The
approach based on Group types/class may seem cumbersome at first
instance, but thinking about object oriented programming, those classes
could correspond to the actual base classes used to handle the
associated datasets.


> 2) Attach the attribute "DataDomainDefinition" to your dedicated root
> group, such that the value is a URL to the community's format
> definition (version ID embedded in URL). This will also reinforce
> ownership of the communities' group names by explicitly tagging them
> in a common manner. In the future there may be a registry, allowing
> for communities to assert their data domain, similar to ICANN but
> clearly not as many domains; this will also provide a means for
> different communities planning new designs to look at existing designs
> for inspiration or adoption.
>
> 3) In the event of "archiving", the format specification document
> should be included as a dataset under the community's data domain
> group, using an internal URL for "DataDomainDefinition".
>
>

Nice hints. I was just considering version number ing, but if nobody
knows to what correspond the version number ...


> As for Nexus, don't worry about it. In a past conversation I was told
> that such a re-instantiation could be done by a formal change in their
> standard without causing a lot of problems. I am under the impression
> this has effectively been done under their XML design.
>
>
>
> In the long run, I am working with the HDF Group and various
> scientific user communities to establish a formal de-jure image
> definition/standard using RDF, such that communities could voluntarily
> use it within their data domains. This should not interfere with your
> current or future designs, because there is no fundamental change in
> HDF design strategy, and will not force a change in your core data
> design. The intent here is to establish common nomenclature for
> generic scientific multimodal-multidimensional images stored as HDF
> datasets. This will allow researchers outside your communities to
> identify where the images are within your data domains, similar to
> Dublin Core used by libraries and archivists to provide basic
> navigation. For example, this should make it a lot easier for
> visualization developers in other communities to interact with your
> images, and also lay the groundwork for a common downstream approach
> to archiving HDF files that will meet archivist's best practices. If
> you would like to know more about this or participate in the
> discussions/development, let me know.
>

Sorry, I am a bit lost here.When you say: "The intent here is to

establish common nomenclature for generic scientific

multimodal-multidimensional images stored as HDF datasets" are you
talking about this mailing list or about your work with the HDF group.

Thanks again,

Armando


ambergino

unread,
Oct 13, 2009, 8:02:59 AM10/13/09
to Methods for the analysis of hyperspectral image data


On Oct 13, 5:14 am, Matthew Dougherty <matth...@bcm.edu> wrote:

> 1) When developing a new community format in HDF, establish a unique
> dedicated HDF group under the root so as to contain all community's
> datasets, attributes & subgroups.  For example //SAXS/data/data/ as
> opposed to //data/data/; //EELS/ is more unique than //image/.

I would argue instead that what we want is //image/EELS/ or //image/
SAXS/

That is, start from the most generic and work one's way downward.
That way a general image analysis program can be given an HDF5 file
and look for the image regardless of the modality it was taken in.
More specific info to the method can then be in the EELS or SAXS etc.
subgroups.

CJ

Pete R. Jemian

unread,
Oct 13, 2009, 10:10:51 AM10/13/09
to ma...@googlegroups.com

NeXus (behind the scenes from what you can find on the
main web page) is preparing the NeXus Definition Language
(NXDL). The intent of NXDL is to provide an easier, and
rules-based method for defining a NeXus data file that is
specific to either an instrument (where NeXus has been
for years) or an area of scientific technique or analysis.
Think of the relevance to small-angle scattering, both
SAXS and SANS data represented. What's new, you ask?
One of the efforts that motivated NeXus in this direction
was the canSAS format for storing reduced small-angle
scattering data in an XML file. See:
http://www.smallangles.net/wgwiki/index.php/cansas1d_documentation

An NXDL description will be a true (not pseudo) XML file
which structure can be validated by a schema. See below
for a draft example from the working repository. Since
the NXDL specification is not complete, expect that some
aspects of this example might change. NXDL is not intended
to change the location of information stored in existing
NeXus files, only to change (and simplify) the way the file
would be arranged for a specific instance such as instrument or
technique.

Just this next weekend, I'm hosting a small code camp for the
NeXus technical group (those who actually seem to be able
to make time to work on the NeXus code) in Evanston, Illinois, USA.
(http://www.nexusformat.org/NIAC2009)

I will make sure the NeXus group is aware of this discussion.
One item we will need to finish is a good introduction
to NXDL and why. The quick summary is so that groups
like this discussion group could use NXDL to define
a standard for spectromicroscopy and coherent diffraction files.

Pete


excerpt from NXDL draft specification
for raw data from a rotation camera
------------------% clip here %-----------------------
<definition name="NXxrot" extends="NXxbase" type="group"
xmlns="http://definition.nexusformat.org/nxdl/3.1"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://definition.nexusformat.org/nxdl/3.1 ../nxdl.xsd ">
<doc>This is the application definition for raw data from a rotation camera.
It extends NXxbase, so the full definition is the content of
NXxbase plus the data defined here.</doc>
<group type="NXentry" name="entry">
<field name="definition">
<doc>
Official NeXus DTD or NXDL schema to which this file conforms
</doc>
<enumeration>
<item value="NXxrot"></item>
</enumeration>
</field>
<group type="NXinstrument" name="instrument">
<group type="NXdetector" name="detector">
<field name="polar_angle" type="NX_FLOAT" units="NX_ANGLE">
<doc>The polar_angle (two theta) where the detector is placed.</doc></field></group>
</group>
<group type="NXsample" name="sample">
<field name="rotation_angle" type="NX_FLOAT" units="NX_ANGLE">
<doc>This is an array holding the sample rotation angle at each scan point</doc>
<dimensions size="1">
<dim index="1" value="np" /></dimensions></field>
</group>
<group type="NXdata" name="name">
<field name="rotation_angle" type="NX_FLOAT" units="NX_ANGLE">
<doc>Link to data in /entry/sample/rotation_angle</doc>
<dimensions size="1">
<dim index="1" value="np" /></dimensions></field>
</group>
</group>
</definition>
------------------% clip here %-----------------------

--
----------------------------------------------------------
Pete R. Jemian, Ph.D. <jem...@anl.gov>
Beam line Controls and Data Acquisition, Group Leader
Advanced Photon Source, Argonne National Laboratory
Argonne, IL 60439 630 - 252 - 3189
-----------------------------------------------------------
Education is the one thing for which people
are willing to pay yet not receive.
-----------------------------------------------------------

Chris Jacobsen

unread,
Oct 13, 2009, 10:51:43 AM10/13/09
to Methods for the analysis of hyperspectral image data


On Oct 13, 10:10 am, "Pete R. Jemian" <jem...@anl.gov> wrote:
> NeXus (behind the scenes from what you can find on the
> main web page) is preparing the NeXus Definition Language
> (NXDL).  
> An NXDL description will be a true (not pseudo) XML file
> which structure can be validated by a schema.

Let me ask a really ignorant question: can XML hold binary data, as
opposed to data stored as ASCII representation of numbers? The only
XML files I've seen thus far use ASCII representation, which is of
course unacceptable for big data arrays.

Since light and electron microscopists are also using HDF5, I see a
lot more advantages to HDF5 storage than to XML storage.

CJ

Pete R. Jemian

unread,
Oct 13, 2009, 11:44:05 AM10/13/09
to ma...@googlegroups.com

Chris Jacobsen wrote:
> Let me ask a really ignorant question: can XML hold binary data, as
> opposed to data stored as ASCII representation of numbers?

Good question.
Answer: base64
http://en.wikipedia.org/wiki/Base64
and for example in Python:
http://docs.python.org/library/base64.html

For example, binary documents in my email arrive as base64-encoded attachments.
Content-Type: application/pdf;
x-mac-hide-extension=yes;
x-unix-mode=0644;
name="letter09.pdf"
Content-Transfer-Encoding: base64

Just to prove I cannot write a short reply:

Q: Who uses base64 in an XML file for scientific data?
A: GAML uses MIME base64-encoding of data values.
Development of GAML is supported by Thermo, such as
for two-dimensional gas chromatography.
Here's a scientific instance of xsd:base64Binary from GAML
(where xmlns:xsd="http://www.w3.org/2001/XMLSchema"):
http://pubs.acs.org/doi/abs/10.1021/ac031260c

see also:
http://www.google.com/url?sa=t&source=web&ct=res&cd=1&ved=0CAYQFjAA&url=
http://www.gaml.org/Documentation/XML%20Analytical%20Archive%20Format.doc


By my guess, base64 may inflate the size of binary by about
a factor of 2 Anyone with real experience? This may be a
critical difference for extremely large data sets (time to
store from acquisition, time/bandwidth to transfer). But
the compelling argument for XML (text based) rather than
HDF (binary) is that the metadata is human-readable,
although buried in a lot of XML tags. It may be acceptable
to use XML files with binary data in base64 and all content
in UTF-8 or ASCII, "as long as readers and visualizers exist."

Pete

Chris Jacobsen

unread,
Oct 13, 2009, 11:48:42 AM10/13/09
to Methods for the analysis of hyperspectral image data

> By my guess, base64 may inflate the size of binary by about
> a factor of 2 Anyone with real experience?  This may be a
> critical difference for extremely large data sets (time to
> store from acquisition, time/bandwidth to transfer).  But
> the compelling argument for XML (text based) rather than
> HDF (binary) is that the metadata is human-readable,
> although buried in a lot of XML tags.  It may be acceptable
> to use XML files with binary data in base64 and all content
> in UTF-8 or ASCII, "as long as readers and visualizers exist."

But you can always get a human-readable view of the content of any
HDF5 file with "h5dump". Therefore I vote for efficiency.

Dougherty, Matthew T.

unread,
Oct 13, 2009, 1:57:42 PM10/13/09
to ma...@googlegroups.com

Hi Armando,

My work.  My interest in this mailing list is to study the discussion.


Matthew Dougherty
713-433-3849


National Center for Macromolecular Imaging

Baylor College of Medicine/Houston Texas USA
=========================================================================
=========================================================================

Dougherty, Matthew T.

unread,
Oct 13, 2009, 2:44:59 PM10/13/09
to ma...@googlegroups.com

HI Chris,

In this case (//image/EELS/ or //data/SAXS/), I would recommend attaching the attribute (DataDomainDefinition) to the subgroups. 

A critical question is do you get any performance advantages sprawling across the root domain?  Also, by having //image/ or //data/ implies there is an overarching organizational plan, begging the questions: Where is it? Who is writing it? How does or can an outside community participate? Clearly an ad-hoc //image/EELS/ is less confusing than "everybody puts datasets into //image/"

My speculation is, having a single community group under the root compacts the organization providing tighter containment of community data, and less chance another community will clobber it accidentally or misuse it (e.g., //image/1.img)

If these HDF files are to be used exclusively by your community and there is no chance of another scientific community (e.g. viz, bioinformatics) will put data into the same HDF file; then it does not matter.






Matthew Dougherty
713-433-3849


National Center for Macromolecular Imaging

Baylor College of Medicine/Houston Texas USA
=========================================================================
=========================================================================




Darren Dale

unread,
Oct 13, 2009, 6:46:58 PM10/13/09
to ma...@googlegroups.com

I don't understand the motivation behind this change in emphasis
towards xml. Isn't hdf5 more efficient in terms of speed and memory
consumption as well as storage? And more flexible? hdf5 has native
support for arrays, compound datatypes, hard and soft links, variable
length arrays, etc. I don't understand how to work with xml and I
don't know of any interfaces for xml that are as capable as the hdf5
library is or as intuitive as the 3rd party python bindings like h5py.

I have tried at least twice to learn how to work with xml and it just
doesn't seem to be a good fit. This isn't intended to cultivate fear,
uncertainty and doubt. Maybe someone could kindly post some code
snippets.

Darren

Garth Williams

unread,
Oct 13, 2009, 6:57:58 PM10/13/09
to ma...@googlegroups.com
I second this concern. I haven't had a serious look at XML in several
years, but at the time the parsing was really, really slow. Granted,
that's compared to extracting a known array element of an HDF in C,
but I needed to perform that task many times and it did add up to an
unacceptable delay.

garth

Mark Rivers

unread,
Oct 13, 2009, 8:05:05 PM10/13/09
to ma...@googlegroups.com
I don't think anyone is seriously suggesting using XML as the primary file type for experimental data.

netCDF does support export to an ASCII file format, which can then be read back into the binary netCDF format with no loss of information. Having such a capability for HDF and using XML as the ASCII format could be very useful. With the file in ASCII it is easy to view, and easy to fix mistakes in header info, etc.

It is not only the file size which is larger in XML, but will be MUCH slower to read and write. We are potentially talking about HUGE files for this application. I was just at the Australian synchrotron collecting hard x-ray spectroscopy data. They are routinely collecting 4096x4096 x-ray fluorescence data sets. They are storing the data in list-mode, e.g. each photon is recorded separately in the file with information that indicates its X, Y position and time above threshold for pileup rejection. That is more efficient than storing spectra when there are fewer than about 2000 photons per pixel. If that amount of data were stored as a 2048-channel spectra at each pixel it would be 137GB for 1 scan. The list mode files were maybe 4-10 times smaller than this, but still ASCII is not an option!

Mark


________________________________

Mark Rivers

unread,
Oct 13, 2009, 8:06:44 PM10/13/09
to ma...@googlegroups.com
I just realized my file size was low by 2 orders of magnitude, because this is a 96-element detector, and we need to store 96 separate spectra at each pixel. The total file size for a 4Kx4K scan would be 1.3TB if it were stored in spectra format, rather than list mode. That's assuming 2048 channel spectra and 4 bytes/channel, which is probably overkill, but not by much.

Mark


________________________________

"V. Armando Solé"

unread,
Oct 14, 2009, 2:19:56 AM10/14/09
to ma...@googlegroups.com
Hi Mark,

Mark Rivers wrote:

"""
I don't think anyone is seriously suggesting using XML as the primary
file type for experimental data.
"""

Right. At the ESRF we are not considering at all :-)

"""
netCDF does support export to an ASCII file format, which can then be
read back into the binary netCDF format with no loss of information.
Having such a capability for HDF and using XML as the ASCII format could
be very useful. With the file in ASCII it is easy to view, and easy to
fix mistakes in header info, etc.
"""

Well, a proper HDF5 editor should be even better. SPEC is saving in
ASCII format and, despite its simplicity, the files are so easy to edit
that sometimes the modified files are not readable by anybody else than
the guy who edited it. I agree with you that, in general, it is simpler
to fix mistakes, but it is not so simpler to ADD information that was
forgotten or missing at the moment the recording took place while that
is a given with HDF5. I guess the versatility compensates the (not so
much) readability of XML.

"""
I just realized my file size was low by 2 orders of magnitude, because this is a 96-element detector, and we need to store 96 separate spectra at each pixel. The total file size for a 4Kx4K scan would be 1.3TB if it were stored in spectra format, rather than list mode. That's assuming 2048 channel spectra and 4 bytes/channel, which is probably overkill, but not by much.
"""


Isn't that the same way of storing data the Ion Beam analysis people are
using? It is very efficient when you have few counts and, in addition,
you have the time information. Do you think it is still applicable at
synchrotrons? They move the beam and record few events per "step", while
we move the sample and record the full spectrum (0.2-1 seconds). I must
say at some beamlines the amount of counts is very low and touching few
channels, but at others I would expect much gain because basically are
counts in the whole spectrum but I am very curious. Have you already tried?

Armando

Mark Rivers

unread,
Oct 14, 2009, 8:30:09 AM10/14/09
to ma...@googlegroups.com
> They move the beam and record few events per "step", while
> we move the sample and record the full spectrum (0.2-1 seconds). I must
> say at some beamlines the amount of counts is very low and touching few
> channels, but at others I would expect much gain because basically are
> counts in the whole spectrum but I am very curious. Have you already tried?

Yes, we've tried it because the Maia detector that the BNL/CSIRO collaboration has built produces such data. The detector we used was a 96-element prototype, the next generation will be 384 elements. The sample is moved continuously, and data is collected as X is scanned in both directions (i.e. no flyback overhead). The data stream contains a 32-bit word for each photon event that encodes the detector number, photon energy, and time over threshold for pileup processing. There are also pixel address events, which encode the X and Y positions. But those addresss events only occur when a pixel boundary is crossed, so there are not many of them relative to photon events. They write this list mode data in 100MB chunks, so that the individual files are not too large.

The only end-user software that can process these files right now is Chris Ryan's GeoPixie program. It reads the event files and directly produces quantitative concentration maps using his Dynamic Analysis matrix technique. The data are never binned into spectra (except for a few summed spectra to get the fit parameters right initially), and there is no non-linear fitting of spectra. The quantification uses matrix arithmetic so it is very fast.

In terms of this group, getting GeoPixie to use the proposed HDF format for its output, and being able also to read our raw data as its input would be a very worthwhile goal. I don't know if it makes sense to use HDF for the event-mode raw data files.

Mark








________________________________

From: ma...@googlegroups.com on behalf of "V. Armando Solé"
Sent: Wed 10/14/2009 1:19 AM
To: ma...@googlegroups.com
Subject: [MAHID] Re: NeXus NXDL is in development




Mark Rivers

unread,
Oct 14, 2009, 9:23:48 AM10/14/09
to ma...@googlegroups.com
A little more information about our experience with the prototype detector.

We typically ran with the event data stream collecting 2MB/sec. Since each x-ray event is 4 bytes, that was about 500,000 counts/sec total, over 96 detectors, or about 5,000 events/sec/detector. The MAXIMUM dwell time of this prototype detector (because it controlled the stages directly) was 16.6msec. Thus, even at the maximum dwell time there were only about 83 counts per pixel per detector. Under these conditions it is much more efficient to use list mode. Only when the number of photons per detector per pixel approaches 2000 would spectral mode be more efficient.

The next generation of the Maia will have on-board histogramming, so it will be possible read spectra from the system if and when that is desired.

Mark


________________________________

"V. Armando Solé"

unread,
Oct 14, 2009, 9:44:04 AM10/14/09
to ma...@googlegroups.com
Mark Rivers wrote:
> A little more information about our experience with the prototype detector.
>
> We typically ran with the event data stream collecting 2MB/sec. Since each x-ray event is 4 bytes, that was about 500,000 counts/sec total, over 96 detectors, or about 5,000 events/sec/detector. The MAXIMUM dwell time of this prototype detector (because it controlled the stages directly) was 16.6msec. Thus, even at the maximum dwell time there were only about 83 counts per pixel per detector. Under these conditions it is much more efficient to use list mode. Only when the number of photons per detector per pixel approaches 2000 would spectral mode be more efficient.
>
So, just a fast check, for scans in the 0.5 seconds per pixel region we
are are at (guestimate): 0.5s * 1.0E+05 = 500 photons/detector

That means, the typical situation, for a 2K spectrum, is that (at most)
only one fourth of the channels will contain counts. By other hand, one
needs to store 8 bytes (channels + counts) instead of 4 bytes (counts).
It is quite interesting.

Armando

"V. Armando Solé"

unread,
Oct 14, 2009, 9:51:30 AM10/14/09
to ma...@googlegroups.com

Sorry, I made the check wrong :-)

So, just a fast check, for scans in the 0.5 seconds per pixel region we

are are at (guestimate): 0.5s * 1.0E+05 cps= 50000 photons/detector

Therefore the spectral mode is convenient for us because we'll be in the
5000-50000 integrated counts per detector.

Sorry!

Armando

Mark Rivers

unread,
Oct 14, 2009, 10:35:41 AM10/14/09
to ma...@googlegroups.com
> By other hand, one
? needs to store 8 bytes (channels + counts) instead of 4 bytes (counts).

No, I don't think that is true. For list-mode one does not need to store 8 bytes. You don't store counts, you just store channel (=energy), which is actually only 2 bytes at most. The Maia uses 4 bytes rather than 2 because it also records time over threshold, which is used for pileup rejection during post-processing. But it could be changed to do the pileup rejection on-the-fly, and not record the piled-up events, so then 2 bytes per photon event would be enough.

Mark


-----Original Message-----
From: ma...@googlegroups.com [mailto:ma...@googlegroups.com] On Behalf Of "V. Armando Solé"
Sent: Wednesday, October 14, 2009 8:44 AM
To: ma...@googlegroups.com
Subject: [MAHID] Re: NeXus NXDL is in development


"V. Armando Solé"

unread,
Oct 14, 2009, 10:46:19 AM10/14/09
to ma...@googlegroups.com
Mark Rivers wrote:
>> By other hand, one
>>
> ? needs to store 8 bytes (channels + counts) instead of 4 bytes (counts).
>
> No, I don't think that is true. For list-mode one does not need to store 8 bytes. You don't store counts, you just store channel (=energy), which is actually only 2 bytes at most. The Maia uses 4 bytes rather than 2 because it also records time over threshold, which is used for pileup rejection during post-processing. But it could be changed to do the pileup rejection on-the-fly, and not record the piled-up events, so then 2 bytes per photon event would be enough.
>

Ok, understood, one stores as many times the channel as counts in the
channel.

Armando

Chris Ryan

unread,
Oct 14, 2009, 9:56:31 PM10/14/09
to Methods for the analysis of hyperspectral image data
Hi Armando,

On Oct 14, 5:19 pm, "V. Armando Solé" <s...@esrf.fr> wrote:

>
> Isn't that the same way of storing data the Ion Beam analysis people are
> using? It is very efficient when you have few counts and, in addition,
> you have the time information. Do you think it is still applicable at
> synchrotrons? They move the beam and record few events per "step", while
> we move the sample and record the full spectrum (0.2-1 seconds). I must
> say at some beamlines the amount of counts is very low and touching few
> channels, but at others I would expect much gain because basically are
> counts in the whole spectrum but I am very curious. Have you already tried?
>
> Armando

Yes, that's correct. It's been used in IBA and PIXE imaging since the
mid-70's and
in nuclear physics for a lot longer. At small "dwell" (or transit)
time per pixel then
it is quite efficient, as Mark has been saying. For long dwell times,
as you say,
the spectra become more populated and spectra can become more
efficient.

A hybrid, as you were aluding to, of channel & multiplicity entries is
not a true
list-mode file, but can work similarly, and is the format I use for
translating
data-cube formats (e.g. APS MDA) into a quasi-list-mode that GeoPIXE
can process.
For typical APS data-sets (~1s dwell, 3-13 detectors) the result is a
similar or slightly
larger file size. So even then a (quasi)list-mode file is not bad.

But then as we go towards large detector arrays (96, 384), large
conversion ranges (e.g. 4K per
detector) and large pixel counts (we have demonstrated up to 102M
pixels), which means
smaller times per pixel, then list-mode becomes a good lossless
format. Our 96 detector
prototype can handle up to about 5M c/s (although better spectrum
quality at ~1M/s as Mark
was saying), which is a good match to disk bandwidth. For larger
arrays (e.g. 384 nearing
completion) we'll need to filter the data in some way. Currently, we
either use DA method only
and discard raw data [not my favorite] or "throttle" the data rates
from intense peaks in the
spectrum (pre-sample events in intense peaks), which reduces total
disk rates by ~2-10.

The other benefit (from my point of view handling data input from 22
labs now in GeoPIXE) is
that list-mode is quite simple. You can translate a more complex
format into list-mode
(or the quasi-list-mode mentioned above - either as a real file
translation, or on the fly) and then
process it using a common software approach.

Cheers, Chris.
Reply all
Reply to author
Forward
0 new messages