Thank you for the contribution. There are many important things to be
taken from it.
Matthew Dougherty wrote:
> For reasons of namespace collisions and long-term stewardship I
> recommend the following:
>
> 1) When developing a new community format in HDF, establish a unique
> dedicated HDF group under the root so as to contain all community's
> datasets, attributes & subgroups. For example //SAXS/data/data/ as
> opposed to //data/data/; //EELS/ is more unique than //image/. There
> are a number of unrelated bio-imaging groups planning to use HDF from
> acquisition through archiving; integrating the data downstream into
> complex models will be problematic if there are namespace conflicts in
> attribute, dataset, & group names, essentially prohibiting the co-
> existence of different scientific communities within the same HDF
> file. There is a strong parallel to internet domains, it would be
> pretty dull if everybody had the same domain name.
>
My idea goes more in the direction of extending NeXus and relying more
on Group types/classes and attributes than in absolute path names.
If I look for a group of type/class NXentry (that may or may be not
named entry), and a group of type SAXS (that may or may not be called
SAXS), the chances to step on others definitions start to be small. The
approach based on Group types/class may seem cumbersome at first
instance, but thinking about object oriented programming, those classes
could correspond to the actual base classes used to handle the
associated datasets.
> 2) Attach the attribute "DataDomainDefinition" to your dedicated root
> group, such that the value is a URL to the community's format
> definition (version ID embedded in URL). This will also reinforce
> ownership of the communities' group names by explicitly tagging them
> in a common manner. In the future there may be a registry, allowing
> for communities to assert their data domain, similar to ICANN but
> clearly not as many domains; this will also provide a means for
> different communities planning new designs to look at existing designs
> for inspiration or adoption.
>
> 3) In the event of "archiving", the format specification document
> should be included as a dataset under the community's data domain
> group, using an internal URL for "DataDomainDefinition".
>
>
Nice hints. I was just considering version number ing, but if nobody
knows to what correspond the version number ...
> As for Nexus, don't worry about it. In a past conversation I was told
> that such a re-instantiation could be done by a formal change in their
> standard without causing a lot of problems. I am under the impression
> this has effectively been done under their XML design.
>
>
>
> In the long run, I am working with the HDF Group and various
> scientific user communities to establish a formal de-jure image
> definition/standard using RDF, such that communities could voluntarily
> use it within their data domains. This should not interfere with your
> current or future designs, because there is no fundamental change in
> HDF design strategy, and will not force a change in your core data
> design. The intent here is to establish common nomenclature for
> generic scientific multimodal-multidimensional images stored as HDF
> datasets. This will allow researchers outside your communities to
> identify where the images are within your data domains, similar to
> Dublin Core used by libraries and archivists to provide basic
> navigation. For example, this should make it a lot easier for
> visualization developers in other communities to interact with your
> images, and also lay the groundwork for a common downstream approach
> to archiving HDF files that will meet archivist's best practices. If
> you would like to know more about this or participate in the
> discussions/development, let me know.
>
Sorry, I am a bit lost here.When you say: "The intent here is to
establish common nomenclature for generic scientific
multimodal-multidimensional images stored as HDF datasets" are you
talking about this mailing list or about your work with the HDF group.
Thanks again,
Armando
An NXDL description will be a true (not pseudo) XML file
which structure can be validated by a schema. See below
for a draft example from the working repository. Since
the NXDL specification is not complete, expect that some
aspects of this example might change. NXDL is not intended
to change the location of information stored in existing
NeXus files, only to change (and simplify) the way the file
would be arranged for a specific instance such as instrument or
technique.
Just this next weekend, I'm hosting a small code camp for the
NeXus technical group (those who actually seem to be able
to make time to work on the NeXus code) in Evanston, Illinois, USA.
(http://www.nexusformat.org/NIAC2009)
I will make sure the NeXus group is aware of this discussion.
One item we will need to finish is a good introduction
to NXDL and why. The quick summary is so that groups
like this discussion group could use NXDL to define
a standard for spectromicroscopy and coherent diffraction files.
Pete
excerpt from NXDL draft specification
for raw data from a rotation camera
------------------% clip here %-----------------------
<definition name="NXxrot" extends="NXxbase" type="group"
xmlns="http://definition.nexusformat.org/nxdl/3.1"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://definition.nexusformat.org/nxdl/3.1 ../nxdl.xsd ">
<doc>This is the application definition for raw data from a rotation camera.
It extends NXxbase, so the full definition is the content of
NXxbase plus the data defined here.</doc>
<group type="NXentry" name="entry">
<field name="definition">
<doc>
Official NeXus DTD or NXDL schema to which this file conforms
</doc>
<enumeration>
<item value="NXxrot"></item>
</enumeration>
</field>
<group type="NXinstrument" name="instrument">
<group type="NXdetector" name="detector">
<field name="polar_angle" type="NX_FLOAT" units="NX_ANGLE">
<doc>The polar_angle (two theta) where the detector is placed.</doc></field></group>
</group>
<group type="NXsample" name="sample">
<field name="rotation_angle" type="NX_FLOAT" units="NX_ANGLE">
<doc>This is an array holding the sample rotation angle at each scan point</doc>
<dimensions size="1">
<dim index="1" value="np" /></dimensions></field>
</group>
<group type="NXdata" name="name">
<field name="rotation_angle" type="NX_FLOAT" units="NX_ANGLE">
<doc>Link to data in /entry/sample/rotation_angle</doc>
<dimensions size="1">
<dim index="1" value="np" /></dimensions></field>
</group>
</group>
</definition>
------------------% clip here %-----------------------
--
----------------------------------------------------------
Pete R. Jemian, Ph.D. <jem...@anl.gov>
Beam line Controls and Data Acquisition, Group Leader
Advanced Photon Source, Argonne National Laboratory
Argonne, IL 60439 630 - 252 - 3189
-----------------------------------------------------------
Education is the one thing for which people
are willing to pay yet not receive.
-----------------------------------------------------------
Good question.
Answer: base64
http://en.wikipedia.org/wiki/Base64
and for example in Python:
http://docs.python.org/library/base64.html
For example, binary documents in my email arrive as base64-encoded attachments.
Content-Type: application/pdf;
x-mac-hide-extension=yes;
x-unix-mode=0644;
name="letter09.pdf"
Content-Transfer-Encoding: base64
Just to prove I cannot write a short reply:
Q: Who uses base64 in an XML file for scientific data?
A: GAML uses MIME base64-encoding of data values.
Development of GAML is supported by Thermo, such as
for two-dimensional gas chromatography.
Here's a scientific instance of xsd:base64Binary from GAML
(where xmlns:xsd="http://www.w3.org/2001/XMLSchema"):
http://pubs.acs.org/doi/abs/10.1021/ac031260c
see also:
http://www.google.com/url?sa=t&source=web&ct=res&cd=1&ved=0CAYQFjAA&url=
http://www.gaml.org/Documentation/XML%20Analytical%20Archive%20Format.doc
By my guess, base64 may inflate the size of binary by about
a factor of 2 Anyone with real experience? This may be a
critical difference for extremely large data sets (time to
store from acquisition, time/bandwidth to transfer). But
the compelling argument for XML (text based) rather than
HDF (binary) is that the metadata is human-readable,
although buried in a lot of XML tags. It may be acceptable
to use XML files with binary data in base64 and all content
in UTF-8 or ASCII, "as long as readers and visualizers exist."
Pete
Hi Armando,
My work. My interest in this mailing list is to study the discussion.
Matthew Dougherty
713-433-3849
National Center for Macromolecular Imaging
Baylor College of Medicine/Houston Texas USA
=========================================================================
=========================================================================
HI Chris,
In this case (//image/EELS/ or //data/SAXS/), I would recommend attaching the attribute (DataDomainDefinition) to the subgroups.
A critical question is do you get any performance advantages sprawling across the root domain? Also, by having //image/ or //data/ implies there is an overarching organizational plan, begging the questions: Where is it? Who is writing it? How does or can an outside community participate? Clearly an ad-hoc //image/EELS/ is less confusing than "everybody puts datasets into //image/"
My speculation is, having a single community group under the root compacts the organization providing tighter containment of community data, and less chance another community will clobber it accidentally or misuse it (e.g., //image/1.img)
If these HDF files are to be used exclusively by your community and there is no chance of another scientific community (e.g. viz, bioinformatics) will put data into the same HDF file; then it does not matter.
Matthew Dougherty
713-433-3849
National Center for Macromolecular Imaging
Baylor College of Medicine/Houston Texas USA
=========================================================================
=========================================================================
I don't understand the motivation behind this change in emphasis
towards xml. Isn't hdf5 more efficient in terms of speed and memory
consumption as well as storage? And more flexible? hdf5 has native
support for arrays, compound datatypes, hard and soft links, variable
length arrays, etc. I don't understand how to work with xml and I
don't know of any interfaces for xml that are as capable as the hdf5
library is or as intuitive as the 3rd party python bindings like h5py.
I have tried at least twice to learn how to work with xml and it just
doesn't seem to be a good fit. This isn't intended to cultivate fear,
uncertainty and doubt. Maybe someone could kindly post some code
snippets.
Darren
Mark Rivers wrote:
"""
I don't think anyone is seriously suggesting using XML as the primary
file type for experimental data.
"""
Right. At the ESRF we are not considering at all :-)
"""
netCDF does support export to an ASCII file format, which can then be
read back into the binary netCDF format with no loss of information.
Having such a capability for HDF and using XML as the ASCII format could
be very useful. With the file in ASCII it is easy to view, and easy to
fix mistakes in header info, etc.
"""
Well, a proper HDF5 editor should be even better. SPEC is saving in
ASCII format and, despite its simplicity, the files are so easy to edit
that sometimes the modified files are not readable by anybody else than
the guy who edited it. I agree with you that, in general, it is simpler
to fix mistakes, but it is not so simpler to ADD information that was
forgotten or missing at the moment the recording took place while that
is a given with HDF5. I guess the versatility compensates the (not so
much) readability of XML.
"""
I just realized my file size was low by 2 orders of magnitude, because this is a 96-element detector, and we need to store 96 separate spectra at each pixel. The total file size for a 4Kx4K scan would be 1.3TB if it were stored in spectra format, rather than list mode. That's assuming 2048 channel spectra and 4 bytes/channel, which is probably overkill, but not by much.
"""
Isn't that the same way of storing data the Ion Beam analysis people are
using? It is very efficient when you have few counts and, in addition,
you have the time information. Do you think it is still applicable at
synchrotrons? They move the beam and record few events per "step", while
we move the sample and record the full spectrum (0.2-1 seconds). I must
say at some beamlines the amount of counts is very low and touching few
channels, but at others I would expect much gain because basically are
counts in the whole spectrum but I am very curious. Have you already tried?
Armando
That means, the typical situation, for a 2K spectrum, is that (at most)
only one fourth of the channels will contain counts. By other hand, one
needs to store 8 bytes (channels + counts) instead of 4 bytes (counts).
It is quite interesting.
Armando
So, just a fast check, for scans in the 0.5 seconds per pixel region we
are are at (guestimate): 0.5s * 1.0E+05 cps= 50000 photons/detector
Therefore the spectral mode is convenient for us because we'll be in the
5000-50000 integrated counts per detector.
Sorry!
Armando
Ok, understood, one stores as many times the channel as counts in the
channel.
Armando