Handling stacked (image) data

8 views
Skip to first unread message

"V. Armando Solé"

unread,
Nov 4, 2009, 8:34:26 AM11/4/09
to ma...@googlegroups.com
Dear colleagues,

Quite frequently we have to deal with stacks of data of the same type.
Since I am using python and C, I advocate to store stacks of images as
(n_images, n_rows, n_columns) with an HDF5 chunksize of at least of (1,
n_rows, n_columns). Analogously, when I store stacks of 1D data
(~spectra), I leave the last dimension as the dimension corresponding to
the spectrum length.

Anyways, at this moment I am mainly concerned with the stack of images.
I guess (this is really a guess!), for HDF5 itself it does not matter
because the images are going to be stored one by one at least when my
chunksize is (1, n_rows, n_columns) or (n_rows, n_columns, 1). However,
I would expect huge performance differences when mapping the file
contents to memory.

My questions:

- Is there already a convention in HDF5 itself, that could tell us what
index is used as the image counter?

- Can one safely retrieve and use the chunksize information itself to
deduce the image counter?

- Do we need a dataset attribute to specify the index containing the
image counter? Is there already a convention?

Any hint is welcome.

Armando

Mark Rivers

unread,
Nov 4, 2009, 8:59:56 AM11/4/09
to ma...@googlegroups.com
I have a related question. In netCDF there is the concept of the "unlimited" or "record" dimension.

This is from the netCDF documentation:

"A netCDF classic or 64-bit offset dataset can have at most one unlimited dimension, but need not have any. If a variable has an unlimited dimension, that dimension must be the most significant (slowest changing) one. Thus any unlimited dimension must be the first dimension in a CDL shape and the first dimension in corresponding C array declarations."

This is a very useful concept, because one does not need to declare in advance the size of the arrays in that dimension, one can just keep appending information in the unlimited dimension. That is what I use for streaming video in the areaDetector module, the n_images dimension is unlimited.

Does HDF have a similar concept?

Mark

________________________________

Francesc Alted

unread,
Nov 4, 2009, 9:22:07 AM11/4/09
to ma...@googlegroups.com
A Wednesday 04 November 2009 14:34:26 V. Armando Solé escrigué:

> Dear colleagues,
>
> Quite frequently we have to deal with stacks of data of the same type.
> Since I am using python and C, I advocate to store stacks of images as
> (n_images, n_rows, n_columns) with an HDF5 chunksize of at least of (1,
> n_rows, n_columns). Analogously, when I store stacks of 1D data
> (~spectra), I leave the last dimension as the dimension corresponding to
> the spectrum length.
>
> Anyways, at this moment I am mainly concerned with the stack of images.
> I guess (this is really a guess!), for HDF5 itself it does not matter
> because the images are going to be stored one by one at least when my
> chunksize is (1, n_rows, n_columns) or (n_rows, n_columns, 1). However,
> I would expect huge performance differences when mapping the file
> contents to memory.

Well, clearly it is not the same a chunksize of (1, n_rows, n_columns) than
(n_rows, n_columns, 1). For maximum performance, in the first case, you must
make sure that your first dimension will be used for stacking images, while
for the second case you should choose the last one (in case you don't want to
retrieve lines or columns of *different* images in one shot of course ;-) If
you are using Python, you will probably want to follow the C convention, so
I'd go for the (1, n_rows, n_columns) case.

Also, please try to not define chunksizes that are too large or too small.
Finding a sensible value would require some experimentation, but should be
worth the effort if you are going to access many times to images. See, for
example, the section "Understanding chunking" in the PyTables manual:

http://www.pytables.org/docs/manual/ch05.html#chunksizeOptim

Although it talks about some PyTables features, most of what it is said there
applies to HDF5 apps in general. In particular, pay attention to how the
random reads may be affected by choosing a too large chunksize, or how
compression can help in making reads more effective (specially in sequential
reads).

> My questions:
>
> - Is there already a convention in HDF5 itself, that could tell us what
> index is used as the image counter?

I don't think HDF5 imposes a convention on this. You should choose your own
convention.

> - Can one safely retrieve and use the chunksize information itself to
> deduce the image counter?

Mmh, you mean trusting the number of chunk for determining the image? I think
it is *much* better specifying the index in the corresponding image dimension
(the first one, if using the C convention).

> - Do we need a dataset attribute to specify the index containing the
> image counter? Is there already a convention?

Specifying the index in image dimension wouldn't be enough? Or I'm missing
something?

Hope that helps,

--
Francesc Alted

"V. Armando Solé"

unread,
Nov 4, 2009, 9:26:55 AM11/4/09
to ma...@googlegroups.com
Mark Rivers wrote:
> I have a related question. In netCDF there is the concept of the "unlimited" or "record" dimension.
>
> This is from the netCDF documentation:
>
> "A netCDF classic or 64-bit offset dataset can have at most one unlimited dimension, but need not have any. If a variable has an unlimited dimension, that dimension must be the most significant (slowest changing) one. Thus any unlimited dimension must be the first dimension in a CDL shape and the first dimension in corresponding C array declarations."
>
> This is a very useful concept, because one does not need to declare in advance the size of the arrays in that dimension, one can just keep appending information in the unlimited dimension. That is what I use for streaming video in the areaDetector module, the n_images dimension is unlimited.
>
> Does HDF have a similar concept?
>
Not sure Mark, I think Darren Dale was asking something in that sense to
the person who wrote one of the python bindings to HDF5. In his case
Darren was interested on supporting the possibility to define a larger
dataset than needed but to adjust it at the end to the number of entries
actually written. It's not exactly what you say but already close.

Anyways, reading the PyTables manual I have found it talks about
"enlargeable arrays" and it says "You can add new elements to existing
arrays on disk in any dimension you want (but only one)". So, I am
tempted to say yes, the concept is there or can be implemented but I do
not know if it is an HDF5 feature or a PyTables feature. By the way, we
expect the main author of PyTables (Francesc Alted) to participate in
the coming workshop.

See you,

Armando

"V. Armando Solé"

unread,
Nov 4, 2009, 9:41:02 AM11/4/09
to ma...@googlegroups.com
Francesc Alted wrote:
> Specifying the index in image dimension wouldn't be enough? Or I'm missing
> something?
>
Welcome to the mailing list!

In my case I am following the C convention and I use the first dimension
as the image index for optimum performance. However, it seems to be
(quite?) common in electron microscopy to follow the Fortran convention
and use the last dimension as the image index.

Perhaps is me who is missing something. If you are adding images one by
one in a sort of extensible dataset, just knowing the image size can be
enough. The problem is when you just have a three dimensional dataset
written by someone else, you know you are dealing with images or you
want to treat the dataset as a stack of images, and you do not know the
dimension holding the image index. If that information is not there, and
we are willing to be able to share data analysis procedures, I guess we
should:

- somehow agree on how to identify the image index

- always foresee the two possibilities and therefore always prompt the user

Armando


Francesc Alted

unread,
Nov 4, 2009, 10:44:08 AM11/4/09
to ma...@googlegroups.com
A Wednesday 04 November 2009 15:26:55 V. Armando Solé escrigué:

> Anyways, reading the PyTables manual I have found it talks about
> "enlargeable arrays" and it says "You can add new elements to existing
> arrays on disk in any dimension you want (but only one)". So, I am
> tempted to say yes, the concept is there or can be implemented but I do
> not know if it is an HDF5 feature or a PyTables feature.

It is an HDF5 feature. In fact, HDF5 lets you enlarge datasets in any
dimension declared as 'enlargeable'. PyTables explicitly restricts this to
just one (like NetCDF v3) because a number of reasons.

> By the way, we
> expect the main author of PyTables (Francesc Alted) to participate in
> the coming workshop.

Yes, I expect that too :-)

--
Francesc Alted

Francesc Alted

unread,
Nov 4, 2009, 10:47:53 AM11/4/09
to ma...@googlegroups.com
A Wednesday 04 November 2009 15:41:02 V. Armando Solé escrigué:

> Francesc Alted wrote:
> > Specifying the index in image dimension wouldn't be enough? Or I'm
> > missing something?
>
> Welcome to the mailing list!

Thanks. In fact I've been lurking the list for some time now...

> In my case I am following the C convention and I use the first dimension
> as the image index for optimum performance. However, it seems to be
> (quite?) common in electron microscopy to follow the Fortran convention
> and use the last dimension as the image index.
>
> Perhaps is me who is missing something. If you are adding images one by
> one in a sort of extensible dataset, just knowing the image size can be
> enough. The problem is when you just have a three dimensional dataset
> written by someone else, you know you are dealing with images or you
> want to treat the dataset as a stack of images, and you do not know the
> dimension holding the image index. If that information is not there, and
> we are willing to be able to share data analysis procedures, I guess we
> should:
>
> - somehow agree on how to identify the image index
>
> - always foresee the two possibilities and therefore always prompt the user

Ah! I see the problem now. Then yes, I'd use an attribute saying which
dimension holds the images themselves. Or just a boolean attribute saying
whether the image stack follows Fortran of C order.

Cheers,

--
Francesc Alted

Wellenreuther, Gerd

unread,
Nov 4, 2009, 10:53:01 AM11/4/09
to ma...@googlegroups.com
Hi there.

Francesc Alted wrote:
> Ah! I see the problem now. Then yes, I'd use an attribute saying which
> dimension holds the images themselves. Or just a boolean attribute saying
> whether the image stack follows Fortran of C order.

I would go for an attribute, because those are more flexible, you can
have more than just two settings etc. Just imagine someone taking
3d-datasets over time. Then time is the slowest changing index, and the
(in my opinion) most natural way would be to create a 4d-dataset, so you
would like to properly indicate that. Using attributes we better fitted
for the future - whatever that might be.

Cheers, Gerd

--
Dr. Gerd Wellenreuther
beamline scientist P06 "Hard X-Ray Micro/Nano-Probe"
Petra III project
HASYLAB at DESY
Notkestr. 85
22603 Hamburg

Tel.: + 49 40 8998 5701

Werner Benger

unread,
Nov 4, 2009, 2:44:04 PM11/4/09
to ma...@googlegroups.com
Hi there,

the HDF5 group had some plans some time ago to add different index orderings
such as Fortran/C as a property to a dataspace. That would be most elegant, but
was not implemented, as far as I know.

An alternative would be something like a "shared dataspace", which similar to
a "named datatype" could be referenced explicitly in an HDF5 and hold attributes.
So the ordering type could be an attribute to such a dataspace shared by many
datasets. Shared dataspaces unfortunately did not get beyond planning stage either.

With current implementation state, it would seem most appropriate to store the
ordering as an integer array attribute such as "int order[3];" for a 3D
dataset, which gives the mapping from the dataset dimension to the coordinate;
to be added either to the dataset, or a higher level group containing multiple
(3D) datasets which are all then of same ordering. This is the approach that I
use in my F5 library.

However in practice it seems unpractical to store images in anything else than
FORTRAN ordering, as all image file formats and hardware that at least I know and use
so far, use FORTRAN ordering anyway: fastest index being a horizontal line, then
second fastest index vertical direction, with the third coordinate then the z or
stack direction, such that indexing such a dataset in C directly yields an image.

imagestack[Z][Y][X] --> imagestack[Z] is an image[Y][X]

It appears complicated to me to use another convention here.

For time-dependent image stacks, 4D datasets seem to be unpractical to me, since
it makes insertion of images inbetween already sampled images difficult,
it would not make it easy to store attributes that change over time, or
have image stacks that are different at different times. Seems to be more
flexible to use a group here with a set of 3D datasets (and appropriate
attributes for each, such as physical time as float for each time instance etc.),
with no lack of functionality, except that it won't be easily possibly to make
a four-dimensional hyperslab, though this didn't seem important to me so far
in my use cases.

cheers,
Werner


On Wed, 04 Nov 2009 07:53:01 -0800, Wellenreuther, Gerd <gerd.well...@desy.de> wrote:

>
> Hi there.
>
> Francesc Alted wrote:
>> Ah! I see the problem now. Then yes, I'd use an attribute saying which
>> dimension holds the images themselves. Or just a boolean attribute saying
>> whether the image stack follows Fortran of C order.
>
> I would go for an attribute, because those are more flexible, you can
> have more than just two settings etc. Just imagine someone taking
> 3d-datasets over time. Then time is the slowest changing index, and the
> (in my opinion) most natural way would be to create a 4d-dataset, so you
> would like to properly indicate that. Using attributes we better fitted
> for the future - whatever that might be.
>
> Cheers, Gerd
>


--
___________________________________________________________________________
Dr. Werner Benger Visualization Research
Laboratory for Creative Arts and Technology (LCAT)
Center for Computation & Technology at Louisiana State University (CCT/LSU)
211 Johnston Hall, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809 Fax.: +1 225 578-5362

Vicente Sole

unread,
Nov 4, 2009, 3:29:17 PM11/4/09
to ma...@googlegroups.com
Hi,

Quoting Werner Benger <wer...@cct.lsu.edu>:

>
> However in practice it seems unpractical to store images in
> anything else than
> FORTRAN ordering, as all image file formats and hardware that at
> least I know and use
> so far, use FORTRAN ordering anyway: fastest index being a
> horizontal line, then
> second fastest index vertical direction, with the third coordinate
> then the z or
> stack direction, such that indexing such a dataset in C directly
> yields an image.
>
> imagestack[Z][Y][X] --> imagestack[Z] is an image[Y][X]
>
> It appears complicated to me to use another convention here.


If you forget about Z, Y, X, and you consider dimensions that is
precisely the C ordering, not the Fortran ordering :-) In fact my
whole point (in my case) to store the data as (n_images, n_rows,
n_columns) is to be able to have at imagestack[0] the first image, and
at imagestack[n] the nth image. If you translate that to display
coordinates, USUALLY the rows of an image will map to the Y coordinate
of a display and the columns will map to the X coordinate. Therefore
arriving to what you assumed to be the logical order and thought it
was the Fortran order.

I also thought that was the logical way of ordering data. But it does
not seem to be the case on electron microscopy or I misunderstood
Matthew\'s talk:
http://www.medsbio.org/meetings/BSR_2007_imgCIF_Workshop/MTD_DLS_17Aug07_talk.pdf

If when he talks about Z stack of image data, he talks as you do about
Z,Y,X then we are talking about the same thing and almost nobody seems
to be storing stacks of images in Fortran order.


> For time-dependent image stacks, 4D datasets seem to be unpractical
> to me, since
> it makes insertion of images inbetween already sampled images difficult,
> it would not make it easy to store attributes that change over time, or
> have image stacks that are different at different times. Seems to be more
> flexible to use a group here with a set of 3D datasets (and appropriate
> attributes for each, such as physical time as float for each time
> instance etc.),
> with no lack of functionality, except that it won\'t be easily
> possibly to make
> a four-dimensional hyperslab, though this didn\'t seem important to me so far
> in my use cases.


I agree. In my case, 4D data usually correspond to a spatial
information and a measurement and I do not find practical to keep them
all together either. Usually I keep the spatial information separated
from the associated scalar, or 1D or 2D value measured. For regularly
spaced data, that also helps in reducing the amount of data to handle.

Armando

Mark Rivers

unread,
Nov 4, 2009, 4:07:57 PM11/4/09
to ma...@googlegroups.com
In this discussion it's important to distinguish between the syntax (C,
FORTRAN, Python, IDL, etc.) and the physical organization of the data in
memory (or on disk). Generally we will want the physical organization
to be columns (X), rows (Y), images (Z) in terms of what is varying
fastest as one traverses the data from start to end.

In FORTRAN (and IDL for example) the first array variable is the fastest
varying, so the syntax for data organized this way is ARRAY(X, Y, Z),
while in C the syntax is reversed, but the data are stored in the same
way physically.

Mark

Vicente Sole

unread,
Nov 4, 2009, 4:48:07 PM11/4/09
to ma...@googlegroups.com
Hi Mark,

If I open the Daphnia_float32.h5 dataset in python or with HDFView, I
get the dataset reported as being 175x119x2048. Can anyone of you open
the dataset in IDL or FORTRAN to see if it is reported as being
2048x119x175? If that is the case, then I guess all of us understand
what is going on and, most important, it seems all of us agree about
the proper way to store the data in disk.

Armando

Darren Dale

unread,
Nov 4, 2009, 6:08:24 PM11/4/09
to ma...@googlegroups.com
On Wed, Nov 4, 2009 at 9:26 AM, "V. Armando Solé" <so...@esrf.fr> wrote:
>
> Mark Rivers wrote:
>> I have a related question.  In netCDF there is the concept of the "unlimited" or "record" dimension.
>>
>> This is from the netCDF documentation:
>>
>> "A netCDF classic or 64-bit offset dataset can have at most one unlimited dimension, but need not have any. If a variable has an unlimited dimension, that dimension must be the most significant (slowest changing) one. Thus any unlimited dimension must be the first dimension in a CDL shape and the first dimension in corresponding C array declarations."
>>
>> This is a very useful concept, because one does not need to declare in advance the size of the arrays in that dimension, one can just keep appending information in the unlimited dimension.  That is what I use for streaming video in the areaDetector module, the n_images dimension is unlimited.
>>
>> Does HDF have a similar concept?
>>
> Not sure Mark, I think Darren Dale was asking something in that sense to
> the person who wrote one of the python bindings to HDF5. In his case
> Darren was interested on supporting the possibility to define a larger
> dataset than needed but to adjust it at the end to the number of entries
> actually written. It's not exactly what you say but already close.

With h5py it is possible to extend chunked datasets. What I am doing
currently for data acquisition is extend datasets one element at a
time. This may be inefficient, but some quick tests seemed to show
that there was no problem for what I am doing (recording data about
once a second). I am not defining large datasets and then trimming
them after the fact.

Darren

Dougherty, Matthew T.

unread,
Nov 4, 2009, 5:15:16 PM11/4/09
to ma...@googlegroups.com

>>>I also thought that was the logical way of ordering data. But it does not seem to be the case on electron microscopy or I misunderstood Matthew's talk: http://www.medsbio.org/meetings/BSR_2007_imgCIF_Workshop/MTD_DLS_17Aug07_talk.pdf


Hi Vicente,

Here is my point, regardless of the C/Fortran issue or the axis ordering of choice:

For acquisition the images in microscopy are 2D. Storing 2D images (contiguous pixels) concatenated together to form a 3D stack is not a problem, plus it provides some computational & disk efficiencies.

After the stack of 2D images are processed through reconstruction they form a 3D image.  At this point XY, YZ, XZ planes have the same informational relevance, regardless of the original axis of projection.  There is an small exception to this  using electron tomography where the axis of projection is relevant/noticeable, but this does not change the underlying data organization problem.

As long as the 3D image will fit into RAM, no problem.

If my 3D image is say 512GB, I got a problem that most (99%) of the computers cannot visualize the data.

To solve this I need to reorganize my image using three coordinated methods:
1) restructure the 3D image pixels as bricks rather than 2D planes
2) subsample the original 3D image, 2x,4x,8x,16x (also bricked)
3) put it all into a single HDF file so they are contextually linked.

At this point visualization software and other software methods have a functional shot at interacting with the 3D image.


Another point relating to the thread:
The MRC format, arguably the oldest digital microscopy format (1982) predating TIFF, uses a Fortran ordering like HDF5.  Much of this can be attributed to Fortran existing before C.  Now most of the code using MRC is C, so people are probably getting the ordering wrong if they were reading back fortran generated data.  These programming language arrays are abstract; the association of XYZ or ZYX to the array ordering is an arbitrary call frequently by a lone programmer because there are no standards.  Intuition would dictate the obvious (XYZ), but since there are two methods (fortran & C), obvious is not a consensus in contiguous computer storage.  If all the data fits into RAM, who cares as long as you write it as ZYX and you read it as ZYX?  Overall this is much like the big/little endian problem, which HDF deals with transparently.  If the array ordering API in HDF is broken it needs to be fixed; if programmers are not using it, then they should rethink their approach.




These problems are why I am pushing for a common scientific image standard.  Much of this is avoidable if you got rules to the road, digital wall plugs, 60hz/50hz/120v/220v.  At this time there is no ISO spec to subscribe to.  I strongly urge you to create a distinctive root group equivalent to your format design/community as a means to enforce your designs, provide exclusive documentation, and avoid namespace collisions with other formats using an HDF file at the same time.  Think of it like internet domain names, without containment and organization, the internet would be extremely limited to only two nodes.



Matthew Dougherty
713-433-3849
National Center for Macromolecular Imaging
Baylor College of Medicine/Houston Texas USA
=========================================================================
=========================================================================






-----Original Message-----
From: ma...@googlegroups.com on behalf of Vicente Sole
Sent: Wed 11/4/2009 2:29 PM
To: ma...@googlegroups.com
Subject: [MAHID] Re: Handling stacked (image) data


"V. Armando Solé"

unread,
Nov 4, 2009, 6:29:01 PM11/4/09
to ma...@googlegroups.com
Darren Dale wrote:
> With h5py it is possible to extend chunked datasets. What I am doing
> currently for data acquisition is extend datasets one element at a
> time. This may be inefficient, but some quick tests seemed to show
> that there was no problem for what I am doing (recording data about
> once a second). I am not defining large datasets and then trimming
> them after the fact.
>

Ok, then I misunderstood your original request. I thought you wanted to
handle the situation of having foreseen a scan of npoints
but because of an unexpected abort having to trim the data to the
actual number of recorded points.

Armando

"V. Armando Solé"

unread,
Nov 4, 2009, 6:32:46 PM11/4/09
to ma...@googlegroups.com
Hi Matthew,

Except for the point 1) we are following the same approach.

>
>
> Another point relating to the thread:
> The MRC format, arguably the oldest digital microscopy format (1982)
> predating TIFF, uses a Fortran ordering like HDF5. Much of this can
> be attributed to Fortran existing before C. Now most of the code
> using MRC is C, so people are probably getting the ordering wrong if
> they were reading back fortran generated data. These programming
> language arrays are abstract; the association of XYZ or ZYX to the
> array ordering is an arbitrary call frequently by a lone programmer
> because there are no standards. Intuition would dictate the obvious
> (XYZ), but since there are two methods (fortran & C), obvious is not a
> consensus in contiguous computer storage. If all the data fits into
> RAM, who cares as long as you write it as ZYX and you read it as ZYX?
> Overall this is much like the big/little endian problem, which HDF
> deals with transparently. If the array ordering API in HDF is broken
> it needs to be fixed; if programmers are not using it, then they
> should rethink their approach.
>

I do not think the array ordering is broken. If it is just a question of
syntax as Mark is saying, then we are already doing the proper things
because I doubt you are storing your image stack as (Z,X,Y) following
your syntax. Most likely you are writing it as (X,Y,Z) in your syntax
that corresponds to (Z,Y,X) in mine and both of us are saying the same
thing.

All my concerns about the need of having something to indicate the
dimension holding the image index came from huge differences in handling
data loaded dynamically, a situation I face when the data do not fit
into memory that is quite soon when using 32bit machines. I would not be
able to properly interact with your 512Gb dataset if stored as (Z, X, Y)
for instance but besides speed issues, I would be able to deal with it
if stored as (X, Y, Z) always considering your syntax.

>
> These problems are why I am pushing for a common scientific image
> standard. Much of this is avoidable if you got rules to the road,
> digital wall plugs, 60hz/50hz/120v/220v. At this time there is no ISO
> spec to subscribe to. I strongly urge you to create a distinctive
> root group equivalent to your format design/community as a means to
> enforce your designs, provide exclusive documentation, and avoid
> namespace collisions with other formats using an HDF file at the same
> time. Think of it like internet domain names, without containment and
> organization, the internet would be extremely limited to only two nodes.
>

These are things we have to discuss. My personal feeling is that I
should be able to use the NeXus root group, and the NXentry. Then one
could use the rest of NeXus groups as they are if one wishes to do so,
but having the possibility to define new groups, not labeled with
NX_CLASS attribute NXwhatever unless recognized and accepted by the
NIAC. That should prevent any namespace collision. If the NIAC
explicitly forbids me that use, I will have to give an opposite advice
about using NeXus, but looking at NIAC presentations
http://wiki.cacr.caltech.edu/danse/images/7/71/StateOfNeXus.ppt it seems
I am respecting their format because I would not be breaking their API.

Nevertheless, from the analysis software developer side, it would be a
pity to limit analysis capabilities to a particular format defined in
this mailing list. My intention is to offer a generic but probably
cumbersome HDF5 support besides a NeXus/MAHID/Whatever more specific
support requiring less user interaction. That way any scientific
community could benefit from any work arising from this mailing list.

Armando

Dougherty, Matthew T.

unread,
Nov 4, 2009, 7:19:48 PM11/4/09
to ma...@googlegroups.com

Hi Armando,

Working under NXroot means playing by Nexus rules, and I would encourage it.

My Nexus strategy for electron microscopy is
1) move MRC formatted data to HDF that results in something the acquisition & viz people can live with (performance & simplicity)
2) get the EM manufacturers committed to Nexus in order to devise the instrument class
3) graft the MRC/HDF pixels into Nexus framework, preferably with #2's involvement.

Or to say it less subtly, for the EM acquisition community the case can be made to use Nexus, but will require a considerable amount of persuasion to get completed commitments. For downstream communities such as 3D visualization or biological analysis, it would be a long shot at best; they are using a completely different frameworks and unrelated metadata; having next to zero need for Nexus metadata.  The primary connection between the communities that form the image pipeline are the images, not a communities' metadata (unless you are in that community).

thanks for the PPT reference, state of nexus.




Matthew Dougherty
713-433-3849
National Center for Macromolecular Imaging
Baylor College of Medicine/Houston Texas USA
=========================================================================
=========================================================================


Darren Dale

unread,
Nov 4, 2009, 8:18:16 PM11/4/09
to ma...@googlegroups.com
On Wed, Nov 4, 2009 at 6:29 PM, "V. Armando Solé" <so...@esrf.fr> wrote:
>
> Darren Dale wrote:
>> With h5py it is possible to extend chunked datasets. What I am doing
>> currently for data acquisition is extend datasets one element at a
>> time. This may be inefficient, but some quick tests seemed to show
>> that there was no problem for what I am doing (recording data about
>> once a second). I am not defining large datasets and then trimming
>> them after the fact.
>>
>
> Ok, then I misunderstood your original request.

What request?

> I thought you wanted to
> handle the situation of having foreseen a scan of npoints
>  but because of an unexpected abort having to trim the data to the
> actual number of recorded points.

The way I do it is, if I know the dimensions ahead of time, I set the
maxdims argument for h5py accordingly, but initially create a small
dataset. Then the dataset can grow to those maxdims. It is also
possible to grow indefinitely, as long as the maxdims are not set to
constrict you.

Werner Benger

unread,
Nov 5, 2009, 1:13:12 AM11/5/09
to ma...@googlegroups.com
Indeed, and since HDF5 uses C convention in dealing with dimensions of
a dataset, it will show (e.g. when using h5ls or other HDF5 tools) the
dimensions reversed, which in this case is <images> x <rows> x <columns>
for a stack of <images> single images that are <columns> x <rows> in
size.

That's not a problem and just a matter of convention as long as the
entire data is read at once. However when dealing with hyperslabs one
must take care to set up the indices correctly, and eventually reverse
the order from the "C" convention. The HDF5 API in FORTRAN already
includes such as index reversal, such that FORTRAN user's don't need
be bother with different indexing conventions (only the indices are swapped,
not the data), only when doing some h5ls, they will see the dimensions
reversed as compared to how they specified it in their program.

One thing is the indexing (or dimension name) convention, the other the
data layout. A better term for the data layout might be "row-major"

http://en.wikipedia.org/wiki/Row-major_order

though the wikipedia page is quite confusing here, as it says
that FORTRAN is *not* using row-major...

Werner

Gerd Wellenreuther

unread,
Nov 5, 2009, 1:24:18 AM11/5/09
to ma...@googlegroups.com
Dear all,

I saved the links of the two presentations Armando mentioned (see below)
in http://groups.google.com/group/mahid/web/talks-presentations-posters
. This should allow us to quickly find them again without browsing all
the emails - which is especially good if someone new wants to quickly
get an overview. In case there is something to add or correct and you
are not able to do it yourself please send me an email directly - not
via the groups email.

Thanks, Gerd


>> does not seem to be the case on electron microscopy or I misunderstood
>> Matthew's talk:
>> http://www.medsbio.org/meetings/BSR_2007_imgCIF_Workshop/MTD_DLS_17Aug07_talk.pdf

Werner Benger

unread,
Nov 5, 2009, 1:28:24 AM11/5/09
to ma...@googlegroups.com
On Wed, 04 Nov 2009 14:15:16 -0800, Dougherty, Matthew T. <matt...@bcm.tmc.edu> wrote:
> If my 3D image is say 512GB, I got a problem that most (99%) of the computers cannot visualize the data.
>
> To solve this I need to reorganize my image using three coordinated methods:
> 1) restructure the 3D image pixels as bricks rather than 2D planes
> 2) subsample the original 3D image, 2x,4x,8x,16x (also bricked)

Just as comment, one should point out that a simple hyperslab would not be sufficient here, as it would lead to aliasing problems. It indeed needs a subsampling as averaging of 3D pixels, similar to mipmaps.

> 3) put it all into a single HDF file so they are contextually linked.
>
> At this point visualization software and other software methods have a functional shot at interacting with the 3D image.
>
>
> Another point relating to the thread:
> The MRC format, arguably the oldest digital microscopy format (1982) predating TIFF, uses a Fortran ordering like HDF5. Much of this can be attributed to Fortran existing before C. Now most of the code using MRC is C, so people are probably getting the ordering wrong if they were reading back fortran generated data. These programming language arrays are abstract; the association of XYZ or ZYX to the array ordering is an arbitrary call frequently by a lone programmer because there are no standards. Intuition would dictate the obvious (XYZ), but since there are two methods (fortran & C), obvious is not a consensus in contiguous computer storage. If all the data fits into RAM, who cares as long as you write it as ZYX and you read it as ZYX? Overall this is much like the big/little endian problem, which HDF deals with transparently. If the array ordering API in HDF is broken it needs to be fixed; if programmers are not using it, then they should rethink their approach.
>
If a lone programmer uses an image convention that is different from what standards such as OpenGL, jpeg, etc. etc. (meaning: existing common hardware and software) uses, then he'll indeed become quite lonely. The array ordering API in HDF5 is consistent and does not need fix, it is consistent C-ordering using the C API, and consistently FORTRAN-ordering when using the FORTRAN API. However, the programmer needs to be aware that in C the latest index is the fastest, on FORTRAN the first index is the fastest. This is unrelated to HDF5, and the programmers better be aware of the properties of their programming languages. Means: In C they should name their image arrays ZYX and not XYZ. However, in practice you might want to have arrays with dimensions that are determined at runtime anyway, not compile-time static in size, and since C does not support multidimensional arrays whose extensions are given at runtime, the programmer has to implement such anyway first. And will most lik
ely
use the ZYX approach when computing a linear address x + y*width + z * with * height , which is row-major convention,

Werner

Francesc Alted

unread,
Nov 5, 2009, 4:31:48 AM11/5/09
to ma...@googlegroups.com
A Wednesday 04 November 2009 20:44:04 Werner Benger escrigué:

> Hi there,
>
> the HDF5 group had some plans some time ago to add different index
> orderings such as Fortran/C as a property to a dataspace. That would be
> most elegant, but was not implemented, as far as I know.

Mmh, my information is that they did that for HDF4, but they don't have plans
for doing the same for HDF5:

http://mail.hdfgroup.org/pipermail/hdf-forum_hdfgroup.org/2008-May/001166.html

although they recognize that having an attribute with a reserved name
describing how the data was ordered would be a good idea.

However, I still think that the current HDF5 approach for dealing with
Fortran/C orderings, even if an attribute is reserved, is ill-designed and can
be a source of confusion for many people. I expressed my thoughts here:

http://mail.hdfgroup.org/pipermail/hdf-forum_hdfgroup.org/2008-
June/001164.html

but I don't think they are going this route (unless they changed their mind
recently). So, I think it would be a good idea to express interest for the
reserved attribute convention for specifying C/Fortran (or, if you prefer,
row-major/column-major) ordering.

At any rate, and for this concrete case, I think it is fine to keep adding
another attribute specifying which dimension the image stack is following.

Cheers,

--
Francesc Alted

"V. Armando Solé"

unread,
Nov 10, 2009, 5:15:08 AM11/10/09
to ma...@googlegroups.com
Hi,

> If I open the Daphnia_float32.h5 dataset in python or with HDFView, I
> get the dataset reported as being 175x119x2048. Can anyone of you open
> the dataset in IDL or FORTRAN to see if it is reported as being
> 2048x119x175? If that is the case, then I guess all of us understand
> what is going on and, most important, it seems all of us agree about
> the proper way to store the data in disk

Please can someone using MATLAB and/or FORTRAN and/or IDL tell us the
dimensions reported by their tool when opening the data contained in the
Daphnia_float32.h5 dataset found in the datasets page
http://groups.google.fr/group/mahid/web/datasets

BTW, I did not find a direct link from the main web page to the datasets
page. I was expecting to reach the page via the Files link that
currently is pointing to a page with just one pdf.

Armando


Wellenreuther, Gerd

unread,
Nov 10, 2009, 5:28:36 AM11/10/09
to ma...@googlegroups.com

V. Armando Solé wrote:
> BTW, I did not find a direct link from the main web page to the datasets
> page. I was expecting to reach the page via the Files link that
> currently is pointing to a page with just one pdf.

This is because we can not use Google itself to host the datasets - the
available storage space is just far too small (100 MB). So I have
created pages containing links to the datasets, and all those pages can
be found under "Pages" from the main menu.

I will now:

* write a comment on the main page
* write a comment on the Files-page

in order to make sure everybody can find the links.

BTW, if somebody has datasets to be shared I can host + link them.
Currently, we are using some DESY-resources to do the hosting.

"V. Armando Solé"

unread,
Nov 10, 2009, 5:33:49 AM11/10/09
to ma...@googlegroups.com
Hi Gerd,

Wellenreuther, Gerd wrote:
> V. Armando Solé wrote:
>
>> BTW, I did not find a direct link from the main web page to the datasets
>> page. I was expecting to reach the page via the Files link that
>> currently is pointing to a page with just one pdf.
>>
>
> This is because we can not use Google itself to host the datasets - the
> available storage space is just far too small (100 MB).

Is it possible to write a link as a file in the Files page? That could
solve the problem in an elegant manner by linking to the datasets page.

Armando


Werner Benger

unread,
Nov 10, 2009, 8:09:20 AM11/10/09
to ma...@googlegroups.com
On Tue, 10 Nov 2009 04:15:08 -0600, V. Armando Solé <so...@esrf.fr> wrote:

>
> Hi,
>> If I open the Daphnia_float32.h5 dataset in python or with HDFView, I
>> get the dataset reported as being 175x119x2048. Can anyone of you open
>> the dataset in IDL or FORTRAN to see if it is reported as being
>> 2048x119x175? If that is the case, then I guess all of us understand
>> what is going on and, most important, it seems all of us agree about
>> the proper way to store the data in disk
>
> Please can someone using MATLAB and/or FORTRAN and/or IDL tell us the
> dimensions reported by their tool when opening the data contained in the
> Daphnia_float32.h5 dataset found in the datasets page
> http://groups.google.fr/group/mahid/web/datasets
>
h5ls reports this dataset as {175/175, 119/119, 2048/2048}, so I would
treat it as X=2048, Y=119, Z=175 dataset, i.e. 175 images of width 2048
and height 119.

However, there are some other additional attributes that call dims_2 to
be 2048 and dims_0 to be 175.

*if* we consider X/width to be dims_2 and Z/stacksize to be dims_0, then it's
consistent. This is just a naming convention however which is unrelated
to the dataset itself. I'd prefer to call the width/X to be dimension 0,
but it's matter of taste.

Werner


> BTW, I did not find a direct link from the main web page to the datasets
> page. I was expecting to reach the page via the Files link that
> currently is pointing to a page with just one pdf.
>
> Armando
>
>
>
>
>
> >
>


Freddie Akeroyd

unread,
Dec 6, 2009, 6:06:02 PM12/6/09
to Methods for the analysis of hyperspectral image data
>
> These are things we have to discuss. My personal feeling is that I
> should be able to use the NeXus root group, and the NXentry. Then one
> could use the rest of NeXus groups as they are if one wishes to do so,
> but having the possibility to define new groups, not labeled with
> NX_CLASS attribute NXwhatever unless recognized and accepted by the
> NIAC. That should prevent any namespace collision. If the NIAC
> explicitly forbids me that use, I will have to give an opposite advice
> about using NeXus, but looking at NIAC presentationshttp://wiki.cacr.caltech.edu/danse/images/7/71/StateOfNeXus.pptit seems
> I am respecting their format because I would not be breaking their API.
>
Yes, my understanding too is that the NIAC only requests you do not
create unofficial NeXus classes.
You can use the NeXus API to create you own classes (it does not
restrict class names to only those with NX)
which allows you to use the NeXus API as a simplified HDF5 API and
also make use of other NeXus APIs
(e.g. IDL, python, Java) and tools (e.g. nxbrowse)

Freddie
Reply all
Reply to author
Forward
0 new messages