Quite frequently we have to deal with stacks of data of the same type.
Since I am using python and C, I advocate to store stacks of images as
(n_images, n_rows, n_columns) with an HDF5 chunksize of at least of (1,
n_rows, n_columns). Analogously, when I store stacks of 1D data
(~spectra), I leave the last dimension as the dimension corresponding to
the spectrum length.
Anyways, at this moment I am mainly concerned with the stack of images.
I guess (this is really a guess!), for HDF5 itself it does not matter
because the images are going to be stored one by one at least when my
chunksize is (1, n_rows, n_columns) or (n_rows, n_columns, 1). However,
I would expect huge performance differences when mapping the file
contents to memory.
My questions:
- Is there already a convention in HDF5 itself, that could tell us what
index is used as the image counter?
- Can one safely retrieve and use the chunksize information itself to
deduce the image counter?
- Do we need a dataset attribute to specify the index containing the
image counter? Is there already a convention?
Any hint is welcome.
Armando
Well, clearly it is not the same a chunksize of (1, n_rows, n_columns) than
(n_rows, n_columns, 1). For maximum performance, in the first case, you must
make sure that your first dimension will be used for stacking images, while
for the second case you should choose the last one (in case you don't want to
retrieve lines or columns of *different* images in one shot of course ;-) If
you are using Python, you will probably want to follow the C convention, so
I'd go for the (1, n_rows, n_columns) case.
Also, please try to not define chunksizes that are too large or too small.
Finding a sensible value would require some experimentation, but should be
worth the effort if you are going to access many times to images. See, for
example, the section "Understanding chunking" in the PyTables manual:
http://www.pytables.org/docs/manual/ch05.html#chunksizeOptim
Although it talks about some PyTables features, most of what it is said there
applies to HDF5 apps in general. In particular, pay attention to how the
random reads may be affected by choosing a too large chunksize, or how
compression can help in making reads more effective (specially in sequential
reads).
> My questions:
>
> - Is there already a convention in HDF5 itself, that could tell us what
> index is used as the image counter?
I don't think HDF5 imposes a convention on this. You should choose your own
convention.
> - Can one safely retrieve and use the chunksize information itself to
> deduce the image counter?
Mmh, you mean trusting the number of chunk for determining the image? I think
it is *much* better specifying the index in the corresponding image dimension
(the first one, if using the C convention).
> - Do we need a dataset attribute to specify the index containing the
> image counter? Is there already a convention?
Specifying the index in image dimension wouldn't be enough? Or I'm missing
something?
Hope that helps,
--
Francesc Alted
Anyways, reading the PyTables manual I have found it talks about
"enlargeable arrays" and it says "You can add new elements to existing
arrays on disk in any dimension you want (but only one)". So, I am
tempted to say yes, the concept is there or can be implemented but I do
not know if it is an HDF5 feature or a PyTables feature. By the way, we
expect the main author of PyTables (Francesc Alted) to participate in
the coming workshop.
See you,
Armando
In my case I am following the C convention and I use the first dimension
as the image index for optimum performance. However, it seems to be
(quite?) common in electron microscopy to follow the Fortran convention
and use the last dimension as the image index.
Perhaps is me who is missing something. If you are adding images one by
one in a sort of extensible dataset, just knowing the image size can be
enough. The problem is when you just have a three dimensional dataset
written by someone else, you know you are dealing with images or you
want to treat the dataset as a stack of images, and you do not know the
dimension holding the image index. If that information is not there, and
we are willing to be able to share data analysis procedures, I guess we
should:
- somehow agree on how to identify the image index
- always foresee the two possibilities and therefore always prompt the user
Armando
It is an HDF5 feature. In fact, HDF5 lets you enlarge datasets in any
dimension declared as 'enlargeable'. PyTables explicitly restricts this to
just one (like NetCDF v3) because a number of reasons.
> By the way, we
> expect the main author of PyTables (Francesc Alted) to participate in
> the coming workshop.
Yes, I expect that too :-)
--
Francesc Alted
Thanks. In fact I've been lurking the list for some time now...
> In my case I am following the C convention and I use the first dimension
> as the image index for optimum performance. However, it seems to be
> (quite?) common in electron microscopy to follow the Fortran convention
> and use the last dimension as the image index.
>
> Perhaps is me who is missing something. If you are adding images one by
> one in a sort of extensible dataset, just knowing the image size can be
> enough. The problem is when you just have a three dimensional dataset
> written by someone else, you know you are dealing with images or you
> want to treat the dataset as a stack of images, and you do not know the
> dimension holding the image index. If that information is not there, and
> we are willing to be able to share data analysis procedures, I guess we
> should:
>
> - somehow agree on how to identify the image index
>
> - always foresee the two possibilities and therefore always prompt the user
Ah! I see the problem now. Then yes, I'd use an attribute saying which
dimension holds the images themselves. Or just a boolean attribute saying
whether the image stack follows Fortran of C order.
Cheers,
--
Francesc Alted
Francesc Alted wrote:
> Ah! I see the problem now. Then yes, I'd use an attribute saying which
> dimension holds the images themselves. Or just a boolean attribute saying
> whether the image stack follows Fortran of C order.
I would go for an attribute, because those are more flexible, you can
have more than just two settings etc. Just imagine someone taking
3d-datasets over time. Then time is the slowest changing index, and the
(in my opinion) most natural way would be to create a 4d-dataset, so you
would like to properly indicate that. Using attributes we better fitted
for the future - whatever that might be.
Cheers, Gerd
--
Dr. Gerd Wellenreuther
beamline scientist P06 "Hard X-Ray Micro/Nano-Probe"
Petra III project
HASYLAB at DESY
Notkestr. 85
22603 Hamburg
Tel.: + 49 40 8998 5701
the HDF5 group had some plans some time ago to add different index orderings
such as Fortran/C as a property to a dataspace. That would be most elegant, but
was not implemented, as far as I know.
An alternative would be something like a "shared dataspace", which similar to
a "named datatype" could be referenced explicitly in an HDF5 and hold attributes.
So the ordering type could be an attribute to such a dataspace shared by many
datasets. Shared dataspaces unfortunately did not get beyond planning stage either.
With current implementation state, it would seem most appropriate to store the
ordering as an integer array attribute such as "int order[3];" for a 3D
dataset, which gives the mapping from the dataset dimension to the coordinate;
to be added either to the dataset, or a higher level group containing multiple
(3D) datasets which are all then of same ordering. This is the approach that I
use in my F5 library.
However in practice it seems unpractical to store images in anything else than
FORTRAN ordering, as all image file formats and hardware that at least I know and use
so far, use FORTRAN ordering anyway: fastest index being a horizontal line, then
second fastest index vertical direction, with the third coordinate then the z or
stack direction, such that indexing such a dataset in C directly yields an image.
imagestack[Z][Y][X] --> imagestack[Z] is an image[Y][X]
It appears complicated to me to use another convention here.
For time-dependent image stacks, 4D datasets seem to be unpractical to me, since
it makes insertion of images inbetween already sampled images difficult,
it would not make it easy to store attributes that change over time, or
have image stacks that are different at different times. Seems to be more
flexible to use a group here with a set of 3D datasets (and appropriate
attributes for each, such as physical time as float for each time instance etc.),
with no lack of functionality, except that it won't be easily possibly to make
a four-dimensional hyperslab, though this didn't seem important to me so far
in my use cases.
cheers,
Werner
On Wed, 04 Nov 2009 07:53:01 -0800, Wellenreuther, Gerd <gerd.well...@desy.de> wrote:
>
> Hi there.
>
> Francesc Alted wrote:
>> Ah! I see the problem now. Then yes, I'd use an attribute saying which
>> dimension holds the images themselves. Or just a boolean attribute saying
>> whether the image stack follows Fortran of C order.
>
> I would go for an attribute, because those are more flexible, you can
> have more than just two settings etc. Just imagine someone taking
> 3d-datasets over time. Then time is the slowest changing index, and the
> (in my opinion) most natural way would be to create a 4d-dataset, so you
> would like to properly indicate that. Using attributes we better fitted
> for the future - whatever that might be.
>
> Cheers, Gerd
>
--
___________________________________________________________________________
Dr. Werner Benger Visualization Research
Laboratory for Creative Arts and Technology (LCAT)
Center for Computation & Technology at Louisiana State University (CCT/LSU)
211 Johnston Hall, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809 Fax.: +1 225 578-5362
Quoting Werner Benger <wer...@cct.lsu.edu>:
>
> However in practice it seems unpractical to store images in
> anything else than
> FORTRAN ordering, as all image file formats and hardware that at
> least I know and use
> so far, use FORTRAN ordering anyway: fastest index being a
> horizontal line, then
> second fastest index vertical direction, with the third coordinate
> then the z or
> stack direction, such that indexing such a dataset in C directly
> yields an image.
>
> imagestack[Z][Y][X] --> imagestack[Z] is an image[Y][X]
>
> It appears complicated to me to use another convention here.
If you forget about Z, Y, X, and you consider dimensions that is
precisely the C ordering, not the Fortran ordering :-) In fact my
whole point (in my case) to store the data as (n_images, n_rows,
n_columns) is to be able to have at imagestack[0] the first image, and
at imagestack[n] the nth image. If you translate that to display
coordinates, USUALLY the rows of an image will map to the Y coordinate
of a display and the columns will map to the X coordinate. Therefore
arriving to what you assumed to be the logical order and thought it
was the Fortran order.
I also thought that was the logical way of ordering data. But it does
not seem to be the case on electron microscopy or I misunderstood
Matthew\'s talk:
http://www.medsbio.org/meetings/BSR_2007_imgCIF_Workshop/MTD_DLS_17Aug07_talk.pdf
If when he talks about Z stack of image data, he talks as you do about
Z,Y,X then we are talking about the same thing and almost nobody seems
to be storing stacks of images in Fortran order.
> For time-dependent image stacks, 4D datasets seem to be unpractical
> to me, since
> it makes insertion of images inbetween already sampled images difficult,
> it would not make it easy to store attributes that change over time, or
> have image stacks that are different at different times. Seems to be more
> flexible to use a group here with a set of 3D datasets (and appropriate
> attributes for each, such as physical time as float for each time
> instance etc.),
> with no lack of functionality, except that it won\'t be easily
> possibly to make
> a four-dimensional hyperslab, though this didn\'t seem important to me so far
> in my use cases.
I agree. In my case, 4D data usually correspond to a spatial
information and a measurement and I do not find practical to keep them
all together either. Usually I keep the spatial information separated
from the associated scalar, or 1D or 2D value measured. For regularly
spaced data, that also helps in reducing the amount of data to handle.
Armando
With h5py it is possible to extend chunked datasets. What I am doing
currently for data acquisition is extend datasets one element at a
time. This may be inefficient, but some quick tests seemed to show
that there was no problem for what I am doing (recording data about
once a second). I am not defining large datasets and then trimming
them after the fact.
Darren
>>>I also thought that was the logical way of ordering data. But it does not seem to be the case on electron microscopy or I misunderstood Matthew's talk: http://www.medsbio.org/meetings/BSR_2007_imgCIF_Workshop/MTD_DLS_17Aug07_talk.pdf
Hi Vicente,
Here is my point, regardless of the C/Fortran issue or the axis ordering of choice:
For acquisition the images in microscopy are 2D. Storing 2D images (contiguous pixels) concatenated together to form a 3D stack is not a problem, plus it provides some computational & disk efficiencies.
After the stack of 2D images are processed through reconstruction they form a 3D image. At this point XY, YZ, XZ planes have the same informational relevance, regardless of the original axis of projection. There is an small exception to this using electron tomography where the axis of projection is relevant/noticeable, but this does not change the underlying data organization problem.
As long as the 3D image will fit into RAM, no problem.
If my 3D image is say 512GB, I got a problem that most (99%) of the computers cannot visualize the data.
To solve this I need to reorganize my image using three coordinated methods:
1) restructure the 3D image pixels as bricks rather than 2D planes
2) subsample the original 3D image, 2x,4x,8x,16x (also bricked)
3) put it all into a single HDF file so they are contextually linked.
At this point visualization software and other software methods have a functional shot at interacting with the 3D image.
Another point relating to the thread:
The MRC format, arguably the oldest digital microscopy format (1982) predating TIFF, uses a Fortran ordering like HDF5. Much of this can be attributed to Fortran existing before C. Now most of the code using MRC is C, so people are probably getting the ordering wrong if they were reading back fortran generated data. These programming language arrays are abstract; the association of XYZ or ZYX to the array ordering is an arbitrary call frequently by a lone programmer because there are no standards. Intuition would dictate the obvious (XYZ), but since there are two methods (fortran & C), obvious is not a consensus in contiguous computer storage. If all the data fits into RAM, who cares as long as you write it as ZYX and you read it as ZYX? Overall this is much like the big/little endian problem, which HDF deals with transparently. If the array ordering API in HDF is broken it needs to be fixed; if programmers are not using it, then they should rethink their approach.
These problems are why I am pushing for a common scientific image standard. Much of this is avoidable if you got rules to the road, digital wall plugs, 60hz/50hz/120v/220v. At this time there is no ISO spec to subscribe to. I strongly urge you to create a distinctive root group equivalent to your format design/community as a means to enforce your designs, provide exclusive documentation, and avoid namespace collisions with other formats using an HDF file at the same time. Think of it like internet domain names, without containment and organization, the internet would be extremely limited to only two nodes.
Matthew Dougherty
713-433-3849
National Center for Macromolecular Imaging
Baylor College of Medicine/Houston Texas USA
=========================================================================
=========================================================================
-----Original Message-----
From: ma...@googlegroups.com on behalf of Vicente Sole
Sent: Wed 11/4/2009 2:29 PM
To: ma...@googlegroups.com
Subject: [MAHID] Re: Handling stacked (image) data
Ok, then I misunderstood your original request. I thought you wanted to
handle the situation of having foreseen a scan of npoints
but because of an unexpected abort having to trim the data to the
actual number of recorded points.
Armando
Except for the point 1) we are following the same approach.
>
>
> Another point relating to the thread:
> The MRC format, arguably the oldest digital microscopy format (1982)
> predating TIFF, uses a Fortran ordering like HDF5. Much of this can
> be attributed to Fortran existing before C. Now most of the code
> using MRC is C, so people are probably getting the ordering wrong if
> they were reading back fortran generated data. These programming
> language arrays are abstract; the association of XYZ or ZYX to the
> array ordering is an arbitrary call frequently by a lone programmer
> because there are no standards. Intuition would dictate the obvious
> (XYZ), but since there are two methods (fortran & C), obvious is not a
> consensus in contiguous computer storage. If all the data fits into
> RAM, who cares as long as you write it as ZYX and you read it as ZYX?
> Overall this is much like the big/little endian problem, which HDF
> deals with transparently. If the array ordering API in HDF is broken
> it needs to be fixed; if programmers are not using it, then they
> should rethink their approach.
>
I do not think the array ordering is broken. If it is just a question of
syntax as Mark is saying, then we are already doing the proper things
because I doubt you are storing your image stack as (Z,X,Y) following
your syntax. Most likely you are writing it as (X,Y,Z) in your syntax
that corresponds to (Z,Y,X) in mine and both of us are saying the same
thing.
All my concerns about the need of having something to indicate the
dimension holding the image index came from huge differences in handling
data loaded dynamically, a situation I face when the data do not fit
into memory that is quite soon when using 32bit machines. I would not be
able to properly interact with your 512Gb dataset if stored as (Z, X, Y)
for instance but besides speed issues, I would be able to deal with it
if stored as (X, Y, Z) always considering your syntax.
>
> These problems are why I am pushing for a common scientific image
> standard. Much of this is avoidable if you got rules to the road,
> digital wall plugs, 60hz/50hz/120v/220v. At this time there is no ISO
> spec to subscribe to. I strongly urge you to create a distinctive
> root group equivalent to your format design/community as a means to
> enforce your designs, provide exclusive documentation, and avoid
> namespace collisions with other formats using an HDF file at the same
> time. Think of it like internet domain names, without containment and
> organization, the internet would be extremely limited to only two nodes.
>
These are things we have to discuss. My personal feeling is that I
should be able to use the NeXus root group, and the NXentry. Then one
could use the rest of NeXus groups as they are if one wishes to do so,
but having the possibility to define new groups, not labeled with
NX_CLASS attribute NXwhatever unless recognized and accepted by the
NIAC. That should prevent any namespace collision. If the NIAC
explicitly forbids me that use, I will have to give an opposite advice
about using NeXus, but looking at NIAC presentations
http://wiki.cacr.caltech.edu/danse/images/7/71/StateOfNeXus.ppt it seems
I am respecting their format because I would not be breaking their API.
Nevertheless, from the analysis software developer side, it would be a
pity to limit analysis capabilities to a particular format defined in
this mailing list. My intention is to offer a generic but probably
cumbersome HDF5 support besides a NeXus/MAHID/Whatever more specific
support requiring less user interaction. That way any scientific
community could benefit from any work arising from this mailing list.
Armando
Hi Armando,
Working under NXroot means playing by Nexus rules, and I would encourage it.
My Nexus strategy for electron microscopy is
1) move MRC formatted data to HDF that results in something the acquisition & viz people can live with (performance & simplicity)
2) get the EM manufacturers committed to Nexus in order to devise the instrument class
3) graft the MRC/HDF pixels into Nexus framework, preferably with #2's involvement.
Or to say it less subtly, for the EM acquisition community the case can be made to use Nexus, but will require a considerable amount of persuasion to get completed commitments. For downstream communities such as 3D visualization or biological analysis, it would be a long shot at best; they are using a completely different frameworks and unrelated metadata; having next to zero need for Nexus metadata. The primary connection between the communities that form the image pipeline are the images, not a communities' metadata (unless you are in that community).
thanks for the PPT reference, state of nexus.
Matthew Dougherty
713-433-3849
National Center for Macromolecular Imaging
Baylor College of Medicine/Houston Texas USA
=========================================================================
=========================================================================
What request?
> I thought you wanted to
> handle the situation of having foreseen a scan of npoints
> but because of an unexpected abort having to trim the data to the
> actual number of recorded points.
The way I do it is, if I know the dimensions ahead of time, I set the
maxdims argument for h5py accordingly, but initially create a small
dataset. Then the dataset can grow to those maxdims. It is also
possible to grow indefinitely, as long as the maxdims are not set to
constrict you.
I saved the links of the two presentations Armando mentioned (see below)
in http://groups.google.com/group/mahid/web/talks-presentations-posters
. This should allow us to quickly find them again without browsing all
the emails - which is especially good if someone new wants to quickly
get an overview. In case there is something to add or correct and you
are not able to do it yourself please send me an email directly - not
via the groups email.
Thanks, Gerd
>> does not seem to be the case on electron microscopy or I misunderstood
>> Matthew's talk:
>> http://www.medsbio.org/meetings/BSR_2007_imgCIF_Workshop/MTD_DLS_17Aug07_talk.pdf
Mmh, my information is that they did that for HDF4, but they don't have plans
for doing the same for HDF5:
http://mail.hdfgroup.org/pipermail/hdf-forum_hdfgroup.org/2008-May/001166.html
although they recognize that having an attribute with a reserved name
describing how the data was ordered would be a good idea.
However, I still think that the current HDF5 approach for dealing with
Fortran/C orderings, even if an attribute is reserved, is ill-designed and can
be a source of confusion for many people. I expressed my thoughts here:
http://mail.hdfgroup.org/pipermail/hdf-forum_hdfgroup.org/2008-
June/001164.html
but I don't think they are going this route (unless they changed their mind
recently). So, I think it would be a good idea to express interest for the
reserved attribute convention for specifying C/Fortran (or, if you prefer,
row-major/column-major) ordering.
At any rate, and for this concrete case, I think it is fine to keep adding
another attribute specifying which dimension the image stack is following.
Cheers,
--
Francesc Alted
Please can someone using MATLAB and/or FORTRAN and/or IDL tell us the
dimensions reported by their tool when opening the data contained in the
Daphnia_float32.h5 dataset found in the datasets page
http://groups.google.fr/group/mahid/web/datasets
BTW, I did not find a direct link from the main web page to the datasets
page. I was expecting to reach the page via the Files link that
currently is pointing to a page with just one pdf.
Armando
V. Armando Solé wrote:
> BTW, I did not find a direct link from the main web page to the datasets
> page. I was expecting to reach the page via the Files link that
> currently is pointing to a page with just one pdf.
This is because we can not use Google itself to host the datasets - the
available storage space is just far too small (100 MB). So I have
created pages containing links to the datasets, and all those pages can
be found under "Pages" from the main menu.
I will now:
* write a comment on the main page
* write a comment on the Files-page
in order to make sure everybody can find the links.
BTW, if somebody has datasets to be shared I can host + link them.
Currently, we are using some DESY-resources to do the hosting.
Wellenreuther, Gerd wrote:
> V. Armando Solé wrote:
>
>> BTW, I did not find a direct link from the main web page to the datasets
>> page. I was expecting to reach the page via the Files link that
>> currently is pointing to a page with just one pdf.
>>
>
> This is because we can not use Google itself to host the datasets - the
> available storage space is just far too small (100 MB).
Is it possible to write a link as a file in the Files page? That could
solve the problem in an elegant manner by linking to the datasets page.
Armando