January 2010 -V.A. Sole HDF5 Workshop Report

6 views
Skip to first unread message

"V. Armando Solé"

unread,
Jan 18, 2010, 8:28:48 AM1/18/10
to ma...@googlegroups.com
Dear colleagues,

Please find attached my report on the workshop on HDF5 that took place
in Grenoble from January 11 to January 13, 2010.

Gerd and I will try to set a way to share the talks for which
participants have supplied the slides.

Best wishes,

Armando

Report.odt

Elena

unread,
Feb 4, 2010, 1:31:40 PM2/4/10
to Methods for the analysis of hyperspectral image data, epou...@hdfgroup.org
Dear members of the MAHID community,

Several long-term users of HDF5 brought to our attention concerns
expressed during the “HDF5 as hyperspectral data exchange and analysis
format” Workshop and encouraged us to address the issues on this
mailing list.

I will try to address three requests mentioned in Armando’s report and
will be more than happy to answer any other HDF5 questions you have.

1. “We request the HDF-group to keep the maximum value of the version
somewhere easily accessible in the file.”

When we were working on the HDF5 file format design, we specifically
voted against storing library version in the file. This was done to
achieve better forward compatibility - old versions of the library can
read some data in the new files. The topic of forward/backward
compatibility is complex and needs another discussion thread; I will
be happy to answer your questions about how The HDF Group addresses
it.

To give you an example why storing version number is not very helpful:
if an application is linked with the 1.8.4 library and doesn’t use any
new features, the file it creates can be read by the 1.6.10 library
and earlier. File created by the 1.6.10 library can be modified by an
application linked with the 1.8.4 library by adding new objects like
external links, which will make some part of the file unavailable to
the 1.16.10 library. And it is another change to the file format! ☺

To address the problem, we encourage users to store the version of the
library that creates a file in an attribute or in the user-block if
necessary. On our part we will be coming with a tool that reports the
earliest version of the HDF5 library that can read all data in a file
(the latest version should ALWAYS read all data in a file).


2. “There is a set_version() call in the HDF5 API that limits the
variability of versions that can be used with an HDF5 file. This
functionality is already there, not sure if the command line utility
support repack supports it yet. Would probably be easy to add such an
option. “

The H5Pset_libver_bounds function http://www.hdfgroup.org/HDF5/doc/RM/RM_H5P.html#Property-SetLibverBounds
will be enhanced in 1.10 release to allow the library to create
objects compatible with the 1.8.* libraries. Tools, like h5repack, can
be enhanced accordingly.

3. “There was comment suggesting to allow saving/converting/
downgrading an HDF5 file to a particular version. Something similar to
what is commonly made in the text editors world. That would be very
convenient, although we do not know if it is possible to implement.”

This is an excellent suggestion that we will definitely consider.

I would also like to add that we work very closely with MATLAB and IDL
developers to help them to move software to the latest versions of
HDF5. We encourage you to contact their support when you have any
problems with reading/manipulating HDF files.

Our Help Desk he...@hdfgroup.org is one of your best resources. We
will be more than happy to help.

Sincerely,

Elena Pourmal on behalf of The HDF Group.

>  Report.odt
> 27KViewDownload

Darren Dale

unread,
Feb 4, 2010, 2:31:28 PM2/4/10
to ma...@googlegroups.com, epou...@hdfgroup.org
Hello Elena,

On Thu, Feb 4, 2010 at 1:31 PM, Elena <epou...@gmail.com> wrote:
> Dear members of the MAHID community,
>
> Several long-term users of HDF5 brought to our attention concerns
> expressed during the “HDF5 as hyperspectral data exchange and analysis
> format” Workshop and encouraged us to address the issues on this
> mailing list.
>
> I will try to address three requests mentioned in Armando’s report and
> will be more than happy to answer any other HDF5 questions you have.
>
> 1.      “We request the HDF-group to keep the maximum value of the version
> somewhere easily accessible in the file.”
>
> When we were working on the HDF5 file format design, we specifically
> voted against storing library version in the file. This was done to
> achieve better forward compatibility - old versions of the library can
> read some data in the new files. The topic of forward/backward
> compatibility is complex and needs another discussion thread; I will
> be happy to answer your questions about how The HDF Group addresses
> it.

Does that mean that in such a modified file, all of the data that was
created with an old version of the library will always be accessible
by that older library version (unless it is deleted and replaced with
an incompatible element, of course)?

There was one other issue that we discussed at the workshop and I have
been meaning to follow up with the HDF Group. It seems that since the
HDF5 library does some caching in order to improve efficiency, it is
possible to open a single file with two separate processes or
programs, attempt to modify the file in both processes, and end up
corrupting it. Does the HDF5 library have a mechanism to protect
against this scenario?

Thanks,
Darren

Vicente Sole

unread,
Feb 4, 2010, 2:37:44 PM2/4/10
to ma...@googlegroups.com, Elena, Methods for the analysis of hyperspectral image data, epou...@hdfgroup.org
Dear Elena,

First of all, thank you very much for your feedback.

Quoting Elena <epou...@gmail.com>:

>
> To give you an example why storing version number is not very helpful:

> if an application is linked with the 1.8.4 library and doesn?t use any


> new features, the file it creates can be read by the 1.6.10 library
> and earlier. File created by the 1.6.10 library can be modified by an
> application linked with the 1.8.4 library by adding new objects like
> external links, which will make some part of the file unavailable to

> the 1.16.10 library. And it is another change to the file format! ?


> To address the problem, we encourage users to store the version of the
> library that creates a file in an attribute or in the user-block if
> necessary.

One of the ideas behind having that information is to have the
possibility to limit newer versions of the library to the features
available when the file was last modified. We do add version
information at the root level, but I guess if a program with a more
recent version of the library modifies/adds a group at a deeper level,
that information will be useless.

> On our part we will be coming with a tool that reports the
> earliest version of the HDF5 library that can read all data in a file
> (the latest version should ALWAYS read all data in a file).
>

That tool will already be very helpful to check if generated files may
give troubles with particular programs/environments.

>
> 2. ?There is a set_version() call in the HDF5 API that limits the


> variability of versions that can be used with an HDF5 file. This
> functionality is already there, not sure if the command line utility
> support repack supports it yet. Would probably be easy to add such an

> option. ?


>
> The H5Pset_libver_bounds function
> http://www.hdfgroup.org/HDF5/doc/RM/RM_H5P.html#Property-SetLibverBounds
> will be enhanced in 1.10 release to allow the library to create
> objects compatible with the 1.8.* libraries. Tools, like h5repack, can
> be enhanced accordingly.
>

Nice.

> 3. ?There was comment suggesting to allow saving/converting/


> downgrading an HDF5 file to a particular version. Something similar to
> what is commonly made in the text editors world. That would be very

> convenient, although we do not know if it is possible to implement.?


>
> This is an excellent suggestion that we will definitely consider.
>

We do appreciate the HDF5 group is doing a very good work. I think it
has just been unfortunate that at the moment we have got involved
there were very unusual issues present.

Thanks again for your time amd kindness.

Sincerely,

Armando


Darren Dale

unread,
Feb 4, 2010, 8:21:54 PM2/4/10
to Elena Pourmal, ma...@googlegroups.com
Hi Elena,

On Thu, Feb 4, 2010 at 5:43 PM, Elena Pourmal <epou...@hdfgroup.org> wrote:


> On Feb 4, 2010, at 1:31 PM, Darren Dale wrote:
>> There was one other issue that we discussed at the workshop and I have
>> been meaning to follow up with the HDF Group. It seems that since the
>> HDF5 library does some caching in order to improve efficiency, it is
>> possible to open a single file with two separate processes or
>> programs, attempt to modify the file in both processes, and end up
>> corrupting it. Does the HDF5 library have a mechanism to protect
>> against this scenario?
>>

> Currently "no". Applications can take several steps to be able to read while writing to the same file; see FAQ http://www.hdfgroup.org/hdf5-quest.html#grdwt
>
> We are working on the library feature that will allow multiple readers/one writer, but there are no plans to have multiple writers.

I'm actually not all that interested in having support for multiple
writers, its the possibility of accidentally corrupting data in this
way that is concerning to me. Instead of attempting to support
multiple writers, would it be possible for the HDF5 library to create
a lock file that would prevent more than one process from opening a
file with write capabilities?

Thanks,
Darren

Darren Dale

unread,
Feb 4, 2010, 8:41:29 PM2/4/10
to Elena Pourmal, ma...@googlegroups.com
On Thu, Feb 4, 2010 at 5:43 PM, Elena Pourmal <epou...@hdfgroup.org> wrote:
> On Feb 4, 2010, at 1:31 PM, Darren Dale wrote:
>> Does that mean that in such a modified file, all of the data that was
>> created with an old version of the library will always be accessible
>> by that older library version (unless it is deleted and replaced with
>> an incompatible element, of course)?
>>
> Yes, but sometimes it is not very clear and we should provide a better documentation.
[...]
> So summary is: if root group is modified with the new feature, all file become
> unaccessible. If group that contains an object is modified, then the object
> cannot be accessed with the old library.

It is situations like this that we wish to avoid. If adding some 1.8
feature simply meant that that entry in the file was inaccessible or
even transparent to the 1.6 api, but everything created with 1.6 was
still accessible, I guess there would not be a problem. But your
example shows that adding a 1.8 feature, like the external link above,
appears to effect the group that contains the new feature as well. /G1
was created with the 1.6 api and it is now inaccessible from the 1.6
library. Isn't this a problem?

Darren

Elena Pourmal

unread,
Feb 4, 2010, 5:43:14 PM2/4/10
to Darren Dale, ma...@googlegroups.com
Hello Darren,

On Feb 4, 2010, at 1:31 PM, Darren Dale wrote:

> Hello Elena,
>
> On Thu, Feb 4, 2010 at 1:31 PM, Elena <epou...@gmail.com> wrote:
>> Dear members of the MAHID community,
>>
>> Several long-term users of HDF5 brought to our attention concerns
>> expressed during the “HDF5 as hyperspectral data exchange and analysis
>> format” Workshop and encouraged us to address the issues on this
>> mailing list.
>>
>> I will try to address three requests mentioned in Armando’s report and
>> will be more than happy to answer any other HDF5 questions you have.
>>
>> 1. “We request the HDF-group to keep the maximum value of the version
>> somewhere easily accessible in the file.”
>>
>> When we were working on the HDF5 file format design, we specifically
>> voted against storing library version in the file. This was done to
>> achieve better forward compatibility - old versions of the library can
>> read some data in the new files. The topic of forward/backward
>> compatibility is complex and needs another discussion thread; I will
>> be happy to answer your questions about how The HDF Group addresses
>> it.
>
> Does that mean that in such a modified file, all of the data that was
> created with an old version of the library will always be accessible
> by that older library version (unless it is deleted and replaced with
> an incompatible element, of course)?
>

Yes, but sometimes it is not very clear and we should provide a better documentation.

Here is a very simple example:

I created a file h5ex_g_create.h5 that has one group /G1 with 1.6 version.
Then I used 1.8 version to insert an external link (this feature is available in 1.8 only and doesn't exist in 1.6)

Here is output of h5dump for the original file (both 1.6 and 1.8 dumper versions can be used):

HDF5 "h5ex_g_create.h5" {
GROUP "/" {
GROUP "G1" {
}
}
}

After I added an external link and tried to dump the file with 1.6 version of the dumper, I get the following:

HDF5 "h5ex_g_create.h5" {
GROUP "/" {
h5dump error: unknown object "G1"
}
}

Dumper recognizes that G1 was converted to the new format to accommodate external link.

1.8 version reads the file successfully:

HDF5 "h5ex_g_create.h5" {
GROUP "/" {
GROUP "G1" {
EXTERNAL_LINK "ext_dangle" {
TARGETFILE "foo.h5"
TARGETPATH "/group"
}
}
}
}

Now I add an external link to the root group and 1.6 dumper chokes.

h5dump error: internal error (file h5dump.c:line 3798)

1.8 shows

HDF5 "h5ex_g_create.h5" {
GROUP "/" {
GROUP "G1" {
EXTERNAL_LINK "ext_dangle" {
TARGETFILE "foo.h5"
TARGETPATH "/group"
}
}
EXTERNAL_LINK "ext_dangle" {
TARGETFILE "foo.h5"
TARGETPATH "/group"
}

Well.. ;-) clearly we need to do some work to report more meaningful error.

So summary is: if root group is modified with the new feature, all file become unaccessible. If group that contains an object is modified, then the object cannot be accessed with the old library.

> There was one other issue that we discussed at the workshop and I have


> been meaning to follow up with the HDF Group. It seems that since the
> HDF5 library does some caching in order to improve efficiency, it is
> possible to open a single file with two separate processes or
> programs, attempt to modify the file in both processes, and end up
> corrupting it. Does the HDF5 library have a mechanism to protect
> against this scenario?
>

Currently "no". Applications can take several steps to be able to read while writing to the same file; see FAQ http://www.hdfgroup.org/hdf5-quest.html#grdwt

We are working on the library feature that will allow multiple readers/one writer, but there are no plans to have multiple writers.

Elena
> Thanks,
> Darren

Elena Pourmal

unread,
Feb 4, 2010, 6:04:38 PM2/4/10
to Vicente Sole, ma...@googlegroups.com
Hello Armando,

On Feb 4, 2010, at 1:37 PM, Vicente Sole wrote:

Dear Elena,

First of all, thank you very much for your feedback.

Quoting Elena <epou...@gmail.com>:


To give you an example why storing version number is not very helpful:
if an application is linked with the 1.8.4 library and doesn?t use any
new features, the file it creates can be read by the 1.6.10 library
and earlier.  File created by the 1.6.10 library can be modified by an
application linked with the 1.8.4 library by adding new objects like
external links, which will make some part of the file unavailable to
the 1.16.10 library. And it is another change to the file format! ?
To address the problem, we encourage users to store the version of the
library that creates a file in an attribute or in the user-block if
necessary.

One of the ideas behind having that information is to have the possibility to limit newer versions of the library to the features available when the file was last modified. We do add version information at the root level, but I guess if a program with a more recent version of the library modifies/adds a group at a deeper level, that information will be useless.

yes. That is exactly why the "check" tool will be helpful to find out which library is needed.

On our part we will be coming with a tool that reports the
earliest version of the HDF5 library that can read all data in a file
(the latest version should ALWAYS read all data in a file).


That tool will already be very helpful to check if generated files may give troubles with particular programs/environments.


2. ?There is a set_version() call in the HDF5 API that limits the
variability of versions that can be used with an HDF5 file. This
functionality is already there, not sure if the command line utility
support repack supports it yet. Would probably be easy to add such an
option. ?

The H5Pset_libver_bounds function  http://www.hdfgroup.org/HDF5/doc/RM/RM_H5P.html#Property-SetLibverBounds
will be enhanced in 1.10 release to allow the library to create
objects compatible with the 1.8.* libraries. Tools, like h5repack, can
be enhanced accordingly.


Nice.

3. ?There was comment suggesting to allow saving/converting/
downgrading an HDF5 file to a particular version. Something similar to
what is commonly made in the text editors world. That would be very
convenient, although we do not know if it is possible to implement.?

This is an excellent suggestion that we will definitely consider.


We do appreciate the HDF5 group is doing a very good work. I think it has just been unfortunate that at the moment we have got involved there were very unusual issues present.

Well, you are not alone :-) Many of our users faced the same issues. 

Forward/backward compatibility is difficult. Our group learned a lot when we moved from 1.6 to 1.8. We think we did a decent job handling API forward compatibilities by introducing API versioning, but file format forward compatibility still requires a lot of work.

We already provide non-HDF5 h5check tool to validate HDF5 file (i.e., is file corrupted, or library that reads it, is old?)
A few improvements that are on our list:

1. Improve current tools to better handle "unknown" objects.
2. Enhance H5Pset_libver_bounds function allowing application to create files accessible by a specific HDF5 version X.Y.Z.
3. Provide a tool that identifies the earliest version of the HDF5 library needed to access the file (it may be an enhancement to current h5check).

1. Thank you!


Elena

EI  

Elena Pourmal

unread,
Feb 4, 2010, 10:49:15 PM2/4/10
to Darren Dale, ma...@googlegroups.com
Hi Darren,

Currently HDF5 doesn't have this capability and it is left to application(s) to implement.

If you are interested in a multithreaded application, then HDF5 can be built in a thread safe mode to allow only one thread to go into the file (which is different from what you are asking, but I thought it is worth mentioning here).

Elena

> Thanks,
> Darren

Elena Pourmal

unread,
Feb 4, 2010, 11:48:28 PM2/4/10
to Darren Dale, ma...@googlegroups.com
Hi Darren,

Good question :-)

You are right that this is a problem, because changing a group in a hierarchy does effect not only the group and its immediate members, but every object further down the tree (unless the object has another path to it).

On another hand, the 1.8 library doesn't know if the file was created by 1.6, since the same application linked with 1.8 library, will create exactly the same file. (Aside: By default HDF5 creates all objects using the earliest version available for that object; this way the old library which knows about the object, can always read the object created by the new library - forward compatibility.) Since application instructs the library to create an external link, the group is "upgraded" to the new format in order to store the link making it "unknown" to the old library, i.e., HDF5 did what it was asked to do - upgrade.

We do not claim that a file touched by the new library is always accessible with the old library; we only claim that it is true when NO new features are used. Please let us know right away if you encounter a situation when "round trip" creates a "corrupted" file.

In general, file accessibility with the old library depends on the features used by the new library, and this should be expected, right?

There are several ways to prevent this type of situation:

1. Community agrees on common interface like NeXUS. If all files are accessed through that interface, there is no chance of "corruption". 2. If there is no such interface, then there should be at least an agreement which features for which objects are used.
3. I mentioned the H5Pset_libver_bounds function in my previous email, i.e., HDF5 has a better version control available to applications.
4. We (The HDF Group) need to educate HDF5 users about backward/forward compatibility and provide the documentation and tools needed.

Please let me know if I completely confuse you with my explanations.

Elena

> Darren

Darren Dale

unread,
Feb 5, 2010, 7:45:37 AM2/5/10
to Elena Pourmal, ma...@googlegroups.com
On Thu, Feb 4, 2010 at 10:49 PM, Elena Pourmal <epou...@hdfgroup.org> wrote:
> Hi Darren,
>
> On Feb 4, 2010, at 7:21 PM, Darren Dale wrote:
>
>> Hi Elena,
>>
>> On Thu, Feb 4, 2010 at 5:43 PM, Elena Pourmal <epou...@hdfgroup.org> wrote:
>>> On Feb 4, 2010, at 1:31 PM, Darren Dale wrote:
>>>> There was one other issue that we discussed at the workshop and I have
>>>> been meaning to follow up with the HDF Group. It seems that since the
>>>> HDF5 library does some caching in order to improve efficiency, it is
>>>> possible to open a single file with two separate processes or
>>>> programs, attempt to modify the file in both processes, and end up
>>>> corrupting it. Does the HDF5 library have a mechanism to protect
>>>> against this scenario?
>>>>
>>> Currently "no". Applications can take several steps to be able to read while writing to the same file; see FAQ http://www.hdfgroup.org/hdf5-quest.html#grdwt
>>>
>>> We are working on the library feature that will allow multiple readers/one writer, but there are no plans to have multiple writers.
>>
>> I'm actually not all that interested in having support for multiple
>> writers, its the possibility of accidentally corrupting data in this
>> way that is concerning to me. Instead of attempting to support
>> multiple writers, would it be possible for the HDF5 library to create
>> a lock file that would prevent more than one process from opening a
>> file with write capabilities?
>>
> Currently HDF5 doesn't have this capability and it is left to application(s) to implement.

I would like to follow up on this one last time and ask if the HDF5
group would please consider implementing this capability. It is not
sufficient to implement such a lock at the application level: I might
unintentionally open the file with two completely different
applications (say matlab and python), in which case the lock provides
no protection. If instead the underlying HDF5 library created the lock
file, the data would be protected (even if attempting to open with two
completely different installations of the HDF5 library, because HDF5
would determine the name and path conventions).

Best regards,
Darren

Matt Newville

unread,
Feb 5, 2010, 1:20:29 PM2/5/10
to ma...@googlegroups.com, epou...@hdfgroup.org
Hi Elena,

Thanks for your message.

> 1.      “We request the HDF-group to keep the maximum value of the version
> somewhere easily accessible in the file.”
>
> When we were working on the HDF5 file format design, we specifically
> voted against storing library version in the file. This was done to
> achieve better forward compatibility - old versions of the library can
> read some data in the new files. The topic of forward/backward
> compatibility is complex and needs another discussion thread; I will
> be happy to answer your questions about how The HDF Group addresses
> it.
>

> <snip>


>
> To address the problem, we encourage users to store the version of the
> library that creates a file in an attribute or in the user-block if
> necessary. On our part we will be coming with a tool that reports the
> earliest version of the HDF5 library that can read all data in a file
> (the latest version should ALWAYS read all data in a file).

It's a bit confusing to me to see you explain that HDFGroup considered
and rejected using file versions and then recommend that users do this
themselves. Wouldn't the same problems arise, and it be less robust
to have users store the version information?

As you say, there is not necessarily one version of the library that
creates a file. But, there is one version of the library that creates
(or replaces) an particular HDF5 object: these *could* be versioned,
and these versions used to detect forward incompatibilities (without
crashing). But, the HDF5 objects are not versioned in a way that can
be detected through the API. This would be very helpful.

Tools that report information about HDF5 are good for developers, but
they are not a substitute for getting version information through the
API.

> 2.      “There is a set_version() call in the HDF5 API that limits the
> variability of versions that can be used with an HDF5 file. This
> functionality is already there, not sure if the command line utility
> support repack supports it yet. Would probably be easy to add such an
> option. “
>
>  The H5Pset_libver_bounds function
> http://www.hdfgroup.org/HDF5/doc/RM/RM_H5P.html#Property-SetLibverBounds
> will be enhanced in 1.10 release to allow the library to create
> objects compatible with the 1.8.* libraries. Tools, like h5repack, can
> be enhanced accordingly.

This is a nice feature of the library, but it doesn't address our main
concern. This approach limits what versions of objects a particular
application will write. It does not prevent objects in a particular
file to be altered by an application in such a way that another
applications may no longer be able to read. This has happened, and is
our concern.

Allow me to explain our use case: We are looking to be able to share
datafiles created from multiple sources, to be used in multiple
applications by very many scientists who have no knowledge of HDF5.
HDF5 appears to be a good choice for the base format but, of course,
HDF5 is not the main point of the applications. The multiple sources
here are facilities around the world with a variety of software
infrastructures. Data producing applications will be custom-built,
many by local scientists, some by professional developers. The data
reading codes will also be many and various in the toolkits used,
almost all written by scientists. Again, HDF5 is (and should be)
viewed as a detail of how the data happens to be stored. Users of the
applications will not know that a program called h5dump exists.

There is simply no way to have all of these data producing and data
reading applications linked against the same version of HDF5. Many
installations of complex proprietary environments (Matlab, IDL,
IgorPro, etc) and many open source providers will use slightly out of
date versions of the library. There is no way around it and no point
pretending that multiple applications could use the same version of
the library. The situation is exacerbated by HDFGroup's simultaneous
support for two libraries that have incompatibilities, so that even
asking which version of HDF5 is "the latest" is open for debate.

One suggestion would be to be able to "lock" a file so that all
objects added (by any version of the API) would be guaranteed to be
readable by a specified version of the API. From our point of view,
this would be ideal.

Another option would be to be able to determine (in the API) versions
of the objects, so that a program could know and report why an object
couldn't be read.

I don't know whether the first option is feasible, but I'd be
surprised if the second was not feasible. I am sort of amazed that
we'd be the first people to ask to determine versions of objects.

Do you have any suggestion for how we should share HDF5 files?

Later you also wrote:

> In general, file accessibility with the old library depends on the
> features used by the new library, and this should be expected, right?

Well, if a file that was created by Application A is opened by
Application B, which adds some new data to the file, it might be
reasonable for the new data to be unreadable by Application A. It is
certainly not reasonable for the old data to be unreadable by
Application A. It is also reasonable to expect that the library used
by Application B could automatically detect from the file which
version of the data type it is writing would be compatible with the
other objects in the file.

> There are several ways to prevent this type of situation:
>
> 1. Community agrees on common interface like NeXUS. If all files are
> accessed through that interface, there is no chance of
> "corruption".

Nexus uses HDF5. Therefore it is open to the same problem.
We are using multiple supported versions of the API (Python h5py,
Python Pytables, IDL, Matlab, ....), and so are not accessing through
a common interface. We are asking for file-level versioning, not
API-level versioning.

> 2. If there is no such interface, then there should be
> at least an agreement which features for which objects are used.
>
> 3. I mentioned the H5Pset_libver_bounds function in my previous email,
> i.e., HDF5 has a better version control available to applications.

Is there version information available from a file?

> 4. We (The HDF Group) need to educate HDF5 users about backward/forward
> compatibility and provide the documentation and tools needed.

OK, but if the compatibility issues were detected and handled
gracefully, you wouldn't need to educate users.

Thanks,

--Matt Newville <newville at cars.uchicago.edu> 630-252-0431

Elena Pourmal

unread,
Feb 5, 2010, 5:19:24 PM2/5/10
to Darren Dale, ma...@googlegroups.com
Darren,

I entered your request into our issues database. But please understand that it is not a trivial enhancement and will require a substantial development effort (read - needs funding for studying the problem, prototyping solutions (which OS, which FS to target, portability, etc., etc.), testing and further maintenance). We will be more than happy to work with the HDF5 users community on this issue.

Elena


> Best regards,
> Darren

Darren Dale

unread,
Feb 5, 2010, 5:23:56 PM2/5/10
to Elena Pourmal, ma...@googlegroups.com

Thank you very much!

Darren

Elena Pourmal

unread,
Feb 5, 2010, 7:34:49 PM2/5/10
to Matt Newville, ma...@googlegroups.com
Hi Matt,

On Feb 5, 2010, at 12:20 PM, Matt Newville wrote:

Hi Elena,

Thanks for your message.

1.      “We request the HDF-group to keep the maximum value of the version
somewhere easily accessible in the file.”

When we were working on the HDF5 file format design, we specifically
voted against storing library version in the file. This was done to
achieve better forward compatibility - old versions of the library can
read some data in the new files. The topic of forward/backward
compatibility is complex and needs another discussion thread; I will
be happy to answer your questions about how The HDF Group addresses
it.

<snip>

To address the problem, we encourage users to store the version of the
library that creates a file in an attribute or in the user-block if
necessary. On our part we will be coming with a tool that reports the
earliest version of the HDF5 library that can read all data in a file
(the latest version should ALWAYS read all data in a file).

It's a bit confusing to me to see you explain that HDFGroup considered
and rejected using file versions and then recommend that users do this
themselves.  Wouldn't the same problems arise, and it be less robust
to have users store the version information?

yes, it would be the same problem. But very often files are used as "read only", and then the version number that can be found without HDF5 library (stored, for example, in the user block) is very helpful.

As you say, there is not necessarily one version of the library that
creates a file.  But, there is one version of the library that creates
(or replaces) an particular HDF5 object: these *could* be versioned,
and these versions used to detect forward incompatibilities (without
crashing).  But, the HDF5 objects are not versioned in a way that can
be detected through the API.  This would be very helpful.

Well... need to go into more details :-)

HDF5 objects - datasets, groups, named datatypes - are comprised of so called "header messages", and those are versioned, not HDF5 objects themselves. For example, datasets with compound datatypes can have datatype header messages with versions 1 and 2. Versions of the header messages are not available through public APIs.

Forward compatibility shouldn't cause crash for application. HDF5 library should handle it gracefully and we work really hard to achieve it. If there are any cases when library crashes and not fails with the error stack (that can be handle by the user's application btw), we need to know it right away. IT IS A BUG.


Tools that report information about HDF5 are good for developers, but
they are not a substitute for getting version information through the
API.

I am not sure how this will help. It should be HDF5 library responsibility to handle it and provide useful information to application.
Could you please provide use cases for such API? It will help us to better understand your request.

2.      “There is a set_version() call in the HDF5 API that limits the
variability of versions that can be used with an HDF5 file. This
functionality is already there, not sure if the command line utility
support repack supports it yet. Would probably be easy to add such an
option. “

 The H5Pset_libver_bounds function
 http://www.hdfgroup.org/HDF5/doc/RM/RM_H5P.html#Property-SetLibverBounds
will be enhanced in 1.10 release to allow the library to create
objects compatible with the 1.8.* libraries. Tools, like h5repack, can
be enhanced accordingly.

This is a nice feature of the library, but it doesn't address our main
concern.  This approach limits what versions of objects a particular
application will write.  It does not prevent objects in a particular
file to be altered by an application in such a way that another
applications may no longer be able to read.  This has happened, and is
our concern.

Yes, we understand. It is a common problem that doesn't have too many good solutions....

Different versions of MS Word (plus Open Office, plus going back and force between 
Windows and Mac) have the same problem, right? How do we deal with it? We call or email asking to save a file in the format we can read. Or we upgrade the MS Word version, or we downgrade the file. And we blame Microsoft (but still using it... :-)
HDF5 is not different. Well... it is different, because in most cases it is almost transparent to the users. And that is what we would like to achieve. Give us another 10 years :-) joking...

This discussion is very timely, since we are thinking about how to handle 1.10 release (will have file format changes). Believe me, your concerns are taken very seriously. 

I ant to assure you, that there will be NO file format changes in the future releases of 1.8. (unless we discover a bug that causes data corruption). 


Allow me to explain our use case: We are looking to be able to share
datafiles created from multiple sources, to be used in multiple
applications by very many scientists who have no knowledge of HDF5.
HDF5 appears to be a good choice for the base format but, of course,
HDF5 is not the main point of the applications.  The multiple sources
here are facilities around the world with a variety of software
infrastructures.  Data producing applications will be custom-built,
many by local scientists, some by professional developers.  The data
reading codes will also be many and various in the toolkits used,
almost all written by scientists.  Again, HDF5 is (and should be)
viewed as a detail of how the data happens to be stored.  Users of the
applications will not know that a program called h5dump exists.

I understand, it is a very common situation. But users will use some application to read HDF5 data, right? Then if we (THG) provide some means for tools/applications builders to deal with forward compatibility, this will help, right? Like when MS Word opens new file, it tells that some features will be lost, but most of the data is still available. 

There is simply no way to have all of these data producing and data
reading applications linked against the same version of HDF5.  Many
installations of complex proprietary environments (Matlab, IDL,
IgorPro, etc) and many open source providers will use slightly out of
date versions of the library.  There is no way around it and no point
pretending that multiple applications could use the same version of
the library.
True. 

One of the problems with IDL was that it used a buggy 1.6.3 for a very long time. It was our mistake in 1.6.3 that caused data corruption; we had to fix it by changing file format in the released software that is against our own rules, but mistakes do happen. 

 The situation is exacerbated by HDFGroup's simultaneous
support for two libraries that have incompatibilities, so that even
asking which version of HDF5 is "the latest" is open for debate.

Well... Libraries didn't have incompatibilities. One of them have more powerful features. Old features WERE not changed.
There will be no more releases of 1.6.*. 1.6.10 was the last one. We hope this will help. 
 
One suggestion would be to be able to "lock" a file so that all
objects added (by any version of the API) would be guaranteed to be
readable by a specified version of the API.  From our point of view,
this would be ideal.

Noted. I don't think we have a solution now. 
 
Another option would be to be able to determine (in the API) versions
of the objects, so that a program could know and report why an object
couldn't be read.

See above. Objects do not have versions per se. We believe library should handle it.

I don't know whether the first option is feasible, but I'd be
surprised if the second was not feasible.  I am sort of amazed that
we'd be the first people to ask to determine versions of objects.

Do you have any suggestion for how we should share HDF5 files?


Reading the Workshop notes I saw that file sharing was successful :-)
I've already suggested that your community agrees on the HDF5 features used, then it is easy to find the "earliest" version needed.
But if some of you will use new features that "upgrade" a file, those who stay on the old version are definitely in trouble.
If we provide tools (or other means to downgrade the file) it will help, right?


Later you also wrote:

In general, file accessibility with the old library depends on the
features used by the new library, and this should be expected, right?

Well, if a file that was created by Application A is opened by
Application B, which adds some new data to the file, it might be
reasonable for the new data to be unreadable by Application A. It is
certainly not reasonable for the old data to be unreadable by
Application A.

Unfortunately this is a case of HDF5 when hierarchical structure (a group) is used.

 It is also reasonable to expect that the library used
by Application B could automatically detect from the file which
version of the data type it is writing would be compatible with the
other objects in the file.

Hmmm... I think we are back to the problem that HDF5 doesn't have a file format version. Imagine a several gigabyte file that has to be traversed to find a common denominator (file format version).


There are several ways to prevent this type of situation:

1. Community agrees on common interface like NeXUS. If all files are
accessed through that interface, there is no chance of
"corruption".

Nexus uses HDF5.  Therefore it is open to the same problem.
No, it is not. If NeXUs is linked with any of 1.4.*, 1.6,*, 1.8.* it will produce the same files since it uses HDF5 library features. Objects created by all those versions will be the same. If NeXUS decides to use compact groups (1.8) features (i.e. they change the properties of the objects used), then yes, 1.6 applications will not be able to read new NeXUS files.

We are using multiple supported versions of the API (Python h5py,
Python Pytables, IDL, Matlab, ....), and so are not accessing through
a common interface.  We are asking for file-level versioning, not
API-level versioning.

Sorry, no file-level versioning possible at this point with HDF5 and it will not help as we already discussed. 


2. If there is no such interface, then there should be
at least an agreement which features for which objects are used.

3. I mentioned the H5Pset_libver_bounds function in my previous email,
i.e., HDF5 has a better version control available to applications.

Is there version information available from a file?

No. But we will provide the tool that will determine the earliest version of the library needed to read all data from the file.

4. We (The HDF Group) need to educate HDF5 users about backward/forward
compatibility and provide the documentation and tools needed.

OK, but if the compatibility issues were detected and handled
gracefully, you wouldn't need to educate users.

Well.. Users know about PDF, doc, docx, txt. They also know that software upgrade is needed sometimes to read new files, and know where to get it (or they get it automatically and are used to clicking on buttons in the upgrade window).  Files with .h5 extension shouldn't be different.

Working on it :-)

To summarize:

We do understand your concerns and take them seriously.

One of the proposed solutions to the forward compatibility problem is to have a file version stored in a file (another file format change ;-), and corresponding APIs to handle it. It will be very helpful if you provide us with the use cases showing how the file version will be used for data sharing and for addressing forward compatibility issue.

Thank you!

Elena
Reply all
Reply to author
Forward
0 new messages