SingleFileData after migration to v2.0.1

Atsushi Togo

unread,

May 19, 2022, 8:41:18 PM5/19/22

to aiida...@googlegroups.com

Dear AiiDA development team,

Thanks for all your effort in the development of AiiDA. AiiDA helps me
a lot in my research.

I recently migrated one of my AiiDA environments from v1.6.5 to
v2.0.1. I took the full backup of this environment before migration.

I have a stored SingleFileData node that contains an hdf5 file. I have
a question about handling this node. I think I am doing something
wrong or unusual, but I don't know where the problem is.

At v2.0.1, a newly created SingleFileData node that contains an hdf5
file can go the following without raising any exception
(`n.outputs.ltc` is the SingleFileData node):

```
with n.outputs.ltc.open(mode='rb') as f:
hf = h5py.File(f)
```

But with a migrated data

```
In [68]: with n.outputs.ltc.open(mode='rb') as f:
...: hf = h5py.File(f)
...:
...:
---------------------------------------------------------------------------
NotImplementedError Traceback (most recent call last)
<ipython-input-68-d61f0b32713c> in <module>
1 with n.outputs.ltc.open(mode='rb') as f:
----> 2 hf = h5py.File(f)
3
4

~/.miniconda/envs/demo/lib/python3.8/site-packages/h5py/_hl/files.py
in __init__(self, name, mode, driver, libver, userblock_size, swmr,
rdcc_nslots, rdcc_nbytes, rdcc_w0, track_order, fs_strategy,
fs_persist, fs_threshold, **kwds)
440 with phil:
441 fapl = make_fapl(driver, libver, rdcc_nslots,
rdcc_nbytes, rdcc_w0, **kwds)
--> 442 fid = make_fid(name, mode, userblock_size,
443 fapl,
fcpl=make_fcpl(track_order=track_order, fs_strategy=fs_strategy,
444 fs_persist=fs_persist,
fs_threshold=fs_threshold),

~/.miniconda/envs/demo/lib/python3.8/site-packages/h5py/_hl/files.py
in make_fid(name, mode, userblock_size, fapl, fcpl, swmr)
193 if swmr and swmr_support:
194 flags |= h5f.ACC_SWMR_READ
--> 195 fid = h5f.open(name, flags, fapl=fapl)
196 elif mode == 'r+':
197 fid = h5f.open(name, h5f.ACC_RDWR, fapl=fapl)

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/h5f.pyx in h5py.h5f.open()

h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_get_eof()

h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_get_eof()

h5py/h5fd.pyx in h5py.h5fd.H5FD_fileobj_get_eof()

~/.miniconda/envs/demo/lib/python3.8/site-packages/disk_objectstore/utils.py
in seek(self, target, whence)
419 """
420 if whence not in [0, 1]:
--> 421 raise NotImplementedError(
422 "Invalid value for `whence`: only 0 and 1 are
currently implemented."
423 )

NotImplementedError: Invalid value for `whence`: only 0 and 1 are
currently implemented.
```

I can recover the hdf5 file in the following way, so the data is not lost.
```
In [75]: n.outputs.ltc.filename
Out[75]: 'kappa-m53106106.hdf5'

In [76]: with n.outputs.ltc.open(mode='rb') as f:
...: with open(n.outputs.ltc.filename, 'wb') as w:
...: w.write(f.read())
...:

In [77]: hf = h5py.File("kappa-m53106106.hdf5")

In [78]: list(hf)
Out[78]:
['P_matrix',
'Q_matrix',
'frequency',
'gamma',
'grid_matrix',
'group_velocity',
'gv_by_gv',
'heat_capacity',
'kappa',
'kappa_unit_conversion',
'mesh',
'mode_kappa',
'qpoint',
'temperature',
'version',
'weight']
```

I would appreciate it if you could give me any advice.

Best regards,

Togo

--
Atsushi Togo

Sebastiaan Huber

unread,

May 20, 2022, 1:29:51 AM5/20/22

to aiida...@googlegroups.com

Dear Togo,

The problem here is that the `File` constructor of the `h5py` library is
trying to call `seek` on the file handle `f` with a value of `whence`
that is not supported.
The `seek` allows them to effectively move the read position to a
certain byte in the stream.
However, in AiIDA, we cannot allow this because the disk object-store
library (which manages the file repository) stores multiple files in one
single big binary file.
If we were to allow reading beyond the boundaries of the file that was
requested, you could start reading the contents of other files.

The only solution currently is the workaround is the one you already
proposed: copying the file to a temporary file on disk.
Since it only needs to be temporary, I would maybe adapt the example
slightly differently:

import shutil
import tempfile

with node.open(mode='rb') as source:
    with tempfile.TemporaryFile(mode='rb') as target:
        shutil.copyfileobj(source, target) # Copy the content of
source to target in chunks
        target.seek(0) # Make sure to reset the pointer to the
beginning of the stream
        hf = h5py.File(target)

By using `tempfile` you don't have to worry whether the filename already
exists.
And by using `shutil.copyfileobj` you make sure the file is copied in
chunks and won't be read into memory entirely.
This is important if the file is potentially big and may not fit in
available memory.

It is a bit annoying and efficient to have to do this, but I don't think
there is another way for now.

It would maybe make sense to create a `Hdf5Data` class that subclasses
`SinglefileData` that does this trick automatically when accessing the
underlying file.

Hope that helps,

Sebastiaan

Atsushi Togo

unread,

May 20, 2022, 2:17:37 AM5/20/22

to aiida...@googlegroups.com

Daer Sebastiaan,

Thanks for your quick answer, and detailed explanation that helps me
to understand how it works.

Your solution worked perfectly (with mode='w+b' for the second line),
and this solution is very good for me.

Best wishes,

Togo

> --
> AiiDA is supported by the NCCR MARVEL (http://nccr-marvel.ch/), funded by the Swiss National Science Foundation, and by the European H2020 MaX Centre of Excellence (http://www.max-centre.eu/).
>
> Before posting your first question, please see the posting guidelines at http://www.aiida.net/?page_id=356 .
> ---
> You received this message because you are subscribed to the Google Groups "aiidausers" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to aiidausers+...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/aiidausers/261bb60a-8cf0-1f32-3961-69c3681299be%40epfl.ch.

--
Atsushi Togo

Sebastiaan Huber

unread,

May 20, 2022, 2:49:18 AM5/20/22

to aiida...@googlegroups.com

Dear Togo,

Glad that works, and indeed, the mode should include the reading bit (I
hadn't tested the snippet, my bad :)
I went back in the commits to see why I didn't implement `whence=2`.
I thought there was some fundamental limitation that prevented us from
implementing it, but I think that is not actually the case.
So I opened an issue to implement it:
https://github.com/aiidateam/disk-objectstore/issues/136
I will hopefully fix that soon and than make a new release.
I will let you know here when that is done, and then you can update
`disk-objectstore` and you can get rid of the workaround.

Regards,

Sebastiaan

Reply all

Reply to author

Forward