Can you provide more details what you are using and what you would like to achieve?
Currently H5PYDataset
provides out-of-memory datasets (load_in_memory=False
argument is set by default if I’m not mistaken), every batch is fetched from the disc. The datastream never stores data in memory unless you explicitly make pointers to it.
def img_into_hdf5(list, hdf5_group, flag):
# put the image patches into hdf5
if flag == "image":
imgArray = np.reshape(list, (1, 28, 28))
# print(imgArray.shape)
imgArray = imgArray.astype('float32') # cast them to float32, saving storage space
if 'imagePatches' in hdf5_group:
h5_append_dataset(hdf5_group['/imagePatches'], imgArray) # extends the imagePatch array
else:
# chunksize used to store them in the hdf5
chunksize_images = (500, 28, 28) # roughly 300kB for float32
h5_create_extending_dataset(hdf5_group, 'imagePatches', imgArray, compress=True, chunks=chunksize_images)
else:
if 'label' in hdf5_group:
h5_append_dataset(hdf5_group['/label'], imgArray) # extends the imagePatch array
else:
# chunksize used to store them in the hdf5
chunksize_scalars = (1000,) # 78kB for int64
h5_create_extending_dataset(hdf5_group, 'label', np.array(list), compress=True, chunks=chunksize_scalars)
def h5_create_extending_dataset( rootgroup:h5py.Group, dataset_name:str, someData:np.ndarray, chunks:tuple, compress:bool):
maxShape = (None,) + someData.shape[1:] # None signals that this dim can be extended indefinitly
if compress:
dset = rootgroup.create_dataset(name=dataset_name, data=someData,
maxshape=maxShape, compression="lzf", # note that lzf seems to be much faster than gzip
chunks=chunks)
else:
dset = rootgroup.create_dataset(name=dataset_name, data=someData,
maxshape=maxShape, chunks=chunks)
return dset
def h5_append_dataset(dset, data):
"neat, also works with 1d arrays e.g ones(10)"
assert dset.maxshape[0] is None, 'cannot extend arrays that dont have maxshape[0]==None'
"appends the data, dynamically enlarging the array"
oldSize = dset.shape
assert oldSize[1:] == data.shape[1:], 'size of appending data doesnt match'
newshape = (oldSize[0]+data.shape[0],) + data.shape[1:] #first dim extends, rest stays the same
dset.resize(newshape)
#put in the new data
dset[oldSize[0]:] = data
As far as I see, you dataset is not compatible. I don’t see if you add split information, for example.
Probably, you’ll have to implement your own dataset with your own open
, get_data
, and close
methods. In fuel the dataset object is stateless, it is more like the dataset descriptor with method open
returning the state, in your case it would be the h5py file object. The get_data
accepts this state and the request iterator (iterator of lists of indexes to construct a batch) and you need to implement this method without loading the whole dataset into memory. It is described in fuel documentation for Dataset
.
Alternatively you can try to create an hdf5 file compatible with H5PYDataset
following the tutorial you linked.