Loading and generating batches using fuel for large dataset

h

unread,

Oct 6, 2016, 9:57:39 AM10/6/16

to fuel-users

Hey,

I have two different hdf5 files containing train and test data. And my dataset is very large. I want to load data using batch generator in each epoch. would you please help me how to use it to generate batches and not eat memory?

Best,

Dmitriy Serdyuk

unread,

Oct 6, 2016, 4:25:42 PM10/6/16

to fuel-users, rasoul...@gmail.com

Can you provide more details what you are using and what you would like to achieve?

Currently H5PYDataset provides out-of-memory datasets (load_in_memory=False argument is set by default if I’m not mistaken), every batch is fetched from the disc. The datastream never stores data in memory unless you explicitly make pointers to it.

Message has been deleted

rasoul...@gmail.com

unread,

Oct 6, 2016, 5:49:19 PM10/6/16

to fuel-users, rasoul...@gmail.com

Hey Dimitry,

I've tried to follow this link but the problem is my dataset stored are on HDF5 file differently. The reason is the data was really huge (more than 100 millions images) so I restored it using chunk:

   def img_into_hdf5(list,  hdf5_group, flag):
        # put the image patches into hdf5
        if flag == "image":
            imgArray = np.reshape(list, (1, 28, 28))
            # print(imgArray.shape)
            imgArray = imgArray.astype('float32')  # cast them to float32, saving storage space

            if 'imagePatches' in hdf5_group:
                h5_append_dataset(hdf5_group['/imagePatches'], imgArray)  # extends the imagePatch array

            else:
                # chunksize used to store them in the hdf5
                chunksize_images = (500, 28, 28)  # roughly 300kB for float32
                h5_create_extending_dataset(hdf5_group, 'imagePatches', imgArray, compress=True, chunks=chunksize_images)

        else:

            if 'label' in hdf5_group:
                h5_append_dataset(hdf5_group['/label'], imgArray)  # extends the imagePatch array

            else:

                # chunksize used to store them in the hdf5
                chunksize_scalars = (1000,)  # 78kB for int64

                h5_create_extending_dataset(hdf5_group, 'label', np.array(list), compress=True, chunks=chunksize_scalars)



def h5_create_extending_dataset( rootgroup:h5py.Group, dataset_name:str, someData:np.ndarray, chunks:tuple, compress:bool):
    
    maxShape = (None,) + someData.shape[1:]  # None signals that this dim can be extended indefinitly
    if compress:
        dset = rootgroup.create_dataset(name=dataset_name, data=someData,
                                        maxshape=maxShape, compression="lzf",  # note that lzf seems to be much faster than gzip
                                        chunks=chunks)
    else:
        dset = rootgroup.create_dataset(name=dataset_name, data=someData,
                                        maxshape=maxShape, chunks=chunks)
    return dset

def h5_append_dataset(dset, data):
    "neat, also works with 1d arrays e.g ones(10)"
    assert dset.maxshape[0] is None, 'cannot extend arrays that dont have maxshape[0]==None'

    "appends the data, dynamically enlarging the array"
    oldSize = dset.shape
    assert oldSize[1:] == data.shape[1:], 'size of appending data doesnt match'
    newshape = (oldSize[0]+data.shape[0],) + data.shape[1:]  #first dim extends, rest stays the same
    dset.resize(newshape)

    #put in the new data
    dset[oldSize[0]:] = data

Through following code I created two dataset, Train and Test. Since both are very large I want to load them on the fly in each epoch. To do so, I need Fuel to load data from HDF5, generate batches and feed it to the network.

I've totally lost in between, I guessed the problem might be because my dataset has different architecture, the Fuel can not read it.

Thanks for helping

Dmitriy Serdyuk

unread,

Oct 7, 2016, 3:55:57 PM10/7/16

to fuel-users, rasoul...@gmail.com

As far as I see, you dataset is not compatible. I don’t see if you add split information, for example.

Probably, you’ll have to implement your own dataset with your own open, get_data, and close methods. In fuel the dataset object is stateless, it is more like the dataset descriptor with method open returning the state, in your case it would be the h5py file object. The get_data accepts this state and the request iterator (iterator of lists of indexes to construct a batch) and you need to implement this method without loading the whole dataset into memory. It is described in fuel documentation for Dataset.

Alternatively you can try to create an hdf5 file compatible with H5PYDataset following the tutorial you linked.

Reply all

Reply to author

Forward