Iterate over specific rows in a dataset

1,148 views
Skip to first unread message

gre...@eng.ucsd.edu

unread,
Sep 4, 2017, 2:18:32 PM9/4/17
to h5py
I've got an HDF5 file with a dataset that is 100,000 rows x 200 columns.  The 100,000 rows are from different 'groups' in the experiment.  I'd like to have h5py iterate over the dataset but skip a particular group. 

I believe I can just do something like 

self.inp = self.hdf_file['input'][subset_idx, :]

where subset_idx is a list of indices (e.g. subset_idx = [10,11,12,345,23001])

However it seems like when I make the assignment to the self.inp variable that the entire HDF variable is being read in (rather than just setting up an iterator to those specific rows). I say this because the python script hangs at this line depending on how many rows I want to use.

Is there a better way to do this? That is, how can I iterate over specific rows in the dataset?

Thanks.
-Tony

NumesSanguis

unread,
Sep 6, 2017, 3:11:14 AM9/6/17
to h5py
Your question might be related to my question (I just found the solution to): https://groups.google.com/forum/#!topic/h5py/Djh2kH4ZyxE
You can assign attributes to your data that refer to subsets of your data.

So you could assign a group as an attribute and then just iterate over all group labels
and skip the ones you don't want.

Please let me know if this solved your speed problem.

NumesSanguis

unread,
Sep 6, 2017, 4:27:11 AM9/6/17
to h5py
If attributes doesn't work, you could try chunking: http://docs.h5py.org/en/latest/high/dataset.html
Haven't tried it myself though.

G Reina

unread,
Sep 6, 2017, 11:06:01 AM9/6/17
to h5...@googlegroups.com
Thanks for the reply.

Yes, it's definitely the same thing you are doing (I am using this also for neural network training).  I can do the fancy indexing to slice the data, but according to the documentation for h5py if you try to take a slice greater than 1,000 rows then things will be really slow. 

I was hoping that there was something other than slicing. In particular, it would be nice to have an iterator that went through the data in a non-sequential fashion. However, I realize that is easier said than done.

Thanks again.
Best,
-Tony


On Wed, Sep 6, 2017 at 1:27 AM, NumesSanguis <numess...@gmail.com> wrote:
If attributes doesn't work, you could try chunking: http://docs.h5py.org/en/latest/high/dataset.html
Haven't tried it myself though.

--
You received this message because you are subscribed to a topic in the Google Groups "h5py" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/h5py/vT6MhaA6rjw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to h5py+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

NumesSanguis

unread,
Sep 6, 2017, 9:19:51 PM9/6/17
to h5py
I think chunking your data (and, if possible, put your data on an SSD) might be a good solution.
This is a good piece about chunking and HDF5:
http://geology.beer/2015/02/10/hdf-for-large-arrays/
Especially note: "First off, to clear up confusion, accessing an h5py dataset returns an object that behaves fairly similarly to a numpy array, but does not load the data into memory until it’s sliced."

I guess even more optimal would be to write some parallel code which reads a slice of HDF5 while your batch is being trained on by the neural network, so it has the data in memory by the time the next batch is required.

Please let me know what you ended up doing, or if find any more information related to this.

NumesSanguis

unread,
Sep 7, 2017, 4:04:25 AM9/7/17
to h5py
I think you misunderstood the performance issue with greater than 1,000.
"Very long lists (> 1000 elements) may produce poor performance" is about selecting specific elements in a (many) row(s), not slicing.
http://docs.h5py.org/en/latest/high/dataset.html#fancy-indexing

Can you show your code how you extract your rows?

G Reina

unread,
Sep 7, 2017, 11:09:32 AM9/7/17
to h5...@googlegroups.com
Sure.

The code is a little abstracted, but I'll try to pull out the relevant parts. 

My HDF5 has multiple classes within the same ['input'] matrix. So the idea is to produce mini-batches for SGD that take a random sample of each class for every batch such that I grab the same number of samples for each class. For instance, if I have 2 classes and a mini-batch size of 1,024, then I'll grab 512 inputs from class 1 and 512 inputs from class 2 for every batch. That looks something like this:

def random_sample(self, n):
        '''
        Instead of just taking successive samples of the data for the mini-batch
        we take random samples of the data for each mini-batch
        AND make sure that each class is equally represented.
        This helps balance training for imbalanced datasets (e.g. # class 1 >> # class 2)
        '''

        divs = n // self.nclass  # Divide mini-batch into nclass regions

        idx_array = []
        
        for i in range(self.nclass):
            rnd_idx = np.random.random_integers(0, len(self.class_idx[i])-1, divs)
            idx_array.extend(self.class_idx[i][rnd_idx])

        # If the mini-batch can't be divided evenly, then remainder of array is just one from each class
        extra_idx = n - len(idx_array)
        if (extra_idx > 0):
            for i in range(extra_idx):
                idx_array.extend(self.class_idx[i][np.random.random_integers(0, len(self.class_idx[i])-1, 1)])

        return np.sort(idx_array)


idx_array = self.random_sample(1024) # Get the indices for 1024 elements

mini_batch_in[:, :bsz] = df['input'][idx_array, :]  # Slice those random indices from the HDF5 input array and put on the GPU tensor




--
Reply all
Reply to author
Forward
0 new messages