Sure.
My HDF5 has multiple classes within the same ['input'] matrix. So the idea is to produce mini-batches for SGD that take a random sample of each class for every batch such that I grab the same number of samples for each class. For instance, if I have 2 classes and a mini-batch size of 1,024, then I'll grab 512 inputs from class 1 and 512 inputs from class 2 for every batch. That looks something like this:
def random_sample(self, n):
'''
Instead of just taking successive samples of the data for the mini-batch
we take random samples of the data for each mini-batch
AND make sure that each class is equally represented.
This helps balance training for imbalanced datasets (e.g. # class 1 >> # class 2)
'''
divs = n // self.nclass # Divide mini-batch into nclass regions
idx_array = []
for i in range(self.nclass):
rnd_idx = np.random.random_integers(0, len(self.class_idx[i])-1, divs)
idx_array.extend(self.class_idx[i][rnd_idx])
# If the mini-batch can't be divided evenly, then remainder of array is just one from each class
extra_idx = n - len(idx_array)
if (extra_idx > 0):
for i in range(extra_idx):
idx_array.extend(self.class_idx[i][np.random.random_integers(0, len(self.class_idx[i])-1, 1)])
return np.sort(idx_array)
mini_batch_in[:, :bsz] = df['input'][idx_array, :] # Slice those random indices from the HDF5 input array and put on the GPU tensor