slow reading from the LMDB database having an imbalanced training data

Alexey Abramov

unread,

Aug 4, 2017, 9:04:13 AM8/4/17

to Caffe Users

Hello everyone,

I have some questions regarding the LMDB random access for reading. In my training set there are about 20,000 RGB images with resolution of 1200 x 700 px. However, the data is very imbalanced: number of samples per class is between 10 and 2,000 images (for some classes it is very hard to collect data). In the training stage all batches have a uniform distribution over classes. This makes reading from the LMDB database quite tricky. Instead of iterating through the whole database:

for key, value in lmdb_cursor:
 datum.ParseFromString(value)

I have a list of lmdb keys for all classes and use a random access to pick samples for all classes:

db_dict = load_dictionary(os.path.join(db_directory, DB_DICTIONARY_NAME))

for key, value in db_dict.iteritems():
 index = np.random.randint(0,len(value))
 db_access_key = '{:08}'.format(value[index])

 lmdb_cursor.set_key(db_access_key)

The problem is that the random access to the LMDB database takes on average about 72 ms per image... which is even slower than reading images from files. On the contrary, picking samples one after each other (see the first code block) is very fast (about 13 ms per image), but in this case I cannot ensure a uniform distribution in my batch.

Is there any possibility to accelerate the random access to the LMDB? Would any other database be a better solution for this problem?

As a naive solution I think about creating a database for every class, in such a way I can avoid the random access. But maybe there is a more elegant solution.

Thanks a lot in advance for your help!

Best,

Alexey

Przemek D

unread,

Aug 22, 2017, 6:36:36 AM8/22/17

to Caffe Users

What about shuffling images while creating the DB? If I understand correctly, your current DB looks like:
1 1 1 1 1 2 2 2 2 3 4 4 4 5 5 6 6 6 6 6 6 7 ...
so in order to balance the batches you try reading from different locations - true?
The typical approach to this is create a shuffled DB, like so:
1 6 2 4 7 3 5 1 ...
so it can be read sequentially, but still give varied batches.

Christopher Bayer

unread,

Aug 24, 2017, 3:49:38 AM8/24/17

to Caffe Users

I don't think that approach works for an imbalanced dataset:
Considering a DB of: 112222222222333,
a shuffled DB would be:

212223232221232

Taking a batch sizes of 3 would result in:

212, 223,232,221,232

which is unbalanced (not uniform distributed) and follows the distribution of the imbalanced dataset.

One possibility would be to insert copies of the minority classes to the dataset:

11[1][1][1][1][1][1][1][1]2222222222333[3][3][3][3][3][3][3], where [1] denotes a copy of a random image of the '1' class.

After shuffling, the batch would be then balanced, e.g.:
[3]21[3][1]2[3]2[1]2[1]3[3][1]232[1][3]22[1]1[3]2[1]32[1][3]

Nevertheless, this approach requires huge amount of disc space since every minority class will be scaled to the number of images of the most frequent class.

Looking at your image size I'm not sure if this approach is suitable.

Przemek D

unread,

Aug 25, 2017, 3:01:00 AM8/25/17

to Caffe Users

I think I misunderstood the initial question. Still, following my answer will get you better results than putting all images in ordered sequence.
Inserting copies is an extreme brute-force, it would be much more efficient (despite un-elegantness) to create separate DBs for each of the N classes, have several data layers reading (B/N) images per iteration (where B is the total batch size) and just concatenate their outputs along 0 axis. This has the added advantage of each batch being different - since each database has different number of records, they will be cycled over with different periods.

Reply all

Reply to author

Forward