create lmdb/leveldb from non-image data

3,744 views
Skip to first unread message

Daffe

unread,
Sep 25, 2014, 8:44:51 PM9/25/14
to caffe...@googlegroups.com
Hello,
How can I create lmdb or leveldb from non-image data, so that I can pass the data into Caffe layers?
It seems that convert_imageset.cpp only converts images (and labels). Thanks!

Rayn Sakaguchi

unread,
Sep 26, 2014, 10:26:40 AM9/26/14
to caffe...@googlegroups.com
The easiest way I have found to get data into Caffe is using HDF5 data files. You can check out this for python: https://github.com/BVLC/caffe/blob/master/src/caffe/test/test_data/generate_sample_data.py

I have also had success writing hdf5 files to MATLAB using something like this:

 h5create(['mnistTrain.h5'],'/data',[28 28 1 mnist.nObservations]);
 h5create([mnistTrain.h5'],'/label',[1 mnist.nObservations]);
 h5write(['mnistTrain.h5'],'/data',mnistData);
 h5write(['mnistTrain.h5'],'/label',mnistLabels');

Obviously the sizes in the h5create would be dependent on your data. Note that MATLAB is column major and caffe is row major if you go the MATLAB route. 

Armin Kappeler

unread,
Feb 17, 2015, 6:51:28 PM2/17/15
to caffe...@googlegroups.com
Hi Rayn,

Thanks for that answer. I have a similar problem, and this solves my problem. Now I have a follow-up question, as I haven't used the HDF5 format yet. I thought maybe you would know:

My training database consists of about 30 million samples (dimension: 6x36x36), so the HDF5 file will grow very large. 
=> Should I use chunking? (Do I get higher read performance if I use chunking?)
=> If yes, what is a reasonable chunk size? 
        When I train my neural network, I am using a batch size of 128, so I thought a chunk size of 128x6x36x36 makes sense, that                 would give me about 230'000 chunks. Is that reasonable?

Jianyu Lin

unread,
Aug 19, 2015, 10:18:35 PM8/19/15
to Caffe Users
Hi Armin, have you found the answer? I met the same question, too.

Hidekazu

unread,
Aug 20, 2015, 10:00:57 AM8/20/15
to Caffe Users
Hello,
I found the following web page which also answers your question.
LMDB - Lightening Memory-Mapped DataBase is a high-performance database that can handle data sets larger than your RAM.,

http://deepdish.io/2015/04/28/creating-lmdb-in-python/

I have been able to run the training process using
an LMDB database created that way.

I am still confused as to how to deploy the caffemodel I created using this method though,
and would very much appreciate help on how to deploy a network with inputs that are not images.

thanks!

Paolo

unread,
Jan 13, 2016, 10:26:51 AM1/13/16
to Caffe Users, rayn.sa...@gmail.com
Anybody knows how to do that in python? 

Jan C Peters

unread,
Jan 26, 2016, 7:38:56 AM1/26/16
to Caffe Users, rayn.sa...@gmail.com
Actually it is not that hard, and very similar to the way convert_imageset does it: Assume your data is in a numpy array. Just serialize that array and save that bytestring in the "data" attribute of a Datum. I have once written a small python script to convert an HDF5 db to LMDB. Here is the most interesting part:

from numpy import *
import lmdb
import h5py

import sys
caffe_root
= <where your caffe installation is>
sys
.path.insert(0, caffe_root + 'python');

import caffe

[...]

tgt_db
= <pathname of the target lmdb>
src_db
= <filename of the source HDF5 file>

[...]

env
= lmdb.open(tgt_db, map_size=10000000000)

with h5py.File(src_db,'r') as f:
   
# extract data from hdf file
    ar_data
= array(f['data'],dtype=float32)
    ar_label
= array(f['label'],dtype=int)
    n
, c, h, w = ar_data.shape
    ar_label
= ar_label.flatten()
   
assert len(ar_label) == n # number of labels has to match the number of input images!
   
# write data to lmdb
   
for i in range(n):
        datum
= caffe.proto.caffe_pb2.Datum()
        datum
.channels = c
        datum
.height = h
        datum
.width = w
        datum
.data = ar_data[i,:,:,:].tobytes()
        datum
.label = ar_label[i]
        str_id
= '{:08}'.format(i) # create a8 digit string id based on the index
       
with env.begin(write=True) as txn:
            txn
.put(str_id.encode('ascii'), datum.SerializeToString())

I would only use it for small dbs (in the single-digit GB range), or the instruction array(f['data'], dtype=float32) will kill you...

Jan

Jan C Peters

unread,
Jan 26, 2016, 7:45:01 AM1/26/16
to Caffe Users, rayn.sa...@gmail.com
Oh I noticed just now that Hidekazu provided a link which demonstrates essentially the same as I did. So that was kind of superfluous. But then why did you ask Paolo? The linked page's examples _are_ in python.

Jan
Reply all
Reply to author
Forward
0 new messages