Python script to create new LMDB dataset for regression

10,612 views
Skip to first unread message

Yoann

unread,
Jan 12, 2015, 11:31:32 AM1/12/15
to caffe...@googlegroups.com
Dear all,

I had hard time to create a script to convert my own dataset into the LMDB format with real values labels. That is why I whant to share with you my script which can do that. Please tell me if I you find errors.
As suggested by Evan Shelhamer in other threads dealing with regression problem, what I do is that I create two LMDB datasets:
_ one LMDB that containts all image data
_ another LMDB dataset with all the real value labels in a (1,1,1) data channel
The script uses as input one txt file for the train set such that each line is "path/to/image.jpg 0.54651".
You will probably want to run the script a second time to create the test set with you second txt file.

Here is the script:

import lmdb
import re, fileinput, math
import numpy as np

# Make sure that caffe is on the python path:
caffe_root = '../../'  # this file is expected to be in {caffe_root}/examples
import sys
sys.path.insert(0, caffe_root + 'python')

import caffe

# Command line to check created files:
# python -mlmdb stat --env=./Downloads/caffe-master/data/liris-accede/train_score_lmdb/

data = 'train.txt'
lmdb_data_name = 'train_data_lmdb'
lmdb_label_name = 'train_score_lmdb'

Inputs = []
Labels = []

for line in fileinput.input(data):
entries = re.split(' ', line.strip())
Inputs.append(entries[0])
Labels.append(entries[1])

print('Writing labels')

# Size of buffer: 1000 elements to reduce memory consumption
for idx in range(int(math.ceil(len(Labels)/1000.0))):
in_db_label = lmdb.open(lmdb_label_name, map_size=int(1e12))
with in_db_label.begin(write=True) as in_txn:
for label_idx, label_ in enumerate(Labels[(1000*idx):(1000*(idx+1))]):
im_dat = caffe.io.array_to_datum(np.array(label_).astype(float).reshape(1,1,1))
in_txn.put('{:0>10d}'.format(1000*idx + label_idx), im_dat.SerializeToString())

string_ = str(1000*idx+label_idx+1) + ' / ' + str(len(Labels))
sys.stdout.write("\r%s" % string_)
sys.stdout.flush()
in_db_label.close()
print('')

print('Writing image data')

for idx in range(int(math.ceil(len(Inputs)/1000.0))):
in_db_data = lmdb.open(lmdb_data_name, map_size=int(1e12))
with in_db_data.begin(write=True) as in_txn:
for in_idx, in_ in enumerate(Inputs[(1000*idx):(1000*(idx+1))]):
im = caffe.io.load_image(in_)
im_dat = caffe.io.array_to_datum(im.astype(float).transpose((2, 0, 1)))
in_txn.put('{:0>10d}'.format(1000*idx + in_idx), im_dat.SerializeToString())

string_ = str(1000*idx+in_idx+1) + ' / ' + str(len(Inputs))
sys.stdout.write("\r%s" % string_)
sys.stdout.flush()
in_db_data.close()
print('')

Hope it helps!

Best,
Yoann

Wei Guo

unread,
Feb 6, 2015, 2:11:59 AM2/6/15
to caffe...@googlegroups.com
Woudl you mind to modify your code to include multi-labels,such as:

            "image_file_name regression_vector  classified_id  regression_vector2"

for example:

             "path/to/image.jpg 0.54   2.34   4.56  4  5.60  7.45"

Thanks,

Yoann

unread,
Feb 6, 2015, 4:02:37 AM2/6/15
to caffe...@googlegroups.com
It can be done easily. Create as many arrays as needed (Label1 = [] Label2 = []...)

for line in fileinput.input(data):
entries = re.split(' ', line.strip())
Inputs.append(entries[0])
Label1.append(entries[1])
  Label2.append(entries[2])
... 

Then for each array copy paste the "Writing labels" section:

print('Writing label 1')

# Size of buffer: 1000 elements to reduce memory consumption
for idx in range(int(math.ceil(len(Label1)/1000.0))):
in_db_label = lmdb.open(lmdb_label1_name, map_size=int(1e12))
with in_db_label.begin(write=True) as in_txn:
for label_idx, label_ in enumerate(Label1[(1000*idx):(1000*(idx+1))]):
im_dat = caffe.io.array_to_datum(np.array(label_).astype(float).reshape(1,1,1))
in_txn.put('{:0>10d}'.format(1000*idx + label_idx), im_dat.SerializeToString())

string_ = str(1000*idx+label_idx+1) + ' / ' + str(len(Labels))
sys.stdout.write("\r%s" % string_)
sys.stdout.flush()
in_db_label.close()
print('')

You can also modify the code in order to automatically create the right amount of arrays depending on the number of columns in your data file.
Message has been deleted

Wei Guo

unread,
Feb 8, 2015, 1:00:12 AM2/8/15
to caffe...@googlegroups.com
for line in fileinput.input(data):
entries = re.split(' ', line.strip())
Inputs.append(entries[0])
Label1.append(entries[1:4])                       
  Label2.append(entries[4])
            Label3.append(entries[5:7])
... 

       And for lebel1,  I should modified the code as the following ?
    with in_db_label.begin(write=True) as in_txn:
for label_idx, label_ in enumerate(Label1[(1000*idx):(1000*(idx+1))]):
im_dat = caffe.io.array_to_datum(np.array(label_).astype(float).reshape(1,3,1))

     
      I don't know the dim 3 of regression vector  should be placed where in reshape parameters?       Should it be reshape(1,3,1) or reshape(3,1,1) , or reshape(1, 1, 3 )) ?

Best,

wei guo

Wei Guo

unread,
Feb 9, 2015, 9:04:13 AM2/9/15
to caffe...@googlegroups.com
Dear all::

  When I want to write yonn's code to a lmdb file, the python program has the wollowing error:
"
     7000 / 10000Traceback (most recent call last):
        File "writelmdb.py", line 152, in <module>
            sys.stdout.flush()
          lmdb.Error: mdb_txn_commit: Input/output error
"

It  is resulted at the followding code :
string_ = str(1000*idx+label_idx+1) + ' / ' + str(len(Labels))
sys.stdout.write("\r%s" % string_)
sys.stdout.flush()
 Who can tell me what's the problem?

Thanks,



On Friday, February 6, 2015 at 5:02:37 PM UTC+8, Yoann wrote:

丁少锦

unread,
Apr 2, 2015, 12:45:38 AM4/2/15
to caffe...@googlegroups.com
Dear, Yoann
I'm new in caffe. I'm trying to put a 10-dimention label in a LMDB dataset. I'm wondering how to modify the script to deal with it?
Thanks!

在 2015年1月13日星期二 UTC+8上午12:31:32,Yoann写道:

wuxinh...@gmail.com

unread,
Apr 2, 2015, 9:51:32 AM4/2/15
to caffe...@googlegroups.com
could you tell me where you save the LMDB dataset for regression??


在 2015年1月13日星期二 UTC+8上午12:31:32,Yoann写道:
Dear all,

wuxinh...@gmail.com

unread,
Apr 2, 2015, 11:41:22 PM4/2/15
to caffe...@googlegroups.com
how to make the training set,validation set ,test set of R-CNN?  for this need which job to do?
R-CNN's training set,validation set ,test set are completely different to general classification  model 's training set,validation set ,test set, could you tell me  ,thank you


在 2015年1月13日星期二 UTC+8上午12:31:32,Yoann写道:
Dear all,

Xiao Yang

unread,
Apr 15, 2015, 3:36:46 PM4/15/15
to caffe...@googlegroups.com
Hi Yoann,

Thanks for sharing your code! Besides, do you know how to create a lmdb file with both image data and labels? Just the same as what $caffe_root/tools/convert_imageset does, but I want a python version to do some extra work

Best
Xiao

在 2015年1月12日星期一 UTC-5上午11:31:32,Yoann写道:

Xiao Yang

unread,
Apr 15, 2015, 4:11:57 PM4/15/15
to caffe...@googlegroups.com
I found a solution myself by looking into the array_to_datum() function in io.py

--------------------------------------------------------------

Hi Yoann,

Thanks for sharing your code! Besides, do you know how to create a lmdb file with both image data and labels? Just the same as what $caffe_root/tools/convert_imageset does, but I want a python version to do some extra work

Best
Xiao

在 2015年1月12日星期一 UTC-5上午11:31:32,Yoann写道:
Dear all,

Pastafarianist

unread,
Apr 24, 2015, 10:08:15 AM4/24/15
to caffe...@googlegroups.com
I have encountered exactly the same problem and I was wondering, have you found the solution?

воскресенье, 8 февраля 2015 г., 9:00:12 UTC+3 пользователь Wei Guo написал:

SHUBHAM PACHORI

unread,
Jul 29, 2015, 3:49:27 PM7/29/15
to Caffe Users, yoann....@gmail.com
Hi Yoann,

First of all thanks for sharing your code. I am working on regression problem where my input(data) is a RGB image of size 256*256*3 and my output(labels) is of size 256*256*3(RGB). I am using your code to create my data and label files in lmdb format to train convolutional neural networks for regression in Caffe. But the files for data and label are created separately. What modifications do I need in my code so as to produce data and label data in the same lmdb file? Also is there anyway in which I could train my net in caffe by importing the input files and target files separately? If so could you please provide me some hints?

Thanks!!

SHUBHAM PACHORI

unread,
Jul 29, 2015, 3:56:00 PM7/29/15
to Caffe Users, yangx...@gmail.com
Hi Xiao,

First of all thanks for sharing your code. I am working on regression problem where my input(data) is a RGB image of size 256*256*3 and my output(labels) is of size 256*256*3(RGB). I  am using the code given in the thread to create my image data and labels in lmdb files. But I am getting problems in figuring it out that how to create a single lmdb file with both image data and labels. Could you please help in this regard.

Thanks

Xiao Yang

unread,
Jul 29, 2015, 5:25:18 PM7/29/15
to Caffe Users, shubham...@iitgn.ac.in
Hi Shubham,

If your label is a scalar, then you may need to take a look of array_to_datum function in io.py. 

For your case, since the input shape and label shape are equal, you can concatenate them into a new ndarray. Then in caffe, use slice layer

For more general cases, it's better to use hdf5 or two lmdbs 

Best,
Xiao

在 2015年7月29日星期三 UTC-4下午3:56:00,SHUBHAM PACHORI写道:

Matt

unread,
Jul 29, 2015, 10:53:54 PM7/29/15
to Caffe Users, shubham...@iitgn.ac.in, yangx...@gmail.com
Hi all,

I am sorry for the newb question.  I am trying to do something pretty similar and am hesitant to post this because the code here should have answered my question.

I'm trying to create and attach some 1-D time series data via LMDB to a CNN using pycaffe. For now, I'm using the code from the MNIST LeNet example and trying to create an LMDB using 9x100x1 numpy arrays (9 channels, 100x1 time series). I am following some example code from here as well as: https://groups.google.com/forum/#!msg/caffe-users/19XfmJqg34Q/0qBxNwEeSNkJ using array_to_datum() and have also tried manually creating a datum object and inserting into an LMDB similarly to here: http://deepdish.io/2015/04/28/creating-lmdb-in-python/. I've written a test that inserts a numpy array using this code, closes the db, and then reads it back out to ensure that it inserted correctly.

However, whenever I try and attach it to the MNIST example code with some tweaks, I always get "Check failed: shape[i] >= 0 (-3 vs. 0) " from blob.cpp when it tries to build the first convolutional layer. I've been working on this for a couple of days now and don't have the experience or skills to figure out what I'm doing wrong. Would be grateful if anyone had any hints or ideas to point me in the right direction. Thanks!

Matt

unread,
Jul 30, 2015, 9:46:25 AM7/30/15
to Caffe Users, shubham...@iitgn.ac.in, yangx...@gmail.com, matt...@gmail.com
I think I found roughly what was going on.  In conv_layer.cpp there is a compute_output_shape function with the line:

this->width_out_ = (this->width_ + 2 * this->pad_w_ - this->kernel_w_)
      / this->stride_w_ + 1;

With a width of 1 and the MNIST example's kernel width of 5 in the first convolutional layer, I was getting a negative -3 for width_out, which was then getting passed to the blob Reshape() by the layer SetUp() function.  

What it means is I need to go back and brush up my intuition of what the implications of kernel width, stride width,etc are (what is pad width?).  But at least I have some idea of where my mistake is happening.

Thanks,
Matt

Majid Azimi

unread,
Mar 9, 2016, 3:29:24 PM3/9/16
to Caffe Users
Hi Yoann,

I have a regression problem but in this case I have an image as label. what I can do in this case. how should I change your code? I have a RGB image and its label is a grayscale image with values from 0 to 1. I really appreciate your help.

Jeremy Pinto

unread,
May 26, 2016, 2:14:45 PM5/26/16
to Caffe Users
Hello, and thanks for the code, works very well ! My original images are .jpg files, should they be normalized before hand? or are the integer values from 0 to 255 acceptable?  I have continuous Y_label values that are normalized.  Thanks.

J

Yoann

unread,
May 26, 2016, 3:00:09 PM5/26/16
to Caffe Users
Hi,

You do not need to normalize the pictures if you use the Caffe script already available which can compute the mean image from the LMDB (don't remember the name, sorry).

Best,
Yoann

p.Paul

unread,
Feb 23, 2017, 11:21:30 AM2/23/17
to Caffe Users
I have the same error. How did you fix it?

images_db = lmdb.open(images_file)#, map_size=int(1e12), map_async=True, writemap=True)
lmdb.Error: ../data/HPElmdb/image-lmdb_train: Input/output error
 
          lmdb.Error: mdb_txn_commit: Input/output error



in_db_label = lmdb.open(lmdb_label1_name, map_size=int(1e12))

Ranju Mandal

unread,
Jan 16, 2018, 8:45:02 PM1/16/18
to Caffe Users
Hi Yoann,

Do I need to resize my images beforehand? I don't see any image resize in the code.
Reply all
Reply to author
Forward
0 new messages