Text to profobuf Message c code?

48 views
Skip to first unread message

Robert Frantz

unread,
May 10, 2016, 4:57:33 PM5/10/16
to Caffe Users
I am trying to write an interface to convert text from an SQL data base to a Datum (::protobuf::Message) format for input to my Caffe model.
 
From looking at the existing code, I need to convert the text representation to coded Message format. Or write my own parse code to initialize the Blob.
 
Does anyone here know of existing c code to convert text (ints and floats represented as chars) to Message (varint, 32-bit, etc) format?

Jan

unread,
May 11, 2016, 4:45:04 AM5/11/16
to Caffe Users
First you should probably think about how you would actually convert the text into values that are actually fed through the network (because that is what you need to put into the Datum.data field). I am not sure using raw ascii values is a good idea, but on the other hand I have never worked with text input. if you really want to do that, you can directly copy the data into the Datum.data field: it is of type "bytes" and in ascii every letter is one byte. As long as you set the width = height = 1 and channels to the number of letters, you should be fine. If that works well is a completely different question however ...

Jan

Robert Frantz

unread,
May 11, 2016, 2:11:17 PM5/11/16
to Caffe Users
Thanks for your response Jan.
 
Raw ASCII values do not work as the existing datum.ParseFromString() function expects protobuf::Message encoded ASCII values.
 
I have written a simple parse function which converts all text int and float values to encoded ASCII float values.
 
Turns out just using float is simpler than converting int values to varint.

Jan

unread,
May 12, 2016, 6:11:55 AM5/12/16
to Caffe Users
I am not sure what you mean by


Raw ASCII values do not work as the existing datum.ParseFromString() function expects protobuf::Message encoded ASCII values.

Of course that doesn't work, since ParseFromString is a function for deserialization of a serialized "Datum" message. Which doesn't have anything to do with what you want to achieve. You just create a datum message and fill your string into the data field. Which could be done in python by

s = 'my text'

d
= Datum()
d
.width = 1
d
.height = 1
d
.channels = len(s)
d
.data.extend(s)
# maybe assign d.label

# put d into an lmdb or leveldb


Jan



Am Mittwoch, 11. Mai 2016 20:11:17 UTC+2 schrieb Robert Frantz:
Thanks for your response Jan.
 

 

Jan

unread,
May 12, 2016, 6:16:50 AM5/12/16
to Caffe Users
Or did I misunderstand you in the way that you actually have a (human-readable) text-representation of a Datum message (in same way a network config is a text-representation of a NetParameter message), and want to deserialize that into an actual Datum object?

That can be done using the text_format helper functions of the protobuf library, for python these are documented here, for C++ here.

Jan

Robert Frantz

unread,
May 12, 2016, 6:19:02 PM5/12/16
to Caffe Users
Thank you for the follow up, Jan.
 
I have a large number of rows, each an array of floats and ints retrieved from my SQL DB in text format (eg "12.2343" or "34533") separated by spaces. Each row is currently terminated by \n, bur that is easily changed.
 
I am  part way through parsing these values and converting them to (all floats) protobuf Message format readable by the datum.ParseFromString() function.
 
Looking the text_format functions, it is not obvious to me what if any function contained there can do this. 

Jan

unread,
May 13, 2016, 3:47:40 AM5/13/16
to Caffe Users
Ok, but what do you actually want to do with those values? Directly feed them through a caffe net? Or building a database in LMDB/LevelDB/HDF5 format which then can be used for training a caffe net? Is this for training a network or just using a trained network (inference)?

Depending on what your actual goal is, I'd recommend a different approach. For instance, if you directly want to feed these values into a trained network to get a prediction you don't need to bother with the "Datum" message. Just convert those those lines to sequences of floats and put them directly into input blobs.

Jan

Robert Frantz

unread,
May 13, 2016, 1:29:19 PM5/13/16
to Caffe Users
At present, I am directly feeding them to a caffe net, and have figured out how to use them to initialize a Datum directly. Thanks for the suggestion.
 
But I thought the resulting Blobs could be used for training a caffe net. Do Blobs have to originate from LMDB, etc to be used for training?
 
I have been under the maybe mistaken impression that the source of training data in a Blob does not matter. Not true?

Jan

unread,
May 24, 2016, 9:40:53 AM5/24/16
to Caffe Users
The source not matter of course. It is just more convenient to have the data inside a db for training, since the data layer in the caffe net can pull the next training examples as needed, and also rewind as needed. One other possibility is to use a "MemoryData" layer, but in that case you must feed all training examples to caffe at once. If you do serious training you usually have too much data to do that (well, or you need huge amounts of RAM). You could also use a "Input" layer that just specifies the input blobs, and for every single iteration it is YOUR job then to put in the data. Which requires a lot of additional logic and is not really recommended.

What I was saying about the "Datum" data structure is this: The "Datum" structure is just a format used in interaction with the LMDB/LevelDB, and it is ONLY used there. Core caffe and all layers other than the LMDB/LevelDB "Data" layer do not use it. So you are only really required to use it if you are doing LMDB/LevelDB stuff.

Jan

robosmith

unread,
May 25, 2016, 12:48:14 PM5/25/16
to Caffe Users
Thanks again, Jan.
 
What I have done is write an interface to SQL which mimics the LMDB/LevelDB interface by adding a new backend type and Layer setup class, so I can use my SQL data w/o having to understand the whole process.
 
What still confuses me is that it appears the Blobs are only populated by the pre-fetch thread and not the original call to value(). Is that the case, and why?
 
It appears the first set of data retrieved by the initial value() call is ignored.

Jan

unread,
Jun 7, 2016, 7:20:29 AM6/7/16
to Caffe Users
The top blobs are not populated in the pre-fetch thread, they are populated when forward() is called on the data layer. The pre-fetch thread only loads the actual data into some local storage, to be moved to the top blobs when appropriate.

I am not sure what you mean by "value() call" and how that is connected to the loading of data.

Jan

robosmith

unread,
Jun 7, 2016, 3:41:39 PM6/7/16
to Caffe Users
cursor_->value() is the  call to load a string from the database which is parsed to initialize the Datum: datum.ParseFromString(cursor_->value());

 

I have determined that the first cursor_->value() call is ignored but is duplicated in the prefetch thread before cursor->Next() is called to move the cursor to the next image record.
 
BTW, it appears that for batch_size = 1, there are redundant calls to datum.ParseFromString(cursor_->value());

 

which may unnecessarily slow things down.

Reply all
Reply to author
Forward
0 new messages