Python layer as datasource?

Nicholas Dufour

unread,

Sep 5, 2015, 5:03:09 PM9/5/15

to Caffe Users

Hi All-

I have a net that requires some pretty complex data augmentation / data selection / data preparation before being fed into the net. Being far more capable with python than C/C++, I've implemented it in python. What I'm thinking now is, in parallel, fetch the next batch of data to feed to Caffe while Caffe runs the current data through the net. This will all be done via the pycaffe interface. Since I'm going to be training it in this manner, I have a few questions.

(1) How much overhead do people expect shuttling the data disk -> python -> GPU will introduce?

(2) Since the net won't actually have a data layer, does this mean I will be effectively training on a deploy network? Does it make sense to produce a train_val.prototxt?

(3) Will caffe still attempt to test the net while I'm calling the forward and backward passes? I'm leaning towards no, since it won't actually have a solver for a train_val network, but this is still unclear to me.

(4) Does it make more sense to attempt to engineer a pythonlayer as the datasource?

(5) Is it possible to omit a test phase altogether?

Thanks!

li kai

unread,

Sep 6, 2015, 2:11:04 AM9/6/15

to Caffe Users

I have the same problem.

Evan Shelhamer

unread,

Sep 7, 2015, 8:01:00 PM9/7/15

to Nicholas Dufour, Caffe Users

Data layers are good candidates for implementation as Python layers. So the idea in (4) to make a PythonLayer to generate the data is a reasonable idea.

(1) All data layers have to do a host -> device transfer anyway. Doing the data processing in Python is unlikely to add problematic overhead *as long as you prefetch* so that data is prepared in parallel to the execution of the net. If the prefetch / data prep. is faster than an iteration then there's nothing to worry about.

(2) Instead of doing all the bookkeeping in a script I suggest creating a PythonLayer for your data operations instead. Then you can instantiate a solver in Python and train the usual way.

(3) If you don't have a solver, then there's no solving, testing, snapshotting, and so on. If you do have a solver, it can be configured without a test net too by omitting the test net fields.

(4) Yeah. See this video data PythonLayer by Lisa Anne Hendricks for a serious example with pre-fetching, transformations, and multiple tops.

(5) Phase is a flag that layers can follow or ignore. It is included for simple control over layers that do vary in train or test operation such as dropout.

Hope that helps. Happy brewing,

Evan Shelhamer

--
You received this message because you are subscribed to the Google Groups "Caffe Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to caffe-users...@googlegroups.com.
To post to this group, send email to caffe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/caffe-users/9bf30ebd-919e-4839-bff6-0e5da616857e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Nicholas Dufour

unread,

Sep 9, 2015, 12:46:37 AM9/9/15

to Caffe Users, dufour....@gmail.com

Evan thank you so much!

Fortunately, I've implemented effectively the entirety of the python to asynchronously fetch data in multiple different threads and put them into blobs (using a multiprocessing Queue--I suspect that blobs are passed by reference so this shouldn't be a problem?)

However, only recently did I realize that Caffe data layers can also by python layers. This, however, seems like it would be a poor choice (at least I thought) since they're executed serially by the network. Although after reading a recent pull request that I sadly can't seem to find, it seems as though Caffe asynchronously executes python data layers? Essentially the current plan is to:

dequeue an input blob from the queue
run SGDSolver(solver.prototxt).step(1)
re-enqueue the 'spent' input blob
repeat.

This is *all* managed by python, however, effectively using Caffe to only run a single iteration of forward and backward prop. *If* it is the case that Caffe will run the python data layer asynchronously, then it makes sense to integrate them all (I have a loss layer implemented in python too but that's another item entirely) and run it in the 'traditional manner.'

Nicholas Dufour

unread,

Sep 9, 2015, 12:54:31 AM9/9/15

to Caffe Users, dufour....@gmail.com

Incidentally, Hendricks' github is invaluable--I know there are several people like me who are also looking to produce data layers from python, and, in particular, for examples. Perhaps it would be a useful instructional tool for the caffe website.

On Monday, September 7, 2015 at 5:01:00 PM UTC-7, Evan Shelhamer wrote:

Nicholas Dufour

unread,

Sep 9, 2015, 1:04:06 AM9/9/15

to Caffe Users, dufour....@gmail.com

Lastly, it's clear from Hendrick's code that the layer itself manages the threading, so the question is moot. My bad!

On Monday, September 7, 2015 at 5:01:00 PM UTC-7, Evan Shelhamer wrote:

Xialei Liu

unread,

May 27, 2016, 7:17:41 AM5/27/16

to Caffe Users, dufour....@gmail.com

I guess you have already implemented the Python Data Layer, Can you share a simple example by using Python Data Layer? For me, I don't need video stuffs like Hendricks did, the only thing I need to do is data augmentation during training like you I guess. Thanks a lot.

在 2015年9月9日星期三 UTC+2上午7:04:06，Nicholas Dufour写道：

Reply all

Reply to author

Forward