Backpropagating through Concat layer

548 views

Skip to first unread message

pwj

unread,

Mar 28, 2018, 4:59:45 PM3/28/18

to Caffe Users

Dear Caffe users,

this is a question concerning how backpropagation is handled in Concat layers:

Let's assume we have two layers A,B with weights a,b that process data of batch sizes x and y, respectively.
The top blobs of A,B get concatenated in a Concat layer C along dimension 0 to yield a top blob of batch size x+y, which is forwarded to an fully connected layer F with weights f
that projects to a SoftmaxWithLoss layer L (that also receives x+y concatenated labels, of course).

During backpropagation I would expect the flow to:
i) backward-pass to F the derivative of L w.r.t. f for the full x+y batch items (i.e. f gets updated considering all x+y samples)
ii) backward-pass to A the derivative of dL/df w.r.t. a for the first x batch items
iii) backward-pass to B the derivative of dL/df w.r.t. b for the last y batch items
That is, the Concat layer now splits the merged information flow up again.

(1) Is my understanding of backpropagating through the Concat layer correct?
(2) Furthermore, would it make any difference for the updating of weights a,b if we would not use the concatenation and a single SoftmaxWithLoss layer but instead two separate SoftmaxWithLoss layers?
(3) What if we now concatenate the blob of A with itself along dimension 0 to yield a top blob of batch size 2x, does A now receive two gradients such that the weights of A get updated twice?

Thanks for any insights!

Przemek D

unread,

Mar 29, 2018, 4:22:04 AM3/29/18

to Caffe Users

1. Yes, this is exactly how it works!

2. No difference at all from the mathematical point of view. It might be more handy to concatenate as you have less blobs to care about (and access from Python, for example). On the backend side, Concat means an additional data copy and more RAM used (for the intermediate, concatenated blob), however the loss function GPU kernel might be executed slightly more efficiently (as a single kernel over a larger tensor, as opposed to two kernel launches over smaller tensors).

3. If you concatenate A with itself, it will receive the sum of gradients coming from the top (but technically, it will still be updated just once). The actual value of the update will of course depend on what happens above the concat, but in a trivial case of concatenating A along dim 0 and right after that computing the loss function with labels also concatenated this way, the update will be simply twice the size.