We all know that Faster-RCNN consists of two module, the RPN module which will produce some Proposals(object locations, e.g a set of rectangulars) and the Fast-RCNN module that classifies each proposal into one class(e.g airplane) and refine the proposal into a more precise location.
My problem is that when I got all the Proposals in the image in question from the output of the RPN module, how can I use these Proposals simultaneously to achieve another task like Image Caption?More precisely, how can I implement a new Caffe layer such that I can exploit all these Proposals simultaneously? In the Fast-RCNN module of Faster-RCNN, it just classifies each proposal into one class and refine the proposal into a more precise location one by one. And this is what Caffe do in other CNN architecture, e.g, DeepID2, AlexNet, VGG_16. But I want to exploit these proposals simultaneously!
My motivation is that some vision tasks are related to many parts(e.g, all objects in a image, or all facial parts of a facial image) in the image. So when I got all these proposals, how can I implement a new Caffe layer that can process these proposals simultaneously?
A figure indicates how Caffe process proposals one by one is as follow:
![](https://lh3.googleusercontent.com/-NCO80bjVyRA/Vz8Cw_3bonI/AAAAAAAAAAM/wSKV8HkaiAstv0UgLPjH0CUlr7g9Vdf2QCLcB/s320/5D%2528YZ%255B%255D9%2540%255B0%2524WV1%2560YV%2560ZUD6.png)