After reading Girschik et al's articles on he matter and looking at the code, I'm still at loss at this problem:
How does the network get around the fact that the number of predicted bounding boxes varies from image to image? If in image1 there are 3 positive predictions, and in image2 5, how is the loss calculated? I understand different size is not a problem, as RoI layer transforms them to the same size, but what about the number of proposals? I suspect there's a separate loop for loss calculation somewhere, but can't get my head around it unfortunately.
Thanks