in the paper, there is paragraph as following: why not this straightforward way inefficient? why the RoI network still allocates memory and performs backward pass for all RoIs? most RoIs have 0 loss, and the network still calculate the gradient for them? but the loss didn't include them. could anyone can explain this? thx