The main motive of the author by saying 'fine -grained visual information' is some time the desired object that is to be detected is small, so it is hard for the network to detect it(as the image passes through network the object feature information can be lost after pooling or relu layer). So , it is preferred to rescale the image so that tiny objects of the image can be extracted.