The problem you have in mind is the "localisation" and "detection" task which is much harder than the "classification" task. There are multiple ways: training a FCNN, use segmentation then process it; or training a classifier CNN and find a way to detect interesting spot "region of interest" or try all patches and classify it. That's all I know, anyone should feel free to correct what I said.
Keywords: sliding window, object detection, object localisation, segmentation, fully convolutional neural network.