To follow up on this with more details and precision:
1. When we developed the new feature extractor for CoralNet using EfficientNet, we experimented with a few different patch sizes, and for example we found a degradation in accuracy when the patch size was smaller. e.g., reducing the patch size from 224 x 224 to 168x168 led to a 1% absolute reduciton in accuracy when compared to different versions of efficientnet and resnet. So, we stuck with 224x224 for all the subsequent research.
2. Re box vs cross hair. The training semantics are the label at that point of the cross hair. and the box is to build the context of the region around it. The meaning is not the "average of the region."
If the region is 40% sand, but the center is acropora, the label is acropora. The context around a point is crucial for determining what's at that point, and it clearly weighs the central point more heavily than the outer points, a larger context does mean that we can't label point within a border of 224/2 pixels from the edge.
3. There are other ways to establish context in computer vision behond patchdes, but ultimately there's some region of the image that supports a decision.
4. Right from the start of coralnet, there was an idea to rescale the image so that a pixel corresponds to a standard size on the benthic surface. (e.g., a pixel is 0.5mm). If this could be applied over all training data and inference, it would make classification easier. To do this, successfully, one would need to know the field of view of camera-lens (easy) and the distance to the surface for every pixel (hard) and even an aproximate distance (transact was taken at 1 meter) is just not widely. We opted for better coverage of coralnet and just trained over a wide range of conditions, enjoying the benefit of lots of data and deep neural networks.
I hope this provides more context about coralnet. More details can be found in:
David