Dear Yannik,
thank you for bringing this to our attention. In the following I will briefly describe 1) our reasoning behind using F1, 2) the reason it fails in some cases and 3) how we will assess the submissions moving forward.
Why F1?
The main idea of using detection-based metrics (F1) is to assess the
quality of the segmentation from the detection perspective, especially
the detection of smaller obstacles, which only have a minor impact on
mIoU. In other words it is made to distinguish good methods from great methods. For this reasons, we believe using only mIoU is not the best fit for this task.
Where does it go wrong?
However, the F1-score will indeed fail in certain corner cases as you mentioned above. This is due to the way the number of TPs, FNs and FPs detections in an image are calculated. TPs and FNs are determined based on segmentation coverage, i.e. whether or not the obstacle is included as part of the obstacle segmentation mask. Naturally, this leads to 100% recall in case of predicting everything as an obstacle. This, however, is not the issue. The problem occurs due to the way FPs are calculated: as the number of predicted components on areas labelled as 'water' in GT. In the case of predicting everything as obstacles, there is just one, huge component, that includes everything, so only one FP will be counted.
Why are FPs computed this way?
During the development of the metric we considered different ways to compute FPs and their trade-offs. Perhaps the most straightforward metric is the area of FP segmentation. However, this cannot be easily combined with the number of TPs and FNs to get a traditional metric such as F1-score. For this reason we opted for the method with connected components as I described above, from which we can get a count of FPs. Furthermore, this way we punish methods for producing many small FP predictions, which severely limit the navigation of the boat, but are not covered by mIoU.
How can we fix this problem?
In the text above we've already seen some key differences between mIoU and F1. mIoU captures the general segmentation quality but is not good for assessing critical, but small segmentation details such as not detecting small obstacles or predicting many small FPs. F1 on the other hand punishes methods for missing small obstacles and predicting many FPs, but fails in extreme cases such as predicting everything as obstacles. However, such cases will severely impact the mIoU score. For this reason, we believe a combination of mIoU and F1 may be used to tell the whole picture.
Solution, TLDR;
From now on we will use a combined metric called Quality (Q) computed as Q = F1 x mIoU to determine the standings in our USV segmentation challenges (USV-based Obstacle Segmentation and USV-based Embedded Obstacle Segmentation). Since mIoU and F1 are highly correlated, this does not impact current standings much, except discredit extreme examples as the one you mentioned.
All the entries have been automatically updated so there is no need to re-upload your submission. The challenge instructions will also be updated to reflect the change.
Hope this answers your questions.
Best,
Lojze Žust
on behalf of the MaCVi team
ponedeljek, 25. september 2023 ob 17:27:12 UTC+2 je oseba Yannik Steiniger napisala: