[USV Semantic Segmentation] Considering using different metric

37 views
Skip to first unread message

Yannik Steiniger

unread,
Sep 25, 2023, 11:27:12 AM9/25/23
to MaCVi Support
Dear MaCVi Team,

in the USV Semantic Segmentation challenge, the main metric is the dynamic object F1-score. The way this metric is calculated has the drawback that predicting the whole image as the semantic class "obstacle" results in 100% recall and thus a very high F1-score (see my submission to the leaderboard today (DLR-MI, 25-Sep)).

I think, calculating the true-positives using the IoU is not useful either, since the predicted semantic obstacles have a very large area and thus the union will be very large. Maybe the mIoU should be considered as the main metric for this challenge?

Cheers

Yannik

Lojze Žust

unread,
Sep 26, 2023, 10:33:16 AM9/26/23
to MaCVi Support
Dear Yannik,

thank you for bringing this to our attention. In the following I will briefly describe 1) our reasoning behind using F1, 2) the reason it fails in some cases and 3) how we will assess the submissions moving forward.

Why F1?
The main idea of using detection-based metrics (F1) is to assess the quality of the segmentation from the detection perspective, especially the detection of smaller obstacles, which only have a minor impact on mIoU. In other words it is made to distinguish good methods from great methods. For this reasons, we believe using only mIoU is not the best fit for this task.

Where does it go wrong?
However, the F1-score will indeed fail in certain corner cases as you mentioned above. This is due to the way the number of TPs, FNs and FPs detections in an image are calculated. TPs and FNs are determined based on segmentation coverage, i.e. whether or not the obstacle is included as part of the obstacle segmentation mask. Naturally, this leads to 100% recall in case of predicting everything as an obstacle. This, however, is not the issue. The problem occurs due to the way FPs are calculated: as the number of predicted components on areas labelled as 'water' in GT. In the case of predicting everything as obstacles, there is just one, huge component, that includes everything, so only one FP will be counted.

Why are FPs computed this way?
During the development of the metric we considered different ways to compute FPs and their trade-offs. Perhaps the most straightforward metric is the area of FP segmentation. However, this cannot be easily combined with the number of TPs and FNs to get a traditional metric such as F1-score. For this reason we opted for the method with connected components as I described above, from which we can get a count of FPs. Furthermore, this way we punish methods for producing many small FP predictions, which severely limit the navigation of the boat, but are not covered by mIoU.

How can we fix this problem?
In the text above we've already seen some key differences between mIoU and F1. mIoU captures the general segmentation quality but is not good for assessing critical, but small segmentation details such as not detecting small obstacles or predicting many small FPs. F1 on the other hand punishes methods for missing small obstacles and predicting many FPs, but fails in extreme cases such as predicting everything as obstacles. However, such cases will severely impact the mIoU score. For this reason, we believe a combination of mIoU and F1 may be used to tell the whole picture. 

Solution, TLDR;
From now on we will use a combined metric called Quality (Q) computed as Q = F1 x mIoU to determine the standings in our USV segmentation challenges (USV-based Obstacle Segmentation and USV-based Embedded Obstacle Segmentation). Since mIoU and F1 are highly correlated, this does not impact current standings much, except discredit extreme examples as the one you mentioned.
All the entries have been automatically updated so there is no need to re-upload your submission. The challenge instructions will also be updated to reflect the change.

Hope this answers your questions.

Best,
Lojze Žust
on behalf of the MaCVi team


ponedeljek, 25. september 2023 ob 17:27:12 UTC+2 je oseba Yannik Steiniger napisala:

Yannik Steiniger

unread,
Sep 27, 2023, 2:37:58 AM9/27/23
to MaCVi Support
Hi Lojze,

thanks for your detailed answer. You are right in principle the F1 should deal with this "all pixel are obstacle" case through a huge false alarm rate. Maybe calculating TP and FP on a pixel level, e.g. each pixel predicted as obstacle and lying inside an GT dynamic obstacle is considered TP and each pixel predicted as obstacle but lying inside the GT water region is counted as FP, would work. But I also believe the introduced Quality metric is a good solution.

Looking forward to submit a "real" model.

Cheers

Yannik

Lojze Žust

unread,
Sep 27, 2023, 6:05:23 AM9/27/23
to MaCVi Support
We also considered this approach in the past, however this runs into a similar problem to mIoU: smaller obstacles have a much lower impact towards the final score and get drowned out by larger obstacles. We thus opted for instance-level metrics.

Thanks for the discussion and we're looking forward to seeing the results of your work.

Best,
Lojze
sreda, 27. september 2023 ob 08:37:58 UTC+2 je oseba Yannik Steiniger napisala:
Reply all
Reply to author
Forward
0 new messages