Dear Jona,
I am considering attending PIC 2025, coming back to my city of Aachen. I lived five years there for my PhD and a postdoc at the RWTH.
I downloaded the results published from the CLIC 2022 challenge and got somewhat puzzled because images are not always compressed at the targeted bitrate, instead the integrated bitrate over the 30 test images must be reached.
The consequence is that at the subjective evaluation stage, evaluators are asked to compare images that have very different rates. This is a problem.
For instance, I picked the smallest image from NewbieCodec, the purple flower at 8679 bytes, the effective bitrate is 0.025bpp and not 0.075bpp, three times less, and this image is clearly visually (and PSNR) inferior to the two other second and third ranked coders at 12605 bytes and 12093 bytes. I found the same image compressed at 33111 bytes in the 0.15bpp folder, that is close to the target of 2048 * 1327 * 0.075 / 8 = 25478 bytes.
Knowing that, the rules of the challenge mark more importance to imagining a kind of Poker gambling strategy where the rates must be hand-picked and globally balanced for each of the 30 test images such that we hope to gather a great ELO after the games.
When comparing images at different rates, the dimensionality of the evaluation space is increased, and you will probably need millions of such pairwise comparisons to capture the Pareto front.
My suggestion is to change the rules such that individual images must not exceed the target file size, and the evaluations will always compare images at roughly equal size.
My second idea is to consider lower bitrates (aka extreme image compression) where the image approximation capabilities become important because at 0.075bpp, the global appearance of results look very similar to the reference and closer inspection is needed.
On this point, if the visual difference between two candidates is small, there is a great danger that the evaluator randomly picks left or right. I would allow a middle choice where the use can choose to mark both options are roughly equal in quality. This third option can also be used as a control to detect unfair (or tired) evaluators.
Colas