I'll just talk about the quality evaluation aspects of this study, as it is a field I know quite well (PhD on the topic, even if in video specifically).
> I think the most important kind of comparison to do is a subjective blind test with real people. This is of course produces less accurate results, but more meaningful ones.
I don't get how more meaningful results may be less accurate... Running subjective quality tests is not as trivial as it sounds, at least to get meaningful results, as you say. Of course, you can throw a bunch of images to some naive observers with a nice web interface, but what about their screens differences? what about their light conditions differences? how do you validate people for the test (vision acuity, color blindness)? I've ran more than 600 test sessions with around 200 different observers. Each one of them was tested before the session, and a normalized (ITU-R BT.500) room was dedicated to the process. I don't want to brag, I just mean it's a complicated matter, and not as sexy as it sounds :-)
In this study, you used several objective quality criteria (Y-SSIM, RGB-SSIM, IW-SSIM, PSNR-HVS-M). You say yourself: "It's unclear which algorithm is best in terms of human visual perception, so we tested with four of the most respected algorithms." Still, the ultimate goal of your test is to compare different degrading systems (photography coders here) at equivalent *perceived* quality. As your graphs show, they don't produce very consistent results (especially RGB-SSIM). SSIM-based metrics are structural, which means they evaluate how the structure of the image differ from one version to the other. Then they are very dependent of the content of the picture. Y-SSIM and IW-SSIM are only applied to luma channel, which is not optimal in your case, as image coders tend to blend colors. Still, IW-SSIM is the best performer in [1] (but it was the subject of the study), so why not. Your results with RGB-SSIM are very different than the others, disqualifying it for me. Plus, averaging SSIM over R, G and B channels has no sense for the human visual system. PSNR-HVS-M has the advantage to take into account a CSF to ponderate their PSNR, but it was designed over artificial artefacts, then you don't know how it performs over compression artefacts. None of these metrics use the human visual system at their heart. At best, they apply some HVS filter to PSNR or SSIM. For a more HVS-related metric, which tend to perform well (over 0.92 in correlation), look at [2] (from the lab I worked in). The code is a bit old now though, but an R package seems to be available.
You cite [1], in which they compare 5 algorithms (PSNR, IW-PSNR, SSIM, MS-SSIM, and IW-SSIM) over 6 subject-rated independent image databases (LIVE database, Cornell A57 database, IVC database, Toyama database, TID2008 database, and CSIQ database). These databases contain images and subjective quality evaluations obtained in normalized (i.e. repeatable) conditions. Most of them use JPEG and JPEG2000 compression, but not the others you want to test. The LIVE database is known not to be spread enough, resulting in high correlation in most studies (yet the reason why other databases emerged). If you want to perform your study further, consider using some of these data to start with.
Finally, be careful when you compute average of values, did you check their distribution first?
Stéphane Péchard
[1]
https://ece.uwaterloo.ca/~z70wang/research/iwssim/
[2]
http://www.irccyn.ec-nantes.fr/~autrusse/Komparator/index.html