Context model performance

lucie...@gmail.com

unread,

Mar 17, 2020, 2:45:02 AM3/17/20

to tensorflow-compression

Hi,

I tried to reproduce the work in your paper "Joint autoregressive and hierarchical priors for learned image compression". However I cannot get the quality result as mentioned in the paper. I got only about 1-2% bit rate saving by adding the context model. I read the other related topic, and I fully understand your decision not to release the code of the above paper (maybe time is limited). Still, I would like to ask a few questions about the implementation of this paper, I wish you could give me some suggestions.

1. In the current conditional bottleneck layer, the actual value that is quantized and encoded is [y - mean]. If we want to add a context model, mean is actually depends on [y], so I have modified the current conditional bottleneck layer, and quantize and code [y] directly. Is this the right way to do it?

2. In the context prediction module and entropy parameter module, did you use the common Conv operators or the Signal Conv operators? Does this choice really matters?

3. Is there any additional hint you could give? Cause I have stuck by this reproduction for a really long time.

Thank you in advance!

Xinyu

mlk...@outlook.com

unread,

Apr 17, 2020, 2:04:48 AM4/17/20

to tensorflow-compression

Hi Xinyu,

I'm also trying to reproduce the mean-scale hyper-prior model's results and am having trouble.

Do you mind sharing how you were able to reproduce the mean-scale model's results (without context model)? I trained using bmshj2018.py with default settings (with only a minor modification to model the means in addition to scale), and I'm about 0.1 BPP worse than the results published in the paper (at fixed distortion). Did you use learning rate decay, or regularization techniques?

Thanks!

Tomas

Johannes Ballé

unread,

May 20, 2020, 8:20:15 PM5/20/20

to tensorflow-compression, David Minnen

Hi Xinyu, Tomas:

my apologies for the late reply, unfortunately I wasn't able to respond earlier to your emails and I'm just processing them now.

With the context model, there are some subtleties regarding how exactly to condition on the previously decoded values. David Minnen (CC'ed) knows more about the details here, since he implemented this code.

It sounds like the problem may be that you would want to condition on the quantized values rather than the "noisy" values during training, since the decoder can only "see" the quantized values. In order for the encoder and decoder to use the same conditioning, you would want to make sure that both are using the same values for the conditioning, or else there may be a performance drop due to mismatch between encoder and decoder.

Regarding the other question, SignalConv is essentially the same as Conv, but it has more options for boundary handling, which can be quite significant in the case of image compression models (or fully convolutional autoencoders).

We didn't use any regularizations or learning rate tricks other than described in the paper.

I hope this helps. Please let me or David know if you have any further questions!

Johannes

--
You received this message because you are subscribed to the Google Groups "tensorflow-compression" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tensorflow-compre...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tensorflow-compression/65f50cdb-1da2-4b15-8de1-8ffb8459e01b%40googlegroups.com.

David Minnen

unread,

Jun 2, 2020, 3:32:09 PM6/2/20

to Johannes Ballé, mlk...@outlook.com, tensorflow-compression

Hi. Hopefully, I can help you figure out how to train a more effective context-adaptive compression model. There are a bunch of small details that affect the final rate-distortion (RD) performance. Here are a few that may be relevant:

We train the models for many steps using 256x256 patches with a batch size of 8. I think for this paper we trained for six million steps, though we've cut back since then to four or five million. You can use fewer steps for experiments to compare the relative performance of a new idea or new tweak, but the extra steps (i.e. going from 2M to 5M) will give you a few extra percent of rate savings so that's what we do for the final results in our papers.
The size of the bottleneck (number of output channels in the last conv layer of the "encoder" -- sometimes called the "analysis transform") makes a difference. It needs to be larger for the best performance at higher bit rates (i.e. higher lambda in the R + lambda * D loss function). We typically use 320 channels and drop down to 192 for low bit rates and bump up to 512 for high bit rates. Roughly, "low" is less than 0.5 bpp and high is greater than 2.5 bpp, but you should test with your specific architecture.
You're right that it makes more sense to code [y] instead of [y - mean]. Unfortunately, I always saw worse RD performance with this approach. So our best models still code using [y - mean].
Johannes is right that using quantized values as the input to the "context model" makes sense. More experimentation is needed here, but for the paper, I trained with noisy values: y + U(-0.5, 0.5). I think I also tried training with a stop_gradient (so that gradients from the context model would NOT affect the encoder transform) but that led to worse performance.
We've found it useful to start with a higher lambda and then shift down to the target lambda half way through training. So if you want to use lambda=0.5 and train for 2M steps, train with lambda=1.0 for 1M steps and then continue training with lambda=0.5 for another 1M steps. We have not explored the effects of this method in detail. It seems to help most with small lambda values.
The output of the context model is integrated with the output of the hyper-synthesis transform fairly early. Essentially, we get two latent tensors, concatenate them, and then pass them through additional 1x1 convs (the "entropy parameters" block in the paper). The output of that is then split in half to get mean and scale values. In later models, we actually use two sets of conv layers: one for mean and one for scale instead of using one set and splitting at the end.
After publication, we were able to get better results with the mean & scale model (i.e. without context). There's still a benefit for using spatial context, but it's somewhat smaller than what is reported in the paper. We've posted CSV files with data for the RD curves in our github repo: https://github.com/tensorflow/compression/tree/master/results/image_compression. Files named "minnen-2018-neurips-no-context.txt" hold results for the hyperprior model with mean and scale parameters (but no spatial context).
We do reduce learning rate over time, but our approach is fairly simple. I believe for the paper we trained for 5M steps with LR=1e-4 and then another 1M steps with LR=1e-5. The lower LR at the end helps, but I have no reason to believe that this schedule is optimal (or even close to optimal).
Finally, we've mostly moved away from using spatial context explicitly since decoding is so slow. Other researchers have found these models to work well (e.g. Lee 2019 and Klopp 2018 are both cited on our github page and many submissions to CLIC build on the hyperprior + context architecture). Instead, we've been exploring a new model that is autoregressive only in the channel dimension of the latent tensor. This approach leads to a simpler model that's faster and gives better RD performance when paired with a couple other improvements. The model is described in a paper that was recently accepted at ICIP 2020: Channel-wise autoregressive entropy models for learned image compression. That's a draft, and I should have an expanded version up on arxiv soon.

Message has been deleted

Yu Qin

unread,

Aug 20, 2020, 7:09:56 AM8/20/20

to tensorflow-compression

Hi Xinyu,

Recently I am reproduce context model based on bmshj2018.py, but I met with a problem. I am getting stucked into how to decode a tfci file using a model with a context model. I tried to save mu and sigma from encoding process and use them to decode y with a gaussian conditional model. And then I use the decoded y to reproduce new mu and sigma for decoding y again. This process looks so strange that I am very confused.

Hope that you can give some suggestions, thanks !

Best,

Yu Qin

qq1151...@gmail.com

unread,

Oct 13, 2020, 12:23:34 AM10/13/20

to tensorflow-compression

I also have similar doubts about quantification. In the autoregressive model, mu is gradually predicted by the autoregressive model, and the quantified value of the source of the autoregressive model, but the quantified result is related to mu （mu+ round (y-mu)） , Then in the inference, how does the encoding end perform quantization?

thanks

zmye

David Minnen

unread,

Oct 28, 2020, 3:18:57 AM10/28/20

to tensorflow-compression

I think that Figure 1 in the NeurIPS 2018 paper isn't as clear as it should be. Johannes recently came up with a better way of representing the quantization and entropy coding in our models, which is shown in Figure 10 in Nonlinear Transform Coding (here's a direct link to the image). This image makes it clear that it's `round(y - mu)` that gets coded. In non-autoregressive models, mu only depends on \hat{z} so everything is straightforward. In the spatially autoregressive model, mu depends on both \hat{z} and "previous" values of \hat{y}. The decoder will have all the values it needs for each spatial location, but implementing it efficiently is difficult (and it's even more involved if you want it to be deterministic).

We mostly punted on this problem since deployable models need to run quickly. Most codecs have an autoregressive component (e.g. CABAC) so the architecture isn't inherently slow, but when we got good RD performance from the channel-wise autoregressive model, we moved in that direction since it's much easier to implement within an end-to-end optimized TF model.

Reply all

Reply to author

Forward