Dear tf-compression authors,
I notice there's quite some difference in the results with `model(x, training=True)` v.s. `model(x, training=False)` in the main models, like bls2017 and bmshj2018; the BPP is often a lot higher (like 10~20%) with `training=True` than `training=False`, and I wonder why. I thought the noisy quantization with uniform noise should give very close BPP to actual quantization, according to the ICLR 2017 paper, no?
I first noticed this from the keras training logs, where the training (loss,bpp,mse) was quite different from the validation (loss,bpp,mse):
e.g., training bls2017.py on CLIC with --num_filters 192 and --lambda 0.01:
Epoch 101/20
10000/10000 [==============================] - 439s 44ms/step - loss: 0.9100 - bpp: 0.4769 - mse: 43.3064 - val_loss: 0.8713 - val_bpp: 0.3983 - val_mse: 47.2942
Epoch 102/200
10000/10000 [==============================] - 416s 42ms/step - loss: 0.9121 - bpp: 0.4776 - mse: 43.4441 - val_loss: 0.8689 - val_bpp: 0.3980 - val_mse: 47.0877
like the train loss can be 10-20% higher than val_loss (0.91 v.s. 0.87), and train bpp also higher than val_bpp (0.477 v.s. 0.398), and mse is different too. And it stays this way until convergence.
I thought there was some mismatch in train vs validation data, but I got similar results to the train numbers when I evaluated on the validation set with `training=True`.
Thank you!