Thank you. This is very much appreciated.
However, I feel bad that this ended up simply creating mork work for you. If
there is any way I can help automate things, please let me know. In a previous
life, I was a sysops engineer.
> For the record, here is what I observed: ...
This is, indeed, surprising. On my machine, these same steps produce
identically-hashing contents. The discrepency bothered me enough that I spent
some time sleuthing the root cause.
The quick 'n dirty is that gzip 1.6 produces non-deterministic builds by
default. I was able to reproduce the behaviour you demonstrate by using gzip
1.6. The non-deterministicity is fixed in gzip version 1.10 (dated from
2018-12-29).
The medium-length explanation is that gzip includes a timestamp in its header
by default. Up until version 1.9, when the input file is stdin this timestamp
becomes the current system time. Version 1.10 chooses to simply elide the
timestamp altogether in this case. However, when input is a normal file, all
versions, including 1.6, behave similarly by setting the timestamp to the
original file's modification time (mtime).
It turns out that you can manually tell gzip to elide the timestamp in the
latter case by providing the non-obviously named '-n' (--no-name) option. This
means, that with version 1.6, you should be able to get reproducible tar.gz
archives by manually chaining tar and gzip together:
$ tar -cf - path/to/metamath | gzip -n >metamath.tar.gz
or telling tar to chain for us with the -I switch:
$ tar -I 'gzip -n' -cf metamath.tar.gz path/to/metamath
As it turns out, either of those invocations produce archives that hash
identically between versions 1.6 and 1.10.
And just for over-the-top kicks, if anyone is interested in checking out the
commit that introduced the above change to gzip, here it is:
url: git://
git.savannah.gnu.org/gzip.git
commit: bce795d0a38ae10f13b3297f1253acdeb4defc21
Cheers.