archive/tar: write pax headers in sorted order

446 views
Skip to first unread message

josh...@docker.com

unread,
Dec 24, 2014, 3:10:00 AM12/24/14
to golan...@googlegroups.com
Hello,

I wish to create a tool which is capable of creating deterministic Tar archives so that two separately created archives can be compared using a cryptographic hash of each. I can do most of this already using the tar Header/Writer types which allow me to do this by ensuring that I create a tar archive where all of the entries I put in it are sorted by name first, and I can explicitly zero out atime/mtime/ctime as I don't deem them important for this tool. Normal header fields are all written in a particular, deterministic order in a tar header, but extended attributes are not - they are stored in memory as a map and written in whatever order it happens to be iterated in. As the article "Go maps in action" states (https://blog.golang.org/go-maps-in-action#TOC_7.):

When iterating over a map with a range loop, the iteration order is not specified and is not guaranteed to be the same from one iteration to the next.

This makes the tool I'm planning to write unreliable for tar entries which have more than 1 xattr as I currently cannot ensure the order in which these will be written.

I've never contributed to the Go project before and I'd like this to by my first contribution. My plan is to simply use the method described in the "Go maps in action" article linked above to change the way which the method `archive.Tar.(Writer).writePAXHeader` iterates over and writes paxHeaders - by first sorting the keys of the `paxHeaders` map and selecting and writing them in that order.

Since this is my first contribution I'm not entirely sure if this is the correct place to start. I started with the contribution guidelines (https://golang.org/doc/contribute.html#Design) which say to start by discussing my design on the mailing list. The article linked to the go-nuts mailing list but I get the feeling that the golang-dev mailing list is more appropriate. Let me know if I'm mistaken and I'll go ahead and cross post this on that list instead.

Thanks!

- Josh

Ian Lance Taylor

unread,
Dec 24, 2014, 11:49:34 AM12/24/14
to josh...@docker.com, golan...@googlegroups.com
On Wed, Dec 24, 2014 at 12:10 AM, <josh...@docker.com> wrote:
>
> I've never contributed to the Go project before and I'd like this to by my
> first contribution. My plan is to simply use the method described in the "Go
> maps in action" article linked above to change the way which the method
> `archive.Tar.(Writer).writePAXHeader` iterates over and writes paxHeaders -
> by first sorting the keys of the `paxHeaders` map and selecting and writing
> them in that order.

Sounds like a good plan.

Ian

josh...@docker.com

unread,
Dec 24, 2014, 1:05:03 PM12/24/14
to golan...@googlegroups.com, josh...@docker.com


On Wednesday, December 24, 2014 8:49:34 AM UTC-8, Ian Lance Taylor wrote:
Sounds like a good plan.

Ian

It looks like there is yet another issue that prevents me from making creating a deterministic archive. The `writePaxHeader` method also includes the current pid as part of the name field of the pax header. There is a comment stating:

> The spec asks that we namespace our pseudo files with the current pid.

I'm not sure where the spec is, but I've found a description of this field in the GNU Tar Manual (http://www.gnu.org/software/tar/manual/html_section/tar_71.html#SEC146). Scroll down from that point just a bit to find the description for the option which specifies the format for writing pax header names:

exthdr.name=string

It then goes on to define format specifiers for the file dirname (%d), basname (%f), and tar process pid (%p). GNU Tar allows the user to set the format but gives `%d/PaxHeaders.%p/%f` as a default. It seems pretty arbitrary, but `archive.tar.(Writer).writePaxHeader` includes the PID. Would you guys be agreeable to omitting it?

- Josh

Alan Donovan

unread,
Aug 27, 2015, 9:44:32 AM8/27/15
to golang-dev, josh...@docker.com
On Wednesday, 24 December 2014 13:05:03 UTC-5, josh...@docker.com wrote:
It looks like there is yet another issue that prevents me from making creating a deterministic archive. The `writePaxHeader` method also includes the current pid as part of the name field of the pax header. There is a comment stating:

> The spec asks that we namespace our pseudo files with the current pid.

I'm not sure where the spec is, but I've found a description of this field in the GNU Tar Manual (http://www.gnu.org/software/tar/manual/html_section/tar_71.html#SEC146). Scroll down from that point just a bit to find the description for the option which specifies the format for writing pax header names:

exthdr.name=string

It then goes on to define format specifiers for the file dirname (%d), basname (%f), and tar process pid (%p). GNU Tar allows the user to set the format but gives `%d/PaxHeaders.%p/%f` as a default. It seems pretty arbitrary, but `archive.tar.(Writer).writePaxHeader` includes the PID. Would you guys be agreeable to omitting it?

- Josh

I just ran into the same problem.  Including the pid seems like a bad default and an even worse hard-wired behavior, POSIX be damned.

Vincent Batts

unread,
Aug 27, 2015, 10:55:41 AM8/27/15
to Alan Donovan, golang-dev, Joshua Hawn
seems less like a posix thing, and just an available formatting
option. Seems fine to not actually use `os.Getpid()` but a generic
numerical marker. Omitting the value entirely would not conform to the
pattern other archive extractors expect.
http://pubs.opengroup.org/onlinepubs/9699919799/utilities/pax.html
(search for "PaxHeaders")
> --
> You received this message because you are subscribed to the Google Groups
> "golang-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to golang-dev+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Shane Hansen

unread,
Sep 14, 2015, 3:30:32 PM9/14/15
to Vincent Batts, Alan Donovan, golang-dev, Joshua Hawn
Maybe we can use a deterministic method for feeling in the "pid" field such as a hash of the file name.
Reply all
Reply to author
Forward
0 new messages