tar files sort order by date or numeric name

8 views
Skip to first unread message

Janis Papanagnou

unread,
May 6, 2022, 12:41:37 PMMay 6
to
I want the files in a tar archive in sorted form. (Using GNU tar.)
Either by date of the file or by its name (containing a number).
For example I want these three files in sorted order like depicted:
rfc748.txt
rfc7168.txt
rfc8774.txt

I can add the files incrementally one by one to an empty archive,
but I wanted to know whether there's a trick that I missed to fill
the archive in one go, like tar cf sorted.tgz dir-with-files/

On a quick search and man page inspection I couldn't see anything.

Janis

Axel Reichert

unread,
May 6, 2022, 1:38:49 PMMay 6
to
Janis Papanagnou <janis_pa...@hotmail.com> writes:

> I want the files in a tar archive in sorted form. (Using GNU tar.)
> Either by date of the file or by its name (containing a number).
> For example I want these three files in sorted order like depicted:
> rfc748.txt
> rfc7168.txt
> rfc8774.txt
>
> I can add the files incrementally one by one to an empty archive,
> but I wanted to know whether there's a trick that I missed to fill
> the archive in one go, like tar cf sorted.tgz dir-with-files/

I assume that something along

ls -tr dir-with-files/ | xargs tar cf sorted.tgz

is too brittle for you?

Best regards

Axel

marrgol

unread,
May 6, 2022, 2:10:28 PMMay 6
to
Here a quick man page inspection reveals:

“--sort=ORDER
When creating an archive, sort directory entries according
to ORDER, which is one of none, name, or inode.”


--
mrg

Christian Weisgerber

unread,
May 6, 2022, 2:30:09 PMMay 6
to
On 2022-05-06, Janis Papanagnou <janis_pa...@hotmail.com> wrote:

> I want the files in a tar archive in sorted form. (Using GNU tar.)
>
> I can add the files incrementally one by one to an empty archive,
> but I wanted to know whether there's a trick that I missed to fill
> the archive in one go, like tar cf sorted.tgz dir-with-files/

Various tar(1) implementations can read a list of files to archive.

$ ls | sort >list
$ tar -c -I list -f sorted.tar

GNU tar also supports this.

$ gtar -c -T list -f sorted.tar

--
Christian "naddy" Weisgerber na...@mips.inka.de

Janis Papanagnou

unread,
May 6, 2022, 9:08:37 PMMay 6
to
Too brittle? - Hmm.. - thinking about what happens if the arguments'
length will result in more than one call of tar triggered by xargs.
But I suppose using also the tar option to add to an existing archive
will solve that issue.

Thanks.

Janis

> Best regards
>
> Axel
>

Janis Papanagnou

unread,
May 6, 2022, 9:12:56 PMMay 6
to
That's what I also had found in the man page, and none of the three
options will sort by date or by name with a numeric variable-length
numeric component. With 'name' the order would be
>> rfc7168.txt
>> rfc748.txt
>> rfc8774.txt
and with the other options arbitrary w.r.t. the stated requirement.

Janis

Janis Papanagnou

unread,
May 6, 2022, 9:13:35 PMMay 6
to
I missed that. Thanks.

Janis

Janis Papanagnou

unread,
May 6, 2022, 9:32:46 PMMay 6
to
On 07.05.2022 03:12, Janis Papanagnou wrote:
> On 06.05.2022 20:10, marrgol wrote:
>> On 06/05/2022 at 18.41, Janis Papanagnou wrote:
>>> I want the files in a tar archive in sorted form. (Using GNU tar.)
>>> Either by date of the file or by its name (containing a number).
>>> For example I want these three files in sorted order like depicted:
>>> rfc748.txt
>>> rfc7168.txt
>>> rfc8774.txt
>>>
>>> I can add the files incrementally one by one to an empty archive,
>>> but I wanted to know whether there's a trick that I missed to fill
>>> the archive in one go, like tar cf sorted.tgz dir-with-files/
>>>
>>> On a quick search and man page inspection I couldn't see anything.
>>
>> Here a quick man page inspection reveals:
>>
>> “--sort=ORDER
>> When creating an archive, sort directory entries according
>> to ORDER, which is one of none, name, or inode.”

I forgot to mention that this was the place where I'd have expected
some, say, --sort=mtime option variants. That way the call that I
currently use to create the tar file - I'm just tar'ing the directory
that contains the actual files - would stay simple and not require
xargs (incl. caveats) or separate file lists as suggested elsethread.
Needless to say, with the suggestions provided, it's just a matter of
convenience now, but maybe also a possible --sort extension candidate.

Janis

Brian Patrie

unread,
May 7, 2022, 3:04:18 AMMay 7
to
You can also use "-T -" to read the list of files from stdin. So:

find dir-with-files | sort --version-sort \
| tar -czvf sorted.tgz --sort=none --no-recursion -T -

I'm abusing sort's "--version-sort" option to get the order that Janis
wants (beware that this will sort decimals incorrectly--i couldn't get
"--numeric-sort" to do the desired thing, for some unknown reason).
"--sort-none" tells tar not to do its own sorting. "--no-recursion"
tells tar not to do it's own directory diving--which would also muck
things up.

find (GNU findutils) 4.7.0-git
sort (GNU coreutils) 8.28
tar (GNU tar) 1.29

Axel Reichert

unread,
May 7, 2022, 3:28:25 AMMay 7
to
Brian Patrie <bpa...@bellsouth.spamisicky.net> writes:

> You can also use "-T -" to read the list of files from stdin.

Ah, this avoids my xargs, great!

> find dir-with-files | sort --version-sort \
> | tar -czvf sorted.tgz --sort=none --no-recursion -T -

[...]

> "--sort-none" tells tar not to do its own sorting.

Would this be done otherwise, even though the files are given directly
on the command line as arguments (respectively read from STDIN) and not
created by globbing?

> "--no-recursion" tells tar not to do it's own directory diving

Is my understanding correct that this happens only if "find" returns
directories? So depending on the contents of Janis's "dir-with-files", a
simple

find dir-with-files -name "rfc*.txt"

might do, even without "-type f".

Best regards

Axel

Janis Papanagnou

unread,
May 7, 2022, 9:43:13 AMMay 7
to
On 07.05.2022 09:04, Brian Patrie wrote:
> Christian Weisgerber wrote:
>> On 2022-05-06, Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>>
>>> I want the files in a tar archive in sorted form. (Using GNU tar.)
>>>
>>> I can add the files incrementally one by one to an empty archive,
>>> but I wanted to know whether there's a trick that I missed to fill
>>> the archive in one go, like tar cf sorted.tgz dir-with-files/
>>
>> Various tar(1) implementations can read a list of files to archive.
>>
>> $ ls | sort >list
>> $ tar -c -I list -f sorted.tar
>>
>> GNU tar also supports this.
>>
>> $ gtar -c -T list -f sorted.tar
>>
>
> You can also use "-T -" to read the list of files from stdin. So:
>
> find dir-with-files | sort --version-sort \
> | tar -czvf sorted.tgz --sort=none --no-recursion -T -
>
> I'm abusing sort's "--version-sort" option to get the order that Janis
> wants (beware that this will sort decimals incorrectly--i couldn't get
> "--numeric-sort" to do the desired thing, for some unknown reason).

With the 'sort' step I can use sort's -kn feature because of the
regularity of the file names in this case. Thanks.

Janis

Helmut Waitzmann

unread,
May 7, 2022, 12:05:50 PMMay 7
to
Janis Papanagnou <janis_pa...@hotmail.com>:
The trick of a thorough inspection of the GNU tar info manual (not
just the manual page, but see the SEE ALSO section of the manual
page for how to get it), will reveal the options "--no-recursion",
"--null", and "--files-from", which you could use like in this
example to have the file names sorted by version number:

find dir-with-files/ -print0 |
sort --zero-terminated --version-sort |
tar cf sorted.tgz --no-recursion --null --files-from=-

A rule of thumb:  Whenever you want to gain more control of what
files in what order to be processed by GNU tar, use the
"--no-recursion", "--null", and "--files-from" options.  Let GNU
"find … -print0" collect the file names according to the given
criteria, then sort them using "sort --zero-terminated …" and
finally feed them to GNU tar.

If you want the file names to be sorted by their file contents
modification time, you could do

TZ=UTC0 find dir-with-files/ \
-printf '%TY %Tm-%TdT%TH:%TM:%TS %p\0' \
sort --zero-terminated -t ' ' -k 1,1n -k 2,2 -k 3 |
sed --zero-terminated -E -e '^/([[:graph:]]+ ){2}/s///' |
tar cf sorted.tgz --no-recursion --null --files-from=-

Let GNU find list the filenames, each of them prepended by its data
modification time, then let GNU sort sort the list of the filenames
by the prepended time stamps, then let GNU sed remove the prepended
timestamps from the filenames and finally feed them to GNU tar.

Thanks to the "-printf" GNU find predicate, the "--zero-terminated"
GNU sort and GNU sed options, and the "--null" GNU tar options that
will work with any filename.

Janis Papanagnou

unread,
May 7, 2022, 1:23:08 PMMay 7
to
On 07.05.2022 18:05, Helmut Waitzmann wrote:
>
> If you want the file names to be sorted by their file contents
> modification time, you could do
>
> TZ=UTC0 find dir-with-files/ \
> -printf '%TY %Tm-%TdT%TH:%TM:%TS %p\0' \
> sort --zero-terminated -t ' ' -k 1,1n -k 2,2 -k 3 |
> sed --zero-terminated -E -e '^/([[:graph:]]+ ){2}/s///' |
> tar cf sorted.tgz --no-recursion --null --files-from=-

Thanks for you reply. The nice thing about Unix is that we can
construct solutions of arbitrary complexity solving (almost)
every imaginable task.

The task I have presented is quite primitive. With the previous
posts I think using something like sort -k1.4,1.7n as part of
a pipe serves quite well.

In my opinion, how files are inserted into a tar archive should
be controlled by tar options. That's why I'd still favor the
existence of some tar --sort=mdate feature[*] instead of more
or less complex workarounds. Being able to sort by numeric name
could also be an option.

It's certainly arguable whether tar should have an option --sort
instead of letting an external tool do the sort, but tar option
--sort is already there in GNU tar, so it would seem obvious to
complete the set of option arguments.


To put the pieces together; for now I think something along

ls rfcs/* | sort -t/ -k2.4,2.7n | tar czf rfcs.tgz -T -

(untested!) would serve me best.

Janis

[*] BTW, I noticed just now that the file dates turned out to be
not significant to indicate generation of the respective files,
so I will have to rely on the file name numbering in this case.

Spiros Bousbouras

unread,
May 7, 2022, 1:26:06 PMMay 7
to
On Sat, 07 May 2022 18:05:21 +0200
Helmut Waitzmann <nn.th...@xoxy.net> wrote:
> If you want the file names to be sorted by their file contents
> modification time, you could do
>
> TZ=UTC0 find dir-with-files/ \
> -printf '%TY %Tm-%TdT%TH:%TM:%TS %p\0' \
> sort --zero-terminated -t ' ' -k 1,1n -k 2,2 -k 3 |
> sed --zero-terminated -E -e '^/([[:graph:]]+ ){2}/s///' |
> tar cf sorted.tgz --no-recursion --null --files-from=-
>
> Let GNU find list the filenames, each of them prepended by its data
> modification time, then let GNU sort sort the list of the filenames
> by the prepended time stamps, then let GNU sed remove the prepended
> timestamps from the filenames and finally feed them to GNU tar.

For sorting using modification time it's simpler to do

find dir-with-files/ -printf '%T@ %p\0' |
sort -z -n -k1 |
gawk 'BEGIN { RS = "\0" } {print $2}' | etc.

Helmut Waitzmann

unread,
May 7, 2022, 7:36:24 PMMay 7
to
Spiros Bousbouras <spi...@gmail.com>:
>On Sat, 07 May 2022 18:05:21 +0200
>Helmut Waitzmann <nn.th...@xoxy.net> wrote:
>> If you want the file names to be sorted by their file contents
>> modification time, you could do
>>
>> TZ=UTC0 find dir-with-files/ \
>> -printf '%TY %Tm-%TdT%TH:%TM:%TS %p\0'
>
>For sorting using modification time it's simpler to do
>
> find dir-with-files/ -printf '%T@ %p\0' |

I considered using "%T@" but refrained from using it because of the
paragraph in the GNU find info manual:  "Below are the formats for
the directives '%A', '%C', and '%T', which print the file's
timestamps. Some of these formats might not be available on all
systems, due to differences in the C 'strftime' function between
systems."

The POSIX definition of the "strftime" function
(<https://pubs.opengroup.org/onlinepubs/9699919799/functions/strftime.html#top>)
knows the conversion specifiers "Y", "m", "d", "H", "M", and "S",
but not "@".

Spiros Bousbouras

unread,
May 8, 2022, 2:10:14 AMMay 8
to
On Sat, 07 May 2022 23:23:54 +0200
Helmut Waitzmann <nn.th...@xoxy.net> wrote:
> Spiros Bousbouras <spi...@gmail.com>:
> >On Sat, 07 May 2022 18:05:21 +0200
> >Helmut Waitzmann <nn.th...@xoxy.net> wrote:
> >> If you want the file names to be sorted by their file contents
> >> modification time, you could do
> >>
> >> TZ=UTC0 find dir-with-files/ \
> >> -printf '%TY %Tm-%TdT%TH:%TM:%TS %p\0'
> >
> >For sorting using modification time it's simpler to do
> >
> > find dir-with-files/ -printf '%T@ %p\0' |
>
> I considered using "%T@" but refrained from using it because of the
> paragraph in the GNU find info manual: "Below are the formats for
> the directives '%A', '%C', and '%T', which print the file's
> timestamps. Some of these formats might not be available on all
> systems, due to differences in the C 'strftime' function between
> systems."

The man page says

%Ak File's last access time in the format specified by k, which is
either `@' or a directive for the C `strftime' function. The pos-
sible values for k are listed below; some of them might not be
available on all systems, due to differences in `strftime' between
systems.

@ seconds since Jan. 1, 1970, 00:00 GMT, with fractional
part.

So the @ directive is on top of what strftime() offers.Note that
https://www.gnu.org/software/libc/manual/html_mono/libc.html#index-strftime
does not mention @ either. It does mention

%s

The number of seconds since the epoch, i.e., since
1970-01-01 00:00:00 UTC. Leap seconds are not counted
unless leap second support is available.

This format is a GNU extension.

but https://www.gnu.org/software/findutils/manual/html_mono/find.html does
not say you can use %s for seconds since the epoch.

> The POSIX definition of the "strftime" function
> (<https://pubs.opengroup.org/onlinepubs/9699919799/functions/strftime.html#top>)
> knows the conversion specifiers "Y", "m", "d", "H", "M", and "S",
> but not "@".

But POSIX does not mention --zero-terminated for sed or sort either.

Axel Reichert

unread,
May 8, 2022, 3:19:49 AMMay 8
to
Janis Papanagnou <janis_pa...@hotmail.com> writes:

> I want the files in a tar archive in sorted form.

After the IMHO fruitful discussion I would like to ask why you want to
have them in sorted order in your tar file. I could not come up with a
motivation for this myself. Could you please explain?

Best regards

Axel

Janis Papanagnou

unread,
May 8, 2022, 4:28:57 AMMay 8
to
Sure. In short: sorted item lists let you find specific items or detect
inconsistencies easier on inspection or on comparison with other data.

If I inspect a foreign tar file I typically inspect the contents before
the decision of unpacking them or not. If I obtain a package with sorted
numbered items that are a subset of a larger set I can easily identify
whether a set of entities is in that package or not. It's just the usual
effect that sorted item lists let you identify specific items easier and
faster. The alternative for me with an unsorted archive would be to sort
the output of 'tar tvf' for that purpose. It's easier, though, to sort
it once when populating the archive than to require it be sorted by the
unpacking users many times. It's similar to, say, 'ls'; I don't want to
type 'ls | sort -whatever' every time to get an order where I can easily
spot what I am looking for. It may be just me (or few people) who prefer
data sorted, but since it doesn't cost me anything to provide it sorted
I decided to just do it that way.

BTW, the displayed (and sorted by date) items let me (in the course of
the discussion posts in this thread) recognize that the file's 'mtime'
isn't consistent with the file numbers order. So we can consider the
sorting also as a quality measure of data sets that helps finding bugs
or data inconsistencies easier.

And it's not only convenience for humans, also for computers/programs.
I recall that in the 1990's (when I was closer to programming than I am
now) we had sorted *.a (or *.so, don't recall) library archives. I don't
recall the technical details or the exact rationale, but the reason was
to increase the performance of the build process.

And, finally, for those who don't see an advantage of sorted data, let
me also point you to Donald Knuth's decade old book series "The Art of
Computer Programming" with the third book about "Sorting and Searching".
In the introduction he points to the "most important applications of
sorting"; a) Solving the "togetherness" problem, b) Matching items in
two or more files, and c) Searching for information by key values, that
closely resemble the reasons I had.

Janis

>
> Best regards
>
> Axel
>

Kenny McCormack

unread,
May 8, 2022, 6:06:19 AMMay 8
to
In article <Nlz6LfeP...@bongo-ra.co>,
Spiros Bousbouras <spi...@gmail.com> wrote:
...
>So the @ directive is on top of what strftime() offers.Note that
>https://www.gnu.org/software/libc/manual/html_mono/libc.html#index-strftime
>does not mention @ either. It does mention
>
> %s
>
> The number of seconds since the epoch, i.e., since
> 1970-01-01 00:00:00 UTC. Leap seconds are not counted
> unless leap second support is available.
>
> This format is a GNU extension.
>
>but https://www.gnu.org/software/findutils/manual/html_mono/find.html does
>not say you can use %s for seconds since the epoch.

I think the problem is that %s was already "taken" by "find" to mean
"size", so they couldn't use %s (from strftime) to mean seconds since the
epoch. So, they had to come up with something else (for "find" to use).

--
"Everything Roy (aka, AU8YOG) touches turns to crap."
--citizens of alt.obituaries--

Axel Reichert

unread,
May 8, 2022, 6:37:48 AMMay 8
to
Janis Papanagnou <janis_pa...@hotmail.com> writes:

> sorted item lists let you find specific items or detect
> inconsistencies easier

Thanks. Spotting inconsistencies did not occur to me, although I have
often used sorting for this.

> It may be just me (or few people) who prefer data sorted

Me too, a habit passed on by my father. It also helps to find structure
in the data and thus, contrary to its bean-counting image might spark
creativity.

Best regards

Axel

Helmut Waitzmann

unread,
May 8, 2022, 8:51:57 AMMay 8
to
Strange.  I checked the manual page at my system (Debian buster)
and it's indeed the same as yours.  But the GNU find info manual at
my system says what I cited above.  The online GNU find info manual
(<https://www.gnu.org/software/findutils/manual/html_node/find_html/Time-Formats.html#Time-Formats>)
is even more clear:  "Below is an incomplete list of formats for
the directives ‘%A’, ‘%B’, ‘%C’, and ‘%T’, which print the file’s
timestamps.  Please refer to the documentation of strftime for the
full list.  Some of these formats might not be available on all
systems, due to differences in the implementation of the C strftime
function."

That is:  If a conversion specifier is not in the documentation of
the strftime function ("the full list"), then it will not be
available with GNU find.

So the info manuals disagree with the manual page.  Which one of
them is correct, which one is wrong?

Also, the manual page says (in the SEE ALSO section):  "The full
documentation for find is maintained as a Texinfo manual.  If the
info and find programs are properly installed at your site, the
command info find should give you access to the complete manual."

That lets me assume that the info manual is more complete than the
manual page.

In the BUGS section, the manual page says:  "The environment
variable LC_COLLATE has no effect on the -ok action." whereas in
the EXPRESSION and ENVIRONMENT VARIABLES sections it states that
the interpretation of the response given will be affected by the
environment variable LC_COLLATE.

Apparently the manual page contradicts itself.  That makes me doubt
of the reliability of it.  Perhaps it's a compilation of different
sources?

> Note that
> https://www.gnu.org/software/libc/manual/html_mono/libc.html#index-strftime
> does not mention @ either. It does mention
>
> %s
>
> The number of seconds since the epoch, i.e., since
> 1970-01-01 00:00:00 UTC. Leap seconds are not counted
> unless leap second support is available.
>
> This format is a GNU extension.
>
> but
> https://www.gnu.org/software/findutils/manual/html_mono/find.html
> does not say you can use %s for seconds since the epoch.

But since %s is neither part of the POSIX definition of the
strftime function nor part of the find info manual I refrained from
using it as well.

>> The POSIX definition of the "strftime" function
>> (<https://pubs.opengroup.org/onlinepubs/9699919799/functions/strftime.html#top>)
>> knows the conversion specifiers "Y", "m", "d", "H", "M", and "S",
>> but not "@".
>
> But POSIX does not mention --zero-terminated for sed or sort
> either.

As in the OP Janis stated that he is using GNU tar, I assumed he
might use GNU find, GNU sort, and GNU sed as well.  As he made no
statement about the strftime function at his system, I preferred to
be better safe than sorry.

Brian Patrie

unread,
May 9, 2022, 5:20:03 PMMay 9
to
Axel Reichert wrote:
> Brian Patrie <bpa...@bellsouth.spamisicky.net> writes:
>
>> You can also use "-T -" to read the list of files from stdin.
>
> Ah, this avoids my xargs, great!
>
>> find dir-with-files | sort --version-sort \
>> | tar -czvf sorted.tgz --sort=none --no-recursion -T -
>
> [...]
>
>> "--sort-none" tells tar not to do its own sorting.
>
> Would this be done otherwise, even though the files are given directly
> on the command line as arguments (respectively read from STDIN) and not
> created by globbing?

It did for me.

>> "--no-recursion" tells tar not to do it's own directory diving
>
> Is my understanding correct that this happens only if "find" returns
> directories? So depending on the contents of Janis's "dir-with-files", a
> simple
>
> find dir-with-files -name "rfc*.txt"
>
> might do, even without "-type f".

Yes, "-type f" would probably solve the same problem, as
"--no-recursion", as long as no other types need to be caught, and you
don't need directory metadata in the archive (which may be desirable,
depending on the use case). It would be needed even without
subdirectories, as find will normally yield the specified dir in its output.

Brian Patrie

unread,
May 9, 2022, 5:46:56 PMMay 9
to
Janis Papanagnou wrote:
> To put the pieces together; for now I think something along
>
> ls rfcs/* | sort -t/ -k2.4,2.7n | tar czf rfcs.tgz -T -
>
> (untested!) would serve me best.

Just beware that subdirectories under rfcs/ may bugger things up. Also,
too many files might run you into the argv length limit (though that's
mighty huge, these days).

Janis Papanagnou

unread,
May 9, 2022, 6:09:17 PMMay 9
to
On 09.05.2022 23:46, Brian Patrie wrote:
> Janis Papanagnou wrote:
>> To put the pieces together; for now I think something along
>>
>> ls rfcs/* | sort -t/ -k2.4,2.7n | tar czf rfcs.tgz -T -
>>
>> (untested!) would serve me best.
>
> Just beware that subdirectories under rfcs/ may bugger things up.

I don't have subdirectories, that's why I said it serves me best.

> Also,
> too many files might run you into the argv length limit (though that's
> mighty huge, these days).

Ah, right. I might then replace that code by

printf "%s\n" rfcs/* | sort -t/ -k2.4,2.7n | tar czf rfcs.tgz -T -

which (as a shell built-in) doesn't have that limit. (Or I can use
find, as suggested elsethread, though I prefer efficient built-ins.)

Janis

Reply all
Reply to author
Forward
0 new messages