Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

tar::create with encoding?

162 views
Skip to first unread message

Alexandru

unread,
Oct 16, 2022, 12:06:26 PM10/16/22
to
Hi,

It seems, that there is no encoding option for the command tar:encoding.

I use this command to create an archive of multiple files:

set fd [open $zipfile wb]
zlib push gzip $fd -level 9
tar::create $fd $paths -chan
close $fd

Now I realize, all file with Umlaute in the path/name are wrongly encoded when unpacking the archive with the Windows program 7z.

What could be the solution to this issue?

Many thanks
Alexandru

Rich

unread,
Oct 16, 2022, 1:08:32 PM10/16/22
to
Alexandru <alexandr...@meshparts.de> wrote:
> Hi,
>
> It seems, that there is no encoding option for the command
> tar:encoding.

There is also no comamnd tar::encoding.

Tar (the archive format) is so old that it does not have an 'encoding'.
It just stores bytes, and upper level code has to decide what to do
with the bytes.

> I use this command to create an archive of multiple files:
>
> set fd [open $zipfile wb]
> zlib push gzip $fd -level 9
> tar::create $fd $paths -chan
> close $fd
>
> Now I realize, all file with Umlaute in the path/name are wrongly
> encoded when unpacking the archive with the Windows program 7z.

The issue here could be Tcllib tar, or it could be 7z. Right now you
don't know, and Tar (the format) has no way to communicate a flag that
says "filenames herein are UTF8 (or any other encoding)".

> What could be the solution to this issue?

Several:

1) (easiest, but may not be practical) -- don't use Umlaute's (or other
non-ascii characters) in filenames.

2) If you look through the source of Tcllib's tar, you will find that
it inserts the filenames into the tar header block using binary format
"a" (which simply inserts the codepoint value modulo 256, and that will
only be correct for an 8-bit fixed length encoding). Which likely
means the breakage happens during tar::create.

If you look further up the call chain, you find that directories are
resolved to lists of filenames via glob, and the proc which writes each
tar component is fed a filename to work with.

So, you could use tcllib's find to pre-aquire the filenames you want to
pack into the Tar file, pre-encode them into the appropriate encoding
using 'encoding convertto', and output the tar file by calling
'formatHeader' with the 'encoded' name, and fcopying the file contents
yourself.

3) You could patch tcllib's tar to encode filenames to an encoding
(including allowing specification of that encoding type via an option
to tar::create). And then contribute the patches back to Tcllib so
everyone benefits.

Alexandru

unread,
Oct 17, 2022, 6:58:30 AM10/17/22
to
Thanks Rich. After looking at tar.tcl, I see that "-encoding binary" is used for the output chanel (which must be the archive file) and also for encoding data inside the file, e.g. for header composing.
I think, if I start playing arround with the code, I might even make it work for my case but most probably it won't work for other cases. This is due to my limited undestanding of the whole encoding stuff.
But I think changing the source is the best way.
What I don't quite undestand, is why pre-encoding the paths does not work:

tar::create $fd [encoding convertto $enc $paths] -chan

I tried enc=utf-8 and the Windows compatible encoding enc=cp1252 but both didn't work.
Shouldn't this be enough?

Rich

unread,
Oct 17, 2022, 8:00:42 AM10/17/22
to
Alexandru <alexandr...@meshparts.de> wrote:
> Rich schrieb am Sonntag, 16. Oktober 2022 um 19:08:32 UTC+2:
>> Alexandru <alexandr...@meshparts.de> wrote:
>> > Now I realize, all file with Umlaute in the path/name are wrongly
>> > encoded when unpacking the archive with the Windows program 7z.
> What I don't quite undestand, is why pre-encoding the paths does not work:
>
> tar::create $fd [encoding convertto $enc $paths] -chan
>
> I tried enc=utf-8 and the Windows compatible encoding enc=cp1252 but
> both didn't work. Shouldn't this be enough?

If you look through the source, the tar module uses the paths you
supply to also open each file and copy its contents to the output tar
file. If you pre-encode the strings, then those opens likely will not
find the correct file (because the name used to open will have been
changed by the encoding process).

The patch to tar, presuming it would work, would be to perform encoding
convertto on the path/name inside the writeheader proc that outputs the
paths/names into the tar header. That way the open gets back the
string it needs to open the correct file, but non-ascii characters get
encoded just before being output into the header.

Alexandru

unread,
Oct 17, 2022, 8:14:22 AM10/17/22
to
Thanks. I changed this next paragraph in the source by adding "encoding convertto cp1252"

set header [binary format a100A8A8A8A12A12A8a1a100A6a2a32a32a8a8a155a12 \
[encoding convertto cp1252 $name] $A(mode)\x00 $ouid\x00 $ogid\x00\
$osize\x00 $omtime\x00 {} $type \
$A(linkname) ustar\x00 00 $A(uname) $A(gname)\
$A(devmajor) $A(devminor) $prefix {}]

Also tried with utf-8. The result is a valid archive but the names in the archive, when I open it with Windows 7z shows different special chars, not the Umlaute I actually have in the original file names.

Rich

unread,
Oct 17, 2022, 9:35:56 AM10/17/22
to
'encoding names' will give you all the possibilities your Tcl supports.
Whether one of them works is unknown, and is dependent upon what 7z
expects to see in the names inside the tar file (which is the big
unknown here, what does 7z expect, you need to insert what it expects,
but without knowing that fact, you are left with trying all to see if
any work).

Rich

unread,
Oct 17, 2022, 9:54:24 AM10/17/22
to
Alexandru <alexandr...@meshparts.de> wrote:
> Also tried with utf-8. The result is a valid archive but the names
> in the archive, when I open it with Windows 7z shows different
> special chars, not the Umlaute I actually have in the original file
> names.

Try one more test (if you can).

Create a tar file using 7z (if possible) and see if:

1) the Umlaute is encoded correctly (if the answer is no here, then
this may not be possible with 7z)

2) if the answer is yes to #1, then open up the tar file in a hex
display/hex editor and try to work out what the encoding used by 7z
for the filename (and Umlaute's) likely was.

Alexandru

unread,
Oct 17, 2022, 9:59:29 AM10/17/22
to
Creating with Windows 7z is no problem.
I opened the archive with SublimeText as Hexadecimal file and I only see binary stuff:

377a bcaf 271c 0004 cf4b c197 6006 1b00
0000 0000 2400 0000 0000 0000 5fca de53
e3ac bcc0 235d 0006 82cf 6346 7fed db19
67c2 b2aa c224 0a02 1e57 167f 3a28 63ef
864b 7da8 71ed dc72 9494 456e b474 a34e
7646 3e62 b0bc fb35 b31f 98ec 0cde 30ab

How should that work? Which editor to you recommend (for Windows)?

Alexandru

unread,
Oct 17, 2022, 11:46:31 AM10/17/22
to
Ok I added the HexViewer package to SublimeText and now I can see the typical hex view in typical hex editors. Let's see where this leads...

Alexandru

unread,
Oct 17, 2022, 11:52:38 AM10/17/22
to
Beats me, I only see chars glibber. Now the hell can I see from the hex view what the damn encoding is?
I just run "file --mime-encoding" in the GIT console and the return value is binary. No surprize here.

Robert Heller

unread,
Oct 17, 2022, 12:08:13 PM10/17/22
to
Copy the 7z created tar file to a Linux machine and see what 'tar tvf' gives
for file names.

It 7z compressing the tar file? If so, you need to uncompress it before
looking for filename encoding.

>

--
Robert Heller -- Cell: 413-658-7953 GV: 978-633-5364
Deepwoods Software -- Custom Software Services
http://www.deepsoft.com/ -- Linux Administration Services
hel...@deepsoft.com -- Webhosting Services

Christian Gollwitzer

unread,
Oct 17, 2022, 1:21:17 PM10/17/22
to
Am 17.10.22 um 17:52 schrieb Alexandru:
A tarred archive is usually first tarred, then gzipped. YOu need to undo
the gzip first to see the tar (the metadata is also compressed, unlike a
ZIP file)

On Linux you would do e.g.

zcat compressedfile > uncompressedfile

Or, if the file has the usual ending (e.g. data.tar.gz) then this should
do the trick:

gunzip data.tar.gz

which then yields data.tar

Christian

Christian

Rich

unread,
Oct 17, 2022, 2:21:17 PM10/17/22
to
No idea, as I avoid windows like the plague it is.

Under linux I'd use xxd or hexdump, which can both be asked to give a
hex dump plus ascii equivalents in the same view.

If you can post the first 512 bytes of the tar file (that's the tar
header) as hex like the above, I can use the hex to convert it back to
binary, then dump with xxd or hexdump.

Rich

unread,
Oct 17, 2022, 2:22:49 PM10/17/22
to
First you have to find the bytes that are the filename/path within the
header, then compare what bytes are present with what the displayed
names are when viewed in windows/7z. What differs bytes wise vs. the
view will then begin to suggest an encoding.

Robert Heller

unread,
Oct 17, 2022, 3:29:21 PM10/17/22
to
More correctly, tarballs are not compressed as part of being tarballs, but the
tarball itself is compressed. Unlike a ZIP file which compress each entry
separately, with the overall ZIP file itself not compressed (eg the ZIP
directory is not compressed, even though some/all of the contents is
compressed).

>
> On Linux you would do e.g.
>
> zcat compressedfile > uncompressedfile
>
> Or, if the file has the usual ending (e.g. data.tar.gz) then this should
> do the trick:
>
> gunzip data.tar.gz

(Modern) Linux tar itself reconizes the endings and will do the decompress
on-the-fly as needed:

tar tvf data.tar.gz
(or:
tar tvf data.tar.bz2
etc.)

No need to separately decompress the tarfile, if you are going to use the tar
command on it. This might actually be a good first step for the OP. Seeing
what tar displays for the file name might yield some enlightenment.

It should be noted that tar originated under UNIX and predates the use of
non-ASCII characters in file names.

>
> which then yields data.tar
>
> Christian
>
> Christian
>
>

Christian Gollwitzer

unread,
Oct 18, 2022, 1:48:43 AM10/18/22
to
Am 17.10.22 um 21:29 schrieb Robert Heller:
But then you see again an interpretation of the file names through tar
and the terminal. Alexandru wants to see the raw data in order to
replicate tar in Tcl code. I would guess that tar simply stores the file
name as a bytestream, since on Linux the file systems do not have an
encoding as opposed to Windows - the file names you see depend on how
you set LOCALE on Linux, whereas they are converted to UTF16 on NTFS
file systems on Windows.

7z on Windows might again have a different idea of the tar format. Not
to mention that there are multiple tar formats out there; e.g. see here:

https://www.gnu.org/software/tar/manual/html_section/Formats.html



Christian

Alexandru

unread,
Oct 19, 2022, 4:40:54 AM10/19/22
to
This whole encoding stuff is crazy to follow up.
I was thinking, maybe I can get an workarround, if I use Tcl to unpack the archive.
Since I could not find a manual for the tar package, I hat to read the source code and other forums online to get close to a solution, which is not working right now.
So here is the code to unpack a 7z archive, which contains a tar archive:

set f [open $zipfile]
zlib push gunzip $f
set result [tar::untar $f -chan]
close $f

I get this type:
couldn't open "<filename> 100644 0 0 226212 1432372742" : filename is invalid on this platform

Seems like the untar cannot handle correctly multiple file names in the header.
The first file name in ther archive is handled correctly. The second one is somehow cut in the middle and instead following data is attached to the file name.

The method used to create the 7z file is

set fd [open $zipfile wb]
zlib push gzip $fd -level 9
tar::create $fd $paths -chan
close $fd

where as $paths a list of full file paths is.

Is this a bug? Or am I using the wrong method to unpack?

0 new messages