untar file by file in a loop

Alexandru

unread,

Nov 1, 2022, 3:39:07 AM11/1/22

to

I have a procedure that unpacks files given by a list of file paths from an archive like this:

proc ::meshparts::AssemblyArchiveUnpack {zipfile {paths {}} {targetpaths {}}} {
set f [open $zipfile rb]
fconfigure $f -encoding binary -translation lf -eofchar {}
zlib push gunzip $f
if {[llength $paths]==0} {
set result [tar::untar $f -chan]
} else {
foreach path $paths targetpath $targetpaths {
set dir [file dirname $targetpath]
set code [catch {file mkdir $dir} err]
if {$code} {
::meshparts::message "*** [mc {%1$s} $err]" -errorlog 0
continue
}
set result [tar::untar $f -file $path -dir $dir -chan]
seek $f 0
}
}
close $f
return 1
}

The main part is the foreach:

foreach path $paths targetpath $targetpaths {
set dir [file dirname $targetpath]
set code [catch {file mkdir $dir} err]
if {$code} {
::meshparts::message "*** [mc {%1$s} $err]" -errorlog 0
continue
}
set result [tar::untar $f -file $path -dir $dir -chan]
seek $f 0
}

It can be further reduces to:

foreach path $paths targetpath $targetpaths {
set dir [file dirname $targetpath]
set result [tar::untar $f -file $path -dir $dir -chan]
seek $f 0
}

The problem is that it only works for first file in list.
Second file is not unpacked and if a third file is given I get the error:

*** START OF ERROR MESSAGE ***
can't read "name": no such variable
can't read "name": no such variable
while executing
"set $x"
(procedure "readHeader" line 5)
invoked from within
"readHeader [read $fh 512]"
(procedure "tar::untar" line 24)
invoked from within
"tar::untar $f -file $path -dir $dir -chan"

For me, It looks like the untar procedure has a bug.
The "seek $f 0" command I added it while trying to make it work.
No success until now.
I think, while the read channel stays open, the untar procedure read until the end of the file, so the next untar command does not find the needed file.
But then, the "seek $f 0" should actually solve the problem.
But it doesn't.

Here is the untar procedure, maybe some trained eyes can see the issue better than me.

proc ::tar::untar {tar args} {
set nooverwrite 0
set data 0
set nomtime 0
set noperms 0
set chan 0
parseOpts {dir 1 file 1 glob 1 nooverwrite 0 nomtime 0 noperms 0 chan 0} $args
if {![info exists dir]} {set dir [pwd]}
set pattern *
if {[info exists file]} {
set pattern [string map {* \\* ? \\? \\ \\\\ \[ \\\[ \] \\\]} $file]
} elseif {[info exists glob]} {
set pattern $glob
}

set ret {}
if {$chan} {
set fh $tar
} else {
set fh [::open $tar]
fconfigure $fh -encoding binary -translation lf -eofchar {}
}
while {![eof $fh]} {
array set header [readHeader [read $fh 512]]
HandleLongLink $fh header
if {$header(name) == ""} break
if {$header(prefix) != ""} {append header(prefix) /}
set name [string trimleft $header(prefix)$header(name) /]
if {![string match $pattern $name] || ($nooverwrite && [file exists $name])} {
seekorskip $fh [expr {$header(size) + [pad $header(size)]}] current
continue
}

if {$dir!=""} {
if {[::tar::isabsolute $name]} {
set name [file join $dir [file tail $name]]
} else {
set name [file join $dir $name]
}
}
if {![file isdirectory [file dirname $name]]} {
file mkdir [file dirname $name]
lappend ret [file dirname $name] {}
}
if {[string match {[0346]} $header(type)]} {
if {[catch {::open $name w+} new]} {
# sometimes if we dont have write permission we can still delete
catch {file delete -force $name}
set new [::open $name w+]
}
fconfigure $new -encoding binary -translation lf -eofchar {}
fcopy $fh $new -size $header(size)
close $new
lappend ret $name $header(size)
} elseif {$header(type) == 5} {
file mkdir $name
lappend ret $name {}
} elseif {[string match {[12]} $header(type)] && $::tcl_platform(platform) == "unix"} {
catch {file delete $name}
if {![catch {file link [string map {1 -hard 2 -symbolic} $header(type)] $name $header(linkname)}]} {
lappend ret $name {}
}
}
seekorskip $fh [pad $header(size)] current
if {![file exists $name]} continue

if {$::tcl_platform(platform) == "unix"} {
if {!$noperms} {
catch {file attributes $name -permissions 0[string range $header(mode) 2 end]}
}
catch {file attributes $name -owner $header(uid) -group $header(gid)}
catch {file attributes $name -owner $header(uname) -group $header(gname)}
}
if {!$nomtime} {
file mtime $name $header(mtime)
}
}
if {!$chan} {
close $fh
}
return $ret
}

Rich

unread,

Nov 1, 2022, 10:57:26 AM11/1/22

to

Alexandru <alexandr...@meshparts.de> wrote:
> I have a procedure that unpacks files given by a list of file paths from an archive like this:
>
> proc ::meshparts::AssemblyArchiveUnpack {zipfile {paths {}} {targetpaths {}}} {

Confustion above for yourself in the future. A zip file is not a tar
file, and a tar file is not a zip file (zip and tar are two very
different formats). Having the variable of the name be 'zipfile'
implies a "zip" not a "tar" at first glance.

> set f [open $zipfile rb]
> fconfigure $f -encoding binary -translation lf -eofchar {}
> zlib push gunzip $f
> if {[llength $paths]==0} {
> set result [tar::untar $f -chan]
> } else {
> foreach path $paths targetpath $targetpaths {
> set dir [file dirname $targetpath]
> set code [catch {file mkdir $dir} err]
> if {$code} {
> ::meshparts::message "*** [mc {%1$s} $err]" -errorlog 0
> continue
> }
> set result [tar::untar $f -file $path -dir $dir -chan]
> seek $f 0
> }
> }
> close $f
> return 1
> }

If your tar file is indeed gzipped, implied by this:
> zlib push gunzip $f
then simply doing this:
> seek $f 0
will not work, because just seeking to the beginning does not reset the
gunzip state to the same as it was at initial file opening. Which is
most likely why things are failing for you.

Try closing and reopening the file inside the loop. If that works,
then this was the cause.

> For me, It looks like the untar procedure has a bug.

Looks to me like you are creating the problem by trying to seek around
inside gzipped data. You also have to be able to reset the gunzip
uncompress state to the identical state it was in for the file offset to
make that work.

If you can't formulate a glob pattern for the set of files you want to
extract, then you'll have to do one of four things:

1) unpack the entire tar file into a temporary location, then move out
the files of interest and delete the unwanted files

2) close and reopen the file inside the loop around tar::untar. But you
are still left with scanning all of the preceeding tar data up to the
file of interest, which means you are quite close to an O(N^2)
complexity here

3) Create your own 'untar' by making calls into the tar module
internals to read file headers, decide if the header is for a file of
interest, and extract the file if so. This, however, does mean you are
calling procs that are not documented as part of the visible api to the
tar module, so should the internals change, your code would break until
you adapted. This method, however, does give you the most efficient
extract, because only a single pass over the tar file is needed.

4) Extend the tar module's untar proc to take an additional parameter
that is a list of filenames to match tar entries against and extract
each when found, and consider contributing the changes back to Tcllib.
This has the identical benefits of #3, with the added benefit that if
accepted, your change becomes part of the documented API so less likely
to change "out from under you" in the future.

Alexandru

unread,

Nov 1, 2022, 11:36:00 AM11/1/22

to

Thanks Rich,

I must admit, I still don't understand, how "read" can work on the channel but "seek" not.
I'll just follow your advice and see if I can add a -files option to the "untar" procedure and propose a change on github (your option 4).

Option 2 is of course a "no go". I can already see now the time needed to open the archive and finding one file is huge. Doing this for multiple files would be a party braker.

BTW: I know tar and zip are different formats. I have this habbit of calling all types of archives a zip file.

Regards
Alexandru

Schelte

unread,

Nov 1, 2022, 11:59:14 AM11/1/22

to

On 01/11/2022 16:35, Alexandru wrote:
> Option 2 is of course a "no go".

Instead of closing/reopening, you can also pop the gunzip channel
transformation, seek to the beginning, and then push the transformation
again. But I doubt that will make a big difference in performance.
Parsing the file multiple times is what makes it slow. Closing/opening
the file is probably negligible in comparison.

Schelte.

Robert Heller

unread,

Nov 1, 2022, 12:44:12 PM11/1/22

to

When you "read" a compressed tar file, you are not actually reading the tar
file itself, but the output of a pipeline from gunzip (or something like
gunzip). You can't seek on a pipeline -- I don't know if this is an actual pipe
device or a 'faked' pipe using VFS hackery and it does not matter which, the
effect is the same.

> Option 2 is of course a "no go". I can already see now the time needed to open the archive and finding one file is huge. Doing this for multiple files would be a party braker.
>
> BTW: I know tar and zip are different formats. I have this habbit of calling all types of archives a zip file.

This confusing tar and zip is probably what is getting you into lots of
trouble, esp. if you are confusing a gziped tar file.

Some important things to understand about tar and zip files:

Tar was originally designed for *tapes* (yes, those reels of plastic film
coated with Iron Oxide). Nobody uses tapes anymore. Tar files don't have
compressed elements, the whole tar file get compressed as a single blob. Tar
files are meant to be read and written sequentially and not randomly accessed.

*Zip* files contain an *uncompress* table of contents, and each member element
is separately compressed (or not). Zip files were specificly designed to be
randomly accessed -- one can seek to the end and read the TOC and then seek to
specific files in the Zip archive and extract (and uncompress) them, in any
order you like.

>
> Regards
> Alexandru
>
>

--
Robert Heller -- Cell: 413-658-7953 GV: 978-633-5364
Deepwoods Software -- Custom Software Services
http://www.deepsoft.com/ -- Linux Administration Services
hel...@deepsoft.com -- Webhosting Services

Rich

unread,

Nov 1, 2022, 12:47:06 PM11/1/22

to

Alexandru <alexandr...@meshparts.de> wrote:
> Thanks Rich,
>
> I must admit, I still don't understand, how "read" can work on the
> channel but "seek" not.

The seek works. You move the file pointer back and start reading from
a different offset.

But, your file is a gzip file. The gzip compressed format needs to be
read from the front, because to unpack byte X, you need the gzip
compression state that was created by unpacking bytes 0 through X-1.

If you are at offset Y, you have the gzip compression state created
from 0 through Y-1. If you now seek to X, you'll get the wrong result
from trying to decompress X using the gzip state of 0 through Y-1.

> I'll just follow your advice and see if I can add a -files option to
> the "untar" procedure and propose a change on github (your option 4).
>
> Option 2 is of course a "no go". I can already see now the time
> needed to open the archive and finding one file is huge. Doing this
> for multiple files would be a party braker.

Tar is not zip. The expanded acrynym gives a clue (T)ape (Ar)chive.
It was created (originally) to package files onto magnetic tape. As
tape does not have "random seek ability" tar contains no features to
allow random access within the tar file. You have to either read it
from the start in a linear manner, or pre-index once up front (by
reading it from from to back in a linear manner) and then use your
index to randomly grab files out.

Zip files include index data as part of the format, so one can directly
access a single file in a zip without having to read the whole file
from the front in order to do so.

> BTW: I know tar and zip are different formats. I have this habbit of
> calling all types of archives a zip file.

Which is fine, but it confuses others who call a tar file a tar file
and a zip file a zip file because they are two very different file
formats.

Rich

unread,

Nov 1, 2022, 12:48:36 PM11/1/22

to

Schelte <nos...@wanadoo.nl> wrote:
> On 01/11/2022 16:35, Alexandru wrote:
>> Option 2 is of course a "no go".
> Instead of closing/reopening, you can also pop the gunzip channel
> transformation, seek to the beginning, and then push the transformation
> again.

Ah, that would reset the gzip state as well. I forgot about that
option.

> But I doubt that will make a big difference in performance. Parsing
> the file multiple times is what makes it slow. Closing/opening the
> file is probably negligible in comparison.

Agreed.

Alexandru

unread,

Nov 1, 2022, 2:44:15 PM11/1/22

to

Thanks all for the help.
I added the -files and -dirs options to the untar procedure and commited the changes:
https://github.com/Meshparts/tcllib/blob/master/modules/tar/tar.tcl