Optimizing a binary read routine

jgodfrey

unread,

May 2, 2008, 10:09:55 AM5/2/08

to

Hi All,

I have a number of Tcl procedures I wrote a few years ago that are
designed to decode geometric data that's packed into a "Unformatted
Sequential File" that's written by a FORTRAN application.

The internal organization of this type of FORTRAN file is documented
at http://www.ae.utexas.edu/lrc/fortran/intel/f_ug1/pggfmsp.htm under
the "Unformatted Sequential Files" section.

Essentially, it's fairly straight forward, except data that's longer
than 128 bytes is broken into 128 byte chunks, with each chunk being
surrounded by a "length byte" at either end.

The file is stored in a bunch of sections, with the first few bytes of
each section specifying the data type of the upcoming object as well
as the number of that object-type to read. So, for instance, I might
find that I need to read 28 doubles from the next section. Once
that's done, I should be sitting on the next section header...

That's all well and good, but the 128-byte chunk limitation mentioned
above complicates things a bit. Because the underlying FORTRAN write
breaks the data apart, and pads it with the length bytes, I had to
devise a way to account for these breaks during my read. With that in
mind, I created the enclosed binayScan procedure. It accepts a file
offset, a variable to hold the return data, a Tcl binary format
character for the read, and a count for the number of objects to read.

From that, it determines how to properly dodge the length-bytes and
reassembles the data into a continuous chunk for return to the caller.

The procedure seems to work well but now I need to make it faster. As
I don't deal with much binary data (outside of this), I'm sure there's
room for improvement. So, without further explanation, does anyone
see any glaring opportunities for optimization?

Thanks for any suggestions.

Jeff

proc ::msio::binaryScan {offset retVar format count {addPad 1}} {
upvar $retVar data
variable binaryData

set data [list]

# --- verify that we got a valid format
# if so, record the number of bytes read by the format
switch -- $format {
c {set bytes 1}
s {set bytes 2}
i {set bytes 4}
f {set bytes 4}
d {set bytes 8}
default {return -code error "Invalid format statement -
$format"}
}

# --- If addPad is 0, just do a raw read of the requested data.
That is,
# don't pad the read with leader and trailer bytes...
if {!$addPad} {
binary scan $binaryData @${offset}$format$count data
incr offset [expr {$bytes * $count}]

} else {

# --- determine the byte length of the requested read. If it
exceeds
# 128, the FORTRAN file will have been written in 128-byte
records
# with each record being surrounded by it's own "leader"
and
# "trailer" bytes.
set readLen [expr {$bytes * $count}]
set thisFormat ""

# --- format too large, break it down...
if {$readLen > 128} {
# --- find the number of <format> width reads that fit
into a
# 128-byte string.
set fullRec [expr {128 / $bytes}]
set thisFormat "c1${format}${fullRec}c1"
set numBytes [expr {($bytes * $fullRec) + 2}]
while {$readLen > 128} {
binary scan $binaryData @${offset}$thisFormat leader \
thisData trailer
incr offset $numBytes
set data [concat $data $thisData]
incr readLen -128
}
set remainder [expr {$readLen / $bytes}]
binary scan $binaryData @${offset}c1${format}$
{remainder}c1 \
leader thisData trailer
incr offset [expr {($bytes * $remainder) + 2}]
set data [concat $data $thisData]
} else {
binary scan $binaryData @${offset}c1${format}${count}c1 \
leader data trailer
incr offset [expr {($bytes * $count) + 2}]
}
}
return $offset
}

Uwe Klein

unread,

May 2, 2008, 1:25:22 PM5/2/08

to

jgodfrey wrote:
> Hi All,
>
> I have a number of Tcl procedures I wrote a few years ago that are
> designed to decode geometric data that's packed into a "Unformatted
> Sequential File" that's written by a FORTRAN application.

got an example somewhere?

I would do it a bit differently, the following example is rather raw
and would just read a complete file:

fortran::usf::readformated {filename format args {
set fd [ open $filename r ]
fconfigure $fd -encoding binary -translation binary

set type [ read $fd 1 ]
binary scan [ read $fd 2 ] cc type nextlen

switch -- $type \
"79" {
# is unformated sequential file
} default {
puts stderr "Nicht die Mammi"
error "wrong filetype"
}
while { ![ append buffer [read $fd $nextlen ] ; feof $fd ] } {
binary scan [ read $fd 2 ] cc lastlen nextlen
if { $nextlen == 130 } break
}
close $fd

return [ binary scan $buffer $format ]
}

uwe

Uwe Klein

unread,

May 2, 2008, 1:29:39 PM5/2/08

to

jgodfrey wrote:
> Hi All,
>
> I have a number of Tcl procedures I wrote a few years ago that are
> designed to decode geometric data that's packed into a "Unformatted
> Sequential File" that's written by a FORTRAN application.

got an example somewhere?

I would do it a bit differently, the following example is rather raw
and would just read a complete file:

fortran::usf::readformated {filename format args {
set fd [ open $filename r ]
fconfigure $fd -encoding binary -translation binary

binary scan [ read $fd 2 ] cc type nextlen

jgodfrey

unread,

May 2, 2008, 2:21:18 PM5/2/08

to

On May 2, 12:29 pm, Uwe Klein <uwe_klein_habertw...@t-online.de>
wrote:

> got an example somewhere?

Hi Use - thanks for the response.

Hmmmm - unfortunately, I don't have examples that I can share. The
file is a proprietary format, and I don't (at least right now) have
permission to share it...

> I would do it a bit differently, the following example is rather raw
> and would just read a complete file:

I'll need to study this more carefully, but I think it ignores the
important complications - probably due to the lack of a good
explanation on my part. Let me try to add some more details...

There are really two things at work here - the content and format of
the file as dictated by the the original FORTRAN application that
wrote it, and then, at a lower level, the format of the file as
dictated by the complier that was used to build the original FORTRAN
program.

From the standpoint of the application itself, the file is written in
defined "sections", though there can be a variable number of
sections. Though a bit simplified, the following description should
suffice for discussion purposes...

The beginning of the file contains a counter specifying how many
sections are in the file, and then the beginning of each section
contains a counter that specifies the quantity of a given data type
that should be read from that section. The actual data types of each
section are not actually stored in the file, though they are
documented. So, for instance, section number 1 always consists of a
certain number (specified in the section header) of double values.

From the (original) FORTRAN writer application's viewpoint, this is
simple. It just writes a counter and a counter's worth of data of a
specific datatype to the file, per section. The complication is
caused by the fact that when the data is actually written, it's broken
apart (if longer than 128 bytes) and padded with the mentioned pre and
post "length bytes" around each 128-byte chunk. That's documented at
the link mentioned in my previous post.

The breaking down of the data into 128-byte chunks and the padding of
each chunk is not done by the original FORTRAN application code. In
fact, it's not even aware of it. When it reads the file, it also
doesn't need to be aware of the crazy storage format - as the data is
magically rejoined and passed back to the FORTRAN read.

Unfortunately, from Tcl, I *do* need to be aware of the crazy storage
format, and account for it in my read procedure. I need to know that
each chunk is surrounded by the length bytes, and that longer chunks
are broken into 128-byte pieces, each surrounded by their own length
bytes. Without this knowledge, the reader gets lost in the process of
reading the file as there's not much to go by. You just find out how
much data you should read, read the "chunks" of data from the section
(dodging the length byte markers), and assume that you end up pointing
to the header of the next section. If the length byte calculations
are wrong, the process is hopelessly lost.

Hopefully, that makes more sense without being (overly) long
winded... There are several levels to the reader that I'm currently
trying to optimize, with the current questions revolving around the
lowest one (which reads the data from a single "section" of the
file). I'm just looking for speed anywhere I can find it...

If anyone has additional questions, just ask.

Thanks for any additional input.

Jeff

Uwe Klein

unread,

May 2, 2008, 3:46:49 PM5/2/08

to

jgodfrey wrote:
> On May 2, 12:29 pm, Uwe Klein <uwe_klein_habertw...@t-online.de>
> wrote:
>
>
>>got an example somewhere?
>
>
> Hi Use - thanks for the response.
>
> Hmmmm - unfortunately, I don't have examples that I can share. The
> file is a proprietary format, and I don't (at least right now) have
> permission to share it...
>
>
>
>>I would do it a bit differently, the following example is rather raw
>>and would just read a complete file:

...................

> lowest one (which reads the data from a single "section" of the
> file). I'm just looking for speed anywhere I can find it...

So the Fortran file is a container for your domain specific file.

My example extracts blindly the payload of the
"Fortran unformated sequential file" in one walk through.
Which imho is the fastest way to do it.

(Your) next step then would be parsing this single block
to the rules of your domain specific format.

In the example I return the payload as list of $format values
converted in one "rush". My experience is that binary scan works
better on large data items than on repeated nibbles of small
blocks.

uwe

jgodfrey

unread,

May 2, 2008, 4:42:35 PM5/2/08

to

On May 2, 2:46 pm, Uwe Klein <uwe_klein_habertw...@t-online.de> wrote:

> So the Fortran file is a container for your domain specific file.
>
> My example extracts blindly the payload of the
> "Fortran unformated sequential file" in one walk through.
> Which imho is the fastest way to do it.

Uwe,

Thanks again for the input, but I think one of us is missing the
point. At this point, I'd guess it may be me, so bear with me... ;^)

As mentioned before, I need to study your code in more detail (as I'm
not overly familiar with binary and friends), but I don't see how it
can be properly finding the payload of the file because it doesn't
appear to be dodging the nonsense introduced by the FORTRAN storage
method.

Without bypassing the cruft in the file (the length bytes and the
additional breaks in the data), the reader will never properly locate
the beginning of the next section. At that point, the reading in of
your "nextlen" variable would seem to load garbage from the data
stream, at which point the whole process would be hopelessly lost.

If you're misunderstanding something, I'd guess it's that the
combination of known elements (the data type to read and the quantity
of that data type to read) is not enough to determine how many bytes
need to be read. The other necessary info is how many *additional*
length bytes will have been sprinkled into the data stream, and where
those length bytes are located. Only with that additional information
is it possible to truly locate the files payload.

Maybe I'm somehow over complicating things, but I don't see how the
calculation of the number and location of the length bytes can be
ignored during the read process...

Now, where am I confused?

Thanks,

Jeff

Bruce Hartweg

unread,

May 2, 2008, 5:38:06 PM5/2/08

to

But I think you missed the format, you are blindly reading a block size, then a block
of that size - which is true at the *logical* level

12 <12 byte block> 50 <50 byte block> 200 <200 byte block> 10 <10 byte block>

but at the physical level - if the "block" is more than 128 bytes, the it is
split/padded at 128 byte chunks, so the above 200 byte block would really be stored
in 256 bytes - so you walkthrough would get off by 56 bytes.

I *think* this is what the OP described - he can correct me if I am wrong ;)

Bruce

jgodfrey

unread,

May 2, 2008, 5:51:56 PM5/2/08

to

On May 2, 4:38 pm, Bruce Hartweg <bruce-n...@hartweg.us> wrote:
> But I think you missed the format, you are blindly reading a block size, then a block
> of that size - which is true at the *logical* level
>
> 12 <12 byte block> 50 <50 byte block> 200 <200 byte block> 10 <10 byte block>
>
> but at the physical level - if the "block" is more than 128 bytes, the it is
> split/padded at 128 byte chunks, so the above 200 byte block would really be stored
> in 256 bytes - so you walkthrough would get off by 56 bytes.
>
> I *think* this is what the OP described - he can correct me if I am wrong ;)

Bruce,

Thanks - that's close. I didn't spend much time defining the actual
write format as it's documented in the link given in my original
post. Maybe it deserves a little explanation here also...

While somewhat simplified, this should suffice for our purposes...

(below, "mb" = marker byte and "db" = data bytes)

If the data being written to the file is <= 128 bytes, it's written
like this...

<1 mb> <actual db> <1 mb>

If the data being written to the file is > 128 bytes, it's broken into
128-byte chunks and written like this...

<1 mb> <128 db> <1 mb> <1mb> <128 db> <1 mb> ... ... <1 mb> <remaining
db> <1 mb>

So, each 128-byte chunk is surrounded with a leading and trailing
marker byte. The last chunk, that's likely < 128 bytes, is just
written as is (though with the marker bytes). That is, it is not
padded out to 128 bytes.

Jeff

Uwe Klein

unread,

May 2, 2008, 5:59:09 PM5/2/08

to

jgodfrey wrote:

> Now, where am I confused?

I don't know, you may be completely right.

my understanding:

The container format has an
identifyer :#76
endmarker :#130
payload elements

<#$len><$len raw bytes of payload><#$len>

with 0 < len < 128

if you keep track of the prevlen,nextlen pairs
the format can be walked in both directions.

what my script does is
read typemarker,nextlen ( nextlen being the firstlen item)
loop
read firstlen bytes, append to payload
read lastlen,nextlen
break if nextlen == fileendmarker

there is no need imho to keep track of special
properties of the payload. ( the description
is of a true container format )

Having an example file would help.

uwe

jgodfrey

unread,

May 2, 2008, 6:20:26 PM5/2/08

to

On May 2, 4:59 pm, Uwe Klein <uwe_klein_habertw...@t-online.de> wrote:
> jgodfrey wrote:
> > Now, where am I confused?
>
> I don't know, you may be completely right.
>
> my understanding:
>
> The container format has an
> identifyer :#76
> endmarker :#130
> payload elements
>
> <#$len><$len raw bytes of payload><#$len>
>
> with 0 < len < 128

Uwe,

I think I've seen the light... ;^)

I just re-read the URL I pointed to in my original post and saw
something that I've either forgotten about since originally coding
this or just never saw in the first place. That is, the marker bytes
actually contain meaningful data (the length of the next read).

Wow - that's a big miss... ;^)

My code doesn't currently make use of the length markers surrounding
each data chunk. It just determines where they are and discards
them. Knowing now that they'll help in navigation of the file, I see
the logic in your sample code.

I've got to run now, but I think you've put me onto a great track.
Sorry it took so much convincing... ;^)

Jeff

Uwe Klein

unread,

May 2, 2008, 6:12:36 PM5/2/08

to

Bruce Hartweg wrote:

> but at the physical level - if the "block" is more than 128 bytes, the
> it is
> split/padded at 128 byte chunks, so the above 200 byte block would
> really be stored
> in 256 bytes - so you walkthrough would get off by 56 bytes.
>
> I *think* this is what the OP described - he can correct me if I am
> wrong ;)

I've read it again and see my misconception.
( But it should not make much of a difference, more tomorror its midnight for me.

fortran::usf::read {filename format args {

set fd [ open $filename r ]
fconfigure $fd -encoding binary -translation binary

binary scan [ read $fd 2 ] cc type nextlen

switch -- $type \
"79" {
# is unformated sequential file
} default {
puts stderr "Nicht die Mammi"
error "wrong filetype"
}
while { ![ append buffer [read $fd $nextlen ] ; feof $fd ] } {
binary scan [ read $fd 2 ] cc lastlen nextlen
if { $nextlen == 130 } break

if { $nextlen == 129 } {
incr nextlen -1
} else {
lappend ret $buffer
unset buffer
}
}
close $fd

return $ret ;# list of items
}

uwe

>
> Bruce

jgodfrey

unread,

May 4, 2008, 9:10:47 PM5/4/08

to

Continuing this...

With the replies and sample code provided by Uwe Klein, I've managed
to put together a very simple (and fast) routine that can pick apart
my binary "container" file and produce a clean data stream for later
processing. With that in mind, I've got another question...

The data I'm now processing is just one big byte stream containing
multiple sections which consist of the following information:

<objCount> <objSizeInBytes> <data> <objCount> <objSizeInBytes>
<data> ...

It's fairly easy to pick through this data stream using binary scan,
though it looks like I need to keep track of a pointer offset in order
to place each [binary scan] at the proper point in the data stream.

Currently, I'm doing the read of each "section" (objectCount, size,
data) using something like the following:

# read the object count and size, starting at proper offset
binary scan $data @${ptr}ii count wrdl

# increment the offset to account for just-read data
incr ptr [expr {$bytes(i) * 2}]

# read "count" number of given data type, starting at proper offset
binary scan $data @${ptr}f$count segCoords

# increment the offset to account for just-read data
incr ptr [expr {$wrdl * $count}]

While the above is simple enough, I wonder if there's not a better way
to track and/or position the cursor used for reading the data. It'd
be nice if there were a way to tell [binary scan] to leave the cursor
at the end of the last read, in which case all of the above pointer
math would be unnecessary.

Not being overly familiar with [binary] and friends, maybe I'm missing
something simple?

Thanks,

Jeff

Uwe Klein

unread,

May 5, 2008, 3:59:29 AM5/5/08

to

jgodfrey wrote:

>
> Not being overly familiar with [binary] and friends, maybe I'm missing
> something simple?

Have a look at the x X and @ formatspecs to [binary scan]

>
> Thanks,
>
> Jeff

jgodfrey

unread,

May 5, 2008, 8:49:36 AM5/5/08

to

Hi Uwe,

As you can see in my previous post, I'm currently using the "@" format
spec to position the read pointer for each separate [binary scan]
operation. When needing to position the pointer between separate scan
operations, both "@" and "x" would seem to be the same. One is an
absolute position and the other is an incremental position, but in
this case, they both start at 0, right?

Either way, it seems that I still need to calculate the length of the
previous read "by hand" before I can properly position the cursor for
the next read. I was just hoping there were an easier/faster/cleaner
way to read my data stream.

What I really need is a way to request the absolute position of the
scan cursor after each separate operation, so I can start it there for
the next operation. While tracking the pointer position is fairly
simple, it effectively doubles the amount of code required for this
task, as each scan requires another pointer calculation.

So, is there a better way to determine the pointer position after each
scan, or do I just need to calculate it by hand as the sample code in
my previous post does?

Thanks,

Jeff

Uwe Klein

unread,

May 5, 2008, 9:47:08 AM5/5/08

to

AFAIK binary scan has no state.

Do you operate on a buffer or a stream?
What size are "objlen" and "objcnt"
are the objects of "simple" type i.e. float double longints ..?
my try at this for a buffer would be something like:

set offset 0
set objcnt 0
set objlen 0
set headsize 8 ;# assumed
set objfmt f ;# simple format

while {$run} {
binary scan $buffer "@${offset}II objcnt objlen
incr offset $headsize
unset obj
# nonsimple objects:
for {set i 0 } { $i < $obcnt } { incr i ; incr offset $objlen } {
# do binary scan into a known format or add code
# for some complex object format.
binary scan $buffer "@${offset}$objfmt curobj
lappend obj $curobj
}
# alternate
binary scan $buffer "@${offset}$objfmt$objcnt obj
# get the offset right
incr offset [ expr {$objcnt * $objlen} ]
# alt end

lappend res [ list $objcnt $objlen $obj

# test for end of buffer ...
if {$atend} {
set run 0
}
}

uwe

>
>
>

jgodfrey

unread,

May 5, 2008, 10:25:01 AM5/5/08

to

On May 5, 8:47 am, Uwe Klein <uwe_klein_habertw...@t-online.de> wrote:
> jgodfrey wrote:

> Do you operate on a buffer or a stream?

A buffer - just a big binary string, extracted from the "container"
file we discussed earlier in this thread. Maybe I used these 2 terms
a little loosely in previous replies...

> What size are "objlen" and "objcnt"

They are just 32-bit ints, properly read using the "i" specifier - so
your "headsize 8" was a good guess... ;^)

> are the objects of "simple" type i.e. float double longints ..?

Yes, each "section" in the file consists of "count" quantity of a
simple data type. The size of the data type is recorded in the file,
but not the actual format. For instance, I can tell from the file
content that an upcoming object is 4-bytes long, but I don't know if
it's and "i" or an "f". I *do* know that information, but only from
external documentation, not from the file's content...

> my try at this for a buffer would be something like:

My current code (a sample of which was shown in my previous post)
looks very much like your "alternate" method, so I may already be on
the right track...

For this type of "chunked" reading, it would be nice if [binary scan]
could (optionally) return the ending cursor position after each read.
That way, I could simply pass that pointer back into the next scan via
the "@" specifier. Currently, it seems that I'll just have to
calculate the next position after each scan - using whatever method is
the simplest.

Does getting an (optional) pointer position back from [binary scan]
make any sense? Would others find that useful?

Thanks again.

Jeff

wiede...@googlemail.com

unread,

May 6, 2008, 2:15:22 AM5/6/08

to

On May 5, 4:25 pm, jgodfrey <jeff_godf...@pobox.com> wrote:

> Does getting an (optional) pointer position back from [binary scan]
> make any sense? Would others find that useful?

I may be prejudiced but one of my close programming friends is %n.

So : YES

uwe

wiede...@googlemail.com

unread,

May 6, 2008, 6:39:51 AM5/6/08

to

On the gripping hand: the number of items is either known or the
buffer
has been consumed anyway. ( <fmtspec>+<number|*> )

uwe