Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

RS value that would never occur in a binary file

91 views
Skip to first unread message

Hermann Peifer

unread,
Mar 16, 2015, 2:24:05 AM3/16/15
to
Hi

I was just wondering if someone had a smart RS value for my gawk script
at hand: a value that would never occur when reading a binary file (say:
png image). In the script, I am using FS = "", so that the whole file is
read into a single record.

In other words, I am looking for the equivalent to using this for text
files: RS = "\0".

Thanks for your time, Hermann

P.S.
As c.l.a is a give & take, here a 3 short lines for people in desperate
need of the first million digits of Pi. Expanding the script to getting
the first 10 million digits isn't too difficult either..

$ gawk -M -vmillion=1 'BEGIN {
prec = 1e6 * million
PREC = prec * log(10) / log(2)
printf "%." prec "f", atan2(0,-1)
}'

Kenny McCormack

unread,
Mar 16, 2015, 6:43:16 AM3/16/15
to
In article <me5su4$fd0$1...@news.albasani.net>,
Hermann Peifer <pei...@gmx.eu> wrote:
>Hi
>
>I was just wondering if someone had a smart RS value for my gawk script
>at hand: a value that would never occur when reading a binary file (say:
>png image). In the script, I am using FS = "", so that the whole file is
>read into a single record.
>
>In other words, I am looking for the equivalent to using this for text
>files: RS = "\0".

I'm going to assume that the actual problem here (the X in your XY problem)
is "How do I read a whole file in a single go?". Assuming that to be the
case, I will say that:

1) I recently hit this same problem (in the context of parsing
/proc/xxx/status files)

2) I found this (readfile.awk), tucked away in the GAWK distribution,
which seems to fit the bill. I used it in my script like this:

split(readfile("/proc/" pid "/status"),A,"\n")

--- Cut Here ---
# readfile.awk --- read an entire file at once
#
# Original idea by Denis Shirokov, cosm...@gmail.com, April 2013
#

function readfile(file, tmp, save_rs)
{
save_rs = RS
RS = "^$"
getline tmp < file
close(file)
RS = save_rs

return tmp
}
--- Cut Here ---

>As c.l.a is a give & take, here a 3 short lines for people in desperate
>need of the first million digits of Pi. Expanding the script to getting
>the first 10 million digits isn't too difficult either..
>
>$ gawk -M -vmillion=1 'BEGIN {
> prec = 1e6 * million
> PREC = prec * log(10) / log(2)
> printf "%." prec "f", atan2(0,-1)
>}'

Interesting. I haven't tried it out yet, but note that this depends on
having GMP/MPFR compiled in in your version of GAWK. Correct?

--

There are many self-professed Christians who seem to think that because
they believe in Jesus' sacrifice they can reject Jesus' teachings about
how we should treat others. In this country, they show that they reject
Jesus' teachings by voting for Republicans.

Janis Papanagnou

unread,
Mar 16, 2015, 9:15:31 AM3/16/15
to
Am 16.03.2015 um 11:43 schrieb Kenny McCormack:
> In article <me5su4$fd0$1...@news.albasani.net>,
> Hermann Peifer <pei...@gmx.eu> wrote:
>
>> As c.l.a is a give & take, here a 3 short lines for people in desperate

You could have put it in an own thread, so that it can later be
found when searching the archives. (I changed the subject for
that purpose, and keep all context intact.)

>> need of the first million digits of Pi. Expanding the script to getting
>> the first 10 million digits isn't too difficult either..
>>
>> $ gawk -M -vmillion=1 'BEGIN {
>> prec = 1e6 * million
>> PREC = prec * log(10) / log(2)
>> printf "%." prec "f", atan2(0,-1)
>> }'
>
> Interesting. I haven't tried it out yet, but note that this depends on
> having GMP/MPFR compiled in in your version of GAWK. Correct?

As far as memory serves, option -M was for that purpose and it
requires the compiled/linked multiple precision libraries, yes.

Janis

Hermann Peifer

unread,
Mar 16, 2015, 1:07:14 PM3/16/15
to
On 2015-03-16 11:43, Kenny McCormack wrote:
>
> 2) I found this (readfile.awk), tucked away in the GAWK distribution,
> which seems to fit the bill. I used it in my script like this:
>
...
> RS = "^$"
...

This will do. And it is all explained in the manual. Silly me :-(

>> $ gawk -M -vmillion=1 'BEGIN {
>> prec = 1e6 * million
>> PREC = prec * log(10) / log(2)
>> printf "%." prec "f", atan2(0,-1)
>> }'
>
> Interesting. I haven't tried it out yet, but note that this depends on
> having GMP/MPFR compiled in in your version of GAWK. Correct?

Indeed. About -M or --bignum, from the manual:
> Force arbitrary-precision arithmetic on numbers. This option has no
> effect if gawk is not compiled to use the GNU MPFR and MP libraries

The 10 million digits which I generated using the above have the same
md5sum compared to the digits found on Internet, e.g. at
http://www.pibel.de/

Hermann

Kaz Kylheku

unread,
Mar 16, 2015, 1:30:10 PM3/16/15
to
On 2015-03-16, Hermann Peifer <pei...@gmx.eu> wrote:
> Hi
>
> I was just wondering if someone had a smart RS value for my gawk script
> at hand: a value that would never occur when reading a binary file (say:
> png image). In the script, I am using FS = "", so that the whole file is
> read into a single record.

A binary file of N bits can represent an of the 2**N possible bit string.

There is no substring of M bits, M < N, which may not occur!

A specific binary format may have some assurance that certain bits
or bytes are part of a framing sequence or whatever.

That format may require fairly complicated parsing; and not just simple
scanning for a particular byte. Even if a byte is used as a delimiter,
there may be mechanisms in the file format to allow that same byte to appear in
literal data.

> In other words, I am looking for the equivalent to using this for text
> files: RS = "\0".

Text files may in fact contain null bytes, depending on system. For instance
/proc/<pid>/environ on Linux.

mjc

unread,
Mar 16, 2015, 9:17:12 PM3/16/15
to
On Sunday, March 15, 2015 at 11:24:05 PM UTC-7, Hermann Peifer wrote:
> Hi
>
> I was just wondering if someone had a smart RS value for my gawk script
> at hand: a value that would never occur when reading a binary file (say:
> png image). In the script, I am using FS = "", so that the whole file is
> read into a single record.
>
> In other words, I am looking for the equivalent to using this for text
> files: RS = "\0".
>
> Thanks for your time, Hermann

I once had to do this exact thing. What I ended up doing is using dd to convert from raw binary to hex characters, then reading and parsing the resulting file.

This was, of course, extremely inefficient, but worked. gawk is my hammer.

Andrew Schorr

unread,
Mar 16, 2015, 11:50:32 PM3/16/15
to
On Monday, March 16, 2015 at 2:24:05 AM UTC-4, Hermann Peifer wrote:
> I was just wondering if someone had a smart RS value for my gawk script
> at hand: a value that would never occur when reading a binary file (say:
> png image). In the script, I am using FS = "", so that the whole file is
> read into a single record.

The standard readfile extension library offers another solution.

bash-4.2$ gawk -lreadfile 'BEGIN { PROCINFO["readfile"]} {print FILENAME, FNR, NF, length($0)}' /etc/passwd /etc/group
/etc/passwd 1 424 11995
/etc/group 1 211 4148

Of course, the RS trick also works:

bash-4.2$ gawk -v "RS=^$" '{print FILENAME, FNR, NF, length($0)}' /etc/passwd /etc/group
/etc/passwd 1 424 11995
/etc/group 1 211 4148

Regards,
Andy

Hermann Peifer

unread,
Mar 17, 2015, 1:24:01 AM3/17/15
to
On 2015-03-17 2:17, mjc wrote:
>
> ... using dd to convert from raw binary to hex characters, then ...
>

This was my first thought, too. The second thought was to read the whole
file into a single record, then loop over single-byte fields (using:
--characters-as-bytes and FS = "") and eventually converting fields
(byte sequences) into integer and float values:

# Convert any (really: any?) number of bytes to an integer value
function Bytes2Number(bytes, a, i, idx, res) { ... }

# Convert 2, 4, 8 or 16 bytes to a float value, as per IEEE-754
function Bytes2Float(bytes, num, sign, exponent, fraction, res) {...}

My functions are essentially the slightly improved code found at:
http://awk.freeshell.org/ConvertHexToFloatingPoint
http://awk.info/?doc/bitmaps.html

Hermann

Hermann Peifer

unread,
Mar 17, 2015, 3:58:01 AM3/17/15
to
If I understand things correctly, then "RS=^$" is the smartest RS value
I can get to, in my gawkish context. Both binary and text files will be
read into a single record, which is my goal. The only exception is the
border case of an empty file. As far as I can see from awkvars.out:
"RS=^$" is considered to be found in empty files, as RT's value changes
from "RT: uninitialized scalar" to empty string (RT: ""). Empty files
could be handled separately, if needed.

All of the above is of course nicely explained and documented in the
manual, most likely since the Beginning. I was just too stupid to find
section "10.2.8 Reading A Whole File At Once". Sigh.

http://www.gnu.org/software/gawk/manual/html_node/Readfile-Function.html#Readfile-Function

Thanks again to all for their time,

Hermann

Joe User

unread,
Mar 17, 2015, 1:33:29 PM3/17/15
to
On Mon, 16 Mar 2015 07:24:04 +0100, Hermann Peifer wrote:

> I was just wondering if someone had a smart RS value for my gawk script
> at hand: a value that would never occur when reading a binary file (say:
> png image). In the script, I am using FS = "", so that the whole file is
> read into a single record.

Remember to use the -b option to gawk.

It's important to keep gawk from looking at locale and encoding, if you
are reading binary files. Otherwise, gawk will try to consider multi-
byte encodings of input records.

--
Extremism in the defense of liberty is no vice;
moderation in the pursuit of justice is no virtue.

-- Barry Goldwater (actually written by
Karl Hess)

Hermann Peifer

unread,
Mar 17, 2015, 2:52:18 PM3/17/15
to
On 2015-03-17 7:17, Joe User wrote:
> On Mon, 16 Mar 2015 07:24:04 +0100, Hermann Peifer wrote:
>
> Remember to use the -b option to gawk.
>
> It's important to keep gawk from looking at locale and encoding, if you
> are reading binary files. Otherwise, gawk will try to consider multi-
> byte encodings of input records.
>

Thanks for the hint, I was aware. I also use BINMODE = "r", to avoid
silent "\r\n" to "\n" translations on read which otherwise could happen,
at least on some OS, as explained at
https://www.gnu.org/software/gawk/manual/html_node/PC-Using.html#PC-Using

Hermann

Kenny McCormack

unread,
Apr 10, 2015, 11:40:51 AM4/10/15
to
In article <3572555e-14c4-4244...@googlegroups.com>,
Just out of curiosity, why is there both a "readfile" extension (shared lib)
and a "readfile" include file (GAWK source - the one I posted a while back) ?

--
Religion is regarded by the common people as true,
by the wise as foolish,
and by the rulers as useful.

(Seneca the Younger, 65 AD)

0 new messages