GAWK: Converting a number to a string (the hard way!) - Discuss!

32 views
Skip to first unread message

Kenny McCormack

unread,
Jan 13, 2022, 8:40:04 AMJan 13
to
I have a situation where I need to convert a number (a 32 bit integer) to a
4 byte string - i.e., the internal representation of that 32 bit number as 4
consecutive bytes. This is so that I can pass (the address of) that string
to a low-level routine that wants (basically) an "int *" value.

I managed to get it working, using the following function:

# This assumes 32 bit ints on a little-endian architecture.
# Call as: str = encode(number)
function encode(n, i,s) {
s = sprintf("%c",n)
for (i=1; i<4; i++)
s = s sprintf("%c",rshift(n,i*8))
return s
}

This works, but I'm wondering if there is a better/more efficient/cuter way
to do it. Please discuss.

Note, BTW, that I have verified that when you printf with %c, it only uses
the low 8 bits of the number you pass in. So, you don't need to do any
"AND"ing.

--
Modern Christian: Someone who can take time out from using Leviticus
to defend homophobia and Exodus to plaster the Ten Commandments on
every school and courthouse to claim that the Old Testament is merely
"ancient laws" that "only applies to Jews".

Janis Papanagnou

unread,
Jan 13, 2022, 11:43:59 AMJan 13
to
On 13.01.2022 14:40, Kenny McCormack wrote:
> I have a situation where I need to convert a number (a 32 bit integer) to a
> 4 byte string - i.e., the internal representation of that 32 bit number as 4
> consecutive bytes. This is so that I can pass (the address of) that string
> to a low-level routine that wants (basically) an "int *" value.
>
> I managed to get it working, using the following function:
>
> # This assumes 32 bit ints on a little-endian architecture.
> # Call as: str = encode(number)
> function encode(n, i,s) {
> s = sprintf("%c",n)
> for (i=1; i<4; i++)
> s = s sprintf("%c",rshift(n,i*8))
> return s
> }
>
> This works, but I'm wondering if there is a better/more efficient/cuter way
> to do it. Please discuss.

Well, the task has a few standard data splitting steps that you
implemented in a straightforward way. Effectively it's basically
fine and minimal, I'd say.

Just one thought one might want to take into consideration...

Recursive counterparts of iterative functions are typically clearer,
since they don't require explicit variables to be defined and assigned.
(And I presume that the function call overhead is insignificant here.)
Such a function may look as simple as

function encode(i,n) {
if (i>0) {
printf("%c",n)
encode(i-1,rshift(n,8))
}
}

and is called with an additional argument indicating the number of
octets e.g., encode(4, 0x41424344) or encode(4, 1094861636) to
produce "DCBA".

To hide function parameters like the "4" there's then often a wrapper
function defined if one doesn't need to control the number of octets
function e(n) { encode(4,n) }
which of course "complicates" the matter again a bit (one may think).

But keeping that parameter allows also less function calls in case
you want to just extract, say, 2 or 3 octets from that number, as in
encode(3, 0x41424344) which will produce the same result as the call
encode(3, 0x00424344) .

Whether the clearness of recursion is "better" or "cuter" certainly
lies in the eye of the beholder. While I have to admit to rarely use
recursion, in most cases I always admire these recursive solutions
once I've written them down and see how perfect they are as a concept.

Janis

Janis Papanagnou

unread,
Jan 13, 2022, 12:08:49 PMJan 13
to
This function will just print the result, but I notice that the OP
wanted them in a string. So here's a recursive variant

function encode(i,n) {
if (i>0)
return sprintf("%c",n) encode(i-1,rshift(n,8))
}

Or if the reverse octet order is desired, just change the order of
the concatenation

return encode(i-1,rshift(n,8)) sprintf("%c",n)

Note: I omitted the 'i<=0' case since awk seems to create an empty
value as default return value.

Janis Papanagnou

unread,
Jan 14, 2022, 2:00:10 AMJan 14
to
On 13.01.2022 14:40, Kenny McCormack wrote:
>
> Note, BTW, that I have verified that when you printf with %c, it only uses
> the low 8 bits of the number you pass in. So, you don't need to do any
> "AND"ing.

I also used that assumption in my code upthread but forgot to point
out that this is not reliable or is generally even not true because
that depends on the locale that you have set. Just two samples from
a Unix context...

$ printf "%s\n" 65 65601 | LC_ALL=C awk '{printf "%c\n", $0}' | od -c -tx1
0000000 A \n A \n
41 0a 41 0a

$ printf "%s\n" 65 65601 | LC_ALL=C.UTF-8 awk '{printf "%c\n", $0}' | od
-c -tx1
0000000 A \n 360 220 201 201 \n
41 0a f0 90 81 81 0a

So depending on context and requirements the AND'ing might still be
necessary or the locale explicitly adjusted (as in the sample here).

Janis

Kenny McCormack

unread,
Jan 14, 2022, 9:29:48 AMJan 14
to
In article <srr71o$ll4$1...@dont-email.me>,
Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>On 13.01.2022 14:40, Kenny McCormack wrote:
>>
>> Note, BTW, that I have verified that when you printf with %c, it only uses
>> the low 8 bits of the number you pass in. So, you don't need to do any
>> "AND"ing.
>
>I also used that assumption in my code upthread but forgot to point
>out that this is not reliable or is generally even not true because
>that depends on the locale that you have set. Just two samples from
>a Unix context...

I get it, but I am not too concerned about it. Since this method already
assumes 32 bits and little-endian, I would just add to the list of
assumptions: "No goofy locale settings". I.e., it works in the C locale.

In fact, on almost all of my machines, I put code in my startup files to
unset any locale related environment variables and/or set them to just "C".
Makes life a lot more predictable.

BTW(1), this is sort of the genesis of this thread. I was looking for a more
straightforward way to do it - that wouldn't depend on so many simplifying
assumptions in order to work. Seems there ought to be a simpler way to
just put 4 bytes into a string. That's what I was hoping for...

BTW(2), TAWK has this covered - there are functions "pack" and "unpack"
specifically for this sort of thing - packing values into (and unpacking
out of) strings that act as structs that you pass to low-level routines.
Of course, the fact that TAWK directly supports access to low-level
routines obliges it to provide these functionalities. Native GAWK does not
(yet) provide access to low-level stuff. The dialect of GAWK that I
program in, does.

Of course, I could make this whole problem go away by writing yet another
extension lib to do it - but I was trying to avoid doing that.

--
The randomly chosen signature file that would have appeared here is more than 4
lines long. As such, it violates one or more Usenet RFCs. In order to remain
in compliance with said RFCs, the actual sig can be found at the following URL:
http://user.xmission.com/~gazelle/Sigs/Infallibility

Janis Papanagnou

unread,
Jan 14, 2022, 8:26:02 PMJan 14
to
On 14.01.2022 15:29, Kenny McCormack wrote:
>
> I get it, but I am not too concerned about it. Since this method already
> assumes 32 bits and little-endian, I would just add to the list of
> assumptions: "No goofy locale settings". I.e., it works in the C locale.

Fair enough. For others here it might be a fact to consider to not
get surprised.

>
> Of course, I could make this whole problem go away by writing yet another
> extension lib to do it - but I was trying to avoid doing that.

And that (with GNU Awk) would be the way to go.

Janis

Message has been deleted

Kpop 2GM

unread,
Jan 17, 2022, 5:39:43 AMJan 17
to
if u wanna make it consistent regardless of locale settings, add a very large multiple of 256 above 0x10FFFF :

LC_ALL="UTF-8" gawk -e 'BEGIN { printf("%c",65601+8^7) }' | od -baxco
0000000 101
A
0041
A
000101
0000001

% LC_ALL="UTF-8" gawk -e 'BEGIN { printf("%c",65601) }' | od -baxco
0000000 360 220 201 201
? 90 81 81
90f0 8181
360 220 201 201
110360 100601
0000004
Reply all
Reply to author
Forward
0 new messages