Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

gawk - interpretation of hexadecimal values in strings

54 views
Skip to first unread message

Janis Papanagnou

unread,
Apr 29, 2016, 3:39:50 AM4/29/16
to
You can define characters using hexadecimal codes, e.g. "\x40" for "@".

$ awk 'BEGIN { printf "#\x40field#\n" }' | od -tx1 -c

0000000 23 0f 69 65 6c 64 23 0a
# 017 i e l d # \n

Obviously "f" is taken as part of the number, and the "4" in "\x40f" is
completely ignored. (Looks like a bug to me, BTW.)

$ awk 'BEGIN { printf "#\x40\field#\n" }' | od -tx1 -c

0000000 23 40 0c 69 65 6c 64 23 0a
# @ \f i e l d # \n

We see that escaping the "f" does not help in this case; "\f" has an
own meaning already.

$ awk 'BEGIN { printf "#%cfield#\n", "\x40" }' | od -tx1 -c

0000000 23 40 66 69 65 6c 64 23 0a
# @ f i e l d # \n

This is the expected result, but if I want to use hexadecimals without
printf in strings, say, sub(/ABC/,"\x40field"), that again won't work.
I can of course use sprintf and string concatenation, but that seems
quite bulky (and inefficient) compared to the intended task to define
a compile time constant. The best option I see is splitting the string
into parts and concatenating them

$ awk 'BEGIN { printf "#" "\x40" "field#\n" }' | od -tx1 -c

Or are there any better ways (something done within the string) that I
may have missed?

Janis

Anton Treuenfels

unread,
Apr 29, 2016, 9:15:48 AM4/29/16
to

"Janis Papanagnou" <janis_pa...@hotmail.com> wrote in message
news:nfv345$m53$1...@news.m-online.net...
> You can define characters using hexadecimal codes, e.g. "\x40" for "@".
>
> $ awk 'BEGIN { printf "#\x40field#\n" }' | od -tx1 -c
>
> 0000000 23 0f 69 65 6c 64 23 0a
> # 017 i e l d # \n
>
> Obviously "f" is taken as part of the number, and the "4" in "\x40f" is
> completely ignored. (Looks like a bug to me, BTW.)

Without commenting on the entire issue, I'm not sure that this behavior is a
bug. I've written code to interpret this kind of escape sequence in a string
constant, and I made it accept the longest part that looks like a hex number
but only convert at most the last two characters. As far as I can tell, that
is what is supposed to happen.

There is another escape I've sometimes seen, "\u", which converts four
characters at a time - "u" probably stands for "unsigned" or "unicode".

Ed Morton

unread,
Apr 29, 2016, 1:00:43 PM4/29/16
to
On 4/29/2016 2:39 AM, Janis Papanagnou wrote:
> You can define characters using hexadecimal codes, e.g. "\x40" for "@".

Never use hex codes, use octal instead. See
http://awk.freeshell.org/PrintASingleQuote

Ed.

Kenny McCormack

unread,
Apr 29, 2016, 1:35:31 PM4/29/16
to
In article <ng03pa$kqp$1...@dont-email.me>,
Why? Octal is just like so, 1970s, man!

Unix was originally developed on a PDP11. I used to use a PDP11.
I haven't used octal since...

Personally, I do think it should work as Janis expects it to - that is, the
first two (exactly 2, neither more nor less) hex digits after the "\x"
should be used - and it should generate exactly one 8 bit character.

But then again, I suppose what this thread is really about is i18n, and the
fact that, in 2016, you can no longer assume that characters and bytes are
the same thing. If you are using a locale where characters require, say,
16 bits, then you'd want \xDEAD to do something reasonable (generate a
single character).

BTW, what is the value of the C macro CHAR_BIT on such a system?

--
The problem in US politics today is that it is no longer a Right/Left
thing, or a Conservative/Liberal thing, or even a Republican/Democrat
thing, but rather an Insane/not-Insane thing.

(And no, there's no way you can spin this into any confusion about
who's who...)

Andrew Schorr

unread,
Apr 30, 2016, 2:00:34 PM4/30/16
to
On Friday, April 29, 2016 at 3:39:50 AM UTC-4, Janis Papanagnou wrote:
> You can define characters using hexadecimal codes, e.g. "\x40" for "@".
>
> $ awk 'BEGIN { printf "#\x40field#\n" }' | od -tx1 -c
>
> 0000000 23 0f 69 65 6c 64 23 0a
> # 017 i e l d # \n
>

I believe this is a bug. Using the gawk master development branch, this does
not happen:

bash-4.2$ ./gawk 'BEGIN { printf "#\x40field#\n" }' | od -tx1 -c
0000000 23 40 66 69 65 6c 64 23 0a
# @ f i e l d # \n
0000011

I think the fix was committed on 2014-08-20, but has not made it into the stable release:

2014-08-20 Arnold D. Robbins

* node.c (parse_escape): Max of 2 digits after \x.

Unfortunately, the recently released gawk-4.1.3f beta does not seem to fix this issue.

Regards,
Andy

Andrew Schorr

unread,
Apr 30, 2016, 2:08:14 PM4/30/16
to
On Saturday, April 30, 2016 at 2:00:34 PM UTC-4, Andrew Schorr wrote:
> Unfortunately, the recently released gawk-4.1.3f beta does not seem to fix this issue.

Also, please note the lint warning:

bash-4.2$ ./gawk --lint 'BEGIN { printf "#\x40field#\n" }' | od -tx1 -c
gawk: cmd. line:1: warning: POSIX does not allow `\x' escapes
gawk: cmd. line:1: warning: hex escape \x40f of 3 characters probably not interpreted the way you expect
0000000 23 0f 69 65 6c 64 23 0a
# 017 i e l d # \n
0000010

Regards,
Andy

Janis Papanagnou

unread,
Apr 30, 2016, 3:03:25 PM4/30/16
to
On 30.04.2016 20:08, Andrew Schorr wrote:
> On Saturday, April 30, 2016 at 2:00:34 PM UTC-4, Andrew Schorr wrote:
>> Unfortunately, the recently released gawk-4.1.3f beta does not seem to fix this issue.
>
> Also, please note the lint warning:

Yes, I know it's non-standard.
But I haven't used lint on that simple test case. Thanks for the pointer!
Glad to hear you have an eye on that effect and probably a fix. Thanks again!

Janis

Aharon Robbins

unread,
Apr 30, 2016, 10:57:17 PM4/30/16
to
In article <6c733a99-2a82-4755...@googlegroups.com>,
Andrew Schorr <asc...@telemetry-investments.com> wrote:
>On Friday, April 29, 2016 at 3:39:50 AM UTC-4, Janis Papanagnou wrote:
>> You can define characters using hexadecimal codes, e.g. "\x40" for "@".
>>
>> $ awk 'BEGIN { printf "#\x40field#\n" }' | od -tx1 -c
>>
>> 0000000 23 0f 69 65 6c 64 23 0a
>> # 017 i e l d # \n
>>
>
>I believe this is a bug.

The current behavior was intentional, based on my understanding at the
time of how C did it, which was to read as many hex characters as
followed the \x but effectively end up using only the last two.

>I think the fix was committed on 2014-08-20, but has not made it into
>the stable release:
>
>2014-08-20 Arnold D. Robbins
>
> * node.c (parse_escape): Max of 2 digits after \x.

Yes, I made the change.

>Unfortunately, the recently released gawk-4.1.3f beta does not seem to
>fix this issue.

It's only on the master branch, on purpose, since it's a behavioral
change.

In the meantime:

| $ awk 'BEGIN { printf "#\x40" "field#\n" }' | od -tx1 -c

is a workaround.
--
Aharon (Arnold) Robbins arnold AT skeeve DOT com
0 new messages