Problem with substr() after match() with non-ASCII characters

Janis Papanagnou

unread,

Aug 21, 2015, 10:14:32 PM8/21/15

to

I observe a problem with non-ASCII characters in data files if a matched
substring is referenced; those two functions, substr() and match(), seem
to not cooperate consistently. (Tried with gawk 4.1.3; testcase below.)

Note that this effect can be fixed by calling it as: LC_ALL=C awk '...'
It seems that the non-ASCII chars form an illegal UTF-8 encoding in my
UTF-8 environment. But I wonder why the RSTART/RLENGTH variables that are
set by match() can not be used in substr() in that case. Do those two
functions use (sort of) a different metric system? (Looks strange to me;
or is that bug?)

It's 4 a.m. here, and I may be just too tired to see the obvious. Would
the correct way to process data with unknown encoding be to *always* set
LC_ALL=C (assuming one needs no locale depending sorting, or similar)?

Janis

$ awk '
match($0,/:deathdate=2007....:/) { print substr($0,RSTART+11,RLENGTH-16) }
' f.out

2007
0703
2007
0071

$ file f.out
f.out: Non-ISO extended-ASCII text

$ cat f.out
missile:deathdate=20070306:
P<94>rr<94>:deathdate=20070306:
wizard:deathdate=20071103:
Daithí:deathdate=20071103:

$ od -t x1 -c f.out
0000000 6d 69 73 73 69 6c 65 3a 64 65 61 74 68 64 61 74
m i s s i l e : d e a t h d a t
0000020 65 3d 32 30 30 37 30 33 30 36 3a 0a 50 94 72 72
e = 2 0 0 7 0 3 0 6 : \n P 224 r r
0000040 94 3a 64 65 61 74 68 64 61 74 65 3d 32 30 30 37
224 : d e a t h d a t e = 2 0 0 7
0000060 30 33 30 36 3a 0a 77 69 7a 61 72 64 3a 64 65 61
0 3 0 6 : \n w i z a r d : d e a
0000100 74 68 64 61 74 65 3d 32 30 30 37 31 31 30 33 3a
t h d a t e = 2 0 0 7 1 1 0 3 :
0000120 0a 44 61 69 74 68 ed 3a 64 65 61 74 68 64 61 74
\n D a i t h 355 : d e a t h d a t
0000140 65 3d 32 30 30 37 31 31 30 33 3a 0a
e = 2 0 0 7 1 1 0 3 : \n
0000154

Hermann Peifer

unread,

Aug 22, 2015, 1:45:20 AM8/22/15

to

On 2015-08-22 4:14, Janis Papanagnou wrote:
> It's 4 a.m. here, and I may be just too tired to see the obvious. Would
> the correct way to process data with unknown encoding be to *always* set
> LC_ALL=C (assuming one needs no locale depending sorting, or similar)?
>
> Janis
>

Whenever input data encoding doesn't match your locale:
Either use LC_ALL=C, or gawk -b which according to the manual is an easy
way to tell gawk: "Hands off my data!". Both options seem to produce the
expected results [1].

In Gawk source code (builtin.c/do_match()), there is some comment about
/* byte length */ near the rlength variable, a few lines below [2],
whereas the usual way is "to do all string processing in terms of
characters, not bytes" (from the manual).

Hermann

[1]

$ LC_ALL=C gawk 'match($0,/:deathdate=2007....:/) { print
substr($0,RSTART+11,RLENGTH-16) }' f.out
2007
2007
2007
2007

$ gawk -b 'match($0,/:deathdate=2007....:/) { print
substr($0,RSTART+11,RLENGTH-16) }' f.out
2007
2007
2007
2007

[2]
http://git.savannah.gnu.org/cgit/gawk.git/tree/builtin.c?h=gawk-4.1-stable#n2446

Janis Papanagnou

unread,

Aug 22, 2015, 8:32:45 AM8/22/15

to

On 22.08.2015 07:45, Hermann Peifer wrote:
> On 2015-08-22 4:14, Janis Papanagnou wrote:
>> It's 4 a.m. here, and I may be just too tired to see the obvious. Would
>> the correct way to process data with unknown encoding be to *always* set
>> LC_ALL=C (assuming one needs no locale depending sorting, or similar)?
>

> Whenever input data encoding doesn't match your locale:
> Either use LC_ALL=C, or gawk -b which according to the manual is an easy way
> to tell gawk: "Hands off my data!". Both options seem to produce the expected
> results [1].

Hmm.. - for specific cases that may be okay, but as a general "solution"?
Since changing locale setting would affect other behaviour as well. Well,
you're probably right that I should use -b then. Yes, I think I'll follow
that path. - I wonder what could be the trade-offs of using -b (or LC_ALL=C)
per default in such cases; length() and the sort() functions come to my mind.

>
> In Gawk source code (builtin.c/do_match()), there is some comment about /*
> byte length */ near the rlength variable, a few lines below [2], whereas the
> usual way is "to do all string processing in terms of characters, not bytes"
> (from the manual).

One part of my question was consistency; wouldn't one expect that match()
and substr() would be subject to the same character/byte interpretation?
I.e. an interpretation that works consistently whether bytes or characters
are defined or whatever locale was set.

I faintly seem to recall this effect might have already been discussed
(more than once) in the past. In a quick search I fould this response:

Tue Jan 18 17:23:25 2005 Arnold D. Robbins <arn...@skeeve.com>
: Make gawk multibyte aware. This means that index(), length(), substr()
: and match() all work in terms of characters, not bytes.

This clearly suggests that substr() and match() would behave consistently
with "characters". With gawk 4.1.3, it doesn't seem so, though. Hmm..

Janis

> [...]

Hermann Peifer

unread,

Aug 22, 2015, 10:41:42 AM8/22/15

to

On 2015-08-22 14:32, Janis Papanagnou wrote:
>
> One part of my question was consistency; wouldn't one expect that match()
> and substr() would be subject to the same character/byte interpretation?
> I.e. an interpretation that works consistently whether bytes or characters
> are defined or whatever locale was set.
>

My gawk experience is that since many years, processing UTF-8 encoded
data in an UTF-8 locale works properly, in characters: all string
functions I use, also FIELDWIDTHS and printf. I just never use match().
Maybe you should report the issue to bug-gawk.

Hermann

Aharon Robbins

unread,

Aug 22, 2015, 2:49:26 PM8/22/15

to

In article <mra1n5$cc9$1...@news.albasani.net>,

Indeed. With program and data files *as attachments*.
--
Aharon (Arnold) Robbins arnold AT skeeve DOT com

Janis Papanagnou

unread,

Aug 22, 2015, 4:35:39 PM8/22/15

to

Okay. Done.

Janis

Janis Papanagnou

unread,

Aug 22, 2015, 7:23:35 PM8/22/15

to

On 22.08.2015 23:47, Martin Neitzel wrote:
> JP> It seems that the non-ASCII chars form an illegal UTF-8 encoding in my
> JP> UTF-8 environment.
>
> This is indeed the case.
>
> JP> set by match() can not be used in substr() in that case. Do those two
> JP> functions use (sort of) a different metric system? (Looks strange to me;
> JP> or is that bug?)
>
> Match() and substr() do use the same metric system: POSIX is requiring
> all awk string functions to measure in (LC_CTYPE-dependent) characters
> (not in bytes). However, it is your job to provide well-formed input
> with respect to the LC_CTYPE you tell awk to use (be it implicitly or
> explicitly). If you don't, don't be suprised about funny results, as
> happened to you. I wouldn't file this as a bug.

(Well, I did file it, since I was asked to do so by the gawk maintainer.)

And I understand your point. Still, I'm not sure. If I could provide the
data in a well defined format that would be fine, but I can't. The file
carries no markers and there is no meta-information provided. But for the
given case it's also not necessary. I match a well defined portion and
want to extract that portion. A match is detected and pointers returned.
And the return code of match() is neither 0 nor -1, but it is equal to
RSTART.

My point is that even if the data is of the form

trash trash trash sensible-data trash trash trash

that if I match() the "sensible-data" *correctly* then I'd expect that I
get the pointers RSTART and RLENGTH back as well *correctly*, so that
another function substr() can use it consistently.

If that consistency can't be guaranteed, shouldn't then match() already
bail out with an error? (While that's not satisfying that would at least
not silently produce hard to detect errors.)

Janis

>
> [...]