On 22.08.2015 07:45, Hermann Peifer wrote:
> On 2015-08-22 4:14, Janis Papanagnou wrote:
>> It's 4 a.m. here, and I may be just too tired to see the obvious. Would
>> the correct way to process data with unknown encoding be to *always* set
>> LC_ALL=C (assuming one needs no locale depending sorting, or similar)?
>
> Whenever input data encoding doesn't match your locale:
> Either use LC_ALL=C, or gawk -b which according to the manual is an easy way
> to tell gawk: "Hands off my data!". Both options seem to produce the expected
> results [1].
Hmm.. - for specific cases that may be okay, but as a general "solution"?
Since changing locale setting would affect other behaviour as well. Well,
you're probably right that I should use -b then. Yes, I think I'll follow
that path. - I wonder what could be the trade-offs of using -b (or LC_ALL=C)
per default in such cases; length() and the sort() functions come to my mind.
>
> In Gawk source code (builtin.c/do_match()), there is some comment about /*
> byte length */ near the rlength variable, a few lines below [2], whereas the
> usual way is "to do all string processing in terms of characters, not bytes"
> (from the manual).
One part of my question was consistency; wouldn't one expect that match()
and substr() would be subject to the same character/byte interpretation?
I.e. an interpretation that works consistently whether bytes or characters
are defined or whatever locale was set.
I faintly seem to recall this effect might have already been discussed
(more than once) in the past. In a quick search I fould this response:
Tue Jan 18 17:23:25 2005 Arnold D. Robbins <
arn...@skeeve.com>
: Make gawk multibyte aware. This means that index(), length(), substr()
: and match() all work in terms of characters, not bytes.
This clearly suggests that substr() and match() would behave consistently
with "characters". With gawk 4.1.3, it doesn't seem so, though. Hmm..
Janis
> [...]