BEGIN {print length("Südwestwind")} ## length=12 - ???
BEGIN {print length("Sudwestwind")} ## length=11
I'd expect a length of 11 in both cases. The gawk length() function returns
number of bytes. Any ideas how to return the number of characters instead?
BTW, there's a similar issue with split() and probably with other string
functions as well.
Janis
My understanding is that if you set your locale to a UTF-8 locale, then gawk
does the right thing:
$ LC_ALL=C awk 'BEGIN {print length("Südwestwind")}'
12
$ LC_ALL=en_GB.utf8 awk 'BEGIN {print length("Südwestwind")}'
11
And I thought I had tried that without success.
Hmm.. - it works. I must have made something wrong tonight.
Thanks!
Janis
From my experience, I can confirm that gawk's string functions and FIELDWIDTHS work as expected, as long as your locale and the data encoding are in sync.
And also, gawk's printf counts characters, rather than bytes, e.g.:
$ LC_ALL=en_US.UTF-8 gawk 'BEGIN {printf "|%-12s|\n", "Südwestwind"}'
|Südwestwind |
Hermann