Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

UTF-8 string length() in gawk

987 views
Skip to first unread message

Janis Papanagnou

unread,
Mar 28, 2010, 11:51:00 PM3/28/10
to
How do we correctly handle UTF-8 string length() in gawk if characters are
used that require more than one byte in UTF-8 encoding?

BEGIN {print length("Südwestwind")} ## length=12 - ???
BEGIN {print length("Sudwestwind")} ## length=11

I'd expect a length of 11 in both cases. The gawk length() function returns
number of bytes. Any ideas how to return the number of characters instead?

BTW, there's a similar issue with split() and probably with other string
functions as well.

Janis

pk

unread,
Mar 29, 2010, 5:10:04 AM3/29/10
to
Janis Papanagnou wrote:

My understanding is that if you set your locale to a UTF-8 locale, then gawk
does the right thing:

$ LC_ALL=C awk 'BEGIN {print length("Südwestwind")}'
12
$ LC_ALL=en_GB.utf8 awk 'BEGIN {print length("Südwestwind")}'
11

Janis Papanagnou

unread,
Mar 29, 2010, 5:58:42 AM3/29/10
to

And I thought I had tried that without success.
Hmm.. - it works. I must have made something wrong tonight.

Thanks!

Janis

Hermann Peifer

unread,
Mar 29, 2010, 7:30:40 AM3/29/10
to

From my experience, I can confirm that gawk's string functions and FIELDWIDTHS work as expected, as long as your locale and the data encoding are in sync.

And also, gawk's printf counts characters, rather than bytes, e.g.:

$ LC_ALL=en_US.UTF-8 gawk 'BEGIN {printf "|%-12s|\n", "Südwestwind"}'
|Südwestwind |

Hermann

0 new messages