Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

[gawk] printf does not recognize .PREC if locale is en_US.UTF-8

0 views
Skip to first unread message

tczy

unread,
Dec 31, 2009, 6:32:39 AM12/31/09
to bug-...@gnu.org
*** ISSUE AND HOW TO REPRODUCE (IT): ***

echo nothing | awk '{printf "%.3s", "foobar"}'

produces 'foobar' if LC_ALL is en_US.UTF-8. Other variations of the same
program (with awk 'BEGIN{printf ...', etc.) produce the same. If LC_ALL
is set to C, everything is fine.

awk '{a=sprintf("%.3s", "foobar"); print a}'

also has this issue.

IRC reports 3.1.5 working well with UTF locale.

*** SYSTEM INFO ***

% gawk --version
GNU Awk 3.1.7

% uname -a
Linux sidep.ath.cx 2.6.31-ARCH #1 SMP PREEMPT Tue Nov 10 19:01:40 CET 2009 x86_64 Intel(R) Core(TM)2 Duo CPU T5670 @ 1.80GHz GenuineIntel GNU/Linux

Also GLibC 2.11.1.

--
PGP/GnuPG keyID: D7AE1B98
Key signature: 9D08 F20B C9CA 4174 5968 C0DE 9751 20F1 D7AE 1B98

Aharon Robbins

unread,
Jan 1, 2010, 4:46:32 AM1/1/10
to bug-...@gnu.org, c...@wre.ath.cx
Greetings. Re: this:

> Date: Thu, 31 Dec 2009 13:32:39 +0200
> From: tczy <c...@wre.ath.cx>
> To: bug-...@gnu.org
> Subject: [gawk] printf does not recognize .PREC if locale is en_US.UTF-8


>
> *** ISSUE AND HOW TO REPRODUCE (IT): ***
>
> echo nothing | awk '{printf "%.3s", "foobar"}'
>
> produces 'foobar' if LC_ALL is en_US.UTF-8. Other variations of the same
> program (with awk 'BEGIN{printf ...', etc.) produce the same. If LC_ALL
> is set to C, everything is fine.
>

> awk '{a=3Dsprintf("%.3s", "foobar"); print a}'


>
> also has this issue.
>
> IRC reports 3.1.5 working well with UTF locale.
>
> *** SYSTEM INFO ***
>
> % gawk --version
> GNU Awk 3.1.7
>
> % uname -a

> Linux sidep.ath.cx 2.6.31-ARCH #1 SMP PREEMPT Tue Nov 10 19:01:40 CET 2009 =


> x86_64 Intel(R) Core(TM)2 Duo CPU T5670 @ 1.80GHz GenuineIntel GNU/Linux
>
> Also GLibC 2.11.1.

It is indeed a bug. Dealing with multibyte characters in general has been
a continuing source of pain. Attached is a patch. It will wend its way
into the Savannah CVS shortly.

Happy New Year!

Arnold
---------------------------------------------------------------------------------
Fri Jan 1 11:41:50 2010 Arnold D. Robbins <arn...@skeeve.com>

* builtin.c (format_tree): At pr_tail, remember to take the precision
into account when determining how many characters to copy out.
Thanks to tczy <c...@wre.ath.cx> for the bug report.

Index: builtin.c
===================================================================
RCS file: /d/mongo/cvsrep/gawk-stable/builtin.c,v
retrieving revision 1.38
diff -u -r1.38 builtin.c
--- builtin.c 21 Nov 2009 21:16:50 -0000 1.38
+++ builtin.c 1 Jan 2010 09:40:49 -0000
@@ -1223,9 +1223,18 @@
if (fw == 0 && ! have_prec)
;
else if (gawk_mb_cur_max > 1 && (cs1 == 's' || cs1 == 'c')) {
+ int nchars_needed = 0;
+
assert(cp == arg->stptr || cp == cpbuf);
- copy_count = mbc_byte_count(arg->stptr,
- cs1 == 's' ? arg->stlen : 1);
+
+ if (cs1 == 'c')
+ nchars_needed = 1;
+ else if (have_prec)
+ nchars_needed = prec;
+ else
+ nchars_needed = arg->stlen;
+
+ copy_count = mbc_byte_count(arg->stptr, nchars_needed);
}
bchunk(cp, copy_count);
while (fw > prec) {


0 new messages