Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

How to get GAWK to work with UTF-8 for Windows

2,577 views
Skip to first unread message

Marc de Bourget

unread,
Feb 8, 2016, 4:33:10 PM2/8/16
to
I'm interested to use GAWK with UTF-8 for Windows.

My source awk file is encoded in UTF-8 with the content:
BEGIN {
print length("Céline")
}

However, the result is 7 instead of 6.

I have read that a locale "en_US.UTF-8" has to be set but it seems the locale en_US.UTF-8 can't be set for Windows:
http://stackoverflow.com/questions/4324542/what-is-the-windows-equivalent-for-en-us-utf-8-locale

Am I doing something wrong or does GAWK not work with UTF-8 for Windows?

Manuel Collado

unread,
Feb 9, 2016, 11:06:54 AM2/9/16
to
Which gawk Windows binary are you using?

The Cygwin port behaves like Linux. No problem setting a utf-8 locale.

Mingw based gawk ports probably require a native Windows encoding. There
are Windows ports of the "iconv" utility, that can convert from one
encoding to another. For instance:

http://sourceforge.net/projects/mingw/files/MinGW/Base/libiconv/

Hope this helps.
--
Manuel Collado - http://lml.ls.fi.upm.es/~mcollado

Marc de Bourget

unread,
Feb 9, 2016, 11:33:15 AM2/9/16
to
Thank you Manuel! I use this version:
http://sourceforge.net/projects/ezwinports/files/gawk-4.1.3-w32-bin.zip/download

I'm not shure if I have completely understood: How to set utf-8 locale? I don't want to use iconv, actually I want native UTF-8 access with AWK.

Manuel Collado

unread,
Feb 10, 2016, 3:40:07 AM2/10/16
to
El 09/02/2016 17:33, Marc de Bourget escribió:
> Thank you Manuel! I use this version:
> http://sourceforge.net/projects/ezwinports/files/gawk-4.1.3-w32-bin.zip/download

Ok. This is the recommended native Windows port.

>
> I'm not sure if I have completely understood: How to set utf-8 locale? I don't want to use iconv, actually I want native UTF-8 access with AWK.

I'm afraid the only way to achieve what you want is to install Cygwin
and use the gawk tool it provides. You could also use its bash shell
instead of the Windows cmd.exe one.

Regards,

Marc de Bourget

unread,
Feb 10, 2016, 4:33:25 AM2/10/16
to
Thank you, Manuel.
@GAWK developpers: Can you fix this for the official GAWK Windows port, please?

Aharon Robbins

unread,
Feb 10, 2016, 3:30:58 PM2/10/16
to
In article <915d0012-d678-4e55...@googlegroups.com>,
Marc de Bourget <marcde...@gmail.com> wrote:
>Thank you, Manuel.
>@GAWK developpers: Can you fix this for the official GAWK Windows port, please?

Not sure what "this" is.

In any case, please send an explicit description of the problem you're
having and what you'd like to happen to the bug-...@gnu.org which is
where such issues are dealt with.

I don't do Windows, so I'll let my Windows developer answer once you
submit a report. See

http://www.gnu.org/software/gawk/manual/html_node/Bugs.html#Bugs

for instructions on filing a bug report.
--
Aharon (Arnold) Robbins arnold AT skeeve DOT com

Marc de Bourget

unread,
Feb 10, 2016, 4:34:59 PM2/10/16
to
The problem is that GAWK for Windows counts bytes instead of characters.
Céline has 6 characters but 7 bytes due tu the multibyte character "é".

The length function for the string "Céline" should result in 6 but it is 7.
Using gawk for Windows with UTF-8 produces wrong results for at least the functions length, substr, index, match, split("Céline", CHARS, "").

Maybe your GAWK for Windows developper can simply compile an additional version
with UTF-8 encoding (if I understand Manual correctly, this may be a solution).

Janis Papanagnou

unread,
Feb 10, 2016, 5:09:03 PM2/10/16
to
Manuel suggested to install Cygwin to run gawk in a Unix like environment,
AFAIR.

Janis

Marc de Bourget

unread,
Feb 10, 2016, 5:29:42 PM2/10/16
to
Thank you, Janis.
Aharon recommended to use the GAWK Windows version by Eli Zaretskii.
He stated "This is the Windows port that I recommend".
I prefer to use Eli's GAWK version.

Janis Papanagnou

unread,
Feb 10, 2016, 5:55:46 PM2/10/16
to
On 10.02.2016 23:29, Marc de Bourget wrote:
> Thank you, Janis.

[ Please quote context so that we know what you were refering to. ]

> Aharon recommended to use the GAWK Windows version by Eli Zaretskii.

In your posting you were refering to your presumed opinion of Manuel.

> He stated "This is the Windows port that I recommend".
> I prefer to use Eli's GAWK version.

Your choice. Then do what Aharon suggested; and file an official report.
They will then decide what's possible to support on the Windows platform.

Janis

Aharon Robbins

unread,
Feb 10, 2016, 11:17:05 PM2/10/16
to
In article <f0297886-9d89-4d75...@googlegroups.com>,
Marc de Bourget <marcde...@gmail.com> wrote:
I will modify my recommendation to be "use Eli's version if you don't
want to (or can't) use Cygwin". I myself use Cygwin when working
on Windows systems.

Aharon Robbins

unread,
Feb 10, 2016, 11:18:19 PM2/10/16
to
In article <1ee9b74e-06de-47a3...@googlegroups.com>,
Marc de Bourget <marcde...@gmail.com> wrote:
Once again. You won't get any official response about this until
you file a bug report.

Thanks,

Arnold

Marc de Bourget

unread,
Feb 11, 2016, 4:46:15 AM2/11/16
to
Thank you all. No need for a bug report.
I've figured out how to set the locale for Windows:
You have to set LC_ALL as an environment variable.

a.) Either permanently like described here:
http://www.nextofwindows.com/how-to-addedit-environment-variables-in-windows-7

b.) or as part of a DOS batch:
SET LC_ALL=en_US.UTF-8
gawk -f celine.awk

(Strange, I was shure I have tested the last one before starting this post.)
Now length of "Céline" is 6. I have to test a bit more but I think it works.
If so, IMHO Eli's version is the best for Windows (even better than Cygwin).

One last question, please:
Is this native UTF-8 support or kind of ICONV mapping?

Kenny McCormack

unread,
Feb 11, 2016, 5:16:01 AM2/11/16
to
In article <9fb14ff0-9cff-47e2...@googlegroups.com>,
Marc de Bourget <marcde...@gmail.com> wrote:
...
>If so, IMHO Eli's version is the best for Windows (even better than Cygwin).

I'm curious about this. Why do you say that?

(Given that two noted AWK authorities, namely A. Robbins and myself, say
that Cygwin is best, ...)

--
Debating creationists on the topic of evolution is rather like trying to
play chess with a pigeon --- it knocks the pieces over, craps on the
board, and flies back to its flock to claim victory.

Marc de Bourget

unread,
Feb 11, 2016, 5:24:29 AM2/11/16
to
Oh dear!
I've tested accidentally with an ANSI file instead of a UTF-8 file!
With UTF-8, the length of Céline is still 7.

Maybe I'll install Cygwin for testing how it works there.
If so, what's next? Is GAWK already part of Cygwin?

@Kenny: Actually, you are right. I don't know Cygwin (probably I'll know it soon) but usually I prefer native software instead of simluations of other operating systems. I have to use AWK scripts inside DOS batches, so Cygwin is no option. But you are right, I can't say Eli's version is better because I don't know the Cygwin version yet.

Unfortunately I have had to adjust the header to the original issue :-)

Marc de Bourget

unread,
Feb 11, 2016, 6:47:46 AM2/11/16
to
I've posted the bug report to GNU.

Luuk

unread,
Feb 13, 2016, 7:08:20 AM2/13/16
to
C:\>CHCP 437
Active code page: 437

C:\>echo Céline | gawk "{ print length($0), $0, length(\"Céline\"); }"
7 C,line 6

C:\>chcp 65001
Active code page: 65001

C:\>echo Céline | gawk "{ print length($0), $0, length(\"Céline\"); }"
8 Céline 6

C:\>chcp 437
Active code page: 437

C:\>

(GAWK 4.1.3 on Windows 10)

BartC

unread,
Feb 13, 2016, 10:52:02 AM2/13/16
to
I used 4.1.3 on Windows 7 and I get a result of 6 when the source code
is encoded as ANSI.

Switching the current code page to UTF-8 (using chcp 65001 at a command
prompt), it still assumed the input was ANSI.

If I use UTF-8 for the source code, then it grumbles at the
byte-order-mark at the beginning of the file. If I get rid of that, then
I get a result of 7 (é is a 2-byte UTF-8 sequence), no matter what the
code-page is.

It looks like this version doesn't know anything about UTF-8!

--
bartc

Marc de Bourget

unread,
Feb 15, 2016, 9:14:28 AM2/15/16
to
Yes, GNU has confirmed that utf8-locale can't be set for Windows.

David Thompson

unread,
Feb 21, 2016, 12:02:05 AM2/21/16
to
On Sat, 13 Feb 2016 13:08:23 +0100, Luuk <lu...@invalid.lan> wrote:

> C:\>CHCP 437
> Active code page: 437
>
> C:\>echo Céline | gawk "{ print length($0), $0, length(\"Céline\"); }"
> 7 C,line 6
>
> C:\>chcp 65001
> Active code page: 65001
>
> C:\>echo Céline | gawk "{ print length($0), $0, length(\"Céline\"); }"
> 8 Céline 6
>
> C:\>chcp 437
> Active code page: 437
>
> C:\>
>
> (GAWK 4.1.3 on Windows 10)

echo in Windows CMD includes trailing space(s) so the gawk input is:
C 0x82 l i n e SP for 'OEM'
or C 0xc3 0xa9 l i n e SP for UTF8.
And first output is
7=length($0) SP=OFS C 0x82 l i n e SP=data SP=OFS 6=length(literal)
0 new messages