printf/scanf for UNICODE?

Ken Turkowski

unread,

May 28, 2005, 5:17:48 AM5/28/05

to

Are there printf and scanf functions for Unicode?
--
comp.lang.c.moderated - moderation address: cl...@plethora.net -- you must
have an appropriate newsgroups line in your header for your mail to be seen,
or the newsgroup name in square brackets in the subject line. Sorry.

Roger Leigh

unread,

May 29, 2005, 1:03:08 AM5/29/05

to

Ken Turkowski <tu...@worldserver.com> writes:

> Are there printf and scanf functions for Unicode?

This (very general) answer will necessarily have to be compiler- and
platform-dependent. You'll need to check how your implementation
provides these facilities.

First of all, what do you mean by "Unicode"? The Unicode and
ISO-10646 standards define a Universal Character Set (UCS), a 31-bit
character set. There are various encodings of this. UTF-8, UTF-16
and UTF-32 are in common use; all but UTF-32 are multibyte encodings.
Which are you using?

UTF-8 is an 8-bit code. This is therefore representable with the
normal "char" type, and the usual printf/scanf functions should be
sufficient. It's in common use on Linux and UNIX systems. UTF-16 is
used on modern Microsoft systems, and UTF-32 on both. UTF-16 and
UTF-32 are not typically representable in type "char", and so there is
a "wide character", "wchar_t" type available for this purpose.
"wchar_t" is not specifically intended for UCS, but is useful for all
character sets wider than 8 bits. On GNU systems, wchar_t is 32 bits,
but on MS Windows it's only 16 bits; this can cause problems with
UCS-4/UTF-32.

When using wchar_t, you need to use the "w" variants of printf and
scanf, and set your streams to wide mode with fwide(). This will
allow direct output and input of wide characters.

Now let's have a look at a specific implementation, as an example:
We will use a current Debian GNU/Linux system with GCC 3.4.3. On this
system the default /input encoding/ of source code is UTF-8. Whatever
the source code encoding, it will be converted to UTF-8 prior to being
compiled. This has two implications:

* "Narrow" (char) strings are /always/ encoded as UTF-8
* "Wide" (wchar_t, 32 bits on GNU) strings are /always/ encoded as
UCS-4 (UTF-32)

An additional point is that on GNU, the locale system causes the
output to be converted into the user's locale charmap. This means
that if the user has a UTF-8 locale, narrow strings are not converted,
and wide strings are converted to UTF-8 on output. If the locale uses
e.g. ISO-8859-2, both will be converted to this charmap.

The upshot is that on GNU, you can use the wide and narrow strings
basically interchangably. You can also output narrow strings to wide
streams, and wide strings to narrow streams. This means that you use
whichever internal (narrow/wide) representation you find appropriate,
and whichever stream width you find appripriate. However, this
flexibility may not extend to other systems: you'll need to test.

Lastly, here's a simple example for you to try. It's UTF-8 encoded,
and tested under GNU/Linux. It should run on any system with working
UCS and wide character support.

#include <locale.h>
#include <stdio.h>
#include <string.h>
#include <wchar.h>

int main(void)
{
setlocale(LC_ALL, "");

const char *narrow = "Test Unicode (narrow): ïàý Ноя けたいと願う!\n";
fprintf(stdout, "%s\n", narrow);

fprintf(stdout, "Narrow bytes:\n");
for (int i = 0; i< strlen(narrow); ++i)
fprintf(stdout, "%3d: %02X\n", i, (unsigned int) *((unsigned char *)narrow+i));

if (fwide (stderr, 1) <= 0)
fprintf(stdout, "Failed to set stderr to wide orientation\n");

const wchar_t *wide = L"Test Unicode (wide): ïàý Ноя けたいと願う!\n";
fwprintf(stderr, L"\n%ls\n", wide);

fwprintf(stderr, L"\nNarrow-to-wide: %s\n", narrow);

fprintf(stdout, "\nWide-to-narrow: %ls\n", wide);

fprintf(stdout, "Wide bytes:\n");
for (int i = 0; i< (wcslen(wide) * sizeof(wchar_t)); ++i)
fprintf(stdout, "%3d: %02X\n", i, (unsigned int) *((unsigned char *)wide+i));

return 0;
}

Regards,
Roger

--
Roger Leigh
Printing on GNU/Linux? http://gimp-print.sourceforge.net/
Debian GNU/Linux http://www.debian.org/
GPG Public Key: 0x25BFB848. Please sign and encrypt your mail.

Brian Inglis

unread,

May 29, 2005, 1:04:34 AM5/29/05

to

On Sat, 28 May 2005 09:17:48 -0000 in comp.lang.c.moderated, Ken
Turkowski <tu...@worldserver.com> wrote:

>Are there printf and scanf functions for Unicode?

There are multibyte to/from wide char conversion functions and wide
char variants of the printf and scanf family functions. An
implementation of these may support Unicode.

--
Thanks. Take care, Brian Inglis Calgary, Alberta, Canada

Brian....@CSi.com (Brian[dot]Inglis{at}SystematicSW[dot]ab[dot]ca)
fake address use address above to reply