I am trying to make a function that converts WCHAR_T strings to UTF-8.
After experimenting for a while, found out that I can only convert
standard ASCII chars. When I put a vowel with an accent (for example) I
always get a EILSEQ in errno. I am trying to convert using the
following test function :
void convTest() {
wchar_t *str = L"a �tring";
char *inBuf = (char*)str;
size_t inBufSize = sizeof(wchar_t)*wcslen(str);
char *outBuf = (char*)malloc(1024);
size_t outBufAvailSize = sizeof(char)*1024;
iconv_t ds = iconv_open("UTF-8", "WCHAR_T");
size_t converted = iconv(ds, &inBuf, &inBufSize, &outBuf, &outBufAvailSize);
if (converted == (size_t)-1)
if (errno == EILSEQ)
printf("invalid char sequence");
else if (errno == EINVAL)
printf("incomplete input");
else if (errno == E2BIG)
printf("not enough space");
}
The � character causes an EILSEQ. I am in portugal with a portuguese keyboard.
Help !
Are we supposed to know what 'iconv_open' and 'iconv' are?
V
--
Please remove capital 'A's when replying by e-mail
I do not respond to top-posted replies, please don't ask
Ok,
I noticed in a terminal window that the locale was C, and from the
glibc docs learned that at startup the current locale of any C program
is also C.
I then called
<code>setlocale(LC_ALL, "pt_PT.UTF-8");</code>
before calling the function in the previous post and the iconv call succeeded.
But now I am confused. Isn't UTF-8 locale independent ? I was supposing
that UTF-8 contained every possible character and that a conversion
existed between it and wchar_t.
What if in my program I want decode characters from different locales
than the one on my machine ? From what I've learned from the glibc
docs, the call to setlocale sets the locale machine-wide, so that is
not an option as it would mess up other programs, right ?
How do you deal with this when a single program must handle multiple locales ?
What is the output of running the command "iconv -l" on your computer?
iconv and your locale are not related.
The format of wchar_t characters is "undefined" so you can't depend on
it being anything interesting other than using them with the clib wie
char routines (like mbtowc and family..)
Many modern platforms use the 4 byte version of UNICODE (UCS-4 or
UTF-32) and older platforms use 16 bit wide chars, however modern "16
bit" platforms now seem to use "UTF-16) as the wide character format.
I hope this helps.
The C++ standard does not impose anything on wchar_t so you really need
to know how your system is configured.
The output is a list of convertible encodings. In that list I didn't
find the string "WCHAR_T" I am using in my code when calling
iconv_open, but I am using the C function and it is on the man-page. I
think that using UCS-4 is the same (I am on macos X, which is GNU)
When I set the locale the iconv function is affected as I said in the
previous post. From an answer I got in another mailing list, I was told
that the constant L"a string" is subject to the current locale. That is
why when I change the locale all its characters are considered valid by
iconv.
My next question is, if I receive any string in function for
conversion inside a wchar_t*, how do I know which locale it is in ? If
i get the locale wrong, the conversion will fail.
[...]
> But now I am confused. Isn't UTF-8 locale independent ?
The encoding use in char (and possibly in whcar_t as well) is
determined by the locale. At least for functions which depend
on the locale---I would expect, however, that a function which
takes names of encodings as arguments (although "WCHAR_T" is not
the name of any encoding I'm familiar with) would use those
names, and not the current global locale, to determine the
encoding.
> I was supposing that UTF-8 contained every possible character
> and that a conversion existed between it and wchar_t.
UTF-8 is a Unicode Transformat Format. As such, it has
encodings for all Unicode characters. Certainly not every
possible character. (I could invent a new character tomorrow,
for example.) UTF-8 encodes Unicode in octets.
wchar_t is a type, which has nothing to do with encoding. What
it actually corresponds to, and which encoding the system
libraries use by default with it, is implementation defined.
Typical implementations make it a 16 or 32 bit type, using
UTF-16, UTF-32 or EUC. There is an exact translation between
UTF-8 and UTF-16 or UTF-32, since all are encoding formats for
Unicode. I don't know off hand about EUC.
> What if in my program I want decode characters from different
> locales than the one on my machine ? From what I've learned
> from the glibc docs, the call to setlocale sets the locale
> machine-wide, so that is not an option as it would mess up
> other programs, right ?
If you're working in C, or interfacing with a C library,
setlocale can be called with a null pointer to determine the
current global locale, which you can then restore. (Of course,
in a multithreaded envirionment, you'll have to ensure thread
safety when doing this.) In the case of C++, the standard idiom
is to passe a locale to the function, with possibly a default
argument of the global locale.
> How do you deal with this when a single program must handle
> multiple locales ?
In C++, you can maintain several locales, and pass them around.
In C, you have to read the current locale, and restore it when
finished.
--
James Kanze (GABI Software) email:james...@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
The MacOS X man page for iconv_open does not mention WCHAR_T on the Mac I
have access to. It mentions wchar_t, though:
"Locale dependent, in terms of char or wchar_t (with machine dependent
endianness and alignment, and with semantics depending on the OS and the
current LC_CTYPE locale facet): char, wchar_t"
It seems that the meaning of "wchar_t locale" entirely depends on the
system and maybe on your locale settings. On Windows, for example, it
would be quite certain that wchar_t corresponds to UTF-16LE encoding. On
Mac I have no idea. If sizeof(wchar_t)==4, then it is probably UCS-4. The
"wchar_t locale" is probably meant for making this difference
transparent, so you should not care.
> When I set the locale the iconv function is affected as I said in the
> previous post. From an answer I got in another mailing list, I was
> told that the constant L"a string" is subject to the current locale.
> That is why when I change the locale all its characters are considered
> valid by iconv.
>
> My next question is, if I receive any string in function for
> conversion inside a wchar_t*, how do I know which locale it is in ? If
> i get the locale wrong, the conversion will fail.
>
If wchar_t is always used for the same encoding than the iconv facility
expects for "wchar_t" locale identifier then there should be no problem).
Otherwise, the encoding has to be passed as an extra piece of
information. Now when I have written this down after quite some
pondering, this seems very trivial ;-)
hth
Paavo