Unexplained behavior related to converters

17 views

Skip to first unread message

MARC LORTET

unread,

Feb 23, 2026, 11:49:48 AMFeb 23

to icu-support

Hello,

My program uses "ucnv_setDefaultName" which doesn't set the desired default converter at all because "U_CHARSET_IS_UTF8" is set to 1.
Just found the problem after some searching.
The setting of "U_CHARSET_IS_UTF8" is indeed not done in "utypes.h" as indicated in documentation but in "/usr/include/unicode/platform.h".

[ICU v50 (CentOS 7)] <platform.h>

#idef U_CHARSET_IS_UTF8
#elif U_PLATFORM == U_PF_ANDROID || U_PLATFORM_IS_DARWIN_BASED
#define U_CHARSET_IS_UTF8 1
#else
#define U_CHARSET_IS_UTF8 0
#endif

[ICU 67.1 (RHEL 9)] <platform.h>

#idef U_CHARSET_IS_UTF8
#elif U_PLATFORM_IS_LINUX_BASED || U_PLATFORM_IS_DARWIN_BASED || U_PLATFORM == U_PF_EMSCRIPTEN
#define U_CHARSET_IS_UTF8 1
#else
#define U_CHARSET_IS_UTF8 0
#endif

To easily verify this, I used the following code :

#if U_CHARSET_IS_UTF8
std::cout << "U_CHARSET_IS_UTF8 = 1" << "\r\n";
#else
std::cout << "U_CHARSET_IS_UTF8 = 0" << "\r\n";
#endif

And indeed, in CentOS 7, "U_CHARSET_IS_UTF8 = 0" is returned, while with RHEL9, it's returned "U_CHARSET_IS_UTF8 = 1"!
Now that I understand the problem, I want to achieve the same result as version 50.

Question :
What am i missing with ICU and converters ?
I am looking for an equivalent of "U_ICU_NAMESPACE::UnicodeString vStr(__mValue);" which works using a converter (source code below).
Any positive help would be grantly appreciated.

I've read all the ICU documentation for converters, studied releases notes from v50 to v67.1, but my various attempts have been unsuccessful.
I've tried unsuccessfully with both "Single-String" and "Convenience" conversions (https://unicode-org.github.io/icu/userguide/conversion/converters.html#usage-model).
Same compiler "g++ (GCC) 12.3.1 20230926 (for GNAT Pro 24.2 20240606)" is used.
I'm receiving a string of characters of a label encoded with signed bytes (-128, 127).
I can't change this.
The character 'é' (&eacute in HTML) has a decimal value of 233 in the extended ASCII table, and its integer value is 256 - 233 = -23.
In the UTF-8 table (http://charset.org/utf-8), the character 'é' has a value of C3 A9, or -61 -87 in signed bytes.
The Unix console locale is in UTF-8 (LANG is "fr_FR").
The small example code below, help to explain the problem, i'm using the octal value to simulate the received value -23.

Here are the results (comments at end of line are mine) :

With CentOS 7 and ICU v50 :
> main8
Default ICU converter is UTF-8
U_CHARSET_IS_UTF8 = 0
Default ICU converter is ISO-8859-1

Before ICU <Libell�> <=== brut data with UTF-8 in console, that's OK
L i b e l l �=(-23) <=== input

After ICU <Libell�> <=== brut data with UTF-8 in console, that's OK
L i b e l l �=(-23) <=== converter is latin-1 so with extended ASCII in input, output is identical, OK for me

After ICU & UnicodeString with locale <Libellé> <=== UTF-8 converted data with UTF-8 in console, that's OK
L i b e l l �(=-61) �(=-87) <=== latin-1 to UTF-8, in the UTF-8 table (http://charset.org/utf-8), character 'é' is equal to C3 A9 or -61 -87 in signed bytes, that's fine

Converter closed.
After ICU UTF-16 to UTF-8 with ucnv_open <Libellé> <=== UTF-8 converted data with UTF-8 in console, still OK
L i b e l l �(=-61) �(=-87) <=== latin-1 to UTF-16 to UTF-8, the result is correct and understandable

RHEL 9 and ICU v67.1 :
> main8
Default ICU converter is UTF-8
U_CHARSET_IS_UTF8 = 0
Default ICU converter is UTF-8

Before ICU <Libell�> <=== brut data with UTF-8 in console, that's OK
L i b e l l �=(-23) <=== input

After ICU <Libell�> <=== brut data with UTF-8 in console, that's OK
L i b e l l �=(-17) �=(-65) �=(-67) <=== value indicating an unicode error conversion aka "ï¿½" symbol

After ICU & UnicodeString with locale <Libellé> <=== UTF-8 converted data displayed correctly in UTF-8 console, it's OK
L i b e l l �(=-61) �(=-81) �(=-62) �(=-65) �(=-62) �(=-67) <=== ??? whereas displayed value is correct, the internal data is not as expected

Converter closed.
After ICU UTF-16 to UTF-8 with ucnv_open <Libellé> <=== UTF-16 to UTF-8 converted displayed correctly in UTF-8 console, correct
L i b e l l �(=-61) �(=-81) �(=-62) �(=-65) �(=-62) �(=-67) <=== ??? same

And now the source code "main8.c" :

// Compilation command :
//
// g++ main8.c -o main8 -I/usr/include/ -licuuc
//

#include <cstring>
#include <iostream>

// ICU
#include <unicode/uversion.h>
#include <unicode/unistr.h>
#include <unicode/ustream.h>
#include <unicode/ucnv.h>

#include <unicode/uloc.h>
#include <unicode/utypes.h>

#include <unicode/ustring.h>

int main() {

std::cout << "Default ICU converter is " << ucnv_getDefaultName() << "\r\n";

UErrorCode status = U_ZERO_ERROR;

#if U_CHARSET_IS_UTF8
std::cout << "U_CHARSET_IS_UTF8 = 1" << "\r\n";
#else
std::cout << "U_CHARSET_IS_UTF8 = 0" << "\r\n";
#endif

// ================= TEST 1 : DEFAULT CONVERTER SET TO TO LATIN-1 =================

ucnv_setDefaultName("ISO-8859-1");

std::cout << "Default ICU converter is " << ucnv_getDefaultName() << "\r\n";

char __mValue[50] = "Libell\351"; // Libellé 'é' = 233d = E9h = 351o

std::cout << "\r\n";
std::cout << "Before ICU <" << __mValue << ">" << "\r\n";
for( int i = 0; i < strlen(__mValue); i++) {
if ( (int)__mValue[i] < 0 ) {
std::cout << __mValue[i] << "=(" << (int)__mValue[i] << ") ";
} else {
std::cout << __mValue[i] << " ";
}
}
std::cout << "\r\n";
std::cout << "\r\n";

// ================= TEST 1 : DEFAULT CONVERTER =================

U_ICU_NAMESPACE::UnicodeString vStr(__mValue);
vStr = vStr.trim();
memset(__mValue, 0, 30);
vStr.extract(0, vStr.length(), __mValue);

std::cout << "After ICU <" << __mValue << ">" << "\r\n";

for( int i = 0; i < strlen(__mValue); i++) {
if ( (int)__mValue[i] < 0 ) {
std::cout << __mValue[i] << "=(" << (int)__mValue[i] << ") ";
} else {
std::cout << __mValue[i] << " ";
}
}
std::cout << "\r\n";
std::cout << "\r\n";

// ================= TEST 2 : FORCED CONVERTER ISO-8859-1 TO UTF-8 =================

char target[ 100 ];
U_ICU_NAMESPACE::UnicodeString str(__mValue, "ISO-8859-1");
int32_t targetSize = str.extract(0, str.length(), target, sizeof(target), "UTF-8");
target[targetSize] = 0;
std::cout << "After ICU & UnicodeString with locale <" << target << ">" << "\r\n";
for( int i = 0; i < strlen(target); i++) {
if ( target[i] < 0 ) {
std::cout << target[i] << "(=" << (int)target[i] << ") ";
} else {
std::cout << target[i] << " ";
}
}
std::cout << "\r\n";
std::cout << "\r\n";

// ================= TEST 3 : OPENED CONVERTER UTF-16 TO UTF-8 =================

UConverter *conv = ucnv_open("ISO-8859-1", &status);
if(U_FAILURE(status)) {
std::cout << "ucnv_open KO" << "\r\n";
}

UChar dest[ 100 ];
int32_t destLen = 0;

// Convert ISO-8859-1 to Unicode UTF-16
destLen = ucnv_toUChars( conv, dest, sizeof( dest ), __mValue, -1, &status );
if(U_FAILURE(status)) {
std::cout << "FATAL: ICU conversion !" << "\r\n";
}

if ( conv != NULL ) {
ucnv_close(conv);
std::cout << "Converter closed." << "\r\n";
}

// Convert UTF-16 to UTF-8
char utf8[ 100 ];
status = U_ZERO_ERROR;

char * ret = u_strToUTF8( utf8, sizeof( utf8 ), NULL, dest, destLen, &status );
if(U_FAILURE(status)) {
std::cout << "FATAL: Conversion !" << "\r\n";
}

std::cout << "After ICU UTF-16 to UTF-8 with ucnv_open <" << utf8 << ">" << "\r\n";

for( int i = 0; i < strlen(utf8); i++) {
if ( utf8[i] < 0 ) {
std::cout << utf8[i] << "(=" << (int)utf8[i] << ") ";
} else {
std::cout << utf8[i] << " ";
}
}
std::cout << "\r\n";
std::cout << "\r\n";

}

Reply all

Reply to author

Forward

0 new messages