UnicodeString::toUTF8String, why BOM?

146 views
Skip to first unread message

Gregorio Litenstein

unread,
Dec 17, 2024, 10:32:44 AM12/17/24
to icu-support
While debugging some stuff related to text conversion, I have noticed that converting from UTF16 to UTF8 (via an intermediary `UnicodeString` and `toUTF8String`) results in a UTF8 string that starts with \xEF\xBB\xBF. Why is this BOM being appended to my string, and why does it only seem to happen when converting from UTF16?

P.S. I am using icu4c 74.2

Markus Scherer

unread,
Dec 17, 2024, 5:30:58 PM12/17/24
to Gregorio Litenstein, icu-support
On Tue, Dec 17, 2024 at 7:32 AM Gregorio Litenstein <g.lite...@gmail.com> wrote:
While debugging some stuff related to text conversion, I have noticed that converting from UTF16 to UTF8 (via an intermediary `UnicodeString` and `toUTF8String`) results in a UTF8 string that starts with \xEF\xBB\xBF. Why is this BOM being appended to my string, and why does it only seem to happen when converting from UTF16?

UnicodeString::toUTF8String() does not prepend the BOM.

I just tried this:
    UnicodeString s16(u"abcçカ🚴");
    std::string s8;
    s16.toUTF8String(s8);
    printf("s8.length=%d [%2x %2x %2x %2x ...] \"%s\"\n",
           (int)s8.length(),
           (uint8_t)s8[0], (uint8_t)s8[1], (uint8_t)s8[2], (uint8_t)s8[3],
           s8.c_str());

As expected, this outputs 12 bytes, starting with 0x61 for 'a':
s8.length=12 [61 62 63 c3 ...] "abcçカ🚴"

If you get a BOM (which in UTF-8 is really merely a "signature byte sequence") in the output from UnicodeString::toUTF8String(), then the UnicodeString starts with U+FEFF.

Best regards,
markus

Steven R. Loomis

unread,
Dec 17, 2024, 5:49:15 PM12/17/24
to Markus Scherer, Gregorio Litenstein, icu-support
Thanks,Markus,

I actually wonder if the issue is the other way… Gregorio, perhaps you are choosing a converter type that does not recognize the BOM but you have input data that has a BOM, ICU won’t automatically detect and strip it. 

Steven

--
You received this message because you are subscribed to the Google Groups "icu-support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to icu-support...@unicode.org.
To view this discussion visit https://groups.google.com/a/unicode.org/d/msgid/icu-support/CAN49p6pEGnnjWcZE1NHGCucAjG5zYh%3DAVtC2m9UWMzBGTtf9Jw%40mail.gmail.com.

Gregorio Litenstein

unread,
Dec 17, 2024, 6:44:09 PM12/17/24
to Markus Scherer, Steven R. Loomis, icu-support
Ok, I think the fundamental issue is that I didn't properly understand how the BOM actually works. I was assuming it behaved similar to magic numbers/4CC or such (and thus I was under the impression that they were agreed-on but arbitrary sequences)

If I now understand it correctly, it's always a single sequence that is not a visible character in either endianness, but the result of trying to interpret it one way or another can fail in a specific and expected way such that catching that provides a hint for which endianness use, yes?

Is 0xEFBBBF the direct translation of 0xFEFF? 

I just got the answer to my question from Wikipedia, it seems:

https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8 says, "...or that it was converted to UTF-8 from a stream that contained an optional BOM. The standard also does not recommend removing a BOM when it is there..."

Thanks again for helping me clear this up!


Gregorio Litenstein Goldzweig glit_qr_4.png
Médico Cirujano
 

Markus Scherer

unread,
Dec 17, 2024, 6:55:11 PM12/17/24
to Gregorio Litenstein, Steven R. Loomis, icu-support
Hi Gregorio,

On Tue, Dec 17, 2024 at 3:44 PM Gregorio Litenstein <g.lite...@gmail.com> wrote:
Ok, I think the fundamental issue is that I didn't properly understand how the BOM actually works. I was assuming ...

Let me send you to the Unicode Standard here which I think does a good job explaining it:

markus

Steven R. Loomis

unread,
Dec 18, 2024, 8:21:25 AM12/18/24
to Markus Scherer, Gregorio Litenstein, icu-support
Resend: 

I actually wonder if the issue is the other way… Gregorio, perhaps you are choosing a converter type that does not recognize the BOM but you have input data that has a BOM, ICU won’t automatically detect and strip it. 

Steven
El El mar, dic 17, 2024 a la(s) 4:31 p.m., Markus Scherer <marku...@gmail.com> escribió:
--

Steven R. Loomis

unread,
Dec 18, 2024, 9:31:56 AM12/18/24
to Markus Scherer, Gregorio Litenstein, icu-support
Gregorio,
 The character is invisible as U+FEFF, but it is *invalid* the other way - U+FFFE is not a valid Unicode character.

--
Steven R. Loomis
Code Hive Tx, LLC


Steven R. Loomis

unread,
Dec 19, 2024, 8:35:48 PM12/19/24
to Gregorio Litenstein, icu-support
Gregorio,

Thanks for posting the code- that makes it much easier to see what is going on.

What’s happening is that you are using the codepage name “UTF-16” and converting that to “UTF-8”.   Note that the codepage “UTF-16” detects and writes a BOM, as you found out. If you would like UTF-16 without a BOM, please use the names “UTF-16LE” or “UTF-16BE” for little or big endian, where the endianness is specified directly.

However, if you know that your std_string is actually UTF-16 in your platform’s endianess, couldn’t you just do:
    
        icu::UnicodeString ustring(sv.data(), static_cast<int32_t>sv.length()/2 /* bytes -> code units */);

.. since UnicodeString has a constructor that takes an array of UTF-16 code units.

As a side note, converters are already cached, so you can probably just use the thread-safe ucnv_open() and ucnv_close() when you need them instead of adding another map on top of them. The shared data is memory mapped from a time and space efficient format.  If you want to manage a map of converters (such as if they have custom options on them), you might want be interested in ucnv_clone() and ucnv_reset().

Hope this helps,

Steven

--
Steven R. Loomis
Code Hive Tx, LLC



On Dec 17, 2024, at 10:39 AM, Gregorio Litenstein <g.lite...@gmail.com> wrote:

```c++
std::map<std::string, Converter> UnicodeUtil::m_converters{};

Converter::Converter(std::string const& codepage): m_codepage(codepage), m_converter(nullptr, &ucnv_close) {
m_converter = std::unique_ptr<UConverter, decltype(&ucnv_close)>(ucnv_open(m_codepage.c_str(), m_error), &ucnv_close);
if (m_error.isFailure()) throw std::runtime_error("unicode/error: " + std::to_string(m_error.get()) + ": " + std::string(m_error.errorName()));
}

Converter::Converter(Converter&& c) noexcept:
m_codepage(std::move(c.m_codepage)),
m_converter(std::move(c.m_converter)),
m_error(std::move(c.m_error)) {}

icu::UnicodeString Converter::convertToUTF8(std::string_view sv) {
std::scoped_lock l(m_lock);
icu::UnicodeString ret(sv.data(), static_cast<int>(sv.length()), m_converter.get(), m_error);
if (m_error.isFailure()) throw std::runtime_error("Couldn't convert string: " + std::string(sv) + " to UTF-8. Error: " + std::to_string(m_error.get()) + ": " + m_error.errorName());
return ret;
}

Converter& UnicodeUtil::getConverter(std::string const& s) {
return m_converters.try_emplace(s, Converter(s)).first->second; // FIXME: THIS NEEDS A LOCK.
}

std::string UnicodeUtil::convertToUTF8 (std::string_view str, std::string _filename, CaseMapping toCase, bool assumeUTF8) {
icu::UnicodeString ustring;
std::string charset;
if (assumeUTF8) charset = "UTF-8";
else charset = UnicodeUtil::getCharset(str);
if (charset != "UTF-8") {
if (!_filename.empty()) {
SpdLogger::info(LogSystem::I18N, "Filename={} does not seem to be UTF-8. Detected encoding={}", _filename, charset);
}
ustring = UnicodeUtil::getConverter(charset).convertToUTF8(str);
}
else { ustring = icu::UnicodeString::fromUTF8(str.data()); }
switch(toCase) {
case CaseMapping::UPPER:
ustring.toUpper();
break;
case CaseMapping::LOWER:
ustring.toLower();
break;
case CaseMapping::TITLE:
ustring.toTitle(0, icu::Locale(TranslationEngine::getCurrentLanguageCode().c_str()), U_TITLECASE_NO_LOWERCASE);
break;
case CaseMapping::NONE:
break;
}
std::string ret;
if (!ustring.isEmpty()) {
ustring.toUTF8String(ret);
}
else {
if (!ret.empty()) {
SpdLogger::error(LogSystem::I18N, "Unable to convert text in unknown encoding={}", charset);
}
}
return ret.substr(removeUTF8BOM(ret) ? 3 : 0); // For reasons unknown, it appears ICU appends an UTF-8 BOM when the source is UTF-16.
}
```

Before adding the `substr` solution at the end, I had put the following lines to get a better look at what was going on:

```c++

if (ret.length() >= 5 && SpdLogger::initialized()) {
SpdLogger::debug(LogSystem::I18N, "Converted from charset={} -- Original string beginning: {}\nret[0]={:X}, ret[1]={:X}, ret[2]={:X}, ret[3]={:X}, ret[4]={:X}", charset, str.substr(0,16), ret[0], ret[1], ret[2], ret[3], ret[4]);
if (removeUTF8BOM(ret)) {
SpdLogger::debug(LogSystem::I18N, "After trying to remove the BOM... ret[0]={:X}, ret[1]={:X}, ret[2]={:X}, ret[3]={:X}, ret[4]={:X}", ret.substr(3)[0], ret.substr(3)[1], ret.substr(3)[2], ret.substr(3)[3], ret.substr(3)[4]);
}
}
```

It seems that the BOM only gets appended for a UTF16 source (I tried converting from ANSI as well as Shift-JIS).

Considering for UTF8 the BOM is not encouraged, I would expect ICU to just remove the UTF16 BOM and not add a new one.


Gregorio Litenstein Goldzweig glit_qr_4.png
Médico Cirujano
 
On 17 Dec 2024 13:30 -0300, Steven R. Loomis <srl...@gmail.com>, wrote:
Hi, 
 Can you post the exact code you’re using? You may have chosen an encoding which includes a BOM. 

 -s

--
Steven R. Loomis
Code Hive Tx, LLC


On Dec 17, 2024, at 9:32 AM, Gregorio Litenstein <g.lite...@gmail.com> wrote:

While debugging some stuff related to text conversion, I have noticed that converting from UTF16 to UTF8 (via an intermediary `UnicodeString` and `toUTF8String`) results in a UTF8 string that starts with \xEF\xBB\xBF. Why is this BOM being appended to my string, and why does it only seem to happen when converting from UTF16?

P.S. I am using icu4c 74.2

--
You received this message because you are subscribed to the Google Groups "icu-support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to icu-support...@unicode.org.
To view this discussion visit https://groups.google.com/a/unicode.org/d/msgid/icu-support/72cd8a2b-6e86-4b94-85e7-a77aca1f031bn%40unicode.org.

--
You received this message because you are subscribed to the Google Groups "ICU - Team" group.
To unsubscribe from this group and stop receiving emails from it, send an email to icu-team+u...@unicode.org.
To view this discussion visit https://groups.google.com/a/unicode.org/d/msgid/icu-team/72cd8a2b-6e86-4b94-85e7-a77aca1f031bn%40unicode.org.


Gregorio Litenstein

unread,
Dec 19, 2024, 8:35:48 PM12/19/24
to Steven R. Loomis, icu-support
Hi and thanks for replying so quickly.

First, re the issue why I sent this question:

I think you misunderstood the issue: I want to use UTF-8 wherever possible, but the thing is, I am working with with user-provided text files., so I am converting those to UTF-8.

What actually happens is that I use CED to detect the encoding and create a UnicodeString, which is then converted to UTF-8.

My problem is not that the UTF-16 has a BOM (which is rather expected), but rather that after calling UnicodeString::toUTF8String, if the source was UTF16 and had a BOM, the resulting string has the 3-byte UTF-8 BOM prepended to it.

As I said above, I've already fixed it for my code by, essentially, checking for said three-byte BOM and removing it if found, but I'd like to understand (if you know, obviously) whether this behavior is intentional, why was it decided as such and maybe, it should be explicitly documented.



And second, re: the `UConverter`s. I didn't remember that `ucnv_open` was thread-safe. In spite of that, I think it's specified in the documentation that if the converter is not reset before being used, the result will be gibberish, no? That's the reason why I added my map and wrappers on top, because I want to make sure that does not happen, while at the same time potentially allowing for two different Converters to be used at the same time.

That I don't think I could achieve without my own map?



Gregorio Litenstein Goldzweig glit_qr_4.png
Médico Cirujano
 

Gregorio Litenstein

unread,
Dec 19, 2024, 8:35:48 PM12/19/24
to Steven R. Loomis, icu-support
While debugging some stuff related to text conversion, I have noticed that converting from UTF16 to UTF8 (via an intermediary `UnicodeString` and `toUTF8String`) results in a UTF8 string that starts with \xEF\xBB\xBF. Why is this BOM being appended to my string, and why does it only seem to happen when converting from UTF16?

P.S. I am using icu4c 74.2

Steven R. Loomis

unread,
Dec 19, 2024, 8:35:48 PM12/19/24
to Gregorio Litenstein, icu-support

--
Steven R. Loomis
Code Hive Tx, LLC



On Dec 17, 2024, at 12:33 PM, Gregorio Litenstein <g.lite...@gmail.com> wrote:

Hi and thanks for replying so quickly.

First, re the issue why I sent this question:

I think you misunderstood the issue: I want to use UTF-8 wherever possible, but the thing is, I am working with with user-provided text files., so I am converting those to UTF-8.

What actually happens is that I use CED to detect the encoding and create a UnicodeString, which is then converted to UTF-8.

My problem is not that the UTF-16 has a BOM (which is rather expected), but rather that after calling UnicodeString::toUTF8String, if the source was UTF16 and had a BOM, the resulting string has the 3-byte UTF-8 BOM prepended to it.

If you are detecting the encoding, what is the actual string you are using to pass to the converter for the converter id? 

As I said above, I've already fixed it for my code by, essentially, checking for said three-byte BOM and removing it if found, but I'd like to understand (if you know, obviously) whether this behavior is intentional, why was it decided as such and maybe, it should be explicitly documented.


Given a specific converter ID, do you have a sample data (just a few bytes) set?


And second, re: the `UConverter`s. I didn't remember that `ucnv_open` was thread-safe.

Yes, ucnv_open() is threadsafe.

In spite of that, I think it's specified in the documentation that if the converter is not reset before being used, the result will be gibberish, no?

If it’s not reset with ucnv_reset() before being RE-used, yes.  If you ucnv_reset() you can use the same converter (in one thread at a time) over.

That's the reason why I added my map and wrappers on top, because I want to make sure that does not happen, while at the same time potentially allowing for two different Converters to be used at the same time.

That I don't think I could achieve without my own map?

If you call ucnv_open() in two threads, each thread can do its own conversion, as they will be using different converter objects.

-s

Steven R. Loomis

unread,
Dec 19, 2024, 8:35:48 PM12/19/24
to Gregorio Litenstein, icu-support
Hi, 
 Can you post the exact code you’re using? You may have chosen an encoding which includes a BOM. 

 -s

--
Steven R. Loomis
Code Hive Tx, LLC


On Dec 17, 2024, at 9:32 AM, Gregorio Litenstein <g.lite...@gmail.com> wrote:

While debugging some stuff related to text conversion, I have noticed that converting from UTF16 to UTF8 (via an intermediary `UnicodeString` and `toUTF8String`) results in a UTF8 string that starts with \xEF\xBB\xBF. Why is this BOM being appended to my string, and why does it only seem to happen when converting from UTF16?

P.S. I am using icu4c 74.2

Markus Scherer

unread,
Dec 19, 2024, 9:04:28 PM12/19/24
to Gregorio Litenstein, Steven R. Loomis, icu-support
Minor code feedback --

On Thu, Dec 19, 2024 at 5:35 PM Gregorio Litenstein <g.lite...@gmail.com> wrote:
icu::UnicodeString Converter::convertToUTF8(std::string_view sv) {
std::scoped_lock l(m_lock);
icu::UnicodeString ret(sv.data(), static_cast<int>(sv.length()), m_converter.get(), m_error);
if (m_error.isFailure()) throw std::runtime_error("Couldn't convert string: " + std::string(sv) + " to UTF-8. Error: " + std::to_string(m_error.get()) + ": " + m_error.errorName());
return ret;
}

Misnomer: This converts to UTF-16, not to UTF-8.

std::string UnicodeUtil::convertToUTF8 (std::string_view str, std::string _filename, CaseMapping toCase, bool assumeUTF8) {
icu::UnicodeString ustring;
std::string charset;
if (assumeUTF8) charset = "UTF-8";
else charset = UnicodeUtil::getCharset(str);
if (charset != "UTF-8") {
if (!_filename.empty()) {
SpdLogger::info(LogSystem::I18N, "Filename={} does not seem to be UTF-8. Detected encoding={}", _filename, charset);
}
ustring = UnicodeUtil::getConverter(charset).convertToUTF8(str);
}
else { ustring = icu::UnicodeString::fromUTF8(str.data()); }

This line does not work if str contains NUL bytes. Just remove .data() -->
else { ustring = icu::UnicodeString::fromUTF8(str); }

switch(toCase) {
case CaseMapping::UPPER:
ustring.toUpper();
break;
...

Note that case mappings are language-sensitive. Calling these functions without a Locale parameter / locale ID string will use the machine's default locale. You will get different results if your machine is set to Turkish, Dutch, Greek, ...

It seems that the BOM only gets appended for a UTF16 source (I tried converting from ANSI as well as Shift-JIS).

As discussed, UnicodeString::toUTF8String() does not add a BOM. It will convert one if there is one.

Considering for UTF8 the BOM is not encouraged, I would expect ICU to just remove the UTF16 BOM and not add a new one.

On input, the "UTF-16" converter will detect and remove the BOM. The "UTF-16LE" and "UTF-16BE" converters will not remove the BOM. All according to the standard.

Best regards,
markus

Gregorio Litenstein

unread,
Dec 27, 2024, 11:46:24 AM12/27/24
to Markus Scherer, Steven R. Loomis, icu-support


Gregorio Litenstein Goldzweig glit_qr_4.png
Médico Cirujano
 
On 19 Dec 2024 at 23:04 -0300, Markus Scherer <marku...@gmail.com>, wrote:
Minor code feedback --

On Thu, Dec 19, 2024 at 5:35 PM Gregorio Litenstein <g.lite...@gmail.com> wrote:
icu::UnicodeString Converter::convertToUTF8(std::string_view sv) {
std::scoped_lock l(m_lock); icu::UnicodeString ret(sv.data(), static_cast<int>(sv.length()), m_converter.get(), m_error); if (m_error.isFailure()) throw std::runtime_error("Couldn't convert string: " + std::string(sv) + " to UTF-8. Error: " + std::to_string(m_error.get()) + ": " + m_error.errorName()); return ret;}

Misnomer: This converts to UTF-16, not to UTF-8.

You're right of course. I guess I didn't think about it when I wrote that, but I'm loath to change it now after several years. I'll probably add a comment though.


std::string UnicodeUtil::convertToUTF8 (std::string_view str, std::string _filename, CaseMapping toCase, bool assumeUTF8) {
icu::UnicodeString ustring; std::string charset; if (assumeUTF8) charset = "UTF-8"; else charset = UnicodeUtil::getCharset(str); if (charset != "UTF-8") { if (!_filename.empty()) { SpdLogger::info(LogSystem::I18N, "Filename={} does not seem to be UTF-8. Detected encoding={}", _filename, charset); } ustring = UnicodeUtil::getConverter(charset).convertToUTF8(str); } else { ustring = icu::UnicodeString::fromUTF8(str.data()); }

This line does not work if str contains NUL bytes. Just remove .data() -->
else { ustring = icu::UnicodeString::fromUTF8(str); }

That was a (possibly misguided) but deliberate decision at a time when some of the platforms we were working with had versions of ICU before 65. Thanks for pointing it out though.
switch(toCase) { case CaseMapping::UPPER: ustring.toUpper(); break;
...

Note that case mappings are language-sensitive. Calling these functions without a Locale parameter / locale ID string will use the machine's default locale. You will get different results if your machine is set to Turkish, Dutch, Greek, ...

This is very unlikely to actually come up in usage or our app but thanks for pointing it out, should be easy to correct.

Gregorio Litenstein

unread,
Dec 27, 2024, 12:42:43 PM12/27/24
to Markus Scherer, Steven R. Loomis, icu-support
Actually… a couple follow-up questions:

1) Are instances of icu::Locale expensive to construct? i.e. does it make sense to keep a reference to a locale based on the app’s current language setting?
2) Do I need to worry about cleaning them up? and if so, how?

Thanks again!


Gregorio Litenstein Goldzweig glit_qr_4.png
Médico Cirujano
 
On 19 Dec 2024 at 23:04 -0300, Markus Scherer <marku...@gmail.com>, wrote:

Fredrik Roubert

unread,
Dec 29, 2024, 10:43:22 PM12/29/24
to Gregorio Litenstein, Markus Scherer, Steven R. Loomis, icu-support
On Sat, Dec 28, 2024 at 2:42 AM Gregorio Litenstein
<g.lite...@gmail.com> wrote:

> 1) Are instances of icu::Locale expensive to construct?

Kind of, maybe, maybe not. They're definitely not trivial to
construct, but whether that's an issue for your use-case is something
that you only can find out by doing measurements.

> 2) Do I need to worry about cleaning them up? and if so, how?

In terms of memory management there's nothing strange about them, they
behave just like any normal C++ object, once constructed you can think
of them and handle them just like you would do with std::string
objects.

--
Fredrik Roubert
rou...@google.com

Gregorio Litenstein

unread,
Dec 31, 2024, 8:48:07 AM12/31/24
to Fredrik Roubert, Markus Scherer, Steven R. Loomis, icu-support
1) Fair. I don't use those calls THAT much but to be on the safe side, I think I'll just construct it on app startup or when the display language changes.

2) Awesome, thanks.


P.S. Happy new year to all who helped me here.


Gregorio Litenstein Goldzweig glit_qr_4.png
Médico Cirujano
 
Reply all
Reply to author
Forward
0 new messages