While debugging some stuff related to text conversion, I have noticed that converting from UTF16 to UTF8 (via an intermediary `UnicodeString` and `toUTF8String`) results in a UTF8 string that starts with \xEF\xBB\xBF. Why is this BOM being appended to my string, and why does it only seem to happen when converting from UTF16?
--
You received this message because you are subscribed to the Google Groups "icu-support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to icu-support...@unicode.org.
To view this discussion visit https://groups.google.com/a/unicode.org/d/msgid/icu-support/CAN49p6pEGnnjWcZE1NHGCucAjG5zYh%3DAVtC2m9UWMzBGTtf9Jw%40mail.gmail.com.
Gregorio Litenstein Goldzweig | |
Médico Cirujano | |
|
|
Ok, I think the fundamental issue is that I didn't properly understand how the BOM actually works. I was assuming ...
--
On Dec 17, 2024, at 10:39 AM, Gregorio Litenstein <g.lite...@gmail.com> wrote:```c++
std::map<std::string, Converter> UnicodeUtil::m_converters{};
Converter::Converter(std::string const& codepage): m_codepage(codepage), m_converter(nullptr, &ucnv_close) {
m_converter = std::unique_ptr<UConverter, decltype(&ucnv_close)>(ucnv_open(m_codepage.c_str(), m_error), &ucnv_close);
if (m_error.isFailure()) throw std::runtime_error("unicode/error: " + std::to_string(m_error.get()) + ": " + std::string(m_error.errorName()));
}
Converter::Converter(Converter&& c) noexcept:
m_codepage(std::move(c.m_codepage)),
m_converter(std::move(c.m_converter)),
m_error(std::move(c.m_error)) {}
icu::UnicodeString Converter::convertToUTF8(std::string_view sv) {
std::scoped_lock l(m_lock);
icu::UnicodeString ret(sv.data(), static_cast<int>(sv.length()), m_converter.get(), m_error);
if (m_error.isFailure()) throw std::runtime_error("Couldn't convert string: " + std::string(sv) + " to UTF-8. Error: " + std::to_string(m_error.get()) + ": " + m_error.errorName());
return ret;
}
Converter& UnicodeUtil::getConverter(std::string const& s) {
return m_converters.try_emplace(s, Converter(s)).first->second; // FIXME: THIS NEEDS A LOCK.
}
std::string UnicodeUtil::convertToUTF8 (std::string_view str, std::string _filename, CaseMapping toCase, bool assumeUTF8) {
icu::UnicodeString ustring;
std::string charset;
if (assumeUTF8) charset = "UTF-8";
else charset = UnicodeUtil::getCharset(str);
if (charset != "UTF-8") {
if (!_filename.empty()) {
SpdLogger::info(LogSystem::I18N, "Filename={} does not seem to be UTF-8. Detected encoding={}", _filename, charset);
}
ustring = UnicodeUtil::getConverter(charset).convertToUTF8(str);
}
else { ustring = icu::UnicodeString::fromUTF8(str.data()); }
switch(toCase) {
case CaseMapping::UPPER:
ustring.toUpper();
break;
case CaseMapping::LOWER:
ustring.toLower();
break;
case CaseMapping::TITLE:
ustring.toTitle(0, icu::Locale(TranslationEngine::getCurrentLanguageCode().c_str()), U_TITLECASE_NO_LOWERCASE);
break;
case CaseMapping::NONE:
break;
}
std::string ret;
if (!ustring.isEmpty()) {
ustring.toUTF8String(ret);
}
else {
if (!ret.empty()) {
SpdLogger::error(LogSystem::I18N, "Unable to convert text in unknown encoding={}", charset);
}
}
return ret.substr(removeUTF8BOM(ret) ? 3 : 0); // For reasons unknown, it appears ICU appends an UTF-8 BOM when the source is UTF-16.
}
```
Before adding the `substr` solution at the end, I had put the following lines to get a better look at what was going on:
```c++
if (ret.length() >= 5 && SpdLogger::initialized()) {
SpdLogger::debug(LogSystem::I18N, "Converted from charset={} -- Original string beginning: {}\nret[0]={:X}, ret[1]={:X}, ret[2]={:X}, ret[3]={:X}, ret[4]={:X}", charset, str.substr(0,16), ret[0], ret[1], ret[2], ret[3], ret[4]);
if (removeUTF8BOM(ret)) {
SpdLogger::debug(LogSystem::I18N, "After trying to remove the BOM... ret[0]={:X}, ret[1]={:X}, ret[2]={:X}, ret[3]={:X}, ret[4]={:X}", ret.substr(3)[0], ret.substr(3)[1], ret.substr(3)[2], ret.substr(3)[3], ret.substr(3)[4]);
}
}
```
It seems that the BOM only gets appended for a UTF16 source (I tried converting from ANSI as well as Shift-JIS).
Considering for UTF8 the BOM is not encouraged, I would expect ICU to just remove the UTF16 BOM and not add a new one.
Gregorio Litenstein Goldzweig Médico Cirujano
- Fono: +56 9 96343643
- E-Mail: g.lite...@gmail.com
On 17 Dec 2024 13:30 -0300, Steven R. Loomis <srl...@gmail.com>, wrote:
Hi,Can you post the exact code you’re using? You may have chosen an encoding which includes a BOM.-s
On Dec 17, 2024, at 9:32 AM, Gregorio Litenstein <g.lite...@gmail.com> wrote:
While debugging some stuff related to text conversion, I have noticed that converting from UTF16 to UTF8 (via an intermediary `UnicodeString` and `toUTF8String`) results in a UTF8 string that starts with \xEF\xBB\xBF. Why is this BOM being appended to my string, and why does it only seem to happen when converting from UTF16?P.S. I am using icu4c 74.2
--
You received this message because you are subscribed to the Google Groups "icu-support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to icu-support...@unicode.org.
To view this discussion visit https://groups.google.com/a/unicode.org/d/msgid/icu-support/72cd8a2b-6e86-4b94-85e7-a77aca1f031bn%40unicode.org.
--
You received this message because you are subscribed to the Google Groups "ICU - Team" group.
To unsubscribe from this group and stop receiving emails from it, send an email to icu-team+u...@unicode.org.
To view this discussion visit https://groups.google.com/a/unicode.org/d/msgid/icu-team/72cd8a2b-6e86-4b94-85e7-a77aca1f031bn%40unicode.org.
Gregorio Litenstein Goldzweig | |
Médico Cirujano | |
|
|
While debugging some stuff related to text conversion, I have noticed that converting from UTF16 to UTF8 (via an intermediary `UnicodeString` and `toUTF8String`) results in a UTF8 string that starts with \xEF\xBB\xBF. Why is this BOM being appended to my string, and why does it only seem to happen when converting from UTF16?P.S. I am using icu4c 74.2
On Dec 17, 2024, at 12:33 PM, Gregorio Litenstein <g.lite...@gmail.com> wrote:Hi and thanks for replying so quickly.
First, re the issue why I sent this question:
I think you misunderstood the issue: I want to use UTF-8 wherever possible, but the thing is, I am working with with user-provided text files., so I am converting those to UTF-8.
What actually happens is that I use CED to detect the encoding and create a UnicodeString, which is then converted to UTF-8.
My problem is not that the UTF-16 has a BOM (which is rather expected), but rather that after calling UnicodeString::toUTF8String, if the source was UTF16 and had a BOM, the resulting string has the 3-byte UTF-8 BOM prepended to it.
As I said above, I've already fixed it for my code by, essentially, checking for said three-byte BOM and removing it if found, but I'd like to understand (if you know, obviously) whether this behavior is intentional, why was it decided as such and maybe, it should be explicitly documented.
And second, re: the `UConverter`s. I didn't remember that `ucnv_open` was thread-safe.
In spite of that, I think it's specified in the documentation that if the converter is not reset before being used, the result will be gibberish, no?
That's the reason why I added my map and wrappers on top, because I want to make sure that does not happen, while at the same time potentially allowing for two different Converters to be used at the same time.
That I don't think I could achieve without my own map?
On Dec 17, 2024, at 9:32 AM, Gregorio Litenstein <g.lite...@gmail.com> wrote:
While debugging some stuff related to text conversion, I have noticed that converting from UTF16 to UTF8 (via an intermediary `UnicodeString` and `toUTF8String`) results in a UTF8 string that starts with \xEF\xBB\xBF. Why is this BOM being appended to my string, and why does it only seem to happen when converting from UTF16?P.S. I am using icu4c 74.2
icu::UnicodeString Converter::convertToUTF8(std::string_view sv) {
std::scoped_lock l(m_lock);
icu::UnicodeString ret(sv.data(), static_cast<int>(sv.length()), m_converter.get(), m_error);
if (m_error.isFailure()) throw std::runtime_error("Couldn't convert string: " + std::string(sv) + " to UTF-8. Error: " + std::to_string(m_error.get()) + ": " + m_error.errorName());
return ret;
}
std::string UnicodeUtil::convertToUTF8 (std::string_view str, std::string _filename, CaseMapping toCase, bool assumeUTF8) {
icu::UnicodeString ustring;
std::string charset;
if (assumeUTF8) charset = "UTF-8";
else charset = UnicodeUtil::getCharset(str);
if (charset != "UTF-8") {
if (!_filename.empty()) {
SpdLogger::info(LogSystem::I18N, "Filename={} does not seem to be UTF-8. Detected encoding={}", _filename, charset);
}
ustring = UnicodeUtil::getConverter(charset).convertToUTF8(str);
}
else { ustring = icu::UnicodeString::fromUTF8(str.data()); }
switch(toCase) {
case CaseMapping::UPPER:
ustring.toUpper();
break;
It seems that the BOM only gets appended for a UTF16 source (I tried converting from ANSI as well as Shift-JIS).
Considering for UTF8 the BOM is not encouraged, I would expect ICU to just remove the UTF16 BOM and not add a new one.
Gregorio Litenstein Goldzweig | |
Médico Cirujano | |
|
|
Minor code feedback --
On Thu, Dec 19, 2024 at 5:35 PM Gregorio Litenstein <g.lite...@gmail.com> wrote:
icu::UnicodeString Converter::convertToUTF8(std::string_view sv) {
std::scoped_lock l(m_lock); icu::UnicodeString ret(sv.data(), static_cast<int>(sv.length()), m_converter.get(), m_error); if (m_error.isFailure()) throw std::runtime_error("Couldn't convert string: " + std::string(sv) + " to UTF-8. Error: " + std::to_string(m_error.get()) + ": " + m_error.errorName()); return ret;}
Misnomer: This converts to UTF-16, not to UTF-8.
std::string UnicodeUtil::convertToUTF8 (std::string_view str, std::string _filename, CaseMapping toCase, bool assumeUTF8) {
icu::UnicodeString ustring; std::string charset; if (assumeUTF8) charset = "UTF-8"; else charset = UnicodeUtil::getCharset(str); if (charset != "UTF-8") { if (!_filename.empty()) { SpdLogger::info(LogSystem::I18N, "Filename={} does not seem to be UTF-8. Detected encoding={}", _filename, charset); } ustring = UnicodeUtil::getConverter(charset).convertToUTF8(str); } else { ustring = icu::UnicodeString::fromUTF8(str.data()); }
This line does not work if str contains NUL bytes. Just remove .data() -->
else { ustring = icu::UnicodeString::fromUTF8(str); }
switch(toCase) { case CaseMapping::UPPER: ustring.toUpper(); break;...
Note that case mappings are language-sensitive. Calling these functions without a Locale parameter / locale ID string will use the machine's default locale. You will get different results if your machine is set to Turkish, Dutch, Greek, ...
Gregorio Litenstein Goldzweig | |
Médico Cirujano | |
|
|
Gregorio Litenstein Goldzweig | |
Médico Cirujano | |
|
|