Is this the most idiomatic, reasonably performant way to normalize a std::string encoded with UTF8 using ICU4C?

30 views
Skip to first unread message

prospero

unread,
Mar 28, 2025, 11:23:06 PMMar 28
to icu-s...@unicode.org
Just want to make sure I'm not missing anything obvious (besides StringPiece having non-explicit constructors for std::string), especially something that could cause an unnecessary degradation in performance:

int normalizeUTF8String(const icu::Normalizer2 *normalizer, const std::string &unnormalized_utf8_string, std::string &normalized_utf8_string)
{
icu::UnicodeString unnormalized_string = icu::UnicodeString::fromUTF8(icu::StringPiece(unnormalized_utf8_string));

icu::UnicodeString normalized_string;
UErrorCode status = U_ZERO_ERROR;
normalizer->normalize(unnormalized_string, normalized_string, status);
if (status != U_ZERO_ERROR)
return -1;

normalized_string.toUTF8String(normalized_utf8_string);

return 0;
}

Thanks.

Fredrik Roubert

unread,
Mar 31, 2025, 3:54:06 PMMar 31
to prospero, icu-s...@unicode.org
On Sat, Mar 29, 2025 at 4:23 AM 'prospero' via icu-support
<icu-s...@unicode.org> wrote:

> icu::UnicodeString unnormalized_string = […]
> icu::UnicodeString normalized_string;

There's no need to convert back and forth to and from
icu::UnicodeString, you'll waste less computing resources if you skip
that and work directly with UTF-8 instead, by calling the
normalizeUTF8() method instead of the normalize() method:

https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classicu_1_1Normalizer2.html#acf059465e9ced97d153fd21e5a048a37

icu::StringByteSink<std::string> sink(&normalized_utf8_string);
normalizer->normalizeUTF8(0, unnormalized_utf8_string, sink, nullptr, status);

--
Fredrik Roubert
rou...@google.com

prospero

unread,
Apr 7, 2025, 11:48:52 PMApr 7
to rou...@google.com, icu-s...@unicode.org
Easily 2-3x faster. Thanks for your help, Fredrik.

> Sent: Monday, March 31, 2025 at 3:53 PM
> From: "'Fredrik Roubert' via icu-support" <icu-s...@unicode.org>
> To: "prospero" <pros...@cyber-wizard.com>
> Cc: icu-s...@unicode.org
> Subject: Re: [icu-support] Is this the most idiomatic, reasonably performant way to normalize a std::string encoded with UTF8 using ICU4C?
> --
> You received this message because you are subscribed to the Google Groups "icu-support" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to icu-support...@unicode.org.
> To view this discussion visit https://groups.google.com/a/unicode.org/d/msgid/icu-support/CAPLBv_OmK-3Sp69VGp69fhud9Jtkbw3RaFxC%2B%2BvhN66_Ucc0Ew%40mail.gmail.com.
>
Reply all
Reply to author
Forward
0 new messages