Unexpected encoding issue with _() macro and UTF-8 source in Unicode build (wxWidgets 3.3.1)

37 views
Skip to first unread message

Claudio Fabiano Rossato

unread,
Nov 14, 2025, 10:23:03 AM (5 days ago) Nov 14
to wx-u...@googlegroups.com

Hello everyone,

I'm encountering an unexpected encoding issue with string literals in a Unicode build of a wxWidgets application, and I’d appreciate any insights.

Environment

  • wxWidgets: 3.3.1

  • OS: Windows 11

  • Source files: UTF-8 with BOM

  • Compiler: Visual Studio 2022 with /utf-8

  • wxWidgets build: Unicode (UTF-16 internally)

Problem

Using the standard _() macro on a UTF-8 string literal containing Italian accented characters results in a corrupted wxString. It looks like the UTF-8 bytes are interpreted as single-byte characters.

Examples

Code Debug Output Result
wxString msg1 = _("Prova di log con accenti: è ò à ì ù"); L"Prova di log con accenti: è ò à ì ù" Corrupted
wxString msg2 = _(wxS("Prova di log con accenti: è ò à ì ù")); L"Prova di log con accenti: è ò à ì ù" Correct

Clarification

I understand that wxS() fixes the issue because it turns the literal into a wide string (L"...").
What confuses me is that I expected _() alone to be sufficient, since the source file is UTF-8 and MSVC is instructed to treat all narrow literals as UTF-8 via /utf-8.

My question

Is it expected that _() still interprets narrow string literals according to the system code page (CP1252) instead of UTF-8, even when the compiler is set to UTF-8?
In other words: in a Unicode build, is using wxS() (or FromUTF8) the intended and required way when dealing with UTF-8 source files?

Thanks a lot for any clarification.

Best regards,
Claudio Rossato

Simon Richter

unread,
Nov 14, 2025, 10:47:48 AM (5 days ago) Nov 14
to wx-u...@googlegroups.com, Claudio Fabiano Rossato
Hi,

On 11/15/25 00:22, Claudio Fabiano Rossato wrote:

> Is it expected that |_()| still interprets narrow string literals
> according to the system code page (CP1252) instead of UTF-8, even when
> the compiler is set to UTF-8?

The argument to _() is a narrow string literal that is used for the
lookup of the string in the translation database only. In general it is
assumed to be ASCII text only.

If no entry exists in the database, then the context (if any) is
removed, this string is converted to wxString and returned from the
translation function. In Unicode builds, the wxString will consist of
wxUniChar that can then be converted to the currently active locale's
representation.

The entries in the translation database are converted from the codepage
in the translation file to Unicode, then converted to the active locale
on output.

So any non-ASCII characters must be part of the translation. Quite a few
projects use an English “translation” to get Unicode quote marks.

For translated strings, there is no other way -- there isn't even a
guarantee that the file that the translators edit will be UTF-8, e.g. a
Japanese translator might prepare a .po file using SJIS, and
representing diacritics on the input strings would be difficult here.

> In other words: in a Unicode build, is using |wxS()| (or |FromUTF8|) the
> intended and required way when dealing with UTF-8 source files?

In any build, the parameter of _() is a narrow string and the return
value a wxString that can be output, so you never need wxS() or
FromUTF8() for a translated string.

Simon
OpenPGP_signature.asc

Claudio Fabiano Rossato

unread,
Nov 14, 2025, 11:46:58 AM (5 days ago) Nov 14
to Simon Richter, wx-u...@googlegroups.com

Hi,

Thanks for the explanation — it clarifies the behavior of _().

I understand now that the argument is always treated as an ASCII key. I still find it surprising that this means a program cannot be written in Italian first and then translated to English. It seems that with wxWidgets, multi-language programs must always start with English strings.

Claudio.


Vadim Zeitlin

unread,
Nov 14, 2025, 12:07:59 PM (5 days ago) Nov 14
to wx-u...@googlegroups.com
On Fri, 14 Nov 2025 17:46:51 +0100 Claudio Fabiano Rossato wrote:

CFR> Thanks for the explanation — it clarifies the behavior of |_()|.
CFR>
CFR> I understand now that the argument is always treated as an ASCII key. I
CFR> still find it surprising that this means a program cannot be written in
CFR> Italian first and then translated to English. It seems that with
CFR> wxWidgets, multi-language programs must always start with English strings.

This is not quite correct, message IDs can use any encoding, it just needs
to be specified in the message catalog. OTOH _all_ "char*" string are
assumed to be in the current locale encoding by wxString, and this is not
specific to _() at all. Unfortunately, while the current locale encoding is
almost always UTF-8 under Unix systems (including macOS), this is almost
never the case under Windows. You may set it explicitly, but if you don't
change the current locale, its encoding will be CP1252 and not UTF-8 by
default.

If you're sure that all your strings are in UTF-8, you may build wxWidgets
in UTF-8-only mode, but this is not the default. Alternatively, and
actually recommended if you are not working with a legacy code base where
this can be difficult to do, define wxNO_IMPLICIT_WXSTRING_ENCODING when
building your project (you do _not_ have to rebuild wxWidgets for this) to
get errors for all implicit conversions from "char*" as this will force you
to specify the (hopefully correct) encoding for all of them.

See https://wxwidgets.org/blog/2020/08/implicit_explicit_encoding/ for
more details.

Regards,
VZ

--
TT-Solutions: wxWidgets consultancy and technical support
https://www.tt-solutions.com/
Message has been deleted

Claudio Fabiano Rossato

unread,
Nov 17, 2025, 8:51:07 PM (2 days ago) Nov 17
to wx-users

Hi,

Thank you for the clarification — now everything is clear.

Claudio.

Reply all
Reply to author
Forward
0 new messages