unicode string manipulation and file I/O

MarioCCCP

unread,

Oct 3, 2023, 12:37:51 PM10/3/23

to

I have been away from C++ for too long and I am very
confused (particularly with the differences between wchar_t
strings and unicode strings, like UTF-8).

I'd need to load / save text files of non-Ascii strings
(containing actual unicode codepoints not represented as
"entities" but as actual characters, of variable size).

does the standard library contain suitable functions for
UFT-8 unicode string manipulation and input/output on files ?

I am trying to use QString from QT ... but frustratingly
produces compile errors IN THEIR headers (not yet in my own
code), and I can't find what I am missing. So I was also
looking for some more standard and portable solution.

Tnx for any advice. Pls also mention relevant headers, if any

tnx again

--
1) Resistere, resistere, resistere.
2) Se tutti pagano le tasse, le tasse le pagano tutti
MarioCPPP

Bonita Montero

unread,

Oct 3, 2023, 12:40:27 PM10/3/23

to

Am 03.10.2023 um 18:37 schrieb MarioCCCP:
>
>
> I have been away from C++ for too long and I am very confused
> (particularly with the differences between wchar_t strings and unicode
> strings, like UTF-8).
>
> I'd need to load / save text files of non-Ascii strings (containing
> actual unicode codepoints not represented as "entities" but as actual
> characters, of variable size).
>
> does the standard library contain suitable functions for UFT-8 unicode
> string manipulation and input/output on files ?
>
> I am trying to use QString from QT ... but frustratingly produces
> compile errors IN THEIR headers (not yet in my own code), and I can't
> find what I am missing. So I was also looking for some more standard and
> portable solution.
>
> Tnx for any advice. Pls also mention relevant headers, if any

Maybe this helps:
https://stackoverflow.com/questions/4775437/read-unicode-utf-8-file-into-wstring

MarioCCCP

unread,

Oct 3, 2023, 2:14:54 PM10/3/23

to

I'll have a look, tnx !
Ciao

James Kuyper

unread,

Oct 3, 2023, 11:28:43 PM10/3/23

to

On 10/3/23 12:37, MarioCCCP wrote:
>
>
> I have been away from C++ for too long and I am very confused
> (particularly with the differences between wchar_t strings and unicode
> strings, like UTF-8).

The execution character set has a multibyte encoding that is used by
most standard library routines that that take arguments as strings of
[[un]signed] char. It could use UTF-8 (or even UTF-16, if CHAR_BIT >=
16), but it is not required to do so.

wchar_t uses a fixed-length encoding capable of encoding every supported
character. It could have a Unicode encoding, either UTF-32, or UCS-2 if
the set of supported characters is sufficiently restricted, but it is
not required to.

The C++ standard incorporates by reference functions for converting
between those encodings:
The <cwchar> header declares routines for converting between wchar_t and
multi-byte strings: std::mbstowcs(), std::wcstombs(), std::wbtowc(), and
std::wctomb(). std::wcrtomb() std::mnbrtowc(), std::mbstowcs(),
std::wcsrtombs(). Their definitions are not in the C++ standard itself,
but incorporated by reference from the C standard.

char8_t, char16_t, and char32_t are newer typedefs used to store strings
in UTF-8, UTF-16, or UTF-32 format.

The <cuchar> header defines routines for converting between the
multi-byte and any of the UTF-N encodings, with names like mbrtoc8() and
c8rtomb() (and similarly for c16 and 32). Their definitions are also
incorporated by reference from the C standard.

Combining those routines, you can use multi-byte strings as an
intermediary to convert between any pair of the other encodings
mentioned above.

MarioCCCP

unread,

Oct 4, 2023, 12:24:50 PM10/4/23

to

tnx for the detailed compendium. I'll try to go through all
this !
Ciao