Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

How to convert strings?

10 views
Skip to first unread message

Szyk Cech

unread,
Aug 17, 2019, 7:10:22 AM8/17/19
to
Hello!

I want to write string conversion functions:
std::wstring <--> unsigned char
where first is UTF-32, but second can be with any encoding.

I want to have functions like this:

std::wstring gRawToUnicode(std::vector<unsigned char> aString,
std::wstring aEncoding);
std::vector<unsigned char> gUnicodeToRaw(std::wstring aString,
std::wstring aEncoding);

Important to me is ability to handle any input encoding (defined as
wstring) because I want to use this functions in future versions of my
text editor.

I have two questions:
1. Is this possible to make it in pure C++ (stl based)?!?
2. Do I have to use ICU library for this?!?

ad1. If so, please give me examples:
+ How to get list of all supported encodings?
+ How to convert strings in pure C++ when we know only input/output
format?!? So I don't want example with hardcoded input/output encoding -
I want to handle any input format and any output format (according to my
functions).

Thanks in advance! And best regards!
Szyk Cech

Alf P. Steinbach

unread,
Aug 17, 2019, 7:36:31 AM8/17/19
to
On 17.08.2019 13:10, Szyk Cech wrote:
>
> I want to write string conversion functions:
> std::wstring <--> unsigned char
> where first is UTF-32, but second can be with any encoding.

As I read it, you want to convert between UTF-32 and any byte-oriented
encoding.

That's a noble goal.

You could look into what functionality is used by e.g. Scintilla
component, but I remember from making a Notepad++ extension for UTF-8 as
default, that it's rather messy and generally ungood.


> I want to have functions like this:
>
> std::wstring gRawToUnicode(std::vector<unsigned char> aString,
> std::wstring aEncoding);
> std::vector<unsigned char> gUnicodeToRaw(std::wstring aString,
> std::wstring aEncoding);

Surely not as the bottom foundation.

Stringly typed stuff belongs up near the user, invoking strongly typed
stuff below.

A reasonable approach to bridge the gap between user interface stringly
typed (e.g. where an editor has a textual command interface somewhere),
and internal strongly typed, can be to use an encoding id string as a
key to a repository of converters, which then hands you a converter for
that encoding, or fails to find one.

I'm sure there's a pattern name for that.

Like inversion or some silly name like that.


> Important to me is ability to handle any input encoding (defined as
> wstring) because I want to use this functions in future versions of my
> text editor.
>
> I have two questions:
> 1. Is this possible to make it in pure C++ (stl based)?!?

Yes, but then you have to implement most all of it by yourself.

The standard library supports only two general encoding conversion:
between wide text and the locale's multibyte strings, and between the
various UTF encodings.

The latter set of conversions have been deprecated, and they do anyway
not make for very portable code if they're used directly, even though
they're still as of C++17 “standard”. E.g. g++ and MSVC differ in (1)
where they stop on detecting an input error, and (2) in the endianess
(!) of the result.


> 2. Do I have to use ICU library for this?!?

Yes, in practice.


> ad1. If so, please give me examples:
> + How to get list of all supported encodings?
> + How to convert strings in pure C++ when we know only input/output
> format?!? So I don't want example with hardcoded input/output encoding -
> I want to handle any input format and any output format (according to my
> functions).

I know what I would do: I would just start doing it.

But since I haven't done it I can't help you other than just noting that
diving into stuff like that, is in general both (1) much easier than you
thought, and (2) much more labor intensive, like orders of magnitude
more work, than you thought.

You have hereby been motivated and warned.


Cheers!,

- Alf

Sam

unread,
Aug 17, 2019, 9:15:03 AM8/17/19
to
Szyk Cech writes:

> Hello!
>
> I want to write string conversion functions:
> std::wstring <--> unsigned char
> where first is UTF-32, but second can be with any encoding.
>
> I want to have functions like this:
>
> std::wstring gRawToUnicode(std::vector<unsigned char> aString, std::wstring
> aEncoding);
> std::vector<unsigned char> gUnicodeToRaw(std::wstring aString, std::wstring
> aEncoding);
>
> Important to me is ability to handle any input encoding (defined as wstring)
> because I want to use this functions in future versions of my text editor.
>
> I have two questions:
> 1. Is this possible to make it in pure C++ (stl based)?!?

Yes. This is done by using the std::codecvt facet, see
<URL:https://en.cppreference.com/w/cpp/locale/codecvt>.

> 2. Do I have to use ICU library for this?!?

No, but I find the C++ library's implementation of this functionality to be
unnecessarily convoluted, and a royal pain. Using a third party library will
likely be easier. I use iconv, which seems to come standard as part of
glibc, on Linux. I'm sure that MS-Windows has its own interface you can use,
if you are unfortunate enough to be using C++ on MS-Windows. You should be
able to find some documentation on that in MSDN.

> ad1. If so, please give me examples:
> + How to get list of all supported encodings?

This is not supported by the C++ library itself. The C++ library expects you
to know which encoding you want to use, and then you use its arkane
interface to do the conversion.

> + How to convert strings in pure C++ when we know only input/output
> format?!? So I don't want example with hardcoded input/output encoding - I
> want to handle any input format and any output format (according to my
> functions).

I've given you some search terms, above. Google should be able to find
plenty of examples.

Paavo Helde

unread,
Aug 17, 2019, 11:05:53 AM8/17/19
to
On 17.08.2019 14:10, Szyk Cech wrote:
> Hello!
>
> I want to write string conversion functions:
> std::wstring <--> unsigned char
> where first is UTF-32, but second can be with any encoding.

On Windows, wchar_t is 16 bits, so std::wstring is most probably UTF-16
(the native Windows string encoding), not UTF-32.

If you are interested in Linux/POSIX only, then use iconv (man
iconv_open et al). Note that this is an extensible interface, the glibc
base implementation supports only few encodings whereas the glibc-locale
package adds support for a wide variety of encodings. One can type

iconv --list

to see what is supported by the current Linux installation.

On Windows one should use its own native SDK functions like
MultiByteToWideChar() et al.

>
> Important to me is ability to handle any input encoding (defined as
> wstring)

Most encodings are using bytes, not wchar_t. Wchar_t usually implies a
single fixed encoding (UTF-32 or UTF-16, depending on platform).

> because I want to use this functions in future versions of my
> text editor.

If this is a plain/code text editor, I strongly suggest to incorporate
the Scintilla text editor control as the main component, this would
probably save 90% of work.

Cheers
Paavo

0 new messages