As I understand it std::string
, std::wstring
, std::u16string
and std::u32string
are mainly intended to store utf-strings.
Since at least utf-8 and utf-16 have variable storage sizes for their
code points the question for the "size of a string" storing either
encoding is ambiguous:
std::basic_string::size()
or std::basic_string::length()
.As I understand it
std::string
,std::wstring
,std::u16string
andstd::u32string
are mainly intended to store utf-strings.
Since at least utf-8 and utf-16 have variable storage sizes for their code points the question for the "size of a string" storing either encoding is ambiguous:
- On one hand there is the storage size of the string, which can be obtained by either
std::basic_string::size()
orstd::basic_string::length()
.
[Why two methods to get the same result?]
- On the other hand the amount of stored code points.
- Sadly without a standard way to ask for, or have I missed something?
As I understand it
std::string
,std::wstring
,std::u16string
andstd::u32string
are mainly intended to store utf-strings.
Since at least utf-8 and utf-16 have variable storage sizes for their code points the question for the "size of a string" storing either encoding is ambiguous:
- On one hand there is the storage size of the string, which can be obtained by either
std::basic_string::size()
orstd::basic_string::length()
.
[Why two methods to get the same result?]
std::basic_string::length()
] as deprecated?
- On the other hand the amount of stored code points.
Sadly without a standard way to ask for, or have I missed something?
Code points are not symbols. Many Unicode code points are combining characters, which means that when you visualize them, the that code point and the preceeding one are visually one "symbol". So assuming that the number of code points and the number of visible symbols are the same doesn't work in general.
Perhaps it does, in your specific case. Perhaps your case forbids combining characters, or your renderer doesn't render them properly. But that doesn't change how they work as far as the Unicode standard is concerned
On sábado, 24 de junho de 2017 11:36:39 PDT Tiger wrote:
> It would be quiet simple to add a new trait telling the used encoding [even
> if it might not possible to enforce it].
No, it wouldn't, because std::string (which is nothing more than
std::basic_string<char, std::char_traits<char>>) would be different from the
type that has the UTF-8 trait.
> > Since at least utf-8 and utf-16 have variable storage sizes for their code
> > points the question for the "size of a string" storing either encoding is
> > ambiguous:
> > - On one hand there is the storage size of the string, which can be
> > obtained by either std::basic_string::size() or
> > std::basic_string::length().
> > [Why two methods to get the same result?]
> >
> > OK, just historical reasons, can be accepted. Although it might be worth a
> > thought if it´s time to mark one of them [apparently
> > std::basic_string::length()] as deprecated?
Why? Is one of them wrong? Are they confusing?
Yup, just as I predicted, you're asking for the wrong thing. You do NOT want
the number of codepoints.
You want the width of a string in a monospace font, measured in "ex" or "em"
(they're the same in a monospace font).
std::u16string(u"Tiger").size() = 10 (not 5)
std::u16string(u"Tiger").size() = 5 (not 10)