size of strings

128 views
Skip to first unread message

Tiger

unread,
Jun 23, 2017, 10:28:02 PM6/23/17
to ISO C++ Standard - Discussion

As I understand it std::string, std::wstring, std::u16string and std::u32string are mainly intended to store utf-strings.
Since at least utf-8 and utf-16 have variable storage sizes for their code points the question for the "size of a string" storing either encoding is ambiguous:

  • On one hand there is the storage size of the string, which can be obtained by either std::basic_string::size() or std::basic_string::length().
    [Why two methods to get the same result?]
  • On the other hand the amount of stored code points.
    Sadly without a standard way to ask for, or have I missed something?

Thiago Macieira

unread,
Jun 23, 2017, 10:31:47 PM6/23/17
to std-dis...@isocpp.org
On Friday, 23 June 2017 19:28:02 PDT Tiger wrote:
> As I understand it std::string, std::wstring, std::u16string and
> std::u32string are mainly intended to store utf-strings.

You're correct only for std::u16string and std::u32string. std::string and
std::wstring don't have that requirement. std::wstring is supposed to use the
wide-character encoding, whichever that is (though it's UTF-16 or UTF-32 in
all modern compilers).

std::string can store any content, in any encoding. It does not enforce or
require or even is expected to be used only with UTF-8.

> Since at least utf-8 and utf-16 have variable storage sizes for their code
> points the question for the "size of a string" storing either encoding is
> ambiguous:
>
> - On one hand there is the storage size of the string, which can be
> obtained by either std::basic_string::size() or
> std::basic_string::length().
> [Why two methods to get the same result?]
> - On the other hand the amount of stored code points.
> Sadly without a standard way to ask for, or have I missed something?

Why do you want to know the number of code points? What's your use-case?

--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel Open Source Technology Center

Nicol Bolas

unread,
Jun 24, 2017, 10:31:07 AM6/24/17
to ISO C++ Standard - Discussion
On Friday, June 23, 2017 at 10:28:02 PM UTC-4, Tiger wrote:

As I understand it std::string, std::wstring, std::u16string and std::u32string are mainly intended to store utf-strings.


No, they can store Unicode-encoded strings; they are not "intended" to do so. With `u16` and `u32`, you can reasonably assume that they do. But C++ has no way to tell if a `std::string` stores a UTF-8-encoded string, so that's up to convention.

Since at least utf-8 and utf-16 have variable storage sizes for their code points the question for the "size of a string" storing either encoding is ambiguous:

  • On one hand there is the storage size of the string, which can be obtained by either std::basic_string::size() or std::basic_string::length().
    [Why two methods to get the same result?]
Because of the etymology of `std::basic_string`. Originally, `std::string` was taken from a regular old string library, and that library used "length", since that's pretty common for string types. But when it was standardized, the string type started adopting STL-style methods: `begin`, `end`, and so forth. And the STL standardized the sizing method with the name `size`.

Also, `basic_string` predates any form of Unicode support. It was standardized that `length` would be identical to `size`, and therefore had to be O(1). You can't just change that on people.
  • On the other hand the amount of stored code points.
I have to repeat Thiago's question: why do you need the number of codepoints? I've asked this question of many people, and not once have I gotten an answer that did not involve an inefficient algorithm.

For example, you might want the number of codepoints so that you could convert the Unicode-encoded string into an array of codepoints. But since counting the number of codepoints requires decoding the encoded sequence, it's faster if you just `push_back`ed them into a `vector/string`, rather than getting an exact count first. You can get a reasonable guess as to the number of codepoints from the length of the encoded sequence to `reserve` some memory. But the cost of occasionally having to do two memory allocations will overall be lower than the cost of converting the code unit sequence twice. 
  • Sadly without a standard way to ask for, or have I missed something?
C++ doesn't really provide much Unicode support at present.

Thiago Macieira

unread,
Jun 24, 2017, 12:18:10 PM6/24/17
to std-dis...@isocpp.org
On sábado, 24 de junho de 2017 07:31:07 PDT Nicol Bolas wrote:
> > I have to repeat Thiago's question: why do you need the number of
> > codepoints? I've asked this question of many people, and not once have I
> > gotten an answer that did not involve an inefficient algorithm.
>
> For example, you might want the number of codepoints so that you could
> convert the Unicode-encoded string into an array of codepoints. But since
> counting the number of codepoints *requires* decoding the encoded sequence,
> it's faster if you just `push_back`ed them into a `vector/string`, rather
> than getting an exact count first. You can get a reasonable guess as to the
> number of codepoints from the length of the encoded sequence to `reserve`
> some memory. But the cost of occasionally having to do two memory
> allocations will overall be lower than the cost of converting the code unit
> sequence twice.

Right, that would be about the only valid case for knowing the number of code-
points: conversion to UTF-32. But like Nicol is saying, counting the number of
code points *is* the conversion, so you may as well just store the result.

I'm expecting the answer to "why you need the number of codepoints" to be a
misunderstanding of Unicode or useless for the intended problem. Both can be
seen in what people want the count for: to calculate text width:
* in monospace fonts, they expect 1 codepoint = 1 cell, but that's wrong when
you consider zero- or double-width characters
* in variable width fonts, the width of x and m is not the same

So Qt, where QString *is* UTF-16 only, does not provide a width function. The
two cases above would be solved by QFontMetrics, which requires specifying
which font you meant and give you a width in pixels. There's also a
convenience function to insert "..." and cut the string.

Tiger

unread,
Jun 24, 2017, 2:36:40 PM6/24/17
to ISO C++ Standard - Discussion

As I understand it std::string, std::wstring, std::u16string and std::u32string are mainly intended to store utf-strings.

 
I never said anything about requiered...
And of course you may still store plain ASCII or ISO-Latin-X  in a std::string or even in a std::u32string, or if you´re realy inclined that way: even just some binary junk...
And that C++ itself has no way to know the used encoding goes without saying, but what
are the std::char_traits for?
It would be quiet simple to add a new trait telling the used encoding [even if it might not possible to enforce it].

Since at least utf-8 and utf-16 have variable storage sizes for their code points the question for the "size of a string" storing either encoding is ambiguous:

  • On one hand there is the storage size of the string, which can be obtained by either std::basic_string::size() or std::basic_string::length().
    [Why two methods to get the same result?]
OK, just historical reasons, can be accepted. Although it might be worth a thought if  it´s time to mark one of them [apparently std::basic_string::length()] as deprecated?
  • On the other hand the amount of stored code points.
    Sadly without a standard way to ask for, or have I missed something?
I noticed this "flaw" when doing some preliminary work in modernizing a LPC-interpreter.
In LPC it´s standard to use sprintf(...) to format the requiered output and it´s more often than not requiered to know the length of a particular substring....
Length as in amount of symbols (code points)...
 

Nicol Bolas

unread,
Jun 24, 2017, 2:58:45 PM6/24/17
to ISO C++ Standard - Discussion

Well, Thiago called it ;)

Code points are not symbols. Many Unicode code points are combining characters, which means that when you visualize them, the that code point and the preceeding one are visually one "symbol". So assuming that the number of code points and the number of visible symbols are the same doesn't work in general.

Perhaps it does, in your specific case. Perhaps your case forbids combining characters, or your renderer doesn't render them properly. But that doesn't change how they work as far as the Unicode standard is concerned.

Thiago Macieira

unread,
Jun 24, 2017, 3:14:17 PM6/24/17
to std-dis...@isocpp.org
On sábado, 24 de junho de 2017 11:36:39 PDT Tiger wrote:
> It would be quiet simple to add a new trait telling the used encoding [even
> if it might not possible to enforce it].

No, it wouldn't, because std::string (which is nothing more than
std::basic_string<char, std::char_traits<char>>) would be different from the
type that has the UTF-8 trait.

> > Since at least utf-8 and utf-16 have variable storage sizes for their code
> > points the question for the "size of a string" storing either encoding is
> > ambiguous:
> > - On one hand there is the storage size of the string, which can be
> > obtained by either std::basic_string::size() or
> > std::basic_string::length().
> > [Why two methods to get the same result?]
> >
> > OK, just historical reasons, can be accepted. Although it might be worth a
> > thought if it´s time to mark one of them [apparently
> > std::basic_string::length()] as deprecated?

Why? Is one of them wrong? Are they confusing?

QString has not two, but three ways to get the number of elements in a string:
length()
size()
count()

We're going to deprecate the third one because it *is* confusing, since
there's an overload count(QChar) that returns the number of matching
characters. The other two shall remain because there's no issue with them.

> > - On the other hand the amount of stored code points.
> > Sadly without a standard way to ask for, or have I missed something?
> >
> > I noticed this "flaw" when doing some preliminary work in modernizing a
>
> LPC <https://en.wikipedia.org/wiki/LPC_(programming_language)>-interpreter.
> In LPC it´s standard to use sprintf(...) to format the requiered output and
> it´s more often than not requiered to know the length of a particular
> substring....
> Length as in amount of symbols (code points)...

Yup, just as I predicted, you're asking for the wrong thing. You do NOT want
the number of codepoints.

You want the width of a string in a monospace font, measured in "ex" or "em"
(they're the same in a monospace font).

Tiger

unread,
Jun 27, 2017, 8:46:07 PM6/27/17
to ISO C++ Standard - Discussion
Am Samstag, 24. Juni 2017 20:58:45 UTC+2 schrieb Nicol Bolas:
Code points are not symbols. Many Unicode code points are combining characters, which means that when you visualize them, the that code point and the preceeding one are visually one "symbol". So assuming that the number of code points and the number of visible symbols are the same doesn't work in general.

Perhaps it does, in your specific case. Perhaps your case forbids combining characters, or your renderer doesn't render them properly. But that doesn't change how they work as far as the Unicode standard is concerned
Ouch...
Forgot about those :(

 

Am Samstag, 24. Juni 2017 21:14:17 UTC+2 schrieb Thiago Macieira:
On sábado, 24 de junho de 2017 11:36:39 PDT Tiger wrote:
> It would be quiet simple to add a new trait telling the used encoding [even
> if it might not possible to enforce it].

No, it wouldn't, because std::string (which is nothing more than
std::basic_string<char, std::char_traits<char>>) would be different from the
type that has the UTF-8 trait.
Grmbl...
Obviously a half cooked idea by me...
 

> > Since at least utf-8 and utf-16 have variable storage sizes for their code
> > points the question for the "size of a string" storing either encoding is
> > ambiguous:
> >    - On one hand there is the storage size of the string, which can be
> >    obtained by either std::basic_string::size() or
> >    std::basic_string::length().
> >    [Why two methods to get the same result?]
> >
> > OK, just historical reasons, can be accepted. Although it might be worth a
> > thought if  it´s time to mark one of them [apparently
> > std::basic_string::length()] as deprecated?

Why? Is one of them wrong? Are they confusing?
I thought more on line of one of them beeing superflous...
 
Yup, just as I predicted, you're asking for the wrong thing. You do NOT want
the number of codepoints.

You want the width of a string in a monospace font, measured in "ex" or "em"
(they're the same in a monospace font).

Sometimes the best thought ideas aren´t as straight as one thinks and this idea obviously wasn´t as thought through as I thought :(

Thiago Macieira

unread,
Jun 27, 2017, 9:01:15 PM6/27/17
to std-dis...@isocpp.org
On terça-feira, 27 de junho de 2017 17:46:07 PDT Tiger wrote:
> Ouch...
> Forgot about those

You also forgot the zero-width and full-width characters.

std::u16string(u"T‍i‍g‍e‍r‍").size() = 10 (not 5)
std::u16string(u"Tiger").size() = 5 (not 10)

Tiger

unread,
Jun 28, 2017, 11:00:50 PM6/28/17
to ISO C++ Standard - Discussion


Am Mittwoch, 28. Juni 2017 03:01:15 UTC+2 schrieb Thiago Macieira:
 You also forgot the zero-width and full-width characters.

std::u16string(u"T‍i‍g‍e‍r‍").size() = 10         (not 5)
std::u16string(u"Tiger").size()  =  5        (not 10)

While I indeed did forget about, the zero-width characters, the half-/full-width characters I didn´t come across before :(
But atleast I see now, that I´ve got to do this another way...
Reply all
Reply to author
Forward
0 new messages