Am 19.04.19 um 20:58 schrieb Bonita Montero:
>> Some string implementations have handling for fast in place storage
>> for short strings.
>
> Short stings woulf fit only for very short strings.
Reserving unnecessarily large storage for every string (for fixed sized
strings) can have a large impact as well. From my experience this is
either counterproductive or just a bad workaround for other problems.
>>> On the
>>> other side, there are many standard-library facilities that only accept
>>> a basic_string so this type of string would be incompatible.
>
>> Exactly. And converting strings over and over (with allocations) can
>> be much more impact than the optimization gain.
>
> There would be no additional conversions.
You just mentioned that library functions will not be compatible to
whatever other string implementation you use. This is an issue.
>> As long as there is no need to pass /mutable/ strings to library
>> functions the use of const char* is the least common denominator.
>
> Maybe, but there are other cases. Memory-allocation is simply slow.
It depends. But indeed, the standard allocators of several platforms are
not exactly brilliant.
>> It is quite easy to provide zero copy conversion operators to this
>> type for any string implementation.
>
> An overloaded += and other operatoes also wouldn't copy if the capacity
> would be suffient.
+= always requires some copying unless have a rope implementation with
immutable fragments.
And no rule says that std::string needs to do an extra allocation for
every += call. In fact it does not. Of course it cannot estimate the
final size of your string. But feel free to call .reserve() to ensure
enough space before the concatenation. This is quite close to your
requirement, especially the one that requires allocation if the buffer
gets too small.
It is true that you cannot simply inject an optimized custom allocator
into basic_string without breaking type compatibility. But you can reuse
an instance for building several strings. The reduces allocations a lot
as well. At least if your platforms string implementation supports COW.
You are right, std::string is not the optimum for every purpose. But
from my point of view it does not mainly lack from an optimized allocator.
The main issue is that there is no concept of *immutable strings* as a
distinct type in the standard library. Carrying the overhead of mutable
strings everywhere in the application is the largest impact. This
especially applies in multi-threaded context, i.e almost any application
nowadays.
A simple base class that consists just of the size and the reference to
a data storage could reduce the need for copying a lot. Even COW
optimizations (that add further complexity) are now superfluous.
Especially different string implementations could provide implicit
conversions to that type as long as they use the compatible storage layout.
In fact I made the best experience with an immutable, strongly thread
safe string class. It uses exactly one storage for every string that
consists of the length and the content and a reference counter.
Instances of the class are just a pointer to a storage object. Storage
objects are shared between instances. Actually these pointers are binary
compatible to const char* so a conversion to this type is just a NOP.
You can get further gain if you deduplicate identical strings in memory
for strings. This can reduce the memory footprint a lot. While saving
some memory might not be of primary interest the side effect of
significantly higher cache hit rate can give a remarkable gain to
application performance. The working set size simply decreases. For
relational database applications with many users with typically large
redundancy the memory gain can be up to one order of magnitude.
The additional effort for deduplication is usually insignificant. It can
also avoid the allocation if the factory function already does the
deduplication when converting from raw storage.
And there is another effect: scalability becomes /below/ linear. The
more data you load the higher is the probability that you already have
some of its content in memory.
But this concept requires /immutability/. It cannot be implemented on
top of std::string.
So from my point of view the problem with allocations of std::string is
mainly a consequence of its mutability.
Of course, you always need a second, mutable string class. But this one
can use completely different allocation strategies. But if your
implementation is smart enough, the storage object behind the string
builder class can be the same than the storage object behind the
immutable string. So extracting an immutable string from the string
builder could be a trivial O(1) operation with no allocation as long as
you accept that the string builder is empty afterwards which is the
typical use case anyway.
On the other side, if you intend to build a large number of strings with
different size it could be more helpful to allocate a shared buffer once
and reuse it for every string. This buffer could also be on the stack.
Now the conversion to an immutable string always requires an allocation
(unless you use deduplication of course) but now the creation of the
string is likely to cause no more allocations which might be more important.
std::string could be used for this purpose if you pass a custom
allocator. But you should always convert this to a more compact form
when you are done rather than carrying the backpack of mutability and
custom allocators everywhere in your application.
I have built at least two larger applications that use the above
concepts. One in C# which already has an immutable string class and
another one in C++ with a custom string class. Both provide very high
performance ate quite low memory usage.
Loading about 10GB data from an RDBMS implodes to something about 1GB
memory. Even several dozen concurrent users create no significant CPU
load. And the application does not wait for I/O either, as it mainly
operates in memory. Only remote calls to other applications can slow
down response times.
There are some other deduplication concepts in the application but the
largest gain is from strings.
Furthermore once you have deduplication other features become quite
easy. E.g. if you want to want to provide full versioning in case of
changes to objects this now takes almost non memory as new versions of
objects likely share most of their properties with the older ones. In
fact keeping 5 years of history of really everything did not cause any
performance issue at all. And access to this data is transparent. Simply
choose date and time and you see an old snapshot of all data including
all work items that were post deadline at this point and whatever.
Marcel
P.S.: There is another thing I do to prevent unnecessary allocations
when building strings: printf-like formatting allows the implementation
to allocate one final buffer of adequate size. This is impossible if you
add compontents with subsequent method calls. So creating an immutable
string from a printf like format string is not that bad either. Of
course this is not always reasonable. Many platforms provide printf
format checking, so UB from the missing type safety could be mostly
eliminated.