Efficiently Building Strings

Travis Parks

unread,

May 21, 2012, 11:55:10 PM5/21/12

to

In most other languages, you have to be careful when building large strings. I never really worried about this in C++. Is there overhead in using the += operator over and over again? I am wondering if the class std::stringstream should be used to reduce overhead, or is it just implemented in terms of a std::string?

Paavo Helde

unread,

May 22, 2012, 1:10:01 AM5/22/12

to

Travis Parks <jehuga...@gmail.com> wrote in
news:85e6e5f4-e623-473e...@googlegroups.com:

It depends on the quality of the implementation. Basically it boils down
to whether the implementation increases the capacity of the string buffer
proactively by some factor (1.5 seems popular nowadays) or not. Current
mainstream compilers like msvc or g++ certainly do that. You can easily
test your implementation behavior with a small snippet:

#include <string>
#include <iostream>

int main() {
std::string s;
for (int i=0; i<100; ++i) {
s += "abc";
std::cout << s.capacity() << "\n";
}
}

Streams are specifically designed for such continuous appending so at the
first glance it may seem that std::stringstream might work better in some
implementation. However, the stream subsystem is riddled with virtual
functions and other complex unneeded features like locale support so it
will probably be significantly slower for this particul string
concatenation task (4 times in my tests with msvc2010).

But as always with performance issues, there is no sure way other than
actually measure/profile with your target compiler(s) and realistic data.

hth
Paavo

Ian Collins

unread,

May 22, 2012, 1:56:03 AM5/22/12

to

On 05/22/12 03:55 PM, Travis Parks wrote:

Please wrap lines to something sensible!

> In most other languages, you have to be careful when building large strings.. I never really worried about this in C++. Is there overhead in using the += operator over and over again? I am wondering if the class std::stringstream should be used to reduce overhead, or is it just implemented in terms of a std::string?

I don't thank it's a matter of efficiency, ostringstream is appending to
a string under the hood. It is more a matter of what fits. If you
adding text, then you may as well just use a string. If you are adding
something that requires formatting, use an ostringstream.

--
Ian Collins

Alf P. Steinbach

unread,

May 22, 2012, 2:10:59 AM5/22/12

to

On 22.05.2012 05:55, Travis Parks wrote:
> In most other languages, you have to be careful when building large strings.
> I never really worried about this in C++. Is there overhead in using the +=
> operator over and over again?

Not particularly. It's O(n) on average, where n is the length of the
final string.

In contrast, building a string by using infix "+" can be rather
inefficient, and if it's built up of many small pieces, typically O(n^2).

> I am wondering if the class std::stringstream should be used to reduce
> overhead, or is it just implemented in terms of a std::string?

std::stringstream is convenient but usually, due to silly locale support
etc., it's quite inefficient.

In other words, use it as a convenient default converter, but be aware
of the performance issues.

Consider instead using a class like

<code>
class S
{
private:
string text_;

template< class Type >
static string generalStringFrom( Type const& v )
{
ostringstream stream;
stream << v; return stream.str();
}

public:
template< class Type >
S& operator<<( Type const& v ) { text_ += generalStringFrom( v );
return *this; }

operator char const* () const { return text_.c_str(); }
operator string const& () const { return text_; }
};
<code>

It's easy to optimize by specializing `generalStringFrom` or a function
that it calls.

Anyway then you can write e.g.

void foo( char const s[] );

void bar()
{
foo( S() << "The answer is " << 6*7 << "!" );
}

and the expression evaluation is O(n) in the size of the resulting string.

Cheers & hth.,

- Alf

SG

unread,

May 22, 2012, 6:52:01 AM5/22/12

to

On May 22, 8:10 am, Alf P. Steinbach wrote:
> On 22.05.2012 05:55, Travis Parks wrote:
>
> > In most other languages, you have to be careful when building large strings.
> > I never really worried about this in C++. Is there overhead in using the +=
> > operator over and over again?
>
> Not particularly. It's O(n) on average, where n is the length of the
> final string.

I expect a quality implementation to have this behaviour. But I just
looked up whether += makes any complexity guarantees. It turns out,
+= is defined in terms of .append where many append overloads in turn
are defined in terms of the following member function

basic_string& append(const charT* s, type_type n);

(Effects: The function replaces the string controlled by *this with a
string of length size() + n whose first size() elements are a copy of
the original string controlled by *this and whose remaining elements
are a copy of the initial n elements of s)

But there's actually no sign of a complexity guarantee.

> In contrast, building a string by using infix "+" can be rather
> inefficient, and if it's built up of many small pieces, typically O(n^2).

We can expect this to change (C++2011) due to move semantics and
operator+ being also defined in terms of append -- at least with a
decent append implementation. ;)

Cheers!

Jorgen Grahn

unread,

May 22, 2012, 7:53:06 AM5/22/12

to

On Tue, 2012-05-22, Travis Parks wrote:
> In most other languages, you have to be careful when building large
> strings. I never really worried about this in C++.

I have two partly conflicting comments on that:

- You probably shouldn't worry in any language until you can measure a
problem. On modern hardware and outside critical sections, you are
unlikely to notice.

- Building large strings is IME usually a sign of a design weakness.
There's often an alternative if you look more closely. E.g. using
POSIX writev() on a socket instead of assembling the data yourself and
sending it as one buffer to the socket. Or doing needed
post-processing on the fly while writing to file. And so on.

/Jorgen

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .

Travis Parks

unread,

May 22, 2012, 11:04:55 AM5/22/12

to

On Tuesday, May 22, 2012 7:53:06 AM UTC-4, Jorgen Grahn wrote:
> On Tue, 2012-05-22, Travis Parks wrote:
> > In most other languages, you have to be careful when building large
> > strings. I never really worried about this in C++.
>
> I have two partly conflicting comments on that:
>
> - You probably shouldn't worry in any language until you can measure a
> problem. On modern hardware and outside critical sections, you are
> unlikely to notice.
>

I only brought it up because this time through "The C++ Programming Language"
it occurred to me I never considered string building performance. I had a
professor once tell me that it can be really inefficient to read an entire
file into a single string using noskipws. Instead, he suggested using a
vector<char> and an ostream_iterator, passing the result to the string ctor.
My guess this had to do with how strings were represented internally. I also
know, unlike most of the other languages I deal with, strings are mutable, so
they probably handle string building a little better.

> - Building large strings is IME usually a sign of a design weakness.
> There's often an alternative if you look more closely. E.g. using
> POSIX writev() on a socket instead of assembling the data yourself and
> sending it as one buffer to the socket. Or doing needed
> post-processing on the fly while writing to file. And so on.
>

I usually spit out results immediate as output. I rarely build large strings,
since it is rarely necessary in C++, thanks to fewer format strings and
operator<<. Still, I was wondering if += had performance issues.

Luca Risolia

unread,

May 22, 2012, 3:14:21 PM5/22/12

to

On 22/05/2012 17:04, Travis Parks wrote:
> I only brought it up because this time through "The C++ Programming Language"
> it occurred to me I never considered string building performance. I had a
> professor once tell me that it can be really inefficient to read an entire
> file into a single string using noskipws. Instead, he suggested using a
> vector<char> and an ostream_iterator, passing the result to the string ctor.
> My guess this had to do with how strings were represented internally. I also
> know, unlike most of the other languages I deal with, strings are mutable, so
> they probably handle string building a little better.

If you want to copy a file to a string, you don't need the vector. You
can build the string directly from stream buffer, as in the example
below. You can make it more efficient by preallocating the size of the
string and std::copy() to copy the buffer into the string (which will be
overwritten). Note that getting the size of the file can usually be done
in constant time, but there is no guarantee (it depends on the device).

#include <fstream>
#include <string>
#include <iterator>
#include <iostream>

using namespace std;

int main(int argc, char** argv) {
ifstream f(argv[1], ifstream::in | ifstream::binary);
istreambuf_iterator<char > i(f), eof;
string s(i, eof);
cout << s;
}

Jorgen Grahn

unread,

May 22, 2012, 5:06:23 PM5/22/12

to

On Tue, 2012-05-22, Luca Risolia wrote:
> On 22/05/2012 17:04, Travis Parks wrote:
>> I only brought it up because this time through "The C++ Programming Language"
>> it occurred to me I never considered string building performance. I had a
>> professor once tell me that it can be really inefficient to read an entire
>> file into a single string using noskipws. Instead, he suggested using a
>> vector<char> and an ostream_iterator, passing the result to the string ctor.
>> My guess this had to do with how strings were represented internally.

I don't know ... To me, part of the problem is once you have the file
as a string, you've lost the one-line-at-a-time functionality of
getline() and friends, and have to reinvent it if you want it (or turn
the string back into a stringstream).

Also you risk failing, or suffering, or being impolite to other
processes due to excessive memory usage. You cannot know for sure all
files will be small, and users will not like "error: file too large to
be safely processed" error messages.

>> I also
>> know, unlike most of the other languages I deal with, strings are mutable, so
>> they probably handle string building a little better.
>
> If you want to copy a file to a string, you don't need the vector. You
> can build the string directly from stream buffer, as in the example
> below. You can make it more efficient by preallocating the size of the
> string and std::copy() to copy the buffer into the string (which will be
> overwritten). Note that getting the size of the file can usually be done
> in constant time, but there is no guarantee (it depends on the device).

What's "usual" also depends on what OS you use and which user
interface you choose. I'm on Unix, and I try to support reading from
standard input and writing to standard output if it makes any sense at
all. Not only it it impossible to tell the size then -- the size may
turn out to be infinite. (And processing an infinite file may
actually be desireable. In such scenarios you *really* don't want to
read the whole file before processing it.)

Juha Nieminen

unread,

May 23, 2012, 10:59:53 AM5/23/12

to

Luca Risolia <luca.r...@studio.unibo.it> wrote:
> int main(int argc, char** argv) {
> ifstream f(argv[1], ifstream::in | ifstream::binary);
> istreambuf_iterator<char > i(f), eof;
> string s(i, eof);
> cout << s;
> }

Note that this is not the fastest possible way of reading a large file
because reading one byte at a time from a C++ stream is pretty slow.

If maximum reading speed is required, the best compromise between speed
and relative ease of usage is to use std::fread() instead. It would go
something like this:

std::FILE* inFile = std::fopen(argv[1], "rb");
if(!inFile) { std::perror(argv[1]); return 1; }
std::fseek(inFile, 0, SEEK_END);
std::string s(std::ftell(inFile), ' ');
std::rewind(inFile);
std::fread(&s[0], 1, s.length(), inFile);
std::fclose(inFile);

For example in my system reading a 700-megabyte file takes about 7.5 seconds
by using the istreambuf_iterator method, while with fread it takes about
1.5 seconds.

(If maximum reading speed is not imperative, or if input files are
expected to be small, then use whichever method is safest and most
convenient.)

Scott Lurndal

unread,

May 23, 2012, 11:40:10 AM5/23/12

to

Juha Nieminen <nos...@thanks.invalid> writes:
>Luca Risolia <luca.r...@studio.unibo.it> wrote:
>> int main(int argc, char** argv) {
>> ifstream f(argv[1], ifstream::in | ifstream::binary);
>> istreambuf_iterator<char > i(f), eof;
>> string s(i, eof);
>> cout << s;
>> }
>
>Note that this is not the fastest possible way of reading a large file
>because reading one byte at a time from a C++ stream is pretty slow.
>
>If maximum reading speed is required, the best compromise between speed
>and relative ease of usage is to use std::fread() instead. It would go
>something like this:
>
> std::FILE* inFile = std::fopen(argv[1], "rb");
> if(!inFile) { std::perror(argv[1]); return 1; }
> std::fseek(inFile, 0, SEEK_END);
> std::string s(std::ftell(inFile), ' ');
> std::rewind(inFile);
> std::fread(&s[0], 1, s.length(), inFile);
> std::fclose(inFile);

If I was really aiming for the fastest way to make an in-memory
string of a file, I'd use mmap:

int fd, diag;
struct stat st;

fd = ::open(argv[1], O_RDONLY, 0);
if (fd == -1) throw exception(errno);
diag = ::fstat(fd, &st);
if (diag == -1) { close (fd); throw exception(errno); }
void *base = ::mmap(NULL, st.st_size, PROT_READ, MAP_SHARED, fd, (off_t)0);
if (base == MAP_FAILED) { close(fd); throw exception(errno); }

So now, base points to the first byte of the string, which is st.st_size
bytes in length.

Unfortunatly, there is no std::string constructor that takes a pointer and length
and makes _it_ the string (rather than copy it).

This method will also defer the disk access until the page containing the string
is accessed for the first time, so for sparse access is much more efficient.

This method will not allow a string larger than the available virtual address
space (which is around 2GB or so on 32-bit systems, basically unlimited on 64-bit).

This method does not consume backing store (e.g. swap or paging file space).

There are equivalent API's (OpenFileMapping/MapViewOfFile) available for windows.

scott

Juha Nieminen

unread,

May 24, 2012, 9:20:13 AM5/24/12

to

Scott Lurndal <sc...@slp53.sl.home> wrote:
>> std::FILE* inFile = std::fopen(argv[1], "rb");
>> if(!inFile) { std::perror(argv[1]); return 1; }
>> std::fseek(inFile, 0, SEEK_END);
>> std::string s(std::ftell(inFile), ' ');
>> std::rewind(inFile);
>> std::fread(&s[0], 1, s.length(), inFile);
>> std::fclose(inFile);
>
>
> If I was really aiming for the fastest way to make an in-memory
> string of a file, I'd use mmap:

The difference is that my example code above is portable while mmap
isn't.

Scott Lurndal

unread,

May 24, 2012, 11:48:16 AM5/24/12

to

You can, of course, wrap it in a class and compile the windows version
or linux version of the "mapfile" class as necessary. Then it's just
a library call.

scott

Travis Parks

unread,

May 24, 2012, 1:56:40 PM5/24/12

to

I was just using the file reading to explain my point about string building taking a long time. Although, it is pretty cool to see so many potential implementations. I was more concerned with building strings in code, like HTML, XML or whatever. When building somewhat large strings that you can't immediately output (say you plan on storing it in a database), is there something better than string's += operator?

Luca Risolia

unread,

May 24, 2012, 2:13:15 PM5/24/12

to

On 23/05/2012 16:59, Juha Nieminen wrote:
> Luca Risolia<luca.r...@studio.unibo.it> wrote:
>> int main(int argc, char** argv) {
>> ifstream f(argv[1], ifstream::in | ifstream::binary);
>> istreambuf_iterator<char> i(f), eof;
>> string s(i, eof);
>> cout<< s;
>> }
>
> Note that this is not the fastest possible way of reading a large file
> because reading one byte at a time from a C++ stream is pretty slow.

You can provide your preferred char type and use it with basic_string<>
and basic_ifstream<>.

> If maximum reading speed is required, the best compromise between speed
> and relative ease of usage is to use std::fread() instead. It would go
> something like this:
>

> std::fseek(inFile, 0, SEEK_END);
> std::string s(std::ftell(inFile), ' ');
> std::rewind(inFile);

The couple seek() and rewind() can be done in constant time only if the
device supports random access storage, but it is slow (typically linear
time) with drives providing sequential access only.

Jorgen Grahn

unread,

May 24, 2012, 2:29:21 PM5/24/12

to

On Wed, 2012-05-23, Scott Lurndal wrote:
...

> If I was really aiming for the fastest way to make an in-memory
> string of a file, I'd use mmap:

...

> Unfortunatly, there is no std::string constructor that takes a
> pointer and length and makes _it_ the string (rather than copy it).

That would not be a string -- strings (like the containers) must own
their content. I agree though that it would be useful with something
like that, in some other class.

Luca Risolia

unread,

May 24, 2012, 8:54:13 PM5/24/12

to

On 24/05/2012 17:48, Scott Lurndal wrote:
>>> If I was really aiming for the fastest way to make an in-memory
>>> string of a file, I'd use mmap:
>>
>> The difference is that my example code above is portable while mmap
>> isn't.
>
> You can, of course, wrap it in a class and compile the windows version
> or linux version of the "mapfile" class as necessary. Then it's just
> a library call.

As I suspected, my fstream implementation on Linux already uses mmap()
internally, so the above mmap() optimization is pointless. I bet it's
the same on Windows. Once again, this shows that it's better to focus on
simple and portable C++ code, unless you are sure your implementation is
NOT already optimized.

Luca Risolia

unread,

May 24, 2012, 9:35:12 PM5/24/12

to

On 24/05/2012 17:48, Scott Lurndal wrote:

This the most portable and fastest way probably:

ifstream f(argv[1], ifstream::in | ifstream::binary);

ostringstream o;
o << f.rdbuf();
cout << o.str();

Juha Nieminen

unread,

May 25, 2012, 4:15:43 AM5/25/12

to

Luca Risolia <luca.r...@studio.unibo.it> wrote:
> This the most portable and fastest way probably:
>
> ifstream f(argv[1], ifstream::in | ifstream::binary);
> ostringstream o;
> o << f.rdbuf();
> cout << o.str();

Perhaps faster, but not fastest. In my system, with that same file, that
takes about 4.6 seconds (vs. the 1.5 seconds of the fread method).

Luca Risolia

unread,

May 25, 2012, 9:06:17 AM5/25/12

to

What is your platform? Can you trace o << f.rdbuf() ? I am curious.

Juha Nieminen

unread,

May 25, 2012, 11:30:45 AM5/25/12

to

Linux.

I believe that one problem may be that using ostringstream's operator<<
the string is being gradually resized as more data is read, while with the
fread method the size is set with one single allocation and the data read
directly into it.

If I replace the fread code with a similar code that reads smaller chunks
and resizes the string to match, the reading time increases to over 3
seconds.

Pavel

unread,

May 26, 2012, 10:06:33 PM5/26/12

to

Alf P. Steinbach wrote:
> On 22.05.2012 05:55, Travis Parks wrote:
>> In most other languages, you have to be careful when building large strings.
>> I never really worried about this in C++. Is there overhead in using the +=
>> operator over and over again?
>
> Not particularly. It's O(n) on average, where n is the length of the final string.
>
> In contrast, building a string by using infix "+" can be rather inefficient, and
> if it's built up of many small pieces, typically O(n^2).

I wonder why 26.3.1-3 relaxation for return types of valarray operations is not
extended to std::string '+' operation. This would allow an intelligent
implementation to achieve the same complexity for + and += in expressions with
multiple '+'; and my guess is that std::string's + syntax is used more often
than valarray.

>
> - Alf

-Pavel

Pavel

unread,

May 26, 2012, 10:55:31 PM5/26/12

to

Juha Nieminen wrote:
> Luca Risolia<luca.r...@studio.unibo.it> wrote:
>> int main(int argc, char** argv) {
>> ifstream f(argv[1], ifstream::in | ifstream::binary);
>> istreambuf_iterator<char> i(f), eof;
>> string s(i, eof);
>> cout<< s;
>> }
>
> Note that this is not the fastest possible way of reading a large file
> because reading one byte at a time from a C++ stream is pretty slow.
>
> If maximum reading speed is required, the best compromise between speed
> and relative ease of usage is to use std::fread() instead. It would go
> something like this:
>
> std::FILE* inFile = std::fopen(argv[1], "rb");
> if(!inFile) { std::perror(argv[1]); return 1; }
> std::fseek(inFile, 0, SEEK_END);
> std::string s(std::ftell(inFile), ' ');
> std::rewind(inFile);
> std::fread(&s[0], 1, s.length(), inFile);

I know it is often assumed, but std::string's characters are not guaranteed to
be stored contiguously (despite many implementations' doing so).

> std::fclose(inFile);
>
> For example in my system reading a 700-megabyte file takes about 7.5 seconds
> by using the istreambuf_iterator method, while with fread it takes about
> 1.5 seconds.
>
> (If maximum reading speed is not imperative, or if input files are
> expected to be small, then use whichever method is safest and most
> convenient.)

Just a nit-pick
-Pavel

Pavel

unread,

May 26, 2012, 11:00:47 PM5/26/12

to

Juha Nieminen wrote:
> Luca Risolia<luca.r...@studio.unibo.it> wrote:
>> int main(int argc, char** argv) {
>> ifstream f(argv[1], ifstream::in | ifstream::binary);
>> istreambuf_iterator<char> i(f), eof;
>> string s(i, eof);
>> cout<< s;
>> }
>
> Note that this is not the fastest possible way of reading a large file
> because reading one byte at a time from a C++ stream is pretty slow.
>
> If maximum reading speed is required, the best compromise between speed
> and relative ease of usage is to use std::fread() instead. It would go
> something like this:
>
> std::FILE* inFile = std::fopen(argv[1], "rb");
> if(!inFile) { std::perror(argv[1]); return 1; }
> std::fseek(inFile, 0, SEEK_END);
> std::string s(std::ftell(inFile), ' ');
> std::rewind(inFile);
> std::fread(&s[0], 1, s.length(), inFile);

Just a nit-pick: std::string is not guaranteed to be stored contiguously so
technically your code has UB.

> std::fclose(inFile);
>
> For example in my system reading a 700-megabyte file takes about 7.5 seconds
> by using the istreambuf_iterator method, while with fread it takes about
> 1.5 seconds.
>
> (If maximum reading speed is not imperative, or if input files are
> expected to be small, then use whichever method is safest and most
> convenient.)

-Pavel

woodb...@gmail.com

unread,

May 27, 2012, 12:40:55 AM5/27/12

to

On Saturday, May 26, 2012 10:00:47 PM UTC-5, Pavel wrote:
> Juha Nieminen wrote:
> > Luca Risolia<luca.r...@studio.unibo.it> wrote:
> >> int main(int argc, char** argv) {
> >> ifstream f(argv[1], ifstream::in | ifstream::binary);
> >> istreambuf_iterator<char> i(f), eof;
> >> string s(i, eof);
> >> cout<< s;
> >> }
> >
> > Note that this is not the fastest possible way of reading a large file
> > because reading one byte at a time from a C++ stream is pretty slow.
> >
> > If maximum reading speed is required, the best compromise between speed
> > and relative ease of usage is to use std::fread() instead. It would go
> > something like this:
> >
> > std::FILE* inFile = std::fopen(argv[1], "rb");
> > if(!inFile) { std::perror(argv[1]); return 1; }
> > std::fseek(inFile, 0, SEEK_END);
> > std::string s(std::ftell(inFile), ' ');
> > std::rewind(inFile);
> > std::fread(&s[0], 1, s.length(), inFile);
> Just a nit-pick: std::string is not guaranteed to be stored contiguously so
> technically your code has UB.
>

Iirc, this changed with C++ 2011. Now the storage is
guaranteed to be contiguous.

Brian
Ebenezer Enterprises
http://webEbenezer.net

Bo Persson

unread,

May 27, 2012, 7:24:00 AM5/27/12

to

Pavel skrev 2012-05-27 04:55:
> Juha Nieminen wrote:
>> Luca Risolia<luca.r...@studio.unibo.it> wrote:
>>> int main(int argc, char** argv) {
>>> ifstream f(argv[1], ifstream::in | ifstream::binary);
>>> istreambuf_iterator<char> i(f), eof;
>>> string s(i, eof);
>>> cout<< s;
>>> }
>>
>> Note that this is not the fastest possible way of reading a large file
>> because reading one byte at a time from a C++ stream is pretty slow.
>>
>> If maximum reading speed is required, the best compromise between speed
>> and relative ease of usage is to use std::fread() instead. It would go
>> something like this:
>>
>> std::FILE* inFile = std::fopen(argv[1], "rb");
>> if(!inFile) { std::perror(argv[1]); return 1; }
>> std::fseek(inFile, 0, SEEK_END);
>> std::string s(std::ftell(inFile), ' ');
>> std::rewind(inFile);
>> std::fread(&s[0], 1, s.length(), inFile);
> I know it is often assumed, but std::string's characters are not
> guaranteed to be stored contiguously (despite many implementations'
> doing so).
>

Change that to "ALL known implementations doing so".

In C++11 it IS required (non-controversial, because nobody had to change
anything).

Bo Persson

Pavel

unread,

May 27, 2012, 8:13:01 PM5/27/12

to

I just took a look at this part of C+11 and was surprised c_str() is no longer
required to invalidate iterators etc and even it does not seem to be required to
return a null-terminated string. Is this intentional?

-Pavel

Dombo

unread,

May 28, 2012, 6:07:55 AM5/28/12

to

Op 28-May-12 2:13, Pavel schreef:

If it doesn't return a null-terminated string it would be pretty useless
and break almost all code that uses c_str(). I didn't look it up in the
C++11 standard, but if what you say is true than I'm pretty sure it is
not intentional.

According to the C++ 2003 version of the standard c_str() does return a
null-terminated string, but AFAICT does not require iterators to be
invalidated when this member is called. However any subsequent call to a
non-const member of the string object may make the pointer returned by
the c_str() function invalid.

Jan Andres

unread,

May 28, 2012, 1:23:57 PM5/28/12

to

I tend to disagree on your second point above. 21.4.7.1(1) requires
c_str() and data() to return

A pointer p such that p + i == &operator[](i) for each i in [0,size()],

i.e. the interval includes the index size() which is one past the end of
the string. 21.4.5(2) specifies that operator[](pos) returns an object
of value charT() for pos == size(), which is a NUL character for your
typical charT.

Pavel

unread,

May 28, 2012, 8:39:47 PM5/28/12

to

See 21.3-5:
References, pointers, and iterators referring to the elements of a basic_string
sequence may be invalidated by the following uses of that basic_string object:
...
— Calling data() and c_str() member functions.

However any subsequent call to a non-const member of
> the string object may make the pointer returned by the c_str() function invalid.
>

-Pavel

Pavel

unread,

May 28, 2012, 9:06:40 PM5/28/12

to

> A pointer p such that p + i ==&operator[](i) for each i in [0,size()],

>
> i.e. the interval includes the index size() which is one past the end of
> the string. 21.4.5(2) specifies that operator[](pos) returns an object
> of value charT() for pos == size(), which is a NUL character for your
> typical charT.

Thanks, I see now. It was more explicit before. But, then, in conjunction with
non-invalidating iterators on c_str(), and the requirement of contiguous storage
this means that any practical implementation of std::string shall have its
contiguous storage ended with an extra NUL character, even if the client code
don't need it. This is probably minor, but good to know when calculating memory
requirements for many small strings.

As for adding the requirement contiguous storage,, I think it closed a
possibility for some optimizations related of string growing (the subject that
arose earlier in this thread). For example, before the change an std::string
implementation with size < C * pow(2, n) could keep an array of n contiguous
chunks with sizes C * pow(2, i) where C is a constant and 0 <= i < n. Then
appending would require neither memory deallocation nor data copying.

-Pavel

Jan Andres

unread,

May 29, 2012, 10:09:52 AM5/29/12

to

Agreed, though for strings so short that a single-character size
difference is significant, there are other types of overhead that must
be taken into account. Take e.g. the memory allocator's internal
overhead which can easily be something on the order of 2 pointers per
allocated block, plus many malloc libraries will round up the allocation
size to the next multiple of 16 bytes or so. IMHO that makes the one
extra NUL character irrelevant in the vast majority of cases. (Unless
you are using a truly gigantic character type.:-)

> As for adding the requirement contiguous storage,, I think it closed a
> possibility for some optimizations related of string growing (the subject that
> arose earlier in this thread). For example, before the change an std::string
> implementation with size < C * pow(2, n) could keep an array of n contiguous
> chunks with sizes C * pow(2, i) where C is a constant and 0 <= i < n. Then
> appending would require neither memory deallocation nor data copying.

Agreed also, but I'd like to add that from a complexity theoretic
perspective this isn't as bad as it sounds. Say the size of the buffer
is doubled every time it fills up and we are building up a string of
total length n where 2^(k-1) < n <= 2^k.
Then taking the sum over all reallocations they will copy a total of
2^0 + 2^1 + ... + 2^(k-1) = 2^k - 1 characters and there will be only
(k-1) deallocations, so the time required to build up the whole
n-character string is still linear in n.

Of course, the usual care must be taken when applying such complexity
theoretic considerations to practical programming.

Pavel

unread,

May 30, 2012, 9:56:14 PM5/30/12

to

True; but I have surprisingly many strings that have size of 4 bytes. By adding
an extra byte, they will break to another category (allocate 32 bytes instead of
16 otherwise -- on 32-bit system). This is all theoretical of course because
many implementations have chosen to always allocate NUL terminator even under
the previous standard.

>
>> As for adding the requirement contiguous storage,, I think it closed a
>> possibility for some optimizations related of string growing (the subject that
>> arose earlier in this thread). For example, before the change an std::string
>> implementation with size< C * pow(2, n) could keep an array of n contiguous
>> chunks with sizes C * pow(2, i) where C is a constant and 0<= i< n. Then
>> appending would require neither memory deallocation nor data copying.

>
> Agreed also, but I'd like to add that from a complexity theoretic
> perspective this isn't as bad as it sounds. Say the size of the buffer
> is doubled every time it fills up and we are building up a string of
> total length n where 2^(k-1)< n<= 2^k.
> Then taking the sum over all reallocations they will copy a total of
> 2^0 + 2^1 + ... + 2^(k-1) = 2^k - 1 characters and there will be only
> (k-1) deallocations, so the time required to build up the whole
> n-character string is still linear in n.

Yes big-O is same, but when you are in a business where speed is a competitive
advantage, constant C matters :-). No copying vs extra copy for every allocation
adds a factor about 2 for copying (4 reallocations from 4->8->16->32->64 will
copy 60 characters during reallocations whereas alternative approach copies
none). For extra k deallocations, its cost in comparison to allocations varies
wildly, which makes it difficult to make a general estimate. I only know it is
far from being free.

>
> Of course, the usual care must be taken when applying such complexity
> theoretic considerations to practical programming.

-Pavel

Paavo Helde

unread,

May 31, 2012, 1:42:29 AM5/31/12

to

Pavel <pauldont...@removeyourself.dontspam.yahoo> wrote in
news:4fc6cfc4$0$20896$c3e8da3$f626...@news.astraweb.com:

> Jan Andres wrote:
>>
>> Agreed, though for strings so short that a single-character size
>> difference is significant, there are other types of overhead that
>> must be taken into account. Take e.g. the memory allocator's internal
>> overhead which can easily be something on the order of 2 pointers per
>> allocated block, plus many malloc libraries will round up the
>> allocation size to the next multiple of 16 bytes or so. IMHO that
>> makes the one extra NUL character irrelevant in the vast majority of
>> cases. (Unless you are using a truly gigantic character type.:-)
> True; but I have surprisingly many strings that have size of 4 bytes.
> By adding an extra byte, they will break to another category (allocate
> 32 bytes instead of 16 otherwise -- on 32-bit system). This is all
> theoretical of course because many implementations have chosen to
> always allocate NUL terminator even under the previous standard.

Several implementations are using short-string optimization (e.g. for
strings less than 16 bytes or characters), so a 4-character string would
cause no allocation at all there, be it with a zero byte in the end or not.

Cheers
Paavo