Why is sorting a pointer-array to strings slower than sorting the strings themselfes ?

Bonita Montero

unread,

Nov 29, 2021, 3:26:35 PM11/29/21

to

Why is sorting a pointer-array to strings slower than sorting the
strings themselfes ? I've checked that with that code:

#include <iostream>
#include <vector>
#include <fstream>
#include <string>
#include <algorithm>
#include <random>
#include <concepts>
#include <chrono>
#include <execution>

using namespace std;
using namespace chrono;

int main( int argc, char **argv )
{
if( argc < 2 )
return EXIT_FAILURE;
using vc_t = vector<char>;
using vc_it = vc_t::iterator;
vector<char> block;
{
ifstream ifs;
ifs.exceptions( ios_base::badbit );
ifs.open( argv[1], ios_base::binary );
ifs.seekg( 0, ios_base::end );
streampos size = ifs.tellg();
if( size > (size_t)-1 || !size )
return EXIT_FAILURE;
ifs.seekg( 0, ios_base::beg );
block.resize( (size_t)size );
ifs.read( &*block.begin(), (streamsize)size );
}
cout << "read" << endl;
auto parse = [&]<typename Callback>( Callback callback ) -> size_t
requires
requires( Callback callback, char const *str, size_t len )
{
{ callback( str, len ) };
}
{
size_t n = 0;
for( vc_it cit = block.begin(); cit != block.end(); )
{
vc_it lineBegin = cit;
for( ; cit != block.end() && (unsigned char)*cit > ' '; ++cit );
if( cit != lineBegin )
++n, callback( &*lineBegin, cit - lineBegin );
for( ; cit != block.end() && (unsigned char)*cit <= ' '; ++cit );
}
return n;
};
vector<string> strings;
size_t sumLen = 0;
strings.reserve( parse( [&]( char const *, size_t len ) { sumLen +=
len; } ) );
if( !strings.capacity() )
return EXIT_FAILURE;
cout << "counted, avg length: " << (double)(ptrdiff_t)sumLen /
(ptrdiff_t)strings.capacity() << endl;
parse( [&]( char const *str, size_t len ) { strings.emplace_back( str,
len ); } );
cout << "parsed" << endl;
sort( strings.begin(), strings.end() );
strings.erase( unique( strings.begin(), strings.end() ), strings.end() );
cout << "uniqued" << endl;
strings.shrink_to_fit();
block.resize( 0 ), block.reserve( 0 );
auto randomize = []<typename RandomIt>( RandomIt begin, size_t n )
requires random_access_iterator<RandomIt>
{
mt19937_64 mt;
uniform_int_distribution<size_t> uidIdx( 0, n - 1 );
for( size_t i = 0, j; i != n; ++i )
{
while( (j = uidIdx( mt )) == i );
swap( begin[i], begin[j] );
}
};
randomize( strings.begin(), strings.size() );
auto start = high_resolution_clock::now();
sort( strings.begin(), strings.end() );
double ns = (double)(int64_t)duration_cast<nanoseconds>(
high_resolution_clock::now() - start ).count() / (ptrdiff_t)strings.size();
cout << "string-sort: " << ns << endl;
vector<string *> pStrings;
pStrings.reserve( strings.size() );
for( string &str : strings )
pStrings.emplace_back( &str );
randomize( pStrings.begin(), pStrings.size() );
start = high_resolution_clock::now();
sort( pStrings.begin(), pStrings.end(),
[]( string *left, string *right ) { return *left < *right; } );
ns = (double)(int64_t)duration_cast<nanoseconds>(
high_resolution_clock::now() - start ).count() / (ptrdiff_t)strings.size();
cout << "pString-sort: " << ns << endl;
}

I've read 1.4 million strings with an average lengh of 9 characters,
so most strings should fall under the short string optimization and
swapping the string-objects would have a higher cost than swapping
two pointers or a pointer and a length. And sorting string-pointers
is actually nearly twice times slower than sorting the string-objects.
Why is that ?

Marcel Mueller

unread,

Nov 29, 2021, 4:55:06 PM11/29/21

to

Am 29.11.21 um 21:26 schrieb Bonita Montero:

> Why is sorting a pointer-array to strings slower than sorting the
> strings themselfes ? I've checked that with that code:

[...]

> I've read 1.4 million strings with an average lengh of 9 characters,
> so most strings should fall under the short string optimization and
> swapping the string-objects would have a higher cost than swapping
> two pointers or a pointer and a length.

Swapping a few bytes is not the performance killer. Cache misses are the
performance killer. Every time you touch a memory address not in the
cache you get the latency of the RAM. And this latency has not improved
significantly in the last two decades.

> And sorting string-pointers
> is actually nearly twice times slower than sorting the string-objects.
> Why is that ?

Using pointers touches more different memory locations. The array and
the string storage. With that large array almost any (random) access is
a cache miss.

Marcel

Scott Lurndal

unread,

Nov 30, 2021, 9:58:18 AM11/30/21

to

Marcel Mueller <news.5...@spamgourmet.org> writes:
>Am 29.11.21 um 21:26 schrieb Bonita Montero:
>> Why is sorting a pointer-array to strings slower than sorting the
>> strings themselfes ? I've checked that with that code:
>[...]
>> I've read 1.4 million strings with an average lengh of 9 characters,
>> so most strings should fall under the short string optimization and
>> swapping the string-objects would have a higher cost than swapping
>> two pointers or a pointer and a length.
>
>Swapping a few bytes is not the performance killer. Cache misses are the
>performance killer. Every time you touch a memory address not in the
>cache you get the latency of the RAM. And this latency has not improved
>significantly in the last two decades.

Define significantly. Mid 2000's server DRAM access latency could
be as much as 250ns (Intel Westmere), while modern DRAM latency is
closer to 50-60ns. That's a five-fold improvement (granted westmere
was abnormally slow when accessing the second socket).

Bonita Montero

unread,

Nov 30, 2021, 10:42:28 AM11/30/21

to

Am 30.11.2021 um 15:58 schrieb Scott Lurndal:

> Define significantly. Mid 2000's server DRAM access latency could
> be as much as 250ns (Intel Westmere), while modern DRAM latency is
> closer to 50-60ns. That's a five-fold improvement (granted westmere
> was abnormally slow when accessing the second socket).

Westmere was 2010:
https://en.wikipedia.org/wiki/Westmere_(microarchitecture)
2000's DRAM-Latency wasn't signigificantly slower, but the Pentium 4
CPUs of that time had 128 byte cacheline-sizes, fetching a cacheline
with two bursts which results in a higher latency.

Scott Lurndal

unread,

Nov 30, 2021, 11:03:15 AM11/30/21

to

Bonita Montero <Bonita....@gmail.com> writes:
>Am 30.11.2021 um 15:58 schrieb Scott Lurndal:
>
>> Define significantly. Mid 2000's server DRAM access latency could
>> be as much as 250ns (Intel Westmere), while modern DRAM latency is
>> closer to 50-60ns. That's a five-fold improvement (granted westmere
>> was abnormally slow when accessing the second socket).
>
>Westmere was 2010:
>https://en.wikipedia.org/wiki/Westmere_(microarchitecture)

Nahelem suffered from the same issue, and we had samples of
westmere in 2008.

Bonita Montero

unread,

Nov 30, 2021, 11:19:50 AM11/30/21

to

No, there was no CPU-architecture except the Pentium 4 that had
128 byte cachelines.

Scott Lurndal

unread,

Nov 30, 2021, 11:46:59 AM11/30/21

to

I never, ever, mentioned 128-byte cache lines. Why do you insist
on changing the topic, which was memory latency; which on Nahalem
was significantly higher than today's CPUs (mainly due to inter-socket
communication latency, being a NUMA architecture).

Note also that there are current CPU architectures with 128-byte
cache lines, for example Octeon (MIPS) and OcteonTx (ARM64).

Bonita Montero

unread,

Nov 30, 2021, 11:58:30 AM11/30/21

to

Am 30.11.2021 um 17:46 schrieb Scott Lurndal:

> I never, ever, mentioned 128-byte cache lines. Why do you insist
> on changing the topic, which was memory latency; which on Nahalem
> was significantly higher than today's CPUs (mainly due to inter-socket
> communication latency, being a NUMA architecture).

DRAM access latency hasn't varied much since DDR1.
The snoops are faster than the parallel speculative fetches,
so NUMA isn't a problem here.

Marcel Mueller

unread,

Nov 30, 2021, 1:17:28 PM11/30/21

to

Am 30.11.21 um 15:58 schrieb Scott Lurndal:

>> Swapping a few bytes is not the performance killer. Cache misses are the
>> performance killer. Every time you touch a memory address not in the
>> cache you get the latency of the RAM. And this latency has not improved
>> significantly in the last two decades.
>
> Define significantly. Mid 2000's server DRAM access latency could
> be as much as 250ns (Intel Westmere), while modern DRAM latency is
> closer to 50-60ns.

1. Westmere appeared in 2010.

2. Let's take the CAS latency as an example:
DDR1: typ. 15 ns
DDR4: typ. 14 ns
Other values like tRAS are similar.

3. Maybe some CPUs are designed bad and introduced additional latency.
But this is no general rule. Even an old AMD Athlon took only about 50ns
or 70ns (page miss).

Did you confuse it with the latency of NUMA architectures?

Marcel

Öö Tiib

unread,

Nov 30, 2021, 4:26:49 PM11/30/21

to

On Monday, 29 November 2021 at 22:26:35 UTC+2, Bonita Montero wrote:
> Why is sorting a pointer-array to strings slower than sorting the

> strings themselfes ?i

Because std::string itself is kind of "smart pointer" of text it contains.
Its swap after compare costs next to nothing. So using additional
indirection and memory for raw pointers there is clear "preliminary
pessimization" ... more complicated code for no gain.

Bonita Montero

unread,

Dec 1, 2021, 2:29:19 AM12/1/21

to

I already said that I'm swapping strings that fit inside the string
-objects themselfes. They have a average length of 9 chars and they
fit inside the small string optimization.

Öö Tiib

unread,

Dec 1, 2021, 4:36:18 PM12/1/21

to

What difference does it supposedly make? You still use quarter more
memory, for those pointers, add one indirection and have to access
all that memory for sorting. Pure pessimisation. Why you expected
that it is quicker?

Bonita Montero

unread,

Dec 2, 2021, 3:09:58 AM12/2/21

to

Am 01.12.2021 um 22:36 schrieb Öö Tiib:
> On Wednesday, 1 December 2021 at 09:29:19 UTC+2, Bonita Montero wrote:
>> Am 30.11.2021 um 22:26 schrieb Öö Tiib:
>>> On Monday, 29 November 2021 at 22:26:35 UTC+2, Bonita Montero wrote:
>>>> Why is sorting a pointer-array to strings slower than sorting the
>>>> strings themselfes ?i
>>>
>>> Because std::string itself is kind of "smart pointer" of text it contains.
>>> Its swap after compare costs next to nothing. So using additional
>>> indirection and memory for raw pointers there is clear "preliminary
>>> pessimization" ... more complicated code for no gain.
>> I already said that I'm swapping strings that fit inside the string
>> -objects themselfes. They have a average length of 9 chars and they
>> fit inside the small string optimization.
>

> What difference does it supposedly make? ...

It makes the difference that you swap the string-contents byte-by-byte
and not with two pointer-swaps. Loading or storing a byte costs the same
like loading or storing a pointer.

Juha Nieminen

unread,

Dec 2, 2021, 3:39:43 AM12/2/21

to

If the string objects are indeed using short string optimization, I doubt
that the object contents are being swapped byte-by-byte. Most likely
8 bytes at a time (just like the pointers).

One advantage over using pointers to the objects is, of course, that you
don't get the extra indirection.

Bonita Montero

unread,

Dec 2, 2021, 4:44:38 AM12/2/21

to

With MSVC, clang and g++ the contents are swapped as four SSE
128 bit words; the size of the string or if its external or
internal is ignored. But it's more effort to swap four 128 bit
words than two pointers.

Juha Nieminen

unread,

Dec 2, 2021, 5:27:34 AM12/2/21

to

Bonita Montero <Bonita....@gmail.com> wrote:
> With MSVC, clang and g++ the contents are swapped as four SSE
> 128 bit words; the size of the string or if its external or
> internal is ignored.

That's strange. 4x128 bits would be 64 bytes. I just tried what the
sizeof(std::string) is with a very recent version of gcc 11, and
it was 32 bytes.

If compiling for a modern enough architecture, that would be one
single SSE register swap. But even if it uses two 128-bit registers,
it doesn't make all that much of a difference.

> But it's more effort to swap four 128 bit
> words than two pointers.

You are forgetting that you are not only swapping the strings, you are
also comparing them. With pointers you'll be using an extra level of
indirection.