>> Note that the original question was about the case where the character
>> type is biggish, not `char` but perhaps `wchar_t` or `char32_t`.
>
> I changed my code to that:
>
> using tchar = char32_t;
> using tstring = ::std::basic_string< tchar >;
>
> static tchar first_unique_char_in( tstring const & s )
> { ::std::unordered_multiset< tchar >const chars( begin( s ), end( s ));
> for( tchar const ch : s )
> if( chars.count( ch ) == 1 )return ch;
> return tchar{ 0 }; }
>
> static tchar first_unique_char_of( tstring const & s )
> { auto const z = s.size();
> auto i = z; for( i = 0; i < z; ++i )
> { tchar const ch = s[ i ]; bool found = false;
> auto j = z; for( j = 0; j < z; ++j )
> if( s[ j ]== ch && i != j ){ found = true; break; }
> if( !found )return ch; }
> return tchar{ 0 }; }
>
> When the program runs, it blocks my computer. So, I cannot
> afford to have it run for hours. So, the maximum feasible
> length for a string for my tests is about 200'000. It gives
>
> dti =
3317031700
> dtf = 1800200
>
> That means, in this case, »first_unique_char_of« is more
> than 1000 times faster. (Unless I made an error, so feel
> free to reproduce this.)
You've apparently made some errors of methodology, I /think/, but still
this is a fascinating result and a very good question. :)
First, the methodology:
• To measure performance, always turn on optimization, e.g. `g++ -O3`.
Otherwise there is an abstraction cost for function calls, to make
sure that one can delve into them in a debugger. This favors non-
abstracted code. The cost is usually a fixed factor, but a huge one.
• Always use reasonable data for what you test.
For example, testing with a string of 200 000 '#' would be
unreasonable, because one function would use just a couple of
machine code instructions to find that single value '#', while
the other would delve into a complex data structure first.
Still with this methodology in place there's a completely weird
“breaking point” at about string size 200 000 on the machine I'm writing
this on, a really old Asus laptop running Windows 7. Before this
breaking point the O(n^2) brute force algorithm is unreasonably fast.
With larger strings it exhibits the expected O(n^2) sluggishness.
I discuss that below.
The code below is the code that I used to test. It bears witness of the
way I honed in on the truth, at first suspecting some really ungood
implementation of `std::unordered_multiset` with both g++ and Visual
C++. That turned out to not be the case but I let the workaround stand:
[code]
#include <iostream>
#include <iterator> // std::(begin, end)
#include <locale.h> // setlocale
#include <stddef.h> // ptrdiff_t
#include <stdlib.h> // rand, srand
#include <string> // std::basic_string
#include <set> // std::multiset
#include <time.h> // clock
#include <unordered_map> // std::unordered_map
#include <map>
#include <unordered_set> // std::unordered_multiset
namespace alf {
using Size = ptrdiff_t;
template< class Type >
class Unordered_multiset_
{
private:
std::unordered_map< Type, int > values_;
public:
auto count( Type const& value ) const
-> Size
{
auto const it = values_.find( value );
return (it == values_.end()? 0 : it->second);
}
template< class Iterator >
Unordered_multiset_( Iterator const first, Iterator const beyond )
{
//values_.reserve( beyond - first ); // Ungood, more slow.
for( Iterator it = first; it != beyond; ++it )
{
++values_[*it];
}
}
};
} // namespace alf
namespace stefan {
using tchar = wchar_t;
using tstring = ::std::basic_string< tchar >;
template< class Value >
//using Multiset_ = std::unordered_multiset< Value >;
using Multiset_ = alf::Unordered_multiset_<Value>;
static tchar first_unique_char_in( tstring const & s )
{
Multiset_< tchar > const chars( begin( s ), end( s ));
for( tchar const ch : s )
{
if( chars.count( ch ) == 1 ) { return ch; }
}
return tchar{ 0 };
}
static tchar first_unique_char_of( tstring const & s )
{
auto const z = s.size();
auto i = z;
for( i = 0; i < z; ++i )
{
tchar const ch = s[ i ];
bool found = false;
auto j = z;
for( j = 0; j < z; ++j )
{
if( s[ j ]== ch && i != j )
{
found = true; break;
}
}
if( !found )return ch;
}
return tchar{ 0 };
}
}
namespace alf {
using namespace std;
using Func = auto( stefan::tstring const& ) -> stefan::tchar;
void test(
char const funcname[],
Func* const func,
stefan::tstring const& s
)
{
time_t const start = clock();
auto const result = func( s );
(void) result;
time_t const end = clock();
double const duration = 1.0*(end - start)/CLOCKS_PER_SEC;
wcout << funcname << ": " << duration << endl;
}
}
#define TEST( f, s ) alf::test( #f, f, s )
auto data( int const n )
-> stefan::tstring
{
stefan::tstring s( n, '\0' );
srand( 42 );
for( int i = 0; i < n; ++i )
{
s[i] = static_cast<stefan::tchar>( rand() );
}
return s;
}
auto main( int, char** args )
-> int
{
int const n = [=]{ try{ return std::stoi( args[1] ); } catch(...) {
return 200'000; } }();
setlocale( LC_ALL, "" );
auto const s = data( n );
using namespace std;
wcout << "Testing string with " << n << " units." << endl;
TEST( stefan::first_unique_char_in, s );
TEST( stefan::first_unique_char_of, s );
}
[code]
As you can see I've reformatted your code, in order to be able to relate
it more easily to generated assembly.
Also note that a string of completely random values, as in this code,
would be very unkind to a set implementation optimized for ¹Zipf's law,
that the frequency of any value is inversely proportional to its rank in
the frequency table, e.g. that ASCII characters should be very much more
frequent than obscure old Chinese glyphs, say. So with such optimization
involved the completely random data would skew the results. But I think
it's good enough here.
[results]
[C:\my\forums\clc++\054]
> a
Testing string with 200000 units.
stefan::first_unique_char_in: 0.015
stefan::first_unique_char_of: 0
[C:\my\forums\clc++\054]
> a 400000
Testing string with 400000 units.
stefan::first_unique_char_in: 0.027
stefan::first_unique_char_of: 4.861
[C:\my\forums\clc++\054]
> a 800000
Testing string with 800000 units.
stefan::first_unique_char_in: 0.05
stefan::first_unique_char_of: 25.453
[C:\my\forums\clc++\054]
> _
[/results]
With 200 000 chars, the brute force algorithm uses ~0 time.
Above that, doubling from 400K to 800K yields a four time increase in
the now sluggish behavior, as expected from O(n^2).
To investigate the odd extreme performance of the double loop for a
string below 200 000 characters, I generated annotated assembly code
with these commands:
[commands]
[C:\my\forums\clc++\054]
> g++ -c compare.cpp -O3 -g
[C:\my\forums\clc++\054]
> objdump -d -M intel -S compare.o >compare.s
[C:\my\forums\clc++\054]
> _
[/commands]
The hypothesis was that maybe g++ was smart enough to recognize that
with a short enough string it could replace the inner loop with
something based on a table lookup. This was before I made the string
size a command line argument. I just used a fixed size string at the
start of the investigation, so the size was known to the compiler.
However, apparently g++ generates just very straightforward assembly
(I've manually removed some very long symbolic label comments):
[assembly]
static tchar first_unique_char_of( tstring const & s )
{
0: 4c 8b 41 08 mov r8,QWORD PTR [rcx+0x8]
auto const z = s.size();
auto i = z;
for( i = 0; i < z; ++i )
4: 4d 85 c0 test r8,r8
7: 74 39 je 42
9: 48 8b 09 mov rcx,QWORD PTR [rcx]
c: 45 31 c9 xor r9d,r9d
f: 44 0f b7 11 movzx r10d,WORD PTR [rcx]
tchar const ch = s[ i ];
bool found = false;
auto j = z;
for( j = 0; j < z; ++j )
{
if( s[ j ]== ch && i != j )
13: 4d 85 c9 test r9,r9
tchar const ch = s[ i ];
16: 42 0f b7 04 49 movzx eax,WORD PTR [rcx+r9*2]
if( s[ j ]== ch && i != j )
1b: 74 06 je 23
1d: 66 44 39 d0 cmp ax,r10w
21: 74 16 je 39
23: 31 d2 xor edx,edx
for( j = 0; j < z; ++j )
25: 48 83 c2 01 add rdx,0x1
29: 4c 39 c2 cmp rdx,r8
2c: 74 17 je 45
if( s[ j ]== ch && i != j )
2e: 66 39 04 51 cmp WORD PTR [rcx+rdx*2],ax
32: 75 f1 jne 25
34: 4c 39 ca cmp rdx,r9
37: 74 ec je 25
for( i = 0; i < z; ++i )
39: 49 83 c1 01 add r9,0x1
3d: 4d 39 c1 cmp r9,r8
40: 75 d1 jne 13
found = true; break;
}
}
if( !found )return ch;
}
return tchar{ 0 };
42: 31 c0 xor eax,eax
44: c3 ret
}
45: c3 ret
[/assembly]
There's no special optimization here.
My main hypothesis is therefore that the buffer of the string of 200 000
characters, which would be around 800 KB, fits in a processor cache,
while the buffer of just double that size doesn't.
But the effect seems to be too drastic for just that, at least a
millionfold improvement?
Cheers!, still baffled,
- Alf
References:
¹
https://simple.wikipedia.org/wiki/Zipf's_law