Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

The first non-duplicated letter in a string

83 views
Skip to first unread message

Paul

unread,
Jan 29, 2017, 9:31:35 AM1/29/17
to
The code below is intended to return the value of the first-occurring
unique character in a string, and to return the null character if no
characters are unique. This can be done quite simply with a vector of
length 256 because the number of chars is small. Imagine that the char type
is huge. Is the below code correct and efficient or is there something
a bit unpleasant about it?

Many thanks,
Paul

#include <iostream>
#include <list>
#include <string>
#include <unordered_map>

char findFirstUnique(const std::string& str)
{
std::unordered_map<char, std::list<char>::iterator> Map;
std::list<char> letters;
for(char letter : str)
if(Map.find(letter) == Map.end())
{
letters.push_back(letter);
Map[letter] = --letters.end();
}
else if(Map[letter] != letters.end())
{
letters.erase(Map[letter]);
Map[letter] = letters.end();
}

if(letters.empty())
{
std::cout << "All letters duplicated";
return 0;
}

return letters.front();

}

Daniel

unread,
Jan 29, 2017, 1:18:41 PM1/29/17
to
On Sunday, January 29, 2017 at 9:31:35 AM UTC-5, Paul wrote:
> The code below is intended to return the value of the first-occurring
> unique character in a string, and to return the null character if no
> characters are unique. This can be done quite simply with a vector of
> length 256 because the number of chars is small. Imagine that the char type
> is huge. Is the below code correct and efficient

Using maps or sets for this is going to be less efficient than sequence
containers. Also, I think you should be more explicit about the empty string or
all duplicates case. An alternative might be

std::pair<char,bool> findFirstUnique(const std::string& str)
{
std::string s(str);
std::sort(s.begin(), s.end());

bool unique = true;

auto prev = s.begin();
auto it = s.begin();

if (s.begin() != s.end())
{
++it;
bool done = false;
while (!done && it != s.end())
{
if (*it == *prev)
{
++it;
}
else if ((it - prev) == 1)
{
done = true;
}
else
{
prev = it;
++it;
}
}
}
return ((it - prev) == 1) ? std::make_pair(*prev,true) : std::make_pair(0,false);
}

Note, however, that a function like this would be mostly useless today because
an std::string would be expected to contain UTF-8 sequences, so you should
probably modify it to return the first unique UTF-8 sequence.

Daniel

Paul

unread,
Jan 29, 2017, 1:41:05 PM1/29/17
to
Thanks for your thoughts, Daniel.
I should have said this before, but the context for the
question is my attempt to learn about Data Structures and Algorithms
for the purpose of performing well in job interviews.

Rightly or wrongly, solutions are always frowned upon which have
suboptimal worst-case time complexity. Sorting the string would be
heavily criticised for providing an O(N * log N) algorithm instead of an
O(N) algorithm where N is the string length.


Paul

Paul

unread,
Jan 29, 2017, 1:52:29 PM1/29/17
to
On Sunday, January 29, 2017 at 6:18:41 PM UTC, Daniel wrote:
I don't think this solves the same problem that I solved.
I meant that the task is to find the unique character that occurs
earliest. For example f("character") == 'h' because the non-repeaters
are 'h', 'e' and 't', and 'h' occurs at an earlier point in the string
than the other non-repeaters.

Paul

Paavo Helde

unread,
Jan 29, 2017, 2:04:21 PM1/29/17
to
This is finding the smallest unique character, not the first occurring
one as specified in the task description.

std::sort() is also O(N*log N) which is worse than the OP's O(N) using
hash map operations which have average constant complexity.

Cheers
Paavo





Daniel

unread,
Jan 29, 2017, 2:06:32 PM1/29/17
to
On Sunday, January 29, 2017 at 1:52:29 PM UTC-5, Paul wrote:
>
> I don't think this solves the same problem that I solved.

You're right. I replied too quickly.

Daniel

Paavo Helde

unread,
Jan 29, 2017, 2:15:20 PM1/29/17
to
On 29.01.2017 16:31, Paul wrote:
> The code below is intended to return the value of the first-occurring
> unique character in a string, and to return the null character if no
> characters are unique. This can be done quite simply with a vector of
> length 256 because the number of chars is small. Imagine that the char type
> is huge. Is the below code correct and efficient or is there something
> a bit unpleasant about it?
>
> Many thanks,
> Paul
>
> #include <iostream>
> #include <list>
> #include <string>
> #include <unordered_map>
>
> char findFirstUnique(const std::string& str)
> {
> std::unordered_map<char, std::list<char>::iterator> Map;
> std::list<char> letters;
> for(char letter : str)
> if(Map.find(letter) == Map.end())
> {
> letters.push_back(letter);
> Map[letter] = --letters.end();
> }
> else if(Map[letter] != letters.end())
> {
> letters.erase(Map[letter]);
> Map[letter] = letters.end();
> }

These multiple map find and lookup operations could be replaced by a
single insert(). Not sure the optimizer is able to do this by itself.

> if(letters.empty())
> {
> std::cout << "All letters duplicated";
> return 0;
> }
>
> return letters.front();
>
> }

The algorithm itself seems OK, as std::list iterators are not
invalidated when deleting or adding other elements to the list. This
might actually be one of the very few valid usage cases for std::list I
have ever seen.

About performance: the requirement 'char type is huge' basically says
that a solution with a lookup array is infeasible, so something like a
hash map seems to be needed indeed.

I have some suspicions about the std::list still. It is a very expensive
data structure, each node requires separate dynamic allocation which is
slow, plus the nodes are not guaranteed to be compact in the memory
which makes accessing it yet slower. I would get rid of the list and
just use a hash map which counts the occurrences of characters. This
would mean two passes: first count the occurrences, then in the second
pass find the first character with count 1. My gut feeling tells me this
would be faster than messing with a std::list, but YMMV.

Cheers
Paavo




Paul

unread,
Jan 30, 2017, 5:11:59 AM1/30/17
to
On Sunday, January 29, 2017 at 7:15:20 PM UTC, Paavo Helde wrote:
...
> > for(char letter : str)
> > if(Map.find(letter) == Map.end())
> > {
> > letters.push_back(letter);
> > Map[letter] = --letters.end();
> > }
> > else if(Map[letter] != letters.end())
> > {
> > letters.erase(Map[letter]);
> > Map[letter] = letters.end();
> > }
>
> These multiple map find and lookup operations could be replaced by a
> single insert().
...

Thanks a lot for your feedback. I'm still not sure how to replace the
find and lookup operations by a single insert().

Could you (or someone else) possibly say a bit more about how to do this?

Many thanks,
Paul




Alf P. Steinbach

unread,
Jan 30, 2017, 6:01:52 AM1/30/17
to
On 29.01.2017 15:31, Paul wrote:
> The code below is intended to return the value of the first-occurring
> unique character in a string, and to return the null character if no
> characters are unique. This can be done quite simply with a vector of
> length 256 because the number of chars is small. Imagine that the char type
> is huge. Is the below code correct and efficient or is there something
> a bit unpleasant about it?
>
> #include <iostream>
> #include <list>
> #include <string>
> #include <unordered_map>
>
> char findFirstUnique(const std::string& str)
> {
> std::unordered_map<char, std::list<char>::iterator> Map;
> std::list<char> letters;
> for(char letter : str)
> if(Map.find(letter) == Map.end())
> {
> letters.push_back(letter);
> Map[letter] = --letters.end();
> }
> else if(Map[letter] != letters.end())
> {
> letters.erase(Map[letter]);
> Map[letter] = letters.end();
> }
> if(letters.empty())
> {
> std::cout << "All letters duplicated";
> return 0;
> }
> return letters.front();
> }

I would do this:

[code]
#include <iostream>
#include <string> // std::basic_string
#include <iterator> // std::(begin, end)
#include <locale.h> // setlocale
#include <unordered_set> // std::unordered_multiset

#define TXT( s ) L ## s

using Char = decltype( TXT( ' ' ) );
using String = std::basic_string<Char>;

auto first_unique_char_in( String const& s )
-> Char
{
std::unordered_multiset<Char> const chars( begin( s ), end( s ) );
for( Char const ch : s )
{
if( chars.count( ch ) == 1 ) { return ch; }
}
return Char{ 0 };
}

auto main()
-> int
{
using namespace std;
setlocale( LC_ALL, "" );
wcout << first_unique_char_in( TXT( "Blah dah Bladihdah!" ) ) << endl;
}
[/code]

Cheers & hth.,

- Alf


Paul

unread,
Jan 30, 2017, 8:22:24 AM1/30/17
to
Thanks, Alf.
I'm a bit concerned that this appears to O(N * log(N)) where N is the
length of the string. I'm not saying that it's slower than my code,
by any means. However, solutions are always frowned upon if their
theoretical time complexities are suboptimal.

I have a more novice-friendly version of your idea below:

Paul

char findFirstUnique(const std::string& str)
{
std::unordered_multiset<char>word(str.begin(), str.end());
for(char c : str)
if(word.count(c) == 1)
return c;

std::cout << "No unique characters\n";
return 0;
}

Paavo Helde

unread,
Jan 30, 2017, 9:24:20 AM1/30/17
to
On 30.01.2017 12:11, Paul wrote:
> On Sunday, January 29, 2017 at 7:15:20 PM UTC, Paavo Helde wrote:
> ...
>>> for(char letter : str)
>>> if(Map.find(letter) == Map.end())
>>> {
>>> letters.push_back(letter);
>>> Map[letter] = --letters.end();
>>> }
>>> else if(Map[letter] != letters.end())
>>> {
>>> letters.erase(Map[letter]);
>>> Map[letter] = letters.end();
>>> }
>>
>> These multiple map find and lookup operations could be replaced by a
>> single insert().
> ...
>
> Thanks a lot for your feedback. I'm still not sure how to replace the
> find and lookup operations by a single insert().

E.g.

typedef std::unordered_map<char, std::list<char>::iterator> Map_t;
Map_t Map;
std::list<char> letters;

for (char letter : str) {
Map_t::iterator iter;
bool inserted;

std::tie(iter, inserted) =
Map.insert(std::make_pair(letter, letters.end()));

if (inserted) {
letters.push_back(letter);
iter->second = --letters.end();
} else if (iter->second != letters.end()) {
letters.erase(iter->second);
iter->second = letters.end();
}
}


Paavo Helde

unread,
Jan 30, 2017, 9:39:08 AM1/30/17
to
On 30.01.2017 15:30, Stefan Ram wrote:
> "Alf P. Steinbach" <alf.p.stein...@gmail.com> writes:
>> auto first_unique_char_in( String const& s )
>
> I wrote another implementation which I called »first_unique_char_of«
> and a microbenchmark to show that under some circumstances it might
> seem to be faster.
>
> #include <cassert>
> #include <chrono>
> #include <cstdlib>
> #include <ctime>
> #include <iostream>
> #include <iterator>
> #include <ostream>
> #include <string>
> #include <vector>
> #include <unordered_set>
>
> using namespace ::std::literals;
>
> static void escape( void * p )
> { asm volatile( "" : : "g"(p) : "memory" ); }
>
> static char first_unique_char_in( ::std::string const& s )
> { ::std::unordered_multiset< char >const chars( begin( s ), end( s ));
> for( char const ch : s )
> if( chars.count( ch ) == 1 )return ch;
> return char{ 0 }; }
>
> static char first_unique_char_of( ::std::string const& s )
> { auto const z = s.size();
> auto i = z; for( i = 0; i < z; ++i )
> { char const ch = s[ i ]; bool found = false;
> auto j = z; for( j = 0; j < z; ++j )
> if( i != j && s[ j ]== ch )found = true;
> if( !found )return ch; }
> return char{ 0 }; }
>
> static void fill( ::std::vector< ::std::string >& v )
> { static const char set[] =
> "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz !";
> for( int i = 0; i < 1000; ++i )
> { ::std::string s{};
> while( true )
> { if( rand() < RAND_MAX / 100 )break;
> s.push_back( set[ rand() %( sizeof( set )- 1 )] ); }
> v.push_back( s ); }}

If I am able to interpret this line noise correctly, then you are
claiming that a simple O(N*N) algorithm is performing better than a
O(N*log N) algorithm which is using more complex data structures.

This is the expected result, *for small N*. Increase your data size to
e.g. 10 times the L3 cache size, e.g. 80 MB, and test again.



Scott Lurndal

unread,
Jan 30, 2017, 11:27:50 AM1/30/17
to
r...@zedat.fu-berlin.de (Stefan Ram) writes:
> Distribution through any means other than regular usenet
>
> channels is forbidden. It is forbidden to publish this
>
> article in the world wide web. It is forbidden to change
>
> URIs of this article into links. It is forbidden to remove
>
> this notice or to transfer the body without this notice.
>X-No-Archive: Yes
>Archive: no
>X-No-Archive-Readme: "X-No-Archive" is only set, because this prevents some
>
> services to mirror the article via the web (HTTP). But Stefan Ram
>
> hereby allows to keep this article within a Usenet archive server
>
> with only NNTP access without any time limitation.
>X-No-Html: yes
>Content-Language: en
>X-Received-Body-CRC: 612936975
>X-Received-Bytes: 4067
>
>"Alf P. Steinbach" <alf.p.stein...@gmail.com> writes:
>>auto first_unique_char_in( String const& s )
>
> I wrote another implementation which I called »first_unique_char_of«
> and a microbenchmark to show that under some circumstances it might
> seem to be faster.

It's completely unreadable.

Gareth Owen

unread,
Jan 30, 2017, 2:04:45 PM1/30/17
to
Paul <peps...@gmail.com> writes:

> Thanks, Alf.
> I'm a bit concerned that this appears to O(N * log(N)) where N is the
> length of the string. I'm not saying that it's slower than my code,
> by any means. However, solutions are always frowned upon if their
> theoretical time complexities are suboptimal.


// Doesn't use a complicated data structure, but guaranteed O(n)
// Assumes CHAR_BIT is 8
char findFirstUnique(const std::string&str){
std::array<int,256> arr;
if(str.length() == 0) return 0;
for(int i=0;i < str.length(); ++i){
unsigned char ch = str[ch];
if(arr[ch] == 0) arr[ch] = i;
else arr[ch] = INT_MAX;
}
int minval=INT_MAX;
int minch = 0;
for(int i=0;i < 256; ++i){
if(arr[i] > 0 && arr[i] < INT_MAX) {
minval = arr[i];
minch = (char)i;
}
}
return minch;
}

Ian Collins

unread,
Jan 30, 2017, 2:14:41 PM1/30/17
to
I thought at first glance it was a post from that Relf bloke...

--
Ian

Alf P. Steinbach

unread,
Jan 30, 2017, 4:59:51 PM1/30/17
to
Where on Earth did you get that idea?

Cheers!,

- Alf

Alf P. Steinbach

unread,
Jan 30, 2017, 5:11:03 PM1/30/17
to
On 30.01.2017 20:04, Gareth Owen wrote:
> Paul <peps...@gmail.com> writes:
>
>> Thanks, Alf.
>> I'm a bit concerned that this appears to O(N * log(N)) where N is the
>> length of the string. I'm not saying that it's slower than my code,
>> by any means. However, solutions are always frowned upon if their
>> theoretical time complexities are suboptimal.
>
>
> // Doesn't use a complicated data structure, but guaranteed O(n)
> // Assumes CHAR_BIT is 8

Note that the original question was about the case where the character
type is biggish, not `char` but perhaps `wchar_t` or `char32_t`.


> char findFirstUnique(const std::string&str){
> std::array<int,256> arr;

I don't think std::array has a defined default constructor. The original
Boost implementation was based on `array` being a POD so that it could
be initialized with C++03 curly braces. I think that's so still after
C++11, and if so then due to the assumption in the code below of zero
array values, you have Undefined Behavior.


> if(str.length() == 0) return 0;
> for(int i=0;i < str.length(); ++i){
> unsigned char ch = str[ch];
> if(arr[ch] == 0) arr[ch] = i;
> else arr[ch] = INT_MAX;
> }
> int minval=INT_MAX;
> int minch = 0;
> for(int i=0;i < 256; ++i){
> if(arr[i] > 0 && arr[i] < INT_MAX) {
> minval = arr[i];
> minch = (char)i;
> }
> }
> return minch;
> }


Cheers!,

- Alf

Alf P. Steinbach

unread,
Jan 31, 2017, 7:33:07 PM1/31/17
to
On 31.01.2017 15:46, Stefan Ram wrote:
> "Alf P. Steinbach" <alf.p.stein...@gmail.com> writes:
>> Note that the original question was about the case where the character
>> type is biggish, not `char` but perhaps `wchar_t` or `char32_t`.
>
> I changed my code to that:
>
> using tchar = char32_t;
> using tstring = ::std::basic_string< tchar >;
>
> static tchar first_unique_char_in( tstring const & s )
> { ::std::unordered_multiset< tchar >const chars( begin( s ), end( s ));
> for( tchar const ch : s )
> if( chars.count( ch ) == 1 )return ch;
> return tchar{ 0 }; }
>
> static tchar first_unique_char_of( tstring const & s )
> { auto const z = s.size();
> auto i = z; for( i = 0; i < z; ++i )
> { tchar const ch = s[ i ]; bool found = false;
> auto j = z; for( j = 0; j < z; ++j )
> if( s[ j ]== ch && i != j ){ found = true; break; }
> if( !found )return ch; }
> return tchar{ 0 }; }
>
> When the program runs, it blocks my computer. So, I cannot
> afford to have it run for hours. So, the maximum feasible
> length for a string for my tests is about 200'000. It gives
>
> dti = 3317031700
> dtf = 1800200
>
> That means, in this case, »first_unique_char_of« is more
> than 1000 times faster. (Unless I made an error, so feel
> free to reproduce this.)

You've apparently made some errors of methodology, I /think/, but still
this is a fascinating result and a very good question. :)

First, the methodology:

• To measure performance, always turn on optimization, e.g. `g++ -O3`.

Otherwise there is an abstraction cost for function calls, to make
sure that one can delve into them in a debugger. This favors non-
abstracted code. The cost is usually a fixed factor, but a huge one.

• Always use reasonable data for what you test.

For example, testing with a string of 200 000 '#' would be
unreasonable, because one function would use just a couple of
machine code instructions to find that single value '#', while
the other would delve into a complex data structure first.

Still with this methodology in place there's a completely weird
“breaking point” at about string size 200 000 on the machine I'm writing
this on, a really old Asus laptop running Windows 7. Before this
breaking point the O(n^2) brute force algorithm is unreasonably fast.
With larger strings it exhibits the expected O(n^2) sluggishness.

I discuss that below.

The code below is the code that I used to test. It bears witness of the
way I honed in on the truth, at first suspecting some really ungood
implementation of `std::unordered_multiset` with both g++ and Visual
C++. That turned out to not be the case but I let the workaround stand:


[code]
#include <iostream>
#include <iterator> // std::(begin, end)
#include <locale.h> // setlocale
#include <stddef.h> // ptrdiff_t
#include <stdlib.h> // rand, srand
#include <string> // std::basic_string
#include <set> // std::multiset
#include <time.h> // clock
#include <unordered_map> // std::unordered_map
#include <map>
#include <unordered_set> // std::unordered_multiset

namespace alf {
using Size = ptrdiff_t;

template< class Type >
class Unordered_multiset_
{
private:
std::unordered_map< Type, int > values_;

public:
auto count( Type const& value ) const
-> Size
{
auto const it = values_.find( value );
return (it == values_.end()? 0 : it->second);
}

template< class Iterator >
Unordered_multiset_( Iterator const first, Iterator const beyond )
{
//values_.reserve( beyond - first ); // Ungood, more slow.
for( Iterator it = first; it != beyond; ++it )
{
++values_[*it];
}
}
};

} // namespace alf

namespace stefan {
using tchar = wchar_t;
using tstring = ::std::basic_string< tchar >;

template< class Value >
//using Multiset_ = std::unordered_multiset< Value >;
using Multiset_ = alf::Unordered_multiset_<Value>;

static tchar first_unique_char_in( tstring const & s )
{
Multiset_< tchar > const chars( begin( s ), end( s ));
for( tchar const ch : s )
{
if( chars.count( ch ) == 1 ) { return ch; }
}
return tchar{ 0 };
}

static tchar first_unique_char_of( tstring const & s )
{
auto const z = s.size();
auto i = z;
for( i = 0; i < z; ++i )
{
tchar const ch = s[ i ];
bool found = false;
auto j = z;
for( j = 0; j < z; ++j )
{
if( s[ j ]== ch && i != j )
{
found = true; break;
}
}
if( !found )return ch;
}
return tchar{ 0 };
}
}

namespace alf {
using namespace std;

using Func = auto( stefan::tstring const& ) -> stefan::tchar;

void test(
char const funcname[],
Func* const func,
stefan::tstring const& s
)
{
time_t const start = clock();
auto const result = func( s );
(void) result;
time_t const end = clock();
double const duration = 1.0*(end - start)/CLOCKS_PER_SEC;
wcout << funcname << ": " << duration << endl;
}
}

#define TEST( f, s ) alf::test( #f, f, s )

auto data( int const n )
-> stefan::tstring
{
stefan::tstring s( n, '\0' );
srand( 42 );
for( int i = 0; i < n; ++i )
{
s[i] = static_cast<stefan::tchar>( rand() );
}
return s;
}

auto main( int, char** args )
-> int
{
int const n = [=]{ try{ return std::stoi( args[1] ); } catch(...) {
return 200'000; } }();
setlocale( LC_ALL, "" );
auto const s = data( n );

using namespace std;
wcout << "Testing string with " << n << " units." << endl;
TEST( stefan::first_unique_char_in, s );
TEST( stefan::first_unique_char_of, s );
}
[code]


As you can see I've reformatted your code, in order to be able to relate
it more easily to generated assembly.

Also note that a string of completely random values, as in this code,
would be very unkind to a set implementation optimized for ¹Zipf's law,
that the frequency of any value is inversely proportional to its rank in
the frequency table, e.g. that ASCII characters should be very much more
frequent than obscure old Chinese glyphs, say. So with such optimization
involved the completely random data would skew the results. But I think
it's good enough here.


[results]
[C:\my\forums\clc++\054]
> a
Testing string with 200000 units.
stefan::first_unique_char_in: 0.015
stefan::first_unique_char_of: 0

[C:\my\forums\clc++\054]
> a 400000
Testing string with 400000 units.
stefan::first_unique_char_in: 0.027
stefan::first_unique_char_of: 4.861

[C:\my\forums\clc++\054]
> a 800000
Testing string with 800000 units.
stefan::first_unique_char_in: 0.05
stefan::first_unique_char_of: 25.453

[C:\my\forums\clc++\054]
> _
[/results]


With 200 000 chars, the brute force algorithm uses ~0 time.

Above that, doubling from 400K to 800K yields a four time increase in
the now sluggish behavior, as expected from O(n^2).

To investigate the odd extreme performance of the double loop for a
string below 200 000 characters, I generated annotated assembly code
with these commands:


[commands]
[C:\my\forums\clc++\054]
> g++ -c compare.cpp -O3 -g

[C:\my\forums\clc++\054]
> objdump -d -M intel -S compare.o >compare.s

[C:\my\forums\clc++\054]
> _
[/commands]


The hypothesis was that maybe g++ was smart enough to recognize that
with a short enough string it could replace the inner loop with
something based on a table lookup. This was before I made the string
size a command line argument. I just used a fixed size string at the
start of the investigation, so the size was known to the compiler.

However, apparently g++ generates just very straightforward assembly
(I've manually removed some very long symbolic label comments):


[assembly]
static tchar first_unique_char_of( tstring const & s )
{
0: 4c 8b 41 08 mov r8,QWORD PTR [rcx+0x8]
auto const z = s.size();
auto i = z;
for( i = 0; i < z; ++i )
4: 4d 85 c0 test r8,r8
7: 74 39 je 42
9: 48 8b 09 mov rcx,QWORD PTR [rcx]
c: 45 31 c9 xor r9d,r9d
f: 44 0f b7 11 movzx r10d,WORD PTR [rcx]
tchar const ch = s[ i ];
bool found = false;
auto j = z;
for( j = 0; j < z; ++j )
{
if( s[ j ]== ch && i != j )
13: 4d 85 c9 test r9,r9
tchar const ch = s[ i ];
16: 42 0f b7 04 49 movzx eax,WORD PTR [rcx+r9*2]
if( s[ j ]== ch && i != j )
1b: 74 06 je 23
1d: 66 44 39 d0 cmp ax,r10w
21: 74 16 je 39
23: 31 d2 xor edx,edx
for( j = 0; j < z; ++j )
25: 48 83 c2 01 add rdx,0x1
29: 4c 39 c2 cmp rdx,r8
2c: 74 17 je 45
if( s[ j ]== ch && i != j )
2e: 66 39 04 51 cmp WORD PTR [rcx+rdx*2],ax
32: 75 f1 jne 25
34: 4c 39 ca cmp rdx,r9
37: 74 ec je 25
for( i = 0; i < z; ++i )
39: 49 83 c1 01 add r9,0x1
3d: 4d 39 c1 cmp r9,r8
40: 75 d1 jne 13
found = true; break;
}
}
if( !found )return ch;
}
return tchar{ 0 };
42: 31 c0 xor eax,eax
44: c3 ret
}
45: c3 ret
[/assembly]


There's no special optimization here.

My main hypothesis is therefore that the buffer of the string of 200 000
characters, which would be around 800 KB, fits in a processor cache,
while the buffer of just double that size doesn't.

But the effect seems to be too drastic for just that, at least a
millionfold improvement?


Cheers!, still baffled,

- Alf

References:
¹ https://simple.wikipedia.org/wiki/Zipf's_law

Ralf Goertz

unread,
Feb 1, 2017, 2:22:26 AM2/1/17
to
Am Wed, 1 Feb 2017 01:32:03 +0100
schrieb "Alf P. Steinbach" <alf.p.stein...@gmail.com>:

> int const n = [=]{ try{ return std::stoi( args[1] ); }
> catch(...) { return 200'000; } }();

I get a syntax error:
error: missing terminating ' character
return 200'000; } }();
^


Why don't you? Is it a locale issue? Or c++17?

Alf P. Steinbach

unread,
Feb 1, 2017, 2:38:57 AM2/1/17
to
C++14 introduced digit group separators (you can ¹add an apostrophe
between any pair of digits), and binary literals.

Unfortunately my compilers choke on `200'000` as user input. I think
that the standard library wasn't updated to follow the core language
syntax. But I'm not sure, I haven't checked the standard.


Cheers!,

- Alf

References:
¹ <url: http://en.cppreference.com/w/cpp/language/integer_literal>

David Brown

unread,
Feb 1, 2017, 3:16:32 AM2/1/17
to
On 01/02/17 02:45, Stefan Ram wrote:
> "Alf P. Steinbach" <alf.p.stein...@gmail.com> writes:
>> - To measure performance, always turn on optimization, e.g. -g++ -O3-.
>
> I did use
>
> -msse2 -march=native -Ofast -O3 -g -fgcse -fgcse-las
>
> (maybe I should still remove »-g« here, these are just my
> one-size-fits-all default settings.)
>

That's not a great set of flags - it shows a certain confusion about
what they actually do.

If you have "-march=native", the compiler will use all the instructions
it can based on the processor doing the compiling. "-msse2" is redundant.

"-Ofast" enables a wide set of optimisations, including everything in
"-O3" - so "-O3" is redundant. Note that it also turns off certain
strict compliance support, such as turning on "-ffast-math" that can
floating point significantly faster at the expense of IEEE standard
handling of rounding, ordering, NaNs, etc.

"-g" is fine - it does not limit code generation.

"-fgcse" is redundant as it is included in -O2

"-fgcse-las" is an odd choice. It is an optimisation pass that is not
enabled by any "-O" flag. The two common reasons for not including the
pass in a -"O" level are that the pass sometimes makes code less
efficient, or that it can generate code that does not work the way the
programmer might expect (since not all programmers fully understand the
language). In this case, code that writes and then reads data through
incompatible pointers might fail. Typically these are passes that have
relatively little benefit for most code.

In general, these extra flags are usually unnecessary. But if you are
trying to get the /very/ best code, you might want to run tests with a
combination of them to see what works best for the code in question.
There are quite a lot of such flags - I don't know why you single out
this one.


That all leaves you with:

-march=native -Ofast -g

as a better choice of default flags.


David Brown

unread,
Feb 1, 2017, 3:17:04 AM2/1/17
to
I forgot to add warnings:

-march=native -Ofast -g -Wall -Wextra



Robert Wessel

unread,
Feb 1, 2017, 4:24:40 AM2/1/17
to
On Wed, 1 Feb 2017 08:37:57 +0100, "Alf P. Steinbach"
<alf.p.stein...@gmail.com> wrote:

>On 01.02.2017 08:22, Ralf Goertz wrote:
>> Am Wed, 1 Feb 2017 01:32:03 +0100
>> schrieb "Alf P. Steinbach" <alf.p.stein...@gmail.com>:
>>
>>> int const n = [=]{ try{ return std::stoi( args[1] ); }
>>> catch(...) { return 200'000; } }();
>>
>> I get a syntax error:
>> error: missing terminating ' character
>> return 200'000; } }();
>> ^
>>
>>
>> Why don't you? Is it a locale issue? Or c++17?
>
>C++14 introduced digit group separators (you can ąadd an apostrophe
>between any pair of digits), and binary literals.
>
>Unfortunately my compilers choke on `200'000` as user input. I think
>that the standard library wasn't updated to follow the core language
>syntax. But I'm not sure, I haven't checked the standard.


The input functions are basically unchanged, and don't accept digit
separators (at least not as digit separators).

Tim Rentsch

unread,
Feb 1, 2017, 9:43:09 AM2/1/17
to
"Alf P. Steinbach" <alf.p.stein...@gmail.com> writes:

> On 29.01.2017 15:31, Paul wrote:
>> [..to find the first-occurring unique character in a string..]
>>
>
> I would do this:
>
> [...]
>
> auto first_unique_char_in( String const& s )
> -> Char
> {
> std::unordered_multiset<Char> const chars( begin( s ), end( s ) );
> for( Char const ch : s )
> {
> if( chars.count( ch ) == 1 ) { return ch; }
> }
> return Char{ 0 };
> }

This is a nice solution. Maybe not the fastest, but very
likely fast enough, and very nicely simple and obviously
right.

Tim Rentsch

unread,
Feb 1, 2017, 12:33:50 PM2/1/17
to
> First, the methodology: [...]
>
> * Always use reasonable data for what you test. [...]
>
> Still with this methodology in place there's a completely weird
> "breaking point" at about string size 200 000 on the machine I'm
> writing this on, a really old Asus laptop running Windows 7. Before
> this breaking point the O(n^2) brute force algorithm is unreasonably
> fast. With larger strings it exhibits the expected O(n^2)
> sluggishness.

Nice analysis. My compliments on an interesting posting.

> I discuss that below.
>
> The code below is the code that I used to test. It bears witness of
> the way I honed in on the truth, at first suspecting some really
> ungood implementation of `std::unordered_multiset` with both g++ and
> Visual C++. That turned out to not be the case but I let the
> workaround stand:
>
> [code] .. [you forgot the / in [/code] :)]
>
> [N]ote that a string of completely random values, as in this code,
> would be very unkind to a set implementation optimized for [1]Zipf's
> law, that the frequency of any value is inversely proportional to its
> rank in the frequency table, e.g. that ASCII characters should be very
> much more frequent than obscure old Chinese glyphs, say. So with such
> optimization involved the completely random data would skew the
> results. But I think it's good enough here.

The string being initialized to random values is important!
Continued below...

> [results] [...] [/results]
>
> With 200 000 chars, the brute force algorithm uses ~0 time.
>
> Above that, doubling from 400K to 800K yields a four time increase in
> the now sluggish behavior, as expected from O(n^2).
>
> [...special optimization hypothesis explored and rejected...]
>
> My main hypothesis is therefore that the buffer of the string of 200
> 000 characters, which would be around 800 KB, fits in a processor
> cache, while the buffer of just double that size doesn't.
>
> But the effect seems to be too drastic for just that, at least a
> millionfold improvement?

What you may be seeing is a consequence of using rand() and the
particular value of RAND_MAX on that system. If RAND_MAX is large
(eg, 2**31-1), using rand() to initialize the string is very kind
to the first_unique_char_of() algorithm. The reason is, in most
cases the first unique character will be the first or second
character in the string. Scanning 200,000 characters, whether once
or twice, takes almost no time.

If you investigate different values of RAND_MAX (eg, by using
'rand() & 0xffff'), I think you find that there is a knee in the
curve in different places (for where the first unique value occurs,
and hence also the running time), depending on how many distinct
random values there are.

I suggest time trials using a different initialization method.
For example, a string of length 2n+1 might be initialized to (if
you see what I mean):

1..n, 1..n, n+1

whereupon you should be able to see evidence of quadratic behavior
with much smaller string sizes (eg, n = 100 and n = 200), and say
one million iterations of calling first_unique_char_of().

Tim Rentsch

unread,
Feb 1, 2017, 2:00:27 PM2/1/17
to
r...@zedat.fu-berlin.de (Stefan Ram) writes:

> "Alf P. Steinbach" <alf.p.stein...@gmail.com> writes:
>> - Always use reasonable data for what you test.
>
> We would need to know from the OP what he has intended this
> algorithm for. I assume that in many programs many strings are
> sized well below 100'000 and use less than 1'000 of all the code
> points of Unicode.

Still more than enough for quadratic behavior to be slower
than linear behavior, even if the linear behavior has a
constant factor of 100.

>> this on, a really old Asus laptop running Windows 7. Before this
>> breaking point the O(n^2) brute force algorithm is unreasonably
>> fast. With larger strings it exhibits the expected O(n^2)
>> sluggishness.
>
> I expected the brute force algorithm to be quite fast because it's
> inner loop has high cache locality and perfect predictability of
> its access pattern for the prefetcher.

That doesn't matter nearly as much as how many times that loop
is executed. In cases where the double loop algorithm is faster,
it is faster because the inner loop isn't executed very many
times.

>> s[i] = static_cast<stefan::tchar>( rand() );
>
> Sometimes, RAND_MAX is as small as 32767, so this might not fill
> the whole range of a tchar, which might have 32 bits, but then,
> realistic data might not fill it either.

Based on my tests, a smaller value of RAND_MAX will tend to be
worse for the double loop algorithm, because duplicate values
usually make it slow down.

>> But the effect seems to be too drastic for just that, at least a
>> millionfold improvement?
>
> When the string is small, the probability that some character is
> unique is high and the outer loop can exit early.

If the string is small (say, less than 25 characters), it won't
matter if an O( n**2 ) algorithm is used. But even relatively
short strings (a few hundred characters) can get bogged down,
depending mostly on the characters. The relevant question is
_where_ does the first unique character occur? If it's the
first character in the string, the double loop algorithm is
very fast. Somewhere up around the 20'th or 30'th character,
the double loop algorithm starts to lose to the algorithms
that go through the string a (small) fixed number of times.

> [...]
>
> Rule 3. Fancy algorithms are slow when n is small, and n is
> usually small. Fancy algorithms have big constants. Until
> you know that n is frequently going to be big, don't get fancy.
> - Rob Pike

This quote isn't really apt, because that isn't what's going on
here. The double loop algorithm is faster _on some inputs_, but
O( n**2 ) for other inputs, and even for shortish strings this
matters. If it were known that all the inputs were 25 characters
or less, the double loop algorithm would be fine. Above that, it
depends on the strings in question. Here we may need a fancier
(but still not all that fancy) algorithm not because of how large
n is but because of what the input values might be. Insertion
sort can be very fast on some inputs, but it still is a poor
choice for more than a hundred elements or so. The same kind
of reasoning applies to the find-first-unique question.

Alf P. Steinbach

unread,
Feb 1, 2017, 8:17:27 PM2/1/17
to
On 01.02.2017 20:00, Tim Rentsch wrote:
> If the string is small (say, less than 25 characters), it won't
> matter if an O( n**2 ) algorithm is used. But even relatively
> short strings (a few hundred characters) can get bogged down,
> depending mostly on the characters. The relevant question is
> _where_ does the first unique character occur? If it's the
> first character in the string, the double loop algorithm is
> very fast. Somewhere up around the 20'th or 30'th character,
> the double loop algorithm starts to lose to the algorithms
> that go through the string a (small) fixed number of times.

Yeah I should (ideally) test again with better test data, not data where
likely all values are unique.

I mentioned Zipf's law when I posted that test, but somehow didn't
trivially connect the dots.

Grumble.


Cheers!,

- ALf

Ben Bacarisse

unread,
Feb 4, 2017, 6:28:59 AM2/4/17
to
r...@zedat.fu-berlin.de (Stefan Ram) writes:

> r...@zedat.fu-berlin.de (Stefan Ram) writes:
>>I do not claim that I fully understand the reasons for the
>>execution time of certain parts of the program, I will not
>>have time to delve deeper into it; but still, I tried to
>>optimize my code.
>
> Ok, we can consider the following cases:
>
> First Case:
>
> The string is artificially manufactured to have no
> repetitions of characters.
<snip>
> Second Case:
>
> The string is artificially manufactured to have only
> repetitions of characters. I.e., it only consists of
> the repetition of one character. This is kind of like
> the opposite of the preceding case.
<snip>
> Third Case:
>
> The string is filled with random characters that remotely
> resemble English text, i.e., with the common characters
> and digits. This is the worst case for my algorithm I have
> seen so far.

I think the actual worst-case is a string that is the concatenation of
two copies of a type one string (no repeated characters). It would be
interesting to see this case tested as well.

By the way, I have also found this thread very interesting.

<snip>
> Feel free to do your own tests guys using the following
> source code.

I don't have time at the moment, but maybe later.

<snip>
--
Ben.
0 new messages