Re: Breakthrough

MelissA

unread,

Jun 4, 2012, 8:20:01 AM6/4/12

to

On Mon, 04 Jun 2012 13:38:47 +0200
jacob navia <ja...@spamsink.net> wrote:

> Hi
>
> In a recent discussion ("Open letter to Ian Collins") we found out
> that the main problem in the C containers library was the time taken
> by the Sort function.
>
> The reasons for that is that it was a general sort function calling a
> general comparison function that uses memcmp...
>
> The first step to improve performance was to write data type specific
> functions using a data type generic file. As a bonus, we get compile
> time checking and better syntax.
>
> Still, the generic sort function was taking a lot of time.
>
> I have written then a data type generic quick sort function that
> receives as a parameter a macro that is used to compare two elements.
> This vastly improves performance.
>
> The times are:
>
> Generic sort function without any specialized heap
> (using malloc)
> real 0m17.029s
> user 0m16.654s
> sys 0m0.368s
>
> Generic sort function using a specialized heap
> (reduces the malloc overhead)
> real 0m11.759s
> user 0m11.500s
> sys 0m0.252s
>
> "Templated" sort function using a macro parameter
> (reduces comparisons to a simple expression)
> real 0m6.453s
> user 0m6.165s
> sys 0m0.281s
>
> The expression used to compare two list elements is:
> #define COMPARE_EXPRESSION(A, B) \
> ((*B)->Data > (*A)->Data ? -1 : (*B)->Data != (*A)->Data)
>
> I thank Pete for that proposal, I wouldn't have come to it
> alone :-)
>
> I would like to have the corresponding C++ times but I fear that
> if I write it it would be unfair since I am not a C++ wizard...
> Here is the C code for the example:
>
> #include <stdlib.h>
> #include "containers.h"
> #include "intlist.h"
> #define N 10000000
> int main(void)
> {
> intList *L1;
> size_t i;
> int d;
> long long sum=0,newSum=0;
> Iterator *it;
> int *pint;
>
> L1 = iintList.Create(sizeof(int));
> iintList.UseHeap(L1,NULL); // Use specialized Heap
> // Add N random numbers to the integer list
> for (i=0; i<N;i++) {
> d = rand();
> sum += d;
> iintList.Add(L1,d);
> }
> // Sort it
> iintList.Sort(L1);
> // Go through the sorted list using an iterator
> it = iintList.NewIterator(L1);
> i=0;
> for (pint = it->GetFirst(it); pint; pint = it->GetNext(it)) {
> newSum += *pint;
> }
> // Verify that both sums are identical
> if (sum != newSum)
> printf("Failed\n");
> else printf("Sum = %lld\n",sum);
> // Destroy the list
> iintList.Finalize(L1);
> }
>
> I have uploaded the latest code to:
>
> http://code.google.com/p/ccl/
>
> There you will find (in listgen.c) the "templated" version of quick
> sort in C.
>

Here are timings on mu computer.

bmaxa@maxa:~/jacob/ccl$ time ./cppvec
Sum = 10738138201479754

real 0m0.899s
user 0m0.832s
sys 0m0.064s
bmaxa@maxa:~/jacob/ccl$ time ./cpplist
Sum = 10738138201479754

real 0m7.493s
user 0m7.344s
sys 0m0.136s
bmaxa@maxa:~/jacob/ccl$ time ./testlist
Sum = 10738138201479754

real 0m4.320s
user 0m4.180s
sys 0m0.140s

Sources:
---- cppvec.cpp

#include <vector>
#include <cstdio>
#include <cstdlib>
#include <algorithm>

#define N 10000000

int main()
{
std::vector<int> L1;
long long sum=0,newSum=0;
for(int i = 0; i < N; ++i)
{
int d = rand();
sum+=d;
L1.push_back(d);
}

std::sort(L1.begin(),L1.end());

for(auto it : L1)
{
newSum+= it;
}

if (sum != newSum)
printf("Failed\n");
else printf("Sum = %lld\n",sum);

}

----cpplist.cpp

#include <list>
#include <cstdio>
#include <cstdlib>

#define N 10000000

int main()
{
std::list<int> L1;
long long sum=0,newSum=0;
for(int i = 0; i < N; ++i)
{
int d = rand();
sum+=d;
L1.push_back(d);
}

L1.sort();

for(auto it : L1)
{
newSum+= it;
}

if (sum != newSum)
printf("Failed\n");
else printf("Sum = %lld\n",sum);

}

Compile with -std=c++0x for gcc 4.6 or -std=c++11 for 4.7

As you can see your program is faster than list version
of C++, but is less elegant ;)

Greets!

Zoltan Juhasz

unread,

Jun 4, 2012, 10:08:45 AM6/4/12

to

On Monday, 4 June 2012 08:20:01 UTC-4, MelissA wrote:
> Here are timings on mu computer.
>
> bmaxa@maxa:~/jacob/ccl$ time ./cppvec
> Sum = 10738138201479754
>
> real 0m0.899s
> user 0m0.832s
> sys 0m0.064s
> bmaxa@maxa:~/jacob/ccl$ time ./cpplist
> Sum = 10738138201479754
>
> real 0m7.493s
> user 0m7.344s
> sys 0m0.136s
> bmaxa@maxa:~/jacob/ccl$ time ./testlist
> Sum = 10738138201479754
>
> real 0m4.320s
> user 0m4.180s
> sys 0m0.140s
>

I've got two problems with the analysis:

1. You are measuring a lot more than the sort. Narrow the scope of measurement down to the portion you want to measure.
2. Assuming that the C version uses doubly-linked list (just like the C++ version), the C version uses a small-object allocator to reduce the malloc's header overhead. To do a fair comparison, you'll need to do the same for the C++ version; a small-object allocator might enhance locality dramatically, plus eliminates the 8/16 byte overhead per allocation introduced by malloc (at least some glibc malloc does that).

I would expect that you'll end up seeing the same result.

MelissA

unread,

Jun 4, 2012, 10:18:39 AM6/4/12

to

timings without sort:

bmaxa@maxa:~/jacob/ccl$ time ./cpplist
Sum = 10738138201479754

real 0m0.626s
user 0m0.504s
sys 0m0.120s

bmaxa@maxa:~/jacob/ccl$ time ./testlist
Sum = 10738138201479754

real 0m0.440s
user 0m0.316s
sys 0m0.120s

So it's pretty small difference in allocation. Biggest time is spent
in sort function. After sort, traverse can be also slower, if sort
changes order of nodes.

jacob navia

unread,

Jun 4, 2012, 10:20:38 AM6/4/12

to

Le 04/06/12 16:08, Zoltan Juhasz a écrit :

(1)
The problem is that in older versions of C++ there isn't a single
linked list...

(2)
How would you use a small object allocator using the STL?
The C containers library comes with a small object allocator
already done for you. I thought that a classic problem like
that would be solved already.

Zoltan Juhasz

unread,

Jun 4, 2012, 10:54:49 AM6/4/12

to ja...@jspamsink.org

On Monday, 4 June 2012 10:20:38 UTC-4, jacob navia wrote:
> Le 04/06/12 16:08, Zoltan Juhasz a écrit :
>
> (1)
> The problem is that in older versions of C++ there isn't a single
> linked list...

Right, only C++11. That is why I wanted to make sure both
uses the same kind, otherwise the measurement is rather unfair.

> (2)
> How would you use a small object allocator using the STL?
> The C containers library comes with a small object allocator
> already done for you. I thought that a classic problem like
> that would be solved already.

I am not sure if the question is directed towards
- possibility or
- feasibility

As far as possibility is concerned:

STL containers provide an 'Allocator' template parameter,
where you can provide your own allocator. Your custom
allocator then can be built on top of malloc (such as
Loki::SmallObj allocator, by using pools), or you can even
use an alternative malloc implementation to malloc / free
memory.

As far a feasibility is concerned, I am moderately sure that
there are off-the-shelf, STL compatible small-object
allocators available; providing an additional template argument
should not be a problem.

Alternatively, there are other malloc implementations, such as
jmalloc, which segregate memory based on allocation size,
thus eliminating some of the overhead.

Either way, it is possible, and not horribly complex.

-- Zoltan

Juha Nieminen

unread,

Jun 4, 2012, 11:32:34 AM6/4/12

to

In comp.lang.c++ MelissA <me...@a.com> wrote:
> #define N 10000000
>
> int main()
> {
> std::list<int> L1;
> long long sum=0,newSum=0;
> for(int i = 0; i < N; ++i)
> {
> int d = rand();
> sum+=d;
> L1.push_back(d);

Your test is basically useless as a measurement of sorting speed because
the vast majority of the time will be spent on those push_back() calls.

If you want to test sorting speed and nothing else, you have to discard
the time spent in memory allocation.

Ike Naar

unread,

Jun 4, 2012, 11:44:57 AM6/4/12

to

On the machine that I tried, most of the time was spent on the
implicit destruction of the list at program termination.

jacob navia

unread,

Jun 4, 2012, 11:47:12 AM6/4/12

to

Le 04/06/12 17:32, Juha Nieminen a �crit :

> In comp.lang.c++ MelissA<me...@a.com> wrote:
>> #define N 10000000
>>
>> int main()
>> {
>> std::list<int> L1;
>> long long sum=0,newSum=0;
>> for(int i = 0; i< N; ++i)
>> {
>> int d = rand();
>> sum+=d;
>> L1.push_back(d);
>
> Your test is basically useless as a measurement of sorting speed because
> the vast majority of the time will be spent on those push_back() calls.
>

The purpose of this test is to compare the STL to the C Containers
Library (CCL). With this in mind I wanted a big test with both systems
doing roughly the same thing:

1) Allocate a list
2) Fill it with 100 million random numbers adding each number
to a grand total
3) Sort it
4) Go through the whole list again, reading the data and verify that
you get the same number.
5) Destroy the list.

> If you want to test sorting speed and nothing else, you have to discard
> the time spent in memory allocation.

Yes, but what I want to test is the speed to do ALL tasks mentioned above

jacob navia

unread,

Jun 4, 2012, 12:01:16 PM6/4/12

to

Le 04/06/12 16:54, Zoltan Juhasz a �crit :

> On Monday, 4 June 2012 10:20:38 UTC-4, jacob navia wrote:

>> Le 04/06/12 16:08, Zoltan Juhasz a �crit :

>>
>> (1)
>> The problem is that in older versions of C++ there isn't a single
>> linked list...
>
> Right, only C++11. That is why I wanted to make sure both
> uses the same kind, otherwise the measurement is rather unfair.
>

You are in principle right. I will rewrite the double linked
list module and redo the test.

>> (2)
>> How would you use a small object allocator using the STL?
>> The C containers library comes with a small object allocator
>> already done for you. I thought that a classic problem like
>> that would be solved already.
>
> I am not sure if the question is directed towards
> - possibility or
> - feasibility
>
> As far as possibility is concerned:
>
> STL containers provide an 'Allocator' template parameter,
> where you can provide your own allocator.

The same for the CCL

> Your custom
> allocator then can be built on top of malloc (such as
> Loki::SmallObj allocator, by using pools), or you can even
> use an alternative malloc implementation to malloc / free
> memory.
>

Yes, but in the CCL you get one already built for your. You just
tell the system

iList.UseHeap(MyList);

and the provided small object allocator is plugged in to the
list.

C++ is a bit more complicated in this respect, but I would
like to use a common library. Loki needs to be downloaded,
compiled, and it is lacking some examples...

In this page
http://sourceforge.net/projects/loki-lib/forums/forum/93009/topic/3828398

people say it runs slower than malloc...

> As far a feasibility is concerned, I am moderately sure that
> there are off-the-shelf, STL compatible small-object
> allocators available; providing an additional template argument
> should not be a problem.
>

Well, besides Loki what others would you propose?

> Alternatively, there are other malloc implementations, such as
> jmalloc, which segregate memory based on allocation size,
> thus eliminating some of the overhead.
>
> Either way, it is possible, and not horribly complex.
>
> -- Zoltan

Not horribly complex for you but for me it looks like a big deal...

jacob navia

unread,

Jun 4, 2012, 12:03:07 PM6/4/12

to

Le 04/06/12 17:44, Ike Naar a écrit :

>
> On the machine that I tried, most of the time was spent on the
> implicit destruction of the list at program termination.

Yes, I see a noticeable delay between the printing of the
result sum and the end of the program.

Apparently cleaning up is quit e expensive since destructors must be
called 100 million times.

In the CCL you can ADD a destructor if you want but they are NOT
required as in C++.

Paul

unread,

Jun 4, 2012, 12:32:16 PM6/4/12

to

If you're going to collect timing information, it should be
collected right around the working part of the test, and
inside the code itself. Rather than recording the runtime of
the entire executable, and all those rand() calls.

generate_random_numbers
...
get_time_before
do_the_sort
get_time_after
...
print ( get_time_after - get_time_before )

Paul

Zoltan Juhasz

unread,

Jun 4, 2012, 1:49:21 PM6/4/12

to

On Monday, 4 June 2012 12:01:16 UTC-4, jacob navia wrote:
> C++ is a bit more complicated in this respect, but I would
> like to use a common library. Loki needs to be downloaded,
> compiled, and it is lacking some examples...
>
> In this page
> http://sourceforge.net/projects/loki-lib/forums/forum/93009/topic/3828398
>
> people say it runs slower than malloc...

Yes, it was designed as a portable, book-example allocator,
not necessarily an industry strength, fully tuned one.

It might be slower than malloc to allocate the elements,
but from the test perspective allocation speed is irrelevant;
What you are after with Loki is a more compact representation
of your data.

> > As far a feasibility is concerned, I am moderately sure that
> > there are off-the-shelf, STL compatible small-object
> > allocators available; providing an additional template argument

> > s hould not be a problem.

> >
>
> Well, besides Loki what others would you propose?

Well, Google can only answer that :), but as I said, Loki should
be fine, also NSTL seems to provide its own SmallObjAlloc.

> Not horribly complex for you but for me it looks like a big deal...

Oh I agree, it is not a trivial exercise, but keep in mind
that you are talking about a specialized use-case.

Back to the original topic: if the edge is provided by your
sorting algorithm, the tests have to be based on close-to
identical environment, and your sort has to be measured, not
something else.

On the other hand, I also hear your argument that the test should
be based on a close-to identical environment that could be achieved
with resonably similar effort...

-- Zoltan

MelissA

unread,

Jun 4, 2012, 3:44:30 PM6/4/12

to

Here is just sorting time, C version is significantly faster:

bmaxa@maxa:~/jacob/ccl$ ./testlist
sort time 3.140
Sum = 10738138201479754
bmaxa@maxa:~/jacob/ccl$ ./cpplist
sort time 5.350
Sum = 10738138201479754
bmaxa@maxa:~/jacob/ccl$

MelissA

unread,

Jun 4, 2012, 4:01:28 PM6/4/12

to

But watch out this:
It is faster to populate vector from list, then
to sort vector , then to clear list, then
to populate list from vector then C version! ;)

bmaxa@maxa:~/jacob/ccl$ time ./cpplist
sort time 0.750
Sum = 10738138201479754

real 0m1.813s
user 0m1.660s
sys 0m0.152s
bmaxa@maxa:~/jacob/ccl$ time ./testlist

sort time 3.140
Sum = 10738138201479754

real 0m4.302s
user 0m4.208s
sys 0m0.096s
bmaxa@maxa:~/jacob/ccl$ cat cpplist.cpp
#include <list>
#include <vector>
#include <algorithm>
#include <cstdio>
#include <cstdlib>
#include <ctime>

#define N 10000000

int main()
{
std::list<int> L1;
long long sum=0,newSum=0;
for(int i = 0; i < N; ++i)
{
int d = rand();
sum+=d;
L1.push_back(d);
}

std::vector<int> tmp(L1.begin(),L1.end());
clock_t t = clock();
std::sort(tmp.begin(),tmp.end());
clock_t et = clock();
L1.clear();
L1.insert(L1.begin(),tmp.begin(),tmp.end());
printf("sort time %.3f\n",float(et - t)/CLOCKS_PER_SEC);

jacob navia

unread,

Jun 4, 2012, 4:42:28 PM6/4/12

to

Confirmed.

real 0m5.112s
user 0m4.600s
sys 0m0.487s

That's faster than C by a small amount...

MelissA

unread,

Jun 4, 2012, 6:16:36 PM6/4/12

to

This is with my inplace qsort with bidirectional
iterators:

bmaxa@maxa:~/jacob/ccl$ time ./cpplist1
sort time 1.480
Sum = 10738138201479754

real 0m2.105s
user 0m1.936s
sys 0m0.168s
bmaxa@maxa:~/jacob/ccl$ time ./testlist
sort time 3.130
Sum = 10738138201479754

real 0m4.265s
user 0m4.128s
sys 0m0.136s
bmaxa@maxa:~/jacob/ccl$ cat QSort.h
#include <algorithm>

namespace VLib{
using std::swap;

template <typename T,typename F>
void qsort(T begin, T end, unsigned size, F f)
{
if(begin == end)return;

T high = end;
if(!size)
{
T tmp = begin;
while(tmp!=end){ high=tmp++;++size; }
}
else
{
--high;
}

if(size == 1)return;

T low = begin;
++low;

unsigned counthigh = 0,countlow = 0;
do
{
while(high != low && f(*begin,*high)){ --high;++counthigh; }
while(low != high && f(*low,*begin)){ ++low;++countlow; }

if(low != high && !f(*low,*begin) && !f(*begin,*low) && !f(*begin,*high) && !f(*high,*begin))
{
while(high != low && !f(*high,*low) && !f(*low,*high)){ --high;++counthigh; }
}

swap(*low,*high);
}while(low != high);
swap(*low,*begin);
T i = low;

while(++i != end && !f(*i,*low) && !f(*low,*i) )--counthigh;

VLib::qsort(begin,low,countlow,f);
VLib::qsort(i,end,counthigh,f);
}

template <typename T>
inline void qsort(T begin, T end, unsigned size = 0)
{
VLib::qsort(begin,end,size,std::less<typename std::iterator_traits<T>::value_type>());
}

}
bmaxa@maxa:~/jacob/ccl$ cat cpplist1.cpp
#include <list>

#include <cstdio>
#include <cstdlib>
#include <ctime>

#include "QSort.h"

#define N 10000000

int main()
{
std::list<int> L1;
long long sum=0,newSum=0;
for(int i = 0; i < N; ++i)
{
int d = rand();
sum+=d;
L1.push_back(d);
}

clock_t t = clock();
VLib::qsort(L1.begin(),L1.end());
clock_t et = clock();

Ian Collins

unread,

Jun 4, 2012, 6:23:56 PM6/4/12

to

So not leaking memory is optional :)

--
Ian Collins

jacob navia

unread,

Jun 4, 2012, 6:31:40 PM6/4/12

to

Le 05/06/12 00:23, Ian Collins a écrit :

>> In the CCL you can ADD a destructor if you want but they are NOT
>> required as in C++.
>
> So not leaking memory is optional :)

No. You do a

iintList.Finalize(list);

The advantage is that you free ALL memory in a single call to a function
that destroys the heap not individually (list element by list element)
but in blocks of 1000 list elements each you see?

C++ calls destructors 100 million times. That takes time.

MelissA

unread,

Jun 4, 2012, 6:47:14 PM6/4/12

to

On Tue, 5 Jun 2012 00:16:36 +0200
MelissA <me...@a.com> wrote:

>
> unsigned counthigh = 0,countlow = 0;

mine error
countlow should be 1, not 0 ;)

Luca Risolia

unread,

Jun 4, 2012, 7:43:34 PM6/4/12

to

On 05/06/2012 00:31, jacob navia wrote:
> Le 05/06/12 00:23, Ian Collins a �crit :

Not necessarily. Memory deallocation and object destruction are two
separate things usually controlled by an allocator. With regard to the
standard containers an allocator is always an (optional) template
parameter. If you really want, you can provide your preferred
deallocation strategy for a given type T to std::list, so that it frees
the memory in blocks of 1000 and doesn't destroy any object.

Martin Shobe

unread,

Jun 4, 2012, 8:26:18 PM6/4/12

to

In the code that lead to this, the list was a list of ints. They don't
even have destructors to call, so the time certainly wasn't spent
calling them.

Martin Shobe

MelissA

unread,

Jun 4, 2012, 9:34:07 PM6/4/12

to

Final QSort.h (corrected errors, this one is little bit slower)

#include <algorithm>

namespace VLib{
using std::swap;

template <typename T,typename F>
void qsort(T begin, T end, unsigned size, F f)
{
if(begin == end)return;

T high = end;
if(!size)
{
T tmp = begin;
while(tmp!=end){ high=tmp++;++size; }
}
else
{
--high;
}

if(size == 1)return;
T low = begin;

unsigned count = 0;
while(++count<size/2)++low;
// printf("size %u\n",size);
swap(*low,*begin);
low = begin;
++low;

unsigned counthigh = 0,countlow = 1;

Juha Nieminen

unread,

Jun 5, 2012, 1:36:40 AM6/5/12

to

In comp.lang.c++ jacob navia <ja...@spamsink.net> wrote:
>> If you want to test sorting speed and nothing else, you have to discard
>> the time spent in memory allocation.
>
> Yes, but what I want to test is the speed to do ALL tasks mentioned above

Then you will be comparing C++'s default memory allocator speed to whatever
that C list was using.