Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Why is java consumer/producer so much faster than C++

523 views
Skip to first unread message

Melzzzzz

unread,
Jul 22, 2012, 5:59:20 PM7/22/12
to
I have played little bit with new C++11 features and compared,
java performance to c++.
Actually this was meant to be GC vs RAII memory management,
but boiled down to speed of BlockingQueue class in java,
and mine in c++.
It seems that java implementation is so much more efficient
but I don't know why. I even tried c++ without dynamic
memory management (except queue itself) and that is *even slower*.
Must be some quirks with a queue ;)

These are timings:
(java openjdk 1.7)
bmaxa@maxa:~/examples$ time java consprod

real 0m13.411s
user 0m19.853s
sys 0m0.960s

(c++ gcc 4.6.3)
bmaxa@maxa:~/examples$ time ./consprod

real 0m28.726s
user 0m34.518s
sys 0m6.800s

Example programs follow (I think implementations of
blocking queues are similar):
// java
import java.util.concurrent.*;
import java.util.Random;

class Vars{
public final static int nitems = 100000000;
public final static Random rnd = new Random(12345);
public final static int size = 10000;
}

class Producer implements Runnable {
private final BlockingQueue<Integer> queue;
Producer(BlockingQueue<Integer> q) { queue = q; }
public void run() {
try {
int i = Vars.nitems;
while(i-- > 0) { queue.put(produce(i)); }
} catch (InterruptedException ex)
{

}
}
Integer produce(int i) { return new Integer(i); }
}

class Consumer implements Runnable {
private final BlockingQueue<Integer> queue;
Consumer(BlockingQueue<Integer> q)
{
queue = q;
}
public void run() {
try {
Integer[] arr = new Integer[10000];
int i = Vars.nitems;
while(i-- > 0) { consume(queue.take(),arr); }
} catch (InterruptedException ex)
{
}
}
void consume(Integer x, Integer[] arr)
{
arr[Vars.rnd.nextInt(Vars.size)] = x;
}
}

public class consprod {
public static void main(String [] args) {
try{
BlockingQueue<Integer> q = new ArrayBlockingQueue<Integer>(100000);
Producer p = new Producer(q);
Consumer c = new Consumer(q);
new Thread(p).start();
new Thread(c).start();
} catch(Exception e)
{
e.printStackTrace();
}
}
}
//-----------------------------------------
// c++
#include <condition_variable>
#include <mutex>
#include <thread>
#include <deque>
#include <cstdlib>

template <class T>
class BlockingQueue{
public:
BlockingQueue(unsigned cap):capacity_(cap)
{
}
void put(T t)
{
std::unique_lock<std::mutex> lock(m_);
while(queue_.size() >= capacity_)c_full_.wait(lock);
queue_.push_back(std::move(t));
c_empty_.notify_one();
}
T take()
{
std::unique_lock<std::mutex> lock(m_);
while(queue_.empty())c_empty_.wait(lock);
T tmp = std::move(queue_.front());
queue_.pop_front();
c_full_.notify_one();
return std::move(tmp);
}
bool empty()
{
std::unique_lock<std::mutex> lock(m_);
return queue_.empty();
}
private:
std::mutex m_;
std::condition_variable c_empty_,c_full_;
std::deque<T> queue_;
unsigned capacity_;
};

int main()
{
BlockingQueue<std::unique_ptr<int>> produced(100000);
const int nitems = 100000000;
std::srand(12345);


std::function<void()> f_prod = [&]() {
int i = nitems;
while(i-- > 0){
produced.put(std::unique_ptr<int>(new int(i)));
}
};

std::thread producer1(f_prod);

std::function<void()> f_cons = [&]() {
const int size = 10000;
std::unique_ptr<int> arr[size];
int i = nitems;
while(i-- > 0)
{
arr[std::rand()%size] = produced.take();
}
};

std::thread consumer1(f_cons);

producer1.join();
consumer1.join();
}
// --------------------------------------

Joshua Maurice

unread,
Jul 22, 2012, 6:42:55 PM7/22/12
to
What g++ optimization options did you use? If you didn't compile with
proper optimization flags (e.g. at least -O2), then the numbers you
have are meaningless.

Also, you might be testing the difference between std::rand() and
Java's Random.

Also, why did you use dynamic allocation in the C++ code for an int?
If your original goal was to compare the Java way to the C++ way, you
are not doing it right. This is especially troubling after you
apparently went through the effort to make the queue support move-only
types. (Not sure. I'm still new to move-semantics.)

Also,
> std::unique_ptr<int> arr[size];
...
> arr[std::rand()%size] = produced.take();
I wonder if it's possible to optimize that into a simple assignment,
or whether there will be a call to delete() - or at least a branch to
test if the internal member pointer is null.

Just off the top of my head.

Melzzzzz

unread,
Jul 22, 2012, 7:28:01 PM7/22/12
to
On Sun, 22 Jul 2012 15:42:55 -0700 (PDT)
Joshua Maurice <joshua...@gmail.com> wrote:

>
>
> What g++ optimization options did you use? If you didn't compile with
> proper optimization flags (e.g. at least -O2), then the numbers you
> have are meaningless.

I have used -O2.

>
> Also, you might be testing the difference between std::rand() and
> Java's Random.

Possibly.

>
> Also, why did you use dynamic allocation in the C++ code for an int?

Actually with non dynamical allocation seems to be slower ;)

> If your original goal was to compare the Java way to the C++ way, you
> are not doing it right. This is especially troubling after you
> apparently went through the effort to make the queue support move-only
> types. (Not sure. I'm still new to move-semantics.)

That was necessary in order to put unique_ptr in queue.

>
> Also,
> > std::unique_ptr<int> arr[size];
> ...
> > arr[std::rand()%size] = produced.take();
> I wonder if it's possible to optimize that into a simple assignment,
> or whether there will be a call to delete() - or at least a branch to
> test if the internal member pointer is null.

I have removed that statement , removed called to rand, left
only take, made int's non dynamic
and I got following time:

bmaxa@maxa:~/examples$ g++ -Wall -O2 -pthread -std=c++0x consprod1.cpp -o consprod1
consprod1.cpp: In lambda function:
consprod1.cpp:58:19: warning: unused variable ‘size’ [-Wunused-variable]
bmaxa@maxa:~/examples$ time ./consprod1

real 0m23.335s
user 0m22.177s
sys 0m16.417s

with java, I have removed rand too...
bmaxa@maxa:~/examples$ javac consprod.java
bmaxa@maxa:~/examples$ time java consprod

real 0m12.491s
user 0m21.001s
sys 0m0.524s

Still twice as fast as c++.
What it looks like, is that c++ spends much more time in syscalls
(futex) than java and that explains big difference.



Luca Risolia

unread,
Jul 22, 2012, 8:03:28 PM7/22/12
to
On 23/07/2012 01:28, Melzzzzz wrote:
> Still twice as fast as c++.
> What it looks like, is that c++ spends much more time in syscalls
> (futex) than java and that explains big difference.

A lock-free "BlockingQueue" would probably be more efficient.
You can also try to implement your Queue with std::array instead of
std::deque and see how it performs.

In the C++ version change:

while (queue_.size() >= capacity_)c_full_.wait(lock);
with
c_full_.wait(lock, [this]{return queue_.size() < capacity_;});

and
while (queue_.empty())c_empty_.wait(lock);
with
c_empty_.wait(lock, [this]{return !queue_.empty();});

Also use a std::lock_guard in empty() instead of std::unique_lock. The
former is faster.

(On a side note, the C++ version you have given does not have the best
possible interface for a thread-safe queue)

Juha Nieminen

unread,
Jul 23, 2012, 2:37:17 AM7/23/12
to
In comp.lang.c++ Melzzzzz <m...@zzzzz.com> wrote:
> I even tried c++ without dynamic
> memory management (except queue itself) and that is *even slower*.

If two C++ programs are otherwise identical, except that one uses
'new int' and the other just uses the ints by value, and the former
is faster than the latter, then there's something horribly wrong in
the way you are measuring, or something else. I don't think it's
physically possible for 'new int' to be faster than using ints by
value under any possible circumstance, even if we assumed a highly
optimized version of 'new' that does magic under the hood to be 10
times faster than the regular 'new'.

> std::unique_lock<std::mutex> lock(m_);

> produced.put(std::unique_ptr<int>(new int(i)));

> arr[std::rand()%size] = produced.take();

I don't know what's the cause of the slowness in your program, but
I quoted above the three possible culprits I would first investigate
(which happen to be in order of likeliness).

Locks are slow. They are probably slower than 'new'.

'new' is slow, especially when compared to java's.

std::rand() is slow, although probably does not account for that much
slowness, but could be a minor contributing factor.

Melzzzzz

unread,
Jul 23, 2012, 6:17:46 AM7/23/12
to
On Mon, 23 Jul 2012 02:03:28 +0200
Luca Risolia <luca.r...@studio.unibo.it> wrote:

> On 23/07/2012 01:28, Melzzzzz wrote:
> > Still twice as fast as c++.
> > What it looks like, is that c++ spends much more time in syscalls
> > (futex) than java and that explains big difference.
>
> A lock-free "BlockingQueue" would probably be more efficient.
> You can also try to implement your Queue with std::array instead of
> std::deque and see how it performs.

I will try. Seems that java ArrayBlockingQueue is more efficient
because if that.

>
> In the C++ version change:
>
> while (queue_.size() >= capacity_)c_full_.wait(lock);
> with
> c_full_.wait(lock, [this]{return queue_.size() < capacity_;});
>
> and
> while (queue_.empty())c_empty_.wait(lock);
> with
> c_empty_.wait(lock, [this]{return !queue_.empty();});
>
> Also use a std::lock_guard in empty() instead of std::unique_lock.
> The former is faster.

Yes, I got speed up of few seconds, thanks.

>
> (On a side note, the C++ version you have given does not have the
> best possible interface for a thread-safe queue)

I have just copied Java BlockingQueue interface, which is of course
suited for Java. Don't know actually how ideal interface would look
like.


>


Melzzzzz

unread,
Jul 23, 2012, 6:33:21 AM7/23/12
to
On Mon, 23 Jul 2012 06:37:17 +0000 (UTC)
Juha Nieminen <nos...@thanks.invalid> wrote:

> In comp.lang.c++ Melzzzzz <m...@zzzzz.com> wrote:
> > I even tried c++ without dynamic
> > memory management (except queue itself) and that is *even slower*.
>
> If two C++ programs are otherwise identical, except that one uses
> 'new int' and the other just uses the ints by value, and the former
> is faster than the latter, then there's something horribly wrong in
> the way you are measuring, or something else.

I think that pressure on condition variable is greater with non
dynamic allocation, therefore it is slower.
If I reduce queue capacity on eg 1000 (more pressure on condition
variable) than it is much slower than with capacity of 100000.


I don't think it's
> physically possible for 'new int' to be faster than using ints by
> value under any possible circumstance, even if we assumed a highly
> optimized version of 'new' that does magic under the hood to be 10
> times faster than the regular 'new'.

That's because new is very fast and don;t take much time of program
in this example.

>
> > std::unique_lock<std::mutex> lock(m_);
>
> > produced.put(std::unique_ptr<int>(new int(i)));
>
> > arr[std::rand()%size] = produced.take();
>
> I don't know what's the cause of the slowness in your program, but
> I quoted above the three possible culprits I would first investigate
> (which happen to be in order of likeliness).
>
> Locks are slow. They are probably slower than 'new'.

I think that locks are not that slow in comparison to
condition variables. I have feeling that somehow
Java has less pressure on condition variables
in it's queue implementation.

>
> 'new' is slow, especially when compared to java's.

In this example new takes very little time.

>
> std::rand() is slow, although probably does not account for that much
> slowness, but could be a minor contributing factor.

This also takes little time.
Here is difference between new and rand and without.
Both versions are optimized now as per Luca's suggestion.


(with new and rand array assignment)
bmaxa@maxa:~/examples$ time ./consprod

real 0m25.127s
user 0m32.202s
sys 0m5.540s
(without)
bmaxa@maxa:~/examples$ time ./consprod1

real 0m23.883s
user 0m23.009s
sys 0m17.153s

Java:
(without rand and array assignment)
bmaxa@maxa:~/examples$ time java consprod

real 0m12.825s
user 0m21.833s
sys 0m0.592s

(with rand and array assignment)
bmaxa@maxa:~/examples$ time java consprod

real 0m15.571s
user 0m25.634s
sys 0m1.168s

Java pays higher price for random assignments, I think.


Juha Nieminen

unread,
Jul 23, 2012, 7:46:37 AM7/23/12
to
In comp.lang.c++ Melzzzzz <m...@zzzzz.com> wrote:
> I think that pressure on condition variable is greater with non
> dynamic allocation, therefore it is slower.

That sentence doesn't make any kind of sense.

> I don't think it's
>> physically possible for 'new int' to be faster than using ints by
>> value under any possible circumstance, even if we assumed a highly
>> optimized version of 'new' that does magic under the hood to be 10
>> times faster than the regular 'new'.
>
> That's because new is very fast and don;t take much time of program
> in this example.

'new' is a very slow operation in C++, and even if it weren't, it just
can't be faster than using an integer by value. It's physically impossible.
(The pointer that points to the allocated int is also an integral value at
the low level, so by allocating something dynamically you are only adding
to the amount of values being handled, and the overall complexity of the
program.)

Can you give any scenario where handling pointers would be faster than
handling ints?

>> 'new' is slow, especially when compared to java's.
>
> In this example new takes very little time.

You are executing tens of thousands of 'new's. That is slow.

Melzzzzz

unread,
Jul 23, 2012, 9:33:15 AM7/23/12
to
On Mon, 23 Jul 2012 11:46:37 +0000 (UTC)
Juha Nieminen <nos...@thanks.invalid> wrote:

> In comp.lang.c++ Melzzzzz <m...@zzzzz.com> wrote:
> > I think that pressure on condition variable is greater with non
> > dynamic allocation, therefore it is slower.
>
> That sentence doesn't make any kind of sense.
I has sense in this context. I will provide example.

>
> > I don't think it's
> >> physically possible for 'new int' to be faster than using ints by
> >> value under any possible circumstance, even if we assumed a highly
> >> optimized version of 'new' that does magic under the hood to be 10
> >> times faster than the regular 'new'.
> >
> > That's because new is very fast and don;t take much time of program
> > in this example.
>
> 'new' is a very slow operation in C++, and even if it weren't, it just
> can't be faster than using an integer by value. It's physically
> impossible. (The pointer that points to the allocated int is also an
> integral value at the low level, so by allocating something
> dynamically you are only adding to the amount of values being
> handled, and the overall complexity of the program.)

Yes, but overall complexity can be masked with something else like in
this case.

>
> Can you give any scenario where handling pointers would be faster than
> handling ints?

This scenario is exactly example. My queue is limited to capacity
nodes. When capacity is reached, producer waits on condition
variable until it is emptied.
In first example queue was 100000 nodes so slow down with plain ints
isn't just that noticabe.
But if I put just 1000 nodes here is what happens:
(dynamic ints , unique_ptr, random function and array assignement)
bmaxa@maxa:~/examples$ time ./consprod

real 0m23.523s
user 0m32.390s
sys 0m10.581s

but look at this:
(no dynamic ints, no random function, no assignment)
:
bmaxa@maxa:~/examples$ time ./consprod1

real 0m35.654s
user 0m30.882s
sys 0m34.202s

it's more than 10 seconds slower! just ints!


>
> >> 'new' is slow, especially when compared to java's.
> >
> > In this example new takes very little time.
>
> You are executing tens of thousands of 'new's. That is slow.

It is slow but waiting on condition is much slower so,
faster producer increases number of times condition waits,
so actually that makes program slower...



Zoltan Juhasz

unread,
Jul 23, 2012, 9:57:03 AM7/23/12
to
On Sunday, 22 July 2012 17:59:20 UTC-4, Melzzzzz wrote:
> I have played little bit with new C++11 features and compared,
> java performance to c++.

My take on this:

I guess the Java BlockingQueue is a lock-free implementation
of the blocking circular-buffer concept. Your implementation
uses a global, single lock around operations, the _m mutex,
which essentially serializes your code (plus the synchronization
overhead). Your code might be executed concurrently, but not in
parallel for sure.

What is wrong with your code? That you compare a naive,
inefficient blocking queue implementation, with an industrial
strength solution.

- look up a lock-free, single producer / single consumer
circular-buffer implementation
- use rvalue semantics to move items around, no need for
ptrs; at very least specialize the container for built-in types
- eliminate noise (i.e. rand)

The most important part: profile your code to find out
where you spend your time, whether you have some kind of
contention etc. Performance optimization is a tricky business,
especially in a multi-threaded environment, naive
implementations usually produce catastrophic results (see above).

-- Zoltan

Melzzzzz

unread,
Jul 23, 2012, 12:42:14 PM7/23/12
to
On Mon, 23 Jul 2012 06:57:03 -0700 (PDT)
Zoltan Juhasz <zoltan...@gmail.com> wrote:

> On Sunday, 22 July 2012 17:59:20 UTC-4, Melzzzzz wrote:
> > I have played little bit with new C++11 features and compared,
> > java performance to c++.
>
> My take on this:
>
> I guess the Java BlockingQueue is a lock-free implementation
> of the blocking circular-buffer concept. Your implementation
> uses a global, single lock around operations, the _m mutex,
> which essentially serializes your code (plus the synchronization
> overhead). Your code might be executed concurrently, but not in
> parallel for sure.
>
This is implementation of put/take methods of Java queue:

public void put(E e) throws InterruptedException {
244: if (e == null) throw new NullPointerException();
245: final E[] items = this.items;
246: final ReentrantLock lock = this.lock;
247: lock.lockInterruptibly();
248: try {
249: try {
250: while (count == items.length)
251: notFull.await();
252: } catch (InterruptedException ie) {
253: notFull.signal(); // propagate to non-interrupted
thread 254: throw ie;
255: }
256: insert(e);
257: } finally {
258: lock.unlock();
259: }
260: }

310: public E take() throws InterruptedException {
311: final ReentrantLock lock = this.lock;
312: lock.lockInterruptibly();
313: try {
314: try {
315: while (count == 0)
316: notEmpty.await();
317: } catch (InterruptedException ie) {
318: notEmpty.signal(); // propagate to non-interrupted thread
319: throw ie;
320: }
321: E x = extract();
322: return x;
323: } finally {
324: lock.unlock();
325: }
326: }
327:

> What is wrong with your code? That you compare a naive,
> inefficient blocking queue implementation, with an industrial
> strength solution.

Heh, I don't think so. This implementation is pretty classic.
It does well for blocking queue, but this test is too
much for it. It gets contented with mutex/condition_variable
calls, so does Java, but for some reason, much less.


>
> - look up a lock-free, single producer / single consumer
> circular-buffer implementation

Java ArrayBlockingQueue use circular buffer but is
not lock free, rather implementation is identical,
as I see (one mutex, two condition variables)

> - use rvalue semantics to move items around, no need for
> ptrs; at very least specialize the container for built-in types
> - eliminate noise (i.e. rand)
>
> The most important part: profile your code to find out
> where you spend your time, whether you have some kind of
> contention etc. Performance optimization is a tricky business,
> especially in a multi-threaded environment, naive
> implementations usually produce catastrophic results (see above).

Agreed. I think that I am close. Somehow (probably because it
is faster? c++ puts much more pressure on queue than Java does).
Perhaps someone can confirm or deny this.



Howard Hinnant

unread,
Jul 23, 2012, 4:23:30 PM7/23/12
to
I made a few changes to your code:

1. I only signal if the capacity becomes non-zero or non-full.

2. If the queue is the queue has got some items in it and the mutex is locked, the put thread yields to the take thread. If the queue has some free space in it and the mutex is locked, the take thread yields to the put thread.

3. Traffic in int instead of unique_ptr<int>.


#include <condition_variable>
#include <mutex>
#include <thread>
#include <deque>
#include <cstdlib>

template <class T>
class BlockingQueue{
public:
BlockingQueue(unsigned cap):capacity_(cap)
{
}
void put(T t)
{
std::unique_lock<std::mutex> lock(m_, std::try_to_lock);
int retry = 0;
while (!lock.owns_lock())
{
if (queue_.size() > capacity_/4 && ++retry < 1000)
{
std::this_thread::yield();
}
else
{
lock.lock();
}
}
while(queue_.size() >= capacity_)c_full_.wait(lock);
queue_.push_back(std::move(t));
if (queue_.size() == 1)
c_empty_.notify_one();
}
T take()
{
std::unique_lock<std::mutex> lock(m_, std::try_to_lock);
int retry = 0;
while (!lock.owns_lock())
{
if (queue_.size() < 3*capacity_/4 && ++retry < 1000)
{
std::this_thread::yield();
}
else
{
lock.lock();
}
}
while(queue_.empty())c_empty_.wait(lock);
T tmp = std::move(queue_.front());
queue_.pop_front();
if (queue_.size() == capacity_-1)
c_full_.notify_one();
return tmp;
}
bool empty()
{
std::unique_lock<std::mutex> lock(m_);
return queue_.empty();
}
private:
std::mutex m_;
std::condition_variable c_empty_,c_full_;
std::deque<T> queue_;
unsigned capacity_;
};

int main()
{
BlockingQueue<int> produced(100000);
const int nitems = 100000000;
std::srand(12345);


std::function<void()> f_prod = [&]() {
int i = nitems;
while(i-- > 0){
produced.put(i);
}
};

std::thread producer1(f_prod);

std::function<void()> f_cons = [&]() {
const int size = 10000;
int arr[size];
int i = nitems;
while(i-- > 0)
{
arr[std::rand()%size] = produced.take();
}
};

std::thread consumer1(f_cons);

producer1.join();
consumer1.join();
}

On my system this sped the C++ solution up considerably (to about 13.7 seconds). I didn't time the Java solution.

Dombo

unread,
Jul 23, 2012, 4:57:38 PM7/23/12
to
Op 22-Jul-12 23:59, Melzzzzz schreef:
In the C++ version at the end of the main function it waits for the
threads to complete, in the Java version the threads are not joined at
the end of the main function. I don't know if Java keeps the process
alive until all threads have terminated, but if not it would explain the
difference.

I'm a bit surpirsed that the C++ version compiles without complaints on
your compiler because there is no return statement in the main() function.


Melzzzzz

unread,
Jul 23, 2012, 6:32:10 PM7/23/12
to
On Mon, 23 Jul 2012 13:23:30 -0700 (PDT)
Howard Hinnant <howard....@gmail.com> wrote:

> I made a few changes to your code:
>
>
> On my system this sped the C++ solution up considerably (to about
> 13.7 seconds). I didn't time the Java solution.

Yes, that's it, on my system:
bmaxa@maxa:~/examples$ time ./consprod2

real 0m9.478s
user 0m10.797s
sys 0m7.196s

which is more than twice speed of original program!
(and faster than java)

Luca Risolia

unread,
Jul 23, 2012, 6:33:10 PM7/23/12
to
On 23/07/2012 12:17, Melzzzzz wrote:
>> (On a side note, the C++ version you have given does not have the
>> best possible interface for a thread-safe queue)
>
> I have just copied Java BlockingQueue interface, which is of course
> suited for Java. Don't know actually how ideal interface would look
> like.

In C++11 the ideal interface for a generic thread-safe and
exception-safe queue should at least provide std::shared_ptr<T> take()
and/or void take(T&). T take() is limited to those types having no-throw
copy constructors or move-constructors. This also means that you should
check the type at compile-time by using type traits. However, in your
test case you used move-semantic everywhere, so that is not problem.
Also note that the term "pop" instead of "take" is more in the spirit of
C++.

Howard Hinnant

unread,
Jul 23, 2012, 9:11:58 PM7/23/12
to
On Monday, July 23, 2012 6:32:10 PM UTC-4, Melzzzzz wrote:
> On Mon, 23 Jul 2012 13:23:30 -0700 (PDT)
> Howard Hinnant wrote:
>
> &gt; I made a few changes to your code:
> &gt;
> &gt;
> &gt; On my system this sped the C++ solution up considerably (to about
> &gt; 13.7 seconds). I didn&#39;t time the Java solution.
>
> Yes, that&#39;s it, on my system:
> bmaxa@maxa:~/examples$ time ./consprod2
>
> real 0m9.478s
> user 0m10.797s
> sys 0m7.196s
>
> which is more than twice speed of original program!
> (and faster than java)

At second glance I wasn't that fond of my solution and tweaked it as below:


void put(T t)
{
std::unique_lock<std::mutex> lock(m_, std::try_to_lock);
while (!lock.owns_lock())
{
if (queue_.size() > capacity_/4)
{
for (int i = 0; i < 3250; ++i)
std::this_thread::yield();
lock.try_lock();
}
else
{
lock.lock();
break;
}
}
while(queue_.size() >= capacity_)c_full_.wait(lock);
queue_.push_back(std::move(t));
if (queue_.size() == 1)
c_empty_.notify_one();
}
T take()
{
std::unique_lock<std::mutex> lock(m_, std::try_to_lock);
int retry = 0;
while (!lock.owns_lock())
{
if (queue_.size() < 3*capacity_/4)
{
for (int i = 0; i < 3250; ++i)
std::this_thread::yield();
lock.try_lock();
}
else
{
lock.lock();
break;
}
}
while(queue_.empty())c_empty_.wait(lock);
T tmp = std::move(queue_.front());
queue_.pop_front();
if (queue_.size() == capacity_-1)
c_full_.notify_one();
return tmp;
}

I am admittedly coding to the benchmark (which is not a really great idea). But I got the time on my system down from 13.7 seconds to about 9.7 seconds (another 40%).

Joshua Maurice

unread,
Jul 24, 2012, 12:54:30 AM7/24/12
to
On Jul 23, 1:57 pm, Dombo <do...@disposable.invalid> wrote:
> I'm a bit surpirsed that the C++ version compiles without complaints on
> your compiler because there is no return statement in the main() function.

It's an exception in C and C++ for main only. Running off the end of
main is allowed, and is equivalent to return 0, itself equivalent to
exit(0), IIRC.

Melzzzzz

unread,
Jul 24, 2012, 5:21:26 AM7/24/12
to
On my system speed up is not that big.
(new version)
bmaxa@maxa:~/examples$ time ./consprod3

real 0m9.296s
user 0m10.061s
sys 0m8.069s

(old version)
bmaxa@maxa:~/examples$ time ./consprod2

real 0m9.496s
user 0m10.917s
sys 0m7.292s
bmaxa@maxa:~/examples$

(java with array assignment)
bmaxa@maxa:~/examples$ time java consprod

real 0m16.098s
user 0m25.674s
sys 0m1.932s


Jorgen Grahn

unread,
Jul 24, 2012, 8:50:17 AM7/24/12
to
["Followup-To:" header set to comp.lang.c++.]
On Tue, 2012-07-24, Joshua Maurice wrote:
> On Jul 23, 1:57�pm, Dombo <do...@disposable.invalid> wrote:
>> I'm a bit surpirsed that the C++ version compiles without complaints on
>> your compiler because there is no return statement in the main() function.
>
> It's an exception in C and C++ for main only. Running off the end of
> main is allowed, and is equivalent to return 0,

Sounds right.

> itself equivalent to exit(0), IIRC.

But this does not. Calling exit() prevent objects from being
destroyed properly.

Unless you mean "equivalent to returning from main (thereby destroying
a lot of objects) and then calling exit(0)".

/Jorgen

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .

none Yannick Tremblay

unread,
Jul 24, 2012, 8:51:20 AM7/24/12
to
In article <86c79114-84e6-4065...@googlegroups.com>,
Howard Hinnant <howard....@gmail.com> wrote:
>I made a few changes to your code:
>
>1. I only signal if the capacity becomes non-zero or non-full.

>2. If the queue is the queue has got some items in it and the mutex is locked, the put thread yields to the take
>thread. If the queue has some free space in it and the mutex is locked, the take thread yields to the put
>thread.

Are you sure about the usefulness of #2?

Doing a few tests myself, adding this in m queue seems to reduce performance by some 20%:
while (!lock.owns_lock())
{
if (queue_.size() > capacity_/4)
{
for (int i = 0; i < 3250; ++i)
std::this_thread::yield();
lock.try_lock();
}
else
{
lock.lock();
break;
}
}


>3. Traffic in int instead of unique_ptr<int>.


Yannick

Melzzzzz

unread,
Jul 24, 2012, 11:47:55 AM7/24/12
to
On Tue, 24 Jul 2012 12:51:20 +0000 (UTC)
yatremblay@bel1lin202.(none) (Yannick Tremblay) wrote:

> In article <86c79114-84e6-4065...@googlegroups.com>,
> Howard Hinnant <howard....@gmail.com> wrote:
> >I made a few changes to your code:
> >
> >1. I only signal if the capacity becomes non-zero or non-full.
>
> >2. If the queue is the queue has got some items in it and the mutex
> >is locked, the put thread yields to the take thread. If the queue
> >has some free space in it and the mutex is locked, the take thread
> >yields to the put thread.
>
> Are you sure about the usefulness of #2?

Problem is that we need to reduce waiting on condition to minimum
as more condition waits program is slower.
Ideally there would be no condition waits (maximum speed).

>
> Doing a few tests myself, adding this in m queue seems to reduce
> performance by some 20%: while (!lock.owns_lock())
> {
> if (queue_.size() > capacity_/4)
> {
> for (int i = 0; i < 3250; ++i)
> std::this_thread::yield();
> lock.try_lock();
> }
> else
> {
> lock.lock();
> break;
> }
> }
>

This is second version, on my system first version seems better
for smaller capacities...


Bo Persson

unread,
Jul 24, 2012, 1:50:22 PM7/24/12
to
Using move semantics in a return statement might interfere with the NRVO
that would otherwise likely kick in. It is not an optimization.


Bo Persson

Melzzzzz

unread,
Jul 24, 2012, 3:45:22 PM7/24/12
to
On Mon, 23 Jul 2012 18:11:58 -0700 (PDT)
Howard Hinnant <howard....@gmail.com> wrote:

> On Monday, July 23, 2012 6:32:10 PM UTC-4, Melzzzzz wrote:
> > On Mon, 23 Jul 2012 13:23:30 -0700 (PDT)
> > Howard Hinnant wrote:
> >
> > &gt; I made a few changes to your code:
> > &gt;
> > &gt;
> > &gt; On my system this sped the C++ solution up considerably (to
> > about &gt; 13.7 seconds). I didn&#39;t time the Java solution.
> >
> > Yes, that&#39;s it, on my system:
> > bmaxa@maxa:~/examples$ time ./consprod2
> >
> > real 0m9.478s
> > user 0m10.797s
> > sys 0m7.196s
> >
> > which is more than twice speed of original program!
> > (and faster than java)
>
> At second glance I wasn't that fond of my solution and tweaked it as
> below:
>
>
> void put(T t)
> {
> std::unique_lock<std::mutex> lock(m_, std::try_to_lock);
> while (!lock.owns_lock())
> {
> if (queue_.size() > capacity_/4)
> {

There is problem, as queue_ is not protected under lock.
This is version with array (identical implementation
as ArrayBlockingQueue in Java), instead of deque (I
used atomic type for count_). This executes about same speed as Java
as atomic type slows it down a bit.
Sadly, no noticing difference between deque and array.

#include <condition_variable>
#include <mutex>
#include <thread>
#include <cstdlib>
#include <atomic>

template <class T>
class BlockingQueue{
public:
BlockingQueue(unsigned cap)
:
queue_((T*)new char[cap*sizeof(T)]),
insPos_(0),
extPos_(0),
count_(0),
capacity_(cap)
{
}
BlockingQueue(const BlockingQueue&)=delete;
BlockingQueue& operator=(const BlockingQueue&)=delete;
~BlockingQueue()
{
delete[] (char*)queue_;
}
void put(T t)
{
std::unique_lock<std::mutex> lock(m_, std::try_to_lock);
int retry = 0;
while (!lock.owns_lock())
{
if (count_ > capacity_/4 && ++retry < 1000)
{
std::this_thread::yield();
}
else
{
lock.lock();
}
}
c_full_.wait(lock, [this]{return count_ < capacity_;});
new(queue_+insPos_)T(std::move(t));
inc(insPos_);
++count_;
if (count_ == 1)
c_empty_.notify_one();
}
T take()
{
std::unique_lock<std::mutex> lock(m_, std::try_to_lock);
int retry = 0;
while (!lock.owns_lock())
{
if (count_ < 3*capacity_/4 && ++retry < 1000)
{
std::this_thread::yield();
}
else
{
lock.lock();
}
}
c_empty_.wait(lock, [this]{return count_ > 0 ;});
T tmp = std::move(queue_[extPos_]);
queue_[extPos_].~T();
inc(extPos_);
--count_;
if (count_ == capacity_-1)
c_full_.notify_one();
return tmp;
}
bool empty()
{
std::lock_guard<std::mutex> lock(m_);
return count_ == 0;
}
private:
void inc(unsigned& i)
{
++i;
if(i == capacity_)i = 0;
}
std::mutex m_;
std::condition_variable c_empty_,c_full_;
T* queue_;
unsigned insPos_,extPos_;
std::atomic<unsigned> count_;
const unsigned capacity_;
};

int main()
{
BlockingQueue<int> produced(1000);

Luca Risolia

unread,
Jul 24, 2012, 5:24:07 PM7/24/12
to
On 24/07/2012 21:45, Melzzzzz wrote:
> This is version with array (identical implementation
> as ArrayBlockingQueue in Java), instead of deque (I
> used atomic type for count_).

Why do you need count_ to be an atomic type, it seems that all the
updates to that variable are already protected by a mutex..

Melzzzzz

unread,
Jul 24, 2012, 6:48:05 PM7/24/12
to
count_ is not protected when testing if lock is held in while loop.
Anyway this has not be precise and if count_ is not atomic program
is much faster.
I have tried Howard's algorithm with Java and got same
time as C++.
Perhaps Array/LinkedBlockingQueue implementation in Java has
to be updated ;)

Luca Risolia

unread,
Jul 24, 2012, 7:21:54 PM7/24/12
to
On 25/07/2012 00:48, Melzzzzz wrote:
> On Tue, 24 Jul 2012 23:24:07 +0200
> Luca Risolia <luca.r...@studio.unibo.it> wrote:
>
>> On 24/07/2012 21:45, Melzzzzz wrote:
>>> This is version with array (identical implementation
>>> as ArrayBlockingQueue in Java), instead of deque (I
>>> used atomic type for count_).
>>
>> Why do you need count_ to be an atomic type, it seems that all the
>> updates to that variable are already protected by a mutex..
>>
>
> count_ is not protected when testing if lock is held in while loop.

Correct, I did not read the code carefully, since I supposed the only
difference from your original version was the use of an array, which
made the program about 20-30% faster in my tests.

Howard Hinnant

unread,
Jul 24, 2012, 8:35:01 PM7/24/12
to
On Tuesday, July 24, 2012 3:45:22 PM UTC-4, Melzzzzz wrote:
> On Mon, 23 Jul 2012 18:11:58 -0700 (PDT)
> Howard Hinnant &lt;howard....@gmail.com&gt; wrote:
> &gt; void put(T t)
> &gt; {
> &gt; std::unique_lock&lt;std::mutex&gt; lock(m_, std::try_to_lock);
> &gt; while (!lock.owns_lock())
> &gt; {
> &gt; if (queue_.size() &gt; capacity_/4)
> &gt; {
>
> There is problem, as queue_ is not protected under lock.

Correct. But the code gets away with it anyway because:

1. It does not modify the queue.
2. It does not need an accurate value for queue_.size().

The *very* worst that could happen is that the value of queue_.size() read by one thread is never updated by the other as the other thread fills/empties the queue. Eventually the other thread will drop the mutex ownership, either because it runs out of things to put into the queue, or because it fills the queue and has to wait, or because it empties the queue and has to wait. In any event, the other thread drops the lock and this thread acquires it via the try_lock. It has not mattered, except perhaps for efficiency purposes, that this thread has been reading the wrong value of queue_.size().

And as you pointed out, efficiency hasn't suffered.

none Yannick Tremblay

unread,
Jul 25, 2012, 4:09:43 AM7/25/12
to
In article <jumg3b$nvn$1...@news.albasani.net>, Melzzzzz <m...@zzzzz.com> wrote:
>On Tue, 24 Jul 2012 12:51:20 +0000 (UTC)
>yatremblay@bel1lin202.(none) (Yannick Tremblay) wrote:
>
>> In article <86c79114-84e6-4065...@googlegroups.com>,
>> Howard Hinnant <howard....@gmail.com> wrote:
>> >I made a few changes to your code:
>> >
>> >1. I only signal if the capacity becomes non-zero or non-full.
>>
>> >2. If the queue is the queue has got some items in it and the mutex
>> >is locked, the put thread yields to the take thread. If the queue
>> >has some free space in it and the mutex is locked, the take thread
>> >yields to the put thread.
>>
>> Are you sure about the usefulness of #2?
>
>Problem is that we need to reduce waiting on condition to minimum
>as more condition waits program is slower.
>Ideally there would be no condition waits (maximum speed).

OK, for this test program, I guess it might make sense (albeit in my
case, this worsens the results)

However, you should be careful to generalizing this to a "real"
program. This test program passes integers back and forth via the
queue and do no "work" at all. A "real" program will most likely do
work.

Typically for a real program, fcons/fprod will look more like:

void fprod()
{
for(;;)
{
Job job = CreateJob();
queue.put(job);
}
}

void fcons()
{
for(;;)
{
Job job = queue.take();
ProcessJob(Job);
}
}

And the cost of ProcessJob()/CreateJob() will dwarf the cost of
put()/take() by several order of magnitude. (in your test program, the
cost of storing the integer is dwarfed by the cost of accessing the
queue)

For such a system, you actually want it to be a common occurence that
the consumer blocks waiting on a condition.

Howard Hinnant

unread,
Jul 25, 2012, 11:46:28 AM7/25/12
to
On Wednesday, July 25, 2012 4:09:43 AM UTC-4, none wrote:

> However, you should be careful to generalizing this to a &quot;real&quot;
> program.

100% agreed.

I wrote earlier:

> I am admittedly coding to the benchmark (which is not a really great idea).

It was fun. I had a good time. :-)

Howard

Luca Risolia

unread,
Jul 26, 2012, 3:19:21 PM7/26/12
to
On 24/07/2012 21:45, Melzzzzz wrote:
> This is version with array (identical implementation
> as ArrayBlockingQueue in Java), instead of deque (I
> used atomic type for count_). This executes about same speed as Java
> as atomic type slows it down a bit.

> Sadly, no noticing difference between deque and array.

Right, it may well depend on how smart the std::deque::allocator_type is.

Below is a version of the BlockingQueue using spin locks. Compared to
the original version in Java it is about 2.5 times faster on my
platform, provided that you remove std::rand() from the test, as it
seems to be a real bottleneck and does not give any precious
informations about the efficiency of the Queue (which is the thing to
measure after all).

Let me know your results.

---

#include <thread>
#include <cstdlib>
#include <functional>
#include <atomic>
#include <memory>
#include <type_traits>

class spinlock_mutex {
std::atomic_flag flag = ATOMIC_FLAG_INIT;
public:

void lock() {
while (flag.test_and_set(std::memory_order_acquire)) {
std::this_thread::yield();
};
}

void unlock() {
flag.clear(std::memory_order_release);
}
};

template <class T, class A = std::allocator<T >>
class BlockingQueue {
const std::size_t cap_;
A alloc_;
T* queue_;
std::size_t n = 0, b = 0, f = 0;
spinlock_mutex m_;
public:

BlockingQueue(const std::size_t cap, const A& alloc = A())
: cap_(cap), alloc_(alloc), queue_(alloc_.allocate(cap)) { }

~BlockingQueue() {
alloc_.deallocate(queue_, cap_);
}

static_assert(std::is_nothrow_move_constructible<T>::value,
"BlockingQueue cannot be exception-safe");

void put(const T t) {
for (;;) {
m_.lock();
if (n < cap_)
break;
m_.unlock();
}
alloc_.construct(&queue_[f], std::move(t));
f = ++f % cap_;
++n;
m_.unlock();
}

T take() {
for (;;) {
m_.lock();
if (n > 0)
break;
m_.unlock();
}
const T t = std::move(queue_[b]);
alloc_.destroy(&queue_[b]);
b = ++b % cap_;
--n;
m_.unlock();
return t;
}
};

int main() {
BlockingQueue<int> produced(100000);
const int nitems = 100000000;
std::srand(12345);


std::function<void() > f_prod = [&](){
int i = nitems;
while (i-- > 0) {
produced.put(i);
}
};

std::thread producer1(f_prod);

std::function<void() > f_cons = [&](){
const int size = 10000;
int arr[size];
int i = nitems;
while (i-- > 0) {
arr[std::rand() % size] = produced.take();

Melzzzzz

unread,
Jul 26, 2012, 3:56:34 PM7/26/12
to
On Thu, 26 Jul 2012 21:19:21 +0200
Luca Risolia <luca.r...@studio.unibo.it> wrote:

> On 24/07/2012 21:45, Melzzzzz wrote:
> > This is version with array (identical implementation
> > as ArrayBlockingQueue in Java), instead of deque (I
> > used atomic type for count_). This executes about same speed as Java
> > as atomic type slows it down a bit.
>
> > Sadly, no noticing difference between deque and array.
>
> Right, it may well depend on how smart the std::deque::allocator_type
> is.
>
> Below is a version of the BlockingQueue using spin locks. Compared to
> the original version in Java it is about 2.5 times faster on my
> platform, provided that you remove std::rand() from the test, as it
> seems to be a real bottleneck and does not give any precious
> informations about the efficiency of the Queue (which is the thing to
> measure after all).
>
> Let me know your results.
(your version without rand and array assignment)
bmaxa@maxa:~/examples$ time ./consprod5

real 0m6.812s
user 0m7.532s
sys 0m5.136s
(Howards version without rand and array assignment)
bmaxa@maxa:~/examples$ time ./consprod4

real 0m6.748s
user 0m7.376s
sys 0m5.000s

(your version with rand and array assignment)
bmaxa@maxa:~/examples$ time ./consprod5

real 0m23.733s
user 0m27.042s
sys 0m16.849s

(Howards version with rand and array assignment)
bmaxa@maxa:~/examples$ time ./consprod4

real 0m9.351s
user 0m10.257s
sys 0m6.900s

(java (Howards version) without rand and array assignment)
bmaxa@maxa:~/examples$ time java consprod

real 0m8.088s
user 0m8.841s
sys 0m5.596s

(java (Howards version) with rand and array assignment)
bmaxa@maxa:~/examples$ time java consprod

real 0m10.052s
user 0m11.117s
sys 0m7.208s


>
> ---
>
> #include <thread>
> #include <cstdlib>
> #include <functional>
> #include <atomic>
> #include <memory>
> #include <type_traits>
>
> class spinlock_mutex {
> std::atomic_flag flag = ATOMIC_FLAG_INIT;

gcc 4.6.3 doesn't implements non static member initialization ;(

>
> static_assert(std::is_nothrow_move_constructible<T>::value,
> "BlockingQueue cannot be exception-safe");

this is not yet in library ;(

>
> void put(const T t) {
> for (;;) {
> m_.lock();
> if (n < cap_)
> break;
> m_.unlock();
> }
> alloc_.construct(&queue_[f], std::move(t));
> f = ++f % cap_;

compiler warns about undefined behavior here.

> ++n;
> m_.unlock();
> }
>
> T take() {
> for (;;) {
> m_.lock();
> if (n > 0)
> break;
> m_.unlock();
> }
> const T t = std::move(queue_[b]);
> alloc_.destroy(&queue_[b]);
> b = ++b % cap_;

also warning about undefined behavior.


Luca Risolia

unread,
Jul 26, 2012, 4:15:33 PM7/26/12
to
On 26/07/2012 21:56, Melzzzzz wrote:
>> f = ++f % cap_;
>
> compiler warns about undefined behavior here.

Ouch..split that:

++f;
f %= cap_;

>> b = ++b % cap_;

++b;
b %= cap_

I did not turn on all the warnings. My compiler produced the correct
code though.

Melzzzzz

unread,
Jul 26, 2012, 5:01:59 PM7/26/12
to
On Thu, 26 Jul 2012 22:15:33 +0200
Luca Risolia <luca.r...@studio.unibo.it> wrote:

> On 26/07/2012 21:56, Melzzzzz wrote:
> >> f = ++f % cap_;
> >
> > compiler warns about undefined behavior here.
>
> Ouch..split that:
>
> ++f;
> f %= cap_;

I corrected that already ;)

I have installed gcc-snapshot compiler
gcc version 4.8.0 20120314 (experimental) [trunk revision 185382]
(Ubuntu/Linaro 20120314-0ubuntu2)

Your timings with array assignment and rand included, are much worse
than gcc 4.6.3. Definitely something is wrong if
other thread does something else besides waiting on
spin lock.




Luca Risolia

unread,
Jul 26, 2012, 5:12:54 PM7/26/12
to
Yes, it's worth to strace std::rand() to see what it does exactly.
Out of my curiosity, try to yield() after releasing the lock in both
put() and take():

m_.unlock();
boost::this_thread::yield(); // add this line

do you see any improvements?

Marc

unread,
Jul 26, 2012, 5:10:40 PM7/26/12
to
Luca Risolia wrote:

> On 26/07/2012 21:56, Melzzzzz wrote:
>>> f = ++f % cap_;
>>
>> compiler warns about undefined behavior here.
>
> Ouch..split that:
>
> ++f;
> f %= cap_;

If you want to write it in one line, you can still do:
++f %= cap_;

Donkey Hottie

unread,
Jul 26, 2012, 6:05:27 PM7/26/12
to
How about f = (f+1) % cap_;

Paavo Helde

unread,
Jul 26, 2012, 6:12:09 PM7/26/12
to
Marc <marc....@gmail.com> wrote in news:jusbog$lv0$1...@news-rocq.inria.fr:
This is also undefined behavior in C++98 (two modifications of f without a
sequence point). In C++11 it seems to be defined however ("In all cases,
the assignment is sequenced after the value computation of the right and
left operands, and before the value computation of the assignment
expression.")

Melzzzzz

unread,
Jul 26, 2012, 6:24:00 PM7/26/12
to
On Thu, 26 Jul 2012 23:12:54 +0200
No, that make it drastically slower (when not assigning array and using
rand, also not any faster when rand is used).
You could speed up your program by 10-30% I think with just
using my original implementation of
++f;
if(f == cap_)f = 0;
instead of
++f;
f %= cap_;
because division is drastically slower on modern CPUs,
than check and assignment (that's why division by constant
is usually implemented by shift and multiply).
In this case I don;t think compiler can be that clever, it
has to use division and that makes performance *much* worse.

take a look at my time now (with if instead of modulo)
bmaxa@maxa:~/examples$ time ./consprod5

real 0m5.234s
user 0m5.912s
sys 0m3.976s

almost 30% improvement!

Melzzzzz

unread,
Jul 26, 2012, 6:48:59 PM7/26/12
to
On Thu, 26 Jul 2012 23:12:54 +0200
I have solved your problem. Seems that your program suffers same problem
as my original version. When applied Howards algorithm to your version
I've got fastest time both for rand array assignment and without:

take a look:
(merged version with rand array assignment)
bmaxa@maxa:~/examples$ time ./consprod5

real 0m4.731s
user 0m5.164s
sys 0m2.804s

(merged version without rand array assignment,
wow this is fast ;)
bmaxa@maxa:~/examples$ time ./consprod5

real 0m2.345s
user 0m2.624s
sys 0m0.820s

code:
#include <thread>
#include <cstdlib>
#include <functional>
#include <atomic>
#include <memory>
#include <type_traits>

class spinlock_mutex {
std::atomic_flag flag = ATOMIC_FLAG_INIT;
public:
void lock() {
while (flag.test_and_set(std::memory_order_acquire)) {
std::this_thread::yield();
};
}
bool try_lock(){
if(flag.test_and_set(std::memory_order_acquire))
return false;
else
return true;
}
void unlock() {
flag.clear(std::memory_order_release);
}
};

template <class T, class A = std::allocator<T >>
class BlockingQueue {
const std::size_t cap_;
A alloc_;
T* queue_;
std::size_t n = 0, b = 0 , f = 0;
spinlock_mutex m_;
public:

BlockingQueue(const std::size_t cap, const A& alloc = A())
: cap_(cap), alloc_(alloc), queue_(alloc_.allocate(cap)) { }

~BlockingQueue() {
alloc_.deallocate(queue_, cap_);
}

static_assert(std::is_nothrow_move_constructible<T>::value,
"BlockingQueue cannot be exception-safe");

void put(T t) {
bool locked = m_.try_lock();
int retry = 0;
while(!locked)
{
if(n > cap_ / 4 && ++retry < 1000)
{
std::this_thread::yield();
}
else
{
locked = m_.try_lock();
}
}
alloc_.construct(&queue_[f], std::move(t));
++f;
if(f == cap_)f = 0;
++n;
m_.unlock();
}

T take() {
bool locked = m_.try_lock();
int retry = 0;
while(!locked)
{
if(n < 3 * cap_ / 4 && ++retry < 1000)
{
std::this_thread::yield();
}
else
{
locked = m_.try_lock();
}
}
T t = std::move(queue_[b]);
alloc_.destroy(&queue_[b]);
++b;
if(b == cap_) b = 0;
--n;
m_.unlock();
return t;
}
};

int main() {
BlockingQueue<int> produced(1000);

Melzzzzz

unread,
Jul 26, 2012, 11:43:53 PM7/26/12
to
Nah, implementation is completely wrong (it was too late),disregard this
post, problem remains... ;(


Melzzzzz

unread,
Jul 27, 2012, 1:43:13 AM7/27/12
to
On Thu, 26 Jul 2012 23:12:54 +0200
Finally got it ;)
Problem was that producer thread constantly spins, but consumer does
some work. Queue was constantly full and that slowed it down
significantly.
I have used Howards trick, and this time correctly ;)

now times are fastest (queue size was 1000):
(with rand)
bmaxa@maxa:~/examples$ time ./consprod5

real 0m6.215s
user 0m6.440s
sys 0m5.108s

(without rand)
bmaxa@maxa:~/examples$ time ./consprod5

real 0m3.232s
user 0m3.024s
sys 0m2.968s

code:
void put(T t) {
for(;;)
{
m_.lock();
if(n < cap_)
{
int retry = 0;
while(n > 3 * cap_ / 4 && ++retry < 3000)
{
m_.unlock();
std::this_thread::yield();
m_.lock();
}
break;
}
m_.unlock();
}
alloc_.construct(&queue_[f], std::move(t));
++f;
if(f == cap_)f = 0;
++n;
m_.unlock();
}

T take() {
for(;;)
{
m_.lock();
if(n > 0)
{
int retry = 0;
while(n < cap_ / 4 && ++retry < 3000)
{
m_.unlock();
std::this_thread::yield();
m_.lock();
}
break;
}
m_.unlock();
}
T t = std::move(queue_[b]);
alloc_.destroy(&queue_[b]);

James Kanze

unread,
Jul 29, 2012, 6:27:18 AM7/29/12
to
On Sunday, July 22, 2012 10:59:20 PM UTC+1, Melzzzzz wrote:
> I have played little bit with new C++11 features and compared,

> java performance to c++.

> Actually this was meant to be GC vs RAII memory management,
> but boiled down to speed of BlockingQueue class in java,
> and mine in c++.

> It seems that java implementation is so much more efficient
> but I don't know why. I even tried c++ without dynamic
> memory management (except queue itself) and that is *even slower*.
> Must be some quirks with a queue ;)

Just a guess (I don't know anything about the Java
BlockingQueue), but your C++ implementation uses an extensible
queue; the restriction on the queue size is external. From the
code, I'd guess that the Java blocking queue is a fixed length
(established once in the constructor). This means that once the
queue has been constructed, there are no more allocations in
Java; in C++, I think you'll find that std::deque does allocate
and free new memory as it expands and contracts.

You might try replacing std::deque with a simple, std::vector
based circular buffer. Construct the std::vector to its fixed
size once at the beginning, and just keep two iterators into it:
one for writing, and one for reading. Wrap the iterators each
time they reach the end. (Since the limit conditions for full
and empty are sometimes difficult to implement, you'd be better
off searching an existing implementation on the net. I'd be
surprised if Boost didn't have one, for example.)

--
James

Melzzzzz

unread,
Jul 29, 2012, 8:40:12 AM7/29/12
to
On Sun, 29 Jul 2012 03:27:18 -0700 (PDT)
James Kanze <james...@gmail.com> wrote:

>
> You might try replacing std::deque with a simple, std::vector
> based circular buffer.

I already did that. If you follow thread, Howard solved problem.




Ricardo Nabinger Sanchez

unread,
Aug 14, 2012, 6:48:26 PM8/14/12
to
On Sun, 22 Jul 2012 23:59:20 +0200, Melzzzzz wrote:

> I have played little bit with new C++11 features and compared,
> java performance to c++.
> Actually this was meant to be GC vs RAII memory management,
> but boiled down to speed of BlockingQueue class in java,
> and mine in c++.
> It seems that java implementation is so much more efficient
> but I don't know why. I even tried c++ without dynamic
> memory management (except queue itself) and that is *even slower*.
> Must be some quirks with a queue

As others mentioned, most of your CPU time is being wasted on memory
management and synchronization. I've compiled your source with GCC-4.7.1,
-O2, and ran through perf (from Linux kernel). The output follows.

Cheers!

# captured on: Tue Aug 14 19:22:47 2012
# os release : 3.2.23
# perf version : 3.2.13
# arch : x86_64
# nrcpus online : 4
# nrcpus avail : 4
# cpudesc : Intel(R) Core(TM) i3 CPU 540 @ 3.07GHz
# cpuid : GenuineIntel,6,37,5
# total memory : 3897640 kB

# Overhead Command Shared Object Symbol
10.90% lala [kernel.kallsyms] [k] _raw_spin_lock
10.34% lala libc-2.15.so [.] _int_free
8.84% lala libpthread-2.15.so [.] pthread_cond_signal@@GLIBC_2.3.2
7.53% lala libc-2.15.so [.] free
7.28% lala libc-2.15.so [.] _int_malloc
6.89% lala libc-2.15.so [.] malloc
6.49% lala libpthread-2.15.so [.] pthread_mutex_lock
4.85% lala libpthread-2.15.so [.] __pthread_mutex_unlock_usercnt
3.20% lala libpthread-2.15.so [.] __lll_lock_wait
3.18% lala [kernel.kallsyms] [k] hash_futex
2.49% lala [kernel.kallsyms] [k] system_call
2.40% lala [kernel.kallsyms] [k] futex_wait_setup
2.02% lala [kernel.kallsyms] [k] copy_user_generic_string
1.99% lala libc-2.15.so [.] __random
1.81% lala libpthread-2.15.so [.] __lll_unlock_wake
1.76% lala lala [.] main::{lambda()#2}::operator()() const
1.60% lala lala [.] std::_Function_handler<void (), main::{lambda()#1}>::_M_invoke(std::_Any_data const&)
1.43% lala [kernel.kallsyms] [k] futex_wake
1.39% lala [kernel.kallsyms] [k] do_futex
1.32% lala [kernel.kallsyms] [k] system_call_after_swapgs
1.21% lala [kernel.kallsyms] [k] sys_futex
1.21% lala [kernel.kallsyms] [k] get_futex_key
1.15% lala [kernel.kallsyms] [k] try_to_wake_up
1.03% lala [kernel.kallsyms] [k] futex_wait
0.58% lala [kernel.kallsyms] [k] sysret_check
0.46% lala libpthread-2.15.so [.] pthread_cond_wait@@GLIBC_2.3.2
0.40% lala [kernel.kallsyms] [k] get_futex_value_locked
0.40% lala libstdc++.so.6.0.17 [.] operator new(unsigned long)
0.33% lala [kernel.kallsyms] [k] select_task_rq_fair
0.31% lala libc-2.15.so [.] __random_r
0.29% lala libpthread-2.15.so [.] _L_lock_965
0.29% lala [kernel.kallsyms] [k] __schedule
0.24% lala [kernel.kallsyms] [k] drop_futex_key_refs.isra.12
0.23% lala libstdc++.so.6.0.17 [.] std::condition_variable::notify_one()
0.20% lala lala [.] pthread_mutex_unlock@plt
0.20% lala [kernel.kallsyms] [k] get_futex_key_refs.isra.10
0.16% lala [kernel.kallsyms] [k] system_call_fastpath
0.16% lala libpthread-2.15.so [.] _L_unlock_567
0.15% lala [kernel.kallsyms] [k] task_waking_fair
0.14% lala lala [.] _ZNSt18condition_variable10notify_oneEv@plt
0.13% lala libstdc++.so.6.0.17 [.] malloc@plt
0.13% lala [kernel.kallsyms] [k] wake_futex
0.12% lala lala [.] pthread_mutex_lock@plt
0.11% lala [kernel.kallsyms] [k] intel_pmu_disable_all
0.11% lala libstdc++.so.6.0.17 [.] pthread_cond_signal@plt
0.11% lala libpthread-2.15.so [.] __pthread_mutex_cond_lock
0.10% lala libc-2.15.so [.] rand
0.10% lala libpthread-2.15.so [.] pthread_mutex_unlock
0.10% lala lala [.] rand@plt
0.09% lala [kernel.kallsyms] [k] default_send_IPI_mask_sequence_phys
0.08% lala lala [.] _Znwm@plt
0.07% lala [kernel.kallsyms] [k] _raw_spin_lock_irqsave
0.07% lala [kernel.kallsyms] [k] finish_task_switch
0.07% lala [kernel.kallsyms] [k] native_sched_clock
0.07% lala [kernel.kallsyms] [k] futex_wait_queue_me
0.06% lala [kernel.kallsyms] [k] intel_pmu_enable_all
0.06% lala [kernel.kallsyms] [k] cpuacct_charge
0.05% lala libc-2.15.so [.] __lll_lock_wait_private
0.05% lala [kernel.kallsyms] [k] plist_add
0.05% lala lala [.] _ZdlPv@plt
0.05% lala [kernel.kallsyms] [k] sched_clock_cpu
0.05% lala [kernel.kallsyms] [k] hrtick_update
0.05% lala [kernel.kallsyms] [k] update_curr
0.04% lala [kernel.kallsyms] [k] pick_next_task_stop
0.04% lala [kernel.kallsyms] [k] hrtick_start_fair
0.04% lala libpthread-2.15.so [.] __pthread_disable_asynccancel
0.04% lala libstdc++.so.6.0.17 [.] operator delete(void*)
0.04% lala [kernel.kallsyms] [k] dequeue_task_fair
0.04% lala libstdc++.so.6.0.17 [.] free@plt
0.04% lala [kernel.kallsyms] [k] irq_exit
0.03% lala [kernel.kallsyms] [k] pick_next_task_fair
0.03% lala [kernel.kallsyms] [k] local_clock
0.03% lala lala [.] void std::deque<std::unique_ptr<int, std::default_delete<int> >, std::allocator<std::unique_ptr<int, std::default_delete<int> > > >::_M_push_back_aux<std::unique_ptr<int, std::default_delete<int> > >(std::unique_ptr<int, std::default_delete<int> >&&)
0.03% lala [kernel.kallsyms] [k] rcu_note_context_switch
0.03% lala [kernel.kallsyms] [k] _raw_spin_lock_irq
0.03% lala [kernel.kallsyms] [k] find_next_bit
0.03% lala [kernel.kallsyms] [k] dequeue_entity
0.03% lala [kernel.kallsyms] [k] update_cfs_shares
0.03% lala [kernel.kallsyms] [k] pick_next_task_rt
0.02% lala [kernel.kallsyms] [k] account_entity_dequeue
0.02% lala [kernel.kallsyms] [k] put_prev_task_fair
0.02% lala [kernel.kallsyms] [k] __perf_event_task_sched_out
0.02% lala [kernel.kallsyms] [k] dequeue_task
0.02% lala libpthread-2.15.so [.] __pthread_enable_asynccancel
0.02% lala libc-2.15.so [.] __lll_unlock_wake_private
0.02% lala [kernel.kallsyms] [k] update_min_vruntime
0.02% lala [kernel.kallsyms] [k] update_rq_clock
0.02% lala [kernel.kallsyms] [k] update_context_time.isra.48
0.02% lala [kernel.kallsyms] [k] perf_pmu_disable
0.02% lala [kernel.kallsyms] [k] apic_timer_interrupt
0.02% lala [kernel.kallsyms] [k] schedule
0.02% lala libstdc++.so.6.0.17 [.] std::condition_variable::wait(std::unique_lock<std::mutex>&)
0.02% lala [kernel.kallsyms] [k] free_hot_cold_page
0.02% lala [kernel.kallsyms] [k] pick_next_task_idle
0.02% lala [kernel.kallsyms] [k] do_softirq
0.01% lala [kernel.kallsyms] [k] ktime_get
0.01% lala [kernel.kallsyms] [k] update_cfs_load
0.01% lala [kernel.kallsyms] [k] perf_event_context_sched_in
0.01% lala [kernel.kallsyms] [k] __perf_event_task_sched_in
0.01% lala [kernel.kallsyms] [k] perf_pmu_rotate_start.isra.44
0.01% lala [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
0.01% lala [kernel.kallsyms] [k] hrtimer_interrupt
0.01% lala libc-2.15.so [.] __memmove_ssse3_back
0.01% lala [kernel.kallsyms] [k] ctx_sched_out
0.01% lala [kernel.kallsyms] [k] physflat_send_IPI_mask
0.01% lala [kernel.kallsyms] [k] intel_pmu_nhm_enable_all
0.01% lala [kernel.kallsyms] [k] do_timer
0.01% lala [kernel.kallsyms] [k] smp_apic_timer_interrupt
0.01% lala [kernel.kallsyms] [k] x86_pmu_disable
0.01% lala [kernel.kallsyms] [k] plist_del
0.01% lala [kernel.kallsyms] [k] deactivate_task
0.01% lala [kernel.kallsyms] [k] run_timer_softirq
0.01% lala libc-2.15.so [.] _L_lock_4111
0.01% lala [kernel.kallsyms] [k] rcu_enter_nohz
0.01% lala [kernel.kallsyms] [k] __rcu_pending
0.01% lala [kernel.kallsyms] [k] perf_event_task_sched_out
0.01% lala [kernel.kallsyms] [k] native_smp_send_reschedule
0.01% lala [kernel.kallsyms] [k] __do_softirq
0.01% lala [kernel.kallsyms] [k] profile_tick
0.01% lala [kernel.kallsyms] [k] __run_hrtimer
0.01% lala [kernel.kallsyms] [k] scheduler_tick
0.01% lala [kernel.kallsyms] [k] task_tick_fair
0.01% lala [kernel.kallsyms] [k] tick_sched_timer
0.01% lala [kernel.kallsyms] [k] perf_event_task_tick
0.01% lala [kernel.kallsyms] [k] x86_pmu_enable
0.01% lala [kernel.kallsyms] [k] acct_update_integrals
0.00% lala libc-2.15.so [.] _L_unlock_12586
0.00% lala libpthread-2.15.so [.] _L_cond_lock_973
0.00% lala [kernel.kallsyms] [k] hrtimer_forward
0.00% lala [kernel.kallsyms] [k] effective_load.isra.70
0.00% lala [kernel.kallsyms] [k] update_process_times
0.00% lala [kernel.kallsyms] [k] __unqueue_futex
0.00% lala [kernel.kallsyms] [k] perf_pmu_enable
0.00% lala [kernel.kallsyms] [k] clear_buddies
0.00% lala [kernel.kallsyms] [k] rcu_check_callbacks
0.00% lala lala [.] _ZNSt18condition_variable4waitERSt11unique_lockISt5mutexE@plt
0.00% lala [kernel.kallsyms] [k] update_cpu_load
0.00% lala [kernel.kallsyms] [k] rcu_process_callbacks
0.00% lala [kernel.kallsyms] [k] raise_softirq
0.00% lala [kernel.kallsyms] [k] jiffies_to_timeval
0.00% lala [kernel.kallsyms] [k] timerqueue_add
0.00% lala [kernel.kallsyms] [k] ret_from_sys_call
0.00% lala [kernel.kallsyms] [k] flush_tlb_others_ipi
0.00% lala [kernel.kallsyms] [k] rb_insert_color
0.00% lala [kernel.kallsyms] [k] rcu_process_gp_end.isra.27
0.00% lala [kernel.kallsyms] [k] rcu_exit_nohz
0.00% lala [kernel.kallsyms] [k] hrtimer_run_queues
0.00% lala libc-2.15.so [.] _L_lock_12561
0.00% lala [kernel.kallsyms] [k] idle_cpu
0.00% lala [kernel.kallsyms] [k] do_lookup
0.00% lala [kernel.kallsyms] [k] calc_global_load
0.00% lala [kernel.kallsyms] [k] lookup_page_cgroup
0.00% lala [kernel.kallsyms] [k] irq_enter
0.00% lala [kernel.kallsyms] [k] force_quiescent_state
0.00% lala [kernel.kallsyms] [k] dyntick_save_progress_counter
0.00% lala [kernel.kallsyms] [k] run_posix_cpu_timers
0.00% lala [kernel.kallsyms] [k] __math_state_restore
0.00% lala [kernel.kallsyms] [k] clockevents_program_event
0.00% lala [kernel.kallsyms] [k] account_process_tick
0.00% lala [kernel.kallsyms] [k] find_busiest_group
0.00% lala [kernel.kallsyms] [k] call_softirq
0.00% lala [kernel.kallsyms] [k] rb_next
0.00% lala [kernel.kallsyms] [k] wake_up_state
0.00% lala [kernel.kallsyms] [k] printk_tick
0.00% lala [kernel.kallsyms] [k] find_vma_prev
0.00% lala [kernel.kallsyms] [k] __alloc_pages_nodemask
0.00% lala [kernel.kallsyms] [k] tick_program_event
0.00% lala [kernel.kallsyms] [k] update_shares
0.00% lala [kernel.kallsyms] [k] cpuacct_update_stats
0.00% lala [e1000e] [k] e1000_irq_enable
0.00% lala [kernel.kallsyms] [k] sched_clock_tick
0.00% lala [kernel.kallsyms] [k] irq_work_run
0.00% lala [kernel.kallsyms] [k] clear_page_c
0.00% lala [kernel.kallsyms] [k] account_user_time
0.00% lala [kernel.kallsyms] [k] fib_get_table
0.00% lala [kernel.kallsyms] [k] ehci_work
0.00% lala [kernel.kallsyms] [k] __current_kernel_time
0.00% lala [kernel.kallsyms] [k] __rcu_process_callbacks
0.00% lala [kernel.kallsyms] [k] rb_erase
0.00% lala [kernel.kallsyms] [k] __percpu_counter_add
0.00% lala [kernel.kallsyms] [k] update_vsyscall
0.00% lala [kernel.kallsyms] [k] mem_cgroup_pgfault
0.00% lala libstdc++.so.6.0.17 [.] std::thread::join()
0.00% lala libpthread-2.15.so [.] __read_nocancel
0.00% lala libpthread-2.15.so [.] start_thread
0.00% lala [kernel.kallsyms] [k] __mmdrop


--
Ricardo Nabinger Sanchez http://rnsanchez.wait4.org/
"Left to themselves, things tend to go from bad to worse."
0 new messages