Rhino Service Bus with Load Balancer pegs CPU around 100%

97 views
Skip to first unread message

Michael Lyons

unread,
Nov 14, 2011, 7:06:29 PM11/14/11
to rhino-t...@googlegroups.com
When using the load balancer with RSB I'm seeing the CPU runs at near 100% when the consumers are all busy which causes the consumers to run slower and be free less often.
It can be simulated easily by setting up a load balancer with no consumers listening to it and trying to send out some messages to the consumer.

In my specific situation I have 2 load balancers with 5 threads each (each load balancer runs a separate queue with different types of messages), there is a consumer waiting at the other end of each load balancer with another 20 threads each. If one of the load balancers gets congested then all consumers run slow. When I ran the load balancer without load it averaged ~200ms to process a message, once the load balancer was under load (achieved by queuing over 1000 messages) it resulted in an average time of ~1750ms, which results in the user waiting 8 times longer for their tasks to complete.

Is there anyway around this?

Corey Kaylor

unread,
Nov 15, 2011, 9:15:36 PM11/15/11
to rhino-t...@googlegroups.com
Is each load balancer configured with a ready for work uri?

--
You received this message because you are subscribed to the Google Groups "Rhino Tools Dev" group.
To view this discussion on the web visit https://groups.google.com/d/msg/rhino-tools-dev/-/PYwMBzLg7m4J.
To post to this group, send email to rhino-t...@googlegroups.com.
To unsubscribe from this group, send email to rhino-tools-d...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/rhino-tools-dev?hl=en.

Corey Kaylor

unread,
Nov 15, 2011, 9:17:39 PM11/15/11
to rhino-t...@googlegroups.com
Also, how many cores are on the load balancer machine? There shouldn't be that much demand on the cpu, but having said that it really depends on the circumstances and environment.

Michael Lyons

unread,
Nov 15, 2011, 10:12:24 PM11/15/11
to Rhino Tools Dev
The load balancers are configured with the readyForWorkEndpoint
attribute on the loadBalancer xml element.

System is a quad core 2.83Ghz core 2 duo, on the staging server which
is running an older single core 2.8Ghz xeon (Dell 2650) with hyper
threading it sits at about 80% and in production it sits between 40 to
80% on a quad core 2.8Ghz xeon (Dell R210) where it is allocated 2
cores

Forgot to mention that RSB is version 2.2


On Nov 16, 1:17 pm, Corey Kaylor <co...@kaylors.net> wrote:
> Also, how many cores are on the load balancer machine? There shouldn't be
> that much demand on the cpu, but having said that it really depends on the
> circumstances and environment.
>
>
>
>
>
>
>
> On Tue, Nov 15, 2011 at 7:15 PM, Corey Kaylor <co...@kaylors.net> wrote:
> > Is each load balancer configured with a ready for work uri?
>

Corey Kaylor

unread,
Nov 15, 2011, 11:13:24 PM11/15/11
to rhino-t...@googlegroups.com
To summarize your setup.

Load Balancer 1, configured for messages belonging to NamespaceA, with 5 threads, deployed to MachineA\queue1
   1 worker endpoint sending sending ready for work to MachineA\queue1.readyforwork, configured with 20 threads, deployed to MachineA

Load Balancer 2, configured for messages belonging to NamespaceB, with 5 threads, deployed to MachineA\queue2
   1 worker endpoint sending ready for work to MachineA\queue2.readyforwork, configured with 20 threads, deployed to MachineA

I assumed by staging server that you mean staging environment that is configured similarly above but with different machine specs as you've stated.

Is this correct?

Michael Lyons

unread,
Nov 15, 2011, 11:24:16 PM11/15/11
to Rhino Tools Dev
Yes you're correct, it's a staging environment where we do our testing
before releasing into production.

That's pretty much the situation.

Here are the xml configurations for the 2 load balancers:

<loadBalancer threadCount="5"
endpoint="msmq://localhost/notifier.loadbalancer"
readyForWorkEndpoint="msmq://localhost/
notifier.loadbalancer.acceptingwork"
/>

<loadBalancer threadCount="5"
endpoint="msmq://localhost/processor.loadbalancer"
readyForWorkEndpoint="msmq://localhost/
processor.loadbalancer.acceptingwork"
/>

Consumers xml configuration is:

<bus threadCount="20"
loadBalancerEndpoint="msmq://localhost/
processor.loadbalancer.acceptingwork"
numberOfRetries="5"
endpoint="msmq://localhost/processor"
/>

<bus threadCount="20"
loadBalancerEndpoint="msmq://localhost/
notifier.loadbalancer.acceptingwork"
numberOfRetries="5"
endpoint="msmq://localhost/notifier"
/>

Corey Kaylor

unread,
Nov 15, 2011, 11:40:27 PM11/15/11
to rhino-t...@googlegroups.com
I would try changing the thread counts on the consumers and the load balancer, and possibly add additional worker endpoint(s).

Ayende in previous conversations has recommended thread counts that are equal to the number of cores on the machine. I have found that isn't always a perfect recipe. So in our case we have run load tests and changing the configuration of threads for each machine.

When changing the thread counts on each test run, try to observe which specific process is utilizing the most CPU.

There may be places to optimize for sure, but it sounds to me like threads are competing for priority.

Michael Lyons

unread,
Nov 16, 2011, 12:28:14 AM11/16/11
to Rhino Tools Dev
I've run EQATEC profiler against the code and when the load balancer
process is under load it it records no activity between snapshots
indicating it is sitting in RSB code.

I'd be happy to spot profile RSB in my app and point out where the
high CPU is coming from but I'm assuming you already have a fair idea.

What do you mean by adding additional worker endpoints? Can you point
me to an example.

Corey Kaylor

unread,
Nov 16, 2011, 12:39:53 AM11/16/11
to rhino-t...@googlegroups.com
I am happy to take any form of contribution you can offer.

By adding additional worker endpoints I mean.

Load Balancer 1, 5 threads, deployed to MachineA
  1 worker endpoint, configured to send  to Machine1\queue1.readyforwork, 5 threads, deployed to NewMachineB
  2 worker endpoint, configured to send to Machine1\queue1.readyforwork, 5 threads, deployed to NewMachineC

Load balancing although completely *possible* to run on one machine, was designed to distribute load to multiple machines. You're not gaining any benefits from load balancing when there is only one worker sending ready for work messages to the load balancer. You would be better off in this case just having two endpoints without load balancing.

Michael Lyons

unread,
Nov 16, 2011, 2:44:54 AM11/16/11
to Rhino Tools Dev
Strangely enough I'm going to be testing load balancing next week
across physical servers as I have provisioned another server last week
for the staging environment to test this out.
In our case the workers get tied up as they are contacting website
services which sometimes can be really slow (up to 120 seconds)
causing the load balancers queue to grow. My idea with the load
balancer was so I can spin up a new worker process when the queue
becomes too large, which is what I can do currently and it works
perfectly, it's just that the load balancer is consuming more
resources than it needs to while the machine is really not under any
other stress.
I've just done some quick profile and all the action seems to be
called from AbstractMsmqListener.PeekMessageOnBackgroundThread. It
spends 53% of its time in calls to
MsmqLoadBalancer.HandlePeekedMessage and it's children with the
remaining 47% in AbstractMsmqListener.TryPeek and it's children.
So over a total period of 4 minutes RSB consumed 183 seconds out of
240 seconds of CPU time excluding my app's time. Which I think is a
bit excessive particularly since it peeked at 226130 messages.
Shouldn't the load balancer pause for a second if it failed to get in
contact with any of the workers, instead of just blindly retrying?
Here are the top offenders in csv format - if you want I can email you
a full csv (it's actually tab delimited) or a pdf.
Total Time with children (ms), Average Time with children (ms), Total
for self (ms), Average for self (ms), Calls, Method name
+183366,0.8,11384,0.1,226122,Rhino.ServiceBus.LoadBalancer.MsmqLoadBalancer.HandlePeekedMessage(Rhino.ServiceBus.Msmq.OpenedQueue,System.Messaging.Message)
+160651,0.7,4743,0,226130,Rhino.ServiceBus.Msmq.AbstractMsmqListener.TryPeek(Rhino.ServiceBus.Msmq.OpenedQueue,System.Messaging.Message&)
+155724,0.7,155724,0.7,226130,Rhino.ServiceBus.Msmq.OpenedQueue.Peek(System.TimeSpan)
+134270,0.6,134138,0.6,226125,Rhino.ServiceBus.Msmq.OpenedQueue.TryGetMessageFromQueue(System.String)29787,0.2,1159,0,180430,Rhino.ServiceBus.LoadBalancer.MsmqLoadBalancer.HandleStandardMessage(Rhino.ServiceBus.Msmq.OpenedQueue,System.Messaging.Message)28569,0.2,28546,0.2,180430,Rhino.ServiceBus.Msmq.OpenedQueue.Send(System.Messaging.Message)7759,0,1312,0,180432,Rhino.ServiceBus.LoadBalancer.MsmqLoadBalancer.PersistEndpoint(Rhino.ServiceBus.Msmq.OpenedQueue,System.Messaging.Message)3825,0,3825,0,180432,Rhino.ServiceBus.Msmq.MsmqUtil.GetQueueUri(System.Messaging.MessageQueue)2622,0,2622,0,180431,Rhino.ServiceBus.DataStructures.Set`1.Add(T)
On Nov 16, 4:39 pm, Corey Kaylor <co...@kaylors.net> wrote:
> I am happy to take any form of contribution you can offer.
>
> By adding additional worker endpoints I mean.
>
> Load Balancer 1, 5 threads, deployed to MachineA
>   1 worker endpoint, configured to send  to Machine1\queue1.readyforwork, 5
> threads, deployed to NewMachineB
>   2 worker endpoint, configured to send to Machine1\queue1.readyforwork, 5
> threads, deployed to NewMachineC
>
> Load balancing although completely *possible* to run on one machine, was
> designed to distribute load to multiple machines. You're not gaining any
> benefits from load balancing when there is only one worker sending ready
> for work messages to the load balancer. You would be better off in this
> case just having two endpoints without load balancing.
>

Michael Lyons

unread,
Nov 16, 2011, 3:00:11 AM11/16/11
to Rhino Tools Dev
Sorry about that last message, for some reason it lost it's formatting
> +134270,0.6,134138,0.6,226125,Rhino.ServiceBus.Msmq.OpenedQueue.TryGetMessa geFromQueue(System.String)29787,0.2,1159,0,180430,Rhino.ServiceBus.LoadBala ncer.MsmqLoadBalancer.HandleStandardMessage(Rhino.ServiceBus.Msmq.OpenedQue ue,System.Messaging.Message)28569,0.2,28546,0.2,180430,Rhino.ServiceBus.Msm q.OpenedQueue.Send(System.Messaging.Message)7759,0,1312,0,180432,Rhino.Serv iceBus.LoadBalancer.MsmqLoadBalancer.PersistEndpoint(Rhino.ServiceBus.Msmq. OpenedQueue,System.Messaging.Message)3825,0,3825,0,180432,Rhino.ServiceBus. Msmq.MsmqUtil.GetQueueUri(System.Messaging.MessageQueue)2622,0,2622,0,18043 1,Rhino.ServiceBus.DataStructures.Set`1.Add(T)
> ...
>
> read more »

Corey Kaylor

unread,
Nov 16, 2011, 8:56:00 AM11/16/11
to rhino-t...@googlegroups.com
Ok, I'll take a look when I get into the office. I may suggest changes to make and have you try them out. I have run into similar issues with rhino queues being too eager in peeking messages in the past.


--

Michael Lyons

unread,
Nov 16, 2011, 10:56:53 PM11/16/11
to Rhino Tools Dev
Great news Corey after a little bit of playing around I found what
seems to be a possible solution.

I reworked the code in MsmqLoadBalancer so that after a number of
failures to contact a worker it would then pause the thread for a
second and reset the failure count back to zero. By doing so the load
balancer dropped CPU usage to around 7%.

It worked perfectly in the situation when a worker was busy and
another worker process was started alleviating the queue backlog
without the load balancer trying to hog the system.

My code for the change to MsmqLoadBalancer.HandleStandardMessage can
be found here: http://pastebin.com/0PbC6ecB
> ...
>
> read more »

Kaylor Mail

unread,
Nov 16, 2011, 11:19:07 PM11/16/11
to rhino-t...@googlegroups.com
Following the contribution guidelines of course :)

Kaylor Mail

unread,
Nov 16, 2011, 11:18:08 PM11/16/11
to rhino-t...@googlegroups.com
Can you send a pull request?

Michael Lyons

unread,
Nov 17, 2011, 2:30:31 AM11/17/11
to Rhino Tools Dev
Sorry guys, didn't realise there was one until you told me. Maybe a
link on the wiki to it.
Request has been sent.
On Nov 17, 3:18 pm, Kaylor Mail <co...@kaylors.net> wrote:
> Can you send a pull request?
>
> ...
>
> read more »
Reply all
Reply to author
Forward
0 new messages