mms cr5 performance under load

5 views
Skip to first unread message

John

unread,
Feb 19, 2009, 10:17:21 AM2/19/09
to mobicents-public
Hi,

[ please disregard if this appears as a repost. I'm assuming my first
attempt to post didn't stick because the forum site complained that my
session expired.]


I read in the 2/18 status meeting log that you are interested in a
public sharing of my load testing of mms cr5. I'm happy to do so.

My goal is to run 500 concurrent IVR-type calls. mobicents is
competing with an existing system that can process calls at that
rate. The existing system is difficult to extend and costly to
maintain. However, mobicents advantages in these areas are not enough
by themselves for it to win.


My platform:

Java 1.6 Jrockit.
1G of heap.

intel xeon 2.8Ghz
2G of ram

linux kernel: 2.6.9, redhat distribution.

I use sipp as traffic generator.

Call duration is about 60 seconds.

I measure performance by number of concurrent users and cpu idle
time. I use linux vmstat or top. I am eyeballing the cpu idle time
when during periods of sustained media processing, disregarding cpu
costs during call establishment.

I am not measuring quality, or jitter. I would like to, but after
attaining maximum throughput with code changes.

In later tests, I have speex, gsm, and g729 dsp processing disabled
(Processor.java edit). For the moment, I suspect these are sitting on
about 200MB of memory which I cannot afford and am not using.



On my first run, using the announcement demo, 140 users could not be
attained without hitting 0% cpu idle time. I had avoided transcoding
by using alaw in the announcement wav file, and alaw on the rtp
channel. I had also used a call duration of 3 minutes to help isolate
costs of call management from media processing.

Bartosz, Amit or Oleg (or all three) advised me to modify the demo and
strike out PacketRelay. I also chose to avoid use of the audio file
servlet, using 'file:///' for the announcement file url. This new sbb
performed better, but still not good:
announcement only: 250 concurrent calls at 20% idle time
ivr: 130 calls at 20% idle time.





An engineer in my compnay made the point that inband dtmf detection is
expensive, and the yourkit profile confirmed. Just for grins, I
modified mms to isolate costs of the incoming media handling. I ran
three different tests:

1) the rtp message is only received and placed into the jitter buffer,
but not read from the jitter buffer.
230 concurrent calls at 70% idle.

2) the rtp message is read from the jitter buffer and processed, but
the dtmf detector is disabled in the demultiplexer. This result shows
that scheduling the receiver thread doesn't cost too much:
200 concurrent calls at 80% idle.

3) the dtmf detector is NOT disabled in the demultiplexer.
120 concurrent calls at 44% idle

So, it is clear that inband dtmf detection is a significant
bottleneck, and that is probably not a surprise to you.





Our application has a requirement on various inband detection, not
just for keypad tones, so this result is not acceptable. So, I am
pushing the system for better throughput. My first instinct was to
implement the dtmf detection natively, but while googling for such, I
found an alternative implementation of inband dtmf detection on
wikipedia. I transcribed it and replaced CR5's. Amit tells me it
doesn't work, but it certainly is faster:

with inband dtmf detection from wikipedia site:
200 users at 70% idle (remember, this dtmf implementation isn't
working).




I tried a few other things that didn't seem to have an impact or were
simply wrong headed: 1) building jboss and mobicents slee and mms with
java 1.6 (for compiler help in runtime optimizations), 2) disabling
the BufferFactory (to avoid permanent space compaction cycle).





Upcoming tests:
Amit and Oleg say that between CR5 and upcoming CR6 there are some
performance related fixes: to dtmf detection and buffer memory
management. I think I will try that next.


Implement use of NIO for datagram reception. This would be motivated
by this result:
200 calls, 15% user, 15% sys, %70 idle
250 calls, 40% user, 40% sys, %20 idle

I think there is a scaling problem here and am guessing its due to the
os servicing blocking threads in receive datagram system calls. I'm
willing to try out an implementation of RtpSocketImpl that uses NIO
for receiving, and will look at implementing that in the upcoming
days. You have probably already had some engineering discussion
around this decision. I'm interested to hear what you would expect
from that.

Lastly, we think we can optimize the linux kernel (by reconfiguration)
for this application and will investigate.

Regards,
John

Amit Bhayani

unread,
Feb 20, 2009, 3:57:59 AM2/20/09
to mobicent...@googlegroups.com
Hi John,

Thanks for such a comprehensive report. My comments in-line


The Speex and G729 is definitely costly. For CR6 we are trying to fix this. Look at issue http://code.google.com/p/mobicents/issues/detail?id=586
So now if you are not using these codecs, better remove them from jboss-service.xml too and that extra memory will be saved




On my first run, using the announcement demo, 140 users could not be
attained without hitting 0% cpu idle time.  I had avoided transcoding
by using  alaw in the announcement wav file, and alaw on the rtp
channel.  I had also used a call duration of 3 minutes to help isolate
costs of call management from media processing.

Bartosz, Amit or Oleg (or all three) advised me to modify the demo and
strike out PacketRelay.  I also chose to avoid use of the audio file
servlet, using 'file:///' for the announcement file url.  This new sbb
performed better, but still not good:
announcement only: 250 concurrent calls at 20% idle time
ivr: 130 calls at 20% idle time.





An engineer in my compnay made the point that inband dtmf detection is
expensive, and the yourkit profile confirmed.  Just for grins, I
modified mms to isolate costs of the incoming media handling.  I ran
three different tests:
Indeed. Inband calculations are very costly. However with patch # 580 http://code.google.com/p/mobicents/issues/detail?id=580 few calculations are modified and profiling shows that it takes a little less CPU


1) the rtp message is only received and placed into the jitter buffer,
but not read from the jitter buffer.
230 concurrent calls at 70% idle.

 2) the rtp message is read from the jitter buffer and processed, but
the dtmf detector is disabled in the demultiplexer.  This result shows
that scheduling the receiver thread doesn't cost too much:
200 concurrent calls at 80% idle.

 3) the dtmf detector is NOT disabled in the demultiplexer.
120 concurrent calls at 44% idle

So, it is clear that inband dtmf detection is a significant
bottleneck, and that is probably not a surprise to you.

Right. But unfortunately INBAND cannot be enabled/disabled on demand at runtime as this causes unnecessary jitter and bad user experience. So if you have enabled INBAND in jboss-service.xml, it will eat resource.








Our application has a requirement on various inband detection, not
just for keypad tones, so this result is not acceptable.  So, I am
pushing the system for better throughput.  My first instinct was to
implement the dtmf detection natively, but while googling for such, I
found an alternative implementation of inband dtmf detection on
wikipedia.  I transcribed it and replaced CR5's.  Amit tells me it
doesn't work, but it certainly is faster:

with inband dtmf detection from wikipedia site:
200 users at 70% idle (remember, this dtmf implementation isn't
working).

It would be a big win if we can make this work or modify existing one to be light on CPU. Any input from you here is highly appreciated.





I tried a few other things that didn't seem to have an impact or were
simply wrong headed: 1) building jboss and mobicents slee and mms with
java 1.6 (for compiler help in runtime optimizations), 2) disabling
the BufferFactory (to avoid permanent space compaction cycle).

IMHO disabling BufferFactory may even reduce the performance further as Buffer object's will not  be recycled and hence many Object getting created and GC'd in very short time. This will cause bad user experience.







Upcoming tests:
Amit and Oleg say that between CR5 and upcoming CR6 there are some
performance related fixes: to dtmf detection and buffer memory
management.  I think I will try that next.


Implement use of NIO for datagram reception.  This would be motivated
by this result:
200 calls, 15% user, 15% sys, %70 idle
250 calls, 40% user, 40% sys, %20 idle

Absolutely. We have discussed about NIO in past and planing to implement in SVN trunk for MMS 2.x.y.


I think there is a scaling problem here and am guessing its due to the
os servicing blocking threads in receive datagram system calls.  I'm
willing to try out an implementation of RtpSocketImpl that uses NIO
for receiving, and will look at implementing that in the upcoming
days.    

Please let us know how you progress on implementing NIO for Socket. This is definitely area of interest

You have probably already had some engineering discussion
around this decision.  I'm interested to hear what you would expect
from that. 


Lastly, we think we can optimize the linux kernel (by reconfiguration)
for this application and will investigate.

Would love to hear on this part too, however the goal is to make MMS platform /OS independent ;)


Regards,
John

John Franey

unread,
Feb 25, 2009, 5:42:44 PM2/25/09
to mobicent...@googlegroups.com
On Fri, Feb 20, 2009 at 3:57 AM, Amit Bhayani <amit.b...@gmail.com> wrote:


Implement use of NIO for datagram reception.  This would be motivated
by this result:
200 calls, 15% user, 15% sys, %70 idle
250 calls, 40% user, 40% sys, %20 idle

Absolutely. We have discussed about NIO in past and planing to implement in SVN trunk for MMS 2.x.y.



I think there is a scaling problem here and am guessing its due to the
os servicing blocking threads in receive datagram system calls.  I'm
willing to try out an implementation of RtpSocketImpl that uses NIO
for receiving, and will look at implementing that in the upcoming
days.    

Please let us know how you progress on implementing NIO for Socket. This is definitely area of interest


Ok, I don't have this all cleaned up, there are some issues that I have not yet resolved.  However, I am very sure that there are much too many threads created by the media server.  See : http://www.javaspecialists.eu/archive/Issue149.html

I've been running a test with much less threads.

In the receiver, mms originally had two threads per endpoint: udp receiver which transfers media from the socket to a jitter buffer, and recevier thread, scheduled to fire every packet interval, to send media from the jitter buffer into the mms media paths.  Using NIO, I was able to drop that down.  In my test, there is one selector thread which transfers media from the ready socket into the right jitter buffer, and a few threads (probably too many), scheduled from a pool, which send media from jitter buffer into mms media paths.

In the transmitter, mms originally created three threads per endpoint.  One is scheduled every packet interval to get data from a source and send it.  Then one thread each for command and event processing.  I've reduced these down to a handful, system wide.  A few in a scheduled pool for transmitting (probably too many), one for events and one for commands (probably, too many).

In the audio player, mms originally created one thread per player for command processing.  I reduced this down to one for the whole system.

Overall, for example, with 200 ivr endpoints, mms originally would create 1200 threads.  The load test platform has only two cores!  My changes reduces the number of threads to 15 or so.

The system runs linux which has a tool called vmstat.  vmstat reports average count of runnable processes and threads (which are kind of indistinguishable by linux and vmstat) and the cpu utilization.

In a comparitive test of 250 ivr calls, vmstat would report:

Original mms (cr5)
runq between 12-220, cpu: user: 53%, sys: 46%, idle: 1%

Now, vmstat reports:
runq between 1-16, cpu: user: 41%, sys: 21%, idle: 38%

This is preliminary.  I wanted to share this now because I have to take a break for a few days.  I don't know if such information would affect your milestone schedule.

I'm looking forward to hearing your thoughts on this.

One of my concerns is blocking on read of the audio file.  Am I underestimating its impact?  (linux buffers file data in core memory.  So in my load tests, there is probably little disk access for an 8MB file, but different story if every announcement is a different file.)



Also, about nio, it has a good buffer class: ByteBuffer which is 'native' to the nio api and is useful to the application as well.  This means, adopting nio and its ByteBuffer can lead to less data copying.  I see that data copying is internal to the demux, IIRC.  ByteBuffer has a 'read only' protection that might help eliminate the need for a copy there.

Regards,
John






Regards,
John


Oleg Kulikov

unread,
Feb 25, 2009, 11:07:37 PM2/25/09
to mobicent...@googlegroups.com
Hi John,
 
Thank you for the update. We were trying to use pool but without any success. Yes, I agree about performance but do you receive correct media stream. The thread pool cause:
1) out of sequence packets
2) out of order command
3) media jitter
 
Regards,
Oleg
2009/2/26 John Franey <jjfr...@gmail.com>

Vladimir Ralev

unread,
Feb 25, 2009, 11:36:34 PM2/25/09
to mobicent...@googlegroups.com
Sounds like a great improvement. Also, worth checking these results in both linux and windows, because they are behave differently with many/idle threads.

Oleg Kulikov

unread,
Feb 25, 2009, 11:39:15 PM2/25/09
to mobicent...@googlegroups.com
Would be interested to compare audio quality as well. It is not enouph to see digits of CPU and mem usage without media checking.


 
2009/2/26 Vladimir Ralev <vladimi...@gmail.com>

Amit Bhayani

unread,
Feb 26, 2009, 5:55:20 AM2/26/09
to mobicent...@googlegroups.com
John,

As said by Oleg while doing the load test the quality of Ann is also important. What I would suggest is when you doing the load test, just connect to same application using real UA and listen for quality, this will give fair idea of jitter at heavy load.

Also would it be possible for you to share your changes for NIO before you go for a break?

Thanks

John Franey

unread,
Feb 26, 2009, 8:26:37 AM2/26/09
to mobicent...@googlegroups.com
Amit,


On Thu, Feb 26, 2009 at 5:55 AM, Amit Bhayani <amit.b...@gmail.com> wrote:
John,

As said by Oleg while doing the load test the quality of Ann is also important. What I would suggest is when you doing the load test, just connect to same application using real UA and listen for quality, this will give fair idea of jitter at heavy load.

Yes, these results are preliminary.  I agree it is not worth mentioning cpu/mem usage without also comparing quality.  I haven't measured quality at all.  Would jitter analysis by wireshark suffice?  I guess maybe wireshark won't detect out of sequence media packet?

 

Also would it be possible for you to share your changes for NIO before you go for a break?

I'll try.

I have a bug in code somewhere.  Its not stable.  I discovered this after I originated this post (in too much haste, my apology).   After a while it starts to lose sockets.  Needs investigation.  I'm sort of reluctant to burden you with that.


John Franey

unread,
Feb 26, 2009, 8:38:27 AM2/26/09
to mobicent...@googlegroups.com

Oleg,

I agree.  It isn't helpful to give good performance results without reporting quality.  I intend to report media quality as well.  Doesn't make much sense for me to measure quality when the throughput isn't there.  Now that I'm seeing higher throughput, I'll turn to quality, as my upcoming schedule allows.


On Wed, Feb 25, 2009 at 11:07 PM, Oleg Kulikov <oleg.k...@gmail.com> wrote:
Hi John,
 
Thank you for the update. We were trying to use pool but without any success. Yes, I agree about performance but do you receive correct media stream. The thread pool cause:
1) out of sequence packets
2) out of order command

For out of order command: implies that more than one thread were in the pool for commands.  Perhaps if this pool were of size 1 for the entire system, instead of 1 per endpoint.  Would one queue/thread with large number of commands for all endpoints be more performant than 250 queue/threads each with low count of commands?

 

Oleg Kulikov

unread,
Feb 26, 2009, 11:15:15 AM2/26/09
to mobicent...@googlegroups.com
hi John,
 
you may be correct here. I agree

2009/2/26 John Franey <jjfr...@gmail.com>
Reply all
Reply to author
Forward
0 new messages