Juggernaut performance/load testing thread

35 views
Skip to first unread message

heisee

unread,
Apr 25, 2008, 4:07:27 AM4/25/08
to Juggernaut for Rails
Hi there,

I thought it's a good idea to start a thread dedicated to the current
performance and load concerns. There are several people working on
this topic and the results are quite scattered around in different
threads. Maybe we could use this thread for that?

First, this one:
Whenever you test with a lot of clients, you should verify, that you
didn't configure a subscription_url in juggernaut.yml.
If so, at least check, that the rails app behind the url is running in
production env, because that's probably the limiting factor then
(every client login results in a calling this url)

With this load test client
==========================
require "socket"

class JugClient

def run
puts "start"
threads = []
1000.times do |a|
threads << Thread.new(a) {
socket=TCPSocket.new("localhost",5000)
puts "start of thread #{a}"
s="""{
\"command\": \"subscribe\",
\"session_id\": \"abcef\",
\"client_id\": #{a},
\"channels\": [\"loadtest\"]
}\0
""".gsub(/\n/, ' ')
b=socket.send(s,s.length)
puts "connnecting client nr. #{a} and wrote #{b} bytes"
socket.flush
while true do
s=socket.recv 1000
puts "client nr. #{a} says #{s}" unless s.empty?
return if s =~ /close/
end
socket.close
}
end
threads.each {|t| t.join }
puts "end"
end
end

JugClient.new.run
=================

I'm able to establish 1000 client connections in one ruby process. On
my machine (updated Ubuntu 8.04, amd64), the limit is at 1024.
Starting several other ruby processes with additional 1000 clients is
no problem without a subscription url.I stopped testing with 5
parallel clients which resulted in 5000 client connections.
But when I configured a subscription url and put my rails app (in
production mode of course:-), the first client with 1000 connections
connected as it should, but the the second client stopped around 500
connections and hangs with 100% cpu. I reproduced this several times.
any ideas for that? A multi threading problem with event machine and
normal outgoing HTTP-requests?

Connecting 1000 clients (without subscrition url) takes about 10
seconds and pushing 1000 messages around 2 seconds (including debug
output)

For me, the other main problem for using Juggernaut in high traffic
production is, that disconnected clients sockets still get data
pushed.
To see this, simply
* connect client1,
* disconnect it (you'll see in the log file "lost client 1" ,
* connect client 2,
* let juggernaut push something and
* in the log you can still see, that it will pushed to client 1 and
client 2

I hope we can push these limits and thanks for your help,
Heiko Seebach

Alex MacCaw

unread,
Apr 25, 2008, 4:29:59 AM4/25/08
to Juggernaut...@googlegroups.com
Francais, the author of eventmachine, says that EM should be able to scale way beyond the 1000 connection limit - basically the limiting factor will be the server's network connection.
So there must be some sort of memory leak or performance issue with Juggernaut.
--
http://www.eribium.org | http://juggernaut.rubyforge.org | http://www.aireohq.com | Skype: oldmanorhouse

koko

unread,
Apr 25, 2008, 7:03:09 AM4/25/08
to Juggernaut for Rails
It's possible and yet it's worth while finding an example for such a
behavior of EM (I managed to push Juggernaut to 2k on ubuntu) but I
want to get 60k as a result

On Apr 25, 11:29 am, "Alex MacCaw" <macc...@gmail.com> wrote:
> Francais, the author of eventmachine, says that EM should be able to scale
> way beyond the 1000 connection limit - basically the limiting factor will be
> the server's network connection.
> So there must be some sort of memory leak or performance issue with
> Juggernaut.
>
> On Fri, Apr 25, 2008 at 9:07 AM, heisee <heiko.seeb...@googlemail.com>
> --http://www.eribium.org|http://juggernaut.rubyforge.org|http://www.aireohq.com| Skype: oldmanorhouse

heisee

unread,
Apr 25, 2008, 7:05:12 AM4/25/08
to Juggernaut for Rails
Hi koko,
> I want to get 60k as a result
hehe, mee too :-)

I disabled debug logging in juggernaut, enabled epoll using "--fdsize
60000" and got some numbers:

I connected 25000 clients (in 25 different jug_client-processes)
without any problems. also pushing a string to all clients - the
clients where running on the same machine as the server.

On the linux bash you could use e.g.
for i in ` seq 1 5 `; do ruby jug_client.rb > $i.txt & done
for 5000 clients in 5 ruby processes. Creating 25 ruby processes at
once didn't work, because of connection timeouts. But you can create
the next 5 processes with
for i in ` seq 6 10 `; do ruby jug_client.rb > $i.txt & done
and so on. But you should wait, until the first 5000 connections are
established. On my ubuntu this took about 15 seconds.

Or you simply use
for i in ` seq 1 25 `; do ruby jug_client.rb > $i.txt & sleep 5;
done
to connect 25 clients at once.

For about 5 seconds you can see, how the juggernaut server itself is
working on 100%CPU (probably iterating over the client list which are
all subscribed to the channel "loadtest") and after that, the clients
all start printing out the "client nr. xyz says ..." lines for about
10 seconds.

Pushin data to all connected clients can be done easily if you open
the rails console and do a
Juggernaut.send_to_channel("push me","loadtest")

You can disconnect all clients with
Juggernaut.send_to_channel("close","loadtest")


I can still see a barrier of around 30300 clients. Here, the client or
server hangs (but without cpu load, in the post above I said it hang
with 100% cpu, that was wrong)
The reason could be, that my local ports are all used by the clients.
losf shows me, that ruby assigns them from around port number 30000 on
up to 64K.

BUT WHEN I define the subscription url, restart the server and try to
do the same, I've problems even with around 1000 clients.

Any idea, why the subscription call could be the problem here?

thanks,
Heiko Seebach

Shrek

unread,
Apr 25, 2008, 7:39:07 PM4/25/08
to Juggernaut for Rails


wow! you guys have pushed the benchmark to a decent level :). let me
get me EC2 tonite itself and become part of the elite 10K +
connections extractor group. thanks guys.

oh, btw i never used 'subscription url' in my tests so far.
question of the day is, what happens in your test if you comment out
your authentication code in 'subscription url' callback function and
simply return http code 200. this way load test will be purely Jug
related.

I"LL B BACK WITH MY results soon.

koko

unread,
Apr 26, 2008, 1:43:29 PM4/26/08
to Juggernaut for Rails
waiting with patience for these...

heisee

unread,
Apr 26, 2008, 6:15:51 PM4/26/08
to Juggernaut for Rails
Hi,

> question of the day is, what happens in your test if you comment out
> your authentication code in 'subscription url' callback function and
> simply return http code 200. this way load test will be purely Jug
> related.

as I already wrote: when you don't configure a subscription url at
all, no http call is done, so it IS ALREADY purely jug related. could
connect 30000 clients without this call, but only about 1500 including
it.

Heiko

Srikanth Pagadala

unread,
Apr 26, 2008, 6:24:48 PM4/26/08
to Juggernaut...@googlegroups.com
 
Heiko
 
30000 is really brilliant. :)
 
would u please share your OS (kernel version etc) + any server config changes that you had to do?
I'm at a stage where i can switch my EC2 os etc easily. then i can start my testing. i'm so excited, ha ha
 
Thanks
Shrek

heisee

unread,
Apr 27, 2008, 5:15:10 AM4/27/08
to Juggernaut for Rails
Hi guys,

> would u please share your OS (kernel version etc) + any server config
> changes that you had to do?

it's Ubuntu 8.04 (Kernel 2.6.24) amd64, 4GB RAM, but you should get
the same results on any 2.6.x Kernel. 30K clients isn't that much,
take a look at that: http://en.wikipedia.org/wiki/Razorback2:
"Razorback2 was a server (195.245.244.243) of the eDonkey network,
known for being able to handle 1 million users simultaneously, meaning
that they had capacity for 1.3 million users and were indexing around
170 million files." As far as I know, that these were 1 million
standing connections on one host, but maybe this was juts the ip of a
load balancer. But there's still a long way to go for us! (I don't
know if these numbers are possible with eventmachine)

Depending on the current Linux distribution and configuration, you
might have to edit /etc/security/limits.conf
add the following two lines one line above the "# End of File"
* hard nofile 65535
* soft nofile 65535
(including the stars!)
On some distributions (e.g. Ubuntu) you also have to add the following
line:
session required pam_limits.so
to the file /etc/pam.d/common-session .
Then reboot...
This step is only necessary, if the output of "New descriptor-table
size is..." is still lower than what was defined as parameter.

good luck,
Heiko

heisee

unread,
Apr 27, 2008, 5:57:34 AM4/27/08
to Juggernaut for Rails
Oh, and very important: don't forget to use the --fdsize option and
use the newest trunk (not the 0.5.4 gem). Maybe someone could build a
0.5.5 gem from the trunk, Alex?

thanks, Heiko

Srikanth Pagadala

unread,
Apr 27, 2008, 6:00:39 AM4/27/08
to Juggernaut...@googlegroups.com
Thanks Heiko. I'm on it as i write this to you. EC2 project kinna took longer than i had expected. Still finishing few things on EC2....

koko

unread,
Apr 28, 2008, 12:48:01 PM4/28/08
to Juggernaut for Rails
wow, I somehow missed this post
30k is really great(!)

I have no idea about the subscription_url parameter but,.. hear my
thoughts:

why is it needed?

if we assign each channel a really long random name
'SDKFASODFKAOFKOSAVM<ASDO#@(KERW)EFKV(SADVMCKODSMWEDPOW'
we can avoid any kind of authentication because the statistical
ability to hit:
1. any channel
2. a channel wished one wishes to do eavesdropping on
is virtually 0
and so the name itself represents both the name and the key.

just an idea...
what did I miss?

heisee

unread,
Apr 28, 2008, 7:24:46 PM4/28/08
to Juggernaut for Rails
hmm, sounds good. it would be some kind of session_id.
Some thoughts:
The jug server has to store this channel name, right? How does it get
this, and when does it expire? The rails server has to give these
channel name to the jug server BEFORE the client connects, right?
Another possibility would be, that the real channel is
cryptographically encoded inside this string, then the rails could
give it to the browser, which could send it to the jug server, which
decrypts it to get the real channel name (using shared secret between
rails server and jug server)

In the rails server, people might use the subscription url to
initialize some stuff (not when the page is served, but when flash
connects, which is always some time later), but this could be alright
in most cases, especially when using :store_messages

So, I think that this could work.
But one thing that is really helpful at least for me is the call of
the logout_url, which is necessary for differentiating if users load a
page in a new browser tab or in the same browser tab (in the first
case, they have to 2 chat windows open!)

I didn't test the case "Having no subsciption_url, but logout_url "
with so many clients, but I think, that jug server will have the same
problems here.

So I looked again at the source code and the solution is oh so simple:
begin
open(url.to_s, "User-Agent" => "Ruby/#{RUBY_VERSION}")
rescue => e
return false
end
the url get's opened, but not closed ... THAT'S IT :-)

If you take a look at open_uri.rb, you can see:
def OpenURI.open_uri(name, *rest)
...
if block_given?
begin
yield io
ensure
io.close
end
else
io
end
end

And there was no block given, so the io didn't close. But I even if I
gave a block, it didn't close, with "netstat" I could still see all
open connections. Well, the were not not OPEN, but in TIME_WAIT-
state,, but these were about 1000 sockets in this state, but ruby
doesn't allocate more than 1000 sockets.
If you look at this:
http://www.softlab.ntua.gr/facilities/documentation/unix/unix-socket-faq/unix-socket-faq-2.html#time_wait
you can see, that this is normal from the view of the TCP/IP-Stack...
What do you think, should we play around with the SO_LINGER option to
make these sockets leave the TIME_WAIT-state faster?

thanks, Heiko

heisee

unread,
Apr 29, 2008, 5:44:33 AM4/29/08
to Juggernaut for Rails
Hi again, I did some more tests and have to correct some of the above
statements:
This test program
===============
require "open-uri"

class GetTest

def run
puts "start"
threads = []
300.times do |a|
threads << Thread.new(a) do
for i in (1..300) do
puts i
open("http://localhost:3000/jug_server/login?
client_id=#{i}&session_id=abcef&channels[]=loadtest")
end
end
end
threads.each {|t| t.join }
puts "end"
end
end

SocketTest.new.run
===============
This program works fine and does 90000 requests. "netstat" shows that
(on my machine) there are always around 3000 sockets in TIME_WAIT
state.
So the open(...)-method is fine.
Maybe some kind of multi-threading problem?
Any more ideas?

thanks, Heiko

Shrek

unread,
May 1, 2008, 3:53:45 AM5/1/08
to Juggernaut for Rails
Heiko

I still have 0.5.4 gem. Is '--fdsize' option part of 0.5.5 gem (or
Trunk)?
if yes, what is the recommended value for the option?

Alex

As Heiko requested, when can you release 0.5.5 gem?
Or, can you give me some instructions to make gem from the trunk?

Heiko, Alex, Koko

Finally i got my EC2 setup going. From my initial load testing
(without subscription urls and none of ubuntu os config changes), i
noticed, considerable delay in 'events-reception' when running 'test-
program' on a remote machine rather than on the Jug server box itself.
Sometimes delays were of the order of few mins (for approx 500 bots/
clients). And when i ran the same program on the Jug server box, there
was no delay at all.

As i said, nothing is still conclusive. this is just my preliminary
observation.
Awaiting Heiko's recommendation and Alex's 0.5.5 gem.

Sample output from my Test framework:
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++
...
...
...

5. [QUIT] Bye
2
[Thu May 01 00:06:59 PDT 2008]
722 Bots have received 0 events.
[FE @: null, LE @: null]
1 Bots have become inactive ...

====== Welcome to Jug Load Tester (Java version) ========

1. [CREATE_BOTS] Add bots (in chunks of as configured in props file)
2. [SHOW_REPORT] Display # of alive bots and # of events received
3. [CLEAR_EVENTS_DATA] Will reset events information in the data
collector for a fresh test
4. [BROADCAST] Send an event to all bot channels (works iff broadcast
is authorized from this IP)
5. [QUIT] Bye
2

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++
Thanks
Srikanth

Alex MacCaw

unread,
May 1, 2008, 4:16:11 AM5/1/08
to Juggernaut...@googlegroups.com
You can make the gem with 'rake gem'.
I've got a few other patches for the gem - I'll try and get it out this weekend.
Keep up the good work!

Thanks
Alex
--

Shrek

unread,
May 1, 2008, 2:23:53 PM5/1/08
to Juggernaut for Rails

Baby steps to build gem from the trunk (thanks Alex)

1. mkdir tmp
2. cd tmp
3. svn co http://juggernaut.rubyforge.org/svn/trunk/gem
4. cd gem
5. vi lib/juggernaut.rb, change constant VERSION number. This will
become gems version number.
6. rake gem, you should see success build message, and gem is built
in 'pkg' folder.
7. cd pkg
8. gem install juggernaut-x.gem (where is the version you specified
in step 5)
9. gem list (to verify that your gem made it to gem repo)
10. gem clean (to clean up old jug gem)
11. rm -rf tmp

Happy juggling.
> --http://www.eribium.org|http://juggernaut.rubyforge.org|http://www.aireohq.com| Skype: oldmanorhouse- Hide quoted text -
>
> - Show quoted text -

Srikanth Pagadala

unread,
May 1, 2008, 11:21:25 PM5/1/08
to maccman, heiko....@googlemail.com, d...@dorkalev.com, Juggernaut...@googlegroups.com

sorry i didnt know any easy way to publish to the google group with attachements.  can someone do it for me? Pastie doesnt support attachements, does it?
 
ok, so i conducted my first load test (with latest gem and all heiko's ubuntu configs). i'll jot down some points here, but most of the story is in the attachements.
 
jug is running on ec2 (ubuntu 8.04 etc etc).
created 3500 bots on xp box (remote machine)
simultaneously started creating 3500 bots on Ubuntu (jug box).
 
all 3500 xp bots survied creation phase
only 1016 ubuntu bots made it.
Sockets on dead bots somehow closed (either from server side or bot side). i dont have reconnect logic in my bots (should i)?
 
Ok rest of the story is better told by the attached pictures.
 
I'll fine tune my testing framework tonite. do you guys have any more suggestions to be built in framework( like reconnect)? or any test scenarios to be tested ?
 
 
happy juggling again
ubuntu-bots-report.GIF
xp-bots-report.GIF

Srikanth Pagadala

unread,
May 1, 2008, 11:30:30 PM5/1/08
to Juggernaut...@googlegroups.com
o wow, all my attachments made it to the group. thanks google.
 
one more last min flash news:
 
After an hour or so also , all 3500 Xp bots were still alive and again
all 3500/3500 events made it home safely.  
 
this time, Jug took about 10mins to deliever all 3500 events.

Srikanth Pagadala

unread,
May 2, 2008, 5:30:16 AM5/2/08
to Juggernaut...@googlegroups.com
 
here is one more test result.
 
i called it "Juggernaut 0.5.5 baseline performance numbers", when 12K bots are created on Jug box itself.
 
some of the conclusions of the test are these:
  1. Jug can handle many concurrent connections (~ 90K, heiko has proved it, so i stopped myself at 12K)
  2. it is very reliable (when bots are on the same box). all 24K events made it to home safely.
  3. under this kind of load, Jug's average turn around time for 1000 events is 5.82 mins
  4. for more details take a look at this attachment.
Note: I'm (we) havent looked at memory usage (leaks) and cpu utilization etc. someone?

coming soon: case 2, 12K bots created on remote machine (xp).

shrek

Juggernaut 0.5.5 baseline performance numbers - part I.xls

weepy

unread,
May 2, 2008, 8:48:33 AM5/2/08
to Juggernaut for Rails

1000 in 5mins=300s means that the average turnaround for 1 event is
about 0.3s - can this really be true ?


On 2 May, 10:30, "Srikanth Pagadala" <srikanth.pagad...@gmail.com>
wrote:
> here is one more test result.
>
> i called it "Juggernaut 0.5.5 baseline performance numbers", when 12K bots
> are created on Jug box itself.
>
> some of the conclusions of the test are these:
>
> 1. Jug can handle many concurrent connections (~ 90K, heiko has proved
> it, so i stopped myself at 12K)
> 2. it is very reliable (when bots are on the same box). all 24K events
> made it to home safely.
> 3. under this kind of load, Jug's average turn around time for 1000
> events is 5.82 mins
> 4. for more details take a look at this attachment.
>
> Note: I'm (we) havent looked at memory usage (leaks) and cpu utilization
> etc. someone?
>
> coming soon: case 2, 12K bots created on remote machine (xp).
> shrek
>
> On Thu, May 1, 2008 at 8:30 PM, Srikanth Pagadala <
>
> > > On Thu, May 1, 2008 at 11:23 AM, Shrek <srikanth.pagad...@gmail.com>
> > > wrote:
>
> > > > Baby steps to build gem from the trunk (thanks Alex)
>
> > > > 1. mkdir tmp
> > > > 2. cd tmp
> > > > 3. svn cohttp://juggernaut.rubyforge.org/svn/trunk/gem
> Juggernaut 0.5.5 baseline performance numbers - part I.xls
> 34KDownload

heisee

unread,
May 2, 2008, 9:27:45 AM5/2/08
to Juggernaut for Rails
Hi,

>only 1016 ubuntu bots made it.
As I said above, there's some kind of limitation in ruby that let's
you only have 1024 sockets open at once within one process
But your test client is done in Java. I don't know if there's a "max
socket/open files limit" set there, but maybe it's just defined by the
bash process. You should be able to increase it using
> ulimit -n 65530
then try again.

>dont have reconnect logic in my bots (should i)?
no, juggernaut tests should work without that

rock on,
Heiko

Srikanth Pagadala

unread,
May 2, 2008, 2:09:40 PM5/2/08
to Juggernaut...@googlegroups.com
 
yes, my conclusions are not definitivie. it is kinnna hard to draw clear conclusion at this point.
thats one of the reasons why, i kept the language of my conclusion very vague.
eg: "Under this kind of load, average time is 5min" . i didnt (couldnt) define what that "kind of load" means.
 
but the data collected during the test is as accurate as possible.
thats why, i sent out the entire data out, so that community can look at it, discuss and then we can draw some agreeable conclusions.
 
for eg: look at the "process 1" [in both datasets] it totally assasinated the average turnaround time by a factor of 9
i think "process 1" is fast, because at this point Jug is not really loaded. but as more and more processes start publishing, Jug is really getting smacked down hard.  Pay attention to the overlapping "publish times". so as 4th, 5th, 6th processes are pushing, Jug is kinna simultaneously managing 4000-6000 push events. I dont know if this makes any difference, but we are talking about 4000-6000 different channels here. Every bot has its own unique channel (as opposed to all 1000 bots listening to only one channel). Alex, Heiko, comments? [i havent looked into the Jug code]
 
long story short, what i'm trying to say here is, "LOAD" here is much much more than 1000 events. Jug is actually handling something like 6000 events when I'm taking numbers for 1000 events. see my point?
 
so test data is out there. if we have any statistics guru among us, may be he can help us ....
 
or, may be i should load jug in some other systematic manner. may be my approach is entirely flawed. suggestions? comments?

Srikanth Pagadala

unread,
May 2, 2008, 2:13:50 PM5/2/08
to Juggernaut...@googlegroups.com
thanks heiko.
 
yes, i figured that out in two cycles.
thats why in my "baseline performance numbers" experiment, i have 12 processes instead of 1 BIG process. :)
 
thanks once again. Your UBUNTU configs did wonders.

heisee

unread,
May 2, 2008, 8:30:00 PM5/2/08
to Juggernaut for Rails
To the "LOAD" topic:

I think what is interesting here is, if the jug-server-process is
taking a lot of cpu at the moment when it publishs events to the
clients.
There are thousands of clients connected, so there is also at least
one array with thousands of client objects (i mean the one in
client.rb), that hold open socket connections and channel names.
if you send a command to broadcast it does e.g. this:
find {|client| client.id == id }.first
in line 38 of client.rb
and I think this can become really slow.
the other method that i think may become slow when many clients are
connected
Juggernaut::Client.find_all.each {|client|
client.send_message(msg, channels) } #server.rb:306
which in the end results in
@channels.include?(channel) #server.rb: 182

I think, that using some hashes here instead of arrays could improve
performance (if the jug-server cpu is under heavy load in your tests)

happy juggin',
Heiko

Srikanth Pagadala

unread,
May 3, 2008, 7:55:14 PM5/3/08
to Juggernaut...@googlegroups.com
 
 
here are the results of "load testing jug" with remote bots.
 
few things to note are:
  1. numbers are more or less similar to the numbers of part-I test (which involved all local bots).
  2. so it looks like remote or local bots really dont make much difference (which should be the case, and is a very good news).
  3. all further testings will involve only remote bots, and local bots willl be used only for the purpose of creating burden on Jug.
  4. new tests will be more realistic. Meaning, they will be as close as possible to real life scenarios. Stay Tuned.
Does someone else wanna get involved? I can use some help. we can run many more realistic scenarios.
We will have to repeat our tests, once Alex (other authors?) finds bottlenecks and improves the code, to verify that numbers are going up.
 
 
Numbers from todays test are:
 
Under this kind of load, Jug's average turn around time for 1000 events is : 7.13 mins
Only remote bots: 7.16 mins
Only xp remote bots: 7.54 mins
Only ubuntu remote bots: 6.87 mins
Only ubuntu local bots: 7.08 mins
Juggernaut_0.5.5_baseline_performance_numbers_-_part_II.xls
Reply all
Reply to author
Forward
0 new messages