File descriptor problem at 5k TPS on RHEL

330 views
Skip to first unread message

Bob Nilsen

unread,
May 2, 2013, 12:48:06 PM5/2/13
to iago-...@googlegroups.com
I'm getting an exception when trying to run 5k TPS.  I recognize that this is likely an OS limitation... but I thought I'd report it here since the goal is to be able to do 10k TPS with one box.  Maybe you guys know how to overcome this problem?

System Details:
Red Hat Enterprise Linux Server release 5.6 (Tikanga)
HP Blade BL460 G6 Dual Quad Core 48 GB RAM

[11:19:57 sg212844] $ java -jar iago-0.5.1-SNAPSHOT.jar -f config/5000_bigip_reuse.scala
Configs generated, are you ready to do some damage? [yes]
sh scripts/local-parrot.sh
initialized parrot
java.lang.InternalError: errno: 24 error: Unable to open directory /proc/self/fd

        at com.sun.management.UnixOperatingSystem.getOpenFileDescriptorCount(Native Method)
        at com.twitter.ostrich.stats.StatsCollection.fillInJvmGauges(StatsCollection.scala:72)
        at com.twitter.ostrich.stats.StatsCollection.getGauges(StatsCollection.scala:205)
        at com.twitter.ostrich.stats.StatsCollection.getGauges(StatsCollection.scala:30)
        at com.twitter.ostrich.stats.StatsProvider$class.get(StatsProvider.scala:184)
        at com.twitter.ostrich.stats.StatsCollection.get(StatsCollection.scala:30)
        at com.twitter.ostrich.admin.CommandHandler$$anonfun$handleCommand$6.apply(CommandHandler.scala:113)
        at com.twitter.ostrich.admin.CommandHandler$$anonfun$handleCommand$6.apply(CommandHandler.scala:113)
        at scala.Option.getOrElse(Option.scala:108)
        at com.twitter.ostrich.admin.CommandHandler.handleCommand(CommandHandler.scala:111)
        at com.twitter.ostrich.admin.CommandHandler.apply(CommandHandler.scala:65)
        at com.twitter.ostrich.admin.CommandRequestHandler.handle(AdminHttpService.scala:296)
        at com.twitter.ostrich.admin.CgiRequestHandler.handle(AdminHttpService.scala:154)
        at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:65)
        at sun.net.httpserver.AuthFilter.doFilter(AuthFilter.java:65)
        at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:68)
        at sun.net.httpserver.ServerImpl$Exchange$LinkHandler.handle(ServerImpl.java:554)
        at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:65)
        at sun.net.httpserver.ServerImpl$Exchange.run(ServerImpl.java:526)
        at sun.net.httpserver.ServerImpl$DefaultExecutor.execute(ServerImpl.java:117)
        at sun.net.httpserver.ServerImpl$Dispatcher.handle(ServerImpl.java:347)
        at sun.net.httpserver.ServerImpl$Dispatcher.run(ServerImpl.java:319)
        at java.lang.Thread.run(Thread.java:662)
shutting down client
shut down parrot
done.


ostrich shows:

counters:
  400: 104363
  client/connects: 105480
  client/failures/com.twitter.finagle.WriteException: 627
  client/failures/org.jboss.netty.channel.ChannelException: 7053
  client/received_bytes: 51555816
  client/requests: 104852
  client/requests/10.14.42.20:9567: 104852
  client/sent_bytes: 11355159
  client/success: 104363
  client/success/10.14.42.20:9567: 104363
  jvm_gc_ConcurrentMarkSweep_cycles: 0
  jvm_gc_ConcurrentMarkSweep_msec: 0
  jvm_gc_ParNew_cycles: 15
  jvm_gc_ParNew_msec: 1465
  jvm_gc_cycles: 15
  jvm_gc_msec: 1465
  records-read: 411600
  requests_sent: 112533
  unexpected_error: 7680
  unexpected_error/com.twitter.finagle.WriteException: 627
  unexpected_error/org.jboss.netty.channel.ChannelException: 7053
gauges:
  client/connections: 469
  client/loadbalancer/available/failure_accrual_watermark_pool_caching_pool_host:10.14.42.20/10.14.42.20:9567: 0
  client/loadbalancer/load/failure_accrual_watermark_pool_caching_pool_host:10.14.42.20/10.14.42.20:9567: 505
  client/loadbalancer/size: 1
  client/pending: 505
  client/pending/10.14.42.20:9567: 505
  client/pool_cached: 0
  client/pool_cached/10.14.42.20:9567: 0
  client/pool_size: 505
  client/pool_size/10.14.42.20:9567: 505
  client/pool_waiters: 0
  client/pool_waiters/10.14.42.20:9567: 0
  clock_error: 0
  jvm_fd_count: 694
  jvm_fd_limit: 1024
  jvm_heap_committed: 2043478016
  jvm_heap_max: 4140630016
  jvm_heap_used: 820676128
  jvm_nonheap_committed: 53608448
  jvm_nonheap_max: 136314880
  jvm_nonheap_used: 53103840
  jvm_num_cpus: 16
  jvm_post_gc_CMS_Old_Gen_used: 0
  jvm_post_gc_CMS_Perm_Gen_used: 0
  jvm_post_gc_Par_Eden_Space_used: 0
  jvm_post_gc_Par_Survivor_Space_used: 53673984
  jvm_post_gc_used: 53673984
  jvm_start_time: 1367511620035
  jvm_thread_count: 48
  jvm_thread_daemon_count: 7
  jvm_thread_peak_count: 48
  jvm_uptime: 29051
  queue_depth: 299956
labels:
metrics:
  client/codec_connection_preparation_latency_ms: (average=0, count=112532, maximum=386, minimum=0, p25=0, p50=0, p75=0, p90=0, p95=0, p99=0, p999=1, p9999=57, sum=2658)
  client/codec_connection_preparation_latency_ms/10.14.42.20:9567: (average=0, count=112532, maximum=386, minimum=0, p25=0, p50=0, p75=0, p90=0, p95=0, p99=0, p999=1, p9999=57, sum=2658)
  client/connect_latency_ms: (average=0, count=104852, maximum=386, minimum=0, p25=0, p50=0, p75=0, p90=0, p95=0, p99=0, p999=1, p9999=57, sum=2531)
  client/connect_latency_ms/10.14.42.20:9567: (average=0, count=104852, maximum=386, minimum=0, p25=0, p50=0, p75=0, p90=0, p95=0, p99=0, p999=1, p9999=57, sum=2531)
  client/connection_duration: (average=43, count=104991, maximum=6365, minimum=0, p25=0, p50=0, p75=1, p90=23, p95=52, p99=1161, p999=6365, p9999=6365, sum=4562565)
  client/connection_received_bytes: (average=491, count=104992, maximum=472, minimum=0, p25=472, p50=472, p75=472, p90=472, p95=472, p99=472, p999=472, p9999=472, sum=51556310)
  client/connection_requests: (average=0, count=104991, maximum=1, minimum=0, p25=1, p50=1, p75=1, p90=1, p95=1, p99=1, p999=1, p9999=1, sum=104364)
  client/connection_sent_bytes: (average=107, count=104992, maximum=105, minimum=0, p25=105, p50=105, p75=105, p90=105, p95=105, p99=105, p999=105, p9999=105, sum=11302416)
  client/failed_connect_latency_ms: (average=0, count=627, maximum=3, minimum=0, p25=0, p50=0, p75=0, p90=0, p95=0, p99=0, p999=3, p9999=3, sum=8)
  client/failed_connect_latency_ms/10.14.42.20:9567: (average=0, count=627, maximum=3, minimum=0, p25=0, p50=0, p75=0, p90=0, p95=0, p99=0, p999=3, p9999=3, sum=8)
  client/request_latency_ms: (average=43, count=104364, maximum=6365, minimum=0, p25=0, p50=0, p75=0, p90=23, p95=47, p99=1161, p999=6365, p9999=6365, sum=4549459)
  client/request_latency_ms/10.14.42.20:9567: (average=43, count=104364, maximum=6365, minimum=0, p25=0, p50=0, p75=0, p90=23, p95=47, p99=1161, p999=6365, p9999=6365, sum=4549459)


ulimit shows:

[11:41:07 sg212844] $ ulimit
unlimited


Thanks for any help you can give,

Bob

Bob Nilsen

unread,
May 2, 2013, 12:51:57 PM5/2/13
to iago-...@googlegroups.com
System file descriptor limit

[root@tmp]# /sbin/sysctl fs.file-max
fs.file-max = 4874583

Bob Nilsen

unread,
May 2, 2013, 1:19:56 PM5/2/13
to iago-...@googlegroups.com
I raised my ulimits to be as follows:

[12:15:11 sg212844] $ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 397311
max locked memory       (kbytes, -l) 32
max memory size         (kbytes, -m) unlimited
open files                      (-n) 10240
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 397311
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited


Now, the exception does not happen.  So yay!

Bob Nilsen

unread,
May 2, 2013, 1:21:02 PM5/2/13
to iago-...@googlegroups.com
However, now at this high TPS rate I see exceptions in ostrich:


counters:
  400: 262708
  client/connects: 272781
  client/failures/com.twitter.finagle.WriteException: 9674
  client/received_bytes: 129779234
  client/requests: 263106
  client/requests/10.14.42.20:9567: 263106
  client/sent_bytes: 28493948
  client/success: 262711
  client/success/10.14.42.20:9567: 262711
  jvm_gc_ConcurrentMarkSweep_cycles: 1
  jvm_gc_ConcurrentMarkSweep_msec: 418
  jvm_gc_ParNew_cycles: 36
  jvm_gc_ParNew_msec: 2907
  jvm_gc_cycles: 37
  jvm_gc_msec: 3325
  records-read: 572400
  requests_sent: 272781
  unexpected_error: 9674
  unexpected_error/com.twitter.finagle.WriteException: 9674
gauges:
  client/connections: 395
  client/loadbalancer/available/failure_accrual_watermark_pool_caching_pool_host:10.14.42.20/10.14.42.20:9567: 0
  client/loadbalancer/load/failure_accrual_watermark_pool_caching_pool_host:10.14.42.20/10.14.42.20:9567: 398
  client/loadbalancer/size: 1
  client/pending: 397
  client/pending/10.14.42.20:9567: 397
  client/pool_cached: 0
  client/pool_cached/10.14.42.20:9567: 0
  client/pool_size: 398
  client/pool_size/10.14.42.20:9567: 398
  client/pool_waiters: 0
  client/pool_waiters/10.14.42.20:9567: 0
  clock_error: 34085352912
  jvm_fd_count: 602
  jvm_fd_limit: 10240
  jvm_heap_committed: 2043478016
  jvm_heap_max: 4140630016
  jvm_heap_used: 499718800
  jvm_nonheap_committed: 84877312
  jvm_nonheap_max: 136314880
  jvm_nonheap_used: 53377432
  jvm_num_cpus: 16
  jvm_post_gc_CMS_Old_Gen_used: 251215936
  jvm_post_gc_CMS_Perm_Gen_used: 47070816
  jvm_post_gc_Par_Eden_Space_used: 0
  jvm_post_gc_Par_Survivor_Space_used: 47152792
  jvm_post_gc_used: 345439544
  jvm_start_time: 1367514968508
  jvm_thread_count: 48
  jvm_thread_daemon_count: 7
  jvm_thread_peak_count: 48
  jvm_uptime: 95183
  queue_depth: 299610
labels:
metrics:
  client/codec_connection_preparation_latency_ms: (average=0, count=272780, maximum=2858, minimum=0, p25=0, p50=0, p75=0, p90=0, p95=1, p99=4, p999=5, p9999=95, sum=60943)
  client/codec_connection_preparation_latency_ms/10.14.42.20:9567: (average=0, count=272780, maximum=2858, minimum=0, p25=0, p50=0, p75=0, p90=0, p95=1, p99=4, p999=5, p9999=95, sum=60943)
  client/connect_latency_ms: (average=0, count=263106, maximum=2858, minimum=0, p25=0, p50=0, p75=0, p90=0, p95=0, p99=2, p999=5, p9999=95, sum=24598)
  client/connect_latency_ms/10.14.42.20:9567: (average=0, count=263106, maximum=2858, minimum=0, p25=0, p50=0, p75=0, p90=0, p95=0, p99=2, p999=5, p9999=95, sum=24598)
  client/connection_duration: (average=128, count=272385, maximum=9498, minimum=0, p25=0, p50=0, p75=1, p90=13, p95=42, p99=3158, p999=6365, p9999=9498, sum=35060789)
  client/connection_received_bytes: (average=476, count=272385, maximum=472, minimum=0, p25=472, p50=472, p75=472, p90=472, p95=472, p99=472, p999=472, p9999=472, sum=129779234)
  client/connection_requests: (average=0, count=272385, maximum=1, minimum=0, p25=1, p50=1, p75=1, p90=1, p95=1, p99=1, p999=1, p9999=1, sum=262711)
  client/connection_sent_bytes: (average=104, count=272385, maximum=105, minimum=0, p25=105, p50=105, p75=105, p90=105, p95=105, p99=105, p999=105, p9999=105, sum=28451164)
  client/failed_connect_latency_ms: (average=3, count=9674, maximum=52, minimum=0, p25=3, p50=4, p75=4, p90=4, p95=4, p99=4, p999=10, p9999=52, sum=34833)
  client/failed_connect_latency_ms/10.14.42.20:9567: (average=3, count=9674, maximum=52, minimum=0, p25=3, p50=4, p75=4, p90=4, p95=4, p99=4, p999=10, p9999=52, sum=34833)
  client/request_latency_ms: (average=133, count=262711, maximum=9498, minimum=0, p25=0, p50=0, p75=0, p90=13, p95=52, p99=3158, p999=6365, p9999=9498, sum=34969140)
  client/request_latency_ms/10.14.42.20:9567: (average=133, count=262711, maximum=9498, minimum=0, p25=0, p50=0, p75=0, p90=13, p95=52, p99=3158, p999=6365, p9999=9498, sum=34969140)

Bob Nilsen

unread,
May 2, 2013, 1:28:24 PM5/2/13
to iago-...@googlegroups.com
I think i'm hitting port exhaustion now.  I can see about 28k max open tcp connections from Iago (nearly all in TIME_WAIT).

After one minute (which I think is the kernel linger time per connection) I can see the connection count drop and Iago starts sending again.

Is there anything else I can set to tell Iago to do *more* connection reuse?  

James Waldrop

unread,
May 2, 2013, 1:42:33 PM5/2/13
to iago-...@googlegroups.com
Ephemeral port exhaustion is the most common reason to need to add more servers actually, at least here at Twitter. That said, it's usually at higher RPS that we see this problem. Hitting it with 5K RPS means that your service is responding in more than 5s, which I *hope* means it's saturated. If you're wondering how I get there, there's a law that says concurrency is equal to request rate multiplied by your service time.

Is there a reason you want more load even though it seems like your service is saturated? Do you expect to normally have requests that take more than 5s to respond to?

If so, then you probably need to add another server instance -- Tom Howland is currently working on getting support for this added to the Github repo and expects to be done shortly I believe (although I don't want to speak for him). We do it internally with Mesos, and I was reluctant to ship code that depended on Mesos since I don't expect most people to have it, or to setup a Mesos cluster just because Iago can use that to scale automatically.

James




--
 
---
You received this message because you are subscribed to the Google Groups "Iago Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to iago-users+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Bob Nilsen

unread,
May 2, 2013, 3:03:00 PM5/2/13
to iago-...@googlegroups.com
Can Iago tell me what my response times are in this case?

ostrich shows this line:

client/request_latency_ms: (average=133, count=262711, maximum=9498, minimum=0, p25=0, p50=0, p75=0, p90=13, p95=52, p99=3158, p999=6365, p9999=9498, sum=34969140)

If that's the correct line to be looking at, it shows that my average response times are 133 ms... which means my service times are acceptably fast.  I should be able to do this with about 665 connections, if connections pooled and left open after each request.

But that's what I'm not seeing... from my perspective it seems like connections are closed quickly and that's why I'm running out.  I don't know how else I'd get to 28k connections in TIME_WAIT.

As background, I'm not really testing any particular service in this experiment... I'm testing and trying to understand Iago.  We use JMeter, LoadRunner, Visual Studio, LoadUI, etc. very frequently in our performance testing group at work.  But understanding the functional shortcomings of thread-based load test tools, we are looking into java or python-based async IO load driver frameworks like Iago to see how they scale.

Thanks for all the help, sorry to flood you guys with questions.

-Bob

James Waldrop

unread,
May 2, 2013, 4:46:23 PM5/2/13
to iago-...@googlegroups.com
Do you have reuseConnections set to false? That's the other thing that could explain this -- SSL without reuseConnections being true will often exhaust the port space because the handshake takes so long.

James



Bob Nilsen

unread,
May 2, 2013, 4:48:25 PM5/2/13
to iago-...@googlegroups.com
To verify my test apparatus I just saw that this same machine using JMeter can do a max of 42k TPS to this apache server.

James Waldrop

unread,
May 2, 2013, 4:50:43 PM5/2/13
to iago-...@googlegroups.com
Cool, that's useful. Seems good to get to the bottom of then.


On Thu, May 2, 2013 at 1:48 PM, Bob Nilsen <rwni...@gmail.com> wrote:
To verify my test apparatus I just saw that this same machine using JMeter can do a max of 42k TPS to this apache server.

--

Bob Nilsen

unread,
May 2, 2013, 4:52:00 PM5/2/13
to iago-...@googlegroups.com
Here's my config, reuseConnections is true.... but that's just it, setting reuse connections doesn't help nearly as much as I would expect:


import com.twitter.parrot.config.ParrotLauncherConfig

new ParrotLauncherConfig {
  doConfirm = false
  localMode = true
  jobName = "testrun"
  port = 8090
  victims = "10.19.148.199"
  log = "config/replay.log"
  requestRate = 5000
  duration = 2
  reuseConnections = true
}


--
Bob Nilsen
rwni...@gmail.com

James Waldrop

unread,
May 2, 2013, 5:05:53 PM5/2/13
to iago-...@googlegroups.com
Can you attach the parrot server log?

Bob Nilsen

unread,
May 2, 2013, 11:35:17 PM5/2/13
to iago-...@googlegroups.com
Sure thing, James.

The parrot-server.log files only show lines like this:

ERR [20130502-22:00:01.007] logging: Unable to open socket to scribe server at localhost:1463: java.net.ConnectException: Connection refused
ERR [20130502-22:01:01.008] logging: Unable to open socket to scribe server at localhost:1463: java.net.ConnectException: Connection refused
ERR [20130502-22:02:01.008] logging: Unable to open socket to scribe server at localhost:1463: java.net.ConnectException: Connection refused
ERR [20130502-22:03:01.007] logging: Unable to open socket to scribe server at localhost:1463: java.net.ConnectException: Connection refused
ERR [20130502-22:04:01.007] logging: Unable to open socket to scribe server at localhost:1463: java.net.ConnectException: Connection refused

Do I need to put Iago into super-debug mode in order to show more errors?

-Bob



Tom Howland

unread,
May 3, 2013, 12:37:38 PM5/3/13
to iago-...@googlegroups.com
Bob

Those errors are an attempt to log to a scribe server and is a bug in the default logging configuration. Fix it with

diff --git a/src/main/resources/templates/local-template-server.scala b/src/main/resources/templates/local-template-server.scala
index 8211167..958290f 100644
--- a/src/main/resources/templates/local-template-server.scala
+++ b/src/main/resources/templates/local-template-server.scala
@@ -12,16 +12,6 @@ new ParrotServerConfig[#{requestType}, #{responseType}] {
       rollPolicy = Policy.Hourly,
       rotateCount = 6
     )
-  ) :: new LoggerFactory(
-    node = "stats",
-    level = Level.INFO,
-    useParents = false,
-    handlers = ScribeHandler(
-      hostname = "localhost",
-      category = "cuckoo_json",
-      maxMessagesPerTransaction = 100,
-      formatter = BareFormatter
-    )
   ) :: loggers
 
   statsName = "parrot_#{jobName}"

or wait until I do, which may not be for a week or two.

James Waldrop

unread,
May 3, 2013, 1:26:15 PM5/3/13
to iago-...@googlegroups.com
I think the performance stats are what we need. You're already fetching them, but we may need to see them over time to understand what's happening. Rather than putting a lot of work on your plate, I think I'm inclined to just replicate your results locally where we can debug it directly.

James

Bob Nilsen

unread,
May 3, 2013, 2:07:35 PM5/3/13
to iago-...@googlegroups.com
The stats from ostrich?  I might be able to capture them for you.  I'm only doing 2 minute test runs, so I could poll ostrich manually, dumping to a file with a timestamp.

stand by...

-Bob

Bob Nilsen

unread,
May 3, 2013, 2:50:29 PM5/3/13
to iago-...@googlegroups.com
James,

Please find attached 1-second samples of the ostrich output during a 2-minute test run.  They are in chronological order by epoc time in their filename.

I have also included the logs, which are *much* more interesting than those I posted previously.  Lots of exceptions to dig into.

-Bob


--
Bob Nilsen
rwni...@gmail.com
iago_performance.tgz
iago_logs.tgz

James Waldrop

unread,
May 3, 2013, 4:11:50 PM5/3/13
to iago-...@googlegroups.com
You're right, these are interesting in a number of ways.

INF [20130503-13:18:23.313] server: Creating job named testrun

This is the start of the run. You should get load shortly after this (shortly meaning basically immediately, modulo how expensive it is to create your request objects and in this case they're very cheap to create).

ERR [20130503-13:19:45.280] server: unexpected error: com.twitter.finagle.WriteException: java.net.BindException: Cannot assign requested address [many repeated]

This usually means that you're out of sockets.

ERR [20130503-13:20:30.013] server: unexpected error: com.twitter.finagle.WriteException: java.net.ConnectException: Connection timed out [many repeated]

This is our first obvious indication that your system is too saturated to handle any more requests.


There are several messages that are useless noise either because they're scribe or shutdown-related:

ERR [20130503-13:20:33.227] server: unexpected error: java.lang.IllegalArgumentException: requirement failed: newTimeout on inactive timer
ERR [20130503-13:20:32.827] server: unexpected error: com.twitter.finagle.WriteException: java.nio.channels.ClosedChannelException
FAT [20130503-13:20:56.319] monitor: Exception propagated to the root monitor!

All scary looking, all not worth considering. Note that they're slightly beyond the 2 minute mark. We have fixes for the timer exceptions that should land soon in Github.


Now, I'm stating that your system is saturated based on the evidence of the exceptions. It would be useful if the metrics backed this up. We do get that, although someone who isn't familiar with Finagle might not immediately diagnose it:

1367605229_iago_stats.txt:  client/failed_connect_latency_ms: (average=3, count=594, maximum=10, minimum=3, p25=3, p50=3, p75=4, p90=4, p95=4, p99=4, p999=10, p9999=10, sum=1962)

1367605230_iago_stats.txt:  client/failed_connect_latency_ms: (average=10712, count=1479, maximum=21153, minimum=3, p25=3, p50=21153, p75=21153, p90=21153, p95=21153, p99=21153, p999=21153, p9999=21153, sum=15843245)

These are samples that are ~1s apart I believe based on what you've stated above. You can see that we have a sudden large increase in failed connections, with a p50 of 21s. My guess is that you have a timeout on your Apache server of 20s configured before it will time out a connection for not getting any available workers.

All of this begs the obvious question of why this happens with Iago and it doesn't with JMeter. My theory for what's happening here is that you're falling prey to underlying design of JMeter vs Iago, where JMeter is coupled with your system under test and is giving you a false sense of security for the expected performance in production. The primary difference between JMeter and Iago in terms of systems theory is that Iago will continue to send requests at a specific rate regardless of anything that's happening with the system under test. JMeter cannot not do that for any reasonable thread pool size. So what ends up happening with JMeter is that you get an accurate estimate of the maximum throughput of the system, but not an accurate estimate of where it will fail when encountering a specific production load.

James



Bob Nilsen

unread,
May 3, 2013, 6:06:30 PM5/3/13
to iago-...@googlegroups.com
Hi James,

Thanks again for the help.

I see what you mean about the 21 second connection failures.  I'll look at apache for some 20 second timeout.

However, looking at that same file, it shows those timeouts represent only 1479 of the 628,000 (0.2%) of the connection attempts.  99.09 of the requests were successful, with an average response time of 8 ms.  So it's not as if the apache server is *overwhelmed*.

But it still seems like Iago uses up a heck of a lot of connections, regardless of the status of the reuseConnections setting.

client/connection_duration shows there were 623,995 connections, for 623,966 requests?

client/connection_received_bytes is *almost always* exactly 472, and sent_bytes is nearly always 105.  If there was random reuse of connections I'd expect to see a little variety here.

Believe me, I'm on board with the Iago mission... I don't like thread-based load drivers either.  I'm happy to try out other suggestions to understand what's going on here.

-Bob







James Waldrop

unread,
May 6, 2013, 1:27:58 PM5/6/13
to iago-...@googlegroups.com
I agree, there's something confusing going on here. We're digging on our side, it may be the version of Finagle we're using which is woefully out of date.
Reply all
Reply to author
Forward
0 new messages