File descriptor problem at 5k TPS on RHEL

Bob Nilsen

unread,

May 2, 2013, 12:48:06 PM5/2/13

to iago-...@googlegroups.com

I'm getting an exception when trying to run 5k TPS. I recognize that this is likely an OS limitation... but I thought I'd report it here since the goal is to be able to do 10k TPS with one box. Maybe you guys know how to overcome this problem?

System Details:

Red Hat Enterprise Linux Server release 5.6 (Tikanga)

HP Blade BL460 G6 Dual Quad Core 48 GB RAM

[11:19:57 sg212844] $ java -jar iago-0.5.1-SNAPSHOT.jar -f config/5000_bigip_reuse.scala
Configs generated, are you ready to do some damage? [yes]
sh scripts/local-parrot.sh
initialized parrot
java.lang.InternalError: errno: 24 error: Unable to open directory /proc/self/fd

at com.sun.management.UnixOperatingSystem.getOpenFileDescriptorCount(Native Method)
at com.twitter.ostrich.stats.StatsCollection.fillInJvmGauges(StatsCollection.scala:72)
at com.twitter.ostrich.stats.StatsCollection.getGauges(StatsCollection.scala:205)
at com.twitter.ostrich.stats.StatsCollection.getGauges(StatsCollection.scala:30)
at com.twitter.ostrich.stats.StatsProvider$class.get(StatsProvider.scala:184)
at com.twitter.ostrich.stats.StatsCollection.get(StatsCollection.scala:30)
at com.twitter.ostrich.admin.CommandHandler$$anonfun$handleCommand$6.apply(CommandHandler.scala:113)
at com.twitter.ostrich.admin.CommandHandler$$anonfun$handleCommand$6.apply(CommandHandler.scala:113)
at scala.Option.getOrElse(Option.scala:108)
at com.twitter.ostrich.admin.CommandHandler.handleCommand(CommandHandler.scala:111)
at com.twitter.ostrich.admin.CommandHandler.apply(CommandHandler.scala:65)
at com.twitter.ostrich.admin.CommandRequestHandler.handle(AdminHttpService.scala:296)
at com.twitter.ostrich.admin.CgiRequestHandler.handle(AdminHttpService.scala:154)
at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:65)
at sun.net.httpserver.AuthFilter.doFilter(AuthFilter.java:65)
at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:68)
at sun.net.httpserver.ServerImpl$Exchange$LinkHandler.handle(ServerImpl.java:554)
at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:65)
at sun.net.httpserver.ServerImpl$Exchange.run(ServerImpl.java:526)
at sun.net.httpserver.ServerImpl$DefaultExecutor.execute(ServerImpl.java:117)
at sun.net.httpserver.ServerImpl$Dispatcher.handle(ServerImpl.java:347)
at sun.net.httpserver.ServerImpl$Dispatcher.run(ServerImpl.java:319)
at java.lang.Thread.run(Thread.java:662)
shutting down client
shut down parrot
done.

ostrich shows:

counters:

  400: 104363

  client/connects: 105480

  client/failures/com.twitter.finagle.WriteException: 627

  client/failures/org.jboss.netty.channel.ChannelException: 7053

  client/received_bytes: 51555816

  client/requests: 104852

  client/requests/10.14.42.20:9567: 104852

  client/sent_bytes: 11355159

  client/success: 104363

  client/success/10.14.42.20:9567: 104363

  jvm_gc_ConcurrentMarkSweep_cycles: 0

  jvm_gc_ConcurrentMarkSweep_msec: 0

  jvm_gc_ParNew_cycles: 15

  jvm_gc_ParNew_msec: 1465

  jvm_gc_cycles: 15

  jvm_gc_msec: 1465

  records-read: 411600

  requests_sent: 112533

  unexpected_error: 7680

  unexpected_error/com.twitter.finagle.WriteException: 627

  unexpected_error/org.jboss.netty.channel.ChannelException: 7053

gauges:

  client/connections: 469

  client/loadbalancer/available/failure_accrual_watermark_pool_caching_pool_host:10.14.42.20/10.14.42.20:9567: 0

  client/loadbalancer/load/failure_accrual_watermark_pool_caching_pool_host:10.14.42.20/10.14.42.20:9567: 505

  client/loadbalancer/size: 1

  client/pending: 505

  client/pending/10.14.42.20:9567: 505

  client/pool_cached: 0

  client/pool_cached/10.14.42.20:9567: 0

  client/pool_size: 505

  client/pool_size/10.14.42.20:9567: 505

  client/pool_waiters: 0

  client/pool_waiters/10.14.42.20:9567: 0

  clock_error: 0

  jvm_fd_count: 694

  jvm_fd_limit: 1024

  jvm_heap_committed: 2043478016

  jvm_heap_max: 4140630016

  jvm_heap_used: 820676128

  jvm_nonheap_committed: 53608448

  jvm_nonheap_max: 136314880

  jvm_nonheap_used: 53103840

  jvm_num_cpus: 16

  jvm_post_gc_CMS_Old_Gen_used: 0

  jvm_post_gc_CMS_Perm_Gen_used: 0

  jvm_post_gc_Par_Eden_Space_used: 0

  jvm_post_gc_Par_Survivor_Space_used: 53673984

  jvm_post_gc_used: 53673984

  jvm_start_time: 1367511620035

  jvm_thread_count: 48

  jvm_thread_daemon_count: 7

  jvm_thread_peak_count: 48

  jvm_uptime: 29051

  queue_depth: 299956

labels:

metrics:

  client/codec_connection_preparation_latency_ms: (average=0, count=112532, maximum=386, minimum=0, p25=0, p50=0, p75=0, p90=0, p95=0, p99=0, p999=1, p9999=57, sum=2658)

  client/codec_connection_preparation_latency_ms/10.14.42.20:9567: (average=0, count=112532, maximum=386, minimum=0, p25=0, p50=0, p75=0, p90=0, p95=0, p99=0, p999=1, p9999=57, sum=2658)

  client/connect_latency_ms: (average=0, count=104852, maximum=386, minimum=0, p25=0, p50=0, p75=0, p90=0, p95=0, p99=0, p999=1, p9999=57, sum=2531)

  client/connect_latency_ms/10.14.42.20:9567: (average=0, count=104852, maximum=386, minimum=0, p25=0, p50=0, p75=0, p90=0, p95=0, p99=0, p999=1, p9999=57, sum=2531)

  client/connection_duration: (average=43, count=104991, maximum=6365, minimum=0, p25=0, p50=0, p75=1, p90=23, p95=52, p99=1161, p999=6365, p9999=6365, sum=4562565)

  client/connection_received_bytes: (average=491, count=104992, maximum=472, minimum=0, p25=472, p50=472, p75=472, p90=472, p95=472, p99=472, p999=472, p9999=472, sum=51556310)

  client/connection_requests: (average=0, count=104991, maximum=1, minimum=0, p25=1, p50=1, p75=1, p90=1, p95=1, p99=1, p999=1, p9999=1, sum=104364)

  client/connection_sent_bytes: (average=107, count=104992, maximum=105, minimum=0, p25=105, p50=105, p75=105, p90=105, p95=105, p99=105, p999=105, p9999=105, sum=11302416)

  client/failed_connect_latency_ms: (average=0, count=627, maximum=3, minimum=0, p25=0, p50=0, p75=0, p90=0, p95=0, p99=0, p999=3, p9999=3, sum=8)

  client/failed_connect_latency_ms/10.14.42.20:9567: (average=0, count=627, maximum=3, minimum=0, p25=0, p50=0, p75=0, p90=0, p95=0, p99=0, p999=3, p9999=3, sum=8)

  client/request_latency_ms: (average=43, count=104364, maximum=6365, minimum=0, p25=0, p50=0, p75=0, p90=23, p95=47, p99=1161, p999=6365, p9999=6365, sum=4549459)

  client/request_latency_ms/10.14.42.20:9567: (average=43, count=104364, maximum=6365, minimum=0, p25=0, p50=0, p75=0, p90=23, p95=47, p99=1161, p999=6365, p9999=6365, sum=4549459)

ulimit shows:

[11:41:07 sg212844] $ ulimit

unlimited

Thanks for any help you can give,

Bob

Bob Nilsen

unread,

May 2, 2013, 12:51:57 PM5/2/13

to iago-...@googlegroups.com

System file descriptor limit

[root@tmp]# /sbin/sysctl fs.file-max

fs.file-max = 4874583

Bob Nilsen

unread,

May 2, 2013, 1:19:56 PM5/2/13

to iago-...@googlegroups.com

I raised my ulimits to be as follows:

[12:15:11 sg212844] $ ulimit -a

core file size (blocks, -c) 0

data seg size (kbytes, -d) unlimited

scheduling priority (-e) 0

file size (blocks, -f) unlimited

pending signals (-i) 397311

max locked memory (kbytes, -l) 32

max memory size (kbytes, -m) unlimited

open files (-n) 10240

pipe size (512 bytes, -p) 8

POSIX message queues (bytes, -q) 819200

real-time priority (-r) 0

stack size (kbytes, -s) 10240

cpu time (seconds, -t) unlimited

max user processes (-u) 397311

virtual memory (kbytes, -v) unlimited

file locks (-x) unlimited

Now, the exception does not happen. So yay!

Bob Nilsen

unread,

May 2, 2013, 1:21:02 PM5/2/13

to iago-...@googlegroups.com

However, now at this high TPS rate I see exceptions in ostrich:

counters:
  400: 262708
  client/connects: 272781
  client/failures/com.twitter.finagle.WriteException: 9674
  client/received_bytes: 129779234
  client/requests: 263106
  client/requests/10.14.42.20:9567: 263106
  client/sent_bytes: 28493948
  client/success: 262711
  client/success/10.14.42.20:9567: 262711
  jvm_gc_ConcurrentMarkSweep_cycles: 1
  jvm_gc_ConcurrentMarkSweep_msec: 418
  jvm_gc_ParNew_cycles: 36
  jvm_gc_ParNew_msec: 2907
  jvm_gc_cycles: 37
  jvm_gc_msec: 3325
  records-read: 572400
  requests_sent: 272781
  unexpected_error: 9674
  unexpected_error/com.twitter.finagle.WriteException: 9674
gauges:
  client/connections: 395
  client/loadbalancer/available/failure_accrual_watermark_pool_caching_pool_host:10.14.42.20/10.14.42.20:9567: 0
  client/loadbalancer/load/failure_accrual_watermark_pool_caching_pool_host:10.14.42.20/10.14.42.20:9567: 398
  client/loadbalancer/size: 1
  client/pending: 397
  client/pending/10.14.42.20:9567: 397
  client/pool_cached: 0
  client/pool_cached/10.14.42.20:9567: 0
  client/pool_size: 398
  client/pool_size/10.14.42.20:9567: 398
  client/pool_waiters: 0
  client/pool_waiters/10.14.42.20:9567: 0
  clock_error: 34085352912
  jvm_fd_count: 602
  jvm_fd_limit: 10240
  jvm_heap_committed: 2043478016
  jvm_heap_max: 4140630016
  jvm_heap_used: 499718800
  jvm_nonheap_committed: 84877312
  jvm_nonheap_max: 136314880
  jvm_nonheap_used: 53377432
  jvm_num_cpus: 16
  jvm_post_gc_CMS_Old_Gen_used: 251215936
  jvm_post_gc_CMS_Perm_Gen_used: 47070816
  jvm_post_gc_Par_Eden_Space_used: 0
  jvm_post_gc_Par_Survivor_Space_used: 47152792
  jvm_post_gc_used: 345439544
  jvm_start_time: 1367514968508
  jvm_thread_count: 48
  jvm_thread_daemon_count: 7
  jvm_thread_peak_count: 48
  jvm_uptime: 95183
  queue_depth: 299610
labels:
metrics:
  client/codec_connection_preparation_latency_ms: (average=0, count=272780, maximum=2858, minimum=0, p25=0, p50=0, p75=0, p90=0, p95=1, p99=4, p999=5, p9999=95, sum=60943)
  client/codec_connection_preparation_latency_ms/10.14.42.20:9567: (average=0, count=272780, maximum=2858, minimum=0, p25=0, p50=0, p75=0, p90=0, p95=1, p99=4, p999=5, p9999=95, sum=60943)
  client/connect_latency_ms: (average=0, count=263106, maximum=2858, minimum=0, p25=0, p50=0, p75=0, p90=0, p95=0, p99=2, p999=5, p9999=95, sum=24598)
  client/connect_latency_ms/10.14.42.20:9567: (average=0, count=263106, maximum=2858, minimum=0, p25=0, p50=0, p75=0, p90=0, p95=0, p99=2, p999=5, p9999=95, sum=24598)
  client/connection_duration: (average=128, count=272385, maximum=9498, minimum=0, p25=0, p50=0, p75=1, p90=13, p95=42, p99=3158, p999=6365, p9999=9498, sum=35060789)
  client/connection_received_bytes: (average=476, count=272385, maximum=472, minimum=0, p25=472, p50=472, p75=472, p90=472, p95=472, p99=472, p999=472, p9999=472, sum=129779234)
  client/connection_requests: (average=0, count=272385, maximum=1, minimum=0, p25=1, p50=1, p75=1, p90=1, p95=1, p99=1, p999=1, p9999=1, sum=262711)
  client/connection_sent_bytes: (average=104, count=272385, maximum=105, minimum=0, p25=105, p50=105, p75=105, p90=105, p95=105, p99=105, p999=105, p9999=105, sum=28451164)
  client/failed_connect_latency_ms: (average=3, count=9674, maximum=52, minimum=0, p25=3, p50=4, p75=4, p90=4, p95=4, p99=4, p999=10, p9999=52, sum=34833)
  client/failed_connect_latency_ms/10.14.42.20:9567: (average=3, count=9674, maximum=52, minimum=0, p25=3, p50=4, p75=4, p90=4, p95=4, p99=4, p999=10, p9999=52, sum=34833)
  client/request_latency_ms: (average=133, count=262711, maximum=9498, minimum=0, p25=0, p50=0, p75=0, p90=13, p95=52, p99=3158, p999=6365, p9999=9498, sum=34969140)
  client/request_latency_ms/10.14.42.20:9567: (average=133, count=262711, maximum=9498, minimum=0, p25=0, p50=0, p75=0, p90=13, p95=52, p99=3158, p999=6365, p9999=9498, sum=34969140)

Bob Nilsen

unread,

May 2, 2013, 1:28:24 PM5/2/13

to iago-...@googlegroups.com

I think i'm hitting port exhaustion now. I can see about 28k max open tcp connections from Iago (nearly all in TIME_WAIT).

After one minute (which I think is the kernel linger time per connection) I can see the connection count drop and Iago starts sending again.

Is there anything else I can set to tell Iago to do *more* connection reuse?

James Waldrop

unread,

May 2, 2013, 1:42:33 PM5/2/13

to iago-...@googlegroups.com

Ephemeral port exhaustion is the most common reason to need to add more servers actually, at least here at Twitter. That said, it's usually at higher RPS that we see this problem. Hitting it with 5K RPS means that your service is responding in more than 5s, which I *hope* means it's saturated. If you're wondering how I get there, there's a law that says concurrency is equal to request rate multiplied by your service time.

Is there a reason you want more load even though it seems like your service is saturated? Do you expect to normally have requests that take more than 5s to respond to?

If so, then you probably need to add another server instance -- Tom Howland is currently working on getting support for this added to the Github repo and expects to be done shortly I believe (although I don't want to speak for him). We do it internally with Mesos, and I was reluctant to ship code that depended on Mesos since I don't expect most people to have it, or to setup a Mesos cluster just because Iago can use that to scale automatically.

James

--

---
You received this message because you are subscribed to the Google Groups "Iago Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to iago-users+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Bob Nilsen

unread,

May 2, 2013, 3:03:00 PM5/2/13

to iago-...@googlegroups.com

Can Iago tell me what my response times are in this case?

ostrich shows this line:

client/request_latency_ms: (average=133, count=262711, maximum=9498, minimum=0, p25=0, p50=0, p75=0, p90=13, p95=52, p99=3158, p999=6365, p9999=9498, sum=34969140)

If that's the correct line to be looking at, it shows that my average response times are 133 ms... which means my service times are acceptably fast.  I should be able to do this with about 665 connections, if connections pooled and left open after each request.

But that's what I'm not seeing... from my perspective it seems like connections are closed quickly and that's why I'm running out.  I don't know how else I'd get to 28k connections in TIME_WAIT.


As background, I'm not really testing any particular service in this experiment... I'm testing and trying to understand Iago.  We use JMeter, LoadRunner, Visual Studio, LoadUI, etc. very frequently in our performance testing group at work.  But understanding the functional shortcomings of thread-based load test tools, we are looking into java or python-based async IO load driver frameworks like Iago to see how they scale.

Thanks for all the help, sorry to flood you guys with questions.

-Bob

James Waldrop

unread,

May 2, 2013, 4:46:23 PM5/2/13

to iago-...@googlegroups.com

Do you have reuseConnections set to false? That's the other thing that could explain this -- SSL without reuseConnections being true will often exhaust the port space because the handshake takes so long.

James

Bob Nilsen

unread,

May 2, 2013, 4:48:25 PM5/2/13

to iago-...@googlegroups.com

To verify my test apparatus I just saw that this same machine using JMeter can do a max of 42k TPS to this apache server.

James Waldrop

unread,

May 2, 2013, 4:50:43 PM5/2/13

to iago-...@googlegroups.com

Cool, that's useful. Seems good to get to the bottom of then.

On Thu, May 2, 2013 at 1:48 PM, Bob Nilsen <rwni...@gmail.com> wrote:

To verify my test apparatus I just saw that this same machine using JMeter can do a max of 42k TPS to this apache server.

--

Bob Nilsen

unread,

May 2, 2013, 4:52:00 PM5/2/13

to iago-...@googlegroups.com

Here's my config, reuseConnections is true.... but that's just it, setting reuse connections doesn't help nearly as much as I would expect:

import com.twitter.parrot.config.ParrotLauncherConfig

new ParrotLauncherConfig {

doConfirm = false

localMode = true

jobName = "testrun"

port = 8090

victims = "10.19.148.199"

log = "config/replay.log"

requestRate = 5000

duration = 2

reuseConnections = true

}

--
Bob Nilsen
rwni...@gmail.com

James Waldrop

unread,

May 2, 2013, 5:05:53 PM5/2/13

to iago-...@googlegroups.com

Can you attach the parrot server log?

Bob Nilsen

unread,

May 2, 2013, 11:35:17 PM5/2/13

to iago-...@googlegroups.com

Sure thing, James.

The parrot-server.log files only show lines like this:

ERR [20130502-22:00:01.007] logging: Unable to open socket to scribe server at localhost:1463: java.net.ConnectException: Connection refused

ERR [20130502-22:01:01.008] logging: Unable to open socket to scribe server at localhost:1463: java.net.ConnectException: Connection refused

ERR [20130502-22:02:01.008] logging: Unable to open socket to scribe server at localhost:1463: java.net.ConnectException: Connection refused

ERR [20130502-22:03:01.007] logging: Unable to open socket to scribe server at localhost:1463: java.net.ConnectException: Connection refused

ERR [20130502-22:04:01.007] logging: Unable to open socket to scribe server at localhost:1463: java.net.ConnectException: Connection refused

Do I need to put Iago into super-debug mode in order to show more errors?

-Bob

Tom Howland

unread,

May 3, 2013, 12:37:38 PM5/3/13

to iago-...@googlegroups.com

Bob

Those errors are an attempt to log to a scribe server and is a bug in the default logging configuration. Fix it with

diff --git a/src/main/resources/templates/local-template-server.scala b/src/main/resources/templates/local-template-server.scala

index 8211167..958290f 100644

--- a/src/main/resources/templates/local-template-server.scala

+++ b/src/main/resources/templates/local-template-server.scala

@@ -12,16 +12,6 @@ new ParrotServerConfig[#{requestType}, #{responseType}] {

rollPolicy = Policy.Hourly,

rotateCount = 6

)

- ) :: new LoggerFactory(

- node = "stats",

- level = Level.INFO,

- useParents = false,

- handlers = ScribeHandler(

- hostname = "localhost",

- category = "cuckoo_json",

- maxMessagesPerTransaction = 100,

- formatter = BareFormatter

- )

) :: loggers

statsName = "parrot_#{jobName}"

or wait until I do, which may not be for a week or two.

James Waldrop

unread,

May 3, 2013, 1:26:15 PM5/3/13

to iago-...@googlegroups.com

I think the performance stats are what we need. You're already fetching them, but we may need to see them over time to understand what's happening. Rather than putting a lot of work on your plate, I think I'm inclined to just replicate your results locally where we can debug it directly.

James

Bob Nilsen

unread,

May 3, 2013, 2:07:35 PM5/3/13

to iago-...@googlegroups.com

The stats from ostrich? I might be able to capture them for you. I'm only doing 2 minute test runs, so I could poll ostrich manually, dumping to a file with a timestamp.

stand by...

-Bob

Bob Nilsen

unread,

May 3, 2013, 2:50:29 PM5/3/13

to iago-...@googlegroups.com

James,

Please find attached 1-second samples of the ostrich output during a 2-minute test run. They are in chronological order by epoc time in their filename.

I have also included the logs, which are *much* more interesting than those I posted previously. Lots of exceptions to dig into.

-Bob

--
Bob Nilsen
rwni...@gmail.com

iago_performance.tgz

iago_logs.tgz

James Waldrop

unread,

May 3, 2013, 4:11:50 PM5/3/13

to iago-...@googlegroups.com

You're right, these are interesting in a number of ways.

INF [20130503-13:18:23.313] server: Creating job named testrun

This is the start of the run. You should get load shortly after this (shortly meaning basically immediately, modulo how expensive it is to create your request objects and in this case they're very cheap to create).

ERR [20130503-13:19:45.280] server: unexpected error: com.twitter.finagle.WriteException: java.net.BindException: Cannot assign requested address [many repeated]

This usually means that you're out of sockets.

ERR [20130503-13:20:30.013] server: unexpected error: com.twitter.finagle.WriteException: java.net.ConnectException: Connection timed out [many repeated]

This is our first obvious indication that your system is too saturated to handle any more requests.

There are several messages that are useless noise either because they're scribe or shutdown-related:

ERR [20130503-13:20:33.227] server: unexpected error: java.lang.IllegalArgumentException: requirement failed: newTimeout on inactive timer

ERR [20130503-13:20:32.827] server: unexpected error: com.twitter.finagle.WriteException: java.nio.channels.ClosedChannelException

FAT [20130503-13:20:56.319] monitor: Exception propagated to the root monitor!

All scary looking, all not worth considering. Note that they're slightly beyond the 2 minute mark. We have fixes for the timer exceptions that should land soon in Github.

Now, I'm stating that your system is saturated based on the evidence of the exceptions. It would be useful if the metrics backed this up. We do get that, although someone who isn't familiar with Finagle might not immediately diagnose it:

1367605229_iago_stats.txt: client/failed_connect_latency_ms: (average=3, count=594, maximum=10, minimum=3, p25=3, p50=3, p75=4, p90=4, p95=4, p99=4, p999=10, p9999=10, sum=1962)

1367605230_iago_stats.txt: client/failed_connect_latency_ms: (average=10712, count=1479, maximum=21153, minimum=3, p25=3, p50=21153, p75=21153, p90=21153, p95=21153, p99=21153, p999=21153, p9999=21153, sum=15843245)

These are samples that are ~1s apart I believe based on what you've stated above. You can see that we have a sudden large increase in failed connections, with a p50 of 21s. My guess is that you have a timeout on your Apache server of 20s configured before it will time out a connection for not getting any available workers.

All of this begs the obvious question of why this happens with Iago and it doesn't with JMeter. My theory for what's happening here is that you're falling prey to underlying design of JMeter vs Iago, where JMeter is coupled with your system under test and is giving you a false sense of security for the expected performance in production. The primary difference between JMeter and Iago in terms of systems theory is that Iago will continue to send requests at a specific rate regardless of anything that's happening with the system under test. JMeter cannot not do that for any reasonable thread pool size. So what ends up happening with JMeter is that you get an accurate estimate of the maximum throughput of the system, but not an accurate estimate of where it will fail when encountering a specific production load.

James

Bob Nilsen

unread,

May 3, 2013, 6:06:30 PM5/3/13

to iago-...@googlegroups.com

Hi James,

Thanks again for the help.

I see what you mean about the 21 second connection failures. I'll look at apache for some 20 second timeout.

However, looking at that same file, it shows those timeouts represent only 1479 of the 628,000 (0.2%) of the connection attempts. 99.09 of the requests were successful, with an average response time of 8 ms. So it's not as if the apache server is *overwhelmed*.

But it still seems like Iago uses up a heck of a lot of connections, regardless of the status of the reuseConnections setting.

client/connection_duration shows there were 623,995 connections, for 623,966 requests?

client/connection_received_bytes is *almost always* exactly 472, and sent_bytes is nearly always 105. If there was random reuse of connections I'd expect to see a little variety here.

Believe me, I'm on board with the Iago mission... I don't like thread-based load drivers either. I'm happy to try out other suggestions to understand what's going on here.

-Bob

James Waldrop

unread,

May 6, 2013, 1:27:58 PM5/6/13

to iago-...@googlegroups.com

I agree, there's something confusing going on here. We're digging on our side, it may be the version of Finagle we're using which is woefully out of date.

Reply all

Reply to author

Forward