Zipkin Collector performance Issue

138 views
Skip to first unread message

RAUNAK AGRAWAL

unread,
Dec 11, 2014, 6:07:57 AM12/11/14
to zipki...@googlegroups.com
Hi,

I have been trying out zipkin and I integrated with the JAVA Brace library as my client which is pushing the spans and traces to the zipkin collector directly. But surprisingly as soon as I put a load of 100QPS, zipkin collector started throwing error. Any idea about this?


^[[0m[^[[31merror^[[0m] ^[[0mWAR [20141128-08:56:14.596] nio: Failed to accept a connection.^[[0m

^[[0m[^[[31merror^[[0m] ^[[0mWAR [20141128-08:56:14.596] nio: java.io.IOException: Too many open files^[[0m

^[[0m[^[[31merror^[[0m] ^[[0mWAR [20141128-08:56:14.596] nio:     at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)^[[0m

^[[0m[^[[31merror^[[0m] ^[[0mWAR [20141128-08:56:14.596] nio:     at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:241)^[[0m

^[[0m[^[[31merror^[[0m] ^[[0mWAR [20141128-08:56:14.596] nio:     at org.jboss.netty.channel.socket.nio.NioServerBoss.process(NioServerBoss.java:100)^[[0m

^[[0m[^[[31merror^[[0m] ^[[0mWAR [20141128-08:56:14.596] nio:     at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)^[[0m

^[[0m[^[[31merror^[[0m] ^[[0mWAR [20141128-08:56:14.596] nio:     at org.jboss.netty.channel.socket.nio.NioServerBoss.run(NioServerBoss.java:42)^[[0m

^[[0m[^[[31merror^[[0m] ^[[0mWAR [20141128-08:56:14.596] nio:     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)^[[0m

^[[0m[^[[31merror^[[0m] ^[[0mWAR [20141128-08:56:14.596] nio:     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)^[[0m

^[[0m[^[[31merror^[[0m] ^[[0mWAR [20141128-08:56:14.596] nio:     at java.lang.Thread.run(Thread.java:744)^[[0m

^[[0m[^[[31merror^[[0m] ^[[0mERR [20141128-08:56:14.589] processor: indexSpanDuration^[[0m

^[[0m[^[[31merror^[[0m] ^[[0mERR [20141128-08:56:14.589] processor: org.jboss.netty.channel.ChannelException: Failed to open a socket.^[[0m

^[[0m[^[[31merror^[[0m] ^[[0mERR [20141128-08:56:14.589] processor:     at org.jboss.netty.channel.socket.nio.NioClientSocketChannel.newSocket(NioClientSocketChannel.java:43)^[[0m

^[[0m[^[[31merror^[[0m] ^[[0mERR [20141128-08:56:14.589] processor:     at org.jboss.netty.channel.socket.nio.NioClientSocketChannel.<init>(NioClientSocketChannel.java:82)^[[0m

^[[0m[^[[31merror^[[0m] ^[[0mERR [20141128-08:56:14.589] processor:     at org.jboss.netty.channel.socket.nio.NioClientSocketChannelFactory.newChannel(NioClientSocketChannelFactory.java:212)^[[0m

^[[0m[^[[31merror^[[0m] ^[[0mERR [20141128-08:56:14.589] processor:     at org.jboss.netty.channel.socket.nio.NioClientSocketChannelFactory.newChannel(NioClientSocketChannelFactory.java:82)^[[0m

^[[0m[^[[31merror^[[0m] ^[[0mERR [20141128-08:56:14.589] processor:     at com.twitter.finagle.netty3.Netty3Transporter$$anonfun$2.apply(Netty3Transporter.scala:138)^[[0m

^[[0m[^[[31merror^[[0m] ^[[0mERR [20141128-08:56:14.589] processor:     at com.twitter.finagle.netty3.Netty3Transporter$$anonfun$2.apply(Netty3Transporter.scala:138)^[[0m

^[[0m[^[[31merror^[[0m] ^[[0mERR [20141128-08:56:14.589] processor:     at com.twitter.finagle.netty3.Netty3Transporter.com$twitter$finagle$netty3$Netty3Transporter$$newConfiguredChannel(Netty3Transporter.scala:301)^[[0m

^[[0m[^[[31merror^[[0m] ^[[0mERR [20141128-08:56:14.589] processor:     at com.twitter.finagle.netty3.Netty3Transporter$$anonfun$14.apply(Netty3Transporter.scala:308)^[[0m

^[[0m[^[[31merror^[[0m] ^[[0mERR [20141128-08:56:14.589] processor:     at com.twitter.finagle.netty3.Netty3Transporter$$anonfun$14.apply(Netty3Transporter.scala:308)^[[0m

^[[0m[^[[31merror^[[0m] ^[[0mERR [20141128-08:56:14.589] processor:     at com.twitter.finagle.netty3.ChannelConnector.apply(Netty3Transporter.scala:38)^[[0m

^[[0m[^[[31merror^[[0m] ^[[0mERR [20141128-08:56:14.589] processor:     at com.twitter.finagle.netty3.Netty3Transporter.apply(Netty3Transporter.scala:310)^[[0m

^[[0m[^[[31merror^[[0m] ^[[0mERR [20141128-08:56:14.589] processor:     at com.twitter.finagle.netty3.Netty3Transporter$$anon$3.apply(Netty3Transporter.scala:159)^[[0m

^[[0m[^[[31merror^[[0m] ^[[0mERR [20141128-08:56:14.589] processor:     at com.twitter.finagle.client.StackClient$$anon$2$$anonfun$make$1.apply(StackClient.scala:163)^[[0m

^[[0m[^[[31merror^[[0m] ^[[0mERR [20141128-08:56:14.589] processor:     at com.twitter.finagle.client.StackClient$$anon$2$$anonfun$make$1.apply(StackClient.scala:163)^[[0m

^[[0m[^[[31merror^[[0m] ^[[0mERR [20141128-08:56:14.589] processor:     at com.twitter.finagle.ServiceFactory$$anon$7.apply(Service.scala:171)^[[0m

^[[0m[^[[31merror^[[0m] ^[[0mERR [20141128-08:56:14.589] processor:     at com.twitter.finagle.ServiceFactory$$anon$5.apply(Service.scala:134)^[[0m

^[[0m[^[[31merror^[[0m] ^[[0mERR [20141128-08:56:14.589] processor:     at com.twitter.finagle.ServiceFactory$$anon$5.apply(Service.scala:134)^[[0m

^[[0m[^[[31merror^[[0m] ^[[0mERR [20141128-08:56:14.589] processor:     at com.twitter.finagle.ProxyServiceFactory$class.apply(Service.scala:179)^[[0m

^[[0m[^[[31merror^[[0m] ^[[0mERR [20141128-08:56:14.589] processor:     at com.twitter.finagle.builder.ClientBuilder$$anonfun$codec$2$$anon$6$$anon$1.apply(ClientBuilder.scala:344)^[[0m

^[[0m[^[[31merror^[[0m] ^[[0mERR [20141128-08:56:14.589] processor:     at com.twitter.finagle.pool.CachingPool.apply(CachingPool.scala:55)^[[0m

^[[0m[^[[31merror^[[0m] ^[[0mERR [20141128-08:56:14.589] processor:     at com.twitter.finagle.pool.WatermarkPool.apply(WatermarkPool.scala:128)^[[0m

^[[0m[^[[31merror^[[0m] ^[[0mERR [20141128-08:56:14.589] processor:     at com.twitter.finagle.Filter$$anon$3.apply(Filter.scala:81)^[[0m

^[[0m[^[[31merror^[[0m] ^[[0mERR [20141128-08:56:14.589] processor:     at com.twitter.finagle.service.FailureAccrualFactory.apply(FailureAccrualFactory.scala:86)^[[0m

^[[0m[^[[31merror^[[0m] ^[[0mERR [20141128-08:56:14.589] processor:     at com.twitter.finagle.Filter$$anon$3.apply(Filter.scala:81)^[[0m

^[[0m[^[[31merror^[[0m] ^[[0mERR [20141128-08:56:14.589] processor:     at com.twitter.finagle.Filter$$anon$3.apply(Filter.scala:81)^[[0m

^[[0m[^[[31merror^[[0m] ^[[0mERR [20141128-08:56:14.589] processor:     at com.twitter.finagle.Filter$$anon$3.apply(Filter.scala:81)^[[0m

^[[0m[^[[31merror^[[0m] ^[[0mERR [20141128-08:56:14.589] processor:     at com.twitter.finagle.loadbalancer.HeapBalancer.apply(HeapBalancer.scala:215)^[[0m

                                      

RAUNAK AGRAWAL

unread,
Dec 11, 2014, 6:48:21 AM12/11/14
to zipki...@googlegroups.com
Mind you, the file descriptor limit is set to 1024. Shouldn't it be reasonable?

--

---
You received this message because you are subscribed to a topic in the Google Groups "zipkin-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/zipkin-user/p6hYCAcj7sc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to zipkin-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Eirik Sletteberg

unread,
Dec 11, 2014, 4:51:34 PM12/11/14
to zipki...@googlegroups.com
This line:
at com.twitter.finagle.builder.ClientBuilder$$anonfun$codec$2$$anon$6$$anon$1.apply(ClientBuilder.scala:344)^[[0m
Makes me think it has something to do with the connection to the database/cassandra (since it's the ClientBuilder). Is the database slow, underprovisioned, or with a high network latency?

RAUNAK AGRAWAL

unread,
Dec 15, 2014, 6:52:59 AM12/15/14
to zipki...@googlegroups.com
Hi Eirik,

Thanks for looking into this. Yes the cassandra machine is underprovisioned and its just a one node cluster and that too in a very sub optimal performing machine :)

Kristof Adriaenssens

unread,
Dec 24, 2014, 4:29:49 AM12/24/14
to zipki...@googlegroups.com
You can get some statistics from your zipkin-collector instances which indicate if you are underprovisioned.
When you do 

curl http://<zipkin-host>:9900/stats.txt 

you get a bunch of information (9900 is the default configured admin port if I'm not mistaken).  Following metrics:

collector.log: 909314296
collector.pushback: 185210

indicate if the collector is keeping up with persisting the received spans. The collector.pushback number indicates spans that are dropped by the collector which will result in incomplete traces.
The collector.log number indicates successfully logged and persisted spans.

So if you have a high pushback number you should upgrade/tune your cassandra cluster and increase the number of collectors or lower the sample rate.
I recently found out that you can also change the sample rate in the zipkin-collector besides changing the sample rate for your services.



Op maandag 15 december 2014 12:52:59 UTC+1 schreef RAUNAK AGRAWAL:
Reply all
Reply to author
Forward
0 new messages