How to limit the error traces from an agent

604 views
Skip to first unread message

Pavani Manthana

unread,
Nov 20, 2018, 9:59:57 AM11/20/18
to Glowroot
Hi Trask - Is there any way we can limit the  number of traces that are collected from agent for a specific time or the error rate? We are noticing that sometimes if there are any issues on client side it is resulting in too many error traces to be sent to collector and affecting it in a high volume env. 

Thanks,
Pavani

Trask Stalnaker

unread,
Nov 28, 2018, 4:17:32 PM11/28/18
to Glowroot
Hi Pavani,

Thanks for posting!

Do you have any ballpark for how many traces per minute starts to cause heavy collector load in your environment? (I'm wondering what to set as default limit)

And/or maybe the limit should (also?) be based on total trace entries? Do you have any ballpark for how many total trace entries per minute starts to cause heavy server load in your environment?

Trask

Pavani Manthana

unread,
Dec 3, 2018, 6:26:30 PM12/3/18
to Glowroot
Hi Trask - I would say like 10-15 traces per min per rollup. But I guess it depends on the number of rollups that are having this issue as well. In our case there are about 1000 agents and 70+ rollups but only 4-5 of them have heavy error traces and that is causing a slowness in aggregation.

Thanks,
Pavani

Trask Stalnaker

unread,
Dec 3, 2018, 7:48:59 PM12/3/18
to Glowroot
Hi Pavani,

Since you are monitoring the collector itself, can you check the gRPC / Trace throughput on the collector, e.g.

2018-12-03_16-47-28.png


Thanks,
Trask

mailar...@gmail.com

unread,
Dec 5, 2018, 11:20:12 AM12/5/18
to Glowroot

Hello Pavani,
     We are in the middle of rolling out multi-collector setup, this is what our topoloy looks like.

  Client(agents)    -------->  VirtualIP -----   Ngnix1 ---- Collector 1   ----- Cassandra 1
                                                                              ---- Collector 2   -----  Cassandra 2
 
  
                                                       ------- Ngnix2  ---- Collector 3   -----  Cassandra 3
                                                                             ----- Collector 4   -----  Cassandra 4 

Looks like you have lots of agents running and a fairly large setup.

Could you share some experiences and also really interested to see how your setup looks like, are you using client side load balancing ? what is your cassandra cluster size, replication etc.
What are your data retention configurations
And what is the client side app load you are managing ?

Appreciate your response.

AK.

Pavani Manthana

unread,
Dec 7, 2018, 1:14:06 PM12/7/18
to Glowroot

grpc_trace.png

Pavani Manthana

unread,
Dec 7, 2018, 1:30:27 PM12/7/18
to Glowroot
Hi AK - We have a 2 nginx load balancers in front of a 3 node collector cluster which use 4 node cassandra cluster (RF=3). The LBs handle the GUI traffic and gRPC agent connections respectively. 

We have a bunch of these clusters. Each cluster handles around 600-1100 agents. Most of our apps are microservices and hence the agents are dynamic. 
We have a requirement to retain data for at least 30 days and trace for 14 days. 

Key Learnings - 
1. In high load situations, collectors that handle the agent traffic get affected due to aggregation process. We have brought up collector instances that do nothing but aggregation and the main collectors that handle grpc connections do not do any aggregation at all. It helps us to prevent to agent connections in case of situations where the collectors have to be bounced.
2. You will need connection pooling and appropriate settings which Trask seems to have implemented in latest version of Glowroot
3. Aggregation is still causing us heart ache and is having trouble keeping up in high traffic situations.
4. Setup nginx load balancers to limit max connections to each node.
5. If you can, try to set any limits or configurations at nginx level rather than collectors to handle agent connections, since once the agent connects to collectors, there is no way for collector to prevent it from connecting again.

Thanks,
Pavani

mailar...@gmail.com

unread,
Dec 8, 2018, 9:29:56 AM12/8/18
to Glowroot
Pavani,
   Really appreciate your detailed response, all very important learnings, faced point 1, wasn't sure how to solve it, one idea i had was to make the aggregation collectors auto-scale, by having a layer of collectors on docker, and spin off based on the load, never tried though. I like your idea of segregating collectors, will give it a go.
I see you have something similar going.

Thanks again, you are a life saver :-)

Thanks again Trask, for the wonderful work. 

Trask Stalnaker

unread,
Dec 11, 2018, 9:04:21 PM12/11/18
to Glowroot
Hi AK and Pavani!

I made some changes to rollups (aggregation) in 0.12.3. Previously, rollups and gRPC requests fought each other for the right to queue up Cassandra writes. Now rollups and gRPC requests each have their own Semaphore limiting access to queue up Cassandra writes, which I think should make them coexist much better.

Also, if you monitor your central collector(s) with glowroot agent (using the embedded collector), it would be really helpful if you could zip and upload the embedded collector's data folder to https://www.dropbox.com/request/pAKZ8qAARMXHKDoqzuxU so that I can review your central collector performance bottlenecks.

My preference is to find a way for aggregation and gRPC to co-exist so that you don't need to run separate nodes just dedicated to rollup (aggregation).

Another thought is that you may need to increase "cassandra.pool.maxRequestsPerConnection" in the glowroot-central.properties file (the default is 1024) if you have a lot of traffic. Starting in 0.12.3 the central collector exposes MBeans for the key Semaphores limiting access to the Cassandra write queue, so if you can add this set of Gauges when you are monitoring the central collector, you can see if you are running out of permits a lot which would indicate you need to bump cassandra.pool.maxRequestsPerConnection:

    {
      "mbeanObjectName": "org.glowroot.central:type=*QuerySemaphore",
      "mbeanAttributes": [
        {
          "name": "AvailablePermits"
        },
        {
          "name": "QueueLength"
        }
      ]
    }

I'm very interested in making the central collector work smoothly for large installations, so please keep me in the loop and don't hesitate to ask questions / post issues.

Thanks,
Trask

Trask Stalnaker

unread,
Jan 2, 2019, 12:47:50 PM1/2/19
to Glowroot
Hi Pavani!

Maybe good news on traces overloading the central collector.

This happened to me recently. When I took down and updated the central collector, one of the agents that was running a load test flooded the central collector with traces when the central collector came back up and caused central collector to run really really slow for a while.  I found the issue (at least in this case) was that the central collector was accepting all of the gRPC requests from the "problem" agent (de-serializing them) and only then throttling them when writing them to the database. The reason this caused problem is that all of those accepted/de-serialized gRPC requests ran the system to the brink of being out of memory (it did slowly clear, but caused major GC-related slowness in the meantime).

I fixed the agent to not flood the central collector with requests (limiting to one gRPC request at a time of a given type) in https://github.com/glowroot/glowroot/commit/148ceb05f9f100d3fc888340a71f5e7a68f7e990.

Thanks,
Trask

Prashant kumar Gupta

unread,
Apr 12, 2019, 11:24:38 PM4/12/19
to Glowroot
Hi Trask,
I'm using glowroot 12.3 (both agent & collector). We are facing an issue due to error traces. Agent is tied to a springboot application  which has error rate of 30%. The no of error trace captured is around 8000 in 10 hours. And then agent stops sending data to collector with a message "not sending data to the central collector  because pending request limit 100 exceeded".
Any suggestions? 

Thanks!!
Message has been deleted

fairly accurate

unread,
Apr 13, 2019, 7:30:33 AM4/13/19
to Glowroot
Additionally we see the following logs
2019-04-11 19:34:55.411 INFO  org.glowroot - Java version: 1.6.0_05 (BEA Systems, Inc. / Linux)
2019-04-11 19:34:55.415 INFO  org.glowroot - Java args: -Xms1024m -Xmx1024m -javaagent:/usr/local/wls/glowroot/coreApiAgent/glowroot.jar -Xverify:none -da
2019-04-11 19:34:57.235 WARN  org.glowroot - one or more important classes were loaded before Glowroot instrumentation could be applied to them: java.net.HttpURLConnection
2019-04-11 19:34:57.528 INFO  org.glowroot - agent id: XXXXXX
2019-04-11 19:34:58.296 WARN  o.g.a.s.c.google.protobuf.UnsafeUtil - platform method missing - proto runtime falling back to safer methods: java.lang.NoSuchMethodException: sun.misc.Unsafe.copyMemory(java.lang.Object, long, java.lang.Object, long, long)
2019-04-11 19:34:58.711 INFO  org.glowroot - connected to the central collector http://XXXXX version 0.12.4-SNAPSHOT, built 2019-01-07 05:00:08 +0000
2019-04-11 19:35:57.435 ERROR o.g.a.util.LazyPlatformMBeanServer - platform mbean server was never created by container


After some time, it kills the agent.

Trask Stalnaker

unread,
Apr 16, 2019, 5:47:08 PM4/16/19
to Glowroot
Hi Prashant,

The agent version 0.13.0 has much better behavior under load due to https://github.com/glowroot/glowroot/commit/148ceb05f9f100d3fc888340a71f5e7a68f7e990.

Can you update to that version and see if you still have the issue?

Thanks,
Trask

Trask Stalnaker

unread,
Apr 16, 2019, 5:50:02 PM4/16/19
to Glowroot
Hi fairly accurate,

What version of WebLogic are you running?

Thanks,
Trask

Prashant kumar Gupta

unread,
Apr 17, 2019, 12:06:46 AM4/17/19
to Glowroot
I ll upgrade to 13.3 very soon & update you. I just wanted to know the problem is solved and i think i got the answer. 

Thanks!!

fairly accurate

unread,
Apr 17, 2019, 12:15:32 AM4/17/19
to Glowroot

Thanks Trask,

The app is running on Weblogic 10.3 and JAVA : 1.6.0_05,  BEA JRockit(R).
Reply all
Reply to author
Forward
0 new messages