Tuning Taurus for high-volume tests

506 views
Skip to first unread message

Jarrod P

unread,
Aug 9, 2018, 12:12:32 AM8/9/18
to codename-taurus
I have a test that pushes ~100,000 transaction/hour. All JMS request only.

The problem I'm having is that Taurus (1.12.1) is running at 100% CPU (Jmeter is in distributed mode on multiple different servers).  I've added the below config items to the taurus definition but it doesn't seem to help.  Yes I can bump the server up to a larger size, but I would like to exhaust all avenues on tuning Taurus first before throwing more money at it.

Are there any other tweaks I can make to reduce the processing of Taurus?  During the test I don't need any response time processing as all the requests are asynchronous.  I also do not have any passfail assessments. After setting the check--interval to 30s i noticed that the engine loop went down from 100 to 1. 

modules:
  consolidator:
    max-buffer-len: 2h      # maximal length of buffer (default: infinity)
    rtimes-len: 500         # size of storage for response time values (default: 1000)  
    percentiles:  # percentile levels to track, 
    - 0.0
    - 50.0
    - 95.0
    - 100.0
  console:
    disable: true

settings:
  check-interval: 30s 
  check-updates: false


services:
- module: monitoring
  local:
  - interval: 10s 
    metrics:
    - cpu
    - disk-space
    - engine-loop

Andrey Pokhilko

unread,
Aug 9, 2018, 4:21:14 AM8/9/18
to codenam...@googlegroups.com

Hi,

The relation of check-interval and engine-loop is this: engine-loop = <time Taurus spent processing modules in iteration> / check-interval. So it makes sense that engine-loop can go down when you increase check-interval. But I don't think this is a solution to the problem you observe. The tweaks you made look fine, but if they don't help then problem is something else.

100K transactions/hr = 1666 tran/min = 27 tran/sec. So the hits rate does not look too high. Taurus is able to process thousands of hits/s on average CPU.

To help diagnose problem, can you run taurus with -v flag added, run the problematic test again and send me bzt.log (privately if needed, and all artifacts zip would be awesome)? Please, leave check-interval the same, as it does not help in your case. Also, don't disable console, it also does no harm. Also, it would be really nice to have engine health collected for this test, can you run it with "-report" flag and share the report link with me?

This way I'll have much more info to spot the problem. Currently I suspect that you have very long response times, which makes aggregator subsystem to spend a lot of time arranging histograms.

One experiment also comes into my mind, it is to set "min-buffer-len: 1h" also, so it will cause Taurus to buffer most of results it reads and process at the very end of execution process. But it means it will use RAM for that buffer, which may also cause problems.

--

Andrey Pokhilko
Open Source Initiatives Leader
CA
          BlazeMeter

09.08.2018 07:12, Jarrod P пишет:
CAUTION: This email originated from outside of CA. Do not click links or open attachments unless you recognize the sender and know the content is safe.

--
You received this message because you are subscribed to the Google Groups "codename-taurus" group.
To unsubscribe from this group and stop receiving emails from it, send an email to codename-taur...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/codename-taurus/9859aa24-1084-4b9e-abaa-d0d2ff46f090%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jarrod P

unread,
Aug 13, 2018, 12:38:30 AM8/13/18
to codename-taurus

Whats the best way to get the data to you securely?

Andrey Pokhilko

unread,
Aug 13, 2018, 2:58:35 AM8/13/18
to codenam...@googlegroups.com

Just send me direct email to this address.

--

Andrey Pokhilko
Open Source Initiatives Leader
CA
          BlazeMeter

13.08.2018 07:38, Jarrod P пишет:

Jarrod P

unread,
Aug 15, 2018, 2:25:45 AM8/15/18
to codename-taurus
A few things to note:
- We get a lot of failures which are expected on our side, but do not impact test validity.  Not sure if this is contributing to the high CPU usage.
- bzt CPU is 100% of one of the cores.   I have Jmeter also running on the same box which is consuming ~200%. Its an m4.xlarge (4 core). Overall usage is about 60%.
- bzt dies unexpectedly whilst aggregating the results.  JMeter continues running in the background, however the death of Taurus tricks my wrapper into thinking the test has finished and it then tears down the environment.  I've run a test with the min-buffer-length set to 2hours, max set to 3hours.  This test ran fine for the 2 hours, but died aggregating results. When min-buffer-length is not set, Taurus dies after completing ramp up.
- Slaves are looking okay resource wise.
- One of the error jtl files was large (130Mb) due to the high number of errors.

I run taurus from another python wrapper script which handles a number of tasks such as environment provisioning, results collation etc.  In top i'm seeing bzt as the process that has high CPU (not python generically). I've also run this directly through taurus without my wrapper and it has the same performance.

Andrey Pokhilko

unread,
Aug 17, 2018, 12:18:04 AM8/17/18
to codenam...@googlegroups.com

Hi,

From your logs, I see nothing obvious to point on problem.

One suspect is the process of aggregating results from JMeter. Usually, that problem is diagnosed by post-process phase, after your Taurus gets slow and you press ctrl+c. Then, Taurus switches back into detailed logging, and details can be found in bzt.log.

Can you run your test, reach slowness and then shut Taurus down gracefully, so all post-process will go through? Then share bzt.log with me.

Another thing to try is to install latest snapshot (http://gettaurus.org/docs/DeveloperGuide/#Python-Egg-Snapshots), because we've made important performance improvements in aggregator module recently.

--

Andrey Pokhilko
Open Source Initiatives Leader
CA
          BlazeMeter

15.08.2018 09:25, Jarrod P пишет:

Sachin Patel

unread,
Aug 21, 2023, 9:54:11 AM8/21/23
to codename-taurus
Hi, 

I am observing a similar problem with a performance test I am running. Here are the specifications of the JMeter test: 

- The test is sending a message to an IBM MQ. There are two transactions: 1. Create the JMS session and create a message (each message must be unique), 2. Send the message, close the session.
- The requirement is to hit over 300 hits/s. Essentially the test needs to hit 150 hit/s for sending a message. An accumulative hits/s would be 300, including the creation of the message and JMS session. 
- The average response time is 0.03 seconds for sending a message.
- The modules section has been set up as the following:
modules:
jmeter:
path: ${JMETER_BIN_PATH}/jmeter
properties:
basedir: ${JMETER_HOME}
output: ${TAURUS_ARTIFACTS_DIR}/output/
memory-xmx: 15G
cpu: 8
detect plugins: true 

- The test is using a Concurrency thread group (such that the following property is used ${__tstFeedback(TST_Name,1,150,10)}) and a Throughput Shaping Timer, which is set to 300 RPS. Due to this, the yml file doesn't state the concurrency or duration.  
- I have tried executing this test on a highly spec-ed performance testing pod (16GB memory and 10 core CPUs). The test has been run with 1/10/50/100/150/200....up to a 1000 threads. However, the throughput never goes above 200 hits/s. 
- To confirm it wasn't a limitation with IBM MQ, I was able to run a second instance of the bzt test on the same machine. Which ran side-by-side with the same test. This showed IBM MQ was able to process more than 200 hits/s. Because the second test was running fine, but also only went to a maximum of 200 hits/s.

What other limitations can there be to prevent a single instance of Taurus to not allow more than 200 hit/s to be processed? But it works fine with two instances of Taurus. I need to consider running a spike test as well which will go over and above 300 hits/s for a short period of time during this test. How am I able to measure this if I cannot run this from one instance of Taurus?

Regards, 

Sachin  

DT

unread,
Aug 21, 2023, 10:15:48 AM8/21/23
to codename-taurus
Taurus doesn't generate any load per se. If you're using JMeter Executor you need to tune JMeter, not Taurus for high loads. 

It might be the case you're just hitting the limit of JMeter's maximum heap size resulting in frequent garbage collection slowing JMeter engine down. If this is the case you can increase it If you need to apply more tuning you can do this for JMeter instance and configure Taurus to use that JMeter instance

Another possibility is that your machine is not capable of conducting that load due to resources limit (CPU, RAM, etc.) If this is the case you can kick off as many JMeter slave machines as needed and tell Taurus that you want to run JMeter test in distributed mode. See How to Perform Distributed Testing in JMeter article for more information.
Reply all
Reply to author
Forward
0 new messages