Hi Gatling team,
I am using Graphite to store both load related metrics from Gatling and SUT resource metrics in one place. From there I use the Graphite REST API to present my load test results in (real time) graph dashboards, benchmark metrics between different build and organize my test runs per project.
I found the 1 second retention Gatling requires leads to some trouble in Graphite in my use case:
- Opening 48 hours load test results lead to very slow queries (which makes sense considering the number of datapoints)
- Retaining datapoints in this granularity requires a lot of storage space.
To tackle these issues I decided to implement a carbon-aggregator process that aggregates the incoming Gatling data per 10 seconds using these aggregation rules:
gatling2.<simulation>.users.<scenario>.active (10) = avg gatling2.<simulation>.users.<scenario>.active
gatling2.<simulation>.users.<scenario>.done (10) = avg gatling2.<simulation>.users.<scenario>.done
gatling2.<simulation>.users.<scenario>.waiting (10) = avg gatling2.<simulation>.users.<scenario>.waiting
gatling2.<simulation>.<request>.<status>.count (10) = sum gatling2.<simulation>.<request>.<status>.count
gatling2.<simulation>.<request>.<status>.max (10) = avg gatling2.<simulation>.<request>.<status>.max
gatling2.<simulation>.<request>.<status>.min (10) = avg gatling2.<simulation>.<request>.<status>.min
gatling2.<simulation>.<request>.<status>.percentiles95 (10) = avg gatling2.<simulation>.<request>.<status>.percentiles95
gatling2.<simulation>.<request>.<status>.percentiles99 (10) = avg gatling2.<simulation>.<request>.<status>.percentiles99
This solved the issues of the slow queries and storage space but introduced some new issues:
- Using the "avg" aggregation method for min, max and percentile metrics does not forward the correct values but an average over 10 seconds of these datapoints instead.
- It seems that if, for some reason (network latency, lost packets?), requests from Gatling are not reaching the carbon-aggregator (in time), this results in "hiccups" in the "count" metrics. My "transactions per second" graph will then show a dip in throughput, that I can't correlate to any SUT resource graph or transaction response times.
What would be really convenient for my use case is if Gatling would have the option to configure a 10 second Graphite push interval and would aggregate the metric data accordingly before sending it. Would you consider implementing such a feature?
Cheers
Daniel