Realtime dependency calculation

189 views
Skip to first unread message

Alexander Jeyaraj

unread,
Jun 25, 2018, 2:58:46 AM6/25/18
to Jaeger Tracing
We use the API

api/dependencies?endTs=xxxxxxxxxxxxxxxxx&lookback=xxxxxxxxxx

to get the dependencies along with number of calls made between services in a given duration. It all worked fine on all-in-one, but in a production we are running the spark-dependencies once in 3 hours, and the number of calls is inaccurate. When the lookback time is less than 6 hours, the API doesn't return any data, and sometimes I have to go 20 hours back to get any data. Also, the data returned doesn't seem to be accurate, we hit the service 5 times, every 10 minutes. But, when the data is queried the number of calls does not match with duration.

When we use the API

api/traces?start=xxxxxxxxxxxxxx&lookback=1h

the data returned matches with the calls we made.

My questions are;
1. Am I missing something on the way the spark dependencies work?
2. Are there any configuration parameters that I can tweak to make reporting accurate?


There was some discussion on this at https://github.com/jaegertracing/jaeger/issues/215, but the solution seems to be the spark-dependency, which is not working for our case. Ideally, we would like to generate accurate dependency count in real-time. If not, what could be the reasonable lookback time that we could expect.

Yuri Shkuro

unread,
Jun 25, 2018, 11:50:46 AM6/25/18
to Jaeger Tracing
iirc the spark job only writes data once a day. Not sure if it has a configuration option to use shorter window.

Alexander Jeyaraj

unread,
Jun 26, 2018, 12:52:51 AM6/26/18
to Jaeger Tracing
Thanks Yuri. You are right, it builds the dependency for spans since midnight. We run this job every three hours, so the expectation is every 3 hour dependency gets regenerated for that day since midnight. However, in the API, the lookback time must be greater than 21600000 (6 hours). Anything below this value we do not get dependencies. Another thing is when I increase the lookback to 7 or 8 hours, the callCount between services remains same (while the expectation is to increase by 30, as we generate 30 calls per hour).

Also, we even reduced the cron job timing to 10 minutes, but the result is same. Ours is a small test cluster with 4 services, and 30 calls per hour, and the job finishes in less than a minute. Not sure what is causing this discrepancy.

Yuri Shkuro

unread,
Jun 26, 2018, 9:58:43 AM6/26/18
to Jaeger Tracing
Have you checked what's actually gets stored in the database? Dependencies are recorded as a simple blob, easy to inspect.

Alexander Jeyaraj

unread,
Jun 27, 2018, 1:47:32 AM6/27/18
to Jaeger Tracing
Ok, gone through the dependencies in elastic search, and that put things in perspective.

Observations:
The timestamp for all the records set with 'date' granularity. All have zeros for hour, minute, and second.
Each time the job runs it creates an dependency entry for the day
When queried with the lookback time, it adds all the records for the dates in the timerange. So, when lookback is less than 24 hours, we may get one day data or two day data depending on time of querying.

Conclusion:
Run the spark job only once a day. Running it multiple times create duplicate records for that day, and doubles the number of calls when queried with lookback time.
Do not use the call counts for real-time reporting. It can be reliable only for the previous day.
Use the dependencies API only for detecting dependencies, not to decide how many calls made between them.

Having said that, is there any work planned to get the call counts in real-time or is it something that will never happen.

Yuri Shkuro

unread,
Jun 27, 2018, 9:48:11 AM6/27/18
to Jaeger Tracing
Good summary.

Regarding the roadmap, a couple things.

1) this summer we are planning to open source Kafka/flink based data pipeline and a dependencies job that would run continuously and flush aggregated service graph more frequently, like every 15min (configurable).

2) there are no concrete plans for accurate real-time counts in the service graph, because most orgs are using Jaeger with head-based sampling that makes accurate counts impossible (you can only extrapolate based on sampling probabilities). We plan to do a proof of concept for tail-based sampling this year, then we might reconsider real time measures.

Alexander Jeyaraj

unread,
Jun 28, 2018, 12:17:13 AM6/28/18
to Jaeger Tracing
Thanks Yuri. 15 min frequency will be of great help. Will look forward for that feature.
Reply all
Reply to author
Forward
0 new messages