Apache Flink vs Spark for stream processing in Jaeger

Gary Brown

unread,

May 4, 2017, 3:15:32 PM5/4/17

to Jaeger Tracing

Hi

In the original architecture diagram from the post https://eng.uber.com/distributed-tracing/ it shows Apache Spark. However in recent discussions Apache Flink has been mentioned as an alternative.

Would it be possible to provide the details behind the switch?

Regards
Gary

Yuri Shkuro

unread,

May 4, 2017, 3:35:37 PM5/4/17

to Gary Brown, Jaeger Tracing

Actually, there are two parts here. First, it's the question of where the business logic for aggregation should reside. We recently made a decision to write that logic in Go, because we already have a lot of infrastructure in Go for reading traces from storage, converting to domain model, applying various adjustments / enrichments. Rewriting all of that in Java was rather wasteful. So our current data pipeline looks something like this:

Stage 1:

collectors write spans to Kafka
a Flink job applies time window grouping and produces a stream of trace IDs that are considered "completed" (as in: no more spans are expected for these traces)
completed trace IDs are written to another Kafka topic

After this first stage, a number of parallel aggregation jobs are running:

Stage 2:

read completed trace IDs from Kafka
make an RPC call to a Go service
Go service loads the trace from storage, extracts statistics sought by the job, and returns them (e.g. a set of dependency links [parent, child, callCount])
Flink job then aggregates statistics and writes the summary to storage even N minutes

In this approach, the amount of Java code is fairly minimal, and the actual streaming framework is not that important, it can be Flink or Spark or Storm, whatever. We decided to try Flink instead of Spark because of internal politics and levels of infra support.

Our previous Spark solution that was mentioned in the blog post was very different, it was a batch, not streaming, solution, all of the code was in Scala, and the job had to read the entire Cassandra database once a day, which made it rather fragile.

Hope that helps.

--
You received this message because you are subscribed to the Google Groups "Jaeger Tracing" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jaeger-tracing+unsubscribe@googlegroups.com.
To post to this group, send email to jaeger-tracing@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jaeger-tracing/acff6eb7-87d3-4a3b-b006-c2b7ee75edb7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gary Brown

unread,

May 4, 2017, 3:50:18 PM5/4/17

to Yuri Shkuro, Jaeger Tracing

Hi Yuri

Thanks for the information.

So, as with the recent discussion about possibly supporting various spanstore implementations (e.g. ElasticSearch https://github.com/uber/jaeger/issues/140), it sounds like it would be good if more than one streaming framework could be supported - as (for the same reasons you cite) Spark may be our preferred option.

Regards

Gary

To post to this group, send email to jaeger-...@googlegroups.com.

Gary Brown

unread,

May 5, 2017, 3:07:47 AM5/5/17

to Yuri Shkuro, Jaeger Tracing

Any idea when this functionality will land in the opensource project?

Will it be in a separate repo, or in the main jaeger repo?

Yuri Shkuro

unread,

May 5, 2017, 1:49:41 PM5/5/17

to Gary Brown, Yuri Shkuro, Jaeger Tracing

> Any idea when this functionality will land in the opensource project?

Don't know yet, we just started on the new java + go approach. Once we prove that it works in our setup, we'll push it to githib. The Go code should be in the main jaeger repo, Java code (flink / spark integration) will be in a separate repo.

To post to this group, send email to jaeger-tracing@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jaeger-tracing/CAGf8jNAFMz3a148A6q%3DpvfMk%3DKZLvo5J4CEdE-q0tgb3zoXAPg%40mail.gmail.com.

Gary Brown

unread,

May 7, 2017, 8:55:44 AM5/7/17

to Jaeger Tracing

Would it be possible to get the kafka publisher code/config supported in the server earlier? If so, then I'll create a separate issue in github.

To unsubscribe from this group and stop receiving emails from it, send an email to jaeger-tracin...@googlegroups.com.

To post to this group, send email to jaeger-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jaeger-tracing/acff6eb7-87d3-4a3b-b006-c2b7ee75edb7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Jaeger Tracing" group.

To unsubscribe from this group and stop receiving emails from it, send an email to jaeger-tracin...@googlegroups.com.

To post to this group, send email to jaeger-...@googlegroups.com.

Yuri Shkuro

unread,

May 7, 2017, 2:21:41 PM5/7/17

to Gary Brown, Jaeger Tracing

Gary,

Can you open a ticket so we can discuss the requirements? Our current Kafka publisher is tied to Uber's infrastructure, e.g. it's not writing to Kafka directly, but via an HTTP proxy (which avoids heavy Kafka dependencies in the applications). And it is writing in a specific binary format that is supported by our streaming infrastructure, which may or may not be appropriate for open source.

Today the agents talk to collectors p2p, and collectors write to Kafka. Another possible deployment model is to have agents write to Kafka and collectors read from it. There are pros and cons in both approaches (and it's not very difficult to support both, it's just a matter of wiring the final apps).

To unsubscribe from this group and stop receiving emails from it, send an email to jaeger-tracing+unsubscribe@googlegroups.com.
To post to this group, send email to jaeger-tracing@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jaeger-tracing/2589a948-e535-4a72-b5a9-59c166916a84%40googlegroups.com.

Gary Brown

unread,

May 8, 2017, 6:01:00 AM5/8/17

to Jaeger Tracing

Done: https://github.com/uber/jaeger/issues/150

For now I think we should just publish from the server, although it would be good to support both approaches eventually to provide greater choice for users.

Jalpesh Shelar

unread,

Mar 16, 2020, 3:55:41 AM3/16/20

to Jaeger Tracing

Hi Yuri

I am new to Jaeger .

We have span data in ElasticSearch backend.

We want to merge spans into Traces and have to store these traces in ElasticSearch.

So we are planning to use Kafka and Flink .

Please suggest how to proceed.

Thanks & Regards,

Jalpesh Shelar

On Friday, 5 May 2017 01:05:37 UTC+5:30, Yuri Shkuro wrote:

Actually, there are two parts here. First, it's the question of where the business logic for aggregation should reside. We recently made a decision to write that logic in Go, because we already have a lot of infrastructure in Go for reading traces from storage, converting to domain model, applying various adjustments / enrichments. Rewriting all of that in Java was rather wasteful. So our current data pipeline looks something like this:

Stage 1:
collectors write spans to Kafka
a Flink job applies time window grouping and produces a stream of trace IDs that are considered "completed" (as in: no more spans are expected for these traces)
completed trace IDs are written to another Kafka topic
After this first stage, a number of parallel aggregation jobs are running:

Stage 2:
read completed trace IDs from Kafka
make an RPC call to a Go service
Go service loads the trace from storage, extracts statistics sought by the job, and returns them (e.g. a set of dependency links [parent, child, callCount])
Flink job then aggregates statistics and writes the summary to storage even N minutes

In this approach, the amount of Java code is fairly minimal, and the actual streaming framework is not that important, it can be Flink or Spark or Storm, whatever. We decided to try Flink instead of Spark because of internal politics and levels of infra support.

Our previous Spark solution that was mentioned in the blog post was very different, it was a batch, not streaming, solution, all of the code was in Scala, and the job had to read the entire Cassandra database once a day, which made it rather fragile.

Hope that helps.

On Thu, May 4, 2017 at 3:15 PM, Gary Brown <gary....@gmail.com> wrote:

Hi

In the original architecture diagram from the post https://eng.uber.com/distributed-tracing/ it shows Apache Spark. However in recent discussions Apache Flink has been mentioned as an alternative.

Would it be possible to provide the details behind the switch?

Regards
Gary

--
You received this message because you are subscribed to the Google Groups "Jaeger Tracing" group.

To unsubscribe from this group and stop receiving emails from it, send an email to jaeger-...@googlegroups.com.
To post to this group, send email to jaeger-...@googlegroups.com.

Reply all

Reply to author

Forward