[akka-stream] How to implement Spark's cogroup/join operation

146 views
Skip to first unread message

Jakub Liska

unread,
Sep 27, 2016, 8:53:28 AM9/27/16
to Akka User List
Hey,

I'm reimplementing a few Spark batch jobs as akka streams. 

I got stuck at the last one that takes two PairRdd[Key,Value] and cogroups them by Key
which returns an Rdd[Key,Seq[Value]] and then it processes Seq[Value] for each of the unique Keys that are present in both original PairRdds, which is kind of "batchy" operation. Moreover there is high Key cardinality, like 50% of keys are unique.

So if I merged those two Sources and used groupBy then it would create as many SubFlows as number of distinct Keys, which could be max 5 millions.

So my questions are :

1) Is there another way to do that? Note that I cannot use reduce like ops, I need the Seq[Value] physically present when the stream ends.
2) If not, is it Ok to have like 5M tiny SubFlows? 
3) what should the parallelism be for this kind of groupBy operation in mergeSubstreamsWithParallelism? 

Jakub Liska

unread,
Sep 29, 2016, 12:03:55 PM9/29/16
to Akka User List
I'm trying to figure out why this is hanging/idling indefinitely : 

Source.fromIterator(() => Iterator.from(0).take(500).map(_ -> 1))
 
.groupBy(Int.MaxValue, _._1)
 
.mergeSubstreamsWithParallelism(256)
 
.runWith(Sink.seq)

This is the only way how to avoid instantiating ridiculous amounts of sub streams.

Akka Team

unread,
Oct 7, 2016, 3:33:34 PM10/7/16
to Akka User List
If you groupBy with an upper bound of substreams at Int.MaxValue you can at some point have more than 256 substreams, which is what you limit mergeSubstreamsWithParallelism with, this means that it will backpressure indefinitely (as none of the substreams will complete until upstream completes and that is now being back pressured). 

5 million tiny subflows with minimal processing each will be a lot of overhead for materializing those substreams and will take a lot of memory in addition to the large amount of memory you will need for actually collect your values per key in memory.

GroupBy is likely not a good option for you to achieve this, I'd say a custom graph stage would be the way to go if you need to do it with streams, or just folding over some suitable datastructure where you can collect elements per key.

--
Johan
Akka Team

--
>>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
---
You received this message because you are subscribed to the Google Groups "Akka User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akka-user+unsubscribe@googlegroups.com.
To post to this group, send email to akka...@googlegroups.com.
Visit this group at https://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages