Hey,
I'm reimplementing a few Spark batch jobs as akka streams.
I got stuck at the last one that takes two PairRdd[Key,Value] and cogroups them by Key
which returns an Rdd[Key,Seq[Value]] and then it processes Seq[Value] for each of the unique Keys that are present in both original PairRdds, which is kind of "batchy" operation. Moreover there is high Key cardinality, like 50% of keys are unique.
So if I merged those two Sources and used groupBy then it would create as many SubFlows as number of distinct Keys, which could be max 5 millions.
So my questions are :
1) Is there another way to do that? Note that I cannot use reduce like ops, I need the Seq[Value] physically present when the stream ends.
2) If not, is it Ok to have like 5M tiny SubFlows?
3) what should the parallelism be for this kind of groupBy operation in mergeSubstreamsWithParallelism?