Hi folks,
I was testing out some Scalding (on Cascading 3.2 using Tez) code that has a couple of group clauses and wasn't able to figure out how one could set different parallelism for different Tez vertices in Cascading.
Noticed that when we end up with a Tez graph, we always ended up with a parallelism of 1 on all the 'reducer' / scatter gather vertices. This seems to be due to us setting the cascading.flow.runtime.gather.partitions.num = 1 on our groupAll in Scalding.
This used to work fine in the MR setup as these steps end up being different jobs but in case of Tez, it seems like if you set a value like gather partitions it ends up being used for all scatter gather steps.
Is there a recommended way to set parallelism for individual Tez vertices / Cascading nodes? Or do we need to use the Tez parallelism settings directly? Any other recommendations / suggestions?
Here’s the code at a high level:
myTypedPipe.flatMap( tweet => tweet.getText.split("\\s+") )
.filter(_.contains("#"))
.map(_.toLowerCase)
.map(word => (word, 1L))
.sumByKey // first groupBy, would like higher parallelism here
.toTypedPipe
.aggregate(top10000) // groupAll, this can be parallelism of 1
.flatten
.write(TypedTsv("hashtag_output"))To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/692e536f-54b6-44a9-b7ce-8f9bd08789f5%40googlegroups.com.