Issues trying to set up multi format source taps when using Casading on Tez

21 views
Skip to first unread message

Piyush Narang

unread,
Aug 12, 2016, 3:58:51 PM8/12/16
to cascading-user
hi folks,

We were trying to run some of our existing Scalding jobs (they run on Hadoop on cascading 3) using Tez and we were running into some issues due to missing keys in the JobConf. These jobs are using a variant of Cascading's MultiSourceTap with the difference that in this class the underlying schemes and taps need not be of the same class. (This is to support users who have datasets in multiple formats / are migrating between them). 

This is part of our internal repo. I've created a gist of the tap file here - https://gist.github.com/piyushnarang/a6a8878fff30ea7438997d9f01d4c2e2
Some things to list out:
1) The tap we create (MultiFormatSouceTap) extends Cascading's SourceTap and implements CompositeTap (and thus takes in the constructor a list of input Taps). 
2) As part of sourceConfInit() we add a config entry to JobConf for each of the taps contained for format specific settings. 
3) When we try to run the job, we end up with an error - Key scalding_internal.multiformat.tapindex.168e1283-9933-44a5-85e2-f9188d2af2d6 not found in job conf

If I understand correctly, on the Hadoop side, we are setting up MultiFormatInput.addInputFormat(...) in initFromSources: https://github.com/cwensel/cascading/blob/wip-3.1/cascading-hadoop/src/main/shared-mr1/cascading/flow/hadoop/HadoopFlowStep.java#L432 which helps get these JobConfs set up. 


What is the recommended way to set up these JobConfs per tap in Cascading for Tez?

Thanks,
Piyush

Chris K Wensel

unread,
Aug 12, 2016, 4:27:28 PM8/12/16
to cascadi...@googlegroups.com
There is no need for MultiInputFormat on Tez, so there isn’t one.

This solely exists to provide support for CoGroup (joins).

On Tez, when doing a CoGroup, one Node takes the lhs and another takes the rhs. 

On MR, the mapper decides if it is lhs or rhs at runtime, cluster side, based on the split it was handed.

You could write a rule in the planner to detect your custom Tap and replace it with its children on unique branches.

ckw

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at https://groups.google.com/group/cascading-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/2c8f94c5-7706-4b93-b800-eb5d73e5e14f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Chris K Wensel




Piyush Narang

unread,
Aug 15, 2016, 2:19:13 PM8/15/16
to cascading-user
Thanks for the reply Chris. Not super familiar with the code so didn't know that MultiInputFormat was there to support CoGroups. I can dig into writing a custom planner rule. I'm guessing it will be on the lines of adding something to the *Hadoop2TezRuleRegistry ( HashJoinHadoop2TezRuleRegistry / NoHashJoinHadoop2TezRuleRegistry)? 

Thanks,
Reply all
Reply to author
Forward
0 new messages