Cascading3 on Tez skipping distributed cache on job submission?

45 views
Skip to first unread message

Piyush Narang

unread,
Dec 9, 2016, 8:54:08 PM12/9/16
to cascading-user
hi folks,

While testing the performance of a simple 'map' only Scalding job on Tez vs Hadoop2, we noticed that the Tez job ends up running substantially slower as compared to the Hadoop job. Our job basically converts thrift files to Parquet (1.5K files which are 300-800M in size). 

We measure costs using Hadoop MB_MILLIS and by that metric for this job, the MB_MILLIS was double on Tez what it is on Hadoop. Looking at the namenode logs, we noticed that we're spending a lot of time localizing jars. Looking at the cascading3 code to set up and submit jars (https://github.com/cwensel/cascading/blob/wip-3.2/cascading-hadoop2-tez/src/main/java/cascading/flow/tez/util/TezUtil.java#L218) we don't seem to be adding jars to the distributed cache. In Hadoop we seem to be doing so: https://github.com/cwensel/cascading/blob/wip-3.2/cascading-hadoop/src/main/shared-mr1/cascading/flow/hadoop/util/HadoopMRUtil.java#L145. Based on talking to some people more familiar with Hadoop internals, it sounds like adding those jars to the dist cache does help reduce localization in case of Hadoop. 

Does anyone know why cascading isn't doing that in case of Tez? Any other suggestions to minimize localization time in that case?

Thanks,
Piyush

Chris K Wensel

unread,
Dec 11, 2016, 4:24:18 PM12/11/16
to cascadi...@googlegroups.com
Tez doesn’t provide a distributed cache. it relies on the YARN equivalent of local resources (or something, the naming is very confusing). 

Cascading does go through great pains to emulate the distributed cache behavior by adding the things that would be in the MR distcache to the YARN resource interface. fwiw, it also pre-configures YARN to recognize the ‘lib’ folder of the job jar — if any. we do this so users moving from MR to Tez don’t have to use YARN apis do get the same behaviors.

not to say there there aren’t bugs, but for external libraries (not stuck into the lib folder of the job jar) to show up on disk in the remote CLASSPATH this mechanism would need to work. there is no other way external libs will show up ‘local’ to the job jar once on the cluster.

so if we weren’t pushing jars into the cluster to be locally loaded into the local CLASSPATH, the jobs would fail. not go slow.

maybe i’m missing the issue. 

or its an issue with YARN.

ckw

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at https://groups.google.com/group/cascading-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/426259af-25fc-4936-bdc0-b5c3856ef950%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Piyush Narang

unread,
Dec 12, 2016, 1:53:17 PM12/12/16
to cascading-user
Thanks for getting back Chris. Poked around at this with someone more familiar with Yarn. Seems like we're currently setting up these jar resources in Yarn using the 'application' cache: https://github.com/cwensel/cascading/blob/911c7c30f934284974a4c42604b13a465d6b3ffa/cascading-hadoop2-tez/src/main/java/cascading/flow/tez/util/TezUtil.java#L276. From what I understand, that enables reuse within containers / tasks of the same application. 

In our case we have a lot of tasks in the application being scheduled at once. So there's not much scope for reuse within the same application based on looking at the Yarn logs. When we tried setting this to: LocalResourceVisibility.PUBLIC, we noticed that over re-runs of the same job we had much better performance thanks to Tez skipping the localization due to the cache. Considering that a lot of jobs end up being run on a regular cadence (hourly / daily), would it make sense to set the visibility to PUBLIC by default in Cascading? 

Thanks,
Piyush
Reply all
Reply to author
Forward
0 new messages