Loading remote data set repositories

48 views
Skip to first unread message

Edward C. Skoviak

unread,
Mar 27, 2015, 9:09:23 AM3/27/15
to cdk...@cloudera.org
I am attempting to build a crunch pipeline to synchronize data that's in a remote cluster's data set repository into a host cluster's data set repository. Both clusters have HA configurations, and as such I make sure to provide name service configuration for each as part of the job configuration (and inject the additional name service into Kite using the DefaultConfiguration class). When the job runs in the initial jvm, it loads the data set repository as expected, however after the tasks are submitted to the mappers I am getting the following failure:

Error: java.lang.IllegalArgumentException: java.net.UnknownHostException: Douglas at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:373) at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:258) at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:153) at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:632) at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:570) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:147) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370) at org.kitesdk.data.spi.filesystem.Loader$URIBuilder.getFromOptions(Loader.java:66) at org.kitesdk.data.spi.filesystem.Loader$URIBuilder.getFromOptions(Loader.java:50) at org.kitesdk.data.spi.Registration.lookupDatasetUri(Registration.java:106) at org.kitesdk.data.Datasets.load(Datasets.java:103) at org.kitesdk.data.Datasets.load(Datasets.java:165) at org.kitesdk.data.mapreduce.DatasetKeyInputFormat.load(DatasetKeyInputFormat.java:246) at org.kitesdk.data.mapreduce.DatasetKeyInputFormat.setConf(DatasetKeyInputFormat.java:192) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:746) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163) Caused by: java.net.UnknownHostException: Douglas ... 27 more

It appears that because of this line[1], the tasks running on the mappers are using the DefaultConfiguration and not the configuration from the JobContext.

Is there a supported method for loading data from a remote data set repository?

[1] data/kite-data-core/src/main/java/org/kitesdk/data/spi/filesystem/Loader.java#L63

Joey Echeverria

unread,
Mar 27, 2015, 11:46:10 AM3/27/15
to Edward C. Skoviak, cdk...@cloudera.org
Did you also add the name service to the Configuration object passed
to the MRPipeline()?
> --
> You received this message because you are subscribed to the Google Groups
> "CDK Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to cdk-dev+u...@cloudera.org.
> For more options, visit https://groups.google.com/a/cloudera.org/d/optout.



--
Joey Echeverria
Senior Infrastructure Engineer

Edward C. Skoviak

unread,
Mar 27, 2015, 2:06:03 PM3/27/15
to cdk...@cloudera.org, edward....@gmail.com
Yes, and when I check the job configuration on the job history server the name service appears in the config.

Joey Echeverria

unread,
Mar 29, 2015, 12:04:13 AM3/29/15
to cdk...@cloudera.org
I think this is a bug. We should be using the job configuration to set
DefaultConfiguration before we load the dataset, but we're not doing
that when we load[1] the dataset.

Do you want to file a JIRA[2] for this and I'll take a look at a fix?

-Joey

[1] https://github.com/kite-sdk/kite/blob/master/kite-data/kite-data-mapreduce/src/main/java/org/kitesdk/data/mapreduce/DatasetKeyInputFormat.java#L246
[2] https://issues.cloudera.org/browse/CDK

On Fri, Mar 27, 2015 at 11:06 AM, Edward C. Skoviak

Edward C. Skoviak

unread,
Mar 30, 2015, 6:06:01 PM3/30/15
to cdk...@cloudera.org

Logged issue here[1]. I forked the classes into our project and made the changes, and it seems to fix the issues I was running into. Thanks for the help.

Joey Echeverria

unread,
Mar 30, 2015, 6:21:09 PM3/30/15
to Edward C. Skoviak, cdk...@cloudera.org
If you have a patch that works, you can post it to the issue or create
a github PR and we can commit it!

On Mon, Mar 30, 2015 at 3:06 PM, Edward C. Skoviak
Reply all
Reply to author
Forward
0 new messages