I am attempting to build a crunch pipeline to synchronize data that's in a remote cluster's data set repository into a host cluster's data set repository. Both clusters have HA configurations, and as such I make sure to provide name service configuration for each as part of the job configuration (and inject the additional name service into Kite using the DefaultConfiguration class). When the job runs in the initial jvm, it loads the data set repository as expected, however after the tasks are submitted to the mappers I am getting the following failure:
Error: java.lang.IllegalArgumentException: java.net.UnknownHostException: Douglas at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:373) at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:258) at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:153) at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:632) at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:570) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:147) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370) at org.kitesdk.data.spi.filesystem.Loader$URIBuilder.getFromOptions(Loader.java:66) at org.kitesdk.data.spi.filesystem.Loader$URIBuilder.getFromOptions(Loader.java:50) at org.kitesdk.data.spi.Registration.lookupDatasetUri(Registration.java:106) at org.kitesdk.data.Datasets.load(Datasets.java:103) at org.kitesdk.data.Datasets.load(Datasets.java:165) at org.kitesdk.data.mapreduce.DatasetKeyInputFormat.load(DatasetKeyInputFormat.java:246) at org.kitesdk.data.mapreduce.DatasetKeyInputFormat.setConf(DatasetKeyInputFormat.java:192) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:746) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163) Caused by: java.net.UnknownHostException: Douglas ... 27 more
It appears that because of this line[1], the tasks running on the mappers are using the DefaultConfiguration and not the configuration from the JobContext.
Is there a supported method for loading data from a remote data set repository?
[1] data/kite-data-core/src/main/java/org/kitesdk/data/spi/filesystem/Loader.java#L63