I dunno if I understood you correctly, but maybe this will help you ... ( i planned to create a blog post about this)
I just jumped into BigData/Hadoop land, and the thing that bothers me is that 95% of all examples on the web of submitting MR job on hadoop is shown via shell command (hadoop jar), whereas I consider that to be a bit limiting, becuase I want to have control/freedom to decide where I want to submit my jobs from (from hadoop or non-hadoop machine, remotely from my web app running in Tomcat, remotely from my IDE on windows...). Especially is useful to be able to submit job from your IDE, because that way you have fastest development cycle. But problem is that Hadoop jobs require code jars deployed, and that means tackling with with "job driver" app deployment package, which brings collision with how that app is deployed when running it inside IDE or within some other deployment platform.
So currently, I ended one with using Gradle to mark as special "configuration" all 3rd party JARs needed for my MR jobs(such as Cascalog jars and others) so I can copy them to external directory prior to running application, This directory contains also JAR containing your user-defined classes needed for jobs (such as your MR functions):
configurations {
mapreduce {
description = 'Map reduce jobs dependencies'
}
compile {
extendsFrom mapreduce
}
}
task prepareMapReduceLibs(type: Sync, dependsOn: jar) {
from jar.outputs.files
from configurations.mapreduce.files
into 'mapreducelib'
}
I don't want to have hadoop jars inside this directory, so I exclude them (unfortunately a bit clumsy since I did it per-dependency basis), such as:
mapreduce ("cascalog:cascalog-core:${cascalogVersion}") {
exclude group: "org.apache.hadoop", module: "hadoop-core"
}
Of course, hadoop dependency is included in "compile" Gradle configuration.
Now, you can use this directory ("mapreducelib" in above example) to copy this JARs to hadoop HDFS automatically at boot time of your application, and add them to cache afterwards so the jobs could use them. I created my utility class JobHelper to encapsulate that code, but here is just a usage of that:
String hdfsJarsDir = "/myjobs/mylibs";
JobHelper.copyLocalJarsToHdfs("./mapreducelib", hdfsJarsDir, configuration);
JobHelper.addHdfsJarsToDistributedCache(hdfsJarsDir, configuration);
JCascalog requires setting of configuration properties via Map, so you have to convert your COnfiguration into Map, such as:
private static void configureCascalog(Configuration configuration) {
Map map = convertConfigurationToMap(configuration);
System.out.println("Configuring Cascalog with properties: " + map);
Api.setApplicationConf(map);
}
private static Map<String, String> convertConfigurationToMap(Configuration configuration) {
Map<String, String> map = new HashMap<String, String>();
for (Map.Entry<String, String> configurationEntry : configuration) {
map.put(configurationEntry.getKey(), configurationEntry.getValue());
}
return map;
}
regards,
Vjeran