Using Elephant with plain old hadoop

228 views
Skip to first unread message

Chris Stucchio

unread,
Mar 16, 2011, 9:02:37 AM3/16/11
to elephantdb-user
Greetings,

I'm using ordinary Hadoop for a project, but no cascading or cascalog.
I'm just wondering, how do I go about writing ElephantDB outputput
using ordinary hadoop?

Looking inside the source code suggests maybe using
ElephantOutputFormat, and storing IntWritable / ElephantRecordWritable
as the kv pairs. I tried this approach, and wrote a small utility to
convert from a sequencefile to an elephant output file (code at the
bottom).

But when I run this, I get the following:

java.lang.NullPointerException
at
elephantdb.hadoop.ElephantOutputFormat.checkOutputSpecs(ElephantOutputFormat.java:
126)
at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:
772)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at
stylewok.tools.ElephantDBWriter.convertTextJSONFileToElephantDB(Unknown
Source)
at stylewok.textindex.ItemFeatureViewBuilder.run(Unknown Source)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at stylewok.textindex.ItemFeatureViewBuilder.main(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:
39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:
25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

Apparently I'm doing something wrong. Can someone suggest to me the
right way to create an elephant database in plain old hadoop?


public class ElephantDBWriter {

public static class TextJSONElephantMapper extends MapReduceBase
implements Mapper<Text, JSONObjectWritable, IntWritable,
ElephantRecordWritable> {
public void map(Text key, JSONObjectWritable value,
OutputCollector<IntWritable,ElephantRecordWritable> output, Reporter
reporter) throws IOException {
ElephantRecordWritable record = new
ElephantRecordWritable(key.toString().getBytes(),
value.toString().getBytes());
output.collect(new IntWritable(record.hashCode()),
record);
}
}

public static void convertTextJSONFileToElephantDB(String
inputPath, String outputPath) {
JobClient client = new JobClient();
JobConf conf = new JobConf(ElephantDBWriter.class);

conf.setJobName("ConvertFileToElephantDB");

conf.setOutputKeyClass(IntWritable.class);
conf.setOutputValueClass(ElephantRecordWritable.class);

FileInputFormat.addInputPath(conf, new Path(inputPath));
FileOutputFormat.setOutputPath(conf, new Path(outputPath));
conf.setInputFormat(SequenceFileInputFormat.class);
conf.setOutputFormat(ElephantOutputFormat.class);

conf.setMapperClass(TextJSONElephantMapper.class);

client.setConf(conf);

try {
JobClient.runJob(conf);
} catch (Exception e) {
e.printStackTrace();
}
}
}

nathanmarz

unread,
Mar 16, 2011, 9:24:20 PM3/16/11
to elephantdb-user
Hi Chris,

I added an Exporter class to ElephantDB so that you can create an
ElephantDB domain just using MapReduce. This code is available on
GitHub and in the maven repo at Clojars.

What you need to do is create a directory on HDFS containing K/V pairs
in SequenceFiles (both keys and values are BytesWritable). Then you
can export the k/v pairs into elephantdb using a command like the
following:

Exporter.export(sequenceFileDirPath, edbDomainPath, new DomainSpec(new
JavaBerkDB(), 32));

This will run a job that will create a 32 shard Java Berkeley DB
domain at edbDomainPath.

You can plug in your own incremental updating code by passing in
Exporter.Args instead of a DomainSpec and setting the updater within.
If the updater is set to null, it will disable incremental updates and
each new version of the domain will only contain what you export to
it.

Hope that helps,
Nathan
Reply all
Reply to author
Forward
0 new messages