I've finished a C++ to java interface so that the InputFormat can
now properly return InputSplits along with locations.
The files that are part of this interface are both C++ and Java.
I have a small script for compiling them, but I'd like to know
how do I compile the Java ones and what's the preferable way of
doing the compilation within existing Hypertable source tree using
existing compilation process.
I'm not having any kind of Java background at all,
and here's how I'm doing it at the moment.
javac -classpath
/Users/m/work/hypertable-mapreduce/lib/hadoop-0.17.0-core.jar:.
TableSplit.java TableInputFormat.java
g++ -c -I/System/Library/Frameworks/JavaVM.framework/Headers
-I/usr/local/0.9.0.5/include -I/usr/local/include/boost-1_34_1 -I.
TableInputFormat.cc
g++ -dynamiclib -o libhypertable.jnilib TableInputFormat.o -framework
JavaVM -lHypertable -lHyperComm -lHyperCommon -llog4cpp -lexpat
-lHyperspace -lz -L/usr/local/0.9.0.5/lib/ -L/usr/local/lib/
-lboost_thread-mt -lboost_iostreams-mt
Unit test compilation
javac -classpath
/Users/m/work/hypertable-mapreduce/lib/hadoop-0.17.0-core.jar:/Users/m/work/hypertable-mapreduce/lib/log4j-1.2.13.jar:/Users/m/work/hypertable-mapreduce/lib/commons-logging-1.0.4.jar:.
TableMapTest.java
Launching unit test
java -classpath
/Users/m/work/hypertable-mapreduce/lib/hadoop-0.17.0-core.jar:/Users/m/work/hypertable-mapreduce/lib/log4j-1.2.13.jar:/Users/m/work/hypertable-mapreduce/lib/commons-logging-1.0.4.jar:.
TableMapTest
Several parts of the compilation process, like g++ flags are platform
specific (e.g. -dynamiclib or -framework flags).
One more thing worth noting is that I did not put TableInputFormat
class in any sort of package
like org.apache.hadoop.hypertable.mapred, because I experienced a lot
of difficulties
trying to compile and test it.
Is anyone willing to help me on this issue, i.e. setting up a
compilation so it works
cross platform and is integrated with the existing build process?
As a side note, I'm going to test Pipes API for mapreduce
in a couple of hours and see if everything works as expected.
Mateusz
--
Mateusz Berezecki
http://www.meme.pl
Yes, you are correct. The JNI extension is for getting information
about number of tablets in a table, and their locations. The C++
side does that by utilizing TableRangeMap class I developed.
This class is strictly for fetching that particular information
given a table name. This information is then retrieved by Java
in form of an array. The InputSplit array is created based on that
very information from the C++ side. In addition to the TableSplit (whose
parent class is Hadoop's InputSplit) I created a TableInputFormat
which provides dummy record reader and writers - they do nothing
but are required to satisfy interface specification. I expected to run
some tests but I have a test tomorrow so the only thing I got so far
when trying to run a simple identity mapreduce was getting the
sample application compile correctly (this is harder than I initially thought)
In case you were interested you can track the progress here:
http://github.com/mateuszb/hypertable/commits/master
Please note I did not add a sample application to this tree yet.
Mateusz
--
Mateusz Berezecki
http://www.meme.pl
One thing worth adding is that the initial design will not contain
Java record readers and writers - which by the way is not that
hard to code as the JNI interface for that will be relatively simple
once you get acquainted with it. The first version's goal is
to support C++ Pipes API for hadoop's mapreduce.
Mateusz
It seems to me that the custom InputFormat classes are not meant
to be bundled with hadoop. At least that's the case with Hbase and
their TableInputFormat - it sits in the hbase repository.
The reason for that is simple. Mapreduce as implemented
in the hadoop does not need InputFormat classes to be
a part of hadoop. They get configured in the job xml configuration
file and if they are not part of the hadoop they automatically
get redistributed via hdfs to nodes participating in processing.
WRT to SWIG - no I have not tried it although I've was thinking
a little about it. What would be the major difference if it's SWIG or JNI?
I'm considering moving everything to Java as soon as the thrift broker
is published though.
Mateusz
But still hard to achieve :-) I peeked at how Htable does it and
currently I have a pretty solid testing environment but not the
one I'd like to be available to anyone (requires some work
to setup). I'm considering something with 0 setup required.
>
>> WRT to SWIG - no I have not tried it although I've was thinking
>> a little about it. What would be the major difference if it's SWIG or JNI?
>
> SWIG's java extension uses JNI, it might help hiding platform specific
> compiler/linker stuff for you.
>
I will definitely look at it this week then.
>> I'm considering moving everything to Java as soon as the thrift broker
>> is published though.
>
> Yeah, make it work first :) I look forward to the results. Thanks!
Some really preliminary stuff (2 tests coming tomorrow and 2 more a
day after tomorrow ;) )
m:work m$ ./hadoop-0.17.0/bin/hadoop pipes -conf mapredconf.xml
-program /mapreduce/applications/mapred -input TEST -jar ./Table.jar
-inputformat TableInputFormat -output test_output
Configuring MapReduce for table TEST
Configuring MapReduce for table TEST
08/06/09 18:18:23 INFO mapred.JobClient: Running job: job_200806091806_0007
08/06/09 18:18:24 INFO mapred.JobClient: map 0% reduce 0%
08/06/09 18:18:29 INFO mapred.JobClient: Task Id :
task_200806091806_0007_m_000000_0, Status : FAILED
java.lang.NullPointerException
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.close(MapTask.java:166)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:223)
at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)
task_200806091806_0007_m_000000_0: Configuring MapReduce for table TEST
task_200806091806_0007_m_000000_0: Hadoop Pipes Exception:
RecordReader defined when not needed. at impl/HadoopPipes.cc:648 in
virtual void HadoopPipes::TaskContextImpl::runMap(std::string, int,
bool)
08/06/09 18:18:34 INFO mapred.JobClient: Task Id :
task_200806091806_0007_m_000000_1, Status : FAILED
java.lang.NullPointerException
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.close(MapTask.java:166)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:223)
at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)
Still trying to figure out how to define a RecordReader in C++
and inform MR not to use the InputFormat's dummy one.
Once that's done I'll start testing it and if all goes well
try to bundle all the stuff in some sort of a patch or
something.
Mateusz
I couldn't give up on this one so easily :-)
m:work m$ ./hadoop-0.17.0/bin/hadoop pipes -conf mapredconf.xml
-program /mapreduce/applications/mapred -input TEST -jar ./Table.jar
-inputformat TableInputFormat -output test_output -jobconf
hadoop.pipes.java.recordreader=false,hadoop.pipes.java.recordwriter=false
Configuring MapReduce for table TEST
Configuring MapReduce for table TEST
08/06/10 03:32:27 INFO mapred.JobClient: Running job: job_200806091806_0037
08/06/10 03:32:28 INFO mapred.JobClient: map 0% reduce 0%
08/06/10 03:32:35 INFO mapred.JobClient: map 100% reduce 0%
08/06/10 03:32:47 INFO mapred.JobClient: map 100% reduce 50%
08/06/10 03:32:49 INFO mapred.JobClient: map 100% reduce 100%
08/06/10 03:32:50 INFO mapred.JobClient: Job complete: job_200806091806_0037
08/06/10 03:32:50 INFO mapred.JobClient: Counters: 14
08/06/10 03:32:50 INFO mapred.JobClient: Job Counters
08/06/10 03:32:50 INFO mapred.JobClient: Launched map tasks=1
08/06/10 03:32:50 INFO mapred.JobClient: Launched reduce tasks=2
08/06/10 03:32:50 INFO mapred.JobClient: Data-local map tasks=1
08/06/10 03:32:50 INFO mapred.JobClient: Map-Reduce Framework
08/06/10 03:32:50 INFO mapred.JobClient: Map input records=0
08/06/10 03:32:50 INFO mapred.JobClient: Map output records=0
08/06/10 03:32:50 INFO mapred.JobClient: Map input bytes=0
08/06/10 03:32:50 INFO mapred.JobClient: Map output bytes=0
08/06/10 03:32:50 INFO mapred.JobClient: Combine input records=0
08/06/10 03:32:50 INFO mapred.JobClient: Combine output records=0
08/06/10 03:32:50 INFO mapred.JobClient: Reduce input groups=0
08/06/10 03:32:50 INFO mapred.JobClient: Reduce input records=0
08/06/10 03:32:50 INFO mapred.JobClient: Reduce output records=0
08/06/10 03:32:50 INFO mapred.JobClient: File Systems
08/06/10 03:32:50 INFO mapred.JobClient: Local bytes read=220
08/06/10 03:32:50 INFO mapred.JobClient: Local bytes written=432
It is the first run of MR over hypertable I completed. I'm going to polish it
a bit this week and send more information. The first observation is that
it runs slow for an empty Map and empty Reduce phases. I don't
draw any conclusions from this observation yet.
Mateusz
Luke, not yet. I did not try to compare or relate execution time to any
other MR implementation yet. But I got output from another simple test.
hypertable> select * from TEST;
key1 column1 value1
key2 column1 value2
key3 column1 value3
Configuring MapReduce for table TEST
Record reader will open table 'TEST'
Scanning range:
start row:
end row:ÿÿ
key: key1
value: value1
key: key2
value: value2
key: key3
value: value3
It is a simple table with some contents. I'm currently testing the map
phase only.
The reduce phase still does nothing. I'm thinking of generating some data to put
into the table and test it on bigger tables.
Do you have such data generators for testing hypertable out there?
I'd really appreciate such tool as my head is going to explode soon
as there are 3 more final exams coming tomorrow ;-)
Mateusz
I had a week long break but back to the work now.
I am now focusing my efforts on the output phase
as I have successfully ran complete (i.e. having non-trivial
map and reduce phases) jobs over some small (yet) tables.
What are the suggestions on the output part?
I'd like to know if anybody wants to see any specific
functionality when outputting data to tables ?
Mateusz