mapreduce java extensions

Mateusz Berezecki

unread,

Jun 7, 2008, 5:47:31 PM6/7/08

to hyperta...@googlegroups.com

Hello List,

I've finished a C++ to java interface so that the InputFormat can
now properly return InputSplits along with locations.

The files that are part of this interface are both C++ and Java.
I have a small script for compiling them, but I'd like to know
how do I compile the Java ones and what's the preferable way of
doing the compilation within existing Hypertable source tree using
existing compilation process.
I'm not having any kind of Java background at all,
and here's how I'm doing it at the moment.

javac -classpath
/Users/m/work/hypertable-mapreduce/lib/hadoop-0.17.0-core.jar:.
TableSplit.java TableInputFormat.java

g++ -c -I/System/Library/Frameworks/JavaVM.framework/Headers
-I/usr/local/0.9.0.5/include -I/usr/local/include/boost-1_34_1 -I.
TableInputFormat.cc

g++ -dynamiclib -o libhypertable.jnilib TableInputFormat.o -framework
JavaVM -lHypertable -lHyperComm -lHyperCommon -llog4cpp -lexpat
-lHyperspace -lz -L/usr/local/0.9.0.5/lib/ -L/usr/local/lib/
-lboost_thread-mt -lboost_iostreams-mt

Unit test compilation
javac -classpath
/Users/m/work/hypertable-mapreduce/lib/hadoop-0.17.0-core.jar:/Users/m/work/hypertable-mapreduce/lib/log4j-1.2.13.jar:/Users/m/work/hypertable-mapreduce/lib/commons-logging-1.0.4.jar:.
TableMapTest.java

Launching unit test
java -classpath
/Users/m/work/hypertable-mapreduce/lib/hadoop-0.17.0-core.jar:/Users/m/work/hypertable-mapreduce/lib/log4j-1.2.13.jar:/Users/m/work/hypertable-mapreduce/lib/commons-logging-1.0.4.jar:.
TableMapTest

Several parts of the compilation process, like g++ flags are platform
specific (e.g. -dynamiclib or -framework flags).
One more thing worth noting is that I did not put TableInputFormat
class in any sort of package
like org.apache.hadoop.hypertable.mapred, because I experienced a lot
of difficulties
trying to compile and test it.

Is anyone willing to help me on this issue, i.e. setting up a
compilation so it works
cross platform and is integrated with the existing build process?

As a side note, I'm going to test Pipes API for mapreduce
in a couple of hours and see if everything works as expected.

Mateusz
--
Mateusz Berezecki
http://www.meme.pl

Luke

unread,

Jun 8, 2008, 3:05:24 PM6/8/08

to Hypertable Development

So you're developing a JNI extension for Java? I assume this is for
tracker to splitting the input and launch jobs. Even this might not be
the part of the client API, it can be part of the thrift broker
interface though, which is relatively easier (when it's finished)
deploy as the code is pure java.

Anyway, can you describe how the whole thing work together? Thanks!

Mateusz Berezecki

unread,

Jun 8, 2008, 3:38:03 PM6/8/08

to hyperta...@googlegroups.com

On Sun, Jun 8, 2008 at 9:05 PM, Luke <vic...@gmail.com> wrote:
>
> So you're developing a JNI extension for Java? I assume this is for
> tracker to splitting the input and launch jobs. Even this might not be
> the part of the client API, it can be part of the thrift broker
> interface though, which is relatively easier (when it's finished)
> deploy as the code is pure java.

Yes, you are correct. The JNI extension is for getting information
about number of tablets in a table, and their locations. The C++
side does that by utilizing TableRangeMap class I developed.
This class is strictly for fetching that particular information
given a table name. This information is then retrieved by Java
in form of an array. The InputSplit array is created based on that
very information from the C++ side. In addition to the TableSplit (whose
parent class is Hadoop's InputSplit) I created a TableInputFormat
which provides dummy record reader and writers - they do nothing
but are required to satisfy interface specification. I expected to run
some tests but I have a test tomorrow so the only thing I got so far
when trying to run a simple identity mapreduce was getting the
sample application compile correctly (this is harder than I initially thought)

In case you were interested you can track the progress here:
http://github.com/mateuszb/hypertable/commits/master

Please note I did not add a sample application to this tree yet.

Mateusz

--
Mateusz Berezecki
http://www.meme.pl

Mateusz Berezecki

unread,

Jun 8, 2008, 3:40:09 PM6/8/08

to hyperta...@googlegroups.com

On Sun, Jun 8, 2008 at 9:38 PM, Mateusz Berezecki <mate...@gmail.com> wrote:
> On Sun, Jun 8, 2008 at 9:05 PM, Luke <vic...@gmail.com> wrote:
>>
>> So you're developing a JNI extension for Java? I assume this is for
>> tracker to splitting the input and launch jobs. Even this might not be
>> the part of the client API, it can be part of the thrift broker
>> interface though, which is relatively easier (when it's finished)
>> deploy as the code is pure java.
>
> Yes, you are correct. The JNI extension is for getting information
> about number of tablets in a table, and their locations. The C++
> side does that by utilizing TableRangeMap class I developed.
> This class is strictly for fetching that particular information
> given a table name. This information is then retrieved by Java
> in form of an array. The InputSplit array is created based on that
> very information from the C++ side. In addition to the TableSplit (whose
> parent class is Hadoop's InputSplit) I created a TableInputFormat
> which provides dummy record reader and writers - they do nothing
> but are required to satisfy interface specification.

One thing worth adding is that the initial design will not contain
Java record readers and writers - which by the way is not that
hard to code as the JNI interface for that will be relatively simple
once you get acquainted with it. The first version's goal is
to support C++ Pipes API for hadoop's mapreduce.

Mateusz

Luke

unread,

Jun 8, 2008, 9:18:44 PM6/8/08

to Hypertable Development

Understood. Thanks!

On Jun 8, 12:40 pm, "Mateusz Berezecki" <mateu...@gmail.com> wrote:

Just curious, I wonder if you tried SWIG's java extension generator as
well. The problem of JNI extension is not that it's hard, but that
it's ugly (subjective of course) and slow (as IPC/RPC) and make the
hard to distribute. Think of trying to contribute it back to hadoop so
that the hypertable support can be bundled. If it's just pure java
it's a lot easier.

__Luke

Mateusz Berezecki

unread,

Jun 8, 2008, 9:44:19 PM6/8/08

to hyperta...@googlegroups.com

On Mon, Jun 9, 2008 at 3:18 AM, Luke <vic...@gmail.com> wrote:
>
> Understood. Thanks!

>
>
> Just curious, I wonder if you tried SWIG's java extension generator as
> well. The problem of JNI extension is not that it's hard, but that
> it's ugly (subjective of course) and slow (as IPC/RPC) and make the
> hard to distribute. Think of trying to contribute it back to hadoop so
> that the hypertable support can be bundled. If it's just pure java
> it's a lot easier.

It seems to me that the custom InputFormat classes are not meant
to be bundled with hadoop. At least that's the case with Hbase and
their TableInputFormat - it sits in the hbase repository.

The reason for that is simple. Mapreduce as implemented
in the hadoop does not need InputFormat classes to be
a part of hadoop. They get configured in the job xml configuration
file and if they are not part of the hadoop they automatically
get redistributed via hdfs to nodes participating in processing.

WRT to SWIG - no I have not tried it although I've was thinking
a little about it. What would be the major difference if it's SWIG or JNI?

I'm considering moving everything to Java as soon as the thrift broker
is published though.

Mateusz

Luke

unread,

Jun 9, 2008, 2:35:58 PM6/9/08

to Hypertable Development

On Jun 8, 6:44 pm, "Mateusz Berezecki" <mateu...@gmail.com> wrote:
> It seems to me that the custom InputFormat classes are not meant
> to be bundled with hadoop. At least that's the case with Hbase and
> their TableInputFormat - it sits in the hbase repository.
>
> The reason for that is simple. Mapreduce as implemented
> in the hadoop does not need InputFormat classes to be
> a part of hadoop. They get configured in the job xml configuration
> file and if they are not part of the hadoop they automatically
> get redistributed via hdfs to nodes participating in processing.

Makes sense.

> WRT to SWIG - no I have not tried it although I've was thinking
> a little about it. What would be the major difference if it's SWIG or JNI?

SWIG's java extension uses JNI, it might help hiding platform specific
compiler/linker stuff for you.

> I'm considering moving everything to Java as soon as the thrift broker
> is published though.

Yeah, make it work first :) I look forward to the results. Thanks!

__Luke

Mateusz Berezecki

unread,

Jun 9, 2008, 4:56:51 PM6/9/08

to hyperta...@googlegroups.com

On Mon, Jun 9, 2008 at 8:35 PM, Luke <vic...@gmail.com> wrote:
>
> On Jun 8, 6:44 pm, "Mateusz Berezecki" <mateu...@gmail.com> wrote:
>> It seems to me that the custom InputFormat classes are not meant
>> to be bundled with hadoop. At least that's the case with Hbase and
>> their TableInputFormat - it sits in the hbase repository.
>>
>> The reason for that is simple. Mapreduce as implemented
>> in the hadoop does not need InputFormat classes to be
>> a part of hadoop. They get configured in the job xml configuration
>> file and if they are not part of the hadoop they automatically
>> get redistributed via hdfs to nodes participating in processing.
>
> Makes sense.

But still hard to achieve :-) I peeked at how Htable does it and
currently I have a pretty solid testing environment but not the
one I'd like to be available to anyone (requires some work
to setup). I'm considering something with 0 setup required.

>
>> WRT to SWIG - no I have not tried it although I've was thinking
>> a little about it. What would be the major difference if it's SWIG or JNI?
>
> SWIG's java extension uses JNI, it might help hiding platform specific
> compiler/linker stuff for you.
>

I will definitely look at it this week then.

>> I'm considering moving everything to Java as soon as the thrift broker
>> is published though.
>
> Yeah, make it work first :) I look forward to the results. Thanks!

Some really preliminary stuff (2 tests coming tomorrow and 2 more a
day after tomorrow ;) )

m:work m$ ./hadoop-0.17.0/bin/hadoop pipes -conf mapredconf.xml
-program /mapreduce/applications/mapred -input TEST -jar ./Table.jar
-inputformat TableInputFormat -output test_output
Configuring MapReduce for table TEST
Configuring MapReduce for table TEST
08/06/09 18:18:23 INFO mapred.JobClient: Running job: job_200806091806_0007
08/06/09 18:18:24 INFO mapred.JobClient: map 0% reduce 0%
08/06/09 18:18:29 INFO mapred.JobClient: Task Id :
task_200806091806_0007_m_000000_0, Status : FAILED
java.lang.NullPointerException
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.close(MapTask.java:166)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:223)
at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)

task_200806091806_0007_m_000000_0: Configuring MapReduce for table TEST
task_200806091806_0007_m_000000_0: Hadoop Pipes Exception:
RecordReader defined when not needed. at impl/HadoopPipes.cc:648 in
virtual void HadoopPipes::TaskContextImpl::runMap(std::string, int,
bool)
08/06/09 18:18:34 INFO mapred.JobClient: Task Id :
task_200806091806_0007_m_000000_1, Status : FAILED
java.lang.NullPointerException
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.close(MapTask.java:166)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:223)
at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)

Still trying to figure out how to define a RecordReader in C++
and inform MR not to use the InputFormat's dummy one.
Once that's done I'll start testing it and if all goes well
try to bundle all the stuff in some sort of a patch or
something.

Mateusz

Mateusz Berezecki

unread,

Jun 9, 2008, 9:37:24 PM6/9/08

to hyperta...@googlegroups.com

On Mon, Jun 9, 2008 at 10:56 PM, Mateusz Berezecki <mate...@gmail.com> wrote:
>>> I'm considering moving everything to Java as soon as the thrift broker
>>> is published though.
>>
>> Yeah, make it work first :) I look forward to the results. Thanks!
>
>
> Some really preliminary stuff (2 tests coming tomorrow and 2 more a
> day after tomorrow ;) )
>

I couldn't give up on this one so easily :-)

m:work m$ ./hadoop-0.17.0/bin/hadoop pipes -conf mapredconf.xml
-program /mapreduce/applications/mapred -input TEST -jar ./Table.jar

-inputformat TableInputFormat -output test_output -jobconf
hadoop.pipes.java.recordreader=false,hadoop.pipes.java.recordwriter=false

Configuring MapReduce for table TEST
Configuring MapReduce for table TEST

08/06/10 03:32:27 INFO mapred.JobClient: Running job: job_200806091806_0037
08/06/10 03:32:28 INFO mapred.JobClient: map 0% reduce 0%
08/06/10 03:32:35 INFO mapred.JobClient: map 100% reduce 0%
08/06/10 03:32:47 INFO mapred.JobClient: map 100% reduce 50%
08/06/10 03:32:49 INFO mapred.JobClient: map 100% reduce 100%
08/06/10 03:32:50 INFO mapred.JobClient: Job complete: job_200806091806_0037
08/06/10 03:32:50 INFO mapred.JobClient: Counters: 14
08/06/10 03:32:50 INFO mapred.JobClient: Job Counters
08/06/10 03:32:50 INFO mapred.JobClient: Launched map tasks=1
08/06/10 03:32:50 INFO mapred.JobClient: Launched reduce tasks=2
08/06/10 03:32:50 INFO mapred.JobClient: Data-local map tasks=1
08/06/10 03:32:50 INFO mapred.JobClient: Map-Reduce Framework
08/06/10 03:32:50 INFO mapred.JobClient: Map input records=0
08/06/10 03:32:50 INFO mapred.JobClient: Map output records=0
08/06/10 03:32:50 INFO mapred.JobClient: Map input bytes=0
08/06/10 03:32:50 INFO mapred.JobClient: Map output bytes=0
08/06/10 03:32:50 INFO mapred.JobClient: Combine input records=0
08/06/10 03:32:50 INFO mapred.JobClient: Combine output records=0
08/06/10 03:32:50 INFO mapred.JobClient: Reduce input groups=0
08/06/10 03:32:50 INFO mapred.JobClient: Reduce input records=0
08/06/10 03:32:50 INFO mapred.JobClient: Reduce output records=0
08/06/10 03:32:50 INFO mapred.JobClient: File Systems
08/06/10 03:32:50 INFO mapred.JobClient: Local bytes read=220
08/06/10 03:32:50 INFO mapred.JobClient: Local bytes written=432

It is the first run of MR over hypertable I completed. I'm going to polish it
a bit this week and send more information. The first observation is that
it runs slow for an empty Map and empty Reduce phases. I don't
draw any conclusions from this observation yet.

Mateusz

Luke

unread,

Jun 9, 2008, 9:52:12 PM6/9/08

to Hypertable Development

Congrats, Mateusz! Have you tried to compare it (empty map/reduce
tasks) with regular Hadoop jobs and HBase jobs? Even Google's own M/R
implementation has significant scheduling delays, according to their
papers.

On Jun 9, 6:37 pm, "Mateusz Berezecki" <mateu...@gmail.com> wrote:

Mateusz Berezecki

unread,

Jun 10, 2008, 12:59:38 PM6/10/08

to hyperta...@googlegroups.com

On Tue, Jun 10, 2008 at 3:52 AM, Luke <vic...@gmail.com> wrote:
>
> Congrats, Mateusz! Have you tried to compare it (empty map/reduce
> tasks) with regular Hadoop jobs and HBase jobs? Even Google's own M/R
> implementation has significant scheduling delays, according to their
> papers.
>

Luke, not yet. I did not try to compare or relate execution time to any
other MR implementation yet. But I got output from another simple test.

hypertable> select * from TEST;
key1 column1 value1
key2 column1 value2
key3 column1 value3

Configuring MapReduce for table TEST

Record reader will open table 'TEST'
Scanning range:
start row:
end row:√ø√ø
key: key1
value: value1
key: key2
value: value2
key: key3
value: value3

It is a simple table with some contents. I'm currently testing the map
phase only.
The reduce phase still does nothing. I'm thinking of generating some data to put
into the table and test it on bigger tables.

Do you have such data generators for testing hypertable out there?
I'd really appreciate such tool as my head is going to explode soon
as there are 3 more final exams coming tomorrow ;-)

Mateusz

Doug Judd

unread,

Jun 10, 2008, 2:22:09 PM6/10/08

to hyperta...@googlegroups.com

Hi Mateusz,

Very cool! As far as the test data generators go, several have been written, but they're all somewhat ad-hoc. There is one checked into the tree that's worth taking a look at. You can find it here:

src/cc/Hypertable/Lib/generate_test_data.cc

The output it generates is suitable for input to the rsclient tool. It would be good to have one that generates data that can be loaded with LOAD DATA INFILE.

It would be great to have the uber-test-data-generator. You could imagine having the ability to control every aspect of the test data gets generated, like how compressible the data for a given column should be as well as what the data should look like for each column (e.g. number, ASCII date strings, english text) and on average how big it should be.

Feel free to file an issue. It would be a great project for someone to tackle.

- Doug

Gordon

unread,

Jun 14, 2008, 7:07:21 PM6/14/08

to hyperta...@googlegroups.com

Great first step! Congratulations! We might have an opportunity to test it soon.

Mateusz Berezecki

unread,

Jun 21, 2008, 8:01:19 PM6/21/08

to hyperta...@googlegroups.com

On Sun, Jun 15, 2008 at 1:07 AM, Gordon <gpa...@gmail.com> wrote:
> Great first step! Congratulations! We might have an opportunity to test it
> soon.

I had a week long break but back to the work now.

I am now focusing my efforts on the output phase
as I have successfully ran complete (i.e. having non-trivial
map and reduce phases) jobs over some small (yet) tables.

What are the suggestions on the output part?
I'd like to know if anybody wants to see any specific
functionality when outputting data to tables ?

Mateusz

Reply all

Reply to author

Forward