Dumbo & MongoDB

Nathan

unread,

Jun 30, 2011, 2:34:12 PM6/30/11

to dumbo-user

I was using HBase for a while and was happy when I found the lasthbase
driver on github that worked great with dumbo. Recently I have started
working with MongoDB and found a mongodb-hadoop driver here:

https://github.com/mongodb/mongo-hadoop/

I asked a friend of mine who is much more familiar with Java to
compare the two, to see if we can use the mongodb classes easily in
the same way dumbo uses the lasthbase.jar. For reference, here is the
Input & Output format classes for both HBase & mongodb projects:

https://github.com/mongodb/mongo-hadoop/tree/master/src/main/com/mongodb/hadoop

https://github.com/tims/lasthbase/tree/master/src/java/fm/last/hbase/mapred

With lasthbase, the input & output information is specified on the
command line, but in the mongodb, they have a WordCountXML example
that reads all connection, query, and other configurable information
from an XML file. I liked this approach, but had some questions. It
seems as though the lasthbase classes extended a JobConfigurable
class, but its been a long time since it's been updated. Mongodb-
hadoop does not have this. A LOT of the setup looks the same, but was
looking for a good starting point on making their classes work with
dumbo.

What is dumbo expecting, or better yet, what is lasthbase sending to
dumbo? What does dumbo need from the jar file to start streaming the
data to the map/reduce job(s)? And how should it be streamed? I don't
know Java, but my friend is willing to try and help get it going if I
can get him all the information possible. To him it SEEMS some things
can be moved around and into the input & output format classes on
mongodb-hadoop, tell it to read the xml file, and then you have
another driver that connects to a document database for use with
dumbo.

But he has no understand of dumbo, and we could use some assitance.

Nathan

unread,

Jun 30, 2011, 11:51:45 PM6/30/11

to dumbo-user

For instance, I compiled the mongo-hadoop.jar file, and I wanted to
just see what happened. I put the file in my /usr/lib/hadoop-0.20
folder. The I ran this command just to see what would happen:

dumbo test-in.py -libjar mongo-hadoop.jar -inputformat
com.mongodb.hadoop.mapred.MongoInputFormat -outputformat
com.mongodb.hadoop.mapred.MongoOutputFormat -input
mongodb://localhost/demo.yield_historical.in -output
mongodb://localhost/demo.yield_historical.out

XEC: PYTHONPATH="/usr/local/lib/python2.7/dist-packages/dumbo-0.21.30-
py2.7.egg:$PYTHONPATH" python -m dumbo.cmd encodepipe -file
mongodb://localhost/demo.yield_historical.in | PYTHONPATH="/usr/local/
lib/python2.7/dist-packages/dumbo-0.21.30-py2.7.egg:$PYTHONPATH"
dumbo_mrbase_class='dumbo.backends.common.MapRedBase'
dumbo_jk_class='dumbo.backends.common.JoinKey'
dumbo_runinfo_class='dumbo.backends.common.RunInfo' python -m test-in
map 0 262144000 > 'mongodb://localhost/demo.yield_historical.out'
/bin/sh: cannot create mongodb://localhost/demo.yield_historical.out:
Directory nonexistent
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/usr/local/lib/python2.7/dist-packages/dumbo-0.21.30-py2.7.egg/
dumbo/cmd.py", line 170, in <module>
sys.exit(dumbo())
File "/usr/local/lib/python2.7/dist-packages/dumbo-0.21.30-py2.7.egg/
dumbo/cmd.py", line 53, in dumbo
retval = encodepipe(parseargs(sys.argv[2:]))
File "/usr/local/lib/python2.7/dist-packages/dumbo-0.21.30-py2.7.egg/
dumbo/cmd.py", line 133, in encodepipe
for file in files:
File "/usr/local/lib/python2.7/dist-packages/dumbo-0.21.30-py2.7.egg/
dumbo/cmd.py", line 130, in <genexpr>
files = (open(f) for f in addedopts['file'])
IOError: [Errno 2] No such file or directory: 'mongodb://localhost/
demo.yield_historical.in'

From the error above, it doesn't seem to be picking up the JAR file I
passed in CLI. I just installed dumbo from github today and Cloudera's
CDH3 from their repo. Any tips? Does -libjar still work? I looked at
the source and only folder references to libegg, unless I was looking
in the wrong place.

Anyone else interested in using mongodb as their source/sink for
hadoop? :)

On Jun 30, 1:34 pm, Nathan <nbyl...@gmail.com> wrote:
> I was using HBase for a while and was happy when I found the lasthbase
> driver on github that worked great with dumbo. Recently I have started
> working with MongoDB and found a mongodb-hadoop driver here:
>
> https://github.com/mongodb/mongo-hadoop/
>
> I asked a friend of mine who is much more familiar with Java to
> compare the two, to see if we can use the mongodb classes easily in
> the same way dumbo uses the lasthbase.jar. For reference, here is the
> Input & Output format classes for both HBase & mongodb projects:
>

> https://github.com/mongodb/mongo-hadoop/tree/master/src/main/com/mong...
>
> https://github.com/tims/lasthbase/tree/master/src/java/fm/last/hbase/...

Nathan

unread,

Jul 1, 2011, 3:56:36 PM7/1/11

to dumbo-user

I changed my cli argument to this:

dumbo test-in.py -hadoop /usr/lib/hadoop -libjar mongo-hadoop.jar -

inputformat com.mongodb.hadoop.mapred.MongoInputFormat -outputformat
com.mongodb.hadoop.mapred.MongoOutputFormat -input
mongodb://localhost/demo.yield_historical.in -output
mongodb://localhost/demo.yield_historical.out

Adding the -hadoop path. It can't find the mongo-hadoop.jar now. I
believe I just need to update the HADOOP_CLASSPATH in my install. But
the file IS located in /usr/lib/hadoop along with all the default
jar's. My original questions still remain though as I stumble my way
through this. What is the interaction between dumbo, the mongo-
hadoop.jar, and hadoop? Are there specific methods that need to be in
place that do certain things? Can the MongoInputClass be altered to
look for an xml file fed in through cli (passed through dumbo of
course).

I am guessing dumbo would need to be altered. But not sure how all the
communication works, and where in the code. If I can get a better
understanding, I am going to fork the project, and create an "addon"
that allows for mongo access.

This group seems pretty dead lately though...

Nathan

unread,

Jul 1, 2011, 11:53:55 PM7/1/11

to dumbo-user

OK a little bit farther. I added the mongo-java driver & the mongo-
hadoop.jar into the HADOOP_CLASSHPATH and added -conf wordcount.xml
file from their example project. Now I am getting this error:

2011-07-01 22:49:13,688 INFO org.apache.hadoop.mapred.Task: Cleaning
up job
2011-07-01 22:49:13,688 INFO org.apache.hadoop.mapred.Task: Aborting
job with runstate : FAILED
2011-07-01 22:49:13,729 INFO
org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs'
truncater with mapRetainSize=-1 and reduceRetainSize=-1
2011-07-01 22:49:13,731 WARN org.apache.hadoop.mapred.Child: Error
running child
java.io.IOException: No FileSystem for scheme: mongodb
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:
1511)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:67)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:
1548)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1530)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:228)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:183)
at
org.apache.hadoop.mapred.FileOutputCommitter.cleanupJob(FileOutputCommitter.java:
94)
at
org.apache.hadoop.mapred.FileOutputCommitter.abortJob(FileOutputCommitter.java:
112)
at
org.apache.hadoop.mapred.OutputCommitter.abortJob(OutputCommitter.java:
185)
at org.apache.hadoop.mapred.Task.runJobCleanupTask(Task.java:948)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:309)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:
1115)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
2011-07-01 22:49:13,734 INFO org.apache.hadoop.mapred.Task: Runnning
cleanup for the task
2011-07-01 22:49:13,735 WARN
org.apache.hadoop.mapred.FileOutputCommitter: java.io.IOException: No
FileSystem for scheme: mongodb
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:
1511)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:67)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:
1548)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1530)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:228)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:183)
at
org.apache.hadoop.mapred.FileOutputCommitter.getTempTaskOutputPath(FileOutputCommitter.java:
234)
at
org.apache.hadoop.mapred.FileOutputCommitter.abortTask(FileOutputCommitter.java:
179)
at
org.apache.hadoop.mapred.OutputCommitter.abortTask(OutputCommitter.java:
233)
at org.apache.hadoop.mapred.Task.taskCleanup(Task.java:933)
at org.apache.hadoop.mapred.Child$5.run(Child.java:300)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:
1115)
at org.apache.hadoop.mapred.Child.main(Child.java:297)

2011-07-01 22:49:13,735 WARN
org.apache.hadoop.mapred.FileOutputCommitter: Error discarding
outputjava.io.IOException: No FileSystem for scheme: mongodb
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:
1511)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:67)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:
1548)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1530)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:228)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:183)
at
org.apache.hadoop.mapred.FileOutputCommitter.abortTask(FileOutputCommitter.java:
182)
at
org.apache.hadoop.mapred.OutputCommitter.abortTask(OutputCommitter.java:
233)
at org.apache.hadoop.mapred.Task.taskCleanup(Task.java:933)
at org.apache.hadoop.mapred.Child$5.run(Child.java:300)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:
1115)
at org.apache.hadoop.mapred.Child.main(Child.java:297)

Nathan

unread,

Jul 2, 2011, 1:11:17 AM7/2/11

to dumbo-user

Even closer. Doing a simple word count from the test db, with in
collection (in the mongo-hadoop README) using a simple dumbo map
reduce job, It starts up just fine, but fails on the map job(s). It
never gets to reducing, but it throws this error. The
"4e0e98380bfb6ce2d9091ea6" objectId is the id from the db.in
collection in test db.

java.io.IOException: Can't write: 4e0e98380bfb6ce2d9091ea6 as class
org.bson.types.ObjectId
at
org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:
162)
at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:70)
at
org.apache.hadoop.typedbytes.TypedBytesWritableOutput.writeWritable(TypedBytesWritableOutput.java:
217)
at
org.apache.hadoop.typedbytes.TypedBytesWritableOutput.write(TypedBytesWritableOutput.java:
136)
at
org.apache.hadoop.streaming.io.TypedBytesInputWriter.writeTypedBytes(TypedBytesInputWriter.java:
57)
at
org.apache.hadoop.streaming.io.TypedBytesInputWriter.writeKey(TypedBytesInputWriter.java:
47)
at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:108)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:
36)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:390)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:324)

at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:
1115)
at org.apache.hadoop.mapred.Child.main(Child.java:262)

> > > > can get him all the information possible. To him...
>
> read more »

Klaas Bosteels

unread,

Jul 2, 2011, 1:03:36 PM7/2/11

to dumbo...@googlegroups.com

Hi Nathan,

Based on what you told us, I don't think there's a real difference between how the two take configuration params. The mongodb example probably just makes use of the possibility that Hadoop provides for putting the params in an xml file and reading them from that file instead of passing them directly.

To make mongo input or output work, you will need to write a custom input or output format that writes or reads typed bytes writables. I haven't looked at the code much, but you might be able to do this by wrapping the mongo-hadoop formats. You should be able to figure out how to work with typed bytes writables by having a look at the lasthbase code.

Also, to use (Java) input or output formats you need to run on Hadoop. That's the reason why the local run you pasted in on of your emails failed miserably.

Sorry for the late answer, and please share your code if you figure out how to do this!

Regards,

-Klaas

--
You received this message because you are subscribed to the Google Groups "dumbo-user" group.
To post to this group, send email to dumbo...@googlegroups.com.
To unsubscribe from this group, send email to dumbo-user+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/dumbo-user?hl=en.

Nathan

unread,

Jul 2, 2011, 2:09:50 PM7/2/11

to dumbo-user

Thanks for your reply. The last message I posted it's reading from
MongoDB just fine, and their mongodb-hadoop driver uses TypedBytes as
well. This is the error I am currently strugggling with:

java.io.IOException: Can't write: 4e0e98380bfb6ce2d9091ea6 as class
org.bson.types.ObjectId

4e0e98380bfb6ce2d9091ea6 is the mongodb objectId string of the first
record in my test collection, so I know it's able to access the data.
Also, in the error stack trace, it outputs this:

org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:
162)
at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:70)
at
org.apache.hadoop.typedbytes.TypedBytesWritableOutput.writeWritable(TypedBy
tesWritableOutput.java: 217)

So I know their driver is trying to use typed bytes. They have working
examples in pure Java, but I have grown accustom to dumbo, and would
like to use it and help this project grow. Supposively the project
supports streaming jobs too, so there should be no problem working
with dumbo as is once everything is figured out. I am not sure what is
happening yet, but I will share as soon as I have something working. I
also encourage anyone else interested to please take a look or share
their opinions. :)

> >https://github.com/mongodb/mongo-hadoop/tree/master/src/main/com/mong...
>
> >https://github.com/tims/lasthbase/tree/master/src/java/fm/last/hbase/...

Nathan

unread,

Jul 2, 2011, 2:35:59 PM7/2/11

to dumbo-user

I get what you are saying though. I am going to try and create a
wrapper this weekend, but don't expect much success since I am not a
Java guy. :)

They have a lot of the same methods in their input & output formats,
but are there specific methods that must be overridden? Are there very
specific things that MUST happen in the input & output formats? Any
tips are appreciated. Hopefully this is pretty straight forward, as
there is only two classes to mess with.

Nathan

unread,

Jul 2, 2011, 4:02:51 PM7/2/11

to dumbo-user

The odd thing is it can't find this package when I try and import it
(I have all my jar's in build path, including the hadoop streaming):

import org.apache.hadoop.typedbytes.TypedBytesWritable;

Says there is no typedbytes package in hadoop. Eclipse tries to
resolve this error by importing the hadoop-streaming.jar from the
lasthbase project. I have looked, and this is definetly not as
depreceated method, so it should be there, so I don't know what that
problem is.

Nathan

unread,

Jul 2, 2011, 9:03:40 PM7/2/11

to dumbo-user

I feel so close. This class mimics theirs, but uses
TypedBytesWriteable instead of BSONObjects.

@SuppressWarnings("deprecation")
public class TypedBytesTableInputFormat implements
InputFormat<TypedBytesWritable, TypedBytesWritable> {

@Override
public RecordReader<TypedBytesWritable, TypedBytesWritable>
getRecordReader(InputSplit split, JobConf job, Reporter reporter) {

if (!(split instanceof MongoInputSplit))
throw new IllegalStateException("Creation of a new
RecordReader requires a MongoInputSplit instance.");

final MongoInputSplit mis = (MongoInputSplit) split;

return (RecordReader<TypedBytesWritable, TypedBytesWritable>)
new TypedBytesMongoRecordReader(mis);
}
....
....
....
....

public class TypedBytesMongoRecordReader extends
RecordReader<TypedBytesWritable, TypedBytesWritable> {

public TypedBytesMongoRecordReader(MongoInputSplit mis) {
_cursor = mis.getCursor();
}
...
...
...
...

Unfortunately I get this error:

java.lang.ClassCastException:
com.mongodb.hadoop.input.TypedBytesMongoRecordReader cannot be cast to
org.apache.hadoop.mapred.RecordReader
at
com.mongodb.hadoop.TypedBytesTableInputFormat.getRecordReader(TypedBytesTableInputFormat.java:
31)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:370)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:324)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:
1115)
at org.apache.hadoop.mapred.Child.main(Child.java:262)

I feel so close! Not sure why I get a ClassCastException when my
TypedBytesMongoRecordReader is a child of the RecordReader. Any Java
people care to chime in?

Nathan

unread,

Jul 2, 2011, 10:31:03 PM7/2/11

to dumbo-user

OK, I got it reading records just fine. It completes the M/R job, but
it's not writing it to the database. I am not getting errors though.
It says output written to test.out (the db.collection_name I am trying
to write to in MongoDB), but there is nothing in that hadoop fs folder
except an empty _SUCCESS file and a bunch of logs

So I don't know where my output is going.

Nathan

unread,

Jul 3, 2011, 8:45:10 PM7/3/11

to dumbo-user

OK everything is reading and writing to mongodb using the dumbo
wordcount demo. The columns it writes to is hard coded for now, but I
will make a configurable property in the XML file where you can output
the values. Also, right now it will probably only let you write to one
collection, with a key / value pair. If it becomes necessary to try
and save actual BSONObjects with multiple k/v pairs, I will try that
next.

But it's working. Woop woop!

Nathan

unread,

Jul 4, 2011, 12:53:09 PM7/4/11

to dumbo-user

Haha. Feels like a long journey just in this thread from "I don't know
Java" to "Hey I got it working!"

Anyways, I am going to try and do some tweaks to it so you can store
the output document structure in the XML file and have all the data
loaded into the driver instead of on the command line. I have it
checked in on github right now, but it only works if I hard-code the
output fields in the driver. Working on making it more robust.

> > > > > > > > dumbo-user+...@googlegroups.com....
>
> read more »

Klaas Bosteels

unread,

Jul 4, 2011, 12:54:28 PM7/4/11

to dumbo...@googlegroups.com

Cool, thanks for sharing!

-K

Jon Eisen

unread,

Aug 31, 2012, 12:08:47 PM8/31/12

to dumbo...@googlegroups.com

Hey Nathan, did you ever publish your code to get that working? I'm working on the same thing right now.

> > > > > > > > dumbo-user+unsubscribe@googlegroups.com....
>
> read more »

Paul DeCoursey

unread,

Oct 31, 2012, 2:55:08 PM10/31/12

to dumbo...@googlegroups.com

I'm also curious if about sample code. I can't get dumbo to talk to mongo for the life of me.

Paul DeCoursey

unread,

Nov 7, 2012, 2:28:46 PM11/7/12

to dumbo...@googlegroups.com

Ok, I've got it working, but it won't do splits... which why the heck would I even want to use Hadoop if I can't do splits!?!?

Reply all

Reply to author

Forward