I was using HBase for a while and was happy when I found the lasthbase
driver on github that worked great with dumbo. Recently I have started
working with MongoDB and found a mongodb-hadoop driver here:
I asked a friend of mine who is much more familiar with Java to
compare the two, to see if we can use the mongodb classes easily in
the same way dumbo uses the lasthbase.jar. For reference, here is the
Input & Output format classes for both HBase & mongodb projects:
With lasthbase, the input & output information is specified on the
command line, but in the mongodb, they have a WordCountXML example
that reads all connection, query, and other configurable information
from an XML file. I liked this approach, but had some questions. It
seems as though the lasthbase classes extended a JobConfigurable
class, but its been a long time since it's been updated. Mongodb-
hadoop does not have this. A LOT of the setup looks the same, but was
looking for a good starting point on making their classes work with
dumbo.
What is dumbo expecting, or better yet, what is lasthbase sending to
dumbo? What does dumbo need from the jar file to start streaming the
data to the map/reduce job(s)? And how should it be streamed? I don't
know Java, but my friend is willing to try and help get it going if I
can get him all the information possible. To him it SEEMS some things
can be moved around and into the input & output format classes on
mongodb-hadoop, tell it to read the xml file, and then you have
another driver that connects to a document database for use with
dumbo.
But he has no understand of dumbo, and we could use some assitance.
For instance, I compiled the mongo-hadoop.jar file, and I wanted to
just see what happened. I put the file in my /usr/lib/hadoop-0.20
folder. The I ran this command just to see what would happen:
XEC: PYTHONPATH="/usr/local/lib/python2.7/dist-packages/dumbo-0.21.30-
py2.7.egg:$PYTHONPATH" python -m dumbo.cmd encodepipe -file
mongodb://localhost/demo.yield_historical.in | PYTHONPATH="/usr/local/
lib/python2.7/dist-packages/dumbo-0.21.30-py2.7.egg:$PYTHONPATH"
dumbo_mrbase_class='dumbo.backends.common.MapRedBase'
dumbo_jk_class='dumbo.backends.common.JoinKey'
dumbo_runinfo_class='dumbo.backends.common.RunInfo' python -m test-in
map 0 262144000 > 'mongodb://localhost/demo.yield_historical.out'
/bin/sh: cannot create mongodb://localhost/demo.yield_historical.out:
Directory nonexistent
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/usr/local/lib/python2.7/dist-packages/dumbo-0.21.30-py2.7.egg/
dumbo/cmd.py", line 170, in <module>
sys.exit(dumbo())
File "/usr/local/lib/python2.7/dist-packages/dumbo-0.21.30-py2.7.egg/
dumbo/cmd.py", line 53, in dumbo
retval = encodepipe(parseargs(sys.argv[2:]))
File "/usr/local/lib/python2.7/dist-packages/dumbo-0.21.30-py2.7.egg/
dumbo/cmd.py", line 133, in encodepipe
for file in files:
File "/usr/local/lib/python2.7/dist-packages/dumbo-0.21.30-py2.7.egg/
dumbo/cmd.py", line 130, in <genexpr>
files = (open(f) for f in addedopts['file'])
IOError: [Errno 2] No such file or directory: 'mongodb://localhost/
demo.yield_historical.in'
From the error above, it doesn't seem to be picking up the JAR file I
passed in CLI. I just installed dumbo from github today and Cloudera's
CDH3 from their repo. Any tips? Does -libjar still work? I looked at
the source and only folder references to libegg, unless I was looking
in the wrong place.
Anyone else interested in using mongodb as their source/sink for
hadoop? :)
On Jun 30, 1:34 pm, Nathan <nbyl...@gmail.com> wrote:
> I was using HBase for a while and was happy when I found the lasthbase
> driver on github that worked great with dumbo. Recently I have started
> working with MongoDB and found a mongodb-hadoop driver here:
> I asked a friend of mine who is much more familiar with Java to
> compare the two, to see if we can use the mongodb classes easily in
> the same way dumbo uses the lasthbase.jar. For reference, here is the
> Input & Output format classes for both HBase & mongodb projects:
> With lasthbase, the input & output information is specified on the
> command line, but in the mongodb, they have a WordCountXML example
> that reads all connection, query, and other configurable information
> from an XML file. I liked this approach, but had some questions. It
> seems as though the lasthbase classes extended a JobConfigurable
> class, but its been a long time since it's been updated. Mongodb-
> hadoop does not have this. A LOT of the setup looks the same, but was
> looking for a good starting point on making their classes work with
> dumbo.
> What is dumbo expecting, or better yet, what is lasthbase sending to
> dumbo? What does dumbo need from the jar file to start streaming the
> data to the map/reduce job(s)? And how should it be streamed? I don't
> know Java, but my friend is willing to try and help get it going if I
> can get him all the information possible. To him it SEEMS some things
> can be moved around and into the input & output format classes on
> mongodb-hadoop, tell it to read the xml file, and then you have
> another driver that connects to a document database for use with
> dumbo.
> But he has no understand of dumbo, and we could use some assitance.
Adding the -hadoop path. It can't find the mongo-hadoop.jar now. I
believe I just need to update the HADOOP_CLASSPATH in my install. But
the file IS located in /usr/lib/hadoop along with all the default
jar's. My original questions still remain though as I stumble my way
through this. What is the interaction between dumbo, the mongo-
hadoop.jar, and hadoop? Are there specific methods that need to be in
place that do certain things? Can the MongoInputClass be altered to
look for an xml file fed in through cli (passed through dumbo of
course).
I am guessing dumbo would need to be altered. But not sure how all the
communication works, and where in the code. If I can get a better
understanding, I am going to fork the project, and create an "addon"
that allows for mongo access.
This group seems pretty dead lately though...
On Jun 30, 10:51 pm, Nathan <nbyl...@gmail.com> wrote:
> For instance, I compiled the mongo-hadoop.jar file, and I wanted to
> just see what happened. I put the file in my /usr/lib/hadoop-0.20
> folder. The I ran this command just to see what would happen:
> XEC: PYTHONPATH="/usr/local/lib/python2.7/dist-packages/dumbo-0.21.30-
> py2.7.egg:$PYTHONPATH" python -m dumbo.cmd encodepipe -file
> mongodb://localhost/demo.yield_historical.in | PYTHONPATH="/usr/local/
> lib/python2.7/dist-packages/dumbo-0.21.30-py2.7.egg:$PYTHONPATH"
> dumbo_mrbase_class='dumbo.backends.common.MapRedBase'
> dumbo_jk_class='dumbo.backends.common.JoinKey'
> dumbo_runinfo_class='dumbo.backends.common.RunInfo' python -m test-in
> map 0 262144000 > 'mongodb://localhost/demo.yield_historical.out'
> /bin/sh: cannot create mongodb://localhost/demo.yield_historical.out:
> Directory nonexistent
> Traceback (most recent call last):
> File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
> "__main__", fname, loader, pkg_name)
> File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
> exec code in run_globals
> File "/usr/local/lib/python2.7/dist-packages/dumbo-0.21.30-py2.7.egg/
> dumbo/cmd.py", line 170, in <module>
> sys.exit(dumbo())
> File "/usr/local/lib/python2.7/dist-packages/dumbo-0.21.30-py2.7.egg/
> dumbo/cmd.py", line 53, in dumbo
> retval = encodepipe(parseargs(sys.argv[2:]))
> File "/usr/local/lib/python2.7/dist-packages/dumbo-0.21.30-py2.7.egg/
> dumbo/cmd.py", line 133, in encodepipe
> for file in files:
> File "/usr/local/lib/python2.7/dist-packages/dumbo-0.21.30-py2.7.egg/
> dumbo/cmd.py", line 130, in <genexpr>
> files = (open(f) for f in addedopts['file'])
> IOError: [Errno 2] No such file or directory: 'mongodb://localhost/
> demo.yield_historical.in'
> From the error above, it doesn't seem to be picking up the JAR file I
> passed in CLI. I just installed dumbo from github today and Cloudera's
> CDH3 from their repo. Any tips? Does -libjar still work? I looked at
> the source and only folder references to libegg, unless I was looking
> in the wrong place.
> Anyone else interested in using mongodb as their source/sink for
> hadoop? :)
> On Jun 30, 1:34 pm, Nathan <nbyl...@gmail.com> wrote:
> > I was using HBase for a while and was happy when I found the lasthbase
> > driver on github that worked great with dumbo. Recently I have started
> > working with MongoDB and found a mongodb-hadoop driver here:
> > I asked a friend of mine who is much more familiar with Java to
> > compare the two, to see if we can use the mongodb classes easily in
> > the same way dumbo uses the lasthbase.jar. For reference, here is the
> > Input & Output format classes for both HBase & mongodb projects:
> > With lasthbase, the input & output information is specified on the
> > command line, but in the mongodb, they have a WordCountXML example
> > that reads all connection, query, and other configurable information
> > from an XML file. I liked this approach, but had some questions. It
> > seems as though the lasthbase classes extended a JobConfigurable
> > class, but its been a long time since it's been updated. Mongodb-
> > hadoop does not have this. A LOT of the setup looks the same, but was
> > looking for a good starting point on making their classes work with
> > dumbo.
> > What is dumbo expecting, or better yet, what is lasthbase sending to
> > dumbo? What does dumbo need from the jar file to start streaming the
> > data to the map/reduce job(s)? And how should it be streamed? I don't
> > know Java, but my friend is willing to try and help get it going if I
> > can get him all the information possible. To him it SEEMS some things
> > can be moved around and into the input & output format classes on
> > mongodb-hadoop, tell it to read the xml file, and then you have
> > another driver that connects to a document database for use with
> > dumbo.
> > But he has no understand of dumbo, and we could use some assitance.
OK a little bit farther. I added the mongo-java driver & the mongo-
hadoop.jar into the HADOOP_CLASSHPATH and added -conf wordcount.xml
file from their example project. Now I am getting this error:
2011-07-01 22:49:13,688 INFO org.apache.hadoop.mapred.Task: Cleaning
up job
2011-07-01 22:49:13,688 INFO org.apache.hadoop.mapred.Task: Aborting
job with runstate : FAILED
2011-07-01 22:49:13,729 INFO
org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs'
truncater with mapRetainSize=-1 and reduceRetainSize=-1
2011-07-01 22:49:13,731 WARN org.apache.hadoop.mapred.Child: Error
running child
java.io.IOException: No FileSystem for scheme: mongodb
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:
1511)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:67)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:
1548)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1530)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:228)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:183)
at
org.apache.hadoop.mapred.FileOutputCommitter.cleanupJob(FileOutputCommitter .java:
94)
at
org.apache.hadoop.mapred.FileOutputCommitter.abortJob(FileOutputCommitter.j ava:
112)
at
org.apache.hadoop.mapred.OutputCommitter.abortJob(OutputCommitter.java:
185)
at org.apache.hadoop.mapred.Task.runJobCleanupTask(Task.java:948)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:309)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.j ava:
1115)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
2011-07-01 22:49:13,734 INFO org.apache.hadoop.mapred.Task: Runnning
cleanup for the task
2011-07-01 22:49:13,735 WARN
org.apache.hadoop.mapred.FileOutputCommitter: java.io.IOException: No
FileSystem for scheme: mongodb
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:
1511)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:67)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:
1548)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1530)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:228)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:183)
at
org.apache.hadoop.mapred.FileOutputCommitter.getTempTaskOutputPath(FileOutp utCommitter.java:
234)
at
org.apache.hadoop.mapred.FileOutputCommitter.abortTask(FileOutputCommitter. java:
179)
at
org.apache.hadoop.mapred.OutputCommitter.abortTask(OutputCommitter.java:
233)
at org.apache.hadoop.mapred.Task.taskCleanup(Task.java:933)
at org.apache.hadoop.mapred.Child$5.run(Child.java:300)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.j ava:
1115)
at org.apache.hadoop.mapred.Child.main(Child.java:297)
2011-07-01 22:49:13,735 WARN
org.apache.hadoop.mapred.FileOutputCommitter: Error discarding
outputjava.io.IOException: No FileSystem for scheme: mongodb
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:
1511)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:67)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:
1548)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1530)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:228)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:183)
at
org.apache.hadoop.mapred.FileOutputCommitter.abortTask(FileOutputCommitter. java:
182)
at
org.apache.hadoop.mapred.OutputCommitter.abortTask(OutputCommitter.java:
233)
at org.apache.hadoop.mapred.Task.taskCleanup(Task.java:933)
at org.apache.hadoop.mapred.Child$5.run(Child.java:300)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.j ava:
1115)
at org.apache.hadoop.mapred.Child.main(Child.java:297)
On Jul 1, 2:56 pm, Nathan <nbyl...@gmail.com> wrote:
> Adding the -hadoop path. It can't find the mongo-hadoop.jar now. I
> believe I just need to update the HADOOP_CLASSPATH in my install. But
> the file IS located in /usr/lib/hadoop along with all the default
> jar's. My original questions still remain though as I stumble my way
> through this. What is the interaction between dumbo, the mongo-
> hadoop.jar, and hadoop? Are there specific methods that need to be in
> place that do certain things? Can the MongoInputClass be altered to
> look for an xml file fed in through cli (passed through dumbo of
> course).
> I am guessing dumbo would need to be altered. But not sure how all the
> communication works, and where in the code. If I can get a better
> understanding, I am going to fork the project, and create an "addon"
> that allows for mongo access.
> This group seems pretty dead lately though...
> On Jun 30, 10:51 pm, Nathan <nbyl...@gmail.com> wrote:
> > For instance, I compiled the mongo-hadoop.jar file, and I wanted to
> > just see what happened. I put the file in my /usr/lib/hadoop-0.20
> > folder. The I ran this command just to see what would happen:
> > XEC: PYTHONPATH="/usr/local/lib/python2.7/dist-packages/dumbo-0.21.30-
> > py2.7.egg:$PYTHONPATH" python -m dumbo.cmd encodepipe -file
> > mongodb://localhost/demo.yield_historical.in | PYTHONPATH="/usr/local/
> > lib/python2.7/dist-packages/dumbo-0.21.30-py2.7.egg:$PYTHONPATH"
> > dumbo_mrbase_class='dumbo.backends.common.MapRedBase'
> > dumbo_jk_class='dumbo.backends.common.JoinKey'
> > dumbo_runinfo_class='dumbo.backends.common.RunInfo' python -m test-in
> > map 0 262144000 > 'mongodb://localhost/demo.yield_historical.out'
> > /bin/sh: cannot create mongodb://localhost/demo.yield_historical.out:
> > Directory nonexistent
> > Traceback (most recent call last):
> > File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
> > "__main__", fname, loader, pkg_name)
> > File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
> > exec code in run_globals
> > File "/usr/local/lib/python2.7/dist-packages/dumbo-0.21.30-py2.7.egg/
> > dumbo/cmd.py", line 170, in <module>
> > sys.exit(dumbo())
> > File "/usr/local/lib/python2.7/dist-packages/dumbo-0.21.30-py2.7.egg/
> > dumbo/cmd.py", line 53, in dumbo
> > retval = encodepipe(parseargs(sys.argv[2:]))
> > File "/usr/local/lib/python2.7/dist-packages/dumbo-0.21.30-py2.7.egg/
> > dumbo/cmd.py", line 133, in encodepipe
> > for file in files:
> > File "/usr/local/lib/python2.7/dist-packages/dumbo-0.21.30-py2.7.egg/
> > dumbo/cmd.py", line 130, in <genexpr>
> > files = (open(f) for f in addedopts['file'])
> > IOError: [Errno 2] No such file or directory: 'mongodb://localhost/
> > demo.yield_historical.in'
> > From the error above, it doesn't seem to be picking up the JAR file I
> > passed in CLI. I just installed dumbo from github today and Cloudera's
> > CDH3 from their repo. Any tips? Does -libjar still work? I looked at
> > the source and only folder references to libegg, unless I was looking
> > in the wrong place.
> > Anyone else interested in using mongodb as their source/sink for
> > hadoop? :)
> > On Jun 30, 1:34 pm, Nathan <nbyl...@gmail.com> wrote:
> > > I was using HBase for a while and was happy when I found the lasthbase
> > > driver on github that worked great with dumbo. Recently I have started
> > > working with MongoDB and found a mongodb-hadoop driver here:
> > > I asked a friend of mine who is much more familiar with Java to
> > > compare the two, to see if we can use the mongodb classes easily in
> > > the same way dumbo uses the lasthbase.jar. For reference, here is the
> > > Input & Output format classes for both HBase & mongodb projects:
> > > With lasthbase, the input & output information is specified on the
> > > command line, but in the mongodb, they have a WordCountXML example
> > > that reads all connection, query, and other configurable information
> > > from an XML file. I liked this approach, but had some questions. It
> > > seems as though the lasthbase classes extended a JobConfigurable
> > > class, but its been a long time since it's been updated. Mongodb-
> > > hadoop does not have this. A LOT of the setup looks the same, but was
> > > looking for a good starting point on making their classes work with
> > > dumbo.
> > > What is dumbo expecting, or better yet, what is lasthbase sending to
> > > dumbo? What does dumbo need from the jar file to start streaming the
> > > data to the map/reduce job(s)? And how should it be streamed? I don't
> > > know Java, but my friend is willing to try and help get it going if I
> > > can get him all the information possible. To him
Even closer. Doing a simple word count from the test db, with in
collection (in the mongo-hadoop README) using a simple dumbo map
reduce job, It starts up just fine, but fails on the map job(s). It
never gets to reducing, but it throws this error. The
"4e0e98380bfb6ce2d9091ea6" objectId is the id from the db.in
collection in test db.
java.io.IOException: Can't write: 4e0e98380bfb6ce2d9091ea6 as class
org.bson.types.ObjectId
at
org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:
162)
at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:70)
at
org.apache.hadoop.typedbytes.TypedBytesWritableOutput.writeWritable(TypedBy tesWritableOutput.java:
217)
at
org.apache.hadoop.typedbytes.TypedBytesWritableOutput.write(TypedBytesWrita bleOutput.java:
136)
at
org.apache.hadoop.streaming.io.TypedBytesInputWriter.writeTypedBytes(TypedB ytesInputWriter.java:
57)
at
org.apache.hadoop.streaming.io.TypedBytesInputWriter.writeKey(TypedBytesInp utWriter.java:
47)
at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:108)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:
36)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:390)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:324)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.j ava:
1115)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
On Jul 1, 10:53 pm, Nathan <nbyl...@gmail.com> wrote:
> OK a little bit farther. I added the mongo-java driver & the mongo-
> hadoop.jar into the HADOOP_CLASSHPATH and added -conf wordcount.xml
> file from their example project. Now I am getting this error:
> 2011-07-01 22:49:13,688 INFO org.apache.hadoop.mapred.Task: Cleaning
> up job
> 2011-07-01 22:49:13,688 INFO org.apache.hadoop.mapred.Task: Aborting
> job with runstate : FAILED
> 2011-07-01 22:49:13,729 INFO
> org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs'
> truncater with mapRetainSize=-1 and reduceRetainSize=-1
> 2011-07-01 22:49:13,731 WARN org.apache.hadoop.mapred.Child: Error
> running child
> java.io.IOException: No FileSystem for scheme: mongodb
> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:
> 1511)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:67)
> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:
> 1548)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1530)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:228)
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:183)
> at
> org.apache.hadoop.mapred.FileOutputCommitter.cleanupJob(FileOutputCommitter .java:
> 94)
> at
> org.apache.hadoop.mapred.FileOutputCommitter.abortJob(FileOutputCommitter.j ava:
> 112)
> at
> org.apache.hadoop.mapred.OutputCommitter.abortJob(OutputCommitter.java:
> 185)
> at org.apache.hadoop.mapred.Task.runJobCleanupTask(Task.java:948)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:309)
> at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.j ava:
> 1115)
> at org.apache.hadoop.mapred.Child.main(Child.java:262)
> 2011-07-01 22:49:13,734 INFO org.apache.hadoop.mapred.Task: Runnning
> cleanup for the task
> 2011-07-01 22:49:13,735 WARN
> org.apache.hadoop.mapred.FileOutputCommitter: java.io.IOException: No
> FileSystem for scheme: mongodb
> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:
> 1511)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:67)
> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:
> 1548)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1530)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:228)
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:183)
> at
> org.apache.hadoop.mapred.FileOutputCommitter.getTempTaskOutputPath(FileOutp utCommitter.java:
> 234)
> at
> org.apache.hadoop.mapred.FileOutputCommitter.abortTask(FileOutputCommitter. java:
> 179)
> at
> org.apache.hadoop.mapred.OutputCommitter.abortTask(OutputCommitter.java:
> 233)
> at org.apache.hadoop.mapred.Task.taskCleanup(Task.java:933)
> at org.apache.hadoop.mapred.Child$5.run(Child.java:300)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.j ava:
> 1115)
> at org.apache.hadoop.mapred.Child.main(Child.java:297)
> 2011-07-01 22:49:13,735 WARN
> org.apache.hadoop.mapred.FileOutputCommitter: Error discarding
> outputjava.io.IOException: No FileSystem for scheme: mongodb
> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:
> 1511)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:67)
> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:
> 1548)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1530)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:228)
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:183)
> at
> org.apache.hadoop.mapred.FileOutputCommitter.abortTask(FileOutputCommitter. java:
> 182)
> at
> org.apache.hadoop.mapred.OutputCommitter.abortTask(OutputCommitter.java:
> 233)
> at org.apache.hadoop.mapred.Task.taskCleanup(Task.java:933)
> at org.apache.hadoop.mapred.Child$5.run(Child.java:300)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.j ava:
> 1115)
> at org.apache.hadoop.mapred.Child.main(Child.java:297)
> On Jul 1, 2:56 pm, Nathan <nbyl...@gmail.com> wrote:
> > Adding the -hadoop path. It can't find the mongo-hadoop.jar now. I
> > believe I just need to update the HADOOP_CLASSPATH in my install. But
> > the file IS located in /usr/lib/hadoop along with all the default
> > jar's. My original questions still remain though as I stumble my way
> > through this. What is the interaction between dumbo, the mongo-
> > hadoop.jar, and hadoop? Are there specific methods that need to be in
> > place that do certain things? Can the MongoInputClass be altered to
> > look for an xml file fed in through cli (passed through dumbo of
> > course).
> > I am guessing dumbo would need to be altered. But not sure how all the
> > communication works, and where in the code. If I can get a better
> > understanding, I am going to fork the project, and create an "addon"
> > that allows for mongo access.
> > This group seems pretty dead lately though...
> > On Jun 30, 10:51 pm, Nathan <nbyl...@gmail.com> wrote:
> > > For instance, I compiled the mongo-hadoop.jar file, and I wanted to
> > > just see what happened. I put the file in my /usr/lib/hadoop-0.20
> > > folder. The I ran this command just to see what would happen:
Based on what you told us, I don't think there's a real difference between how the two take configuration params. The mongodb example probably just makes use of the possibility that Hadoop provides for putting the params in an xml file and reading them from that file instead of passing them directly.
To make mongo input or output work, you will need to write a custom input or output format that writes or reads typed bytes writables. I haven't looked at the code much, but you might be able to do this by wrapping the mongo-hadoop formats. You should be able to figure out how to work with typed bytes writables by having a look at the lasthbase code.
Also, to use (Java) input or output formats you need to run on Hadoop. That's the reason why the local run you pasted in on of your emails failed miserably.
Sorry for the late answer, and please share your code if you figure out how to do this!
On Thu, Jun 30, 2011 at 8:34 PM, Nathan <nbyl...@gmail.com> wrote: > I was using HBase for a while and was happy when I found the lasthbase > driver on github that worked great with dumbo. Recently I have started > working with MongoDB and found a mongodb-hadoop driver here:
> I asked a friend of mine who is much more familiar with Java to > compare the two, to see if we can use the mongodb classes easily in > the same way dumbo uses the lasthbase.jar. For reference, here is the > Input & Output format classes for both HBase & mongodb projects:
> With lasthbase, the input & output information is specified on the > command line, but in the mongodb, they have a WordCountXML example > that reads all connection, query, and other configurable information > from an XML file. I liked this approach, but had some questions. It > seems as though the lasthbase classes extended a JobConfigurable > class, but its been a long time since it's been updated. Mongodb- > hadoop does not have this. A LOT of the setup looks the same, but was > looking for a good starting point on making their classes work with > dumbo.
> What is dumbo expecting, or better yet, what is lasthbase sending to > dumbo? What does dumbo need from the jar file to start streaming the > data to the map/reduce job(s)? And how should it be streamed? I don't > know Java, but my friend is willing to try and help get it going if I > can get him all the information possible. To him it SEEMS some things > can be moved around and into the input & output format classes on > mongodb-hadoop, tell it to read the xml file, and then you have > another driver that connects to a document database for use with > dumbo.
> But he has no understand of dumbo, and we could use some assitance.
> -- > You received this message because you are subscribed to the Google Groups > "dumbo-user" group. > To post to this group, send email to dumbo-user@googlegroups.com. > To unsubscribe from this group, send email to > dumbo-user+unsubscribe@googlegroups.com. > For more options, visit this group at > http://groups.google.com/group/dumbo-user?hl=en.
Thanks for your reply. The last message I posted it's reading from
MongoDB just fine, and their mongodb-hadoop driver uses TypedBytes as
well. This is the error I am currently strugggling with:
java.io.IOException: Can't write: 4e0e98380bfb6ce2d9091ea6 as class
org.bson.types.ObjectId
4e0e98380bfb6ce2d9091ea6 is the mongodb objectId string of the first
record in my test collection, so I know it's able to access the data.
Also, in the error stack trace, it outputs this:
org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:
162)
at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:70)
at
org.apache.hadoop.typedbytes.TypedBytesWritableOutput.writeWritable(TypedBy
tesWritableOutput.java: 217)
So I know their driver is trying to use typed bytes. They have working
examples in pure Java, but I have grown accustom to dumbo, and would
like to use it and help this project grow. Supposively the project
supports streaming jobs too, so there should be no problem working
with dumbo as is once everything is figured out. I am not sure what is
happening yet, but I will share as soon as I have something working. I
also encourage anyone else interested to please take a look or share
their opinions. :)
On Jul 2, 12:03 pm, Klaas Bosteels <klaas.boste...@gmail.com> wrote:
> Based on what you told us, I don't think there's a real difference between
> how the two take configuration params. The mongodb example probably just
> makes use of the possibility that Hadoop provides for putting the params in
> an xml file and reading them from that file instead of passing them
> directly.
> To make mongo input or output work, you will need to write a custom input or
> output format that writes or reads typed bytes writables. I haven't looked
> at the code much, but you might be able to do this by wrapping the
> mongo-hadoop formats. You should be able to figure out how to work with
> typed bytes writables by having a look at the lasthbase code.
> Also, to use (Java) input or output formats you need to run on Hadoop.
> That's the reason why the local run you pasted in on of your emails failed
> miserably.
> Sorry for the late answer, and please share your code if you figure out how
> to do this!
> Regards,
> -Klaas
> On Thu, Jun 30, 2011 at 8:34 PM, Nathan <nbyl...@gmail.com> wrote:
> > I was using HBase for a while and was happy when I found the lasthbase
> > driver on github that worked great with dumbo. Recently I have started
> > working with MongoDB and found a mongodb-hadoop driver here:
> > I asked a friend of mine who is much more familiar with Java to
> > compare the two, to see if we can use the mongodb classes easily in
> > the same way dumbo uses the lasthbase.jar. For reference, here is the
> > Input & Output format classes for both HBase & mongodb projects:
> > With lasthbase, the input & output information is specified on the
> > command line, but in the mongodb, they have a WordCountXML example
> > that reads all connection, query, and other configurable information
> > from an XML file. I liked this approach, but had some questions. It
> > seems as though the lasthbase classes extended a JobConfigurable
> > class, but its been a long time since it's been updated. Mongodb-
> > hadoop does not have this. A LOT of the setup looks the same, but was
> > looking for a good starting point on making their classes work with
> > dumbo.
> > What is dumbo expecting, or better yet, what is lasthbase sending to
> > dumbo? What does dumbo need from the jar file to start streaming the
> > data to the map/reduce job(s)? And how should it be streamed? I don't
> > know Java, but my friend is willing to try and help get it going if I
> > can get him all the information possible. To him it SEEMS some things
> > can be moved around and into the input & output format classes on
> > mongodb-hadoop, tell it to read the xml file, and then you have
> > another driver that connects to a document database for use with
> > dumbo.
> > But he has no understand of dumbo, and we could use some assitance.
> > --
> > You received this message because you are subscribed to the Google Groups
> > "dumbo-user" group.
> > To post to this group, send email to dumbo-user@googlegroups.com.
> > To unsubscribe from this group, send email to
> > dumbo-user+unsubscribe@googlegroups.com.
> > For more options, visit this group at
> >http://groups.google.com/group/dumbo-user?hl=en.
I get what you are saying though. I am going to try and create a
wrapper this weekend, but don't expect much success since I am not a
Java guy. :)
They have a lot of the same methods in their input & output formats,
but are there specific methods that must be overridden? Are there very
specific things that MUST happen in the input & output formats? Any
tips are appreciated. Hopefully this is pretty straight forward, as
there is only two classes to mess with.
On Jul 2, 1:09 pm, Nathan <nbyl...@gmail.com> wrote:
> Thanks for your reply. The last message I posted it's reading from
> MongoDB just fine, and their mongodb-hadoop driver uses TypedBytes as
> well. This is the error I am currently strugggling with:
> java.io.IOException: Can't write: 4e0e98380bfb6ce2d9091ea6 as class
> org.bson.types.ObjectId
> 4e0e98380bfb6ce2d9091ea6 is the mongodb objectId string of the first
> record in my test collection, so I know it's able to access the data.
> Also, in the error stack trace, it outputs this:
> org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:
> 162)
> at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:70)
> at
> org.apache.hadoop.typedbytes.TypedBytesWritableOutput.writeWritable(TypedBy
> tesWritableOutput.java: 217)
> So I know their driver is trying to use typed bytes. They have working
> examples in pure Java, but I have grown accustom to dumbo, and would
> like to use it and help this project grow. Supposively the project
> supports streaming jobs too, so there should be no problem working
> with dumbo as is once everything is figured out. I am not sure what is
> happening yet, but I will share as soon as I have something working. I
> also encourage anyone else interested to please take a look or share
> their opinions. :)
> On Jul 2, 12:03 pm, Klaas Bosteels <klaas.boste...@gmail.com> wrote:
> > Hi Nathan,
> > Based on what you told us, I don't think there's a real difference between
> > how the two take configuration params. The mongodb example probably just
> > makes use of the possibility that Hadoop provides for putting the params in
> > an xml file and reading them from that file instead of passing them
> > directly.
> > To make mongo input or output work, you will need to write a custom input or
> > output format that writes or reads typed bytes writables. I haven't looked
> > at the code much, but you might be able to do this by wrapping the
> > mongo-hadoop formats. You should be able to figure out how to work with
> > typed bytes writables by having a look at the lasthbase code.
> > Also, to use (Java) input or output formats you need to run on Hadoop.
> > That's the reason why the local run you pasted in on of your emails failed
> > miserably.
> > Sorry for the late answer, and please share your code if you figure out how
> > to do this!
> > Regards,
> > -Klaas
> > On Thu, Jun 30, 2011 at 8:34 PM, Nathan <nbyl...@gmail.com> wrote:
> > > I was using HBase for a while and was happy when I found the lasthbase
> > > driver on github that worked great with dumbo. Recently I have started
> > > working with MongoDB and found a mongodb-hadoop driver here:
> > > I asked a friend of mine who is much more familiar with Java to
> > > compare the two, to see if we can use the mongodb classes easily in
> > > the same way dumbo uses the lasthbase.jar. For reference, here is the
> > > Input & Output format classes for both HBase & mongodb projects:
> > > With lasthbase, the input & output information is specified on the
> > > command line, but in the mongodb, they have a WordCountXML example
> > > that reads all connection, query, and other configurable information
> > > from an XML file. I liked this approach, but had some questions. It
> > > seems as though the lasthbase classes extended a JobConfigurable
> > > class, but its been a long time since it's been updated. Mongodb-
> > > hadoop does not have this. A LOT of the setup looks the same, but was
> > > looking for a good starting point on making their classes work with
> > > dumbo.
> > > What is dumbo expecting, or better yet, what is lasthbase sending to
> > > dumbo? What does dumbo need from the jar file to start streaming the
> > > data to the map/reduce job(s)? And how should it be streamed? I don't
> > > know Java, but my friend is willing to try and help get it going if I
> > > can get him all the information possible. To him it SEEMS some things
> > > can be moved around and into the input & output format classes on
> > > mongodb-hadoop, tell it to read the xml file, and then you have
> > > another driver that connects to a document database for use with
> > > dumbo.
> > > But he has no understand of dumbo, and we could use some assitance.
> > > --
> > > You received this message because you are subscribed to the Google Groups
> > > "dumbo-user" group.
> > > To post to this group, send email to dumbo-user@googlegroups.com.
> > > To unsubscribe from this group, send email to
> > > dumbo-user+unsubscribe@googlegroups.com.
> > > For more options, visit this group at
> > >http://groups.google.com/group/dumbo-user?hl=en.
Says there is no typedbytes package in hadoop. Eclipse tries to
resolve this error by importing the hadoop-streaming.jar from the
lasthbase project. I have looked, and this is definetly not as
depreceated method, so it should be there, so I don't know what that
problem is.
On Jul 2, 1:35 pm, Nathan <nbyl...@gmail.com> wrote:
> I get what you are saying though. I am going to try and create a
> wrapper this weekend, but don't expect much success since I am not a
> Java guy. :)
> They have a lot of the same methods in their input & output formats,
> but are there specific methods that must be overridden? Are there very
> specific things that MUST happen in the input & output formats? Any
> tips are appreciated. Hopefully this is pretty straight forward, as
> there is only two classes to mess with.
> On Jul 2, 1:09 pm, Nathan <nbyl...@gmail.com> wrote:
> > Thanks for your reply. The last message I posted it's reading from
> > MongoDB just fine, and their mongodb-hadoop driver uses TypedBytes as
> > well. This is the error I am currently strugggling with:
> > java.io.IOException: Can't write: 4e0e98380bfb6ce2d9091ea6 as class
> > org.bson.types.ObjectId
> > 4e0e98380bfb6ce2d9091ea6 is the mongodb objectId string of the first
> > record in my test collection, so I know it's able to access the data.
> > Also, in the error stack trace, it outputs this:
> > org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:
> > 162)
> > at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:70)
> > at
> > org.apache.hadoop.typedbytes.TypedBytesWritableOutput.writeWritable(TypedBy
> > tesWritableOutput.java: 217)
> > So I know their driver is trying to use typed bytes. They have working
> > examples in pure Java, but I have grown accustom to dumbo, and would
> > like to use it and help this project grow. Supposively the project
> > supports streaming jobs too, so there should be no problem working
> > with dumbo as is once everything is figured out. I am not sure what is
> > happening yet, but I will share as soon as I have something working. I
> > also encourage anyone else interested to please take a look or share
> > their opinions. :)
> > > Based on what you told us, I don't think there's a real difference between
> > > how the two take configuration params. The mongodb example probably just
> > > makes use of the possibility that Hadoop provides for putting the params in
> > > an xml file and reading them from that file instead of passing them
> > > directly.
> > > To make mongo input or output work, you will need to write a custom input or
> > > output format that writes or reads typed bytes writables. I haven't looked
> > > at the code much, but you might be able to do this by wrapping the
> > > mongo-hadoop formats. You should be able to figure out how to work with
> > > typed bytes writables by having a look at the lasthbase code.
> > > Also, to use (Java) input or output formats you need to run on Hadoop.
> > > That's the reason why the local run you pasted in on of your emails failed
> > > miserably.
> > > Sorry for the late answer, and please share your code if you figure out how
> > > to do this!
> > > Regards,
> > > -Klaas
> > > On Thu, Jun 30, 2011 at 8:34 PM, Nathan <nbyl...@gmail.com> wrote:
> > > > I was using HBase for a while and was happy when I found the lasthbase
> > > > driver on github that worked great with dumbo. Recently I have started
> > > > working with MongoDB and found a mongodb-hadoop driver here:
> > > > I asked a friend of mine who is much more familiar with Java to
> > > > compare the two, to see if we can use the mongodb classes easily in
> > > > the same way dumbo uses the lasthbase.jar. For reference, here is the
> > > > Input & Output format classes for both HBase & mongodb projects:
> > > > With lasthbase, the input & output information is specified on the
> > > > command line, but in the mongodb, they have a WordCountXML example
> > > > that reads all connection, query, and other configurable information
> > > > from an XML file. I liked this approach, but had some questions. It
> > > > seems as though the lasthbase classes extended a JobConfigurable
> > > > class, but its been a long time since it's been updated. Mongodb-
> > > > hadoop does not have this. A LOT of the setup looks the same, but was
> > > > looking for a good starting point on making their classes work with
> > > > dumbo.
> > > > What is dumbo expecting, or better yet, what is lasthbase sending to
> > > > dumbo? What does dumbo need from the jar file to start streaming the
> > > > data to the map/reduce job(s)? And how should it be streamed? I don't
> > > > know Java, but my friend is willing to try and help get it going if I
> > > > can get him all the information possible. To him it SEEMS some things
> > > > can be moved around and into the input & output format classes on
> > > > mongodb-hadoop, tell it to read the xml file, and then you have
> > > > another driver that connects to a document database for use with
> > > > dumbo.
> > > > But he has no understand of dumbo, and we could use some assitance.
> > > > --
> > > > You received this message because you are subscribed to the Google Groups
> > > > "dumbo-user" group.
> > > > To post to this group, send email to dumbo-user@googlegroups.com.
> > > > To unsubscribe from this group, send email to
> > > > dumbo-user+unsubscribe@googlegroups.com.
> > > > For more options, visit this group at
> > > >http://groups.google.com/group/dumbo-user?hl=en.
java.lang.ClassCastException:
com.mongodb.hadoop.input.TypedBytesMongoRecordReader cannot be cast to
org.apache.hadoop.mapred.RecordReader
at
com.mongodb.hadoop.TypedBytesTableInputFormat.getRecordReader(TypedBytesTab leInputFormat.java:
31)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:370)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:324)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.j ava:
1115)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
I feel so close! Not sure why I get a ClassCastException when my
TypedBytesMongoRecordReader is a child of the RecordReader. Any Java
people care to chime in?
On Jul 2, 3:02 pm, Nathan <nbyl...@gmail.com> wrote:
> Says there is no typedbytes package in hadoop. Eclipse tries to
> resolve this error by importing the hadoop-streaming.jar from the
> lasthbase project. I have looked, and this is definetly not as
> depreceated method, so it should be there, so I don't know what that
> problem is.
> On Jul 2, 1:35 pm, Nathan <nbyl...@gmail.com> wrote:
> > I get what you are saying though. I am going to try and create a
> > wrapper this weekend, but don't expect much success since I am not a
> > Java guy. :)
> > They have a lot of the same methods in their input & output formats,
> > but are there specific methods that must be overridden? Are there very
> > specific things that MUST happen in the input & output formats? Any
> > tips are appreciated. Hopefully this is pretty straight forward, as
> > there is only two classes to mess with.
> > On Jul 2, 1:09 pm, Nathan <nbyl...@gmail.com> wrote:
> > > Thanks for your reply. The last message I posted it's reading from
> > > MongoDB just fine, and their mongodb-hadoop driver uses TypedBytes as
> > > well. This is the error I am currently strugggling with:
> > > java.io.IOException: Can't write: 4e0e98380bfb6ce2d9091ea6 as class
> > > org.bson.types.ObjectId
> > > 4e0e98380bfb6ce2d9091ea6 is the mongodb objectId string of the first
> > > record in my test collection, so I know it's able to access the data.
> > > Also, in the error stack trace, it outputs this:
> > > So I know their driver is trying to use typed bytes. They have working
> > > examples in pure Java, but I have grown accustom to dumbo, and would
> > > like to use it and help this project grow. Supposively the project
> > > supports streaming jobs too, so there should be no problem working
> > > with dumbo as is once everything is figured out. I am not sure what is
> > > happening yet, but I will share as soon as I have something working. I
> > > also encourage anyone else interested to please take a look or share
> > > their opinions. :)
> > > > Based on what you told us, I don't think there's a real difference between
> > > > how the two take configuration params. The mongodb example probably just
> > > > makes use of the possibility that Hadoop provides for putting the params in
> > > > an xml file and reading them from that file instead of passing them
> > > > directly.
> > > > To make mongo input or output work, you will need to write a custom input or
> > > > output format that writes or reads typed bytes writables. I haven't looked
> > > > at the code much, but you might be able to do this by wrapping the
> > > > mongo-hadoop formats. You should be able to figure out how to work with
> > > > typed bytes writables by having a look at the lasthbase code.
> > > > Also, to use (Java) input or output formats you need to run on Hadoop.
> > > > That's the reason why the local run you pasted in on of your emails failed
> > > > miserably.
> > > > Sorry for the late answer, and please share your code if you figure out how
> > > > to do this!
> > > > Regards,
> > > > -Klaas
> > > > On Thu, Jun 30, 2011 at 8:34 PM, Nathan <nbyl...@gmail.com> wrote:
> > > > > I was using HBase for a while and was happy when I found the lasthbase
> > > > > driver on github that worked great with dumbo. Recently I have started
> > > > > working with MongoDB and found a mongodb-hadoop driver here:
> > > > > I asked a friend of mine who is much more familiar with Java to
> > > > > compare the two, to see if we can use the mongodb classes easily in
> > > > > the same way dumbo uses the lasthbase.jar. For reference, here is the
> > > > > Input & Output format classes for both HBase & mongodb projects:
> > > > > With lasthbase, the input & output information is specified on the
> > > > > command line, but in the mongodb, they have a WordCountXML example
> > > > > that reads all connection, query, and other configurable information
> > > > > from an XML file. I liked this approach, but had some questions. It
> > > > > seems as though the lasthbase classes extended a JobConfigurable
> > > > > class, but its been a long time since it's been updated. Mongodb-
> > > > > hadoop does not have this. A LOT of the setup looks the same, but was
> > > > > looking for a good starting point on making their classes work with
> > > > > dumbo.
> > > > > What is dumbo expecting, or better yet, what is lasthbase sending to
> > > > > dumbo? What does dumbo need from the jar file to start streaming the
> > > > > data to the map/reduce job(s)? And how should it be streamed? I don't
> > > > > know Java, but my friend is willing to try and help get it going if I
> > > > > can get him all the information possible. To him it SEEMS some things
> > > > > can be moved around and into the input & output format classes on
> > > > > mongodb-hadoop, tell it to read the xml file, and then you have
> > > > > another driver that connects to a document database for use with
> > > > > dumbo.
> > > > > But he has no understand of dumbo, and we could use some assitance.
> > > > > --
> > > > > You received this message because you are subscribed to the Google Groups
> > > > > "dumbo-user" group.
> > > > > To post to this group, send email to dumbo-user@googlegroups.com.
> > > > > To unsubscribe from this group, send email to
> > > > > dumbo-user+unsubscribe@googlegroups.com.
> > > > > For more options, visit this group at
> > > > >http://groups.google.com/group/dumbo-user?hl=en.
OK, I got it reading records just fine. It completes the M/R job, but
it's not writing it to the database. I am not getting errors though.
It says output written to test.out (the db.collection_name I am trying
to write to in MongoDB), but there is nothing in that hadoop fs folder
except an empty _SUCCESS file and a bunch of logs
So I don't know where my output is going.
On Jul 2, 8:03 pm, Nathan <nbyl...@gmail.com> wrote:
> java.lang.ClassCastException:
> com.mongodb.hadoop.input.TypedBytesMongoRecordReader cannot be cast to
> org.apache.hadoop.mapred.RecordReader
> at
> com.mongodb.hadoop.TypedBytesTableInputFormat.getRecordReader(TypedBytesTab leInputFormat.java:
> 31)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:370)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:324)
> at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.j ava:
> 1115)
> at org.apache.hadoop.mapred.Child.main(Child.java:262)
> I feel so close! Not sure why I get a ClassCastException when my
> TypedBytesMongoRecordReader is a child of the RecordReader. Any Java
> people care to chime in?
> On Jul 2, 3:02 pm, Nathan <nbyl...@gmail.com> wrote:
> > The odd thing is it can't find this package when I try and import it
> > (I have all my jar's in build path, including the hadoop streaming):
> > Says there is no typedbytes package in hadoop. Eclipse tries to
> > resolve this error by importing the hadoop-streaming.jar from the
> > lasthbase project. I have looked, and this is definetly not as
> > depreceated method, so it should be there, so I don't know what that
> > problem is.
> > On Jul 2, 1:35 pm, Nathan <nbyl...@gmail.com> wrote:
> > > I get what you are saying though. I am going to try and create a
> > > wrapper this weekend, but don't expect much success since I am not a
> > > Java guy. :)
> > > They have a lot of the same methods in their input & output formats,
> > > but are there specific methods that must be overridden? Are there very
> > > specific things that MUST happen in the input & output formats? Any
> > > tips are appreciated. Hopefully this is pretty straight forward, as
> > > there is only two classes to mess with.
> > > > Thanks for your reply. The last message I posted it's reading from
> > > > MongoDB just fine, and their mongodb-hadoop driver uses TypedBytes as
> > > > well. This is the error I am currently strugggling with:
> > > > java.io.IOException: Can't write: 4e0e98380bfb6ce2d9091ea6 as class
> > > > org.bson.types.ObjectId
> > > > 4e0e98380bfb6ce2d9091ea6 is the mongodb objectId string of the first
> > > > record in my test collection, so I know it's able to access the data.
> > > > Also, in the error stack trace, it outputs this:
> > > > So I know their driver is trying to use typed bytes. They have working
> > > > examples in pure Java, but I have grown accustom to dumbo, and would
> > > > like to use it and help this project grow. Supposively the project
> > > > supports streaming jobs too, so there should be no problem working
> > > > with dumbo as is once everything is figured out. I am not sure what is
> > > > happening yet, but I will share as soon as I have something working. I
> > > > also encourage anyone else interested to please take a look or share
> > > > their opinions. :)
> > > > > Based on what you told us, I don't think there's a real difference between
> > > > > how the two take configuration params. The mongodb example probably just
> > > > > makes use of the possibility that Hadoop provides for putting the params in
> > > > > an xml file and reading them from that file instead of passing them
> > > > > directly.
> > > > > To make mongo input or output work, you will need to write a custom input or
> > > > > output format that writes or reads typed bytes writables. I haven't looked
> > > > > at the code much, but you might be able to do this by wrapping the
> > > > > mongo-hadoop formats. You should be able to figure out how to work with
> > > > > typed bytes writables by having a look at the lasthbase code.
> > > > > Also, to use (Java) input or output formats you need to run on Hadoop.
> > > > > That's the reason why the local run you pasted in on of your emails failed
> > > > > miserably.
> > > > > Sorry for the late answer, and please share your code if you figure out how
> > > > > to do this!
> > > > > Regards,
> > > > > -Klaas
> > > > > On Thu, Jun 30, 2011 at 8:34 PM, Nathan <nbyl...@gmail.com> wrote:
> > > > > > I was using HBase for a while and was happy when I found the lasthbase
> > > > > > driver on github that worked great with dumbo. Recently I have started
> > > > > > working with MongoDB and found a mongodb-hadoop driver here:
> > > > > > I asked a friend of mine who is much more familiar with Java to
> > > > > > compare the two, to see if we can use the mongodb classes easily in
> > > > > > the same way dumbo uses the lasthbase.jar. For reference, here is the
> > > > > > Input & Output format classes for both HBase & mongodb projects:
> > > > > > With lasthbase, the input & output information is specified on the
> > > > > > command line, but in the mongodb, they have a WordCountXML example
> > > > > > that reads all connection, query, and other configurable information
> > > > > > from an XML file. I liked this approach, but had some questions. It
> > > > > > seems as though the lasthbase classes extended a JobConfigurable
> > > > > > class, but its been a long time since it's been updated. Mongodb-
> > > > > > hadoop does not have this. A LOT of the setup looks the same, but was
> > > > > > looking for a good starting point on making their classes work with
> > > > > > dumbo.
> > > > > > What is dumbo expecting, or better yet, what is lasthbase sending to
> > > > > > dumbo? What does dumbo need from the jar file to start streaming the
> > > > > > data to the map/reduce job(s)? And how should it be streamed? I don't
> > > > > > know Java, but my friend is willing to try and help get it going if I
> > > > > > can get him all the information possible. To him it SEEMS some things
> > > > > > can be moved around and into the input & output format classes on
> > > > > > mongodb-hadoop, tell it to read the xml file, and then you have
> > > > > > another driver that connects to a document database for use with
> > > > > > dumbo.
> > > > > > But he has no understand of dumbo, and we could use some assitance.
> > > > > > --
> > > > > > You received this message because you are subscribed to the Google Groups
> > > > > > "dumbo-user" group.
> > > > > > To post to this group, send email to dumbo-user@googlegroups.com.
> > > > > > To unsubscribe from this group, send email to
> > > > > > dumbo-user+unsubscribe@googlegroups.com.
> > > > > > For more options, visit this group at
> > > > > >http://groups.google.com/group/dumbo-user?hl=en.
OK everything is reading and writing to mongodb using the dumbo
wordcount demo. The columns it writes to is hard coded for now, but I
will make a configurable property in the XML file where you can output
the values. Also, right now it will probably only let you write to one
collection, with a key / value pair. If it becomes necessary to try
and save actual BSONObjects with multiple k/v pairs, I will try that
next.
But it's working. Woop woop!
On Jul 2, 9:31 pm, Nathan <nbyl...@gmail.com> wrote:
> OK, I got it reading records just fine. It completes the M/R job, but
> it's not writing it to the database. I am not getting errors though.
> It says output written to test.out (the db.collection_name I am trying
> to write to in MongoDB), but there is nothing in that hadoop fs folder
> except an empty _SUCCESS file and a bunch of logs
> So I don't know where my output is going.
> On Jul 2, 8:03 pm, Nathan <nbyl...@gmail.com> wrote:
> > I feel so close. This class mimics theirs, but uses
> > TypedBytesWriteable instead of BSONObjects.
> > @SuppressWarnings("deprecation")
> > public class TypedBytesTableInputFormat implements
> > InputFormat<TypedBytesWritable, TypedBytesWritable> {
> > if (!(split instanceof MongoInputSplit))
> > throw new IllegalStateException("Creation of a new
> > RecordReader requires a MongoInputSplit instance.");
> > final MongoInputSplit mis = (MongoInputSplit) split;
> > java.lang.ClassCastException:
> > com.mongodb.hadoop.input.TypedBytesMongoRecordReader cannot be cast to
> > org.apache.hadoop.mapred.RecordReader
> > at
> > com.mongodb.hadoop.TypedBytesTableInputFormat.getRecordReader(TypedBytesTab leInputFormat.java:
> > 31)
> > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:370)
> > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:324)
> > at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
> > at java.security.AccessController.doPrivileged(Native Method)
> > at javax.security.auth.Subject.doAs(Subject.java:396)
> > at
> > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.j ava:
> > 1115)
> > at org.apache.hadoop.mapred.Child.main(Child.java:262)
> > I feel so close! Not sure why I get a ClassCastException when my
> > TypedBytesMongoRecordReader is a child of the RecordReader. Any Java
> > people care to chime in?
> > On Jul 2, 3:02 pm, Nathan <nbyl...@gmail.com> wrote:
> > > The odd thing is it can't find this package when I try and import it
> > > (I have all my jar's in build path, including the hadoop streaming):
> > > Says there is no typedbytes package in hadoop. Eclipse tries to
> > > resolve this error by importing the hadoop-streaming.jar from the
> > > lasthbase project. I have looked, and this is definetly not as
> > > depreceated method, so it should be there, so I don't know what that
> > > problem is.
> > > > I get what you are saying though. I am going to try and create a
> > > > wrapper this weekend, but don't expect much success since I am not a
> > > > Java guy. :)
> > > > They have a lot of the same methods in their input & output formats,
> > > > but are there specific methods that must be overridden? Are there very
> > > > specific things that MUST happen in the input & output formats? Any
> > > > tips are appreciated. Hopefully this is pretty straight forward, as
> > > > there is only two classes to mess with.
> > > > > Thanks for your reply. The last message I posted it's reading from
> > > > > MongoDB just fine, and their mongodb-hadoop driver uses TypedBytes as
> > > > > well. This is the error I am currently strugggling with:
> > > > > java.io.IOException: Can't write: 4e0e98380bfb6ce2d9091ea6 as class
> > > > > org.bson.types.ObjectId
> > > > > 4e0e98380bfb6ce2d9091ea6 is the mongodb objectId string of the first
> > > > > record in my test collection, so I know it's able to access the data.
> > > > > Also, in the error stack trace, it outputs this:
> > > > > So I know their driver is trying to use typed bytes. They have working
> > > > > examples in pure Java, but I have grown accustom to dumbo, and would
> > > > > like to use it and help this project grow. Supposively the project
> > > > > supports streaming jobs too, so there should be no problem working
> > > > > with dumbo as is once everything is figured out. I am not sure what is
> > > > > happening yet, but I will share as soon as I have something working. I
> > > > > also encourage anyone else interested to please take a look or share
> > > > > their opinions. :)
> > > > > > Based on what you told us, I don't think there's a real difference between
> > > > > > how the two take configuration params. The mongodb example probably just
> > > > > > makes use of the possibility that Hadoop provides for putting the params in
> > > > > > an xml file and reading them from that file instead of passing them
> > > > > > directly.
> > > > > > To make mongo input or output work, you will need to write a custom input or
> > > > > > output format that writes or reads typed bytes writables. I haven't looked
> > > > > > at the code much, but you might be able to do this by wrapping the
> > > > > > mongo-hadoop formats. You should be able to figure out how to work with
> > > > > > typed bytes writables by having a look at the lasthbase code.
> > > > > > Also, to use (Java) input or output formats you need to run on Hadoop.
> > > > > > That's the reason why the local run you pasted in on of your emails failed
> > > > > > miserably.
> > > > > > Sorry for the late answer, and please share your code if you figure out how
> > > > > > to do this!
> > > > > > Regards,
> > > > > > -Klaas
> > > > > > On Thu, Jun 30, 2011 at 8:34 PM, Nathan <nbyl...@gmail.com> wrote:
> > > > > > > I was using HBase for a while and was happy when I found the lasthbase
> > > > > > > driver on github that worked great with dumbo. Recently I have started
> > > > > > > working with MongoDB and found a mongodb-hadoop driver here:
> > > > > > > I asked a friend of mine who is much more familiar with Java to
> > > > > > > compare the two, to see if we can use the mongodb classes easily in
> > > > > > > the same way dumbo uses the lasthbase.jar. For reference, here is the
> > > > > > > Input & Output format classes for both HBase & mongodb projects:
> > > > > > > With lasthbase, the input & output information is specified on the
> > > > > > > command line, but in the mongodb, they have a WordCountXML example
> > > > > > > that reads all connection, query, and other configurable information
> > > > > > > from an XML file. I liked this approach, but had some questions. It
> > > > > > > seems as though the lasthbase classes extended a JobConfigurable
> > > > > > > class, but its been a long time since it's been updated. Mongodb-
> > > > > > > hadoop does not have this. A LOT of the setup looks the same, but was
> > > > > > > looking for a good starting point on making their classes work with
> > > > > > > dumbo.
> > > > > > > What is dumbo expecting, or better yet, what is lasthbase sending to
> > > > > > > dumbo? What does dumbo need from the jar file to start streaming the
> > > > > > > data to the map/reduce job(s)? And how should it be streamed? I don't
> > > > > > > know Java, but my friend is willing to try and help get it going if I
> > > > > > > can get him all the information possible. To him it SEEMS some things
> > > > > > > can be moved around and into the input & output format classes on
> > > > > > > mongodb-hadoop, tell it to read the xml file, and then you have
> > > > > > > another driver that connects to a document database for use with
> > > > > > > dumbo.
> > > > > > > But he has no understand of dumbo, and we could use some assitance.
> > > > > > > --
> > > > > > > You received this message because you are subscribed to the Google Groups
> > > > > > > "dumbo-user" group.
> > > > > > > To post to this group, send email to dumbo-user@googlegroups.com.
> > > > > > > To unsubscribe from this group, send email to
> > > > > > > dumbo-user+unsubscribe@googlegroups.com.
Haha. Feels like a long journey just in this thread from "I don't know
Java" to "Hey I got it working!"
Anyways, I am going to try and do some tweaks to it so you can store
the output document structure in the XML file and have all the data
loaded into the driver instead of on the command line. I have it
checked in on github right now, but it only works if I hard-code the
output fields in the driver. Working on making it more robust.
On Jul 3, 7:45 pm, Nathan <nbyl...@gmail.com> wrote:
> OK everything is reading and writing to mongodb using the dumbo
> wordcount demo. The columns it writes to is hard coded for now, but I
> will make a configurable property in the XML file where you can output
> the values. Also, right now it will probably only let you write to one
> collection, with a key / value pair. If it becomes necessary to try
> and save actual BSONObjects with multiple k/v pairs, I will try that
> next.
> But it's working. Woop woop!
> On Jul 2, 9:31 pm, Nathan <nbyl...@gmail.com> wrote:
> > OK, I got it reading records just fine. It completes the M/R job, but
> > it's not writing it to the database. I am not getting errors though.
> > It says output written to test.out (the db.collection_name I am trying
> > to write to in MongoDB), but there is nothing in that hadoop fs folder
> > except an empty _SUCCESS file and a bunch of logs
> > So I don't know where my output is going.
> > On Jul 2, 8:03 pm, Nathan <nbyl...@gmail.com> wrote:
> > > I feel so close. This class mimics theirs, but uses
> > > TypedBytesWriteable instead of BSONObjects.
> > > @SuppressWarnings("deprecation")
> > > public class TypedBytesTableInputFormat implements
> > > InputFormat<TypedBytesWritable, TypedBytesWritable> {
> > > if (!(split instanceof MongoInputSplit))
> > > throw new IllegalStateException("Creation of a new
> > > RecordReader requires a MongoInputSplit instance.");
> > > final MongoInputSplit mis = (MongoInputSplit) split;
> > > java.lang.ClassCastException:
> > > com.mongodb.hadoop.input.TypedBytesMongoRecordReader cannot be cast to
> > > org.apache.hadoop.mapred.RecordReader
> > > at
> > > com.mongodb.hadoop.TypedBytesTableInputFormat.getRecordReader(TypedBytesTab leInputFormat.java:
> > > 31)
> > > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:370)
> > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:324)
> > > at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
> > > at java.security.AccessController.doPrivileged(Native Method)
> > > at javax.security.auth.Subject.doAs(Subject.java:396)
> > > at
> > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.j ava:
> > > 1115)
> > > at org.apache.hadoop.mapred.Child.main(Child.java:262)
> > > I feel so close! Not sure why I get a ClassCastException when my
> > > TypedBytesMongoRecordReader is a child of the RecordReader. Any Java
> > > people care to chime in?
> > > > The odd thing is it can't find this package when I try and import it
> > > > (I have all my jar's in build path, including the hadoop streaming):
> > > > Says there is no typedbytes package in hadoop. Eclipse tries to
> > > > resolve this error by importing the hadoop-streaming.jar from the
> > > > lasthbase project. I have looked, and this is definetly not as
> > > > depreceated method, so it should be there, so I don't know what that
> > > > problem is.
> > > > > I get what you are saying though. I am going to try and create a
> > > > > wrapper this weekend, but don't expect much success since I am not a
> > > > > Java guy. :)
> > > > > They have a lot of the same methods in their input & output formats,
> > > > > but are there specific methods that must be overridden? Are there very
> > > > > specific things that MUST happen in the input & output formats? Any
> > > > > tips are appreciated. Hopefully this is pretty straight forward, as
> > > > > there is only two classes to mess with.
> > > > > > Thanks for your reply. The last message I posted it's reading from
> > > > > > MongoDB just fine, and their mongodb-hadoop driver uses TypedBytes as
> > > > > > well. This is the error I am currently strugggling with:
> > > > > > java.io.IOException: Can't write: 4e0e98380bfb6ce2d9091ea6 as class
> > > > > > org.bson.types.ObjectId
> > > > > > 4e0e98380bfb6ce2d9091ea6 is the mongodb objectId string of the first
> > > > > > record in my test collection, so I know it's able to access the data.
> > > > > > Also, in the error stack trace, it outputs this:
> > > > > > So I know their driver is trying to use typed bytes. They have working
> > > > > > examples in pure Java, but I have grown accustom to dumbo, and would
> > > > > > like to use it and help this project grow. Supposively the project
> > > > > > supports streaming jobs too, so there should be no problem working
> > > > > > with dumbo as is once everything is figured out. I am not sure what is
> > > > > > happening yet, but I will share as soon as I have something working. I
> > > > > > also encourage anyone else interested to please take a look or share
> > > > > > their opinions. :)
> > > > > > > Based on what you told us, I don't think there's a real difference between
> > > > > > > how the two take configuration params. The mongodb example probably just
> > > > > > > makes use of the possibility that Hadoop provides for putting the params in
> > > > > > > an xml file and reading them from that file instead of passing them
> > > > > > > directly.
> > > > > > > To make mongo input or output work, you will need to write a custom input or
> > > > > > > output format that writes or reads typed bytes writables. I haven't looked
> > > > > > > at the code much, but you might be able to do this by wrapping the
> > > > > > > mongo-hadoop formats. You should be able to figure out how to work with
> > > > > > > typed bytes writables by having a look at the lasthbase code.
> > > > > > > Also, to use (Java) input or output formats you need to run on Hadoop.
> > > > > > > That's the reason why the local run you pasted in on of your emails failed
> > > > > > > miserably.
> > > > > > > Sorry for the late answer, and please share your code if you figure out how
> > > > > > > to do this!
> > > > > > > Regards,
> > > > > > > -Klaas
> > > > > > > On Thu, Jun 30, 2011 at 8:34 PM, Nathan <nbyl...@gmail.com> wrote:
> > > > > > > > I was using HBase for a while and was happy when I found the lasthbase
> > > > > > > > driver on github that worked great with dumbo. Recently I have started
> > > > > > > > working with MongoDB and found a mongodb-hadoop driver here:
> > > > > > > > I asked a friend of mine who is much more familiar with Java to
> > > > > > > > compare the two, to see if we can use the mongodb classes easily in
> > > > > > > > the same way dumbo uses the lasthbase.jar. For reference, here is the
> > > > > > > > Input & Output format classes for both HBase & mongodb projects:
> > > > > > > > With lasthbase, the input & output information is specified on the
> > > > > > > > command line, but in the mongodb, they have a WordCountXML example
> > > > > > > > that reads all connection, query, and other configurable information
> > > > > > > > from an XML file. I liked this approach, but had some questions. It
> > > > > > > > seems as though the lasthbase classes extended a JobConfigurable
> > > > > > > > class, but its been a long time since it's been updated. Mongodb-
> > > > > > > > hadoop does not have this. A LOT of the setup looks the same, but was
> > > > > > > > looking for a good starting point on making their classes work with
> > > > > > > > dumbo.
> > > > > > > > What is dumbo expecting, or better yet, what is lasthbase sending to
> > > > > > > > dumbo? What does dumbo need from the jar file to start streaming the
> > > > > > > > data to the map/reduce job(s)? And how should it be streamed? I don't
> > > > > > > > know Java, but my friend is willing to try and help
> Haha. Feels like a long journey just in this thread from "I don't know > Java" to "Hey I got it working!"
> Anyways, I am going to try and do some tweaks to it so you can store > the output document structure in the XML file and have all the data > loaded into the driver instead of on the command line. I have it > checked in on github right now, but it only works if I hard-code the > output fields in the driver. Working on making it more robust.
> On Jul 3, 7:45 pm, Nathan <nbyl...@gmail.com> wrote: >> OK everything is reading and writing to mongodb using the dumbo >> wordcount demo. The columns it writes to is hard coded for now, but I >> will make a configurable property in the XML file where you can output >> the values. Also, right now it will probably only let you write to one >> collection, with a key / value pair. If it becomes necessary to try >> and save actual BSONObjects with multiple k/v pairs, I will try that >> next.
>> But it's working. Woop woop!
>> On Jul 2, 9:31 pm, Nathan <nbyl...@gmail.com> wrote:
>>> OK, I got it reading records just fine. It completes the M/R job, but >>> it's not writing it to the database. I am not getting errors though. >>> It says output written to test.out (the db.collection_name I am trying >>> to write to in MongoDB), but there is nothing in that hadoop fs folder >>> except an empty _SUCCESS file and a bunch of logs
>>> So I don't know where my output is going.
>>> On Jul 2, 8:03 pm, Nathan <nbyl...@gmail.com> wrote:
>>>> I feel so close. This class mimics theirs, but uses >>>> TypedBytesWriteable instead of BSONObjects.
>>>> @SuppressWarnings("deprecation") >>>> public class TypedBytesTableInputFormat implements >>>> InputFormat<TypedBytesWritable, TypedBytesWritable> {
>>>> if (!(split instanceof MongoInputSplit)) >>>> throw new IllegalStateException("Creation of a new >>>> RecordReader requires a MongoInputSplit instance.");
>>>> final MongoInputSplit mis = (MongoInputSplit) split;
>>>> java.lang.ClassCastException: >>>> com.mongodb.hadoop.input.TypedBytesMongoRecordReader cannot be cast to >>>> org.apache.hadoop.mapred.RecordReader >>>> at >>>> com.mongodb.hadoop.TypedBytesTableInputFormat.getRecordReader(TypedBytesTab leInputFormat.java: >>>> 31) >>>> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:370) >>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:324) >>>> at org.apache.hadoop.mapred.Child$4.run(Child.java:268) >>>> at java.security.AccessController.doPrivileged(Native Method) >>>> at javax.security.auth.Subject.doAs(Subject.java:396) >>>> at >>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.j ava: >>>> 1115) >>>> at org.apache.hadoop.mapred.Child.main(Child.java:262)
>>>> I feel so close! Not sure why I get a ClassCastException when my >>>> TypedBytesMongoRecordReader is a child of the RecordReader. Any Java >>>> people care to chime in?
>>>> On Jul 2, 3:02 pm, Nathan <nbyl...@gmail.com> wrote:
>>>>> The odd thing is it can't find this package when I try and import it >>>>> (I have all my jar's in build path, including the hadoop streaming):
>>>>> Says there is no typedbytes package in hadoop. Eclipse tries to >>>>> resolve this error by importing the hadoop-streaming.jar from the >>>>> lasthbase project. I have looked, and this is definetly not as >>>>> depreceated method, so it should be there, so I don't know what that >>>>> problem is.
>>>>> On Jul 2, 1:35 pm, Nathan <nbyl...@gmail.com> wrote:
>>>>>> I get what you are saying though. I am going to try and create a >>>>>> wrapper this weekend, but don't expect much success since I am not a >>>>>> Java guy. :)
>>>>>> They have a lot of the same methods in their input & output formats, >>>>>> but are there specific methods that must be overridden? Are there very >>>>>> specific things that MUST happen in the input & output formats? Any >>>>>> tips are appreciated. Hopefully this is pretty straight forward, as >>>>>> there is only two classes to mess with.
>>>>>> On Jul 2, 1:09 pm, Nathan <nbyl...@gmail.com> wrote:
>>>>>>> Thanks for your reply. The last message I posted it's reading from >>>>>>> MongoDB just fine, and their mongodb-hadoop driver uses TypedBytes as >>>>>>> well. This is the error I am currently strugggling with:
>>>>>>> java.io.IOException: Can't write: 4e0e98380bfb6ce2d9091ea6 as class >>>>>>> org.bson.types.ObjectId
>>>>>>> 4e0e98380bfb6ce2d9091ea6 is the mongodb objectId string of the first >>>>>>> record in my test collection, so I know it's able to access the data. >>>>>>> Also, in the error stack trace, it outputs this:
>>>>>>> org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java: >>>>>>> 162) >>>>>>> at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:70) >>>>>>> at >>>>>>> org.apache.hadoop.typedbytes.TypedBytesWritableOutput.writeWritable(TypedBy >>>>>>> tesWritableOutput.java: 217)
>>>>>>> So I know their driver is trying to use typed bytes. They have working >>>>>>> examples in pure Java, but I have grown accustom to dumbo, and would >>>>>>> like to use it and help this project grow. Supposively the project >>>>>>> supports streaming jobs too, so there should be no problem working >>>>>>> with dumbo as is once everything is figured out. I am not sure what is >>>>>>> happening yet, but I will share as soon as I have something working. I >>>>>>> also encourage anyone else interested to please take a look or share >>>>>>> their opinions. :)
>>>>>>> On Jul 2, 12:03 pm, Klaas Bosteels <klaas.boste...@gmail.com> wrote:
>>>>>>>> Hi Nathan,
>>>>>>>> Based on what you told us, I don't think there's a real difference between >>>>>>>> how the two take configuration params. The mongodb example probably just >>>>>>>> makes use of the possibility that Hadoop provides for putting the params in >>>>>>>> an xml file and reading them from that file instead of passing them >>>>>>>> directly.
>>>>>>>> To make mongo input or output work, you will need to write a custom input or >>>>>>>> output format that writes or reads typed bytes writables. I haven't looked >>>>>>>> at the code much, but you might be able to do this by wrapping the >>>>>>>> mongo-hadoop formats. You should be able to figure out how to work with >>>>>>>> typed bytes writables by having a look at the lasthbase code.
>>>>>>>> Also, to use (Java) input or output formats you need to run on Hadoop. >>>>>>>> That's the reason why the local run you pasted in on of your emails failed >>>>>>>> miserably.
>>>>>>>> Sorry for the late answer, and please share your code if you figure out how >>>>>>>> to do this!
>>>>>>>> Regards, >>>>>>>> -Klaas
>>>>>>>> On Thu, Jun 30, 2011 at 8:34 PM, Nathan <nbyl...@gmail.com> wrote: >>>>>>>>> I was using HBase for a while and was happy when I found the lasthbase >>>>>>>>> driver on github that worked great with dumbo. Recently I have started >>>>>>>>> working with MongoDB and found a mongodb-hadoop driver here:
>>>>>>>>> I asked a friend of mine who is much more familiar with Java to >>>>>>>>> compare the two, to see if we can use the mongodb classes easily in >>>>>>>>> the same way dumbo uses the lasthbase.jar. For reference, here is the >>>>>>>>> Input & Output format classes for both HBase & mongodb projects:
>>>>>>>>> With lasthbase, the input & output information is specified on the >>>>>>>>> command line, but in the mongodb, they have a WordCountXML example >>>>>>>>> that reads all connection, query, and other configurable information >>>>>>>>> from an XML file. I liked this approach, but had some questions. It >>>>>>>>> seems as though the lasthbase classes extended a JobConfigurable >>>>>>>>> class, but its been a long time since it's been updated. Mongodb- >>>>>>>>> hadoop does not have this. A LOT of the setup looks the same, but was >>>>>>>>> looking for a good starting point on making their classes work with >>>>>>>>> dumbo.
>>>>>>>>> What is dumbo expecting, or better yet, what is lasthbase sending to >>>>>>>>> dumbo? What does dumbo need from the jar file to start streaming the >>>>>>>>> data to the map/reduce job(s)? And how should it be streamed? I don't >>>>>>>>> know Java, but my friend is willing to try and help get it going if I >>>>>>>>> can get him all the information possible. To him it SEEMS some things >>>>>>>>> can be moved around and into the input & output format classes on >>>>>>>>> mongodb-hadoop, tell it to read the xml file, and then you have >>>>>>>>> another driver that connects to a document database for use with >>>>>>>>> dumbo.
>>>>>>>>> But he has no understand of dumbo, and we could use some assitance.
>>>>>>>>> -- >>>>>>>>> You received this message because you are subscribed to the
On Monday, July 4, 2011 12:53:09 PM UTC-4, Nathan wrote:
> Haha. Feels like a long journey just in this thread from "I don't know > Java" to "Hey I got it working!"
> Anyways, I am going to try and do some tweaks to it so you can store > the output document structure in the XML file and have all the data > loaded into the driver instead of on the command line. I have it > checked in on github right now, but it only works if I hard-code the > output fields in the driver. Working on making it more robust.
> On Jul 3, 7:45 pm, Nathan <nbyl...@gmail.com> wrote: > > OK everything is reading and writing to mongodb using the dumbo > > wordcount demo. The columns it writes to is hard coded for now, but I > > will make a configurable property in the XML file where you can output > > the values. Also, right now it will probably only let you write to one > > collection, with a key / value pair. If it becomes necessary to try > > and save actual BSONObjects with multiple k/v pairs, I will try that > > next.
> > But it's working. Woop woop!
> > On Jul 2, 9:31 pm, Nathan <nbyl...@gmail.com> wrote:
> > > OK, I got it reading records just fine. It completes the M/R job, but > > > it's not writing it to the database. I am not getting errors though. > > > It says output written to test.out (the db.collection_name I am trying > > > to write to in MongoDB), but there is nothing in that hadoop fs folder > > > except an empty _SUCCESS file and a bunch of logs
> > > > if (!(split instanceof MongoInputSplit)) > > > > throw new IllegalStateException("Creation of a new > > > > RecordReader requires a MongoInputSplit instance.");
> > > > final MongoInputSplit mis = (MongoInputSplit) split;
> > > > I feel so close! Not sure why I get a ClassCastException when my > > > > TypedBytesMongoRecordReader is a child of the RecordReader. Any Java > > > > people care to chime in?
> > > > > The odd thing is it can't find this package when I try and import > it > > > > > (I have all my jar's in build path, including the hadoop > streaming):
> > > > > Says there is no typedbytes package in hadoop. Eclipse tries to > > > > > resolve this error by importing the hadoop-streaming.jar from the > > > > > lasthbase project. I have looked, and this is definetly not as > > > > > depreceated method, so it should be there, so I don't know what > that > > > > > problem is.
> > > > > > I get what you are saying though. I am going to try and create a > > > > > > wrapper this weekend, but don't expect much success since I am > not a > > > > > > Java guy. :)
> > > > > > They have a lot of the same methods in their input & output > formats, > > > > > > but are there specific methods that must be overridden? Are > there very > > > > > > specific things that MUST happen in the input & output formats? > Any > > > > > > tips are appreciated. Hopefully this is pretty straight forward, > as > > > > > > there is only two classes to mess with.
> > > > > > > Thanks for your reply. The last message I posted it's reading > from > > > > > > > MongoDB just fine, and their mongodb-hadoop driver uses > TypedBytes as > > > > > > > well. This is the error I am currently strugggling with:
> > > > > > > 4e0e98380bfb6ce2d9091ea6 is the mongodb objectId string of the > first > > > > > > > record in my test collection, so I know it's able to access > the data. > > > > > > > Also, in the error stack trace, it outputs this:
> > > > > > > So I know their driver is trying to use typed bytes. They have > working > > > > > > > examples in pure Java, but I have grown accustom to dumbo, and > would > > > > > > > like to use it and help this project grow. Supposively the > project > > > > > > > supports streaming jobs too, so there should be no problem > working > > > > > > > with dumbo as is once everything is figured out. I am not sure > what is > > > > > > > happening yet, but I will share as soon as I have something > working. I > > > > > > > also encourage anyone else interested to please take a look or > share > > > > > > > their opinions. :)
> > > > > > > > Based on what you told us, I don't think there's a real > difference between > > > > > > > > how the two take configuration params. The mongodb example > probably just > > > > > > > > makes use of the possibility that Hadoop provides for > putting the params in > > > > > > > > an xml file and reading them from that file instead of > passing them > > > > > > > > directly.
> > > > > > > > To make mongo input or output work, you will need to write a > custom input or > > > > > > > > output format that writes or reads typed bytes writables. I > haven't looked > > > > > > > > at the code much, but you might be able to do this by > wrapping the > > > > > > > > mongo-hadoop formats. You should be able to figure out how > to work with > > > > > > > > typed bytes writables by having a look at the lasthbase > code.
> > > > > > > > Also, to use (Java) input or output formats you need to run > on Hadoop. > > > > > > > > That's the reason why the local run you pasted in on of your > emails failed > > > > > > > > miserably.
> > > > > > > > Sorry for the late answer, and please share your code if you > figure out how > > > > > > > > to do this!
> > > > > > > > Regards, > > > > > > > > -Klaas
> > > > > > > > On Thu, Jun 30, 2011 at 8:34 PM, Nathan <nbyl...@gmail.com> > wrote: > > > > > > > > > I was using HBase for a while and was happy when I found > the lasthbase > > > > > > > > > driver on github that worked great with dumbo. Recently I > have started > > > > > > > > > working with MongoDB and found a mongodb-hadoop driver > here:
> > > > > > > > > I asked a friend of mine who is much more familiar with > Java to > > > > > > > > > compare the two, to see if we can use the mongodb classes > easily in > > > > > > > > > the same way dumbo uses the lasthbase.jar. For reference, > here is the > > > > > > > > > Input & Output format classes for both HBase & mongodb > projects:
On Friday, August 31, 2012 11:08:47 AM UTC-5, Jon Eisen wrote:
> Hey Nathan, did you ever publish your code to get that working? I'm > working on the same thing right now.
> On Monday, July 4, 2011 12:53:09 PM UTC-4, Nathan wrote:
>> Haha. Feels like a long journey just in this thread from "I don't know >> Java" to "Hey I got it working!"
>> Anyways, I am going to try and do some tweaks to it so you can store >> the output document structure in the XML file and have all the data >> loaded into the driver instead of on the command line. I have it >> checked in on github right now, but it only works if I hard-code the >> output fields in the driver. Working on making it more robust.
>> On Jul 3, 7:45 pm, Nathan <nbyl...@gmail.com> wrote: >> > OK everything is reading and writing to mongodb using the dumbo >> > wordcount demo. The columns it writes to is hard coded for now, but I >> > will make a configurable property in the XML file where you can output >> > the values. Also, right now it will probably only let you write to one >> > collection, with a key / value pair. If it becomes necessary to try >> > and save actual BSONObjects with multiple k/v pairs, I will try that >> > next.
>> > But it's working. Woop woop!
>> > On Jul 2, 9:31 pm, Nathan <nbyl...@gmail.com> wrote:
>> > > OK, I got it reading records just fine. It completes the M/R job, but >> > > it's not writing it to the database. I am not getting errors though. >> > > It says output written to test.out (the db.collection_name I am >> trying >> > > to write to in MongoDB), but there is nothing in that hadoop fs >> folder >> > > except an empty _SUCCESS file and a bunch of logs
>> > > > if (!(split instanceof MongoInputSplit)) >> > > > throw new IllegalStateException("Creation of a new >> > > > RecordReader requires a MongoInputSplit instance.");
>> > > > final MongoInputSplit mis = (MongoInputSplit) split;
>> > > > I feel so close! Not sure why I get a ClassCastException when my >> > > > TypedBytesMongoRecordReader is a child of the RecordReader. Any >> Java >> > > > people care to chime in?
>> > > > > The odd thing is it can't find this package when I try and import >> it >> > > > > (I have all my jar's in build path, including the hadoop >> streaming):
>> > > > > Says there is no typedbytes package in hadoop. Eclipse tries to >> > > > > resolve this error by importing the hadoop-streaming.jar from the >> > > > > lasthbase project. I have looked, and this is definetly not as >> > > > > depreceated method, so it should be there, so I don't know what >> that >> > > > > problem is.
>> > > > > > I get what you are saying though. I am going to try and create >> a >> > > > > > wrapper this weekend, but don't expect much success since I am >> not a >> > > > > > Java guy. :)
>> > > > > > They have a lot of the same methods in their input & output >> formats, >> > > > > > but are there specific methods that must be overridden? Are >> there very >> > > > > > specific things that MUST happen in the input & output formats? >> Any >> > > > > > tips are appreciated. Hopefully this is pretty straight >> forward, as >> > > > > > there is only two classes to mess with.
>> > > > > > > Thanks for your reply. The last message I posted it's reading >> from >> > > > > > > MongoDB just fine, and their mongodb-hadoop driver uses >> TypedBytes as >> > > > > > > well. This is the error I am currently strugggling with:
>> > > > > > > 4e0e98380bfb6ce2d9091ea6 is the mongodb objectId string of >> the first >> > > > > > > record in my test collection, so I know it's able to access >> the data. >> > > > > > > Also, in the error stack trace, it outputs this:
>> > > > > > > So I know their driver is trying to use typed bytes. They >> have working >> > > > > > > examples in pure Java, but I have grown accustom to dumbo, >> and would >> > > > > > > like to use it and help this project grow. Supposively the >> project >> > > > > > > supports streaming jobs too, so there should be no problem >> working >> > > > > > > with dumbo as is once everything is figured out. I am not >> sure what is >> > > > > > > happening yet, but I will share as soon as I have something >> working. I >> > > > > > > also encourage anyone else interested to please take a look >> or share >> > > > > > > their opinions. :)
>> > > > > > > > Based on what you told us, I don't think there's a real >> difference between >> > > > > > > > how the two take configuration params. The mongodb example >> probably just >> > > > > > > > makes use of the possibility that Hadoop provides for >> putting the params in >> > > > > > > > an xml file and reading them from that file instead of >> passing them >> > > > > > > > directly.
>> > > > > > > > To make mongo input or output work, you will need to write >> a custom input or >> > > > > > > > output format that writes or reads typed bytes writables. I >> haven't looked >> > > > > > > > at the code much, but you might be able to do this by >> wrapping the >> > > > > > > > mongo-hadoop formats. You should be able to figure out how >> to work with >> > > > > > > > typed bytes writables by having a look at the lasthbase >> code.
>> > > > > > > > Also, to use (Java) input or output formats you need to run >> on Hadoop. >> > > > > > > > That's the reason why the local run you pasted in on of >> your emails failed >> > > > > > > > miserably.
>> > > > > > > > Sorry for the late answer, and please share your code if >> you figure out how >> > > > > > > > to do this!
>> > > > > > > > Regards, >> > > > > > > > -Klaas
>> > > > > > > > On Thu, Jun 30, 2011 at 8:34 PM, Nathan <nbyl...@gmail.com> >> wrote: >> > > > > > > > > I was using HBase for a while and was happy when I found >> the lasthbase >> > > > > > > > > driver on github that worked great with dumbo. Recently I >> have started >> > > > > > > > > working with MongoDB and found a mongodb-hadoop driver >> here:
>> > > > > > > > > I asked a friend of mine who is much more familiar with >> Java to >> > > > > > > > > compare the two, to see if we can use the mongodb classes >> easily in >> > > > > > > > > the same way dumbo uses the lasthbase.jar. For reference, >> here is the >> > > > > > > > > Input
On Wednesday, October 31, 2012 1:55:08 PM UTC-5, Paul DeCoursey wrote:
> I'm also curious if about sample code. I can't get dumbo to talk to mongo > for the life of me.
> On Friday, August 31, 2012 11:08:47 AM UTC-5, Jon Eisen wrote:
>> Hey Nathan, did you ever publish your code to get that working? I'm >> working on the same thing right now.
>> On Monday, July 4, 2011 12:53:09 PM UTC-4, Nathan wrote:
>>> Haha. Feels like a long journey just in this thread from "I don't know >>> Java" to "Hey I got it working!"
>>> Anyways, I am going to try and do some tweaks to it so you can store >>> the output document structure in the XML file and have all the data >>> loaded into the driver instead of on the command line. I have it >>> checked in on github right now, but it only works if I hard-code the >>> output fields in the driver. Working on making it more robust.
>>> On Jul 3, 7:45 pm, Nathan <nbyl...@gmail.com> wrote: >>> > OK everything is reading and writing to mongodb using the dumbo >>> > wordcount demo. The columns it writes to is hard coded for now, but I >>> > will make a configurable property in the XML file where you can output >>> > the values. Also, right now it will probably only let you write to one >>> > collection, with a key / value pair. If it becomes necessary to try >>> > and save actual BSONObjects with multiple k/v pairs, I will try that >>> > next.
>>> > But it's working. Woop woop!
>>> > On Jul 2, 9:31 pm, Nathan <nbyl...@gmail.com> wrote:
>>> > > OK, I got it reading records just fine. It completes the M/R job, >>> but >>> > > it's not writing it to the database. I am not getting errors though. >>> > > It says output written to test.out (the db.collection_name I am >>> trying >>> > > to write to in MongoDB), but there is nothing in that hadoop fs >>> folder >>> > > except an empty _SUCCESS file and a bunch of logs
>>> > > > if (!(split instanceof MongoInputSplit)) >>> > > > throw new IllegalStateException("Creation of a new >>> > > > RecordReader requires a MongoInputSplit instance.");
>>> > > > final MongoInputSplit mis = (MongoInputSplit) split;
>>> > > > I feel so close! Not sure why I get a ClassCastException when my >>> > > > TypedBytesMongoRecordReader is a child of the RecordReader. Any >>> Java >>> > > > people care to chime in?
>>> > > > > The odd thing is it can't find this package when I try and >>> import it >>> > > > > (I have all my jar's in build path, including the hadoop >>> streaming):
>>> > > > > Says there is no typedbytes package in hadoop. Eclipse tries to >>> > > > > resolve this error by importing the hadoop-streaming.jar from >>> the >>> > > > > lasthbase project. I have looked, and this is definetly not as >>> > > > > depreceated method, so it should be there, so I don't know what >>> that >>> > > > > problem is.
>>> > > > > > I get what you are saying though. I am going to try and create >>> a >>> > > > > > wrapper this weekend, but don't expect much success since I am >>> not a >>> > > > > > Java guy. :)
>>> > > > > > They have a lot of the same methods in their input & output >>> formats, >>> > > > > > but are there specific methods that must be overridden? Are >>> there very >>> > > > > > specific things that MUST happen in the input & output >>> formats? Any >>> > > > > > tips are appreciated. Hopefully this is pretty straight >>> forward, as >>> > > > > > there is only two classes to mess with.
>>> > > > > > > Thanks for your reply. The last message I posted it's >>> reading from >>> > > > > > > MongoDB just fine, and their mongodb-hadoop driver uses >>> TypedBytes as >>> > > > > > > well. This is the error I am currently strugggling with:
>>> > > > > > > 4e0e98380bfb6ce2d9091ea6 is the mongodb objectId string of >>> the first >>> > > > > > > record in my test collection, so I know it's able to access >>> the data. >>> > > > > > > Also, in the error stack trace, it outputs this:
>>> > > > > > > So I know their driver is trying to use typed bytes. They >>> have working >>> > > > > > > examples in pure Java, but I have grown accustom to dumbo, >>> and would >>> > > > > > > like to use it and help this project grow. Supposively the >>> project >>> > > > > > > supports streaming jobs too, so there should be no problem >>> working >>> > > > > > > with dumbo as is once everything is figured out. I am not >>> sure what is >>> > > > > > > happening yet, but I will share as soon as I have something >>> working. I >>> > > > > > > also encourage anyone else interested to please take a look >>> or share >>> > > > > > > their opinions. :)
>>> > > > > > > > Based on what you told us, I don't think there's a real >>> difference between >>> > > > > > > > how the two take configuration params. The mongodb example >>> probably just >>> > > > > > > > makes use of the possibility that Hadoop provides for >>> putting the params in >>> > > > > > > > an xml file and reading them from that file instead of >>> passing them >>> > > > > > > > directly.
>>> > > > > > > > To make mongo input or output work, you will need to write >>> a custom input or >>> > > > > > > > output format that writes or reads typed bytes writables. >>> I haven't looked >>> > > > > > > > at the code much, but you might be able to do this by >>> wrapping the >>> > > > > > > > mongo-hadoop formats. You should be able to figure out how >>> to work with >>> > > > > > > > typed bytes writables by having a look at the lasthbase >>> code.
>>> > > > > > > > Also, to use (Java) input or output formats you need to >>> run on Hadoop. >>> > > > > > > > That's the reason why the local run you pasted in on of >>> your emails failed >>> > > > > > > > miserably.
>>> > > > > > > > Sorry for the late answer, and please share your code if >>> you figure out how >>> > > > > > > > to do this!
>>> > > > > > > > On Thu, Jun 30, 2011 at 8:34 PM, Nathan <nbyl...@gmail.com> >>> wrote: >>> > > > > > > > > I was using HBase for a while and was happy when I found >>> the lasthbase >>> > > > > > > > > driver on github that worked great with dumbo. Recently >>> I