Message from discussion
Dumbo & MongoDB
Received: by 10.101.2.17 with SMTP id e17mr122263ani.15.1309798390003;
Mon, 04 Jul 2011 09:53:10 -0700 (PDT)
X-BeenThere: dumbo-user@googlegroups.com
Received: by 10.101.200.31 with SMTP id c31ls1058838anq.1.gmail; Mon, 04 Jul
2011 09:53:09 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.101.29.30 with SMTP id g30mr608201anj.3.1309798389412; Mon, 04
Jul 2011 09:53:09 -0700 (PDT)
Received: by m22g2000yqh.googlegroups.com with HTTP; Mon, 4 Jul 2011 09:53:09
-0700 (PDT)
Date: Mon, 4 Jul 2011 09:53:09 -0700 (PDT)
In-Reply-To: <98e8bd30-e24a-4937-9978-c719a06e1fc9@fq4g2000vbb.googlegroups.com>
References: <7f2b5c07-a11e-4f39-95ce-e98815b2cb45@a15g2000pri.googlegroups.com>
<CAA8T3K=5ptZwcT6-ScwWo4=2TBB+m1s0rPZz6DVEBd+FCyf_ZA@mail.gmail.com>
<093f07a4-c48f-4b7f-a194-7d7e619fead5@d14g2000yqb.googlegroups.com>
<a7c9100c-8d0f-43ce-8b30-91abd0a3d046@em7g2000vbb.googlegroups.com>
<1b5a4af4-de25-4156-91bb-8f37ebad6d32@t9g2000vbs.googlegroups.com>
<ac133cd6-a6df-49cb-a06d-1c11c485ee95@m10g2000yqd.googlegroups.com>
<1edac602-ad42-4a7e-a7d4-4cd806783444@em7g2000vbb.googlegroups.com> <98e8bd30-e24a-4937-9978-c719a06e1fc9@fq4g2000vbb.googlegroups.com>
User-Agent: G2/1.0
X-HTTP-UserAgent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.30
(KHTML, like Gecko) Chrome/12.0.742.112 Safari/534.30,gzip(gfe)
Message-ID: <f3a8b714-787e-43ea-a424-3b90bce8ce5d@m22g2000yqh.googlegroups.com>
Subject: Re: Dumbo & MongoDB
From: Nathan <nbyl...@gmail.com>
To: dumbo-user <dumbo-user@googlegroups.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Haha. Feels like a long journey just in this thread from "I don't know
Java" to "Hey I got it working!"
Anyways, I am going to try and do some tweaks to it so you can store
the output document structure in the XML file and have all the data
loaded into the driver instead of on the command line. I have it
checked in on github right now, but it only works if I hard-code the
output fields in the driver. Working on making it more robust.
On Jul 3, 7:45=A0pm, Nathan <nbyl...@gmail.com> wrote:
> OK everything is reading and writing to mongodb using the dumbo
> wordcount demo. The columns it writes to is hard coded for now, but I
> will make a configurable property in the XML file where you can output
> the values. Also, right now it will probably only let you write to one
> collection, with a key / value pair. If it becomes necessary to try
> and save actual BSONObjects with multiple k/v pairs, I will try that
> next.
>
> But it's working. Woop woop!
>
> On Jul 2, 9:31=A0pm, Nathan <nbyl...@gmail.com> wrote:
>
>
>
>
>
>
>
> > OK, I got it reading records just fine. It completes the M/R job, but
> > it's not writing it to the database. I am not getting errors though.
> > It says output written to test.out (the db.collection_name I am trying
> > to write to in MongoDB), but there is nothing in that hadoop fs folder
> > except an empty _SUCCESS file and a bunch of logs
>
> > So I don't know where my output is going.
>
> > On Jul 2, 8:03=A0pm, Nathan <nbyl...@gmail.com> wrote:
>
> > > I feel so close. This class mimics theirs, but uses
> > > TypedBytesWriteable instead of BSONObjects.
>
> > > @SuppressWarnings("deprecation")
> > > public class TypedBytesTableInputFormat implements
> > > InputFormat<TypedBytesWritable, TypedBytesWritable> {
>
> > > =A0 =A0 =A0 =A0 @Override
> > > =A0 =A0 =A0 =A0 public RecordReader<TypedBytesWritable, TypedBytesWri=
table>
> > > getRecordReader(InputSplit split, JobConf job, Reporter reporter) {
>
> > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (!(split instanceof MongoInputSpli=
t))
> > > =A0 =A0 =A0 =A0 =A0 =A0 throw new IllegalStateException("Creation of =
a new
> > > RecordReader requires a MongoInputSplit instance.");
>
> > > =A0 =A0 =A0 =A0 final MongoInputSplit mis =3D (MongoInputSplit) split=
;
>
> > > =A0 =A0 =A0 =A0 return (RecordReader<TypedBytesWritable, TypedBytesWr=
itable>)
> > > new TypedBytesMongoRecordReader(mis);
> > > =A0 =A0 =A0 =A0 }
> > > ....
> > > ....
> > > ....
> > > ....
>
> > > public class TypedBytesMongoRecordReader extends
> > > RecordReader<TypedBytesWritable, TypedBytesWritable> {
>
> > > =A0 =A0 =A0 =A0 public TypedBytesMongoRecordReader(MongoInputSplit mi=
s) {
> > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 _cursor =3D mis.getCursor();
> > > =A0 =A0 =A0 =A0 }
> > > ...
> > > ...
> > > ...
> > > ...
>
> > > Unfortunately I get this error:
>
> > > java.lang.ClassCastException:
> > > com.mongodb.hadoop.input.TypedBytesMongoRecordReader cannot be cast t=
o
> > > org.apache.hadoop.mapred.RecordReader
> > > =A0 =A0 =A0 =A0 at
> > > com.mongodb.hadoop.TypedBytesTableInputFormat.getRecordReader(TypedBy=
tesTab leInputFormat.java:
> > > 31)
> > > =A0 =A0 =A0 =A0 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapT=
ask.java:370)
> > > =A0 =A0 =A0 =A0 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:=
324)
> > > =A0 =A0 =A0 =A0 at org.apache.hadoop.mapred.Child$4.run(Child.java:26=
8)
> > > =A0 =A0 =A0 =A0 at java.security.AccessController.doPrivileged(Native=
Method)
> > > =A0 =A0 =A0 =A0 at javax.security.auth.Subject.doAs(Subject.java:396)
> > > =A0 =A0 =A0 =A0 at
> > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma=
tion.j ava:
> > > 1115)
> > > =A0 =A0 =A0 =A0 at org.apache.hadoop.mapred.Child.main(Child.java:262=
)
>
> > > I feel so close! Not sure why I get a ClassCastException when my
> > > TypedBytesMongoRecordReader is a child of the RecordReader. Any Java
> > > people care to chime in?
>
> > > On Jul 2, 3:02=A0pm, Nathan <nbyl...@gmail.com> wrote:
>
> > > > The odd thing is it can't find this package when I try and import i=
t
> > > > (I have all my jar's in build path, including the hadoop streaming)=
:
>
> > > > import org.apache.hadoop.typedbytes.TypedBytesWritable;
>
> > > > Says there is no typedbytes package in hadoop. Eclipse tries to
> > > > resolve this error by importing the hadoop-streaming.jar from the
> > > > lasthbase project. I have looked, and this is definetly not as
> > > > depreceated method, so it should be there, so I don't know what tha=
t
> > > > problem is.
>
> > > > On Jul 2, 1:35=A0pm, Nathan <nbyl...@gmail.com> wrote:
>
> > > > > I get what you are saying though. I am going to try and create a
> > > > > wrapper this weekend, but don't expect much success since I am no=
t a
> > > > > Java guy. :)
>
> > > > > They have a lot of the same methods in their input & output forma=
ts,
> > > > > but are there specific methods that must be overridden? Are there=
very
> > > > > specific things that MUST happen in the input & output formats? A=
ny
> > > > > tips are appreciated. Hopefully this is pretty straight forward, =
as
> > > > > there is only two classes to mess with.
>
> > > > > On Jul 2, 1:09=A0pm, Nathan <nbyl...@gmail.com> wrote:
>
> > > > > > Thanks for your reply. The last message I posted it's reading f=
rom
> > > > > > MongoDB just fine, and their mongodb-hadoop driver uses TypedBy=
tes as
> > > > > > well. This is the error I am currently strugggling with:
>
> > > > > > java.io.IOException: Can't write: 4e0e98380bfb6ce2d9091ea6 as c=
lass
> > > > > > org.bson.types.ObjectId
>
> > > > > > 4e0e98380bfb6ce2d9091ea6 is the mongodb objectId string of the =
first
> > > > > > record in my test collection, so I know it's able to access the=
data.
> > > > > > Also, in the error stack trace, it outputs this:
>
> > > > > > org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.=
java:
> > > > > > 162)
> > > > > > at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.jav=
a:70)
> > > > > > at
> > > > > > org.apache.hadoop.typedbytes.TypedBytesWritableOutput.writeWrit=
able(TypedBy
> > > > > > tesWritableOutput.java: 217)
>
> > > > > > So I know their driver is trying to use typed bytes. They have =
working
> > > > > > examples in pure Java, but I have grown accustom to dumbo, and =
would
> > > > > > like to use it and help this project grow. Supposively the proj=
ect
> > > > > > supports streaming jobs too, so there should be no problem work=
ing
> > > > > > with dumbo as is once everything is figured out. I am not sure =
what is
> > > > > > happening yet, but I will share as soon as I have something wor=
king. I
> > > > > > also encourage anyone else interested to please take a look or =
share
> > > > > > their opinions. :)
>
> > > > > > On Jul 2, 12:03=A0pm, Klaas Bosteels <klaas.boste...@gmail.com>=
wrote:
>
> > > > > > > Hi Nathan,
>
> > > > > > > Based on what you told us, I don't think there's a real diffe=
rence between
> > > > > > > how the two take configuration params. The mongodb example pr=
obably just
> > > > > > > makes use of the possibility that Hadoop provides for putting=
the params in
> > > > > > > an xml file and reading them from that file instead of passin=
g them
> > > > > > > directly.
>
> > > > > > > To make mongo input or output work, you will need to write a =
custom input or
> > > > > > > output format that writes or reads typed bytes writables. I h=
aven't looked
> > > > > > > at the code much, but you might be able to do this by wrappin=
g the
> > > > > > > mongo-hadoop formats. You should be able to figure out how to=
work with
> > > > > > > typed bytes writables by having a look at the lasthbase code.
>
> > > > > > > Also, to use (Java) input or output formats you need to run o=
n Hadoop.
> > > > > > > That's the reason why the local run you pasted in on of your =
emails failed
> > > > > > > miserably.
>
> > > > > > > Sorry for the late answer, and please share your code if you =
figure out how
> > > > > > > to do this!
>
> > > > > > > Regards,
> > > > > > > -Klaas
>
> > > > > > > On Thu, Jun 30, 2011 at 8:34 PM, Nathan <nbyl...@gmail.com> w=
rote:
> > > > > > > > I was using HBase for a while and was happy when I found th=
e lasthbase
> > > > > > > > driver on github that worked great with dumbo. Recently I h=
ave started
> > > > > > > > working with MongoDB and found a mongodb-hadoop driver here=
:
>
> > > > > > > >https://github.com/mongodb/mongo-hadoop/
>
> > > > > > > > I asked a friend of mine who is much more familiar with Jav=
a to
> > > > > > > > compare the two, to see if we can use the mongodb classes e=
asily in
> > > > > > > > the same way dumbo uses the lasthbase.jar. For reference, h=
ere is the
> > > > > > > > Input & Output format classes for both HBase & mongodb proj=
ects:
>
> > > > > > > >https://github.com/mongodb/mongo-hadoop/tree/master/src/main=
/com/mong...
>
> > > > > > > >https://github.com/tims/lasthbase/tree/master/src/java/fm/la=
st/hbase/...
>
> > > > > > > > With lasthbase, the input & output information is specified=
on the
> > > > > > > > command line, but in the mongodb, they have a WordCountXML =
example
> > > > > > > > that reads all connection, query, and other configurable in=
formation
> > > > > > > > from an XML file. I liked this approach, but had some quest=
ions. It
> > > > > > > > seems as though the lasthbase classes extended a JobConfigu=
rable
> > > > > > > > class, but its been a long time since it's been updated. Mo=
ngodb-
> > > > > > > > hadoop does not have this. A LOT of the setup looks the sam=
e, but was
> > > > > > > > looking for a good starting point on making their classes w=
ork with
> > > > > > > > dumbo.
>
> > > > > > > > What is dumbo expecting, or better yet, what is lasthbase s=
ending to
> > > > > > > > dumbo? What does dumbo need from the jar file to start stre=
aming the
> > > > > > > > data to the map/reduce job(s)? And how should it be streame=
d? I don't
> > > > > > > > know Java, but my friend is willing to try and help get it =
going if I
> > > > > > > > can get him all the information possible. To him it SEEMS s=
ome things
> > > > > > > > can be moved around and into the input & output format clas=
ses on
> > > > > > > > mongodb-hadoop, tell it to read the xml file, and then you =
have
> > > > > > > > another driver that connects to a document database for use=
with
> > > > > > > > dumbo.
>
> > > > > > > > But he has no understand of dumbo, and we could use some as=
sitance.
>
> > > > > > > > --
> > > > > > > > You received this message because you are subscribed to the=
Google Groups
> > > > > > > > "dumbo-user" group.
> > > > > > > > To post to this group, send email to dumbo-user@googlegroup=
s.com.
> > > > > > > > To unsubscribe from this group, send email to
> > > > > > > > dumbo-user+unsubscribe@googlegroups.com....
>