how to split bson dumps for consumption by hadoop

617 views
Skip to first unread message

Mark Lewandowski

unread,
May 9, 2012, 8:31:42 PM5/9/12
to mongod...@googlegroups.com
Hi All,

I'm currently using mongodump to write my entire database to a file in bson.  Next, I'd like to translate it to json and store it on S3 for consumption by EMR.  I've currently got a python script that can do this serially, but it seems like this makes more sense to do in EMR to take advantage of the parallelism I can get through hadoop.  The only problem is I can't figure out how to break the bson file into chunks for consumption by hadoop. 

Anyone have experience with this?

Thanks,

-Mark

Scott Hernandez

unread,
May 9, 2012, 8:34:51 PM5/9/12
to mongod...@googlegroups.com
I wrote a java program to do this; it split each bson file into n-MB
chunks... I can post it somewhere if you like :)
> --
> You received this message because you are subscribed to the Google Groups
> "mongodb-user" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/mongodb-user/-/lHU3u6i7540J.
> To post to this group, send email to mongod...@googlegroups.com.
> To unsubscribe from this group, send email to
> mongodb-user...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/mongodb-user?hl=en.

Mark Lewandowski

unread,
May 9, 2012, 8:40:27 PM5/9/12
to mongod...@googlegroups.com
Scott, that would be great.  Maybe I did something wrong, but when I tried to split the file into chunks i was getting corrupted data.  I assumed it was because you can't depend on new lines delimiting each record.  Is there a better record delimiter?

-Mark

Scott Hernandez

unread,
May 10, 2012, 4:42:49 PM5/10/12
to mongod...@googlegroups.com
Yeah, the format of the file is just a stream of BSON documents
(http://bsonspec.org/). Linefeeds/breaks do not sep. docs.

Here is the sample java code:

import java.io.*;
import com.mongodb.*;

public static void main( String args[] ) throws Exception {
String fn = args[0];
Long sizeMB = args.length > 1 ? new Long(args[1]) : 2;
InputStream in = new FileInputStream(fn);
LazyDBDecoder decoder = new LazyDBDecoder();
OutputStream out = new FileOutputStream( fn + ".0.bson" );
int i = 0;
long written = 0;
while(in.available() > 0) {
LazyDBObject dbObj = (LazyDBObject) decoder.decode( in,
(DBCollection) null );
written += dbObj.getBSONSize();
dbObj.pipe(out);

if (written > sizeMB * 1024 * 1024) {
out.close();
out = new FileOutputStream( fn + "." + i + ".bson" );
written = 0;
i++;
}
}
if (out != null)
out.close();
}


I can supply a full working project/lib if you need.

On Wed, May 9, 2012 at 5:40 PM, Mark Lewandowski

Emmanuel VINET

unread,
May 29, 2012, 5:39:34 AM5/29/12
to mongod...@googlegroups.com
Hi Scott,

Your code is nice, but it throws a compilation exception on dbObj.pipe(out)
which doesn't exist for LazyDBObject in driver 2.7.3 of mongo.

Do you have an idea ?

Thanks.
Emmanuel



Scott Hernandez

unread,
May 29, 2012, 8:56:55 AM5/29/12
to mongod...@googlegroups.com

It uses the latest code which will be the 2.8.0 release. Please use master or a 2.8.0 snapshot.

--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com
To unsubscribe from this group, send email to
mongodb-user...@googlegroups.com
See also the IRC channel -- freenode.net#mongodb

Emmanuel VINET

unread,
May 29, 2012, 9:45:06 AM5/29/12
to mongod...@googlegroups.com
Thanks, i'll try it.

Jesse Riggins

unread,
Oct 25, 2012, 7:43:01 AM10/25/12
to mongod...@googlegroups.com
I think that the answer is 'yes' looking at LazyBSONDecoder#decode, but a confirmation will be great.

Thanks in advance,
 Jesse

On Thursday, October 25, 2012 5:27:02 AM UTC-5, Jesse Riggins wrote:
(I hope it's ok to comment on old threads).

Scott, 

Thanks for posting the code.  Does using LazyDBDecoder with the InputStream avoid bringing the entire BSON file into memory?

-Jesse
Reply all
Reply to author
Forward
0 new messages