how to split bson dumps for consumption by hadoop

Mark Lewandowski

unread,

May 9, 2012, 8:31:42 PM5/9/12

to mongod...@googlegroups.com

Hi All,

I'm currently using mongodump to write my entire database to a file in bson. Next, I'd like to translate it to json and store it on S3 for consumption by EMR. I've currently got a python script that can do this serially, but it seems like this makes more sense to do in EMR to take advantage of the parallelism I can get through hadoop. The only problem is I can't figure out how to break the bson file into chunks for consumption by hadoop.

Anyone have experience with this?

Thanks,

-Mark

Scott Hernandez

unread,

May 9, 2012, 8:34:51 PM5/9/12

to mongod...@googlegroups.com

I wrote a java program to do this; it split each bson file into n-MB
chunks... I can post it somewhere if you like :)

> --
> You received this message because you are subscribed to the Google Groups
> "mongodb-user" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/mongodb-user/-/lHU3u6i7540J.
> To post to this group, send email to mongod...@googlegroups.com.
> To unsubscribe from this group, send email to
> mongodb-user...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/mongodb-user?hl=en.

Mark Lewandowski

unread,

May 9, 2012, 8:40:27 PM5/9/12

to mongod...@googlegroups.com

Scott, that would be great. Maybe I did something wrong, but when I tried to split the file into chunks i was getting corrupted data. I assumed it was because you can't depend on new lines delimiting each record. Is there a better record delimiter?

-Mark

Scott Hernandez

unread,

May 10, 2012, 4:42:49 PM5/10/12

to mongod...@googlegroups.com

Yeah, the format of the file is just a stream of BSON documents
(http://bsonspec.org/). Linefeeds/breaks do not sep. docs.

Here is the sample java code:

import java.io.*;
import com.mongodb.*;

public static void main( String args[] ) throws Exception {
String fn = args[0];
Long sizeMB = args.length > 1 ? new Long(args[1]) : 2;
InputStream in = new FileInputStream(fn);
LazyDBDecoder decoder = new LazyDBDecoder();
OutputStream out = new FileOutputStream( fn + ".0.bson" );
int i = 0;
long written = 0;
while(in.available() > 0) {
LazyDBObject dbObj = (LazyDBObject) decoder.decode( in,
(DBCollection) null );
written += dbObj.getBSONSize();
dbObj.pipe(out);

if (written > sizeMB * 1024 * 1024) {
out.close();
out = new FileOutputStream( fn + "." + i + ".bson" );
written = 0;
i++;
}
}
if (out != null)
out.close();
}

I can supply a full working project/lib if you need.

On Wed, May 9, 2012 at 5:40 PM, Mark Lewandowski

Emmanuel VINET

unread,

May 29, 2012, 5:39:34 AM5/29/12

to mongod...@googlegroups.com

Hi Scott,

Your code is nice, but it throws a compilation exception on dbObj.pipe(out)
which doesn't exist for LazyDBObject in driver 2.7.3 of mongo.

Do you have an idea ?

Thanks.
Emmanuel

Scott Hernandez

unread,

May 29, 2012, 8:56:55 AM5/29/12

to mongod...@googlegroups.com

It uses the latest code which will be the 2.8.0 release. Please use master or a 2.8.0 snapshot.

--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com

To unsubscribe from this group, send email to
mongodb-user...@googlegroups.com

See also the IRC channel -- freenode.net#mongodb

Emmanuel VINET

unread,

May 29, 2012, 9:45:06 AM5/29/12

to mongod...@googlegroups.com

Thanks, i'll try it.

Jesse Riggins

unread,

Oct 25, 2012, 7:43:01 AM10/25/12

to mongod...@googlegroups.com

I think that the answer is 'yes' looking at LazyBSONDecoder#decode, but a confirmation will be great.

Thanks in advance,

Jesse

On Thursday, October 25, 2012 5:27:02 AM UTC-5, Jesse Riggins wrote:

(I hope it's ok to comment on old threads).

Scott,

Thanks for posting the code. Does using LazyDBDecoder with the InputStream avoid bringing the entire BSON file into memory?

-Jesse

Reply all

Reply to author

Forward