mongo-hadoop connector and s3 directory source

122 views
Skip to first unread message

Dmitriy Selivanov

unread,
Jan 19, 2015, 10:45:05 AM1/19/15
to mongod...@googlegroups.com
Hi mongodb users. I have one issue with mongo-hadoop-connector. I use it with apache spark standalone cluster. What I actually trying to do - I tried to read from s3 directory with bsons (produced by spark at previous job). My configuration listed below:
config.set("mongo.job.input.format", "com.mongodb.hadoop.BSONFileInputFormat");
config
.set("mapred.input.dir", args(0));



then I read from s3:
spark/bin/spark-submit  .... s3n://ID:SECRET@BUCKET/JOB_DIRECTORY/

After that in  logs I see following:
............... 
15/01/19 12:26:08 INFO NativeS3FileSystem: Opening 's3n://ID:SECRET@BUCKET/test/.part-r-00000.bson.splits' for reading
15/01/19 12:26:08 INFO BSONSplitter: Found split file at : FileStatus{path=s3n://ID:SECRET@BUCKET/test/.part-r-00001.bson.splits; isDirectory=false; length=27; replication=1; blocksize=67108864; modification_time=1421487730000; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}
15/01/19 12:26:08 INFO NativeS3FileSystem: Opening 's3n://ID:SECRET@BUCKET/test/.part-r-00001.bson.splits' for reading
15/01/19 12:26:08 INFO BSONSplitter: Found split file at : FileStatus{path=s3n://ID:SECRET@BUCKET/.part-r-00002.bson.splits; isDirectory=false; length=27; replication=1; blocksize=67108864; modification_time=1421487730000; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}
15/01/19 12:26:08 INFO NativeS3FileSystem: Opening 's3n://ID:SECRET@BUCKET/.part-r-00002.bson.splits' for reading
............... 
And actually nothing happening (I checked network usage using nload tool).
This seems very strange - if  I try to use individual file - job works as expected.
spark/bin/spark-submit  .... s3n://ID:SECRET@BUCKET/JOB_DIRECTORY/part-r-00002.bson
Also I can read same directory from hdfs without any problems... 
What is going wrong?

Luke Lovett

unread,
Jan 20, 2015, 8:34:00 PM1/20/15
to mongod...@googlegroups.com
Hi Dmitriy,

My knowledge of S3 is fairly limited, but I did come across an excellent troubleshooting guide by Amazon on input and output errors with Hadoop and S3: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-troubleshoot-errors-io.html. One thing that stuck out to me was that listing all objects within an S3 bucket is an extremely expensive operation (see the last section on that page), and I'm pretty sure that providing a path to a directory within S3 will cause Hadoop to do exactly that. Perhaps it seems as if "nothing is happening" because Hadoop is waiting on a listing of all files within the S3 bucket, which may be taking a long time? I'm sorry I cannot be more help.

Luke

Dmitriy Selivanov

unread,
Jan 21, 2015, 4:01:50 AM1/21/15
to mongod...@googlegroups.com
Hi Luke, thank you for answer. But seems the problem is not in s3 listing. At first iteration spark see all the files in directory and can produce splits (like .file-part-1.bson.split ) without any problems. But then, when it try to read files, it freezes (no network activity, no cpu activity -  both on slaves and master)...
But the link you provided is very useful!

среда, 21 января 2015 г., 4:34:00 UTC+3 пользователь Luke Lovett написал:

Dmitriy Selivanov

unread,
Jan 22, 2015, 11:50:29 AM1/22/15
to mongod...@googlegroups.com
I found solution here:
This pull fixes problem - I can read directory from s3. 

But this issue https://jira.mongodb.org/browse/HADOOP-178 is still annoying me. All records doubled, so I use workaround [very ugly :-( ] like this:
bsonRDD.reduceByKey((x, y) => x)

Hope this will be useful for somebody.


понедельник, 19 января 2015 г., 18:45:05 UTC+3 пользователь Dmitriy Selivanov написал:

Luke Lovett

unread,
Feb 19, 2015, 5:59:39 PM2/19/15
to mongod...@googlegroups.com
Hey Dmitriy,
It seems that specifying "mapred.input.dir" and newAPIHadoopFile together is what's causing duplicate records. I'm in the process of investigating now why this is. In the meantime, you can either:

1. not set "mapred.input.dir" and keep using newAPIHadoopFile to load BSON into an RDD
2. use newAPIHadoopRDD instead of newAPIHadoopFile and set "mapred.input.dir" to the location where your BSON files are kept

I'll update this discussion again when I have more information.

Luke Lovett

unread,
Feb 19, 2015, 7:12:49 PM2/19/15
to mongod...@googlegroups.com
Update:

Spark itself is appending the path given to "newAPIHadoopFile" to "mapreduce.input.fileinputformat.inputdir", which is the same as "mapred.input.dir" (the latter is deprecated). BSONFileInputFormat has no control over this. Setting "mapred.input.dir" is actually redundant.
Reply all
Reply to author
Forward
0 new messages