Hi mongodb users. I have one issue with mongo-hadoop-connector. I use it with apache spark standalone cluster. What I actually trying to do - I tried to read from s3
directory with bsons (produced by spark at previous job). My configuration listed below:
config.set("mongo.job.input.format", "com.mongodb.hadoop.BSONFileInputFormat");
config.set("mapred.input.dir", args(0));
then I read from s3:
spark/bin/spark-submit .... s3n://ID:SECRET@BUCKET/JOB_DIRECTORY/
After that in logs I see following:
...............
15/01/19 12:26:08 INFO NativeS3FileSystem: Opening 's3n://ID:SECRET@BUCKET/test/.part-r-00000.bson.splits' for reading
15/01/19 12:26:08 INFO BSONSplitter: Found split file at : FileStatus{path=s3n://ID:SECRET@BUCKET/test/.part-r-00001.bson.splits; isDirectory=false; length=27; replication=1; blocksize=67108864; modification_time=1421487730000; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}
15/01/19 12:26:08 INFO NativeS3FileSystem: Opening 's3n://ID:SECRET@BUCKET/test/.part-r-00001.bson.splits' for reading
15/01/19 12:26:08 INFO BSONSplitter: Found split file at : FileStatus{path=s3n://ID:SECRET@BUCKET/.part-r-00002.bson.splits; isDirectory=false; length=27; replication=1; blocksize=67108864; modification_time=1421487730000; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}
15/01/19 12:26:08 INFO NativeS3FileSystem: Opening 's3n://ID:SECRET@BUCKET/.part-r-00002.bson.splits' for reading
...............
And actually nothing happening (I checked network usage using nload tool).
This seems very strange - if I try to use individual file - job works as expected.
spark/bin/spark-submit .... s3n://ID:SECRET@BUCKET/JOB_DIRECTORY/part-r-00002.bson
Also I can read same directory from hdfs without any problems...
What is going wrong?