Mongo-Hadoop: Pig can write correct BSON to HDFS or S3 but doesn't write directly to MongoDB

59 views

Skip to first unread message

Iver Walkoe

unread,

Jun 25, 2015, 4:48:07 PM6/25/15

to mongod...@googlegroups.com

Hello,

I've been going nuts over this.

I have data imported to HDFS from a RDBMS by Sqoop. The data is in CSV format.

I have an EMR cluster set up with Spark, Hive, Pig and all the 1.4-rc0 Mongo-Hadoop jars (fwiw, I also tried the 1.3.2 jars but they did not work from the outset as I got the "Output directory not set" error).

Everything seems fine.

I can process the data above with Pig and output it in BSON format to HDFS or S3. I have even downloaded to my laptop a small dataset from S3 as exported by Pig and then scp'd it to the MongoDB server and MongoDB imports it perfectly, implying Pig is correctly formatting the BSON ouput. The table metadata is preserved and the JSON records look perfect. So the data is cool and Pig "gets" BSON.

Additionally, the MongoDB server and the EMR cluster are in the same "group" and can "talk" to one another - pings work perfectly.

Here's the part I just can't figure out: I'll run a Pig job and specify the MongoDB server in the "STORE" command: the Pig script runs without error, reports "Success!" at the conclusion, and gives the correct number of records and bytes as being written.

But when I look at MongoDB....there's nothing there.

I'll add that I also tried using the local output option with STORE pointing to a file:/// destination on the EMR machine and that too reports success but does not write anything. The absence of error messages is particularly perplexing as everything seems fine but nothing "is happening".

Any and all assistance is appreciated. Sample output belwo (with machine and database names obscured).

HadoopVersion PigVersion UserId StartedAt FinishedAt Features

2.4.0 0.12.0 hadoop 2015-06-25 19:45:21 2015-06-25 19:46:13 UNKNOWN

Success!

Job Stats (time in seconds):

JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs

job_1435165736183_0027 1 0 36 36 36 36 n/a n/a n/a n/a data_in,data_out MAP_ONLY mongodb://ec2-xx-xx-xx-xx.compute-1.amazonaws.com:27017/whatever.something,

Input(s):

Successfully read 14198 records (9182636 bytes) from: "hdfs://xx-xx-xx-xx:9000/user/sqoop/this_sample/part-m-00000"

Output(s):

Successfully stored 14198 records (21312399 bytes) in: "mongodb://ec2-xx-xx-xx-xx.compute-1.amazonaws.com:27017/whatever.something"

Counters:

Total records written : 14198

Total bytes written : 21312399

Spillable Memory Manager spill count : 0

Total bags proactively spilled: 0

Total records proactively spilled: 0

P.S. I've tried both pointing PIg at a directory (i.e., the output of the Sqoop process) or at a particular file in that directory as above - both with the same results....it reports success but nothing shows up in Mongo.

I've tried several variations on the STORE instruction including:

STORE data_out INTO 'mongodb://ec2-xx-xx-xx-xx.compute-1.amazonaws.com:27017/whatever.something' USING com.mongodb.hadoop.pig.MongoInsertStorage('', '' );

Any and all assistance is greatly appreciated!

Best regards,

Iver

Reply all

Reply to author

Forward

0 new messages