Hi all,
I'm trying to insert some documents in mongodb using hadoop connector with Spark (using Python).
By now, I am able to connect to mongo and get the collection. My problem is that I can't find a way to insert the document and send it back to database. Actually, the collection is sent back but it is empty without the document I tryed to insert.
Could anyone help me with this?
Thanks,
Marcus
Below is the code I'm using:
__author__ = 'marcusrehm'
from pyspark import SparkContext
import xmltodict, json
nfce = open('D:/35150400776574000741550030073022671331273267.xml','r')
o = xmltodict.parse(nfce)
# set up parameters for reading from MongoDB via Hadoop input format
config = {"mongo.input.uri": "mongodb://ds.mongolab.com:39250/td.nf",
"mongo.input.split.create_input_splits": "false",
"mongo.output.uri": "mongodb://ds.mongolab.com:39250/td.nf"}
inputFormatClassName = "com.mongodb.hadoop.MongoInputFormat"
outputFormatClassName = "com.mongodb.hadoop.MongoOutputFormat"
# these values worked but others might as well
keyClassName = "org.apache.hadoop.io.Text"
valueClassName = "org.apache.hadoop.io.MapWritable"
# read the nfc-e from MongoDB into Spark RDD format
sc = SparkContext()
nfRDD = sc.newAPIHadoopRDD(inputFormatClassName, keyClassName, valueClassName, None, None, config)
notasRDD.map(None, o)
notasRDD.saveAsNewAPIHadoopFile("file:///placeholder", outputFormatClassName, None, None, None, None, config)