process mongodb data using spark Graphx

124 views
Skip to first unread message

Pooja Parab

unread,
Feb 7, 2017, 8:43:23 AM2/7/17
to mongodb-user
hi,
I am new in monogdb-spark. I am trying to use Spark Graphx to process data. But I am not getting how to create case class to define schema and use it. 
my dataset is like below.

    "_id" : "5877fe1ae4b0d1d303d9fec2", 
    "child_id" : "AGCLT", 
    "child_name" : "aaa.com Inc", 
    "parent" : "MLAB", 
    "group_id" : NumberInt(101337), 
    "member_id": NumberInt(1013)
    }

below is my code.

object graphx1 {
  
  def main(args:Array[String]){
  
 //set hadoop_home for windows environment
  System.setProperty("hadoop.home.dir", "C:\\winutils\\");
   
  val sparkConf = new SparkConf()
  .setMaster("local")
  .setAppName("MongoSpark")
  .set("spark.mongodb.input.uri","mongodb://localhost:27017/test.data")
  
  
  val sc = new SparkContext(sparkConf)
  
  case class Example(_id:Long,child_id:String,child_name:String,parent_name:String,group_id:Long,member_id:Int)
  
  def parseData(str: String): Example= {
 val line = str.split(",")
 Example(line(0)toLong,line(1),line(2),line(3),line(4)toLong,line(5)toInt)
}
  
  val dataRDD =  MongoSpark.load(sc, ReadConfig(Map("collection" -> "data"), Some(ReadConfig(sc))))
  
  val newRDD = dataRDD.map(parseData).cache()
  }}
  

I am getting error at yellow marked line as type mismatch; found : String ⇒ fitch required: org.bson.Document ⇒ ? 

-pooja






Wan Bachtiar

unread,
Feb 13, 2017, 1:49:04 AM2/13/17
to mongodb-user

I am getting error at yellow marked line as type mismatch; found : String ⇒ fitch required: org.bson.Document ⇒ ?

Hi Pooja,

This is because your variable dataRDD contains org.bson.Document instead of String . See also BSON.Documents for extra information.

As an example, you could try below:

case class Example(child_id:String, member_id: Int)

def parseData(doc:Document): Example={ 
  Example(doc.get("child_id").toString, 
          doc.get("member_id").asInstanceOf[Number].intValue) 
}
val newrdd = rdd.map(parseData)

I am new in mongodb-spark. I am trying to use Spark Graphx to process data.

You may also find the example on this post Spark Mongo Connector - GraphX useful.

Regards,

Wan.

Pooja Parab

unread,
Feb 13, 2017, 8:15:59 AM2/13/17
to mongodb-user
hi Wan,
Thanks for the solution.

my error has gone now. but when i create new rdd to pass that parseData method and tried to print that rdd it is not giving me any result. 

 val dataRDD = sc.loadFromMongoDB()
 println(dataRDD.first())
  //case class 
case class Example(child_id:String, child_name:String,parent:String,parent_name:String,member_id:Int)

def parseData(doc:Document): Example={ 
  Example(doc.get("child_id").toString,
      doc.get("child_name").toString,
      doc.get("parent").toString,
      doc.get("parent_name").toString)
      doc.get("member_id").asInstanceOf[Number].intValue)
  }
  
   val newrdd = dataRDD.map(parseData)
  println(newrdd)
  
val defaultID = "defaultID"
val vertices = newrdd.map(Example => ("child_id", "child_name"))
val relation = newrdd.map(Example => ("parent","parent_name"))
  
  //val graph = Graph(vertices, relation, defaultID)

for( doc <- vertices.take( 10 ) ) println( doc )


I am not getting any output at second rdd. when i try to print vertices i am getting output as below. not getting actual result.

(child_id,child_name)
(child_id,child_name)
not getting any data here.

can anyone tell me how to solve this.

Wan Bachtiar

unread,
Feb 22, 2017, 1:18:09 AM2/22/17
to mongodb-user

val vertices = newrdd.map(Example => (“child_id”, “child_name”))

when i try to print vertices i am getting output as below. not getting actual result.

Hi Pooja,

This is because you have assigned string child_id and child_name for every Example object in the RDD.

If you’re trying to retrieve the value of child_id and child_nam instead, you could try as below example:

val vertices = newrdd.map(x=> (x.child_id, x.child_name));

If you have further questions on Apache Spark or Scala questions, I would recommend to post a question on StackOverflow to reach wider audience.

Best regards,

Wan.

Reply all
Reply to author
Forward
0 new messages