Trying to use MongoDB-hadoop connector for Spark in Scala

1,891 views
Skip to first unread message

Kim Ngo

unread,
Jul 17, 2015, 4:56:34 PM7/17/15
to mongod...@googlegroups.com
I'm new to using Spark and MongoDB, and I'm trying to read from an existing database that is on MongoDB.

I am receiving an error when trying to call first() on an RDD.

I'm using mongodb-driver-3.0.2.jar and mongo-hadoop-core-1.4.0.jar when starting up my spark-shell. I'd appreciate any input, thanks!

import org.apache.hadoop.conf.Configuration
import org.bson.BSONObject
import org.bson.BasicBSONObject

val config
= new Configuration()
config
.set("mongo.input.uri", "mongodb://host:27017/collection.database")
config
.set("mongo.job.input.format", "com.mongodb.hadoop.MongoInputFormat")

val mongoRDD
= sc.newAPIHadoopRDD(config, classOf[com.mongodb.hadoop.MongoInputFormat], classOf[Object], classOf[BSONObject])

mongoRDD
.first()


java.lang.NoClassDefFoundError: com/mongodb/ReadPreference
        at com.mongodb.MongoClientOptions$Builder.<init>(MongoClientOptions.java:686)

       at com.mongodb.MongoClientURI.<init>(MongoClientURI.java:150)

       at com.mongodb.hadoop.util.MongoConfigUtil.getMongoClientURI(MongoConfigUtil.java:367)

       at com.mongodb.hadoop.util.MongoConfigUtil.getInputURI(MongoConfigUtil.java:371)

       at com.mongodb.hadoop.splitter.MongoSplitterFactory.getSplitter(MongoSplitterFactory.java:113)

       at com.mongodb.hadoop.MongoInputFormat.getSplits(MongoInputFormat.java:56)

       at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95)

       at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)

       at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)

       at scala.Option.getOrElse(Option.scala:120)

       at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)

       at org.apache.spark.rdd.RDD.take(RDD.scala:1156)

       at org.apache.spark.rdd.RDD.first(RDD.scala:1189)

       at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:30)

       at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:35)

       at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:37)

       at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:39)

       at $iwC$$iwC$$iwC$$iwC.<init>(<console>:41)

       at $iwC$$iwC$$iwC.<init>(<console>:43)

       at $iwC$$iwC.<init>(<console>:45)

       at $iwC.<init>(<console>:47)

       at <init>(<console>:49)

       at .<init>(<console>:53)

       at .<clinit>(<console>)

       at .<init>(<console>:7)

       at .<clinit>(<console>)

       at $print(<console>)

       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

       at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

       at java.lang.reflect.Method.invoke(Method.java:606)

       at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)

       at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)

       at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)

       at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)

       at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)

       at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856)

       at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:901)

       at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813)

       at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:656)

       at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:664)

       at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:669)

       at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:996)

       at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)

       at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)

       at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)

       at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:944)

       at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1058)

       at org.apache.spark.repl.Main$.main(Main.scala:31)

       at org.apache.spark.repl.Main.main(Main.scala)

       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

       at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

       at java.lang.reflect.Method.invoke(Method.java:606)

       at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)

       at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)

       at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)

       at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)

       at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Caused by: java.lang.ClassNotFoundException: com.mongodb.ReadPreference

       at java.net.URLClassLoader$1.run(URLClassLoader.java:366)

       at java.net.URLClassLoader$1.run(URLClassLoader.java:355)

       at java.security.AccessController.doPrivileged(Native Method)

       at java.net.URLClassLoader.findClass(URLClassLoader.java:354)

       at java.lang.ClassLoader.loadClass(ClassLoader.java:425)

       at java.lang.ClassLoader.loadClass(ClassLoader.java:358)

       ... 59 more



Jeff Yemin

unread,
Jul 17, 2015, 5:12:56 PM7/17/15
to mongod...@googlegroups.com
Hi Kim, as documented in the installation guidemongodb-driver-3.0.2.jar depends on mongodb-driver-core-3.0.2.jar and bson-3.0.2.jar.
So you'll either need to use all three jar files, or else use the mongo-java-driver-3.0.2.jar uber-jar.

Please let me know if that solves your issue.

Regards,
Jeff



--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.
 
For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user...@googlegroups.com.
To post to this group, send email to mongod...@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/f830a922-2134-47a7-ab4a-83554552cb8d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Message has been deleted

Jeff Yemin

unread,
Jul 20, 2015, 10:25:34 PM7/20/15
to mongod...@googlegroups.com
I'm pretty certain you still have a configuration error, as that jar definitely contains that class:

~/.m2/repository/org/mongodb/mongo-java-driver/3.0.2$ jar tvf mongo-java-driver-3.0.2.jar | grep "com/mongodb/ReadPreference.class"
  4366 Thu May 28 14:19:42 EDT 2015 com/mongodb/ReadPreference.class


Regards,
Jeff

On Mon, Jul 20, 2015 at 4:52 PM, Kim Ngo <ngoa...@gmail.com> wrote:
Hi Jeff,

Thanks for clarifying that. I tried both methods and unfortunately still ran into problems.

Using all three jar files: I get the same error

Using mongo-java-driver-3.0.2.jar

val mongoRDD = sc.newAPIHadoopRDD(config, classOf[com.mongodb.hadoop.MongoInputFormat], classOf[Object], classOf[BSONObject])

error: object hadoop is not a member of package com.mongodb

      val mongoRDD = sc.newAPIHadoopRDD(config, classOf[com.mongodb.hadoop.MongoInputFormat], classOf[Object], classOf[BSONObject])




On Friday, July 17, 2015 at 5:12:56 PM UTC-4, Jeff Yemin wrote:
Hi Kim, as documented in the installation guidemongodb-driver-3.0.2.jar depends on mongodb-driver-core-3.0.2.jar and bson-3.0.2.jar.
So you'll either need to use all three jar files, or else use the mongo-java-driver-3.0.2.jar uber-jar.

Please let me know if that solves your issue.

Regards,
Jeff

--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.
 
For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user...@googlegroups.com.
To post to this group, send email to mongod...@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.

Kim Ngo

unread,
Jul 21, 2015, 11:56:15 AM7/21/15
to mongod...@googlegroups.com
Hi Jeff,

Thanks for looking into my issue. You were right--I did not have all the jars needed.

Chenna Varri

unread,
Aug 27, 2015, 6:51:33 PM8/27/15
to mongodb-user
Thanks Jeff. That definitely solved the issue for me

Astro

unread,
Oct 19, 2015, 8:40:51 AM10/19/15
to mongodb-user
Hi,

Recently I encountered the same issue while trying spark with mongodb.
I have been following this blog to setup mongodb+spark

Mongodb 2.6.7, Spark 1.3.0, python 2.6.6

As discussed above I ran pyspark in two different ways:

1. pyspark --jars mongo-java-driver-3.0.4.jar, mongo-hadoop-core-1.4.0.jar (using uber jar)

2. pyspark --jars mongodb-driver-3.0.4.jar, bson-3.0.4.jar, mongodb-core-3.0.4.jar (all three)

I have checked  ReadPreference.class three is availale in uber jar and mongodb-driver-3.0.4.jar/mongodb-core-3.0.4.jar

But. it failed either way with :

java.lang.NoClassDefFoundError: com/mongodb/ReadPreference
        at com.mongodb.MongoClientOptions$Builder.<init>(MongoClientOptions.java:686)

       at com.mongodb.MongoClientURI.<init>(MongoClientURI.java:150)


  Caused by: java.lang.ClassNotFoundException: com.mongodb.ReadPreference


Help!

Luke Lovett

unread,
Oct 19, 2015, 11:30:35 AM10/19/15
to mongodb-user
Hey Astro,

It's hard to tell from the formatting of the post, but it looks like you might have some whitespace after the ',' in --jars. You should have no whitespace there. To make sure that the jars are actually found in the pyspark shell, look for log messages like:

15/10/19 08:27:27 INFO SparkContext: Added JAR file:/home/spark/spark-1.3.1-bin-hadoop2.6/lib/mongo-java-driver.jar at http://192.168.59.3:60728/jars/mongo-java-driver.jar with timestamp 1445268447895

for each jar that you supplied to --jars.

Astro

unread,
Oct 20, 2015, 1:54:17 AM10/20/15
to mongodb-user
Hi Luke,

I added jars without whitespaces for sure. They were only separated by commas and I can see all of those messages on logs like the particular jar(mongo-java and mongo-hadoop) was added into the sparkContext with time-stamp.

Despite all things being verified, I still get the above mentioned error.


Thanks,

Luke Lovett

unread,
Oct 20, 2015, 1:17:22 PM10/20/15
to mongod...@googlegroups.com
Perhaps you're hitting this: https://issues.apache.org/jira/browse/SPARK-5185

If that's the case, try also adding these jars to the
'--driver-class-path' option in pyspark.
> --
> You received this message because you are subscribed to the Google Groups
> "mongodb-user"
> group.
>
> For other MongoDB technical support options, see:
> http://www.mongodb.org/about/support/.
> ---
> You received this message because you are subscribed to a topic in the
> Google Groups "mongodb-user" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/mongodb-user/m80r2cQOXh4/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> mongodb-user...@googlegroups.com.
> To post to this group, send email to mongod...@googlegroups.com.
> Visit this group at http://groups.google.com/group/mongodb-user.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/mongodb-user/684055ba-d178-4e77-b88e-fcd75251a7ea%40googlegroups.com.

Luke Lovett

unread,
Oct 20, 2015, 1:19:22 PM10/20/15
to mongodb-user
Perhaps you're hitting this: https://issues.apache.org/jira/browse/SPARK-5185

If that's the case, try also adding these jars to the '--driver-class-path' option in pyspark.

Message has been deleted
Message has been deleted
Message has been deleted

Luke Lovett

unread,
Oct 22, 2015, 2:04:22 PM10/22/15
to mongod...@googlegroups.com
I'm not sure what you're missing out, either. I'm curious about
"mongo-hadoop-core-1.0.0-sources.jar," though. That is a very old
version of the connector. I'm not sure that it will make a difference,
but you should try using the latest version (1.4.1).

Other than that, I can tell you that mongo-java-driver-2.13.0.jar does
contain com.mongodb.ReadPreference. I'm not sure what's wrong with
your current setup. It might be helpful to see what the actual
classpath is of PySpark when launched. You can do that with this
command:

SPARK_PRINT_LAUNCH_COMMAND=1 bin/spark-shell

Then you can double-check that mongo-java-driver.jar is on the classpath.

On Thu, Oct 22, 2015 at 10:51 AM, Astro <atish....@raweng.com> wrote:
> Hi Luke,
>
> This what am trying to do so far:
>
> Setup:
> Python 2.7.6, Spark 1.3.0, MongoDB 2.6.7, Java 1.7.0_67.
>
> Command:
> pyspark --jars
> /home/cloudera/mongo-java-driver-2.13.0.jar,/home/cloudera/astro/mongodb-spark/jars/mongo-hadoop-core-1.0.0-sources.jar
> --driver-class-path
> /home/cloudera/mongo-java-driver-2.13.0.jar,/home/cloudera/astro/mongodb-spark/jars/mongo-hadoop-core-1.0.0-sources.jar
>
> In logs:
> INFO spark.SparkContext: Added JAR
> file:/home/cloudera/mongo-java-driver-2.13.0.jar at
> http://172.16.93.132:56379/jars/mongo-java-driver-2.13.0.jar with timestamp
> 1445534602340
>
> INFO spark.SparkContext: Added JAR
> file:/home/cloudera/astro/mongodb-spark/jars/mongo-hadoop-core-1.0.0-sources.jar
> at http://172.16.93.132:39043/jars/mongo-hadoop-core-1.0.0-sources.jar with
> timestamp 1445535061266
>
> Error in logs:
>
> An error occurred while calling
> z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
> : java.lang.NoClassDefFoundError: com/mongodb/ReadPreference
> at
> com.mongodb.MongoClientOptions$Builder.<init>(MongoClientOptions.java:686)
> at com.mongodb.MongoClientURI.<init>(MongoClientURI.java:150)
> at
> com.mongodb.hadoop.util.MongoConfigUtil.getMongoClientURI(MongoConfigUtil.java:367)
> at
> com.mongodb.hadoop.util.MongoConfigUtil.getInputURI(MongoConfigUtil.java:371)
> at
> com.mongodb.hadoop.splitter.MongoSplitterFactory.getSplitter(MongoSplitterFactory.java:113)
> at com.mongodb.hadoop.MongoInputFormat.getSplits(MongoInputFormat.java:56)
> at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
> at
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
> at org.apache.spark.rdd.RDD.take(RDD.scala:1156)
> at
> org.apache.spark.api.python.SerDeUtil$.pairRDDToPython(SerDeUtil.scala:205)
> at
> org.apache.spark.api.python.PythonRDD$.newAPIHadoopRDD(PythonRDD.scala:483)
> at org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD(PythonRDD.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
> at py4j.Gateway.invoke(Gateway.java:259)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:207)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ClassNotFoundException: com.mongodb.ReadPreference
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>
>
> Just wondering what I'm missing out.
>
> Thanks,
>
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "mongodb-user"
> group.
>
> For other MongoDB technical support options, see:
> http://www.mongodb.org/about/support/.
> ---
> You received this message because you are subscribed to a topic in the
> Google Groups "mongodb-user" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/mongodb-user/m80r2cQOXh4/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> mongodb-user...@googlegroups.com.
> To post to this group, send email to mongod...@googlegroups.com.
> Visit this group at http://groups.google.com/group/mongodb-user.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/mongodb-user/994234de-5b59-4c16-8320-5bed96d05c0e%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages