hadoop cluster for querying data on mongodb

Martinus Martinus

unread,

Dec 20, 2011, 10:29:55 PM12/20/11

to mongod...@googlegroups.com

Hi,

I have hadoop cluster running and have my data inside mongodb database. I already write a java code to query data on mongodb using mongodb-java driver. And right now, I want to use hadoop cluster to run my java code to get and put the data from and to mongo database. Did anyone has done this before? Can you explain to me how to do that?

Thanks.

Eliot Horowitz

unread,

Dec 21, 2011, 1:08:34 AM12/21/11

to mongod...@googlegroups.com

Take a look at:
https://github.com/mongodb/mongo-hadoop

> --
> You received this message because you are subscribed to the Google Groups
> "mongodb-user" group.
> To post to this group, send email to mongod...@googlegroups.com.
> To unsubscribe from this group, send email to
> mongodb-user...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/mongodb-user?hl=en.

Martinus Martinus

unread,

Dec 21, 2011, 1:20:42 AM12/21/11

to mongod...@googlegroups.com

Hi Eliot,

I have tried to built the jar file from the core folder inside it, but it gaves me error of source-5, so then I add this in the pom.xml file below the </resources> :

    <plugins>
     <plugin>
      <groupId>org.apache.maven.plugins</groupId>
      <artifactId>maven-compiler-plugin</artifactId>
      <configuration>
       <source>1.5</source>
       <target>1.5</target>
      </configuration>
     </plugin>
    </plugins>

and it can be built using mvn package. It gaves me mongo-hadoop-core-1.0-SNAPSHOT.jar on the target folder, but I still don't know how to use this library along with mongodb inside eclipse.

Thanks.

Martinus Martinus

unread,

Dec 25, 2011, 11:46:12 PM12/25/11

to mongod...@googlegroups.com

Hi Eliot,

I tried to used hadoop-mongo plugin using hadoop-0.20.2 and do I need to add external library for all of hadoop library? and when I tried to run the WordCount.java program in eclipse, it gave me this error :

Conf: Configuration: core-default.xml, core-site.xml
11/12/26 12:42:46 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
11/12/26 12:42:46 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
11/12/26 12:42:46 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
11/12/26 12:42:58 INFO util.MongoSplitter: Calculate Splits Code ... Use Shards? false, Use Chunks? true; Collection Sharded? false
11/12/26 12:42:58 INFO util.MongoSplitter: Creation of Input Splits is enabled.
11/12/26 12:42:58 INFO util.MongoSplitter: Using Unsharded Split mode (Calculating multiple splits though)
11/12/26 12:42:58 INFO util.MongoSplitter: Calculating unsharded input splits on namespace 'test.in' with Split Key '{ "_id" : 1}' and a split size of '8'mb per
Exception in thread "main" java.lang.IllegalArgumentException: Unable to calculate input splits: ns not found
    at com.mongodb.hadoop.util.MongoSplitter.calculateUnshardedSplits(MongoSplitter.java:106)
    at com.mongodb.hadoop.util.MongoSplitter.calculateSplits(MongoSplitter.java:75)
    at com.mongodb.hadoop.MongoInputFormat.getSplits(MongoInputFormat.java:51)
    at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885)
    at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779)
    at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
    at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
    at WordCount.main(WordCount.java:76)

Would you be so kindly to tell me how to fix this problem?

Thanks.

Martinus Martinus

unread,

Dec 26, 2011, 4:02:03 AM12/26/11

to mongod...@googlegroups.com

Hi Eliot,

I knew where the problem is : I haven't made the "in" collection when I run the program, so it gave me above error.

Thanks.

Merry Christmas.

Martinus Martinus

unread,

Jan 2, 2012, 4:44:46 AM1/2/12

to mongod...@googlegroups.com

Hi,

Is there any better way to do map/reduce in parallel on many machines to query and put data inside mongodb database besides using hadoop?

Thanks and Happy New Year 2012.

Scott Hernandez

unread,

Jan 2, 2012, 7:36:32 AM1/2/12

to mongod...@googlegroups.com

Do you mean other than using this? https://github.com/mongodb/mongo-hadoop

It sounded like you got it working once you set the correct params.

Martinus Martinus

unread,

Jan 2, 2012, 11:37:43 AM1/2/12

to mongod...@googlegroups.com

Hi Scott,

Yes, I did tried the example, but I'm wondering if there is another distributed computing software like hadoop that can be used to do map/reduce with MongoDB and faster.

Thanks and Happy New Year 2012.

Scott Hernandez

unread,

Jan 2, 2012, 11:53:14 AM1/2/12

to mongod...@googlegroups.com

Conceptually there are many, and people even have written their own
map/reduce (or batch processing frameworks for mongodb), but in
practice there are only a few established patterns and well-used
systems. There is the built-in map/reduce system in MongoDB and there
are external extensions, like the hadoop stuff. That is about all I
know of that are actively being used in the open source community.

Here is an example of someone rolling their own as I mentioned above :
http://sourceforge.net/p/zarkov/blog/2011/07/zarkov-is-a-lightweight-map-reduce-framework/

The (possible) advantage of taking the map/reduce jobs off the MongoDB
is that you can (dynamically) scale CPU/processing and enable richer
programming logic/libraries using an external system.

Martinus Martinus

unread,

Jan 4, 2012, 3:54:42 AM1/4/12

to mongod...@googlegroups.com

Hi Scott,

I remembered hadoop has single point of failure on it's namenode. It's there any other way to make this hadoop-mongo to become more fault tolerant system?

Thanks.

Scott Hernandez

unread,

Jan 13, 2012, 6:10:51 PM1/13/12

to mongod...@googlegroups.com

Not sure what you mean. You mean independent of the hadoop framework?
On the mongo side you can use a replica set, or sharded cluster to
take care of any single points of failure (SPF).

I'm not a SPF expert for hadoop but i thought they'd taken care of
this stuff a while ago.

On Wed, Jan 4, 2012 at 12:54 AM, Martinus Martinus

Martinus Martinus

unread,

Jan 16, 2012, 1:19:47 AM1/16/12

to mongod...@googlegroups.com

Hi Scott,

Is this mongo-hadoop can also put data inside the mongo-hadoop with replicated and sharded environment? and how to make or setup a connection into a replicated and sharded environment on mongodb using java driver?

Thanks.

Scott Hernandez

unread,

Jan 16, 2012, 8:48:45 AM1/16/12

to mongod...@googlegroups.com

In a sharded environment use one of the mongos addresses when
connecting. For a replica set you just need to use a seed list of more
than one address.

On Sun, Jan 15, 2012 at 10:19 PM, Martinus Martinus

Brendan W. McAdams

unread,

Jan 16, 2012, 8:51:48 AM1/16/12

to mongod...@googlegroups.com

There is some explanation oof the different settings for MongoHadoop WRT sharding as well in the README.

elif

unread,

Feb 14, 2012, 4:42:12 PM2/14/12

to mongodb-user

How did you made the "in" collection later. I am getting the same
error and don't know how to proceed.
Are we supposed to import the beyond_lies_the_wub.txt to the mongodb
or we need to set it up as the input?

thanks.

On Dec 26 2011, 4:02 am, Martinus Martinus <martinus...@gmail.com>
wrote:

> Hi Eliot,
>
> I knew where the problem is : I haven't made the "in" collection when I run
> the program, so it gave me above error.
>
> Thanks.
>
> Merry Christmas.
>
> On Mon, Dec 26, 2011 at 12:46 PM, Martinus Martinus

> <martinus...@gmail.com>wrote:

Martinus Martinus

unread,

Feb 14, 2012, 9:17:06 PM2/14/12

to mongod...@googlegroups.com

Hi elif,

You need to made the collection named "in" inside your MongoDB database from the Mongo shell and then you can run your WordCount.java example, otherwise there is nothing to be map/reduce by mongo-hadoop.

Thanks.

elif

unread,

Feb 15, 2012, 11:47:56 AM2/15/12

to mongodb-user

Thanks Martinus. I did that and now WordCount is working. BUT how do I
import text to in collection? Right now I have empty in so empty out
collections.
I am assuming that "beyond_lies_the_wub.txt" is the sample input they
provide for this example.
But how do I import that into mongodb since it is not in json,csv or
tsv format?

thanks,

elif

Martinus Martinus

unread,

Feb 16, 2012, 9:38:49 PM2/16/12

to mongod...@googlegroups.com

Hi Elif,

I guess you should read this first :

http://www.mongodb.org/display/DOCS/Import+Export+Tools.

I have never test that either. I guess the developers should know better than me. :)

Thanks.

Message has been deleted

kg

unread,

Mar 15, 2012, 5:13:36 PM3/15/12

to mongod...@googlegroups.com

Hi Elif/Martinus,

I am running into the same issue as u guys, even though I have the collection 'in' and the database 'test' created in mongos.

hduser@ip-10-252-31-236:/usr/lib/hadoop-0.20$ bin/hadoop jar WordCount.jar WordCount

Conf: Configuration: core-default.xml, core-site.xml

12/03/15 20:33:00 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing.

12/03/15 20:33:00 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.

12/03/15 20:33:01 INFO util.MongoSplitter: Calculate Splits Code ... Use Shards? false, Use Chunks? true; Collection Sharded? false

12/03/15 20:33:01 INFO util.MongoSplitter: Creation of Input Splits is enabled.

12/03/15 20:33:01 INFO util.MongoSplitter: Using Unsharded Split mode (Calculating multiple splits though)

12/03/15 20:33:01 INFO util.MongoSplitter: Calculating unsharded input splits on namespace 'test.in' with Split Key '{ "_id" : 1}' and a split size of '8'mb per

12/03/15 20:33:01 INFO mapred.JobClient: Cleaning up the staging area hdfs://master:54310/app/hadoop/tmp/mapred/staging/hduser/.staging/job_201203151831_0003

Exception in thread "main" java.lang.IllegalArgumentException: Error calculating splits: { "serverUsed" : "ec2-50-112-19-33.us-west-2.compute.amazonaws.com:27017" , "$err" : "unrecognized command: splitVector" , "code" : 13390}

at com.mongodb.hadoop.util.MongoSplitter.calculateUnshardedSplits(MongoSplitter.java:104)

at com.mongodb.hadoop.util.MongoSplitter.calculateSplits(MongoSplitter.java:75)

at com.mongodb.hadoop.MongoInputFormat.getSplits(MongoInputFormat.java:51)

at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:944)

at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:961)

at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170)

at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:880)

at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:833)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:396)

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)

at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:833)

at org.apache.hadoop.mapreduce.Job.submit(Job.java:476)

at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:506)

at WordCount.main(WordCount.java:97)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

at java.lang.reflect.Method.invoke(Method.java:597)

at org.apache.hadoop.util.RunJar.main(RunJar.java:197)

Any ideas on what am I doing wrong??

> > > >>> > mongodb-user+unsubscribe@googlegroups.com.

> > > >>> > For more options, visit this group at
> > > >>> >http://groups.google.com/group/mongodb-user?hl=en.
>
> > > >>> --
> > > >>> You received this message because you are subscribed to the Google
> > > >>> Groups "mongodb-user" group.
> > > >>> To post to this group, send email to mongod...@googlegroups.com.
> > > >>> To unsubscribe from this group, send email to

> > > >>> mongodb-user+unsubscribe@googlegroups.com.

> > > >>> For more options, visit this group at
> > > >>>http://groups.google.com/group/mongodb-user?hl=en.
>
> > --
> > You received this message because you are subscribed to the Google Groups
> > "mongodb-user" group.
> > To post to this group, send email to mongod...@googlegroups.com.
> > To unsubscribe from this group, send email to

> > mongodb-user+unsubscribe@googlegroups.com.

> > For more options, visit this group at
> >http://groups.google.com/group/mongodb-user?hl=en.

--
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com.

To unsubscribe from this group, send email to mongodb-user+unsubscribe@googlegroups.com.

Brendan W. McAdams

unread,

Mar 15, 2012, 5:24:52 PM3/15/12

to mongod...@googlegroups.com

Are you running in a sharded cluster?

if so, mongos does not currently expose the splitVector command needed to split an unsharded collection.

For now, you'll need to either run with no splits, or point hadoop directly at the mongod which is the primary for that collection.

I am working on a workaround for this issue, and a ticket is open to expose splitVector through mongos.

To view this discussion on the web visit https://groups.google.com/d/msg/mongodb-user/-/Ig3E3wS-P0oJ.

To post to this group, send email to mongod...@googlegroups.com.

To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.

Reply all

Reply to author

Forward