MongoDB storage driver for Apache Tajo

Janaka Thilakarathna

unread,

Apr 29, 2016, 6:12:59 AM4/29/16

to mongodb-dev

Hi everyone,

I am Janaka, who is a student selected for Google Summer of Code 2016. My project is to develop MongoDB storage plugin for Apache Tajo ( Big Data Warehouse). I am going to use Mongo Java driver for this module. I am currently in the community bonding period and I will start coding after 23rd May.

If you want more information about my project please refer these links.

Project Link: https://summerofcode.withgoogle.com/projects/#4541666853650432
Proposal: https://cwiki.apache.org/confluence/display/TAJO/Add+MongoDB+to+Tajo+Storage+-+Proposal
Jira Issue: https://issues.apache.org/jira/browse/TAJO-2079?filter=12334770

I am starting this topic because I think it will be really great if any of you can give me some advice. Specially any materiel to read on mapping the Document based NoSQL database to an column based one.

Further I found that, there is a MongoDb connector for Hadoop. I think it has the functionality I am trying to understand. I would be glad if you can give me some helps for this project.

Thanks!

Regards,

Janaka.

Luke Lovett

unread,

May 4, 2016, 1:31:03 PM5/4/16

to mongodb-dev

Hi Janaka,

Sounds like an exciting project!

I'm the maintainer of the MongoDB Hadoop connector. I can tell you that the connector does have to map some MongoDB-specific types onto types that are native to another system (e.g. Hive and Pig), which is likely one problem that you'll have to solve when integrating MongoDB with Tajo. You can see how the connector deals with transforming these types here: https://github.com/mongodb/mongo-hadoop/wiki/Hive-Usage#serialization-and-deserialization. In this case, Hive already has some notion of "nested types," so the transformation here is fairly straightforward.

I have no experience with Tajo whatsoever, so I can't give much advice on that front. However, you might look into what other storage engines for Tajo do, when their source/sink is a non-relational data source. Do the storage engines allow the user to configure every detail of how data transformations are applied (e.g. do they require that users declare fields/types in advance)? What kinds of assumptions do the storage engines make (e.g. do they assume that every document in a collection looks roughly the same)?

MongoDB also has a "BI" (Business Intelligence) connector, which has to map MongoDB documents onto a relational structure. Perhaps the documentation for this will give some inspiration or guidance: https://docs.mongodb.org/bi-connector/schema-configuration/.

Best of luck on your project! It sounds like a lot of fun. Please feel free to ask for help anytime.

Luke

Janaka Thilakarathna

unread,

May 9, 2016, 4:32:40 AM5/9/16

to mongodb-dev

Hi Luke,

It's really great to meet you. Thanks for the reply and I am really sorry for being this much late to reply.

I am still studying the Tajo storage structure and still didn't get a chance to go through the of Mongo-Hadoop connector. I think it will be interesting.

Thanks for the help. Next few days, I will go through the document you have provided.

MongoDB storage driver for Apache Tajo - GSoC 2016

Janaka Thilakarathna

Luke Lovett

Janaka Thilakarathna