I want to contribute as part of my thesis, I have some questions about where to start.

75 views
Skip to first unread message

Ricardo Amendoeira

unread,
Oct 27, 2016, 2:44:02 PM10/27/16
to mongodb-dev

Hello,


First of all let me thank you for reading and hopefully helping me out! :)

My name is Ricardo Amendoeira, I'm a Electrical Engineering student from Portugal and for my EE Master's thesis I'll be trying to add a different way of sharding geospatial data on MongoDB, based on Voronoi diagrams.

First I have some more general questions:

1) From which commit/tag should I base my work on? Latest commit on the Master branch, some recent release, other?

2) What sort of things should I be careful with in order to increase the chances that my final work will be accepted? (Besides updating/creating tests and commenting my code)

3) I'm getting a bit familiar with the file structure and organization of the repo but is there some documentation for contributors with this information?

4) Any other general tips for contributing? This will probably be my first contribution to an open-source project.


For the more specific questions I should probably give some more detail about how the the idea is supposed to work: The db admin selects a virtual coordinate location for each sharding cluster. Geospatial data can then be inserted into the cluster which is "closest" to itself based on the virtual coordinate of the cluster.

This is a more efficient way of sharding geospatial data, since it allows queries to hit fewer servers when searching for data of a certain region and it's also more flexible than the current method used by MongoDB (Quad-Tree) in terms of how it allows the space to be divided among clusters. Source: The attached document, which is an investigation into different geo sharding methods and the reason for my thesis.


So, my more specific questions:

5) I read on some third-party sources that MongoDB supports sharding by geolocation but the documentation says otherwise. Is it supported?

6) My current idea on how to implement this is to use Shard Tags, I'll add support for 2dsphere sharding and tagging, so that the user can tag each cluster with a 2dsphere coordinate of his choice. Is there a problem with this approach or a better way to do it? 

7) As far as I understand so far, only mongoS servers need to know about sharding keys, so my first step should be to drill down the sh.shardCollection() command and modify the relevant files to accept 2dsphere coordinates as sharding keys, correct? Are there other components that this would significantly affect and that I should look into?

8) After that's done I think my next steps are to:

a) create a new command like sh.addTagGeoRange() (new command because the Tag behavior will be different)

b) modify commands related to queries/inserts/updates/deletes to behave according to the distance of the data to the virtual locations of the sharding clusters.

    Any issues with this plan?


Thank you, I hope to make a valuable contribution to the project! :)

Ricardo Amendoeira
 




main.pdf

Asya Kamsky

unread,
Nov 11, 2016, 6:21:52 AM11/11/16
to mongo...@googlegroups.com
Ricardo:

Thank you for your interest in contributing to MongoDB.

First thing would be to read the information here: https://github.com/mongodb/mongo/blob/master/CONTRIBUTING.rst and  https://github.com/mongodb/mongo/wiki there are links there to write-ups on coding style, how to write tests, how to build the server, and more.

The specific work you propose would address Jira ticket  https://jira.mongodb.org/browse/SERVER-1982 - you are correct that currently it is not allowed to shard on 2dsphere field/index.  You can ask specific questions about preferred implementation details in the ticket, or on this list.   

You should keep in mind that no matter how you define shard key partitioning, GeoJson objects can span multiple "ranges" (when they have multiple points) and in MongoDB, each document must be assigned to exactly one shard (which shard it is can change, but an operation targeted to a single shard key value must be routed to exactly one shard).

Another way of saying this is whether the shard key geoJSON object is a point or an object that contained multiple points (for instance, a polygon), it would have to always deterministically be associated with a shard that "owns" appropriate range of shard keys.  There is no mechanism in MongoDB for the same document to have its "home" on more than one shard.

I'm not an expert in geospatial state-of-the-art research, but if you can solve this issue, that would be fantastic.

Wishing you the best,
Asya Kamsky




--
You received this message because you are subscribed to the Google Groups "mongodb-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-dev+unsubscribe@googlegroups.com.
To post to this group, send email to mongo...@googlegroups.com.
Visit this group at https://groups.google.com/group/mongodb-dev.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages