Grid De-Duplication Standard

20 views
Skip to first unread message

Sunny Hirai

unread,
Jun 11, 2010, 5:10:50 PM6/11/10
to mongodb-user
I'd like to propose a Grid De-Duplication standard.

If it sounds good, I can write a working version for Ruby.

Since MongoDB's use case includes massive scaling, reducing
duplication in Grid is valuable. It should be not that difficult to
handle at the driver level.

The benefit is that, to developers, there is no change in how they use
the library but improvements in performance and less disk space usage.

Without a standard, there will be no adoption in the drivers and also,
likely, no adoption across languages. If it's in the drivers, it would
be easy for libraries to take advantage of de-duplication.

The proposal is this:

1. De-duplication happens at the chunks level
2. As each chunk is uploaded, a 128-bit (16 byte) hash is calculated
3. The 16 byte hash is used as the ID for the chunk (either in binary
format or as a Mongo ID)

Some points:

1. Using a 128-bit hash, the chance of a hash collision assuming a
file system with 1-trillion chunks is 1 in a trillion. See probability
table in birthday problem wikipedia entry:
http://en.wikipedia.org/wiki/Birthday_problem#Probability_table
2. Reads would be the same.
3. Inserts would need to ignore duplicate chunks.
4. Deletes need to make sure that the chunks aren't being used by
other files before deleting them.
5. It would require an index on the 'chunks' key of the files
collection (required for deletions).
6. Need to decide whether we want to handle the 1 in a trillion chance
of the hash collision or just ignore it.
7. I did not de-duplicate the files because we may have the same file
but with different meta-data. This could be useful if, for example, we
use meta-data to store who owns a given file.

Like I said, I can do a Ruby implementation if the standard sounds
good.

The problem is that this cannot be implemented without going into the
driver. In other words, I can't write this above the driver. It would
have to be in the driver.

In terms of experience, I've written a library that uses MongoDB. I am
preparing to release it but includes some neat ideas like arbitrary
joins, infinitely nestable validations, ability to insert files and
images directly into a document (which are stored as references to
files in GridFS). I've also found and fixed several driver bugs. I
just mention this in case you are curious about my credentials on
messing with the Ruby driver.

Sunny Hirai
CEO, MeZine Inc.

Kyle Banker

unread,
Jun 12, 2010, 10:32:13 AM6/12/10
to mongod...@googlegroups.com
Hi Sunny,

While this would certainly be valuable for some users, I'm not certain it belongs in the driver. It'd be easy to write a thin layer that implements the de-duplication as a sort of plugin. Alternatively, you could maintain a fork of the driver that includes this.

If enough users want it include, then we could certainly go that route.  I've gone ahead and created an issue to track interest:


Kyle


--
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com.
To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.


Reply all
Reply to author
Forward
0 new messages