I'd like to propose a Grid De-Duplication standard.
If it sounds good, I can write a working version for Ruby.
Since MongoDB's use case includes massive scaling, reducing
duplication in Grid is valuable. It should be not that difficult to
handle at the driver level.
The benefit is that, to developers, there is no change in how they use
the library but improvements in performance and less disk space usage.
Without a standard, there will be no adoption in the drivers and also,
likely, no adoption across languages. If it's in the drivers, it would
be easy for libraries to take advantage of de-duplication.
The proposal is this:
1. De-duplication happens at the chunks level
2. As each chunk is uploaded, a 128-bit (16 byte) hash is calculated
3. The 16 byte hash is used as the ID for the chunk (either in binary
format or as a Mongo ID)
Some points:
1. Using a 128-bit hash, the chance of a hash collision assuming a
file system with 1-trillion chunks is 1 in a trillion. See probability
table in birthday problem wikipedia entry:
http://en.wikipedia.org/wiki/Birthday_problem#Probability_table
2. Reads would be the same.
3. Inserts would need to ignore duplicate chunks.
4. Deletes need to make sure that the chunks aren't being used by
other files before deleting them.
5. It would require an index on the 'chunks' key of the files
collection (required for deletions).
6. Need to decide whether we want to handle the 1 in a trillion chance
of the hash collision or just ignore it.
7. I did not de-duplicate the files because we may have the same file
but with different meta-data. This could be useful if, for example, we
use meta-data to store who owns a given file.
Like I said, I can do a Ruby implementation if the standard sounds
good.
The problem is that this cannot be implemented without going into the
driver. In other words, I can't write this above the driver. It would
have to be in the driver.
In terms of experience, I've written a library that uses MongoDB. I am
preparing to release it but includes some neat ideas like arbitrary
joins, infinitely nestable validations, ability to insert files and
images directly into a document (which are stored as references to
files in GridFS). I've also found and fixed several driver bugs. I
just mention this in case you are curious about my credentials on
messing with the Ruby driver.
Sunny Hirai
CEO, MeZine Inc.