GridFS, insert/deletes and database size control

182 views
Skip to first unread message

Jerome

unread,
Mar 13, 2012, 4:47:10 AM3/13/12
to mongod...@googlegroups.com
Hi,

This is my first try at mongo and I'm having all sorts of problems I can't figure easily.

Here is what we're trying to achieve :
We need to store files (mostly between 3-10 MB and some up to 300 MB) along with their associated metadata to present them into a small web site. Users can query them easily and a preview is embedded with the metadata, avoiding complete raw file retrieval.
Metadata (after being converted to BSON format) has an estimate size of 0.15 % of the raw file size.
File storage must act as a queue, meaning that, based on their categories and status (has anyone seen this file using the web page), their retention must vary.
When running out of space (we want to maximize the retention using as much space as available on the server), we need to delete files according to their status and category priority.
For now, the system runs on a single node (no sharding, no replication). None of those features are expected in the near future. Target system size is around 8TB so something like ~100k documents. This should be reasonable because the whole metadata collection would fit in memory, allowing fast queries.

Toying around with GridFS, I've had much pain trying to constraint the size of the database as a whole. To achieve the above scenario, I've stored a JS function that gets called before I want to insert a file in the db. The function takes an amount of chunks to free and removes files accordingly. Although, there seems to be enough free space (db.fs.chunks.validate() shows enough deletedSpace), the server keeps reallocating new files on disk. It seems pretty strange to me because, with GridFS, every document should be the same size and thus no fragmentation shoud happen, right ? I've tried, turning off prealloc and creating collections with a preset size but none works (ie. server keeps reallocating).

There seems to be someone else experiencing problem with this : https://groups.google.com/forum/?fromgroups#!topic/mongodb-user/adpfShPx23w and this leads to this bug https://jira.mongodb.org/browse/SERVER-2958

Although, I would be able to store files on disk (which would be less convenient), I fear the problem would be even harder with the metadata because each document has a different size. Inserts/delete over time will lead to fragmentation and I won't be able to restrain database size (leading to an annoying scenario where mongo blocks all write operations and you have to restart by hand :(

Capped collections seemed a good choice at first but, as we can't delete from them, document expiration would be in natural order, which doesn't match some of the constraints (delete read docs first).

So right now, I'm confused about the durability of the database : being unable to control database will lead to problems and maintenance burdens. Searching the web there seems to be little to no questions about this. Am I trying something crazy ?

Thanks for your help

Jérôme

Kevin Matulef

unread,
Mar 13, 2012, 11:37:41 AM3/13/12
to mongod...@googlegroups.com
Hi Jerome,

What version of mongo are you using?  

As you point out, GridFS chunks are fixed size, so theoretically mongo should be able to reuse chunks from deleted files without much memory fragmentation.   For the metadata, fragmentation is more likely, but can be mitigated by running the "compact" command periodically:

Can you try running "compact" and see if that helps?  

-Kevin

Jerome

unread,
Mar 13, 2012, 2:48:52 PM3/13/12
to mongod...@googlegroups.com
Hi,


What version of mongo are you using?  

Mongo 2.0.0. It's running on a 64 bits Debian Squeeze.
 

As you point out, GridFS chunks are fixed size, so theoretically mongo should be able to reuse chunks from deleted files without much memory fragmentation.   For the metadata, fragmentation is more likely, but can be mitigated by running the "compact" command periodically:

Can you try running "compact" and see if that helps?  

I've tried 'compact' but this leads to another problem : compact may need an extra extent in order to complete. In my case that would be ok unless there was this bug :(
https://jira.mongodb.org/browse/SERVER-3791 I'll have to wait for 2.2.0 in order to get both fixes (I hope).

Testing with larger collections today, I finally got the thing working although this is a bit degraded. I turned off prealloc and pre-sized every collection. Every time I need to insert a new file, I run a stored function on Mongo that gives me 2GB + sizeof(file_to_insert). This way, the database size is stable and no more "database.x" extra files spawning all the time.

I guess (to be confirmed), I had to set a larger backoff in order to avoid fragmentation (filesystems tend to behave the same way afaik).

Anyway, maxing out the total database capacity is going to be tricky and capped gridfs collection (or at least setting a maxSize on non-capped collections) would be the answer to that problem.
Reply all
Reply to author
Forward
0 new messages