Adding a TTL index to GridFS files collection

Mark Horstmeier

unread,

Jan 29, 2013, 3:37:02 PM1/29/13

to mongod...@googlegroups.com

Is uploadDate set by the client (PERL in my case) or is it created by default?

Documents in the files collection contain some or all of the following fields. Applications may create additional arbitrary fields:
files.uploadDate¶
The date the document was first stored by GridFS. This value has the Date type.

If I add a TTL index on uploadDate

db.files.ensureIndex({"uploadDate" : 1},{expireAfterSeconds : 3600})

That document will be eventually reaped by a background process, right?

From what I currently read, nothing will happen to the chunks collection, but without the document in db.files the file will be effectively unreachable

If I then run a periodic process on db.chunks that checks for _id that don't exist in db.files, I can manually remove the chunks for any orphans (excluding very recent additions that might be uploading and not have an entry in db.files yet).

Aside from the clunkiness, is there anything broken about implementing a short term file storage in this manner?

If I can treat files.db as a normal collection, is there anything that would keep me from creating db.files as a capped collection? With that, I suppose that I could also use a tailable cursor on the oplog and let the delete operation on db.files trigger a clean-up of associated db.chunk files...

Ronald Stalder

unread,

Jan 29, 2013, 11:03:02 PM1/29/13

to mongod...@googlegroups.com

sorry, Mark, aren't you overcomplicating a bit? Why not just have a periodically run background job that does a find() for expired db.files (uploadDate older than 3600 seconds ago) and then remove these together with the chunks they're pointing to?

Wojons Tech

unread,

Jan 30, 2013, 8:12:41 PM1/30/13

to mongod...@googlegroups.com

I have to agree with Ronald on this one. Because you need to think about the files and there chunks being deleted. depending on the size of the files and the number of them your going to have to delete a lot of things every minute because its a background job, and that is time that you will not be serving content from your cluster. And knowing mongodb it is going to try to delete them as close to one another as it can and if you have moer deletes than can be done in 1 minute that could be another issue. your better off doing your deletes during your low traffic and having them queued up and deleted slowly over some period of time.

Mark Horstmeier

unread,

Jan 31, 2013, 12:16:02 PM1/31/13

to mongod...@googlegroups.com

If the MongoDB TTL index can't handle the delete queue then it's functionally useless.

The TTL index appears to offer an alternative from doing a time relative query and delete. (Assuming that the TTL index is faster because it is an internal MongoDB operation). I would rather do a straight _id lookup on the collections and handle the sorting in my code (Not that I'm saying that I could do it faster, but processing the deletes atomically allows me to adjust to traffic if my response times start to increase) In 1.8, at least, I found this was a more effective way to delete millions of documents without noticeably impacting performance of regular operations.

Additionally, I'm looking at this as an opportunity to exercise TTL and tailable cursors on a low profile/impact feature. The gridFS is only a prototype, and I can afford to fiddle and even throw it away if it doesn't meet my needs. I have production needs that could benefit from TTL and tailable cursors (an on data change event would be a nice alternative to polling)

So I admit that this is a somewhat complexified approach to my problem, but I have an ulterior motive that coincides with feature that I am developing.

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com
To unsubscribe from this group, send email to
mongodb-user...@googlegroups.com
See also the IRC channel -- freenode.net#mongodb

---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Wojons Tech

unread,

Jan 31, 2013, 1:58:36 PM1/31/13

to mongod...@googlegroups.com

Its not that mongodb cant handle the deletes, its that your going to end up with high lock rate if you let mongo handle the delets and you have a busy server or you were busy and now its time to delete things. Just as you said about it being internal so it will be faster is true and would you rather be able to serve reads and add new files or make sure old files are deleted on time. I recommend making sure to use memcache for the simple looks ups on this when you can, insteed of lots of querys that will then have to serve files. Also one thing to think about with ttl is that it is not perfect at deleteing on time depending on how busy the server is and scheduling issues.

http://docs.mongodb.org/manual/tutorial/expire-data/

Note

TTL indexes expire data by removing documents in a background task that runs once a minute. As a result, the TTL index provides no guarantees that expired documents will not exist in the collection. Consider that:

Documents may remain in a collection after they expire and before the background process runs.
The duration of the removal operations depend on the workload of your mongod instance.

Dont get me wrong your free to try and can make some of your coding life a lot easier to handle but unless you have done lots of testing for your case I wuold not run it in production.

Mark Horstmeier

unread,

Jan 31, 2013, 2:57:32 PM1/31/13

to mongod...@googlegroups.com

For these purposes, I'm okay with a timeframe of a few minutes to eventual deletion. I do use memcache for my regular operations

So for gridFS, the db.files should just be one document per file. A TTL on that collection should be dead simple to process. db.chunks could have many documents so that is where I would have more concern, but I'm not leaving that up to the TTL process so I will have more control over backing off if my lock rate is high.

Sam Millman

unread,

Jan 31, 2013, 4:55:50 PM1/31/13

to mongod...@googlegroups.com

"From what I currently read, nothing will happen to the chunks collection, but without the document in db.files the file will be effectively unreachable"

Exactly, a TTL index cannot be applied to a gridfs collection so as to take its chunks into consideration (I believe). The mongod has no knowledge of gridfs, it is a application (driver) standard, it is not maintained DB side.

"If I then run a periodic process on db.chunks that checks for _id that don't exist in db.files, I can manually remove the chunks for any orphans (excluding very recent additions that might be uploading and not have an entry in db.files yet)."

Yes this is a good method actually since imagine if your main server handling these deletes goes down and this is a user file.

MongoDB will not see the file meaning that you can maintain compliance to certain laws on handling user data while in downtime on your cronjob (if the DB is down then screw it tbh).

You have got to consider however that the chunks collection can get huge, and when I say huge I mean huge. A chunk by default in most drivers is about 256kb so if you place even a file of 100megs in you will soon find you are actually iterating the ids of millions of rows.

"If I can treat files.db as a normal collection, is there anything that would keep me from creating db.files as a capped collection? With that, I suppose that I could also use a tailable cursor on the oplog and let the delete operation on db.files trigger a clean-up of associated db.chunk files..."

I am unsure but I would bet that there is a problem with making the main files collection capped. Instead what you could do is make a message queue out of capped collections and use that side by side to your gridfs.

Reply all

Reply to author

Forward