GridFS orphan chunk cleanup

Mini

unread,

Apr 7, 2011, 6:08:00 AM4/7/11

to mongodb-user

Hello,

I'm using the pymongo(1.10) and trying to store various files to the
GridFS.

From some tests, I found that when the GridFS.put or GridIn.write is
writing data it adds data to fs.chunks first then adds entry to
fs.files after it finish writing all the data.

Generally I like this approach, as half-written file is now shown.

But in a error circumstance, a process which is writing data to GridFS
is dead from any reason, some orphan fs.chunks documents which have
{files_id:Object(..)} but the files_id is not in fs.files, are still
there.

The orphan documents just consume disk and would not harm my
application. But I might have to clean them through a management
script.

Is there anyway to find the orphan document in a effective way? The
following management script is getting slower.

-------------
for chunk in fs.chunks.find():
if not fs.files.find_one({'_id': chunk['files_id']):
fs.chunks.remove( chunk['_id'] )
------------

Thanks

Nat

unread,

Apr 7, 2011, 8:57:21 AM4/7/11

to mongodb-user

Right now, that is probably a way to clean up orphan chunks. You can
probably optimize it a bit. If you clean them up regularly, you can
search only for chunks that are recently updated but not too new since
it might be newly added chunks but files collection hasn't been
updated.

You can vote for http://jira.mongodb.org/browse/SERVER-858 for a
utility to do so.

D Boyd

unread,

Apr 11, 2011, 10:52:31 AM4/11/11

to mongodb-user

If you reverse the query order you can do this with two queries.
Query and generate an array of all file id's.
Then query chunks for files_id not in that array.
Then do your a remove with the list of _ids returned
from the chunk query.

And like the other responder said, you can use
the date to further narrow the queries.

Just make sure you have indexes on whatever
you are querying against.

Reply all

Reply to author

Forward