GridFS write performance problems on Windows

244 views
Skip to first unread message

Sigurd Høgsbro

unread,
Sep 3, 2010, 8:08:58 AM9/3/10
to mongod...@googlegroups.com
Hello,

We're using GridFS for a batch job to transcode from WMA Lossless to FLAC, which requires the use of Windows for the actual conversion (ffmpeg does not yet support WMA *lossless* codec).

The model is one of a controller running on a Linux server (Ubuntu 8.04.2), talking to mongoDB on the same server. All code is Python (2.5.2 on controller, 2.6 on workers), using pymongo 1.81. Worker apps run on Windows XP instances hosted within VirtualBox on Linux servers (Ubuntu 10.04).

We're seeing some issues that I'd welcome feedback on:
  1. GridFS read performance is pretty good (average 10-20MB/sec), but write performance is a very different beast. Writing a 18MB file can take 1 min 45 secs, though it can also complete in around 11 secs.

    We never see such slow writes on the controller task dumping the WMA file into GridFS, so I fear this isn't just triggered by the allocation of another datafile for the database.

  2. We sometimes receive exceptions from the server which I'd like to understand the cause of:
    command SON([('filemd5', ObjectId('...')), ('root', u'worker_flac')]) failed: exception: chunks out of order

  3. When a worker crashes we sometimes end up with files left over in the GridFS collection. When doing GridFS.delete() on such files, immediately followed by a GridFS.new_file() using the same '_id', we get error 11000 (see below). This is probably caused by the missing safe=True on the call to remove the chunks in GridFS.delete().
13:38:12 [conn532]  Caught Assertion in insert , continuing
13:38:12 [conn532] insert transcode.worker_flac.chunks exception 11000 E11000 duplicate key error index: transcode.worker_flac.chunks.$files_id_1_n_1  dup key: { : ObjectId('4c636968acb81b35a1073e59'), : 166 } 1ms

Ideas?

Sigurd

Richard Kreuter

unread,
Sep 3, 2010, 10:28:57 AM9/3/10
to mongod...@googlegroups.com
What version of mongodb are you running for the server?

--
Richard

Sigurd Høgsbro <sigurd....@museeka.com> wrote:

> Hello,
>
> We're using GridFS for a batch job to transcode from WMA Lossless to FLAC,
> which requires the use of Windows for the actual conversion (ffmpeg does not
> yet support WMA *lossless* codec).
>
> The model is one of a controller running on a Linux server (Ubuntu 8.04.2),
> talking to mongoDB on the same server. All code is Python (2.5.2 on controller,
> 2.6 on workers), using pymongo 1.81. Worker apps run on Windows XP instances
> hosted within VirtualBox on Linux servers (Ubuntu 10.04).
>
> We're seeing some issues that I'd welcome feedback on:
>

> 1. GridFS read performance is pretty good (average 10-20MB/sec), but write


> performance is a very different beast. Writing a 18MB file can take 1 min
> 45 secs, though it can also complete in around 11 secs.
>
> We never see such slow writes on the controller task dumping the WMA file
> into GridFS, so I fear this isn't just triggered by the allocation of
> another datafile for the database.
>

> 2. We sometimes receive exceptions from the server which I'd like to


> understand the cause of:
> command SON([('filemd5', ObjectId('...')), ('root', u'worker_flac')])
> failed: exception: chunks out of order
>

> 3. When a worker crashes we sometimes end up with files left over in the


> GridFS collection. When doing GridFS.delete() on such files, immediately
> followed by a GridFS.new_file() using the same '_id', we get error 11000
> (see below). This is probably caused by the missing safe=True on the call
> to remove the chunks in GridFS.delete().
>
> 13:38:12 [conn532] Caught Assertion in insert , continuing
> 13:38:12 [conn532] insert transcode.worker_flac.chunks exception 11000 E11000 duplicate key error index: transcode.worker_flac.chunks.$files_id_1_n_1 dup key: { : ObjectId('4c636968acb81b35a1073e59'), : 166 } 1ms
>
>
> Ideas?
>
> Sigurd
>

> --
> You received this message because you are subscribed to the Google Groups
> "mongodb-user" group.
> To post to this group, send email to mongod...@googlegroups.com.
> To unsubscribe from this group, send email to
> mongodb-user...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/
> mongodb-user?hl=en.

Mark M

unread,
Sep 3, 2010, 10:39:49 AM9/3/10
to mongodb-user
Are you chunking the writes? I noticed that using larger,rather than
smaller chunk sizes when writing tended to speed up the write.

Mark





On Sep 3, 2:08 pm, Sigurd Høgsbro <sigurd.hogs...@museeka.com> wrote:
> Hello,
>
> We're using GridFS for a batch job to transcode from WMA Lossless to FLAC,
> which requires the use of Windows for the actual conversion (ffmpeg does not
> yet support WMA *lossless* codec).
>
> The model is one of a controller running on a Linux server (Ubuntu 8.04.2),
> talking to mongoDB on the same server. All code is Python (2.5.2 on
> controller, 2.6 on workers), using pymongo 1.81. Worker apps run on Windows
> XP instances hosted within VirtualBox on Linux servers (Ubuntu 10.04).
>
> We're seeing some issues that I'd welcome feedback on:
>
>    1. GridFS read performance is pretty good (average 10-20MB/sec), but
>    write performance is a very different beast. Writing a 18MB file can take 1
>    min 45 secs, though it can also complete in around 11 secs.
>
>    We never see such slow writes on the controller task dumping the WMA file
>    into GridFS, so I fear this isn't just triggered by the allocation of
>    another datafile for the database.
>
>    2. We sometimes receive exceptions from the server which I'd like to
>    understand the cause of:
>    *command SON([('filemd5', ObjectId('...')), ('root', u'worker_flac')])
>    failed: exception: chunks out of order
>
>    *
>    3. When a worker crashes we sometimes end up with files left over in the

Sigurd Høgsbro

unread,
Sep 3, 2010, 10:55:48 AM9/3/10
to mongodb-user
1.6.2 on 64-bit system

Sigurd Høgsbro

unread,
Sep 3, 2010, 10:57:44 AM9/3/10
to mongodb-user
Yes, I'm sending the file in chunks using the chunk_size exposed on
GridIn class. What did you use instead?

Sigurd

Mark M

unread,
Sep 6, 2010, 3:37:30 AM9/6/10
to mongodb-user
Hi Sigurd,

The chunk size that is implemented in GridIn is the size of the chunks
to be store in the database, you can not change that parameter.

I am talking about how you are passing the data to the GridIn.write
function. Are you passing a file like object to the GridIn.write
function and allowing it to decide on how to buffer the data, or are
you looping it and sending smaller pieces at a time?

If you send it a file like object then GridIn will handle chunking it.
In other words you should not have to worry about chunking a large
file yourself, as the driver does it for you.

Sigurd Høgsbro

unread,
Sep 6, 2010, 3:50:14 PM9/6/10
to mongodb-user
I changed it to pass an iterable stream - open(path, 'rb') - but see
no difference so far.

I have noticed a correlation between the slow writes and the eventual
return of the below error from the server. This error occurs on both
the controller and the workers. The occurences are much less frequent
when the GridFS writer is on the same machine as mongoDB. In fact, it
was when I shifted GridFS to another server that the Linux controller
started giving this error too.

13:01:36 INFO 4c623b9aacb81b35a1006462: read input in 0:00:01.372000
(13.11MB)
13:01:40 INFO 4c623b9aacb81b35a1006462: generated FLAC in
0:00:03.826000 (13.45MB)
13:06:54 ERRO ** put 4c623b9aacb81b35a1006462 failed **: command
SON([('filemd5', ObjectId('4c623b9aacb81b35a1006462')), ('root',
u'worker_flac')]) failed: exception: chunks out of order
Traceback (most recent call last):
File "C:\Museeka\lib\site-packages\museekatranscodewma-0.2.0-
py2.6.egg\transcode_wma\transcode.py", line 304, in _put_outputfile
flac_mongo.close()
File "C:\Museeka\lib\site-packages\pymongo-1.8.1-py2.6-win32.egg
\gridfs\grid_file.py", line 218, in close
self.__flush()
File "C:\Museeka\lib\site-packages\pymongo-1.8.1-py2.6-win32.egg
\gridfs\grid_file.py", line 200, in __flush
root=self._coll.name)["md5"]
File "C:\Museeka\lib\site-packages\pymongo-1.8.1-py2.6-win32.egg
\pymongo\database.py", line 294, in command
(command, result["errmsg"]))
OperationFailure: command SON([('filemd5',
ObjectId('4c623b9aacb81b35a1006462')), ('root', u'worker_flac')])
failed: exception: chunks out of order
13:12:11 WARN *** 4c623b9aacb81b35a1006462: wrote output in
0:10:30.376000 - exiting ***

Regards,

Sigurd

Michael Dirolf

unread,
Sep 7, 2010, 2:26:32 PM9/7/10
to mongod...@googlegroups.com
Are you doing concurrent writes and deletes? You need to be careful w/
deletes as they aren't concurrency safe. Any more info about what
you're doing (better yet, a reproducible test case) would help us
diagnose this.

2010/9/6 Sigurd Høgsbro <sigurd....@museeka.com>:

Sigurd Høgsbro

unread,
Sep 7, 2010, 4:17:30 PM9/7/10
to mongod...@googlegroups.com
I'm using GridFS as a shared file-system between a Linux controller process and a large number of worker processes on separate hosts to convert ~600k WMA files to FLAC. Files range from under 1MB to 400MB each. Average throughput is

All hosts are 64-bit Ubuntu 8.04.02. Using pymongo 1.8.1 and mongoDB 1.6.2 everywhere. I now have 2 mongoDBs: one used for GridFS (on same server that runs controller), and another on a separate server holding application state.

With regard to the GridFS delete() method, I noticed that there is no 'Safe=True' option, and that it thus would be possible for the mongo operations to overlap/get out of sync in a high concurrency scenario, especially if doing a sequence of  exist()/delete()/new_file() with a specific _id as I used to do. I've now changed the code to letting GridFS allocate a new _id for each file I write.

The controller maintains an outstanding queue of transcoding work items from which the workers consume. The flow for each file is:
  1. Controller stores WMA file in GridFS
  2. Controller sends WMA _id in AMQP payload to the worker-queue
  3. Worker process reads a message from the worker-queue
  4. Worker copies the WMA file from GridFS to C:\Temp within the Windows VM
  5. Worker runs a Windows console app to convert WMA Lossless to FLAC
  6. Worker writes FLAC file into GridFS
  7. Worker sends _id references of WMA and FLAC files back to controller on AMQP response queue
  8. Controller reads AMQP response queue
  9. Controller saves the FLAC to the SAN
  10. Controller deletes both files from GridFS
Passing a file-stream to GridIn.write() instead of me looping, doing GridIn.write() of blocks of size GridIn.chunk_size each, seems to have reduced or gotten rid of the 'chunks out of order' problem. 

There are still relatively frequent occurrences of error 10053 on worker writes to GridFS, which I've worked around by running the Python app from an endless looping CMD file.

The WinSock errors suggest pymongo might be overflowing Windows TCP buffers. There seems to be a difference in how Linux and Windows deal with very large writes, which I also came up against in Pika - see http://github.com/tonyg/pika/issues#issue/14. My solution there was to limit the size of the calls to send() to the negotiated connection fragment size, which whilst affecting performance negligibly made the Windows workers stable. 

Maybe the sock.sendall() in connection.py should be changed to do something like this (lifted from Mercurial osutils.py):

def send_all(socket, bytes):
    """Send all bytes on a socket.

    Regular socket.sendall() can give socket error 10053 on Windows.  This
    implementation sends no more than 64k at a time, which avoids this problem.
    """
    chunk_size = 2**16
    for pos in xrange(0, len(bytes), chunk_size):
        socket.sendall(bytes[pos:pos+chunk_size])

I also found this page interesting reading: http://www.itamarst.org/writings/win32sockets.html

Regards,

Sigurd

Michael Dirolf

unread,
Sep 7, 2010, 4:39:39 PM9/7/10
to mongod...@googlegroups.com
2010/9/7 Sigurd Høgsbro <sigurd....@museeka.com>:

> I'm using GridFS as a shared file-system between a Linux controller process
> and a large number of worker processes on separate hosts to convert ~600k
> WMA files to FLAC. Files range from under 1MB to 400MB each. Average
> throughput is
> All hosts are 64-bit Ubuntu 8.04.02. Using pymongo 1.8.1 and mongoDB 1.6.2
> everywhere. I now have 2 mongoDBs: one used for GridFS (on same server that
> runs controller), and another on a separate server holding application
> state.
> With regard to the GridFS delete() method, I noticed that there is no
> 'Safe=True' option, and that it thus would be possible for the mongo
> operations to overlap/get out of sync in a high concurrency scenario,
> especially if doing a sequence of  exist()/delete()/new_file() with a
> specific _id as I used to do. I've now changed the code to letting GridFS
> allocate a new _id for each file I write.

The issue doesn't really have anything to do with safe mode, but with
the fact that we can't do multiple, isolated operations, which is
needed to do a delete that is concurrency safe. This limitation is
noted strongly in the API docs, IIRC.

This seems like something that should be integrated into Python's
sendall method, if it's the proper fix for this. Has there been any
discussion / is there a case open for this w/ the Python project? If
there is a case that just hasn't made it in yet, then we could
definitely add it to pymongo for the time being, as well.

Sigurd Høgsbro

unread,
Sep 7, 2010, 5:52:41 PM9/7/10
to mongod...@googlegroups.com
I haven't checked whether this is in the Python tracker, but since we have to live with the Python installations currently found in the wild I suggest a workaround is applied to pymongo. 

I've found reference to this issue in Twisted (link in previous mail), Mercurial and Bzr (see below), so the empirical evidence suggests a generic Python/Windows limitation. Whether 64K is the optimal 'frame' size remains to be seen.

I'm running a test on the cluster with a locally released pymongo where sendall() is called for 64K segments, and will report my findings later.

Regards,

Michael Dirolf

unread,
Sep 7, 2010, 6:02:42 PM9/7/10
to mongod...@googlegroups.com
Okay, please let us know if you see improvements by limiting sendall
calls to 64K buffers. If so, I think we can apply your patch.
Reply all
Reply to author
Forward
0 new messages