Does set_contents_from_filename stream?

228 views
Skip to first unread message

Michael Miller

unread,
Sep 9, 2010, 9:21:42 PM9/9/10
to boto-...@googlegroups.com
Hi,

I'm using the following (*) simple code as the nugget of a worker object. Running on a 2 x intel quad core Apple on a university network I'm getting a decent throughput to S3. I use a threadpool and played around with the numbers to saturate the aggregate throughput (which happens around 10 threads). I'm reading about 200 GB of data from a non-raided drive. dd tests show that I can get data as fast as 50 MB/sec. However, I'm a bit confused about the system behavior. Looking at the disk and network usage (which are 100% anti-correlated), it almost looks like the entire file is read into memory before the network transaction begins. Perhaps I've fooled myself into the wrong diagnoses, but any peek behind the curtain on how set_contents_from_filename streams data would be greatly appreciated.

-Mike

(*)

k = Key(bucket)
k.key = keyname
k.set_contents_from_filename(fname)

Mitchell Garnaat

unread,
Sep 9, 2010, 9:42:20 PM9/9/10
to boto-...@googlegroups.com
Hi -

I suspect that you are observing the calculation of the MD5 checksum prior to sending the file to S3.  Once computed, the value is sent as the Content-MD5 header in the HTTP request and provides the crucial integrity check necessary in networked operations.  Even here, the file is not being read into memory in it's entirety.  It's read is buffers (default buffer size is 8192) as it is when actually uploading to S3.  However, when viewed from the perspective of disk and network usage it might look strange.

The boto library does not provide a way to circumvent this header (nor should it, really) but it does allow you to pass in a tuple as the "md5" parameter to set_contents_from_filename which contains a pre-calculated checksum in the same format returned by the compute_md5 method of the Key class.  So, perhaps you could get better overall throughput for large numbers of files by doing some pre-calculation of the MD5.  I haven't tried that but it seems like it could be useful in your situation.

Mitch


--
You received this message because you are subscribed to the Google Groups "boto-users" group.
To post to this group, send email to boto-...@googlegroups.com.
To unsubscribe from this group, send email to boto-users+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/boto-users?hl=en.


Michael Miller

unread,
Sep 9, 2010, 10:01:17 PM9/9/10
to boto-...@googlegroups.com
Hi,

Ahh, that's obvious. I wonder why I didn't see a corresponding spike in CPU usage. Perhaps It's ultimately limited by the bus speed and scheduling/data affinity on the chip. I'll play around with your suggestion to bypass it (only temporarily) to see if it behaves as predicted. Thanks, issue closed.

-M

Reply all
Reply to author
Forward
0 new messages