Writes to cloud storage hanging after opening file?

517 views
Skip to first unread message

Ben Smithers

unread,
Sep 10, 2013, 6:14:58 AM9/10/13
to google-a...@googlegroups.com
Hi,

I've been having some problems writing data to google cloud storage using the python client library (https://developers.google.com/appengine/docs/python/googlecloudstorageclient/).

Frequently, the write to cloudstorage will 'hang' indefinitely. The call to open a new file is successful, but the write (1MB in size in this test case) never returns and the file never appears in the bucket. For example, I launch 10 instances of a (backend) module, each of which attempts to write a file. Typically, somewhere between 4 and 9 of these will succeed, with the others hanging after opening. This is the code I am running:

class StartHandler(webapp2.RequestHandler):

    GCS_BUCKET
="/"

   
def debugMessage(self, msg):
        logging
.debug(msg)
        logservice
.flush()

   
def get(self):
        suffix
= str(backends.get_instance())
        filename
=self.GCS_BUCKET + "/testwrite" + suffix + ".txt"
        gcs_file
= cloudstorage.open(filename, 'w', content_type='text/plain' )
       
self.debugMessage("opened file")
        gcs_file
.write("f" * 1024 * 1024 * 1 + '\n')
       
self.debugMessage("data written")
        gcs_file
.close()
       
self.debugMessage("file closed")


I have also attached a tarred example of the full application in case it is relevant (to run, you should only need to modify the application name in the two .yaml files and the bucket name in storagetest.py). A few additional things:

1.) I wondered if this was a problem with simultaneous writes so I had each instance sleep for 30 seconds * its instance number; I observe the same behaviour.
2.) I have seen this behaviour on frontend instances, but far far more rarely. I modified the above to run in response to a user request - once out of 60 times the write hung after opening (a Deadline Exceeded Exception was then thrown). 
3.) I have experimented with the RetryParams (though according to the documentation, the defaults should be sufficient) but to no avail. I also find it hard to believe this is the issue - I would assume I would be getting a TimeoutError.

Has anyone else observed this behaviour? Does anyone have any suggestions for what I am doing wrong? Or a different approach to try?

Very grateful for any help,
Ben
example.tar.gz

Ye Yuan

unread,
Sep 10, 2013, 4:12:36 PM9/10/13
to google-a...@googlegroups.com
Hi Ben,

Thanks for the detailed post. What does the application log for those hanged request look like? Do they all only have DeadlineExceededException? What's the stacktrace?

I don't see anything you are doing wrong. Just make sure you are using a relatively recent release of the library.

Ben Smithers

unread,
Sep 10, 2013, 4:52:52 PM9/10/13
to google-a...@googlegroups.com
Hi Ye,

I am using the latest release (r65) downloaded from: https://code.google.com/p/appengine-gcs-client/downloads/list 

I've generally been testing with manually scaled backend modules, in which case there is no Exception, they just hang indefinitely. A successful write simply has the following in the log:

  1. 2013-09-10 10:30:17.830 /_ah/start 200 3156ms 0kb instance=0
    0.1.0.3 - - [10/Sep/2013:02:30:17 -0700] "GET /_ah/start HTTP/1.1" 200 119 - - "0.v1.storagetest.benstoragetest.appspot.com" ms=3156 cpu_ms=606 cpm_usd=0.000013 loading_request=1 instance=0 app_engine_release=1.8.4 
  2. D2013-09-10 10:30:15.582
    sleeping
  3. D2013-09-10 10:30:15.810
    opened file
  4. D2013-09-10 10:30:16.915
    data written
  5. D2013-09-10 10:30:17.728
    file closed
  6. I2013-09-10 10:30:17.830
    This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time. This request may thus take longer and use more CPU than a typical request for your application.

For an instance that hangs, the log is terminated when I stop the instances:

  1. 2013-09-10 10:46:33.041 /_ah/start 500 978206ms 0kb instance=7
    0.1.0.3 - - [10/Sep/2013:02:46:33 -0700] "GET /_ah/start HTTP/1.1" 500 0 - - "7.v1.storagetest.benstoragetest.appspot.com" ms=978207 cpu_ms=866 loading_request=1 exit_code=114 instance=7 app_engine_release=1.8.4 
  2. D2013-09-10 10:30:16.038
    sleeping
  3. D2013-09-10 10:33:46.377
    opened file
  4. I2013-09-10 10:46:33.040
    This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time. This request may thus take longer and use more CPU than a typical request for your application.
  5. E2013-09-10 10:46:33.040
    Process terminated because the backend took too long to shutdown.

If it is useful, here is an example of the failure of a frontend instance:

  1. 2013-09-10 10:51:41.971 /gcs?suffix=15 500 60704ms 0kb Wget/1.13.4 (linux-gnu)
    137.222.108.37 - - [10/Sep/2013:02:51:41 -0700] "GET /gcs?suffix=15 HTTP/1.1" 500 0 - "Wget/1.13.4 (linux-gnu)" "benstoragetest.appspot.com" ms=60705 cpu_ms=1280 loading_request=1 app_engine_release=1.8.4 instance=00c61b117c1bed3e63c87fa8b71e8c142c32b7
  2. D2013-09-10 10:50:44.196
    opened file
  3. E2013-09-10 10:51:41.944
    Traceback (most recent call last):
      File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/runtime/wsgi.py", line 266, in Handle
        result = handler(dict(self._environ), self._StartResponse)
      File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1505, in __call__
        rv = self.router.dispatch(request, response)
      File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1253, in default_dispatcher
        return route.handler_adapter(request, response)
      File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 1077, in __call__
        return handler.dispatch()
      File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.3/webapp2.py", line 545, in dispatch
        return method(*args, **kwargs)
      File "/base/data/home/apps/s~benstoragetest/v1.370120547795114149/storagetest.py", line 26, in get
        gcs_file.write("f" * 1024 * 1024 * 1 + '\n')
      File "/base/data/home/apps/s~benstoragetest/v1.370120547795114149/cloudstorage/storage_api.py", line 578, in write
        self._flush()
      File "/base/data/home/apps/s~benstoragetest/v1.370120547795114149/cloudstorage/storage_api.py", line 659, in _flush
        self._send_data(''.join(buffer), last)
      File "/base/data/home/apps/s~benstoragetest/v1.370120547795114149/cloudstorage/storage_api.py", line 676, in _send_data
        self._path_with_token, payload=data, headers=headers)
      File "/base/data/home/apps/s~benstoragetest/v1.370120547795114149/cloudstorage/rest_api.py", line 41, in sync_wrapper
        return future.get_result()
      File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/ext/ndb/tasklets.py", line 325, in get_result
        self.check_success()
      File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/ext/ndb/tasklets.py", line 320, in check_success
        self.wait()
      File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/ext/ndb/tasklets.py", line 304, in wait
        if not ev.run1():
      File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/ext/ndb/eventloop.py", line 219, in run1
        delay = self.run0()
      File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/ext/ndb/eventloop.py", line 181, in run0
        callback(*args, **kwds)
      File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/ext/ndb/tasklets.py", line 375, in _help_tasklet_along
        ds_conn = datastore._GetConnection()
      File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/datastore.py", line 391, in _GetConnection
        return _thread_local.connection_stack[-1]
    DeadlineExceededError
  4. I2013-09-10 10:51:41.971
    This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time. This request may thus take longer and use more CPU than a typical request for your application.

I have also had a friend replicate the issue (with manually scaled backend modules at least) using the code I posted.

Thanks for any suggestions,

Ben

Vinny P

unread,
Sep 10, 2013, 9:45:04 PM9/10/13
to google-a...@googlegroups.com
On Tue, Sep 10, 2013 at 5:14 AM, Ben Smithers <smithe...@googlemail.com> wrote:
Has anyone else observed this behaviour? Does anyone have any suggestions for what I am doing wrong?



Backends are the red headed stepchild of GAE - they exhibit weird behavior periodically. Did these errors start happening recently, or is this new code? If you're using the Modules version of backends, can you try using regular backends (and not defining any modules whatsoever)?

Can you try seeing what an AppStats profile of this request looks like? Here's the link: https://developers.google.com/appengine/docs/python/tools/appstats . Specifically, what I'm looking for is whether the library is hanging during the connection to GCS, and what the connection information looks like.

If you don't want to install AppStats, you can also try splitting up file creation and writing to the file: create the file through the GCS library, write a single character, then close the connection. Sleep for a bit, then try writing to the file. The problem might isolate itself to a particular part of the file writing process.
 
-----------------
-Vinny P
Technology & Media Advisor
Chicago, IL

App Engine Code Samples: http://www.learntogoogleit.com
 
   

Ben Smithers

unread,
Sep 11, 2013, 8:04:45 AM9/11/13
to google-a...@googlegroups.com
Hi Vinny,

Thanks for your suggestions. As a redhead, I'll try not to take offence to that :)

I have been working with app engine only for a couple of months, so all code is relatively new. My *guess* is that I've been having a similar issue pretty much from the start, it just took me a while to debug it (I think I was being fooled by log messages prior to me flushing them). However, I was originally working with the blobstore. Since I was having issues and that has become deprecated, I switched to GCS - and have been having this exact problem since.

I tried switching back to regular backends and removing my modules - same behaviour.


I tried your suggestion of writing a single character, closing the file and trying the larger write:

        gcs_file = cloudstorage.open(filename, 'w', content_type='text/plain' )
       
self.debugMessage("opened file")

        gcs_file
.write("f")
       
self.debugMessage("One character written")
        gcs_file
.close()
       
self.debugMessage("file closed. Sleeping")

        time
.sleep(int(suffix) * 10)


        gcs_file
= cloudstorage.open(filename, 'w', content_type='text/plain')
       
self.debugMessage("opened file")
        gcs_file
.write("f" * 1024 * 1024 * 1 + '\n')
       
self.debugMessage("data written")
        gcs_file
.close()
       
self.debugMessage("file closed")

I ran this three times (10 instances each time) - in all cases the file is written with the single character successfully, but a number of the larger writes then hang.

I also enabled AppStats. However, it looks to me like the requests that hang never reach the state of writing the AppStat log (e.g. I never see something like "Saved; key: __appstats__:073900, part: 29 bytes, full: 1364 bytes" in the log), so I am not sure how that will be useful. Nonetheless, I modified the AppStat code to allow public access (this is a test application, no sensitive data), so if you wish to do so, you can check for yourself at http://benstoragetest.appspot.com/stats

Thanks for your input,
Ben

Vinny P

unread,
Sep 11, 2013, 10:48:12 AM9/11/13
to google-a...@googlegroups.com
On Wed, Sep 11, 2013 at 7:04 AM, Ben Smithers <smithe...@googlemail.com> wrote:
I ran this three times (10 instances each time) - in all cases the file is written with the single character successfully, but a number of the larger writes then hang.



So what this looks to me is that GCS and the library are working fine, but the backend (and more importantly, the urfetch connection to the backend) are having difficulty piping down the data into GCS servers. 

So to speak of this problem in a hypothetical sense, at some point there is an "inflection point" - there is a certain size of data writing that the urlfetch is having difficulty with. You were able to write a single character without a problem, but 1 MB of characters was too much for it. So I were you, I'd try to find that inflection point and then save to Cloud Storage in chunks less than that. For example, try to write to Cloud Storage in chunks of 250 KB, if that works, try writing in chunks of 400 KB, and so on. Of course this is only a bandaid solution - something is clearly wrong with the underlying urlfetch implementation.

But if I had to make a recommendation, I'd say skip the whole backend writing thing altogether. It's a waste of money to load up backends for the purposes of writing (unless you're Bill Gates or similar!). You said frontend instances seem to be much better in terms of writing - try using task queues to spawn off frontend requests which are solely dedicated to writing to Cloud Storage. If you need a temporary place to store data, write it to the datastore or Cloud SQL until the tasks can get to the GCS writing.

Ben Smithers

unread,
Sep 12, 2013, 4:57:10 AM9/12/13
to google-a...@googlegroups.com
On Wednesday, September 11, 2013 3:48:12 PM UTC+1, Vinny P wrote:

 You were able to write a single character without a problem, but 1 MB of characters was too much for it. So I were you, I'd try to find that inflection point and then save to Cloud Storage in chunks less than that. For example, try to write to Cloud Storage in chunks of 250 KB, if that works, try writing in chunks of 400 KB, and so on. Of course this is only a bandaid solution - something is clearly wrong with the underlying urlfetch implementation.
   

I'm not too convinced - I wouldn't expect it to be so binary. Writes of 1MB aren't *always* failing so if there is some size-dependant problem, I would expect smaller writes to continue to fail but with a lower probability (this might be fine if a failure didn't cause the instance to hang indefinitely, as you could then retry!) 


On Wednesday, September 11, 2013 3:48:12 PM UTC+1, Vinny P wrote:

But if I had to make a recommendation, I'd say skip the whole backend writing thing altogether. It's a waste of money to load up backends for the purposes of writing (unless you're Bill Gates or similar!). You said frontend instances seem to be much better in terms of writing - try using task queues to spawn off frontend requests which are solely dedicated to writing to Cloud Storage. If you need a temporary place to store data, write it to the datastore or Cloud SQL until the tasks can get to the GCS writing.

Well yes, clearly my actual requirement isn't just to store some number of 'f's in GCS, the backends are required to do some amount of processing before writing results. Thanks for the suggestion though - that's definitely something I can explore if there is no actual fix for this issue.

Cheers,
Ben 

Ben Smithers

unread,
Sep 18, 2013, 9:46:32 AM9/18/13
to google-a...@googlegroups.com
I think there must be a problem with RPCs from backends. Given my continued attempts to debug this problem and Vinny's suggestion, I setup another small test case, this time writing to the datastore. I observe the exact same behaviour: a large percentage of writes to datastore hang (seemingly) indefinitely. I was writing 750KBs of data. Test code:

import webapp2
from google.appengine.ext import db
from google.appengine.api.logservice import logservice
import logging

class TestEntity(db.Model):
    data
= db.TextProperty()

class StartHandler(webapp2.RequestHandler):

   
def debugMessage(self, msg):
        logging
.debug(msg)
        logservice
.flush()

   
def get(self):

        data
= "f" * 1024 * 750
        entity
= TestEntity(data=data)
       
self.debugMessage("Entity created")
        entity
.put()
       
self.debugMessage("Entity saved")

app
= webapp2.WSGIApplication([
   
('/_ah/start', StartHandler),
], debug=True)


I ran this with 10 instances, 3 times. The number of successful writes was 7,3 and 7 for each time I tested. The unsuccessful writes hang after "Entity created" appears in the log but before "Entity saved" is logged (and these entities do not appear in the datastore). Additional info:

- The max time I waited for this was ~15 minutes; I observed hanging of ~1 hour when I was writing/debugging "real" code though. 
- I have tried with both db and ndb. Same behaviour
- Things look ok running on frontends. 60/60 successful writes (using ndb in this case, didn't test with the older API). 

Unless I'm missing something, this seems to be a real problem with backend instances.

Ben Smithers

unread,
Sep 19, 2013, 8:46:49 AM9/19/13
to google-a...@googlegroups.com
A few more updates on this,

It looks like Vinny was correct - writes of a smaller size do not seem to cause a problem. Specifically, writes from backend instances to datastore, blobstore and cloud storage of 32KBs all display the hanging behaviour I have been talking about. Writes to each of these of 30KBs all seem to work without problem (the actual size at which there is a problem seems to vary slightly depending on the service, presumably because of varying overhead - I have had full success with writes of 31.5KBs to datastore).

My guess therefore is that there is some kind of block size used somewhere in the underlying urlfetch/RPC implementation and whenever multiple blocks are required for the request, there is some probability of the request hanging. Why this issue affects backend instances only, I have no idea.

Using the deprecated files API for blobstore, it is possible to write large files simply by writing in chunks of 30KBs:

    def write_in_chunks(self,data):
        chunk_size
= 1024 * 30
        file_name
= files.blobstore.create(mime_type="text/plain")
        num_chunks
= int(math.ceil(len(data)/chunk_size))
       
with files.open(file_name, 'a') as f:
           
for i in range(0, num_chunks):
                f
.write(data[i*chunk_size:(i+1)*chunk_size])
        files
.finalize(file_name)

Using the cloudstorage library, this isn't possible. It looks like this is because the library only flushes data once 256K characters have been written; this can be seen in the write method in storage_api.py (the blocksize is set to 256K on line 489): 

  def write(self, data):
   
"""Write some bytes."""
   
self._check_open()
   
assert isinstance(data, str)
   
if not data:
     
return
   
self._buffer.append(data)
   
self._buffered += len(data)
   
self._offset += len(data)
   
if self._buffered >= self._blocksize:
     
self._flush()

I tried fiddling this parameter, but this seems to cause too many requests to GCS as I started receiving 503 responses, which corresponds to a request rate that is too high (https://developers.google.com/storage/docs/reference-status#standardcodes).

Hopefully this is helpful for others and for identifying and fixing the underlying issue.

Ben

Doug Anderson

unread,
Sep 19, 2013, 11:36:46 AM9/19/13
to google-a...@googlegroups.com
  1. Why hasn't the GCS client download been updated since early June?  Numerous changes have been checked in since then.  As I understand it one of them is crucial to GCS reliability.
  2. Why hasn't the GCS client been integrated into the App Engine SDK / Runtime.  This seems like the best way to ensure developers are using the latest-approved version.
  3. These 'hangs' are reminiscent of the unreliability of the deprecated files API... my hope remains that GCS can become MUCH more reliable in the very near future!
  4. On a side note... the python SDK installation can take A LONG time... I believe this is largely due to more and more Django versions getting added to the SDK.  Without Django versions v0.96 thru v1.5 the uncompressed SDK size drops from 138MB to 24MB (from 19k+ files to just 1.9k files).  I'm a Django fan but I'm not using it at the moment with App Engine... it would be nice if the Django installation was optional and/or let you select the versions of Django to install... I doubt any developer needs all versions from 0.96 to 1.5.  Just a minor annoyance really... please fix the GCS issues asap!

On Tuesday, September 10, 2013 6:14:58 AM UTC-4, Ben Smithers wrote:

Ye Yuan

unread,
Sep 19, 2013, 12:47:08 PM9/19/13
to google-a...@googlegroups.com
Ben,

Speaking personally, thank you a lot for your posts! It is very helpful for identifying this issue. I was focused on GCS client and wasn't aware of the full scope. More people had been looped in and I'll reach out to other relevant folks today. Very sorry for all the inconvenience!

I tried fiddling this parameter, but this seems to cause too many requests to GCS as I started receiving 503 responses, which corresponds to a request rate that is too high (https://developers.google.com/storage/docs/reference-status#standardcodes).
GCS has a minimum chunk size of 256KB. Flushing data of smaller size has undefined behavior.

Ye Yuan

unread,
Sep 19, 2013, 12:50:24 PM9/19/13
to google-a...@googlegroups.com
Why hasn't the GCS client download been updated since early June?  Numerous changes have been checked in since then.  As I understand it one of them is crucial to GCS reliability.
The SVN is updated with fixes. Sorry about the download zips. We are in the process of sorting out distributions.


These 'hangs' are reminiscent of the unreliability of the deprecated files API... my hope remains that GCS can become MUCH more reliable in the very near future!
Yeah from the evidences so far I am also hoping the hangs aren't originated from the GCS lib.

On a side note... the python SDK installation can take A LONG time... I believe this is largely due to more and more Django versions getting added to the SDK.  Without Django versions v0.96 thru v1.5 the uncompressed SDK size drops from 138MB to 24MB (from 19k+ files to just 1.9k files).  I'm a Django fan but I'm not using it at the moment with App Engine... it would be nice if the Django installation was optional and/or let you select the versions of Django to install... I doubt any developer needs all versions from 0.96 to 1.5.  Just a minor annoyance really... please fix the GCS issues asap!
I'll send this feedback to relevant people. Thanks!

Ben Smithers

unread,
Sep 19, 2013, 12:53:46 PM9/19/13
to google-a...@googlegroups.com
Hi Ye,

No problems. Thanks for the update, I appreciate it.

In light of Doug's comments, I have also confirmed the same behaviour (success on writing 30KB; some hangs on writing >=32KBs) occurs with the latest release (r105) of the client library from SVN. I was expecting this, as it doesn't look like the GCS library is the problem.

Ben

Vinny P

unread,
Sep 19, 2013, 2:03:44 PM9/19/13
to google-a...@googlegroups.com
On Thu, Sep 19, 2013 at 7:46 AM, Ben Smithers <smithe...@googlemail.com> wrote:
It looks like Vinny was correct - writes of a smaller size do not seem to cause a problem. Specifically, writes from backend instances to datastore, blobstore and cloud storage of 32KBs all display the hanging behaviour


Your experiences parallel mine - backend instances seem to have more difficulty communicating with other Google services compared to frontend instances. I often speculate that there is an additional communications layer within the backends system, or different architectural decisions.


On Thu, Sep 19, 2013 at 11:53 AM, Ben Smithers <smithe...@googlemail.com> wrote:
In light of Doug's comments, I have also confirmed the same behaviour (success on writing 30KB; some hangs on writing >=32KBs) occurs with the latest release (r105) of the client library from SVN. I was expecting this, as it doesn't look like the GCS library is the problem.



+1. I see similar behavior when communicating directly to GCS via the JSON API. 


On Thu, Sep 19, 2013 at 11:50 AM, Ye Yuan <y...@google.com> wrote:
Yeah from the evidences so far I am also hoping the hangs aren't originated from the GCS lib.



Ye, I want to emphasize something since you're reading this thread. I believe there are two different issues here: One, backend instance urlfetch to anything (not just GCS) seems to be less stable than frontend instance urlfetch (perhaps there's something chunking requests from the backend?). Second, GCS itself (not the library, but GCS servers) seems to have difficulty with differing chunk sizes. You wrote that GCS has a minimum chunk size of 256KB, but I've flushed chunks much smaller than that reliably and without difficulty. 

So the bottom line here is when you're testing, test from App Engine frontend and backend instances, Compute Engine machines, and from an external-to-Google-networks machine. I'd focus less on the client library. 
 

Doug Anderson

unread,
Sep 19, 2013, 3:36:45 PM9/19/13
to google-a...@googlegroups.com
I realize this is likely a GCS server issue and I don't mean to distact from that but I'd also like to point out that the SDK's google-api-python-client (which I assume GCS relies upon under the hood) is v1.0beta6 from Oct 28, 2011, where the latest release seems to be v1.2 from Aug 7, 2013.  I assume the runtime versions on the live servers are up to date :)  PKG-INFO snippets below:

--- Latest PKG-INFO snippet ---
Metadata-Version: 1.1
Name: google-api-python-client
Version: 1.2
Summary: Google API Client Library for Python
Classifier: Development Status :: 5 - Production/Stable

--- App Engine SDK 1.8.5 PKG-INFO snippet ---
Metadata-Version: 1.0
Name: google-api-python-client
Version: 1.0beta6
Summary: Google API Client Library for Python
Classifier: Development Status :: 4 - Beta

Reply all
Reply to author
Forward
0 new messages