Reading DBF records from blobstore object is VERY SLOW

68 views
Skip to first unread message

Mike Lucente

unread,
Dec 6, 2016, 1:45:09 PM12/6/16
to Google App Engine
I'm using dbfpy to read records from a blobstore entry and am unable to read 24K records before hitting the 10 minute wall (my process is in a task queue). Here's my code:

    def get(self):
        count = 0
        cols = ['R_MEM_NAME','R_MEM_ID','R_EXP_DATE','R_STATE','R_RATING1','R_RATING2']

        blobkey = self.request.get('blobkey')
        blob_reader = blobstore.BlobReader(blobkey)

        dbf_in = dbf.Dbf(blob_reader, True)

        try:
            if dbf_in.fieldNames[0] == 'R_MEM_NAME':
                pass
        except:
            logging.info("Invalid record type: %s", dbf_in.fieldNames[0])
            return

        mysql = mysqlConnect.connect('ratings')
        db = mysql.db
        cursor = db.cursor()

        for rec in dbf_in:
            count = count + 1
            if count == 1:
                continue

            continue

---
This simple loop should finish in seconds. Instead it gets through a few thousand records and then hits the wall.

Note the last "continue" that I added to bypass the mysql inserts (that I previously thought were the culprit).

I'm stumped and stuck.

Mike Lucente

unread,
Dec 6, 2016, 1:59:59 PM12/6/16
to Google App Engine
Note also that this works fine when running locally (dev).

Nicholas (Google Cloud Support)

unread,
Dec 7, 2016, 4:57:15 PM12/7/16
to Google App Engine
Hey Mike,

I'm not familiar with dbfpy or how it implements iteration but if no other point in your example consumes much time, it seems iterating through dbf_in might be the issue.  As it implements __getitem__ to serve as a stream, it's possible that this is what costs cycles by issuing many requests to the blob reader.  I would strongly recommend using Stackdriver Trace to see the life of a request and where it spends the bulk of its time.  Let me know what you find.

Nicholas


On Tuesday, December 6, 2016 at 1:45:09 PM UTC-5, Mike Lucente wrote:

Mike Lucente

unread,
Dec 8, 2016, 11:12:44 AM12/8/16
to google-a...@googlegroups.com
Yes, stackdriver shows that the blobstore file is being accessed for every record (/blobstore.FetchData (56 ms) for example). But it works fine when run locally(?) I would have expected that the blobstore would function just as a regular file would where opening a file and reading records is a non-issue. Why is this so painfully slow when accessing blobstore? Do I have to slurp the entire file into memory and parse it??




--
You received this message because you are subscribed to a topic in the Google Groups "Google App Engine" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/google-appengine/L-qePUVWekU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to google-appengine+unsubscribe@googlegroups.com.
To post to this group, send email to google-appengine@googlegroups.com.
Visit this group at https://groups.google.com/group/google-appengine.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-appengine/ba5b7252-e820-403d-9e60-df3ad1e02cbb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Nicholas (Google Cloud Support)

unread,
Dec 12, 2016, 2:30:28 PM12/12/16
to Google App Engine
Every call to FetchData is issuing a small network request to the blobstore.  This will most certainly take time when scaling to 24,000 fetches.  The reason this is not encountered when testing in the development environment (dev_appserver.py) is that the dev blobstore there is hosted locally on you rmachine.  As such, the fetch takes as much time as a local file read, not as long as a proper network request.  This latency is in this way definitely reasonable and to be expected.

To solve this issue, you would need to either reduce the latency of these network calls or reduce the volume of them per request to this module.  56ms for these Blobstore internal calls is entirely expected so you won't really be able to cut that down.  Thus, we are left with reducing how many of these calls are made with each request.
  • What is this task queue-issued request doing during it's lifespan that affects thousands of records?
  • Is it possible for this request to instead be broken up into multiple requests affecting fewer records?
  • If not, why must all records be processed in a single request?
To provide some more practical advice, I would need to know more about your implementation as with the questions above.
Message has been deleted
Message has been deleted
Message has been deleted
Message has been deleted

Mike Lucente

unread,
Dec 13, 2016, 12:51:17 PM12/13/16
to google-a...@googlegroups.com
I've changed the app to read from a flat file instead of a DBF and am now processing 100 insert/updates at a time vs MySQL. All good now. 

Nicholas (Google Cloud Support)

unread,
Dec 13, 2016, 2:22:57 PM12/13/16
to Google App Engine
Good to hear that you've found a solution.  Generally speaking, bulk tasks are often better broken down into smaller tasks as this tends to lend itself quite well to horizontal scaling.  Happy coding!


On Tuesday, December 13, 2016 at 12:51:17 PM UTC-5, Mike Lucente wrote:
I've changed the app to read from a flat file instead of a DBF and am now processing 100 insert/updates at a time vs MySQL. All good now. 
Reply all
Reply to author
Forward
0 new messages