Blobstore filename created in MapReduce job too long to create BlobKey

122 views
Skip to first unread message

Jamie N.

unread,
Mar 1, 2013, 11:25:17 PM3/1/13
to google-a...@googlegroups.com
I've been receiving intermittent errors from MapReduce jobs. I'm running Python 2.7.

The specific error is "BadValueError: name must be under 500 bytes" which is raised when calling datastore.Key.from_path() within blobstore.get_blob_key(); the filename being provided is way too long to make a key from.

This all occurs within the code in the mapreduce package… nothing in my code seems to affect it.

Some of the filenames are 288 bytes long, while some are 992. The M/R spec name and id in each case is nearly the same and is very short; I don't see where this variance comes from.

The sequence of events is this:
mapreduce.output_writers.init_job() creates a reasonable, short filename and passes it to files.blobstore.create()
create() calls files.file._create('blobstore', …, filename)
_create() sets up an rpc with that filename and calls _make_call('Create', ...)

And that call sometimes returns a filename that's 288 bytes, sometimes 992. I have no idea why or how to work around this — any help would be appreciated.

Thanks,
Jamie

Alex Burgel

unread,
Mar 5, 2013, 4:48:04 PM3/5/13
to google-a...@googlegroups.com
On Friday, March 1, 2013 11:25:17 PM UTC-5, Jamie Niemasik wrote:
Some of the filenames are 288 bytes long, while some are 992. The M/R spec name and id in each case is nearly the same and is very short; I don't see where this variance comes from.

Have you noticed if the long file names contain the word 'writable' at the beginning?

If so, it might similar to an issue that I had (my issue was with google storage, not blobstore, but their APIs are similar):


It seems that when a file is writable, its in a special state with a filename that is very long. When the MR job finishes, it finalizes the file and gets another filename that is shorter. My issue had to do with the MR job not finishing properly. Some of my code was throwing exceptions but it wasn't causing the job to finalize properly and therefore not getting the shorter filename.

I would take a look at your logs to see if there are any errors. They may be causing the MR job to not finish properly and then return unfinalized filenames.

--Alex

Jamie N.

unread,
Mar 5, 2013, 7:02:08 PM3/5/13
to google-a...@googlegroups.com
Thanks Alex. Yes, they are writable; however, I can't find any errors in the logs, other than the name errors themselves (and those occur right after "Final result for job … is 'success'). The mapreduce code gets through files.finalize(filename), but blows up in get_blob_key because the filename is no shorter than before.

Ben, I wish I could do that, the mapreduce lib is creating a __BlobFileIndex__
datastore key using this filename as the id, so I don't know what sort of change I could make there. Unfortunately it's not something I'm storing as a property on my own model.

Jamie

--
You received this message because you are subscribed to a topic in the Google Groups "Google App Engine" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/google-appengine/I0pXHW1poWU/unsubscribe?hl=en.
To unsubscribe from this group and all its topics, send an email to google-appengi...@googlegroups.com.
To post to this group, send email to google-a...@googlegroups.com.
Visit this group at http://groups.google.com/group/google-appengine?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

bmurr

unread,
Mar 6, 2013, 5:13:08 AM3/6/13
to google-a...@googlegroups.com
Well, I think I have it solved.

I was using an older version of the mapreduce library, which uses mapreduce.lib.files to interact with the blobstore.
The newer version of mapreduce uses google.appengine.api.files instead, which doesn't cause this problem.

These two libraries seem pretty similar -- so I'm not sure what precisely was causing the issue.

Jamie N.

unread,
Mar 6, 2013, 7:08:09 AM3/6/13
to google-a...@googlegroups.com
: )  I tried the same thing last night, but wasn't ready to declare victory because I saw new errors. Happily, it turns out those were from old tasks (with retry counts nearing 1000) whose states were not compatible with the new MR code. I've purged the queue and cleaned out the associated blobs and everything's humming along now.

When I first started using MR, I had to make a lot of modifications to get it working with py2.7 and NDB, using two different versions of the MR lib that I found. I was hesitant to touch that code since it had been working perfectly for so long, until sometime in February when I started experiencing these errors. I'm still not sure why that happened.

But, happily, all I had to do to the svn version of MR this time around was change mapreduce/main.py to mapreduce.main.APP in include.yaml. Nice!

--
Reply all
Reply to author
Forward
0 new messages