MapReduce Roadmap

162 views
Skip to first unread message

PK

unread,
Nov 25, 2013, 3:53:09 AM11/25/13
to google-a...@googlegroups.com
For a very very long time MapReduce has not been integrated with the GAE SDK and remains experimental (https://developers.google.com/appengine/docs/python/dataprocessing/).

Could somebody shed some light on the roadmap plan?

Thanks,

PK

unread,
Dec 5, 2013, 9:11:58 PM12/5/13
to google-a...@googlegroups.com
I am resending this in case the right Product Manager(s) at Google missed my question.

Thanks,

Chris Ramsdale

unread,
Dec 5, 2013, 11:27:16 PM12/5/13
to google-a...@googlegroups.com

PK,

We're definitely planning on moving mapreduce to GA. Plan is to get the API finalized and then move it through the std Preview => GA channel.

Questions for you:

- how would prefer to access the library itself?

- is there something about having it outside of the SDK that causes substantial friction?

- or, is it the fact that it's sat in experimental for way too long that is the larger concern?

-- Chris

Product Manager, Google App Engine

-- Chris

--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-appengi...@googlegroups.com.
To post to this group, send email to google-a...@googlegroups.com.
Visit this group at http://groups.google.com/group/google-appengine.
For more options, visit https://groups.google.com/groups/opt_out.

PK

unread,
Dec 6, 2013, 12:59:02 AM12/6/13
to google-a...@googlegroups.com, Chris Ramsdale
Thanks Chris.

Great to hear that you plan to move map/reduce to GA. The reasons I asked are:

1. it has been experimental for about 3 years if not longer so I had started wondering… I think it is fair.
2. More important some bugs still in NEW state make me wonder whether it follows along with other changes in the platform. In particular seeing issues staying in NEW state for months concerns me the most about how active the effort is.

(examples of such bugs:
          or Issue 203  map reduce is broken since r534)

Thanks,
PK

Chris Ramsdale

unread,
Dec 6, 2013, 1:10:37 AM12/6/13
to PK, google-a...@googlegroups.com
thanks, PK.  we'll follow-up on the issue you cited.

-- Chris

D X

unread,
Dec 6, 2013, 1:51:48 PM12/6/13
to google-a...@googlegroups.com, PK
I've been using the mapreduce library for the last 18mo or so.
In addition to what's already been mentioned, some additional comments:

- The docs are kinda confusing because there's different sets of docs.  Just the fact that docs are disorganized gives the impression that it's a low priority project that's not well maintained.
Keeping one set of well maintained docs would help give the sense that MapReduce is a higher class citizen

- Using MapReduce for schema changes is probably a very common yet simple use case.  I've heard more than one comment that the MapReduce pipeline seems to complicated to pick up to do a simple task of updating a bunch of entities.
mapper_spec?  reducer_spec?  input_reader?  output_reader?  Do I have to learn all these things just to add an extra field to my entities?  While the wordcount demo shows more of the pipeline, it would probably be easier for users to pick up if there was a simple demo of how to update your 'schema' with 5 lines of python.  (a DatastoreInputReader that supports filtering would be great too)

If this sounds too negative, you can interepret these comments as saying that the rest of the GAE docs are great and easy to follow.

- Packaged versions would be great.  I mean, it was great.  I'm not sure why you guys got rid of it.  Maybe I'm not hardcore enough to just sync with the repo (actually, I did).  However, a packaged version suggests that it's tested and stable.  If I see bugs, I can check online to see if anyone's seeing the same issue.  When syncing with the repo, I have not idea how stable the latest checkins are.  Maybe something just broke and I'm the one person who sync'd after the broken change, and I'm obviously not going to be constantly syncing the MR library, because I actually have other things to work on.
What about the version that's included in the SDK?  I toyed with using that, but the docs indicate I should be downloading from the repo.  So is the repo more recent, and the SDK version outdated?  Would the SDK version be more stable?  Again, confusion.

amits

unread,
Dec 6, 2013, 2:11:56 PM12/6/13
to google-a...@googlegroups.com, PK
I agree with D X that packaged version would really help.

One functionality we are badly in need is to be able to supply a query filter in MapReduce (for e.g. where x="vvv"). Currently, it just iterates through all entities of a given object in datastore. So, if I want to update 10,000 rows of a given entity which consists of 5M records, it iterates through all 5M entities and then we have to put the filter logic in our code. This really defeats the purpose and also has huge implications on read costs.

Jacob Taylor

unread,
Dec 6, 2013, 2:31:37 PM12/6/13
to google-a...@googlegroups.com, PK
It would also be great to offer some caching support. If we could easily decide which kinds of entities should still be subject to caching, that would be great. 

i.e. we are iterating through Users and reference some common stuff. We either need to build our own cache for the common stuff or pound on the datastore.

I would also love to see shard status represented in the graph (green if done?) and keep the shard graph after the mapreduce is done. This will help me understand how balanced the shards are.

We have too many namespaces now and sharding is happening on alpha sort of namespace names. This is a huge problem since a few of our larger namespaces are adjacent. We end up with about 10x the response time. I think this might also be affecting the production index builder.

Thanks,
Jacob

D X

unread,
Dec 7, 2013, 1:36:11 PM12/7/13
to google-a...@googlegroups.com, PK
So we had written our own input readers that applied filtering, and it was pretty simple by extending the DatastoreInputReader code.  I didn't have the actual code in front of me, so I took a look back at the library code to see what I needed to override.

Then I noticed, filtering functionality is included now!  Just not documented online.  Look through mapreduce.input_readers.DatastoreInputReader.validate_filters

Looks like you need to include a 'filters' dictionary in your mapper_spec

This supports my poor documentation point...

Sridhar Ragunathan

unread,
May 9, 2014, 9:18:30 AM5/9/14
to google-a...@googlegroups.com
Hi DX,

For a beginner like it would be really helpful if you have a sample for using filter.
Badly in need of it.
Thanks,
Sridhar R
Reply all
Reply to author
Forward
0 new messages