Best practices on sorting values received by the reducer?

25 views
Skip to first unread message

Gilberto Torrezan Filho

unread,
Mar 19, 2014, 7:45:42 AM3/19/14
to app-engine-...@googlegroups.com
Hi folks,

I have the following data structure on my Datastore:

Date | Name | int A | int B

And I need to split the data among 3 groups: those where A > B, those where A = B and those where A < B. I need to save those groups in separated csv files.... but sorted by date.

My mapper just read the values and emit the correct groups. My reducer receives the shuffled and grouped results, which I save them to a file on cloud storage, but the values I receive on my reducer aren't sorted - and so my isn't my csv file.

What's the best practice here to sort the values received by the reducer? I can read all values in memory, sort them, and then save the csv file, but I don't think this approach is healthy for the server memory - considering I have about 3.2M registers to read.

By the way, to avoid having all the group in the reducer memory, I'm not using the output stage: I just write to the cloud storage register by register when I stream them from the reducer.

Thanks.

Tom Kaitchuck

unread,
Mar 21, 2014, 6:39:06 PM3/21/14
to app-engine-...@googlegroups.com
Make the framework do the sorting for you. 
If you emit a key that sorts in the order you want, it will be delivered to the reducer in that order. 

So if you made your key "<name>-<TimeStamp>" you would receive all the items with the same name consecutively, and sorted by time within that.
or you could do something like:  "<1,2, or 3>-<group>" where 1 means a > b, 2 means a = b, and 3 means a < b. so that these are all pre-grouped for you.


--
You received this message because you are subscribed to the Google Groups "Google App Engine Pipeline API" group.
To unsubscribe from this group and stop receiving emails from it, send an email to app-engine-pipeli...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gilberto Torrezan Filho

unread,
Mar 22, 2014, 8:21:39 AM3/22/14
to app-engine-...@googlegroups.com

Hi Tom, thanks for your answer.

But how that would work with multiple reducer shards?

You received this message because you are subscribed to a topic in the Google Groups "Google App Engine Pipeline API" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/app-engine-pipeline-api/XSa5ULjYQe4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to app-engine-pipeli...@googlegroups.com.

Tom Kaitchuck

unread,
Mar 24, 2014, 2:28:18 PM3/24/14
to app-engine-...@googlegroups.com
The order of keys is sorted is within a shard as shards are totally independent. Keys will be spread across shards uniformly but all items for a given key will end up on the same shard.

Tom Kaitchuck

unread,
Mar 24, 2014, 4:31:16 PM3/24/14
to app-engine-...@googlegroups.com
To clarify in this case you could have a single reducer write to 3 files, for each of A > B, A = B, and A < B, then have the mapper emit keys that are prefixed by the date so that all the data is in order.

If instead you had more than a large number of categories, or need to do substantial other work in the reducer, then it might be better to chain map reduces to solve the problem. But in this case a single one will almost certainly be faster.
Reply all
Reply to author
Forward
0 new messages