Primary sort not happening in MacOS Catalina and Ubuntu 18.04.3

191 views
Skip to first unread message

Julius Raškevičius

unread,
Dec 4, 2019, 1:49:52 PM12/4/19
to mrjob
Hi, 
to test MrJob (v0.7.0) I am using a simple word frequency counter program (Python 3.8 and 3.7). I run this both default runner "-r inline" and simulation "-r local":

"""The classic MapReduce job: count the frequency of words.
"""
from mrjob.job import MRJob

class MRWordFreqCount(MRJob):

def mapper(self, _, line):
yield (line, 1)

def reducer(self, keyword, counts):
total_tally = sum(counts)
yield (keyword, total_tally)

if __name__ == '__main__':
MRWordFreqCount.run()


Here is my input:
a
c
d
z
a
z

I expect the output to be this:
"a" 2
"c" 1
"d" 1
"z" 2

Instead on MacOS Catalina the output is this:
"d" 1
"z" 2
"a" 2
"c" 1

The output is the same on Ubuntu 18.04.3.

However, on Windows 10 the output is correct:
"a" 2
"c" 1
"d" 1
"z" 2

From what I read in Hadoop documentation, I should be getting a primary sort on keyword, like in Windows case?

Your help is appreciated!

Dave Marin

unread,
Dec 4, 2019, 1:55:10 PM12/4/19
to mr...@googlegroups.com
Nope. By default Hadoop partitions your data with a hash table, so to simulate this, the local and inline runners deliberately try *not* to output data in sorted order.

IIRC, on Windows, the sort utility doesn’t have the ability to sort in reverse order, so you get output in sorted order as a side-effect of using Windows sort.

If you *do* want your output sorted, ask for it explicitly by adding SORT_VALUES = True to your mrjob class.

-Dave

--
You received this message because you are subscribed to the Google Groups "mrjob" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mrjob+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/mrjob/fa126fb1-cb2f-47db-98c0-267e7e6cabc6%40googlegroups.com.

Julius Raškevičius

unread,
Dec 4, 2019, 2:51:02 PM12/4/19
to mrjob
I am new to MapReduce, but it seems that you are talking about secondary sort and I am taking about primary sort. From MrJob documentation:

MRJob.SORT_VALUES = None

Set this to True if you would like reducers to receive the values associated with any key in sorted order (sorted by their encoded value). Also known as secondary sort.

So this setting takes care of the reducer input. In my case the reducer input is intermediate key-value pair (<word>, 1). If I was emitting something else than "1" as the value of this key-value pair, I would observe that value order is random in the reducer, because, like you said, shuffling into reducers depends on the hash function done on the key, not the value.

Instead, I observe that the reducer output is not ordered by they key. And from what I read, it should be. 

This is my understanding of how the process unfolds:
0. Intermediate key-value pairs are generated by the mapper().
1. Partitioning happens by calculating which reducer an intermediate key-value pair should be sent to using hash(key) % n_reducers.
2. Each partition is sorted by key (using quick sort). This is done because reducer node requires input to be ordered key-wise so that it knows when the last intermediate key-value pair arrives for a specific key and reducer() can be called. 
3. Intermediates are shuffled to reducer nodes.
4. Once all reducers finish their work, an output from each is passed to a final node. This node merges a locally pre-sorted reducer output and returns it to the user. This is done using merge sort.

If I use a bigger dataset, I see big chunks of primary-sorted lines that are "out of order", like they haven't been merged correctly. As you described, value has no effect on the ordering of the lines, but this is not my issue. Please correct me if you see something wrong in this.

Hopefully I got the process right, but 
To unsubscribe from this group and stop receiving emails from it, send an email to mr...@googlegroups.com.

Dave Marin

unread,
Dec 4, 2019, 2:53:34 PM12/4/19
to mr...@googlegroups.com
You know, you may be right, local and inline mode might be over-zealous about not sorting output. I’ll look into this.

And yes, you are correct about the difference between primary and secondary sort. I just mean that turning on secondary sort will also force primary sort. :)

-Dave

To unsubscribe from this group and stop receiving emails from it, send an email to mrjob+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/mrjob/b776d5fb-fc5b-4706-98c0-c333a5357adf%40googlegroups.com.

Reply all
Reply to author
Forward
0 new messages