Sort_values

37 views

Skip to first unread message

Agustin Caminero

unread,

May 21, 2019, 10:39:34 AM5/21/19

to mrjob

Dear all,

I have a question about SORT_VALUES. I have two datasets, one with countries and one with clients. The countries file has this structure (country name, country code), such as:

South Georgia and the South Sandwich Islands,GS
South Sudan,SS
Spain,ES
Sri Lanka,LK
Sudan,SD
Suriname,SR

The clients file has the name, a rating and the country code, such as:

Manda Wingate ,regular,SI
Anna Rappold ,regular,SB
Albina Lamore ,malo,SO
Carolyn Machado ,bueno,ZA
Jeni Espinoza ,bueno,GS
Charisse Salzman ,bueno,SS

My program is about calculating how many good clients (clients with "bueno" rating) there are per country. I have a map, combiner and reducer. The problem is that I'm setting SORT_VALUES at True, but the key-values are not ordered in the reduce.

My code is (it is also attached in this message):

import sys, os, re
from mrjob.job import MRJob
from mrjob.step import MRStep

class MRJoin(MRJob):

    SORT_VALUES = True

    def mapper(self, _, line):
        splits = line.rstrip("\n").split(",")
        if len(splits) == 2: # countries
            symbol = 'A' # countries before clients
            country2digit = splits[1]
            yield country2digit, (symbol, splits[0])
        else: # clients
            symbol = 'X'
            country2digit = splits[2]
            if splits[1]=='bueno':
                yield country2digit,(symbol, 1)

    def combiner(self,key, values):
        bueno=0
        for value in values:
            if value[0] == 'A':
                yield key, ('A', value[1])
            else:
                bueno=bueno + 1

        if bueno > 0:
            yield key, ('X', bueno)

    def reducerSimple(self, key, values):
        for value in values:
            yield key,value

    def steps(self):
        return [
            MRStep(mapper=self.mapper
                   ,combiner=self.combiner
                   ,reducer=self.reducerSimple)
        ]


if __name__ == '__main__':
    MRJoin.run()

And the output that I get when I run on hadoop is:

"ES"   ["X", 3]
"ES"   ["A", "Spain"]

When I run local the order is fine.

Could anybody tell me why "X" comes before "A" considering that SORT_VALUES is true? Am I doing something wrong?

Thanks for your help

Agustin