Dear all,
I have a question about SORT_VALUES. I have two datasets, one with countries and one with clients. The countries file has this structure (country name, country code), such as:
South Georgia and the South Sandwich Islands,GS
South Sudan,SS
Spain,ES
Sri Lanka,LK
Sudan,SD
Suriname,SR
The clients file has the name, a rating and the country code, such as:
Manda Wingate ,regular,SI
Anna Rappold ,regular,SB
Albina Lamore ,malo,SO
Carolyn Machado ,bueno,ZA
Jeni Espinoza ,bueno,GS
Charisse Salzman ,bueno,SS
My program is about calculating how many good clients (clients with "bueno" rating) there are per country. I have a map, combiner and reducer. The problem is that I'm setting SORT_VALUES at True, but the key-values are not ordered in the reduce.
My code is (it is also attached in this message):
import sys, os, re
from mrjob.job import MRJob
from mrjob.step import MRStep
class MRJoin(MRJob):
SORT_VALUES = True
def mapper(self, _, line):
splits = line.rstrip("\n").split(",")
if len(splits) == 2: # countries
symbol = 'A' # countries before clients
country2digit = splits[1]
yield country2digit, (symbol, splits[0])
else: # clients
symbol = 'X'
country2digit = splits[2]
if splits[1]=='bueno':
yield country2digit,(symbol, 1)
def combiner(self,key, values):
bueno=0
for value in values:
if value[0] == 'A':
yield key, ('A', value[1])
else:
bueno=bueno + 1
if bueno > 0:
yield key, ('X', bueno)
def reducerSimple(self, key, values):
for value in values:
yield key,value
def steps(self):
return [
MRStep(mapper=self.mapper
,combiner=self.combiner
,reducer=self.reducerSimple)
]
if __name__ == '__main__':
MRJoin.run()
And the output that I get when I run on hadoop is:
"ES" ["X", 3]
"ES" ["A", "Spain"]
When I run local the order is fine.
Could anybody tell me why "X" comes before "A" considering that SORT_VALUES is true? Am I doing something wrong?
Thanks for your help
Agustin