Again, thank you very much for your reply.
To test things, I only allowed the key "18" to come out of the map function. I've logged the output of the map function (using write.table before returning, in append mode of course). Also, I logged the values the reduce function is receiving.
worker1 : "age" "quantity" "var1" "var2" "var3" "var4"
worker1 : "1747" 18 0 0 0 0 0
worker1 : "1844" 18 0 0 0 0 0
(...)
worker1 : "2523" 18 0 0 0 0 0
worker1 : "2620" 18 0 0 0 0 0
worker1 : "2717" 18 1 0.00148370210305262 0.000608698298688255 0.000243479319475302 9.13047448032383e-05
worker1 : "2814" 18 2 0.00298049399474725 0.00113612889898781 0.000575441909876944 0.00016230412842683
worker1 : "2911" 18 8 0.010596732180914 0.00586076249112006 0.00201278711816244 0.000947193937958797
worker1 : "3008" 18 26 0.0347183285625756 0.0170714433263493 0.00786437276831824 0.00345265145926166
worker1 : "3105" 18 53 0.0729443510332484 0.0309025329500461 0.0183259207029343 0.00646797201280035
worker1 : "3202" 18 40 0.0576033750709901 0.023798577494118 0.0100062200827541 0.00567920599291451
(...)
worker1 : "age" "quantity" "var1" "var2" "var3" "var4"
worker1 : "98" 18 61 0.106296886998641 0.030177506054699 0.0117106739913757 0.00585533699568787
worker1 : "195" 18 67 0.110259902394734 0.0389711723981387 0.0104556803995006 0.00950516399954602
worker1 : "292" 18 65 0.107460886224146 0.0378188526625749 0.0133749600879838 0.00507326072302835
worker1 : "389" 18 0 0 0 0 0
(...)
The fact that the header is shown multiple times just means that it is the output of another call of the map function.
The reduce should only be called once, right? There is only one key. These are the relevant output information of MapReduce (for maps = 30):
Job Counters
Launched map tasks=30
Launched reduce tasks=1
Map-Reduce Framework
Map input records=2618
Reduce input groups=1
Reduce shuffle bytes=12750
Reduce input records=90
Reduce output records=3
Shuffled Maps =30
rmr
reduce calls=31
My first question is: why are there 31 reduce calls? On the other hand, I see "Reduce input groups = 1", and that is correct, as there is only one key.
This is the logging of the values (not keys, which I included above) the reduce received:
(...)
worker1 : "X1" "X2" "X3" "X4" "X5"
worker1 : "1" 0 0 0 0 0
worker1 : "2" 0 0 0 0 0
(...)
worker1 : "28" 1 0.00148370210305262 0.000608698298688255 0.000243479319475302 9.13047448032383e-05
worker1 : "29" 2 0.00298049399474725 0.00113612889898781 0.000575441909876944 0.00016230412842683
worker1 : "30" 8 0.010596732180914 0.00586076249112006 0.00201278711816244 0.000947193937958797
worker1 : "31" 26 0.0347183285625756 0.0170714433263493 0.00786437276831824 0.00345265145926166
worker1 : "32" 53 0.0729443510332484 0.0309025329500461 0.0183259207029343 0.00646797201280035
worker1 : "33" 40 0.0576033750709901 0.023798577494118 0.0100062200827541 0.00567920599291451
(...)
worker1 : "X1" "X2" "X3" "X4" "X5"
worker1 : "1" 0 0 0 0 0
worker1 : "2" 0 0 0 0 0
worker1 : "3" 0 0 0 0 0
(...)
The values match, that's okay. HOWEVER, why does it show the header multiple times? That can only mean that the reduce is being called multiple times for the same key. That is my problem, and I think that's why my results fluctuate. Even more, when I count the amount of times the header repeats it's 31, the same amount of reduce calls. I think it should only be one, as only one reduce should get called with a single key.
Thank you very much.