HI Philippe,
I think the first thing we need to know is the distribution you are aiming for. Without specifying that, implementing a Bernoulli sample is a quite trivial undertaking. rmr2 has a function rmr.sample which has two methods, one is the fastest but without statistical guarantees (like give me any items) and the other is Bernoulli. In the newer package plyrmr there is the function sample which also has these two methods, plus uniform without replacement, using a priority technique: you just assign a random number from the uniform distribution to each item and then perform a top-k selection which I guess involves maintaining and merging current top-k "reservoirs" even if they are normally called that way. For uniform sampling with replacement, I found
this article about achieving that in a streaming context (memory << data size, single or low number of sequential passes over the data) which are usually a good inspiration for map reduce algorithms, but they are sequential algorithms. The article you mentioned has the priority algorithm as Algorithm 3 but surprisingly insists on improving the sorting step, which is absolutely unnecessary as they also point out. I am looking at your implementation and besides the identifiers in French, which are a challenge in and of themselves, I am not sure why you'd want to select a random real number as the key (unless you have ties, each group will have one element, in which case you might as well complete your processing map-side).
This is how I would write a selection procedure
selfun = function(k,v) keyval (1, head(sort(v)))
from.dfs(mapreduce(to.dfs(runif(1:100), map = selfun, reduce = selfun, combine = TRUE))
Now to turn that into a sampling procedure it's an incremental step
arrange.matrix = function(x, ...) as.matrix(arrange(as.data.frame(x), ...))
n = 4
from.dfs(
mapreduce(
to.dfs(matrix(1:100, ncol = 10)),
map = function(k,v) {v = cbind(priority = runif(nrow(v)), v); keyval(1, arrange(v, priority)[1:n,])},
reduce = function(k,v) keyval(1, arrange(v, priority)[1:n,]),
combine = TRUE))
As you can see, there is a single distinct key of 1 and therefore there is a single reduce call. To make that work, we have to have a combiner, which requires the combine operation to be associative and commutative, which is true here. With a single key, the vectorized reduce option doesn't help because its effect is to reduce multiple keys in the same call. I would agree that that is a feature that has not been explained as well as it could, but the problem is that to take advantage of it one needs not only to understand it but also to write an efficient multi-key reducer, which is not easy either (dplyr comes to the rescue though if you are manipulating data frames). So there was less emphasis on documenting it. I am working on changes in the new package plyrmr that take advantage of that feature without the user having to know about it. Maybe that's what vectorized reduce was for, to develop on top of it rather than for the end user.
Antonio