trivial random sampling

Philippe Besse

unread,

Jul 18, 2014, 4:11:08 AM7/18/14

to rha...@googlegroups.com

Dear all,
At first: many thanks Antonio for all your production around RHadoop, whose your nice tutorial.
As I'm a new RHadoop user and a "statistician", my initiation and first rmr/mapreduce program aims at drawing a random sample from hdfs.

A scalable way to make that job is described by Xiangrui Meng in Scalable Simple Random Sampling and Stratified Sampling,
Proceedings of the 30th International Conference on Machine Learning,2013.

The main principle is a combination between "reservoir sampling" and ordering the selected observations in the tank.
I begin to write the trivial following code. This works and it is then quite easy to adapt it in order to produce the different Meng's stratégies. This means to calculate some more accurate levels for the p probability to obtain a smaller tank before ordering it.

Questions:
Is it the right way to write that kind of program to order observations thank to uniform random keys?
What about a "combine" step to make it ?
Another RHaddoop program for simple or stratified random sample ?
Remark: I spent a lot of time with many hazardous tests before setting the "vectorized.reduce" parameter to TRUE . Documentation is not very clear about it.

Thanks for any remark, suggestion or criticism.
All the best
Philippe

library(rmr2)
rmr.options(backend = "local")

aleaSimp=function(input,n,p){
# n: size of the sample
# p: probability to put an observation in a "reservoir" sampling
mapreduce(
    input=input,
# map
    map = function(.,v){
# selection
    alea=runif(nrow(v))
    clef=alea[alea<p]
    ech=v[alea<p,]
    keyval(clef,ech)},
# essential parameter for sorting keys
    vectorized.reduce=TRUE,
# reduce for sorting and
# selecting n observations
    reduce=function(kk,vv){
    clef=sort(kk)[1:n]
    index=order(kk)[1:n]
    result=vv[index,]
    keyval(clef,result)})}

# test run
set.seed(1)
data=matrix(rnorm(5000),ncol=5)
base=to.dfs(data)
from.dfs(aleaSimp(base,100,.2))

Antonio Piccolboni

unread,

Jul 18, 2014, 3:28:42 PM7/18/14

to RHadoop Google Group

HI Philippe,

I think the first thing we need to know is the distribution you are aiming for. Without specifying that, implementing a Bernoulli sample is a quite trivial undertaking. rmr2 has a function rmr.sample which has two methods, one is the fastest but without statistical guarantees (like give me any items) and the other is Bernoulli. In the newer package plyrmr there is the function sample which also has these two methods, plus uniform without replacement, using a priority technique: you just assign a random number from the uniform distribution to each item and then perform a top-k selection which I guess involves maintaining and merging current top-k "reservoirs" even if they are normally called that way. For uniform sampling with replacement, I found this article about achieving that in a streaming context (memory << data size, single or low number of sequential passes over the data) which are usually a good inspiration for map reduce algorithms, but they are sequential algorithms. The article you mentioned has the priority algorithm as Algorithm 3 but surprisingly insists on improving the sorting step, which is absolutely unnecessary as they also point out. I am looking at your implementation and besides the identifiers in French, which are a challenge in and of themselves, I am not sure why you'd want to select a random real number as the key (unless you have ties, each group will have one element, in which case you might as well complete your processing map-side).

This is how I would write a selection procedure

selfun = function(k,v) keyval (1, head(sort(v)))

from.dfs(mapreduce(to.dfs(runif(1:100), map = selfun, reduce = selfun, combine = TRUE))

Now to turn that into a sampling procedure it's an incremental step

arrange.matrix = function(x, ...) as.matrix(arrange(as.data.frame(x), ...))

n = 4

from.dfs(

mapreduce(

to.dfs(matrix(1:100, ncol = 10)),

map = function(k,v) {v = cbind(priority = runif(nrow(v)), v); keyval(1, arrange(v, priority)[1:n,])},

reduce = function(k,v) keyval(1, arrange(v, priority)[1:n,]),

combine = TRUE))

As you can see, there is a single distinct key of 1 and therefore there is a single reduce call. To make that work, we have to have a combiner, which requires the combine operation to be associative and commutative, which is true here. With a single key, the vectorized reduce option doesn't help because its effect is to reduce multiple keys in the same call. I would agree that that is a feature that has not been explained as well as it could, but the problem is that to take advantage of it one needs not only to understand it but also to write an efficient multi-key reducer, which is not easy either (dplyr comes to the rescue though if you are manipulating data frames). So there was less emphasis on documenting it. I am working on changes in the new package plyrmr that take advantage of that feature without the user having to know about it. Maybe that's what vectorized reduce was for, to develop on top of it rather than for the end user.

Antonio

--
post: rha...@googlegroups.com ||
unsubscribe: rhadoop+u...@googlegroups.com ||
web: https://groups.google.com/d/forum/rhadoop?hl=en-US
---
You received this message because you are subscribed to the Google Groups "RHadoop" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rhadoop+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Philippe Besse

unread,

Jul 23, 2014, 4:20:45 AM7/23/14

to rha...@googlegroups.com

Thanks Antonio for your quick answer and a so more simple code. It really helps me to better understand how to produce efficient mapreduce code.
I aggree, it is a better choice to directly sort in the map step and then, to produce a single distinct key at 1!
I go on my exploration of other RHadoop functions.
all the best
philippe

Reply all

Reply to author

Forward