rmr2 and lasso regression

69 views
Skip to first unread message

Pranav Kabra

unread,
Mar 5, 2015, 12:58:49 AM3/5/15
to rha...@googlegroups.com
Hello All,

I have to run a lasso regression in a map reduce fashion. I modelled my code in similarity to the wordcount and other sample programs of rmr2. 
I have set all the path variables required for the packages rhdfs and rmr2. Also, I am using glmnet for regression.

This is the sample data set that i want to run regression on.

1000 [0,0,1,1] 3.4
1000 [1,1,2,0] 4.5
1000 [4,3,5,2] 3.6
1000 [2,1,3,2] 2.6
1000 [1,2,1,1] 4.5

The second column is the X values and the thrid Column in the Y values. I want the aggregation(formation of matrix for regression to happen in the reducer)

Thus output of my mapper is 
Key: Column 1 (1000)
Value: Column2,Column 3 (eg: 0,0,1,1,3.4)

For sake of convenience I have chosen only one distinct value for key so that only one reducer is used.

The mapper output is as required. However for the purpose of creating the matrix for regression, I need to be able to access each of these keys individually which i am not able to do. Kindly suggest 

I have been able to form a continuous vector($total) but the ordering is varied. Thus my matrix is not formed properly. I believe there is a gap in my understanding of the architecture of rmr programs.

1. Environment: Ubuntu 14.04, Apache Hadoop 2.6.0(pseudo distributed mode), R 3.1.2, latest packages of rmr2 and rhdfs.
2. For mapper, is the input read line by line or all at once? How to access individual values in a set of values in reducer. 
3. My length of values received in the mapper is always 2. But the number of values passed are 5. Is it because of the number of splits in the mapper? 


Kindly suggest as to where the mistake is. 
  
Below is my code. 

lasso = 
  function(
    input, 
    output){
l.map = 
      function(., lines) {
mat=matrix(unlist(strsplit(lines,"\t")),nrow=length(lines),byrow=TRUE) 
temp=strsplit((strsplit(mat[,2],"[[]"))[[1]][2],"[]]") //Get rid of the brackets in the input
keyval(mat[,1],paste(temp,mat[,3],sep=",")) //The output after this step is eg: "1000" "0,0,1,1,3.4"
}
l.reduce =
      function(key2, val2 ) {
first=c();
for(i in 1:length(val2)) //length is seen as 2 but when run in a loop it can access all elements. But val2[3] is NA
{
a3=as.numeric(unlist(strsplit(val2[i],","))) //Creating a vector with all the numeric values
total=append(total,a3) //contiuously appending
}
mat1=matrix(total,ncol=5,byrow=TRUE) 
  v=mat1[,5] //This shows NA
glmnet(mat1,v) 

}
mapreduce(
      input = input ,
      output = output,
      input.format ="text",
      map = l.map,
    `reduce = l.reduce,
      combine = T)
}


Thanks in advance! Also, is there any other resource to understanding the flow of the program in rmr2.

Pranav

Antonio Piccolboni

unread,
Mar 5, 2015, 1:43:05 AM3/5/15
to RHadoop Google Group
Hi, 
I am not sure I understand the problem. The map function is vectorized in the sense that it receives, at each call, multiple keys and multiple values (the same number), an unspecified portion of the data. So say a vector with 10 elements and a data frame with 10 rows. Which one exactly is not something you can count on, albeit in real implementation the splits are consecutive chunksI am not sure I appreciate the difference between the "length of values" and the "number of values". As far as one key vs multiple keys you don't explain what you need to do so I am not sure. As far as the order, sorry but there's no order guarantee. You may want to supplement your matrices with additional information concerning the ordering. I am not sure why the order of the data set would matter here, 


Antonio

--
post: rha...@googlegroups.com ||
unsubscribe: rhadoop+u...@googlegroups.com ||
web: https://groups.google.com/d/forum/rhadoop?hl=en-US
---
You received this message because you are subscribed to the Google Groups "RHadoop" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rhadoop+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Pranav Kabra

unread,
Mar 6, 2015, 3:55:36 AM3/6/15
to rha...@googlegroups.com, ant...@piccolboni.info
Hi Antonio,

Thanks for the prompt reply. I will try to explain my problem better. 

At the mapper I am receiving the dataset that I posted earlier. However I want to the key value pairs to be formed as key: 1st column("1000") and value: 1st col,2nd col("0,0,1,1,3.4"). This part is executing. 

With the dataset, as we can notice, all the keys emmitted by the mapper are the same. And thus all the key,value pairs would be handled by a single reducer. 

At the reducer, I want to split the value into two portions to form my X and Y for the regression.

For example: The Mapper will emit: "1000" "0,0,1,1,3,4", "1000" "1,1,2,0,4,5", "1000" "4,3,5,2,3.6", "1000" "2,1,3,2,2.6","1000" "1,2,1,1,4.5"

At the reducer we will get a set of these values: Eg: "1000", "0,0,1,1,3.4" "1,1,2,0,4.5" "4,3,5,2,3.6" "2,1,3,2,2.6" "1,2,1,1,4.5"

Now I want to create a matrix using these values at the reducer. 

My X matrix would be: [0 0 1 1       and my Y vector would be the [3.4,4.5,3.6,2.6,4.5]
                                  1 1 2 0
                                  4 3 5 2
                                  2 1 3 2
                                  1 2 1 1]


Thus the ordering for me is important so that I can form these matrices and run the regression.

Is it possible to run such a scenario?

Thanks,
Pranav

Antonio Piccolboni

unread,
Mar 6, 2015, 12:27:44 PM3/6/15
to RHadoop Google Group
It isn't built in, but it may be possible to implement it. Each row of data has to carry information about its position, say in an additional columns. Then you can sort it again in the reducer. But the exact order in which records arrive at the reducer is not guaranteed. Because rmr2 tries to batch small records together, small examples may be misleading, in that order will be preserved up to a certain size. With multiple map processes, this will not be guaranteed. In any classifier I know of though you can rearrange the order of the data points as long as predictors and predicted variables are rearranged the same way, which seems built into the way you represent your data. It seems like your algorithms is the exception.

Antonio

Pranav Kabra

unread,
Mar 9, 2015, 11:44:36 PM3/9/15
to rha...@googlegroups.com, ant...@piccolboni.info
Thanks for the explanation. Here is a similar scenario, a generalised case.

The reducer receives K1, { "V1_V2_V3" "V4_V5_V6" "V7_V8_V9" }

I want to create a matrix such that V1, V2,V3 feature in the same row, doesnt matter if its the first row or the second row or the third row. So the ordering there is not important. However, the combination of V1,V2,V3 or V4,V5,V6 should be in the same row.

This requires me to split on "_" and then frame the matrix. My issue is that once I am splitting the data, I am not able to control the ordering of these elements. Kindly suggest as to how I can frame such a matrix after splitting the values at the reducer.

Thanks.

Pranav

Antonio Piccolboni

unread,
Mar 10, 2015, 2:56:21 PM3/10/15
to rha...@googlegroups.com, ant...@piccolboni.info
rmr2 exists so that you don't have to deal with strings, separators splitting and what not. If you are happy with that level of programming, just use hadoop streaming directly. rmr2 is only overhead unless you use its features. If the reducer receives k, v1 ... v9, it's hopeless because you lost row information. So that information has to be sent in from the mapper, which you don't show. So assume that the mapper returns


keyval(1, data.frame(row.ids, values))

then the reducer gets as the second argument a data frame with row ids and values. Now you need to put this in a wide format, which is regular R programming and not covered  in this group. I don't see where the rmr-related problem is here.
Reply all
Reply to author
Forward
0 new messages