Hello All,
I have to run a lasso regression in a map reduce fashion. I modelled my code in similarity to the wordcount and other sample programs of rmr2.
I have set all the path variables required for the packages rhdfs and rmr2. Also, I am using glmnet for regression.
This is the sample data set that i want to run regression on.
1000 [0,0,1,1] 3.4
1000 [1,1,2,0] 4.5
1000 [4,3,5,2] 3.6
1000 [2,1,3,2] 2.6
1000 [1,2,1,1] 4.5
The second column is the X values and the thrid Column in the Y values. I want the aggregation(formation of matrix for regression to happen in the reducer)
Thus output of my mapper is
Key: Column 1 (1000)
Value: Column2,Column 3 (eg: 0,0,1,1,3.4)
For sake of convenience I have chosen only one distinct value for key so that only one reducer is used.
The mapper output is as required. However for the purpose of creating the matrix for regression, I need to be able to access each of these keys individually which i am not able to do. Kindly suggest
I have been able to form a continuous vector($total) but the ordering is varied. Thus my matrix is not formed properly. I believe there is a gap in my understanding of the architecture of rmr programs.
1. Environment: Ubuntu 14.04, Apache Hadoop 2.6.0(pseudo distributed mode), R 3.1.2, latest packages of rmr2 and rhdfs.
2. For mapper, is the input read line by line or all at once? How to access individual values in a set of values in reducer.
3. My length of values received in the mapper is always 2. But the number of values passed are 5. Is it because of the number of splits in the mapper?
Kindly suggest as to where the mistake is.
Below is my code.
lasso =
function(
input,
output){
l.map =
function(., lines) {
mat=matrix(unlist(strsplit(lines,"\t")),nrow=length(lines),byrow=TRUE)
temp=strsplit((strsplit(mat[,2],"[[]"))[[1]][2],"[]]") //Get rid of the brackets in the input
keyval(mat[,1],paste(temp,mat[,3],sep=",")) //The output after this step is eg: "1000" "0,0,1,1,3.4"
}
l.reduce =
function(key2, val2 ) {
first=c();
for(i in 1:length(val2)) //length is seen as 2 but when run in a loop it can access all elements. But val2[3] is NA
{
a3=as.numeric(unlist(strsplit(val2[i],","))) //Creating a vector with all the numeric values
total=append(total,a3) //contiuously appending
}
mat1=matrix(total,ncol=5,byrow=TRUE)
v=mat1[,5] //This shows NA
glmnet(mat1,v)
}
mapreduce(
input = input ,
output = output,
input.format ="text",
map = l.map,
`reduce = l.reduce,
combine = T)
}
Thanks in advance! Also, is there any other resource to understanding the flow of the program in rmr2.
Pranav