error while running lm in map/red script

27 views

Skip to first unread message

sandy

unread,

Oct 24, 2011, 10:28:57 AM10/24/11

to Bangalore R Users - BRU

I was trying to apply lm method for data . data and related map reduce
script i am giving below and i am seeing following error in when i am
running the script.

Error Message:-
-------------------------

Error in `[[<-.data.frame`(`*tmp*`, i, value = c(177L, 272L, 39L,
177L, :
replacement has 5076 rows, data has 141

R ERROR END
===========

at org.godhuli.rhipe.RHMRHelper
$MRErrorThread.run(RHMRHelper.java:391)

at
org.godhuli.rhipe.RHMRHelper.checkOuterrThreadsThrowable(RHMRHelper.java:
236)
at org.godhuli.rhipe.RHMRReducer.run(RHMRReducer.java:68)
at
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:
408)
at org.apache.hadoop.mapred.Child.main(Child.java:170)

R script:-
-------------#! /usr/bin/env Rscript

library(Rhipe)
rhinit(TRUE, TRUE)

map <- expression({
# For each input record, parse out required fields and output new
record:
extractDeptDelays = function(line) {
fields <- unlist(strsplit(line, "\\,"))
valueline<-paste(line,",","\n")
rhcollect(fields[1],valueline)
}
# Process each record in map input:
lapply(map.values, extractDeptDelays)
})

reduce <- expression(
pre = {
delays <- numeric(0)
},
reduce = {
line <- c(line,reduce.values)
},
post = {
line<-unlist(strsplit(paste(reduce.values), "\n"))
countline<-length(line)
word<-unlist(strsplit(line[1],","))
col<-length(word)
clustermat<-matrix(0, nrow = countline, ncol = col-1)
colval<-NULL
for(i in 1:length(line)){
words<-NULL
words<-unlist(strsplit(line[i],","))
colval<-NULL
wordlen<-length(words)-1
for(j in 1:wordlen){
colval<-cbind(colval,words[j])
}
clustermat[i,]=colval
}
clustmatdf= data.frame()
clustmatdf = clustermat
z<-NULL
for(k in 3:length(clustmatdf[1,]))
{
z<-cbind(z,clustmatdf[,k])
}
x<-c(clustmatdf[,2])
reg_cluster<-lm(x~z)

rhcollect(reduce.key,paste(length(words),length(line),length(z[1,]),length(z[,
1]),length(clustmatdf),length(clustmatdf),clustmatdf[1,2],x[1]))
}
)

inputPath <- "/regressioninput"
outputPath <- "/regressionout"

# Create job object:
z <- rhmr(map=map, reduce=reduce,
ifolder=inputPath, ofolder=outputPath,
inout=c('text', 'text'), jobname='Regression',
mapred=list(mapred.reduce.tasks=2))
# Run it:
rhex(z)
# Get the results from HDFS and use to create a dataframe:
results <- rhread(paste(outputPath, "/part-*", sep = ""), type =
"text")
write(results, file="regout.dat")

Sample Input Data:-
----------------------------

Cluster yld/ftg std ln footage Factor 1 Factor 2 Factor 3 Factor 4
Factor 5 Factor 6 Factor 7 Factor 8 Factor 9 Factor 10 Factor 11
Factor 12 Factor 13 Factor 14 Factor 15 Factor 16 Factor 17 Factor 18
Factor 19 Factor 20 Factor 21 Factor 22 Factor 23 Factor 24 Factor 25
Factor 26 Factor 27 Factor 28 Factor 29 Factor 30 Factor 31 Factor 32
Factor 33 Factor 34 Factor 35
2 -0.51 0.52 1.81 -0.93 -0.49 0.19 0.32 1.1 0.46 -0.44 -0.26 -0.94
0.82 0.46 0.35 -0.13 -0.79 -0.76 0.45 0.08 -0.33 -0.71 0.97 -0.15
0.73 -0.15 0.43 -0.5 -0.06 0.06 0.17 -0.58 0.38 -0.03 0.26 -0.53 -0.11
3 -0.53 0.94 -0.73 5.22 -0.39 -1.25 1.66 -0.53 0.99 -2.18 -0.33 0.6
1.15 -0.31 1.11 1.06 -0.89 -1 0.22 1 -0.32 -0.33
0.19 -0.47 -0.12 -0.18 -0.04 0.75 -0.33 -0.11 0.19 -0.47 -0.1 -0.22
0.27 0.35 0.05
2 0.44 0.52 2.2 4.67 -3.18 -3.33 1.62 0.11 1.62 -2.47 0.02 1.45
1.77 -1.22 1.38 0.97 -0.05 -1.51 1.07 0.57 -0.66 1.15 0.46
0.09 -0.37 -0.56 -0.46 0.02 -1.02 -0.44 0.52 -0.23 -0.3 0.31 0.17 0.19
0.92
3 -1.34 -0.43 -0.32 6.32 0.18 -2.14 1.51 -1.07 -0.36 -2.24 1.6 0.45
1.2 0.36 0.56 0.46 -0.04 -0.78 0.86 0.47 -0.64 -1.21 1.02 -0.4 0.42
1.52 0.05 -0.39 -0.66 0.1 -0.37 -0.09 0.11 -0.69 0.55 0.3 -0.06
2 -1.3 -0.43 0.92 2.91 4.32 0.29 1.74 1.28 1.86 -2.78 -0.42 0.58 0.54
0.77 1.3 0.56 0.33 -1.42 0.57 0.86 -0.4 -0.66 0.78 -1.15 0.76
1.41 -0.05 -0.68 -0.04 -0.4 0.22 -0.02 0.21 -0.29 0.7 0.1 0.04
3 -0.28 0.11 -1.26 6.55 -0.69 -3.84 1.63 -1.74 -0.99 -1 3.02 -1.09
0.46 0.23 0.38 0.21 -0.54 -0.23 0.21 -0.1 -1.06 -0.74 0.84 -0.32 0.24
0.9 0.09 -0.37 -0.37 -0.19 -0.24 -0.12 0.3 -0.38 0.47 0.81 0.01
1 1.01 -1.58 -4.66 -3.42 4.69 6.58 -1.94 -1.18 0 -0.33 0.18 1.86 0.44
0.14 0.23 0.05 0.07 0.18 0.38 0.33 -0.58 -0.02 0.4 -0.61 0.06
0.38 -0.57 0.45 0.42 0.02 0.28 0 -0.2 -0.38 0.3 0.21 0.22
3 0.4 -0.43 -2.15 1.21 6.92 8.02 0.19 0.38 -0.51 -0.17 0.87 2.45 1.13
0.49 -0.09 -0.42 -0.23 -0.13 0.64 0.37 -0.36 -0.44 0.33 -0.62 0.27
0.31 -0.09 0.05 0.66 -0.12 -0.01 -0.17 -0.69 -0.3 -0.27 0.31 0.39
3 -1.28 0.52 -5.43 3.19 -1.52 1.77 -1.43 -1.64 -0.54 1.17 -0.29 0.91
0.99 0.21 0.42 -0.01 0.64 -0.41 0.55 0 0.63 -0.12 0.24 -0.22 0.11
0.6 -0.18 0.26 -0.12 0.1 0.1 -0.13 -0.19 -0.08 0.17 0.1 0.32

in mapper script i am applying key as cluster id and sending it to
reducer , in reducer i am getting it as cluster wise and applying lm
function on yld/ftg column and whole dataset. Am seeing the error
which i have pasted above , any help will be greatly approachable.

Thanks in advance
Sandeep

Shekhar

unread,

Oct 27, 2011, 9:51:00 AM10/27/11

to Bangalore R Users - BRU

Hi Sandeep,
Can you please check mapper code? since the values coming to mapper
will be in list form, so you need to unlist them first, which will
give you a line, then you need to split them with the token required.
Splitting will again create a list, and therefore again you need to
unlist them.

So the code which you have written inside the mapper should be
something like this:
fields <- unlist(strsplit(unlist(line), "\\,"))

Let me know if you have any further issues.

> map <- expression({
> # For each input record, parse out required fields and output new
> record:
> extractDeptDelays = function(line) {
> fields <- unlist(strsplit(line, "\\,"))