Unable to read MR job output using R

45 views

Skip to first unread message

coder404

unread,

Oct 7, 2015, 12:58:06 PM10/7/15

to RHadoop

Environment:
HortonWorks 2.1 cluster integrated with Kerberos and Active Directory
R version: 3.1.3

Issue:
I am trying to run a simple MR job using R on a Kerberos enabled Hadoop cluster. The R code is given below:
Sys.setenv(HADOOP_STREAMING="/usr/lib/hadoop-mapreduce/hadoop-streaming-2.4.0.2.1.5.0-695.jar")
Sys.setenv(HADOOP_CMD="/usr/bin/hadoop")
Sys.setenv(HADOOP_CONF_DIR="/etc/hadoop/conf")
library(rhdfs)
library(rmr2)
hdfs.init()
ints = to.dfs(1:100)
calc = mapreduce(input = ints, map = function(k, v) cbind(v, 2*v))

Till this point the mapreduce job runs successfully but when I try to access the results using the following command, an error is thrown:
from.dfs(calc)

The error is "Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 1 did not have 8 elements".

The same error is thrown while accessing output of any MR job [wordcount, pi value].

The traceback() function displays the following:
7: scan(file = file, what = what, sep = sep, quote = quote, dec = dec,
nmax = nrows, skip = 0, na.strings = na.strings, quiet = TRUE,
fill = fill, strip.white = strip.white, blank.lines.skip = blank.lines.skip,
multi.line = FALSE, comment.char = comment.char, allowEscapes = allowEscapes,
flush = flush, encoding = encoding, skipNul = skipNul)
6: read.table(textConnection(hdfs("ls", fname, intern = TRUE)),
skip = 1, col.names = c("permissions", "links", "owner",
"group", "size", "date", "time", "path"), stringsAsFactors = FALSE)
5: hdfs.ls(fname)
4: part.list(fname)
3: lapply(src, function(x) system(paste(hadoop.streaming(), "dumptb",
rmr.normalize.path(x), ">>", rmr.normalize.path(dest))))
2: dumptb(part.list(fname), tmp)
1: from.dfs(calc)