Hello all,
I am using rhadoop in order to compute a correlation matrix according to a specified key. The codes works fine in local and hadoop backend as well. The problem is that when the dataset in increased twice (from 20 millions to 50 millions of rows ), the computation time increases too.
I think that my problem has to make with the time response time of the task. The error I am getting is:
Task attempt_201503181218_296033_r_000000_0 failed to report status for 600 seconds. Killing!. Diagnostic information will be saved in userlogs
It is true that when I tried to compute the correlation matrix for a specific key it was taking quite long to compute. From a search in google I found some posts refer that I must make additional parameterizations mapred-site.xml. Below you may find a relevant link:
If this is the case, how should I change default the time? The default time is set to 600000 ms = 600 secs.
Below you may find the relevant stderr logs:
stderr logs
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.NativeCodeLoader).
log4j:WARN Please initialize the log4j system properly.
Loading objects:
associateAttributeLevel
associateVariousLevels
attributeAlignment
census
ChisqAttributeAlignment
cmdPath
CorrelClusterAlgorithm
CorrelClusterMapReduceExecutor
CorrelClusterParallelExecutor
curDirectory
currTime
email_to
filesToValidate
fileToUpload
finalTableName
ftpLnx
GenStartFiles
GetFTPcsvFile
GetStoreList
GetStubFileFromServer
host_id
jobNum
lastWeek
listFiles
listFilesTable
lnxHost_id
lnxPassWord
lnxPath
lnxUserName
lnxuserPass
ManipulateSTUB
ManipulationModeldataR
mapperCorrelClusterfunction
mapReduceCorrelaClusterFlag
MapReduceCorrelClusterAlgorithm
modeldata
modelDataFileName
moveFilesToHadoop
.N
numCores
numOfPeriods
outputFolder
passWord
path
path_id
produceCMDFiles
.Random.seed
reducerCorrelClusterfunction
saveRDSName
splitDataPath
storeListTable
storesFile
storesTables
stub
stubData1
stubFileName
stubFiles
substrRight
syshostname
sysuserid
tempMatrix
tempMatrixNumeric
.testLogger
trim.leading
trim.trailing
UpcSelectFileTable
UpcSelectTable
upcToPPGConverter
url_id
userName
Loading objects:
backend.parameters
combine
combine.file
combine.line
debug
default.input.format
Warning: S3 methods ‘gorder.default’, ‘gorder.factor’, ‘gorder.data.frame’, ‘gorder.matrix’, ‘gorder.raw’ were declared in NAMESPACE but not found
Please review your hadoop settings. See help(hadoop.settings)
default.output.format
in.folder
in.memory.combine
input.format
libs
map
map.file
map.line
out.folder
output.format
pkg.opts
postamble
preamble
profile.nodes
reduce
reduce.file
reduce.line
rmr.global.env
rmr.local.env
save.env
tempfile
vectorized.reduce
verbose
work.dir
Loading required package: methods
Loading required package: data.table
Attaching package: ‘data.table’
The following object is masked _by_ ‘.GlobalEnv’:
.N
Loading required package: foreach
Loading required package: iterators
Loading required package: parallel
Loading required package: doParallel
Loading required package: rmr2
Loading required package: plyr
Loading required package: rJava
Loading required package: rhdfs
Error : .onLoad failed in loadNamespace() for 'rhdfs', details:
call: fun(libname, pkgname)
error: Environment variable HADOOP_CMD must be set before loading package rhdfs
Warning in FUN(c("base", "methods", "datasets", "utils", "grDevices", "graphics", :
can't load rhdfs
Loading required package: stringr
Loading required package: bitops
Loading required package: RCurl
Attaching package: ‘RCurl’
The following object is masked from ‘package:rJava’:
clone
Loading objects:
backend.parameters
combine
combine.file
combine.line
debug
default.input.format
default.output.format
in.folder
in.memory.combine
input.format
libs
map
map.file
map.line
out.folder
output.format
pkg.opts
postamble
preamble
profile.nodes
reduce
reduce.file
reduce.line
rmr.global.env
rmr.local.env
save.env
tempfile
vectorized.reduce
verbose
work.dir
Loading required package: grid
Loading required package: lattice
Loading required package: survival
Loading required package: splines
Loading required package: Formula
Loading required package: ggplot2
Attaching package: ‘Hmisc’
The following objects are masked from ‘package:plyr’:
is.discrete, summarize
The following objects are masked from ‘package:base’:
format.pval, round.POSIXt, trunc.POSIXt, units
2015-06-20 10:24:16,7760 ERROR Client fs/client/fileclient/cc/writebuf.cc:154 Thread: 5405 FlushWrite failed: File part-00000, error: Stale File handle(116), pfid 2149.29771415.392936198, off 0, fid 2149.29771415.392936198
log4j:ERROR Could not write to: com.mapr.fs.MapRFsOutStream@7e21e65f. Failing over to local logging
java.io.IOException: stream closed
at com.mapr.fs.MapRFsOutStream.checkClosed(MapRFsOutStream.java:43)
at com.mapr.fs.MapRFsOutStream.write(MapRFsOutStream.java:80)
at com.mapr.fs.MapRFsDataOutputStream.write(MapRFsDataOutputStream.java:46)
at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
at com.mapr.log4j.MaprfsLogAppender.append(MaprfsLogAppender.java:419)
at com.mapr.log4j.CentralTaskLogAppender.append(CentralTaskLogAppender.java:102)
at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
at org.apache.log4j.Category.callAppenders(Category.java:206)
at org.apache.log4j.Category.forcedLog(Category.java:391)
at org.apache.log4j.Category.log(Category.java:856)
at org.apache.commons.logging.impl.Log4JLogger.error(Log4JLogger.java:181)
at com.mapr.fs.LoggerProxy.error(LoggerProxy.java:18)
at com.mapr.fs.Inode.flushPages(Inode.java:421)
at com.mapr.fs.Inode.syncInternal(Inode.java:530)
at com.mapr.fs.Inode.sync(Inode.java:543)
at com.mapr.fs.Inode.closeWrite(Inode.java:553)
at com.mapr.fs.Inode.close(Inode.java:1006)
at com.mapr.fs.MapRFsOutStream.close(MapRFsOutStream.java:223)
at com.mapr.fs.Inode.closeAll(Inode.java:1078)
at com.mapr.fs.BackgroundWork.close(BackgroundWork.java:99)
at com.mapr.fs.MapRFileSystem.close(MapRFileSystem.java:1236)
at org.apache.hadoop.fs.FileSystem$Cache.closeAll(FileSystem.java:1626)
at org.apache.hadoop.fs.FileSystem$Cache$ClientFinalizer.run(FileSystem.java:1596)
log4j:WARN com.mapr.log4j.CentralTaskLogAppender@61579cd: closed and disabled due to errors.
java.io.IOException: stream closed
at com.mapr.fs.MapRFsOutStream.checkClosed(MapRFsOutStream.java:43)
at com.mapr.fs.MapRFsOutStream.write(MapRFsOutStream.java:80)
at com.mapr.fs.MapRFsDataOutputStream.write(MapRFsDataOutputStream.java:46)
at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
at com.mapr.log4j.MaprfsLogAppender.append(MaprfsLogAppender.java:419)
at com.mapr.log4j.CentralTaskLogAppender.append(CentralTaskLogAppender.java:102)
at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
at org.apache.log4j.Category.callAppenders(Category.java:206)
at org.apache.log4j.Category.forcedLog(Category.java:391)
at org.apache.log4j.Category.log(Category.java:856)
at org.apache.commons.logging.impl.Log4JLogger.error(Log4JLogger.java:181)
at com.mapr.fs.LoggerProxy.error(LoggerProxy.java:18)
at com.mapr.fs.Inode.flushPages(Inode.java:421)
at com.mapr.fs.Inode.syncInternal(Inode.java:530)
at com.mapr.fs.Inode.sync(Inode.java:543)
at com.mapr.fs.Inode.closeWrite(Inode.java:553)
at com.mapr.fs.Inode.close(Inode.java:1006)
at com.mapr.fs.MapRFsOutStream.close(MapRFsOutStream.java:223)
at com.mapr.fs.Inode.closeAll(Inode.java:1078)
at com.mapr.fs.BackgroundWork.close(BackgroundWork.java:99)
at com.mapr.fs.MapRFileSystem.close(MapRFileSystem.java:1236)
at org.apache.hadoop.fs.FileSystem$Cache.closeAll(FileSystem.java:1626)
at org.apache.hadoop.fs.FileSystem$Cache$ClientFinalizer.run(FileSystem.java:1596)
log4j:ERROR Attempted to append to closed appender named [maprfsTLA].
log4j:ERROR Attempted to append to closed appender named [maprfsTLA].
Any assistance is appreciated.
Best Regards,
Kostas