so many unexepcted "lost task tracker" when using topicmodels library

simm13

unread,

Aug 12, 2013, 12:34:05 PM8/12/13

to rha...@googlegroups.com

i'm using library topicmodels to get the topics in the hdfs text files,when the data is small(10000 records),everything is fine

but when the data is bigger then 100000 records,it got a lot of error:Lost task tracker: tracker_datanodeX:localhost.localdomain/

127.0.0.1:xxxxx,Before getting the error, the task attempt has been running some times,Sometimes, another task

attempt is launched in paralel,, takes few times. to complete and so the first one gets killed (the second one can even be launched in the same

task tracker and work),and the job raise a lot of datanodes at the time ,But in the end,the job can be done,and i also got a lot of "lost tast tracker"

and the number of datanode be increased,eg:10 datanode changed to 17datanode.

The job will end up with some of the task trackers blacklisted.

when the data is bigger then 1000000 records ,the job can not be completed,the map task always generate the "lost task tracker" error,and always can not be done.

how can i deal with it? thanks

Antonio Piccolboni

unread,

Aug 12, 2013, 1:12:08 PM8/12/13

to RHadoop Google Group

Which RHadoop component are you using and could you at least sketch the general structure of your program? My guess is that you are using rmr2 to split the data and build separate topic models reduce side, but it's better if you tell us rather than me trying to guess. It could be that some tasks are timing out, but you need to tell us a bit more about your program.

Antonio

--
post: rha...@googlegroups.com ||
unsubscribe: rhadoop+u...@googlegroups.com ||
web: https://groups.google.com/d/forum/rhadoop?hl=en-US
---
You received this message because you are subscribed to the Google Groups "RHadoop" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rhadoop+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

simm13

unread,

Aug 12, 2013, 8:41:17 PM8/12/13

to rha...@googlegroups.com, ant...@piccolboni.info

thanks for responsed,i'm now using hadoop 0.20.2 with cdh3u4,there are 1 namenode and 10 datanode in my cluster,rhadoop installed with rmr2.2.2

my code is here:

library(topicmodels) library(rmr2) library(tm) library(slam) tm_mapreduce<-function(x){ words<-strsplit(x,',') corpus = Corpus(VectorSource(words)) sample.dtm <- DocumentTermMatrix(corpus, control = list(wordLengths = c(1, Inf))) k <- 3 setAs("NULL", "CTM_VEMcontrol", function(from, to) new(to)) VEM = LDA(sample.dtm, k = k,control = NULL) Terms <- terms(VEM, 10) return(Terms) } try_lda <- function(x) { out <- tryCatch( tm_mapreduce(x), error=function(e) { message(paste("it seems error on:", x,"message:",e)) return("ERROR") } ) return(out) } try_word <- function(x,ind,split) { out <- tryCatch( word(x,ind,sep=fixed(split)), error=function(e) { message(paste("it seems split error on:", x,"message:",e)) return("ERROR") } ) return(out) } tmp <- tryCatch( tm_mapreduce(x), error = identity ) keyword <- function (input, output){ mapreduce( input=input, output=output, map=function(k, v){ keyval(try_word(v,3,"\001"),try_word(v,5,"\001")) #keyval(1,v) }, reduce=function(k,vv){ d<-data.frame(k,vv) acc_nbr <- as.character(unique(d$k)) term<-lapply(acc_nbr,function(x){as.character(d[which(d$k==x),]$vv)}) keyval(acc_nbr,term) } ) } tm_lda<-function(input, output){ mapreduce( input=input,output=output, map=function(k,v){ n<-1:length(v) val<-lapply(n,function(x){try_lda(v[[x]])}) keyval(k,val) }, backend.parameters = list(hadoop = list(D = 'mapred.task.timeout=36000000')) ) } keyword('/user/hive/warehouse/tmp_hfh_keyword2','/rhipe/keyword') tm_lda('/rhipe/keyword','/rhipe/lda')

-------

are there any problems?

在 2013年8月13日星期二UTC+8上午1时12分08秒，Antonio Piccolboni写道：

simm13

unread,

Aug 12, 2013, 8:41:27 PM8/12/13

to rha...@googlegroups.com, ant...@piccolboni.info

thanks for responsed,i'm now using hadoop 0.20.2 with cdh3u4,there are 1 namenode and 10 datanode in my cluster,rhadoop installed with rmr2.2.2

my code is here:

library(topicmodels) library(rmr2) library(tm) library(slam) tm_mapreduce<-function(x){ words<-strsplit(x,',') corpus = Corpus(VectorSource(words)) sample.dtm <- DocumentTermMatrix(corpus, control = list(wordLengths = c(1, Inf))) k <- 3 setAs("NULL", "CTM_VEMcontrol", function(from, to) new(to)) VEM = LDA(sample.dtm, k = k,control = NULL) Terms <- terms(VEM, 10) return(Terms) } try_lda <- function(x) { out <- tryCatch( tm_mapreduce(x), error=function(e) { message(paste("it seems error on:", x,"message:",e)) return("ERROR") } ) return(out) } try_word <- function(x,ind,split) { out <- tryCatch( word(x,ind,sep=fixed(split)), error=function(e) { message(paste("it seems split error on:", x,"message:",e)) return("ERROR") } ) return(out) } tmp <- tryCatch( tm_mapreduce(x), error = identity ) keyword <- function (input, output){ mapreduce( input=input, output=output, map=function(k, v){ keyval(try_word(v,3,"\001"),try_word(v,5,"\001")) #keyval(1,v) }, reduce=function(k,vv){ d<-data.frame(k,vv) acc_nbr <- as.character(unique(d$k)) term<-lapply(acc_nbr,function(x){as.character(d[which(d$k==x),]$vv)}) keyval(acc_nbr,term) } ) } tm_lda<-function(input, output){ mapreduce( input=input,output=output, map=function(k,v){ n<-1:length(v) val<-lapply(n,function(x){try_lda(v[[x]])}) keyval(k,val) }, backend.parameters = list(hadoop = list(D = 'mapred.task.timeout=36000000')) ) } keyword('/user/hive/warehouse/tmp_hfh_keyword2','/rhipe/keyword') tm_lda('/rhipe/keyword','/rhipe/lda')

-------

are there any problems?

在 2013年8月13日星期二UTC+8上午1时12分08秒，Antonio Piccolboni写道：

Which RHadoop component are you using and could you at least sketch the general structure of your program? My guess is that you are using rmr2 to split the data and build separate topic models reduce side, but it's better if you tell us rather than me trying to guess. It could be that some tasks are timing out, but you need to tell us a bit more about your program.

Antonio

Antonio Piccolboni

unread,

Aug 12, 2013, 10:27:19 PM8/12/13

to RHadoop Google Group

It's a bit complicated for me to tell at a glance. I am not familiar with three out of four packages you are using. There are multiple mapreduce calls and I don't know which one is failing. One step forward could be to add some logging statements such as rmr.str("working on xyz") at least you would know where it gets stuck. rmr.str will write to stderr and that's an important log to monitor and I don't think you have mentioned it so far. The other thing is to try and build a curve with the running times of successful runs, you say it runs at 10^4 but not at 10^5. Can you run it a different sizes (below the failure point) and see how long it takes and the number of tasks? This will tell you what to expect. If you run it at 12000 records in input and it takes 10 times longer than the 10000 run, we have a working hypothesis right away. I had a group give me timings line by line in a map function, and 90% of the time was spent in one line. Complexity of that line grew with the square of the size of the input. We fixed that and everything was fine. But you need to do that kind of detailed analysis. Or you could do a few local runs using the debugger (see rmr.options, backend). In short, rmr2 or not you can not get away without debugging. You have to learn how to debug programs in hadoop. It's not me, you or rmr2, it's how programming is.

Antonio

Reply all

Reply to author

Forward