Performance issue of k-means

47 views
Skip to first unread message

WenHun, Xu

unread,
Oct 23, 2015, 1:03:34 PM10/23/15
to RHadoop
hi all, I'm new to rhadoop.

I testing the k-means with 1 GB dateset. (ps. k means : https://github.com/RevolutionAnalytics/rmr2/blob/master/pkg/tests/kmeans.R )

here is my test data

 P = 
    do.call(
      rbind, 
      rep(
        list(
          matrix(
            rnorm(9000000, sd = 10), 
            ncol=30000)), 
        10)) + 
    matrix(rnorm(90000000), ncol =30000)

and I save it as a .csv.

the following is the changed part of the kmeans

out = list()

## for local mode
# library(bigmemory)
# ID00 = read.big.matrix("/usr/local/AMI/20150422_AMI/ID01_96_analysis_data.csv")

ptm <- proc.time()
 
for(be in c("hadoop")) {
rmr.options(backend = be)
set.seed(0)
 
out[[be]] = 
 
  '/usr/local/3000x20000.csv',

## for local mode
#  to.dfs(ID00[1:3000, 1:20000]),

  num.clusters = 3, 
  num.iter = 1,
   combine = FALSE,
  in.memory.combine = FALSE)
 
}
proc.time() - ptm

here is my result

                            local        hadoop 
processing time       60s       1000s

Is it normal that the processing time on local much faster than on hadoop?
 
My hadoop cluster setup is 3 node clusters (memory: 6G, 8G,12G );  hadoop version 2.6.0 
 
thanks
Xu

Reply all
Reply to author
Forward
0 new messages