Performance issue of k-means

49 views

Skip to first unread message

unread,

Oct 23, 2015, 1:03:34 PM10/23/15

to RHadoop

hi all, I'm new to rhadoop.

here is my test data

P =

do.call(

rbind,

rep(

list(

matrix(

rnorm(9000000, sd = 10),

ncol=30000)),

10)) +

matrix(rnorm(90000000), ncol =30000)

and I save it as a .csv.

the following is the changed part of the kmeans

out = list()

## for local mode

# library(bigmemory)

# ID00 = read.big.matrix("/usr/local/AMI/20150422_AMI/ID01_96_analysis_data.csv")

ptm <- proc.time()

for(be in c("hadoop")) {

rmr.options(backend = be)

set.seed(0)

out[[be]] =

'/usr/local/3000x20000.csv',

## for local mode

# to.dfs(ID00[1:3000, 1:20000]),

num.clusters = 3,

num.iter = 1,

combine = FALSE,

in.memory.combine = FALSE)

}

proc.time() - ptm

here is my result

local hadoop

processing time 60s 1000s

Is it normal that the processing time on local much faster than on hadoop?

My hadoop cluster setup is 3 node clusters (memory: 6G, 8G,12G ); hadoop version 2.6.0