I am doing a simple Principal Component transformation on a dataset for each sample, the data looks like
Sample X Y Z
1 ... (some entries for X Y and Z)
1 ...
1 ...
1 ...
2 ...
2 ...
2 ...
2 ...
3 ...
3 ...
3 ...
3 ...
When I write the map function, I did
unqsample <- unique(data$Sample)
pca_mapper <- function(k,input) {
generate.sample = function(i) {
select.input = Data[Data$Sample==i,]
keyval(i,select.input)
}
c.keyval(lapply(unqsample, generate.sample))
}
The data is pre-loaded as csv file,
column = c("Sample","X","Y","Z")
pca.input.format =
make.input.format(
"csv",
sep=",",
row.names=NULL,
col.names=column,
na.strings=c("NA"),
colClasses=c(Sample="numeric",
X="numeric",
Y="numeric",
Z="numeric"
)
)
It reported error when I run
mapreduce(input="rawpca.csv", input.format = pca.input.format, map=pca_mapper)
I wonder if it is because the data is distributed across different data node so that Data[Data$Sample==i,] is not subsetting from the full dataset?
I am new to Hadoop, and wonder what is the best strategy to write the map function so that I can have key being the Sample ID, and its value is the subset of the data corresponding to the specific Sample ID.
Thank you for your help!