Access a subset of data using Rhadoop?

36 views
Skip to first unread message

Liang Zhou

unread,
Apr 7, 2014, 5:45:44 PM4/7/14
to rha...@googlegroups.com
I am doing a simple Principal Component transformation on a dataset for each sample, the data looks like 

Sample X Y Z
1           ... (some entries for X Y and Z)
1           ...
1           ...
1           ...
2           ...
2           ...
2           ...
2           ...
3           ...
3           ...
3           ...
3           ...

When I write the map function, I did 

unqsample <- unique(data$Sample)

pca_mapper <- function(k,input) {
    generate.sample = function(i) {
        select.input = Data[Data$Sample==i,]
        keyval(i,select.input)
    }
    c.keyval(lapply(unqsample, generate.sample))
}

The data is pre-loaded as csv file, 

column = c("Sample","X","Y","Z")
pca.input.format =
    make.input.format(
        "csv",
        sep=",",
        row.names=NULL,
        col.names=column,
        na.strings=c("NA"),
        colClasses=c(Sample="numeric",
                     X="numeric",
                     Y="numeric",
                     Z="numeric"
            )
 )


It reported error when I run
mapreduce(input="rawpca.csv", input.format = pca.input.format, map=pca_mapper)

I wonder if it is because the data is distributed across different data node so that Data[Data$Sample==i,] is not subsetting from the full dataset? 

I am new to Hadoop, and wonder what is the best strategy to write the map function so that I can have key being the Sample ID, and its value is the subset of the data corresponding to the specific Sample ID. 

Thank you for your help!

Antonio Piccolboni

unread,
Apr 7, 2014, 5:58:50 PM4/7/14
to RHadoop Google Group
Real data, real code, real error information => real help. pseudo-data, pseudo-code, no error details => guesses. 
My guess per your specs:

pca_mapper = function(k, input) keyval(input$Sample, input[, -1])

That's all

Antonio


--
post: rha...@googlegroups.com ||
unsubscribe: rhadoop+u...@googlegroups.com ||
web: https://groups.google.com/d/forum/rhadoop?hl=en-US
---
You received this message because you are subscribed to the Google Groups "RHadoop" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rhadoop+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Liang Zhou

unread,
Apr 7, 2014, 10:34:52 PM4/7/14
to rha...@googlegroups.com, ant...@piccolboni.info
Thank you, Antonio. It turns out to work as simple as you pointed out. 
Reply all
Reply to author
Forward
0 new messages