How to import library with Rhdoop

Shalini Ravishankar

unread,

Jan 28, 2015, 1:10:21 PM1/28/15

to rha...@googlegroups.com

Hello Everyone,

I am new to RHadoop and R. I am having a normal R program which has a library(). I am wondering can someone give some insights on how do I run this R program on hadoop. What do I need to modify in the original R program? It would be really help if some one gives me some idea.

Thanks,

shalini

library(methylKit)
file.list=list( "new_sample1.txt","new_sample2.txt","n_sample3.txt")
myobj=read(file.list,sample.id=list("test1","test2","ctrl1"),assembly="hg19",treatment=c(1,1,0),context="CpG", pipeline=list(fraction=TRUE,chr.col=1,start.col=2,end.col=2,
coverage.col=6,strand.col=3,freqC.col=5 ))
getMethylationStats(myobj[[1]],plot=F,both.strands=F)
pdf("sample1_statistics.pdf")
getMethylationStats(myobj[[1]],plot=T,both.strands=F)
dev.off()
getMethylationStats(myobj[[2]],plot=F,both.strands=F)
pdf("sample2_statistics.pdf")
getMethylationStats(myobj[[2]],plot=T,both.strands=F)
dev.off()
getCoverageStats(myobj[[3]],plot=F,both.strands=F)
pdf("sample3_statistics.pdf")
getMethylationStats(myobj[[3]],plot=T,both.strands=F)
dev.off()
library("graphics")
pdf("sample1_coverage.pdf")
getCoverageStats(myobj[[1]], plot = T, both.strands = F)
dev.off()
pdf("sample2_coverage.pdf")
getCoverageStats(myobj[[2]], plot = T, both.strands = F)
dev.off()
pdf("sample3_coverage.pdf")
getCoverageStats(myobj[[3]], plot = T, both.strands = F)
dev.off()
meth=unite(myobj, destrand=FALSE)
pdf("correlation.pdf")
getCorrelation(meth,plot=T)
dev.off()
pdf("cluster.pdf")
clusterSamples(meth, dist="correlation",method="ward", plot=TRUE)
dev.off()
hc <- clusterSamples(meth, dist = "correlation", method = "ward",plot = FALSE)
pdf("pca.pdf")
PCASamples(meth, screeplot = TRUE)
PCASamples(meth)
myDiff=calculateDiffMeth(meth)
write.table(myDiff, "mydiff.txt", sep='\t')
myDiff25p.hyper <-get.methylDiff(myDiff,differenc=25,qvalue=0.01,type="hyper")
myDiff25p.hyper
write.table(myDiff25p.hyper,"hyper_methylated.txt",sep='\t')
myDiff25p.hypo <-get.methylDiff(myDiff,differenc=25,qvalue=0.01,type="hypo")
myDiff25p.hypo
write.table(myDiff25p.hypo,"hypo_methylated.txt",sep='\t')
myDiff25p <-get.methylDiff(myDiff,differenc=25,qvalue=0.01)
myDiff25p
write.table(myDiff25p,"differentialy_methylated.txt",sep='\t')
diffMethPerChr(myDiff,plot=FALSE,qvalue.cutoff=0.01,meth.cutoff=25)
pdf("diffMethPerChr.pdf")
diffMethPerChr(myDiff,plot=TRUE,qvalue.cutoff=0.01,meth.cutoff=25)
dev.off()
gene.obj <- read.transcript.features(system.file("extdata","refseq.hg18.bed.txt", package = "methylKit"))
write.table(gene.obj,"gene_obj.txt", sep='\t')
annotate.WithGenicParts(myDiff25p, gene.obj)
cpg.obj <- read.feature.flank(system.file("extdata","cpgi.hg18.bed.txt", package = "methylKit"),feature.flank.name = c("CpGi","shores"))
write.table(cpg.obj,"cpg_obj.txt", sep='\t')
diffCpGann <- annotate.WithFeature.Flank(myDiff25p,cpg.obj$CpGi, cpg.obj$shores, feature.name = "CpGi",flank.name = "shores")
write.table(diffCpGann,"diffCpCann.txt", sep='\t')
diffCpGann 
promoters <- regionCounts(myobj, gene.obj$promoters)
head(promoters[[1]])
write.table(promoters,"promoters.txt", sep='\t')
diffAnn <- annotate.WithGenicParts(myDiff25p, gene.obj)
head(getAssociationWithTSS(diffAnn))
diffAnn
write.table(getAssociationWithTSS(diffAnn),"diff_ann.txt", sep='\t')
getTargetAnnotationStats(diffAnn, percentage = TRUE,precedence = TRUE)
pdf("piechart1.pdf")
plotTargetAnnotation(diffAnn, precedence = TRUE, main ="differential methylation annotation")
dev.off()
pdf("piechart2.pdf")
plotTargetAnnotation(diffCpGann, col = c("green","gray", "white"), main = "differential methylation annotation")
dev.off()
getFeatsWithTargetsStats(diffAnn, percentage = TRUE)

sample.R

Antonio Piccolboni

unread,

Jan 28, 2015, 1:43:07 PM1/28/15

to rha...@googlegroups.com

There isn't anything special about importing libs in rmr2. You just call library and, if installed on every node, each loaded library will be loaded on each instance of R that's started on the cluster, so you don't have to.

library(somepackage)

mapreduce(input, output, map = function(k,v) function.from.somepackage(v, other args))

It should just work. In case your expectations are set a little too high, please note that I wrote a program with rmr2, particularly one that uses the function mapreduce and a function from some other library. Which means it still up to the dev to write a program that uses rmr and the other library. No library that I know of is aware of rmr2 and able to use it without custom code. Oh wait a second, there is one that I wrote, plyrmr. And a couple proprietary ones from Revolution Analytics and muSigma, but that's about it AFAIK.

Shalini Ravishankar

unread,

Jan 28, 2015, 1:52:40 PM1/28/15

to rha...@googlegroups.com, picc...@gmail.com

Thanks Antonio for the information.

I have an another doubt. This R Program produces multiple output (most are pdf of graphs). It takes 3 txt files as input. I am wondering will it create any issue because when the program run on the data chunks, it will create output based on that data chunks. Will it create any issue on reducing the pdfs ??

--

Thanks & Regards,

Shalini Ravishankar.

--
post: rha...@googlegroups.com ||
unsubscribe: rhadoop+u...@googlegroups.com ||
web: https://groups.google.com/d/forum/rhadoop?hl=en-US
---
You received this message because you are subscribed to a topic in the Google Groups "RHadoop" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rhadoop/cYBt3iyXZqc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rhadoop+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Antonio Piccolboni

unread,

Jan 29, 2015, 7:12:43 PM1/29/15

to Shalini Ravishankar, RHadoop Google Group

Yes it sounds like a fairly complicated situation. Multiple files can be specified as input to mapreduce. They will all be read, but not in a specific order. Another possibility to combine multiple inputs is to use one or more joins, function equijoin in rmr2. A final possibility, if some of the inputs fit in memory, is to read them in before calling mapreduce, and they will become available according to normal scoping rules (a process similar to what is known as map-side join). Which one is possible or best in your case is hard to tell without a deep analysis of the algorithms involved. As far as the chunks, yes, if for instance your program averages a column in the data, and you ignore the fact that you are running on Mapreduce, you'll end up with multiple averages, one for every chunk. That's relatively easy to solve, but if you were trying to do quantiles that requires a little more art and some degree of approximation. So yes, working only on chunks of the data is one of he fundamental constraints of writing programs for mapreduce and the other is that communication among processors happens only in the shuffle phase. As far as the pdfs, I don't know what reducing a pdf mean, but graphic devices in R only write to the local file system, hence you will get your files scattered all over the cluster and pretty much impossible to retrieve. You will need to read them back in as raw vectors with readBin and return then as return value for the map or reduce function.

Reply all

Reply to author

Forward