Begin forwarded message:
> From: Frank <fzie...@gmail.com>
> Date: 2 March 2009 10:19:59 PM
> To: Mark Robinson <mrob...@wehi.edu.au>
> Subject: Re: FIRMA score
>
> Hi Mark, Hi everyboy,
>
> I would like to revive this thread and put the focus on the
> visualization of "FIRMA-data" with GenomeGraphs. Maybe the vignette,
> Mark has spoken of earlier, is already somewhere available?
>
> In any case I have some problems I could use help with. If you want to
> use the ExonArray-class in GenomeGraphs you have to provide the
> intensities (or whatsoever values you want to plot) for every probeset
> and the end and start coordinates (and some more stuff). Like already
> said in this thread before, you can find all this in the probeset
> annotation files from affymetrix (link somewhere above) and you can
> import this file with read.csv(...).
>
> But what I'm asking myself, how could you efficiently import this
> file? For the HuEx its around 418 MB and contains over 1.4 million
> lines. On my machine there is no way to import this file. Right now
> I'm trying to build a function that works with chunks and that removes
> all the "unnecessary" columns. But this looks not very convenient to
> me. Isn't there any other way in R to extract only the rows of
> interest, like you could do with grep under Unix? The grep and scan
> functions in R don't seem to work on files that are not already read
> in, or do I miss something?
>
> Maybe someone has a script to post here, that does nearly the
> following:
> Suppose you have an experiment with control/treatment and n chips for
> each group. Your ResidualSet is stored in "rs", "csN" holds the
> normalized intensities and "fs" the FirmaScores. For a given ensembl-
> gene (e.g. "ENSG00000060237") the script should provide all the needed
> values to create the object
> exonarray<-new("ExonArray", intensity = ..., probeStart = ,
> probeEnd=..., probeId = ..., nProbes = ...) in GenomeGraphs (where
> "intensity=..." should plot the FirmaScores for instance).
> My feelings are, that one could do a lot of mistakes here when putting
> all these values together. So some advice from someone who knows how
> to do it, would we superb.
>
> Thanks in advance for your help and feedback,
> Frank
------------------------------
Mark Robinson
Epigenetics Laboratory, Garvan
Bioinformatics Division, WEHI
e: m.rob...@garvan.org.au
e: mrob...@wehi.edu.au
p: +61 (0)3 9345 2628
f: +61 (0)3 9347 0852
------------------------------
A very good question and unfortunately, I haven't had time to put a
vignette together. I have given a working example below that will at
least to get you started.
Yes, I've struggled with that before as well. This annotation file
contains a LOT of information with respect to Ensembl/Genbank/etc
identifiers and so on. A lot of this I don't need. One way to cut
down on the size before reading into R (at least in osx/unix/linux) is
a combination of grep and awk just to pull out a small subset of the
columns:
mac1618$ grep -v "^#" HuEx-1_0-st-v2.na27.hg18.probeset.csv | awk
'{FS=","; print $1,$2,$3,$4,$5,$7}' > HuEx.small.csv
mac1618$ ls -l
total 965816
-rw-r--r-- 1 mrobinson lab0605 76411632 3 Mar 18:55 HuEx.small.csv
-rw-r--r--@ 1 mrobinson lab0605 418038269 24 Nov 19:05 HuEx-1_0-st-
v2.na27.hg18.probeset.csv
This leaves a 75MB file, which is a lot more manageable. This reads
in rather quickly:
> psCsv <- read.table("HuEx.small.csv",skip=1,sep=" ",comment.char="")
> colnames(psCsv) <-
c
("probeset_id
","seqname","strand","start","stop","transcript_cluster_id")
> dim(psCsv)
[1] 1425647 6
Lots of other ways to do this of course. Perhaps others on the list
have other tricks.
>> Maybe someone has a script to post here, that does nearly the
>> following:
>> Suppose you have an experiment with control/treatment and n chips for
>> each group. Your ResidualSet is stored in "rs", "csN" holds the
>> normalized intensities and "fs" the FirmaScores. For a given ensembl-
>> gene (e.g. "ENSG00000060237") the script should provide all the
>> needed
>> values to create the object
>> exonarray<-new("ExonArray", intensity = ..., probeStart = ,
>> probeEnd=..., probeId = ..., nProbes = ...) in GenomeGraphs (where
>> "intensity=..." should plot the FirmaScores for instance).
>> My feelings are, that one could do a lot of mistakes here when
>> putting
>> all these values together. So some advice from someone who knows how
>> to do it, would we superb.
Indeed, you need to do this carefully. Here is a possible set of
steps to follow (below). Give me your feedback on this and I will
start to make a documentation page on it.
Say you have followed the steps in the human exon array use-case: you
have an object 'plm' which is your probe level model, you have fit it,
calculated the residuals and have run FIRMA, etc. Now, say you want
to plot the normalized data and residuals in the context on known
(say, Ensembl) transcripts for that gene. You may need to do some
slight modifications for your dataset. Here, I'm using the 33 sample
tissue dataset. And, for example, I know that WNK1 has a very
interesting kidney-specific isoform ...
# starting from PLM and 'psCsv' from above
ds <- getDataSet(plm)
rs <- getResidualSet(plm)
cdf <- getCdf(ds)
u <- indexOf(cdf,"3400034")
ugcM <- getUnitGroupCellMap(getCdf(ds), units=u, retNames=TRUE) # get
table listing unit,group,cell
ind <-
unlist(getCellIndices(cdf,units=u,verbose=verbose),use.names=FALSE) #
indices to read for this unit
d <- log2(extractMatrix(cs,cells=ind,verbose=verbose))
r <- log2(extractMatrix(rs,cells=ind,verbose=verbose))
nProbes <- table(ugcM$group)
m <- match(names(nProbes), psCsv$probeset_id)
lwds <- rep(1,ncol(d)); lwds[10:12] <- 3
cols <- rep("grey",ncol(d)); cols[10:12] <- "blue"
library(GenomeGraphs)
ea1 <- new("ExonArray", intensity=d, probeStart=psCsv[m,"start"],
probeEnd=psCsv[m,"stop"],
probeId=names(nProbes),
nProbes=as.numeric(nProbes),
dp = DisplayPars(plotMap=FALSE, color=cols,
lwd=lwds))
ea2 <- new("ExonArray", intensity=r, probeStart=psCsv[m,"start"],
probeEnd=psCsv[m,"stop"],
probeId=names(nProbes),
nProbes=as.numeric(nProbes),
dp = DisplayPars(plotMap=TRUE, color=cols,
lwd=lwds))
f <- m[1]
mart=useMart(biomart="ensembl",dataset="hsapiens_gene_ensembl")
ch <- gsub("chr","",psCsv[f,"seqname"])
gr <- new("GeneRegion", chromosome=ch, start=min(psCsv[m,"start"]),
end = max(psCsv[m,"stop"]), strand =
psCsv[f,"strand"], biomart = mart)
tr<-new("Transcript",id=unique(gr@ens[,"ensembl_gene_id"]),
biomart=mart,dp=DisplayPars(plotId=TRUE))
ga <- new("GenomeAxis", add53 = TRUE)
png("exon_array_example_plot.png",width=1600,height=1200)
gdPlot( list(ea1,ga,ea2,gr,tr) )
dev.off()
... you can see this plot at:
http://bioinf.wehi.edu.au/folders/mrobinson/aroma/exon_array_example_plot.png
(Note: this link may not stay forever)
Note of course that there are more options that can be specified. See
the GenomeGraphs documentation.
As an exercise for the reader, how would you add FIRMA scores to this?
Cheers,
Mark