Using GenomeGraphs with FIRMA (Was: FIRMA score)

53 views
Skip to first unread message

Mark Robinson

unread,
Mar 2, 2009, 5:34:02 PM3/2/09
to aroma-af...@googlegroups.com

[this was sent to me offline. i've created a new thread and will try
to respond in the next day or two.]


Begin forwarded message:

> From: Frank <fzie...@gmail.com>
> Date: 2 March 2009 10:19:59 PM
> To: Mark Robinson <mrob...@wehi.edu.au>
> Subject: Re: FIRMA score
>
> Hi Mark, Hi everyboy,
>
> I would like to revive this thread and put the focus on the
> visualization of "FIRMA-data" with GenomeGraphs. Maybe the vignette,
> Mark has spoken of earlier, is already somewhere available?
>
> In any case I have some problems I could use help with. If you want to
> use the ExonArray-class in GenomeGraphs you have to provide the
> intensities (or whatsoever values you want to plot) for every probeset
> and the end and start coordinates (and some more stuff). Like already
> said in this thread before, you can find all this in the probeset
> annotation files from affymetrix (link somewhere above) and you can
> import this file with read.csv(...).
>
> But what I'm asking myself, how could you efficiently import this
> file? For the HuEx its around 418 MB and contains over 1.4 million
> lines. On my machine there is no way to import this file. Right now
> I'm trying to build a function that works with chunks and that removes
> all the "unnecessary" columns. But this looks not very convenient to
> me. Isn't there any other way in R to extract only the rows of
> interest, like you could do with grep under Unix? The grep and scan
> functions in R don't seem to work on files that are not already read
> in, or do I miss something?
>
> Maybe someone has a script to post here, that does nearly the
> following:
> Suppose you have an experiment with control/treatment and n chips for
> each group. Your ResidualSet is stored in "rs", "csN" holds the
> normalized intensities and "fs" the FirmaScores. For a given ensembl-
> gene (e.g. "ENSG00000060237") the script should provide all the needed
> values to create the object
> exonarray<-new("ExonArray", intensity = ..., probeStart = ,
> probeEnd=..., probeId = ..., nProbes = ...) in GenomeGraphs (where
> "intensity=..." should plot the FirmaScores for instance).
> My feelings are, that one could do a lot of mistakes here when putting
> all these values together. So some advice from someone who knows how
> to do it, would we superb.
>
> Thanks in advance for your help and feedback,
> Frank

------------------------------
Mark Robinson
Epigenetics Laboratory, Garvan
Bioinformatics Division, WEHI
e: m.rob...@garvan.org.au
e: mrob...@wehi.edu.au
p: +61 (0)3 9345 2628
f: +61 (0)3 9347 0852
------------------------------


Mark Robinson

unread,
Mar 4, 2009, 12:34:52 AM3/4/09
to aroma-af...@googlegroups.com
Hi Frank.

A very good question and unfortunately, I haven't had time to put a
vignette together. I have given a working example below that will at
least to get you started.


Yes, I've struggled with that before as well. This annotation file
contains a LOT of information with respect to Ensembl/Genbank/etc
identifiers and so on. A lot of this I don't need. One way to cut
down on the size before reading into R (at least in osx/unix/linux) is
a combination of grep and awk just to pull out a small subset of the
columns:

mac1618$ grep -v "^#" HuEx-1_0-st-v2.na27.hg18.probeset.csv | awk
'{FS=","; print $1,$2,$3,$4,$5,$7}' > HuEx.small.csv
mac1618$ ls -l
total 965816
-rw-r--r-- 1 mrobinson lab0605 76411632 3 Mar 18:55 HuEx.small.csv
-rw-r--r--@ 1 mrobinson lab0605 418038269 24 Nov 19:05 HuEx-1_0-st-
v2.na27.hg18.probeset.csv

This leaves a 75MB file, which is a lot more manageable. This reads
in rather quickly:

> psCsv <- read.table("HuEx.small.csv",skip=1,sep=" ",comment.char="")
> colnames(psCsv) <-
c
("probeset_id
","seqname","strand","start","stop","transcript_cluster_id")
> dim(psCsv)
[1] 1425647 6

Lots of other ways to do this of course. Perhaps others on the list
have other tricks.


>> Maybe someone has a script to post here, that does nearly the
>> following:
>> Suppose you have an experiment with control/treatment and n chips for
>> each group. Your ResidualSet is stored in "rs", "csN" holds the
>> normalized intensities and "fs" the FirmaScores. For a given ensembl-
>> gene (e.g. "ENSG00000060237") the script should provide all the
>> needed
>> values to create the object
>> exonarray<-new("ExonArray", intensity = ..., probeStart = ,
>> probeEnd=..., probeId = ..., nProbes = ...) in GenomeGraphs (where
>> "intensity=..." should plot the FirmaScores for instance).
>> My feelings are, that one could do a lot of mistakes here when
>> putting
>> all these values together. So some advice from someone who knows how
>> to do it, would we superb.


Indeed, you need to do this carefully. Here is a possible set of
steps to follow (below). Give me your feedback on this and I will
start to make a documentation page on it.

Say you have followed the steps in the human exon array use-case: you
have an object 'plm' which is your probe level model, you have fit it,
calculated the residuals and have run FIRMA, etc. Now, say you want
to plot the normalized data and residuals in the context on known
(say, Ensembl) transcripts for that gene. You may need to do some
slight modifications for your dataset. Here, I'm using the 33 sample
tissue dataset. And, for example, I know that WNK1 has a very
interesting kidney-specific isoform ...


# starting from PLM and 'psCsv' from above
ds <- getDataSet(plm)
rs <- getResidualSet(plm)
cdf <- getCdf(ds)

u <- indexOf(cdf,"3400034")
ugcM <- getUnitGroupCellMap(getCdf(ds), units=u, retNames=TRUE) # get
table listing unit,group,cell

ind <-
unlist(getCellIndices(cdf,units=u,verbose=verbose),use.names=FALSE) #
indices to read for this unit
d <- log2(extractMatrix(cs,cells=ind,verbose=verbose))
r <- log2(extractMatrix(rs,cells=ind,verbose=verbose))

nProbes <- table(ugcM$group)
m <- match(names(nProbes), psCsv$probeset_id)

lwds <- rep(1,ncol(d)); lwds[10:12] <- 3
cols <- rep("grey",ncol(d)); cols[10:12] <- "blue"

library(GenomeGraphs)
ea1 <- new("ExonArray", intensity=d, probeStart=psCsv[m,"start"],
probeEnd=psCsv[m,"stop"],
probeId=names(nProbes),
nProbes=as.numeric(nProbes),
dp = DisplayPars(plotMap=FALSE, color=cols,
lwd=lwds))
ea2 <- new("ExonArray", intensity=r, probeStart=psCsv[m,"start"],
probeEnd=psCsv[m,"stop"],
probeId=names(nProbes),
nProbes=as.numeric(nProbes),
dp = DisplayPars(plotMap=TRUE, color=cols,
lwd=lwds))
f <- m[1]
mart=useMart(biomart="ensembl",dataset="hsapiens_gene_ensembl")

ch <- gsub("chr","",psCsv[f,"seqname"])

gr <- new("GeneRegion", chromosome=ch, start=min(psCsv[m,"start"]),
end = max(psCsv[m,"stop"]), strand =
psCsv[f,"strand"], biomart = mart)

tr<-new("Transcript",id=unique(gr@ens[,"ensembl_gene_id"]),
biomart=mart,dp=DisplayPars(plotId=TRUE))

ga <- new("GenomeAxis", add53 = TRUE)

png("exon_array_example_plot.png",width=1600,height=1200)
gdPlot( list(ea1,ga,ea2,gr,tr) )
dev.off()

... you can see this plot at:
http://bioinf.wehi.edu.au/folders/mrobinson/aroma/exon_array_example_plot.png

(Note: this link may not stay forever)

Note of course that there are more options that can be specified. See
the GenomeGraphs documentation.

As an exercise for the reader, how would you add FIRMA scores to this?

Cheers,
Mark

Frank

unread,
Mar 4, 2009, 3:39:34 PM3/4/09
to aroma.affymetrix
Hi Mark,

thanks for your help. The awk-hint is wonderful. The annotation file
is MUCH smaller and better manageable in R.

I want to give some advice for Windows-Users:
- get the UnxUtils (update your PATH variable to use the commands
system-wide)
- the awk command is called gawk in UnxUtils
- on Windows the ' have to be "
- I think the | (pipe) has to be masked with ^| but somehow I didn't
manage to pipe the output of grep into gawk, that's why I used a
temporary file...
- execute (grep will remove the comments in the header, gawk will
extract columns 1,4,5,6: "probeset_id", "start", "stop",
"probe_count"):
grep -v "^#" HuEx-1_0-st-v2.na27.hg18.probeset.csv > HuEx_small.tmp
gawk "{FS=\",\"; print $1,$4,$5,$6}" HuEx_small.tmp >
HuEx.small.csv
del HuEx_small.tmp

This resulted in a 53MB file.

When you read this file in, you have to skip the first line (its the
old column names) and reset the column names (identical to what Mark
wrote above):
psCsv <- read.table(HuEx.small.csv,skip=1,sep=" ",comment.char="")
colnames(psCsv) <- c("probeset_id", "start", "stop", "probe_count")

Hope I could help anybody who's trying Mark's hint on a Windows
System,
Frank

Frank

unread,
Mar 4, 2009, 4:28:18 PM3/4/09
to aroma.affymetrix
Hi everybody,

based on Mark's hints I wrote some code to plot FirmaScores with
GenomeGraphs. Mark's way of extracting the data (d <-log2(extractMatrix
(cs,cells=ind,verbose=verbose)) ) is very elegant and effective. But
somehow I didn't manage to pull the firma scores data off in the same
way.
That's why I wrote (a pretty ugly) routine for this:
getPlotDataGenomeGraphs<- function
(id,idname=NULL,CelSet,annotationfile,groups,data="intensities"){
cdf <- getCdf(CelSet)
u<-which(getUnitNames(cdf)==id)

ugcM <- getUnitGroupCellMap(cdf, units=u, retNames=TRUE)
nProbes <- table(ugcM$group)
m <- match(names(nProbes), annotationfile$probeset_id)
probesetdata<-annotationfile[m,]

x <- readUnits(CelSet,units=u)
ind <- 1 # only read the first unit!
this <- x[[ind]]
thisunitname <- names(x)[1] #gene_id
thisgroupnames <- names(this) #probeset_ids

## build intensitymatrix ##

intensities<-matrix(ncol=nbrOfArrays(CelSet),nrow=sum(probesetdata
$probe_count))
lastrow<-0
for (i in 1:nrow(probesetdata)){
thisgroupname <- thisgroupnames [i]
thisnrow <- nrow(x[[thisunitname]][[thisgroupname]][[data]])
thisnprobe <- probesetdata$probe_count[i]
if (thisnrow!=thisnprobe){ #e.g. when working with firma scores
intensities[seq(lastrow+1,lastrow+thisnprobe),]<-matrix(rep(x
[[thisunitname]][[thisgroupname]][[data]],thisnprobe/
thisnrow),ncol=nbrOfArrays(CelSet),nrow=thisnprobe,byrow = TRUE) #no
check here, if (thisnprobe%%thisnrow==0)!!!
}else{
intensities[seq(lastrow+1,lastrow+thisnprobe),]<-x[[thisunitname]]
[[thisgroupname]][[data]]
}
lastrow <- lastrow+thisnprobe
}
intensities <- cbind(intensities ,apply(intensities[,groups[[1]]],
1,mean),apply(intensities [,groups[[2]]],1,mean)) #add means
intensities <- log2(intensities)
## DONE: build intensitymatrix ##
return(list
(id=id,idname=idname,probesetdata=probesetdata,intensities=intensities))
}

The part ## build intensitymatrix ## should be corrected/improved!!!

### SETUP ###

source(paste(getwd(),"/#Scripts_DA/LowLevelAnalysis_ensb.R",sep=""))
#do low-level analysis here and provide variables celsetN,fs,plmEx
library(GenomeGraphs)
mart = useMart("ensembl", dataset = "hsapiens_gene_ensembl")

CelSet<-celsetN #celsetN <- process(QuantileNormalization(celsetBC,
typesToUpdate="pm"))
CelSetFirma<-fs #fs <- getFirmaScores(firma)
CelSetCes<-getChipEffectSet(plmEx) #plmEx <- ExonRmaPlm(celsetN,
mergeGroups=FALSE)

file <- paste(getwd(),"/annotationData/CSV/HuEx.small.csv",sep="")
#HuEx.small.csv is condensed with awk/gawk
psCsv <- read.table(file,skip=1,sep=" ",comment.char="")
colnames(psCsv) <- c("probeset_id", "start", "stop", "probe_count")
groups <- list(group1=c(1:4),group2=c(5:8))


### GENERATE PLOTDATA ###


id<-"ENSG00000030582"; idname<-"PGRN"

plotdata1<-getPlotDataGenomeGraphs
(id=id,idname=idname,CelSet=CelSet,annotationfile=psCsv,groups=groups)
plotdata2<-getPlotDataGenomeGraphs
(id=id,CelSet=CelSetCes,annotationfile=psCsv,groups=groups,data="theta")
plotdata2$intensities<-plotdata2$intensities[,(9:10)] #remove
everything but the means
plotdata3<-getPlotDataGenomeGraphs
(id=id,CelSet=CelSetFirma,annotationfile=psCsv,groups=groups)
plotdata3$intensities<-plotdata3$intensities[,(9:10)] #remove
everything but the means
plotdata3$intensities<- cbind
(plotdata3$intensities,plotdata3$intensities[,1]-plotdata3$intensities
[,2]) #add the diff of means

### MAYBE PLOTDATA ###

file <- paste(getwd(),"/#Scripts_DA/",id,".plotdata",sep="")
if(TRUE){
save(plotdata,plotdata2,plotdata3,file=file)
}else{
load(file)
}

### WRITE PLOTS TO FILE ###

fileformat<-"pdf"
plotpath <- "/#Scripts_DA/"
plotfilename <- paste("GenomeGraphs_",id,sep="")
if (fileformat=="pdf"){pdf(file = paste(getwd
(),plotpath,plotfilename,".pdf", sep=""), width = 12, height = 12,
pointsize = 10, bg = "transparent", family="Times")
}else{png(file = paste(getwd(),plotpath,plotfilename,".png", sep=""),
width = 1600, height = 1200, units = "px", pointsize = 10, bg =
"transparent", res = NA, restoreConsole = TRUE)}

colgroup1="royalblue3"
colgroup2="springgreen4"
coldiff="red"
colgene="orange"
coltranscript="lightskyblue"
col=c(rep("gray70",8),colgroup1,colgroup2)
lwd=c(rep(1,8),rep(2,2))

title = new("Title", title =paste(plotdata1$id,plotdata1$idname,sep="
- "), dp = DisplayPars(color = "darkslategray"))
exon1 = new("ExonArray", intensity = plotdata1$intensities, probeStart
= plotdata1$probesetdata[,2], probeEnd=plotdata1$probesetdata[,3],
probeId = as.character(plotdata1$probesetdata[,1]), nProbes =
plotdata1$probesetdata[,4], dp = DisplayPars(color = col, lwd=lwd,
mapColor = "gray80", plotMap=FALSE), displayProbesets=FALSE)
exon2 = new("ExonArray", intensity = plotdata2$intensities, probeStart
= plotdata2$probesetdata[,2], probeEnd=plotdata2$probesetdata[,3],
probeId = as.character(plotdata2$probesetdata[,1]), nProbes =
plotdata2$probesetdata[,4], dp = DisplayPars(color = c
(colgroup1,colgroup2), lwd=2, mapColor = "gray80", plotMap=FALSE),
displayProbesets=TRUE)
exon3 = new("ExonArray", intensity = plotdata3$intensities, probeStart
= plotdata3$probesetdata[,2], probeEnd=plotdata3$probesetdata[,3],
probeId = as.character(plotdata3$probesetdata[,1]), nProbes =
plotdata3$probesetdata[,4], dp = DisplayPars(color = c
(colgroup1,colgroup2,coldiff), lwd=c(1,1,2), mapColor = "gray80",
plotMap=TRUE), displayProbesets=FALSE)

gene = new("Gene", id = plotdata1$id , biomart = mart,dp = DisplayPars
(color = colgene))
transcript = new("Transcript", id = plotdata1$id , biomart = mart, dp
= DisplayPars(color = coltranscript,plotId=TRUE))
legend = new("Legend", legend = c("mean(group1)","mean(group2)","mean
(group1)-mean(group2)","gene","transcripts"), dp = DisplayPars(color= c
(colgroup1,colgroup2,coldiff,colgene,coltranscript)))
genomeaxis <- new("GenomeAxis", add53 = TRUE)
gdPlot(list(title,exon1,exon2,exon3, gene, genomeaxis, transcript,
legend), minBase = min(exon1@probeStart), maxBase=max(exon1@probeEnd))
#no y-axis labels here
dev.off()

In the section ### SAVE PLOTDATA ### you could save your generated
plotdata. If you want to change your plot later (color, fonts,
highlight-box), you simply can load your data, which is much faster...
Like Mark said, there are a lot customizations possible, but maybe
this is a start to generate your own plots.

Frank

Frank Ziegner

unread,
Mar 4, 2009, 8:34:34 PM3/4/09
to aroma-af...@googlegroups.com
Hi Mark,

I have some questions regarding you plot and some more general questions regarding probesets/exons.

I assume your plot shows the WNK1 (ENSG00000060237) gene.

When doing:
  u<-which(getUnitNames(cdf)=="ENSG00000060237") #cdf is the Ensembl-cdf

  ugcM <- getUnitGroupCellMap(cdf, units=u, retNames=TRUE)
  nProbes <- table(ugcM$group)
  length(nProbes)
I got 58 probesets. But you plot does only shows around 52 probesets. Did you leave any probesets out (can't see where...)?

Your plot shows four transcripts (ENST00000315939, ENST00000252477, ENST00000386624, ENST00000408512). But Ensemble says, that there only three transcripts (ENST00000315939, ENST00000252477, ENST00000340908) belong to this gene. So you missed one (the last) transcript of WNK1 and added to more transcripts:
ENST00000408512 and ENST00000386624. I think it has something to do with the way you selected the transcript:
  "tr<-new("Transcript",id=unique(gr@ens[,"ensembl_gene_id"]),..."
What was the idea behind?

But my more important question is: Ensembl tells me, that WNK1 has 29 Exons (like also shown in your plot, if one counts very carefully in the first two transcripts). And the transcripts ENST00000315939, ENST00000252477, ENST00000340908 each have 28, 26 and 25 Exons, respectively. My memory on the basics (How the HuEx was build and annotated) is a bit weak, but I remember as a rule of thumb that a probeset roughly equals an exon. Obviously this rule does not apply here!
Maybe we have to be a bit more precise. I think one can say: Every probeset matches to only one Exon. But different probesets can match to the same exon.  Is this statement right?
In your plot one could assume, that the first four probesets match to the first exon. But that arises the question: Shouldn't then the expression values for these probesets be identical? Wait, answer to myself: different probesets have different probe affinities and so they have different intensities. Is this the reason?

Two more questions:
In  GenomeGraphs one can plot the names of the probesets with  exon = new("ExonArray", ...,DisplayPars=(...,displayProbesets=TRUE)). By default this plots the affymetrix probeset_ids. But I would like to plot the Ensembl-names of the Exons to which the probesets match (e.g. ENSE00000437674). I guess I have to replace the "probeset_id"s with the Ensembl-name of the corresponding exon. Somehow this must be possible since the "plotMap=TRUE"-Options provides a mapping between probeset and exon. But I don't know, where to find this mapping. (Or is this mapping just done by coordinates?) Got anybody ideas? After all I also want to map a possible probeset (obtained the probeset_id with limma) to an exon, not only to a gene...

In your example plot your first row shows the data "ds <- getDataSet(plm)" (I think there is a typo somewhere ds<->cs). What exactly is this data? I though, the idea of a probe level model was to somehow merge all probe values to a single probeset-value. So afterwards you have one summarized intensity for each probeset and each array. Following this thought, there should be only one intensity-value for each probeset in the plot. But your plot shows (1st. row) more than one value per probeset (mostly 4, one value  per probe). So what exactly did you plot there?
In my plot above I used the values from "celsetN <- process(QuantileNormalization(celsetBC, typesToUpdate="pm"))" and expected to plot the normalized intensities... Did I get i wrong?
I have my plot and the corresponding code attached.

Right now I'm pretty confused... I'm hoping for some enlightenment!
Frank

P.S.: Right now (2009-03-05, around 3 GMT+1) I can't connect to Ensembl through Biomart:

> library(GenomeGraphs)
> mart = useMart("ensembl", dataset = "hsapiens_gene_ensembl")
Opening and ending tag mismatch: meta line 3 and body
Premature end of data in tag html line 1
Fehler: 1: Opening and ending tag mismatch: meta line 3 and body
2: Premature end of data in tag html line 1

Hopefully this is just a temporary error... Does anybody have the same problem?
GenomeGraphs_ENSG00000060237.pdf
GenomeGraphs_ENSG00000060237.R

Mark Robinson

unread,
Mar 4, 2009, 11:38:56 PM3/4/09
to aroma-af...@googlegroups.com
Hi Frank.

Comments below.


On 05/03/2009, at 12:34 PM, Frank Ziegner wrote:

> Hi Mark,
>
> I have some questions regarding you plot and some more general
> questions regarding probesets/exons.
>
> I assume your plot shows the WNK1 (ENSG00000060237) gene.
>
> When doing:
> u<-which(getUnitNames(cdf)=="ENSG00000060237") #cdf is the Ensembl-
> cdf
> ugcM <- getUnitGroupCellMap(cdf, units=u, retNames=TRUE)
> nProbes <- table(ugcM$group)
> length(nProbes)
> I got 58 probesets. But you plot does only shows around 52
> probesets. Did you leave any probesets out (can't see where...)?


We are using different CDFs. I am using the "coreR3,A20071112,EP"
one, not the Ensembl one, which is the source of the difference.


> Your plot shows four transcripts (ENST00000315939, ENST00000252477,
> ENST00000386624, ENST00000408512). But Ensemble says, that there
> only three transcripts (ENST00000315939, ENST00000252477,
> ENST00000340908) belong to this gene. So you missed one (the last)
> transcript of WNK1 and added to more transcripts:
> ENST00000408512 and ENST00000386624. I think it has something to do
> with the way you selected the transcript:
> "tr<-new("Transcript",id=unique(gr@ens[,"ensembl_gene_id"]),..."
> What was the idea behind?

Biomart seems to be down at the moment, so I can't track this down,
but probably this has to do with the previous step, the "GeneRegion".
The command was:

gr <- new("GeneRegion", chromosome=ch, start=min(psCsv[m,"start"]),
end = max(psCsv[m,"stop"]), strand =
psCsv[f,"strand"], biomart = mart)

and this just searches Ensembl for "genes" in that region. Perhaps we
should be more restrictive in this search. ENST00000408512 is an
snRNA gene, ENST00000386624 is an snRNA_pseudogene. According to:

http://www.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000060237

ENSG00000060237 has 3 transcripts.




> But my more important question is: Ensembl tells me, that WNK1 has
> 29 Exons (like also shown in your plot, if one counts very carefully
> in the first two transcripts). And the transcripts ENST00000315939,
> ENST00000252477, ENST00000340908 each have 28, 26 and 25 Exons,
> respectively. My memory on the basics (How the HuEx was build and
> annotated) is a bit weak, but I remember as a rule of thumb that a
> probeset roughly equals an exon. Obviously this rule does not apply
> here!

A probeset is for a probe selection region (PSR). You can have 1 PSR
per exon or many. It depends on the Affy rules and the mass of
annotation projected onto the genome.


> Maybe we have to be a bit more precise. I think one can say: Every
> probeset matches to only one Exon. But different probesets can match
> to the same exon. Is this statement right?

Yes, I agree.


> In your plot one could assume, that the first four probesets match
> to the first exon. But that arises the question: Shouldn't then the
> expression values for these probesets be identical? Wait, answer to
> myself: different probesets have different probe affinities and so
> they have different intensities. Is this the reason?

Yes, different PROBEs have different affinities, but indeed this
carries over to PROBESETs. Gene-level summaries are generally not
comparable from gene-to-gene (although people sometimes do it, at
least indirectly), only sample-to-sample, so why would probeset-level
summaries be?


> Two more questions:
> In GenomeGraphs one can plot the names of the probesets with exon
> = new("ExonArray", ...,DisplayPars=(...,displayProbesets=TRUE)). By
> default this plots the affymetrix probeset_ids. But I would like to
> plot the Ensembl-names of the Exons to which the probesets match
> (e.g. ENSE00000437674). I guess I have to replace the "probeset_id"s
> with the Ensembl-name of the corresponding exon. Somehow this must
> be possible since the "plotMap=TRUE"-Options provides a mapping
> between probeset and exon. But I don't know, where to find this
> mapping. (Or is this mapping just done by coordinates?) Got anybody
> ideas? After all I also want to map a possible probeset (obtained
> the probeset_id with limma) to an exon, not only to a gene...

If you want to plot the Ensembl exon identifiers, I suspect it'll be
tricky. For example:

> gr@ens[,c("ensembl_exon_id","exon_chrom_start","exon_chrom_end")]
[1:2,]
ensembl_exon_id exon_chrom_start exon_chrom_end
1 ENSE00001565147 760558 760626
2 ENSE00001501630 760559 760687

You have 2 exons that cover the same area of the genome (i.e.
overlap). You don't have 2 PSRs that cover the same region of the
genome (i.e. no overlap). Therefore, IMHO, its easier to deal with
PSRs.

If you did want to do this, you have 2 options, as I see it. First
(easier) is take the PSR identifier and replace it with some Ensembl
gene identifer. Due to the above problem, you may have some choices
to make about which Ensembl identifier you give to a certain PSR.
Second (harder) is to create a CDF file where the "groups" are
labelled with Ensembl identifiers.


> In your example plot your first row shows the data "ds <-
> getDataSet(plm)" (I think there is a typo somewhere ds<->cs). What
> exactly is this data?


Thanks for spotting the typo. Good find. Indeed, I should have
written:

ds <- getDataSet(plm)
[...]
d <- log2(extractMatrix(ds,cells=ind,verbose=verbose))

So, in my plot, I was plotting unnormalized raw intensity data, not
the BG-adjusted quantile normalized, since my 'cs' was defined as:

cs <- AffymetrixCelSet$byName("tissues", cdf=cdf)


> I though, the idea of a probe level model was to somehow merge all
> probe values to a single probeset-value. So afterwards you have one
> summarized intensity for each probeset and each array. Following
> this thought, there should be only one intensity-value for each
> probeset in the plot. But your plot shows (1st. row) more than one
> value per probeset (mostly 4, one value per probe). So what exactly
> did you plot there?


I'm plotting all probes. Top plot is the raw data, lower plot is the
residuals ... then all the gene annotation at the bottom.

> In my plot above I used the values from "celsetN <-
> process(QuantileNormalization(celsetBC, typesToUpdate="pm"))" and
> expected to plot the normalized intensities... Did I get i wrong?
> I have my plot and the corresponding code attached.

Your 'plotdata1' is the normalized data. Your 'plotdata2' and
'plotdata3' are the chip effects from probeset-level summaries (PLM
and FIRMA, respectively). Therefore, the nProbes element of 'exon2'
and 'exon3' doesn't actually match the data, does it?

What is the result of:

nrow(plotdata2$intensities) == sum(plotdata2$probesetdata[,4])


In my example:

> nrow(d) == sum(as.numeric(nProbes))
[1] TRUE

Is that a possible source of the problem?


> Right now I'm pretty confused... I'm hoping for some enlightenment!
> Frank

Hopefully I haven't confused you more. There are a lot of questions/
intricacies here!

Good luck.
Mark



> P.S.: Right now (2009-03-05, around 3 GMT+1) I can't connect to
> Ensembl through Biomart:
> > library(GenomeGraphs)
> > mart = useMart("ensembl", dataset = "hsapiens_gene_ensembl")
> Opening and ending tag mismatch: meta line 3 and body
> Premature end of data in tag html line 1
> Fehler: 1: Opening and ending tag mismatch: meta line 3 and body
> 2: Premature end of data in tag html line 1
>
> Hopefully this is just a temporary error... Does anybody have the
> same problem?
>
> >
> <GenomeGraphs_ENSG00000060237.pdf><GenomeGraphs_ENSG00000060237.R>

Frank

unread,
Mar 6, 2009, 10:27:35 PM3/6/09
to aroma.affymetrix
A note to my own post above:

> psCsv <- read.table(HuEx.small.csv,skip=1,sep=" ",comment.char="")
> colnames(psCsv) <- c("probeset_id","start", "stop", "probe_count")

If you are reading more columns in (like "strand") which are
characters, you should keep in mind, that R usually treats them as
factors. So for example you would get something like:
> psCsv[1,"strand"]
[1] -
Levels: - --- +
> class(psCsv[1,"strand"])
[1] "factor"

This is ok, but you should know how to deal with this (I had to find
out the last hour...). Maybe the easiest way is NOT to treat this
character-values as factors. Therefore you should use:
pcCSV <- read.table(HuEx.small.csv,skip=1,sep="
",comment.char="",stringsAsFactors=FALSE)

Frank
Reply all
Reply to author
Forward
0 new messages