tags

23 views
Skip to first unread message

Elizabeth Purdom

unread,
Dec 5, 2007, 8:28:05 PM12/5/07
to aroma-af...@googlegroups.com
Hi,
I have a question about manually giving your own tags in the analysis. What I would *like* to do is  at different points in my analysis, use a different cdf, and note this by putting a tag (I'm working on exon arrays, so I have many I can choose from!). But this seems to overwrite the default tag rather than append to it. For example, for BG correction I do this:
csBGTissue<-RmaBackgroundCorrection(csTissue,tags="main") #indicating I used cdf with all 'main' design probesets

This appears to give me the folder "Affy_tissue_comm,main", not "Affy_tissue_comm,RBC,main", which I would prefer. I'm assuming there's not a way to do this, except manually change the folder name or tag, then redefine csBGTissue using AffymetrixCelSet$fromFile (I might like "appendTag" as well as 'tag' option on commands, but that seems a big change...). So my question is to double check how this trickles down -- I haven't run the continuing analyzes yet and I'd like to know.
1) When I quantile normalize, I assume I will get a folder "Affy_tissue_comm,RBC,main,QN", assuming I fix the tag for my background correction

2) Then when I do my plm call, I'm going to switch to the 'core' cdf (a subset of the probes in the 'main' cdf, so I'm assuming this won't create any problems). So if I didn't add any further tags in my call, I'd get a folder "Affy_tissue_comm,RCB,main,QN,RMA,merged" (because I'm doing ExonRmaPlm). The following call,
ExonRmaPlm(csNTissue, mergeGroups=TRUE,tag="core")
will give me a folder "Affy_tissue_comm,core", which is not what I want, so I need to do
ExonRmaPlm(csNTissue, mergeGroups=TRUE,tag="RBC,main,QN,RMA,merged,core")

3) Then the FIRMA function seems to do something odd by default, comparatively, though I have not rerun it since last spring, so please correct me if this isn't current somebody, but it takes forever to run, so I'd like to fix it upfront rather than run it and then correct it. It makes a folder 'modelFirmaModel' at the level of 'plmData' and 'rawData', etc. Then under that it makes the folder '"Affy_tissue_comm,FIRMA", removing my previous collection of tags. And then the files are called 'xyz,FIRMAscores.cel'. That seems like an awful lot of 'FIRMA'. Mark has already a beta function to rewrite a bit of the firma functions to allow other options. I think its a good time to update this as well. e.g. a folder 'scoresData' (so there might be other scores for other chips, etc. that could go here), then at the next level, keep all the previous tags as well as give a tag 'FIRMA', then under that save all the .cel files for which ever of the possible options, with informative filename tags, e.g. 'xyz,medResScores.cel' and 'xyz,UQWtScores.cel' in the same folder. Feedback?

By the way, what if I was not working with cdfs that are subsets of each other? Are the cells not contained in the cdf just left blank for the various processes, or are the original values copied over?

Thanks,
Elizabeth

Henrik Bengtsson

unread,
Dec 5, 2007, 10:10:02 PM12/5/07
to aroma-af...@googlegroups.com
Hi.

On 05/12/2007, Elizabeth Purdom <epu...@gmail.com> wrote:
> Hi,
> I have a question about manually giving your own tags in the analysis. What
> I would *like* to do is at different points in my analysis, use a different
> cdf, and note this by putting a tag (I'm working on exon arrays, so I have
> many I can choose from!). But this seems to overwrite the default tag rather
> than append to it. For example, for BG correction I do this:
> csBGTissue<-RmaBackgroundCorrection(csTissue,tags="main")
> #indicating I used cdf with all 'main' design probesets
>
> This appears to give me the folder "Affy_tissue_comm,main", not
> "Affy_tissue_comm,RBC,main", which I would prefer. I'm assuming there's not
> a way to do this, except manually change the folder name or tag, then
> redefine csBGTissue using AffymetrixCelSet$fromFile (I might like
> "appendTag" as well as 'tag' option on commands, but that seems a big
> change...). So my question is to double check how this trickles down -- I
> haven't run the continuing analyzes yet and I'd like to know.

There is a concept I call the "asterisk tag" allowing you to do:

csBGTissue <- RmaBackgroundCorrection(csTissue, tags=c("*", "main"))
print(getFullName(csBGTissue))
## [1] "Affymetrix-HeartBrain,RBC,main"

Using tags="*,main" should also do. For most Transform:s and Model:s
there is an internal getAsteriskTag() method that are used to generate
the asterisk tag(s) given class, parameters etc. I went through most
classes and made them utilize this, but it was not a complete scan,
for instance I avoided touching the exon classes since I knew you guys
were working on those. Before a "*" tag was (as currently in
ExonRmaPlm) identified and set/replaced in the constructor function,
whereas you don't want to parse this until you actually query the
tags, e.g. via getTags() or otherwise. I can do a 2nd scan through
all classes and update those - if so, make sure to commit what you
have and let me know when you're done.

> 1) When I quantile normalize, I assume I will get a folder
> "Affy_tissue_comm,RBC,main,QN", assuming I fix the tag for my background
> correction

Yes, every step is just appending tags.

>
> 2) Then when I do my plm call, I'm going to switch to the 'core' cdf (a
> subset of the probes in the 'main' cdf, so I'm assuming this won't create
> any problems). So if I didn't add any further tags in my call, I'd get a
> folder "Affy_tissue_comm,RCB,main,QN,RMA,merged" (because
> I'm doing ExonRmaPlm). The following call,
> ExonRmaPlm(csNTissue, mergeGroups=TRUE,tag="core")
> will give me a folder "Affy_tissue_comm,core", which is not what I want, so
> I need to do
> ExonRmaPlm(csNTissue,
> mergeGroups=TRUE,tag="RBC,main,QN,RMA,merged,core")

As describe above, the idea is to do:

plm <- ExonRmaPlm(csNTissue, mergeGroups=TRUE, tags="*,core")
print(getFullName(plm))
## [1] "Affymetrix-HeartBrain,RBC,main,QN,RMA,merged,core"

However, the current implementation puts "merged" at the end (try):

## [1] "Affymetrix-HeartBrain,RBC,main,QN,RMA,core,merged"

This is going to be corrected when I do the above updated. Let me know.

>
> 3) Then the FIRMA function seems to do something odd by default,
> comparatively, though I have not rerun it since last spring, so please
> correct me if this isn't current somebody, but it takes forever to run, so
> I'd like to fix it upfront rather than run it and then correct it.

I don't really see why FirmaModel should to be slower than in the
spring, but there might be updates and confounded effects causing
this. Mark, have you noticed a slowdown?

However, one this I have noticed but haven't troubleshooted yet is
that with R v2.6.0 (or R v2.6.1) I noticed, on both Linux and Windows,
that when some calls return to the R prompt, some warning takes
awfully long to display (generate), e.g. if I have six arrays and get
six warnings that some NAs where generated in a log transform, each of
the take a very long time to display (and Ctrl-C does not work). My
best guess this is something with R v2.6.x and not aroma.affymetrix,
but I might be wrong. I haven't done thorough tests.

> It makes
> a folder 'modelFirmaModel' at the level of 'plmData' and 'rawData', etc.
> Then under that it makes the folder '"Affy_tissue_comm,FIRMA", removing my
> previous collection of tags.

The 'modelFirmaModel', called a "root path", is specified in the
getRootPath() method, so you need to update getRootPath.FirmaModel().

In order to "pass down" tags from previous steps, you have to let
getTags() of FirmaModel to take care of this. The current
implementation seems to only return the tags for FirmaModel, cf.
this$.tags. I can update this to do the default, i.e. basically the
effect of "*,<new tags>". Let me know.

> And then the files are called
> 'xyz,FIRMAscores.cel'. That seems like an awful lot of 'FIRMA'. Mark has
> already a beta function to rewrite a bit of the firma functions to allow
> other options. I think its a good time to update this as well. e.g. a folder
> 'scoresData' (so there might be other scores for other chips, etc. that
> could go here), then at the next level, keep all the previous tags as well
> as give a tag 'FIRMA', then under that save all the .cel files for which
> ever of the possible options, with informative filename tags, e.g.
> 'xyz,medResScores.cel' and 'xyz,UQWtScores.cel' in the same folder.
> Feedback?

This is a design issue and I've been thinking about making a similar
move for segmentation data. Currently aroma.affymetrix provides to
segmentation methods via classes GladModel and more recently CbsModel.
For historical reasons, GLAD estimates are stored in gladData/ and
CBS estimates in cbsData/, e.g.

gladData/HapMap270,100K,CEU,ACC,-XY,+300,RMA,A+B,FLN/
cbsData/HapMap270,100K,CEU,ACC,-XY,+300,RMA,A+B,FLN/

However, note I do *not* add tags GLAD and CBS. Since more
segmentation methods will come around, I've been thinking of doing the
following instead:

cnsData/HapMap270,100K,CEU,ACC,-XY,+300,RMA,A+B,FLN,GLAD/
cnsData/HapMap270,100K,CEU,ACC,-XY,+300,RMA,A+B,FLN,CBS/

where 'cns' is short for "copy-number segmentation" (or something
similar). However, I haven't made the move yet, because first of all
I want to be sure it is a good one before break peoples existing
folders (although manual renaming is not that hard so that existing
results will be automatically found). On a more philosophical level
there is a design decision that has to be made, namely, should the
above two subdirectories hold identical file types or not? In the
current setup, aroma.affymetrix know that all files found under
gladData/ is of a certain kind, and same for cbsData/. If instead
cnsData/ is used, the file format has to be inferred from the tags -
and they you starting to add restrictions and information in the tags.
It is important to think through consequences of such a move.

Before making the big move, one could keep the current root paths, but
add tags, e.g.

gladData/HapMap270,100K,CEU,ACC,-XY,+300,RMA,A+B,FLN,GLAD/
cbsData/HapMap270,100K,CEU,ACC,-XY,+300,RMA,A+B,FLN,CBS/

So at this stage I'm not sure if it "safe" to use the generic name
scoresData/ for everything. I am happy to discuss it though (either
online or in person).

But I agree, it is better to make the big changes now before too many
people get affected.

>
> By the way, what if I was not working with cdfs that are subsets of each
> other? Are the cells not contained in the cdf just left blank for the
> various processes, or are the original values copied over?

Are you asking about outputted CEL files?

For most (all?) cases where the output CEL file is of the same
dimension (nrow*ncol) as the input CEL file, the method createFrom()
of AffymetrixCelFile is used. Two of the arguments it takes are
'methods=c("copy", "create")' and 'clear=FALSE' (default values).
Basically, the first one allows you to either do a file copy the
existing CEL file or create/build one from scratch. If copying,
clear=TRUE will afterwards go and blank all CEL values (set them to
zero). If "creating", the created CEL file is already blank, so
clear=FALSE will then read the data from the input file and write it
to the create file. Thus, effectively, when clear=TRUE, the new CEL
will be blank and when clear=FALSE the new CEL will contain the same
values as the input file. FYI, the createFrom() call is "atomic",
that is, if something fails or its interrupted during the call, there
will be no output file. This is why it is safe to hit Ctrl-C almost
anytime.

In the case where chip-effect CEL files are created, they will always
be blank, because they are not created from a input CEL file but from
the CDF.

Hope this helps

Henrik

>
> Thanks,
> Elizabeth
>
> >
>
>
>

epurdom

unread,
Dec 5, 2007, 11:40:14 PM12/5/07
to aroma.affymetrix
Thanks Henrik. The astrik tag is what I wanted. And I think Firma and
ExonRmaPlm function should be updated to do this inline with
everything else.

> plm <- ExonRmaPlm(csNTissue, mergeGroups=TRUE, tags="*,core")
> print(getFullName(plm))
> ## [1] "Affymetrix-HeartBrain,RBC,main,QN,RMA,merged,core"
>
> However, the current implementation puts "merged" at the end (try):
>
> ## [1] "Affymetrix-HeartBrain,RBC,main,QN,RMA,core,merged"
>

I do think it would be a little better to have 'merged' put before the
user tag, since it really modifies 'RMA' while the user tag modifies
the entire process of 'RMA,merged'. I'm trying to set up tagging
rules for an aroma.affymetrix setup that will be shared between
people. So the added tag might also be someone's initials so it would
definitely be better at the end for that purpose. (On the other hand,
we may manually be entering new tags anyway, because we might want to
drop off the earlier initials for later steps and just have the
initials on the first entry of those series of tags to make the tags
easier to decipher...)

> I don't really see why FirmaModel should to be slower than in the
> spring, but there might be updates and confounded effects causing
> this. Mark, have you noticed a slowdown?

I don't know if it is slower -- I meant to say that it was very slow
in the spring, so I haven't tried rerunning anything (bad grammer and
commas on my part). FIRMA might be better now. I think it was the
quantile() function, but it's hard to say.

I'm not tied to 'scoresData' and now that I read your explanation I
think that it is better to not depend too much on the tags -- what if
someone manually changes them or overrides them in their call to the
function? It's probably better to make folder names beyond the root
directory informative but not crucial. So I think I'd vote for keeping
each root folder for a different main method. I also think it's better
(for FIRMA) to keep the structure for another reason too; because we
plan to have the option to do variations of the summary statistic, so
it would be better to have

firmaData/Affy_tissue_comm,QN,RMA,merged,MEDRES/
firmaData/Affy_tissue_comm,QN,RMA,merged,UQWT/

rather than put these different scores in the same folder. That way,
if there is some further analysis, the tag 'MEDRES' will be kept and
moved on down. This is also in line with the other folders, where the
root folder is the basic algorithm, and the subfolder distinguish
different options (at least for plmData for the exon arrays this is
the case). After all, you might have a lot of different folders there
just from different choices from previous steps that gave different
tags in addition to different datasets, and so it would get quickly
confusing. For example, my plmData folder looks like this,

plmData/Affy_tissue_comm,QN,RMA,merged/
plmData/Affy_tissue_comm,QN,RMA/
plmData/Affy_tissue_comm,RBC,QN,RMA,merged/
plmData/Affy_tissue_comm,RBC,QN,RMA/
plmData/BCGC_2006,QN,RMA,merged/
plmData/BCGC_2006,QN,RMA/

and once we add the option for multiple summary statistics in the
Firma folder, it will be similar, if not worse. Granted, I don't have
to do much with these, but still one less tag to decifier when I
decide which ones to zip up would be good!

I think that the cel files that save the scores should have a tag
"xyz, chip-by-exonScores.CEL" or something similar (maybe more
compact) that is not related to the method but to the format/size of
the information in the file; then every method that gives a score per
exon and chip (as oppose to per gene) has the same tag on the CEL
file. And gene scores could be 'chipScores', like 'chipEffects' in
plmData, though that's not really what we're doing but maybe with U133
that would be relevant.

Best,
Elizabeth
Reply all
Reply to author
Forward
0 new messages