Re: Ram option is aroma

Elizabeth Purdom

unread,

Nov 8, 2007, 4:19:34 PM11/8/07

to Ju Han, aroma-af...@googlegroups.com

Hi Ju,

If, for example, you look at the help of 'fit.ProbeLevelModel' you see
the option ram (this is why I said in conversation you need to look at
the method-specific help). You can also look at the code to get a better
idea of what the numbers will mean -- I've pasted bits of the code below
that are relevant.
bytesPerChunk <- 1e+08
bytesPerUnitAndArray <- 500
bytesPerUnit <- nbrOfArrays * bytesPerUnitAndArray
unitsPerChunk <- ram * bytesPerChunk/bytesPerUnit
unitsPerChunk <- as.integer(max(unitsPerChunk, 1))
nbrOfChunks <- ceiling(nbrOfUnits/unitsPerChunk)
verbose && cat(verbose, "Number units per chunk: ", unitsPerChunk)
With these numbers, you can change the ram so that 'unitsPerChunk' is
large and so 'nbrOfChunks' is small (the number of iterations, and
therefore the number of calls to write the results). However, I don't
actually know how much speed up this will give, because you are
ultimately still needing to read/write the same number of results, just
the number of calls to do it goes down. Like I said, I've never played
with the option. Maybe Henrik could comment on this.

Also, passing along information I got from Henrik earlier, "the first
thing to check is if [you] run [your] analysis over a shared file system
or from a local disk. If you have a slow file system network that
will kill you. Working towards a local drive is way faster."

Ju, could you respond (to the group -- you should register if you are
not already) as to what you are running. Specifically could you answer
the following questions,
1) how many arrays are you running
2) How long does it take to run each of steps in your code below (e.g.
BG correction, Quantile normalization, fitting the linear model, quality
assessment) and where is the (worst) hang up, time-wise (or is it just
all of it)?
3) What are the specs of the server you are using?
4) Is your cdf the Affy cdf where each probeset is a unit (no transcript
cluster IDs) -- I'm assuming that from the code; since you pick
mergeGroups=F, it would have the same effect regardless. Is it actually
Affy's publicly available cdf or the one created by Alan?
5) Are you running all the probesets, not a subset? (like core,
extended, full). What are the corresponding numbers if you run on just
the core or extended, for example?
6) Do these probesets include the QC probesets; if so, affy defined long
probesets that contain the genomic background probes that might be
included in that cdf (e.g. 25 probesets with 1000 probes each)

If you do get a speed up from changing the ram option, could you also
post your results, because I think others would be interested.

Best,
Elizabeth

This is the code you sent me earlier, which I assume is what you are
still using.

library(aroma.affymetrix)

ds <- AffymetrixCelSet$fromName("GBM", chipType="HuEx-1_0-st-v2");
cdf <- AffymetrixCdfFile$fromChipType("HuEx-1_0-st-v2");
setCdf(ds, cdf);

dsBG <- RmaBackgroundCorrection(ds, addJitter=TRUE);
process(dsBG,verbose=-2);
db = getOutputDataSet(dsBG)

qn <- QuantileNormalization(ds, typesToUpdate="pm");
process(qn, verbose=-2);
dsNorm <- getOutputDataSet(qn);

ceModel <- ExonRmaPlm(dsNorm, mergeGroups=FALSE);
fit(ceModel, verbose=-10);

qam <- QualityAssessmentModel(ceModel);
plotNuse(qam);
plotRle(qam);

Ju Han wrote:
> Hi Elizabeth,
>
> I have checked the help files of aroma package, but still have no clue
> about changing # of units to be processed each time to speed up the
> process. Could you please show me an example?
>
> Thanks,
>
> Ju
>
>
>

jhanre...@gmail.com

unread,

Nov 8, 2007, 5:01:05 PM11/8/07

to aroma.affymetrix

Hi Elizabeth,

Thanks for your kindly reply. I am still using the previous code I
sent to you. The problem in that code is not the speed issue: I was
trying to output the plot into a bmp file by using dev.print after
plotNuse(qam), but failed. The system just hang up and no bmp files
was generated.

As for the ram issue, please find my answer as follows.

> 1) how many arrays are you running

~120

> 2) How long does it take to run each of steps in your code below (e.g.
> BG correction, Quantile normalization, fitting the linear model, quality
> assessment) and where is the (worst) hang up, time-wise (or is it just
> all of it)?

I cannot remember the details. Fitting the linear model is most time-
consuming.
It tooks ~2 days to finish. All other steps account for ~1 day.

> 3) What are the specs of the server you are using?

Dual core intel CPU @2GHz, 8G ram, and 8G Swap.

> 4) Is your cdf the Affy cdf where each probeset is a unit (no transcript
> cluster IDs) -- I'm assuming that from the code; since you pick
> mergeGroups=F, it would have the same effect regardless. Is it actually
> Affy's publicly available cdf or the one created by Alan?

The cdf files originally come from Ken.

> 5) Are you running all the probesets, not a subset? (like core,
> extended, full). What are the corresponding numbers if you run on just
> the core or extended, for example?

I am running all probesets. I have not tried other options on these
data.
I will report back.

> 6) Do these probesets include the QC probesets; if so, affy defined long
> probesets that contain the genomic background probes that might be
> included in that cdf (e.g. 25 probesets with 1000 probes each)

Actually I don't know the answer.

>
> If you do get a speed up from changing the ram option, could you also
> post your results, because I think others would be interested.

Sure.

Thanks,

Ju

Henrik Bengtsson

unread,

Nov 9, 2007, 7:20:57 AM11/9/07

to aroma-af...@googlegroups.com

Hi.

On 11/8/07, jhanre...@gmail.com <jhanre...@gmail.com> wrote:
>
> Hi Elizabeth,
>
> Thanks for your kindly reply. I am still using the previous code I
> sent to you. The problem in that code is not the speed issue: I was
> trying to output the plot into a bmp file by using dev.print after
> plotNuse(qam), but failed. The system just hang up and no bmp files
> was generated.
>
> As for the ram issue, please find my answer as follows.
>
> > 1) how many arrays are you running
> ~120
>
> > 2) How long does it take to run each of steps in your code below (e.g.
> > BG correction, Quantile normalization, fitting the linear model, quality
> > assessment) and where is the (worst) hang up, time-wise (or is it just
> > all of it)?
> I cannot remember the details. Fitting the linear model is most time-
> consuming.
> It tooks ~2 days to finish. All other steps account for ~1 day.

Fitting the log-additive model to probesets that have a large/huge
number of probes *will* take time. Ken Simpson have already written
some about this problem:

http://groups.google.com/group/aroma-affymetrix/web/optimization-speed-file-space-ram

and it was also recently discussed on the Bioconductor mailing list:

https://stat.ethz.ch/pipermail/bioconductor/2007-May/017006.html

See also the aroma.affymetrix 'Summerization Speed Under Unix Options'
thread started July 17, 2007
[http://groups.google.com/group/aroma-affymetrix/browse_thread/thread/f58b942da77d48d6#]
and especially the comments "It took around 20 hours to do the
summerization[ ...]" and later "Now, it only took about 30 min.".

You want to excluded those extremely large probesets.

You write "All other steps account for ~1 day." What are these other
steps? For most of the preprocessing methods it should not take that
long (even) with 120 arrays.

Also, use the 'verbose' argument with a "timestamping" Verbose object
to get more detailed information of what is taking time. See one of
the use case examples how to do this.

Cheers

/Henrik

Henrik Bengtsson

unread,

Nov 9, 2007, 7:31:56 AM11/9/07

to aroma-af...@googlegroups.com

Hi.

On 11/8/07, Elizabeth Purdom <epu...@gmail.com> wrote:
>
> Hi Ju,
>
> If, for example, you look at the help of 'fit.ProbeLevelModel' you see
> the option ram (this is why I said in conversation you need to look at
> the method-specific help). You can also look at the code to get a better
> idea of what the numbers will mean -- I've pasted bits of the code below
> that are relevant.
> bytesPerChunk <- 1e+08
> bytesPerUnitAndArray <- 500
> bytesPerUnit <- nbrOfArrays * bytesPerUnitAndArray
> unitsPerChunk <- ram * bytesPerChunk/bytesPerUnit
> unitsPerChunk <- as.integer(max(unitsPerChunk, 1))
> nbrOfChunks <- ceiling(nbrOfUnits/unitsPerChunk)
> verbose && cat(verbose, "Number units per chunk: ", unitsPerChunk)
> With these numbers, you can change the ram so that 'unitsPerChunk' is
> large and so 'nbrOfChunks' is small (the number of iterations, and
> therefore the number of calls to write the results). However, I don't
> actually know how much speed up this will give, because you are
> ultimately still needing to read/write the same number of results, just
> the number of calls to do it goes down. Like I said, I've never played
> with the option. Maybe Henrik could comment on this.

The 'ram' argument is a scale factor that allows you to adjust if you
want aroma.affymetrix to process data in smaller chunks (raw < 1) or
large chunks (raw > 1) than default (raw = 1).

It's main purpose is to make an algorithm to run in less memory if it
turns out that it is consuming large amount of memory. Use a process
manager/Unix 'top' and/or detailed verbose output to see the memory
consumption. If R is consuming more memory than available RAM, it
will start swapping RAM to disk ("virtual memory") and that will slow
down the processing 100-1000 fold! If this happens set 'ram' to a
smaller value.

Other than avoiding swapping, the value of 'ram' does not affect the
speed very much (unless you set 'ram' to be a ridiculously small value
so that say only a few probesets are processed in every chunk, because
then the relative IO overhead will dominate).

Cheers

Henrik

Elizabeth Purdom

unread,

Nov 9, 2007, 3:19:21 PM11/9/07

to aroma-af...@googlegroups.com

Hi Ju,

Following up on Henrik's helpful references to the other discussions...

I looked at the Affy's cdf (that I also got from Ken) where each unit is
just a probeset. Is that the one you are using Ju?

As described in one of the posts Henrik referenced, there are a number
of Affy defined probesets that have more than 4 probes. According to
their documentation, this should not happen (except maybe non aligned
mRNA probesets?), so I think these are due to quality control probes or
an error in their cdf file. Another oddity: unlike all of the other
units which are just a single probeset, there is one unit that has
(many) groups with variable number of probes per group -- one group has
31259 probes, but even the other groups have 75-300. Again, these are
clearly some kind of controls -- they also could be the genomic
background controls that we were told were combined into one probeset.

A summary of these probesets (not including the 'oddity') -- note some
very large numbers of probes even then!
> summary(unlist(nprobesPerUnit[whichBig[whichBig!=hasGroups]]))
Min. 1st Qu. Median Mean 3rd Qu. Max.
5.00 40.25 91.00 367.50 96.75 16890.00

Anyway, since you are fitting probeset by probeset, I think you should
remove any probeset that has more than 4 probes, at least for the model
fitting stage. While the controls might be useful for background control
(although we're not currently using them smartly) they certainly don't
need a linear model fit to them, especially since we don't know what
controls are being thrown together as 1 probeset. If you change and are
fitting by transcript cluster, there will still be transcript clusters
with large numbers of probes. But you can't get around that -- those are
biologically relevant!

Another interesting note is that the cdf Ken created for 'full' has some
of these probesets with more than 4 probes per probeset. But they only
have 20-40 probes per probeset, so the really huge ones have been cut
out. So if you used that cdf and then use mergeGroups=FALSE, you would
have a probeset by probeset fitting without the really huge ones. I
don't know what these probesets are -- i.e. if they are legit; they may
be mRNA products that do not align to the genome. A big think on my
to-do list is to go back to these cdf definitions and try to both update
them according to Affy's new annotation and to check out these strange
probesets and create a quality control cdf.

Ju, could you tell us how long it takes when you remove these unit
(assuming you're using that cdf)?

You should be able to do that by something like the following:
nprobesPerUnit<-readCdfNbrOfCellsPerUnitGroup(getPathname(cdf),unit=NULL)
whichBig<-which(sapply(nprobesPerUnit,function(x){any(unlist(x)>4)}))
fit(plm,units=indexOf(cdf,names=names(nprobesPerUnit[-whichBig])),...)

Best,
Elizabeth

Kasper Daniel Hansen

unread,

Nov 10, 2007, 4:26:33 PM11/10/07

to aroma.affymetrix

Hi Elizabeth

I have recently been a bit interested in the Affy QC probes (not
necessarily for the exon arrays however). Could you post the names of
the probesets with more than say 4 probes (or put the threshold a bit
higher if you get more), I might be able to tell you what it is.

Kasper

> >http://groups.google.com/group/aroma-affymetrix/web/optimization-spee...

>
> > and it was also recently discussed on the Bioconductor mailing list:
>
> > https://stat.ethz.ch/pipermail/bioconductor/2007-May/017006.html
>
> > See also the aroma.affymetrix 'Summerization Speed Under Unix Options'
> > thread started July 17, 2007

> > [http://groups.google.com/group/aroma-affymetrix/browse_thread/thread/...]

Kasper Daniel Hansen

unread,

Nov 10, 2007, 4:42:35 PM11/10/07

to aroma.affymetrix

On Nov 8, 11:01 pm, jhanresea...@gmail.com wrote:
> Hi Elizabeth,
>
> Thanks for your kindly reply. I am still using the previous code I
> sent to you. The problem in that code is not the speed issue: I was
> trying to output the plot into a bmp file by using dev.print after
> plotNuse(qam), but failed. The system just hang up and no bmp files
> was generated.

This kind of indicates that you are printing a big thing. Generally I
would refrain from using dev.print on bigger images. How the various
images are generated can be a bit confusing, but there is certainly
differences between the methods. I would instead do a (for plotting to
pdf)

pdf(file = "something.pdf")
plotcommand
dev.off()

There is also a BMP option, although I fail to see why you want BMP.
Other options are PNG (actually there are two ways of doing this) and
postscript.

Kasper

Elizabeth Purdom

unread,

Nov 14, 2007, 2:02:02 PM11/14/07

to aroma-af...@googlegroups.com

Hi Kasper,
Sure, but I can't do them right away -- hopefully in a few days.
Best,
Elizabeth

Reply all

Reply to author

Forward