Running aroma.affymetrix on a large dataset and time/space issue

228 views

Skip to first unread message

Sooraj

unread,

Jan 4, 2010, 4:43:38 PM1/4/10

to aroma.affymetrix

Hi Henrik,

I'm continuing a thread exactly after a year that I found following
this link http://www.mail-archive.com/aroma-af...@googlegroups.com/msg00285.html.
I'm running a very large number of samples (~1400+), and the estimated
time was 269324.6 minutes, roughly 187 days. The link above talks
about splitting datasets into chunks and running them in parallel, and
also directs to another thread at this link
http://groups.google.com/group/aroma-affymetrix/browse_thread/thread/4bff6520f5ad0a70#
where it talks about appending the completed batched runs.

However, you mentioned in the latter thread that with newer version
all of the steps (splitting and merging) would be handled. Since it's
been a year I'm hoping there's a newer version out there which can do
all of the above? Can you please post if it's taken care or whether it
still needs to be done manually? Thanks!

P.S. I'm running CRMAv2, and it was brought up at my workplace that
since CRMAv2 works one chip at a time, the final result is the same
regardless of it run in batches or all at once, so why wasn't it
already included in the program to run in batches instead of forcing
the end user to do it manually?

- Sooraj Maharjan (Mayo Clinic)

Henrik Bengtsson

unread,

Jan 4, 2010, 9:37:50 PM1/4/10

to aroma-affymetrix

Hi.

On Mon, Jan 4, 2010 at 1:43 PM, Sooraj <soor...@gmail.com> wrote:
> Hi Henrik,
>
> I'm continuing a thread exactly after a year that I found following
> this link http://www.mail-archive.com/aroma-af...@googlegroups.com/msg00285.html.
> I'm running a very large number of samples (~1400+), and the estimated
> time was 269324.6 minutes, roughly 187 days.

That is approx 269324.6/1400 = 192 min/array = 3:12h/array. If you do
CRMAv2 on a single array, you should typically expect 5-10 mins/array.
I suspect that you are paying a large price of so called cache misses
on your file system, because each chunk processed iterates over all
1400+ arrays making it hard for the file system to optimize the cache
and do so called "read ahead".

Also, make sure to read Page 'Improving processing time'
[http://groups.google.com/group/aroma-affymetrix/web/improving-processing-time],
which has some useful suggestions. One of the comments there is that
you will pay a price if you do the analysis over a file network (less
problematic on fast networks).

If you have a slow file-system network, at least try to use a file
cache directory on a local hard drive. See Page 'Caching of
computational expensive tasks (memoization)'
[http://groups.google.com/group/aroma-affymetrix/web/caching] on how
to specify the cache directory. This will useful also when you run
CRMAv2 in batches on different machines.

Given your current performance on 1400+ arrays, it would be
interesting to all of us to see how much improvement you can get from
the above steps, without trying to process things in chunks. If you
could do some basic benchmarking and share it here, we can learn more
on how to avoid bottlenecks.

> The link above talks
> about splitting datasets into chunks and running them in parallel, and
> also directs to another thread at this link
> http://groups.google.com/group/aroma-affymetrix/browse_thread/thread/4bff6520f5ad0a70#
> where it talks about appending the completed batched runs.
>
> However, you mentioned in the latter thread that with newer version
> all of the steps (splitting and merging) would be handled. Since it's
> been a year I'm hoping there's a newer version out there which can do
> all of the above? Can you please post if it's taken care or whether it
> still needs to be done manually? Thanks!

Things have improved since that thread. The main thing is that
splitting can be done via a simple

csR <- extract(csR, idxs);

before running CRMAv2 where 'idxs' is specifying a subset of the
arrays. See below DOCUMENT for details.

The merging still has to be done, but that now comes down to loading
all chip effect files at once. Something like this:

dataSet <- "HapMap270,6.0,CEU,testSet";
tags <- "ACC,ra,-XY,BPN,-XY,AVG,A+B,FLN,-XY";
chipType <- "GenomeWideSNP_6";
cesN <- CnChipEffectSet$byName(dataSet, tags=tags, chipType=chipType,
mergeStrands=TRUE, combineAlleles=TRUE);
print(cesN);

>
> P.S. I'm running CRMAv2, and it was brought up at my workplace that
> since CRMAv2 works one chip at a time, the final result is the same
> regardless of it run in batches or all at once, so why wasn't it
> already included in the program to run in batches instead of forcing
> the end user to do it manually?

It is pretty much done now, but you still have to launch your batches
manually using non-overlapping (disjoint) sets of 'idxs' vectors. If
you by mistake process the same array (overlapping 'idxs' vectors) in
parallel, there is a risk that the same array and CRMAv2 step will be
processed at the same time. This might cause a conflict when it comes
to storing the result for that array. There are mechanisms in
aroma.affymetrix that tries to detect this and *minimize* the risk for
corrupting the data, and if it happens you will most likely get an
exception telling you that there is a conflict. However, this is not
bullet proof, so make sure to process disjoint sets of arrays.
Technical details: In order to make it bullet proof we a file-locking
mechanism that is bullet proof, and that turns out to be a very hard
problem, especially on shared file systems [it may take up to 30
seconds for a file update to be seen by other computers on an NFS file
systems]. Making it a cross-platform solution is even harder. If
someone knows a solution, please share it.

Finally, below is a "how to" document that I'm writing that should get
you going:

BEGIN DOCUMENT

This document explains how to do CRMAv2 processing on a single array
or on a subset of arrays in a data set. To show how this can be done,
we use the same data and steps as in Vignette 'CRMA v2: Estimation of
total copy numbers using the CRMA v2 method (10K-GWS6)'. It is
assumed that you have the same setup.

IMPORTANT: This approach will only work with methods such as CRMAv2
that are truly single-array methods. If you try it on other methods,
for instance CRMA (v1), it will process the arrays for you, but you
will not get the same results if you process each array independently
or as part of a larger data set. Currently, it is only CRMAv2 that is
a truly single-array method.

Start by loading the complete data set:

library("aroma.affymetrix");
verbose <- Arguments$getVerbose(-8, timestamp=TRUE);
cdf <- AffymetrixCdfFile$byChipType("GenomeWideSNP_6", tags="Full");
csR <- AffymetrixCelSet$byName("HapMap270,6.0,CEU,testSet", cdf=cdf);

AffymetrixCelSet:
Name: HapMap270
Tags: 6.0,CEU,testSet
Path: rawData/HapMap270,6.0,CEU,testSet/GenomeWideSNP_6
Platform: Affymetrix
Chip type: GenomeWideSNP_6,Full
Number of arrays: 6
Names: NA06985, NA06991, ..., NA07019
Time period: 2007-03-06 12:13:04 -- 2007-03-06 19:17:16
Total file size: 395.13MB

Then, in order to do CRMAv2 on a subset of the arrays, all we have to
do is to extract a new data set consisting of the arrays of interest.
Say the arrays are number 2, 3, and 5. To further illustrate that
aroma correctly preserves both the subset and the order* of the arrays
throughout the analysis, we will process these arrays in the order of
2, 5, and 3.

idxs <- c(2,5,3);

Alternatively, if you wish to select the subset based on the names of
the arrays, the use:

names <- c("NA06991", "NA07000", "NA06993");
idxs <- indexOf(csR, names);
print(idxs);

NA06991 NA07000 NA06993 2 5 3
NA06991 NA07000 NA06993
2 5 3

Extract this subset as a new data set:

csR <- extract(csR, idxs);
print(csR);

AffymetrixCelSet:
Name: HapMap270
Tags: 6.0,CEU,testSet
Path: rawData/HapMap270,6.0,CEU,testSet/GenomeWideSNP_6
Platform: Affymetrix
Chip type: GenomeWideSNP_6,Full
Number of arrays: 3
Names: NA06991, NA07000, NA06993
Time period: 2007-03-06 16:15:53 -- 2007-03-06 19:17:16
Total file size: 197.56MB

From here, proceed with CRMA v2 as usual. It is extremely important
that you use the target="zero" arguments throughout, otherwise it will
not be a truly single-array method.

acc <- AllelicCrosstalkCalibration(csR, model="CRMAv2");
csC <- process(acc, verbose=verbose);
bpn <- BasePositionNormalization(csC, target="zero");
csN <- process(bpn, verbose=verbose);
plm <- AvgCnPlm(csN, mergeStrands=TRUE, combineAlleles=TRUE);
if (length(findUnitsTodo(plm)) > 0) {
units <- fitCnProbes(plm, verbose=verbose);
units <- fit(plm, verbose=verbose);
}
ces <- getChipEffectSet(plm);
fln <- FragmentLengthNormalization(ces, target="zero");
cesN <- process(fln, verbose=verbose);
print(cesN);

CnChipEffectSet:
Name: HapMap270
Tags: 6.0,CEU,testSet,ACC,ra,-XY,BPN,-XY,AVG,A+B,FLN,-XY
Path: plmData/HapMap270,6.0,CEU,testSet,ACC,ra,-XY,BPN,-XY,AVG,A+B,FLN,-XY/GenomeWideSNP_6
Platform: Affymetrix
Chip type: GenomeWideSNP_6,Full,monocell
Number of arrays: 3
Names: NA06991, NA07000, NA06993
Time period: 2010-01-04 16:47:13 -- 2010-01-04 16:47:14
Total file size: 80.85MB
Parameters: (probeModel: chr "pm", mergeStrands: logi TRUE,
combineAlleles: logi TRUE)

Note that the subset and the ordering* of the arrays is preserved
through-out the analysis:

stopifnot(getNames(csC) == getNames(csR));
stopifnot(getNames(csN) == getNames(csR));
stopifnot(getNames(ces) == getNames(csR));
stopifnot(getNames(cesN) == getNames(csR));

Furthermore, this is also true if some or all of the steps have
already been processed for all or other arrays. That is, if there are
other arrays in the same data directories, they will not affect the
analysis of this subset. This is the key for doing CRMAv2 in batches.
Try for instance to repeat the above with idxs <- c(1,3,4).

END DOCUMENT

Hope this helps and let us know how it goes.

/Henrik

>
> - Sooraj Maharjan (Mayo Clinic)
>

> --
> When reporting problems on aroma.affymetrix, make sure 1) to run the latest version of the package, 2) to report the output of sessionInfo() and traceback(), and 3) to post a complete code example.
>
>
> You received this message because you are subscribed to the Google Groups "aroma.affymetrix" group.
> To post to this group, send email to aroma-af...@googlegroups.com
> To unsubscribe from this group, send email to aroma-affymetr...@googlegroups.com
> For more options, visit this group at http://groups.google.com/group/aroma-affymetrix?hl=en

Sooraj

unread,

Jan 11, 2010, 5:32:12 PM1/11/10

to aroma.affymetrix

A) After reading 'Improving processing time' [http://groups.google.com/
group/aroma-affymetrix/web/improving-process...], I did following:

1. Ran R from local space (I'm assuming when you say "local" is to run
program on the same file server where the data is. So, setwd(path/to/
data)?).

2. Changed "Ram" to 20, to allow more memory usage. I'm currently
testing it on a cluster that has 16 processors and 128GB memory, so I
have plenty to play with. However, when I ran "top" on shell prompt,
following was what the "R" progam was using. I was assuming higher the
number of "ram" (default was 5), more memory it would use, but I'd
seen it use higher amount of memory even with ram=10.

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
21107 m055555 18 0 783m 683m 5224 R 43 0.5 12:11.62 R

3. After the above two, the estimated time dropped to 17896.1min =
298.27hrs = 12.43days, which is a lot better than 187 days :). Also,
17896/1400 = 12.78 min/array, which is close to your prediction of
5-10 min/array.

B) Few observations while testing memory usage:
1. I changed ram from 20 to 30 and the number of chunks decreased from
313 to 207, and the memory usage were as follows:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
COMMAND
24858 m047314 18 0 953m 852m 5060 D 14 0.7 7:45.73 R
and
24858 m047314 18 0 1018m 854m 5060 D 15 0.7 8:04.37 R

Few lines of log file while the memory usage was above were as
follows:
20100109 16:27:59| Getting model fit for 874959 units.
20100109 16:27:59| Creating CEL file...
20100109 16:27:59| Chip type: GenomeWideSNP_6,Full
20100109 16:27:59| Pathname: plmData/run1,ACC,ra,-XY,BPN,-XY,AVG,A+B/
GenomeWideSNP_6/probeAffinities.CEL
20100109 16:27:59| Returning already existing file.
20100109 16:28:00| Creating CEL file...done
20100109 16:28:04| Number units per chunk: 4231
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 2467545 131.8 3708127 198.1 3708127 198.1
Vcells 2888272 22.1 9244554 70.6 10920626 83.4
20100109 16:28:10| Fitting chunk #1 of 207...
20100109 16:28:10| Units:
int [1:4231] 60631 60632 60633 60634 60635 60636 60637 60638 60639
60640 ...
20100109 16:28:10| Reading probe intensities from 1418 arrays...
20100109 16:28:10| Identifying CDF cell indices...
$`SNP_A-1997587`
$`SNP_A-1997587`$groups
$`SNP_A-1997587`$groups$C
$`SNP_A-1997587`$groups$C$indices
[1] 208700 4316042 4033394

$`SNP_A-1997587`$groups$G
$`SNP_A-1997587`$groups$G$indices
[1] 208699 4316041 4033393

20100109 16:28:14| Identifying CDF cell indices...done

- Estimated time dropped to 7753.2min = 129.22hrs = 5.38 days, and
7753.2/1400 arrays = 5.538min/array.
- This finding was absolutely stunning!!!...so I tried more...

2. ram=50 on a machine with mem=53.16670 GB (this was run on the grid)
Estimated time left: 10335.8min = 172.26hrs = 7.18 days
10335.8/1400 = 7.38 min/array
chunk #1 of 124
vmem=2.008G, maxvmem=2.475G

3. ram=100 on same machine I was running interactive job
Estimated time left: 5302.1min = 88.37 hrs = 3.68 days
chunk #1 of 62
memory: 3.0GB

4. ram=200 on the same machine, interactive job (currently running)
Estimated time left: 4005.1min = 2.78 days
chunk #1 of 31
memory: ~4.0GB

Question:
- My initial plan was to subset arrays and spread out 1400 arrays as
indiviaul jobs over the grid as grid-array-job. Since it'd take 5-10
minutes as we saw above, depending on the available slots all jobs
could finish from 5 minutes to 10 days. However, since the last memory
input of "200" estimated ~3 days I'm currently running it, and it
should be done in another 34 hours.

Meanwhile, I thought I should ask you about this as I'm a bit confused
in merging the jobs once they finished running. You had a code sample
that I copied below.

So, if I ran 1400 jobs on the grid and each returned an object
back...how do I go about merging all of them? Do you think it is
better to run all if you have a machine that can handle big jobs like
I'm running currently, or go about this route I'm suggesting (array
job)? Also, please let me know of other suggestings if you have any.
Thanks!

- Sooraj

On Jan 4, 8:37 pm, Henrik Bengtsson <henrik.bengts...@gmail.com>
wrote:
> Hi.

>
> On Mon, Jan 4, 2010 at 1:43 PM,Sooraj<soora...@gmail.com> wrote:
> > Hi Henrik,
>
> > I'm continuing a thread exactly after a year that I found following

> > this linkhttp://www.mail-archive.com/aroma-af...@googlegroups.com/msg0028....

> > I'm running a very large number of samples (~1400+), and the estimated
> > time was 269324.6 minutes, roughly 187 days.
>
> That is approx 269324.6/1400 = 192 min/array = 3:12h/array. If you do
> CRMAv2 on a single array, you should typically expect 5-10 mins/array.
> I suspect that you are paying a large price of so called cache misses
> on your file system, because each chunk processed iterates over all
> 1400+ arrays making it hard for the file system to optimize the cache
> and do so called "read ahead".
>
> Also, make sure to read Page 'Improving processing time'

> [http://groups.google.com/group/aroma-affymetrix/web/improving-process...],

> which has some useful suggestions. One of the comments there is that
> you will pay a price if you do the analysis over a file network (less
> problematic on fast networks).
>
> If you have a slow file-system network, at least try to use a file
> cache directory on a local hard drive. See Page 'Caching of
> computational expensive tasks (memoization)'
> [http://groups.google.com/group/aroma-affymetrix/web/caching] on how
> to specify the cache directory. This will useful also when you run
> CRMAv2 in batches on different machines.
>
> Given your current performance on 1400+ arrays, it would be
> interesting to all of us to see how much improvement you can get from
> the above steps, without trying to process things in chunks. If you
> could do some basic benchmarking and share it here, we can learn more
> on how to avoid bottlenecks.
>
> > The link above talks
> > about splitting datasets into chunks and running them in parallel, and
> > also directs to another thread at this link

> >http://groups.google.com/group/aroma-affymetrix/browse_thread/thread/...

> > -SoorajMaharjan (Mayo Clinic)

Henrik Bengtsson

unread,

Jan 11, 2010, 10:56:49 PM1/11/10

to aroma-affymetrix

Wow - really wow!

I've added this thread to 'User success stories' on the "how to" page
'Improve processing time' [http://aroma-project.org/node/105].

Thanks for the detailed benchmarking. It is certainly helpful. Here
is a cut'n'paste tabular summary for the record:

ram: Real memory usage: Processing time:
20 0.8GB 298 hours (12.78 min/array; 100%; 4.47x)
30 1.0GB 129 hours (5.54 min/array; 43.3%; 1.94x)
100 3.0GB 88.4 hours (3.79 min/array; 29.7%; 1.33x)
200 4.0GB 66.6 hours (2.86 min/array; 22.3%; 1.00x)

(All of the above was on the same machine; the ram=50 results was done
on the grid and hence not listed).

[Please correct any of the above, if I got it wrong.]

The 'ram' argument is basically a scale factor of how many
arrays/units is processed in each chunk. From your benchmarking it is
clear that the memory profile is not linear with the 'ram' options.
Instead it shows that there is an initial overhead regardless of chunk
size/'ram', and then it increases slowly. Ideally we would have
better knowledge of the relationship between argument 'ram' and the
really memory usage. This relationship will certainly depend on chip
type/CDF, and it is probably also operating system, hardware (cache)
and file system dependent. It is probably not linear in the number of
arrays. It would be nice if the software/R could measure the memory
usage and adjust so that the memory usage stays within a certain
range.

More comments below:

On Mon, Jan 11, 2010 at 2:32 PM, Sooraj <soor...@gmail.com> wrote:
> A) After reading 'Improving processing time' [http://groups.google.com/
> group/aroma-affymetrix/web/improving-process...], I did following:
>
> 1. Ran R from local space (I'm assuming when you say "local" is to run
> program on the same file server where the data is. So, setwd(path/to/
> data)?).

I mean a local hard drive, so that data does not have to be passed
back and forth over the file system network. On Linux, it is hard for
a user to see whether a certain directory is local or not. It is
better to ask the sys admin. Typically /tmp/ and /vartmp/ are on the
local machines.

Did you move data? Did you see an improvement from doing that,
without adjusting the 'ram' option?

>
> 2. Changed "Ram" to 20, to allow more memory usage. I'm currently
> testing it on a cluster that has 16 processors and 128GB memory, so I
> have plenty to play with. However, when I ran "top" on shell prompt,
> following was what the "R" progam was using. I was assuming higher the
> number of "ram" (default was 5), more memory it would use, but I'd
> seen it use higher amount of memory even with ram=10.
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 21107 m055555 18 0 783m 683m 5224 R 43 0.5 12:11.62 R

I'm not sure I understand what your question is. You mean you would
expect a doubling of the memory usage? The reason you don't see a
large increase is that there is an initial overhead of the package and
other things stored in memory. As explained above, 'ram' is basically
a scale factor adjusting how many units per chunk is process. The
more arrays there are, the fewer units are processed. With the same
'ram' value, a data set with 200 arrays processes half the number of
units per chunk as a data set with 100 arrays.

Also, when you say default was ram=5, you mean that you previously
used (or changed the settings) so that ram == 5? The default of the
aroma.* frame is ram=1. How to change the aroma.* settings is
explained under 'Settings' on the web page
[http://aroma-project.org/settings].

>
> 3. After the above two, the estimated time dropped to 17896.1min =
> 298.27hrs = 12.43days, which is a lot better than 187 days :). Also,
> 17896/1400 = 12.78 min/array, which is close to your prediction of
> 5-10 min/array.

Repeating my above question: Can you clarify/confirm that this
improvement was due to both (i) a change of location of data files and
(ii) an increase of 'ram'? ...and not just an increase of 'ram'.

Yupp, now your down to a time scale where it is worth waiting - I
always say one should compare the computer time to the time it takes
to run the assay of those 1400 arrays. Being able to preprocess the
data in 3 days is then not bad at all. ...and you'll probably spend
months/years analysis the results anyway.

>
> Meanwhile, I thought I should ask you about this as I'm a bit confused
> in merging the jobs once they finished running. You had a code sample
> that I copied below.
>
> dataSet <- "HapMap270,6.0,CEU,testSet";
> tags <- "ACC,ra,-XY,BPN,-XY,AVG,A+B,FLN,-XY";
> chipType <- "GenomeWideSNP_6";
> cesN <- CnChipEffectSet$byName(dataSet, tags=tags, chipType=chipType,
> mergeStrands=TRUE, combineAlleles=TRUE);
> print(cesN);
>
> So, if I ran 1400 jobs on the grid and each returned an object
> back...how do I go about merging all of them?

Remember, the data is sitting on the file system. It is not sitting
in memory (RAM). This means that any host running the above lines of
code, will have access to the complete data set.

> Do you think it is
> better to run all if you have a machine that can handle big jobs like
> I'm running currently, or go about this route I'm suggesting (array
> job)?

If you process the data set in K subsets of arrays you should see
roughly a K times speed up. Of course, there might be bottlenecks in
the file system preventing a linear speedup, but considered that I/O
is only a small fraction of the processing time, that shouldn't be too
much of a problem unless you use a huge number of hosts. Now, if you
have a fancy grid system that can distribute the workload of a single
R session automatically, then you might not see an improvement.
However, I don't think a grid system can do this, but I might also be
completely wrong

> Also, please let me know of other suggestings if you have any.

I think you nailed it.

Currently there is a lot of internal overhead when fitting the PLM due
to wrapping and unwrapping of data in nested list structure that are
shaped as the structure of the CDF file. I know of design changes
that would speed up this up several orders of magnitude, but it would
require a lot of work and it has simply never made it up to the top of
my priority list.

> Thanks!

No worries.

/Henrik

Sooraj

unread,

Jan 19, 2010, 1:27:36 PM1/19/10

to aroma.affymetrix

> > 1. Ran R from local space (I'm assuming when you say "local" is to run
> > program on the same file server where the data is. So, setwd(path/to/
> > data)?).
>
> I mean a local hard drive, so that data does not have to be passed
> back and forth over the file system network. On Linux, it is hard for
> a user to see whether a certain directory is local or not. It is
> better to ask the sys admin. Typically /tmp/ and /vartmp/ are on the
> local machines.
>
> Did you move data? Did you see an improvement from doing that,
> without adjusting the 'ram' option?
>

I didn't move data. The results were from the files sitting in a file
server. We don't have much space to play with, so I simply used more
ram as that was a more feasible route.

> > 3. After the above two, the estimated time dropped to 17896.1min =
> > 298.27hrs = 12.43days, which is a lot better than 187 days :). Also,
> > 17896/1400 = 12.78 min/array, which is close to your prediction of
> > 5-10 min/array.
>
> Repeating my above question: Can you clarify/confirm that this
> improvement was due to both (i) a change of location of data files and
> (ii) an increase of 'ram'? ...and not just an increase of 'ram'.
>

This improvement was only due to (ii) increase of ram. As stated
above, I didn't move the data files.

Also, I wanted to mention that apart from the "Estimated time" that I
got out of the log, it took about another day to finish up running
additional functions. So, the entire run for 1400+ samples completed
in ~4days. Thanks a lot for your help.