However, you mentioned in the latter thread that with newer version all of the steps (splitting and merging) would be handled. Since it's been a year I'm hoping there's a newer version out there which can do all of the above? Can you please post if it's taken care or whether it still needs to be done manually? Thanks!
P.S. I'm running CRMAv2, and it was brought up at my workplace that since CRMAv2 works one chip at a time, the final result is the same regardless of it run in batches or all at once, so why wasn't it already included in the program to run in batches instead of forcing the end user to do it manually?
That is approx 269324.6/1400 = 192 min/array = 3:12h/array. If you do CRMAv2 on a single array, you should typically expect 5-10 mins/array. I suspect that you are paying a large price of so called cache misses on your file system, because each chunk processed iterates over all 1400+ arrays making it hard for the file system to optimize the cache and do so called "read ahead".
Also, make sure to read Page 'Improving processing time' [http://groups.google.com/group/aroma-affymetrix/web/improving-process...], which has some useful suggestions. One of the comments there is that you will pay a price if you do the analysis over a file network (less problematic on fast networks).
If you have a slow file-system network, at least try to use a file cache directory on a local hard drive. See Page 'Caching of computational expensive tasks (memoization)' [http://groups.google.com/group/aroma-affymetrix/web/caching] on how to specify the cache directory. This will useful also when you run CRMAv2 in batches on different machines.
Given your current performance on 1400+ arrays, it would be interesting to all of us to see how much improvement you can get from the above steps, without trying to process things in chunks. If you could do some basic benchmarking and share it here, we can learn more on how to avoid bottlenecks.
> However, you mentioned in the latter thread that with newer version > all of the steps (splitting and merging) would be handled. Since it's > been a year I'm hoping there's a newer version out there which can do > all of the above? Can you please post if it's taken care or whether it > still needs to be done manually? Thanks!
Things have improved since that thread. The main thing is that splitting can be done via a simple
csR <- extract(csR, idxs);
before running CRMAv2 where 'idxs' is specifying a subset of the arrays. See below DOCUMENT for details.
The merging still has to be done, but that now comes down to loading all chip effect files at once. Something like this:
> P.S. I'm running CRMAv2, and it was brought up at my workplace that > since CRMAv2 works one chip at a time, the final result is the same > regardless of it run in batches or all at once, so why wasn't it > already included in the program to run in batches instead of forcing > the end user to do it manually?
It is pretty much done now, but you still have to launch your batches manually using non-overlapping (disjoint) sets of 'idxs' vectors. If you by mistake process the same array (overlapping 'idxs' vectors) in parallel, there is a risk that the same array and CRMAv2 step will be processed at the same time. This might cause a conflict when it comes to storing the result for that array. There are mechanisms in aroma.affymetrix that tries to detect this and *minimize* the risk for corrupting the data, and if it happens you will most likely get an exception telling you that there is a conflict. However, this is not bullet proof, so make sure to process disjoint sets of arrays. Technical details: In order to make it bullet proof we a file-locking mechanism that is bullet proof, and that turns out to be a very hard problem, especially on shared file systems [it may take up to 30 seconds for a file update to be seen by other computers on an NFS file systems]. Making it a cross-platform solution is even harder. If someone knows a solution, please share it.
Finally, below is a "how to" document that I'm writing that should get you going:
BEGIN DOCUMENT
This document explains how to do CRMAv2 processing on a single array or on a subset of arrays in a data set. To show how this can be done, we use the same data and steps as in Vignette 'CRMA v2: Estimation of total copy numbers using the CRMA v2 method (10K-GWS6)'. It is assumed that you have the same setup.
IMPORTANT: This approach will only work with methods such as CRMAv2 that are truly single-array methods. If you try it on other methods, for instance CRMA (v1), it will process the arrays for you, but you will not get the same results if you process each array independently or as part of a larger data set. Currently, it is only CRMAv2 that is a truly single-array method.
AffymetrixCelSet: Name: HapMap270 Tags: 6.0,CEU,testSet Path: rawData/HapMap270,6.0,CEU,testSet/GenomeWideSNP_6 Platform: Affymetrix Chip type: GenomeWideSNP_6,Full Number of arrays: 6 Names: NA06985, NA06991, ..., NA07019 Time period: 2007-03-06 12:13:04 -- 2007-03-06 19:17:16 Total file size: 395.13MB
Then, in order to do CRMAv2 on a subset of the arrays, all we have to do is to extract a new data set consisting of the arrays of interest. Say the arrays are number 2, 3, and 5. To further illustrate that aroma correctly preserves both the subset and the order* of the arrays throughout the analysis, we will process these arrays in the order of 2, 5, and 3.
idxs <- c(2,5,3);
Alternatively, if you wish to select the subset based on the names of the arrays, the use:
AffymetrixCelSet: Name: HapMap270 Tags: 6.0,CEU,testSet Path: rawData/HapMap270,6.0,CEU,testSet/GenomeWideSNP_6 Platform: Affymetrix Chip type: GenomeWideSNP_6,Full Number of arrays: 3 Names: NA06991, NA07000, NA06993 Time period: 2007-03-06 16:15:53 -- 2007-03-06 19:17:16 Total file size: 197.56MB
From here, proceed with CRMA v2 as usual. It is extremely important that you use the target="zero" arguments throughout, otherwise it will not be a truly single-array method.
Furthermore, this is also true if some or all of the steps have already been processed for all or other arrays. That is, if there are other arrays in the same data directories, they will not affect the analysis of this subset. This is the key for doing CRMAv2 in batches. Try for instance to repeat the above with idxs <- c(1,3,4).
> -- > When reporting problems on aroma.affymetrix, make sure 1) to run the latest version of the package, 2) to report the output of sessionInfo() and traceback(), and 3) to post a complete code example.
> You received this message because you are subscribed to the Google Groups "aroma.affymetrix" group. > To post to this group, send email to aroma-affymetrix@googlegroups.com > To unsubscribe from this group, send email to aroma-affymetrix-unsubscribe@googlegroups.com > For more options, visit this group at http://groups.google.com/group/aroma-affymetrix?hl=en
A) After reading 'Improving processing time' [http://groups.google.com/ group/aroma-affymetrix/web/improving-process...], I did following:
1. Ran R from local space (I'm assuming when you say "local" is to run program on the same file server where the data is. So, setwd(path/to/ data)?).
2. Changed "Ram" to 20, to allow more memory usage. I'm currently testing it on a cluster that has 16 processors and 128GB memory, so I have plenty to play with. However, when I ran "top" on shell prompt, following was what the "R" progam was using. I was assuming higher the number of "ram" (default was 5), more memory it would use, but I'd seen it use higher amount of memory even with ram=10.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 21107 m055555 18 0 783m 683m 5224 R 43 0.5 12:11.62 R
3. After the above two, the estimated time dropped to 17896.1min = 298.27hrs = 12.43days, which is a lot better than 187 days :). Also, 17896/1400 = 12.78 min/array, which is close to your prediction of 5-10 min/array.
B) Few observations while testing memory usage: 1. I changed ram from 20 to 30 and the number of chunks decreased from 313 to 207, and the memory usage were as follows: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 24858 m047314 18 0 953m 852m 5060 D 14 0.7 7:45.73 R and 24858 m047314 18 0 1018m 854m 5060 D 15 0.7 8:04.37 R
Few lines of log file while the memory usage was above were as follows: 20100109 16:27:59| Getting model fit for 874959 units. 20100109 16:27:59| Creating CEL file... 20100109 16:27:59| Chip type: GenomeWideSNP_6,Full 20100109 16:27:59| Pathname: plmData/run1,ACC,ra,-XY,BPN,-XY,AVG,A+B/ GenomeWideSNP_6/probeAffinities.CEL 20100109 16:27:59| Returning already existing file. 20100109 16:28:00| Creating CEL file...done 20100109 16:28:04| Number units per chunk: 4231 used (Mb) gc trigger (Mb) max used (Mb) Ncells 2467545 131.8 3708127 198.1 3708127 198.1 Vcells 2888272 22.1 9244554 70.6 10920626 83.4 20100109 16:28:10| Fitting chunk #1 of 207... 20100109 16:28:10| Units: int [1:4231] 60631 60632 60633 60634 60635 60636 60637 60638 60639 60640 ... 20100109 16:28:10| Reading probe intensities from 1418 arrays... 20100109 16:28:10| Identifying CDF cell indices... $`SNP_A-1997587` $`SNP_A-1997587`$groups $`SNP_A-1997587`$groups$C $`SNP_A-1997587`$groups$C$indices [1] 208700 4316042 4033394
- Estimated time dropped to 7753.2min = 129.22hrs = 5.38 days, and 7753.2/1400 arrays = 5.538min/array. - This finding was absolutely stunning!!!...so I tried more...
2. ram=50 on a machine with mem=53.16670 GB (this was run on the grid) Estimated time left: 10335.8min = 172.26hrs = 7.18 days 10335.8/1400 = 7.38 min/array chunk #1 of 124 vmem=2.008G, maxvmem=2.475G
3. ram=100 on same machine I was running interactive job Estimated time left: 5302.1min = 88.37 hrs = 3.68 days chunk #1 of 62 memory: 3.0GB
4. ram=200 on the same machine, interactive job (currently running) Estimated time left: 4005.1min = 2.78 days chunk #1 of 31 memory: ~4.0GB
Question: - My initial plan was to subset arrays and spread out 1400 arrays as indiviaul jobs over the grid as grid-array-job. Since it'd take 5-10 minutes as we saw above, depending on the available slots all jobs could finish from 5 minutes to 10 days. However, since the last memory input of "200" estimated ~3 days I'm currently running it, and it should be done in another 34 hours.
Meanwhile, I thought I should ask you about this as I'm a bit confused in merging the jobs once they finished running. You had a code sample that I copied below.
So, if I ran 1400 jobs on the grid and each returned an object back...how do I go about merging all of them? Do you think it is better to run all if you have a machine that can handle big jobs like I'm running currently, or go about this route I'm suggesting (array job)? Also, please let me know of other suggestings if you have any. Thanks!
- Sooraj
On Jan 4, 8:37 pm, Henrik Bengtsson <henrik.bengts...@gmail.com> wrote:
> On Mon, Jan 4, 2010 at 1:43 PM,Sooraj<soora...@gmail.com> wrote: > > Hi Henrik,
> > I'm continuing a thread exactly after a year that I found following > > this linkhttp://www.mail-archive.com/aroma-affymetrix@googlegroups.com/msg0028.... > > I'm running a very large number of samples (~1400+), and the estimated > > time was 269324.6 minutes, roughly 187 days.
> That is approx 269324.6/1400 = 192 min/array = 3:12h/array. If you do > CRMAv2 on a single array, you should typically expect 5-10 mins/array. > I suspect that you are paying a large price of so called cache misses > on your file system, because each chunk processed iterates over all > 1400+ arrays making it hard for the file system to optimize the cache > and do so called "read ahead".
> Also, make sure to read Page 'Improving processing time' > [http://groups.google.com/group/aroma-affymetrix/web/improving-process...], > which has some useful suggestions. One of the comments there is that > you will pay a price if you do the analysis over a file network (less > problematic on fast networks).
> If you have a slow file-system network, at least try to use a file > cache directory on a local hard drive. See Page 'Caching of > computational expensive tasks (memoization)' > [http://groups.google.com/group/aroma-affymetrix/web/caching] on how > to specify the cache directory. This will useful also when you run > CRMAv2 in batches on different machines.
> Given your current performance on 1400+ arrays, it would be > interesting to all of us to see how much improvement you can get from > the above steps, without trying to process things in chunks. If you > could do some basic benchmarking and share it here, we can learn more > on how to avoid bottlenecks.
> > The link above talks > > about splitting datasets into chunks and running them in parallel, and > > also directs to another thread at this link > >http://groups.google.com/group/aroma-affymetrix/browse_thread/thread/... > > where it talks about appending the completed batched runs.
> > However, you mentioned in the latter thread that with newer version > > all of the steps (splitting and merging) would be handled. Since it's > > been a year I'm hoping there's a newer version out there which can do > > all of the above? Can you please post if it's taken care or whether it > > still needs to be done manually? Thanks!
> Things have improved since that thread. The main thing is that > splitting can be done via a simple
> csR <- extract(csR, idxs);
> before running CRMAv2 where 'idxs' is specifying a subset of the > arrays. See below DOCUMENT for details.
> The merging still has to be done, but that now comes down to loading > all chip effect files at once. Something like this:
> > P.S. I'm running CRMAv2, and it was brought up at my workplace that > > since CRMAv2 works one chip at a time, the final result is the same > > regardless of it run in batches or all at once, so why wasn't it > > already included in the program to run in batches instead of forcing > > the end user to do it manually?
> It is pretty much done now, but you still have to launch your batches > manually using non-overlapping (disjoint) sets of 'idxs' vectors. If > you by mistake process the same array (overlapping 'idxs' vectors) in > parallel, there is a risk that the same array and CRMAv2 step will be > processed at the same time. This might cause a conflict when it comes > to storing the result for that array. There are mechanisms in > aroma.affymetrix that tries to detect this and *minimize* the risk for > corrupting the data, and if it happens you will most likely get an > exception telling you that there is a conflict. However, this is not > bullet proof, so make sure to process disjoint sets of arrays. > Technical details: In order to make it bullet proof we a file-locking > mechanism that is bullet proof, and that turns out to be a very hard > problem, especially on shared file systems [it may take up to 30 > seconds for a file update to be seen by other computers on an NFS file > systems]. Making it a cross-platform solution is even harder. If > someone knows a solution, please share it.
> Finally, below is a "how to" document that I'm writing that should get > you going:
> BEGIN DOCUMENT
> This document explains how to do CRMAv2 processing on a single array > or on a subset of arrays in a data set. To show how this can be done, > we use the same data and steps as in Vignette 'CRMA v2: Estimation of > total copy numbers using the CRMA v2 method (10K-GWS6)'. It is > assumed that you have the same setup.
> IMPORTANT: This approach will only work with methods such as CRMAv2 > that are truly single-array methods. If you try it on other methods, > for instance CRMA (v1), it will process the arrays for you, but you > will not get the same results if you process each array independently > or as part of a larger data set. Currently, it is only CRMAv2 that is > a truly single-array method.
(All of the above was on the same machine; the ram=50 results was done on the grid and hence not listed).
[Please correct any of the above, if I got it wrong.]
The 'ram' argument is basically a scale factor of how many arrays/units is processed in each chunk. From your benchmarking it is clear that the memory profile is not linear with the 'ram' options. Instead it shows that there is an initial overhead regardless of chunk size/'ram', and then it increases slowly. Ideally we would have better knowledge of the relationship between argument 'ram' and the really memory usage. This relationship will certainly depend on chip type/CDF, and it is probably also operating system, hardware (cache) and file system dependent. It is probably not linear in the number of arrays. It would be nice if the software/R could measure the memory usage and adjust so that the memory usage stays within a certain range.
More comments below:
On Mon, Jan 11, 2010 at 2:32 PM, Sooraj <soora...@gmail.com> wrote: > A) After reading 'Improving processing time' [http://groups.google.com/ > group/aroma-affymetrix/web/improving-process...], I did following:
> 1. Ran R from local space (I'm assuming when you say "local" is to run > program on the same file server where the data is. So, setwd(path/to/ > data)?).
I mean a local hard drive, so that data does not have to be passed back and forth over the file system network. On Linux, it is hard for a user to see whether a certain directory is local or not. It is better to ask the sys admin. Typically /tmp/ and /vartmp/ are on the local machines.
Did you move data? Did you see an improvement from doing that, without adjusting the 'ram' option?
> 2. Changed "Ram" to 20, to allow more memory usage. I'm currently > testing it on a cluster that has 16 processors and 128GB memory, so I > have plenty to play with. However, when I ran "top" on shell prompt, > following was what the "R" progam was using. I was assuming higher the > number of "ram" (default was 5), more memory it would use, but I'd > seen it use higher amount of memory even with ram=10.
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 21107 m055555 18 0 783m 683m 5224 R 43 0.5 12:11.62 R
I'm not sure I understand what your question is. You mean you would expect a doubling of the memory usage? The reason you don't see a large increase is that there is an initial overhead of the package and other things stored in memory. As explained above, 'ram' is basically a scale factor adjusting how many units per chunk is process. The more arrays there are, the fewer units are processed. With the same 'ram' value, a data set with 200 arrays processes half the number of units per chunk as a data set with 100 arrays.
Also, when you say default was ram=5, you mean that you previously used (or changed the settings) so that ram == 5? The default of the aroma.* frame is ram=1. How to change the aroma.* settings is explained under 'Settings' on the web page [http://aroma-project.org/settings].
> 3. After the above two, the estimated time dropped to 17896.1min = > 298.27hrs = 12.43days, which is a lot better than 187 days :). Also, > 17896/1400 = 12.78 min/array, which is close to your prediction of > 5-10 min/array.
Repeating my above question: Can you clarify/confirm that this improvement was due to both (i) a change of location of data files and (ii) an increase of 'ram'? ...and not just an increase of 'ram'.
> B) Few observations while testing memory usage: > 1. I changed ram from 20 to 30 and the number of chunks decreased from > 313 to 207, and the memory usage were as follows: > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ > COMMAND > 24858 m047314 18 0 953m 852m 5060 D 14 0.7 7:45.73 R > and > 24858 m047314 18 0 1018m 854m 5060 D 15 0.7 8:04.37 R
> Few lines of log file while the memory usage was above were as > follows: > 20100109 16:27:59| Getting model fit for 874959 units. > 20100109 16:27:59| Creating CEL file... > 20100109 16:27:59| Chip type: GenomeWideSNP_6,Full > 20100109 16:27:59| Pathname: plmData/run1,ACC,ra,-XY,BPN,-XY,AVG,A+B/ > GenomeWideSNP_6/probeAffinities.CEL > 20100109 16:27:59| Returning already existing file. > 20100109 16:28:00| Creating CEL file...done > 20100109 16:28:04| Number units per chunk: 4231 > used (Mb) gc trigger (Mb) max used (Mb) > Ncells 2467545 131.8 3708127 198.1 3708127 198.1 > Vcells 2888272 22.1 9244554 70.6 10920626 83.4 > 20100109 16:28:10| Fitting chunk #1 of 207... > 20100109 16:28:10| Units: > int [1:4231] 60631 60632 60633 60634 60635 60636 60637 60638 60639 > 60640 ... > 20100109 16:28:10| Reading probe intensities from 1418 arrays... > 20100109 16:28:10| Identifying CDF cell indices... > $`SNP_A-1997587` > $`SNP_A-1997587`$groups > $`SNP_A-1997587`$groups$C > $`SNP_A-1997587`$groups$C$indices > [1] 208700 4316042 4033394
> - Estimated time dropped to 7753.2min = 129.22hrs = 5.38 days, and > 7753.2/1400 arrays = 5.538min/array. > - This finding was absolutely stunning!!!...so I tried more...
> 2. ram=50 on a machine with mem=53.16670 GB (this was run on the grid) > Estimated time left: 10335.8min = 172.26hrs = 7.18 days > 10335.8/1400 = 7.38 min/array > chunk #1 of 124 > vmem=2.008G, maxvmem=2.475G
> 3. ram=100 on same machine I was running interactive job > Estimated time left: 5302.1min = 88.37 hrs = 3.68 days > chunk #1 of 62 > memory: 3.0GB
> 4. ram=200 on the same machine, interactive job (currently running) > Estimated time left: 4005.1min = 2.78 days > chunk #1 of 31 > memory: ~4.0GB
> Question: > - My initial plan was to subset arrays and spread out 1400 arrays as > indiviaul jobs over the grid as grid-array-job. Since it'd take 5-10 > minutes as we saw above, depending on the available slots all jobs > could finish from 5 minutes to 10 days. However, since the last memory > input of "200" estimated ~3 days I'm currently running it, and it > should be done in another 34 hours.
Yupp, now your down to a time scale where it is worth waiting - I always say one should compare the computer time to the time it takes to run the assay of those 1400 arrays. Being able to preprocess the data in 3 days is then not bad at all. ...and you'll probably spend months/years analysis the results anyway.
> Meanwhile, I thought I should ask you about this as I'm a bit confused > in merging the jobs once they finished running. You had a code sample > that I copied below.
> So, if I ran 1400 jobs on the grid and each returned an object > back...how do I go about merging all of them?
Remember, the data is sitting on the file system. It is not sitting in memory (RAM). This means that any host running the above lines of code, will have access to the complete data set.
> Do you think it is > better to run all if you have a machine that can handle big jobs like > I'm running currently, or go about this route I'm suggesting (array > job)?
If you process the data set in K subsets of arrays you should see roughly a K times speed up. Of course, there might be bottlenecks in the file system preventing a linear speedup, but considered that I/O is only a small fraction of the processing time, that shouldn't be too much of a problem unless you use a huge number of hosts. Now, if you have a fancy grid system that can distribute the workload of a single R session automatically, then you might not see an improvement. However, I don't think a grid system can do this, but I might also be completely wrong
> Also, please let me know of other suggestings if you have any.
I think you nailed it.
Currently there is a lot of internal overhead when fitting the PLM due to wrapping and unwrapping of data in nested list structure that are shaped as the structure of the CDF file. I know of design changes that would speed up this up several orders of magnitude, but it would require a lot of work and it has simply never made it up to the top of my priority list.
> On Jan 4, 8:37 pm, Henrik Bengtsson <henrik.bengts...@gmail.com> > wrote: >> Hi.
>> On Mon, Jan 4, 2010 at 1:43 PM,Sooraj<soora...@gmail.com> wrote: >> > Hi Henrik,
>> > I'm continuing a thread exactly after a year that I found following >> > this linkhttp://www.mail-archive.com/aroma-affymetrix@googlegroups.com/msg0028.... >> > I'm running a very large number of samples (~1400+), and the estimated >> > time was 269324.6 minutes, roughly 187 days.
>> That is approx 269324.6/1400 = 192 min/array = 3:12h/array. If you do >> CRMAv2 on a single array, you should typically expect 5-10 mins/array. >> I suspect that you are paying a large price of so called cache misses >> on your file system, because each chunk processed iterates over all >> 1400+ arrays making it hard for the file system to optimize the cache >> and do so
> > 1. Ran R from local space (I'm assuming when you say "local" is to run > > program on the same file server where the data is. So, setwd(path/to/ > > data)?).
> I mean a local hard drive, so that data does not have to be passed > back and forth over the file system network. On Linux, it is hard for > a user to see whether a certain directory is local or not. It is > better to ask the sys admin. Typically /tmp/ and /vartmp/ are on the > local machines.
> Did you move data? Did you see an improvement from doing that, > without adjusting the 'ram' option?
I didn't move data. The results were from the files sitting in a file server. We don't have much space to play with, so I simply used more ram as that was a more feasible route.
> > 3. After the above two, the estimated time dropped to 17896.1min = > > 298.27hrs = 12.43days, which is a lot better than 187 days :). Also, > > 17896/1400 = 12.78 min/array, which is close to your prediction of > > 5-10 min/array.
> Repeating my above question: Can you clarify/confirm that this > improvement was due to both (i) a change of location of data files and > (ii) an increase of 'ram'? ...and not just an increase of 'ram'.
This improvement was only due to (ii) increase of ram. As stated above, I didn't move the data files.
Also, I wanted to mention that apart from the "Estimated time" that I got out of the log, it took about another day to finish up running additional functions. So, the entire run for 1400+ samples completed in ~4days. Thanks a lot for your help.