Corset with big datasets.

180 views
Skip to first unread message

Fernando Razo

unread,
Aug 29, 2017, 11:48:06 AM8/29/17
to corset-project
Hi Nadia!, 
I'm working with a big dataset about 200,0000,000  2x150 nextSEQ reads and i'm attempting to use corset as a way to reduce complexity, i'm using bowtie2 with -a --very-sensitive -threads 32 -score-min L,-0.1,-0.1, and i tried to use bt2 outputs with corset, the reading of the bam files and the intermediate summary files was made in no time, but the clustering has been running for 3 days and it says its down to 214,600 clusters (from a cluster with 260,752) its going to take a about two weeks to finish.
I`m using a workstation with 32 cores and a RAM of 256gb but its only using 3% of the the cpu and 10% of memory, i've been reading other post on the group searching for a solution, i already read that the hierarchical clustering algorithm is not easy to run on parallel so the only other option i`ve seen its reducing the number of alignments reported by bt2, i`ve seen on the wiki:
If you have a large dataset, you might like to limit the number of reported alignment to a large (but finite) number, as this will be faster and the resulting bam files will be smaller. In bowtie you could do this with -k 40 (40 reported alignments).
So i was wondering if 40 its a suggested number or just an example of how to use the -k option?, how would you estimate a proper number of alignments (how big should it be)?, and last, it`s there any other thing you can suggest to hurry up the clustering part?.

Thanks! (and sorry for the long fuzzy post)
Fernando. 

Nadia Davidson

unread,
Aug 31, 2017, 8:29:59 PM8/31/17
to corset-project
Hi Fernando,

It sounds like the clustering is going to take a very long time for you. Ideally it should be reporting about 100,000 contigs in a clusters or less to finish in a reasonable time.

These are the options I would recommend (in the following order):
- using the latest corset (1.06) play with the command line options -x and -l (see https://github.com/Oshlack/Corset/releases). You should be able to do this using the summary files are input (so no need to remap).
- do some preliminary clustering with tools like cd-hist-est or bbmap dedupe. Both do a good job of removing fully redundant (ie. identical or near identical) contigs, which may help bring down the number of contigs. You'd then need to remap and rerun corset.
-  play with the alignment settings. I think this is less likely to help as you've already got pretty strict settings. You should align the reads as paired if you're not already doing that. Playing with -k might help, but I'm a bit pessimistic as it hasn't helped in some of my past analyses (it can help reduce the bam file size and the speed of the bam reads in, but less so the clustering step).

Let us know if you find a solution that works for you, as many people seem to run into this sort of problem. Unfortunately I won't have time to parallelise the clustering part of the algorithm any time soon (but the code is open source, so I welcome others to try if they like!).

Cheers,
Nadia.
Reply all
Reply to author
Forward
0 new messages