Hi Nadia!,
I'm working with a big dataset about 200,0000,000 2x150 nextSEQ reads and i'm attempting to use corset as a way to reduce complexity, i'm using bowtie2 with -a --very-sensitive -threads 32 -score-min L,-0.1,-0.1, and i tried to use bt2 outputs with corset, the reading of the bam files and the intermediate summary files was made in no time, but the clustering has been running for 3 days and it says its down to 214,600 clusters (from a cluster with 260,752) its going to take a about two weeks to finish.
I`m using a workstation with 32 cores and a RAM of 256gb but its only using 3% of the the cpu and 10% of memory, i've been reading other post on the group searching for a solution, i already read that the hierarchical clustering algorithm is not easy to run on parallel so the only other option i`ve seen its reducing the number of alignments reported by bt2, i`ve seen on the wiki:
If you have a large dataset, you might like to limit the number of reported alignment to a large (but finite) number, as this will be faster and the resulting bam files will be smaller. In bowtie you could do this with -k 40 (40 reported alignments).
So i was wondering if 40 its a suggested number or just an example of how to use the -k option?, how would you estimate a proper number of alignments (how big should it be)?, and last, it`s there any other thing you can suggest to hurry up the clustering part?.
Thanks! (and sorry for the long fuzzy post)
Fernando.