Stacks very slow.

985 views
Skip to first unread message

mig...@uji.es

unread,
Apr 15, 2015, 3:22:10 AM4/15/15
to stacks...@googlegroups.com
Hi, 

Stacks is working extremely slow with several batches of data. Data are single end reads, around 100 fastq files. I have parallelized the first step, ustaks, but to build the catalog cstacks takes very long time (6 or less samples/day). Is it normal?

Comparing my new data with previous batches that worked ok with Stacks, I realized that in some cases the duplication levels are pretty different (new batch has lower number of duplicates). Can be it a reason for speed problem?

I attach 2 logs from ustacks step, and 2 logs more from denovo_map.pl,  for file from new batch (very slow) and old batch (fast). 

Any idea? Thank you!


Miquel


denovo_map_new.log
denovo_map_old.log
new.o3712326
old.o3712325
new.html
old.html

Jason

unread,
Apr 15, 2015, 7:02:25 AM4/15/15
to stacks...@googlegroups.com
Hi Miquel,

The speed of C stacks will depend on a number of factors, including how many loci you allow between stacks etc. However, your run times of about 7 individuals per day is what I am also getting despite parallelizing the c-stacks step across 16 cores and using >64 GB memory. C-stacks is a very slow step and represents the bottleneck of the entire pipeline. The best way to proceed as others have suggested is to generate your library using only a subset of individuals. For a 140 individual project, I use about 20 individuals chosen to maximize geographic and/or taxonomic coverage, and which have the highest coverage (as approximated from the size of the tag files). These should be sufficient to generate your catalogue. Also, a nice feature of the more recent versions of c-stacks is the ability to add one or just a few samples at a time to the library. That way, if your network allows jobs to run for only 2 days (as on our grid), then you can simply submit a new c-stacks job with the remaining samples that need to be added to the library.

Hope this helps,
Jason


mig...@uji.es

unread,
Apr 15, 2015, 8:47:16 PM4/15/15
to stacks...@googlegroups.com
Thanks for your answer, Jason. 
Yes, I am trying to generate the catalog using only a subset of individuals. The problem is that we have a lot of different species and I can not select only 20 samples. Anyway, I would like to know why, with the same configuration, sometimes stacks works faster. I suppose there are several factors, like number of duplicates but I don't know exactly the differences between run my old & new batches with stacks.

Thanks, 

Miquel

Jason

unread,
Apr 17, 2015, 7:08:58 AM4/17/15
to stacks...@googlegroups.com
One thing you might try to make c-stacks run a little faster is to alter your input order so that closely related individuals/species are added to the catalogue first. These might add quickly as there will be fewer differences between them. Then add the most distant ones. If you add the two most distance individuals/species first, then the catalogue will be large from the beginning. Just a suggestion, and I don't know how much of an impact it will have.

mig...@uji.es

unread,
Apr 19, 2015, 9:30:40 PM4/19/15
to stacks...@googlegroups.com
Hi Jason, 

Thanks for the suggestion, I will try it.
Anyway, I would like to know which is the 'normal' speed for ustacks and cstacks. As I said, cstacks  takes very long time (6 or less samples/day) and same for ustacks. I don't know if it is normal. If I try running cstacks with individuals from the same specie, it is also very slow (for the new data).

For the old batch of data, ustacks takes 3 days for 200 samples and cstacks around a week. I thought this is the normal speed. What can be the reason? Can you extract some information from the log files that I attached?

Thanks again,

Miquel

mig...@uji.es

unread,
Apr 23, 2015, 5:23:22 AM4/23/15
to stacks...@googlegroups.com
Please, some help?
user's experience are welcome! 

I have tried with 16/32 threads, 100GB RAM, with same results. My files are single end reads from illumina. Cstacks finished 10 samples in one week. Ustacks finished 30 samples in one week. 

Thank you!


Miquel

Julian Catchen

unread,
Apr 23, 2015, 11:50:17 AM4/23/15
to stacks...@googlegroups.com, mig...@uji.es
Hi Miquel,

Why don't you describe your data a little more? How many samples are you
loading, how many reads per sample?

Are the reads the same length, and what is that length?

Can you tell if cstacks is spending its time in the matching stage (as
it kmerizes and tries to match loci together), or is it mostly in the
disk reading/writing stage?

If you watch cstacks in top as it runs, what is the behavior, is it
spending most of its time parallelized (meaning it is in the matching
stage), or most of its time as a single process, using little CPU (disk
reading/writing)?

julian

mig...@uji.es

unread,
Apr 23, 2015, 10:33:12 PM4/23/15
to stacks...@googlegroups.com, jcat...@illinois.edu, mig...@uji.es
Hi Julian, 

Thanks for your help.

Samples have around 2255550 reads, with 41bases length (same for all). I have tested batches with a different number of samples. A batch of 20 samples takes 1 week to finish. 40 files takes 3 weeks or more. More than 100 samples, it can not finish. I have tried using a sub-batch to build the catalog, with 2-3 individuals of each specie, 40 samples in total, with same results.

 The point I don't understand completely is why the old sample (you can see it here) is much faster than the new one (here). Both files have the same reads/length. You can see the logs in my first post in this thread.

I think cstacks is spending most of time in the matching stage. In log file, it stops long time in this step:
Searching for sequence matches...
  Distance allowed between stacks: 2
  Using a k-mer length of 17
  Number of kmers per sequence: 39
  Miniumum number of k-mers to define a match: 5
  962533 loci in the catalog, 15779340 kmers in the catalog hash.

I am generating 3 catalogs for the new data, with 16 threads as input parameter each one. About top information, two of them are using two process each one (not single). The other one is parallelized. No cpu's overloaded. All log files are in "Searching for sequence matches..." stage, as above. All catalogs have finished around 6 samples in almost two days. 

Not only cstacks is slower. Ustacks works much slow in new files than old files.

Thanks!

Miquel

mig...@uji.es

unread,
Apr 23, 2015, 10:34:41 PM4/23/15
to stacks...@googlegroups.com, mig...@uji.es, jcat...@illinois.edu

Julian Catchen

unread,
May 12, 2015, 2:04:37 PM5/12/15
to mig...@uji.es, stacks...@googlegroups.com
Hi Miquel,

I have looked at the data you provided and can draw a few conclusions.

1) you provided two sets of data (batch_0 and batch_2), each has between 2 and 3 million reads:

(I am giving some commands for the list if others want to try this on their own data sets; these commands should all be on one line, I broke them up here to make them more readable.)

ls -1 *.fq |
while read line;
do
   echo -n "$line    "; cat $line | sed -n '2~4p' | wc -l;
done

EGP0027E06_batch2.fq    2948364
EGP0089A05_batch2.fq    2255550
EMS1906_E1_batch0.fq    2973760
EMS2088_B12_batch0.fq   2255550

2) I checked the distribution of lengths for each set:

ls -1 *.fq |
while read line;
do
  echo -n "$line    ";
  cat $line | sed -n '2~4p' |
  awk '{print length}' | sort -n | uniq -c;
done

EGP0027E06_batch2.fq    2948364 41
EGP0089A05_batch2.fq    2255550 41
EMS1906_E1_batch0.fq    2973760 41
EMS2088_B12_batch0.fq   2255550 41

So, each data set is approximately the same size, and all the reads are 41bp in length (we expect all reads to be the same length for ustacks). So far so good.

3) I moved forward with one file from each batch and I ran ustacks on those files:

For EMS1906_E1_batch0.fq, I see this output from ustacks, immediately after it reads the input FASTQ:

Loaded 2973760 RAD-Tags; inserted 252194 elements into the RAD-Tags hash map.
Coverage mean: 23; stdev: 85.5674

For EGP0027E06_batch2.fq I see this:

Loaded 2948364 RAD-Tags; inserted 501911 elements into the RAD-Tags hash map.
Coverage mean: 12; stdev: 40.0631

When Stacks slows down on a particular data set, it is tempting to blame the software, or assume there must be a "bug." However, despite both data files have approximately the same number of reads, the underlying data is very different.

The first stage of the ustacks algorithm is to find putative alleles. This is done by reducing down all exactly-matching reads to a single representation (the allele). In the first data file, 2.9 million reads immediately reduce down to 252,000 putative alleles, with an average depth of 23x. In the second case, the 2.9 million reads only reduce down to 500,000 alleles with half the coverage, at 12x. So, something is drastically different in the underlying libraries.

4) Checking your parameters:

You ran ustacks with a minimum stack depth (allele depth) of 3 reads, and allowed up to 4 mismatches between alleles. However, your reads are only 41bp long. Is it really biologically reasonable to assume that four SNPs can occur within 41 basepairs?

In terms of the ustacks algorithm, what is happening is the following: ustacks needs to compare all alleles against all other alleles to check for matches. Given n alleles, that results in n^2 comparisons. This takes a very long time and as n increases, the number of comparisons required increases according to the square of n. To speed this up, ustacks identifies all the kmers (subsequences of length k) in all the alleles and only compares two alleles if they have a certain number of kmers in common. Most of the time, this drastically reduces the number of comparisons and ustacks executes fast. However, in the worst possible case, the algorithm can revert back to n^2 comparisons.

Now, you have asked for up to four mismatches in 41bp of sequence. To allow this many mismatches in a short amount of sequence, ustacks must use a very small kmer size so that no potentially matching alleles are missed. ustacks automatically calculates this and prints out its decision:

Calculating distance between stacks...
  Distance allowed between stacks: 4
  Using a k-mer length of 7
  Number of kmers per sequence: 35
  Minimum number of k-mers to define a match: 7

With a kmer of length 7, many, many alleles will have spurious kmer matches by chance. It takes a very long time for ustacks to iterate through all possible matches. In addition, after alleles are merged into loci, ustacks aligns the secondary reads back to the assembled loci to increase locus depth for SNP calling. To do this, ustacks relaxes the number of allowed mismatches from 4 to 6, resulting in an even smaller kmer being chosen:

Merging remainder radtags
  367084 remainder sequences left to merge.
  Distance allowed between stacks: 6
  Using a k-mer length of 5
  Number of kmers per sequence: 37
  Minimum number of k-mers to define a match: 7

So this process is even slower. And, the reason why your data sets behave so differently is because the second dataset has a lot more dirt, resulting in a lot more allele comparisons and a lot more secondary reads that have to be matched because they never fit in to one of the initially assembled alleles. Since this algorithm is worst case n^2, your second data set has twice as many alleles, but has to make nearly the square of that more comparisons.

We see the result of this with the run time (-M 4, -N 6; 41bp sequences):

EMS1906_E1_batch0:   0:32:47
EGP0027E06_batch2:   4:44:49

So, the dirtier data takes more than 6 times longer to execute. However, if we reduce the number of allowed mismatches down to a more biologically plausible number, the run time is much more manageable (-M 2, -N 3):

EMS1906_E1_batch0:   0:05:54
EGP0027E06_batch2:   1:52:30

And for comparison (-M 1, -N 2):

EGP0027E06_batch2:    1:04:11

I have also exposed the kmer length argument in ustacks and cstacks, so now you can manually override the kmer setting. This can result in some reads not being matched, but it will drastically speed up matching. Running ustacks using a kmer length of 19, we further reduce the run time:

ustacks -t fastq -p 32 -i 3 -f ./ustacks/EGP0027E06_batch2.fq -o out_19k/ -d -r -M 4 --k_len 19

EGP0027E06_batch2:   
36:02.66


Anyway, I'm not claiming that the ustacks algorithm could not be improved. It has not been worked on in quite a while and is on the list to be improved in the future. However, the combination of your short reads, high error rate, and large number of mismatches is causing the ustacks algorithm to perform in its rare, but worst case behavior. If your reads were 100bp, choosing the right kmer with 4 mismatches would not be an issue.

I recommend that you decrease the number of mismatches (-M 2, -N 3), and increase the minimum stack depth (-m 5). You should also seriously consider why your second batch of data looks so different from your first, PCR error, lower quality of input DNA? You could also consider not aligning secondary reads (-N 0) which will save a large matching algorithm step, but cost you the use of those reads (but your data is noisy...so). The same thing applies to the matching algorithm in cstacks (-n parameter, uses the same kmer matching algorithm).

Finally, if you want to play around with manually setting the kmer length, you can try out the code (although I consider it experimental at the moment):

http://creskolab.uoregon.edu/stacks/source/stacks-1.31.tar.gz


Best wishes,

julain

April 23, 2015 at 9:34 PM
April 23, 2015 at 10:49 AM
Hi Miquel,

Why don't you describe your data a little more? How many samples are you loading, how many reads per sample?

Are the reads the same length, and what is that length?

Can you tell if cstacks is spending its time in the matching stage (as it kmerizes and tries to match loci together), or is it mostly in the disk reading/writing stage?

If you watch cstacks in top as it runs, what is the behavior, is it spending most of its time parallelized (meaning it is in the matching stage), or most of its time as a single process, using little CPU (disk reading/writing)?

julian




April 23, 2015 at 4:23 AM
Please, some help?
user's experience are welcome! 

I have tried with 16/32 threads, 100GB RAM, with same results. My files are single end reads from illumina. Cstacks finished 10 samples in one week. Ustacks finished 30 samples in one week. 

Thank you!


Miquel
April 19, 2015 at 8:30 PM
Hi Jason, 

Thanks for the suggestion, I will try it.
Anyway, I would like to know which is the 'normal' speed for ustacks and cstacks. As I said, cstacks  takes very long time (6 or less samples/day) and same for ustacks. I don't know if it is normal. If I try running cstacks with individuals from the same specie, it is also very slow (for the new data).

For the old batch of data, ustacks takes 3 days for 200 samples and cstacks around a week. I thought this is the normal speed. What can be the reason? Can you extract some information from the log files that I attached?

Thanks again,

Miquel
April 17, 2015 at 6:08 AM
One thing you might try to make c-stacks run a little faster is to alter your input order so that closely related individuals/species are added to the catalogue first. These might add quickly as there will be fewer differences between them. Then add the most distant ones. If you add the two most distance individuals/species first, then the catalogue will be large from the beginning. Just a suggestion, and I don't know how much of an impact it will have.
April 15, 2015 at 7:47 PM
Thanks for your answer, Jason. 
Yes, I am trying to generate the catalog using only a subset of individuals. The problem is that we have a lot of different species and I can not select only 20 samples. Anyway, I would like to know why, with the same configuration, sometimes stacks works faster. I suppose there are several factors, like number of duplicates but I don't know exactly the differences between run my old & new batches with stacks.

Thanks, 

Miquel
April 15, 2015 at 6:02 AM
Reply all
Reply to author
Forward
0 new messages