Hi Miquel,
I have looked at the data you provided and can draw a few conclusions.
1) you provided two sets of data
(batch_0 and batch_2), each has between 2 and 3 million reads:
(I am giving some commands for the list if others want to try this on
their own data sets; these commands should all be on one line, I broke
them up here to make them more readable.)
ls -1 *.fq |
while read line;
do
echo -n "$line "; cat $line |
sed -n '2~4p' | wc -l;
done
EGP0027E06_batch2.fq 2948364
EGP0089A05_batch2.fq 2255550
EMS1906_E1_batch0.fq 2973760
EMS2088_B12_batch0.fq 2255550
2) I checked the distribution of
lengths for each set:
ls -1 *.fq |
while read line;
do
echo -n "$line ";
cat $line | sed -n '2~4p' |
awk '{print length}' | sort -n |
uniq -c;
done
EGP0027E06_batch2.fq 2948364 41
EGP0089A05_batch2.fq 2255550 41
EMS1906_E1_batch0.fq 2973760 41
EMS2088_B12_batch0.fq 2255550 41
So, each data set is approximately the same size, and all the reads are
41bp in length (we expect all reads to be the same length for ustacks).
So far so good.
3) I moved forward with one file from
each batch and I ran ustacks on those files:
For EMS1906_E1_batch0.fq, I see this output from ustacks, immediately
after it reads the input FASTQ:
Loaded 2973760 RAD-Tags; inserted
252194 elements into the RAD-Tags hash map.
Coverage mean: 23; stdev: 85.5674
For EGP0027E06_batch2.fq I see this:
Loaded 2948364 RAD-Tags; inserted
501911 elements into the RAD-Tags hash map.
Coverage mean: 12; stdev: 40.0631
When Stacks slows down on a particular data set, it is tempting to blame
the software, or assume there must be a "bug." However, despite both
data files have approximately the same number of reads, the underlying
data is very different.
The first stage of the ustacks algorithm is to find putative alleles.
This is done by reducing down all exactly-matching reads to a single
representation (the allele). In the first data file, 2.9 million reads
immediately reduce down to 252,000 putative alleles, with an average
depth of 23x. In the second case, the 2.9 million reads only reduce down
to 500,000 alleles with half the coverage, at 12x. So, something is
drastically different in the underlying libraries.
4) Checking your parameters:
You ran ustacks with a minimum stack depth (allele depth) of 3 reads,
and allowed up to 4 mismatches between alleles. However, your reads are
only 41bp long. Is it really biologically reasonable to assume that four
SNPs can occur within 41 basepairs?
In terms of the ustacks algorithm, what is happening is the following:
ustacks needs to compare all alleles against all other alleles to check
for matches. Given n alleles, that results in n^2 comparisons. This
takes a very long time and as n increases, the number of comparisons
required increases according to the square of n. To speed this up,
ustacks identifies all the kmers (subsequences of length k) in all the
alleles and only compares two alleles if they have a certain number of
kmers in common. Most of the time, this drastically reduces the number
of comparisons and ustacks executes fast. However, in the worst possible
case, the algorithm can revert back to n^2 comparisons.
Now, you have asked for up to four mismatches in 41bp of sequence. To
allow this many mismatches in a short amount of sequence, ustacks must
use a very small kmer size so that no potentially matching alleles are
missed. ustacks automatically calculates this and prints out its
decision:
Calculating distance between
stacks...
Distance allowed between stacks:
4
Using a k-mer length of 7
Number of kmers per sequence: 35
Minimum number of k-mers to
define a match: 7
With a kmer of length 7, many, many alleles will have spurious kmer
matches by chance. It takes a very long time for ustacks to iterate
through all possible matches. In addition, after alleles are merged into
loci, ustacks aligns the secondary reads back to the assembled loci to
increase locus depth for SNP calling. To do this, ustacks relaxes the
number of allowed mismatches from 4 to 6, resulting in an even smaller
kmer being chosen:
Merging remainder radtags
367084 remainder sequences left
to merge.
Distance allowed between stacks:
6
Using a k-mer length of 5
Number of kmers per sequence: 37
Minimum number of k-mers to
define a match: 7
So this process is even slower. And, the reason why your data sets
behave so differently is because the second dataset has a lot more dirt,
resulting in a lot more allele comparisons and a lot more secondary
reads that have to be matched because they never fit in to one of the
initially assembled alleles. Since this algorithm is worst case n^2,
your second data set has twice as many alleles, but has to make nearly
the square of that more comparisons.
We see the result of this with the run time (-M 4, -N 6; 41bp
sequences):
EMS1906_E1_batch0: 0:32:47
EGP0027E06_batch2: 4:44:49
So, the dirtier data takes more than 6 times longer to execute. However,
if we reduce the number of allowed mismatches down to a more
biologically plausible number, the run time is much more manageable (-M
2, -N 3):
EMS1906_E1_batch0: 0:05:54
EGP0027E06_batch2: 1:52:30
And for comparison (-M 1, -N 2):
EGP0027E06_batch2:
1:04:11
I have also exposed the kmer length argument in ustacks and cstacks, so
now you can manually override the kmer setting. This can result in some
reads not being matched, but it will drastically speed up matching.
Running ustacks using a kmer length of 19, we further reduce the run
time:
ustacks -t fastq -p 32 -i 3 -f
./ustacks/EGP0027E06_batch2.fq -o out_19k/ -d -r -M 4 --k_len 19
EGP0027E06_batch2: 36:02.66
Anyway, I'm not claiming that the ustacks algorithm could not be
improved. It has not been worked on in quite a while and is on the list
to be improved in the future. However, the combination of your short
reads, high error rate, and large number of mismatches is causing the
ustacks algorithm to perform in its rare, but worst case behavior. If
your reads were 100bp, choosing the right kmer with 4 mismatches would
not be an issue.
I recommend that you decrease the number of mismatches (-M 2, -N 3), and
increase the minimum stack depth (-m 5). You should also seriously
consider why your second batch of data looks so different from your
first, PCR error, lower quality of input DNA? You could also consider
not aligning secondary reads (-N 0) which will save a large matching
algorithm step, but cost you the use of those reads (but your data is
noisy...so). The same thing applies to the matching algorithm in cstacks
(-n parameter, uses the same kmer matching algorithm).
Finally, if you want to play around with manually setting the kmer
length, you can try out the code (although I consider it experimental at
the moment):
http://creskolab.uoregon.edu/stacks/source/stacks-1.31.tar.gz
Best wishes,
julain
April 23, 2015 at
9:34 PM