Hi Stacks community,
this post is not directly related to Stacks itself, but I thought it might be useful to collect some information/opinions
about which read mapper people are using prior to using Stacks.
Personally, I have tried a couple of different read mappers and am not perfectly happy with any of them.
In particular it seems that the use of different read mappers can have a profound impact on the SNP calling in Stacks.
If we look at the distribution of SNPs over the length of reads (parsed out from the sumstats.tsv file)
this becomes pretty obvious (attached).
So my experience so far (only for the read mappers that have an option to turn-off terminal alignments):
GSNAP: performs quite well, but calls almost no SNPs in the last three basepairs. This wouldn't be too much of a problem
as it is conservative, but the problem with GSNAP is that the parameters to report only reads that map uniquely is not sensitive,
i.e. even when you tell GSNAP to only report reads with a single hit you will still get reads that can be mapped to different positions
in the genome. Just blast some of your loci in your catalogue back against your reference genome and you will likely see the problem
(the insensitivity of this option was also kindly confirmed to me by the developer).
Bowtie2: no problem with multiple hits if reads with an XS: flag are removed from the SAM files, but the number of called SNPs increases
linearly over the length of the read from position 75 on or so. Note, that I of course I used the --end-to-end mode to avoid terminal alignments.
Bowtie: the mapper I'm using right now, as it performs well with the issue of multiple hits and seems very conservative. However, for some reason
there is an increase in the number of called SNPs in the last three basepair positions. Note, that this does not seem to result from sequence quality
issues, as the problem remains even if I trim the last basepairs prior to mapping and running Stacks!
Attached are the SNP distribution plots for GSNAP, Bowtie2 and Bowtie. The x-axis gives the position along the read in bp and the y-axis is number of SNPs.
The header contains the relevant flags I used to run the software. Note, that the problem with position 47 seems to be indeed actually a sequencing problem, so
please ignore it.
Also attached is the output for a subset of the data with the trimmed and full-length reads mapped with Bowtie.
So far I got around this issue by using Bowtie and simply using custom scripts to modify the internal stacks files to remove SNPs from these ambiguous positions,
but this does of course not work for all the nice and new haplotype-based options of Stacks.
So please weigh in and share your experiences if you have similar problems or which mapper you use. And of course please
let me know if I'm missing something obvious here.
Sorry for this long post, but I hope this will be of interest to many Stacks users.
Cheers,
Andi