which read mapper to use prior to Stacks

577 views
Skip to first unread message

AFK

unread,
Aug 7, 2014, 6:21:57 AM8/7/14
to stacks...@googlegroups.com
Hi Stacks community,

this post is not directly related to Stacks itself, but I thought it might be useful to collect some information/opinions
about which read mapper people are using prior to using Stacks.

Personally, I have tried a couple of different read mappers and am not perfectly happy with any of them.

In particular it seems that the use of different read mappers can have a profound impact on the SNP calling in Stacks.
If we look at the distribution of SNPs over the length of reads (parsed out from the sumstats.tsv file)
this becomes pretty obvious (attached).


So my experience so far (only for the read mappers that have an option to turn-off terminal alignments):

GSNAP: performs quite well, but calls almost no SNPs in the last three basepairs. This wouldn't be too much of  a problem
as it is conservative, but the problem with GSNAP is that the parameters to report only reads that map uniquely is not sensitive,
i.e. even when you tell GSNAP to only report reads with a single hit you will still get reads that can be mapped to different positions
in the genome. Just blast some of your loci in your catalogue back against your reference genome and you will likely see the problem
(the insensitivity of this option was also kindly confirmed to me by the developer).

Bowtie2: no problem with multiple hits if reads with an XS: flag are removed from the SAM files, but the number of called SNPs increases
linearly over the length of the read
from position 75 on or so. Note, that I of course I used the --end-to-end mode to avoid terminal alignments.

Bowtie: the mapper I'm using right now, as it performs well with the issue of multiple hits and seems very conservative. However, for some reason
there is an increase in the number of called SNPs in the last three basepair positions. Note, that this does not seem to result from sequence quality
issues, as the problem remains even if I trim the last basepairs prior to mapping and running Stacks!



Attached are the SNP distribution plots for GSNAP, Bowtie2 and Bowtie. The x-axis gives the position along the read in bp and the y-axis is number of SNPs.
The header contains the relevant flags I used to run the software. Note, that the problem with position 47 seems to be indeed actually a sequencing problem, so
please ignore it.
Also attached is the output for a subset of the data with the trimmed and full-length reads mapped with Bowtie.

So far I got around this issue by using Bowtie and simply using custom scripts to modify the internal stacks files to remove SNPs from these ambiguous positions,
but this does of course not work for all the nice and new haplotype-based options of Stacks.

So please weigh in and share your experiences if you have similar problems or which mapper you use. And of course please
let me know if I'm missing something obvious here.

Sorry for this long post, but I hope this will be of interest to many Stacks users.

Cheers,
Andi









GSNAP.JPG
Bowtie_different_data_set_trimmed_full-length.JPG
Bowtie.JPG
Bowtie2.JPG
Message has been deleted
Message has been deleted
Message has been deleted

Celine Reisser

unread,
Aug 26, 2014, 6:35:49 PM8/26/14
to stacks...@googlegroups.com
Hi Andy,

I had the same experience with Bowtie2. I was also using end-to-end.
I noticed that the main problem came from their threshold for mapping quality. It is a formula directly dependent of the length of your read, and although for short reads of 50bp it seems ok, you increase that threshold dramatically for reads of 100bp (or 90bp, trimmed from the MID and the last 5 nucleotides).
The formula: -0.6 + -0.6 * L
You can see that you get a threshold of -54.6
Hence you either need to re-parameter all the penalties or give to substitutions, gap openings and extensions, or you need to modify the parameters of the formula (especially the 0.6 * L).

I also noticed that trimming my SAM files to keep only reads with a mapping quality MAPQ over 25 helped in the problem of "accumulation of terminal SNPs on reads".

However now, I switched to BWA (mostly because I am going back and forth between Stacks and GATK, and GATK input formats are a nightmare to produce... BWA is the only aligner I know that makes it easier). I also do an extensive "in house" trimming, removing the reads with XA flag (multiple alignments), with soft clipping or gaps (keep only cigar string of 90M), under a MAPQ of 25, and with more than 8 substitutions (I am sometimes working with moderately divergent populations than that used for the reference genome so it can get up to 6 or 7).

With BWA and the custom filters, I obtain really great results with Stacks analysis. We did a crossing panel and only had 4% estimated error in genotypes. Also, I ran duplicates on each of my library to test for sequencing error and genotyping error, and it is also under the 5%.


So yes, the main point of it all is that you cannot entirely trust the mapping softwares, and you still need to get your hands dirty and dig in to modify parameters or filter them out yourself...

Cheers

Celine.
Reply all
Reply to author
Forward
0 new messages