soft_clipping bases - discarding reads

533 views
Skip to first unread message

Angélica Cuevas

unread,
Jun 16, 2017, 7:41:55 AM6/16/17
to Stacks
Hello all,

I would like to ask the community and @Julian about the --max_clipped flag for pstacks.

I understand the problem of using reads that have a high number of soft-clipped (SC) bases, specially if alleles are present and supported by those SC bases, but is there a different effect of just not using the bases rather than discarding the whole read? would I be biasing the variant call if I just tell the variant calling command not to use those bases that have been SC instead of discarding the read all together?

My question is more specifically related to the flag --dontUseSoftClippedBases from HaplotypeCaller, I'm assuming that if I were to use that flag the read is still used but not the SC bases when the SNPs are called. Any thoughts?

And finally, if the best option is to discard the read completely is it possible to do it right after the mapping, before going into STACKS or a variant caller? I mean from the *bam files. I know some mappers allow to turn off the soft clipping, but I'm using bwa men and I can't find a flag that would allow me to do so. Do you people know if there is any tool/script that would allow me to discard reads that contain SC bases or even better reads with a specific % of SC bases (like in --max_clipped flag for pstacks)


Many thanks in advance for your help!

best,

Angelica

Julian Catchen

unread,
Jun 19, 2017, 1:48:54 PM6/19/17
to stacks...@googlegroups.com, cuevas.a...@gmail.com
Hi Angelica,

The major problem is that when you allow excessive soft-clipped bases,
you end up creating a region of the locus, when viewed across the
individual and/or the population, that contains many, weakly supported
SNP calls. By default, if you have soft-masked bases, Stacks will treat
those bases as Ns, but there is rarely a reason for them to be present,
unless the aligner could only align a seed from the read (~40bp or so),
in which case the read is definitely not aligned properly.

I don't know if HaplotypeCaller includes soft-clipped bases in calls by
default (I would doubt it), but if so, that would be incorrect. I do
know that GATK does local realignment, so that may alleviate this
problem, but I am not an expert on GATK.

The pstacks module from Stacks allows you to control the amount of a
read that can contain soft-masked bases and it will tell you the
percentage of reads it is discarding (it should not be very many in a
good data set), so you can judge if this is really a problem for your data.

The process_radtags program will discard reads with progressively
worsening quality scores from the 5' to 3' end of the reads, these are
the types of reads that could legitimately have soft-masking at the
ends. Other reads may contain fragments of unique sequence along with
segments of repeats, and it is probably better to discard those loci
(which Stacks would mostly end up doing).

So, if you are seeing a large amount of soft-masking in your data,
either your library has some intrinsic problems that you need to
consider, or your reference genome is not closely related or it has
intrinsic quality problems. One good option is to assemble de novo, and
then align the consensus sequences of those assembled loci back against
your reference genome and compare the results (there should not be soft
masking in these alignments).

You can indirectly control for this in BWA by setting the mapping
quality score, it doesn't allow you to directly control for soft
masking, but provides one number you can set that incorporates
mismatches and other errors.

Best,

julian

Angélica Cuevas

unread,
Jun 19, 2017, 5:19:04 PM6/19/17
to Julian Catchen, stacks...@googlegroups.com
Thanks a lot Julian for all the insights! Really helpful! I think it's worth to get ride of the over soft-clipped reads. I'm using a ref-genome of the sister species, but probably there is already a lot of interspecific variation that create excess of soft-clipping in my data.

Best,

Angelica
Reply all
Reply to author
Forward
0 new messages