Hi Angelica,
The major problem is that when you allow excessive soft-clipped bases,
you end up creating a region of the locus, when viewed across the
individual and/or the population, that contains many, weakly supported
SNP calls. By default, if you have soft-masked bases, Stacks will treat
those bases as Ns, but there is rarely a reason for them to be present,
unless the aligner could only align a seed from the read (~40bp or so),
in which case the read is definitely not aligned properly.
I don't know if HaplotypeCaller includes soft-clipped bases in calls by
default (I would doubt it), but if so, that would be incorrect. I do
know that GATK does local realignment, so that may alleviate this
problem, but I am not an expert on GATK.
The pstacks module from Stacks allows you to control the amount of a
read that can contain soft-masked bases and it will tell you the
percentage of reads it is discarding (it should not be very many in a
good data set), so you can judge if this is really a problem for your data.
The process_radtags program will discard reads with progressively
worsening quality scores from the 5' to 3' end of the reads, these are
the types of reads that could legitimately have soft-masking at the
ends. Other reads may contain fragments of unique sequence along with
segments of repeats, and it is probably better to discard those loci
(which Stacks would mostly end up doing).
So, if you are seeing a large amount of soft-masking in your data,
either your library has some intrinsic problems that you need to
consider, or your reference genome is not closely related or it has
intrinsic quality problems. One good option is to assemble de novo, and
then align the consensus sequences of those assembled loci back against
your reference genome and compare the results (there should not be soft
masking in these alignments).
You can indirectly control for this in BWA by setting the mapping
quality score, it doesn't allow you to directly control for soft
masking, but provides one number you can set that incorporates
mismatches and other errors.
Best,
julian