filter bwa output (sam file) based on CIGAR string

358 views
Skip to first unread message

Xiaodong

unread,
Dec 21, 2017, 3:31:44 PM12/21/17
to dDocent User Help Forum
Dear Jon,

I have a question about filtering bwa output. If the dDocent code file, the mapping command for pair-end reads is below:

bwa mem reference.fasta $i.R1.fq.gz -L 20,5 -t $NUMProc -a -M -T 10 -A $optA -B $optB -O $optO -R "@RG\tID:$i\tSM:$i\tPL:Illumina" 2> bwa.$i.log | mawk '$6 !~/[2-9].[SH]/ && $6 !~ /[1-9][0-9].[SH]/' | samtools view -@$NUMProc -q 1 -SbT reference.fasta - > $i.bam 2>$i.bam.log

The command filters the output (sam files) generate by bwa based on CIGAR. It removes soft or hard clipped alignment. But this part was missing in the old version of dDocent, for example, the outdated dDocent.GATK. Its mapping command is below:

bwa mem reference.fasta $i.R1.fq $i.R2.fq -t $NUMProc -a -M -T 10 -A $optA -B $optB -O $optO -R "@RG\tID:$i\tSM:$i\tPL:Illumina" > $i.sam 2> bwa.$i.log

Therefore, I wonder why you added filtering part based on CIGAR in the new version. As far as I know, variant callers like GATK or freebayes  ignore soft/hard clipped alignments, unless looking for structural variants.

Any opinions are very appreciated!

Many thanks,
Ivan 

Jon Puritz

unread,
Jan 2, 2018, 9:11:22 AM1/2/18
to ddo...@googlegroups.com
Hi Ivan,

The mawk script removes reads with substantial hard clipping, relative to the length of 100 bp.  They could be adjusted for longer read lengths.  Modern SNP callers, such as freebayes and GATK, ignore the clipped part of reads, but not the whole read.  Working with a lot of different empirical data sets, I found that reads with substantial clipping were usually paralogs or contamination that would sometimes lead to erroneous SNP calls.  Excluding them rarely affects coverage that much (unless trying to align to a distant genome), and greatly improves the accuracy of SNP calls.


Hope that helps,

Jon

-- 
Jon Puritz, PhD

Assistant Professor
Department of Biological Sciences
University of Rhode Island
120 Flagg Road, Kingston, RI 02881

Webpage: MarineEvoEco.com

Email: 
jpu...@gmail.com 

Cell: 401-338-8739

"The most valuable of all talents is that of never using two words when one will do.” -Thomas Jefferson

--
You received this message because you are subscribed to the Google Groups "dDocent User Help Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ddocent+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages