identifying and removing paralogs when you have a reference genome

1,046 views
Skip to first unread message

Kim

unread,
Aug 12, 2014, 12:53:13 PM8/12/14
to stacks...@googlegroups.com
I am looking for advice on how to identify/remove paralogous RAD loci when you have a reference genome.  It seems like the best way to do this would be to have a genome alignment program that tells you when one read aligns to >1 locations in the genome, and then to remove that locus.

I have been using the genome aligner BWA.  My understanding is that when BWA finds a read that maps to more than one location in a genome (i.e. a "multi-mapping read"), it randomly chooses one of those locations to map the read.  And I don't think there is an option for BWA to "flag" or remove reads that align to >1 location in the genome.  Therefore I don't see how paralogous loci could be identified/removed at this stage of the analysis with BWA.

Is there a way to flag and remove multi-mapping reads in other alignment software?  In looking at the online manuals, it seems like Bowtie2 can do this, but GSnap can't, but I have not used these myself.

Any advice on how to identify/remove paralogous loci (when you have a reference genome), and which software to use would be great!

Thanks,
Kim

Celine Reisser

unread,
Aug 26, 2014, 5:49:18 PM8/26/14
to stacks...@googlegroups.com
Hi Kim,

I do not know if you figured out a way yet, but there are other aligners than BWA you could use. NovoAlign allows you to drop the reads that map to more than one place in the genome.
Alternatively, and because BWA is a very good aligner with many great options, you good trim your SAM files yourself using basic unix scripting.

For paralogous sequences, a three step filter can be used:

1. As you know already, a read could map to more than one area in the genome. BWA will indeed report the best alignment as the main alignment, but will include the other alignment found at the end of the line of that read, under a parameter called XA. It should look like this in your file: XA:Z:scaffold01899,+9941,90M,4;
This gives you the chromosome/scaffold it maps to, if it is forward or reverse, the position of the first nucleotide, the cigar string (very useful for further filtering) and finally the mapping quality of this alternate mapping. What you could do is grep the name of all the loci showing reads mapping to them and having this XA column at the end.

2. Alternatively to the grep (but maybe less precise), it is known that having an alternative mapping will bring the mapping score down for the read considered. You could filter your SAM file to only keep loci with a mapping quality of at least 25 (cat yourfile.sam | awk '$5 > 24 {print $0}' > newfile.sam)

3. Another type of paralogous behaviour in a SAM file is linked with loci depth: you could have reads coming from two paralogs in in sequencing pannel, but your reference genome, but your reference genome assembled only one of those paralogous sequences. Then you should see a bias in the depth of the locus compared to the other, and have a lot of reads mapping there. You could do a grep and count the unique occurrence. You would have a rough idea of which loci can be problematic and decide to remove them from all individuals...
(cat yourfile.sam | awk '{print $3 "_" $4"}' | sort | uniq -c > loci_coverage.txt). The number on the right hand should be the depth of the locus, open the file with excel, sort by depth, calculate the mean or median depth and decide of a cut off for acceptable upper limit depth (like, 4 time the median individual depth maximum).

Also to remember, you could also filter according to the max number of substitution you would allow between your sample and the reference genome. This number is reported in column 12 of your SAM file. However, if you are aligning on a genome that is not your species, you would expect higher number of substitutions for certain reads...

I hope this helps.

Cheers

Celine

Chad Smith

unread,
Aug 27, 2014, 11:41:22 AM8/27/14
to stacks...@googlegroups.com
Too add to what Kim said, using samtools flags also will accomplish these tasks:


Output alignments that are not primary (i.e. secondary alignments) 

samtools flag -b -f 256 bam_in > bam_out 

Output primary alignments 

samtools flag -b -F 256 bam_in > bam_out 

Remove reads with mapping scores >20

samtools view -b -q20 bam_in > bam_out

Plus so much more! 

Stacks has a "lumberjack stacks" filter for removing loci with too much coverage. 

Cheers,
Chad

--
Stacks website: http://creskolab.uoregon.edu/stacks/
---
You received this message because you are subscribed to a topic in the Google Groups "Stacks" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stacks-users/FXCV1AOU6ik/unsubscribe.
To unsubscribe from this group and all its topics, send an email to stacks-users...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Chad Smith
Postdoctoral Fellow
Dept of Integrative Biology
University of Texas at Austin
1 University Station C0990
Austin, TX 78712

Office: PAT 632

Julian Catchen

unread,
Aug 28, 2014, 8:48:18 PM8/28/14
to stacks...@googlegroups.com, kiman...@gmail.com
Hi Kim,

Gsnap can remove multimapping reads. BWA also can have problems because
it insists on soft-masking alignments, which are really only partially
matching reads. These soft-masked alignments will cause errors in SNP
calling.

You also have to think about what you consider to be truly multimapping.
If a read aligns perfectly to one place in the genome, but then aligns
to a second location with one mismatch, is that read multimapped?

Best,

julian

Kim

unread,
Oct 18, 2014, 6:14:48 PM10/18/14
to stacks...@googlegroups.com, kiman...@gmail.com, jcat...@uoregon.edu
Hi Celine, Chad, and Julian,

Many thanks for your very helpful replies! 

I have just a few more questions...

1. Celine, your unix scripting commands are very helpful.  However, I'm wondering how these commands will influence the sam file header, and if that matters to Stacks.  For example, if I use grep to copy all reads that are *not* multi-mapping into a new file, that file will no longer have a sam header--is Stacks ok with an input samfile with no header?

2. Chad, you mentioned that Stacks has a lumberjack function for removing loci with high coverage, but my understanding is that function only applies to the denovo analysis.  Is there a way to filter out loci with high coverage using refmap?  If not, does anyone have any suggestions for an efficient way to filter out loci with high coverage when you have hundreds of samples?

Thanks!
Kim

Matthew Hare

unread,
Oct 23, 2014, 8:35:48 AM10/23/14
to stacks...@googlegroups.com, kiman...@gmail.com, jcat...@uoregon.edu, dc...@cornell.edu
Hi Kim,

We recently published an empirical clustering protocol for application at the level of single individuals without a reference, designed to evaluate the sequence distance threshold that minimizes oversplitting of loci (because of divergent alleles) and avoids excessive clustering of paralogous sequences. It is equally useful when you have a reference and need to decide how much sequence difference to allow in the read mapping. The scripts are available from the lead author and include procedures for this de novo evaluation as well as generation of a fasta pseudoreference based on the optimal threshold. One of the take-home messages from our work is that for those of us not working on recent polyploids or recently duplicated genomes, paralogs are a relatively minor problem with 100 bp RAD tags and for many applications it is better to allow higher sequence differences (after proper QC filtering) so that allelic diversity is accurately reflected and artifactual homozygosity is reduced.

Ilut, D. C., Nydam, M. L., & Hare, M. P. (2014). Defining Loci in Restriction-Based Reduced Representation Genomic Data from Nonmodel Species: Sources of Bias and Diagnostics for Optimal Clustering. BioMed research internationalVolume 2014 (2014), Article ID 675158, 9 pages
http://dx.doi.org/10.1155/2014/675158
Good luck,
Matt
Reply all
Reply to author
Forward
0 new messages