STACKS - Duplicated SNP (forward and reverse)

1,315 views
Skip to first unread message

Baptiste Guitton

unread,
Jun 13, 2014, 7:55:13 AM6/13/14
to stacks...@googlegroups.com

Dear all,

I am using STACKS to analyze GBS data on Sorghum for which we have a reference genome. I went through the pipeline to call SNPs (demultiplexing, mapping with bowtie, ref_map.pl and Populations program). Everything went well, but I observed in my output VCF file that duplicated SNP were present. At the same genomic position, the same SNP can be present twice and have the same alleles. These data seem to come from the same read that have been sequenced in both forward and reverse sense. As a result of that, I have data for one SNP for some given individuals and some data for the other duplicated SNP for some other individuals.

Here is an example from my VCF file:

#CHROM

POS

ID

REF

ALT

QUAL

FILTER

INFO

Sb01

27283

144413

C

G

.

PASS

NS=35;AF=0.786,0.214

Sb01

27283

144398

C

G

.

PASS

NS=43;AF=0.721,0.279

 

I am wondering why STACKS pipeline did not merged these duplicated SNP? Furthermore, I would like to merge them, to conserve only one SNP when necessary in order to merge information of depth and through the different individuals (i.e. I have in total 62 individuals, in the example above I have SNP data for 35 individuals for the first SNP and for 43 individuals for the second SNP).

Is someone has already meet this problem? Am I using STACKS the wrong way? Please, is anyone can point me out to a solution?

Many thanks,

 

Baptiste

노한성

unread,
Nov 18, 2014, 4:09:01 AM11/18/14
to stacks...@googlegroups.com
Dear Baptiste,

I also met this problem. I think you used double enzyme like me. I used ApekI & MspI.
When these enzyme recognition sites are closely located, read 1 and read 2 were overlapped.

The Stacks don't merge this locus and treat seperately. It's the reason of duplicated SNPs at same position.
It is violated VCF protocol.

There are three options I think.
1. simply abandon overlapping reads
2. chose one of duplicated SNPs at VCF file.
3. merge reads before mapping. (I'm almost wrote codes for reference-based reads merging.)

I wish you solve the problem.

Hanseong Roh

Baptiste Guitton

unread,
Nov 18, 2014, 5:27:20 AM11/18/14
to stacks...@googlegroups.com
Dear Hanseong,

Thank you for your answer.
I used only one enzyme ApekI.
To me, abandoning overlaping reads or choosing one of the duplicated SNP are realy not pertinent solutions, since I will loose a lot of information. My objectif was actually to merge reads, but I did not success to do it using STACKS (that was the topic of my question).
The solution I found was to use the TASSEL pipeline, which do merge duplicated SNP.
Anyway, if you have a solution to do it using STACKS, I would still be interested.

Cheers,

Baptiste

--
Stacks website: http://creskolab.uoregon.edu/stacks/
---
You received this message because you are subscribed to a topic in the Google Groups "Stacks" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stacks-users/Ag8YyEFe7z0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to stacks-users...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Baptiste Guitton
Domaine de Caudalie, Bat D, Apt 14,
2650 Route de Mende
34980 Montferrier-Sur-Lez
06-83-92-90-72

Julian Catchen

unread,
Dec 3, 2014, 11:38:25 PM12/3/14
to stacks...@googlegroups.com, baptiste...@gmail.com
Hi Baptiste,

As you know, Stacks does not merge overlapping SNPs in the VCF. We are working on an option to merge overlapping loci in the populations program and when it is complete it will also affect the various outputs. The statistical and kernel smoothing algorithms in the populations program *do* merge overlapping SNP positions (and have for some time).

There are very few circumstances when you should see overlapping RAD loci. It is worth examining why your loci are overlapping by visualizing the data and checking that your alignments are good.

Best,

julian

Sheina Sim

unread,
Jan 4, 2016, 7:15:48 PM1/4/16
to Stacks

Aloha Julian,

I ran Stacks and used the --merge_sites and --merge_prune_lim options but there were still duplicates within my dataset (different stacks resulted in the same locus/loci).  Is there a way to get around this other than manually removing the duplicates?

Thanks!
Sheina

Julian Catchen

unread,
Jan 5, 2016, 2:03:36 PM1/5/16
to stacks...@googlegroups.com, shein...@gmail.com
Aloha Sheina,

You'll have to provide more information. What type of data is it (is it
reference aligned) and how are you detecting the duplicates? Can you
please provide a test case?

Best,

julian

Sheina Sim wrote:
>
> Aloha Julian,
>
> I ran Stacks and used the --merge_sites and--merge_prune_lim options but

Sheina Sim

unread,
Jan 5, 2016, 2:19:46 PM1/5/16
to Stacks, shein...@gmail.com, jcat...@illinois.edu
Aloha!  

Thank you so much for your reply!

My data is from a double digest RAD library for which we have paired-end sequences but I processing as single-end.  The single-end sequences were processed using process_radtags, aligned to a reference using BWA mem, and then run through the populations module.

I detected the duplicates by looking at the .plink.map output from populations and noticed that there were some loci from the same position on the same scaffold (columns 1 and 4 from .plink.map output) but they came from different "stacks" (column 2 from .plink.map output).

I looked at the tags.tsv files and they look like loci that are from overlapping stacks that weren't merged.  For the most part the genotypes for an individual are the same between the duplicate loci.

Example .plink.map output:

Column 2 is different indicating that the loci are coming from different stacks but columns 1 and 4 are identical.

gi|751338028|ref|NW_011866214.1|        864_83  0       1225
gi|751338028|ref|NW_011866214.1|        865_69  0       1225

A different duplicate:

gi|751337793|ref|NW_011866449.1|        768_23  0       1485
gi|751337793|ref|NW_011866449.1|        771_76  0       1485

Out of 25K SNP loci, there are about 250 that are duplicated. 

Any help is greatly appreciated.

Thanks!
Sheina

Julian Catchen

unread,
Jan 5, 2016, 2:56:13 PM1/5/16
to stacks...@googlegroups.com, shein...@gmail.com
Hi Sheina,

Okay, so I haven't written code to provide the ordered export for every
output file type. If you check those genomic locations in the sumstats
file, you should only find one of the two. The same should also work for
Structure, GenePop, and VCF files. I'll have to add PLINK format to the
todo list.

There are programs that will convert between many of these formats. You
could also take your 250 loci and place them in a blacklist and then you
could use any of the output formats.

maeva leitwein

unread,
Feb 17, 2016, 8:12:51 AM2/17/16
to Stacks, shein...@gmail.com, jcat...@illinois.edu
Hi Julian, 

I have a similar problem as Sheina's, except that I found the duplicates inside the sumstats file. 
My data are PE from double digest librairy, that I aligned with BWAmem then run through the population module (after pstacks, cstacks and sstacks).

I know that I have overlapping loci, so I ordered export, in my vcf file, only unique SNPs appear.

populations -b 6 -P stacks60indRefmap/ -M popmap.txt -m 3 -r 0.8 -p 6 --max_obs_het 0.9 --min_maf 0.02 --ordered_export --vcf -k --window_size 250000 

In sumstats file :
# Batch ID      Locus ID        Chr     BP      Col 
6       64533   CM003279.1      124727  103 
6       64573   CM003279.1      124727  40     

In vcf file: 
#CHROM  POS     ID
CM003279.1      124672  64533
CM003279.1      124727  64533 

CM003279.1      124690  64573 
CM003279.1      124737  64573
CM003279.1      124740  64573 

I tried to create a white list from the loci present in the vcf file and rerun populations. But I have both loci 64533 and 64573 in my white list. So, is it possible to create a white list with the loci and the snp position, instead of the colomn (since I don't have the colomn information in the vcf file) ?

Thanks, 
Maeva

Norah Saarman

unread,
Jan 14, 2019, 12:43:49 PM1/14/19
to Stacks
Hello all,

Is there an update on this for Stacks v2? 

I have double digest data (nlaIII and mlucI) with a size selection that only left ~100 bp insert length. The reason for this is to match previous ddRAD data, not by my design. I have used Illumina paired-end to get the most out of the sequencing, even though I know this is probably not the best strategy. 


I have demultiplexed in Stacks, assembled to a reference in BWA-mem, realigned indels in GATK, and then run Stacks_v2 ref_map script:

$ ref_map.pl -o ./ --samples ../GATK --popmap OkoleKyoga.popmap \
-T 12 -X "gstacks: --min-mapq 15 --var-alpha 0.01 --gt-alpha 0.01 --details "  \
-X "populations: -p 1 -r 0.5 --min_maf 0.05 --max_obs_het 0.5 --batch_size 5000 --renz nlaIII --merge_sites --vcf --plink -O ./populations_2019-01-05"

Using the VCF output, I can convert to a map file and see that there are 1185 instances of duplicate SNPs (counted at two different loci at the same position on the same scaffold.

$ plink --vcf populations.snps.vcf --double-id --allow-extra-chr --make-bed --out GATKplusStacks_1
$ plink --bfile GATKplusStacks_1 --double-id --allow-extra-chr --recode --tab --out GATKplusStacks_1

$ cat GATKplusStacks_1.map | awk '{print $1 "_" $4}' | uniq -c | grep ^"      2" | head

#      2 Scaffold0_188976

#      2 Scaffold0_189003

#      2 Scaffold0_189006

#      2 Scaffold0_1337593

#      2 Scaffold0_1694204

#      2 Scaffold0_1805958

#      2 Scaffold0_1805978

#      2 Scaffold1_213295

#      2 Scaffold1_213362

#      2 Scaffold1_364081


$ cat GATKplusStacks_1.map | awk '{print $1 "_" $4}' | uniq -c | grep ^"      2" | wc -l

#1185



This means that merging does not seem to be doing what I had hoped, even in the VCF output. I tried putting both enzymes into this command and this doesn't work, so I'm not sure how to get these loci to merge.


In any case, now that I have many libraries sequenced with this paired-end approach, I really want to get all of the data out of the paired-end reads that I can. I want maximum read depth, maximum quality, and I really want to use the overlapping reads to help improve genotyping quality rather than throwing any of the data out. Thus, I would really like to get Stacks working in the way it is meant to: "For double-digest RAD, both the single-end and paired-end reads are anchored by a contig and Stacks will assemble them into two loci. In both cases, the paired-end contig/locus will be merged with the single-end locus."

Can anyone let me know if they have figured out how to use overlapping paired-end reads from two different enzymes in a way that maximizes read-depth and merges duplicate SNPs?

Thank you!!!
Norah

Norah Saarman

unread,
Jan 15, 2019, 12:13:35 PM1/15/19
to Stacks
Hi again,

I did try the same script again with --ordered_export, and there are no longer duplicates. 

ref_map.pl -o ./ --samples ../GATK/realigned/ --popmap OkoleKyoga.popmap \
-T 12 -X "gstacks: --min-mapq 15 --var-alpha 0.01 --gt-alpha 0.01 --details "  \
-X "populations: -p 1 -r 0.5 --min_maf 0.05 --max_obs_het 0.5 --batch_size 5000 \
--renz nlaIII --merge_sites --ordered_export --vcf --plink -O ./populations_2019-01-14"

My question remains, are the duplicate forward/reverse sites merged? Or is just one selected in the ordered output? If only one of the two is selected, then these locations are not trustworthy biallelic SNPs, and I would like to remove those without matching forward/reverse alleles and merge the sites with matching forward/reverse alleles.

Please send input!

Thanks,
Norah

Catchen, Julian

unread,
Jan 15, 2019, 12:50:38 PM1/15/19
to stacks...@googlegroups.com, Norah Saarman
Hi Norah,

In Stacks 2 the entire locus is assembled as a single object, so asking
if they are 'merged' is not quite right, because it implies that the
single-end reads and paired-end reads are assembled separately and then
put together. Anyway, the end result is the same in that each locus
should contain both the single and paired-end reads.

The --ordered_export flag further uses the genome to find overlapping,
but distinct RAD loci, and only export the data from one representative
for each genome nucleotide.

I would take a look at the SNPs that exist twice in the VCF, find the
locus they originate from, and then have a look at the alignment
positions for those loci to see how the loci relate to one another in
the genome.

julian

Norah Saarman wrote on 1/15/19 6:13 PM:
> $ ref_map.pl <http://ref_map.pl> -o ./ --samples ../GATK --popmap
> meant to: "_For double-digest RAD, both the single-end and
> paired-end reads are anchored by a contig and Stacks will assemble
> them into two loci. In both cases, the paired-end contig/locus will
> be merged with the single-end locus._"

Norah Saarman

unread,
Jan 15, 2019, 1:38:25 PM1/15/19
to Stacks
Hi Julian,

Thank you so much for your help! 

I am still confused about how you get two SNPs at the same genome nucleotide with mismatched alleles when the entire locus (forward and reverse reads, right?) is assembled as a single object. Maybe I'm not understanding something about the assembly process... can you clarify how this could happen, and if it is something I should continue to worry about? 

I am using the SNP calls for GWAS analysis, and I am finding that there are many SNPs (up to 20%) with no minor allele homozygotes in the 132 individuals genotyped, and so I am concerned about the genotype quality I am able to produce with the reference available. My hope is that I can use strict filters in Stacks v2 to remove suspect loci before the GWAS analysis.

Thank you once again, your continued help and dedication to this program is much appreciated!

- Norah

Catchen, Julian

unread,
Jan 15, 2019, 2:01:13 PM1/15/19
to stacks...@googlegroups.com, Norah Saarman
Hi Norah,

You can have RAD sites that are nearby in the genome and overlap
portions of each other. Both sites will be sequenced, albeit both at
lower coverage (see Davey, et al. 2012. Mol Ecol).

As to the SNP calling, Stacks does not use the reference genome for SNP
calling, it is only used for alignment. Stacks applies a Bayesian model
that looks at the frequency of the alleles in the population and then
calls the specific genotype for each sample (Maruki and Lynch, 2017).

I would suggest you check your depth of coverage and the distribution of
how many samples each loci are found in (recorded in
gstacks.log.distrib). If you have low coverage or most loci are not
found widely in your population, you may have quite a bit of allele drop
out.

julian

Norah Saarman wrote on 1/15/19 7:38 PM:
> > ref_map.pl <http://ref_map.pl> -o ./ --samples ../GATK/realigned/
> >     $ ref_map.pl <http://ref_map.pl> <http://ref_map.pl> -o ./
Reply all
Reply to author
Forward
0 new messages