fasta_strict output - more loci than expected

Gemma

unread,

Apr 14, 2016, 12:10:53 PM4/14/16

to Stacks

Hi all,

I apologise if this is a dumb question or it has been answered elsewhere. I'm trying to export my data both as SNPs in a vcf file, and with the full RAD locus surrounding the SNPs in fasta format. When I export to vcf, writing one random snp per tag, I get 10560 SNPs, and my sumstats_summary file agrees with this. However, when I also export the -fasta_strict file, I get a file that has too many lines in it (i.e. it has more than 10560 RADtags) and I can't work out why.

I'm running version 1.35 and my populations command is:

populations -b 1 -P ./stacks/rx/populations_m6p7r08_fasta_repeat/ -M ./stacks/rx/popmap -s -t 32 -p 7 -m 6 -r 0.8 --min_maf 0.01 --max_obs_het 0.5 --write_random_snp --fasta_strict --vcf

Then:

$ wc -l batch_1.strict.fa

7520840 batch_1.strict.fa

So my fasta file has 7520840 lines in it.

Half of the lines in the file are header lines -> 7520840 / 2 = 3760420 lines of sequence

There are two sequences per individual at each locus as they are diploid -> 3760420 / 2 = 1880210

I have 82 individuals -> 1880210 / 82 = 22929 loci roughly. (This is an underestimate of the number of RAD loci, as missing genotypes are not exported to the fasta file I believe.)

I have checked for individuals having multiple alleles:

$ grep _Allele_2 batch_1.strict.fa

$ grep _Allele_3 batch_1.strict.fa

Can someone explain why there appear to be so many more RADtags in the strict fasta file than the number of SNPs in the vcf?

Many thanks for any help,

Gemma

Julian Catchen

unread,

Apr 20, 2016, 4:58:37 PM4/20/16

to stacks...@googlegroups.com, gvcl...@gmail.com

Hi Gemma,

I want to walk you through how I would answer this question. I will use my own data set, but you should be able to apply this to your dataset.

The batch_1.sumstats.tsv file contains all the variable sites in each population across the dataset. We can use this to see how many sites there should be.

1) Calculate the number of SNPs exported. The sumstats file provides the SNPs per population, so we have to reduce them down to a single copy. (The following command extracts columns 2 and 5 from the file which contain the catalog locus ID and SNP column number. These two numbers combined make each SNP unique in the dataset.)

cat batch_1.sumstats.tsv | grep -v "^#" | cut -f 2,5 | sort | uniq | wc -l
39017

Now, there can be more than one SNP per locus, and the FASTA sequences are one sequence per locus, so we need to reduce this number down to the unique number of catalog loci that exported the 39017 SNPs:

cat batch_1.sumstats.tsv | grep -v "^#" | cut -f 2 | sort | uniq | wc -l
18316

Okay, so, 39K SNPs from 18K catalog loci.

2) Let's cross check this against the VCF file to make sure we are getting the same results:

The VCF file will report one SNP per line, so it's easy to count the number of SNPs:

cat batch_1.vcf | grep -v "^#" | wc -l
39017

Now, let's do the same thing and reduce the SNPs down to the common set of catalog loci they originate from (column 3 contains the catalog locus each SNP originates from):

cat batch_1.vcf | grep -v "^#" | cut -f 3 | sort | uniq | wc -l
18316

Okay, good, the two files agree exactly!

3) Now, we need to see what is being reported in the FASTA strict file. Each FASTA header contains the catalog locus that the sequence originates from ("CLocus"):

head batch_1.strict.fa
>CLocus_3_Sample_27_Locus_3_Allele_0 [rb_2217.011]
TGCAGGGGAACCAATGNNNNTACTGCGGTCAAAGCGGTCATTATATGAACGCTTGTCCGGTAAAANGACAGGGC
>CLocus_3_Sample_27_Locus_3_Allele_1 [rb_2217.011]
TGCAGGGGAACCAATGNNNNTACTGCGGTCAAAGCGGTCATTTTGTGAACGCTTGTCCAGTAAAANGACAGGGC
>CLocus_5_Sample_1_Locus_5_Allele_0 [ms_2067.01]
TGCAGGCACGCAAAGCAGAGAGATCCGACTTTATAATGCAGTTACATCAATTCCTCGCCNNNNNTCGTAAAAAN
>CLocus_5_Sample_1_Locus_5_Allele_1 [ms_2067.01]
TGCAGGCACGCAAAGCAGAGAGATCCGACTTTATAATGCAGTTACATCAATTCCTCGCCNNNNNTCGTAAAAAN

Let's count the number of catalog loci in the data set (we extract the FASTA header lines and feed them into sed to extract the catalog locus number embedded in the header, then sort and uniq them to get a count):

cat batch_1.strict.fa | grep "^>" | sed -E -e 's/^>CLocus_([0-9]+)_Sample.+/\1/' | sort -n | uniq | wc -l
26469

Shit, we have about 8K too many catalog loci here.

4) Well, let's figure out which ones are different between the two files by capturing the list of catalog loci from the two files:

cat batch_1.sumstats.tsv | grep -v "^#" | cut -f 2 | sort -n | uniq > sumstats_cat_ids

and

cat batch_1.strict.fa | grep "^>" | sed -E -e 's/^>CLocus_([0-9]+)_Sample.+/\1/' | sort -n | uniq > fasta_cat_ids

Now, let's find what's different between the files:

diff -u sumstats_cat_ids fasta_cat_ids | more

--- fasta_cat_ids      2016-04-20 15:45:09.000000000 -0500
+++ sumstats_cat_ids    2016-04-20 15:30:03.000000000 -0500
@@ -11,7 +11,6 @@
18
19
20
-21
23
24
25
@@ -41,11 +40,9 @@
56
57
58
-59
60
61
62
-63
64
65
66
...

What we can see is that locus 21, 59, and 63 are in the FASTA file, but not the sumstats file. Why?

The best thing to do now is to look up those loci in the web interface. If you don't have that, you can grep them out of the catalog:

zgrep "^0    1    21    " batch_1.catalog.snps.tsv.gz

When I do this, I see that the missing catalog loci are all fixed. So, the FASTA file contains fixed and variable loci while the sumstats and VCF files contain only variable loci. If I want exactly the same loci from both files, I can specify a whitelist. I already made one from the sumstats file above -- it is the list of the variable catalog loci, I called it sumstats_cat_ids. You can feed this into populations with the -W parameter and it will only output these loci in the FASTA file (and all other files).

Hope this helps.

Best,

julian

ps - UNIX rules.

Gemma

April 14, 2016 at 11:10 AM

Gemma

unread,

Apr 21, 2016, 12:05:12 PM4/21/16

to Stacks, gvcl...@gmail.com, jcat...@illinois.edu

Hi Julian,

This is fantastic, thank you so much! It all makes sense now. I had (eventually) worked out I could get the loci I needed using a whitelist, but it always feels better when you get to the bottom of a puzzle!

One tiny correction (I think) for anyone else who is working through this: in the last bit where you show how to find the fixed loci in the catalog rather than through the web interface, you want to search for the loci in the batch_1.catalog.tags.tsv.gz rather than the batch_1.catalog.snps.tsv.gz - you won't find them there as there are no snps in those loci!

Thanks again for the help.

Best,

Gemma

P.S. Agreed!

Natalia Bayona Vásquez

unread,

Jan 13, 2017, 3:26:06 PM1/13/17

to Stacks, gvcl...@gmail.com, jcat...@illinois.edu

Hi Julian,

I wanted to create a fasta.strict file with only those loci that are polymorphic, so I followed your suggestion to create a whitelist list from the sumstats.tsv file and use that one with the -W parameter in populations. Anyway giving a look to the fasta output using a small dataset I find loci that all the alleles recorded per locus are the exact same sequence. I was wonering why this is happening. Attached I am sending you the fasta.strict file obtained, the whitelist and the sumstats.tsv.

As an example, locus 407, 593, 1122, 2535 etc. seem to be not polymorphic.

Can we track what is happening in this case?

I will really appreciate it.

Thanks!

Natalia Bayona

batch_210.strict.fa

fasta_cat_ids

ssierra_sumstats_cat_ids

Reply all

Reply to author

Forward