Hi Gemma,
I want to walk you through how I would answer this question. I will use
my own data set, but you should be able to apply this to your dataset.
The
batch_1.sumstats.tsv
file contains all the variable sites in each population across the
dataset. We can use this to see how many sites there should be.
1) Calculate the number of SNPs exported. The sumstats file provides the
SNPs per population, so we have to reduce them down to a single copy.
(The following command extracts columns 2 and 5 from the file which
contain the catalog locus ID and SNP column number. These two numbers
combined make each SNP unique in the dataset.)
cat batch_1.sumstats.tsv | grep -v
"^#" | cut -f 2,5 | sort | uniq | wc -l
39017
Now, there can be more than one SNP per locus, and the FASTA sequences
are one sequence per locus, so we need to reduce this number down to the
unique number of catalog loci that exported the 39017 SNPs:
cat batch_1.sumstats.tsv | grep -v
"^#" | cut -f 2 | sort | uniq | wc -l
18316
Okay, so, 39K SNPs from 18K catalog loci.
2) Let's cross check this against the VCF file to make sure we are
getting the same results:
The VCF file will report one SNP per line, so it's easy to count the
number of SNPs:
cat batch_1.vcf | grep -v "^#" |
wc -l
39017
Now, let's do the same thing and reduce the SNPs down to the common set
of catalog loci they originate from (column 3 contains the catalog locus
each SNP originates from):
cat batch_1.vcf | grep -v "^#" |
cut -f 3 | sort | uniq | wc -l
18316
Okay, good, the two files agree exactly!
3) Now, we need to see what is being reported in the FASTA strict file.
Each FASTA header contains the catalog locus that the sequence
originates from ("CLocus"):
head batch_1.strict.fa
>CLocus_3_Sample_27_Locus_3_Allele_0
[rb_2217.011]
TGCAGGGGAACCAATGNNNNTACTGCGGTCAAAGCGGTCATTATATGAACGCTTGTCCGGTAAAANGACAGGGC
>CLocus_3_Sample_27_Locus_3_Allele_1
[rb_2217.011]
TGCAGGGGAACCAATGNNNNTACTGCGGTCAAAGCGGTCATTTTGTGAACGCTTGTCCAGTAAAANGACAGGGC
>CLocus_5_Sample_1_Locus_5_Allele_0
[ms_2067.01]
TGCAGGCACGCAAAGCAGAGAGATCCGACTTTATAATGCAGTTACATCAATTCCTCGCCNNNNNTCGTAAAAAN
>CLocus_5_Sample_1_Locus_5_Allele_1
[ms_2067.01]
TGCAGGCACGCAAAGCAGAGAGATCCGACTTTATAATGCAGTTACATCAATTCCTCGCCNNNNNTCGTAAAAAN
Let's count the number of catalog loci in the data set (we extract the
FASTA header lines and feed them into
sed to extract the catalog locus number embedded in
the header, then
sort and
uniq them to get a count):
cat batch_1.strict.fa | grep
"^>" | sed -E -e 's/^>CLocus_([0-9]+)_Sample.+/\1/' | sort -n |
uniq | wc -l
26469
Shit, we have about 8K too many catalog loci here.
4) Well, let's figure out which ones are different between the two files
by capturing the list of catalog loci from the two files:
cat batch_1.sumstats.tsv | grep -v
"^#" | cut -f 2 | sort -n | uniq > sumstats_cat_ids
and
cat batch_1.strict.fa | grep
"^>" | sed -E -e 's/^>CLocus_([0-9]+)_Sample.+/\1/' | sort -n |
uniq > fasta_cat_ids
Now, let's find what's different between the files:
diff -u sumstats_cat_ids
fasta_cat_ids | more
--- fasta_cat_ids 2016-04-20
15:45:09.000000000 -0500
+++ sumstats_cat_ids 2016-04-20
15:30:03.000000000 -0500
@@ -11,7 +11,6 @@
18
19
20
-21
23
24
25
@@ -41,11 +40,9 @@
56
57
58
-59
60
61
62
-63
64
65
66
...
What we can see is that locus 21, 59, and 63 are in the FASTA file, but
not the sumstats file. Why?
The best thing to do now is to look up those loci in the web interface.
If you don't have that, you can grep them out of the catalog:
zgrep "^0 1 21 "
batch_1.catalog.snps.tsv.gz
When I do this, I see that the missing catalog loci are all fixed. So,
the FASTA file contains fixed and variable loci while the sumstats and
VCF files contain only variable loci. If I want exactly the same loci
from both files, I can specify a whitelist. I already made one from the
sumstats file above -- it is the list of the variable catalog loci, I
called it
sumstats_cat_ids.
You can feed this into
populations
with the
-W parameter and
it will only output these loci in the FASTA file (and all other files).
Hope this helps.
Best,
julian
ps - UNIX rules.
April
14, 2016 at 11:10 AM