ddocent SNP filtering, vcf file, rad_haplotyper

123 views
Skip to first unread message

oswa...@gmail.com

unread,
Feb 27, 2020, 6:10:18 PM2/27/20
to dDocent User Help Forum
Hi all, 
 For a first pass on my dataset, I used the ddocent assembly tutorial followed by the SNP filtering tutorial. I filtered via: 
1. kept variants successfully genotyped:

       > in 50% of individuals

       > w/min. quality score of 30 

       > minor allele count of 3 (this is low but there is script you can run to determine the probability of including bad data)

2. removed individuals with w/more that 55% missing data. 

3. kept variants with a genotype call rate (95%) across all individuals.


Right then I got a vcf file with my filtered SNPs (looks like I have around 3604). How do you all proceed from here? Do you use that vcf file in analyses? What do you all use to convert that file into say a fasta or a STRUCTURE file?

I used a python script to convert the vcf to a fasta from this step and it removes more SNPs because of missing data (which I don't totally understand since I presumably filtered those out already using ddocent; this likely has to do with the script, etc.). The resulting fasta alignment is horrific with lots of ambiguities (perhaps driven by heterozygous sites).

Next step then - use rad_haplotyper to sort out those heterozygous sites. However when I do this even more data is removed and labeled as "missing" when using this script (assessed by the comment section in the .out file). Any idea why haplotyper is removing SNPs because they're missing when I am giving it my filtered vcf dataset (with the missing data already removed)? Right so from the filtered SNP alignment 1/3 of the SNPs are removed because of complexity issues, 1/3 removed bc of missing data, and the remaining 1/3 are uninformative (site is homogeneous with say As and a few Ns in the mix).

The taxon I'm working on is a land snail. Are they polyploid? No idea. There is no closely related species with a reference genome available. Any ideas or tips would be most welcome. Thanks!
Jess







Joel Anderson

unread,
Apr 14, 2020, 3:35:46 PM4/14/20
to dDocent User Help Forum
Without having the benefit of looking at your data, I would say the following:

1. I would skip the step of converting your .vcf to a FASTA
2. rad_haplotyper is useful because it combines SNPs rather than eliminates them when they are located on the same contig. However I wonder if you aren't losing some loci this way when one SNP or the other is "missing", meaning that the whole locus is "missing" for an individual
3. You can read a .vcf file into R with the command read.vcfR() found in the R package "vcfR"
4. Once you have your data read into R, a good package to do some downstream analysis is "Adagenet". You can convert to Adagenet by using the vcfR function vcfR2genind(), assuming both vcfR and Adagenet are installed.
5. A genind object in Adagenet can be converted to Structure format using a number of R hacks that can be googled

R is probably the best environment to convert your filtered .vcf file into formats that can be used for downstream analysis.

oswa...@gmail.com

unread,
Apr 23, 2020, 4:26:32 PM4/23/20
to dDocent User Help Forum
Thanks, Joel, for your help. 
 Best, 
 Jessica 
Reply all
Reply to author
Forward
0 new messages