Input and output files

Sarah E

unread,

Aug 22, 2016, 2:17:50 PM8/22/16

to polyfreqs-users

Hello,

I am excited to come across polyfreqs and hoping it may be the solution I am looking for. I am relatively new to analyzing RAD-seq data as well as working with polyploids, and have been struggling to make headway with some pilot data from a putatively tetraploid plant species. I have a few questions regarding the input format and eventual output from the package, and any help is much appreciated!

1) Can polyfreqs handle SNP calls from any pipeline? I have run the Stacks denovo pipeline with recommended parameters for polyploids, allowing for up to 5 stacks per locus, and if I look at a histogram of the allele frequencies for a single individual across all SNPs I get peaks around 25%, 50%, and 75%, seeming to confirm that it is tetraploid. However, it has been difficult to figure out exactly what is happening because the individual SNPs files only show 2 possible alleles. Are the results in the SNP catalog usable here, or do I need to start somewhere else?

2) Is there any program that rearranges SNP data into a format similar to the input required for polyfreqs?

3) If my species is actually allotetraploid instead of autotetraploid, will I be able to tell from poor model fit? Again, I have had a tough time determining this from Stacks output but perhaps there is a simpler way to figure it out.

4) Can polyfreqs produce output files that are compatible other pop gen programs, like fastStructure?

Thanks a lot,

Sarah

Paul Blischak

unread,

Aug 23, 2016, 10:18:27 AM8/23/16

to polyfreqs-users

Hi Sarah,

Thanks for your interest in using polyfreqs! I'll do my best to answer your questions below but please feel free to let me know if anything is unclear.

1) polyfreqs should be able to handle SNP data that are collected from any pipeline because it uses the read counts (total number of reads and number of reference reads mapping to a SNP locus) and not the genotype calls that any particular pipeline will give you. These read counts can be specified in separate, tab-delimited text files that are then read into R. Since you are using Stacks, you will need to get the total number of sequencing reads that map to each SNP that it identifies, as well as the number of reads for the reference allele (I'm guessing the program usually chooses this, which is fine) for each individual that you have sampled. To make this easier, I think that Stacks can output its SNP calls in VCF format, which should have those numbers embedded within the output VCF file (the AD field within the genotype section has the "allele depths" that you will want). Once you have the total read counts and reference read counts for each individual at each locus, you can read them into R, convert them to matrices (using `as.matrix()`), and then pass them as arguments to the `polyfreqs()` function.

2) Lindsay Clark has a Python program called TagDigger (https://github.com/lvclark/tagdigger) that looks like it could be good for getting the read counts that polyfreqs uses, too. Otherwise, it will likely take a bit of scripting to extract the read counts from the VCF file Stacks gives you.

3) This is a really good question that I unfortunately do not know the answer to because I have not explored it yet. I'm working on some models for allopolyploids right now and have thought about ways of determining if something is an auto- or allopolyploid but these ideas are not really far enough along to be useful to you at the moment. However, I'd be happy to share them with you once they are more fully developed.

4) The outputs of polyfreqs are the posterior samples of the allele frequencies, and the observed and expected heterozygosity at each locus. If you set `genotypes=TRUE` when you run the program, it will also give you a matrix of genotype calls for each individual at each locus. This genotype matrix can then be saved as a tab-delimited text file (using the `write.table()` function in R). I believe fastStructure can take data that are in the original Structure format, so it should be fairly easy to take this genotypes files from polyfreqs and format it for fastStructure. I don't currently have a function for doing this automatically, though.

--------

Hopefully that helps! If you have any other questions please don't hesitate to ask.

Best,

Paul

Message has been deleted

Sarah E

unread,

Aug 23, 2016, 2:23:25 PM8/23/16

to polyfreqs-users

Hi Paul,

Thanks for the quick response! Regarding my first question, I can get those AD values from the VCF files output by Stacks, but I am concerned that the VCF is produced by the populations program of Stacks, which removes any loci with >2 alleles. Does the polyfreqs model assume all loci are biallelic anyway? I can see why most SNP loci in an autopolyploid would be biallelic, but assume that there would be some loci with 3 or even 4 alleles due to random mutation since duplication (or error). Is my concern valid or am I thinking about this the wrong way?

Thanks,

Sarah

Paul Blischak

unread,

Aug 23, 2016, 2:54:32 PM8/23/16

to polyfreqs-users

Hi Sarah,

polyfreqs does assume that that all SNP are biallelic. So the fact that Stacks throws out loci with more than two alleles present won't be a problem, at least not for using polyfreqs, because it wouldn't be able to analyze them anyway. Sorry for not making that more clear from the start.

Paul

Sarah E

unread,

Aug 24, 2016, 12:53:28 PM8/24/16

to polyfreqs-users

Thanks, that's pretty much what I gathered from the publication, but wanted to check. I got everything formatted as specified and now of course have more questions. I am getting the following error:

> out1<-polyfreqs(total_table,ref_table,ploidy=4) #default run

Starting MCMC...

Error: Invalid random integer generated

Any idea what would cause that? Also can polyfreqs handle missing data or do all loci need to have non-zero total reads values for every individual?

Sarah

Paul Blischak

unread,

Aug 24, 2016, 1:02:47 PM8/24/16

to polyfreqs-users

That's an error message I wrote in myself for when genotypes are sampled during the MCMC. How are you coding your missing data? Are you using 'NA' or '0'. polyfreqs can handle missing data but you need to enter it as '0' in the total and reference read matrices (not the usual 'NA' that R uses).