missing data and SFS

1,303 views
Skip to first unread message

jason....@gmail.com

unread,
Mar 19, 2015, 8:12:20 AM3/19/15
to fasts...@googlegroups.com
Dear All,

I am using fastsimco2 to estimate parameters from an observed (joint) SFS derived from GBS/RAD Tag type data (using 1 SNP per tag). The problem is that GBS generated SNP datasets have lots of missing data. For a dataset with many individuals, there are few SNPs with genotypes called for all individuals. My question is whether this is an issue in generating joint SFS with missing data and using those to estimate parameters using the likelihood functions of fastsimco2? I am guessing it is and if so if there is a work around?

Thanks,

Jason

Ivan Scotti

unread,
Mar 19, 2015, 8:32:10 AM3/19/15
to fasts...@googlegroups.com
The problem is that you cannot put together in the same SFS loci with different sample sizes (=numbers of VALID genotypes). This goes beyond the software, it is a problem of how you treat and interpret SFSs (if you mix loci with different sample sizes, you bias the SFS in favour of loci with small frequencies). So you should either work with the smallest single-locus sample size, or you should select a sample size threshold and (a) throw away the loci with fewer samples and (b) reduce the sample size to the threshold sample size for those that have more.

Ivan

jason....@gmail.com

unread,
Mar 19, 2015, 8:54:33 AM3/19/15
to fasts...@googlegroups.com
Hi Ivan,

I expected this was the case (i.e. inability to include loci with different sample size). This is unfortunate, because for GBS/RAD TAG type datasets with large numbers of individuals, it doesn't matter what sort of filtering strategy is done to take out individuals and/or loci with missing data, you often will still end up with no (or very few) loci found in all individuals. As GBS-type datasets become more commonly used, it might be worth thinking of a work around to this problem.

For example:
1) simulate SFS in such a way that allows for different sample size (seems easy enough), so that the likelihoods take this into account (I am not sure about this part).
2) have a program like fastsimco2 take a series of SFS, each for a different sample size. The likelihoods would then be computed for the series.

Jason

Ivan Scotti

unread,
Mar 19, 2015, 9:26:28 AM3/19/15
to fasts...@googlegroups.com
Hi,

I do not know about the methodological / statistical development. I suspect that since all theory on SFS is based on the assumption that allele counts are "homogeneous", this will be a rather big chunk of work... on the other hand, not all analytical methods are suitable for all types of data... and I'd say that typically SFS is not suitable for RADseq. 

Having said that, it seems to me that the SFS-based tests are very powerful even with small sample sizes, so you might be able to make use of a substantial subset of your data.

Ivan

Laurent Excoffier

unread,
Mar 21, 2015, 9:29:37 AM3/21/15
to fasts...@googlegroups.com
I agree with Ivan, but yes these missing data in NGS are a real problem.
My own recommendation would be to increase the quality of your data, i.e. increase read depth to minimize missing data, for instance by reducing the number of individuals sampled. This would also lower the costs of the analyses as individual libraries are usually quite costly.
I think that good quality RAd seq data should be amenable to SFS analyses.

Ivan Scotti

unread,
Mar 22, 2015, 5:04:08 AM3/22/15
to fasts...@googlegroups.com
I think that the problem with RADtag data is even more serious: data are not necessarily "missing" strictly speaking, because part of the sequence data are just absent from the data set due to restriction site polymorphism (you only "sample" the chromosomes that carry a functional restriction site at a given locus). This introduces a further bias (LD in the remaining data? artificial reduction of polymorphism due to LD with the restriction site?) and I would be very cautious with the interpretation of within-pop analyses. It may be that the bias is less serious for population comparisons, if the populations are not too divergent.

See for example this paper:

Arnold B, Corbett-Detig RB, Hartl D, Bomblies K (2013) RADseq underestimates diversity and introduces genealogical biases due to nonrandom haplotype sampling. Mol Ecol 22:3179–3190. doi: 10.1111/mec.12276

In our tree population data, we have observed very low levels of polymorphism in sequences from RAD tags, which seems coherent with the fact that polymorphisms are in LD with the restriction site, so that when you "throw away" the chromosomes with a mutation in the restriction site, you also throw away associated SNP variants.

Ivan





jason....@gmail.com

unread,
Mar 23, 2015, 12:59:25 PM3/23/15
to fasts...@googlegroups.com
Hi Ivan,

Of course it makes sense that a bias will be introduced, and this bias will be even stronger when 2 restriction enzymes are used (e.g. double digest methods). Still, it would seem that the vast majority of missing data in RADseq/GBS type studies comes from low sequencing depth rather than mutations in cut-sites. Many of those that come from mutations in cut sites could be easily filtered (i.e. when comparing 2 deeply diverged populations, and only 1 population has data for a loci). However, I was actually surprised by how weak the bias appears in the Arnold et al. paper for loci in which only a few individuals have dropped out due to mutations in the cut sites.

Given the current popularity of RADseq, I think what is needed is a abc or likelihood method developed specifically to handle missing data resulting from low to medium coverage.

Mikhail Matz

unread,
Mar 29, 2015, 2:07:25 PM3/29/15
to fasts...@googlegroups.com
Hello all - I second Jason's notion that in RAD allele drop out (ADO) due to cut site mutation is quite rare compared to missing calls due to low coverage. In 2bRAD we get a few thousand markers present in 75-80% of all individuals, so ADO is probably not a big problem (much ADO about nothing, hehe), but we still need to solve the missing data problem to make fastsimcoal run. How about these solutions:
1. Unless I am confusing things, dadi has a downsampling option to produce SFS with lower than actual sample size but with all markers represented, can we use that as a front-end to fastsimcoal?
2. Would it be completely silly to simply impute the missing calls based on within-deme allele frequency (by binomial sampling), just for the purposes of SFS construction to make fastsimcoal run?

kritika...@gmail.com

unread,
Nov 26, 2015, 6:47:36 AM11/26/15
to fastsimcoal
Hi All,
I am using ddRADSeq to obtain SNPs for my work. I am interested in using fastsimcoal for model comparison (based on AIC) and parameter estimation. I have a question regarding missing data. I have low level of missing data (~10%). Allowing for less than 10% missing data I have isolated ~3000 SNPs (these are assumed to be unlinked as I have isolated only a single SNP per fragment). If I remove the loci with missing data I am left with ~1300 SNPs. Should I only use SNPs with no missing data or can I use all the SNPs isolated for SFS generation. Will 10% missing data also affect SFS.

Thank you

Regards
Kritika

Laurent Excoffier

unread,
Nov 26, 2015, 7:14:40 AM11/26/15
to fastsimcoal
You should onla use SNPs without missing data, but in your case I'm afraid that 1300 SNPs will not lead to very meaningful estimates. To reassure you, 3000 SNPs would  also probably not be enough...
For this small number of loci, maybe full likelihood methods, e.g. IMa, could be used
best
laurent

Schyler Nunziata

unread,
Dec 16, 2015, 10:57:01 AM12/16/15
to fastsimcoal
Hi,
I have been reading a few papers that use a couple thousand SNPs for fsc2. Is there a rough estimate of the minimum number of SNPs that would lead to meaningful results with this program?

Laurent Excoffier

unread,
Dec 20, 2015, 4:23:17 PM12/20/15
to fastsimcoal
Hi,

some guidelines on the relationship between number of SNPs and accuracy on parameter estimation is found in 
http://arxiv.org/abs/1505.04228

But it all depends on the model and the number of parameters to estimate from the SFS.

A few thousand SNPs should be enough to estimate a few parameters (1-3 relatively well but not for all models (e.g. bottleneck sizes would require much more.

Bootstrapping should give you some idea about the precision of your parameters.

Laurent


Reply all
Reply to author
Forward
0 new messages