Suitability of input data: interpreting 1D & 2D folded SFS

425 views
Skip to first unread message

Deon de Jager

unread,
Dec 18, 2019, 8:28:07 AM12/18/19
to dadi-user
Dear Ryan and dadi users,

I am trying to figure out if my data are suitable for use in dadi by plotting the 1D and 2D spectra. The plots I am getting look weird to me in the sense that I haven't seen similar plots on this forum or in the literature. Thus, I am unsure whether the spectra I am getting are biologically realistic/plausible or whether they result from some technical issues in my data processing pipeline.

I have low coverage (~7X), whole-genome sequencing data from three populations (AENP, KNP and HiP) of the same subspecies of African buffalo. I used ANGSD (command: realSFS dadi) and the realsfs2dadi.pl script (https://github.com/z0on/2bRAD_denovo/blob/master/realsfs2dadi.pl) to create my SNPs input file for dadi, which I also manually thinned to get ~170 000 unlinked sites (10 Kb apart, on average). I have 5 samples from each population for this analysis (I actually have 15 samples from each of KNP and HiP, but I randomly chose 5 samples from each to match the sample size from AENP, to mitigate any potential confounding factors as a result of differences in sample size).

I don't have missing data, so I didn't project down (I projected to 10, which is my sample size in "chromosomes"). Example of commands loading 1D and 2D fs from data dictionary:
fs1 = dadi.Spectrum.from_data_dict(dd, pop_ids=['AENP'], projections=[10], polarized=False)
fs12
= dadi.Spectrum.from_data_dict(dd, pop_ids=['AENP', 'KNP'], projections=[10, 10], polarized=False)

I know from previous microsatellite studies that these three populations are quite differentiated and when I calculated pairwise Fst in dadi using the 2D frequency spectra, it gave me more or less the results I expected:
AENP-KNP: Fst = 0.098
AENP-HiP:   Fst = 0.13
KNP-HiP:     Fst = 0.085
This tells me that my SNP data are able to distinguish the populations as expected and gives me some confidence that my SNP data are OK.

However, when I plot the 1D folded SFS (ancestral state not known), I get a sharp increase in the number of sites in the final SFS entry (see attached figures) for all three populations. I am unsure how to interpret this. Does it indicate that there are many sites with a minor allele frequency close to or at 0.5, in other words it can't tell which is the minor allele? Should I remove these sites from the dataset? Is perhaps indicative of a bottleneck or is it a technical artefact? The first entry for both AENP ad HiP has fewer sites than KNP, which makes sense given what we think happened in these populations: AENP and HiP underwent fairly strong bottlenecks and slow growth after that, whereas KNP underwent a less severe bottleneck and more rapid expansion after that, so KNP should have a higher frequency of rare alleles than the other two populations.

When looking at the 2D SFS plots, I again am not sure how to interpret them. It seems to reflect a scenario where populations are diverged, as the diagonal is depopulated in relation to the edges, but the difference doesn't seem to be as stark as in other figures I've seen (perhaps I just need to change the y-axis scale?). Also, the top-right corner of the diagonal is densely populated, while in most other plots it is generally empty because it is masked by default. Does this densely populated block in the top-right reflect the same sites as the spike in the 1D spectra at the 5th SFS entry? And then also half the plot is empty, but is this just because the spectrum is folded and I have low sample sizes?

Sorry for the many questions! I would really appreciate any insight into these plots to decide whether I should even be using dadi in the first place (in essence: is my sample size too small?) or whether I should go back to my data processing pipeline to try and sort out any technical issues that are resulting in these plots, or whether my data are fine to continue with.

Kind regards
Deon
AENP_fs.png
HiP_fs.png
KNP_fs.png
AENP-HiP_fs_vmin0.1.png
AENP-KNP_fs_vmin0.1.png
KNP-HiP_fs_vmin0.1.png

Ryan Gutenkunst

unread,
Dec 18, 2019, 5:08:24 PM12/18/19
to dadi-user
Hello Deon,

Yes, there are definitely weird things in your spectra.

1D: That huge excess of alleles at exactly 50% frequency is not something any sensible dadi model will reproduce. Potential data issues: 1) a huge number of paralogs that are causing mismapping, 2) some issue with ANGSD and folding.

2D: It seems these results are folded based on the minor allele for each population separately. In dadi, we fold on the overall minor allele. As an example, based on your plot, I’m guessing the SNP with frequencies AENP: 9/10 and HiP: 0/10 is getting represented as 4/10 AENP and 0/10 HiP. Dadi can’t interpret the data this way. ANGSD should have the capability to output properly folded 2D spectra, but I’m not sure how to coax it.

Best,
Ryan
> --
> You received this message because you are subscribed to the Google Groups "dadi-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to dadi-user+...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/dadi-user/b7f0126e-215e-46dd-a878-11b8b5a598bc%40googlegroups.com.
> <AENP_fs.png><HiP_fs.png><KNP_fs.png><AENP-HiP_fs_vmin0.1.png><AENP-KNP_fs_vmin0.1.png><KNP-HiP_fs_vmin0.1.png>

Deon de Jager

unread,
Dec 19, 2019, 4:58:18 AM12/19/19
to dadi-user
Hi Ryan,

This helps a lot, thanks! I don't really expect many paralogs in the buffalo genome, so I think it is more likely an issue with ANGSD and folding. I'll go back to ANGSD and see what I can work out from there, or try some alternative approaches.

Thanks for your commitment to helping everyone out on this forum. It really is commendable!

Kind regards
Deon
> To unsubscribe from this group and stop receiving emails from it, send an email to dadi...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages