Dear Ryan and dadi users,
I am trying to figure out if my data are suitable for use in dadi by plotting the 1D and 2D spectra. The plots I am getting look weird to me in the sense that I haven't seen similar plots on this forum or in the literature. Thus, I am unsure whether the spectra I am getting are biologically realistic/plausible or whether they result from some technical issues in my data processing pipeline.
I have low coverage (~7X), whole-genome sequencing data from three populations (AENP, KNP and HiP) of the same subspecies of African buffalo. I used ANGSD (command: realSFS dadi) and the
realsfs2dadi.pl script (
https://github.com/z0on/2bRAD_denovo/blob/master/realsfs2dadi.pl) to create my SNPs input file for dadi, which I also manually thinned to get ~170 000 unlinked sites (10 Kb apart, on average). I have 5 samples from each population for this analysis (I actually have 15 samples from each of KNP and HiP, but I randomly chose 5 samples from each to match the sample size from AENP, to mitigate any potential confounding factors as a result of differences in sample size).
I don't have missing data, so I didn't project down (I projected to 10, which is my sample size in "chromosomes"). Example of commands loading 1D and 2D fs from data dictionary:
fs1 = dadi.Spectrum.from_data_dict(dd, pop_ids=['AENP'], projections=[10], polarized=False)
fs12 = dadi.Spectrum.from_data_dict(dd, pop_ids=['AENP', 'KNP'], projections=[10, 10], polarized=False)
I know from previous microsatellite studies that these three populations are quite differentiated and when I calculated pairwise Fst in dadi using the 2D frequency spectra, it gave me more or less the results I expected:
AENP-KNP: Fst = 0.098
AENP-HiP: Fst = 0.13
KNP-HiP: Fst = 0.085
This tells me that my SNP data are able to distinguish the populations as expected and gives me some confidence that my SNP data are OK.
However, when I plot the 1D folded SFS (ancestral state not known), I get a sharp increase in the number of sites in the final SFS entry (see attached figures) for all three populations. I am unsure how to interpret this. Does it indicate that there are many sites with a minor allele frequency close to or at 0.5, in other words it can't tell which is the minor allele? Should I remove these sites from the dataset? Is perhaps indicative of a bottleneck or is it a technical artefact? The first entry for both AENP ad HiP has fewer sites than KNP, which makes sense given what we think happened in these populations: AENP and HiP underwent fairly strong bottlenecks and slow growth after that, whereas KNP underwent a less severe bottleneck and more rapid expansion after that, so KNP should have a higher frequency of rare alleles than the other two populations.
When looking at the 2D SFS plots, I again am not sure how to interpret them. It seems to reflect a scenario where populations are diverged, as the diagonal is depopulated in relation to the edges, but the difference doesn't seem to be as stark as in other figures I've seen (perhaps I just need to change the y-axis scale?). Also, the top-right corner of the diagonal is densely populated, while in most other plots it is generally empty because it is masked by default. Does this densely populated block in the top-right reflect the same sites as the spike in the 1D spectra at the 5th SFS entry? And then also half the plot is empty, but is this just because the spectrum is folded and I have low sample sizes?
Sorry for the many questions! I would really appreciate any insight into these plots to decide whether I should even be using dadi in the first place (in essence: is my sample size too small?) or whether I should go back to my data processing pipeline to try and sort out any technical issues that are resulting in these plots, or whether my data are fine to continue with.
Kind regards
Deon