Hello!
I am working on 3 lineages of tree-frogs, and I am interested in testing whether they experienced a period of isolation before coming into contact at 2 hybrid zones (current status).
I am using low-coverage (~4x) Whole Genome Resequencing data (NovaSeq). I am thus estimating genotype likelihoods in ANGSD and then using those likelihoods to produce the SFSs. I am somewhat limited by sample size, as I have 4, 6 and 7 diploid organisms per lineage sequenced for lcWGS. Although to my understanding that should be ~ok (mostly based on this
https://peerj.com/articles/9939/ and on seeing people work with similar sample sizes.).
I did some standard filtering (Q>20, mapQ>20, min individuals 3, min depth 3, etc). I have not yet filtered for LD nor for paralogues.
I then produced 1d SFS with the resulting data.
Including only polymorphic sites the 1dSFS looks like this:
![8_1_1dSFS.png](https://groups.google.com/group/dadi-user/attach/6ba064bfd0c0/8_1_1dSFS.png?part=0.1&view=1)
To my understanding there is an issue as the LNs population in particular has a U shape instead of a monotonically decreasing shape. The LS population also has an increase towards the right. I am working with different filters (e.g., increase mapQ, add heterozygosity filter, filter paralogues) to address this issue.
As a sanity check, I produced a 1d SFS with ddRAD data for the same populations (same exact sampling location, tested for structure and admixture and it matches between the two dataset). The ddRAD data has ~10x coverage, and includes 40 samples per lineage. It was produced by aligning to the reference genome and calling genotypes with stacks.
The resulting 1d SFSs. produced with easySFS from the filtered VCF file, look like this:
These 1d SFSs do not show a U-shape for any of the lineages. Thus I am pretty certain that the U-shape in lcWGS data is an artefact. I suspect it is the results of paralogous regions given I have working with a frog species and HiC contact maps from the assembly showed many such regions.
Thus, my first question is. Am i correct in thinking that the U-shape in the lcWGS 1d SFS is an artefact of my pipeline? (For context, the genome assembly is from an LS individual, LM is very closely related, probably split ~10,000 generations ago or less, while LNs is the most distantly related).
Furthermore, in the first figure I am showing only polymorphic sites. (i.e., masking first and last position.
If i include the last position, it looks like this:
To my understanding, the peak on the right represents fixed derived loci (this is unfolded using an outgroup to produce ancestral sequence before doing the SFS). This figure still doesn't include the '0' f.class, which would be even higher than the fixed derived loci.
To my understanding, this peak to the right is to be expected, and likely correlated with sample size (e.g., here the lineage with the smallest sample size has the most fixed derived loci).
Thus, my second question is: is the large number of fixed derived sites normal and to be expected? Especially given the small sample size?
My last question is: I expect that some of these will be fixed and derived between all 3 lineages, and thus end in the top right of 2d SFS which gets masked, while any that fall off the diagonal won't get masked. Is this correct?
Apologies for the lengthy text, and apologies for the many questions. I am new to working with SFSs, ANGSD and dadi thus some of my statements might be very off.
Cheers,
LVB