1D SFS from ANGSD with lcWGS data vs 1D SFS from easySFS with ddRAD data

Lorenzo Bertola

unread,

Jun 5, 2024, 8:19:15 AMJun 5

to dadi-user

Hello!

I am working on 3 lineages of tree-frogs, and I am interested in testing whether they experienced a period of isolation before coming into contact at 2 hybrid zones (current status).

I am using low-coverage (~4x) Whole Genome Resequencing data (NovaSeq). I am thus estimating genotype likelihoods in ANGSD and then using those likelihoods to produce the SFSs. I am somewhat limited by sample size, as I have 4, 6 and 7 diploid organisms per lineage sequenced for lcWGS. Although to my understanding that should be ~ok (mostly based on this https://peerj.com/articles/9939/ and on seeing people work with similar sample sizes.).

I did some standard filtering (Q>20, mapQ>20, min individuals 3, min depth 3, etc). I have not yet filtered for LD nor for paralogues.

I then produced 1d SFS with the resulting data.

Including only polymorphic sites the 1dSFS looks like this:

To my understanding there is an issue as the LNs population in particular has a U shape instead of a monotonically decreasing shape. The LS population also has an increase towards the right. I am working with different filters (e.g., increase mapQ, add heterozygosity filter, filter paralogues) to address this issue.

As a sanity check, I produced a 1d SFS with ddRAD data for the same populations (same exact sampling location, tested for structure and admixture and it matches between the two dataset). The ddRAD data has ~10x coverage, and includes 40 samples per lineage. It was produced by aligning to the reference genome and calling genotypes with stacks.

The resulting 1d SFSs. produced with easySFS from the filtered VCF file, look like this:

These 1d SFSs do not show a U-shape for any of the lineages. Thus I am pretty certain that the U-shape in lcWGS data is an artefact. I suspect it is the results of paralogous regions given I have working with a frog species and HiC contact maps from the assembly showed many such regions.

Thus, my first question is. Am i correct in thinking that the U-shape in the lcWGS 1d SFS is an artefact of my pipeline? (For context, the genome assembly is from an LS individual, LM is very closely related, probably split ~10,000 generations ago or less, while LNs is the most distantly related).

Furthermore, in the first figure I am showing only polymorphic sites. (i.e., masking first and last position.

If i include the last position, it looks like this:

To my understanding, the peak on the right represents fixed derived loci (this is unfolded using an outgroup to produce ancestral sequence before doing the SFS). This figure still doesn't include the '0' f.class, which would be even higher than the fixed derived loci.

To my understanding, this peak to the right is to be expected, and likely correlated with sample size (e.g., here the lineage with the smallest sample size has the most fixed derived loci).

For instance, in this tutorial their SFS including such sites looks similar to mine (Except that they have equal sample size between lineages). https://evomics.org/wp-content/uploads/2018/01/realSFS.pdf

I assume that most people do not report this sites when using ANGSD, as I see that 1d SFS stop at 1-2N, where N is the number of diploid individuals.

See for instance this question: https://groups.google.com/g/dadi-user/c/RdU6i0Tooc0

Also, in this paper they get a similar shape when including fixed derived sites in the SFS, except that their sample size is larger so the proportion of fixed derived sites is lower. https://onlinelibrary.wiley.com/doi/10.1111/mec.13351.

Thus, my second question is: is the large number of fixed derived sites normal and to be expected? Especially given the small sample size?

My last question is: I expect that some of these will be fixed and derived between all 3 lineages, and thus end in the top right of 2d SFS which gets masked, while any that fall off the diagonal won't get masked. Is this correct?

Apologies for the lengthy text, and apologies for the many questions. I am new to working with SFSs, ANGSD and dadi thus some of my statements might be very off.

Cheers,
LVB

Lorenzo Bertola

unread,

Jun 5, 2024, 11:13:37 PMJun 5

to dadi-user

Doing some testing with mapQ and maxHetFilter to see if it addresses the U-shape and if anything it makes it worst. The maxHetFilter in particular causes jaggedness in the 1d SFS when set at 0.5, and it worsens the u shape at 0.75. I am not sure what could be causing this but there's at least another example of someone experiencing this with lcWGS data (~3x) https://github.com/ANGSD/angsd/issues/156. The mapQ filter (mapping quality) gets rid of a lot of alleles present only 1, and it also worsens the U-shape. (All graphs are for the LNs population only, shown above in blue)

q refers to the mapQ filter, while h refers to the maxHetFreq filter (none = not applied, 50 = 0.5, 75 = 0.75). I also included the sfs obtained without identifying the ancestral state with the outgroup.

The below graph includes fixed derived sites

I will try with ngsParalog as mentioned in the github issue and see if it improves the sfs or not.

Katharine Prata

unread,

Jun 7, 2024, 1:13:18 AMJun 7

to dadi...@googlegroups.com

Hi Lorenzo!

I had a similar issue and I was able to resolve this by following Mikhail Matz's method he mentioned in that ANGSD issue thread by projecting down to 80% of haplotypes.

See https://groups.google.com/g/dadi-user/c/-8WRjwO4SSk how it smoothed out the SFS, mine looked similar to example 1 and 2 using the het05 filter -your third example looks a bit strange why singletons are lower than doubletons though but that was just made more extreme by the het filter.

Cheers,

Kat

--
You received this message because you are subscribed to the Google Groups "dadi-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dadi-user+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dadi-user/eda44fee-8b9f-4dcf-9677-598fd0c6d13fn%40googlegroups.com.

Lorenzo Bertola

unread,

Jun 7, 2024, 3:26:57 AMJun 7

to dadi-user

Hello Katharine,

Thanks for your reply. Re-reading your question now I see how we were indeed experiencing the same issue.

I am a bit confused about what Mikhail means in his answer, and how that is achieved in ANGSD. Is he suggesting to do projection similar to how easySFS does it to minimise missing data? If so, how is that achieved in ANGSD?

Thank you for chiming in and helping by the way!

Cheers,
L

Ryan Gutenkunst

unread,

Jun 7, 2024, 1:42:54 PMJun 7

to dadi-user

Hello Lorenzo,

Regarding the U-shape, since these are unfolded spectra I would not expect paralogs to generate it. Typically, I think paralogs generate a peak near 50%, when fixed differences between duplicated regions are inferred to actually be heterozygous in a single region due to mapping errors.

A large number of fixed derived sites is expected if your three populations are pretty diverged from each other. If a variant is fixed in one population but not the others, it will be on the “edge” of the 2D SFS and in the final entry of the 1D SFS.

Best,
Ryan

> On Jun 4, 2024, at 10:43 PM, Lorenzo Bertola <lorenz...@gmail.com> wrote:
>
> Hello!
>
> I am working on 3 lineages of tree-frogs, and I am interested in testing whether they experienced a period of isolation before coming into contact at 2 hybrid zones (current status).
>
> I am using low-coverage (~4x) Whole Genome Resequencing data (NovaSeq). I am thus estimating genotype likelihoods in ANGSD and then using those likelihoods to produce the SFSs. I am somewhat limited by sample size, as I have 4, 6 and 7 diploid organisms per lineage sequenced for lcWGS. Although to my understanding that should be ~ok (mostly based on this https://peerj.com/articles/9939/ and on seeing people work with similar sample sizes.).
>
> I did some standard filtering (Q>20, mapQ>20, min individuals 3, min depth 3, etc). I have not yet filtered for LD nor for paralogues.
>
> I then produced 1d SFS with the resulting data.
>
> Including only polymorphic sites the 1dSFS looks like this:
>

> <8_1_1dSFS.png>
>
> To my understanding there is an issue as the LNs population in particular has a U shape instead of a monotonically decreasing shape. The LS population also has an increase towards the right. I am working with different filters (e.g., increase mapQ, add heterozygosity filter, filter paralogues) to address this issue.
>
> As a sanity check, I produced a 1d SFS with ddRAD data for the same populations (same exact sampling location, tested for structure and admixture and it matches between the two dataset). The ddRAD data has ~10x coverage, and includes 40 samples per lineage. It was produced by aligning to the reference genome and calling genotypes with stacks.
>
> The resulting 1d SFSs. produced with easySFS from the filtered VCF file, look like this:
>

> <12_4_sfs_byLineage.png>
>
> These 1d SFSs do not show a U-shape for any of the lineages. Thus I am pretty certain that the U-shape in lcWGS data is an artefact. I suspect it is the results of paralogous regions given I have working with a frog species and HiC contact maps from the assembly showed many such regions.
>
> Thus, my first question is. Am i correct in thinking that the U-shape in the lcWGS 1d SFS is an artefact of my pipeline? (For context, the genome assembly is from an LS individual, LM is very closely related, probably split ~10,000 generations ago or less, while LNs is the most distantly related).
>
> Furthermore, in the first figure I am showing only polymorphic sites. (i.e., masking first and last position.
>
> If i include the last position, it looks like this:
>

> <8_1_1dSFS_withFixed.png>
>
> To my understanding, the peak on the right represents fixed derived loci (this is unfolded using an outgroup to produce ancestral sequence before doing the SFS). This figure still doesn't include the '0' f.class, which would be even higher than the fixed derived loci.
>
> To my understanding, this peak to the right is to be expected, and likely correlated with sample size (e.g., here the lineage with the smallest sample size has the most fixed derived loci).
>
> For instance, in this tutorial their SFS including such sites looks similar to mine (Except that they have equal sample size between lineages). https://evomics.org/wp-content/uploads/2018/01/realSFS.pdf
>
> I assume that most people do not report this sites when using ANGSD, as I see that 1d SFS stop at 1-2N, where N is the number of diploid individuals.
>
> See for instance this question: https://groups.google.com/g/dadi-user/c/RdU6i0Tooc0
>
> Also, in this paper they get a similar shape when including fixed derived sites in the SFS, except that their sample size is larger so the proportion of fixed derived sites is lower. https://onlinelibrary.wiley.com/doi/10.1111/mec.13351.
>
> Thus, my second question is: is the large number of fixed derived sites normal and to be expected? Especially given the small sample size?
>
> My last question is: I expect that some of these will be fixed and derived between all 3 lineages, and thus end in the top right of 2d SFS which gets masked, while any that fall off the diagonal won't get masked. Is this correct?
>
> Apologies for the lengthy text, and apologies for the many questions. I am new to working with SFSs, ANGSD and dadi thus some of my statements might be very off.
>
> Cheers,
> LVB
>
>

> --
> You received this message because you are subscribed to the Google Groups "dadi-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to dadi-user+...@googlegroups.com.

> To view this discussion on the web visit https://groups.google.com/d/msgid/dadi-user/cf4fe9b8-dce3-4b35-8c97-3d0ca5779fa1n%40googlegroups.com.
> <8_1_1dSFS.png><8_1_1dSFS_withFixed.png><12_4_sfs_byLineage.png>

Lorenzo Bertola

unread,

Jun 7, 2024, 9:52:57 PMJun 7

to dadi-user

Hello Ryan,

Thank you very much for your input! That would make a lot of sense, as the U shape is observed only in one of the three lineages, and particularly in the lineage that is most diverged from the reference assembly.

I will check a couple more things but everything else in the data seemed alright in terms of quality, coverage, heterozygosity, differentiation, etc.

Cheers,
Lorenzo

Reply all

Reply to author

Forward