calcDxy() function

33 views
Skip to first unread message

Vítor Sudbrack

unread,
May 28, 2025, 8:31:35 AMMay 28
to slim-discuss
Hi all!

I write to share a suggestion of a new function and also to collect feedback on how to improve its performance. 

I've been working on a function to calculate d_{xy}, as in eq. (10.20) of Nei (1987) Molecular Evolutionary Genetics. This is equal to the expected number of differences between two sequences (haplosomes in SLiM) sampled one from population X and another from population Y. It reads as
d_{xy} = \sum_{ij}x_i y_j d_{ij}
where x_i is the frequency of the i-th sequence in population X (here, one over pop size, 1/N_x), y_j is the frequency of the j-th sequence in population Y (again, 1/N_y) and d_{ij} is the number of differences between sequences i and j. As the equation shows, this function performs (N_x)*(N_y)*(#positions) comparisons, and thus it needs some optimization to reach a reasonable computing time. 
Below, I mention some of its relation with FST and nucleotide diversity (Pi), both examplified in the code I attach below, and more details of its suggested implementation and a current flaw on its calculation at the low mutation regime. 

Relation to Fst
It sounds a bit redundant to FST, that is, genetic differentiation between populations X and Y, but it isn't. I think Cruickshank & Hahn (2014) explains well the difference. In a few words, FST measures how much more differentiated two populations are compared to the total genetic variance present (between+within populations). FST is thus a relative measure of genetic differentiation. d_{xy}, meanwhile, only looks at differences between populations, disregarding any variation within populations. It is an absolute measure of differentiation. SLiM actually provides a great platform to exemplify it, I'm attaching the code at end of this message. 
This code basically simulates the neutral model from Recipe 1 "in steroids"... higher mutation rate and shorter genome to speed up our point. Two isolated subpopulations gain neutral mutations and differentiate as time passes. The code prints "(1,2): FST | Dxy" for each generation. As time passes, we find that Dxy increases up to 100 (the genome size). This means that if we sample a sequence from p1 and another sequence for p2, on average, we expect 100 different mutations -- that is, a different mutation in each position. FST, on the other hand, increases but plateaus, around 0.7 in my simulations, despite no gene flow between populations. This is because the (high) mutation rate introduces new mutations to each population, and only 70% of the genetic variation is distributed between populations (the other 30% of genetic variation is within populations). Dxy is blind to such variation within the subpopulations. 

I think this example also clarifies why this distinction is only important in multi-allelic sites: because it requires variation within subpopulation. If p1 fixes for A and p2 fixes for a (absence of A), then the variation within population is zero, and FST=Dxy. It is needed that at least either p1 contains A1, A2,... or p2 contains a1, a2,... at the same site in order to matter these different metrics. In the function I drafted I tried to create an error message when the stack policy is not first or last to avoid this misuse (perhaps an warning could have been enough). 

Relation to Pi
d_{xy} also sounds related to the function of genetic diversity within populations, calcPi(), but in fact calcPi() is a specific case of calcDxy(). The calcPi(p1.haplosomes) is in fact 2*calcDxy(p1.haplosomes, p1.haplosomes). The factor 2 here is because you have twice as many ways of generation a pair of sequences when they come from the same subpopulation -- that is, you can sample (xi, xj) or (xj, xi), it is the same pair/comparison -- than when they come from two distinct populations, when there is only one way you can sample a pair (xi, yj) -- since the sequence xi cannot be sampled from population Y and vice-versa. In the same example code I give below, I also compare these two metrics and they are quite close.  

Current flaw in the implementation
So currently, the function goes through all mutations (which, again, cannot overlap), and calculates if the genome contains or it not (0 or 1). d_{ij} will increase one unit if this presence variable is 1x0 or 0x1 for each pair of genomes from population X and Y. Therefore, it is simply the sum of how many pairs are (present*absent)+(absent*present) for a given mutation. 
After, I have to divide the overall sum by 2 because we summed a difference in "A x B" when we looked at presence of A, and later at the presence of B. So heterozygous mutants were counted twice. 
This implementation (which is of course very similar to the implementation of calcPi) massively simplifies the pairwise comparisons one would normally have to do.  However, it does have one flaw: when one of the haplosomes is "empty" (or "wildtype") at a given position, it is only summed once - that is, "A x _" is only counted as a difference when looking at the presence of A, but the "empty" genome is never counted. This problem decreases with time as less and less "empty" sites in the genome persist, but still for realistic mutations rates it can be important. 

So I was wondering if someone had already thought about it, and if there is any work around it that does not involve a for-loop across all pairs. 

All feedback is already appreciated :) Thanks!
Vitor

****

// Defining a function calcDxy and testing its relation with calcFST and calcPi

initialize() {

initializeMutationRate(1e-4);

initializeMutationType("m1", 0.5, "f", 0.0); //neutral

initializeGenomicElementType("g1", m1, 1.0);

m1.mutationStackPolicy = 'l'; // this is important for Dxy to be relevant

defineGlobal("L", 100); // genome size

initializeGenomicElement(g1, 0, L-1);

initializeRecombinationRate(1e-8);

}


1 late() { // create two isolated populations of 500 individuals

sim.addSubpop("p1", 500); sim.addSubpop("p2", 500);

}


2: late() {// outputting calculating at every generation

catn("Gen. "+sim.cycle+": -------");

// comparing FST with Dxy

fst = calcFST(p1.haplosomes, p2.haplosomes);

dxy = calcDxy(p1.haplosomes, p2.haplosomes);

catn("(1,2):"+fst+"|"+dxy);

// comparing calcPi with calcDxy for each subpopulation

pi = calcPi(p1.haplosomes);

dxy = calcDxy(p1.haplosomes, p1.haplosomes);

catn("( 1 ):"+pi+"|"+2*dxy/L); // dxy divided by chr size

pi = calcPi(p2.haplosomes);

dxy = calcDxy(p2.haplosomes, p2.haplosomes);

catn("( 2 ):"+pi+"|"+2*dxy/L); // dxy divided by chr size

}


2000000 late() { sim.simulationFinished(); }


// Function calcDxy

function (float$)calcDxy(object<Haplosome> haplosomes1, object<Haplosome> haplosomes2, [No<Mutation> muts = NULL], [Ni$ start = NULL], [Ni$ end = NULL]){


n1 = haplosomes1.length();

n2 = haplosomes2.length();

if (n1 == 0 | n2 == 0)

stop("ERROR (calcDxy): haplosomes1 and haplosomes2 must both be non-empty.");

species = haplosomes1[0].individual.subpopulation.species;

if (community.allSpecies.length() > 1)

{

if (any(c(haplosomes1, haplosomes2).individual.subpopulation.species != species))

stop("ERROR (calcDxy): all haplosomes must belong to the same species.");

if (!isNULL(muts))

if (any(muts.mutationType.species != species))

stop("ERROR (calcDxy): all mutations must belong to the same species as the haplosomes.");

}

chromosome = haplosomes1[0].chromosome;

if (species.chromosomes.length() > 1)

{

if (any(c(haplosomes1, haplosomes2).chromosome != chromosome))

stop("ERROR (calcDxy): all haplosomes must be associated with the same chromosome.");

if (!isNULL(muts))

if (any(muts.chromosome != chromosome))

stop("ERROR (calcDxy): all mutations must be associated with the same chromosome as the haplosomes.");

}

length = chromosome.length;

if (isNULL(muts))

muts = species.subsetMutations(chromosome=chromosome);

// handle windowing

if (!isNULL(start) & !isNULL(end))

{

if (start > end)

stop("ERROR (calcDxy): start must be less than or equal to end.");

if ((start < 0) | (end >= length))

stop("ERROR (calcDxy): start and end must be within the bounds of the focal chromosome");

length = end - start + 1;

mpos = muts.position;

muts = muts[(mpos >= start) & (mpos <= end)];

}

else if (isNULL(start) & isNULL(end))

{

start = 0;

end = chromosome.length - 1;

}

else if (!isNULL(start) | !isNULL(end))

{

stop("ERROR (calcDxy): start and end must both be NULL or both be non-NULL.");

}

// narrow down to the mutations that are actually present in the haplosomes and aren't fixed

p = c(haplosomes1, haplosomes2).mutationFrequenciesInHaplosomes(muts);

muts = muts[(p != 0.0) & (p != 1.0)];

if (size(muts) == 0) // if there are no mutations segregating in this genomic window, Dxy vanishes

return 0.0;

// check if two mutations cannot be in the same position of the same haplotype

stackGroup = muts[0].mutationType.mutationStackGroup;

if(any(muts.mutationType.mutationStackGroup != stackGroup) | muts[0].mutationType.mutationStackPolicy=='s')

stop("ERROR (calcDxy): all mutations must be from the same stack group with either last or first stackPolicy. If mutations can stack, one should probably use calcFST()");


// do the calculation for each mutation

dos1 = haplosomes1.mutationCountsInHaplosomes(muts);

dos2 = haplosomes2.mutationCountsInHaplosomes(muts);

// number of differences is: 1x0 + 0x1

diffs_per_site = dos1 * (n2 - dos2) + (n1 - dos1) * dos2;

diff = sum(diffs_per_site);

return diff / 2.0 / n1 / n2; //it may or may not be divided by "length"... original definition by Nei does not divide, to be consistent with other functions in SLiM it could be normalized

}

****

Ben Haller

unread,
May 28, 2025, 1:13:46 PMMay 28
to Vítor Sudbrack, slim-discuss
Hi Vitor!  Hmm, I wrote a reply to this, and Google Groups seems to have eaten it, so I'll write it again.  :-<

Thanks for this; it looks like it'd be a useful addition to SLiM's popgen utilities.  I think it would be better as a GitHub issue, though; can you please make a new issue with all of this?

Let's move the discussion over to that issue.  A couple hundred people subscribe to this list, so we don't want to spam everybody with that discussion.  :->  If you're reading this message and you're interested in being in the loop on this, please either watch for the new issue that Vitor will probably create soon, or email Vitor off-list to get a link to it if you can't find it.  Let's not continue this thread here, thanks.

Thanks Vitor, I really appreciate contributions like this!  :->  I'll have to put on my thinking cap to try to figure it all out, though, since this kind of thing is not my area of expertise!

Cheers,
-B.

Benjamin C. Haller
Messer Lab
Cornell University


Vítor Sudbrack wrote on 5/28/25 8:31 AM:
--
SLiM forward genetic simulation: http://messerlab.org/slim/
---
You received this message because you are subscribed to the Google Groups "slim-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to slim-discuss...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/slim-discuss/00f0b638-cdf9-4b23-ad1c-1b028101c33en%40googlegroups.com.

Reply all
Reply to author
Forward
0 new messages