calcFST() in SLiM 5.0 multichromosome model

52 views
Skip to first unread message

Kamolphat Atsawawaranunt

unread,
Jun 11, 2025, 11:54:35 PMJun 11
to slim-discuss
Hi Ben,

I have been trying to calculate population pairwise FST between two sub-populations with the multi-chromosome model and I have been getting the following error:

ERROR (calcFST): all haplosomes must be associated with the same chromosome.


Is this the desired behaviour of this function in a multi-chromosome model? Are there ways  around this error?


An example code to reproduce this error:


initialize() {

defineConstant("chrom_len", 1e5); // Chromosome length

defineConstant("MU_base", 1e-7); // Mutation rate

defineConstant("R", 1e-8); // Recombination rate

initializeMutationType("m1", 0.5, "f", 0.0);

initializeGenomicElementType("g1", m1, 1.0);

for (i in seq(1, 3)){

initializeChromosome(i, chrom_len);

initializeGenomicElement(g1, 0, chrom_len-1);

initializeMutationRate(MU_base);

initializeRecombinationRate(R);

}

}


1 early() {

sim.addSubpop("p1", 1000);

sim.addSubpop("p2", 1000);

p1.setMigrationRates(p2, 0.001);

p2.setMigrationRates(p1, 0.001);

}


5 early() {

calcFST(p1.haplosomes, p2.haplosomes);

}


10 late() { }


Yours sincerely,

A


Peter Ralph

unread,
Jun 12, 2025, 1:10:51 AMJun 12
to Kamolphat Atsawawaranunt, slim-discuss
One way around the error is to use haplosomesForChromosomes instead of haplosomes, like

calcFST(p1.haplosomesForChromosomes(1), p2.haplosomesForChromosomes(1));

That would only compute Fst on chromosome 1; so probably you're asking if there's a way to computer Fst across all chromosomes all at once. The answer, I think, is 'no'; and for most use cases, a chromosome is plenty long (just average the values across chromosomes if you want a single value)?

--peter

From: slim-d...@googlegroups.com <slim-d...@googlegroups.com> on behalf of Kamolphat Atsawawaranunt <a.kam...@gmail.com>
Sent: Wednesday, June 11, 2025 8:54 PM
To: slim-discuss <slim-d...@googlegroups.com>
Subject: calcFST() in SLiM 5.0 multichromosome model
 
--
SLiM forward genetic simulation: http://messerlab.org/slim/
---
You received this message because you are subscribed to the Google Groups "slim-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to slim-discuss...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/slim-discuss/2850a60b-c75e-4fbd-b9d6-070f6b916d82n%40googlegroups.com.

Gregor Gorjanc

unread,
Jun 12, 2025, 1:29:50 AMJun 12
to slim-discuss
Hi,

I can not comment on haplosomes, but discussion on the calculation of whole-genome statistics across multiple chromosomes applies here as recently discussed on tskit GitHub https://github.com/tskit-dev/tskit/issues/176#issuecomment-2959692679

gg

Kamolphat Atsawawaranunt

unread,
Jun 12, 2025, 2:58:20 AMJun 12
to slim-discuss
Hi Peter and Gregor Gorjanc,

Thank you so much for your replies, and thank you for pointing to the tskit GitHub thread on the whole-genome statistics. Gregor Gorjanc, am I understanding correctly from your message on the tskit GitHub page that the whole-genome ratio statistics (like the population pairwise-FST as often used in population genetic studies) should not be calculated using weighted averages of the different chromosomes?

Is there a way to "force" overwrite the different "chromosomes" in the haplosome objects (i.e., in pop.haplosomes.chromosome?) before calculating FST?

Failing that, is there also a way to calculate genome-wide HS (average heterozygosity in the two subpopulations), and HT (total heterozygosity when both subpopulations are combined) and use that to calculate population pairwise FST?

Yours sincerely,
A

Gregor Gorjanc

unread,
Jun 12, 2025, 4:06:49 AMJun 12
to slim-discuss
Weighted average works for non-ratio stats. FST is a ratio. You could first calculate whole-genome nominator and denominator values and then take their ratio. I don't know how critical this is however!

Ben Haller

unread,
Jun 12, 2025, 10:43:53 AMJun 12
to Kamolphat Atsawawaranunt, slim-discuss
Hi A!

You very much do NOT want to try to change the chromosome property of haplosomes, and no, there there is no way to do that, nor should there be.  :->  Calculating something like FST between haplosomes from chromosome 1 and haplosomes from chromosome 2 would be worse than meaningless.  This is precisely why the functions like calcFST() only allow haplosomes from a single chromosome to be passed in; that is a safety measure.

Here is some fairly simple code that (I think) calculates FST across a full genome using a weighted average, which I think is what Peter and Gregor have been suggesting?

allFSTs = sapply(sim.chromosomes, "calcFST(p1.haplosomesForChromosomes(1, includeNulls=F), p2.haplosomesForChromosomes(1, includeNulls=F));");

weights = sim.chromosomes.length / sum(sim.chromosomes.length);

print(sum(allFSTs * weights));


But I'm not sure it is so simple in the general case.  For example, if one of your chromosomes is a Y chromosome in a sexual model, there will be half as many Y haplosomes as there are haplosomes for a given diploid autosome (assuming a 50-50 sex ratio; if the sex ratio is not 50-50, then it's even a bit more complicated!).  It is not clear to me whether one would want to weigh the FST of the Y chromosome equally with the FST of a diploid autosome of the same length, or whether one would want to discount it.  Moreover, depending upon the question being asked, it is not even obvious to me that one wants to weigh the FSTs by their length at all; perhaps there are questions where a simple average of allFSTs above would be what you would want.  These sorts of considerations are the reason why we decided not to have SLiM provide built-in support for whole-genome calculations of metrics like FST; it is not difficult to do it yourself, as above, but first you have to decide what exactly it is that you want to calculate, and we decided that it was best for SLiM not to assume that, even if there is a common case that would cover most users.

And as they have noted, often – especially for neutral models – just taking the metric from the longest chromosome would likely be quite close to this weighted average, so that might be simpler and essentially equivalent.  But if strong selection is acting, it is less clear that that would be true, of course; in that case, you might want to use information from all of the chromosomes, or so it seems to me.  But in that case, combining them together into a single metric might not be what you want to do anyhow – if FST is very different for different chromosomes, that's an interesting fact that tells you something very important about what's going on in the model!  So you might want to keep those allFSTs values separate, in fact, and NOT do any sort of average; each one might be an interesting piece of data in its own right.  Just some thoughts.  :->

Cheers,
-B.

Benjamin C. Haller
Messer Lab
Cornell University



Kamolphat Atsawawaranunt wrote on 6/12/25 2:58 AM:

Kamolphat Atsawawaranunt

unread,
Jun 12, 2025, 7:36:43 PMJun 12
to Ben Haller, slim-discuss
Hi Ben,

Thank you so much for your reply, and your explanations and suggestions. I think what Gregor suggested was that weighted average of FST across chromosomes does not equate to the genome-wide calculation of FST as FST is a ratio statistic. In any case, I will try to work with the options available.

What I was trying to do in SLiM was to alter gene flow to yield similar genome-wide population pairwise FST to the observed statistics in my species, essentially creating populations with the same population structure as the observation. Alternatively, I can estimate the geneflow first (e.g., through demographic inference softwares like fastsimcoal2) before I run SLiM, but as the species has some specific biology that does not necessarily follow Wright-Fisher (e.g., monogamy, potentially selection), I was hoping to do it in SLiM to take that into account (and also nice to play around in SLiM).

I was doing this by altering gene flow in SLiM depending on how the population pairwise FST compares to the observed genome-wide population pairwise FST (e.g., increase gene flow if simulated population pairwise FST is significantly higher than the observed population pairwise FST).

What I think I have observed with respect to this (in SLiM 4.3) is that, population pairwise FST calculated from multiple linkage groups seem much more stable, and therefore, is easier to find the optimal value. Due to that, I am reluctant to use a value from just one chromosome.

Yours sincerely, 
A

Ben Haller

unread,
Jun 13, 2025, 8:28:27 AMJun 13
to Kamolphat Atsawawaranunt, slim-discuss
Hi A!
> Thank you so much for your reply, and your explanations and
> suggestions. I think what Gregor suggested was that weighted average
> of FST across chromosomes does not equate to the genome-wide
> calculation of FST as FST is a ratio statistic. In any case, I will
> try to work with the options available.

Ah, yes, I missed that, thanks!  So the code snippet I posted is indeed
incorrect, I guess – for those reading this thread, watch out!

You can see the Eidos source code for calcFST() by doing:

functionSource("calcFST")

in the Eidos console.  So if you want to get into it, you could
copy/paste calcFST()'s code to get mean(H_s) and mean(H_t) separately,
for each chromosome, add them together to get the whole-genome numerator
and denominator, and then calculate the final FST as 1 - H_s/H_t as
SLiM's code does, but with the whole-genome values.  But Gregor did also
write "I don't know how critical this is however!"  I'm not math-y
enough to have a sense of that either – whether this is a tiny detail,
or whether it might actually make an important difference.  Perhaps
someone more math-y like Peter could comment on that.  :->  If it does
potentially make an important difference, then I suppose there is a bug
here – SLiM ought to provide a way to calculate the correct statistic
across multiple chromosomes.  (And likewise, then, for other popgen
metrics that SLiM offers, I suppose.)

> What I was trying to do in SLiM was to alter gene flow to yield
> similar genome-wide population pairwise FST to the observed statistics
> in my species, essentially creating populations with the same
> population structure as the observation. Alternatively, I can estimate
> the geneflow first (e.g., through demographic inference softwares like
> fastsimcoal2) before I run SLiM, but as the species has some specific
> biology that does not necessarily follow Wright-Fisher (e.g.,
> monogamy, potentially selection), I was hoping to do it in SLiM to
> take that into account (and also nice to play around in SLiM).

Yes, this makes sense to me as an approach.

> I was doing this by altering gene flow in SLiM depending on how the
> population pairwise FST compares to the observed genome-wide
> population pairwise FST (e.g., increase gene flow if simulated
> population pairwise FST is significantly higher than the observed
> population pairwise FST).
>
> What I think I have observed with respect to this (in SLiM 4.3) is
> that, population pairwise FST calculated from multiple linkage groups
> seem much more stable, and therefore, is easier to find the optimal
> value. Due to that, I am reluctant to use a value from just one
> chromosome.

OK, fair enough.  Are you modeling full-length chromosomes for your
species of interest, or just subsets of the full chromosomes?  And are
your population sizes very small?  Probably FST will become more "jumpy"
the shorter the chromosomes get, the smaller the population sizes, and
the smaller the migration rates, I would imagine...?  Probably also for
a low mutation rate (less mutational data for it to be based upon), and
perhaps other factors.  So yes, I can see needing to average across
multiple chromosomes in some cases.

It sounds like you have an empirical genome-wide FST that you're
comparing against?  In that case, I'd recommend calculating FST in SLiM
in the same way that that empirical estimate was calculated, so that
you're truly matching the empirical value.  I wonder whether they
followed the procedure recommended by Gregor, because FST is a ratio, or
whether they just did a weighted average or even an unweighted average? 
Do you know?  I'd be curious to know what people do in "the real world",
i.e., with empirical data.

Cheers,
-B.

Peter Ralph

unread,
Jun 13, 2025, 12:02:54 PMJun 13
to Kamolphat Atsawawaranunt, Ben Haller, slim-discuss
I've been asked for input, so: I agree - Fst is a ratio; averages of ratios are not ratios of averages; and so calculating Fst with all chromosomes squished together (were that possible) would not be the same thing as averaging the per-chromosome Fst values.

However, SLiM is calculating Fst as 1 - H_s / H_t, and both H_s and H_t are averages along the chromosome. They're averaging something without high variance, and chromosomes are generally long, so I would expect the Fst calculated on separate recombining chromosomes without selection to be very close. And if Fst on all the chromosomes are close to each other, then there's going to be almost no difference between (Fst on all the chromosomes at once) and (average Fst across the chromosomes).

More generally, average-Fst-by-chromosome - perhaps weighted by length - seems like a perfectly fine measure of Fst, even if it slightly differs from the Fst you'd get by squishing the chromosomes together. Of course, if some chromosomes are short or nonrecombining then Fst values will be more variable across replicates, so depending on what you want to do with the Fst values this may or may not be important.

One note, however: it would make sense to do:

 calcFST(p1.haplosomes, p2.haplosomes);

because there are the same set of chromosomes in both arguments to calcFST(). It would not make sense to compute Fst between (a set of haplosomes from chr1) and (a set of haplosomes from chr2); but I see no reason that shouldn't work. I don't think it's terribly important, but I guess it is a natural thing to do.

-peter



From: 'Ben Haller' via slim-discuss <slim-d...@googlegroups.com>
Sent: Friday, June 13, 2025 5:28 AM
To: Kamolphat Atsawawaranunt <a.kam...@gmail.com>
Cc: slim-discuss <slim-d...@googlegroups.com>
Subject: Re: calcFST() in SLiM 5.0 multichromosome model
 
---
You received this message because you are subscribed to the Google Groups "slim-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to slim-discuss...@googlegroups.com.

Ben Haller

unread,
Jun 13, 2025, 12:28:35 PMJun 13
to Peter Ralph, Kamolphat Atsawawaranunt, slim-discuss
Thanks Peter!  I suspected something like that, but it's helpful to have it stated in more mathematically rigorous terms.  :->

I see you've opened a new issue, https://github.com/MesserLab/SLiM/issues/529, with a suggested implementation that supports multiple chromosomes.  That's a nice extension; I'll give it some testing, and probably it'll appear in the next release of SLiM, whenever that happens.  In the meantime, anyone who wants a multi-chromosome calcFST() could try using the implementation that Peter has given there.  :->  I guess I ought to think about similar extensions for the other pop-gen functions provided by SLiM.  :-O


Cheers,
-B.

Benjamin C. Haller
Messer Lab
Cornell University


Peter Ralph wrote on 6/13/25 12:02 PM:
SLiM forward genetic simulation: http://messerlab.org/slim/

---
You received this message because you are subscribed to the Google Groups "slim-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to slim-discuss...@googlegroups.com.

Kamolphat Atsawawaranunt

unread,
Jun 15, 2025, 8:04:33 PMJun 15
to slim-discuss
Hi Ben and Peter,

Thank you so much for the replies, and the discussions surrounding this, and also the open issue on the github page. 

I agree that the average of the different chromosomes are probably going to be quite similar to the genome-wide average, and I think I will probably go down that path but I might give Peter's implementation a quick go first. For my purpose, it probably does not matter much as I only need it in the same ball park.

I am modelling full-length chromosomes (hoping to use this to help evaluate whether or not I can use reduced representation data for the study), and my population size is 1000 per population, and eight populations in total. I am not modelling many other aspects of the genome architecture (e.g., specific genes, varying recombination and mutation rates, etc.) and I believe that there is stochasticity in the empirical FST calculations (e.g., sampling biases) that the simulated values just needs to be in the same ball park.

In the empirical population genomic studies, I believe that, unless the study is involving evolution of certain genes or chromosomes (e.g., sex chromosome evolution), the population pairwise-FST is often done as the genome-wide average. However, there is a whole different rabbit hole to go down there. There are multiple different calculations of population pairwise-FST out there and also different ways in which they deal with missingness which will interfere with the comparability. I believe that SLiM already followed the more-or-less latest suggestions (Bhatia et al., 2013, based on Hudson et al., 1992), but the Weir and Cockerham's FST (Weir & Cockerham, 1984) is still very popular and there are many other options. I think an option for SLiM to calculate Weir and Cockerham's FST would be welcomed, but perhaps people (including myself) should just move towards the newer FST-calculations based on Hudson's FST anyways. These different calculations are quite similar when they are under ideal conditions, but Weir & Cockerham's FST had some biases with regards to unequal sample sizes.

Yours sincerely,
A

Ben Haller

unread,
Jun 16, 2025, 7:16:40 AMJun 16
to Kamolphat Atsawawaranunt, slim-discuss
Hi A,

Yes, as you say there are many different ways to calculate FST, and other pop-gen statistics as well.  Our goal isn't to provide implementations for all of them, just one reasonable method that can be used by most people.  If you want to write your own implementation, for a different method, you should certainly be able to do that in Eidos, and the existing calcFST() implementation, which you can view with functionSource(), might provide some guidance in how to go about it.

I do think I will extend calcFST() to support multiple chromosomes since that lack of that support can be regarded as a bug, though; see the issue Pter opened (linked previously) for that.  :->

Good luck and happy modeling!


Cheers,
-B.

Benjamin C. Haller
Messer Lab
Cornell University


Kamolphat Atsawawaranunt wrote on 6/15/25 8:04 PM:
Reply all
Reply to author
Forward
0 new messages