Optimizing subsetting individual fitness from amino acid matrix

13 views
Skip to first unread message

Nick Bailey

unread,
Oct 28, 2025, 7:00:11 AM (3 days ago) Oct 28
to slim-discuss
Hi all,

SLiM 5.1 has nice matrix options now (thanks Ben and Vitor!) so I've been reworking a convoluted fitness simulation I mentioned in this thread (https://groups.google.com/g/slim-discuss/c/ahApxZLF7PA/m/Ts6m01I4AwAJ). I've got the following code right now in the initialization step

// Read in matrix of Amino Acid fitnesses and then format as matrix in SLiM

// The text file must be formatted with a header of Amino Acids where the left to right order matches SLiM amino acid integer IDs from 0-20

// In single-letter AA IDs these are "X A R N D C Q E G H I L K M F P S T W Y V" where X is stop codon

// Rows then should correspond to integer amino acid positions in the given sequence, of course matching intended sequence length

// Cells are filled with selection coefficients for a given amino acid at a given position in sequence

defineConstant('AAFile', readFile(paste(FIT, '.fl', sep = '')));

defineGlobal('AAMatrix', sapply(AAFile, "strsplit(applyValue, sep = ',');", simplify = "matrix"));

// make all stop codons have fitness of 0

AAMatrix[2, 1:L] = 0;

// define fitness matrix

defineConstant('AAFitnesses', asFloat(AAMatrix[2:22, 1:L]));


// Take the mean fitness of all cells in the matrix except stop codons, which have 0

meanFitness = mean(AAFitnesses);


// take the natural log of this value and multiply by the length of sequence to generate a normalizing factor for fitness

defineGlobal('scaleFitness', log(meanFitness) * L);


// define log of all fitnesses and the range of all sites along the genome

defineConstant('logFitnesses', log(AAFitnesses));

defineConstant('genRange', 0:(L - 1));



Then I have a fitness effect like this

fitnessEffect() {

// Loop through all genomes and define nucleotide, codon and amino acid sequences for comparison to previously defined constants

// Define amino acid sequence for both haplotypes of the given individual


AAs1 = codonsToAminoAcids(individual.haploidGenome1.nucleotides(format = 'codon'), long = 0);

AAs2 = codonsToAminoAcids(individual.haploidGenome2.nucleotides(format = 'codon'), long = 0);


return exp(sum(logFitnesses[cbind(AAs1, genRange)] + logFitnesses[cbind(AAs2, genRange)]) - scaleFitness);

}


This is much cleaner overall than the mess I made in my previous implementation, which I showed in one of my comments in the above thread. However, it does run a good bit slower. Doing a simulation of a kind of near-neutral amino acid landscape, 100 individuals up to 1000 generations with rec rate 1e-8, the current approach has it so the fitness calculation can consistently take up about 84% of the sim time (using the runtime test option in SLiMGui) whereas my former messy approach was about 75%. If recombination rate is raised (as we do in the study I'm using this for) that becomes the limiting factor really, so this isn't the worst thing but the fitness could still make a difference of a few hours in practice given some other params I'm interested in. I figure that a key part is doing as little within the fitnessEffect as possible, and I've tried to get everything in initialization when possible, so I figure there must be a replacement for the two cbinds at the end but I'm not sure. Or is there a better way to get the amino acids out of individual haploid genomes? The log scaling stuff is necessary because individuals can have very low fitnesses with the multiplicative scheme I've implemented, which seems pretty well-grounded in theoretical pop gen stuff. I'll probably stick with this approach over my last one anyways for better user-readability even if I can't cut down on time, but figured it's worth asking now before I go down a rabbit-hole like that one!


Cheers,

Nick








Ben Haller

unread,
Oct 28, 2025, 10:26:19 AM (3 days ago) Oct 28
to Nick Bailey, slim-discuss
Hi Nick!

I'm trying to puzzle through what your code is doing, and I'm somewhat failing.  :->  In future, it would be helpful for you to supply several things:

- an example input file so that I can actually read that file in and follow through the steps that you're performing

- a working minimal model with various parameter values defined so that it runs under SLiMgui

- a profile taken, in SLiMgui or at the command line, showing exactly where the hotspots are

As far as where my points of confusion are, here are some:

- your comment says that "rows then should correspond to integer amino acid positions in the given sequence", but when you "make all stop codons have a fitness of 0", you do that for row 2, not column 2; so is the comment wrong, or is the code wrong, or am I confused?

- at the same point, "make all stop codons have a fitness of 0", I'm puzzled as to why 2 is the magic number, since it sounds like your file has a single header row; stop codons, being index 0, would then be row (or column) 1 in the resulting matrix, not row (or column) 2, no?  remember that Eidos uses zero-based indices whereas R uses 1-based indices; is there an error here of that sort?

- where your comment says "Take the mean fitness of all cells in the matrix except stop codons, which have 0", I do not see that you are excluding stop codons from the mean; where is that being done, or am I misunderstanding something?

- the name "genRange" confused me for a little while because I took "gen" to be "generation", and thought you were changing selection pressures over time, but it eventually dawned on me that this is the range of genome positions :->; and also might I suggest following the suggested convention that defined constants and globals use uppercase?  so, "GENOME_RANGE" would be much clearer, to me.

The reason I'd like to see a profile is mostly to know how much of the time is spent getting AAs1 and AAs2 versus doing the final fitness calculation.  I'm wondering a bit whether providing a "format='aa'" option to nucleotides() would make a difference, to avoid the extra call to codonsToAminoAcids(), or if that's not the bottleneck anyhow.

But, modulo the above, I think I understand the code well enough to have some suggestions for how to speed it up without uglifying it too much.  :->  The main thing is vectorization:

- It seems to me that rather than getting individual.haploidGenome1 and individual.haploidGenome2 and processing them separately, you could do individual.haplosomes.nucleotides(format = 'codon') to get a single vector containing the codons for both haplosomes, and then do a single call to codonsToAminoAcids() to convert to codons, and then a single subset logFitnesses[cbind(AAs, GENOME_RANGE_REP)], where GENOME_RANGE_REP is the genome range repeated twice.  That avoids some repeated steps; I'm not sure whether it will make much difference, but maybe.

- Second, and probably more importantly, get rid of the fitnessEffect() callback altogether and use fitnessScaling instead.  Use sapply() in an event to loop through all individuals, and pass something like sum(logFitnesses[cbind(codonsToAminoAcids(applyValue.haplosomes.nucleotides(format='codon')), GENOME_RANGE_REP)]) to sapply() get a single value back per individual.  Then do individuals.fitnessScaling = exp(x - scaleFitness) just once to apply the calculated effects across the whole vector of individuals.  At present you are doing the "x - scaleFitness" and "exp(x)" calculations on singleton values, which is far less efficient than doing it on the whole vector of values; and there is a fair bit of overhead involved in calling out to a fitnessEffect() callback per individual, too.  fitnessEffect() callbacks have been de-emphasized in modern SLiM, because they encourage a paradigm of doing singleton operations once per individual like this; when fitnessScaling was added it became the preferred method for adding fitness effects to individuals, because it encourages a paradigm of doing such calculations in vectorized fashion (to the extent possible).

I am optimistic that with these changes the model will run at least as fast as your original code, or perhaps a bit faster.  :->  Let me know!

Cheers,
-B.


Nick Bailey wrote on 10/28/25 7:00 AM:
--
SLiM forward genetic simulation: http://messerlab.org/slim/
Before posting, please read http://benhaller.com/slim/bugs_and_questions.html
---
You received this message because you are subscribed to the Google Groups "slim-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to slim-discuss...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/slim-discuss/02a9f684-4192-4d33-80de-978241725787n%40googlegroups.com.

Nick Bailey

unread,
Oct 28, 2025, 12:38:41 PM (3 days ago) Oct 28
to slim-discuss
Hi Ben,

Thank you! This is already very helpful. Apologies for not providing a minimal example; I'm lazy and didn't want to clean up my script to remove all extraneous info. I've done this now and of course, as I should've expected, it makes it easier for me to test changes anyways. This is, I believe, the minimal I can do to illustrate what I meant to at first:

    initialize() {

    setSeed(6479552501706741767);

    mu = 1e-8;

    defineConstant('L', 100);

    defineConstant('N', 100);

     r = 1e-8;

     FIT = "AA_Fitness_Landscape";


     initializeSLiMOptions(nucleotideBased = T);

     defineConstant('mutMatrix', mmKimura(mu/2, mu/4));

     defineConstant('end', L * 3 - 1);

      initializeMutationTypeNuc("m1", 0.5, "f", 0.0);

      initializeGenomicElementType("g1", m1, 1, mutMatrix);

      initializeGenomicElement(g1, 0, end);

      initializeRecombinationRate(r);

      defineConstant('CodonFile', readFile(paste(FIT, '.sd', sep = '')));

      defineGlobal('CodonMatrix', sapply(CodonFile, "strsplit(applyValue, sep = ',');", simplify = "matrix"));

      for (pos in 1:L) {

      codonVector = c(CodonMatrix[2:65,pos]);

      negCodonVector = c(CodonMatrix[2:65,pos] < 0);

      codonVector[negCodonVector] = 0;

      CodonMatrix[2:65,pos] = codonVector;

      }

      // redefine stop codons to all be 0

      CodonMatrix[50, 1:L] = 0;

      CodonMatrix[52, 1:L] = 0;

       CodonMatrix[58, 1:L] = 0;

       defineGlobal('AncestralSequence', NULL);

      for (pos in 1:L) {

      defineGlobal('AncestralSequence', paste(AncestralSequence, sample(CodonMatrix[2:65, 0], 1, weights = asFloat(CodonMatrix[2:65,pos])), sep = ''));

      }

      initializeAncestralNucleotides(AncestralSequence);

     // Read in matrix of Amino Acid fitnesses and then format as matrix in SLiM

    // The text file must be formatted with a header of Amino Acids where the left to right order matches SLiM amino acid integer IDs from 0-20

    // In single-letter AA IDs these are "X A R N D C Q E G H I L K M F P S T W Y V" where X is stop codon

    // Rows then should correspond to integer amino acid positions in the given sequence, of course matching intended sequence length

    // Cells are filled with selection coefficients for a given amino acid at a given position in sequence

   defineConstant('AAFile', readFile(paste(FIT, '.fl', sep = '')));

   defineGlobal('AAMatrix', sapply(AAFile, "strsplit(applyValue, sep = ',');", simplify = "matrix"));

   // make all stop codons have fitness of 0

    AAMatrix[2, 1:L] = 0;

   // define fitness matrix

   defineConstant('AAFitnesses', asFloat(AAMatrix[2:22, 1:L]));

   defineConstant('logFitnesses', log(AAFitnesses));

   defineConstant('genomeRange', 0:(L - 1));


   // Take the mean fitness of all cells in the matrix except stop codons, which have 0

   meanFitness = mean(AAFitnesses[1:20, ]);


   // take the natural log of this value and multiply by the length of sequence to generate a normalizing factor for fitness

    defineGlobal('scaleFitness', log(meanFitness) * L);

   }


   fitnessEffect() {

  // Loop through all genomes and define nucleotide, codon and amino acid sequences for comparison to previously defined constants

   // Define amino acid sequence for both haplotypes of the given individual

   AAs1 = codonsToAminoAcids(individual.haploidGenome1.nucleotides(format = 'codon'), long = 0);

   AAs2 = codonsToAminoAcids(individual.haploidGenome2.nucleotides(format = 'codon'), long = 0);

   // subset logFitnesses by cbound matrices of AAs and range of genome and exponentiate after applying scale factor

   print(exp(sum(logFitnesses[cbind(AAs1, genomeRange)] + logFitnesses[cbind(AAs2, genomeRange)]) - scaleFitness));

   return exp(sum(logFitnesses[cbind(AAs1, genomeRange)] + logFitnesses[cbind(AAs2, genomeRange)]) - scaleFitness);

   }


   // create a population of 100 individuals

   1 early() {

   sim.addSubpop("p1", N);

   }


   // output samples of 10 haplosomes periodically, all fixed mutations at end

   1000 late() { print('done'); }


I'm also attaching the necessary AA Fitness matrix files. Working directory or the "FIT" variable may need to be redefined to use these. All the codon matrix stuff is necessary so that the fitness of the ancestral individuals doesn't fall to far out of line with expectations but isn't otherwise relevant for my issue (I think...). And thanks for the pointers on other mistakes, meanFitness used to exclude stop codons until I made my most recent changes so now it's back the way it should be. My comments on how the matrix works are probably wrong though I've left them for now. The input matrix has AAs for columns and site number for rows but the steps of reading in "AAFile" and "AAMatrix" may actually transpose that, I copied them from another paper. I'm confident those lines script subset as I intend even if I wrote it wrong. The 2 as "magic number" can be seen in the attached AA_Fitness_Landscape.fl file. The first two columns (0 and 1) are site number and optimal AA (the one with highest fitness) so those are what's being excluded a lot. 

Additionally, I've implemented your changes and the fitnessScaling seems to help a lot! Given a particular random seed there's about a 5% difference. I kept a random seed to see that the fitnesses of individuals is the same while changing up the fitness callback. The combining of haplosomes perhaps not as much but I'd be curious about potential changes there, I have had to use the nucleotides -> codons -> amino acids line for this project quite a bit. Here is a script with the changes. I have images of part of the profile where "before" is the above code and "after" is the below. Hope all this all helps without being too much.

initialize() {

setSeed(6479552501706741767);

mu = 1e-8;

defineConstant('L', 100);

defineConstant('N', 100);

r = 1e-8;

FIT = "AA_Fitness_Landscape";


initializeSLiMOptions(nucleotideBased = T);

defineConstant('mutMatrix', mmKimura(mu/2, mu/4));

defineConstant('end', L * 3 - 1);

initializeMutationTypeNuc("m1", 0.5, "f", 0.0);

initializeGenomicElementType("g1", m1, 1, mutMatrix);

initializeGenomicElement(g1, 0, end);

initializeRecombinationRate(r);

defineConstant('CodonFile', readFile(paste(FIT, '.sd', sep = '')));

defineGlobal('CodonMatrix', sapply(CodonFile, "strsplit(applyValue, sep = ',');", simplify = "matrix"));

for (pos in 1:L) {

codonVector = c(CodonMatrix[2:65,pos]);

negCodonVector = c(CodonMatrix[2:65,pos] < 0);

codonVector[negCodonVector] = 0;

CodonMatrix[2:65,pos] = codonVector;

}

// redefine stop codons to all be 0

CodonMatrix[50, 1:L] = 0;

CodonMatrix[52, 1:L] = 0;

CodonMatrix[58, 1:L] = 0;

defineGlobal('AncestralSequence', NULL);

for (pos in 1:L) {

defineGlobal('AncestralSequence', paste(AncestralSequence, sample(CodonMatrix[2:65, 0], 1, weights = asFloat(CodonMatrix[2:65,pos])), sep = ''));

}

initializeAncestralNucleotides(AncestralSequence);

// Read in matrix of Amino Acid fitnesses and then format as matrix in SLiM

// The text file must be formatted with a header of Amino Acids where the left to right order matches SLiM amino acid integer IDs from 0-20

// In single-letter AA IDs these are "X A R N D C Q E G H I L K M F P S T W Y V" where X is stop codon

// Rows then should correspond to integer amino acid positions in the given sequence, of course matching intended sequence length

// Cells are filled with selection coefficients for a given amino acid at a given position in sequence

defineConstant('AAFile', readFile(paste(FIT, '.fl', sep = '')));

defineGlobal('AAMatrix', sapply(AAFile, "strsplit(applyValue, sep = ',');", simplify = "matrix"));

// make all stop codons have fitness of 0

AAMatrix[2, 1:L] = 0;

// define fitness matrix

defineConstant('AAFitnesses', asFloat(AAMatrix[2:22, 1:L]));

defineConstant('logFitnesses', log(AAFitnesses));

defineConstant('genomeRange', rep(0:(L - 1),2));


// Take the mean fitness of all cells in the matrix except stop codons, which have 0

meanFitness = mean(AAFitnesses[1:20, ]);


// take the natural log of this value and multiply by the length of sequence to generate a normalizing factor for fitness

defineGlobal('scaleFitness', log(meanFitness) * L);

}



// create a population of 100 individuals

1 early() {

sim.addSubpop("p1", N);

}


// output samples of 10 haplosomes periodically, all fixed mutations at end

1:1000 late() {

x = sapply(p1.individuals, "sum(logFitnesses[cbind(codonsToAminoAcids(applyValue.haplosomes.nucleotides(format='codon'), long = 0), genomeRange)]);");

print(exp(x - scaleFitness));

p1.individuals.fitnessScaling = exp(x - scaleFitness);

}




AA_Fitness_After.png
AA_Fitness_Landscape.sd
AA_Fitness_Before_1.png
AA_Fitness_Landscape.fl

Ben Haller

unread,
Oct 28, 2025, 12:56:50 PM (3 days ago) Oct 28
to Nick Bailey, slim-discuss
Hi Nick!  Glad to see that my recommendations improved the performance as expected.  :->  I bit hard to tell how much exactly, since of course the print() calls are taking time as well, especially since the final calculation is being done twice – once for the print() and once for the return value.  But anyhow, it's improved.

Thanks for the full script etc.  I'm not going to look through it in detail now, since it sounds like the issues are resolved, but it's great to have it in the list here for posterity; others will doubtless find it useful.  :->

 Re: the nucleotides -> codons -> amino acids thing, if you do think that that process is an important bottleneck in itself please do open an issue and I can look into it further.

Thanks for the followup; let me know if you have any further questions/ideas!  :->  Happy modeling!

Cheers,
-B.


Nick Bailey wrote on 10/28/25 12:38 PM:

Nick Bailey

unread,
Oct 29, 2025, 6:31:40 AM (2 days ago) Oct 29
to slim-discuss
Hi Ben,

I appreciate it! I guess I should say your advice helped improve the speed as is but overall things still aren't running as fast as when I had my vector-based instead of matrix based approach. This is an example of what I mean by the "vector-based" approach. Basically, here AAFitnesses is a super long vector constructed from the original matrix file, not a matrix itself. A long index has to be created then to index this. This is an example of this approach with the suggestions you gave and overall is the fastest I can do this yet. I have to think it's possible to get a cleaner matrix approach to run as quickly but that could be my ignorance of what's going on under the hood.

defineGlobal('AAFitnesses', NULL);

for (pos in 1:L) {

defineGlobal('AAFitnesses', c(AAFitnesses, asFloat(AAMatrix[2:22,pos])));

}

defineConstant('logFitnesses', log(AAFitnesses));


// Take the mean fitness of all cells in the matrix except stop codons, which have 0

meanFitness = mean(asFloat(AAMatrix[3:22, 1:L]));


// take the natural log of this value and multiply by the length of sequence to generate a normalizing factor for fitness

defineGlobal('scaleFitness', log(meanFitness) * L);

// Given that length L is a 1-indexed position, the vector 1:L must be subtracted by 1

// Then this value is multipled by 21 (the number of amino acids SLiM defines), and added to the AA value in the fitnessScaling below returns the correct selection coefficient position in the logFitnesses vector

// this must be replicated twice for both haplotypes in an individual

defineConstant('AAIndex', rep((1:L - 1) * 21, 2));

}



// create a population of 100 individuals

1 early() {

sim.addSubpop("p1", N);

}


1:1000 late() {

// Given that logFitnesses for an individuals is 0-indexed vector of all the amino acids (in SLiM integer IDs) in the individual, they will correspond to the defined vector order

// For example, the selection coefficient of Arginine (Long ID, Arg; Short ID, R; Integer ID, 2) at position 5 can be obtained like so:

// AA = 2, Pos = 5, so 2 + (5 - 1) * 21 = 86. Therefore 86 is the correct index in the vector for this selection coefficient

// All values obtained this way will form a vector that is converted to a float, all values in vector are added to 1, which obtains fitness values for all amino acids separately, then the product works as a multiplicative function to produce fitness for a given individual

AAs = sapply(p1.individuals, "sum(logFitnesses[codonsToAminoAcids(applyValue.haplosomes.nucleotides(format='codon'), long = 0) + AAIndex]);");


p1.individuals.fitnessScaling = exp(AAs - scaleFitness);

}


I'm also attaching two images of something like the first minimal script I posted here to see relatively how slow codonsToAminoAcids is, with the same parameters as above (e.g. pop size = 100 and sequence length = 300 nucleotides etc.). The first image implements the fitness effect separate from the codonsToAminoAcids work and the second image just returns fitness of 1 to see how that function does in relative isolation. Definitely seems like it can make a difference in isolation but otherwise I don't know how best to test that kind of thing.


Cheers,

Nick

AA_Fitness_codonsToAmino_Test_1.png
AA_Fitness_codonsToAmino_Test_2.png

Ben Haller

unread,
Oct 29, 2025, 7:00:06 AM (2 days ago) Oct 29
to Nick Bailey, slim-discuss
Hi Nick!  That's a nice comparison, and a nice example of the general tradeoff between code optimization, and code comprehensability.  :->  In general, it is a sad rule of thumb that the faster you get a piece of code to run, the more unreadable, unmaintainable, and incomprehensible it will become.

I don't think a matrix-based approach can compete with this vector-based approach, actually; all else being equal (i.e., the code being similarly optimized in other respects), the vector-based approach will always be faster.  Dealing with a matrix is inherently slower; constructing a matrix of coordinates to look up is going to be slow, involving cbind() or rbind() or at least a call to matrix(), and then using those coordinates to look up values involves multiplying each column index by the number of rows and doing an add.  It's just inherently more complex.  If the problem can be translated into equivalent terms utilizing a vector, and avoiding building a coordinate matrix and then avoiding those coordinate multiplies, that's a win.  BUT the optimized code is worse in every way except speed.  So that's always a decision – how important is speed to you?

In SLiM's C++ code there are places where, as you might imagine, things have gotten quite heavily optimized, and it often presents a problem – the debugging and validation and maintenance of those parts of SLiM's code is quite a headache sometimes!  Sometimes what I do is keep around both the optimized and the unoptimized version of a piece of code; I can then switch back to the unoptimized version if I need to.  But of course that's at least twice the work to maintain.

I think probably building the functionality of codonsToAminoAcids() directly into the nucleotides() method would not be a big win, but it's hard to really tell for sure without actually doing it.  There are bigger fish to fry.  :->

Cheers,
-B.

Benjamin C. Haller
Messer Lab
Cornell University


Nick Bailey wrote on 10/29/25 6:31 AM:

Nick Bailey

unread,
Oct 29, 2025, 7:27:31 AM (2 days ago) Oct 29
to slim-discuss
Hi Ben,

Thanks for the info! I'll definitely have to think on what approach I'll move on with. It's good to know it seems like I'm longer missing anything obvious, so I've got a better idea of what my options are! And I'm definitely moving forward with fitnessScaling regardless, I had seen it in the manual but had trouble wrapping my head around exactly how to implement it for this. I also just realized this whole operation in the final script I posted:


defineGlobal('AAFitnesses', NULL);

for (pos in 1:L) {

defineGlobal('AAFitnesses', c(AAFitnesses, asFloat(AAMatrix[2:22,pos])));

}

defineConstant('logFitnesses', log(AAFitnesses));

can now (with SLiM 5.1) just be

AAFitnesses = asVector(asFloat(AAMatrix[2:22,1:L]));


so that already helps make the vector approach more tidy if I stick with it.

Cheers,
Nick
Reply all
Reply to author
Forward
0 new messages