Efficiently reading from a dictionary

61 views
Skip to first unread message

Darren Li

unread,
Aug 28, 2025, 10:49:32 PMAug 28
to slim-discuss
Hello everyone,

I have a dataframe/dictionary of values from which I want to get a specific value (row X, column Y) many times. To vectorise the process, I use sapply:

phenoFitness_mhw = sapply(ordered_indices_pop, "get_phenoFit_mhw(applyValue, bsi_given_dhw50_df, indices_of_interest, indices_of_quantiles);");

Where the function get_phenoFit_mhw is defined as:

function (float)get_phenoFit_mhw (integer index, lifso bsi_given_dhw50_var, integer indices_of_interest, integer indices_of_quantiles)

{

return(bsi_given_dhw50_var.getValue(asString(indices_of_quantiles[index]))[indices_of_interest[index]]);

}


bsi_given_dhw50_df is the dictionary of values, from which I want to obtain the specific fitness of an individual, given its phenotype (row X) and given the severity of a marine heatwave experienced (column Y). ordered_indices_pop is simply a vector of indices that I'm using to obtain the corresponding phenotype and marine heatwave experienced for each individual, with both being unique for each (because in the model, the heat stress experienced by the individual is assumed to be dependent on its phenotype). 


I know that the above part of the code is the bottleneck based on the profile, and also I saw someone experiencing the same issue here:



where the process is slowed down significantly because sapply() call is doing a dictionary lookup for each value.

I'm wondering how can I make the above code more efficient? I can't do a simple lookup like in the other thread:

destination_densities_3 = densities[destinations];

Because I'm not just reading from a single vector here, but from a dictionary.

Thanks,
Darren

Ben Haller

unread,
Sep 8, 2025, 3:10:56 PMSep 8
to Darren Li, slim-discuss
Hi Darren!  Sorry for the slow reply.

OK, so, this is kind of a complex question.

One point would be that the user-defined function here is not doing your performance any good.  Internally, to call a user-defined function Eidos has to create (and then tear down) a new Eidos interpreter, set up a new symbol table with local variables for all the parameters, etc.  It's not super slow, but if you're running a tight loop and the user-defined function does something small/trivial, it will help performance a lot to just inline the code into the loop rather than calling out to a function.

Another point would be that the asString() call there is again not doing you any favors.  That has to turn the integer value into a string value, and allocate a new Eidos value to hold the string, and then do a string-based lookup in the dictionary, which is relatively slow.  Much faster would be to simply use integer keys in your dictionary in the first place, rather than strings.  Dictionary has supported integer keys for a while now, so unless you're running a fairly old version of SLiM that feature ought to be available.

A third point is that there is possible vectorization here that you're not taking advantage of.  For each index in ordered_indices_pop, you're looking up indices_of_quantiles[index] and indices_of_interest[index] separately, one at a time.  That's very slow.  You want to do those subset operations with the whole vector of indices in one go.  Always vectorize performance-sensitive code if you can.

So a rewrite of your code might look like:

phenoFitness_mhw = NULL;
quantiles = indices_of_quantiles[ordered_indices_pop];
indices = indices_of_interest[ordered_indices_pop];

for (quantile in quantiles, index in indices)
    phenoFitness_mhw = c(phenoFitness_mhw, bsi_given_dhw50_df.getValue(quantile)[index]);

Given that we want to loop through both quantuiles and indices in synchrony, I think the for loop is going to be faster than using sapply().  Note that I didn't put the statement inside the for loop inside curly braces {}; that would make the code slower, since it would have to interpret the curly braces every time through the for loop.  (In interpreted languages like Eidos, pretty much everything you do makes your code slower, even putting a statement inside curly braces.  If Eidos had a smarter optimizer that could get optimized out, but it doesn't.  :->)  Also note that I assumed here that the Dictionary has been recast to use integer keys instead of strings.

Taking a step back, the fact that this code is using a Dictionary might not be ideal in the first place.  Dictionary is not terribly fast.  If this code remains unacceptably slow after the above changes, you might think about storing your data as a matrix instead.  That might or might not be faster, depending on what exactly you're doing, but it might be worth a try.

But perhaps the best idea, if it works for your purposes, is to use the DataFrame class instead of the Dictionary class.  This would be nice because DataFrame has a method, subset(), that is designed to do exactly what you are trying to do, if I have understood correctly.  With DataFrame, the code above becomes one line:

phenoFitness_mhw = bsi_given_dhw50_df.subset(indices_of_interest[ordered_indices_pop], indices_of_quantiles[ordered_indices_pop]);

I'd imagine that is both the fastest and the cleanest way to do what you're trying to do, as long as your data can be recast as a DataFrame object.  It's a subclass of Dictionary, so probably it will be suitable for you as long as your data is "rectangular" – same number of rows for each column.

I hope this helps; post again if not.  Happy modeling!

Cheers,
-B.

Benjamin C. Haller
Messer Lab
Cornell University


Darren Li wrote on 8/28/25 10:49 PM:
--
SLiM forward genetic simulation: http://messerlab.org/slim/
---
You received this message because you are subscribed to the Google Groups "slim-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to slim-discuss...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/slim-discuss/a250ee7b-31fc-4e76-a051-9b9d08496f34n%40googlegroups.com.

Message has been deleted

Darren Li

unread,
Sep 9, 2025, 9:35:54 PMSep 9
to slim-discuss
Dear Ben,

(trying to post again since my last message wasn't posted, for some reasons...)

Not a problem at all, and thanks for these details!

I remember trying the last option that you gave with DataFrame. When I do it as you described, however, I obtain a second DataFrame object (attached image), with the values that I'm interesting in lying along the diagonal. In my limited SLiM knowledge, I could use sapply() to sequentially obtain the values that I'm interested in (which is why I initially did it in the way I showed in my first post with sapply()), but I'm not sure if there's a better way to do it? Like you said, since it's probably the fastest way to do it in this way, I want to keep trying this approach. 

Thanks a lot!
Darren
Screenshot 2025-09-10 110748.png

Ben Haller

unread,
Sep 10, 2025, 7:35:42 AMSep 10
to Darren Li, slim-discuss
Hi Darren!

Yes, sorry about the list issues, Google Groups seems to randomly reject messages as "spam" and there is apparently nothing I can do to stop it.  If that happens to you, please email me directly at bha...@mac.com and I can rescue the rejected message from moderation.  You're lucky that your re-post made it through; usually once it decides to reject one message, it keeps rejecting re-posts also!

Great point re: subset() with DataFrame!  Yes, that method does not do what you want it to do; I had forgotten what it actually does.  What it actually does is useful, but not what you want.  :->  But that makes me realize that there should be a way to do what you want!  I've opened a new issue at https://github.com/MesserLab/SLiM/issues/558.  The right place to put this functionality is on matrix, not DataFrame, now that I think about it more; a DataFrame can have a different type for each column, so extracting a subset of row/column pairs would, in the general case, produce a hodgepodge of types that would require type-promotion, like calling c() with a mix of different types.  Not necessarily wrong, but not obviously useful either.  The right place to put it is as a style of subsetting for a matrix, like how in R you can do:

> m = matrix(1:12, nrow=3)
> m
     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12
> r = c(1,2,3,2)
> c = c(2,3,1,4)
> cbind(r,c)
     r c
[1,] 1 2
[2,] 2 3
[3,] 3 1
[4,] 2 4
> m[cbind(r,c)]
[1]  4  8  3 11

Each row of cbind(r,c) specifies a (row,column) pair for an element to be extracted from m.  (Note that R uses 1-based indices whereas Eidos uses 0-based indices, so this example would look slightly different in Eidos.)  This is what you need, right?  See the issue I've just made for further discussion.  I will implement this feature ASAP so that it rolls in the next SLiM release, which I'm working on finalizing now; so this feature should be available to you within a couple of weeks, hopefully sooner.  Thanks for calling this to my attention; sorry it took me a little while to understand, and thanks for your persistence!  :->

Cheers,
-B.


Darren Li wrote on 9/9/25 9:35 PM:

Nick Bailey

unread,
Sep 10, 2025, 10:09:13 AMSep 10
to slim-discuss
Hi Darren and Ben,

I believe I've dealt with a similar problem before and I was hoping SLiM would index matrices like R but it did not (as Ben says here https://github.com/MesserLab/SLiM/issues/558). I ended up doing a hacky workaround where I turned the matrix into a very long vector and subsetted that instead. These should be the relevant lines of code (with a lot of comments, I knew I'd forget how this worked later!). These are in different code blocks in my script but I think I'm including all necessary variables here:

        // Read in matrix of Amino Acid fitnesses and then format as matrix in SLiM
// The text file must be formatted with a header of Amino Acids where the left to right order matches SLiM amino acid integer IDs from 0-20
// In single-letter AA IDs these are "X A R N D C Q E G H I L K M F P S T W Y V" where X is stop codon
// Rows then should correspond to integer amino acid positions in the given sequence, of course matching intended sequence length
// Cells are filled with selection coefficients for a given amino acid at a given position in sequence
defineConstant('AAFile', readFile(paste(FIT, '.fl', sep = '')));
defineGlobal('AAMatrix', sapply(AAFile, "strsplit(applyValue, sep = ',');", simplify = "matrix"));

        // It seems SLiM can't index a matrix by a matrix, which would speed up FitnessEffect calculations
// Instead it's possible to take a "hacky" approach and redefine the Matrix above as a very long vector
// The vector is composed of the 21 amino acid selection coefficients in order repeated L (length of sequence) times
// For example, a 3x3 matrix where each row is composed of the values c(1,2,3) will result in a vector of c(1,2,3,1,2,3,1,2,3)
// This vector will be subset in FitnessEffect so see below how this assists that process
  defineGlobal('AAFitnesses', NULL);
        defineConstant('L', 1000);
for (pos in 1:L) {
defineGlobal('AAFitnesses', c(AAFitnesses, asFloat(AAMatrix[2:22,pos])));
}

       // This gives the proper way of indexing the AAFitnesses, described a bit more in FitnessEffect
  // Saves a little computation to put here instead of computing every fitness check
  defineConstant('AAIndex', (1:L - 1) * 21);

       // Define amino acid sequence for both haplotypes of the given individual

AAs1 = codonsToAminoAcids(individual.genome1.nucleotides(format = 'codon'), long = 0);
AAs2 = codonsToAminoAcids(individual.genome2.nucleotides(format = 'codon'), long = 0);

         // Compute fitness by a formula like the above but using data read in from a matrix where selection coefficients can vary by amino acid type and position
// See Initialize step for how AAIndex is defined
// Given that AAs1 and AAs2 are 0-indexed vector of all the amino acids (in SLiM integer IDs) in an individual, they will correspond to the defined vector order
// Given that length L is a 1-indexed position, the vector 1:L must be subtracted by 1
// Then this value is multiplied by 21 (the number of amino acids including stop codon that SLiM defines), and added to the AA value returns the correct selection coefficient position in the AASelCoef vector
// For example, the selection coefficient of Arginine (Long ID, Arg; Short ID, R; Integer ID, 2) at position 5 can be obtained like so:
// AA = 2, Pos = 5, so 2 + (5 - 1) * 21 = 86. Therefore 86 is the correct index in the vector for this selection coefficient
// All values obtained this way will form a vector that is converted to a float, all values in vector are added to 1, which obtains fitness values for all amino acids separately, then the product works as a multiplicative function to produce fitness for a given individual
// This is a substantial speed-up from more intuitive looping approaches I took before
       return product(AAFitnesses[AAs1 + AAIndex] + AAFitnesses[AAs2 + AAIndex]);

I've also attached a matrix file that would be read in for "AAFile". So you can maybe attempt to plug in the actual code and see the shape of matrices/vectors throughout. I'm sure some of this is superfluous for your purposes and still requires explanation so I'll try to do that. Basically, you have some matrix (AAFile and AAMatrix), which it seems like you already have, and it gets turned into a vector (AAFitnesses) by looping through each of the columns while keeping rows consistent, and concatenating those columns into the vector. The trick in AAIndex is that is that the number of columns from the original matrix (1:L - 1) is multiplied by the number of rows from the original matrix (21) giving a vector that defines the first possible index for each column of the original matrix. Because those are just the possible first possible indices they must be added to the real index of interest. This happens in the final line (return product etc.) where AAs1 may constitute an index of the real values of interest for a given column and it's value is added to AAIndex, which are the same length, which is again the length of the number of columns in the original matrix (e.g. as in AAFitnesses[AAs1 + AAIndex]). I suppose in the context of what you're doing ordered_indices_pop is analogous to AAs1 here? I know the code isn't pretty but the indexing part is fast and consistent. I suppose the key points are 1) turning matrix into a vector of known length based on given columns/rows and 2) then indexing by a vector of length equivalent to number of columns, which will functionally only take a single cell from each row. If I'm completely off, sorry! If not but I didn't clarify enough, let me know.

Cheers,
Nick
AA_Fitness_Landscape.sd

Ben Haller

unread,
Sep 10, 2025, 11:46:32 AMSep 10
to Nick Bailey, slim-discuss
Wow, that's quite a hack, Nick!  :->

Once #558 is fixed this sort of thing ought to be easy and clean.  Stay tuned; I expect to release SLiM 5.1, with this and other new stuff, fairly soon.

My only other remark would be: when you discover something like this that ought to be easy in Eidos/SLiM but seems to be very hard, please do file an issue!  :->  Here the basic problem is that Eidos is just missing a style of subsetting that is supported by R; I wasn't aware of that R feature, and so I didn't put it into Eidos.  Please bring shortcomings like this to my attention.  :->  Thanks, and happy modeling!


Cheers,
-B.

Benjamin C. Haller
Messer Lab
Cornell University


Nick Bailey wrote on 9/10/25 10:09 AM:
Message has been deleted
Message has been deleted

Darren Li

unread,
Sep 10, 2025, 8:57:29 PMSep 10
to slim-discuss
Dear Ben and Nick,

Thanks for sharing all the above details, it's very useful! You both understood my issue perfectly. 

Nick, what you did is very smart! Thank you so much for sharing your code. I also thought of doing it this way, but didn't bother writing the code because I thought surely something that looks that simple probably doesn't require to write new code right... so I ended up going for the lazy and "simpler" yet more computationally expensive way of using sapply(), because I initially thought sapply() would still be fast because it's supposed to vectorise the computations...

Ben, thank you for making this functionality available in the next release! We really appreciate you for all the hard work you're putting in SLiM. 

I'm now torn between whether to implement Nick's approach, or wait for Ben to release this new feature. I have other work to do in the meantime, so I will see how I go and I may implement Nick's approach if I finish everything early.

Noted Ben, regarding filing issues. Thanks again to you both for your help!

Best,
Darren

To unsubscribe from this group and stop receiving emails from it, send an email to slim-discuss+unsubscribe@googlegroups.com.

Nick Bailey

unread,
Sep 11, 2025, 9:49:44 AMSep 11
to slim-discuss
Hi Ben and Darren,

Thanks Ben, that's good to know. I think with stuff like that it's easy to keep getting closer to a solution and keep thinking "i'll just try this one more thing" before raising an issue, until finally I end up with something that works after all! But I'll definitely make a point of raising an issue in the future. I don't think I've had a comparable headache like that in some time.

And thanks Darren, glad it's potentially useful, depending on what you do! The script I took that from is still SLiM 4 (e.g. the "genomes" in place of "haplosomes", which I forgot until now), and really I should update it to SLiM 5 anyways as I'd like it to be usable as long as possible. So I'll probably wait for the new feature, so the code is cleaner for potential users, and change that as part of updating my script to SLiM 5.

Cheers,
Nick

Ben Haller

unread,
Sep 11, 2025, 11:56:14 AMSep 11
to slim-discuss
Hi all.  Just a quick followup to note that this style of subsetting a matrix/array has been added to Eidos in the GitHub head version of SLiM.  You can build from sources to get it now, as discussed in the SLiM manual (chapter 2), or wait for SLiM 5.1 to be released, which should happen pretty soon now.  :->

The new feature is discussed in issue https://github.com/MesserLab/SLiM/issues/558 and the new documentation that will be added to the Eidos manual for it is given there.  Happy modeling!

Cheers,
-B.

Benjamin C. Haller
Messer Lab
Cornell University

Message has been deleted

Darren Li

unread,
Sep 12, 2025, 11:59:24 AMSep 12
to slim-discuss
Dear Ben and Nick,

Thanks for these details! 

Ben, thanks for adding this new feature so quickly! I'm looking forward to seeing how fast my code runs after this new implementation, yay!

Best,
Darren

Reply all
Reply to author
Forward
0 new messages