Efficiently translating a vector with a dictionary

65 views
Skip to first unread message

Noam Vogt-Vincent

unread,
Mar 14, 2025, 8:44:27 PMMar 14
to slim-discuss
Hey all,

I'm running a non-WF configuration, where offspring disperse to different subpopulations within the reproduction callback. For each individual, I (1) calculate the individual-specific fecundity f, (2) generate dispersal destinations for all f offspring using a weighted sample call, (3) calculate the recruitment probability, which is density-dependent (the number of individuals at the dispersal destination), and (4) generate the offspring at their dispersal destinations for offspring that undergo successful recruitment.

At the moment, I perform steps (2) and (3) with the following lines (density is a dictionary in the form {subpop_id=density}, which I use because density in this model actually depends on the biomass, which is an individual property and is pre-computed each cycle):

destinations = sample(subpopulations, fecundity, replace=T, weights=dispersal_likelihood); // Get destination populations

destination_densities = sapply( destinations, 'density.getValue(applyValue.id);'); // Get density at destinations


My problem is that the second line, the sapply call, is taking up ~90% of the processing time for the entire simulation according to the profiling tool. The destinations vector has a typical length of around 100 and it is being called for each fecund individual, so there are a lot of calls. Nevertheless, there are other operations that involve a much greater number of operations, yet they don't seem to be hogging the performance. Is there a more efficient way to do this line?


tl;dr: I want to translate each element of a vector based on a dictionary. Using sapply appears to be really inefficient. Is there a better way?


Thanks!


Noam

Noam Vogt-Vincent

unread,
Mar 14, 2025, 9:26:17 PMMar 14
to slim-discuss
OK, found a solution. I created two vectors, one containing the subpopulations and the other containing the corresponding densities (densities), and then computed destination densities with the following line:

destination_densities = densities[match(destinations, subpopulations)];


Looks like this indexing approach is way faster than using sapply and dictionaries (is sapply not vectorised?).

Ben Haller

unread,
Mar 14, 2025, 10:25:20 PMMar 14
to Noam Vogt-Vincent, slim-discuss
Hi Noam.  Using sapply() run through a vector of length 100 should be extremely fast.  I'm not sure what problem you might be encountering; we'd need to see a full (but minimal) script that reproduces the problem to diagnose it.  I wonder whether your code might be executing many more times than you think it is.  For example, see this recent thread:

https://groups.google.com/g/slim-discuss/c/cdI37zWtrAQ/m/4OXJiKc0AAAJ

The match() function is nice and fast, yes.  :->  But I suspect that the fact that your dictionary code was so slow is indicative of a deeper problem, and using match() might have fixed the visible manifestation of the bug (the slow speed), but the bug itself might still remain.  It's best to get to the bottom of the bug and really understand it.  Good luck!

Cheers,
-B.

Benjamin C. Haller
Messer Lab
Cornell University


Noam Vogt-Vincent wrote on 3/14/25 7:26 PM:
--
SLiM forward genetic simulation: http://messerlab.org/slim/
---
You received this message because you are subscribed to the Google Groups "slim-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to slim-discuss...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/slim-discuss/1bccbeb5-41d6-42e3-be9f-228b15115c01n%40googlegroups.com.

Message has been deleted
Message has been deleted
Message has been deleted

Noam Vogt-Vincent

unread,
Mar 19, 2025, 7:45:06 AMMar 19
to slim-discuss
Hi Ben,

Thanks for your reply! Yes, sapply is being run a large number of times (once per individual), but it still seems to be much slower than the match-indexing approach. I've pasted a simple example below (sorry, I tried attaching it but my post got deleted so maybe it doesn't like code samples as attachments). I guess I could reduce the number of sapply calls by just making one huge destinations vector first (and thereby only calling sapply once), but (1) this would be really complicated because so many of the requisite calculations are individual-specific; (2) this would only save time if most time in the small sapply calls is spent on overhead rather than computation; and (3) I'd still need a 100*10000*100 loop (in this example) since addCrossed isn't vectorised.

All the best,

Noam

initialize() {
defineConstant('n_subpopulations', 100);
defineConstant('n_individuals', 10000);
defineConstant('n_offspring', 100);

defineConstant('subpopulations', 0:(n_subpopulations-1));
defineConstant('densities', runif(n_subpopulations));

// Create a dictionary translating subpopulations to density
defineGlobal('density_dictionary', Dictionary());
for (p in subpopulations) {
density_dictionary.setValue(p, densities[p]);
}
}

1 early() {
for (p in subpopulations) {
for (i in 0:(n_individuals-1)) {
destinations = sample(subpopulations, n_offspring, replace=T);

// Option 1: dictionary + sapply (this is very slow)
destination_densities_1 = sapply(destinations, 'density_dictionary.getValue(applyValue);'); 

// Option 2: match + indexing (this is faster)
destination_densities_2 = densities[match(destinations, subpopulations)];

}
}
}

Ben Haller

unread,
Mar 20, 2025, 9:32:05 PMMar 20
to Noam Vogt-Vincent, slim-discuss
Hi Noam,

First of all, sorry for the difficulties posting.  (A general message to slim-discuss users: Google Groups seems to random decide to veto messages as spam.  If that happens to you, please send me an email off-list immediately, telling me that your message is in quarantine.  I can approve it – but I don't know it's there, because Google Groups takes up to a week to bother sending me an email saying "I've randomly rejected a message to your group, you might want to have a look".  This is out of my control; it's a bug they seem disinclined to fix, because they apparently can't afford to hire an intern to fix their bugs.  <deep breaths>)

Looking at your example code, I guess I'm puzzled by the whole structure of things.  I don't see the purpose of the dictionary, or the match() call.  The vector `subpopulations` is a vector 0, 1, ..., N-1, such that the value of it at an index i *is* the index i; `subpopulations[10]` is 10, for example.  So when you do `sample(subpopulations, n_offspring, replace=T)`, yes, you get back a vector of subpopulation IDs; but you also get back a vector of indices.  So instead of your options 1 or 2 (dictionary or match()), you can simply do:

            // Option 3: simple lookup
            destination_densities_3 = densities[destinations];

That produces identical results to your options 1 and 2, and is of course faster.

So if your real code is structured in the same way as your example code, then I think you can get rid of all that complication.  You might also be interested in a few other things that might speed things up, depending on what you're doing in the surrounding code:

- the Community method subpopulationsWithIDs()
- the fact that subpopulations (and most objects in SLiM) are subclasses of Dictionary, which means that you could attach density values directly to subpopulations, like subpop.setValue("density", x), and then look them up like subpop.getValue("density") – but keeping the densities as a separate vector of values will probably be even faster for you, since your code uses subpopulation *indices*, not subpopulation *objects*

In general, dictionaries are slow but general; you can put anything into them, and keep any number of values, so they're very flexible, but they are much slower than simpler (but less flexible) approaches.  The problem in your "option 1" is not sapply() – that is quite fast.  It is that the sapply() call is doing a dictionary lookup for each value, and that's going to be very slow.  Your option 2, on the other hand, uses match(), which is basically a vectorized loop like sapply() (so on that ground they are roughly equal), but whereas option 1 does a dictionary lookup per element, option 2 does what's called a "hash table lookup" per element, which uses an extremely fast C++ data structure that needs to only be built once and then can be used again and again to provide lookups in a very efficient way – much faster than a dictionary lookup.  So yes, match() is going to be fast, because it avoids the dictionary.  But the fastest approach, if it works for you, is my option 3, simply using the indices that you have sampled as indices, without using match() at all, since (at least in your example code) all that it is telling you is that the value 10 is at index 10, the value 17 is at index 17, etc., which you knew already.  :->

I hope this helps; happy modeling!


Cheers,
-B.

Benjamin C. Haller
Messer Lab
Cornell University


Noam Vogt-Vincent wrote on 3/14/25 9:06 PM:

Noam Vogt-Vincent

unread,
Mar 20, 2025, 10:52:11 PMMar 20
to slim-discuss
Hi Ben,

Thank you for your reply! Sorry, to be clear, that code was just a contrived example to illustrate the problem - in reality subpopulations is a vector of subpopulation objects. But thank you for clarifying that the rate limiting step is the dictionary lookup - that explains everything!

All the best,

Noam
Reply all
Reply to author
Forward
0 new messages