Thanks for the feedback, Oliver!
On the first topics of refactoring, I will try to submit as separate as possible so that the PRs can be judged individually.
On Johannes' unwrapping efforts, I did ask in issue 2376 but there wasn't any reply. They seem to go in a direction orthogonal to my caching ideas, so they should also be considered as we may be able to get more than one type of speedup. I also put forth an idea of quickly pre-checking which molecules are broken, which already gave large gains, but I don't know if Johannes had implemented anything along those lines too.
On the topic of Universe-based caching I would love to shortly present this! I can try to summarize it here:
Problem: Fragment computation is slow.
Solution: We do it once and cache the result.
Problem (also faced by Johannes): Caches are at the AtomGroup level. If the topology changes (e.g. bonds added/deleted), how can AtomGroups be notified that those caches need to be invalidated? (Ergo, there is currently only limited caching of fragments).
Solution: We cache fragment information in a per-AtomGroup cache specific for fragments, under the Universe object. We then make all topology-modifying operations invalidate said cache.
Alternative - Cache versioning: Instead of caching under Universe, we just set at the Universe level some sort of version number of the topology, which gets updated every time the topology changes. Cache management at the AtomGroup level then keeps track of the version associated with the cache, and triggers a recalculation if a new version is found.
The idea extends to caching of any type of compound groups (Segments, Residues, etc.). In this scope, we want specific topology changes to only invalidate the respective caches (a residue reassignment will affect residue caches, but not fragment ones). Caching under Universe must then distinguish between fragment-related caches and residue-related/segment-related caches, etc. Likewise for Cache Versioning, which must keep track of independent hashes for different parts of the topology.
Finally, as to funding, so far the work seems manageable without. I'd say we keep that in mind if this turns into a deeper rabbit hole?
What do you guys think?
Cheers,
Manel