Hi!
I'm reaching out to make sure my workflow is correct. My goal is to simulate a complex trait involving loci on multiple chromosomes using real genetic data from a population VCF file.
To avoid burdening SLiM with tracking neutral mutations, I first infer the tree sequence from the VCF file and then remove all neutral mutations. After the forward simulation in SLiM, I reintegrate these neutral mutations into my tree sequence.
However, there's a point that I'm still unsure about: Is it acceptable to infer one tree sequence from a VCF file with multiple chromosomes? I've noticed strategies in msprime that employ a msprime.RateMap to enhance tree inferences' accuracy. Should I also incorporate a msprime.RateMap in my tsinfer inference? Are shared nodes in between my chromosomes a problem? Since I'm more focused on utilizing the tree sequence as a data structure to effectively store and overlay mutations, rather than the precision of the tree itself, I'm inclined to believe that I don't need to worry about this aspect. Nevertheless, to be safe I wanted to ask.
When importing the .tree file into SLiM, I ensure separate chromosome treatment by implementing the recombination rate=0.5 trick between them. During the simulation, I avoid simplifications to retain all nodes (to accurately overlay mutationd afterward).
If my assumptions are correct, this workflow seems viable after all! (:
Thank you!
Tati