Including originalNuc in output & efficiently counting nucleotides at a position

Chase W. Nelson

unread,

Nov 22, 2021, 2:15:32 PM11/22/21

to slim-discuss

Greetings Ben and all!

I'm running a nucleotide simulation in which I output all end-of-run mutations for a particular subpopulation, p1:

p1.individuals.genomes.output();

For each mutation, I want to determine the ancestral allele (AA). However, due to separation and lack of migration, the context is such that a nucleotide can effectively fix (reach 100%) in one subpopulation but not every population, and therefore will not get incorporated into the "ancestral" reference sequence (will not TECHNICALLY fix). This potentiates back and recurrent mutations that differ by descent but not state.

Given this context, I want to know each mutation's "originalNuc". To do this, the only solution I can think of is to write a mutation() callback that prints this information for all mutations, to be joined later on .id:

mutation() {
writeFile("mutation_metadata.tsv", mut.id + "\t" + originalNuc, append=T);

return T;
}

Unfortunately, this is very inefficient because it prints for every mutation that ever occurs, and it is impossible to know ahead of time which mutations will still be around later. Thus, my first question is whether there is a straightforward way (e.g., a flag argument) that will include the 'originalNuc' in .output() (or, at least, a better way than what I came up with here)?

My second question is the most efficient way of counting all nucleotides (A, C, G, T) at a particular genome position in a particular subpopulation. The best I've been able to come up with is a loop:

this_pos = mut.position;

this_pos_nucs = "";

for (this_genome in p1.genomes) {

this_pos_nucs = this_pos_nucs + substr(this_genome.nucleotides(), this_pos, this_pos);

}

this_nuc_counts = nucleotideCounts(this_pos_nucs);

However, this is incredibly slow. Is there a more computationally efficient (say, vectorized) way to do this?

Thanks a TON!

Chase

Ben Haller

unread,

Nov 22, 2021, 5:20:32 PM11/22/21

to Chase W. Nelson, slim-discuss

Hi Chase. Interesting puzzles. :->

Here's a model that I think solves these issues for you:

initialize() {
    setSeed(6);
    defineConstant("L", 1e5);
    defineConstant("ACGT", c("A", "C", "G", "T"));
    initializeSLiMOptions(nucleotideBased=T);
    initializeAncestralNucleotides(randomNucleotides(L));
    initializeMutationTypeNuc("m1", 0.5, "f", 0.0);
    initializeGenomicElementType("g1", m1, 1.0, mmJukesCantor(1e-7));
    initializeGenomicElement(g1, 0, L-1);
    initializeRecombinationRate(1e-8);
}
1 {
    sim.addSubpop("p1", 250);
    sim.addSubpop("p2", 250);
    p1.setMigrationRates(p2, 0.01);
    p2.setMigrationRates(p1, 0.01);
}
mutation() {
    mut.setValue("O", originalNuc);
    return T;
}
5000 late() {
    // sort mutations by position
    muts = sim.mutations;
    ord = order(muts.position);
    muts = muts[ord];
    pos = muts.position;

    catn("segregating mutations:");
    for (mut in muts)
        catn("   " + mut.position + ": " + ACGT[mut.getValue("O")]
            + " -> " + mut.nucleotide);
    catn();

    catn("fixed mutations:");
    for (sub in sim.substitutions)
        catn("   " + sub.position + ": " + ACGT[sub.getValue("O")]
            + " -> " + sub.nucleotide);

    catn("positions with two segregating mutations:");
    pos = sort(muts.position);
    pos_len = length(pos);
    pos_dup = (pos[0:(pos_len - 2)] == pos[1:(pos_len - 1)]);
    dups = pos[0:(pos_len - 2)][pos_dup];
    for (dup in dups)
    {
        dup_muts = muts[muts.position == dup];
        catn("   position " + dup + " (" + size(dup_muts) + " mutations):");
        for (subpop in sim.subpopulations)
        {
            p1g = subpop.individuals.genomes;
            p_nucs = p1g.nucleotides(dup, dup, "integer");
            p_counts = nucleotideCounts(p_nucs);
            catn("      p" + subpop.id + ": " + paste(ACGT + "=" + p_counts));
        }
    }
}

The mutation() callback simply saves originalNuc in a key named "O" on the mutation. If you're not using the tag property, saving it in tag would be faster, but setValue() works even if you're using tag for other purposes. Since originalNuc is an integer 0/1/2/3, I define a vector named ACGT that can be used to look up the corresponding nucleotide letter for output.

The output event demonstrates several things. First it sorts all the segregating mutations by position, which is nice both for outputting them, and is used later to find duplicate positions. Then it outputs those sorted mutations, using the saved value in "O" to show what the mutation was a mutation from/to. Then it prints out the fixed mutations, which is the same (didn't bother sorting them), but shows that the "O" key-value pair gets carried over from the Mutation to the Substitution when substitution occurs. Finally, it finds positions that have more than one mutation segregating (sorting and then comparing the vector to itself shifted over 1 is a good fast way to find duplicates), and loops over those positions, and prints the nucleotide counts for each such position in each subpopulation. The code there does vectorize, so it's reasonably fast. Probably the way it looks up all the genomes for each subpopulation, with "p1g = subpop.individuals.genomes", for each duplicate, over and over, is a significant time sink, but I didn't bother profiling the model in SLiMgui; if that does turn out to be slow, it could be moved outside the loop with a little more work, particularly if you know ahead of time what subpopulations exist (rather than this code that uses sim.subpopulations to be general). The ACGT vector is useful for this output formatting, too.

The setSeed(6) value is chosen because it results in two duplicate positions, one of which has two segregating mutations to different nucleotides (separate mutational lineages, and genetically different), the other of which has two segregating mutations to *different* nucleotides (separate mutational lineages, but genetically identical):

positions with two segregating mutations:

position 62376 (2 mutations):

p1: A=480 C=7 G=0 T=13

p2: A=491 C=0 G=0 T=9

position 70918 (2 mutations):

p1: A=0 C=35 G=0 T=465

p2: A=0 C=19 G=0 T=481

Earlier in the output you can easily find the mutations in question, since they are output in sorted order:

62376: A -> C

62376: A -> T

...

70918: T -> C

Of course you could also output the generation the mutations occurred, etc. I hope this helps! I believe this code is correct, but of course always check and validate code that you get off of the internet. ;-> Happy modeling!

Cheers,
-B.

Benjamin C. Haller
Messer Lab
Cornell University

Chase W. Nelson wrote on 11/22/21 2:15 PM:

--
SLiM forward genetic simulation: http://messerlab.org/slim/
---
You received this message because you are subscribed to the Google Groups "slim-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to slim-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/slim-discuss/7dbc2fc1-3b41-4412-81e9-06ef9d0d14efn%40googlegroups.com.

Peter Ralph

unread,

Nov 22, 2021, 5:30:20 PM11/22/21

to Ben Haller, Chase W. Nelson, slim-discuss

FWIW, you could also pull this information out of the tree sequence
after the fact, if you save it out that way, since mutations know the
background on which they occurred.

-- peter

https://groups.google.com/d/msgid/slim-discuss/735a1741-ce6d-e683-07f8-2e17ddf0591c%40mac.com.

Ben Haller

unread,

Nov 22, 2021, 6:07:24 PM11/22/21

to Peter Ralph, Chase W. Nelson, slim-discuss

Ah yes, thanks Peter, I meant to mention that.

Also, there is a small but in the duplicate-finding code I posted. After finding duplicated entries, it should call unique() so that if a position has three or more mutations segregating at it, that position still only gets written to the output once. So:

dups = unique(pos[0:(pos_len - 2)][pos_dup]);

instead of:

dups = pos[0:(pos_len - 2)][pos_dup];

Cheers,
-B.

Benjamin C. Haller
Messer Lab
Cornell University

Peter Ralph wrote on 11/22/21 5:30 PM:

Chase W. Nelson

unread,

Nov 23, 2021, 1:30:00 PM11/23/21

to slim-discuss

Dear Ben and Peter:

You're both simply wonderful — this is perfect! Because my model outputs mutation data at several points along several lineages, I have written an extensive post-processing Python script to analyze all the output. Thus, I took the approach of incorporating your suggested code into the mutation() callback and a custom function, output_metadata():

mutation() {

// Record ORIGINAL NUCLEOTIDE

mut.setValue("O", originalNuc);

// Record NUCLEOTIDE COUNTS at THIS POSITION in THIS SUBPOPULATION

nucs = subpop.genomes.nucleotides(mut.position, mut.position, "integer");

nucs_counts = nucleotideCounts(nucs);

mut.setValue("A", nucs_counts[0]);

mut.setValue("C", nucs_counts[1]);

mut.setValue("G", nucs_counts[2]);

mut.setValue("T", nucs_counts[3]);

return T;

}

function (void)output_metadata(string$ outfilename) {

// Write HEADER, replacing outfilename if present

writeFile(outfilename, "class\tID\tposition\toriginalNuc\tnucleotide\toriginGeneration\tsubpopID\tA\tC\tG\tT", append=F);

// MUTATION (mut) metadata

for (mut in sim.mutations) {

writeFile(outfilename,

"mut\t" +

mut.id + "\t" +

mut.position + "\t" +

ACGT[mut.getValue("O")] + "\t" + // .tag is the INTEGER of the NUCLEOTIDE

mut.nucleotide + "\t" +

mut.originGeneration + "\t" +

mut.subpopID + "\t" +

mut.getValue("A") + "\t" +

mut.getValue("C") + "\t" +

mut.getValue("G") + "\t" +

mut.getValue("T"),

append=T);

}

// SUBSTITUTION (sub) metadata

for (sub in sim.substitutions) {

writeFile(outfilename,

"sub\t" +

sub.id + "\t" +

sub.position + "\t" +

ACGT[sub.getValue("O")] + "\t" + // .tag is the INTEGER of the NUCLEOTIDE

sub.nucleotide + "\t" +

sub.originGeneration + "\t" +

sub.subpopID + "\t" +

sub.getValue("A") + "\t" +

sub.getValue("C") + "\t" +

sub.getValue("G") + "\t" +

sub.getValue("T"),

append=T);

}

Then, I simply call the above function output_metadata() right after each time I call .output(), and I can join these data (some of which are redundant) using .id. as the key in Python. Since there aren't THAT many mutations, it's no problem to output this for ALL segregating mutations in the whole sim rather than just the subpop (lineage) of interest. The arguments to .nucleotides() were the missing vectorization magic I needed! And it's AWESOME that .tag gets carried over from Mutation to Substitution!

Have I said before that SLiM is beautiful?

Regarding counting nucleotides at a position, I do this in the mutation() callback because I am hoping to find out if the population was polymorphic at that position at the moment the mutation arose. This works but is still slow, so I may leave it out and just use .tag for the originalNuc. (It shouldn't actually be necessary for determining the AAs.)

Note that the above approach did lead to a peculiarity that at first had me puzzled, namely, rows of the following type:

class=mut | ID=1233079 | position=83150 | originalNuc=G | nucleotide=A | originGeneration=182459 | subpopID=3 | A=0 | C=0 | G=1641 | T=359

What seemed odd here was that (1) more than one nucleotide had a count greater than 0 (namely, G=1641 and T=3509); BUT (2) this was the ONLY mutation listed at this position, 83150. Because both G and T were already segregating, at least one mutation must have already occurred: G>T if G was the ancestral sequence, or T>G if T was the ancestral sequence. Therefore I thought there should be more than one row for this site — until I remembered that these were, by design, the nucleotide tallies in the population when the mutation arose, not when the state was output! So, the fact that a second mutation (which must have been G>T) isn't included in the output means that it went extinct since the reported mutation (G>A) occurred!

Regarding tree sequence, I ended up abandoning this in my specific case. Specifically, I am using a custom 64x4 trinucleotide-based mutation rate matrix, so recapitation could not be used anyway; detecting coalescence is not really that important for my purpose anyway; and the length of time required for coalescence to occur with recombination had a prohibitively gargantuan tail, given coalescence is not central to my research question.