My understanding is that by doing this, I would either commit to equal base probabilities, or I would have to specify *particular* non-equal probabilities (using the weights arguments to sample()) to avoid that outcome. In contrast, if I draw base probability vectors for individual sites from a Dirichlet distribution, this allows every site to have (1) its own base probabilities that (2) may be non-equal without being explicitly set by the user.
Thank you! Yes, that's exactly the kind of solution I had in mind as well. So if I understand correctly, something like the following should work?
function (float) getDirichletBaseFrequencies(float alpha) {
if (size(alpha) != 4) {
stop("The alpha vector has to be of size 4.");
}
freqs = float(4);
sum = 0.0;
for (i in 0:3) {
freqs[i] = rexp(1, 1/alpha[i]);
sum = sum + freqs[i];
}
for (i in 0:3) {
freqs[i] = freqs[i] / sum;
}
return freqs;
}
// need to put this inside a loop iterating over sites
site_probs = getDirichletBaseFrequencies( c(1, 1, 1, 1) ); // flat Dirichlet
// not sure yet what function to use to assign these, but the general idea should be correct
sample(c("A", "C", "G", "T"), size(inds), replace = T, weights = site_probs);