How does RevBayes treat missing character states?

28 views
Skip to first unread message

MUHAMMAD REHAN

unread,
May 19, 2025, 9:38:31 PMMay 19
to revbayes-users
Hi everyone, 

Apologies for the weird formatting of this question, it's my first time on RevBayes forums. I am a linguist working on a phylogenetic inference of the Germanic languages, and a couple of taxa (Vandalic and Burgundian) have a lot of missing states. These taxa seem to be pulled towards the root, and I was curious if there is some information out there about how RevBayes treats missing character states that are marked with a "?". 

I have attached a script here in case there are any questions about my model: 

clear()

# Put the correct path to your data file
data <- readDiscreteCharacterData("C:\\Users\\Dodd User\\Desktop\\revBayes\\data\\germanic_data_root.nex")

n_taxa <- data.ntaxa()
n_branches <- 2 * n_taxa - 3
taxa <- data.taxa()

moves = VectorMoves()
monitors = VectorMonitors()

outgrouped = clade("PGMC")

###################################
# 3. GAMMA RATE VARIATION (4 cat) #
###################################
alpha ~ dnExponential(1.0)
moves.append( mvScale(alpha, weight=1.0) )
rates := fnDiscretizeGamma(alpha, alpha, 4)



######################
# Irreversible Model #
######################

# Transition rate from 0 to 1
a ~ dnExponential(1.0)
moves.append( mvScale(a, weight=1.0) )


Q := fnFreeK([a,0], rescale=false)

# Root always starts in state 0
root_freq <- simplex(1, 0)

##############
# Tree Model #
##############

topology ~ dnUniformTopology(taxa, outgroup=outgrouped)

moves.append( mvNNI(topology, weight=n_taxa/2.0) )
moves.append( mvSPR(topology, weight=n_taxa/10.0) )

for (i in 1:n_branches) {
    bl[i] ~ dnExponential(10.0)
    moves.append( mvScale(bl[i], weight=1.0) )
}

TL := sum(bl)
psi := treeAssembly(topology, bl)

###################
# PhyloCTMC Model #
###################

seq ~ dnPhyloCTMC(
    tree=psi,
    Q=Q,
    type="Standard",
    rootFrequencies=root_freq,
    siteRates=rates
)
seq.clamp(data)

############
# Analysis #
############

my_model = model(seq)

monitors.append( mnScreen(TL, a, printgen=1000) )
monitors.append( mnFile(psi, filename="output/germanic_mk_exp.trees", printgen=100) )
monitors.append( mnModel(filename="output/germanic_mk_exp.log", printgen=100) )

mymcmc = mcmc(my_model, moves, monitors)
mymcmc.run(generations=100000, tuningInterval=100)

mymcmc.operatorSummary()

###################
# Post-processing #
###################

treetrace = readTreeTrace("output/germanic_mk_exp.trees", outgroup=outgrouped, treetype="non-clock")
map_tree = mapTree(treetrace, "output/germanic_mk_exp_map.tree")
mcc_tree = mccTree(treetrace, "output/germanic_mk_exp_mcc.tree")

Thank you so much for for your help with this! 

Best, 
Rehan

Benjamin Redelings

unread,
May 30, 2025, 1:47:18 PMMay 30
to revbaye...@googlegroups.com

Hi Rehan,

1. Missing character states are treated as a lack of knowledge of the what the state is -- like you simply didn't look at the character.  This can be said in several equivalent ways:

- Pr(missing data|s) = 1 for every state s

- missing states are union of all possible states.  For DNA N = A \union T \union C \union G.

2. Missing states are presumed to be "missing completely at random" (MCAR), so that the probability of being missing is independent of both observed and unobserved features of the data.  This assumption might be violated if character states are more likely to be missing from (for example) faster-evolving characters.  I see you have gamma rate heterogeneity in your model, so if characters are preferentially missing in one of the gamma bins, that could possibly be problematic.  Its not a problem for some taxa to have more missing characters though.

3. If the model is not violated -- including the missing-completely-at-random assumption -- then missing data shouldn't "pull" taxa anywhere.  Instead, you might consider if (a) the prior favors balanced topologies over unbalanced ones (or the reverse) and (b) if the odd features of tree estimate are weakly supported.  People often show consensus trees, but if the posterior probability of a feature is only (say) 0.7, then it is only weakly supported.

Does that help?

-BenRI

--
You received this message because you are subscribed to the Google Groups "revbayes-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to revbayes-user...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/revbayes-users/16eac06f-0a0f-4ebc-a9a9-188f08e6e723n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages