Missing data in *BEAST

904 views
Skip to first unread message

Roberto Márquez

unread,
Oct 16, 2013, 1:16:16 PM10/16/13
to beast...@googlegroups.com
Hi All, 

I'm using *BEAST to reconstruct a species tree from 5 unlinked loci (a concatenation of 2 mtDNA genes + 5 unlinked nuclear genes), and have a couple of questions regarding missing data:

1. I'm missing complete loci for some species. *BEAST won't run if you don't give it at least one sequence of each species for every locus. This can be worked around by creating a dummy alignment (a long string of ?s or Ns), but I don't know if this would be bad for the analysis. 

Has anyone used this workaround? Does it work well? Any other (hopefully better) suggestions? 


Thanks!

Roberto Márquez
Department of Ecology & Evolution 
University of Chicago


Simon Joly

unread,
Nov 5, 2013, 9:50:17 AM11/5/13
to beast...@googlegroups.com
Dear Roberto,

Theoretically, the inclusion of dummy sequence (stings of Ns of the same length as the alignment) should not affect the likelihood search and it is the way to go. I have tried this on a dataset of 10 genes for 11 species. 5 genes were sequenced for all species, but there were some missing information for the remaining genes (sometimes more than one species were missing for a given gene). The problem I see with the inclusion of dummy sequences is that if there are too many, it might alter the mixing of the chain and convergence between runs. This will be more likely if your data doesn't contain much information (sequence variation). In my case, to make sure it didn't affect the results, I made several analyses including (i) only species for which I had all genes sequenced and (ii) including x (x=1,2,3,...) species with missing data. With my data, the inclusion of dummy sequences never affected the topology of the species tree. However, I guess this could well depend of the data at hand. Therefore, if you want to make sure that the inclusion of dummy sequences doesn't affect the results, you could test it the way I did.

Hope it helps,

Simon

Roberto Marquez

unread,
Nov 7, 2013, 8:34:51 AM11/7/13
to beast...@googlegroups.com
Hi Simon, 
Thanks a lot for your advise, I'm definitely trying it out. 

Best, 

Roberto 
--
You received this message because you are subscribed to a topic in the Google Groups "beast-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/beast-users/80JRVMXeXKo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to beast-users...@googlegroups.com.
To post to this group, send email to beast...@googlegroups.com.
Visit this group at http://groups.google.com/group/beast-users.
For more options, visit https://groups.google.com/groups/opt_out.

pepster

unread,
Nov 7, 2013, 12:46:35 PM11/7/13
to beast...@googlegroups.com


On Thursday, October 17, 2013 6:16:16 AM UTC+13, Roberto Márquez wrote:
Hi All, 

I'm using *BEAST to reconstruct a species tree from 5 unlinked loci (a concatenation of 2 mtDNA genes + 5 unlinked nuclear genes), and have a couple of questions regarding missing data:

1. I'm missing complete loci for some species. *BEAST won't run if you don't give it at least one sequence of each species for every locus. This can be worked around by creating a dummy alignment (a long string of ?s or Ns), but I don't know if this would be bad for the analysis. 

That is the recommended workaround. You want to have at least 2 individuals for each species in *some* of the loci. Any unnecessary "dummy" sequences will slow the mixing of the chain.

-Joseph

Roberto Marquez

unread,
Nov 7, 2013, 12:55:22 PM11/7/13
to beast...@googlegroups.com
Thanks!


Reply all
Reply to author
Forward
0 new messages