RaxML incorrectly assumes sequences are entirely undetermined

913 views
Skip to first unread message

Richard Thompson

unread,
Mar 29, 2012, 12:17:26 PM3/29/12
to ra...@googlegroups.com
I have an input file in which a number of sequences are missing the initial 20-30 nucleotides (annotated as '-').
I've run RaxML on the full data set using CIPRES without any problems, however I'm now trying to run RaxML (7.2.6) locally on a subset, with this command :

raxmlHPC -x 4362 -m GTRGAMMA -o outgroup -# 100 -n T3 -s testpro.G.phy

but I'm getting the following error:

'Sequence <sequence_name> consists entirely of undetermined values which will be treated as missing data'

refering to these sequences which are missing the start. I tried both raxmlHPC and raxmlHPC_PTHREADS, with the same result.

Has anyone come across this behaviour before? 
Does anyone have any suggestions as to how I might get this to work?

Thanks
Richard



Alexis

unread,
Mar 29, 2012, 12:21:15 PM3/29/12
to ra...@googlegroups.com
Normally this means that at least one complete sequence consists of undetermined characters, and RAxML exits, because this is meaningless. Should this not be the case (please double-check), I assume there is a problem with the parser, so in this case please send me the alignment file via email and I will have a look.

Alexis

Richard Thompson

unread,
Mar 29, 2012, 1:05:14 PM3/29/12
to ra...@googlegroups.com
Hi Alexis,

Many thanks for the quick response and I owe you an apology.

It turns out the script I had written to split create the subset files had a typo which resulted in the length of the sequences being incorrectly annotated at the start of the .phy file. I'd been looking at the sequences for the source of the error and never thought to check that. (hangs head in shame!)

Therefore RaxML was working perfectly, given it thought the sequences were only 10 nucleotides long!

Apologies and thanks
Richard
--

Prabh Basra

unread,
Apr 22, 2015, 2:37:55 PM4/22/15
to ra...@googlegroups.com
Hi

I am having a similar problem. I have an alignment where a number of sequences are missing the initial nucleotides (annotated as '-'). I trying to run RAxML version 8.1.5 on this data. The command I'm using is as follows

raxmlHPC -m GTRGAMMA -p 12345 -# 100 -s Alignment_24Sampleswith-l200_April17_renamed.phy -o nameofoutgroup -n April22raxmlT13

but I'm getting the following error

Sequence <nameofsequence> consists entirely of undetermined values which will be treated as missing data. 

I've checked my alignment and its not the case where complete sequence consists of undetermined characters. 

Any ideas on what might be causing this? 

Thanks

Prabh

Alexandros Stamatakis

unread,
Apr 23, 2015, 11:23:15 AM4/23/15
to ra...@googlegroups.com
Probably you just didn't look carefully, I am pretty sure that there is
no bug in the function that checks for this, note that, except for -,
N, O, X, and ? are also undetermined characters,

Alexis
> --
> You received this message because you are subscribed to the Google
> Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to raxml+un...@googlegroups.com
> <mailto:raxml+un...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

--
Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University
of Arizona at Tucson

www.exelixis-lab.org

Brian Foley

unread,
Apr 24, 2015, 8:46:50 AM4/24/15
to ra...@googlegroups.com
but I'm getting the following error

Sequence <nameofsequence> consists entirely of undetermined values which will be treated as missing data. 


That is a common error message from any phylogenetic program, I get it at least once a week.  It is almost always
a problem in the input file, some of which can be difficult to spot.  A FASTA format file can look perfectly fine in a
text editor, but have problems with "invisible characters" such as the carriage return and line ending or tab characters
in the file.

I often do not care to find out exactly what the problem is, I just run the input file through another round of
"format conversion" from FASTA to FASTA (or Nexus, or PHYLIP format or whatever input I need) and see
if that fixes the problem.  There are many ways to convert format.  One is to open the file in a program such
as BioEdit, SeAl, AliView, MEGA, DAMBE etc... and then "save as" in the format you need.  Another is to use
an online mulitple sequence alignment format converter such as:
http://www.hiv.lanl.gov/content/sequence/FORMAT_CONVERSION/form.html
http://www.ebi.ac.uk/Tools/sfc/readseq/
etc...

Another good practice is to always have a simple "test file" on hand which you know is the right
format, to "trouble shoot" whether the problem is in the infile or in the software. 
Reply all
Reply to author
Forward
0 new messages