Error running PaPaRa "illegal character in DNA/RNA sequence"

183 views
Skip to first unread message

Roey Angel

unread,
Jun 10, 2014, 8:13:37 AM6/10/14
to ra...@googlegroups.com
Hi,
Trying to run PaPaRa using the following command:

papara -t SSURefNR99_1200_slv_115_nogroup.tree -s SILVA115_bac.DNA.phylip -q HiPlants.final.fasta -j 20

I get:

header: 418498 20227
illegal character: 46
terminate called after throwing an instance of 'std::runtime_error'
  what():  illegal character in DNA/RNA sequence
zsh: abort (core dumped)  papara -t SSURefNR99_1200_slv_115_nogroup.tree -s SILVA115_bac.DNA.phylip -q
papara -t SSURefNR99_1200_slv_115_nogroup.tree -s SILVA115_bac.DNA.phylip -q   16.46s user 11.20s system 19% cpu 2:23.40 total

Seems pretty self explanatory except that I don't understand where the problem lies.
Is it an illegal character in the ref file or the query file and is it really an "illegal character" or just a wrong format.

My reference file is the Silva 115 alignment, bacteria only, exported from Arb and then converted to phylip using fasta2phylip.pl (I also changed all U's to T's). My data contains no N's but the reference file has some.

Can PaPaRa handle N's?

It would be helpful to be able to get the name of the file which is causing the problem and the line number.

Thanks in advance,
Roey

Alexandros Stamatakis

unread,
Jun 10, 2014, 10:48:25 AM6/10/14
to ra...@googlegroups.com, Simon Berger
Hi Roey,

Unfortunately, Simon who developed this code is not at the lab any more.

He might still help you, but it could take a while.

In the meantime, do you have the latest PaPaRa version from Simon's github:

https://github.com/sim82/papara_nt

the code is described here:

http://sco.h-its.org/exelixis/pubs/Exelixis-RRDR-2012-5.pdf

Alexis


On 06/10/2014 02:13 PM, Roey Angel wrote:
> Hi,
> Trying to run PaPaRa using the following command:
>
> papara -t SSURefNR99_1200_slv_115_nogroup.tree -s SILVA115_bac.DNA.phylip
>> -q HiPlants.final.fasta -j 20
>>
>
> I get:
>
> header: 418498 20227
>> illegal character: 46
>> terminate called after throwing an instance of 'std::runtime_error'
>> what(): illegal character in DNA/RNA sequence
>> zsh: abort (core dumped) papara -t SSURefNR99_1200_slv_115_nogroup.tree
>> -s SILVA115_bac.DNA.phylip -q
>> papara -t SSURefNR99_1200_slv_115_nogroup.tree -s SILVA115_bac.DNA.phylip
>> -q 16.46s user 11.20s system 19% cpu 2:23.40 total
>>
>
> Seems pretty self explanatory except that I don't understand where the
> problem lies.
> Is it an illegal character in the ref file or the query file and is it
> really an "illegal character" or just a wrong format.
>
> My reference file is the Silva 115 alignment, bacteria only, exported from
> Arb and then converted to phylip using fasta2phylip.pl
> <http://indra.mullins.microbiol.washington.edu/perlscript/docs/Sequence.html>
> (I also changed all U's to T's). My data contains no N's but the reference
> file has some.
>
> Can PaPaRa handle N's?
>
> It would be helpful to be able to get the name of the file which is causing
> the problem and the line number.
>
> Thanks in advance,
> Roey
>

--
Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University
of Arizona at Tucson

www.exelixis-lab.org

Simon Berger

unread,
Jun 11, 2014, 2:34:12 PM6/11/14
to ra...@googlegroups.com
Hello Roey,

according to the error message, one of your sequence files contains a dot ('.') character in one of the sequences (It would interpret all N and ? characters as gaps and U as T, to answer your other question). From the output I can not tell which of the files contains the character, but I suspect it is the phylip file.
Printing the file and line number is a good idea, and I'll look into it if I can easily implement it. Also, if there is a sensible mapping for the dot character I could also integrate that.

Regards, Simon

Simon Berger

unread,
Jun 11, 2014, 5:04:47 PM6/11/14
to ra...@googlegroups.com
Hi Again,

I've prepared a new papara version which treats '.' as a gap character and has an improved error messages in case if illegal characters. Supporting . as gap seems like the 'right thing to do' (TM), at least I think I remember having seen it in a couple of alignments. If there is a better suggestion, I can still change it.
Alexi: If you want, you can officially put it on the lab page as papara v2.4

Regards, Simon

Alexandros Stamatakis

unread,
Jun 12, 2014, 5:01:31 AM6/12/14
to ra...@googlegroups.com
Dear Simon,

Thank you so much. I personally appreciate it very much that you are
still helping us on this :-)

Thanks again,

Alexis
>>> <http://indra.mullins.microbiol.washington.edu/perlscript/docs/Sequence.html>
>>> (I also changed all U's to T's). My data contains no N's but the reference
>>> file has some.
>>>
>>> Can PaPaRa handle N's?
>>>
>>> It would be helpful to be able to get the name of the file which is
>>> causing the problem and the line number.
>>>
>>> Thanks in advance,
>>> Roey
>>>
>>>
>

Roey Angel

unread,
Jun 12, 2014, 5:01:48 AM6/12/14
to ra...@googlegroups.com
Hi Simon,
Thanks for updating the tool so quickly, it seems to be running without an error for me now (though I haven't fully tested it yet).
As mentioned, the ref alignment is the Silva 115. It contains, '-' as gaps and '.' at the ends of sequences to represent non-sequenced regions. They did this to represent a difference between a true gap and simply regions which were not sequenced. I'm not experienced enough in the nuts and bolts of MSA to say whether it's a good idea to treat all '.' as gaps.
Same goes for treating N as a gap. Why don't you treat N's and ? simply as wildcards?

Cheers,
Roey

Alexandros Stamatakis

unread,
Jun 12, 2014, 11:53:15 AM6/12/14
to ra...@googlegroups.com
Hi Roey,

> Thanks for updating the tool so quickly, it seems to be running without an
> error for me now (though I haven't fully tested it yet).
> As mentioned, the ref alignment is the Silva 115. It contains, '-' as gaps
> and '.' at the ends of sequences to represent non-sequenced regions. They
> did this to represent a difference between a true gap and simply regions
> which were not sequenced. I'm not experienced enough in the nuts and bolts
> of MSA to say whether it's a good idea to treat all '.' as gaps.
> Same goes for treating N as a gap. Why don't you treat N's and ? simply as
> wildcards?

It makes sense to represent this as N and . from the point of view of
the Silva developers. However, for all downstream analyses using PaPaRa
and the EPA, Ns and . are simply treated in the same way as undetermined
characters.
For likelihood calculations gaps are treated as undetermined characters
(see previous discussions on this topic in this google group).

Cheers,

Alexis
>>>> <http://indra.mullins.microbiol.washington.edu/perlscript/docs/Sequence.html>
>>>> (I also changed all U's to T's). My data contains no N's but the reference
>>>> file has some.
>>>>
>>>> Can PaPaRa handle N's?
>>>>
>>>> It would be helpful to be able to get the name of the file which is
>>>> causing the problem and the line number.
>>>>
>>>> Thanks in advance,
>>>> Roey
>>>>
>>>>
>

Simon Berger

unread,
Jun 12, 2014, 4:43:52 PM6/12/14
to ra...@googlegroups.com
Hi Roey,

as Alexi has written, -, N, ? and '.' are all treated as undetermined data. In papara this has the result that any query-sequence character can match equally well against these undetermined positions. This means that they are already the same as wildcards in a certain sense.
On the other hand there is the gap signal, that penalizes alignment against any 'gappy' alignment column (including ones containing N, ? and '.'). This works against (or balances out) their wildcard-like behavior (otherwise papara would tend to align query-sequences against large gappy blocks).

In certain cases there might be a benefit of having real 'wildcards' that, on one hand, match against all query-sequence characters but, on the other hand, do not penalize alignment as real gaps do. This should be fairly easy to implement, but as always with sequence-alignment it will be hard to evaluate objectively.

Regards, Simon

Simon Berger

unread,
Jun 12, 2014, 4:49:33 PM6/12/14
to ra...@googlegroups.com
Hi Alexi,

my pleasure! Also it's nice to work on a project with less than 3 quadrillion lines of code for a change ;-)

Simon

Luis

unread,
Jul 17, 2014, 11:00:15 AM7/17/14
to ra...@googlegroups.com
Hi Simon,

I have been using EPA code through RAxML (-f v option). 
I run the analysis with fossil data and a reference tree, and I've got a fabulous tree with the query inserted onto the most likely branch.
However, I would like to explore also the second or third most likely positions of the fossil in the tree, is that possible to obtain?
Using parsimony, I managed to get several trees with the most parsimonius places. But, I would like to implement your code to do that, as I found it very interesting when you applied it to Juglandaceae.

Cheers

Luis

Alexandros Stamatakis

unread,
Jul 17, 2014, 11:06:21 AM7/17/14
to ra...@googlegroups.com
Dear Luis,

The EPA will generate an output-file called .jplace if I remember
correctly that also contains alternative placements of queries with
lower likelihood weights.

The format of this output-file is specified in:

http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0031009

If you need further help, let me know.

Alexis
Reply all
Reply to author
Forward
0 new messages