Papara

Juan Enciso

unread,

Jul 2, 2012, 5:02:15 AM7/2/12

to ra...@googlegroups.com

Hello,

I plan to use Papara to identify several metagenomic reads with eukaryotic markers derived from automatic orthologous clustering. When I try to run Papara i feed it with a reference alignment (phylip format), a tree file calculated using RAxML (bipartitions tree in newick format), and the query sequences that I want to map (fasta). When I run it, an error is printed in console: header: 17 248

papara: ivy_mike/src/multiple_alignment.cpp:60: bool ivy_mike::multiple_alignment::load_phylip(std::istream&): Assertion `n == nTaxon' failed.

Aborted. Can anyone please tell me the possible cause and how to overcome it? Thanks in advance.

Juan Enciso

unread,

Jul 2, 2012, 3:06:21 PM7/2/12

to ra...@googlegroups.com

I've already checked and both, reference tree and alignment have the same number of taxa. Anyone have used papara before? any ideas?

Simon Berger

unread,

Jul 3, 2012, 4:33:52 AM7/3/12

to ra...@googlegroups.com

Hello Juan,

the problem is, that papara currently can not read 'interleaved' phylip files. Fixing this is on my TODO list, but I had no time to do it, yet. An easy fix is to convert the file with a script. I can send you the converted file and/or the script if you like (can I post your data here publicly?).

There is another problem: one of your query sequence contains a 'X'. I'm not sure how to interpret this in AA data. Should the X be removed as 'missing data'?

Simon

Juan Enciso

unread,

Jul 3, 2012, 6:34:10 AM7/3/12

to ra...@googlegroups.com

Hello Simon,

I see no problem if we manage my data through here. I think it's better if you send me the script and its usage instructions since there are several files to fix. Should I replace ALL the X's (in the alignments as well as in the queries) for gaps?

Juan.

Juan Enciso

unread,

Jul 3, 2012, 6:42:33 AM7/3/12

to ra...@googlegroups.com

Hello Simon,

I'm sending you one of my alignments just to be shure that you confirm that the interleaved format is the troubling fact and anything else is wrong.

Juan.

txt_my_prefix1448.fasta.phylip

Simon Berger

unread,

Jul 3, 2012, 9:34:24 AM7/3/12

to ra...@googlegroups.com

Hello Juan,

I was able to align the sequences with the two fixes that I mentioned above. I've attached the uninterleaved phylip file (deleaf.phy), and the corresponding papara output (papara_alignment.default). For the fasta file, just remove the single X is enough.

I've also attached the ruby script that i've used to fix the phylip file. It reads the input from STDIN and writes the result to STDOUT (i.e., to convert file x.phy into y.phy use 'ruby phy_deleaf.rb < x.phy > y.phy'.

Note that we do not have much experience with papara on protein data. Especially, the default parameters were tuned on rRNA, but still the result on your data looks ok to me.

Simon

deleaf.phy

papara_alignment.default

phy_deleaf.rb

Juan Enciso

unread,

Jul 3, 2012, 12:02:14 PM7/3/12

to ra...@googlegroups.com

Hello Simon,

After trying to run papara with the suggested fixes I'm still having trouble, now the program prints the following error

papara: stepwise_align.h:695: void pvec_aligner_vec<score_t, W>::align(biter, biter, score_t, score_t, score_t, score_t, oiter) [with biter = __gnu_cxx::__normal_iterator<const unsigned char*, std::vector<unsigned char> >, oiter = __gnu_cxx::__normal_iterator<int*, std::vector<int, ab_internal_::alloc<int, 4096ul> > >, score_t = int, long unsigned int W = 4u]: Assertion `av_size >= block_width * W' failed.

Aborted

I'm really embarrassed that everything seems to work fine to you but I can't manage to achieve it and I still have to run it for like 14 alignments more. Thanks, I really appreciate your time.

Juan

Simon Berger

unread,

Jul 4, 2012, 4:50:20 AM7/4/12

to ra...@googlegroups.com

uuuh, sorry this is a 'false positive' error caused by a too strict error check in the the original papara 2.0 release. I've fixed it long ago, so I didn't notice the problem with my version. I'll have to prepare a new source release that fixes all the problems. In the meantime you could get the current version from github with the following commands (you'll need to install the 'git' packet):

git clone git://github.com/sim82/papara_nt.git

cd papara_nt
sh prepare_extern.sh

sh build_papara2.sh

If the above procedure fails for some reason, you can as an alternative edit the file 'stepwise_align.h' from the original papara 2.0 source code and remove line 695 (it is the one that reads 'assert( av_size >= block_width * W ); // the code below should handle this case, but is untested'). Then run 'sh build_papara2.sh' again.

Sorry for the confusion...

Simon

Juan Enciso

unread,

Jul 4, 2012, 5:27:40 PM7/4/12

to ra...@googlegroups.com

Hello Simon,

I was unable to get the current version of papara using git (connection always timed out), however I silenced (//) the line 695 in the 'stepwise_align.h' file, ran the build script and tried to run papara again. Bad news, I'm still getting errors:

papara_nt instantiated as: references<(anonymous namespace)::pvec_pgap, sequence_model::tag_aa>

...

edges: 31

scoring scheme: -3 -1 2 -3

start scoring, using 3 threads

terminate called after throwing an instance of 'std::runtime_error'

what(): missing sse4.1

uncaught std::runtime_error in ivy_mike::thread: Aborted

I think this is out of our hands since it seems to be related directly to the processor architecture. The machine I use to run things has an 8-core Intel Xeon processor. If you have any ideas of what else might be happening please write to me. I would also like to have the new version of papara, if there's another way different from git to get it please tell me. Thank you again.

Best regards,

Juan.

Simon Berger

unread,

Jul 5, 2012, 7:51:24 AM7/5/12

to ra...@googlegroups.com

Hello Juan,

ok, that's another protein (and processor) specific problem. As a quick fix, please remove line 356 in vec_unit.h (it starts with throw std::runtime_error....) and replace it with the following 2 lines:

const vec_t ma = _mm_cmpgt_epi32( a, b );

return _mm_or_si128( _mm_and_si128( ma, a ), _mm_andnot_si128( ma, b ) );

sorry if this is getting annoying, but this is basically the first real protein test-case for the software... once this is all settled, I'll prepare a new release. Thanks for your help!

Simon

Simon Berger

unread,

Jul 5, 2012, 8:43:44 AM7/5/12

to ra...@googlegroups.com

Hi Juan,

in the meantime I've prepared a new papara release (version 2.2): http://dl.dropbox.com/u/20749046/papara_nt_20120705.tar.gz

This should fix all problems (it also automatically ignores the X characters in the query sequences). Please let me know if this works.

Thanks, Simon

Juan Enciso

unread,

Jul 5, 2012, 10:20:36 AM7/5/12

to ra...@googlegroups.com

Hello Simon,

version 2.2 of papara seems to be working perfectly, thank you so much for your help. If i have any further issue and or question I'll let you know.

Juan.

Juan Enciso

unread,

Jul 5, 2012, 11:26:32 AM7/5/12

to ra...@googlegroups.com

Hello,

just another question, papara ran several alignments successfully but for others it still complained about some illegal characters. I'm attaching you the query and the alignment for one of those so you can check it out and tell me what do I have to adjust/fix on it. I think stop codons (*) might be the problem but then again, you can detect the exact failure.

Here's the error as printed on console

illegal character: 42

terminate called after throwing an instance of 'std::runtime_error'

what(): illegal character in protein sequence

Thanks,

Juan.

I_1452.fasta

FIX_txt_my_prefix1452.fasta.phylip

Simon Berger

unread,

Jul 5, 2012, 12:35:06 PM7/5/12

to ra...@googlegroups.com

Hello Juan,

you are right, the number (42) in the error message is the ascii code of the bad character (42='*'). I think that, like the X character, it should be ignored in this setting. I should probably change the approach and simply ignore _all_ characters that do not directly correspond to amino acids, instead of crashing... I'll look into this tomorrow. In the meantime I can give you another quick fix for this specific case: Replace the file sequence_model.cpp with the one I've attached.

Simon

sequence_model.cpp

Juan Enciso

unread,

Jul 5, 2012, 1:07:33 PM7/5/12

to ra...@googlegroups.com

Hello Simon,

only as a suggestion, it would be useful for the program client to run successfully the program being warned of which characters are going to be ignored as this might be having some biological effect in the results and should be taken into account in the analysis and discussion of whatever problem they're trying to solve. I hope that all the feedback I've been giving is useful for you.

Juan.

Simon Berger

unread,

Jul 6, 2012, 12:02:19 PM7/6/12

to ra...@googlegroups.com

Hello Juan,

yes, that sounds like a good idea. I'll implement that next week.

Simon

Simon Berger

unread,

Jul 7, 2012, 7:54:02 AM7/7/12

to ra...@googlegroups.com

Hello Juan,

I found out that there were quite a few problems with the way the unknown characters were handled in the papara version I sent you (It was so embarrassing that it could not wait for monday...). I've just uploaded a fixed version, which now also prints out warnings about the unknown characters: http://dl.dropbox.com/u/20749046/papara_nt_20120707.tar.gz

Thanks again for your input,

Simon

Juan Enciso

unread,

Jul 9, 2012, 12:13:54 AM7/9/12

to ra...@googlegroups.com

Hello Simon,

there were no apparent issues with the version you shared previously, I'll try this new improved one and let you know how it works for me.

Juan.

Simon Berger

unread,

Jul 10, 2012, 5:32:19 AM7/10/12

to ra...@googlegroups.com

Hello Juan,

the problem with the previous version was that it did not consistently ignore the X characters, which might have caused slightly undefined results...

Simon

Reply all

Reply to author

Forward