EPA: error while running command for .PARAMS

29 views
Skip to first unread message

haris zaf

unread,
Mar 31, 2018, 8:51:53 AM3/31/18
to raxml
While running the command below:

raxmlHPC -f e -m GTRGAMMA -s aligned_silva_with_Ns_final.fasta -t my_tree.ntree -n PARAMS

I got this error:

raxmlHPC-PTHREADS-AVX: treeIO.c:1409: treeReadLen: Assertion `0' failed.
Aborted


Could anyone tell me what went wrong?

Thank you in advance!

Alexandros Stamatakis

unread,
Apr 3, 2018, 3:22:28 AM4/3/18
to ra...@googlegroups.com
looks like a problem with the input tree format, could you provide the
tree file?

Also, there is a new version of EPA available now, which I'd recommend
you should start using:

https://www.biorxiv.org/content/early/2018/03/29/291658

Alexis
> --
> You received this message because you are subscribed to the Google
> Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to raxml+un...@googlegroups.com
> <mailto:raxml+un...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

--
Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology

www.exelixis-lab.org

haris zaf

unread,
Apr 3, 2018, 8:12:22 AM4/3/18
to ra...@googlegroups.com

here is the tree I used. as alignment I used the file SILVA_132_SSURef_Nr99_tax_silva_full_align_trunc.fasta from silva exports: https://www.arb-silva.de/no_cache/download/archive/current/Exports/
thank you for the new version as well.

Haris

haris zaf

unread,
Apr 3, 2018, 8:18:00 AM4/3/18
to ra...@googlegroups.com
The whole command was:


raxmlHPC -f e -m GTRGAMMA -s aligned_silva_with_Ns_final.fasta -t my_tree.ntree -n PARAMS

with the files i mentioned in the previous message. the only difference between aligned_silva_with_Ns_final.fasta and SILVA_132_SSURef_Nr99_tax_silva_full_align_trunc.fasta is that i replaced "." with "N" as
RAxML suggested.

Thank you very much for your time once more!

Haris

Lucas Czech

unread,
Apr 3, 2018, 9:29:17 AM4/3/18
to ra...@googlegroups.com

Hi Haris,

the tree file might contain some parts that do not work in RAxML:

  • The file starts with a Newick comment, enclosed in square brackets "[...]". Remove this part, so that the first line is the one with parenthesis "(((...".
  • Furthermore, the tree contains node labels of the form "'FPLN01005196, metagenome'", that is, with single quotation marks, commas, and spaces. All this might be the reason for RAxML to fail. Try removing them, for example using sed: cat my_tree.ntree | sed "s/'//g" | sed "s/[, ]/_/g" > clean_tree.newick
    This of course has to be applied to the alignment as well.

If this still does not work, please send your cleaned tree again, so that we can look for any remaining issues.

Best
Lucas

--
You received this message because you are subscribed to the Google Groups "raxml" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raxml+un...@googlegroups.com.

Lucas Czech

unread,
Apr 3, 2018, 9:43:22 AM4/3/18
to ra...@googlegroups.com

Sorry, the situation seems a bit more complex. First, my sed command contained a little error and removed too many commas. But furthermore, there are also ":" in the taxon labels, which make it hard to just remove the quotation marks around them. And lastly, there are duplicate sequences in the tree, for example "CP003297, Escherichia coli O104:H4 str. 2009EL-2050".

How did you obtain this tree? It seems a bit messy. By using some sed-regex-magic, it should be possible to get rid of the conflicting characters, but the duplicates are still suspicious.

Lucas

haris zaf

unread,
Apr 3, 2018, 1:32:13 PM4/3/18
to ra...@googlegroups.com
Hi Lucas!

This is a tree I asked SILVA for. It is a tree that is corresponding to the alignment of the RefSeq last version of SILVA database. I will try to do the manipulations you said and I will reply as soon as i get a result!

To unsubscribe from this group and stop receiving emails from it, send an email to raxml+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "raxml" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raxml+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "raxml" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raxml+unsubscribe@googlegroups.com.

haris zaf

unread,
Apr 3, 2018, 1:33:56 PM4/3/18
to ra...@googlegroups.com
In case of the duplicates, do you have any suggestions about how could we manipulate them?

haris zaf

unread,
Apr 3, 2018, 2:12:59 PM4/3/18
to ra...@googlegroups.com
Here is the tree with the changes you suggested. I am not sure if I understood them all but i tried to include them all.
However, i did not do anything about the duplicates..

Thank you very much for your time Lucas!

Lucas Czech

unread,
Apr 5, 2018, 10:19:37 AM4/5/18
to ra...@googlegroups.com

Hi Haris,

hm, I don't think that just removing taxa from the tree is a good approach. If you want to run EPA using this tree, you will need the alignment of the sequences in the tree as well. Thus, the sequences and the taxa of the tree have to be the same. That is, for each sequences in the alignment, there needs to be exactly one taxon (terminal branch) in the tree - and all of them need to have unique names. The default pipeline is to have a set of "reference" sequences that you want to use, which you then align to each other, and then use those to infer a tree. This way, consistency is ensured.

Now, in your case, inferring a tree that large from the full alignment is a bit too much, with roughly 600k sequences... So that is not an option. As a side remark: I'm not sure why you actually want to have a tree this big as your reference tree. This will give you a lot of trouble with further analyses, memory requirements for EPA, visual inspection, etc. Is there a particular reason that you want a tree with ALL Silva sequences? Usually, people use a subset of those, namely the ones they are interested in and which they expect to be related to the sequences that they want to place on that tree. (Alternatively, you can wait a few days for our new paper on automatic reference trees, in which we use Silva to do an automatic selection of reference sequences for building reference trees - out soon!)

Long story short:

  1. Why do you need a tree that large? Maybe a smaller tree specific for the dataset that you want to place would be better.
  2. If you _do_ need it, do you have access to a computing cluster than can run EPA and other analyses with such large trees?
  3. As for your actual question of how to get rid of duplicates: Either remove them in the alignment and infer a new tree, or ask the people from whom you got the tree why they have duplicates, and ask them how to remove them...

Hope that helps
Lucas




On 04/05/2018 03:58 PM, haris zaf wrote:
Hi Lucas!

Sorry bothering you again!
I am quite new and I try to make my tree right but I have some duplicates.
Do you know what is the best approach in order to remove them?

Thank you very much for your time.

Haris
To unsubscribe from this group and stop receiving emails from it, send an email to raxml+un...@googlegroups.com.

haris zaf

unread,
Apr 5, 2018, 10:39:26 AM4/5/18
to ra...@googlegroups.com
Our thought was:  

We have edna samples and we need to do metabarcoding analysis.
In order to get the taxonomy assignment, we use EPA and SILVA RefSeq alignment and the tree that comes out of that as reference.
We do have access in a cluster that will allow us to do so.

I am thinking of inferring a new tree. I could do that through RAxML couldn't I? Do you think that should I do so or do you suggest another approach?

Thank you very much for all your help!!

Haris

Pierre Barbera

unread,
Apr 5, 2018, 11:46:00 AM4/5/18
to ra...@googlegroups.com

Hi Haris,

you could infer a tree from the SILVA alignment, yes, but making a tree from that order of magnitude of sequences will not only take a substantial time and compute resource investment, but will almost certainly cause a lot of down stream tools to fail. Even if in the end you manage to place your eDNA sequences against something like a 600k taxa tree, you would have a hard time visualizing, or even post processing the results.

I would say in this case the technique that Lucas mentioned he will soon publish could be your best bet, if you insist on using the full SILVA tree. In essence, it does a multi-stage placement, where it first breaks down the tree into an inner backbone tree, against which you can feasibly perform placement. Then, if a sequence was placed on a branch where the full tree had been pruned, placement recurses into that particular sub-tree.

However be somewhat weary of the quality of results: this is essentially a heuristic; a shortcut. I would view this as a pre-study to find out what kind of genera to include in your final reference tree, against which you do your actual placement. This could be a more rigorous approach to what is typically (to my impression) done for the type of study you are attempting: selecting a set of reference taxa by hand, using knowledge from literature. Examples: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0037818 and more recently https://www.nature.com/articles/s41559-017-0091 (The reference sizes there are: 797 taxa / 2763 sites, and 512 taxa / 3374 sites).

Another thing to consider is: to what detail do you want your result to be? If you just want to classify your eDNA sequences to genus level, then you need a lot fewer of those SILVA sequences. Probably even if you just pick one representative sequence per species the database will shrink significantly. Additionally, we found that placement has a very hard time distinguishing between strains, especially for something like 16S.

Lastly: consider looking at https://github.com/lczech/genesis for possible (and blazingly fast :) ) post analysis of placements. We are also currently developing a program wrapping the most common post analysis steps (like taxonomic assignment!) in a new tool called gappa, but it is not well documented yet.

Happy Placement,
Pierre
To unsubscribe from this group and stop receiving emails from it, send an email to raxml+un...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

-- 
MSc Pierre Barbera

Phone: +49 6221 533 258
Fax: +49 6221 533 298
E-Mail: pierre....@h-its.org

HITS gGmbH
Schloss-Wolfsbrunnenweg 35
D-69118 Heidelberg
Amtsgericht Mannheim / HRB 337446
Managing Director: Dr. Gesa Schönberger
Scientific Director: Prof. Dr. Michael Strube

haris zaf

unread,
Apr 7, 2018, 9:53:08 AM4/7/18
to ra...@googlegroups.com
Hi Pierre!

Thank you very much for all this information! It is more than appreciated!

In fact, I think I will both wait for the new paper and the procedures you suggest! I think there is quite a difficulty in order to get a tree that you can actually rely on it...

Thanks again,
Haris


-- 
MSc Pierre Barbera

Phone: +49 6221 533 258
Fax: +49 6221 533 298
E-Mail: pierre....@h-its.org

HITS gGmbH
Schloss-Wolfsbrunnenweg 35
D-69118 Heidelberg
Amtsgericht Mannheim / HRB 337446
Managing Director: Dr. Gesa Schönberger
Scientific Director: Prof. Dr. Michael Strube

  

--
Reply all
Reply to author
Forward
0 new messages