raxml trees for ASTRAL

857 views
Skip to first unread message

chris blair

unread,
Oct 15, 2017, 9:23:08 AM10/15/17
to raxml
Hi all, 

I would like to estimate ML trees for downstream species tree analysis in ASTRAL and have a few questions.

1.  My first question is if I should now switch to RAxML-NG, or stick with RAxML until additional bugs are worked out. 

2.  The data consist of 500 alignments. However, each alignment consists of different individuals with all missing data. These should probably be removed prior to RAxML. I'm sure it's possible to write a script for this, but I'm not sure how.

3.  As I cannot find an ASTRAL discussion group, any recommendations from users who used RAxML for their gene trees would be welcome. 

Thanks everyone.

Chris

Alexandros Stamatakis

unread,
Oct 16, 2017, 5:51:12 AM10/16/17
to ra...@googlegroups.com
Dear Chris,

> I would like to estimate ML trees for downstream species tree analysis
> in ASTRAL and have a few questions.
>
> 1.  My first question is if I should now switch to RAxML-NG, or stick
> with RAxML until additional bugs are worked out.

I'd use RAxML-NG as its faster and pretty stable already.

> 2.  The data consist of 500 alignments. However, each alignment consists
> of different individuals with all missing data. These should probably be
> removed prior to RAxML. I'm sure it's possible to write a script for
> this, but I'm not sure how.

Maybe you can find a local geek for help?

> 3.  As I cannot find an ASTRAL discussion group, any recommendations
> from users who used RAxML for their gene trees would be welcome.

Haven't used ASTRAL, but I'd just do standard ML inference and
bootstraps and maybe also apply the bootstrap convergence criterion.

Alexis

>
> Thanks everyone.
>
> Chris
>
> --
> You received this message because you are subscribed to the Google
> Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to raxml+un...@googlegroups.com
> <mailto:raxml+un...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

--
Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology

www.exelixis-lab.org

Simey

unread,
Oct 25, 2017, 5:17:05 PM10/25/17
to raxml
I have a very similar question:

I am writing a script to automate RAxML analyses (ML & bootstrap) on more than 600 UCE alignments to be used in Astral. Every alignment has duplicate sequences. I want to include in my script a way for raxml to write a.reduced phylip file for each alignment without actually running a raxml analysis for every alignment. I can then include these .reduced files into my pipeline. Alternatively, have raxml automatically use the .reduced file if one is generated. I could not find such options in the manual or in the –h list.


Chris Blair, I will post the script in this thread when done. It sounds like we are doing the same thing.

chris blair

unread,
Oct 25, 2017, 10:16:03 PM10/25/17
to raxml
There appears to be quite a bit of new work highlighting some of the limitations of gene tree reconciliation approaches like ASTRAL. It seems like the performance of concatenated ML analysis and coalescent analyses in svdquartets often produce more reliable and consistent results. Part of the reason has to do with the difficulty in estimating accurate gene trees with highly conserved loci. I don't really have a question here, but perhaps this will spark some further discussion about how to use these methods for species tree analysis. I haven't come across any papers looking at the effect of including identical sequences during gene tree construction on subsequent species tree inference. 

Another issue to think about is practicality. It is quite tedious to manually inspect and edit hundreds to thousands of individual alignments in hopes of generating semi-accurate gene trees. This involves removing taxa with all Ns and possibly removing taxa with limited sequence information that would bias gene tree reconstruction. I hope someone can convince me otherwise, but the amount of work that needs to go into processing the data does not seem to be worth the result, especially since concatenated ML and other coalescent analyses appear to provide better results. 

Chris

Alexandros Stamatakis

unread,
Oct 26, 2017, 4:43:45 AM10/26/17
to ra...@googlegroups.com
There is no option for this, but a work-around:

you can specify -y which will only comute a parsimony starting tree on
the full alignment, and then produce a reduced one, this will be fairly
faced and produce a reduced file,

alexis
> > an email to raxml+un...@googlegroups.com <javascript:>
> > <mailto:raxml+un...@googlegroups.com <javascript:>>.
> > For more options, visit https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>.
>
> --
> Alexandros (Alexis) Stamatakis
>
> Research Group Leader, Heidelberg Institute for Theoretical Studies
> Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
>
> www.exelixis-lab.org <http://www.exelixis-lab.org>

Alexey Kozlov

unread,
Oct 26, 2017, 7:10:44 AM10/26/17
to ra...@googlegroups.com
with old RAxML, I was using the "-f c" option for this ("check alignment")

in RAxML-NG there is no equivalent option yet, but I'll add it in the next release

regarding automatical removal of duplicates I'm not sure it's such a good idea, since it's kind of arbitrary which
sequences will be removed...

Bryan McLean

unread,
Oct 26, 2017, 8:58:40 AM10/26/17
to raxml
Chris, Simey, et al.,

A couple of thoughts RE:ASTRAL etc.

1) ASTRAL can accept multi-individual (i.e. multiple individuals sampled per OTU) gene trees for analysis. Beware that version must be downloaded from multi-individual branch (https://github.com/smirarab/ASTRAL/tree/multiind). However, that is the version you want if this is what you mean by duplicate sequences, and thus no need to develop code to prune fasta files and may even increase prob. of observing coalescent events.

2) There is definitely danger in using low-information gene trees in summary coalescent analyses. I myself have gotten some "weird" results. However, this is actually information about the dataset and real biological processes that have generated it. So I would caution against preferring only single method based on better results and instead draw conclusions from mutliple analyses. SVDquartets does have some desirable aspects but will not necessarily yield a well-supported phylogeny if heterogeneity among genes (or SNPs) is high.

3) Side note: For specific questions about ASTRAL, I have had good luck emailing the author.

Best,
Bryan

Tomáš Fér

unread,
Oct 26, 2017, 9:24:43 AM10/26/17
to raxml
Dear Simey,
I wrote some scripts that worked around this issue - removing samples with too much missing data and checking whether there are invariable sites and identical sequences in the alignment using the option -y (producing parsimony tree and reduced alignment). My scripts are implemented in my HybPhyloMaker pipeline (https://github.com/tomas-fer/HybPhyloMaker), especially in the scripts HybPhyloMaker5_missingdataremoval.sh and HybPhyloMaker6a_RAxML_for_selected.sh.
You can reuse it if you find it useful.
Best,
Tomas


Dne čtvrtek 26. října 2017 10:43:45 UTC+2 Alexis napsal(a):

Simey

unread,
Nov 17, 2017, 2:14:39 PM11/17/17
to raxml
The following are four scripts we wrote to run Astral using RAxML trees generated from UCE data produce in the Phyluce pipeline. I am currently learning python and working on rewriting this in python. Note that it works in our environment and has some hard coded directories etc, so it would need to be edited for anyone else to use. 
#####################################################################################
# The following are 4 scripts required to run Astral for UCE data generated in a Phyluce pipeline.              #
# only two of the four files are manually executed ('astral_prep.sh' and 'astral_run.sh').                               #
# The first script, 'astral_prep.sh', is a shell script that:                                                                                  #
# 1.  creates directories for each UCE alignment                                                                                           #
# 2.  calls an R sript, 'RCmds'                                                                                                                        #
#                                                                                                                                                                     #
# 3.  RCmds uses 'ips' and 'parallel' libs to convert formats and run many instances of RAxML in parallel  #
# 4.  RCmds' also calls a third script, 'run_RAxML.sh', which launches parallel RAxML analyses                #
#                                                                                                                                                                     #
# 5.  The fourth and final script, 'astral_run.sh':                                                                                             #
# 6.  creates new directories                                                                                                                          #
# 7.  moves the RAxMl trees into one of the directories                                                                                #
# 8.  merges all of the trees into a single file                                                                                                 #
# 9.  moves the bootstrap files into a directory                                                                                              #
# 10. creates a file with a list of paths pointing to each bootstrap file                                                           #
# 11. runs Astral.                                                                                                                                           #
####################################################################################

###### astral_prep.sh


#!/bin/bash
#make directories for the phylip files with name of uce
for f in uce*.nexus; do
    dir=$(basename $f .nexus)
    [ ! -d $dir ] && mkdir $dir
done

Rscript RCmds

####### RCmds - R code script

library(ips)
library(parallel)
setwd("/data/Phyluce_uce-alignments/mafft-nexus-min75_110117/")
files=list.files(pattern = '.nexus')
nex = length(files)
# get number of files for 1:length(1:n files)

for(i in 1:length(1:nex)){
        phylip = read.nex(files[i])
        this.name = files[i]
        newname=sub(".nexus",".phy",files[i])
        print(newname)
        save.path = getwd()
        save.path = paste(save.path,"/",sub(".nexus","",files[i]),sep = "")
        write.phy(phylip, file=file.path(save.path, newname), interleave=FALSE)
}
phy_files_list = list.files(pattern = "*.phy$", recursive = TRUE, path = getwd(), full.names = TRUE)

cmd =list()

for(i in 1:length(phy_files_list)){
        cmd[[i]] = paste(getwd(), "/run_RAxML.sh ", phy_files_list[i], sep="")
}



final_raxml = mclapply(cmd, system, mc.cores=getOption("mc.cores", 48))  ### 48 cores


####### run_RAxML.sh

#!/bin/bash
cd $(dirname $1)
id=$(basename $(dirname $1))
phy=$(basename $1)

~/standard-RAxML/raxmlHPC-AVX -f a -m GTRGAMMA -N 100 -x 12345 -p 25258 -n ${id}.txt -s $phy

####### astral_run.sh - ASTRAL ANALYSIS shell script file

# cp gene tree files to single dir from working dir
mkdir tree_files boot_trees

find . -name "RAxML_bipartitions.*" -exec cp {} tree_files/ \;


# merge all trees into a single file for Astral
cd tree_files 
rename 's/$/\.tre/' */RAxML_b*
sed -n wRAx_genetrees_merge.tre *.tre
cd ..

# make file with list of paths to bootstrap files
find . -name "RAxML_bootstrap.*" -exec cp {} boot_trees/ \;
cd boot_trees
for file in ./*
do
    readlink -f "$file" >> bootstrap.filedir.list.txt
done

java -Xmx100G -jar ~/ASTRAL/astral.5.5.6.jar -i tree_files/RAx_genetrees_merge.tre -b boot_trees/bootstrap.filedir.list.txt -r 100 -o SyngUCE_sp_tree.tre

Christopher Blair

unread,
Nov 17, 2017, 3:39:18 PM11/17/17
to ra...@googlegroups.com
Cool thanks for the post. Does your program also remove taxa with all missing data and taxa with identical sequences from each alignment? This should probably be done prior to analysis in RAxML.

Chris

To unsubscribe from this group and stop receiving emails from it, send an email to raxml+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
***********************************************
Christopher Blair, Ph.D. 
Assistant Professor
Department of Biological Sciences
New York City College of Technology and
Ecology, Evolution and Behavior Program
Graduate Center
The City University of New York
300 Jay Street
Brooklyn, NY 11201

Simey

unread,
Nov 17, 2017, 4:19:16 PM11/17/17
to raxml
Hi Chris,
we decided not to remove identical sequences for now, but will test this to see what difference it makes with Astral. 
To unsubscribe from this group and stop receiving emails from it, send an email to raxml+un...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Alexandros Stamatakis

unread,
Nov 18, 2017, 8:41:13 AM11/18/17
to ra...@googlegroups.com
you should definitely remove all this random noise (i.e. duplicate
sequences and sequences with no date) as it just doesn't make sense.

thanks for posting your scripts :-)

alexis
> this.name <http://this.name> = files[i]
> <https://github.com/tomas-fer/HybPhyloMaker>), especially in
> the scripts HybPhyloMaker5_missingdataremoval.sh
> <https://github.com/tomas-fer/HybPhyloMaker/blob/master/HybPhyloMaker5_missingdataremoval.sh> and
> HybPhyloMaker6a_RAxML_for_selected.sh
> <https://github.com/tomas-fer/HybPhyloMaker/blob/master/HybPhyloMaker6a_RAxML_for_selected.sh>.
> it, send an email to raxml+un...@googlegroups.com <javascript:>.
> For more options, visit https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>.
>
>
>
>
> --
> ***********************************************
> Christopher Blair, Ph.D.
> Assistant Professor
> Department of Biological Sciences
> New York City College of Technology and
> Ecology, Evolution and Behavior Program
> Graduate Center
> The City University of New York
> 300 Jay Street
> Brooklyn, NY 11201
> CBl...@citytech.cuny.edu <javascript:>; cbl...@gc.cuny.edu <javascript:>
> <http://individual.utoronto.ca/chrisblair/index.html>
> Website: https://sites.google.com/site/christopherblairphd/home

Simey

unread,
Nov 18, 2017, 1:09:12 PM11/18/17
to raxml
Yes, we are working on eliminating the noise form the data this weekend. I was worried about incomplete gene trees and heterogeneous taxon sets as input into Astral, but Siavash Mirarab's dissertation chapter on Astral states "The optimization problem used in ASTRAL can easily handle incomplete gene trees; i.e., gene trees where some of the leaves are not present" and later writes "When an input gene tree has missing data, at least one of its two parts (but possibly both parts) would not be in the complete gene tree, and therefore the inclusion of that part in X' is unlikely to be helpful (recall that X' is the set of all parts from all bipartitions in X)"

Once we have updated the scripts to use RAxML's reduced alignments I will post on our GitHub and send along a link.

Note that, I found a path mistake in the script above using 'rename'. It is actually not necessary to rename the tree files from .txt to .tre. If the rename command is commented out, then the following sed command needs to be sed -n wRAx_genetrees_merge.tre *.txt.
But I hope to have this on out GitHub by the end of the weekend.

Simey

unread,
Nov 19, 2017, 2:18:40 PM11/19/17
to raxml
Since I want to compare the results of the complete UCE alignments to the .reduced alignments I simply modified our R code and the run_RAxML.sh file to use the .reduced files as RAxML input:

####### RCmds_reduced - R code script

library(parallel)
setwd("/data/Phyluce_uce-alignments/mafft-nexus-min75_110117/")

phy_files_list = list.files(pattern = "*.reduced$", recursive = TRUE, path = getwd(), full.names = TRUE)

cmd =list()

for(i in 1:length(phy_files_list)){
        cmd[[i]] = paste(getwd(), "/run_RAxML_reduced.sh ", phy_files_list[i], sep="")
}

final_raxml = mclapply(cmd, system, mc.cores=getOption("mc.cores", 48))  ### 48 cores


####### run_RAxML_reduced.sh

#!/bin/bash
cd $(dirname $1)
id=$(basename $(dirname $1))
reduced=$(basename $1)

~/standard-RAxML/raxmlHPC-AVX -f a -m GTRGAMMA -N 100 -x 12345 -p 25258 -n ${id}.txt -s $reduced

Simey

unread,
Nov 21, 2017, 6:29:02 PM11/21/17
to raxml
I finally added all of the vetted scripts to Github.

Christopher Blair

unread,
Nov 21, 2017, 9:09:10 PM11/21/17
to ra...@googlegroups.com
Cool thanks for sharing. This should help many of us working with astral.

To unsubscribe from this group and stop receiving emails from it, send an email to raxml+un...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
--
***********************************************
Christopher Blair, Ph.D. 
Assistant Professor
Department of Biological Sciences
New York City College of Technology and
Ecology, Evolution and Behavior Program
Graduate Center
The City University of New York
300 Jay Street
Brooklyn, NY 11201

Peter Lageweg

unread,
Jul 2, 2018, 6:44:59 AM7/2/18
to raxml
Hi Chris,

Did you already manage it to use RaxML trees with ASTRAL? I can't find anything :(
Hope you are able to help me!

Cheers!

Op zondag 15 oktober 2017 15:23:08 UTC+2 schreef chris blair:

chris blair

unread,
Jul 2, 2018, 8:00:01 AM7/2/18
to raxml
Hi Peter, 

There is a handy perl script that comes with the SNaQ package that automates RAxML + ASTRAL analyses. May be worth a look. 

Chris

chris blair

unread,
Jul 2, 2018, 8:05:52 AM7/2/18
to raxml

Peter Lageweg

unread,
Jul 2, 2018, 8:15:07 AM7/2/18
to raxml
Ah, thank you for your quick answer! I will look into this.

Op maandag 2 juli 2018 14:00:01 UTC+2 schreef chris blair:

Peter Lageweg

unread,
Jul 5, 2018, 9:24:54 AM7/5/18
to raxml
Hi Chris,

The pipeline works for my dataset! Only problem is now that the outgroup for my data is mixing with the ingroup sequences.
Do you recommend to rename this outgroup to the species they belong too?
Or exclude the outgroup?

Looking forward to your answer

data is biallelic and is like this:

>354_musculus_COI_1
ATGCTGCTCGTTTCCATCGGATATTCTGTG
>876_norvegicus_COI_2
TCATGCGTAGATGCTCTCGTTTCCATCGG
>987_zibethicus_COI_1
GTGAGTGTTCAATGCCACAGGTGCAGCA
>212_rattus_COI_2
CGTCCATTTATTCCATGCTGCTCGTTCCA
>Scaffold_3212_MM
GGAAGGCTTGCTTTTTGTAGGGTCCACA

These are 73 alignment files and all Scaffold sequences are mixing up with the ingroup sequences.

Op maandag 2 juli 2018 14:05:52 UTC+2 schreef chris blair:

chris blair

unread,
Jul 6, 2018, 10:28:41 AM7/6/18
to raxml
Hi Peter, 

Which script did you use? The one that comes with SNaQ/PhyloNetworks will return unrooted gene trees (not sure about the other scripts). You could probably easily edit the RAxML commands in the script to root using the OG if you like. However, if I recall correctly ASTRAL assumes unrooted gene trees....

Chris
Reply all
Reply to author
Forward
0 new messages