Bootstrap with different phylip inputs

559 views
Skip to first unread message

Muriel Gros-Balthazard

unread,
Sep 24, 2015, 4:03:28 AM9/24/15
to raxml
Hi everybody !

I want to make a tree using SNPs in vcf files.
The thing is that I downsampled my vcf file containing 600.000 SNPs to obtain 25.000 SNPs and made 10 replicates so that I have 10 vcf files containing 25.000 randomly sampled.
My goal is to check that the 25.000 SNPs downsampled won't have effects on the tree obtained.

For each of this 10 vcf files, I created 10 bootstrap replicates.
raxmlHPC -# 10 -b 12345 -f j -m GTRGAMMA -s $phy.phylip -n $i\.REPS
So that I have 100 replicates in total.

I then calculated a starting tree for each of the 100 bootstrap replicates.
raxmlHPC-PTHREADS -T 2 -y -s ../2_BSrep/$phy.phylip.BS$r -m GTRCAT -n vc_$i\_BS$r -p 12$i$r

And ran Examl for each.
mpirun -n 16 \
    examl \
    -s my_phylip_binary \
    -n examlOut \
    -m GAMMA \
    -t my_starting_tree \
    -S


I thus obtained 100 trees.

I want to calculate the bootstrap value so I put the 100 trees in a single file called TREES.
And calculated bootstrap this way:
raxmlHPC \
    -f b \
    -m GTRGAMMA \
    -s vc_1.phylip.binary \
    -z TREES \
    -t ExaML_result.examlOut_vc_1_BS1 \
    -n Bootstrap_tree_vc_1


The thing is that I am not sure this is correct.
Indeed, the vc_1.phylip.binary file is the phylip file corresponding only to 10 replicates since for the 90 others I used 9 other vcf files...
What about the tree ExaML_result.examlOut_vc_1_BS1 ? This is the same. It corresponds to a single vcf file but not to the 9 others. Although I believe here it is not problematic.

What do you think of this pipeline given the goal I described above ?
Is it correct or do you have another idea ?

Should I rather calculate the bootstrap value for each of the 10 vcf replicates separately ?

Thanks a lot for your help,

Muriel





Alexandros Stamatakis

unread,
Sep 29, 2015, 10:12:03 AM9/29/15
to ra...@googlegroups.com
Hi Muriel,

> I want to make a tree using SNPs in vcf files.
> The thing is that I downsampled my vcf file containing 600.000 SNPs to
> obtain 25.000 SNPs and made 10 replicates so that I have 10 vcf files
> containing 25.000 randomly sampled.
>
> My goal is to check that the 25.000 SNPs downsampled won't have effects
> on the tree obtained.

Okay, so your intention is to compare the 25,000 SNP trees to the
600,000 SNP tree, correct?

Keep in mind that you need to deal with ascertainment bias correction
for SNP only datasets, see:
http://sysbio.oxfordjournals.org/content/early/2015/09/15/sysbio.syv053

> For each of this 10 vcf files, I created 10 bootstrap replicates.
> raxmlHPC -# 10 -b 12345 -f j -m GTRGAMMA -s $phy.phylip -n $i\.REPS
> So that I have 100 replicates in total.

Okay.

> I then calculated a starting tree for each of the 100 bootstrap replicates.
> raxmlHPC-PTHREADS -T 2 -y -s ../2_BSrep/$phy.phylip.BS$r -m GTRCAT -n
> vc_$i\_BS$r -p 12$i$r
>
> And ran Examl for each.
> mpirun -n 16 \
> examl \
> -s my_phylip_binary \
> -n examlOut \
> -m GAMMA \
> -t my_starting_tree \
> -S
>
> I thus obtained 100 trees.

Okay, so far.

> I want to calculate the bootstrap value so I put the 100 trees in a
> single file called TREES.
> And calculated bootstrap this way:
> raxmlHPC \
> -f b \
> -m GTRGAMMA \
> -s vc_1.phylip.binary \
> -z TREES \
> -t ExaML_result.examlOut_vc_1_BS1 \
> -n Bootstrap_tree_vc_1
>
> The thing is that I am not sure this is correct.
> Indeed, the vc_1.phylip.binary file is the phylip file corresponding
> only to 10 replicates since for the 90 others I used 9 other vcf files...

that doesn't matter, the MSA file is ignored by RAxML for this command,
despite the fact that the file name must be provided ...

> What about the tree ExaML_result.examlOut_vc_1_BS1 ? This is the same.
> It corresponds to a single vcf file but not to the 9 others. Although I
> believe here it is not problematic.

But why would you want to draw support values on one of the BS replicate
trees???

> What do you think of this pipeline given the goal I described above ?
> Is it correct or do you have another idea ?
>
> Should I rather calculate the bootstrap value for each of the 10 vcf
> replicates separately ?

It's hard to tell, I don't understand yet what the goal of your
subsampling really is. If you want to compare with the full dataset what
you could do is the following:

1. Compute ML trees for each of the 10 subsamples
2. compute the RF distance between these trees and the 600,000 SNP ML tree
3. Compute BS replicates for the full dataset, draw them on the ML tree
of the full dataset
4. Compute 100 BS replicates per subsample, draw the support values from
these subsamples on the ML tree of the full dataset and compare them
with the BS values of the full dataset (e.g., by quantifying the squared
error)
5. build consensus trees of the BS replicates for the subsamples and the
full tree and compute the topological distances between them.

cheers,

alexis

--
Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University
of Arizona at Tucson

www.exelixis-lab.org

Muriel Gros-Balthazard

unread,
Sep 30, 2015, 5:01:58 AM9/30/15
to raxml
Hi Alexis,

Thanks a lot for your answer.

Let me describe in a better way my goal !
I have these 600.000 SNPs but I thought that calculating a ML tree on it might be problematic because of linkage disequilibrium (this is RNA seq data BTW).
So I thought about downsampling my vcf to have 1 SNP per contig that is ~25.000 SNPs.

But then, I thought that maybe the tree would be different depending on the 25.000 SNPs sampled...

Maybe what I could do is rather do 1.000 downsampling and calculate a ML tree for each and then use this 1.000 trees as if they were BS replicates ?

Can you exactly tell me the diffreence between consensus tree and tree with boostrap, it is not clear to me ?

I just did what I said earlier. Downsampling 1000 times, calculate 1000 trees and combining thos trees ina  single file called TREES.

Running RaxML to obtain a consensus:

raxmlHPC \
    -f b \
    -m GTRGAMMA \
    -z TREES \
    -J MR \
    -n consensus

I obtained:
(OC9,(DA1,(Oeu039,(OS3,(Oeu080,(OC16,Oeu067,(Oc417,(Oeu066,((Oeu077,((Oeu007,(Oeu041,Oeu089,(Ost52,Ost45,OC11,Ost25,OS10,OS9,OS5,Ost30,OS4,Ost27,Ost23,OS1,Ost43,Ost37,Ost22,(Ost46,(Oeu084,(Oc605,(Oc582,Oeu088):1.0[55]):1.0[100]):1.0[57]):1.0[51],(Oc593,(OC6,(OC3,Oeu085):1.0[98]):1.0[100]):1.0[65],((Ost35,Ost33):1.0[100],(Ost40,Ost39):1.0[100]):1.0[80],(Ost58,(Ost51,Ost54):1.0[100]):1.0[83]):1.0[96],(Oeu056,(Oc624,(Oeu038,Oeu076,(Oeu042,Oeu054):1.0[93],(Oeu095,OC1):1.0[100]):1.0[51]):1.0[93]):1.0[61],(Oeu008,(Oeu017,OC14):1.0[71]):1.0[92],(OC13,OC15):1.0[53]):1.0[90]):1.0[57],(Oeu097,OC12):1.0[97]):1.0[57]):1.0[64],(Oeu037,(Oeu069,(Oeu073,(Oc586,Oeu070):1.0[100]):1.0[62]):1.0[93]):1.0[98]):1.0[69]):1.0[74]):1.0[66],(Ocu27,OB4):1.0[100]):1.0[100]):1.0[52]):1.0[93]):1.0[100]):1.0[99],(DA3,(MZ4,MZ1):1.0[100]):1.0[97]);

But I can't plot it using Trex (http://www.trex.uqam.ca/index.php?action=newick). It says that it's not an OK Newicq format, why ?

I thought I could calculate bootstrap rather than make a consensus. But then, which tree should I use to put the bootstrap values ? I can't use the consensus one since there is multifurcation...

Thanks a gain for your help,

Muriel

Alexandros Stamatakis

unread,
Sep 30, 2015, 3:43:26 PM9/30/15
to ra...@googlegroups.com
Hi Muriel,

> Let me describe in a better way my goal !
> I have these 600.000 SNPs but I thought that calculating a ML tree on it
> might be problematic because of linkage disequilibrium (this is RNA seq
> data BTW).

Why should linkage desequilibrium affect this? Can you explain in more
detail.

> So I thought about downsampling my vcf to have 1 SNP per contig that is
> ~25.000 SNPs.
>
> But then, I thought that maybe the tree would be different depending on
> the 25.000 SNPs sampled...

Yes, that would rather be expected.

> Maybe what I could do is rather do 1.000 downsampling and calculate a ML
> tree for each and then use this 1.000 trees as if they were BS replicates ?

Well no, technically this is not a bootstrap but a sort of jackknife.

> Can you exactly tell me the diffreence between consensus tree and tree
> with boostrap, it is not clear to me ?

given n BS trees you calculate the consensus trees
given n BS trees and the best-known ML tree on the original alignment
you count how frequentl each bipartition/split of the ML tree appears in
the BS replicates

>
> I just did what I said earlier. Downsampling 1000 times, calculate 1000
> trees and combining thos trees ina single file called TREES.
>
> Running RaxML to obtain a consensus:
> raxmlHPC \
> -f b \
> -m GTRGAMMA \
> -z TREES \
> -J MR \
> -n consensus
>
> I obtained:
> (OC9,(DA1,(Oeu039,(OS3,(Oeu080,(OC16,Oeu067,(Oc417,(Oeu066,((Oeu077,((Oeu007,(Oeu041,Oeu089,(Ost52,Ost45,OC11,Ost25,OS10,OS9,OS5,Ost30,OS4,Ost27,Ost23,OS1,Ost43,Ost37,Ost22,(Ost46,(Oeu084,(Oc605,(Oc582,Oeu088):1.0[55]):1.0[100]):1.0[57]):1.0[51],(Oc593,(OC6,(OC3,Oeu085):1.0[98]):1.0[100]):1.0[65],((Ost35,Ost33):1.0[100],(Ost40,Ost39):1.0[100]):1.0[80],(Ost58,(Ost51,Ost54):1.0[100]):1.0[83]):1.0[96],(Oeu056,(Oc624,(Oeu038,Oeu076,(Oeu042,Oeu054):1.0[93],(Oeu095,OC1):1.0[100]):1.0[51]):1.0[93]):1.0[61],(Oeu008,(Oeu017,OC14):1.0[71]):1.0[92],(OC13,OC15):1.0[53]):1.0[90]):1.0[57],(Oeu097,OC12):1.0[97]):1.0[57]):1.0[64],(Oeu037,(Oeu069,(Oeu073,(Oc586,Oeu070):1.0[100]):1.0[62]):1.0[93]):1.0[98]):1.0[69]):1.0[74]):1.0[66],(Ocu27,OB4):1.0[100]):1.0[100]):1.0[52]):1.0[93]):1.0[100]):1.0[99],(DA3,(MZ4,MZ1):1.0[100]):1.0[97]);
>
> But I can't plot it using Trex
> (http://www.trex.uqam.ca/index.php?action=newick). It says that it's not
> an OK Newicq format, why ?

you probably need to use Dendroscope for it to display correctly.

> I thought I could calculate bootstrap rather than make a consensus. But
> then, which tree should I use to put the bootstrap values ? I can't use
> the consensus one since there is multifurcation...

The ML tree on the original alignment, please read the RAxML manual, it
is all outlined in great detail there.

alexis

>
> Thanks a gain for your help,
>
> Muriel
>
>
> On Thursday, September 24, 2015 at 10:03:28 AM UTC+2, Muriel
> Gros-Balthazard wrote:
>
> Hi everybody !
>
> I want to make a tree using SNPs in vcf files.
> The thing is that I downsampled my vcf file containing 600.000 SNPs
> to obtain 25.000 SNPs and made 10 replicates so that I have 10 vcf
> files containing 25.000 randomly sampled.
> My goal is to check that the 25.000 SNPs downsampled won't have
> effects on the tree obtained.
>
> For each of this 10 vcf files, I created 10 bootstrap replicates.
> raxmlHPC -# 10 -b 12345 -f j -m GTRGAMMA -s $phy.phylip -n $i\.REPS
> So that I have 100 replicates in total.
>
> I then calculated a starting tree for each of the 100 bootstrap
> replicates.
> raxmlHPC-PTHREADS -T 2 -y -s ../2_BSrep/$phy.phylip.BS
> <http://phy.phylip.BS>$r -m GTRCAT -n vc_$i\_BS$r -p 12$i$r
> --
> You received this message because you are subscribed to the Google
> Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to raxml+un...@googlegroups.com
> <mailto:raxml+un...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

Sukesh Kalva

unread,
Oct 6, 2017, 4:22:12 AM10/6/17
to raxml
Dear Sir,

I wanted to generate a phylogenetic tree for 40 genomes of different rice varieties. What are important things to keep in mind while merging the vcf files and how to merge these files. And after merging the vcf files, which software must be used to convert vcf files to phylip format.

Thanks
Sukesh

Alexandros Stamatakis

unread,
Oct 6, 2017, 6:00:02 AM10/6/17
to ra...@googlegroups.com
Dear Sukesh,

I personally don't know, but maybe some other members of the group could
answer this?

Alexis

On 06.10.2017 10:22, Sukesh Kalva wrote:
> Dear Sir,
>
> I wanted to generate a phylogenetic tree for 40 genomes of different
> rice varieties. What are important things to keep in mind while merging
> the vcf files and how to merge these files. And after merging the vcf
> files, which software must be used to convert vcf files to phylip format.
>
> Thanks
> Sukesh
>
> On Thursday, September 24, 2015 at 1:33:28 PM UTC+5:30, Muriel
> Gros-Balthazard wrote:
>
> Hi everybody !
>
> I want to make a tree using SNPs in vcf files.
> The thing is that I downsampled my vcf file containing 600.000 SNPs
> to obtain 25.000 SNPs and made 10 replicates so that I have 10 vcf
> files containing 25.000 randomly sampled.
> My goal is to check that the 25.000 SNPs downsampled won't have
> effects on the tree obtained.
>
> For each of this 10 vcf files, I created 10 bootstrap replicates.
> raxmlHPC -# 10 -b 12345 -f j -m GTRGAMMA -s $phy.phylip -n $i\.REPS
> So that I have 100 replicates in total.
>
> I then calculated a starting tree for each of the 100 bootstrap
> replicates.
> raxmlHPC-PTHREADS -T 2 -y -s ../2_BSrep/$phy.phylip.BS
> <http://phy.phylip.BS>$r -m GTRCAT -n vc_$i\_BS$r -p 12$i$r
> --
> You received this message because you are subscribed to the Google
> Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to raxml+un...@googlegroups.com
> <mailto:raxml+un...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

--
Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology

www.exelixis-lab.org
Reply all
Reply to author
Forward
0 new messages