How to build a phylogenetic tree using NGS data

1,331 views
Skip to first unread message

Kalpi

unread,
Aug 9, 2013, 10:50:39 PM8/9/13
to stacks...@googlegroups.com
Hi ,

I have a GBS/RAD data set generated with high throughput NGS. I want to draw a phylogenetic tree from this data at the end. So I'm hoping to align them de novo using stacks. Can some one please tell me what is the tool I can use to draw phylogenetic tree using the out put of stacks.
Can I use MEGA for the tree building? I don't know whether it will work with NGS data. So i need some advice.
Thank you in advance.

Kalpi

Julian Catchen

unread,
Aug 13, 2013, 6:31:48 PM8/13/13
to stacks...@googlegroups.com, kmkde...@gmail.com
Hi Kalpi,

The populations program will output a PHYLIP file containing fixed-within and
variable-between SNPs from a set of populations. This file can be loaded into
any standard Phylogenetics software. You can also generate distance trees using
popuation genetic measures such as Fst between populations.

Please see our recent paper in Molecular Ecology on Stacks for population
genomics which gives some detail of this feature.

Best,

julian

Kalpi

unread,
Sep 24, 2013, 2:59:46 AM9/24/13
to stacks...@googlegroups.com
Hi julian,

Thank you for the reply. I read your recent paper in Molecular Ecology on Stacks for population genomics.

At a certain point it says "Data generated from pools of DNA can be used , but the fixed model must be specified to ustacks or pstacks manually. Instead of identifying polymorphisms, this model identifies fixed sites and masks out all other sites."

As I'm new to this kind of study I do not understand what is this fixed model that needs to be specified to ustacks and pstacks manually. Does that mean i can not use the codes specified for ustacks and pstacks in the manual and tutes? What are the modifications needed ?
How can I find the necessary steps to follow to generate a PHYLIP file?

I'm sorry for having so many questions. Hope you can help me to clear my doubts.
Thank you in advance.

Regards,
Kalpi

Julian Catchen

unread,
Oct 2, 2013, 11:43:43 PM10/2/13
to stacks...@googlegroups.com, Kalpi
Hi Kalpi,

You can use the populations program to generate a PHYLIP file based on SNPs that
are fixed within and variable among the populations in your study.

However, if you pool DNA, then the SNP calling model can not be guaranteed to
make the correct call, as it's designed for diploid individuals. For example,
since it is a pool, a minor allele can appear as a very small number of reads
relative to the total number of reads at the site and the SNP model will judge
these reads as error, not as a minor allele.

To help with this problem, you can turn on the fixed model in ustacks or
pstacks, which will instead call sites that are fixed or homozygous and turn all
sites that may be polymorphic into Ns. This methodology will still let you
accurately generate your PHYLIP file to build a tree while circumventing the
problem that pooled DNA creates with a diploid model.

Best,

julian

Kalpi

unread,
Oct 8, 2013, 2:38:31 AM10/8/13
to stacks...@googlegroups.com
Hi julian,

Thank you so much for your clear explanation.
That means it will be more accurate if I manually run the ustacks (assgning fix model), cstacks, sstacks and then go for populations analysis. 

I already assigned my data to denovo_map.pl wrapper with default parameters just to have some idea about the results. It's been a week now and it is still running. The terminal shows that it is finished with ustacks and now is generating catalog using cstacks. But no progress is seen for 7 days. There are no error messages. My data is about 9.8 GB.

I'm wondering whether to terminate it or keep continuing. Will it give any results within few more days? Need some advice on how to check it.

Regards,
Kalpi

On Saturday, August 10, 2013 8:20:39 AM UTC+5:30, Kalpi wrote:

Julian Catchen

unread,
Oct 16, 2013, 9:13:39 PM10/16/13
to stacks...@googlegroups.com, kmkde...@gmail.com
Hi Kalpi,

You should be able to tail the denovo_map.log file, which should be in your
Stacks output directory. cstacks will print its progress as it adds each sample
to the catalog.

julian

Adam

unread,
Oct 21, 2013, 4:12:44 PM10/21/13
to stacks...@googlegroups.com, kmkde...@gmail.com, jcat...@uoregon.edu
Hi Julian and Kalpi,

I have also noticed similar behavior of "stalled" cstacks when run on my University's cluster.  When I check the denovo_map.log file while it is running, it contains only the initial execution line of cstacks.  It will usually run for several more days without printing anything to the log file.  Sometimes these files finish but, usually these jobs run out of time and are killed, so my solution has been to just run them for more time, often 7+ days.  

It was actually this issue that brought me to the forum today where I stumbled upon this post...

Let me add a few more details about the study:  

The cluster is currently running v.1.05

The Data:
37 individuals 2 populations, (haploid moss gametophytes) 
Each individual contains ~700,000 single-end reads after stringent filtering.
I have run denovo_map.pl using the following parameters: 

denovo_map.pl -m 2 -M 1 -N 2 -n 3 -t -T 5 -b 7 -B alaska_radtags   .....the rest is probably unimportant 

In the most recent running of this data set I asked for 5 processors, and a total of 50Gb of RAM, with a run time of 6 days.  
I routinely check the denovo_map.log file and ustacks will often run for around 16 hours before initiating cstacks.   This is where the "stall" occurs, the following is usually the last thing I see in the denovo_map.log file, before the job is killed, often 5 days later.   

/apps/stacks/1.05/bin//cstacks -b 7 -o ./output -s ./output/alaska_cyano_pluro_20 <<long list of sampels here>> -n 3 -p 5 2>&1    


So, I am wondering if others have noticed this as well, and/or is this normal behavior?  

I have had several runs with fewer individuals which have completed in less time and in the log file there is cstacks output for each sample written to the denovo_map.log file.  But what I am not sure of is if this is done as cstacks is working or is this printed to the log after cstacks has finished?  I am assuming it is designed to write to the log file as it adds each individual into the catalog (as this is the behavior of ustacks when writing to the log file).  But I wanted to ask around.  Additionally, in the runs that do not finish there is never a batch_X.catalog.X file ever created in the output.  

I ask in part to help inform how much time I allocated for each run, I have some rather large sample sets we plan on running in the near future and short of trial and error I have no way of knowing how far cstacks has progressed in X amount of time.  In most runs cstacks seems to be the portion of the pipe line that takes the longest to complete.  

Julian, in your recent stickleback paper your data set was infinitely larger, how did you handle that catalog creation.  Can you thread your way out of this issue?  ie. is cstacks a portion of the pipe line that can utilize multi-threading?  

Thanks a lot in advance for any input!  

-Adam

Ryan McCormick

unread,
Oct 21, 2013, 4:37:09 PM10/21/13
to stacks...@googlegroups.com, kmkde...@gmail.com, jcat...@uoregon.edu
Adam and Kalpi,

I'm not a shared memory expert, but it may be worth checking how much memory a single node has on your cluster. You may have requested more memory than a node has, and I'm not sure that Stacks can deal with shared memory situations (e.g. no MPI). Maybe it's writing to disk (swap) and causing the slowdown you're seeing? 

I've successfully run data sets many times larger than yours in less than 12 hours (loading into the MySQL database can take considerably longer). My observations are that cstacks and populations are the most memory intensive applications, but only with my largest datasets do I ever exceed 30GB of memory. Maybe try dropping the memory allocation request to your cluster to around 15 or 20GB (or whatever the max is on a single node for your cluster; I think ours is 24GB at our university). 

Ryan McCormick

Adam

unread,
Oct 22, 2013, 12:09:26 AM10/22/13
to stacks...@googlegroups.com, kmkde...@gmail.com, jcat...@uoregon.edu

Hi Ryan, 

Thanks for your thoughts.  I have a few more pieces to add to the puzzle.  Our university just invested a fortune in a new cluster and the available resources are crazy, each node has 250gb of ram and 64 processors, so MPI is not necessary as most jobs can now be run all on a single node.

I recently completed job with 17 individuals, ~700,000 reads/ind, using the denovo_map.pl and the following parameters: 

 -m 2 -M 3 -n 3 -t -T 4.  

The final cluster job report showed it ran for 38 hours, used 137 hours of cpu time, and used 41.4gb of ram.  I also ran this same analysis by hand (ustacks, cstacks, sstacks, populations, load into database...) and each component used considerably less resources the hungriest component was cstacks using 5.4gb of ram and running for 30 cpu hours.  Ustacks ran for 20 cpu hours and used 1.4gb ram.  
 
If others are experiencing shorter run times with more reasonable memory use, perhaps its is an issue with how the program is loaded on our cluster/how it interacts with the various components such as the database.  The compiling of this program was done by our IT folks, who are quite competent, but that is so far outside my skill set I can't even comment on how it was set up.   

Thanks again for any insights, and apologies for hijacking this thread.  In hindsight I should have made a new post.  

-Adam 

 


 

monoc...@gmail.com

unread,
Mar 1, 2014, 4:10:49 AM3/1/14
to stacks...@googlegroups.com, kmkde...@gmail.com, jcat...@uoregon.edu
Hi Julian,
I would like to get back to this issue. I use Stacks for processing ddRAD data (nice tens of GB) and indeed, a control of the cstacks process is the only weak point I found. When I run cstacks separately, the output of adding each sample is sent both to the monitor and in the log file. When I use denovo pipeline, all the outputs are put in the denovo_map.log file only AFTER finishing the whole catalogue building!!! Thus, especially with new data, it can happen that after waiting for one week I just run out of memory and have to start again. Your advise or help in this point would be really appreciated!

Second (and not that important), in my experience running cstacks in multithread mode will not speed up but instead can even slightly slow down the process (which is for me not that surprising, considering the way of the catalogue building). Am I wrong?


Many thanks,
Lubomir


Dne čtvrtek, 17. října 2013 3:13:39 UTC+2 Julian Catchen napsal(a):

Julian Catchen

unread,
Mar 1, 2014, 9:45:55 AM3/1/14
to stacks...@googlegroups.com, monoc...@gmail.com
Hi Lubomir,

What version of Stacks are you running? In Stacks 1.11 I changed the logging method in denovo_map.pl/ref_map.pl to read continuously from each process and output to the log file:

Stacks 1.11 - Jan 09, 2014
--------------------------
Bugfix: changed logging in denovo_map.pl/ref_map.pl to write outputs from Stacks programs continuously instead of waiting until the program completed to write output to log file.

If you have 1.11 or later and this is still not working, there may be some buffering of the files being done by the kernel that I may need to deal with.

As for running cstacks in parallel, I believe the only part of cstacks that is parallelized is the matching between loci when merging a new sample into the catalog. Parallel execution should definitely speed this up, but if you are using a reference genome, then you are not doing any matching by sequence and parallel execution would not have any effect.

julian

monoc...@gmail.com wrote:
--
Stacks website: http://creskolab.uoregon.edu/stacks/
---
You received this message because you are subscribed to the Google Groups "Stacks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stacks-users...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

monocirrhus

unread,
Mar 1, 2014, 9:58:16 AM3/1/14
to Julian Catchen, stacks...@googlegroups.com

Dear Julian,

this can be the case, many thanks. I will reinstall and check once more. I have not realized your upgrades are that frequent - many thanks for them and for your support!!

Lubomir

Reply all
Reply to author
Forward
0 new messages