Hi Uyen,
Some replies to your questions below.
Hồng Vũ Thúy Uyên wrote on 6/30/21 8:11 PM:
> Hi Julian,
>
> I am going to use STACKS output for building the Maximum Likelihood
> phylogeny tree. When I read the manual from your website, I am so
> confused about which parameter that I should use to export the ".phy"
> file for IQ-tree software.
>
> I did try 3 parameters to see how different they are. From the results,
> I know that
> --phylip-var: will export all the variants that are reported in the vcf
> file. In my case, I got 165,341 SNPs in my ".vcf" file and the length of
> sequences in ".phy" file is the same 165,341.
> --phylip: wrote in manual "output nucleotides that are fixed-within, and
> variant among populations in Phylip format for phylogenetic tree
> construction" - in my case, I just got 67,491 bp for the length of
> sequences which is much smaller than 165,341 variants I got. Could you
> explain a little bit more about "nucleotides that are fixed-within" and
> how --phylip working to give me the output, please?
These are positions in the genome that are differentially fixed between
populations. So, it may be an 'A' allele fixed at 100% frequency in one
population, but a 'C' allele fixed at 100% frequency in another
population. This is the classic data type for building trees, since most
models implemented in phylogenetics software assume fixed differences
(that is, they were not designed around SNPs, but around fixed sequence
differences between, potentially very distantly related, species). With
this type of data, you are generally only looking at the branching
pattern, or topology of the tree, not the branch lengths.
> --phylip-var-all: I can understand the way STACKS produced it from
> another question in this group. However, I still have a question on this
> parameter. "Should we use this parameter to get the output for phylogeny
> analysis?". In my case, although the length of sequences is much longer
> than the --phylip-var, when I run IQ_tree, the number of
> parasimony-informative for build the tree is still the same which is 57,642.
Typically the branch lengths in a generated tree are scaled by the total
amount of fixed sequence that is looked at, so including all the sites
along with the variant sites can allow for branch length estimation (but
as IQ_tree tells you, these sites aren't useful to determine the
branching pattern, aka they are not 'parsimony-informative').
Some Bayesian coalescent softwares also depend on having the variable
sites embedded in their fixed, neighboring sites to estimate the
parameters for their models. I am not an expert on these softwares.
>
> My last question is "which parameter should I use to get the output for
> phylogeny analysis?"
> I attached here 3 screenshots for --phylip/--phylip-var/--phylip-var-all
> output to the IQ-tree.
>
Of course, it depends on the type of tree you want to generate and the
model the software uses to estimate the tree. Maximum likelihood
phylogenetic software, like raxml can take differentially fixed sites
(--phylip) or it can also take variant sites that are still segregating
(--phylip-var). These are the two most commonly used, but as I said, I
am not an expert on all their applications.
> Hope my questions don't bother you a lot.
> Thank you and your team so much!
> Best regards,
> Uyen
>
Best,
julian