about haplotype phase, etc. for SNP data analyses

270 views
Skip to first unread message

Michelle Liu

unread,
Feb 24, 2016, 1:53:07 PM2/24/16
to raxml
Dear RAxML users:

I am very new to the phylogenetic analyses and will appreciate your help with the following 3 questions I have about the program. I searched the group postings here but was not able to find the direct answers to these questions.

First, as an example, in the original file, I have two individuals (with ids 1045_3 and 1157_5) and 3 markers with each marker having two alleles:
 
1045_3 T T C T G A
1157_5 T T C C A A

After converting this file to the fasta file format, I have

>1045_3
TTCTGA
>1157_5
TTCCAA

Is this the correct input file format for SNP data analyses for RAxML?

Second, since this is a diploid dataset, should I determine the haplotype phase first. In this example, will the results be different if I have 

>1045_3
TTTCAG      rather than TTCTGA
>1157_5
TTCCAA

Last, about the assumption that different sites (markers) evolve independently of each other, does this mean that the markers shouldn't be in linkage disequilibrium with each other? 

These are very basic questions. Thank you for your attention and patience!

Michelle

Alexandros Stamatakis

unread,
Feb 25, 2016, 3:35:13 AM2/25/16
to ra...@googlegroups.com
Dear Michelle,

> I am very new to the phylogenetic analyses and will appreciate your help
> with the following 3 questions I have about the program. I searched the
> group postings here but was not able to find the direct answers to these
> questions.
>
> First, as an example, in the original file, I have two individuals (with
> ids 1045_3 and 1157_5) and 3 markers with each marker having two alleles:
>
> 1045_3 T T C T G A
> 1157_5 T T C C A A
>
> After converting this file to the fasta file format, I have
>
>> 1045_3
> TTCTGA
>> 1157_5
> TTCCAA
>
> *Is this the correct input file format for SNP data analyses for RAxML?*

RAxML can read FASTA and relaxed PHYLIP, the formats are specified in
the comprehensive manual which I'd please ask you to read:

http://sco.h-its.org/exelixis/resource/download/NewManual.pdf

>
> Second, since this is a diploid dataset, should I determine the *haplotype
> phase* first. In this example, will the results be different if I have
>
>> 1045_3
> *TTTCAG rather than TTCTGA*
>> 1157_5
> TTCCAA

yes, the result will be different

> Last, about the assumption that different sites (markers) evolve
> independently of each other, does this mean that the markers shouldn't be
> in *linkage disequilibrium* with each other?

phylogenetic likelihood models always assume that sites evolve
independently, but you can try to model these dependencies or
independences by having the corresponding markers evolve under the same
model, that is put them into a single or separate partitions ..

alexis

>
> These are very basic questions. Thank you for your attention and patience!
>
> Michelle
>

--
Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University
of Arizona at Tucson

www.exelixis-lab.org

Michelle Liu

unread,
Feb 25, 2016, 4:26:33 PM2/25/16
to raxml
Thanks a lot for your reply, Alexis!

Based on your answer to my second question, I am wondering if it is better to perform the analyses in the following order:

first, derive the haplotype information using another program,
then, use the haplotypes as input to RAxML (i.e. two rows for each sample). 

As an example, the data will be 

>1045_3_1
TCG                     (assume that this is the estimated haplotype)
>1045_3_2
TTA
>1157_5_1
TCA
>1157_5_2
TCA

Will this be better than using

> 1045_3 
TTCTGA 
> 1157_5 
TTCCAA 
?

For the last question from my previous post, could you recommend a paper or book on the background of partition methods and how to do partitions?

Many thanks!

Michelle

Alexandros Stamatakis

unread,
Feb 26, 2016, 7:13:27 AM2/26/16
to ra...@googlegroups.com
dear michelle,

>> 1045_3_1
> TCG (assume that this is the estimated haplotype)
>> 1045_3_2
> TTA
>> 1157_5_1
> TCA
>> 1157_5_2
> TCA
>
> Will this be better than using
>
>> 1045_3
> TTCTGA
>> 1157_5
> TTCCAA
> ?

I'd rather encode the haplotypes as ambiguous DNA characters

> For the last question from my previous post, could you recommend a paper or
> book on the background of partition methods and how to do partitions?

You just need to specify respective partitions for your dataset, the
RAxML partition file format is specified in the manual:

http://sco.h-its.org/exelixis/resource/download/NewManual.pdf

alexis

Michelle Liu

unread,
Mar 1, 2016, 1:41:32 AM3/1/16
to raxml
Sorry to bother you again. I read the tutorial for partitionfinder. It seems that this program is for sequencing data and is NOT for SNP only data. Could you recommend a program which can find the best partition strategy for SNP only data?

Thanks again!

Michelle

Alexandros Stamatakis

unread,
Mar 7, 2016, 8:50:23 AM3/7/16
to ra...@googlegroups.com
hi michelle,

the approach should be the same, in particular if you also include and
have the invariable sites for your SNP alignment as suggested in our paper:

http://sysbio.oxfordjournals.org/content/64/6/1032

if you don't have the invariable sites available in your data you might
ask the partitionfinder developers to add an option for partitionfinding
on SNP datasets using the ascertainment bias correction we propose in
the above paper,

alexis

Michelle Liu

unread,
Mar 7, 2016, 9:24:56 AM3/7/16
to raxml
Thanks a lot, Alexis! I really appreciate your helps!

Michelle

san...@gmail.com

unread,
Nov 2, 2016, 5:18:43 PM11/2/16
to raxml
Dear Alexis,

My dataset very much resembles that of Michelle, and your comments have been very helpful!

However, encoding the two diploid alleles as ambiguous DNA characters remains problematic, and I would very much appreciate your advice.

For any given locus, if there are only heterozygous SNPs, RAxML (incorrectly) regards the site as "invariant" (due to the definition of ambiguity codes). How then could I included these sites while applying acquisition bias corrections in RAxML?

Thank you very much!

Young-Jun


On Friday, February 26, 2016 at 6:13:27 AM UTC-6, Alexis wrote:

Alexandros Stamatakis

unread,
Nov 3, 2016, 5:20:22 AM11/3/16
to ra...@googlegroups.com
Dear Young-Jun,

> My dataset very much resembles that of Michelle, and your comments have
> been very helpful!

:-)

> However, encoding the two diploid alleles as ambiguous DNA characters
> remains problematic, and I would very much appreciate your advice.
>
> For any given locus, if there are only heterozygous SNPs, RAxML
> (incorrectly) regards the site as "invariant" (due to the definition of
> ambiguity codes).

That is correct, assuming, of course, that all samples are heterozygous
at the specific position in the locus in exactly the same way, e.g., we
have AT there for all samples.

> How then could I included these sites while applying
> acquisition bias corrections in RAxML?

It is just an invariant site if there is no variation at all at that
SNP. With acquisition bias correction enabled, RAxML does not allow
partitions to contain invariant sites.

So either, you remove that site and run RAxML with acquisition bias
correction, or, if you have data for all the invariant sites, you just
run it on the big, comprehensive dataset without any correction enabled
(which is more accurate based on the paper:
http://sysbio.oxfordjournals.org/content/64/6/1032).

Cheers,

Alexis
> >> www.exelixis-lab.org <http://www.exelixis-lab.org>
> >>
> >
>
> --
> Alexandros (Alexis) Stamatakis
>
> Research Group Leader, Heidelberg Institute for Theoretical Studies
> Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
> Adjunct Professor, Dept. of Ecology and Evolutionary Biology,
> University
> of Arizona at Tucson
>
> www.exelixis-lab.org <http://www.exelixis-lab.org>
>
> --
> You received this message because you are subscribed to the Google
> Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to raxml+un...@googlegroups.com
> <mailto:raxml+un...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

san...@gmail.com

unread,
Nov 3, 2016, 11:33:12 AM11/3/16
to raxml
Thank you very much for the prompt reply!

I realize that I wasn't very clear in my original posting. Please consider a following example where there are three samples and the IUPAC ambiguity code M is used to represent a heterozygous SNP (A/C). (In addition to AAA and MMM) RAxML considers AAM, AMA, MAA, AMM, MMA and MAM as invariant sites.

Because of this behavior of RAxML, I had to exclude more than half of all variable loci from my dataset, and I was wondering if there’s a better approach than simple exclusion.

Thanks again,
Young-Jun

Alexandros Stamatakis

unread,
Nov 4, 2016, 3:59:39 AM11/4/16
to ra...@googlegroups.com


On 03.11.2016 16:33, san...@gmail.com wrote:
> Thank you very much for the prompt reply!
>
> I realize that I wasn't very clear in my original posting. Please
> consider a following example where there are three samples and the IUPAC
> ambiguity code M is used to represent a heterozygous SNP (A/C). (In
> addition to AAA and MMM) RAxML considers AAM, AMA, MAA, AMM, MMA and MAM
> as invariant sites.

I see.

> Because of this behavior of RAxML, I had to exclude more than half of
> all variable loci from my dataset, and I was wondering if there’s a
> better approach than simple exclusion.

Okay, what one may consider is to maybe use the weight vector option in
RAxML (-a if I remember correctly). What you could do is the follwing:

All variable sites receive a weight of 2, all heterozygous SNPs, e.g.,
AAM are split up into two sites: AAA and AAC and both of those sites
receive a weight of 1.

Alexis
> <http://sysbio.oxfordjournals.org/content/64/6/1032>).
> > an email to raxml+un...@googlegroups.com <javascript:>
> > <mailto:raxml+un...@googlegroups.com <javascript:>>.
> > For more options, visit https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>.

san...@gmail.com

unread,
Nov 7, 2016, 9:29:45 AM11/7/16
to raxml
Thanks again for the suggestion! I'll give that a try.
Have a great week ahead,
Young-Jun

benjamin v

unread,
Nov 28, 2016, 9:32:00 AM11/28/16
to raxml
Okay, what one may consider is to maybe use the weight vector option in
RAxML (-a if I remember correctly). What you could do is the follwing:

All variable sites receive a weight of 2, all heterozygous SNPs, e.g.,
AAM are split up into two sites: AAA and AAC and both of those sites
receive a weight of 1.

Under this solution, if one site is MAM, how would you set the split sites and weights?  CAC and AAA can be arranged so that only one mutation occurred [(2,(1,3))], whereas AAC and CAA require two mutations.

Alexandros Stamatakis

unread,
Nov 29, 2016, 9:10:12 AM11/29/16
to ra...@googlegroups.com
good point, I thought that you had the the actual haplotypes, i.e., you
know wether it is AAC and CAA or CAC and AAA for instance. If this is
not the case, then maybe you might have to enumerate all possible
incarnations of MAM, i.e., AAA, AAC, CAA, CAC, and re-calculate weights
accordingly. It depends of course how many such possible combinations
you will really have in your data.

alexis
Reply all
Reply to author
Forward
0 new messages