IQ-TREE and 6-state Dayhoff groups?

710 views
Skip to first unread message

Elsa

unread,
Dec 6, 2016, 8:51:50 PM12/6/16
to IQ-TREE

Hi everybody,
I have a set of proteins that seems to be saturated and I was wondering:
1) is it possible to reconstruct a ML tree using a protein alignment that was re-coded numerically using Dayhoff groups (i.e. 6-state groups : ASTGP(1), DENQ(2), RKH(3), MVIL (4),FWI (5),C (6)), and applying a site specific model (such as PSMF)?. (re-coding was done to account for saturation and has been used in papers such as http://www.nature.com/nature/journal/v432/n7017/full/nature03149.html)

if so,
2) Since the groups are mapped to the amino acids as multi-state characteres (1 to 6) is there a way to specify a multi-state character for any of those models or how should I deal with the numeric coding?

any reply or advice will be greatly welcomed,

Elsa

Bui Quang Minh

unread,
Dec 7, 2016, 4:52:58 AM12/7/16
to iqt...@googlegroups.com, Elsa
Dear Elsa,

Yes there is a way, as IQ-TREE supports multi-state data. First, recode your protein sequences as you described but using characters 0-5 (instead of 1-6). The recorded alignment can be input into IQ-TREE, but then morphological models are used by default (see http://www.iqtree.org/doc/Substitution-Models/#binary-and-morphological-models). In order to use mixture models (e.g. C10) you have to manually define the model using NEXUS model file — see this thread for more details:


There we discussed the Dayhoff4 group, but the principle is the same. There is even a model file for C60-SR4 model.

Hope that helps
Minh

--
You received this message because you are subscribed to the Google Groups "IQ-TREE" group.
To unsubscribe from this group and stop receiving emails from it, send an email to iqtree+un...@googlegroups.com.
To post to this group, send email to iqt...@googlegroups.com.
Visit this group at https://groups.google.com/group/iqtree.
For more options, visit https://groups.google.com/d/optout.

--
Bui Quang Minh
Center for Integrative Bioinformatics Vienna (CIBIV)
Campus Vienna Biocenter 5, VBC5, Ebene 1
A-1030 Vienna, Austria
Phone: ++43 1 4277 74326
Email: minh.bui (AT) univie.ac.at







Elsa

unread,
Dec 8, 2016, 9:31:32 AM12/8/16
to IQ-TREE
Hi Minh, 
many thanks for your reply, I managed to understand how to create the new profiles and run a preliminar tree.
After reading Susko and Roger (2007), I decide to give a try to SR4(with C60)and PMSF, but I had a problem.

I was able to obtain a guide tree by running:
iqtree-omp -s alig_recoded.phy -nt 8 -pre iqtree -mdef IqtreeC60SR4.nex -bb 1000 -alrt 1000

but then when I try to use PMSF I get this error:('ERROR: No mixture model was specified!').
  
The command line I lauched for PMSF is: iqtree-omp -s alig_recoded.phy -nt 8 -pre iqtree -mdef IqtreeC60SR4.nex -ft iqtree.treefile -b 10

this is how the defined model lookslike in the Iqtree_C60SR4.nex file (I didn't paste the 60 profiles to save space):
model C60SR4=GTR+G+FMIX{C60NT1,C60NT2,C60NT3,C60NT4,C60NT5,C60NT6,C60NT7,C60NT8,C60NT9,C60NT10,C60NT11,C60NT12,C60NT13,C60NT14,C60NT15,C60NT16,C60NT17,C60NT18,C60NT19,C60NT20,C60NT21,C60NT22,C60NT23,C60NT24,C60NT25,C60NT26,C60NT27,C60NT28,C60NT29,C60NT30,C60NT31,C60NT32,C60NT33,C60NT34,C60NT35,C60NT36,C60NT37,C60NT38,C60NT39,C60NT40,C60NT41,C60NT42,C60NT43,C60NT44,C60NT45,C60NT46,C60NT47,C60NT48,C60NT49,C60NT50,C60NT51,C60NT52,C60NT53,C60NT54,C60NT55,C60NT56,C60NT57,C60NT58,C60NT59,C60NT60}+F;

do I need to use a different command line or I cann't use PMSF with SR4?

Many thanks for your advise,

ElSa

Bui Quang Minh

unread,
Dec 8, 2016, 4:23:04 PM12/8/16
to iqt...@googlegroups.com, Elsa
Hi Elsa,

On Dec 8, 2016, at 3:31 PM, Elsa <dayana.sa...@gmail.com> wrote:

Hi Minh, 
many thanks for your reply, I managed to understand how to create the new profiles and run a preliminar tree.

you are welcome!

After reading Susko and Roger (2007), I decide to give a try to SR4(with C60)and PMSF, but I had a problem.

I was able to obtain a guide tree by running:
iqtree-omp -s alig_recoded.phy -nt 8 -pre iqtree -mdef IqtreeC60SR4.nex -bb 1000 -alrt 1000

but then when I try to use PMSF I get this error:('ERROR: No mixture model was specified!').
  
The command line I lauched for PMSF is: iqtree-omp -s alig_recoded.phy -nt 8 -pre iqtree -mdef IqtreeC60SR4.nex -ft iqtree.treefile -b 10

this is how the defined model lookslike in the Iqtree_C60SR4.nex file (I didn't paste the 60 profiles to save space):
model C60SR4=GTR+G+FMIX{C60NT1,C60NT2,C60NT3,C60NT4,C60NT5,C60NT6,C60NT7,C60NT8,C60NT9,C60NT10,C60NT11,C60NT12,C60NT13,C60NT14,C60NT15,C60NT16,C60NT17,C60NT18,C60NT19,C60NT20,C60NT21,C60NT22,C60NT23,C60NT24,C60NT25,C60NT26,C60NT27,C60NT28,C60NT29,C60NT30,C60NT31,C60NT32,C60NT33,C60NT34,C60NT35,C60NT36,C60NT37,C60NT38,C60NT39,C60NT40,C60NT41,C60NT42,C60NT43,C60NT44,C60NT45,C60NT46,C60NT47,C60NT48,C60NT49,C60NT50,C60NT51,C60NT52,C60NT53,C60NT54,C60NT55,C60NT56,C60NT57,C60NT58,C60NT59,C60NT60}+F;

do I need to use a different command line or I cann't use PMSF with SR4?

simply because your command line lacks “-m C60SR4”. Adding this option should work. 

Minh


Many thanks for your advise,

ElSa

On Tuesday, December 6, 2016 at 9:51:50 PM UTC-4, Elsa wrote:

Hi everybody,
I have a set of proteins that seems to be saturated and I was wondering:
1) is it possible to reconstruct a ML tree using a protein alignment that was re-coded numerically using Dayhoff groups (i.e. 6-state groups : ASTGP(1), DENQ(2), RKH(3), MVIL (4),FWI (5),C (6)), and applying a site specific model (such as PSMF)?. (re-coding was done to account for saturation and has been used in papers such as http://www.nature.com/nature/journal/v432/n7017/full/nature03149.html)

if so,
2) Since the groups are mapped to the amino acids as multi-state characteres (1 to 6) is there a way to specify a multi-state character for any of those models or how should I deal with the numeric coding?

any reply or advice will be greatly welcomed,

Elsa


--
You received this message because you are subscribed to the Google Groups "IQ-TREE" group.
To unsubscribe from this group and stop receiving emails from it, send an email to iqtree+un...@googlegroups.com.
To post to this group, send email to iqt...@googlegroups.com.
Visit this group at https://groups.google.com/group/iqtree.
For more options, visit https://groups.google.com/d/optout.

Elsa

unread,
Dec 8, 2016, 5:42:15 PM12/8/16
to IQ-TREE
Dear Minh,
thanks again for your prompt reply, and I'm sorry for keep asking questions but I forgot to mention that I also had tried that command line with the following error:

iqtree-omp -s iqtree.uniqueseq.phy -nt 8 -pre iqtree -m C60SR4 -mdef  IqtreeC60SR4.nex -ft iqtree.treefile -b 100

ERROR: Unknown morphological model name

btw, the version I'm using is 1.5.1

would you have any other advise?

ElSa

On Tuesday, December 6, 2016 at 9:51:50 PM UTC-4, Elsa wrote:

Bui Quang Minh

unread,
Dec 9, 2016, 5:09:03 AM12/9/16
to iqt...@googlegroups.com, Elsa
Dear Elsa,

Sorry I overlooked. This is because GTR is not a valid model name for multi-state data. To overcome, you just need to convert into DNA instead of 0-3. That means the sr4 recoding is the following: 

A,G,N,P,S,T = A; 
C,H,W,Y = C; 
D,E,K,Q,R = G; 
F,I,L,M,V = T 

Moreover, I recommend to try also C10 to C50, then look at the BIC/AIC scores to choose the model to avoid the danger of over-parameterization.

Minh

--
You received this message because you are subscribed to the Google Groups "IQ-TREE" group.
To unsubscribe from this group and stop receiving emails from it, send an email to iqtree+un...@googlegroups.com.
To post to this group, send email to iqt...@googlegroups.com.
Visit this group at https://groups.google.com/group/iqtree.
For more options, visit https://groups.google.com/d/optout.

Elsa

unread,
Dec 10, 2016, 2:13:46 PM12/10/16
to IQ-TREE
Dear Minh,

I followed your advise and initially did work. However, the core gets dumped and I don't see any indication of why. 
This situation is not always reproducible as it can be seen either while I try to generate the guide tree or after I obtained it (i.e. If I'm lucky and get the guide tree, then the core gets dumped while running PMSF)

Could you please, help me figure out what is happening?

Details of the problem:
CASE: the job sometimes run and finishes properly,sometimes it fully stops, and it might get fixed fixed if I raise -b from 100 to 1000, I've been running PMSF with -b 100 without problems.

DATA and MODEL: re-coded data as SR4-states (Susko and Roger 2007), customized C10SR4 model

The following is one example when I am lucky an I get the guide tree, but PMSF (with -b) 100 fails. the problem gets fixed when I increase -b 1000 (I wish to be able to run it only with 100, as i had done it with other PMSF runs)

command line for the guide tree:
iqtree-omp -s TD2_promalsRec.uniqueseq.phy -nt 8 -m C10SR4 -pre TD2_promalsRec -mdef IqtreeC10SR4.nex -bb 1000   here I used -m C10SR4 because i kept getting info that the predet model was under-fitting the data.

command line for PMSF:
iqtree-omp -s TD2_promalsRec.uniqueseq.phy -nt 8 -pre PMSFTD2_promalsRec -m C10SR4 -mdef IqtreeC10SR4.nex -ft TD2_promalsRec.treefile -b 100

Bellow is the info i get (after chi2 test). BTW, I get this info regardless whether it failed while obtaining the guide tree or running PMSF.

error log:
...
  96  XP_008876480_1_A_invadans                   30.63%    passed     56.58%
WARNING: 1 sequences contain more than 50% gaps/ambiguity
****  TOTAL                                                 13.92%  7 sequences failed composition chi2 test (p-value<5%; df=3)

===> COMPUTING SITE FREQUENCY MODEL BASED ON TREE FILE TD2_promalsRec.treefile
Reading model definition file IqtreeC10SR4.nex ... 48 models and 224 frequency vectors loaded
Model C10SR4 is alias for GTR+G+FMIX{C10NT1,C10NT2,C10NT3,C10NT4,C10NT5,C10NT6,C10NT7,C10NT8,C10NT9,C10NT10}+F
NOTE: 22 MB RAM is required!

system error file:

STACK TRACE FOR DEBUGGING:

*** IQ-TREE CRASHES WITH SIGNAL ILLEGAL INSTRUCTION
*** For bug report please send to developers:
***    Log file: PMSFTD2_promalsRec.log
***    Alignment files (if possible)
Illegal instruction (core dumped)


many thanks for your help and advise,

ElSa

On Tuesday, December 6, 2016 at 9:51:50 PM UTC-4, Elsa wrote:

Bui Quang Minh

unread,
Dec 11, 2016, 1:45:57 PM12/11/16
to iqt...@googlegroups.com, Elsa
Hi Elsa, 

this illegal instruction sounds like IQ-TREE does not properly detect the right instruction set supported in your computer. Can you please add option “-nofma” to disable to FMA instructions, and let me know if it runs or not?

Thanks, Minh

--
You received this message because you are subscribed to the Google Groups "IQ-TREE" group.
To unsubscribe from this group and stop receiving emails from it, send an email to iqtree+un...@googlegroups.com.
To post to this group, send email to iqt...@googlegroups.com.
Visit this group at https://groups.google.com/group/iqtree.
For more options, visit https://groups.google.com/d/optout.

Elsa

unread,
Dec 11, 2016, 5:43:33 PM12/11/16
to IQ-TREE
Hi Minh,
I used -nofma as you recommended (for both guide tree and PMSF). I was able to get a guide tree, but PMSF reconstruction failed with the message at the bottom of this message. do you think the error could be related with the data itself?

Many thanks for all your advise :)

INFO:
Command line: (I used -safe because initially I got an error related with the kernel)
iqtree-omp -safe -nofma -s TD2_fastarec.phy -nt 8 -pre PMSFiqtree -m C10SR4 -mdef IqtreeC10SR4.nex -ft TD2_fastarec.treefile -b 100

tail of PMSFiqtree.log:
===> CONTINUE ANALYSIS USING THE INFERRED SITE FREQUENCY MODEL

===> START BOOTSTRAP REPLICATE NUMBER 1

Creating bootstrap alignment (seed: 74768)...

Create initial parsimony tree by phylogenetic likelihood library (PLL)... 0.035 seconds
Reading model definition file IqtreeC10SR4.nex ... 48 models and 224 frequency vectors loaded
Model C10SR4 is alias for GTR+G+FMIX{C10NT1,C10NT2,C10NT3,C10NT4,C10NT5,C10NT6,C10NT7,C10NT8,C10NT9,C10NT10}+F

NOTE: 4 MB RAM (0 GB) is required!
Estimate model parameters (epsilon = 0.100)
1. Initial log-likelihood: -26233.260

Error reported:
iqtree-omp: /home/CIBIV/minh/Dropbox/iqtree-git/phylokernelnew.h:2566: double PhyloTree::computeLikelihoodBranchSIMD(PhyloNeighbor*, PhyloNode*) [with VectorClass = Vec4d; bool SAFE_NUMERIC = true; int nstates = 4; bool FMA = false; bool SITE_MODEL = true]: Assertion `!std::isnan(tree_lh) && !std::isinf(tree_lh) && "Numerical underflow for lh-branch"' failed.
STACK TRACE FOR DEBUGGING:

*** IQ-TREE CRASHES WITH SIGNAL ABORTED
*** For bug report please send to developers:
***    Log file: PMSFiqtree.log
***    Alignment files (if possible)
Aborted (core dumped)

On Tuesday, December 6, 2016 at 9:51:50 PM UTC-4, Elsa wrote:

Bui Quang Minh

unread,
Dec 12, 2016, 5:03:26 AM12/12/16
to iqt...@googlegroups.com, Elsa
Hi Elsa, this sounds like a bug. Can you send me your recoded alignment, model file and the log-file via my personal email?

One question: did the run under the original protein alignment work?

Bottom line: I never tried this recoded stuff myself, thus I may have overlooked something.

Cheers, Minh

--
You received this message because you are subscribed to the Google Groups "IQ-TREE" group.
To unsubscribe from this group and stop receiving emails from it, send an email to iqtree+un...@googlegroups.com.
To post to this group, send email to iqt...@googlegroups.com.
Visit this group at https://groups.google.com/group/iqtree.
For more options, visit https://groups.google.com/d/optout.

Elsa

unread,
Dec 12, 2016, 1:22:41 PM12/12/16
to IQ-TREE
Minh,
I sent you the files to your email.
Many thanks for checking into this,
ElSa


On Tuesday, December 6, 2016 at 9:51:50 PM UTC-4, Elsa wrote:

Elsa

unread,
Dec 22, 2016, 1:45:59 PM12/22/16
to IQ-TREE
Dear Minh,
Could you please give me some advise? I am getting an error (i.e. core gets dumped after some replicates). It happens when I run re-coded (SR4) data and PMSF... here is the command line I have launched: 
iqtree-omp -s MRE11_rec.phy -safe -nofma -nt 4 -pre MRE11_PMSF_rec_C20SR4 -m C20SR4 -mdef IqtreeC20SR4.nex -ft MRE11_rec_C10SR4.treefile -b 100

***** Error ***:
iqtree-omp: /home/CIBIV/minh/Dropbox/iqtree-git/phylokernelnew.h:2566: double PhyloTree::computeLikelihoodBranchSIMD(PhyloNeighbor*, PhyloNode*) [with Vec
torClass = Vec4d; bool SAFE_NUMERIC = true; int nstates = 4; bool FMA = false; bool SITE_MODEL = true]: Assertion `!std::isnan(tree_lh) && !std::isinf(tre
e_lh) && "Numerical underflow for lh-branch"' failed.
STACK TRACE FOR DEBUGGING:
*** IQ-TREE CRASHES WITH SIGNAL ABORTED
*** For bug report please send to developers:
***    Log file: MRE11_PMSF_rec_C20SR4.log
***    Alignment files (if possible)
Aborted (core dumped)
Many thanks for your advise,
ElSa

On Tuesday, December 6, 2016 at 9:51:50 PM UTC-4, Elsa wrote:

Bui Quang Minh

unread,
Jan 5, 2017, 5:43:36 PM1/5/17
to iqt...@googlegroups.com, Elsa
Dear Elsa,

Sorry for the delayed reply: Right now PMSF model only supports amino-acid data — so it does not work with your recoded data. So please refrain from using it. Note that you can still use mixture models with your recoded data.

Long answer: The reason is because empirical amino-acid models (e.g. LG) have fixed exchange rates, thus there is no need to optimize such model parameters. For your recoded data, the GTR model is used, and IQ-TREE has no idea how to optimize GTR parameters under PMSF scheme. I need to discuss this with my co-authors first before getting back to you.

Cheers, Minh

--
You received this message because you are subscribed to the Google Groups "IQ-TREE" group.
To unsubscribe from this group and stop receiving emails from it, send an email to iqtree+un...@googlegroups.com.
To post to this group, send email to iqt...@googlegroups.com.
Visit this group at https://groups.google.com/group/iqtree.
For more options, visit https://groups.google.com/d/optout.

Sergio Andrés Muñoz Gomez

unread,
Jun 9, 2017, 5:33:44 PM6/9/17
to IQ-TREE, minh...@univie.ac.at
Dear Minh,

If I insist on using a 6-character-state recoding scheme, how can I use the GTR model for such multi-state character dataset?

Thank a lot,
Sergio

Bui Quang Minh

unread,
Jun 12, 2017, 3:00:28 AM6/12/17
to Sergio Andrés Muñoz Gomez, IQ-TREE
Hi Sergio,

right now MK and ORDERED are the only two models available for multi-state data, because I was reluctant to provide the parameter-rich GTR model. Here, parameter estimation might be unreliable, eps. for short alignments with little phylogenetic signals. However, if you provide some compelling reason why to do GTR, then I will be willing to add it.

Cheers, Minh

Sergio Andrés Muñoz Gomez

unread,
Jun 12, 2017, 11:23:15 AM6/12/17
to IQ-TREE, sergi...@gmail.com, minh...@univie.ac.at
Thank you for your reply, Minh. I think the GTR model for multi-state characters would be particularly useful for us who are interested in using amino acid-recoded data into 6 states. Some people in our lab have noted that when recoding amino acid data into 4 states, some phylogenenetic signal can be lost, but 6 states appear to conserve more signal while still simplifying the complex signal enough to account for the heterogeneity. We usually have 'phylogenomic' alignments that are long enough. The GTR model for multi-state data would also be quite versatile allowing to incorporate different amino acid recoding schemes (not only 4-, or 6-state), I suppose.

Cheers,
Sergio

Bui Quang Minh

unread,
Jun 13, 2017, 5:28:06 PM6/13/17
to Sergio Andrés Muñoz Gomez, IQ-TREE
Hi Sergio, ok I see. In fact we opened an enhancement (https://github.com/Cibiv/IQ-TREE/issues/22), which will allow users to convert data type automatically. Thus, this will be included in the future version 1.6. Will let you know in that event.

Cheers, Minh

Sergio Andrés Muñoz Gómez

unread,
Jun 13, 2017, 6:33:39 PM6/13/17
to Bui Quang Minh, IQ-TREE
Great, thanks for the update. 

taua...@gmail.com

unread,
Feb 7, 2018, 4:41:46 PM2/7/18
to IQ-TREE
Hi Minh,

I am also running analyses with the Dayhoff-6 recoding. Does the new 1.6 IQTREE contain the GTR model for multistate data? It would be so good to use IQTREE for all of my analyses!

Thanks!

Bui Quang Minh

unread,
Feb 8, 2018, 6:49:30 PM2/8/18
to iqt...@googlegroups.com
Hi Tauana, 
No, GTR model is not supported yet for multistate data. It's not difficult to implement, but one has to test the parameter optimisation thoroughly. In your case, the 6-state GTR model will have 19 free parameters (14 substitution rates and 5 state frequencies), which are harder to estimate accurately, compared with GTR for DNA (8 parameters). That's the main reason.
However, since you do the recoding from protein sequences, then the amount of data is much larger than morphological data, leading to enough information to estimate model parameters... Therefor we may add such model in the near future.

Cheers,
Minh

To unsubscribe from this group and stop receiving emails from it, send an email to iqtree+unsubscribe@googlegroups.com.

Tauana Junqueira da Cunha

unread,
Feb 8, 2018, 8:09:40 PM2/8/18
to iqt...@googlegroups.com
Thanks Minh!
I’ll keep an eye out for the next updates.
You received this message because you are subscribed to a topic in the Google Groups "IQ-TREE" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/iqtree/c7HZ9UxqwG0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to iqtree+un...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages