How to set up RAxML-HPC2 partitioned AA analysis in CIPRES?

220 views
Skip to first unread message

Michael Forthman

unread,
Jul 22, 2015, 11:57:36 AM7/22/15
to CIPRES Science Gateway Users
Hello,

I've had experience with RAxML using partitioned DNA datasets, however this is the first time I've ever tried to analyze a partitioned AA dataset. I'm hoping that someone will be able to assist me in correctly setting up the analysis.

I have an 80-gene AA matrix. The best partitioning scheme determined by PartionFinderProtein includes 19 partitions:

LG, p1 = 1-338
JTT, p2 = 339-562, 1471-2221, 2222-2808, 4129-4575, 5221-5664, 7964-8729, 9459-9994, 9995-10542, 14073-14842
JTTF, p3 = 563-782, 1222-1470, 6664-7052, 9008-9196, 10543-11007, 15229-15711, 15712-16606, 18951-19843, 20292-20673, 21947-22316, 22623-23099, 23632-24113, 28902-29302, 31129-31522
JTT, p4 = 783-1049, 2809-3313, 4792-5220, 7554-7963, 12159-12566, 12875-13578, 14843-15228, 18445-18950, 24268-24743, 25884-26363, 30224-31128, 32114-33338
LG, p5 = 1050-1221
JTTF, p6 = 3314-3727, 11008-11343, 11803-12158, 12567-12874, 19844-20291, 25106-25883, 26364-26829, 27704-28463, 35718-36403
LG, p7 = 3728-4128, 6046-6454, 22317-22622
LG, p8 = 4576-4791
LGF, p9 = 5665-6045
LG, p10 = 6455-6663, 7394-7553, 9197-9458, 11344-11802, 17405-17676, 17980-18302, 21079-21447, 33339-34497
LG, p11 = 7053-7393
LG, p12 = 8730-9007
LG, p13 = 13579-14072, 17677-17979, 18303-18444, 23100-23631, 28464-28901, 29303-29666, 29667-30014, 30015-30223, 34683-35119
JTTF, p14 = 16607-17404
LGF, p15 = 20674-21078, 21715-21946, 24114-24267, 24744-25105, 26830-27007, 35120-35717
LG, p16 = 21448-21714
JTTF, p17 = 27008-27703
LG, p18 = 31523-32113
JTT, p19 = 34498-34682

Furthermore, the best model for each partition includes G and some include I. As far as I can tell, RAxML doesn't allow for I to be applied to specific partitions, but rather would apply to all partitions if selected.

Checking to ensure that I'm setting the analysis up correctly:
1) "Use a mixed/partitioned model? (-q)" - this would be a partition file as formatted above.
2) "Estimate proportion of invariable sites (GTRGAMMA + I)" - keep default (no) for reason mentioned above.
2) "Choose GAMMA or CAT model:" - select "Protein GAMMA".
3) I get confused at "Protein Analysis Option". My questions are, 1) do I not bother with the "Protein Substitution Matrix" option, but instead select "Use a Partition file that specifies AA Matrices"? How is the latter different from the -q option? If I select the latter, the help section indicates that the filenames must be specified as firstpartition, secondpartition, thirdpartition, fourthpartition, and fifthpartition, in order. Are these separate files and what exactly is suppose to be within these files?
4) "Use empirical frequencies" - some partitions include F, but if I select this option, would this include F for all partitions or just those indicating F in the partition file?

I appreciate any assistance and feedback! Cheers, Michael

Mark Miller

unread,
Jul 23, 2015, 1:42:26 AM7/23/15
to CIPRES Science Gateway Users, mfor...@ucr.edu


>Furthermore, the best model for each partition includes G and some include I.
>As far as I can tell, RAxML doesn't allow for I to be applied to specific partitions,
>but rather would apply to all partitions if selected.

>>Checking to ensure that I'm setting the analysis up correctly:
>> 1) "Use a mixed/partitioned model? (-q)" - this would be a partition file as formatted above.


YES

>>2) "Estimate proportion of invariable sites (GTRGAMMA + I)" - keep default (no) for reason mentioned above.

I think you can specify this in the partition file as WAGI, gene3 = 31-50, for example. I would at least try it. Or ask this question on the raxml google group. You will get a quick response.

>>2) "Choose GAMMA or CAT model:" - select "Protein GAMMA".
SEEMS FINE

>>3) I get confused at "Protein Analysis Option". My questions are,
>>1) do I not bother with the "Protein Substitution Matrix" option, but instead select "Use a Partition file that specifies AA Matrices"?
>>How is the latter different from the -q option?

It is different in that it is meant to be used with one or more custom AA matrices created by the user, rather than by matrices that RAxML offers.
If you have multiple different matrices, you create them, upload them via the interface, the interface names them firstpartition, etc....


Files needed:
data.phy (input file)
matrix1 (custom matrix)
matrix2 (custom matrix)
matrix3 (custom matrix)
part    (partition file)


The "matrix" files have a specific format, the same as if you are using a custom one.  See the manual.
The "partitions" file is similar to the one used in regular partitioned analyses.

It is different only in that each line begins with the name of the matrix file instead of
the name of the model. The -q options tells the software to read this file.

A small "partition" file for this option in cipres looks like:

firstpartition, gene1 = 1-20
secondpartition, gene2 = 21-30
thirdpartition, gene3 = 31-50

The first part of each line says find this file and use it as the custom matrix. You cand mix, and use

firstpartition, gene1 = 1-20
WAGF, gene2 = 21-30
WAG, gene3 = 31-50

You would only upload one partition in this case.


Then the command is: ./raxmlHPC -m PROTGAMMAMTREV -q part -p 123 -s smallprotein.phy -n test


The software takes the fact that we are talking about proteins (PROT) and the among site rate variation part (GAMMA) from the PROTGAMMAMTREV token,
discarding the r-matrix (MTREV) part. The r-matrices are in turn taken (differently for each partition) from the matrix files, whose location is
indicated in the partition file. The among site rate variation has to be the same for all partitions (not much of a problem).

>>If I select the latter, the help section indicates that the filenames must be specified as firstpartition, secondpartition, thirdpartition, fourthpartition, and fifthpartition, in order.
>>Are these separate files and what exactly is suppose to be within these files?

(see above)

4) "Use empirical frequencies" - some partitions include F, but if I select this option, would this include F for all partitions or just those indicating F in the partition file?

The choice will be ignored, and the choice for each partition will be respected. Such as:

firstpartition, gene1 = 1-20
WAGF, gene2 = 21-30
WAG, gene3 = 31-50

Let me know if that isn't clear.
Best,
mark

Michael Forthman

unread,
Jul 23, 2015, 10:45:07 AM7/23/15
to CIPRES Science Gateway Users, mmi...@sdsc.edu
Hi Mark, thank you very much for clearing up some of those confusing parts! I'll post on the RAxML group regarding P-Invar for confirmation. Cheers, Michael
Reply all
Reply to author
Forward
0 new messages