Using data from many single copy orthologs to build single tree

51 views

Skip to first unread message

Scott Brainard

unread,

Jan 23, 2023, 5:58:08 PM1/23/23

to PAML discussion group

Attached is a very small example dataset:

2 FASTA files containing the protein sequence for the 6 taxa of two distinct single copy orthologs. I’ve added species IDs to the header lines to show that the order of the species in each FASTA is the same;

2 NEXUS files generated from performing MSA on each of these two FASTA files

If I read these Nexus files directly into PAML, it assumes there are 12 taxa, since the gene IDs are being used as the identifiers in the Nexus files.

So I’m wondering, if I want to use data from all of these different orthologous sites when building the tree, how should I perform the MSA step to generate Nexus files (or FASTA, the file format is irrelevant) that preserve this taxonomical consistency.

I could just loop through all the FASTA files, and strip off the unique gene IDs (which were originally taken from those species’ GFF annotations), and replace them with a consistent string corresponding to the species. But this feels a bit hacky, and not standard practice.

Thanks for any suggestions!

Scott

Best,

Scott

aln-nexus_SCO46.txt

aln-nexus_SCO45.txt

OG0007046.fa

OG0007045.fa

Sandra AC

unread,

May 10, 2023, 5:51:23 AM5/10/23

to PAML discussion group

Hi Scott,

I believe you are using the wrong format for the input files. Not sure which PAML program you are trying to run, but the format of the input files is quite standard: PHYLIP format for your alignment file and NEWICK format for your tree file. Also, the tags that you use for the taxa in the alignment (the IDs that you will use to identify which taxa corresponds to each sequence) will need to be the same tags that you use in your tree file. Try to keep the tags short and avoid metacharacters (e.g., please read the PAML documentation provided in the package to have an idea of which symbols you should avoid in the tags you use) as it will make your tree file and alignment file easily readable.

Before running PAML programs, make sure that you have your MSA as the files you shared seem to be unaligned. There are many different software you can use for generating the alignment, so you may choose which one is best for your dataset. For instance, to have an idea of some of the software you could use, you may want to check Table 3 in Álvarez-Carretero & dos Reis, 2018, although there are many many more! I recommend you also read this paper to better understand the "general workflow" of phylogenomics analyses, as I believe you have skipped some of them (i.e., trying to run PAML without having an alignment or a tree hypothesis).