Attached is a very small example dataset:
2 FASTA files containing the protein sequence for the 6 taxa of two distinct single copy orthologs. I’ve added species IDs to the header lines to show that the order of the species in each FASTA is the same;
2 NEXUS files generated from performing MSA on each of these two FASTA files
If I read these Nexus files directly into PAML, it assumes there are 12 taxa, since the gene IDs are being used as the identifiers in the Nexus files.
So I’m wondering, if I want to use data from all of these different orthologous sites when building the tree, how should I perform the MSA step to generate Nexus files (or FASTA, the file format is irrelevant) that preserve this taxonomical consistency.
I could just loop through all the FASTA files, and strip off the unique gene IDs (which were originally taken from those species’ GFF annotations), and replace them with a consistent string corresponding to the species. But this feels a bit hacky, and not standard practice.
Thanks for any suggestions!
Scott
Best,
Scott