Inquiry Regarding Positive Selection Analysis Workflow

Hyeonseon Park

unread,

Jun 24, 2024, 7:52:35 AMJun 24

to PAML discussion group

Dear PAML disscussion group,

I hope this message finds you well. I am currently performing a positive selection gene analysis and would greatly appreciate your expertise on a few points regarding my workflow. Below is a summary of the steps I have taken:

Identified 398 single-copy genes across 11 species using OrthoFinder.
Used MAFFT for alignment of the CDS sequences of each orthologous gene and verified with pal2nal.
Trimmed the codon alignments using Gblocks.
Created PHYLIP input files from the alignments of genes where more than 50% of the alignment block remained (223 genes).

I have two specific questions:

Is there any issue with the above workflow? Particularly, is it acceptable that the number of single-copy orthologs has been reduced to 223 genes for the analysis?
For constructing the gene tree, should I use RAxML to build the tree from the DNA sequences of the 223 single-copy orthologs, or is it better to use the species tree provided by OrthoFinder that is constructed from single-copy orthologs?

Thank you very much for your time and assistance.

Best regards,

Hyeonseon Park

Janet Young

unread,

Jun 24, 2024, 4:36:55 PMJun 24

to PAML discussion group

hi Hyeonseong,

I'm sure others will have advice too, but here are my thoughts:

1. yes, the workflow looks fine to me. Your "is it acceptable" question very much depends on the overall goals of your project: think about what hypothesis you are testing, or what question you are asking, and think about whether 223 genes is OK for that purpose.

2. you will probably get different answers to this from different people. I prefer to use the tree built from each alignment rather than the accepted species tree, mostly because it should be more robust to issues like incomplete lineage sorting, recombination/gene conversion (and possibly horizontal transfer, for some species sets). Others prefer the species tree to avoid noisy/incorrect tree topologies for example if some alignments are very short, or very rapidly evolving, they might give bad trees.

Janet

Message has been deleted

Sandra AC

unread,

Jun 25, 2024, 6:57:07 AMJun 25

to PAML discussion group

Hi everyone,

I completely agree with everything that Janet has discussed!! :) I just wanted to add some thoughts on the workflow (while not being directly related to PAML programs) in case other users encounter this thread in the group.

I like thinking of the workflow to generate the input sequence and tree files in the following way (see also section 2.1 in Álvarez-Carretero et al. [2020] where we discussed this workflow):

Data collection:
- If the species you want to study have had their genome (or the gene/s you are interested in studying) already sequenced, you may start your analysis by downloading your sequence/s from database/s where they are available.
- Otherwise, you will have to collect samples from your species of interest, send them for sequencing, and carry out a bioinformatics analysis to process and filter your sequences.
Data filtering:
- It does not matter whether you have downloaded your sequences or whether you have processed your own samples: you will always need to apply further filters depending on your biological question/s.
Inferring sequence alignment
- Once you have filtered your dataset, you are ready to infer the sequence alignment. You can decide whether you want to concatenate or partition your alignment.
Model selection
- You may want to run a model selection analysis to figure out which model best fits your data.
Inferring phylogeny
- Once you have your alignment and you have found out which model best fits your data, you can proceed to infer your phylogeny.

Note that there are many tools that you can use for every single step in this general workflow, and you may decide to choose one tool or another depending on many factors. E.g.:

The type of data you have (e.g., nucleotide, amino acid, codon data, morphological discrete/continuous data, recoded data, etc.).
Restrictions you want to apply to keep sequences in an alignment (e.g., % of missing data allowed, number of codons/orthologs allowed, minimum number of taxa in an alignment, etc.).
Biological question you want to ask (e.g., does concatenating/partitioning my sequence alignment affect X? Do you want to test the effect of different priors on Y? Do you want to check the effect of taxon sampling and how this may affect Z? Do you want to account for ILS, coalescence,...? Do you want to benchmark different tools for task W? Etc.).
And many others...

To this end, it is very hard to say use X or Y tool for a specific task; every researcher may follow a specific pipeline they have designed or has been already set up in their lab. It is not right/wrong using one tool or another, concatenating/partitioning your dataset, using the species/consensus/gene trees, etc. If you have somewhat followed the general workflow, you have applied the filters you could think of to "clean" your data (and can justify them and provide instructions for other people to reproduce your results without problems), and are happy with the sequence alignment and phylogeny you have inferred because you can test your biological hypothesis or try to gain insight into your main research question/s... I would say you are ready to go!

There are lots of tools out there to help users answer their biological questions. Nevertheless, users are ultimately responsible for understanding their datasets and make sure that use the tools they choose correctly. All program will (in general) generate an output file (or various output files) which users will need to understand to make sure that the results obtained are sensible according to the data they used, the filters they applied, the options they chose to run the program (reading documentation, while it may take a while, is key to properly use tools!). In case the output file/s display warnings/errors, users have to pay attention to what these warnings/errors mean: they can pinpoint issues with the input files, control files, options chosen, etc.; which are key to understand what may have gone wrong and what can be done to fix it.

Other people here may also want to add their own thoughts, which I am sure will complement what has already been discussed :)

Hope this helps!
S.

Reply all

Reply to author

Forward