paml-tutorial/positive-selection/02_extra_analyses/case_c

352 views
Skip to first unread message

Romina Batista

unread,
Sep 13, 2023, 9:17:19 AM9/13/23
to PAML discussion group
Dear all, 
I am trying to test for positive selection some genes. For that I've choose codeml to help me. It is the first time I am using the program. I installed paml4.10.6 and I started by running case_c from the tutorial, since that is the scenario I have and I am interested in testing. I downloaded the repository on the HPC I usually work and tried to run from there (interactively). I have used the following command line to run (following the tutorial):
$ /home/users/rominab/software/paml-4.10.6/bin/codeml case_c.ctl | tee logfile_caseC.txt
But I am getting a weird error message that I could not understand. I can not replicate case_c. 
I would appreciate if you have any advice to help me solving it so I can go further testing my dataset. 
I have attached the log file this run generated, as well as the files (contol file, tree, and alignment) I've downloaded from paml-tutorial github and used to run this test.
Kind Regards, 
Romina Batista
logfile_caseC.txt
case_c.ctl
case_c.tree
case_c.phy

Sandra AC

unread,
Sep 13, 2023, 5:17:24 PM9/13/23
to PAML discussion group
Dear Romina, 

I have run this case again with both PAML v4.10.6 and PAML v4.10.7 and I get no errors. Perhaps the issue is related to the interactive session you are running on your HPC or memory allocated, but I am not sure. To help me troubleshoot further, can you include the error that you are getting with CODEML? I cannot see any error messages in the log file that you attach. 

Cheers!
S.

Romina Batista

unread,
Sep 14, 2023, 8:04:42 AM9/14/23
to PAML discussion group

Dear Sandra, thank you for your prompt reply! 

One of the files I sent (logfile_caseC.txt) show a different output I would expect for this analysis. The last line says: Extracting gene tree for dataset # 2 from main tree...
If I compare it with the logfile  you have made available on github, yours is a big one (mine has 109 lines), with more lines and information, ending with: Time used:  0:01
I cannot remember any "classical" error msg. But somehow I am trying to run it again (now), using both my local machine and the HPC and I have the following msg: 

# seqs in tree file does not match.  Read as the nexus format.

If I check the beginning of the logfile it says: 

CODONML in paml version 4.10.0, September 2020

I am assuming this a new feature from PAML verion 4.10.6, but I cannot understand why my machine is calling the previous version (4.10.0). 

I have activate the conda env I've prepared to install the version 4.10.6. 

If I list this env, it says I have the right version in both machines:


1) machine1 (MacBook Pro)

$ conda list -p /Users/ses317/opt/miniconda3/envs/paml-v.4.10.6

# packages in environment at /Users/ses317/opt/miniconda3/envs/paml-v.4.10.6:

#

# Name                    Version                   Build  Channel

paml                      4.10.6               h2413b67_2    bioconda


2) machine2 (HPC)

$ conda list -p /home/users/rominab/.conda/envs/paml4.10.6
# packages in environment at /home/users/rominab/.conda/envs/paml4.10.6:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       2_gnu    conda-forge
libgcc-ng                 13.2.0               h807b86a_0    conda-forge
libgomp                   13.2.0               h807b86a_0    conda-forge
paml                      4.10.6               h031d066_2    bioconda


I can only replicate the "error" if I use the full PATH where I have also the compiled the program on the HPC:

/home/users/rominab/software/paml-4.10.6/bin/codeml case_c.ctl | tee logfile_caseC.txt


I hope I made myself clear, but at the end I brought a new issue, related to conda envs? Anyway, I cannot reproduce your logfile.

Thank you,

Romina

Sandra AC

unread,
Sep 19, 2023, 6:25:59 AM9/19/23
to PAML discussion group
Dear Romina, 

Apologies for the late reply, this message somehow did not go to my inbox and just saw it now when going through the google group!

The problem is indeed related to your conda environment, not to the new PAML version -- as you have guessed, this is a new feature implemented in the CODEML program available in the latest PAML release (from v4.10.6 onward). You may want to download the latest PAML version as some bugs were recently fixed (i.e., PAML v4.10.7).

Perhaps you need to export the PATH to the new PAML version in your `~/.bashrc`? This hidden file is the one that you can use to export the paths to software you want to run without having to use the absolute path to the corresponding executable files. If you do not have a `~/.bashrc`, perhaps you will find a `~/.bash_profile` or similar. I guess the path you have exported there still points to the old PAML version and, when you call `codeml`, it is still calling the codeml executable file from an old PAML version -- but I may be wrong, you may want to check what is going on!

Hope this helps and you can use the correct PAML version in your conda envs!

All the best,
Sandra

PAML discussion group

unread,
Sep 19, 2023, 6:31:55 AM9/19/23
to PAML discussion group
Dear Sandra, thank you so much for your reply. I was actually writing an update to send here.
A quick follow up, I was able to run using the correct version (calling the full path from where I have paml in my HPC), never using conda. Never mind. 
Also, I was able to run case C, using my empirical data set (a test with only 10 genes [ca.4% of my entire dataset]), and it took about 2h to finish. I am wondering if you would advise any resource when running ca. >>100 genes.
Thank you!
Romina

PAML discussion group

unread,
Sep 19, 2023, 6:51:46 AM9/19/23
to PAML discussion group
That's great, good to know it ended up working! Try to do what I suggest with the `~/.bashrc` or `~/.bash_profile` and export the path to the directory where you have the executable files for the PAML programs -- that may be the problem and you will not need to use absolute paths :) 

Given that PAML programs cannot be parallelised, I am afraid that they will take long to run the more alignment blocks you include in your analyses. Have not tested this yet and may only be worth doing if your alignment blocks do not need to be analysed one after the other. In other words, if you are to run independent analyses for each alignment block in your alignment file -- and the only thing you are after is to get the corresponding pruned trees from the species tree --, you could do the following:

1. Run case `d` to get the `generatetrees.trees` file with as many pruned trees as alignment blocks you have in your alignment file.
2. Write a script that generates a file structure from `1` to `n`, being `n` the number of alignment blocks you have in your alignment file.
3. If you do "2" within a loop, then during the same loop iteration extract the corresponding alignment block from the alignment file (e.g., iteration 1 will extract first alignment block and save it in directory `1`).
4. Then, you can also extracts the lines from `generatetrees.tree` that refer to the corresponding pruned tree (i.e., for alignment 1, you should get the first tree block "PHYLIP header + Newick tree"). Then, output a tree file and save it in the corresponding directory (i.e., for iteration 1, it would be directory `1`).
5. Once you have the individual alignment blocks and corresponding pruned trees saved in individual directories, you can write a bash script that executes CODEML in a job array manner. In that way, you will be able to run independent CODEML analyses (one CODEML analysis per individual alignment block) as soon as resources in your HPC are available. Nevertheless, please make sure that you contact the IT services managing your HPC before you submit a "big" job array so that they can guide you with regards to how much memory you should ask for and how big the job arrays. Sometimes, they will advise you to send job arrays in batches (e.g., in batches of 50, you would then analyse your data in directories 1-50 first, then in 51-100, etc.) so you do not oversaturate the queue and other users can keep submitting their jobs :)

Hope this helps!

All the best,
S.

Romina Batista

unread,
Sep 20, 2023, 12:19:41 PM9/20/23
to PAML discussion group
Dear Sandra, thank you for insisting about the path. I carefully reviewed my .bashrc, and I realised I was also calling paPAML there, and for that software I had installed an older version of paml, my fault. I think that was the reason it was messing up the versions. Anyway, all sorted. Calling v.4.10.6 smooth from my tab. And also made my SLURM script look better with less variable. :)

Regarding the step-by-step you sent me. I appreciated that! I was trying to figured out what would be the best way. Your advice helped a lot, I will work on this workflow. I have sent job arrays recently, I think I can use the same set up (very similar to what you've suggested) to avoid being unfair with other users. It was a very kind advice. 

You made me think about my hypothesis, so I have a questions regarding to what you've suggested:
1. I have a species tree, which I've used to prepare several hypothesis I want to test (I moved the syntax for foreground species (#1), according to what I want to test). I ended with ca. 20 different files. They are actually the same tree, the only thing that differs is the position of the #1 on the tree where I want to test that specific clade for positive selection. My tree has 85 species. Not all my alignments have 85 sequences, because I lack some genes for some species. To me looks like this is the scenario of case_c. But you are saying I could try to run case_d to access the gene trees, but then I would lack some clades, and then case_d accommodates that by pruning the tree, OK. My question: Do I need to have that to run Brach-Site-Model? Am I breaking any premise if I use my species tree instead? 

Thank you very much for your time and help.
Best, 
Romina
 

Sandra AC

unread,
Sep 21, 2023, 12:22:03 PM9/21/23
to PAML discussion group
Hi Romina, 

Great to see that two problems have now been sorted! With regards to your last concern, I do not think I fully understand your situation, but let me give you some ideas according to what I think you are trying to do, but I may be wrong:

1. Taxa in one alignment block need to also be present in the corresponding tree block. In other words, if one of your block alignments does not have the 85 sequences you mention, then the species tree needs to be pruned. Some notes about the new cases we implemented to better deal with alignment blocks that miss some of the taxa present in the species tree:
  • "Case b" will run CODEML assuming that you have already tree blocks with pruned Newick trees in the tree file, and thus these tree blocks will include the same taxa as their matching alignment blocks. If you run "case d" first, you will obtain the "generatetrees.tree" file, which you can then use as the tree file to run your data under "case b".
  • "Case c" assumes that you only have a species tree but not a tree file with tree blocks with pruned trees. Therefore, before analysing your data, CODEML will read an alignment block, go to the species tree, prune the tree according to the taxa present in such alignment block, and then analyse your data under the model you have specified in the control file. After analysing this dataset, CODEML will read the next alignment block and will do the same until all alignment blocks have been analysed.
  • "Case d" only prunes the corresponding tree blocks and generates a file with all of them, but CODEML does not run.
2. If pruning the species trees will result in removing a clade that you have labelled to test under the branch or the branch-site model, you need to re-consider the hypothesis that you are testing (e.g., if the pruned tree does not include the branch that you have specified as foreground, then analysing the corresponding alignment block may not make sense if your hypothesis can no longer be tested). I paste below the warning I wrote on our GitHub repository under "case c":

```

Note that, if you run the branch or the branch-site model (i.e., you select foreground branches with the tag #1), you should not include gene alignments with missing taxa which tree topology is incompatible with these model. While sometimes this situation can be obvious and it is easy to manually check which tree topologies will be generated for each gene alignment, it gets very time consuming with more taxa and more genes. Under case cCODEML will automatically prune the "main" tree you have included in the tree file when a gene tree has missing taxa. If you have enabled a branch or a branch-site model and the pruned tree is not compatible with the branch or the branch-site model (i.e., the #1 that selects the foreground branch/es has been lost after pruning and there is no #1 left), CODEML will stop. If you want to avoid this from happening, you need to read the details about case d.

There is another important outcome of the pruning that you should consider. If you have more than one label (e.g., several branches have been labelled with #1 or you have different #X tags), some labels could disappear after pruning but others may remain. In this case, as there is at least one branch marked as foreground, CODEML will run. You must always check the results you obtain with gene alignments with missing taxa: while some labels may remain (and hence your hypothesis), others may disappear (and affect your main hypothesis).

```

3. One thing you could do is running "case d" first and see how many tree blocks keep the branch (or branches) that you have labelled as foreground to test your hypothesis/hypotheses. In that way, you can discard the alignment blocks which corresponding pruned trees do not include your foreground branch/es. Then, you could generate a new alignment file with these alignment blocks and a new tree file with only the tree blocks with your branch/es labelled as foreground and run CODEML under "case b". Alternatively, you can run CODEML in a job array manner with only those alignment block + matching tree blocks that keep the branch label to speed things up. You will need to make a decision based on what your hypothesis and whether your data can really be used to test that :)

Hope this makes sense, but please let me know if something is not clear about the newly implemented cases!

All the best,
Sandra
Reply all
Reply to author
Forward
0 new messages