codeML - Analyses with more than one gene stopped without error message

496 views
Skip to first unread message

Vinciane Mossion

unread,
Sep 11, 2023, 5:25:17 AM9/11/23
to PAML discussion group
Hi there,

I am trying to run codeML case a for about 2,000 genes. The analyses started and ran for 3 - 4 genes then stopped apparently without error messages. Here are the log-files for the analysis with a null model and with a selection model. Both analysis did not stop at the same time. Any idea about what could be wrong?

Best wishes,
Vinciane
logfile_casea_select.txt
logfile_casea_null.txt

Sandra AC

unread,
Sep 13, 2023, 5:22:39 PM9/13/23
to PAML discussion group
Hi Vinciane, 

If you are running these analyses on an HPC, perhaps you need to allocate more memory or time for the session? You may have more information in the the error and output files that are normally output from submitted jobs (e.g., those files that end with ".e" and ".o"). If you are running these analyses on your personal computer, which is the command you are using? Have you seen them stop on the terminal and a message has shown on the screen of the terminal you ran them?

Hope some of these suggestions help, but it is difficult to guess what may have happened without the data or more details about how the program was executed.

Cheers!
S.

Vinciane Mossion

unread,
Sep 15, 2023, 5:50:48 AM9/15/23
to PAML discussion group
Hi Sandra,

I am running these analyses on HPC. I allocated a node which is 128 GB memory, and the analysis stopped before the end of the requested time for the run. Unfortunately, the information in the slurm files are the same than in the log files attached in the previous message.
Here is the command I use codeml caseA-nullmodel.ctl | tee logfile_casea_null.txt . I have not tried to run it locally because I supposed it would take too long and that I had so much memory on my own machine. I can share the tree, msa and control files if it might be of use to solve this issue.

Cheers,
Vinciane

Sandra AC

unread,
Sep 15, 2023, 8:46:35 AM9/15/23
to PAML discussion group
Hi Vinciane,

I see, that is odd... But yes, if you send me your input files, I will start troubleshooting things both on my PC and an HPC :)

Cheers!
S.

Vinciane Mossion

unread,
Sep 15, 2023, 8:55:02 AM9/15/23
to PAML discussion group
Hi Sandra,

I cannot directly attached them to the message the alignment is too large.

Cheers,
Vinciane

Sandra AC

unread,
Sep 15, 2023, 9:18:51 AM9/15/23
to PAML discussion group
Hi again,

I may be wrong, but I think you have 2,413 alignment blocks in your alignment file instead of 2,414. Some tests are still running, but you may want to change `ndata = 2414` to `ndata = 2413` and run CODEML again in the meantime -- hopefully this is it!

Cheers!
S.

Vinciane Mossion

unread,
Sep 15, 2023, 10:11:02 AM9/15/23
to PAML discussion group
I had 2,414 at the beginning but I had to exclude one block. I am sorry, I forgot to modify the control file I sent you. However, I just ran again CODEML and unfortunately it stopped just the same.

Cheers,
Vinciane

Sandra AC

unread,
Sep 15, 2023, 1:10:01 PM9/15/23
to PAML discussion group
No worries, have managed to reproduce your error! Not sure whether the problem has to do with a specific alignment block or to memory having exceeded, need to investigate further. I will look into this issue, hopefully next week I can let you know what may be causing this segfault.

Have a good weekend!
S.

PAML discussion group

unread,
Sep 19, 2023, 6:26:41 AM9/19/23
to PAML discussion group
Dear Vinciane, 

After further troubleshooting your dataset by analysing individual alignment blocks, I got segfaults similar to the one you reported in a recent thread in this group (e.g., your dataset #4 seems to be one of those). I believe the problem you are experiencing has to do with the number of ambiguous codons present in some of the alignment blocks in your alignment file -- which is what is causing CODEML to abruptly terminate. Do these ambiguous sites represent heterozygotes or uncertainties in the reads? Note that CODEML does not deal with heterozygotes or within-population polymorphism. 

I have also searched on this Google group for similar problems and found this issue raised some years ago. The answer Ziheng gave is the following:

```
there is a limit to the number of ambiguous codons in the program.  it is either 256 or 128, depending on whether i have used signed or unsigned char for a codon, but it looks like it is 128.
i suspect there might be some gross errors or a mismatch between your data and the way the program interprets them.  why do you have those many ambiguous codons?  they do not look like any sequences that are generated by those sequencing technologies.  do you use YRW etc to represent heterozygotes, or uncertaines in the reads.  they look very strange since you have only 80 sites in the alignment.
best, ziheng
```

While not exactly your issue, there are some similarities (i.e., segfault and issues with the number of ambiguous codons). Following Ziheng's recommendations, you may want to evaluate what the ambiguous codons mean in your alignment blocks so that you are not misinterpreting the data.

Hope this helps!

All the best,
Sandra

Ziheng

unread,
Sep 22, 2023, 11:32:25 PM9/22/23
to PAML discussion group
hi vincent and all, 
i edited the codeml code a bit so that it will accept a maximum of 256 distinct codons in the alignment (instead of 128).  i posted the new code at the dev/ branch at the github repo.  you can download a copy and recompile.
the program will run with your data files.  however, the ambiguities in your alignments look like heterozygotes, as  Y seems to occur only in the column where other sequences have T or C,  R occurs only in the column where other sequences have A or G, and so on.  I am not sure whether they might represent base-call errors.  perhaps you need to check the procedures/pipelines used to generate the data and make sure that there are no errors. 

best wishes, 
ziheng

Vinciane Mossion

unread,
Sep 29, 2023, 6:16:56 AM9/29/23
to PAML discussion group
Hi Sandra and Ziheng,

Thank you so much for your help!

Following your suggestions, I've investigated the reason of those many ambiguous codons. The method I've used to produce multiple alignments is primarily meant to identify parental lineages of polyploid species. Then, I realized that all heterozygote sites were retained. I produced new multiple alignments keeping only bi-allelic sites which reduced the number of ambiguous codons. However, there is still a fair amount of ambiguities. My dataset contains several individuals per species, all of the same population. Some of these ambuities could then be within-population polymorphism. Also, I cannot completely exclude base-call errors.

I tried to run the program (after recompiling codeml) on the new MSA, and it stopped again on the fourth alignment (69 ambiguous codons). I am now trying with more memory.

Best wishes

Vinciane Mossion

unread,
Oct 16, 2023, 10:26:23 AM10/16/23
to PAML discussion group
Hi Sandra and Ziheng,

I could not manage to run codeml (the modified version accepting 256 distinct ambiguities) on the curate MSAs even if the limit number of ambiguous codons was not reached. We work on Salix spp. which are know to have large proportion of heterozygous sites. Could you please explain me how heterozygous sites are treated by PAML codeML? Would you advise that we reduce our dataset to one individual per species?

Best wishes,
Vinciane

Vinciane Mossion

unread,
Oct 19, 2023, 8:17:50 AM10/19/23
to PAML discussion group
Hi Sandra and Ziheng,

I ran codeML per gene and it turned out that it failed for many genes that had much less ambiguities than the maximum number. As before, I have no error messages. 

Also, I do not understand how to interpret the BEB results. Here is an example of what I've got:

Bayes Empirical Bayes (BEB) analysis (Yang, Wong & Nielsen 2005. Mol. Biol. Evol. 22:1107-1118)
Positive sites for foreground lineages Prob(w>1):

The grid (see ternary graph for p0-p1)

w0:   0.050  0.150  0.250  0.350  0.450  0.550  0.650  0.750  0.850  0.950
w2:   1.500  2.500  3.500  4.500  5.500  6.500  7.500  8.500  9.500 10.500

Posterior on the grid

w0:   1.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000
w2:   0.682  0.197  0.068  0.028  0.013  0.006  0.003  0.002  0.001  0.001

Posterior for p0-p1 (see the ternary graph) (YWN2015, fig. 1)

 0.000
 0.000 0.000 0.000
 0.000 0.000 0.000 0.000 0.000
 0.000 0.000 0.000 0.000 0.000 0.000 0.000
 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.001 0.001 0.410
 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.009 0.001 0.549 0.029

sum of density on p0-p1 =   1.000000


Best wishes,
Vinciane

Sandra AC

unread,
Oct 19, 2023, 7:11:49 PM10/19/23
to PAML discussion group
Hi Vinciane,

Regarding the BEB output, all the theoretical information about its calculation and application to various empirical datasets is detailed in Yang, Wong, Nielsen (2005). The BEB output that you show refers to the M2a model, I assume. I am going to go through the output file you get when running `CODEML` with the example dataset used by Yang, Wong, Nielsen (2005), and then compare that to your output:

1. Maximum-likelihood estimates (MLEs): while you do not ask about them, I find it is important to mention that you will always find the MLEs for p0, p1, (p2), w0, w1, and w2 after the line that starts with "MLEs". Note that, for M8, you will have p0, p, q, (p1), and w after the line that starts with "Parameters in M8". If you run `CODEML` with the dataset that was used by Yang, Wong, Nielsen (2005), which can be found on the PAML GitHub repository, you will reproduce the results they report on their Table 4, where you shall see that w2>1. Below, you can find the results I got when I re-ran the analyses with the latest version of `CODEML` (I added p0, p1, p2, w0, w1, and w2 for M2a within parentheses, these are not printed on the output) together with a screenshot of Table 4 in their study:

```
(M2a)                                                                         | (M8)
MLEs of dN/dS (w) for site classes (K=3)           | Parameters in M8 (beta&w>1):
                                                                                    |
p:   (p0) 0.37712  (p1)  0.44169  (p2)  0.18119   |  p0 =   0.79950  p =   0.16719 q =   0.14883
w:  (w0) 0.05998  (w1) 1.00000  (w2) 3.62564   |  (p1 =   0.20050) w =   3.47036
```
  t4.png

2. Bayesian estimates: the BEB calculation will give you Bayesian estimates for the model parameters under M2a and M8. In that way, you can compare the estimates obtained under a maximum-likelihood approach to those obtained under a Bayesian approach. To get more familiar with the "ternary graph" and "the grid" you see in the output file, however, you will need to read Yang, Wong, Nielsen (2005)

In short, the Bayesian empirical Bayes (BEB) approach assigns a prior to the model parameters (i.e., the first lines that you see (i) under header "The grid (see ternary graph for p0-p1)" for w0 and w2 and (ii) a ternary graph that is not shown here but is shown as part of Fig. 1 in Yang, Wong, Nielsen (2005) for p0 and p1) and integrates over their uncertainties.

The prior used for w0 and w2 is a uniform distribution: w0 ~ U(0,1) and w2 ~U(0,1). This distribution is discretised into 10 categories, and you can see the values used for each category in the output. For w0, from 0.05 to 0.95. For w2, from 1.5 to 10.5. Once all the calculations take place, you get the posterior estimated values for these parameters under the "Posterior on the grid" header. In this case, the posterior distribution for w0 "peaks" under the first category (i.e., w0=0.05) with a probability of 0.4. For w2, it is the third category: w2=3.5 with a probability of 0.503. The ternary graph is a bit harder to visualise, and so I have tried to "draw" the graph from 0.1 to 1.0 for both p0 (horizontal) and p1 (vertical) trying to match the representation in Fig. 1 (Yang, Wong, Nielsen (2005)) with the results output by the program -- this is the image you shall see as part of the output. Following the same procedure as with the grid, the posterior distribution seems to peak somewhere between 0.3-0.4 for p0 and between 0.4-0.5 for p1 with a probability of 0.098. I guess that, if you were to compute the posterior following the equations in their study, you would get the values they report in the main text: p0=0.37 and p1=0.47 with a probability of 0.0908 -- not sure whether the estimated values are output in a specific file, have not found them.

```
The grid (see ternary graph for p0-p1)

w0:   0.050  0.150  0.250  0.350  0.450  0.550  0.650  0.750  0.850  0.950
w2:   1.500  2.500  3.500  4.500  5.500  6.500  7.500  8.500  9.500 10.500

Posterior on the grid

w0:   0.400  0.339  0.182  0.063  0.013  0.002  0.000  0.000  0.000  0.000
w2:   0.000  0.203  0.503  0.208  0.056  0.018  0.007  0.003  0.002  0.001


Posterior for p0-p1 (see the ternary graph) (YWN2015, fig. 1)

graph.png


sum of density on p0-p1 =   1.000000
```

For your output, then I would say that, according to the grid, w0=0.05 with probability 1 and w2=1.5 with probability 0.682. According to the graph, p0 is estimated to be between 0.8 and 0.9 and p1 between 0 and 0.1 with probability 0.549.

```
The grid (see ternary graph for p0-p1)

w0:   0.050  0.150  0.250  0.350  0.450  0.550  0.650  0.750  0.850  0.950
w2:   1.500  2.500  3.500  4.500  5.500  6.500  7.500  8.500  9.500 10.500

Posterior on the grid

w0:   1.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000
w2:   0.682  0.197  0.068  0.028  0.013  0.006  0.003  0.002  0.001  0.001

Posterior for p0-p1 (see the ternary graph) (YWN2015, fig. 1)

 0.000
 0.000 0.000 0.000
 0.000 0.000 0.000 0.000 0.000
 0.000 0.000 0.000 0.000 0.000 0.000 0.000
 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.001 0.001 0.410
 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.009 0.001 0.549 0.029

```

The same procedure follows for understanding the output regarding analyses under M8 (you can also read the original paper with their results). If you want to check the BEB probabilities for the classes of each model for each AA, then you should find this in the output file called `rst`. You can read the PAML documentation for more details about this file :)

With regards to heterozygous sites, I am not entirely sure what is the best next thing to do as, per default, `CODEML` does not really deal with heterozygotes or within-species polymorphism -- Ziheng shall give you a better answer for that!

Hope that at least the explanation about the BEB output is helpful. Have a good one!
Sandra

Vinciane Mossion

unread,
Oct 24, 2023, 4:35:56 AM10/24/23
to PAML discussion group
Hi Sandra,

Thanks a lot for going through the BEB output and for the reference.

Regarding running CODEML, I down-sampled my dataset to one individual per species which removed large part of the heterozygous sites. However, I got a new error: Error in `/proj/snic2021-6-33/bioinformatics_tools/paml-4.10.7/bin/codeml': corrupted size vs. prev_size: 0x00000000068c6e20 ***

The error occurred on the second dataset. I ran the case C for the null model, with one maintree and all the alignments in one text file. Here attached is the log file. Do you have any suggestion regarding this error?

I noticed my tree should be unrooted, but I supposed it would not crush the program.

Best wishes,
Vinciane
logfile_casec_null.txt

Sandra AC

unread,
Oct 24, 2023, 5:20:20 AM10/24/23
to PAML discussion group
Hi Vinciane,

Good that it helped! 

With regards to your new error, it seems a memory problem (see this StackOverflow thread of messages). If you are using the compiled `CODEML` program using the source code in the `dev` branch, perhaps what was implemented to allow for more distinct codons (256 vs 128) now broke something related to memory allocation. If you are using the stable `CODEML` program compiled under the stable PAML version, then I am not really sure what may have happened as I had never seen this error. Perhaps it has to do with the settings you specified when submitting the job to the server (if you are using a server), but I am not sure. If you search for this error on the Internet, you will find more information too. Note that he tree being rooted or unrooted is not causing an error. You can read our supplementary material for the protocol for beginners where we discuss how users can decide whether using a rooted or an unrooted tree depending on the hypothesis they want to test.

If the problem is due to using the `dev` version, you may try to reduce your dataset further until heterozygous sites do not cause problems. Perhaps you can run a test to see which alignments pass the filters and can be analysed by CODEML. Then, you can just use such alignments and discard the rest.

Hope that helps!
Sandra

Vinciane Mossion

unread,
Oct 26, 2023, 6:18:10 AM10/26/23
to PAML discussion group
Hi Sandra,

I was using the compiled `CODEML` from the `dev` branch, but then I tried with the stable `CODEML` program and I got the same error. Here attached is my bash script in which you can see the settings I specified when submitting the job to the server (working on Uppmax servers). I searched for similar errors and found some information but nothing that I could or knew how to use to help me to fix the issue. I've also asked Uppmax support to look at this problem, in case it would come from something server related.

I've been using your protocol for beginners since I have started with PAML. It helped me a lot by the way :). In my case, the tree would be best unrooted, which I'll do once the analysis is running.

May I ask what you had in mind while mentioning a test to see which alignments pass the filters and can be analysed by CODEML? I mean other than proper codon alignments, no stop codons and a low number of ambiguities.

Best wishes,
Vinciane
13.6-codeML-nullm.bash

Sandra AC

unread,
Oct 26, 2023, 12:51:53 PM10/26/23
to PAML discussion group
Hi Vinciane,

I was thinking that you could submit a job array so that each alignment is analysed in independent runs. For instance, you could have your tree file, a template control file with flags (e.g., `seqfile = ALN` and `treefile = TREE`; flags are "ALN" and "TREE", which can then replaced with the corresponding paths), and an alignment in your directory. Then, you can run a loop to generate directories from `1` to `n` in each iteration, being `n` the maximum number of alignments you have in your alignment file and the max. number of iterations the loop will run. For each iteration:

a) Extract the partition that corresponds to the iteration and saves it in the corresponding `n` directory.
b) Copies the control file and replaces ALN flag with the corresponding name of the alignment (i.e., control file in same directory as alignment, name is enough).
c) You replace the flag TREE with a relative path to the tree so that you do not duplicate tree files and save disk space.

E.g., if you called your alignment file `aln.phy`, your tree file `tree.tree`, and your template control file with flags `codeml.ctl`, then saved them in the same directory, perhaps something like this may work:

# Run from main dir where you have alignment,
# tree, and control files.
count=0
pattern="^[0-9]* "
while IFS= read -r line
do
if [[ $line =~ $pattern ]]
then
count=$(( count + 1 ))
printf "\nCreate directory "$count"\n"
echo Alignment PHYLIP header":" "$line"
mkdir $count
# Add header
echo "$line" > $count/aln_$count".phy"
# Copy control file to corresponding new dir
cp codeml.ctl $count/codeml.ctl
# Find rel path to tree
name_tt=`ls *tree`
sed -i 's/TREE/\.\.\/'${name_tt}'/' $count/codeml.ctl
# Now, modify the flag with the new individual
# aln file name
sed -i 's/ALN/aln_'${count}'.phy/' $count/codeml.ctl
else
# Add alignment
echo "$line" >> $count/aln_$count".phy"
fi
done < aln.phy

If the code above works, it should create `1` to `n` directories with a control file with the correct paths to the individual alignment and the tree file. What you can do then is submit a job array that accesses directories `1` to `n` (e.g., through `$SGE_TASK_ID` if SGE scheduler or `$SLURM_ARRAY_TASK_ID` if SLURM scheduler) and runs CODEML. Those alignments that "crash", will have failed jobs and raise errors (i.e., CODEML cannot deal with these alignments as they have heterozygous sites). Nevertheless, you can then use the results with those alignments that CODEML can analyse.

With regards to the heterozygous sites, I talked with Ziheng and he said the following:

```
Heterozygotes (or polymorphisms in this case) in the alignment are mistreated as ambiguities by codeml, so that TTY in the alignment is interpreted to mean there is a codon there, either TTT or TTC, but it’s not known which one it is.  This interpretation is incorrect.  In the case of heterozygotes we know there are two codons, TTT and TTC, and both exist.

 

In the likelihood calculation, ambiguous codons such as TTY are resolved into TTT and TTC probabilistically.  Intuitively, if the other sequences have TTT at the site, this ambiguous codon TTY will be resolved into TTT with high probability and into TTC with a small probability, and the precise resolution depends on parameters in the model such as branch lengths, kappa, w, etc.  this is an over-simplification and is a bit misleading, and more proper explanation is perhaps to say that the alignments with ambiguities are incomplete data while the fully resolved data are the complete data.  Anyway the math details are explained in the section “4.2.6.2 Ambiguities and missing data” in my 2014 book.  However, all this theory makes sense only if the ambiguities represent true ambiguities, rather than heterozygotes.

Basically, there is no proper way for dealing with heterozygotes in baseml or codeml.
```

Good luck with the analyses, hope things make more sense now!

All the best,
Sandra
Reply all
Reply to author
Forward
0 new messages