Bug report ???

shaun...@phac-aspc.gc.ca

unread,

Sep 29, 2014, 6:38:17 PM9/29/14

to prodigal...@googlegroups.com

If this isn’t a bug please explain what is going on.

I am using prodigal 2.60 for CDS predictions of some Neisseria meningitidis isolates. In doing the genome comparisons I came across one particular gene which should be identical in all of the isolates but there appeared to be two different versions in the group being studied. I eventually traced this back to the start codon being chosen by prodigal. In one case a rbs_motif is being detected and results in the start codon to occur further into the ORF (I’m using the term ORF as stop to stop). In the other case no rbs_motif is being detected and the start codon is the first Met in the ORF and results in a larger protein. The confusing thing is that the sequences of the ORFs are identical. So why is prodigal picking up the rbs_motif in one case and not the other?

I’ve run a couple of these assemblies through multiple iterations of prodigal (20 repetitions each) and the results are identical for each isolate. So the behaviour is reproducible.

I’ve attached a bit of a summary if you can make sense out of it. I did find one commonality between the two groups but it would seem unrelated. In the group with the larger CDS prediction the assembly breaks the contig a few hundred bases up stream of the start of the ORF. In the other group the region is contiguous with the upstream region. I’ve been able to order the contigs in this region as the genes in question are part of the capsule synthesis pathway and the organization is highly conserved over 3 or 4 operons.

One final note. When I run prodigal with the –n switch the CDS predictions are the same between the two groups with the same rbs_motif being identified in all (but it appears to be a different motif than what is found when the –n switch is not applied)

If you agree this is odd behaviour I would be happy to share the input data if you need it to investigate and correct.

Shaun

cds_diff.txt

dhyatt1

unread,

Sep 29, 2014, 7:13:11 PM9/29/14

to prodigal...@googlegroups.com

Choosing the right start codon remains a difficult problem. Sometimes, the difference between two start candidates is very minor (within 1 point of each other in the score), which means even small differences in the rest of the genomic sequence (resulting in very slightly different training files, or coding sequences that are slightly higher/lower score) can cause it to go one way or the other. This is very common behavior among all gene prediction programs; it is easy for a human to say "these starts should all be the same!", but w/o some sort of explicit ortholog finding across multiple sequences followed by picking a winning start (like the program Genome Majority Vote does), it's not as easy as it seems to get 100% agreement.

Prodigal (imo) is more internally consistent than other programs, but you will still not always see the same start codon chosen across all orthologs from the same species. (See, for example, this paper: http://www.biomedcentral.com/1471-2164/12/125 , which examines this problem in more detail, although admittedly not at the strain level).

The best way to see what is going on is to pick one example of each start codon and run with the "-s" switch to generate a starts file. This shows the complete score breakdown of all the start candidates. It probably isn't that Prodigal is "missing" any RBS motif; it is just scoring the start without the motif higher in some cases (due to a higher coding or upstream score). Looking at this file, you can see the breakdown and see why it's scoring one candidate higher than the other. My guess is the two starts will be very close in score in both files, with the shorter winning in one case and the longer in the other.

You can also make a single training file and use that for each of the isolates, which may increase the consistency somewhat, although even this won't guarantee the same start codon is always chosen (minor differences in coding or upstream sequence can still lead to a slightly lower or higher coding score, which could cause a different start to be chosen).

dhyatt1

unread,

Sep 29, 2014, 7:19:42 PM9/29/14

to prodigal...@googlegroups.com

I see you did note that the sequence of the ORFs is identical (presumably at the nucleotide level, not just protein). Assuming the upstream sequence (Prodigal looks upstream 45 or so bases) is also identical, the difference would have to be due to slightly different training parameters, which means standardizing the training file should solve the problem in this case (presumably... you'd have to try it out).

doug

shaun...@phac-aspc.gc.ca

unread,

Sep 29, 2014, 7:54:53 PM9/29/14

to prodigal...@googlegroups.com

The ORFs are identical on the nucleotide level. There is the odd SNP amongst the entire group but nothing within the region that would be considered for rbs predictions. And no consistency between the two groups that I can see.

I didn’t mention but I also tried with a training set generated from one strain and that didn’t change anything. I should also mention that the collection of isolates I’m looking at are clonal. They are not just the same genus or species but virtually identical. Well pretty close. These bug (ers) do a lot of recombination which is proving to be a big pain. But the majority of the gene content, genome organisation, etc. is consistent with comparing virtually identical isolates.

So the only thing I can see that is consistent is that the one group contains contig breaks so the CDS predictions are at the start of the contigs (I’m not allowing for partials) and the others have a contiguous sequence through this region.

Shaun

Torsten Seemann

unread,

Oct 4, 2014, 11:23:12 PM10/4/14

to prodigal...@googlegroups.com

So the only thing I can see that is consistent is that the one group contains contig breaks so the CDS predictions are at the start of the contigs (I’m not allowing for partials) and the others have a contiguous sequence through this region.

This sounds like a pretty major different. I wonder why it isn't assembling like the others. It suggests bad sequencing run, or a repeat. Possibly a transposon insertion?

--

--Dr Torsten Seemann
--Victorian Bioinformatics Consortium, Monash University, AUSTRALIA

--Life Sciences Computation Centre, VLSCI, Parkville, AUSTRALIA
--http://www.bioinformatics.net.au/

shaun...@phac-aspc.gc.ca

unread,

Oct 7, 2014, 11:24:13 AM10/7/14

to prodigal...@googlegroups.com, torsten...@monash.edu

Pretty good guess ;-) Yes there is an IS element just upstream which causes the break. It is actually present in the Genbank sequence I've been using as a reference. Some of my strains have it and some don't. Obviously the ones that are contiguous are lacking the IS. However, that still doesn’t seem to explain the difference in start predictions and why prodigal finds an rbs motif in one case and not the other. Again the sequences of the ORFs are identical and when I bypass the Shine-Dalgarno trainer (-n) the motif is found in both versions.

Doug Hyatt

unread,

Oct 7, 2014, 11:31:39 AM10/7/14

to prodigal-discuss, torsten...@monash.edu

Again, this is what the "-s" option is for, to be able to see why Prodigal chooses one start over another. Prodigal is not "finding" the RBS motif in one case or "missing" it in the other. I'm assuming it finds the RBS motif in both cases; the start with the RBS is just still scoring lower in some cases. Sometimes two starts are almost tied in score (which I suspect is the case here), and the slightest thing can make a difference. Without examples, I can't really help much; my suggestion is to generate starts files with -s and just look at both potential start candidates in each case.

--
You received this message because you are subscribed to the Google Groups "prodigal-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prodigal-discu...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

shaun...@phac-aspc.gc.ca

unread,

Oct 7, 2014, 1:36:11 PM10/7/14

to prodigal...@googlegroups.com, torsten...@monash.edu

Sorry I did do that but I'm still confused. But maybe I just don't understand the scoring.

Here are the results when the contig is fragmented.

CDS 346..1500

/note="ID=36_1;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.377;conf=99.98;score=36.13;cscore=33.07;sscore=3.06;rscore=-3.32;uscore=1.83;tscore=4.54;"

Beg End Std Total CodPot StrtSc Codon RBSMot Spacer RBSScr UpsScr TypeScr GCCont

307 1500 + 21.74 31.54 -9.80 TTG None None -3.32 3.44 -9.92 0.379

346 1500 + 36.13 33.07 3.06 ATG None None -3.32 1.83 4.54 0.377

367 1500 + 33.50 30.42 3.08 ATG None None -3.32 1.85 4.54 0.378

460 1500 + 18.77 27.70 -8.93 TTG None None -3.32 4.31 -9.92 0.378

466 1500 + 17.24 26.69 -9.45 TTG None None -3.32 3.79 -9.92 0.378

481 1500 + 28.12 25.49 2.63 ATG None None -3.32 1.40 4.54 0.378

487 1500 + 27.18 24.01 3.17 ATG None None -3.32 1.94 4.54 0.379

490 1500 + 25.22 22.80 2.41 ATG None None -3.32 1.19 4.54 0.379

580 1500 + 35.87 23.85 12.02 ATG GGA/GAG/AGG 5-10bp 2.31 5.17 4.54 0.382

646 1500 + 17.68 14.23 3.45 ATG None None -3.32 2.22 4.54 0.381

652 1500 + 14.88 11.07 3.81 ATG None None -3.32 2.59 4.54 0.380

802 1500 + -3.81 5.42 -9.23 TTG None None -3.32 4.01 -9.92 0.369

835 1500 + 4.28 8.99 -4.70 GTG None None -3.32 3.20 -4.59 0.365

841 1500 + 10.30 7.58 2.73 ATG None None -3.32 1.50 4.54 0.364

895 1500 + -1.02 -3.50 2.48 ATG None None -3.32 1.75 4.54 0.360

973 1500 + -9.79 -14.79 5.00 ATG None None -3.32 4.27 4.54 0.356

985 1500 + -23.28 -17.25 -6.03 GTG None None -3.32 2.38 -4.59 0.353

1006 1500 + -11.89 -16.88 4.99 ATG AGxAG 5-10bp -1.15 2.10 4.54 0.352

1183 1500 + -36.18 -24.53 -11.66 TTG None None -3.32 2.08 -9.92 0.377

1201 1500 + -21.57 -26.57 4.99 ATG None None -3.32 4.27 4.54 0.387

1291 1500 + -22.78 -22.31 -0.48 ATG None None -4.01 0.27 3.76 0.343

And this is when the region is contiguous

CDS 85550..86470

/note="ID=1_78;partial=00;start_type=ATG;rbs_motif=GGA/GAG/AGG;rbs_spacer=5-10bp;gc_cont=0.383;conf=99.99;score=39.01;cscore=27.78;sscore=11.23;rscore=2.21;uscore=5.11;tscore=4.55;"

Beg End Std Total CodPot StrtSc Codon RBSMot Spacer RBSScr UpsScr TypeScr GCCont

85277 86470 + 23.82 33.29 -9.47 TTG None None -3.23 3.39 -9.63 0.380

85316 86470 + 37.79 34.48 3.31 ATG None None -3.23 1.98 4.55 0.378

85337 86470 + 35.02 31.84 3.19 ATG None None -3.23 1.86 4.55 0.380

85430 86470 + 20.40 29.00 -8.60 TTG None None -3.23 4.26 -9.63 0.379

85436 86470 + 18.77 28.01 -9.23 TTG None None -3.23 3.63 -9.63 0.380

85451 86470 + 30.03 27.06 2.97 ATG None None -3.23 1.65 4.55 0.380

85457 86470 + 28.87 25.56 3.31 ATG None None -3.23 1.98 4.55 0.381

85460 86470 + 26.73 24.37 2.36 ATG None None -3.23 1.03 4.55 0.381

85550 86470 + 39.66 27.78 11.88 ATG GGA/GAG/AGG 5-10bp 2.21 5.11 4.55 0.383

85616 86470 + 20.85 17.44 3.41 ATG None None -3.23 2.08 4.55 0.382

85622 86470 + 18.23 14.28 3.95 ATG None None -3.23 2.63 4.55 0.382

85772 86470 + -2.59 6.31 -8.90 TTG None None -3.23 3.96 -9.63 0.369

85805 86470 + 4.39 9.42 -5.03 GTG None None -3.23 3.09 -4.89 0.365

85811 86470 + 10.80 8.03 2.77 ATG None None -3.23 1.44 4.55 0.364

85865 86470 + -0.67 -3.44 2.78 ATG None None -3.23 1.95 4.55 0.360

85943 86470 + -9.57 -14.46 4.89 ATG None None -3.23 4.07 4.55 0.356

85955 86470 + -22.67 -16.69 -5.98 GTG None None -3.23 2.64 -4.89 0.353

85976 86470 + -11.04 -15.50 4.46 ATG AGxAG 5-10bp -1.67 2.07 4.55 0.352

86153 86470 + -35.08 -23.71 -11.37 TTG None None -3.23 1.99 -9.63 0.377

86171 86470 + -20.92 -26.16 5.23 ATG None None -3.23 4.41 4.55 0.387

86261 86470 + -23.67 -23.32 -0.35 ATG None None -3.90 0.27 3.77 0.343

So yes in the first case the CDS chosen has the highest overall score but the start score at 580 is clearly the highest. I would have thought that would have more of an influence.

I'm also trying to understand why the scoring for the two versions are different when the ORF sequences are virtually identical (2 SNPs). If the scoring is based solely on the analysis of the ORF then I would think you would get fairly similar results. That was why I made a point of mentioning the fragmented vs contiguous aspect of these assemblies. I take it the surrounding sequence must play some role as well.

Reply all

Reply to author

Forward