Why does Prodigal output "X" characters in predicted protein-coding sequences?

93 views
Skip to first unread message

Brandon Kieft

unread,
May 16, 2016, 1:40:33 PM5/16/16
to prodigal-discuss
Hello,

I am using Prodigal to predict proteins in metagenome contigs (mostly bacteria). When I supply the program with my assembled nucleotide contig fasta file, it outputs a file where several predicted proteins contain an "X" character or a string of "X"s as amino acid residues. What does this mean and is there a way to keep it from happening? It is messing with my downstream application.

Brandon

Doug Hyatt

unread,
May 16, 2016, 1:46:31 PM5/16/16
to prodigal-discuss
Prodigal's protein translation routine translates any codons with an ambiguous character in them (N or some other ambiguous character) into X's.  At some point, I may make the routine smart enough to handle single N's in codons where possible (i.e. if last base of a codon is ambiguous but irrelevant, etc.), but at the moment, even with a single N in a codon, it will translate that codon to X.

The only way to completely guarantee it won't happen is to preprocess the input and replace ambiguous characters with an A, C, T, or G.  Prodigal v2 can mask long stretches of N's (50+bp, I believe) with the -m flag and won't include them in gene models.  Prodigal v3 will have more complete gap/N-handling routines (it will treat any 2-3 consecutive codons with ambiguity in them as a gap in one of its modes).

regards,
doug

--
You received this message because you are subscribed to the Google Groups "prodigal-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prodigal-discu...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages