Hey Marc (and Misko, for when you get around to it) and others who I thought might appreciate this (okay, I'm just CCing the google group),
At bottom is a test genome that I think pretty clearly highlights a number of boundary cases. If you think I have an error in my code (entirely possible), please everyone know. :)
DEBUG GUIDE:
- If you have 8 CCC's and 7 GGG's, you're not counting ORF-codons on the - strand correctly.
- If you get any CCG's CGG's GCC's, etc, your not properly handling the frame inside of ORFs.
- If you don't have 8 AAA's, or have any ORF AAA's, you're not handling the non-ORF regions or boundaries correctly.
MY RELEVANT OUTPUT:
ORFs (1-indexed, inclusive):
9 35 +
56 67 +
24 53 -
non-zero codon counts (ORF, not-ORF):
AAA: 0 8
ATG: 3 0 # the starts of the three ORFS
CCC: 7 0
GGG: 8 0
TAA: 3 0 # the stops of the three orfs
TTA: 2 0
non-zero codon frequencies (ORF, not-ORF):
AAA: 0.00% 100.00%
ATG: 13.04% 0.00%
CCC: 30.43% 0.00%
GGG: 34.78% 0.00%
TAA: 13.04% 0.00%
TTA: 8.70% 0.00%
non-zero AA frequencies:
Gly: 34.78%
Leu: 8.70%
Met: 13.04%
Pro: 30.43%
Stop: 13.04%
Longest ORF (bp): 30
Shortest ORF (bp): 12
Mean ORF length (bp): 23.0
TEST GENOME:
>test
AAAAAAAAATGCCCGGGCCCGGGTTAGGGCCCTAACCCGGGCCCGGGCCCCATAAATGCCCGGGTAAAAAA