Hi, I thought I would try asking my question here as NCBI was not able to give me much assistance. In preparation for submitting to NCBI, I converted my my MAKER gff3 to NCBI tbl format using the gff32tbl script that Carson posted a link to in this thread (http://gmod.827538.n3.nabble.com/NCBI-feature-table-tt4040473.html#a4040475). It seemed to have converted fine, however when I use NCBIs tbl2asn program I get numerous errors in my errorsummary.val file:
4 ERROR: SEQ_FEAT.BadTrailingCharacter
217 ERROR: SEQ_FEAT.NoStop
438 ERROR: SEQ_FEAT.ShortIntron
171 ERROR: SEQ_FEAT.StartCodon
171 ERROR: SEQ_INST.BadProteinStart
291 WARNING: SEQ_FEAT.NotSpliceConsensusAcceptor
648 WARNING: SEQ_FEAT.NotSpliceConsensusDonor
118 WARNING: SEQ_FEAT.ShortExon
In addition, all of the genes, cds, and mRNA coordinates in the resulting sqn files are decreased by one. For example my tbl file will have gene coordinates of 440869 – 441931, but the sqn file will have 440868 – 441930. Any ideas what might be causing this?
Thanks,
Brian
The only one that may be a real error is the first one (I'm not sure what it means). You probably need to find them and open them in a viewer like apollo. The rest I would consider warnings (the NCBI tool doesn't like any weirdness or uncertainty). You often have to manually edit things to get NCBI to accept all models without complaining (sometimes even going against real biology). I know some groups use the always_complete=1 option in MAKER to force start and stop codons into every model for example (even though those forced codons are probably false).
*Not sure about this one --> 4 ERROR: SEQ_FEAT.BadTrailingCharacter
*These are partial genes with no stop (usually happen at the edge of contigs or near strings of NNNN) --> 217 ERROR: SEQ_FEAT.NoStop
*These are just short introns (intron size is under control of the ab initio predictors) --> 438 ERROR: SEQ_FEAT.ShortIntron
*These are partial genes with no start (usually happen at the edge of contigs or near strings of NNNN) --> 171 ERROR: SEQ_FEAT.StartCodon
*These are partial genes with no start (usually happen at the edge of contigs or near strings of NNNN) --> 171 ERROR: SEQ_INST.BadProteinStart
*Non-cononical splicing (can be produced by the ab initio predictor or suggested by EST evidence) --> 291 WARNING: SEQ_FEAT.NotSpliceConsensusAcceptor
*Non-cononical splicing (can be produced by the ab initio predictor or suggested by EST evidence) --> 648 WARNING: SEQ_FEAT.NotSpliceConsensusDonor
*These are just short exons (exon size is under control of the ab initio predictors) --> 118 WARNING: SEQ_FEAT.ShortExon
Hi Brian,
We have a tool to deal with this in development, you should not directly upload your maker output to NCBI, you need to filter out genes, check that things are sane, etc.
http://brianreallymany.github.io/GAG/
It is still in active development, first full release is planned for the end of this month (if you can wait 1.5 weeks). It has no dependencies and maintains parent/child relationships (for example if you remove a gene, it will also remove associated CDS/mRNA). In a release planned for then end of the month, you will be able to perform functions like removing short features, long features, flagging things for review, etc. It also generates an updated genome.fasta file, gff3 file, and sequences files for CDS/mRNA/peptide based on edits made. Hopefully this is helpful to you.
Scott
Just so not to be discouraged, current version has limited functionality and is pretty much un-documented (although will write a .tbl file). Will email the list when first real release is complete and documented.
Scott
Hi,
I know Carson had a script to generate a tbl file he had posted before. If you want to do more filtering, GAG should work. If you come across any issues, please post a bug on the github page.
http://genomeannotation.github.io
Also, NCBI is a bit of a moving target on what their current format is that they accept. You should be able to supply a scaffold assembly, but they will have limitations on how short your CDS can be, question single exon stuff, etc. Hopefully GAG could help you get to where they are happy.
If they want a contig + agp file, you will also need to split your GFF file as well (we can do, but I am not sure it is posted on the github page).
Scott