[maker-devel] CDS retrieve from augustus_masked

394 views
Skip to first unread message

Kang, Yang Jae

unread,
Apr 6, 2013, 3:25:40 AM4/6/13
to maker...@yandell-lab.org

Dear everyone!

 

I want to retrieve CDS sequences from the output of maker; however, in the augustus_masked feature there is no indication of CDS or Exon like maker features. Is there any way for me to retrieve CDS from augustus_masked? There were protein sequences in outdir but no CDS information.

 

Thank you!

 

Kang, Yang Jae

Ph.D.

Cropgenomics Lab.

College of Agriculture and Life Science

Seoul National University

Korea

 

Michael Thon

unread,
Apr 6, 2013, 7:20:16 AM4/6/13
to Kang, Yang Jae, maker...@yandell-lab.org
Hi Kang - After running fasta_merge there should be a file:

[prefix].all.maker.augustus_masked.transcripts.fasta

in the output directory.  Is that what you need?
Mike
_______________________________________________
maker-devel mailing list
maker...@box290.bluehost.com
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

Kang, Yang Jae

unread,
Apr 6, 2013, 7:24:31 AM4/6/13
to Michael Thon, maker...@yandell-lab.org

Thank for your quick response Mike

I looked the file named transcript, but it might include UTRs I suspect. What I want to do is calculating Ka Ks values so that I need coding sequences. Is there any indication where is exact START and STOP in the transcript file?

 

Thank you

Carson Holt

unread,
Apr 6, 2013, 9:54:15 AM4/6/13
to Kang, Yang Jae, Michael Thon, maker...@yandell-lab.org
It's all CDS, from start to finish.  There is never any UTR in the ab initio reference match/match_part alignments.  There are two reasons for this.  First most ab initio predictors don't produce UTR.  Second GFF3 has n is_analysis flag, so it is impossible to separate final gene models from predicted gene models if they are both in the form gene/mRNA/exon/CDS.  Augustus can predict UTR, but gien the limitation just mentioned, if I reject the model, I have to trim it before adding it to the reference information.

We've actually been in discussion with the apollo development group over this limitation.  Original apollo found the same limitation, so they make the same assumption for loading data into the browsing window (gene/mRNA/exon/CDS features always go in the middle annotation track and everything else goes in the reference evidence track).  With the new web apollo, we're working on getting the default behavior to allow UTR in the gene predictions by using the SO predicted gene term in the GFF3 (which previously wasn't available for use in apollo and maker).

So in summary. Nothing but CDS form now, but will include CDS when available in the sequence in the near future.

Thanks,
Carson

Michael Thon

unread,
Apr 6, 2013, 10:37:28 AM4/6/13
to Kang, Yang Jae, maker...@yandell-lab.org
Thats a good point because 'transcripts' implies that it would have the UTRs. Does augustus predict the UTRs?  I manually checked the translations of the .transcript. file and I only found valid translations but that does not mean that UTRs could not be present...

Carson Holt

unread,
Apr 6, 2013, 11:13:16 AM4/6/13
to Michael Thon, Kang, Yang Jae, maker...@yandell-lab.org
Augustus only predicts UTR for a handful of organisms.  I trim them off the rejected models before outputting to the GFF3 as match/match_part features (per my previous e-mail concerning the limitations of GFF3).

 --Carson

Kang, Yang Jae

unread,
Apr 6, 2013, 2:45:02 PM4/6/13
to Carson Holt, maker...@yandell-lab.org

Thank you for quick response again!

 

I found the non-ATG starting sequences in transcript file. I thought this would be the UTR traces, and I additionally found the offset value some position after ‘>’ letter. Is that indicate the starting ATG?

Secondly, there is several files named *.augustus_masked.proteins.fasta, *.non_overlapping_ab_initio.proteins.fasta, and *.proteins.fasta. What is the criteria of splitting those files? The reason why I’m asking is that some genes were redundant between *.augustus_masked.proteins.fasta and *.proteins.fasta.

 

Thank you

Barry Moore

unread,
Apr 6, 2013, 4:50:29 PM4/6/13
to Kang, Yang Jae, maker...@yandell-lab.org
On Apr 6, 2013, at 12:45 PM, Kang, Yang Jae wrote:

Thank you for quick response again!
 
I found the non-ATG starting sequences in transcript file. I thought this would be the UTR traces, and

The gene predictors will occasionally produce a transcript with no start/stop codon, set always_complete=1 in maker_opts.clt to get MAKER to try hard to force a start/stop codon.

I additionally found the offset value some position after ‘>’ letter. Is that indicate the starting ATG?

I didn't really understand that question...

Secondly, there is several files named *.augustus_masked.proteins.fasta, *.non_overlapping_ab_initio.proteins.fasta, and *.proteins.fasta. What is the criteria of splitting those files? The reason why I’m asking is that some genes were

augustus_masked is a file that contains proteins of all predictions make by Augustus when working on masked sequence.  Setting unmask=1 in maker_opts.ctl would instruct MAKER to also run the gene predictors on unmasked sequence and then you'd have a augustus_unmasked file for those predicitions.  The non_overlapping_ab_initio files contain proteins predicted by all gene predictors for which MAKER could not find protein/RNA evidence for, so they are unsupported by physical evidence.  These unsupported predictions are not promoted by MAKER into annotations in it's final output, but they are included in these files in case you want to work with them.  The non_overlapping part of the name means that if multiple gene predictors produce overlapping un support ab initio predictions then MAKER will only output one of them.


redundant between *.augustus_masked.proteins.fasta and *.proteins.fasta.

Yes, the proteins for genes for which MAKER creates annotations will be in both files.

_______________________________________________ maker-devel mailing list maker...@box290.bluehost.comhttp://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
_______________________________________________
maker-devel mailing list
maker...@box290.bluehost.com
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org

Barry Moore
Research Scientist
Dept. of Human Genetics
University of Utah
Salt Lake City, UT 84112
--------------------------------------------




Carson Holt

unread,
Apr 6, 2013, 5:00:19 PM4/6/13
to Kang, Yang Jae, maker...@yandell-lab.org
I additionally found the offset value some position after ‘>’ letter. Is that indicate the starting ATG?

Only in the maker.transcripts.fasta will have offsets other than 0, you can use these to get the transcription offset.  All other *.transcript.fasta files will always have an offset of 0 for the reason previously mentioned.  Some genes will not start with ATG or have stop codons.  These are partial models.  Set always_complete=1 to reduce these.


Secondly, there is several files named *.augustus_masked.proteins.fasta, *.non_overlapping_ab_initio.proteins.fasta, and *.proteins.fasta. What is the criteria of splitting those files?

Final selected annotations go in the maker.proteins.fasta and  maker.transcripts.fasta files.  Raw unfiltered ab initio prediction from augustus go in the augustus_masked.proteins.fasta and  augustus_masked.transcripts.fasta file (these are for reference purposes).  A set of non-redundant rejected models go in the non-overlapping.transcripts.fasta and  non-overlapping.proteins.fasta files (if you are missing a gene you expected to find, look in this file first – you can add them back if you find protein domains in them for example).


The reason why I’m asking is that some genes were redundant between *.augustus_masked.proteins.fasta and *.proteins.fasta.

This is because some of the augustus generated models made it into the final annotation set.


Thanks,
Carson

xu zhang

unread,
Apr 10, 2013, 12:30:38 PM4/10/13
to maker...@yandell-lab.org
Hi All,

Does anybody have genemark .mod file for yeast? I tried to create my own
model file using this command" gm_es.pl
S288C_reference_sequence_R64-1-1_20110203.fsa", where the sequence was
downloaded from ncbi". it failed with this error "
warning, error in input file format:
-3
error reading parameter BRANCH_MAT
error in model file
/gscmnt/gc2124/info/annotation/personal_dir/xzhang/yeast/s_cerevisiae/genemark/training2/mod/es.mod
Error on system: prediction step" and "Error: unknown line format".

and I tried the sample file(pythium_ultimum_scaffolds.fasta) from
Carson. a mod file was created, although it also had some error information
" warning, error in input file format:
-13
5654 dna.fa.good.gb.acc.ph2
first order for ACC 2
Error: unknown line format
GC% ntron".

any suggestion and comments are appreciated

Thanks,
Xu

xu zhang

unread,
Apr 12, 2013, 8:47:08 AM4/12/13
to maker...@yandell-lab.org, Kymberlie Hallsworth-Pepin, Michael Nhan
I know how to do that. I tried different initial mod file and it
worked on my sequences with org_S1_55.0mtx initial mod. I don't know
why. if somebody knows, please let me know.

Thanks,
Xu

Jason Stajich

unread,
Apr 12, 2013, 11:48:53 AM4/12/13
to xu zhang, maker...@yandell-lab.org, Michael Nhan
Did you email the genemark authors? They would be a better source for help.
 I experienced the same problems with the yeast data to train from and didn't use genemark for those species  - it may be that it is expecting more introns and the files for training are empty on some rounds. 

Jason
Reply all
Reply to author
Forward
0 new messages