Annotate and extract a single CDS by homology

11 views
Skip to first unread message

OBBARD Darren

unread,
Sep 3, 2020, 12:28:22 PM9/3/20
to ashworth-c...@googlegroups.com
Hi all!

I'm looking for a tool that I think should exist, but seems not to.

Imagine I have an unannotated DNA sequence (maybe even a whole genome) and I'm only interested in a single gene. I have several homologs for the gene I'm interested in (as protein sequences), although they may be distant.

Is there a tool that would make a homology-based CDS prediction for that single gene and extract the CDS? It feels like it is a much smaller task that annotating a whole genome, and there should be a tool that does it. Basically runs tblastn, and then tries to identify introns by splice sites, infers a good guess of the CDS, and extracts it.

What combination of tools would you use for this task?

Regards!

D

--

Darren Obbard
darren...@ed.ac.uk

Institute of Evolutionary Biology
University of Edinburgh
Ashworth Laboratories, Charlotte Auerbach Road
EdinburghEH9 3FL

Office 0131 651 7781
Mobile: 07968 838 635

http://obbard.bio.ed.ac.uk/

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

Lewis Stevens

unread,
Sep 3, 2020, 1:43:19 PM9/3/20
to ashworth-c...@googlegroups.com
Hi Darren, 

I was doing a similar thing recently and found genewise (web server here: https://www.ebi.ac.uk/Tools/psa/genewise/; will spit out GFF, CDS (I think) and protein but note that you may have to reverse complement your DNA sequence beforehand as it only predicts on one strand) and exonerate (command-line) infer very accurate gene models. I was basically doing the upstream part by hand (ie running TBLASTN, pulling out the region of interest (± a few kb), and then running one of the above tools) though I'm sure it could be automated if you need to do it for many genes. I tried skipping the upstream step and running exonerate across the entire genome but it predicted genes wherever it found even a hint of homology and would sometimes fuse portions of genes together. I found the targeted approach to be much easier and more accurate.

My exonerate command looked something like this (depending on the gene, I sometimes had to increase the percentage identity and decrease max intron size to ensure it didn't predict fused gene models): 

exonerate --model p2g --showvulgar no --showalignment no --showquerygff no --percent 60 --showtargetgff yes --maxintron 4000 [protein.fa] [extract_genomic_region.fa] > [output.gff]

To extract the predicted protein from the exonerate GFF, I used AGAT (https://github.com/NBISweden/AGAT), though there are probably a bunch of tools that would do the same thing. 

Thanks, 

Lewis 



--
The wiki is at:
    https://www.wiki.ed.ac.uk/display/AshCodes/Ashworth+Codemonkeys
The mailing list archive is at:
https://groups.google.com/forum/?fromgroups#!forum/ashworth-code-monkeys
If you have trouble editing the wiki or emailing the group, let me know: sujai...@ed.ac.uk
---
You received this message because you are subscribed to the Google Groups "Ashworth Codemonkeys" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ashworth-code-mo...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/ashworth-code-monkeys/AM6PR05MB5015B299C7815BB30DA2CE6BDF2C0%40AM6PR05MB5015.eurprd05.prod.outlook.com.

Sujai

unread,
Sep 3, 2020, 6:49:19 PM9/3/20
to Ashworth Codemonkeys

BARKER Daniel

unread,
Sep 4, 2020, 5:28:56 AM9/4/20
to Ashworth Codemonkeys
Any protein-homology-based genome annotation software should give a predicted CDS in some format - on the basis of protein query and unannotated genome (e.g. GeneWise) - but would not necessarily use BLAST.

BLAST should help find the region of the gene (and others in the gene family) in the DNA, which can help narrow down the search. But is a BLAST-based option going to be best to get the gene structure, for a distantly related homolog?

The GenBlastG paper below reports excellent results for annotating the C. elegans genome on the basis of C. elegans proteins, and relatively good results for annotating A. thaliana on the basis of O. sativa proteins. Still the paper says: "62.5% of genBlastG predicted A.thaliana gene models are base pair to base pair identical to curated orthologs in A.thaliana" (i.e. 37.5% are not). This is not to criticise GenBlastG - it's a difficult problem. Manual review should help.

Daniel

Dr Daniel Barker
Institute of Evolutionary Biology
School of Biological Sciences
University of Edinburgh
Charlotte Auerbach Road
The Kings Buildings
Edinburgh
EH9 3FL
United Kingdom


The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.


From: ashworth-c...@googlegroups.com <ashworth-c...@googlegroups.com> on behalf of Sujai <sujai...@gmail.com>
Sent: 03 September 2020 23:48
To: Ashworth Codemonkeys <ashworth-c...@googlegroups.com>
Subject: Re: [ashworth-code-monkeys] Annotate and extract a single CDS by homology
 
Reply all
Reply to author
Forward
0 new messages