protein coding gene fasta sequence - how to, and where does the intron data come from?

95 views
Skip to first unread message

mheydt

unread,
Jun 19, 2018, 11:35:55 AM6/19/18
to gen...@soe.ucsc.edu
Dear Sir or Madam

I have been browsing a lot of questions and their answers in the google groups page, and found out how to use the table browser to retrieve the information I need:
FASTA sequences for known protein coding genes - without isoforms.
Although some aspects need some clarification still:

I'm selecting the human genome GRCh37 assembly, genes and gene predictions, ucsc genes from knownGene tables.

Then I need to filter - and this is where it isn't all that clear to me:
I'm first selecting fields from primary and related tables, to see what data I'm getting before I retrieve their sequences.

* How do I specify I want ONLY protein coding genes? I think I should be able to use a filter such as: description does match mRNA => but that returns nothing
* I found this thread: https://groups.google.com/a/soe.ucsc.edu/forum/#!topic/genome/RqHkHN1dQa8 where they suggest using filters like: description not like "%non-coding%" and
description not like "%miRNA%"
=> Does this ensure 100% mRNA protein coding genes?

The thread dates from 2012, and the person Steve Heifner says: "There is no simple way to filter out the isoforms as you suggest.  It would
probably be easiest to devise a post-output method of scanning and removing items with duplicate gene symbols."
Is there perhaps nowadays another UCSC track, table, or filterset that allows me to retrieve 1 cannonical record per gene? 

Even if I am stuck with multiple records per gene; I still need to retrieve the FASTA sequences.
When I select to retrieve output as 'sequence', the tool allows you to retrieve genomic sequences, and then asks for the optional regions.
When you select "introns", where does this information come from? the UCSC known genes set I am using, are derived from UniProt and mRNA GenBank data. Does this mean that UCSC "pastes" the intron/UTR sequences in between the CDS sequences?

Kind regards

Mathias Heydt
PhD student
Centre of Medical Genetics
Prins Boudewijnlaan 43
B2650 Edegem
Belgium

Brian Lee

unread,
Jun 21, 2018, 2:07:39 PM6/21/18
to mheydt, UCSC Genome Browser Mailing List

Dear Mathias,

Thank you for using the UCSC Genome Browser and your question about obtaining protein-coding gene fasta sequence without isoforms.

The desire to have a single isoform is understandable, however, there is still no simple way to filter various gene predictions and decide on one canonical version. Either the selection would be arbitrary (and thus best left to the end-user) or it selected for by some designed methodology, which is not easily accomplished.

You can search our archives and learn about the knownCanonical approach (Steve's email describes it as well) where there is an attempt to find the longest isoform for each gene: https://groups.google.com/a/soe.ucsc.edu/forum/#!searchin/genome/knownCanonical

You can also see how challenging this topic is by looking at external groups such as the principal splice isoforms selected by APPRIS (click into their track to see their methodology):http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg19&hubUrl=http://apprisws.bioinfo.cnio.es/trackHub/hub.txt
https://www.ncbi.nlm.nih.gov/pubmed/23161672

An approach for selecting the protein-coding genes is to use the free-form query on the Table Browser and set cdsStart != cdsEnd.

In this way, rather than filtering on the description for not having a note about non-coding (where non-coding genes could be selected by filtering for table entries of cdsStart = cdsEnd as they are not coding), you will find all genes where cdsStart != cdsEnd and in essence do not display as being non-coding. You ask when you select "introns", where does this information come from? That comes from the gene prediction models, you can see this more clearly in the browser. There are exons in darker boxes (related to how cdsStart != cdsEnd), and between the exons are introns shown as lines with arrows indicating the strand for the gene. Another way to think of this is if you take the mRNA from genebank and align it to the genome, it will align with gaps, those gaps are the introns.

Thank you again for your inquiry and using the UCSC Genome Browser. If you have any further public questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

All the best,

Brian Lee
UC Santa Cruz Genomics Institute



--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
To post to this group, send email to gen...@soe.ucsc.edu.
Visit this group at https://groups.google.com/a/soe.ucsc.edu/group/genome/.
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/ead8f61d-5bda-e848-23be-822396693305%40uantwerpen.be.
For more options, visit https://groups.google.com/a/soe.ucsc.edu/d/optout.

Reply all
Reply to author
Forward
0 new messages