Dear Mathias,
Thank you for using the UCSC Genome Browser and your question about obtaining protein-coding gene fasta sequence without isoforms.
The desire to have a single isoform is understandable, however, there is still no simple way to filter various gene predictions and decide on one canonical version. Either the selection would be arbitrary (and thus best left to the end-user) or it selected for by some designed methodology, which is not easily accomplished.
You can search our archives and learn about the knownCanonical approach (Steve's email describes it as well) where there is an attempt to find the longest isoform for each gene: https://groups.google.com/a/soe.ucsc.edu/forum/#!searchin/genome/knownCanonical
You can also see how challenging this topic is by looking at external groups such as the principal splice isoforms selected by APPRIS (click into their track to see their methodology):http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg19&hubUrl=http://apprisws.bioinfo.cnio.es/trackHub/hub.txt
https://www.ncbi.nlm.nih.gov/pubmed/23161672
An approach for selecting the protein-coding genes is to use the free-form query on the Table Browser and set cdsStart != cdsEnd.
In this way, rather than filtering on the description for not having a note about non-coding (where non-coding genes could be selected by filtering for table entries of cdsStart = cdsEnd as they are not coding), you will find all genes where cdsStart != cdsEnd and in essence do not display as being non-coding. You ask when you select "introns", where does this information come from? That comes from the gene prediction models, you can see this more clearly in the browser. There are exons in darker boxes (related to how cdsStart != cdsEnd), and between the exons are introns shown as lines with arrows indicating the strand for the gene. Another way to think of this is if you take the mRNA from genebank and align it to the genome, it will align with gaps, those gaps are the introns.
Thank you again for your inquiry and using the UCSC Genome Browser. If you have any further public questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.
All the best,
Brian Lee
UC Santa Cruz Genomics Institute
--
---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
To post to this group, send email to gen...@soe.ucsc.edu.
Visit this group at https://groups.google.com/a/soe.ucsc.edu/group/genome/.
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/ead8f61d-5bda-e848-23be-822396693305%40uantwerpen.be.
For more options, visit https://groups.google.com/a/soe.ucsc.edu/d/optout.